0% found this document useful (0 votes)
16 views19 pages

Data Integration

Data integration involves combining data from various sources into a unified view for better business intelligence and decision-making, distinguishing it from data ingestion, which is a simpler process of data collection. The integration process includes complex tasks like ETL and requires skilled engineers, while modern challenges include managing real-time data and hybrid environments. Data transformation is essential for converting raw data into usable formats, and validation ensures the accuracy and reliability of the transformed data.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views19 pages

Data Integration

Data integration involves combining data from various sources into a unified view for better business intelligence and decision-making, distinguishing it from data ingestion, which is a simpler process of data collection. The integration process includes complex tasks like ETL and requires skilled engineers, while modern challenges include managing real-time data and hybrid environments. Data transformation is essential for converting raw data into usable formats, and validation ensures the accuracy and reliability of the transformed data.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

1.

Data Integration
What is Data Integration?
Data integration is the process of combining data from disparate sources across an organization into a unified
view, such as a data warehouse or data lake. The primary goal is to provide a single, consolidated dataset
that serves as a consistent fact base for business intelligence (BI) and analytics.
By unifying these datasets, organizations can align departments, improve customer experiences, and make
better-informed decisions.

Data Integration vs. Data Ingestion


The lecture notes make a critical distinction between Data Ingestion and Data Integration. While often
used interchangeably, they represent different stages of the data lifecycle:
• Data Ingestion is a simpler process that involves collecting and transferring data from various input
sources (like IoT, APIs, or databases) to a target storage system. It does not automatically ensure
data quality and requires less domain expertise.
• Data Integration is more complex. It involves unifying raw data, applying transformations (such as
cleaning, merging, and filtering), and loading it into a destination for analysis. This process ensures
that the data is high-quality and credible. Because it involves complex processes like ETL (Extract,
Transform, Load), metadata management, and governance, it typically requires skilled data engineers.

The Integration Process


Moving data effectively requires a data pipeline that understands the structure and meaning of the data. *
Simple pipelines may just perform ingestion (integrating data on a regular basis). * Complex pipelines
(ETL) include cleansing, sorting, and enrichment to make data ready for its final destination.

Modern Integration Challenges


The landscape of data integration has shifted due to the explosion of data sources (SaaS applications, cloud
services) and the demand for real-time analytics. Modern challenges include: * Continuous Data: Real-
time decision-making requires data to be transformed “in flight”. * Hybrid Environments: Integration
must span on-premises data centers and cloud environments. * Lifecycle Management: There is a need
for continuous integration and delivery (CI/CD) of data, focusing not just on implementation but on business
needs.

Data Integration Tools


The market offers stand-alone software products to handle data access, transformation, and delivery. These
tools are characterized by several key features: * Support for Multiple Styles: They handle bulk/batch
movement, replication, and stream integration. * Transformation Capabilities: They support basic
(string manipulation), intermediate (merging), and advanced (text mining) transformations. * Augmenta-
tion: Modern tools leverage Generative AI and ML to auto-generate pipeline code and detect anomalies. *
Governance and Ops: They include features for metadata management, data governance (lineage, mask-
ing), DataOps (automated testing, Git integration), and FinOps (cost optimization).

Integration Patterns: The Medallion Architecture


A common design pattern for organizing integrated data is the Medallion Architecture, which progres-
sively improves data quality as it flows through three layers: 1. Bronze Layer (Raw): Data is ingested
“as-is” from source systems. This layer focuses on quick capture and historical archiving. 2. Silver Layer
(Cleansed): Data is matched, merged, and cleansed to provide an “Enterprise view” of key entities (e.g.,
master customers, stores). 3. Gold Layer (Curated): Data is transformed into consumption-ready,
project-specific databases (e.g., for sales analytics or churn prediction), often using denormalized models for
reporting.

1
[Link] Transformations
Data transformation is the critical process of converting raw, unstructured, or inconsistent data into clean,
usable formats that fuel analytics, reporting, and machine learning. It serves as the bridge between raw data
ingestion and actionable business insights, ensuring that data is standardized, accurate, and aligned with
business requirements.

Key Types of Transformations Data transformation encompasses various techniques to refine data for
consumption: * Data Cleaning: Involves removing duplicates, handling missing values, correcting typos,
and validating formats such as email addresses or phone numbers. * Data Standardization: Converts
units of measurement, aligns date/time formats, and harmonizes naming conventions to ensure consistency.
* Data Aggregation: Summarizes granular data to provide higher-level insights. * Data Enrichment:
Enhances a dataset by adding new information from internal or external sources. * Data Integration:
Combines data from multiple disparate sources into a unified view. * Data Normalization: Structures
data to minimize redundancy and improve overall integrity. * Data Filtering and Selection: Narrows
down datasets based on specific conditions or rules.

The Data Transformation Process The lifecycle of transforming data typically follows a structured
six-step process:
1. Discovery: Before transforming data, teams must profile the source data to understand its structure,
types, volumes, and quality. This step identifies anomalies and inconsistencies, providing metadata
crucial for automation and lineage.
2. Data Mapping: This step defines how source data fields correspond to destination fields, ensuring the
correct values map to the correct schema (e.g., mapping first_name in a CRM to customerFirstName
in a warehouse).
3. Transformation Logic Design: Teams define the specific logic—such as regex-based cleaning, con-
ditional logic, or SQL aggregations—that will reshape the data.
4. Execution: The logic is executed via batch or streaming jobs. In modern ELT (Extract, Load,
Transform) workflows, this often happens within the data warehouse using tools like dbt, rather than
before loading (ETL).
5. Validation and Testing: Transformed data undergoes rigorous testing to ensure accuracy, complete-
ness, and conformity to expected formats.
6. Documentation and Monitoring: Workflows are documented, and pipelines are monitored for
failures or schema drift to maintain trust in the data.
Based on the “Analytics Engineering” lecture notes, here is an explanation of Data Transformation
and Modelling Patterns, structured for an academic understanding suitable for postgraduate engineering
students.

3. Data Transformation and Modelling Patterns


In analytics engineering, patterns represent reusable solutions to common data organization problems. Just
as design patterns exist in software engineering, data engineering relies on architecture patterns to struc-
ture data flow and ensure quality. The lecture notes highlight two primary, interconnected patterns: the
Medallion Architecture and the ELT-DBT Pattern.

1. The Medallion Architecture (Multi-Hop Architecture) The Medallion Architecture is a data


design pattern used to logically organize data within a lakehouse. Its primary goal is to incrementally
improve the structure and quality of data as it flows through three distinct layers: Bronze, Silver, and Gold.
• Bronze Layer (Raw Data)
– Function: Acts as a “Landing Zone” for data extracted from external source systems.
– Structure: Tables are a 1:1 reflection of source system structures. Data is stored “as-is” with
additional metadata columns (e.g., load date/time, process ID).

2
– Purpose: Facilitates quick Change Data Capture (CDC), provides a historical archive (cold
storage), ensures auditability, and allows for data reprocessing without re-reading from source
systems.
• Silver Layer (Cleansed and Conformed)
– Function: Provides an “Enterprise view” of key business entities (e.g., master customers, stores)
and transactions.
– Structure: Data from the Bronze layer is matched, merged, conformed, and cleansed.
– Transformation Intensity: Only “just-enough” transformations and cleansing rules are applied
to create a unified view.
• Gold Layer (Curated/Consumption-Ready)
– Function: Serves as the final presentation layer for specific projects, such as Customer Analytics,
Inventory Analytics, or Sales Reporting.
– Structure: Data is organized into project-specific databases using denormalized, read-optimized
data models (often Star Schemas) requiring fewer joins.
– Transformation Intensity: This layer applies the final data transformations and complex busi-
ness quality rules to make data ready for reporting and decision-making.

2. The ELT-DBT Modelling Pattern This pattern is a specific implementation of the Medallion
Architecture, often utilizing the dbt (data build tool) framework. It shifts from traditional ETL (Extract,
Transform, Load) to ELT (Extract, Load, Transform), where transformations occur inside the data
warehouse.
This pattern structures data modelling into four distinct layers:
• Landing Layer (Source Layer)
– Definition: The initial entry point, also known as the raw zone.
– Characteristics: A 1:1 reflection of source tables with no transformations applied.
• Staging Layer (stg_*)
– Definition: A replica of the source data prepared for downstream usage.
– Characteristics:
∗ Applies light modifications such as data type casting, column renaming, and filtering of
deleted records.
∗ Typically involves a 1:1 reflection of source tables with no joins between models.
– Purpose: To cleanse and standardize raw data.
• Intermediate Layer (int_*)
– Definition: The bridge between staging data and final business entities, applying business logic.
– Characteristics:
∗ Involves heavier transformations such as aggregations, calculations, CASE statements,
and window functions.
∗ Prepares data specifically for fact modelling.
• Marts Layer (dim_*, fact_*)
– Definition: The final presentation layer containing Dimensional Models (Dimensions) and Fact
Models (Measures).
– Characteristics:
∗ Utilizes joins and unions to combine data from intermediate models, facts, and dimensions.
∗ No complex transformation logic should reside here; it is purely for reflecting the business
perspective.

Summary of Mapping The industry standard often maps these two patterns together: * Bronze ↔
Landing / Source Data * Silver ↔ Staging & Intermediate Layers (Cleansing and Business Logic)
* Gold ↔ Marts Layer (Reporting and Analytics)
Some mature organizations extend this to a “Platinum” layer, which includes data enhanced by AI/ML
predictions.

3
4. Data Transformation Validation and Testing.
Introduction to Data Transformation Validation
Data transformation validation is the process of verifying that the data logic applied during the pipeline exe-
cution produces accurate, reliable, and consistent results. Even well-designed transformations can introduce
“silent errors” into analytics if not properly validated. The goal is to ensure the output is not just technically
executable, but business-accurate.

What to Test?
Testing is categorized into verifying basic technical transformations and complex logic.

1. Basic Transformations These tests focus on the structural and fundamental integrity of the data: *
Data Type Conversions: Ensuring fields are correctly cast (e.g., strings to integers). * Handling Null
Values: Verifying that missing values are handled or replaced appropriately. * String & Date Manip-
ulations: Checking that text is formatted correctly and date-time conversions are accurate. * Filtering:
Confirming that data exclusion logic (e.g., removing deleted records) works as expected. * Uniqueness &
Constraints: Validating that primary keys remain unique and mandatory fields are not empty.

2. Complex Transformations These tests focus on the analytical and business logic applied to the data:
* Aggregations: Verifying summary operations (e.g., monthly sales totals) match expectations. * Data
Integration: Ensuring data from diverse sources (e.g., SQL and NoSQL) is unified correctly without data
loss. * Data Normalization: Checking that redundancy is reduced and data is structured efficiently. *
Business Rules: Verifying specific logic, such as “Apply 15% VAT only to Category X”.

The Validation Process


The lecture notes outline a structured approach to validation:
1. Schema Validation:
• Ensures the structure conforms to specifications.
• Checks for the presence of required columns, correct data types, and adherence to constraints
(e.g., NOT NULL, UNIQUE).
• Validates referential integrity to ensure relationships between tables (e.g., Orders and Customers)
are preserved.
2. Contract Validation:
• Validates “Producer-Consumer contracts” to ensure that changes in the source do not break
downstream consumption.
3. Business Logic & Data Profiling:
• Source-to-Target Verification: Compares transformed data against original source values to
ensure accuracy.
• Data Profiling: Uses profiling tools to check data distributions and patterns (e.g., ensuring age
values fall within a realistic range).

Levels of Testing
Testing should occur at different granularities to catch different types of errors:
• Level 1: Unit Tests:
– Tests single transformation logic in isolation (e.g., testing a specific regex pattern or calculation).
Tools like dbt allow for generic and singular tests at this level.
• Level 2: Integration Tests:
– Verifies multiple transformations working together in an end-to-end pipeline to ensure the flow of
data is correct.
• Level 3: Regression Tests:

4
– Compares current outputs against historical outputs to detect unintended changes or “breaks” in
logic caused by new code.

Best Practices
To maintain a robust data pipeline, the notes recommend the following best practices: * Document Every
Rule: Maintain a mapping sheet linking source fields, transformation logic, and target fields. * Automate
Validation: Use SQL scripts or tools to run checks automatically, rather than relying on manual inspection.
* Test with Realistic Volumes: Simulate production-scale data to catch performance bottlenecks that
small test datasets might hide. * Test Edge Cases: specifically test for missing data, extreme values, or
special characters to ensure the pipeline doesn’t fail under specific conditions.

4. Documentation of Data Models.


Introduction to Data Model Documentation
Data model documentation is the process of recording the structure, logic, and intended use of data systems.
It is not limited to simple text descriptions; it encompasses diagrams, table definitions, usage scenarios,
and design notes.
The primary goal of this documentation is to help teams—both technical and non-technical—understand
how data is organized within the system and the reasoning behind specific design decisions. It serves as a
bridge ensuring alignment between data structures and business goals.

Types of Data Models


Documentation typically covers three distinct levels of abstraction, each serving a different audience and
purpose:
1. Conceptual Data Model:
• Purpose: Acts as a blueprint for the future system.
• Audience: Business Stakeholders.
• Content: A high-level overview defining components and rules in strictly business terms.
2. Logical Data Model:
• Purpose: Provides detail on the system’s entities, attributes, and relationships. It is used to
develop the database schema.
• Audience: Data Architects and Business Analysts.
• Content: Clarification of business rules and data structures.
3. Physical Data Model:
• Purpose: Represents the technical implementation on a specific database.
• Audience: DBAs (Database Administrators) and Developers.
• Content: Detailed physical structures, including data types, indexing, and access methods.

Benefits of Documentation
Proper documentation is essential for the scalability and maintainability of data systems. The notes highlight
five key benefits: * Improved Clarity: It translates complex data structures into formats understandable by
non-technical stakeholders. * Easier Maintenance: It accelerates updates and bug fixes by clearly showing
how components are connected. * Faster Onboarding: It allows new team members to quickly grasp
the model’s logic and assumptions. * Consistent Data Use: It standardizes definitions and structures,
preventing misinterpretation of data. * Better Collaboration: It fosters a shared understanding across
analysts, engineers, and decision-makers.

Tools for Documentation and Management


The lecture notes identify several tools that facilitate data modeling and documentation:

5
• dbt(data build tool):
– Provides robust features for version control, modularization, and documentation.
–Allows stakeholders to access the latest version of the data model in a collaborative environment.
– Enables effective documentation of models, helping team members understand the purpose and
functionality of the data.
• DataHub:
– A modern data catalog used for metadata management, discovery, and governance.
– It helps track data lineage, profile datasets, and establish data contracts.
• Amundsen:
– A data discovery and metadata engine (originally from Lyft) designed to create a centralized
repository.
– It features “popularity ranking” for frequently visited data tags and provides in-line metadata to
save time for users.

5. Data Versioning
What is Data Versioning?
Data versioning is the practice of storing different versions of data that were created or modified at specific
points in time. Much like code version control (e.g., Git) allows developers to track changes in software, data
versioning allows data engineers and scientists to track changes in datasets. This ensures that organizations
can access, retrieve, or restore data exactly as it existed at any previous moment.

Why is Data Versioning Important?


The lecture notes highlight several critical reasons for implementing data versioning:
1. Risk-Free Testing for AI/ML: Data scientists frequently test new Machine Learning (ML) models
to improve efficiency. These tests often require modifying datasets. Versioning preserves the original
“working” dataset, allowing teams to revert instantly if a new experiment fails or proves inefficient.
2. Reproducibility: Datasets change naturally over time (e.g., live sales transactions). To reproduce
a specific analysis or model result from the past, the organization must be able to restore the exact
environment and data state from that time.
3. Business Performance Analysis: By storing historical versions of data (e.g., sales data over several
years), businesses can analyze consumer trends and preferences over time, leading to better strategic
decisions.
4. Compliance and Auditing: Regulations like GDPR often require companies to maintain historical
records of data. Versioning simplifies internal and external audits by providing a clear trail of what
data existed and when.

Naming Conventions for Versions


To manage versions effectively, standardized naming conventions are recommended: * Semantic Version-
ing: Uses a three-part number (e.g., 3.2.4). * The first number (3) indicates significant changes. * The
second number (2) indicates new features. * The third number (4) indicates minor bug fixes. * Status-
Based: Naming based on the data’s state, such as raw, filtered, or cleaned. * Process-Based: Naming
based on the transformation applied, such as normalized.

Implementation Approaches
The notes outline three primary methods for implementing data versioning:

1. Full Duplication This involves manually saving a full copy of the dataset to a new location every
time a version is created. * Pros: Simple to understand and implement; works for small datasets. * Cons:
Highly inefficient regarding storage space; error-prone because users must manually hardcode date values or
paths in their queries. * Best For: Small firms or individual work where sensitive data must be isolated.

6
2. Valid From/To Metadata This is an incremental approach where two metadata columns (valid_from
and valid_to) are added to the table. * Mechanism: When a record is updated, the old record is preserved,
and its valid_to date is updated. The new record is appended with a new valid_from date. * Pros: More
space-efficient than full duplication. * Cons: Interaction is limited to adding date filters to SQL queries,
which can become complex. * Best For: “Time traveling” within a single tabular dataset.

3. Tool-Based Versioning (Recommended) This approach treats versioning as a core capability of


the infrastructure, often extending the Git model to data. It aims to minimize storage footprints (by not
copying unchanged data) and provides direct operations like “create version” or “compare versions”.

Challenges
• Storage Space: Storing multiple versions increases storage requirements and costs. Organizations
must balance the granularity of versioning with their budget.
• Security: More stored versions increase the surface area for potential data leaks. This is a significant
concern in cloud environments where IT functions are outsourced.
• Tool Selection: Teams must evaluate tools based on criteria like storage capacity, open-source avail-
ability, cloud support, and ease of use.

Popular Data Versioning Tools


The lecture notes list several tools designed to handle these challenges: * DVC (Data Version Con-
trol): An open-source tool that brings Git-like version control to machine learning projects. It tracks data
provenance and ensures reproducibility. * Delta Lake: An open-source storage layer that brings ACID
transactions to Apache Spark and big data workloads. It supports “time travel” to query earlier versions of
data. * Pachyderm: A platform that offers containerized data lineage and versioning, allowing teams to
track data commits in separate branches. * LakeFS: Provides a Git-like branching and committing model
specifically for data lakes (S3/GCS), scaling to petabytes of data. * Git LFS (Large File Storage): An
extension for Git that replaces large files with text pointers, storing the actual content on a remote server.
* Dolt: A SQL database that supports forking, cloning, and merging, allowing data and schema to evolve
together.

6. Data Lineage
Definition and Purpose Data lineage is the process of tracing and visualizing the flow of data from its
origin to its final destination. It tracks the data’s journey, documenting every alteration, transformation,
split, or merger it undergoes along the way.
The primary objective of data lineage is to answer critical questions regarding data provenance: * Where did
a specific piece of data originate? * When and where did the data separate or merge with other datasets?
Mechanism Data lineage functions by creating a data mapping framework. This is achieved by collecting
and managing metadata from every step of the data pipeline. This metadata is updated for each process
applied to the data and stored in a central metadata repository, which serves as the foundation for lineage
analysis.
Components of a Lineage Tracker To ensure comprehensive lineage, especially in modern engineering
environments, a lineage tracker must account for various traceable resources:
1. Infrastructure as Code (IaC):
• Cloud computing resources (compute, storage, network) are provisioned via automation (IaC).
• To maintain lineage, IaC code changes must be committed to a version-controlled repository to
avoid configuration drift.
2. Data:
• Data schemas and metadata should be stored in version control systems.
• Actual data is typically stored in storage media like a data lake, with location links maintained
in configuration files.

7
3. Implementation Code:
• All changes to the code responsible for processing data must be stored as point-in-time commits
in version control.
4. Model Feature List:
• A Feature Store is used to maintain details of machine learning features, including their previous
versions, ensuring point-in-time accuracy.
5. Model Algorithm Code:
• The specific algorithms used for modeling must be version-controlled to track changes over time.
6. Model Container Image:
• Versions of the model container images (used for deployment) must be stored in container repos-
itories managed by a container registry.

7. Reproducibility (Context of Training Datasets)


Reproducibility is the ability to recreate a specific data environment or result from a past point in time. In
the context of Machine Learning (ML) and Analytics Engineering, it is inextricably linked to data lineage.
Importance in ML The lecture notes emphasize a core tenet: “If you can’t explain the data that
trained the model, you can’t trust the model”. Reproducibility ensures that if issues arise in a
production or training environment, engineers can reproduce the “last best version” of the system.
Key Elements to Track for Reproducibility To guarantee reproducibility for training datasets, the
following elements must be tracked via lineage:
1. Data Origin and Transformations: Pinpointing the exact source (e.g., APIs, databases) and every
processing step (cleaning, feature engineering, aggregation).
2. Code and Parameters: Tracking specific source code versions (via Git) and the hyperparameters
used during the training phase.
3. Model Artifacts: Capturing different versions of trained models along with their associated perfor-
mance metrics and evaluation reports.
4. Deployment and Usage: Monitoring where specific models are deployed and used for making pre-
dictions.
Below are case-study based, long-answer university-style questions created strictly from your slide
“L9 – Analytics Engineering” .
The cases integrate:
• Data Integration
• ETL vs ELT
• Medallion Architecture
• ELT-DBT pattern
• Transformation testing
• Schema validation
• Data versioning
• Lineage & reproducibility
• Data governance
• Tool selection
All questions are structured for 15–20 mark descriptive answers.

CASE STUDY 1: BUILDING A MODERN DATA PLATFORM FOR RETAIL


A large retail enterprise wants to unify data from:
• POS systems (on-premise database)
• E-commerce platform (cloud SaaS API)

8
• Inventory system (ERP)
• Marketing tools
• IoT sensors from warehouses
They want:
• Real-time dashboards
• AI-based demand forecasting
• Historical analysis
• Regulatory compliance
• Scalable cloud architecture
However:
• Data arrives in different formats.
• Schemas change frequently.
• Business teams demand self-service analytics.
• Model training datasets must be reproducible.
Question:
(a) Explain the difference between Data Ingestion and Data Integration in this context.
(b) Propose an architecture using Medallion Architecture.
(c) Explain whether ETL or ELT is more appropriate and why.
(d) Discuss validation and testing strategies.
(e) Explain how data versioning and lineage would support AI use cases.
Model Answer Outline:
(a) Data Ingestion vs Integration:
• Ingestion: Collect and transfer data from multiple sources into storage.
• Integration: Apply transformations, cleansing, merging, and business logic to create unified datasets.
• Ingestion ensures consolidation.
• Integration ensures quality and usability for analytics.
(b) Medallion Architecture:
Bronze Layer:
• Raw data from POS, APIs, IoT.
• Schema preserved as-is.
• Includes metadata such as load time.
Silver Layer:
• Cleansing and conformance.
• Deduplication of customers.
• Standardization of date/time and currency.
• Light transformations.
Gold Layer:
• Aggregated sales marts.
• Customer 360 view.
• AI-ready datasets.
• Reporting tables.
(c) ETL vs ELT:
• ELT preferred.
• Raw data loaded first.
• Transformations done in cloud data warehouse.

9
• Supports scalability and agility.
• Compatible with dbt pattern.
(d) Validation and Testing:
• Schema validation: column presence, data types.
• Business rule validation: pricing logic.
• Source-to-target verification.
• Regression testing.
• Performance testing for real-time workloads.
(e) Data Versioning and Lineage:
• Version datasets before retraining ML models.
• Track transformation logic via Git.
• Store model artifacts and hyperparameters.
• Enable reproducibility and auditability.
• Support rollback in case of incorrect predictions.

CASE STUDY 2: TRANSFORMATION FAILURE IN INSURANCE


PIPELINE
An insurance company processes claims through an ELT pipeline.
Recent incident:
• VAT incorrectly applied to all categories.
• Monthly payout aggregation incorrect.
• Performance slowed after adding fraud-detection logic.
• No documentation exists for transformation rules.
Question:
(a) Identify where in the transformation process failure likely occurred.
(b) Explain what should have been tested.
(c) Describe levels of testing required.
(d) Suggest best practices to avoid recurrence.
Model Answer Outline:
(a) Likely Failure:
• Transformation Logic Design stage.
• Lack of Business Rule Validation.
• Possibly improper aggregation logic.
(b) What to Test:
Basic Transformations:
• Data type conversions.
• String manipulation.
• Date transformation.
• Aggregation validation.
• Null handling.
Complex Transformations:
• Conditional VAT logic.
• Fraud-detection integration.

10
(c) Levels of Testing: Level 1: Unit Tests (single logic rule). Level 2: Integration Tests (multiple transfor-
mations). Level 3: Regression Tests (compare historical outputs).
(d) Best Practices:
• Document transformation mapping sheet.
• Automate validation.
• Test edge cases.
• Use version control for SQL/dbt code.
• Maintain documentation of data models.

CASE STUDY 3: MIGRATION TO LAKEHOUSE WITH DBT


A company migrates from traditional ETL to a lakehouse architecture using dbt.
Requirements:
• Maintain raw historical archive.
• Standardize multiple CRM sources.
• Build dimensional models.
• Support BI dashboards.
• Maintain modular SQL transformations.
Question:
(a) Map Medallion Architecture to ELT-DBT layers.
(b) Explain the purpose of each dbt layer.
(c) Why should marts not contain transformation logic?
(d) Explain how dbt supports collaboration and documentation.
Model Answer Outline:
(a) Mapping:
Bronze → Landing (raw, 1:1 source reflection). Silver → Staging + Intermediate layers. Gold → Marts
layer.
(b) DBT Layers:
Landing:
• Raw data.
• No transformations.
Staging:
• Light transformations.
• Casting, renaming.
Intermediate:
• Business logic.
• Aggregations.
• Calculations.
• Window functions.
Marts:
• Fact and dimension tables.
• Joins across layers.
• Analytics-ready model.

11
(c) Why no logic in marts:
• Maintain modularity.
• Separation of concerns.
• Avoid duplication of business rules.
• Improve maintainability.
(d) dbt Benefits:
• Version control integration.
• Modular models.
• Built-in documentation.
• Reusable components.
• Collaboration through Git.

CASE STUDY 4: DATA VERSIONING STRATEGY FOR ML TEAM


An ML team frequently retrains models using updated sales data.
Problems:
• Overwriting datasets.
• Cannot reproduce old experiments.
• Compliance audit requires past data snapshots.
• Storage costs increasing rapidly.
Question:
(a) Compare Full Duplication vs Valid From/To vs Tool-Based versioning.
(b) Recommend a strategy for large-scale production.
(c) Discuss challenges in versioning.
(d) Explain how versioning improves ML lifecycle management.
Model Answer Outline:
(a) Comparison:
Full Duplication:
• Simple.
• High storage cost.
• Error-prone.
Valid From/To:
• Space efficient.
• Time-travel capability.
• Requires metadata filtering.
Tool-Based:
• Git-like versioning.
• Efficient storage.
• Branching, merging.
• ACID compliance (Delta Lake, lakeFS).
(b) Recommendation:
• Tool-based (e.g., Delta Lake, lakeFS, DVC).
• Supports large-scale structured and unstructured data.
(c) Challenges:

12
• Storage cost.
• Security risks.
• Choosing right provider.
• Adoption by team.
(d) ML Benefits:
• Reproducibility.
• Compare experiments.
• Rollback datasets.
• Audit trail.
• Track data-code-model linkage.

CASE STUDY 5: DATA LINEAGE FAILURE DURING MODEL INCIDENT


A deployed recommendation model starts producing biased outputs.
Investigation reveals:
• New transformation added in staging.
• No metadata updated.
• Feature store version unclear.
• Model retrained without documentation.
• No audit trail of dataset version used.
Question:
(a) Define Data Lineage and its importance.
(b) Explain components required for full lineage tracking.
(c) How does lineage support reproducibility?
(d) Design a lineage-aware ML pipeline.
Model Answer Outline:
(a) Data Lineage:
• Track origin, transformation, movement of data.
• Identify where data came from and how modified.
• Essential for trust and debugging.
(b) Components:
• Infrastructure as Code.
• Data schemas in version control.
• Implementation code in Git.
• Feature store versioning.
• Model algorithm version.
• Container image version.
• Metadata repository.
(c) Reproducibility:
• Recreate training environment.
• Restore exact data version.
• Re-run experiment with same hyperparameters.
• Diagnose production failures.
(d) Lineage-aware pipeline:
• Store metadata at each transformation step.

13
• Integrate data catalog (DataHub/Amundsen).
• Version data via Delta Lake/lakeFS.
• Track experiments via Neptune/DVC.
• Automate lineage capture in CI/CD.
Below are case-study based, long-answer university-style questions created strictly from your slide
“L9 – Analytics Engineering” .
The cases integrate:
• Data Integration
• ETL vs ELT
• Medallion Architecture
• ELT-DBT pattern
• Transformation testing
• Schema validation
• Data versioning
• Lineage & reproducibility
• Data governance
• Tool selection
All questions are structured for 15–20 mark descriptive answers.

CASE STUDY 1: BUILDING A MODERN DATA PLATFORM FOR RETAIL


A large retail enterprise wants to unify data from:
• POS systems (on-premise database)
• E-commerce platform (cloud SaaS API)
• Inventory system (ERP)
• Marketing tools
• IoT sensors from warehouses
They want:
• Real-time dashboards
• AI-based demand forecasting
• Historical analysis
• Regulatory compliance
• Scalable cloud architecture
However:
• Data arrives in different formats.
• Schemas change frequently.
• Business teams demand self-service analytics.
• Model training datasets must be reproducible.
Question:
(a) Explain the difference between Data Ingestion and Data Integration in this context.
(b) Propose an architecture using Medallion Architecture.
(c) Explain whether ETL or ELT is more appropriate and why.
(d) Discuss validation and testing strategies.
(e) Explain how data versioning and lineage would support AI use cases.
Model Answer Outline:
(a) Data Ingestion vs Integration:

14
• Ingestion: Collect and transfer data from multiple sources into storage.
• Integration: Apply transformations, cleansing, merging, and business logic to create unified datasets.
• Ingestion ensures consolidation.
• Integration ensures quality and usability for analytics.
(b) Medallion Architecture:
Bronze Layer:
• Raw data from POS, APIs, IoT.
• Schema preserved as-is.
• Includes metadata such as load time.
Silver Layer:
• Cleansing and conformance.
• Deduplication of customers.
• Standardization of date/time and currency.
• Light transformations.
Gold Layer:
• Aggregated sales marts.
• Customer 360 view.
• AI-ready datasets.
• Reporting tables.
(c) ETL vs ELT:
• ELT preferred.
• Raw data loaded first.
• Transformations done in cloud data warehouse.
• Supports scalability and agility.
• Compatible with dbt pattern.
(d) Validation and Testing:
• Schema validation: column presence, data types.
• Business rule validation: pricing logic.
• Source-to-target verification.
• Regression testing.
• Performance testing for real-time workloads.
(e) Data Versioning and Lineage:
• Version datasets before retraining ML models.
• Track transformation logic via Git.
• Store model artifacts and hyperparameters.
• Enable reproducibility and auditability.
• Support rollback in case of incorrect predictions.

CASE STUDY 2: TRANSFORMATION FAILURE IN INSURANCE


PIPELINE
An insurance company processes claims through an ELT pipeline.
Recent incident:
• VAT incorrectly applied to all categories.
• Monthly payout aggregation incorrect.

15
• Performance slowed after adding fraud-detection logic.
• No documentation exists for transformation rules.
Question:
(a) Identify where in the transformation process failure likely occurred.
(b) Explain what should have been tested.
(c) Describe levels of testing required.
(d) Suggest best practices to avoid recurrence.
Model Answer Outline:
(a) Likely Failure:
• Transformation Logic Design stage.
• Lack of Business Rule Validation.
• Possibly improper aggregation logic.
(b) What to Test:
Basic Transformations:
• Data type conversions.
• String manipulation.
• Date transformation.
• Aggregation validation.
• Null handling.
Complex Transformations:
• Conditional VAT logic.
• Fraud-detection integration.
(c) Levels of Testing: Level 1: Unit Tests (single logic rule). Level 2: Integration Tests (multiple transfor-
mations). Level 3: Regression Tests (compare historical outputs).
(d) Best Practices:
• Document transformation mapping sheet.
• Automate validation.
• Test edge cases.
• Use version control for SQL/dbt code.
• Maintain documentation of data models.

CASE STUDY 3: MIGRATION TO LAKEHOUSE WITH DBT


A company migrates from traditional ETL to a lakehouse architecture using dbt.
Requirements:
• Maintain raw historical archive.
• Standardize multiple CRM sources.
• Build dimensional models.
• Support BI dashboards.
• Maintain modular SQL transformations.
Question:
(a) Map Medallion Architecture to ELT-DBT layers.
(b) Explain the purpose of each dbt layer.
(c) Why should marts not contain transformation logic?

16
(d) Explain how dbt supports collaboration and documentation.
Model Answer Outline:
(a) Mapping:
Bronze → Landing (raw, 1:1 source reflection). Silver → Staging + Intermediate layers. Gold → Marts
layer.
(b) DBT Layers:
Landing:
• Raw data.
• No transformations.
Staging:
• Light transformations.
• Casting, renaming.
Intermediate:
• Business logic.
• Aggregations.
• Calculations.
• Window functions.
Marts:
• Fact and dimension tables.
• Joins across layers.
• Analytics-ready model.
(c) Why no logic in marts:
• Maintain modularity.
• Separation of concerns.
• Avoid duplication of business rules.
• Improve maintainability.
(d) dbt Benefits:
• Version control integration.
• Modular models.
• Built-in documentation.
• Reusable components.
• Collaboration through Git.

CASE STUDY 4: DATA VERSIONING STRATEGY FOR ML TEAM


An ML team frequently retrains models using updated sales data.
Problems:
• Overwriting datasets.
• Cannot reproduce old experiments.
• Compliance audit requires past data snapshots.
• Storage costs increasing rapidly.
Question:
(a) Compare Full Duplication vs Valid From/To vs Tool-Based versioning.

17
(b) Recommend a strategy for large-scale production.
(c) Discuss challenges in versioning.
(d) Explain how versioning improves ML lifecycle management.
Model Answer Outline:
(a) Comparison:
Full Duplication:
• Simple.
• High storage cost.
• Error-prone.
Valid From/To:
• Space efficient.
• Time-travel capability.
• Requires metadata filtering.
Tool-Based:
• Git-like versioning.
• Efficient storage.
• Branching, merging.
• ACID compliance (Delta Lake, lakeFS).
(b) Recommendation:
• Tool-based (e.g., Delta Lake, lakeFS, DVC).
• Supports large-scale structured and unstructured data.
(c) Challenges:
• Storage cost.
• Security risks.
• Choosing right provider.
• Adoption by team.
(d) ML Benefits:
• Reproducibility.
• Compare experiments.
• Rollback datasets.
• Audit trail.
• Track data-code-model linkage.

CASE STUDY 5: DATA LINEAGE FAILURE DURING MODEL INCIDENT


A deployed recommendation model starts producing biased outputs.
Investigation reveals:
• New transformation added in staging.
• No metadata updated.
• Feature store version unclear.
• Model retrained without documentation.
• No audit trail of dataset version used.
Question:
(a) Define Data Lineage and its importance.

18
(b) Explain components required for full lineage tracking.
(c) How does lineage support reproducibility?
(d) Design a lineage-aware ML pipeline.
Model Answer Outline:
(a) Data Lineage:
• Track origin, transformation, movement of data.
• Identify where data came from and how modified.
• Essential for trust and debugging.
(b) Components:
• Infrastructure as Code.
• Data schemas in version control.
• Implementation code in Git.
• Feature store versioning.
• Model algorithm version.
• Container image version.
• Metadata repository.
(c) Reproducibility:
• Recreate training environment.
• Restore exact data version.
• Re-run experiment with same hyperparameters.
• Diagnose production failures.
(d) Lineage-aware pipeline:
• Store metadata at each transformation step.
• Integrate data catalog (DataHub/Amundsen).
• Version data via Delta Lake/lakeFS.
• Track experiments via Neptune/DVC.
• Automate lineage capture in CI/CD.

19

You might also like