0% found this document useful (0 votes)

16 views19 pages

Data Integration

Data integration involves combining data from various sources into a unified view for better business intelligence and decision-making, distinguishing it from data ingestion, which is a simpler process of data collection. The integration process includes complex tasks like ETL and requires skilled engineers, while modern challenges include managing real-time data and hybrid environments. Data transformation is essential for converting raw data into usable formats, and validation ensures the accuracy and reliability of the transformed data.

Uploaded by

jmdestateanddevelopers

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

16 views19 pages

Data Integration

Uploaded by

jmdestateanddevelopers

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

1.

Data Integration
What is Data Integration?
Data integration is the process of combining data from disparate sources across an organization into a unified
view, such as a data warehouse or data lake. The primary goal is to provide a single, consolidated dataset
that serves as a consistent fact base for business intelligence (BI) and analytics.
By unifying these datasets, organizations can align departments, improve customer experiences, and make
better-informed decisions.

Data Integration vs. Data Ingestion

The lecture notes make a critical distinction between Data Ingestion and Data Integration. While often
used interchangeably, they represent different stages of the data lifecycle:
• Data Ingestion is a simpler process that involves collecting and transferring data from various input
sources (like IoT, APIs, or databases) to a target storage system. It does not automatically ensure
data quality and requires less domain expertise.
• Data Integration is more complex. It involves unifying raw data, applying transformations (such as
cleaning, merging, and filtering), and loading it into a destination for analysis. This process ensures
that the data is high-quality and credible. Because it involves complex processes like ETL (Extract,
Transform, Load), metadata management, and governance, it typically requires skilled data engineers.

The Integration Process

Moving data effectively requires a data pipeline that understands the structure and meaning of the data. *
Simple pipelines may just perform ingestion (integrating data on a regular basis). * Complex pipelines
(ETL) include cleansing, sorting, and enrichment to make data ready for its final destination.

Modern Integration Challenges

The landscape of data integration has shifted due to the explosion of data sources (SaaS applications, cloud
services) and the demand for real-time analytics. Modern challenges include: * Continuous Data: Real-
time decision-making requires data to be transformed “in flight”. * Hybrid Environments: Integration
must span on-premises data centers and cloud environments. * Lifecycle Management: There is a need
for continuous integration and delivery (CI/CD) of data, focusing not just on implementation but on business
needs.

Data Integration Tools

The market offers stand-alone software products to handle data access, transformation, and delivery. These
tools are characterized by several key features: * Support for Multiple Styles: They handle bulk/batch
movement, replication, and stream integration. * Transformation Capabilities: They support basic
(string manipulation), intermediate (merging), and advanced (text mining) transformations. * Augmenta-
tion: Modern tools leverage Generative AI and ML to auto-generate pipeline code and detect anomalies. *
Governance and Ops: They include features for metadata management, data governance (lineage, mask-
ing), DataOps (automated testing, Git integration), and FinOps (cost optimization).

Integration Patterns: The Medallion Architecture

A common design pattern for organizing integrated data is the Medallion Architecture, which progres-
sively improves data quality as it flows through three layers: 1. Bronze Layer (Raw): Data is ingested
“as-is” from source systems. This layer focuses on quick capture and historical archiving. 2. Silver Layer
(Cleansed): Data is matched, merged, and cleansed to provide an “Enterprise view” of key entities (e.g.,
master customers, stores). 3. Gold Layer (Curated): Data is transformed into consumption-ready,
project-specific databases (e.g., for sales analytics or churn prediction), often using denormalized models for
reporting.

1
[Link] Transformations
Data transformation is the critical process of converting raw, unstructured, or inconsistent data into clean,
usable formats that fuel analytics, reporting, and machine learning. It serves as the bridge between raw data
ingestion and actionable business insights, ensuring that data is standardized, accurate, and aligned with
business requirements.

Key Types of Transformations Data transformation encompasses various techniques to refine data for
consumption: * Data Cleaning: Involves removing duplicates, handling missing values, correcting typos,
and validating formats such as email addresses or phone numbers. * Data Standardization: Converts
units of measurement, aligns date/time formats, and harmonizes naming conventions to ensure consistency.
* Data Aggregation: Summarizes granular data to provide higher-level insights. * Data Enrichment:
Enhances a dataset by adding new information from internal or external sources. * Data Integration:
Combines data from multiple disparate sources into a unified view. * Data Normalization: Structures
data to minimize redundancy and improve overall integrity. * Data Filtering and Selection: Narrows
down datasets based on specific conditions or rules.

The Data Transformation Process The lifecycle of transforming data typically follows a structured
six-step process:
1. Discovery: Before transforming data, teams must profile the source data to understand its structure,
types, volumes, and quality. This step identifies anomalies and inconsistencies, providing metadata
crucial for automation and lineage.
2. Data Mapping: This step defines how source data fields correspond to destination fields, ensuring the
correct values map to the correct schema (e.g., mapping first_name in a CRM to customerFirstName
in a warehouse).
3. Transformation Logic Design: Teams define the specific logic—such as regex-based cleaning, con-
ditional logic, or SQL aggregations—that will reshape the data.
4. Execution: The logic is executed via batch or streaming jobs. In modern ELT (Extract, Load,
Transform) workflows, this often happens within the data warehouse using tools like dbt, rather than
before loading (ETL).
5. Validation and Testing: Transformed data undergoes rigorous testing to ensure accuracy, complete-
ness, and conformity to expected formats.
6. Documentation and Monitoring: Workflows are documented, and pipelines are monitored for
failures or schema drift to maintain trust in the data.
Based on the “Analytics Engineering” lecture notes, here is an explanation of Data Transformation
and Modelling Patterns, structured for an academic understanding suitable for postgraduate engineering
students.

3. Data Transformation and Modelling Patterns

In analytics engineering, patterns represent reusable solutions to common data organization problems. Just
as design patterns exist in software engineering, data engineering relies on architecture patterns to struc-
ture data flow and ensure quality. The lecture notes highlight two primary, interconnected patterns: the
Medallion Architecture and the ELT-DBT Pattern.

1. The Medallion Architecture (Multi-Hop Architecture) The Medallion Architecture is a data

design pattern used to logically organize data within a lakehouse. Its primary goal is to incrementally
improve the structure and quality of data as it flows through three distinct layers: Bronze, Silver, and Gold.
• Bronze Layer (Raw Data)
– Function: Acts as a “Landing Zone” for data extracted from external source systems.
– Structure: Tables are a 1:1 reflection of source system structures. Data is stored “as-is” with
additional metadata columns (e.g., load date/time, process ID).

2
– Purpose: Facilitates quick Change Data Capture (CDC), provides a historical archive (cold
storage), ensures auditability, and allows for data reprocessing without re-reading from source
systems.
• Silver Layer (Cleansed and Conformed)
– Function: Provides an “Enterprise view” of key business entities (e.g., master customers, stores)
and transactions.
– Structure: Data from the Bronze layer is matched, merged, conformed, and cleansed.
– Transformation Intensity: Only “just-enough” transformations and cleansing rules are applied
to create a unified view.
• Gold Layer (Curated/Consumption-Ready)
– Function: Serves as the final presentation layer for specific projects, such as Customer Analytics,
Inventory Analytics, or Sales Reporting.
– Structure: Data is organized into project-specific databases using denormalized, read-optimized
data models (often Star Schemas) requiring fewer joins.
– Transformation Intensity: This layer applies the final data transformations and complex busi-
ness quality rules to make data ready for reporting and decision-making.

2. The ELT-DBT Modelling Pattern This pattern is a specific implementation of the Medallion
Architecture, often utilizing the dbt (data build tool) framework. It shifts from traditional ETL (Extract,
Transform, Load) to ELT (Extract, Load, Transform), where transformations occur inside the data
warehouse.
This pattern structures data modelling into four distinct layers:
• Landing Layer (Source Layer)
– Definition: The initial entry point, also known as the raw zone.
– Characteristics: A 1:1 reflection of source tables with no transformations applied.
• Staging Layer (stg_*)
– Definition: A replica of the source data prepared for downstream usage.
– Characteristics:
∗ Applies light modifications such as data type casting, column renaming, and filtering of
deleted records.
∗ Typically involves a 1:1 reflection of source tables with no joins between models.
– Purpose: To cleanse and standardize raw data.
• Intermediate Layer (int_*)
– Definition: The bridge between staging data and final business entities, applying business logic.
– Characteristics:
∗ Involves heavier transformations such as aggregations, calculations, CASE statements,
and window functions.
∗ Prepares data specifically for fact modelling.
• Marts Layer (dim_*, fact_*)
– Definition: The final presentation layer containing Dimensional Models (Dimensions) and Fact
Models (Measures).
– Characteristics:
∗ Utilizes joins and unions to combine data from intermediate models, facts, and dimensions.
∗ No complex transformation logic should reside here; it is purely for reflecting the business
perspective.

Summary of Mapping The industry standard often maps these two patterns together: * Bronze ↔
Landing / Source Data * Silver ↔ Staging & Intermediate Layers (Cleansing and Business Logic)
* Gold ↔ Marts Layer (Reporting and Analytics)
Some mature organizations extend this to a “Platinum” layer, which includes data enhanced by AI/ML
predictions.

3
4. Data Transformation Validation and Testing.
Introduction to Data Transformation Validation
Data transformation validation is the process of verifying that the data logic applied during the pipeline exe-
cution produces accurate, reliable, and consistent results. Even well-designed transformations can introduce
“silent errors” into analytics if not properly validated. The goal is to ensure the output is not just technically
executable, but business-accurate.

What to Test?
Testing is categorized into verifying basic technical transformations and complex logic.

1. Basic Transformations These tests focus on the structural and fundamental integrity of the data: *
Data Type Conversions: Ensuring fields are correctly cast (e.g., strings to integers). * Handling Null
Values: Verifying that missing values are handled or replaced appropriately. * String & Date Manip-
ulations: Checking that text is formatted correctly and date-time conversions are accurate. * Filtering:
Confirming that data exclusion logic (e.g., removing deleted records) works as expected. * Uniqueness &
Constraints: Validating that primary keys remain unique and mandatory fields are not empty.

2. Complex Transformations These tests focus on the analytical and business logic applied to the data:
* Aggregations: Verifying summary operations (e.g., monthly sales totals) match expectations. * Data
Integration: Ensuring data from diverse sources (e.g., SQL and NoSQL) is unified correctly without data
loss. * Data Normalization: Checking that redundancy is reduced and data is structured eﬀiciently. *
Business Rules: Verifying specific logic, such as “Apply 15% VAT only to Category X”.

The Validation Process

The lecture notes outline a structured approach to validation:
1. Schema Validation:
• Ensures the structure conforms to specifications.
• Checks for the presence of required columns, correct data types, and adherence to constraints
(e.g., NOT NULL, UNIQUE).
• Validates referential integrity to ensure relationships between tables (e.g., Orders and Customers)
are preserved.
2. Contract Validation:
• Validates “Producer-Consumer contracts” to ensure that changes in the source do not break
downstream consumption.
3. Business Logic & Data Profiling:
• Source-to-Target Verification: Compares transformed data against original source values to
ensure accuracy.
• Data Profiling: Uses profiling tools to check data distributions and patterns (e.g., ensuring age
values fall within a realistic range).

Levels of Testing
Testing should occur at different granularities to catch different types of errors:
• Level 1: Unit Tests:
– Tests single transformation logic in isolation (e.g., testing a specific regex pattern or calculation).
Tools like dbt allow for generic and singular tests at this level.
• Level 2: Integration Tests:
– Verifies multiple transformations working together in an end-to-end pipeline to ensure the flow of
data is correct.
• Level 3: Regression Tests:

4
– Compares current outputs against historical outputs to detect unintended changes or “breaks” in
logic caused by new code.

Best Practices
To maintain a robust data pipeline, the notes recommend the following best practices: * Document Every
Rule: Maintain a mapping sheet linking source fields, transformation logic, and target fields. * Automate
Validation: Use SQL scripts or tools to run checks automatically, rather than relying on manual inspection.
* Test with Realistic Volumes: Simulate production-scale data to catch performance bottlenecks that
small test datasets might hide. * Test Edge Cases: specifically test for missing data, extreme values, or
special characters to ensure the pipeline doesn’t fail under specific conditions.

4. Documentation of Data Models.

Introduction to Data Model Documentation
Data model documentation is the process of recording the structure, logic, and intended use of data systems.
It is not limited to simple text descriptions; it encompasses diagrams, table definitions, usage scenarios,
and design notes.
The primary goal of this documentation is to help teams—both technical and non-technical—understand
how data is organized within the system and the reasoning behind specific design decisions. It serves as a
bridge ensuring alignment between data structures and business goals.

Types of Data Models

Documentation typically covers three distinct levels of abstraction, each serving a different audience and
purpose:
1. Conceptual Data Model:
• Purpose: Acts as a blueprint for the future system.
• Audience: Business Stakeholders.
• Content: A high-level overview defining components and rules in strictly business terms.
2. Logical Data Model:
• Purpose: Provides detail on the system’s entities, attributes, and relationships. It is used to
develop the database schema.
• Audience: Data Architects and Business Analysts.
• Content: Clarification of business rules and data structures.
3. Physical Data Model:
• Purpose: Represents the technical implementation on a specific database.
• Audience: DBAs (Database Administrators) and Developers.
• Content: Detailed physical structures, including data types, indexing, and access methods.

Benefits of Documentation
Proper documentation is essential for the scalability and maintainability of data systems. The notes highlight
five key benefits: * Improved Clarity: It translates complex data structures into formats understandable by
non-technical stakeholders. * Easier Maintenance: It accelerates updates and bug fixes by clearly showing
how components are connected. * Faster Onboarding: It allows new team members to quickly grasp
the model’s logic and assumptions. * Consistent Data Use: It standardizes definitions and structures,
preventing misinterpretation of data. * Better Collaboration: It fosters a shared understanding across
analysts, engineers, and decision-makers.

Tools for Documentation and Management

The lecture notes identify several tools that facilitate data modeling and documentation:

5
• dbt(data build tool):
– Provides robust features for version control, modularization, and documentation.
–Allows stakeholders to access the latest version of the data model in a collaborative environment.
– Enables effective documentation of models, helping team members understand the purpose and
functionality of the data.
• DataHub:
– A modern data catalog used for metadata management, discovery, and governance.
– It helps track data lineage, profile datasets, and establish data contracts.
• Amundsen:
– A data discovery and metadata engine (originally from Lyft) designed to create a centralized
repository.
– It features “popularity ranking” for frequently visited data tags and provides in-line metadata to
save time for users.

5. Data Versioning
What is Data Versioning?
Data versioning is the practice of storing different versions of data that were created or modified at specific
points in time. Much like code version control (e.g., Git) allows developers to track changes in software, data
versioning allows data engineers and scientists to track changes in datasets. This ensures that organizations
can access, retrieve, or restore data exactly as it existed at any previous moment.

Why is Data Versioning Important?

The lecture notes highlight several critical reasons for implementing data versioning:
1. Risk-Free Testing for AI/ML: Data scientists frequently test new Machine Learning (ML) models
to improve eﬀiciency. These tests often require modifying datasets. Versioning preserves the original
“working” dataset, allowing teams to revert instantly if a new experiment fails or proves ineﬀicient.
2. Reproducibility: Datasets change naturally over time (e.g., live sales transactions). To reproduce
a specific analysis or model result from the past, the organization must be able to restore the exact
environment and data state from that time.
3. Business Performance Analysis: By storing historical versions of data (e.g., sales data over several
years), businesses can analyze consumer trends and preferences over time, leading to better strategic
decisions.
4. Compliance and Auditing: Regulations like GDPR often require companies to maintain historical
records of data. Versioning simplifies internal and external audits by providing a clear trail of what
data existed and when.

Naming Conventions for Versions

To manage versions effectively, standardized naming conventions are recommended: * Semantic Version-
ing: Uses a three-part number (e.g., 3.2.4). * The first number (3) indicates significant changes. * The
second number (2) indicates new features. * The third number (4) indicates minor bug fixes. * Status-
Based: Naming based on the data’s state, such as raw, filtered, or cleaned. * Process-Based: Naming
based on the transformation applied, such as normalized.

Implementation Approaches
The notes outline three primary methods for implementing data versioning:

1. Full Duplication This involves manually saving a full copy of the dataset to a new location every
time a version is created. * Pros: Simple to understand and implement; works for small datasets. * Cons:
Highly ineﬀicient regarding storage space; error-prone because users must manually hardcode date values or
paths in their queries. * Best For: Small firms or individual work where sensitive data must be isolated.

6
2. Valid From/To Metadata This is an incremental approach where two metadata columns (valid_from
and valid_to) are added to the table. * Mechanism: When a record is updated, the old record is preserved,
and its valid_to date is updated. The new record is appended with a new valid_from date. * Pros: More
space-eﬀicient than full duplication. * Cons: Interaction is limited to adding date filters to SQL queries,
which can become complex. * Best For: “Time traveling” within a single tabular dataset.

3. Tool-Based Versioning (Recommended) This approach treats versioning as a core capability of

the infrastructure, often extending the Git model to data. It aims to minimize storage footprints (by not
copying unchanged data) and provides direct operations like “create version” or “compare versions”.

Challenges
• Storage Space: Storing multiple versions increases storage requirements and costs. Organizations
must balance the granularity of versioning with their budget.
• Security: More stored versions increase the surface area for potential data leaks. This is a significant
concern in cloud environments where IT functions are outsourced.
• Tool Selection: Teams must evaluate tools based on criteria like storage capacity, open-source avail-
ability, cloud support, and ease of use.

Popular Data Versioning Tools

The lecture notes list several tools designed to handle these challenges: * DVC (Data Version Con-
trol): An open-source tool that brings Git-like version control to machine learning projects. It tracks data
provenance and ensures reproducibility. * Delta Lake: An open-source storage layer that brings ACID
transactions to Apache Spark and big data workloads. It supports “time travel” to query earlier versions of
data. * Pachyderm: A platform that offers containerized data lineage and versioning, allowing teams to
track data commits in separate branches. * LakeFS: Provides a Git-like branching and committing model
specifically for data lakes (S3/GCS), scaling to petabytes of data. * Git LFS (Large File Storage): An
extension for Git that replaces large files with text pointers, storing the actual content on a remote server.
* Dolt: A SQL database that supports forking, cloning, and merging, allowing data and schema to evolve
together.

6. Data Lineage
Definition and Purpose Data lineage is the process of tracing and visualizing the flow of data from its
origin to its final destination. It tracks the data’s journey, documenting every alteration, transformation,
split, or merger it undergoes along the way.
The primary objective of data lineage is to answer critical questions regarding data provenance: * Where did
a specific piece of data originate? * When and where did the data separate or merge with other datasets?
Mechanism Data lineage functions by creating a data mapping framework. This is achieved by collecting
and managing metadata from every step of the data pipeline. This metadata is updated for each process
applied to the data and stored in a central metadata repository, which serves as the foundation for lineage
analysis.
Components of a Lineage Tracker To ensure comprehensive lineage, especially in modern engineering
environments, a lineage tracker must account for various traceable resources:
1. Infrastructure as Code (IaC):
• Cloud computing resources (compute, storage, network) are provisioned via automation (IaC).
• To maintain lineage, IaC code changes must be committed to a version-controlled repository to
avoid configuration drift.
2. Data:
• Data schemas and metadata should be stored in version control systems.
• Actual data is typically stored in storage media like a data lake, with location links maintained
in configuration files.

7
3. Implementation Code:
• All changes to the code responsible for processing data must be stored as point-in-time commits
in version control.
4. Model Feature List:
• A Feature Store is used to maintain details of machine learning features, including their previous
versions, ensuring point-in-time accuracy.
5. Model Algorithm Code:
• The specific algorithms used for modeling must be version-controlled to track changes over time.
6. Model Container Image:
• Versions of the model container images (used for deployment) must be stored in container repos-
itories managed by a container registry.

7. Reproducibility (Context of Training Datasets)

Reproducibility is the ability to recreate a specific data environment or result from a past point in time. In
the context of Machine Learning (ML) and Analytics Engineering, it is inextricably linked to data lineage.
Importance in ML The lecture notes emphasize a core tenet: “If you can’t explain the data that
trained the model, you can’t trust the model”. Reproducibility ensures that if issues arise in a
production or training environment, engineers can reproduce the “last best version” of the system.
Key Elements to Track for Reproducibility To guarantee reproducibility for training datasets, the
following elements must be tracked via lineage:
1. Data Origin and Transformations: Pinpointing the exact source (e.g., APIs, databases) and every
processing step (cleaning, feature engineering, aggregation).
2. Code and Parameters: Tracking specific source code versions (via Git) and the hyperparameters
used during the training phase.
3. Model Artifacts: Capturing different versions of trained models along with their associated perfor-
mance metrics and evaluation reports.
4. Deployment and Usage: Monitoring where specific models are deployed and used for making pre-
dictions.
Below are case-study based, long-answer university-style questions created strictly from your slide
“L9 – Analytics Engineering” .
The cases integrate:
• Data Integration
• ETL vs ELT
• Medallion Architecture
• ELT-DBT pattern
• Transformation testing
• Schema validation
• Data versioning
• Lineage & reproducibility
• Data governance
• Tool selection
All questions are structured for 15–20 mark descriptive answers.

CASE STUDY 1: BUILDING A MODERN DATA PLATFORM FOR RETAIL

A large retail enterprise wants to unify data from:
• POS systems (on-premise database)
• E-commerce platform (cloud SaaS API)

8
• Inventory system (ERP)
• Marketing tools
• IoT sensors from warehouses
They want:
• Real-time dashboards
• AI-based demand forecasting
• Historical analysis
• Regulatory compliance
• Scalable cloud architecture
However:
• Data arrives in different formats.
• Schemas change frequently.
• Business teams demand self-service analytics.
• Model training datasets must be reproducible.
Question:
(a) Explain the difference between Data Ingestion and Data Integration in this context.
(b) Propose an architecture using Medallion Architecture.
(c) Explain whether ETL or ELT is more appropriate and why.
(d) Discuss validation and testing strategies.
(e) Explain how data versioning and lineage would support AI use cases.
Model Answer Outline:
(a) Data Ingestion vs Integration:
• Ingestion: Collect and transfer data from multiple sources into storage.
• Integration: Apply transformations, cleansing, merging, and business logic to create unified datasets.
• Ingestion ensures consolidation.
• Integration ensures quality and usability for analytics.
(b) Medallion Architecture:
Bronze Layer:
• Raw data from POS, APIs, IoT.
• Schema preserved as-is.
• Includes metadata such as load time.
Silver Layer:
• Cleansing and conformance.
• Deduplication of customers.
• Standardization of date/time and currency.
• Light transformations.
Gold Layer:
• Aggregated sales marts.
• Customer 360 view.
• AI-ready datasets.
• Reporting tables.
(c) ETL vs ELT:
• ELT preferred.
• Raw data loaded first.
• Transformations done in cloud data warehouse.

9
• Supports scalability and agility.
• Compatible with dbt pattern.
(d) Validation and Testing:
• Schema validation: column presence, data types.
• Business rule validation: pricing logic.
• Source-to-target verification.
• Regression testing.
• Performance testing for real-time workloads.
(e) Data Versioning and Lineage:
• Version datasets before retraining ML models.
• Track transformation logic via Git.
• Store model artifacts and hyperparameters.
• Enable reproducibility and auditability.
• Support rollback in case of incorrect predictions.

CASE STUDY 2: TRANSFORMATION FAILURE IN INSURANCE

PIPELINE
An insurance company processes claims through an ELT pipeline.
Recent incident:
• VAT incorrectly applied to all categories.
• Monthly payout aggregation incorrect.
• Performance slowed after adding fraud-detection logic.
• No documentation exists for transformation rules.
Question:
(a) Identify where in the transformation process failure likely occurred.
(b) Explain what should have been tested.
(c) Describe levels of testing required.
(d) Suggest best practices to avoid recurrence.
Model Answer Outline:
(a) Likely Failure:
• Transformation Logic Design stage.
• Lack of Business Rule Validation.
• Possibly improper aggregation logic.
(b) What to Test:
Basic Transformations:
• Data type conversions.
• String manipulation.
• Date transformation.
• Aggregation validation.
• Null handling.
Complex Transformations:
• Conditional VAT logic.
• Fraud-detection integration.

10
(c) Levels of Testing: Level 1: Unit Tests (single logic rule). Level 2: Integration Tests (multiple transfor-
mations). Level 3: Regression Tests (compare historical outputs).
(d) Best Practices:
• Document transformation mapping sheet.
• Automate validation.
• Test edge cases.
• Use version control for SQL/dbt code.
• Maintain documentation of data models.

CASE STUDY 3: MIGRATION TO LAKEHOUSE WITH DBT

A company migrates from traditional ETL to a lakehouse architecture using dbt.
Requirements:
• Maintain raw historical archive.
• Standardize multiple CRM sources.
• Build dimensional models.
• Support BI dashboards.
• Maintain modular SQL transformations.
Question:
(a) Map Medallion Architecture to ELT-DBT layers.
(b) Explain the purpose of each dbt layer.
(c) Why should marts not contain transformation logic?
(d) Explain how dbt supports collaboration and documentation.
Model Answer Outline:
(a) Mapping:
Bronze → Landing (raw, 1:1 source reflection). Silver → Staging + Intermediate layers. Gold → Marts
layer.
(b) DBT Layers:
Landing:
• Raw data.
• No transformations.
Staging:
• Light transformations.
• Casting, renaming.
Intermediate:
• Business logic.
• Aggregations.
• Calculations.
• Window functions.
Marts:
• Fact and dimension tables.
• Joins across layers.
• Analytics-ready model.

11
(c) Why no logic in marts:
• Maintain modularity.
• Separation of concerns.
• Avoid duplication of business rules.
• Improve maintainability.
(d) dbt Benefits:
• Version control integration.
• Modular models.
• Built-in documentation.
• Reusable components.
• Collaboration through Git.

CASE STUDY 4: DATA VERSIONING STRATEGY FOR ML TEAM

An ML team frequently retrains models using updated sales data.
Problems:
• Overwriting datasets.
• Cannot reproduce old experiments.
• Compliance audit requires past data snapshots.
• Storage costs increasing rapidly.
Question:
(a) Compare Full Duplication vs Valid From/To vs Tool-Based versioning.
(b) Recommend a strategy for large-scale production.
(c) Discuss challenges in versioning.
(d) Explain how versioning improves ML lifecycle management.
Model Answer Outline:
(a) Comparison:
Full Duplication:
• Simple.
• High storage cost.
• Error-prone.
Valid From/To:
• Space eﬀicient.
• Time-travel capability.
• Requires metadata filtering.
Tool-Based:
• Git-like versioning.
• Eﬀicient storage.
• Branching, merging.
• ACID compliance (Delta Lake, lakeFS).
(b) Recommendation:
• Tool-based (e.g., Delta Lake, lakeFS, DVC).
• Supports large-scale structured and unstructured data.
(c) Challenges:

12
• Storage cost.
• Security risks.
• Choosing right provider.
• Adoption by team.
(d) ML Benefits:
• Reproducibility.
• Compare experiments.
• Rollback datasets.
• Audit trail.
• Track data-code-model linkage.

CASE STUDY 5: DATA LINEAGE FAILURE DURING MODEL INCIDENT

A deployed recommendation model starts producing biased outputs.
Investigation reveals:
• New transformation added in staging.
• No metadata updated.
• Feature store version unclear.
• Model retrained without documentation.
• No audit trail of dataset version used.
Question:
(a) Define Data Lineage and its importance.
(b) Explain components required for full lineage tracking.
(c) How does lineage support reproducibility?
(d) Design a lineage-aware ML pipeline.
Model Answer Outline:
(a) Data Lineage:
• Track origin, transformation, movement of data.
• Identify where data came from and how modified.
• Essential for trust and debugging.
(b) Components:
• Infrastructure as Code.
• Data schemas in version control.
• Implementation code in Git.
• Feature store versioning.
• Model algorithm version.
• Container image version.
• Metadata repository.
(c) Reproducibility:
• Recreate training environment.
• Restore exact data version.
• Re-run experiment with same hyperparameters.
• Diagnose production failures.
(d) Lineage-aware pipeline:
• Store metadata at each transformation step.

13
• Integrate data catalog (DataHub/Amundsen).
• Version data via Delta Lake/lakeFS.
• Track experiments via Neptune/DVC.
• Automate lineage capture in CI/CD.
Below are case-study based, long-answer university-style questions created strictly from your slide
“L9 – Analytics Engineering” .
The cases integrate:
• Data Integration
• ETL vs ELT
• Medallion Architecture
• ELT-DBT pattern
• Transformation testing
• Schema validation
• Data versioning
• Lineage & reproducibility
• Data governance
• Tool selection
All questions are structured for 15–20 mark descriptive answers.

CASE STUDY 1: BUILDING A MODERN DATA PLATFORM FOR RETAIL

A large retail enterprise wants to unify data from:
• POS systems (on-premise database)
• E-commerce platform (cloud SaaS API)
• Inventory system (ERP)
• Marketing tools
• IoT sensors from warehouses
They want:
• Real-time dashboards
• AI-based demand forecasting
• Historical analysis
• Regulatory compliance
• Scalable cloud architecture
However:
• Data arrives in different formats.
• Schemas change frequently.
• Business teams demand self-service analytics.
• Model training datasets must be reproducible.
Question:
(a) Explain the difference between Data Ingestion and Data Integration in this context.
(b) Propose an architecture using Medallion Architecture.
(c) Explain whether ETL or ELT is more appropriate and why.
(d) Discuss validation and testing strategies.
(e) Explain how data versioning and lineage would support AI use cases.
Model Answer Outline:
(a) Data Ingestion vs Integration:

14
• Ingestion: Collect and transfer data from multiple sources into storage.
• Integration: Apply transformations, cleansing, merging, and business logic to create unified datasets.
• Ingestion ensures consolidation.
• Integration ensures quality and usability for analytics.
(b) Medallion Architecture:
Bronze Layer:
• Raw data from POS, APIs, IoT.
• Schema preserved as-is.
• Includes metadata such as load time.
Silver Layer:
• Cleansing and conformance.
• Deduplication of customers.
• Standardization of date/time and currency.
• Light transformations.
Gold Layer:
• Aggregated sales marts.
• Customer 360 view.
• AI-ready datasets.
• Reporting tables.
(c) ETL vs ELT:
• ELT preferred.
• Raw data loaded first.
• Transformations done in cloud data warehouse.
• Supports scalability and agility.
• Compatible with dbt pattern.
(d) Validation and Testing:
• Schema validation: column presence, data types.
• Business rule validation: pricing logic.
• Source-to-target verification.
• Regression testing.
• Performance testing for real-time workloads.
(e) Data Versioning and Lineage:
• Version datasets before retraining ML models.
• Track transformation logic via Git.
• Store model artifacts and hyperparameters.
• Enable reproducibility and auditability.
• Support rollback in case of incorrect predictions.

CASE STUDY 2: TRANSFORMATION FAILURE IN INSURANCE

PIPELINE
An insurance company processes claims through an ELT pipeline.
Recent incident:
• VAT incorrectly applied to all categories.
• Monthly payout aggregation incorrect.

15
• Performance slowed after adding fraud-detection logic.
• No documentation exists for transformation rules.
Question:
(a) Identify where in the transformation process failure likely occurred.
(b) Explain what should have been tested.
(c) Describe levels of testing required.
(d) Suggest best practices to avoid recurrence.
Model Answer Outline:
(a) Likely Failure:
• Transformation Logic Design stage.
• Lack of Business Rule Validation.
• Possibly improper aggregation logic.
(b) What to Test:
Basic Transformations:
• Data type conversions.
• String manipulation.
• Date transformation.
• Aggregation validation.
• Null handling.
Complex Transformations:
• Conditional VAT logic.
• Fraud-detection integration.
(c) Levels of Testing: Level 1: Unit Tests (single logic rule). Level 2: Integration Tests (multiple transfor-
mations). Level 3: Regression Tests (compare historical outputs).
(d) Best Practices:
• Document transformation mapping sheet.
• Automate validation.
• Test edge cases.
• Use version control for SQL/dbt code.
• Maintain documentation of data models.

CASE STUDY 3: MIGRATION TO LAKEHOUSE WITH DBT

16
(d) Explain how dbt supports collaboration and documentation.
Model Answer Outline:
(a) Mapping:
Bronze → Landing (raw, 1:1 source reflection). Silver → Staging + Intermediate layers. Gold → Marts
layer.
(b) DBT Layers:
Landing:
• Raw data.
• No transformations.
Staging:
• Light transformations.
• Casting, renaming.
Intermediate:
• Business logic.
• Aggregations.
• Calculations.
• Window functions.
Marts:
• Fact and dimension tables.
• Joins across layers.
• Analytics-ready model.
(c) Why no logic in marts:
• Maintain modularity.
• Separation of concerns.
• Avoid duplication of business rules.
• Improve maintainability.
(d) dbt Benefits:
• Version control integration.
• Modular models.
• Built-in documentation.
• Reusable components.
• Collaboration through Git.

CASE STUDY 4: DATA VERSIONING STRATEGY FOR ML TEAM

17
(b) Recommend a strategy for large-scale production.
(c) Discuss challenges in versioning.
(d) Explain how versioning improves ML lifecycle management.
Model Answer Outline:
(a) Comparison:
Full Duplication:
• Simple.
• High storage cost.
• Error-prone.
Valid From/To:
• Space eﬀicient.
• Time-travel capability.
• Requires metadata filtering.
Tool-Based:
• Git-like versioning.
• Eﬀicient storage.
• Branching, merging.
• ACID compliance (Delta Lake, lakeFS).
(b) Recommendation:
• Tool-based (e.g., Delta Lake, lakeFS, DVC).
• Supports large-scale structured and unstructured data.
(c) Challenges:
• Storage cost.
• Security risks.
• Choosing right provider.
• Adoption by team.
(d) ML Benefits:
• Reproducibility.
• Compare experiments.
• Rollback datasets.
• Audit trail.
• Track data-code-model linkage.

CASE STUDY 5: DATA LINEAGE FAILURE DURING MODEL INCIDENT

18
(b) Explain components required for full lineage tracking.
(c) How does lineage support reproducibility?
(d) Design a lineage-aware ML pipeline.
Model Answer Outline:
(a) Data Lineage:
• Track origin, transformation, movement of data.
• Identify where data came from and how modified.
• Essential for trust and debugging.
(b) Components:
• Infrastructure as Code.
• Data schemas in version control.
• Implementation code in Git.
• Feature store versioning.
• Model algorithm version.
• Container image version.
• Metadata repository.
(c) Reproducibility:
• Recreate training environment.
• Restore exact data version.
• Re-run experiment with same hyperparameters.
• Diagnose production failures.
(d) Lineage-aware pipeline:
• Store metadata at each transformation step.
• Integrate data catalog (DataHub/Amundsen).
• Version data via Delta Lake/lakeFS.
• Track experiments via Neptune/DVC.
• Automate lineage capture in CI/CD.

Data Integration and ELT Overview
No ratings yet
Data Integration and ELT Overview
34 pages
Business Analytics Course Overview 2024/25
No ratings yet
Business Analytics Course Overview 2024/25
32 pages
Cae 7123 Bda - Module II
No ratings yet
Cae 7123 Bda - Module II
25 pages
Business Intelligence Architecture Overview
No ratings yet
Business Intelligence Architecture Overview
4 pages
Data Integration in Data Warehousing
No ratings yet
Data Integration in Data Warehousing
7 pages
Business Intelligence Project Report
No ratings yet
Business Intelligence Project Report
22 pages
Data Integration
No ratings yet
Data Integration
9 pages
Overview of Data Integration Techniques
No ratings yet
Overview of Data Integration Techniques
20 pages
M1. What Is Data Engineering All Topic
No ratings yet
M1. What Is Data Engineering All Topic
34 pages
1 Cor. Introduction To ETL Processes
No ratings yet
1 Cor. Introduction To ETL Processes
32 pages
Introduction to Data Engineering Basics
No ratings yet
Introduction to Data Engineering Basics
12 pages
Business Analytics Course Overview 2024
No ratings yet
Business Analytics Course Overview 2024
31 pages
Data Engineering Lifecycle Overview
No ratings yet
Data Engineering Lifecycle Overview
14 pages
Data Integration vs. ETL Explained
No ratings yet
Data Integration vs. ETL Explained
11 pages
Comprehensive Guide to Data Integration
No ratings yet
Comprehensive Guide to Data Integration
20 pages
Data Warehousing Fundamentals Explained
No ratings yet
Data Warehousing Fundamentals Explained
9 pages
Data Cleaning and Transformation Techniques
No ratings yet
Data Cleaning and Transformation Techniques
7 pages
The Ultimate Guide: To Data Integration
No ratings yet
The Ultimate Guide: To Data Integration
14 pages
Data Warehouse Architecture and Preprocessing
No ratings yet
Data Warehouse Architecture and Preprocessing
8 pages
Weighing Data Integration Techniques
No ratings yet
Weighing Data Integration Techniques
8 pages
CRISP-DM and ETL Process Overview
No ratings yet
CRISP-DM and ETL Process Overview
17 pages
Data Integration and Analytics Overview
No ratings yet
Data Integration and Analytics Overview
3 pages
Data Integration Techniques Explained
No ratings yet
Data Integration Techniques Explained
7 pages
Data Warehouse Architecture Explained
No ratings yet
Data Warehouse Architecture Explained
10 pages
Data Transformation Explained: ETL Process
No ratings yet
Data Transformation Explained: ETL Process
7 pages
Integrating Data
No ratings yet
Integrating Data
4 pages
Data Warehouse Fundamentals Overview
No ratings yet
Data Warehouse Fundamentals Overview
26 pages
Advanced Database Management Techniques
No ratings yet
Advanced Database Management Techniques
42 pages
UNIT - I: Data Management
No ratings yet
UNIT - I: Data Management
32 pages
Data Integration Strategies and Tools
No ratings yet
Data Integration Strategies and Tools
27 pages
Understanding Data Marts in Warehousing
No ratings yet
Understanding Data Marts in Warehousing
5 pages
Essential Steps in Data Analysis
No ratings yet
Essential Steps in Data Analysis
6 pages
06-Data-Integration Quality Profiling
No ratings yet
06-Data-Integration Quality Profiling
39 pages
Data Warehouse Fundamentals Explained
No ratings yet
Data Warehouse Fundamentals Explained
30 pages
Revision Sheet Unit16, Session4
No ratings yet
Revision Sheet Unit16, Session4
11 pages
ETL Process and Data Warehouse Design
No ratings yet
ETL Process and Data Warehouse Design
31 pages
Building Big Data Pipelines Overview
No ratings yet
Building Big Data Pipelines Overview
37 pages
Data Analytics Course Overview
No ratings yet
Data Analytics Course Overview
4 pages
Acceptance Testing and ETL Overview
No ratings yet
Acceptance Testing and ETL Overview
19 pages
Data Warehousing Life Cycle Overview
No ratings yet
Data Warehousing Life Cycle Overview
3 pages
Business Intelligence: Data Warehouse Overview
No ratings yet
Business Intelligence: Data Warehouse Overview
33 pages
Business Intelligence and Data Warehousing
No ratings yet
Business Intelligence and Data Warehousing
155 pages
Data Integration in Data Warehousing
No ratings yet
Data Integration in Data Warehousing
33 pages
Stages of Big Data Analytics Process
100% (1)
Stages of Big Data Analytics Process
8 pages
Metadata-Driven Data Integration Guide
No ratings yet
Metadata-Driven Data Integration Guide
9 pages
Data Integration Fundamentals
100% (1)
Data Integration Fundamentals
61 pages
Data Warehousing and Analysis Workflows
No ratings yet
Data Warehousing and Analysis Workflows
5 pages
Data Mart Architectures Explained
No ratings yet
Data Mart Architectures Explained
12 pages
IoT Data Management Overview and Techniques
No ratings yet
IoT Data Management Overview and Techniques
101 pages
Understanding Data Warehousing Concepts
No ratings yet
Understanding Data Warehousing Concepts
48 pages
Data Warehouse Overview and Implementation
No ratings yet
Data Warehouse Overview and Implementation
22 pages
Business Analytics: Data Warehousing Insights
No ratings yet
Business Analytics: Data Warehousing Insights
160 pages
Data Warehouse Process Overview
No ratings yet
Data Warehouse Process Overview
14 pages
Cloud Data Ingestion and Transformation Guide
No ratings yet
Cloud Data Ingestion and Transformation Guide
24 pages
Understanding OLAP in Data Warehousing
No ratings yet
Understanding OLAP in Data Warehousing
12 pages
Data Transformation in Data Pipelines
No ratings yet
Data Transformation in Data Pipelines
11 pages
Understanding the ETL Process
No ratings yet
Understanding the ETL Process
4 pages
ETL Process: Streamlining Data Management
No ratings yet
ETL Process: Streamlining Data Management
44 pages
M1. The Data Engineering Lifecycle
No ratings yet
M1. The Data Engineering Lifecycle
162 pages
Heritage City Apartments, MG Road
No ratings yet
Heritage City Apartments, MG Road
89 pages
DLF City Phase 5 Property Listings
No ratings yet
DLF City Phase 5 Property Listings
23 pages
Verandas Apartments Ownership Details
No ratings yet
Verandas Apartments Ownership Details
25 pages
Icon 1
No ratings yet
Icon 1
46 pages
CNC Technology and Its Advantages
No ratings yet
CNC Technology and Its Advantages
37 pages
VLSI & Semiconductor Training Program
No ratings yet
VLSI & Semiconductor Training Program
2 pages
ADAPTERS WOODCOUSApdf
No ratings yet
ADAPTERS WOODCOUSApdf
36 pages
IoT Based Smart Washroom Using Automated Sensor
No ratings yet
IoT Based Smart Washroom Using Automated Sensor
3 pages
AI ChatGPT Clone Development Guide
No ratings yet
AI ChatGPT Clone Development Guide
29 pages
Single Phase Fault Current Calculation
No ratings yet
Single Phase Fault Current Calculation
11 pages
PA Hire Services in CV37 8HW
No ratings yet
PA Hire Services in CV37 8HW
2 pages
USB Type-C Power Delivery Solutions
No ratings yet
USB Type-C Power Delivery Solutions
30 pages
Analog vs Digital Signals Explained
No ratings yet
Analog vs Digital Signals Explained
64 pages
Enactus REC Marketing Overview
No ratings yet
Enactus REC Marketing Overview
6 pages
Incubation Clubs and Innovation Management
No ratings yet
Incubation Clubs and Innovation Management
7 pages
Python Interview Questions for Networking
No ratings yet
Python Interview Questions for Networking
21 pages
Conference 1 - Copy (College) Final
No ratings yet
Conference 1 - Copy (College) Final
6 pages
DCOM Technology Company Overview
No ratings yet
DCOM Technology Company Overview
24 pages
Airtel Broadband User Information Guide
No ratings yet
Airtel Broadband User Information Guide
4 pages
Storage Devices
No ratings yet
Storage Devices
50 pages
Set-Top Box Software Architectures For Digital Video Broadcast
No ratings yet
Set-Top Box Software Architectures For Digital Video Broadcast
6 pages
Harsh Jain: Education & Experience Overview
No ratings yet
Harsh Jain: Education & Experience Overview
2 pages
Understanding Security Topologies and Zones
No ratings yet
Understanding Security Topologies and Zones
14 pages
Microsoft Word: History and Versions
No ratings yet
Microsoft Word: History and Versions
7 pages
Physical Layer Aspects For Evolved Universal Terrestrial Radio Access (Utra)
No ratings yet
Physical Layer Aspects For Evolved Universal Terrestrial Radio Access (Utra)
132 pages
Network Infra Sdn Bhd Overview
No ratings yet
Network Infra Sdn Bhd Overview
66 pages
New Holland Tractor Service Manuals
50% (2)
New Holland Tractor Service Manuals
3 pages
LM324 Series Operational Amplifiers
No ratings yet
LM324 Series Operational Amplifiers
16 pages
Sonicos 3 Billion 4 TTT
No ratings yet
Sonicos 3 Billion 4 TTT
18 pages
Lead Piping Engineer Profile & Experience
No ratings yet
Lead Piping Engineer Profile & Experience
2 pages
Senior Full Stack .Net Developer Resume
No ratings yet
Senior Full Stack .Net Developer Resume
11 pages
Search: Upload Saved Top Charts Documents Books Audiobooks Magazines News Sheet Music
No ratings yet
Search: Upload Saved Top Charts Documents Books Audiobooks Magazines News Sheet Music
8 pages
Phishing Awareness Tool Framework
No ratings yet
Phishing Awareness Tool Framework
2 pages
AI and ML Overview: Key Concepts & Trends
No ratings yet
AI and ML Overview: Key Concepts & Trends
3 pages

Data Integration

Uploaded by

Data Integration

Uploaded by

1.

Data Integration vs. Data Ingestion

The Integration Process

Modern Integration Challenges

Data Integration Tools

Integration Patterns: The Medallion Architecture

3. Data Transformation and Modelling Patterns

1. The Medallion Architecture (Multi-Hop Architecture) The Medallion Architecture is a data

The Validation Process

4. Documentation of Data Models.

Types of Data Models

Tools for Documentation and Management

Why is Data Versioning Important?

Naming Conventions for Versions

3. Tool-Based Versioning (Recommended) This approach treats versioning as a core capability of

Popular Data Versioning Tools

7. Reproducibility (Context of Training Datasets)

CASE STUDY 1: BUILDING A MODERN DATA PLATFORM FOR RETAIL

CASE STUDY 2: TRANSFORMATION FAILURE IN INSURANCE

CASE STUDY 3: MIGRATION TO LAKEHOUSE WITH DBT

CASE STUDY 4: DATA VERSIONING STRATEGY FOR ML TEAM

CASE STUDY 5: DATA LINEAGE FAILURE DURING MODEL INCIDENT

CASE STUDY 1: BUILDING A MODERN DATA PLATFORM FOR RETAIL

CASE STUDY 2: TRANSFORMATION FAILURE IN INSURANCE

CASE STUDY 3: MIGRATION TO LAKEHOUSE WITH DBT

CASE STUDY 4: DATA VERSIONING STRATEGY FOR ML TEAM

CASE STUDY 5: DATA LINEAGE FAILURE DURING MODEL INCIDENT

You might also like