Data Integration
Data Integration
Data Integration
What is Data Integration?
Data integration is the process of combining data from disparate sources across an organization into a unified
view, such as a data warehouse or data lake. The primary goal is to provide a single, consolidated dataset
that serves as a consistent fact base for business intelligence (BI) and analytics.
By unifying these datasets, organizations can align departments, improve customer experiences, and make
better-informed decisions.
1
[Link] Transformations
Data transformation is the critical process of converting raw, unstructured, or inconsistent data into clean,
usable formats that fuel analytics, reporting, and machine learning. It serves as the bridge between raw data
ingestion and actionable business insights, ensuring that data is standardized, accurate, and aligned with
business requirements.
Key Types of Transformations Data transformation encompasses various techniques to refine data for
consumption: * Data Cleaning: Involves removing duplicates, handling missing values, correcting typos,
and validating formats such as email addresses or phone numbers. * Data Standardization: Converts
units of measurement, aligns date/time formats, and harmonizes naming conventions to ensure consistency.
* Data Aggregation: Summarizes granular data to provide higher-level insights. * Data Enrichment:
Enhances a dataset by adding new information from internal or external sources. * Data Integration:
Combines data from multiple disparate sources into a unified view. * Data Normalization: Structures
data to minimize redundancy and improve overall integrity. * Data Filtering and Selection: Narrows
down datasets based on specific conditions or rules.
The Data Transformation Process The lifecycle of transforming data typically follows a structured
six-step process:
1. Discovery: Before transforming data, teams must profile the source data to understand its structure,
types, volumes, and quality. This step identifies anomalies and inconsistencies, providing metadata
crucial for automation and lineage.
2. Data Mapping: This step defines how source data fields correspond to destination fields, ensuring the
correct values map to the correct schema (e.g., mapping first_name in a CRM to customerFirstName
in a warehouse).
3. Transformation Logic Design: Teams define the specific logic—such as regex-based cleaning, con-
ditional logic, or SQL aggregations—that will reshape the data.
4. Execution: The logic is executed via batch or streaming jobs. In modern ELT (Extract, Load,
Transform) workflows, this often happens within the data warehouse using tools like dbt, rather than
before loading (ETL).
5. Validation and Testing: Transformed data undergoes rigorous testing to ensure accuracy, complete-
ness, and conformity to expected formats.
6. Documentation and Monitoring: Workflows are documented, and pipelines are monitored for
failures or schema drift to maintain trust in the data.
Based on the “Analytics Engineering” lecture notes, here is an explanation of Data Transformation
and Modelling Patterns, structured for an academic understanding suitable for postgraduate engineering
students.
2
– Purpose: Facilitates quick Change Data Capture (CDC), provides a historical archive (cold
storage), ensures auditability, and allows for data reprocessing without re-reading from source
systems.
• Silver Layer (Cleansed and Conformed)
– Function: Provides an “Enterprise view” of key business entities (e.g., master customers, stores)
and transactions.
– Structure: Data from the Bronze layer is matched, merged, conformed, and cleansed.
– Transformation Intensity: Only “just-enough” transformations and cleansing rules are applied
to create a unified view.
• Gold Layer (Curated/Consumption-Ready)
– Function: Serves as the final presentation layer for specific projects, such as Customer Analytics,
Inventory Analytics, or Sales Reporting.
– Structure: Data is organized into project-specific databases using denormalized, read-optimized
data models (often Star Schemas) requiring fewer joins.
– Transformation Intensity: This layer applies the final data transformations and complex busi-
ness quality rules to make data ready for reporting and decision-making.
2. The ELT-DBT Modelling Pattern This pattern is a specific implementation of the Medallion
Architecture, often utilizing the dbt (data build tool) framework. It shifts from traditional ETL (Extract,
Transform, Load) to ELT (Extract, Load, Transform), where transformations occur inside the data
warehouse.
This pattern structures data modelling into four distinct layers:
• Landing Layer (Source Layer)
– Definition: The initial entry point, also known as the raw zone.
– Characteristics: A 1:1 reflection of source tables with no transformations applied.
• Staging Layer (stg_*)
– Definition: A replica of the source data prepared for downstream usage.
– Characteristics:
∗ Applies light modifications such as data type casting, column renaming, and filtering of
deleted records.
∗ Typically involves a 1:1 reflection of source tables with no joins between models.
– Purpose: To cleanse and standardize raw data.
• Intermediate Layer (int_*)
– Definition: The bridge between staging data and final business entities, applying business logic.
– Characteristics:
∗ Involves heavier transformations such as aggregations, calculations, CASE statements,
and window functions.
∗ Prepares data specifically for fact modelling.
• Marts Layer (dim_*, fact_*)
– Definition: The final presentation layer containing Dimensional Models (Dimensions) and Fact
Models (Measures).
– Characteristics:
∗ Utilizes joins and unions to combine data from intermediate models, facts, and dimensions.
∗ No complex transformation logic should reside here; it is purely for reflecting the business
perspective.
Summary of Mapping The industry standard often maps these two patterns together: * Bronze ↔
Landing / Source Data * Silver ↔ Staging & Intermediate Layers (Cleansing and Business Logic)
* Gold ↔ Marts Layer (Reporting and Analytics)
Some mature organizations extend this to a “Platinum” layer, which includes data enhanced by AI/ML
predictions.
3
4. Data Transformation Validation and Testing.
Introduction to Data Transformation Validation
Data transformation validation is the process of verifying that the data logic applied during the pipeline exe-
cution produces accurate, reliable, and consistent results. Even well-designed transformations can introduce
“silent errors” into analytics if not properly validated. The goal is to ensure the output is not just technically
executable, but business-accurate.
What to Test?
Testing is categorized into verifying basic technical transformations and complex logic.
1. Basic Transformations These tests focus on the structural and fundamental integrity of the data: *
Data Type Conversions: Ensuring fields are correctly cast (e.g., strings to integers). * Handling Null
Values: Verifying that missing values are handled or replaced appropriately. * String & Date Manip-
ulations: Checking that text is formatted correctly and date-time conversions are accurate. * Filtering:
Confirming that data exclusion logic (e.g., removing deleted records) works as expected. * Uniqueness &
Constraints: Validating that primary keys remain unique and mandatory fields are not empty.
2. Complex Transformations These tests focus on the analytical and business logic applied to the data:
* Aggregations: Verifying summary operations (e.g., monthly sales totals) match expectations. * Data
Integration: Ensuring data from diverse sources (e.g., SQL and NoSQL) is unified correctly without data
loss. * Data Normalization: Checking that redundancy is reduced and data is structured efficiently. *
Business Rules: Verifying specific logic, such as “Apply 15% VAT only to Category X”.
Levels of Testing
Testing should occur at different granularities to catch different types of errors:
• Level 1: Unit Tests:
– Tests single transformation logic in isolation (e.g., testing a specific regex pattern or calculation).
Tools like dbt allow for generic and singular tests at this level.
• Level 2: Integration Tests:
– Verifies multiple transformations working together in an end-to-end pipeline to ensure the flow of
data is correct.
• Level 3: Regression Tests:
4
– Compares current outputs against historical outputs to detect unintended changes or “breaks” in
logic caused by new code.
Best Practices
To maintain a robust data pipeline, the notes recommend the following best practices: * Document Every
Rule: Maintain a mapping sheet linking source fields, transformation logic, and target fields. * Automate
Validation: Use SQL scripts or tools to run checks automatically, rather than relying on manual inspection.
* Test with Realistic Volumes: Simulate production-scale data to catch performance bottlenecks that
small test datasets might hide. * Test Edge Cases: specifically test for missing data, extreme values, or
special characters to ensure the pipeline doesn’t fail under specific conditions.
Benefits of Documentation
Proper documentation is essential for the scalability and maintainability of data systems. The notes highlight
five key benefits: * Improved Clarity: It translates complex data structures into formats understandable by
non-technical stakeholders. * Easier Maintenance: It accelerates updates and bug fixes by clearly showing
how components are connected. * Faster Onboarding: It allows new team members to quickly grasp
the model’s logic and assumptions. * Consistent Data Use: It standardizes definitions and structures,
preventing misinterpretation of data. * Better Collaboration: It fosters a shared understanding across
analysts, engineers, and decision-makers.
5
• dbt(data build tool):
– Provides robust features for version control, modularization, and documentation.
–Allows stakeholders to access the latest version of the data model in a collaborative environment.
– Enables effective documentation of models, helping team members understand the purpose and
functionality of the data.
• DataHub:
– A modern data catalog used for metadata management, discovery, and governance.
– It helps track data lineage, profile datasets, and establish data contracts.
• Amundsen:
– A data discovery and metadata engine (originally from Lyft) designed to create a centralized
repository.
– It features “popularity ranking” for frequently visited data tags and provides in-line metadata to
save time for users.
5. Data Versioning
What is Data Versioning?
Data versioning is the practice of storing different versions of data that were created or modified at specific
points in time. Much like code version control (e.g., Git) allows developers to track changes in software, data
versioning allows data engineers and scientists to track changes in datasets. This ensures that organizations
can access, retrieve, or restore data exactly as it existed at any previous moment.
Implementation Approaches
The notes outline three primary methods for implementing data versioning:
1. Full Duplication This involves manually saving a full copy of the dataset to a new location every
time a version is created. * Pros: Simple to understand and implement; works for small datasets. * Cons:
Highly inefficient regarding storage space; error-prone because users must manually hardcode date values or
paths in their queries. * Best For: Small firms or individual work where sensitive data must be isolated.
6
2. Valid From/To Metadata This is an incremental approach where two metadata columns (valid_from
and valid_to) are added to the table. * Mechanism: When a record is updated, the old record is preserved,
and its valid_to date is updated. The new record is appended with a new valid_from date. * Pros: More
space-efficient than full duplication. * Cons: Interaction is limited to adding date filters to SQL queries,
which can become complex. * Best For: “Time traveling” within a single tabular dataset.
Challenges
• Storage Space: Storing multiple versions increases storage requirements and costs. Organizations
must balance the granularity of versioning with their budget.
• Security: More stored versions increase the surface area for potential data leaks. This is a significant
concern in cloud environments where IT functions are outsourced.
• Tool Selection: Teams must evaluate tools based on criteria like storage capacity, open-source avail-
ability, cloud support, and ease of use.
6. Data Lineage
Definition and Purpose Data lineage is the process of tracing and visualizing the flow of data from its
origin to its final destination. It tracks the data’s journey, documenting every alteration, transformation,
split, or merger it undergoes along the way.
The primary objective of data lineage is to answer critical questions regarding data provenance: * Where did
a specific piece of data originate? * When and where did the data separate or merge with other datasets?
Mechanism Data lineage functions by creating a data mapping framework. This is achieved by collecting
and managing metadata from every step of the data pipeline. This metadata is updated for each process
applied to the data and stored in a central metadata repository, which serves as the foundation for lineage
analysis.
Components of a Lineage Tracker To ensure comprehensive lineage, especially in modern engineering
environments, a lineage tracker must account for various traceable resources:
1. Infrastructure as Code (IaC):
• Cloud computing resources (compute, storage, network) are provisioned via automation (IaC).
• To maintain lineage, IaC code changes must be committed to a version-controlled repository to
avoid configuration drift.
2. Data:
• Data schemas and metadata should be stored in version control systems.
• Actual data is typically stored in storage media like a data lake, with location links maintained
in configuration files.
7
3. Implementation Code:
• All changes to the code responsible for processing data must be stored as point-in-time commits
in version control.
4. Model Feature List:
• A Feature Store is used to maintain details of machine learning features, including their previous
versions, ensuring point-in-time accuracy.
5. Model Algorithm Code:
• The specific algorithms used for modeling must be version-controlled to track changes over time.
6. Model Container Image:
• Versions of the model container images (used for deployment) must be stored in container repos-
itories managed by a container registry.
8
• Inventory system (ERP)
• Marketing tools
• IoT sensors from warehouses
They want:
• Real-time dashboards
• AI-based demand forecasting
• Historical analysis
• Regulatory compliance
• Scalable cloud architecture
However:
• Data arrives in different formats.
• Schemas change frequently.
• Business teams demand self-service analytics.
• Model training datasets must be reproducible.
Question:
(a) Explain the difference between Data Ingestion and Data Integration in this context.
(b) Propose an architecture using Medallion Architecture.
(c) Explain whether ETL or ELT is more appropriate and why.
(d) Discuss validation and testing strategies.
(e) Explain how data versioning and lineage would support AI use cases.
Model Answer Outline:
(a) Data Ingestion vs Integration:
• Ingestion: Collect and transfer data from multiple sources into storage.
• Integration: Apply transformations, cleansing, merging, and business logic to create unified datasets.
• Ingestion ensures consolidation.
• Integration ensures quality and usability for analytics.
(b) Medallion Architecture:
Bronze Layer:
• Raw data from POS, APIs, IoT.
• Schema preserved as-is.
• Includes metadata such as load time.
Silver Layer:
• Cleansing and conformance.
• Deduplication of customers.
• Standardization of date/time and currency.
• Light transformations.
Gold Layer:
• Aggregated sales marts.
• Customer 360 view.
• AI-ready datasets.
• Reporting tables.
(c) ETL vs ELT:
• ELT preferred.
• Raw data loaded first.
• Transformations done in cloud data warehouse.
9
• Supports scalability and agility.
• Compatible with dbt pattern.
(d) Validation and Testing:
• Schema validation: column presence, data types.
• Business rule validation: pricing logic.
• Source-to-target verification.
• Regression testing.
• Performance testing for real-time workloads.
(e) Data Versioning and Lineage:
• Version datasets before retraining ML models.
• Track transformation logic via Git.
• Store model artifacts and hyperparameters.
• Enable reproducibility and auditability.
• Support rollback in case of incorrect predictions.
10
(c) Levels of Testing: Level 1: Unit Tests (single logic rule). Level 2: Integration Tests (multiple transfor-
mations). Level 3: Regression Tests (compare historical outputs).
(d) Best Practices:
• Document transformation mapping sheet.
• Automate validation.
• Test edge cases.
• Use version control for SQL/dbt code.
• Maintain documentation of data models.
11
(c) Why no logic in marts:
• Maintain modularity.
• Separation of concerns.
• Avoid duplication of business rules.
• Improve maintainability.
(d) dbt Benefits:
• Version control integration.
• Modular models.
• Built-in documentation.
• Reusable components.
• Collaboration through Git.
12
• Storage cost.
• Security risks.
• Choosing right provider.
• Adoption by team.
(d) ML Benefits:
• Reproducibility.
• Compare experiments.
• Rollback datasets.
• Audit trail.
• Track data-code-model linkage.
13
• Integrate data catalog (DataHub/Amundsen).
• Version data via Delta Lake/lakeFS.
• Track experiments via Neptune/DVC.
• Automate lineage capture in CI/CD.
Below are case-study based, long-answer university-style questions created strictly from your slide
“L9 – Analytics Engineering” .
The cases integrate:
• Data Integration
• ETL vs ELT
• Medallion Architecture
• ELT-DBT pattern
• Transformation testing
• Schema validation
• Data versioning
• Lineage & reproducibility
• Data governance
• Tool selection
All questions are structured for 15–20 mark descriptive answers.
14
• Ingestion: Collect and transfer data from multiple sources into storage.
• Integration: Apply transformations, cleansing, merging, and business logic to create unified datasets.
• Ingestion ensures consolidation.
• Integration ensures quality and usability for analytics.
(b) Medallion Architecture:
Bronze Layer:
• Raw data from POS, APIs, IoT.
• Schema preserved as-is.
• Includes metadata such as load time.
Silver Layer:
• Cleansing and conformance.
• Deduplication of customers.
• Standardization of date/time and currency.
• Light transformations.
Gold Layer:
• Aggregated sales marts.
• Customer 360 view.
• AI-ready datasets.
• Reporting tables.
(c) ETL vs ELT:
• ELT preferred.
• Raw data loaded first.
• Transformations done in cloud data warehouse.
• Supports scalability and agility.
• Compatible with dbt pattern.
(d) Validation and Testing:
• Schema validation: column presence, data types.
• Business rule validation: pricing logic.
• Source-to-target verification.
• Regression testing.
• Performance testing for real-time workloads.
(e) Data Versioning and Lineage:
• Version datasets before retraining ML models.
• Track transformation logic via Git.
• Store model artifacts and hyperparameters.
• Enable reproducibility and auditability.
• Support rollback in case of incorrect predictions.
15
• Performance slowed after adding fraud-detection logic.
• No documentation exists for transformation rules.
Question:
(a) Identify where in the transformation process failure likely occurred.
(b) Explain what should have been tested.
(c) Describe levels of testing required.
(d) Suggest best practices to avoid recurrence.
Model Answer Outline:
(a) Likely Failure:
• Transformation Logic Design stage.
• Lack of Business Rule Validation.
• Possibly improper aggregation logic.
(b) What to Test:
Basic Transformations:
• Data type conversions.
• String manipulation.
• Date transformation.
• Aggregation validation.
• Null handling.
Complex Transformations:
• Conditional VAT logic.
• Fraud-detection integration.
(c) Levels of Testing: Level 1: Unit Tests (single logic rule). Level 2: Integration Tests (multiple transfor-
mations). Level 3: Regression Tests (compare historical outputs).
(d) Best Practices:
• Document transformation mapping sheet.
• Automate validation.
• Test edge cases.
• Use version control for SQL/dbt code.
• Maintain documentation of data models.
16
(d) Explain how dbt supports collaboration and documentation.
Model Answer Outline:
(a) Mapping:
Bronze → Landing (raw, 1:1 source reflection). Silver → Staging + Intermediate layers. Gold → Marts
layer.
(b) DBT Layers:
Landing:
• Raw data.
• No transformations.
Staging:
• Light transformations.
• Casting, renaming.
Intermediate:
• Business logic.
• Aggregations.
• Calculations.
• Window functions.
Marts:
• Fact and dimension tables.
• Joins across layers.
• Analytics-ready model.
(c) Why no logic in marts:
• Maintain modularity.
• Separation of concerns.
• Avoid duplication of business rules.
• Improve maintainability.
(d) dbt Benefits:
• Version control integration.
• Modular models.
• Built-in documentation.
• Reusable components.
• Collaboration through Git.
17
(b) Recommend a strategy for large-scale production.
(c) Discuss challenges in versioning.
(d) Explain how versioning improves ML lifecycle management.
Model Answer Outline:
(a) Comparison:
Full Duplication:
• Simple.
• High storage cost.
• Error-prone.
Valid From/To:
• Space efficient.
• Time-travel capability.
• Requires metadata filtering.
Tool-Based:
• Git-like versioning.
• Efficient storage.
• Branching, merging.
• ACID compliance (Delta Lake, lakeFS).
(b) Recommendation:
• Tool-based (e.g., Delta Lake, lakeFS, DVC).
• Supports large-scale structured and unstructured data.
(c) Challenges:
• Storage cost.
• Security risks.
• Choosing right provider.
• Adoption by team.
(d) ML Benefits:
• Reproducibility.
• Compare experiments.
• Rollback datasets.
• Audit trail.
• Track data-code-model linkage.
18
(b) Explain components required for full lineage tracking.
(c) How does lineage support reproducibility?
(d) Design a lineage-aware ML pipeline.
Model Answer Outline:
(a) Data Lineage:
• Track origin, transformation, movement of data.
• Identify where data came from and how modified.
• Essential for trust and debugging.
(b) Components:
• Infrastructure as Code.
• Data schemas in version control.
• Implementation code in Git.
• Feature store versioning.
• Model algorithm version.
• Container image version.
• Metadata repository.
(c) Reproducibility:
• Recreate training environment.
• Restore exact data version.
• Re-run experiment with same hyperparameters.
• Diagnose production failures.
(d) Lineage-aware pipeline:
• Store metadata at each transformation step.
• Integrate data catalog (DataHub/Amundsen).
• Version data via Delta Lake/lakeFS.
• Track experiments via Neptune/DVC.
• Automate lineage capture in CI/CD.
19