0% found this document useful (0 votes)
5 views16 pages

DataWarehousing Notes

This document provides a comprehensive overview of Data Warehousing concepts, including definitions, architecture, dimensional modeling, schema design, and various optimization techniques. It covers key topics such as ETL vs ELT, indexing, partitioning, and common SQL patterns used in data warehouses. Additionally, it discusses modern cloud data warehousing platforms and the advantages of using surrogate keys and materialized views.

Uploaded by

rp105263
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views16 pages

DataWarehousing Notes

This document provides a comprehensive overview of Data Warehousing concepts, including definitions, architecture, dimensional modeling, schema design, and various optimization techniques. It covers key topics such as ETL vs ELT, indexing, partitioning, and common SQL patterns used in data warehouses. Additionally, it discusses modern cloud data warehousing platforms and the advantages of using surrogate keys and materialized views.

Uploaded by

rp105263
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd

Data Warehousing

Quick Reference Notes


Concepts · Schemas · SQL · Indexing · Partitioning · Dimensional Modelling · Optimization

Index
# Topic What's Covered
1 What is a Data Warehouse? Definition, OLTP vs OLAP, use cases
2 DW Architecture Layers, Kimball vs Inmon, Lambda/Kappa
3 Dimensional Modelling Facts, dimensions, grain, conformed dims
4 Schema Design Star schema, snowflake schema, galaxy schema
5 Slowly Changing Dimensions SCD Type 1, 2, 3, 4 — definitions and SQL
6 ETL vs ELT Differences, when to use each, tools
7 Surrogate Keys & Natural Keys Why surrogate keys, sequences, identity columns
8 Indexing in DW Clustered, non-clustered, columnstore indexes
9 Partitioning Range, list, hash partitioning with examples
10 Aggregations & Materialized Pre-aggregation, indexed views, refresh strategies
Views
11 Window Functions ROW_NUMBER, RANK, LAG, LEAD, running totals
12 Common DW SQL Patterns Incremental load, deduplication, pivot, unpivot
13 Data Vault Modelling Hubs, links, satellites — concepts and example
14 Modern Cloud DW Platforms Snowflake, BigQuery, Synapse, Redshift — comparison
15 Performance Optimization Statistics, query hints, distribution, caching
1. What is a Data Warehouse?
A Data Warehouse (DW) is a central repository of integrated, historical data from multiple sources — optimised
for analytical queries (OLAP), not transactional processing (OLTP).

Feature OLTP (Operational DB) OLAP (Data Warehouse)


Purpose Day-to-day transactions Historical analysis & reporting
Data Current, real-time Historical, aggregated
Queries Simple, short (INSERT/UPDATE) Complex, long-running
(SELECT/JOIN/GROUP)
Schema Highly normalised (3NF) Denormalised (star/snowflake)
Users Thousands of concurrent users Dozens of analysts
Examples MySQL, PostgreSQL, SQL Server Snowflake, Synapse, BigQuery,
Redshift

💡 Data Warehouses are read-heavy. Denormalisation, columnstore indexes, and pre-aggregation are key
optimisations — unlike OLTP where write speed and normalisation matter.
2. DW Architecture
Most modern DW architectures follow a layered approach from raw ingestion to clean, business-ready data.

Medallion / Layered Architecture


Layer Also called What it contains
Bronze / Raw Landing / Staging Raw data as-is from source systems — no
transformation
Silver / Clean Cleansed / ODS Deduplicated, standardised, validated data
Gold / Curated DW / Data Mart Business-ready facts and dimensions,
aggregations

Kimball vs Inmon
Approach Philosophy Start with
Kimball (Bottom-up) Build data marts first; integrate later via Dimensional model (star
conformed dimensions schema)
Inmon (Top-down) Build enterprise DW first in 3NF; derive data Normalised enterprise DW
marts from it

💡 Most modern cloud DWs follow a Kimball-style dimensional model combined with a Medallion
architecture — Bronze/Silver/Gold layers feeding star-schema Gold tables.
3. Dimensional Modelling
Dimensional modelling organises data into Facts (measurable events) and Dimensions (context about those
events). It makes queries fast and intuitive.

Concept What it is Example


Fact Table Stores measurable, numeric fact_sales (sale_id, amount, qty,
events date_key, customer_key)
Dimension Table Descriptive context for fact rows dim_customer (customer_key, name,
city, country)
Grain The lowest level of detail one fact 'One row per product per customer per
row represents day'
Measure A numeric value in the fact table sales_amount, quantity_sold, discount
Surrogate Key System-generated integer PK for customer_key = 1001 (not the source
dimension rows CRM ID)
Conformed Dim Dimension shared across multiple dim_date used in both fact_sales and
fact tables / data marts fact_returns
Degenerate Dim Dimension attribute stored in fact order_number, invoice_id
(no separate dim table)
Junk Dimension Combines low-cardinality flags dim_flags (is_online, is_returned,
into one dimension is_discounted)

💡 Always define the grain before designing the fact table. Every measure and foreign key must be at the
same grain — mixing grains causes incorrect aggregations.
4. Schema Design
Star Schema
One central fact table surrounded by denormalised dimension tables. Simple joins, fast queries. Most common
DW pattern.
fact_sales
sale_id INT (PK)
date_key INT (FK → dim_date)
customer_key INT (FK → dim_customer)
product_key INT (FK → dim_product)
store_key INT (FK → dim_store)
amount DECIMAL
quantity INT

Snowflake Schema
Dimension tables are normalised into sub-dimensions. Saves storage but requires more joins. Less common in
modern cloud DWs where storage is cheap.
dim_product → dim_subcategory → dim_category
-- Instead of one flat dim_product with category columns

Galaxy / Fact Constellation Schema


Multiple fact tables sharing conformed dimension tables. Used when a DW covers more than one business
process.
fact_sales ─────┐
├── dim_date, dim_customer, dim_product
fact_returns ─────┘

💡 Prefer Star Schema for most DW projects — simpler SQL, better query optimiser hints, and easier for BI
tools to auto-generate joins.
5. Slowly Changing Dimensions (SCD)
SCDs handle how dimension data changes over time. The SCD type determines whether history is preserved.

Type Strategy History kept? Use when…


SCD Type 1 Overwrite old value with new value No History does not matter (fix typos,
correct data)
SCD Type 2 Add a new row for each change with Yes (full) Full history needed (customer
start/end date + is_current flag address, product price)
SCD Type 3 Add a new column for previous Partial Only previous value needed, not
value full history
SCD Type 4 Separate history table; current Yes Frequent changes; keep current
values in main dimension table fast

Example — SCD Type 2


-- dim_customer with SCD Type 2
customer_key INT -- surrogate PK
customer_id INT -- natural/source key
name VARCHAR(100)
city VARCHAR(100)
start_date DATE -- when this version became active
end_date DATE -- NULL means current record
is_current BIT -- 1 = active row

-- When customer moves city:


-- 1. UPDATE old row: end_date = today, is_current = 0
-- 2. INSERT new row: new city, start_date = today, end_date = NULL, is_current
= 1

💡 Always join fact tables to dimensions using the surrogate key (customer_key), not the natural key
(customer_id) — this ensures the fact row links to the correct historical version.
6. ETL vs ELT
ETL (Extract → Transform → Load) ELT (Extract → Load → Transform)
Transform where In a separate ETL server / tool Inside the target DW (SQL / Spark)
When to use Legacy on-prem DW, sensitive data Cloud DW with powerful compute
masking before load (Snowflake, BigQuery, Synapse)
Tools SSIS, Informatica, Talend, ADF Data dbt, Spark, Databricks SQL, BigQuery
Flows SQL
Advantage Data is clean before entering DW Simpler pipeline; DW compute does the
heavy lifting
Disadvantage ETL server becomes a bottleneck Raw/messy data lands in DW first

💡 Modern cloud DWs favour ELT — load raw data fast, then transform using SQL or dbt inside the
warehouse where compute scales elastically.
7. Surrogate Keys & Natural Keys
Key Type What it is Example
Natural Key Business identifier from the source system customer_id = 'CUST-7823' or
NationalID
Surrogate Key System-generated integer PK — customer_key = 10045
meaningless outside DW (IDENTITY / SEQUENCE)

Why use Surrogate Keys?


• Natural keys can change (customer number recycled, email changes)
• Integers join faster than strings or composite keys
• Support SCD Type 2 — multiple rows for one natural key need different surrogate keys
• Insulate DW from source system changes

Example — Generate Surrogate Keys


-- SQL Server / Synapse
customer_key INT IDENTITY(1,1) PRIMARY KEY

-- PostgreSQL
customer_key SERIAL PRIMARY KEY

-- Spark / Databricks
from [Link] import monotonically_increasing_id
df = [Link]('customer_key', monotonically_increasing_id())
8. Indexing in Data Warehouses
Index Type Best for Notes
Clustered Index Range scans on date/key columns Physically sorts table rows — only
one per table
Non-clustered Index Point lookups on foreign keys or Separate structure; multiple allowed
filters per table
Columnstore Index Analytical queries — Best for DW — compresses columns,
aggregations, scans batch reads
Bitmap Index Low-cardinality columns (gender, Used in Oracle DW; not in SQL
region) Server/Snowflake
Covering Index Queries that only need specific INCLUDE extra columns to avoid
columns table lookups

Example — Columnstore Index (SQL Server / Synapse)


-- Clustered columnstore on fact table (best for DW)
CREATE CLUSTERED COLUMNSTORE INDEX cci_fact_sales
ON fact_sales;

-- Non-clustered columnstore on existing heap/rowstore


CREATE NONCLUSTERED COLUMNSTORE INDEX ncci_sales_date
ON fact_sales (date_key, amount, quantity);

💡 Always use Clustered Columnstore Indexes on fact tables in SQL Server / Synapse. They compress data 5–
10x and speed up aggregation queries dramatically.
9. Partitioning
Partitioning splits a large table into smaller physical segments. Queries that filter on the partition column skip
irrelevant partitions (partition elimination).

Type How it works Best for


Range Rows assigned to partition based on Date columns — one partition per
value range month/year
List Rows assigned based on explicit value list Region, country, category — fixed set of
values
Hash Rows assigned by hash(column) mod N Even distribution when no natural
partitions range/list key
Round-Robin Rows assigned in rotation regardless of No filtering benefit — used for even
value distribution only

Example — Range Partitioning by Year (SQL Server)


-- Step 1: Partition function
CREATE PARTITION FUNCTION pf_sales_year (DATE)
AS RANGE RIGHT FOR VALUES ('2022-01-01','2023-01-01','2024-01-01');

-- Step 2: Partition scheme


CREATE PARTITION SCHEME ps_sales_year
AS PARTITION pf_sales_year ALL TO ([PRIMARY]);

-- Step 3: Create partitioned table


CREATE TABLE fact_sales (
sale_id INT, sale_dt DATE, amount DECIMAL
) ON ps_sales_year(sale_dt);

-- Partition elimination in action:


SELECT SUM(amount) FROM fact_sales WHERE sale_dt >= '2024-01-01'; -- reads only
2024 partition
10. Aggregations & Materialized Views
Pre-aggregating data avoids re-computing expensive GROUP BY operations on every query. Materialized Views
persist the result of a query as a physical table.

Approach Description Platform


Aggregate Table Manually created summary table populated All platforms
by ETL job
Materialized View DB auto-maintains aggregated result; Snowflake, BigQuery, Oracle
queries rewrite to use it
Indexed View SQL Server version of materialized view — SQL Server, Synapse
WITH SCHEMABINDING
Rollup Cube Pre-compute all GROUP BY combinations All SQL platforms
with ROLLUP / CUBE
Result Cache Warehouse caches query result; identical Snowflake, BigQuery
query returns instantly

Example — Materialized View (Snowflake)


CREATE MATERIALIZED VIEW mv_monthly_sales AS
SELECT
DATE_TRUNC('month', sale_dt) AS month,
product_key,
SUM(amount) AS total_sales,
COUNT(*) AS order_count
FROM fact_sales
GROUP BY 1, 2;

-- Queries on fact_sales with similar GROUP BY auto-use this MV

💡 Materialized Views in Snowflake and BigQuery are automatically refreshed and transparently used by the
query optimiser — no changes needed in your SQL queries.
11. Window Functions
Window functions perform calculations across a set of rows related to the current row — without collapsing
rows like GROUP BY does.

Function Purpose Example


ROW_NUMBER() Unique sequential number ROW_NUMBER() OVER (PARTITION BY dept
per partition ORDER BY salary DESC)
RANK() Rank with gaps for ties RANK() OVER (ORDER BY sales DESC) →
1,2,2,4
DENSE_RANK() Rank without gaps for ties DENSE_RANK() OVER (ORDER BY sales DESC)
→ 1,2,2,3
LAG(col,n) Value from n rows before LAG(amount,1) OVER (ORDER BY sale_dt)
current row
LEAD(col,n) Value from n rows after LEAD(amount,1) OVER (ORDER BY sale_dt)
current row
SUM() OVER() Running / cumulative total SUM(amount) OVER (PARTITION BY dept
ORDER BY sale_dt)
AVG() OVER() Moving average AVG(amount) OVER (ORDER BY sale_dt ROWS
BETWEEN 6 PRECEDING AND CURRENT ROW)
NTILE(n) Divide rows into n equal NTILE(4) OVER (ORDER BY salary) → quartile
buckets 1–4
FIRST_VALUE() First value in window FIRST_VALUE(amount) OVER (PARTITION BY
customer_key ORDER BY sale_dt)

Example — Deduplicate with ROW_NUMBER


-- Keep only the latest record per customer
WITH ranked AS (
SELECT *,
ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY updated_at DESC) AS rn
FROM staging_customers
)
SELECT * FROM ranked WHERE rn = 1;
12. Common DW SQL Patterns
Incremental Load — Watermark
-- 1. Get last loaded timestamp from control table
SELECT MAX(last_load_dt) FROM [Link] WHERE table_name = 'fact_sales';

-- 2. Load only new / changed rows


INSERT INTO fact_sales
SELECT * FROM [Link]
WHERE updated_at > '2024-06-01 00:00:00'; -- replace with watermark value

-- 3. Update watermark
UPDATE [Link] SET last_load_dt = GETDATE() WHERE table_name =
'fact_sales';

UPSERT / MERGE
MERGE INTO dim_customer AS target
USING staging_customer AS src
ON target.customer_id = src.customer_id AND target.is_current = 1
WHEN MATCHED AND [Link] <> [Link] THEN
UPDATE SET target.end_date = GETDATE(), target.is_current = 0 -- SCD2 expire
WHEN NOT MATCHED THEN
INSERT (customer_id, name, city, start_date, end_date, is_current)
VALUES (src.customer_id, [Link], [Link], GETDATE(), NULL, 1);

PIVOT
-- Turn row values into columns
SELECT * FROM (
SELECT region, product, amount FROM fact_sales
) src
PIVOT (
SUM(amount) FOR product IN ([Laptop],[Phone],[Tablet])
) AS pvt;
13. Data Vault Modelling
Data Vault is a DW modelling approach designed for agility, auditability, and handling multiple source systems. It
separates structure (Hubs, Links) from context (Satellites).

Component What it stores Example


Hub Unique business keys from source hub_customer (customer_hk,
systems customer_id, load_dt, record_src)
Link Relationships between Hubs link_order_customer (order_hk,
(associations) customer_hk, load_dt, record_src)
Satellite Descriptive attributes and history for sat_customer (customer_hk, name, city,
Hubs or Links load_dt, end_dt, record_src)

Data Vault vs Dimensional Model


Data Vault Dimensional Model (Kimball)
History All history by design in satellites Only if SCD Type 2 implemented
Flexibility Easy to add new sources / attributes Schema changes require more work
Query simplicity Complex — many joins Simple — star schema
Best for Enterprise DW, regulatory compliance Data marts, BI reporting

💡 Data Vault is great for raw vault layer (Silver) in a Medallion architecture. Build Business Vault or
Information Mart (Gold) on top for BI queries — simpler star schemas derived from Data Vault.
14. Modern Cloud DW Platforms
Platform Best for Key Feature
Snowflake Multi-cloud, data sharing, semi- Separate storage & compute; Time
structured data Travel; Zero-copy cloning
Google BigQuery Serverless analytics, massive scale, ML Serverless (no cluster mgmt); BQML;
inside SQL streaming inserts
Azure Synapse Microsoft ecosystem, Spark + SQL in Dedicated SQL Pools + Serverless
one platform SQL + Spark Pools
Amazon Redshift AWS ecosystem, tight S3 integration RA3 nodes with managed storage;
Redshift Spectrum for S3
Databricks SQL Lakehouse — SQL on Delta Lake with Delta Lake + Photon; Unity Catalog;
Photon engine open format
Clickhouse Real-time analytics, high-throughput Columnar, extremely fast
inserts aggregations on billions of rows

Quick Comparison — Key Differences


Feature Snowflake BigQuery Synapse Redshift
Pricing model Per second compute Per TB scanned DWU Per node-hour
+ storage / slot (dedicated) / per (RA3)
reservations TB (serverless)
Scaling Auto Fully serverless Manual scale Elastic resize
suspend/resume per — no config DWUs or (minutes)
query serverless
Semi-structured VARIANT type — JSON / ARRAY JSON via SUPER type for
native JSON/XML natively in SQL OPENJSON JSON
Time Travel Up to 90 days 7 days (table Not built-in (use Not built-in
snapshots) Delta)
15. Performance Optimization
Technique What it does Example / Note
Update Statistics Keeps query optimiser row count UPDATE STATISTICS fact_sales; or
estimates accurate ANALYZE TABLE
Partition Elimination Skip reading irrelevant partitions Filter on the partition column —
WHERE sale_dt >= '2024-01-01'
Columnstore Index Compress & batch-read columns CREATE CLUSTERED COLUMNSTORE
for aggregation queries INDEX ON fact_sales
Clustered Index Physically sort table for fast range Cluster on date_key for time-series
scans reads
Result Set Caching Return identical query instantly Snowflake & BigQuery cache query
from cache results automatically
Materialized / Indexed View Pre-compute expensive GROUP CREATE MATERIALIZED VIEW
BY mv_monthly_sales AS SELECT…
Avoid SELECT * Read only needed columns — SELECT sale_id, amount FROM
critical for columnar stores fact_sales (not *)
Avoid correlated subquery Rewrite as JOIN or CTE for set- Use CTE + JOIN instead of subquery
based execution in SELECT list
Query Hints (SQL Server) Override optimiser choices when WITH (NOLOCK), OPTION (HASH
plan is wrong JOIN), MAXDOP
Data Distribution Choose distribution key to avoid Synapse: DISTRIBUTE BY
data shuffling HASH(customer_key) on large fact
tables

Example — Common Optimisations


-- 1. Update stats after large loads
UPDATE STATISTICS fact_sales;

-- 2. Query only needed columns (columnar store benefit)


SELECT date_key, SUM(amount) FROM fact_sales
WHERE sale_dt >= '2024-01-01'
GROUP BY date_key;

-- 3. Synapse: Distributed table on fact join key


CREATE TABLE fact_sales
WITH (DISTRIBUTION = HASH(customer_key),
CLUSTERED COLUMNSTORE INDEX)
AS SELECT * FROM staging_sales;

-- 4. Avoid row-by-row RBAR (Row By Agonising Row)


-- Bad: CURSOR / WHILE loop updating one row at a time
-- Good: Single UPDATE or MERGE statement on the whole set

💡 In Synapse Analytics and Redshift, choosing the right distribution key is the single biggest performance
decision — distribute large fact tables on the column most commonly used in JOINs.

— End of Data Warehousing Notes —

You might also like