0% found this document useful (0 votes)

14 views49 pages

Module1 Notes WithDiagrams

Uploaded by

udittatia561

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views49 pages

Module1 Notes WithDiagrams

Uploaded by

udittatia561

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

MODULE 1: DATA MINING AND DATA

WAREHOUSING
CSA3006 — Comprehensive Study Notes with Diagrams
Dr. Rudra Kalyan Nayak | VIT Bhopal

Topics: Data Warehousing • OLAP Operations • DWH Architecture • Data Cube Computation • DWH
Implementation
1. Data Warehousing — Basic Concepts

1.1 DBMS vs RDBMS (Background)

Figure 1.1 — DBMS vs RDBMS Comparison Table

1.2 What is a Data Warehouse?

Figure 1.2 — Data Warehouse Concept (Databases + Flat Files → DWH → Data Analysis &
Visualization)

• Definition (W. H. Inmon): "A data warehouse is a subject-oriented, integrated, time-variant,

and nonvolatile collection of data in support of management's decision-making process."
• Designed for analytical needs — functions on the basis of OLAP (Online Analytical
Processing).
• Central location where consolidated data from multiple databases are stored.
• A decision support database maintained separately from the organization's operational
database.
1.3 What is Data Warehousing?

Figure 1.3 — Data Warehousing Pipeline: Data Sources → ETL Staging → DWH → Data Marts → User
Groups

• Data Warehousing = organizing & storing data to make retrieval efficient and insightful.
• Also called the process of transforming data into information.
• Flow: Data Sources → Staging Area (ETL) → Data Warehouse (Raw/Meta/Aggregate Data) →
Data Marts (Sales/Purchase/Stock) → User Groups

1.4 Four Key Properties of a Data Warehouse

Property Explanation
Subject-Oriented Organized around major subjects (customer, product, sales). Focuses on
modeling & analysis for decision makers, not on daily operations or
transaction processing.
Integrated Built by integrating multiple heterogeneous sources (relational DBs, flat files,
OLTP records). Data cleaning & integration ensures consistency in naming,
encoding, attribute measures.
Time-Variant Longer time horizon than operational systems (5–10 years historical data).
Every key structure contains an element of time. Operational DBs = current
data; DWH = historical perspective.
Nonvolatile Physically separate from operational environment. Operational updates do
Property Explanation
NOT occur here. Only two operations: initial data loading and data access.
2. Operational DB Systems vs. Data Warehouses (OLTP vs
OLAP)

2.1 Definitions
• OLTP (On-Line Transaction Processing) — Traditional relational DBMS task. Day-to-day
operations: purchasing, inventory, banking, payroll, registration, accounting.
• OLAP (On-Line Analytical Processing) — Data warehouse system task. Data analysis and
decision making.

2.2 OLTP vs OLAP — Comparison

Figure 2.1 — OLTP vs OLAP Detailed Comparison Table

Attribute OLTP OLAP

Users Clerk, IT professional Knowledge worker
Function Day-to-day operations Decision support
Attribute OLTP OLAP
DB Design Application-oriented Subject-oriented
Data Current, up-to-date, detailed, flat Historical, summarized,
relational, isolated multidimensional, integrated,
consolidated
Usage Repetitive Ad-hoc
Access Read/write; index/hash on primary key Lots of scans
Unit of Work Short, simple transaction Complex query
# Records Tens Millions
# Users Thousands Hundreds
DB Size 100MB – GB 100GB – TB
Metric Transaction throughput Query throughput, response time

2.3 Why Separate DWH from Operational DBs?

• Missing data: Decision support requires historical data which operational DBs do not typically
maintain.
• Data consolidation: DS requires consolidation (aggregation, summarization) of data from
heterogeneous sources.
• Data quality: Different sources use inconsistent representations, codes, and formats — must
be reconciled.
• High performance: DBMS tuned for OLTP (indexing, concurrency, recovery); Warehouse
tuned for OLAP (complex queries, multidimensional views, consolidation).
3. Multidimensional Data Model

3.1 The Data Cube Concept

Figure 3.1 — 3D Data Cube: Products × Branches × Dates. Each cell = a measure value (e.g., £36 of
Drinks sold in Sheffield on 2nd Jan)

• A data cube is a multidimensional representation of data where each dimension = a different

attribute (Products, Branches, Dates).
• Each cell contains a measure value (e.g., £27 of Crisps sold in Leeds on 1st January).
• The cube can be pivoted in any direction — underlying data remains unchanged.
• For n dimensions → 2ⁿ possible cuboids (from apex/0-D to base/n-D).
3.2 Sample Data Cube (Product × Date × Country)

Figure 3.2 — Sample Data Cube: Product(TV,PC,VCR) × Date(1Qtr–4Qtr) ×

Country(USA,Canada,Mexico). Total Annual Sales of TV in USA highlighted.

• Dimensions: Product (TV, PC, VCR), Date (1Qtr–4Qtr + sum), Country (USA, Canada, Mexico +
sum).
• The 'sum' row/column shows aggregated totals for that dimension.
• All, All, All at the corner = apex cuboid (the grand total across all dimensions).

3.3 Key Terminology

Term Definition
Data Cube n-dimensional data representation. n dimensions → 2ⁿ possible
cuboids.
Dimension Table Stores attributes of a dimension (e.g., time_key, day, month,
quarter, year for Time).
Fact Table Central table containing dimension keys (FKs) and measures
(numerical values like units_sold, dollars_sold).
Measure Numerical function evaluated at each point in the data cube space.
Base Cuboid Lowest-level cuboid (all dimensions specified) — most detailed
Term Definition
data.
Apex Cuboid 0-D cuboid (all dimensions aggregated) — grand total. Denoted
'all'.
Cuboid A view of data cube obtained by aggregating over one or more
dimensions.

3.4 Facts & Measures

Figure 3.3 — Fact Table structure: Product dimension → Fact Table contains Product ID (dimension
key) and Number of units sold (measure)

• A fact is a measure that can be summed, averaged, or manipulated.

• Fact Table contains 2 kinds of data: a dimension key (FK) and a measure (numerical value).
• Every Dimension table is linked to a Fact table via its key.

3.5 Types of Measures

Type Condition Examples
Distributive Result from n aggregates = count(), sum(), min(), max()
result from all data
Algebraic Computed by algebraic avg(), min_N(), standard_deviation()
function from distributive
aggregates
Type Condition Examples
Holistic No constant bound on storage median(), mode(), rank()
for sub-aggregate
4. Data Warehouse Models — Schemas

4.0 Schema Overview

Figure 4.0 — Schemas: Employee table linked to Department table. A schema gives logical
description of DB with constraints and key-value links between tables.

• A schema gives the logical description of the entire database.

• Details: constraints on tables, key values present, and how key values are linked between
tables.
• DWH uses: Star Schema, Snowflake Schema, and Fact Constellation (Galaxy Schema).
4.1 Star Schema

Figure 4.1 — Star Schema: Sales Fact Table at centre, connected to Time, Item, Branch, and Location
dimension tables

• Fact table in the middle connected to a set of dimension tables.

• Simplest schema — dimensions are fully denormalized (one table per dimension, no sub-
tables).
• Each dimension table joins directly to the fact table via a foreign key.

DMQL Definition (Star Schema):

define cube sales_star [time, item, branch, location]:
dollars_sold = sum(sales_in_dollars), avg_sales = avg(sales_in_dollars),
units_sold = count(*)
define dimension time as (time_key, day, day_of_week, month, quarter, year)
define dimension item as (item_key, item_name, brand, type, supplier_type)
define dimension branch as (branch_key, branch_name, branch_type)
define dimension location as (location_key, street, city, province_or_state,
country)
4.2 Snowflake Schema

Figure 4.2 — Snowflake Schema: item dimension extended with supplier table; location dimension
extended with city table

• A refinement of star schema where dimensional hierarchy is normalized into smaller dimension
tables.
• Forms a shape similar to a snowflake.
• Reduces data redundancy but requires more joins.

DMQL Definition (normalized dimensions):

define cube sales_snowflake [time, item, branch, location]:
dollars_sold = sum(sales_in_dollars), avg_sales = avg(sales_in_dollars),
units_sold = count(*)
define dimension item as (item_key, item_name, brand, type,
supplier(supplier_key, supplier_type))
define dimension location as (location_key, street, city(city_key,
province_or_state, country))
4.3 Fact Constellation (Galaxy Schema)

Figure 4.3 — Fact Constellation: Sales Fact Table + Shipping Fact Table both sharing Time, Item, and
Location dimension tables

• Multiple fact tables share dimension tables.

• Also called Galaxy Schema — viewed as a collection of stars.

DMQL for Shipping cube (using shared dimensions from Sales):

define cube shipping [time, item, shipper, from_location, to_location]:
dollar_cost = sum(cost_in_dollars), unit_shipped = count(*)
define dimension time as time in cube sales
define dimension item as item in cube sales
define dimension from_location as location in cube sales
define dimension to_location as location in cube sales

4.4 Concept Hierarchies

Hierarchies allow data to be summarized at multiple levels of abstraction:
• Location: office → city → country → region → all
• Time: day → week → month → quarter → year → all
• Product: product → category → industry → all
5. OLAP Operations in the Multidimensional Model

5.1 Roll-Up (Drill-Up / Aggregation)

Figure 5.1 — Roll-Up: Locations climb from cities (Chicago, New York, Toronto, Vancouver) up to
countries (USA, Canada)
• Roll-Up = aggregation by climbing UP a concept hierarchy or by dimension reduction.
• Zoom OUT — moves from detailed data to more summarized, aggregated data.
• Example (Location): Roll-up from cities → countries. Individual city sales merged into country
totals.
• Example (Temperature): Values (64,65,68...) grouped into cool (64–69), mild (70–75), hot (80–
85).
• Result: Fewer cells — data at a higher, more abstract level.
5.2 Drill-Down (Roll-Down)

Figure 5.2 — Drill-Down: Time dimension expands from quarters (Q1,Q2,Q3,Q4) to months
(Jan,Feb,Mar...)

• Drill-Down = reverse of roll-up. Navigates from less detailed to MORE detailed data.
• Zoom IN — adds more detail to the current view.
• Can be performed by stepping DOWN a concept hierarchy OR by adding additional dimensions.
• Example: Drill-down on time from quarters → months. Q1 expands into Jan, Feb, Mar.
5.3 Slice

Figure 5.3 — Slice: Fixing time = Q1 produces a 2D sub-table of Location × Item for Q1 only (removes
the time dimension)

• Slice = subset of cube for a SINGLE VALUE on one dimension.

• Reduces cube dimensionality by 1 — a 3D cube becomes a 2D table.
• Example: Slice for time = 'Q1' → 2D table (Location × Item) showing only Q1 data.
• Selection: equality on one dimension value (e.g., time = Q1).

5.4 Dice

Figure 5.4 — Dice: Sub-cube for location ∈ {Toronto,Vancouver} AND time ∈ {Q1,Q2} AND item ∈
{Mobile,Modem}

• Dice = sub-cube by applying selection on TWO OR MORE dimensions.

• Unlike slice (one equality), dice selects ranges/multiple values on multiple dimensions
simultaneously.
• Result is still a sub-cube (not a flat table).
Example — three criteria simultaneously:
◦ location = 'Toronto' OR 'Vancouver'
◦ time = 'Q1' OR 'Q2'
◦ item = 'Mobile' OR 'Modem'
5.5 Pivot (Rotate)

Figure 5.5 — Pivot/Rotation: cube axes reoriented — Temperature axis and Time axis swap positions
Figure 5.6 — Pivot example table: Location×Item → Item×Location; and Other OLAP operations
summary

• Pivot (also called Rotation) — rotates data axes to give an alternative presentation.
• May involve swapping rows and columns, or moving a row-dimension into column dimension.
• The underlying data does NOT change — only the view orientation changes.
• Example: Rows=Item(types), Columns=Location(cities) → pivoted to Rows=Location,
Columns=Item.
5.6 Other OLAP Operations
• Drill-Across: Executes queries involving more than one fact table.
• Drill-Through: Uses relational SQL to drill through the bottom level of the cube to back-end
relational tables.
• Ranking: Ranking top-N or bottom-N elements in lists.
• Statistical: Moving average, growth rates, interest, IRR, depreciation, currency conversions.
• Analytical Modeling: Calculation engine for ratios, variance, forecasting, trend analysis, and
statistical analysis.
6. Data Warehouse Architecture — Multi-Tiered

6.1 Multi-Tiered Architecture

Figure 6.1 — Multi-Tiered DWH Architecture: Data Sources (Tier1) → Data Storage with ETL, DWH,
Data Marts, Metadata (Tier2) → OLAP Engine → Front-End Tools (Tier3)

• Tier 1 — Data Sources: Operational DBs and external sources (raw data producers).
• Tier 2 — Data Storage: ETL (Extract/Transform/Load/Refresh) → Data Warehouse (Metadata
+ Monitor & Integrator) → Data Marts.
• Tier 3 — OLAP Engine + Front-End: OLAP Server → Analysis, Query, Reports, and Data
Mining tools.

6.2 ETL — Extract, Transform, Load

• Data Extraction: Get data from multiple, heterogeneous, and external sources.
• Data Cleaning: Detect errors in the data and rectify them when possible.
• Data Transformation: Convert data from legacy/host format to warehouse format.
• Load: Sort, summarize, consolidate, compute views, check integrity, build indices and
partitions.
• Refresh: Propagate updates from data sources to the warehouse.

6.3 Metadata Repository

• Structure: Schema, view, dimensions, hierarchies, derived data definitions, data mart
locations/contents.
• Operational: Data lineage, currency of data (active/archived/purged), monitoring info (usage
stats, error reports, audit trails).
• Summarization algorithms, Mapping metadata, Performance metadata, Business
metadata.

6.4 Data Warehouse Model Scope Types

Figure 6.2 — Enterprise DWH model hierarchy: Enterprise DWH → Multi-Tier DWH → Distributed Data
Marts

• Enterprise Warehouse: ALL information spanning the entire organization.

• Data Mart: Subset of corporate-wide data for a specific user group (e.g., marketing). Can be
independent or dependent.
• Virtual Warehouse: Set of views over operational databases — only some summary views
materialized.
6.5 OLAP Server Types

Figure 6.3 — ROLAP vs MOLAP vs HOLAP comparison: characteristics and trade-offs

Type Description Key Points

ROLAP Relational OLAP — Uses Greater scalability; handles large/volatile
relational/extended-relational DBMS for data; poor query performance; requires
storage. expert users.
MOLAP Multidimensional OLAP — Array-based Fast indexing to pre-computed data;
storage with sparse matrix techniques. easy to use; fast IR; cannot handle
detailed data; weak DBMS facility.
HOLAP Hybrid OLAP — Relational for low-level, Combines advantages of both ROLAP
array-based for high-level. and MOLAP.

6.6 Design Process

1. Choose a business process to model (e.g., orders, invoices, payroll).
2. Choose the grain — the atomic level of data in the business process.
3. Choose the dimensions that will apply to each fact table record.
4. Choose the measures that will populate each fact table record.
7. Data Cube Computation

7.1 Cube Materialization Concepts

Figure 7.1 — Full Cube vs Iceberg Cube vs Closed Cube. Shows base cells, aggregate cells (marked
*), and ancestor/descendant relationships.
• Base Cell: A cell in the base cuboid.
• Aggregate Cell: A cell in a non-base cuboid. Each aggregated dimension is indicated as '*'.
• Ancestor/Descendant Cells: In a 3D cube, 1D and 2D cells are ancestors of 3D cells; 3D cells
are descendants.

Three choices for materialization:

• No Materialization: Do not pre-compute any 'nonbase' cuboids → expensive on-the-fly
multidimensional aggregation — extremely slow.
• Full Materialization: Pre-compute ALL cuboids. Resulting lattice = full cube. Requires huge
memory space.
• Partial Materialization: Selectively compute a proper subset of cuboids. (i) Identify subset to
materialize, (ii) Exploit during query processing, (iii) Efficiently update during load/refresh.

Three types of cubes:

• Full Cube: All cells of all cuboids computed.
• Iceberg Cube: Only cells whose measure satisfies the iceberg condition (min support
threshold). Only small portion 'above the water' in sparse cube.
• Closed Cube: Only closed cells. A cell is closed if it has no descendant cell with the same
measure value.

compute cube sales_iceberg as select month, city, customer_group, count(*)

from salesInfo cube by month, city, customer_group having count(*) >=
min_support

7.2 Algorithms for Computing Cubes

7.2.1 Multi-Way Array Aggregation (MOLAP — Bottom-Up)

Figure 7.2 — Multi-Way Array: 3D array partitioned into 64 memory-based chunks (a0b0c0 to a3b3c3).
Dimensions A,B,C each split into 4 partitions.
Figure 7.3 — Multi-Way Array traversal order: sequential 1..64 vs interleaved. Sequential order
minimizes memory (1,56,000 vs 16,41,000 units).

• Computes full cube using a multi-dimensional array partitioned into chunks (small sub-cubes
fitting in memory).
• Compressed sparse array addressing: (chunk_id, offset).
• Compute aggregates 'multiway' by visiting cells in order that minimizes visits → reduces
memory and storage cost.
Best traversal order for dimensions A=40, B=400, C=4000 (partitions 10,100,1000):
◦ Ordering 1,2,...,64: memory = 40×400 + 40×1000 + 100×1000 = 1,56,000 units ✓
◦ Ordering 1,17,33,...: memory = 400×4000 + 40×1000 + 10×100 = 16,41,000 units ✗
◦ Conclusion: Sequential 1..64 order is most memory-efficient.

7.2.2 BUC — Bottom-Up Construction

Figure 7.4 — BUC algorithm: builds from apex toward base cuboid. Star-Cubing lattice showing
shared dimensions (ABC/ABC, ABD/AB, ACD/A, etc.)
• BUC = 'bottom-up construction' algorithm for computing sparse and iceberg cubes.
• Despite the name, BUC builds from APEX cuboid toward the base cuboid.
• Allows BUC to share data partitioning costs.
• Processing order allows PRUNING during construction using the Apriori property.
• Best order minimizes memory requirement and reduces I/Os.

7.2.3 Star-Cubing — Integrating Top-Down and Bottom-Up

• Computes iceberg cubes by integrating both top-down and bottom-up approaches.
• Shared dimensions: E.g., dimension A is the shared dimension of ACD and AD. ABD/AB
means cuboid ABD has shared dimensions AB.
• Allows shared computations: AB computed simultaneously with ABD.
• Aggregates top-down but with bottom-up sub-layer allowing Apriori pruning.
• Shared dimensions grow in bottom-up fashion.
7.3 Compute Cube Operator

Figure 7.5 — DWH Implementation: compute cube operator, cuboid lattice diagram for (city,item,year),
and three materialization strategies

• Data cube = lattice of cuboids. Bottom-most = base cuboid; top-most (apex) = one cell.
• Total cuboids = 2ⁿ where n = number of dimensions.
• Example: item, city, year → 2³ = 8 cuboids: {(city,item,year),(city,item),(city,year),(item,year),
(city),(item),(year),()}
8. Indexing OLAP Data

8.1 Bitmap Indexing

Figure 8.1 — Bitmap Index tables: Base table (RID, item, city), Item bitmap index (H/C/P/S bit
columns), City bitmap index (V/T bit columns)
• Alternative representation of base table for quick searching in data cubes.
• For each attribute value v in its domain, there is a distinct bit vector Bv.
• If domain has n values → n bits needed per entry in bitmap index.
• Example: Item with 4 values (H,C,P,S) → 4-column bitmap. City with (V,T) → 2-column bitmap.
• Enables fast bitwise AND/OR operations to identify matching records.
8.2 Join Indexing

Figure 8.2 — Join Index: location/sales join index, item/sales join index, and combined
location/item/sales join index linking fact table rows to dimension values

• Join indexing registers joinable rows of two relations from a relational database.
• If R(RID, A) and S(B, SID) join on A and B → join index record = pair (RID, SID).
• Identifies joinable tuples without full joins.
• In DWH: join indexing is useful for cross-table search because of the star schema model.
• Maintains relationship between dimension attribute values and corresponding rows in the fact
table.

8.3 Efficient OLAP Query Processing

• Purpose of materialization and OLAP indexing = speed up query processing.
5. Determine which operation should be performed on available data cubes.
6. Determine to which materialized cuboids the relevant operations should be applied.
9. Data Warehouse Usage — From DWH to Data Mining

9.1 Three Kinds of DWH Applications

Figure 9.1 — DWH Application types (Information Processing, Analytical Processing, Data Mining)
and OLAM architecture with four layers
7. Information Processing: Querying, basic statistical analysis, reporting using crosstabs, tables,
charts, graphs.
8. Analytical Processing: Basic OLAP operations — slice-dice, drilling, pivoting. Multidimensional
analysis of historical data.
9. Data Mining: Knowledge discovery from hidden patterns, associations, analytical models,
classification, prediction.

9.2 From OLAP to OLAM (Online Analytical Mining)

• OLAM = Online Analytical Mining = OLAP + Data Mining in a multidimensional database or
DWH.
Why Online Analytical Mining? Four key reasons:
10. High quality data: DWH has integrated, consistent, cleaned data. Costly preprocessing already
done.
11. Available infrastructure: ODBC, OLEDB, Web access, OLAP tools constructed systematically.
12. OLAP-based exploratory analysis: Mining at different levels/subsets; mining with drilling,
dicing, pivoting.
13. On-line selection of mining functions: Integration with multiple mining functions, algorithms,
and tasks.
9.3 OLAM System Architecture (4 Layers)

Figure 9.2 — OLAM Architecture: Layer 1=Data Repository, Layer 2=MDDB, Layer 3=OLAP+OLAM
Engines, Layer 4=User Interface

• Layer 1 — Data Repository: Databases → data cleaning → data integration → Data

Warehouse.
• Layer 2 — MDDB: Multidimensional Database accessed via Database API and Data Cube API.
Includes Meta Data.
• Layer 3 — OLAP/OLAM Engine: Both engines accept user queries via GUI API and work with
data cube via Data Cube API.
• Layer 4 — User Interface: Mining queries sent in; Mining results returned to the user.
• OLAM server performs multiple data mining tasks — more sophisticated than OLAP server.
10. Practice Questions

Q1 — Doctor/Patient Data Warehouse

DWH with 3 dimensions: time, doctor, patient; and 2 measures: count and charge.

(a) Three schema classes:

14. Star Schema
15. Snowflake Schema
16. Fact Constellation (Galaxy Schema)

(b) Star schema for this DWH:

• Fact: VISIT (time_key, doctor_key, patient_key, count, charge)
• Dimension: Time (time_key, day, month, year)
• Dimension: Doctor (doctor_key, doctor_name, specialization, hospital)
• Dimension: Patient (patient_key, patient_name, age, address)

(c) OLAP operations to list total fee per doctor in 2004:

17. Roll-up on time from 'day' to 'year'.
18. Slice on year = 2004.
19. Roll-up on patient (aggregate over all patients).
20. Measure: sum(charge) grouped by doctor.

(d) SQL query:

SELECT doctor, SUM(charge) AS total_fee
FROM fee
WHERE year = 2004
GROUP BY doctor;

Q2 — Big-University DWH
4 dimensions: student, course, semester, instructor. 2 measures: count, avg_grade.

(a) Snowflake schema:

• Fact: GRADES (student_key, course_key, semester_key, instructor_key, count, avg_grade)
• student → major → status → university → all
• course → department → all
• semester → year → all
• instructor → department → all

(b) OLAP ops to list avg_grade of CS courses for each student:

21. Slice on course: select only CS courses.
22. Roll-up on semester to 'all'.
23. Roll-up on instructor to 'all'.

(c) Number of cuboids (each dimension has 5 levels):

• Total = 5 × 5 × 5 × 5 = 625 cuboids.
11. Quick Revision Summary Table

Concept Key Point / One-Line Summary

DWH (Inmon) Subject-Oriented + Integrated + Time-Variant + Nonvolatile
OLTP vs OLAP OLTP = current, detailed, write-heavy; OLAP = historical,
summarized, read-heavy
Star Schema Fact table in centre + denormalized dimension tables directly
connected
Snowflake Schema Star schema with normalized dimension tables (extra sub-tables)
Fact Constellation Multiple fact tables sharing common dimension tables
Roll-Up Climb concept hierarchy = less detail, more aggregation (ZOOM
OUT)
Drill-Down Go down concept hierarchy = more detail (ZOOM IN)
Slice Fix ONE dimension to a single value = reduce cube dimensionality
by 1
Dice Fix MULTIPLE dimensions to value ranges = produce sub-cube
Pivot/Rotate Reorient cube axes for alternative 2D view — data unchanged
Full Cube Pre-compute ALL cuboids — huge memory required
Iceberg Cube Compute only cells meeting minimum support threshold
Closed Cube Only cells with no descendant having the same measure value
Multi-Way Array Bottom-up MOLAP: partition into chunks; sequential order most
efficient
BUC 'Bottom-up construction' — apex to base, uses Apriori pruning
Star-Cubing Integrates top-down + bottom-up; exploits shared dimensions
ROLAP Relational OLAP — scalable, handles large/volatile data, slower
queries
MOLAP Multidimensional OLAP — fast, pre-computed, array-based
HOLAP Hybrid OLAP — combines ROLAP (low-level) + MOLAP (high-
level)
ETL Extract → Transform → Load (populates and refreshes DWH)
Metadata Data about data: structure, lineage, algorithms, business terms
Data Mart Subject-specific subset of DWH for a specific user group
Bitmap Index Bit vector per attribute value — fast bitwise search in cube
Join Index Registers joinable rows — useful in star schema cross-table search
OLAM Online Analytical Mining = OLAP + Data Mining integration
2ⁿ cuboids Total cuboids in an n-dimensional data cube = 2 raised to n
— End of Module 1 (Parts 1, 2, and 3) Notes —
Upload remaining PDFs to continue with Introduction to Data Mining.

Understanding Data Warehousing Concepts
No ratings yet
Understanding Data Warehousing Concepts
26 pages
Data Warehousing Concepts Explained
No ratings yet
Data Warehousing Concepts Explained
40 pages
Data Warehouse and OLAP Overview
No ratings yet
Data Warehouse and OLAP Overview
60 pages
Cognos Data Warehouse Overview
100% (2)
Cognos Data Warehouse Overview
56 pages
Data Warehousing and OLAP Overview
No ratings yet
Data Warehousing and OLAP Overview
47 pages
OLAP Tools in Data Warehousing
No ratings yet
OLAP Tools in Data Warehousing
20 pages
Understanding Data Warehousing Concepts
No ratings yet
Understanding Data Warehousing Concepts
34 pages
Introduction To DataWarehouse
No ratings yet
Introduction To DataWarehouse
32 pages
Data Warehouse and OLAP Fundamentals
No ratings yet
Data Warehouse and OLAP Fundamentals
28 pages
Data Warehouse Concepts Overview
No ratings yet
Data Warehouse Concepts Overview
58 pages
Data Warehouse Concepts and OLAP Operations
No ratings yet
Data Warehouse Concepts and OLAP Operations
49 pages
Data Warehousing and OLAP Overview
100% (1)
Data Warehousing and OLAP Overview
47 pages
Data Warehousing Fundamentals Explained
No ratings yet
Data Warehousing Fundamentals Explained
32 pages
Data Warehousing Concepts and Techniques
No ratings yet
Data Warehousing Concepts and Techniques
16 pages
Understanding Data Warehousing Concepts
No ratings yet
Understanding Data Warehousing Concepts
60 pages
Understanding Data Warehousing Concepts
No ratings yet
Understanding Data Warehousing Concepts
53 pages
Data Warehouse Concepts and Models
No ratings yet
Data Warehouse Concepts and Models
65 pages
Data Warehouse Operations Overview
No ratings yet
Data Warehouse Operations Overview
18 pages
Data Warehousing and OLAP Concepts
No ratings yet
Data Warehousing and OLAP Concepts
18 pages
Data Cube vs. Data Warehouse Explained
No ratings yet
Data Cube vs. Data Warehouse Explained
30 pages
Data Warehousing and OLAP Overview
No ratings yet
Data Warehousing and OLAP Overview
58 pages
Data Warehouse Components Explained
No ratings yet
Data Warehouse Components Explained
27 pages
Introduction to Data Warehousing Concepts
No ratings yet
Introduction to Data Warehousing Concepts
69 pages
Defining Data Warehousing Concepts
No ratings yet
Defining Data Warehousing Concepts
53 pages
(2026!04!27) - Data Warehouse - Lecture 3
No ratings yet
(2026!04!27) - Data Warehouse - Lecture 3
40 pages
Data Warehousing and OLAP Overview
No ratings yet
Data Warehousing and OLAP Overview
55 pages
Data Warehousing and OLAP Overview
No ratings yet
Data Warehousing and OLAP Overview
36 pages
Data Warehouse and OLAP Overview
No ratings yet
Data Warehouse and OLAP Overview
71 pages
Data Warehousing Concepts by Hammad Afzal
No ratings yet
Data Warehousing Concepts by Hammad Afzal
25 pages
Data Warehousing Basics and Concepts
No ratings yet
Data Warehousing Basics and Concepts
50 pages
Data Warehouse and OLAP Overview
No ratings yet
Data Warehouse and OLAP Overview
55 pages
Data Warehousing Concepts and Techniques
No ratings yet
Data Warehousing Concepts and Techniques
31 pages
Data Warehousing Overview for B.Tech
No ratings yet
Data Warehousing Overview for B.Tech
60 pages
Data Warehouse Concepts and Benefits
No ratings yet
Data Warehouse Concepts and Benefits
24 pages
Data Warehouse 1
No ratings yet
Data Warehouse 1
152 pages
Data Warehousing Fundamentals
No ratings yet
Data Warehousing Fundamentals
74 pages
Unit 2
No ratings yet
Unit 2
4 pages
Defining OLAP in Data Warehousing
No ratings yet
Defining OLAP in Data Warehousing
37 pages
Data Warehousing and OLAP Overview
No ratings yet
Data Warehousing and OLAP Overview
63 pages
Data Warehousing and OLAP Techniques
No ratings yet
Data Warehousing and OLAP Techniques
57 pages
Understanding Data Warehousing Concepts
100% (1)
Understanding Data Warehousing Concepts
44 pages
MG4DSECSC200: Data Mining
No ratings yet
MG4DSECSC200: Data Mining
48 pages
DMDW Unit-1
No ratings yet
DMDW Unit-1
47 pages
NuoDB Analyst Tool Overview
No ratings yet
NuoDB Analyst Tool Overview
29 pages
Understanding Data Warehousing Concepts
No ratings yet
Understanding Data Warehousing Concepts
30 pages
Data Warehouse: Hierarchical Data Processing
No ratings yet
Data Warehouse: Hierarchical Data Processing
68 pages
Data Warehouse Architecture Overview
100% (1)
Data Warehouse Architecture Overview
43 pages
Data Warehousing and OLAP Overview
No ratings yet
Data Warehousing and OLAP Overview
30 pages
Data Warehousing and OLAP for Mining
No ratings yet
Data Warehousing and OLAP for Mining
57 pages
Understanding Data Warehousing Concepts
No ratings yet
Understanding Data Warehousing Concepts
249 pages
Understanding Cuboid Data Structures
No ratings yet
Understanding Cuboid Data Structures
38 pages
Data Warehousing and OLAP Concepts
No ratings yet
Data Warehousing and OLAP Concepts
93 pages
Understanding Data Warehouse Concepts
100% (1)
Understanding Data Warehouse Concepts
29 pages
Data Warehousing and OLAP Concepts
No ratings yet
Data Warehousing and OLAP Concepts
58 pages
Data Warehousing and OLAP Techniques
No ratings yet
Data Warehousing and OLAP Techniques
57 pages
Data Warehouse & OLAP Overview Guide
No ratings yet
Data Warehouse & OLAP Overview Guide
36 pages
Introduction to Data Warehousing Concepts
No ratings yet
Introduction to Data Warehousing Concepts
35 pages
Concepts and Techniques: - Chapter 4
No ratings yet
Concepts and Techniques: - Chapter 4
51 pages
Data Warehouse Modeling: OLAP & Data Cubes
No ratings yet
Data Warehouse Modeling: OLAP & Data Cubes
12 pages
IoT Applications and Architecture Overview
No ratings yet
IoT Applications and Architecture Overview
26 pages
Database Integrity and Security Overview
No ratings yet
Database Integrity and Security Overview
25 pages
Job Application Tracker Template
No ratings yet
Job Application Tracker Template
12 pages
Introduction to Big Data Analytics
No ratings yet
Introduction to Big Data Analytics
14 pages
Installing OpenVAS On Kali Linux
No ratings yet
Installing OpenVAS On Kali Linux
10 pages
IBM CMIS For Content Manager OnDemand For Dummies
No ratings yet
IBM CMIS For Content Manager OnDemand For Dummies
54 pages
HND Mock Exam: Information Systems Management
No ratings yet
HND Mock Exam: Information Systems Management
3 pages
Peripherals and I/O Communication Explained
No ratings yet
Peripherals and I/O Communication Explained
28 pages
DBMS Normalization and SQL Overview
No ratings yet
DBMS Normalization and SQL Overview
4 pages
Oracle Database Installation Guide
No ratings yet
Oracle Database Installation Guide
9 pages
Zscaler Private Access Training Overview
100% (1)
Zscaler Private Access Training Overview
89 pages
ICT Safety and E-Safety Guidelines
No ratings yet
ICT Safety and E-Safety Guidelines
31 pages
Django Framework Course Syllabus
No ratings yet
Django Framework Course Syllabus
6 pages
Huawei OceanProtect Backup Overview
No ratings yet
Huawei OceanProtect Backup Overview
33 pages
DB2 11 Query Optimizer Perspective
No ratings yet
DB2 11 Query Optimizer Perspective
47 pages
Oracle Data Integrator Overview and Guide
No ratings yet
Oracle Data Integrator Overview and Guide
4 pages
Zabbix vs. Datadog Comparison Guide
No ratings yet
Zabbix vs. Datadog Comparison Guide
81 pages
Cisco AMP for Endpoints User Guide
No ratings yet
Cisco AMP for Endpoints User Guide
225 pages
NSC March 2016 Exam MS Final
No ratings yet
NSC March 2016 Exam MS Final
13 pages
CIS Audit Quiz: Key Terms & Concepts
No ratings yet
CIS Audit Quiz: Key Terms & Concepts
3 pages
School Management System Overview
No ratings yet
School Management System Overview
6 pages
CI/CD in Microsoft Fabric with Azure DevOps
No ratings yet
CI/CD in Microsoft Fabric with Azure DevOps
20 pages
Overview of the DoD TCP/IP Model
No ratings yet
Overview of the DoD TCP/IP Model
3 pages
Spicy Bistro: Real-Time Project Report
No ratings yet
Spicy Bistro: Real-Time Project Report
50 pages
Cisco Nexus Dashboard Overview and Tools
No ratings yet
Cisco Nexus Dashboard Overview and Tools
22 pages
Introduction to C++ Programming Guide
No ratings yet
Introduction to C++ Programming Guide
14 pages
Software Analysis & Development Course
No ratings yet
Software Analysis & Development Course
2 pages
Pre-Screening Assessment for Tech Candidates
No ratings yet
Pre-Screening Assessment for Tech Candidates
4 pages
Ensuring The Quality of Information
No ratings yet
Ensuring The Quality of Information
17 pages
JYSK Nordic Streamlines AD Management
No ratings yet
JYSK Nordic Streamlines AD Management
3 pages

Module1 Notes WithDiagrams

Uploaded by

Module1 Notes WithDiagrams

Uploaded by

MODULE 1: DATA MINING AND DATA

1.1 DBMS vs RDBMS (Background)

Figure 1.1 — DBMS vs RDBMS Comparison Table

• Definition (W. H. Inmon): "A data warehouse is a subject-oriented, integrated, time-variant,

1.4 Four Key Properties of a Data Warehouse

2.2 OLTP vs OLAP — Comparison

Figure 2.1 — OLTP vs OLAP Detailed Comparison Table

Attribute OLTP OLAP

2.3 Why Separate DWH from Operational DBs?

3.1 The Data Cube Concept

• A data cube is a multidimensional representation of data where each dimension = a different

Figure 3.2 — Sample Data Cube: Product(TV,PC,VCR) × Date(1Qtr–4Qtr) ×

3.3 Key Terminology

3.4 Facts & Measures

• A fact is a measure that can be summed, averaged, or manipulated.

3.5 Types of Measures

4.0 Schema Overview

• A schema gives the logical description of the entire database.

• Fact table in the middle connected to a set of dimension tables.

DMQL Definition (Star Schema):

DMQL Definition (normalized dimensions):

• Multiple fact tables share dimension tables.

DMQL for Shipping cube (using shared dimensions from Sales):

4.4 Concept Hierarchies

5.1 Roll-Up (Drill-Up / Aggregation)

• Slice = subset of cube for a SINGLE VALUE on one dimension.

• Dice = sub-cube by applying selection on TWO OR MORE dimensions.

6.1 Multi-Tiered Architecture

6.2 ETL — Extract, Transform, Load

6.3 Metadata Repository

6.4 Data Warehouse Model Scope Types

• Enterprise Warehouse: ALL information spanning the entire organization.

Figure 6.3 — ROLAP vs MOLAP vs HOLAP comparison: characteristics and trade-offs

Type Description Key Points

6.6 Design Process

7.1 Cube Materialization Concepts

Three choices for materialization:

Three types of cubes:

compute cube sales_iceberg as select month, city, customer_group, count(*)

7.2 Algorithms for Computing Cubes

7.2.2 BUC — Bottom-Up Construction

7.2.3 Star-Cubing — Integrating Top-Down and Bottom-Up

8.1 Bitmap Indexing

8.3 Efficient OLAP Query Processing

9.1 Three Kinds of DWH Applications

9.2 From OLAP to OLAM (Online Analytical Mining)

• Layer 1 — Data Repository: Databases → data cleaning → data integration → Data

Q1 — Doctor/Patient Data Warehouse

(a) Three schema classes:

(b) Star schema for this DWH:

(c) OLAP operations to list total fee per doctor in 2004:

(d) SQL query:

(a) Snowflake schema:

(b) OLAP ops to list avg_grade of CS courses for each student:

(c) Number of cuboids (each dimension has 5 levels):

Concept Key Point / One-Line Summary

You might also like