0% found this document useful (0 votes)
14 views49 pages

Module1 Notes WithDiagrams

Uploaded by

udittatia561
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views49 pages

Module1 Notes WithDiagrams

Uploaded by

udittatia561
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd

MODULE 1: DATA MINING AND DATA

WAREHOUSING
CSA3006 — Comprehensive Study Notes with Diagrams
Dr. Rudra Kalyan Nayak | VIT Bhopal

Topics: Data Warehousing • OLAP Operations • DWH Architecture • Data Cube Computation • DWH
Implementation
1. Data Warehousing — Basic Concepts

1.1 DBMS vs RDBMS (Background)

Figure 1.1 — DBMS vs RDBMS Comparison Table


1.2 What is a Data Warehouse?

Figure 1.2 — Data Warehouse Concept (Databases + Flat Files → DWH → Data Analysis &
Visualization)

• Definition (W. H. Inmon): "A data warehouse is a subject-oriented, integrated, time-variant,


and nonvolatile collection of data in support of management's decision-making process."
• Designed for analytical needs — functions on the basis of OLAP (Online Analytical
Processing).
• Central location where consolidated data from multiple databases are stored.
• A decision support database maintained separately from the organization's operational
database.
1.3 What is Data Warehousing?

Figure 1.3 — Data Warehousing Pipeline: Data Sources → ETL Staging → DWH → Data Marts → User
Groups

• Data Warehousing = organizing & storing data to make retrieval efficient and insightful.
• Also called the process of transforming data into information.
• Flow: Data Sources → Staging Area (ETL) → Data Warehouse (Raw/Meta/Aggregate Data) →
Data Marts (Sales/Purchase/Stock) → User Groups

1.4 Four Key Properties of a Data Warehouse


Property Explanation
Subject-Oriented Organized around major subjects (customer, product, sales). Focuses on
modeling & analysis for decision makers, not on daily operations or
transaction processing.
Integrated Built by integrating multiple heterogeneous sources (relational DBs, flat files,
OLTP records). Data cleaning & integration ensures consistency in naming,
encoding, attribute measures.
Time-Variant Longer time horizon than operational systems (5–10 years historical data).
Every key structure contains an element of time. Operational DBs = current
data; DWH = historical perspective.
Nonvolatile Physically separate from operational environment. Operational updates do
Property Explanation
NOT occur here. Only two operations: initial data loading and data access.
2. Operational DB Systems vs. Data Warehouses (OLTP vs
OLAP)

2.1 Definitions
• OLTP (On-Line Transaction Processing) — Traditional relational DBMS task. Day-to-day
operations: purchasing, inventory, banking, payroll, registration, accounting.
• OLAP (On-Line Analytical Processing) — Data warehouse system task. Data analysis and
decision making.

2.2 OLTP vs OLAP — Comparison

Figure 2.1 — OLTP vs OLAP Detailed Comparison Table

Attribute OLTP OLAP


Users Clerk, IT professional Knowledge worker
Function Day-to-day operations Decision support
Attribute OLTP OLAP
DB Design Application-oriented Subject-oriented
Data Current, up-to-date, detailed, flat Historical, summarized,
relational, isolated multidimensional, integrated,
consolidated
Usage Repetitive Ad-hoc
Access Read/write; index/hash on primary key Lots of scans
Unit of Work Short, simple transaction Complex query
# Records Tens Millions
# Users Thousands Hundreds
DB Size 100MB – GB 100GB – TB
Metric Transaction throughput Query throughput, response time

2.3 Why Separate DWH from Operational DBs?


• Missing data: Decision support requires historical data which operational DBs do not typically
maintain.
• Data consolidation: DS requires consolidation (aggregation, summarization) of data from
heterogeneous sources.
• Data quality: Different sources use inconsistent representations, codes, and formats — must
be reconciled.
• High performance: DBMS tuned for OLTP (indexing, concurrency, recovery); Warehouse
tuned for OLAP (complex queries, multidimensional views, consolidation).
3. Multidimensional Data Model

3.1 The Data Cube Concept

Figure 3.1 — 3D Data Cube: Products × Branches × Dates. Each cell = a measure value (e.g., £36 of
Drinks sold in Sheffield on 2nd Jan)

• A data cube is a multidimensional representation of data where each dimension = a different


attribute (Products, Branches, Dates).
• Each cell contains a measure value (e.g., £27 of Crisps sold in Leeds on 1st January).
• The cube can be pivoted in any direction — underlying data remains unchanged.
• For n dimensions → 2ⁿ possible cuboids (from apex/0-D to base/n-D).
3.2 Sample Data Cube (Product × Date × Country)

Figure 3.2 — Sample Data Cube: Product(TV,PC,VCR) × Date(1Qtr–4Qtr) ×


Country(USA,Canada,Mexico). Total Annual Sales of TV in USA highlighted.

• Dimensions: Product (TV, PC, VCR), Date (1Qtr–4Qtr + sum), Country (USA, Canada, Mexico +
sum).
• The 'sum' row/column shows aggregated totals for that dimension.
• All, All, All at the corner = apex cuboid (the grand total across all dimensions).

3.3 Key Terminology


Term Definition
Data Cube n-dimensional data representation. n dimensions → 2ⁿ possible
cuboids.
Dimension Table Stores attributes of a dimension (e.g., time_key, day, month,
quarter, year for Time).
Fact Table Central table containing dimension keys (FKs) and measures
(numerical values like units_sold, dollars_sold).
Measure Numerical function evaluated at each point in the data cube space.
Base Cuboid Lowest-level cuboid (all dimensions specified) — most detailed
Term Definition
data.
Apex Cuboid 0-D cuboid (all dimensions aggregated) — grand total. Denoted
'all'.
Cuboid A view of data cube obtained by aggregating over one or more
dimensions.

3.4 Facts & Measures

Figure 3.3 — Fact Table structure: Product dimension → Fact Table contains Product ID (dimension
key) and Number of units sold (measure)

• A fact is a measure that can be summed, averaged, or manipulated.


• Fact Table contains 2 kinds of data: a dimension key (FK) and a measure (numerical value).
• Every Dimension table is linked to a Fact table via its key.

3.5 Types of Measures


Type Condition Examples
Distributive Result from n aggregates = count(), sum(), min(), max()
result from all data
Algebraic Computed by algebraic avg(), min_N(), standard_deviation()
function from distributive
aggregates
Type Condition Examples
Holistic No constant bound on storage median(), mode(), rank()
for sub-aggregate
4. Data Warehouse Models — Schemas

4.0 Schema Overview

Figure 4.0 — Schemas: Employee table linked to Department table. A schema gives logical
description of DB with constraints and key-value links between tables.

• A schema gives the logical description of the entire database.


• Details: constraints on tables, key values present, and how key values are linked between
tables.
• DWH uses: Star Schema, Snowflake Schema, and Fact Constellation (Galaxy Schema).
4.1 Star Schema

Figure 4.1 — Star Schema: Sales Fact Table at centre, connected to Time, Item, Branch, and Location
dimension tables

• Fact table in the middle connected to a set of dimension tables.


• Simplest schema — dimensions are fully denormalized (one table per dimension, no sub-
tables).
• Each dimension table joins directly to the fact table via a foreign key.

DMQL Definition (Star Schema):


define cube sales_star [time, item, branch, location]:
dollars_sold = sum(sales_in_dollars), avg_sales = avg(sales_in_dollars),
units_sold = count(*)
define dimension time as (time_key, day, day_of_week, month, quarter, year)
define dimension item as (item_key, item_name, brand, type, supplier_type)
define dimension branch as (branch_key, branch_name, branch_type)
define dimension location as (location_key, street, city, province_or_state,
country)
4.2 Snowflake Schema

Figure 4.2 — Snowflake Schema: item dimension extended with supplier table; location dimension
extended with city table

• A refinement of star schema where dimensional hierarchy is normalized into smaller dimension
tables.
• Forms a shape similar to a snowflake.
• Reduces data redundancy but requires more joins.

DMQL Definition (normalized dimensions):


define cube sales_snowflake [time, item, branch, location]:
dollars_sold = sum(sales_in_dollars), avg_sales = avg(sales_in_dollars),
units_sold = count(*)
define dimension item as (item_key, item_name, brand, type,
supplier(supplier_key, supplier_type))
define dimension location as (location_key, street, city(city_key,
province_or_state, country))
4.3 Fact Constellation (Galaxy Schema)

Figure 4.3 — Fact Constellation: Sales Fact Table + Shipping Fact Table both sharing Time, Item, and
Location dimension tables

• Multiple fact tables share dimension tables.


• Also called Galaxy Schema — viewed as a collection of stars.

DMQL for Shipping cube (using shared dimensions from Sales):


define cube shipping [time, item, shipper, from_location, to_location]:
dollar_cost = sum(cost_in_dollars), unit_shipped = count(*)
define dimension time as time in cube sales
define dimension item as item in cube sales
define dimension from_location as location in cube sales
define dimension to_location as location in cube sales

4.4 Concept Hierarchies


Hierarchies allow data to be summarized at multiple levels of abstraction:
• Location: office → city → country → region → all
• Time: day → week → month → quarter → year → all
• Product: product → category → industry → all
5. OLAP Operations in the Multidimensional Model

5.1 Roll-Up (Drill-Up / Aggregation)

Figure 5.1 — Roll-Up: Locations climb from cities (Chicago, New York, Toronto, Vancouver) up to
countries (USA, Canada)
• Roll-Up = aggregation by climbing UP a concept hierarchy or by dimension reduction.
• Zoom OUT — moves from detailed data to more summarized, aggregated data.
• Example (Location): Roll-up from cities → countries. Individual city sales merged into country
totals.
• Example (Temperature): Values (64,65,68...) grouped into cool (64–69), mild (70–75), hot (80–
85).
• Result: Fewer cells — data at a higher, more abstract level.
5.2 Drill-Down (Roll-Down)

Figure 5.2 — Drill-Down: Time dimension expands from quarters (Q1,Q2,Q3,Q4) to months
(Jan,Feb,Mar...)

• Drill-Down = reverse of roll-up. Navigates from less detailed to MORE detailed data.
• Zoom IN — adds more detail to the current view.
• Can be performed by stepping DOWN a concept hierarchy OR by adding additional dimensions.
• Example: Drill-down on time from quarters → months. Q1 expands into Jan, Feb, Mar.
5.3 Slice

Figure 5.3 — Slice: Fixing time = Q1 produces a 2D sub-table of Location × Item for Q1 only (removes
the time dimension)

• Slice = subset of cube for a SINGLE VALUE on one dimension.


• Reduces cube dimensionality by 1 — a 3D cube becomes a 2D table.
• Example: Slice for time = 'Q1' → 2D table (Location × Item) showing only Q1 data.
• Selection: equality on one dimension value (e.g., time = Q1).

5.4 Dice

Figure 5.4 — Dice: Sub-cube for location ∈ {Toronto,Vancouver} AND time ∈ {Q1,Q2} AND item ∈
{Mobile,Modem}

• Dice = sub-cube by applying selection on TWO OR MORE dimensions.


• Unlike slice (one equality), dice selects ranges/multiple values on multiple dimensions
simultaneously.
• Result is still a sub-cube (not a flat table).
Example — three criteria simultaneously:
◦ location = 'Toronto' OR 'Vancouver'
◦ time = 'Q1' OR 'Q2'
◦ item = 'Mobile' OR 'Modem'
5.5 Pivot (Rotate)

Figure 5.5 — Pivot/Rotation: cube axes reoriented — Temperature axis and Time axis swap positions
Figure 5.6 — Pivot example table: Location×Item → Item×Location; and Other OLAP operations
summary

• Pivot (also called Rotation) — rotates data axes to give an alternative presentation.
• May involve swapping rows and columns, or moving a row-dimension into column dimension.
• The underlying data does NOT change — only the view orientation changes.
• Example: Rows=Item(types), Columns=Location(cities) → pivoted to Rows=Location,
Columns=Item.
5.6 Other OLAP Operations
• Drill-Across: Executes queries involving more than one fact table.
• Drill-Through: Uses relational SQL to drill through the bottom level of the cube to back-end
relational tables.
• Ranking: Ranking top-N or bottom-N elements in lists.
• Statistical: Moving average, growth rates, interest, IRR, depreciation, currency conversions.
• Analytical Modeling: Calculation engine for ratios, variance, forecasting, trend analysis, and
statistical analysis.
6. Data Warehouse Architecture — Multi-Tiered

6.1 Multi-Tiered Architecture

Figure 6.1 — Multi-Tiered DWH Architecture: Data Sources (Tier1) → Data Storage with ETL, DWH,
Data Marts, Metadata (Tier2) → OLAP Engine → Front-End Tools (Tier3)

• Tier 1 — Data Sources: Operational DBs and external sources (raw data producers).
• Tier 2 — Data Storage: ETL (Extract/Transform/Load/Refresh) → Data Warehouse (Metadata
+ Monitor & Integrator) → Data Marts.
• Tier 3 — OLAP Engine + Front-End: OLAP Server → Analysis, Query, Reports, and Data
Mining tools.

6.2 ETL — Extract, Transform, Load


• Data Extraction: Get data from multiple, heterogeneous, and external sources.
• Data Cleaning: Detect errors in the data and rectify them when possible.
• Data Transformation: Convert data from legacy/host format to warehouse format.
• Load: Sort, summarize, consolidate, compute views, check integrity, build indices and
partitions.
• Refresh: Propagate updates from data sources to the warehouse.

6.3 Metadata Repository


• Structure: Schema, view, dimensions, hierarchies, derived data definitions, data mart
locations/contents.
• Operational: Data lineage, currency of data (active/archived/purged), monitoring info (usage
stats, error reports, audit trails).
• Summarization algorithms, Mapping metadata, Performance metadata, Business
metadata.

6.4 Data Warehouse Model Scope Types

Figure 6.2 — Enterprise DWH model hierarchy: Enterprise DWH → Multi-Tier DWH → Distributed Data
Marts

• Enterprise Warehouse: ALL information spanning the entire organization.


• Data Mart: Subset of corporate-wide data for a specific user group (e.g., marketing). Can be
independent or dependent.
• Virtual Warehouse: Set of views over operational databases — only some summary views
materialized.
6.5 OLAP Server Types

Figure 6.3 — ROLAP vs MOLAP vs HOLAP comparison: characteristics and trade-offs

Type Description Key Points


ROLAP Relational OLAP — Uses Greater scalability; handles large/volatile
relational/extended-relational DBMS for data; poor query performance; requires
storage. expert users.
MOLAP Multidimensional OLAP — Array-based Fast indexing to pre-computed data;
storage with sparse matrix techniques. easy to use; fast IR; cannot handle
detailed data; weak DBMS facility.
HOLAP Hybrid OLAP — Relational for low-level, Combines advantages of both ROLAP
array-based for high-level. and MOLAP.

6.6 Design Process


1. Choose a business process to model (e.g., orders, invoices, payroll).
2. Choose the grain — the atomic level of data in the business process.
3. Choose the dimensions that will apply to each fact table record.
4. Choose the measures that will populate each fact table record.
7. Data Cube Computation

7.1 Cube Materialization Concepts

Figure 7.1 — Full Cube vs Iceberg Cube vs Closed Cube. Shows base cells, aggregate cells (marked
*), and ancestor/descendant relationships.
• Base Cell: A cell in the base cuboid.
• Aggregate Cell: A cell in a non-base cuboid. Each aggregated dimension is indicated as '*'.
• Ancestor/Descendant Cells: In a 3D cube, 1D and 2D cells are ancestors of 3D cells; 3D cells
are descendants.

Three choices for materialization:


• No Materialization: Do not pre-compute any 'nonbase' cuboids → expensive on-the-fly
multidimensional aggregation — extremely slow.
• Full Materialization: Pre-compute ALL cuboids. Resulting lattice = full cube. Requires huge
memory space.
• Partial Materialization: Selectively compute a proper subset of cuboids. (i) Identify subset to
materialize, (ii) Exploit during query processing, (iii) Efficiently update during load/refresh.

Three types of cubes:


• Full Cube: All cells of all cuboids computed.
• Iceberg Cube: Only cells whose measure satisfies the iceberg condition (min support
threshold). Only small portion 'above the water' in sparse cube.
• Closed Cube: Only closed cells. A cell is closed if it has no descendant cell with the same
measure value.

compute cube sales_iceberg as select month, city, customer_group, count(*)


from salesInfo cube by month, city, customer_group having count(*) >=
min_support

7.2 Algorithms for Computing Cubes


7.2.1 Multi-Way Array Aggregation (MOLAP — Bottom-Up)

Figure 7.2 — Multi-Way Array: 3D array partitioned into 64 memory-based chunks (a0b0c0 to a3b3c3).
Dimensions A,B,C each split into 4 partitions.
Figure 7.3 — Multi-Way Array traversal order: sequential 1..64 vs interleaved. Sequential order
minimizes memory (1,56,000 vs 16,41,000 units).

• Computes full cube using a multi-dimensional array partitioned into chunks (small sub-cubes
fitting in memory).
• Compressed sparse array addressing: (chunk_id, offset).
• Compute aggregates 'multiway' by visiting cells in order that minimizes visits → reduces
memory and storage cost.
Best traversal order for dimensions A=40, B=400, C=4000 (partitions 10,100,1000):
◦ Ordering 1,2,...,64: memory = 40×400 + 40×1000 + 100×1000 = 1,56,000 units ✓
◦ Ordering 1,17,33,...: memory = 400×4000 + 40×1000 + 10×100 = 16,41,000 units ✗
◦ Conclusion: Sequential 1..64 order is most memory-efficient.

7.2.2 BUC — Bottom-Up Construction

Figure 7.4 — BUC algorithm: builds from apex toward base cuboid. Star-Cubing lattice showing
shared dimensions (ABC/ABC, ABD/AB, ACD/A, etc.)
• BUC = 'bottom-up construction' algorithm for computing sparse and iceberg cubes.
• Despite the name, BUC builds from APEX cuboid toward the base cuboid.
• Allows BUC to share data partitioning costs.
• Processing order allows PRUNING during construction using the Apriori property.
• Best order minimizes memory requirement and reduces I/Os.

7.2.3 Star-Cubing — Integrating Top-Down and Bottom-Up


• Computes iceberg cubes by integrating both top-down and bottom-up approaches.
• Shared dimensions: E.g., dimension A is the shared dimension of ACD and AD. ABD/AB
means cuboid ABD has shared dimensions AB.
• Allows shared computations: AB computed simultaneously with ABD.
• Aggregates top-down but with bottom-up sub-layer allowing Apriori pruning.
• Shared dimensions grow in bottom-up fashion.
7.3 Compute Cube Operator

Figure 7.5 — DWH Implementation: compute cube operator, cuboid lattice diagram for (city,item,year),
and three materialization strategies

• Data cube = lattice of cuboids. Bottom-most = base cuboid; top-most (apex) = one cell.
• Total cuboids = 2ⁿ where n = number of dimensions.
• Example: item, city, year → 2³ = 8 cuboids: {(city,item,year),(city,item),(city,year),(item,year),
(city),(item),(year),()}
8. Indexing OLAP Data

8.1 Bitmap Indexing

Figure 8.1 — Bitmap Index tables: Base table (RID, item, city), Item bitmap index (H/C/P/S bit
columns), City bitmap index (V/T bit columns)
• Alternative representation of base table for quick searching in data cubes.
• For each attribute value v in its domain, there is a distinct bit vector Bv.
• If domain has n values → n bits needed per entry in bitmap index.
• Example: Item with 4 values (H,C,P,S) → 4-column bitmap. City with (V,T) → 2-column bitmap.
• Enables fast bitwise AND/OR operations to identify matching records.
8.2 Join Indexing

Figure 8.2 — Join Index: location/sales join index, item/sales join index, and combined
location/item/sales join index linking fact table rows to dimension values

• Join indexing registers joinable rows of two relations from a relational database.
• If R(RID, A) and S(B, SID) join on A and B → join index record = pair (RID, SID).
• Identifies joinable tuples without full joins.
• In DWH: join indexing is useful for cross-table search because of the star schema model.
• Maintains relationship between dimension attribute values and corresponding rows in the fact
table.

8.3 Efficient OLAP Query Processing


• Purpose of materialization and OLAP indexing = speed up query processing.
5. Determine which operation should be performed on available data cubes.
6. Determine to which materialized cuboids the relevant operations should be applied.
9. Data Warehouse Usage — From DWH to Data Mining

9.1 Three Kinds of DWH Applications

Figure 9.1 — DWH Application types (Information Processing, Analytical Processing, Data Mining)
and OLAM architecture with four layers
7. Information Processing: Querying, basic statistical analysis, reporting using crosstabs, tables,
charts, graphs.
8. Analytical Processing: Basic OLAP operations — slice-dice, drilling, pivoting. Multidimensional
analysis of historical data.
9. Data Mining: Knowledge discovery from hidden patterns, associations, analytical models,
classification, prediction.

9.2 From OLAP to OLAM (Online Analytical Mining)


• OLAM = Online Analytical Mining = OLAP + Data Mining in a multidimensional database or
DWH.
Why Online Analytical Mining? Four key reasons:
10. High quality data: DWH has integrated, consistent, cleaned data. Costly preprocessing already
done.
11. Available infrastructure: ODBC, OLEDB, Web access, OLAP tools constructed systematically.
12. OLAP-based exploratory analysis: Mining at different levels/subsets; mining with drilling,
dicing, pivoting.
13. On-line selection of mining functions: Integration with multiple mining functions, algorithms,
and tasks.
9.3 OLAM System Architecture (4 Layers)

Figure 9.2 — OLAM Architecture: Layer 1=Data Repository, Layer 2=MDDB, Layer 3=OLAP+OLAM
Engines, Layer 4=User Interface

• Layer 1 — Data Repository: Databases → data cleaning → data integration → Data


Warehouse.
• Layer 2 — MDDB: Multidimensional Database accessed via Database API and Data Cube API.
Includes Meta Data.
• Layer 3 — OLAP/OLAM Engine: Both engines accept user queries via GUI API and work with
data cube via Data Cube API.
• Layer 4 — User Interface: Mining queries sent in; Mining results returned to the user.
• OLAM server performs multiple data mining tasks — more sophisticated than OLAP server.
10. Practice Questions

Q1 — Doctor/Patient Data Warehouse


DWH with 3 dimensions: time, doctor, patient; and 2 measures: count and charge.

(a) Three schema classes:


14. Star Schema
15. Snowflake Schema
16. Fact Constellation (Galaxy Schema)

(b) Star schema for this DWH:


• Fact: VISIT (time_key, doctor_key, patient_key, count, charge)
• Dimension: Time (time_key, day, month, year)
• Dimension: Doctor (doctor_key, doctor_name, specialization, hospital)
• Dimension: Patient (patient_key, patient_name, age, address)

(c) OLAP operations to list total fee per doctor in 2004:


17. Roll-up on time from 'day' to 'year'.
18. Slice on year = 2004.
19. Roll-up on patient (aggregate over all patients).
20. Measure: sum(charge) grouped by doctor.

(d) SQL query:


SELECT doctor, SUM(charge) AS total_fee
FROM fee
WHERE year = 2004
GROUP BY doctor;

Q2 — Big-University DWH
4 dimensions: student, course, semester, instructor. 2 measures: count, avg_grade.

(a) Snowflake schema:


• Fact: GRADES (student_key, course_key, semester_key, instructor_key, count, avg_grade)
• student → major → status → university → all
• course → department → all
• semester → year → all
• instructor → department → all

(b) OLAP ops to list avg_grade of CS courses for each student:


21. Slice on course: select only CS courses.
22. Roll-up on semester to 'all'.
23. Roll-up on instructor to 'all'.

(c) Number of cuboids (each dimension has 5 levels):


• Total = 5 × 5 × 5 × 5 = 625 cuboids.
11. Quick Revision Summary Table

Concept Key Point / One-Line Summary


DWH (Inmon) Subject-Oriented + Integrated + Time-Variant + Nonvolatile
OLTP vs OLAP OLTP = current, detailed, write-heavy; OLAP = historical,
summarized, read-heavy
Star Schema Fact table in centre + denormalized dimension tables directly
connected
Snowflake Schema Star schema with normalized dimension tables (extra sub-tables)
Fact Constellation Multiple fact tables sharing common dimension tables
Roll-Up Climb concept hierarchy = less detail, more aggregation (ZOOM
OUT)
Drill-Down Go down concept hierarchy = more detail (ZOOM IN)
Slice Fix ONE dimension to a single value = reduce cube dimensionality
by 1
Dice Fix MULTIPLE dimensions to value ranges = produce sub-cube
Pivot/Rotate Reorient cube axes for alternative 2D view — data unchanged
Full Cube Pre-compute ALL cuboids — huge memory required
Iceberg Cube Compute only cells meeting minimum support threshold
Closed Cube Only cells with no descendant having the same measure value
Multi-Way Array Bottom-up MOLAP: partition into chunks; sequential order most
efficient
BUC 'Bottom-up construction' — apex to base, uses Apriori pruning
Star-Cubing Integrates top-down + bottom-up; exploits shared dimensions
ROLAP Relational OLAP — scalable, handles large/volatile data, slower
queries
MOLAP Multidimensional OLAP — fast, pre-computed, array-based
HOLAP Hybrid OLAP — combines ROLAP (low-level) + MOLAP (high-
level)
ETL Extract → Transform → Load (populates and refreshes DWH)
Metadata Data about data: structure, lineage, algorithms, business terms
Data Mart Subject-specific subset of DWH for a specific user group
Bitmap Index Bit vector per attribute value — fast bitwise search in cube
Join Index Registers joinable rows — useful in star schema cross-table search
OLAM Online Analytical Mining = OLAP + Data Mining integration
2ⁿ cuboids Total cuboids in an n-dimensional data cube = 2 raised to n
— End of Module 1 (Parts 1, 2, and 3) Notes —
Upload remaining PDFs to continue with Introduction to Data Mining.

You might also like