0% found this document useful (0 votes)

7 views30 pages

Data Mining

The document compares data warehouses (DW) and operational databases (OD), highlighting their differences in data structure, performance, and usage. It also discusses the distinctions between data warehouses and data marts, outlines multi-tier data warehouse architecture, and describes various data warehouse models. Additionally, it covers OLAP operations, the relationship between data warehousing and data mining, and the data warehouse implementation process.

Uploaded by

amitchowdhuryleo

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views30 pages

Data Mining

Uploaded by

amitchowdhuryleo

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

1)Compare between DW and OD.

.A data warehouse is a repository for structured, filtered data that has already been processed
for a specific purpose. It collects the data from multiple sources and transforms the data using
ETL(Extract, Transform and Load) process,

An operational database, on the other hand, is a database where the data changes frequently.
They are mainly designed for high volume of data transaction.

Key Data Warehouse Operational Database

1. Basic Stores structured, filtered data for Data changes frequently for daily
analysis. operations.

2. Data Uses de-normalized schema Uses normalized schema (efficient

Structure (simpler for analysis). for transactions).

3. Performance Fast for analytical queries. Slower for analytics, optimized for
transactional queries.

4. Type of Data Focuses on historical data. Focuses on current data.

5. Use Case Used for OLAP (Online Analytical Used for OLTP (Online Transaction
Processing). Processing).

6. Update Data is updated periodically Data is updated continuously in

Frequency (batch process). real-time.

7. Users Used by analysts and managers Used by operational staff for daily
for decision-making. activities.

8. Purpose Supports reporting, forecasting, Supports day-to-day operations like

and business intelligence. order processing and billing.

2) Compare between Data warehouse and Data Mart.

Key Data Warehouse Data Mart

1. Scope Enterprise-wide, covers all business Department-level, focuses on a

areas. specific business area.
2. Size Large in size (terabytes or more). Smaller in size (gigabytes to
terabytes).

3. Data Source Integrates data from multiple May use data from a subset of
sources. sources or the data warehouse
itself.

4. Complexity More complex and comprehensive. Less complex and easier to

manage.

5. Takes longer to build and requires Faster and cheaper to implement.

Implementation more resources.

6. Usage Used by top management and Used by specific teams (e.g.,

analysts for organization-wide sales, finance) for focused
insights. analysis.

7. Maintenance Harder to maintain due to size and Easier to maintain.

complexity.

8. Data Type Contains detailed and summary Often contains summarized or

data. specific data.

3) Describe with an appropriate figure the multi-tier Data Warehouse Architecture.

Tier Purpose

Bottom Tier Collects and loads raw data

Middle Tier Stores, organizes, and processes data

Top Tier Presents data to users for analysis

A typical Data Warehouse architecture is divided into three tiers:

1. Bottom Tier – Data Source Layer

● This layer contains databases, flat files, or external data sources.

● Data is collected from different operational systems (e.g., sales, HR, finance).

● ETL (Extract, Transform, Load) tools are used to clean, transform, and load the data
into the warehouse.

2. Middle Tier – Data Storage and Processing Layer

● This is the core of the data warehouse.

● Stores the processed, structured data in a centralized repository.

● Uses OLAP servers for fast querying and data analysis.

○ OLAP can be:

■ ROLAP (Relational OLAP)

■ MOLAP (Multidimensional OLAP)

● Supports complex queries and multidimensional analysis.

3. Top Tier – Front-End or Presentation Layer

● This is the user interface layer.

● Provides access to the data through:

○ Reporting tools
○ Dashboards

○ Data mining tools

○ Query tools

● Used by business analysts, managers, and decision-makers.

4) Describe about Different Data Warehouse Models.

Feature Enterprise Data Data Mart Operational Data Store

Warehouse (EDW) (ODS)

Definition Centralized warehouse Mini-warehouse Stores current real-time

storing data for the whole focused on a specific data from operational
organization department systems

Scope Organization-wide Department-specific Real-time or daily

(e.g., sales, HR, operations
marketing)

Data Historical Historical Current

Type

Use Case Strategic business Department-level Daily operations and

decisions analysis quick access

Example Retail company analyzing Sales team analyzing Bank tracking live
total sales across all regional performance transactions
regions

5) What is data model and multi dimensional data model?

Data Model

● A data model is a way to organize and define the structure of data.

● It shows how data is stored, connected, and processed.

Types of Data Models:

1. Conceptual Model – High-level overview (what data is needed)

2. Logical Model – Detailed structure (how data is related)

3. Physical Model – Actual database structure (tables, columns, keys)

+------------+ +-------------+
| Customer |----------| Orders |
+------------+ +-------------+
| CustomerID | | OrderID |
| Name | | CustomerID |
| Address | | OrderDate |
+------------+ +-------------+

Multidimensional Data Model

● A multidimensional data model is used in data warehousing and OLAP.

● It represents data in the form of a cube, where:

○ Each dimension represents a way to analyze the data (e.g., time, location,
product).

○ The fact contains measurable data (e.g., sales amount, quantity).

Example:

Let’s say you're analyzing sales data:

● Fact Table: Sales amount

● Dimensions:

○ Time (e.g., year, month)

○ Product (e.g., phone, laptop)

+-----------------------------+
| Time Dimension |
| (Year, Month, Day) |
+-----------------------------+
/ | \
/ | \
+---------+ / +---------+ +--------+
| |/ | | | |
| Product |------| Fact |-----|Location|
|Dimension| | Table | |Dimension|
|(Phone, | |(Sales) | |(City, |
|Laptop) | | | |Country) |
+---------+ +---------+ +--------+

6) What are facts, dimensions and schemas?

Term Definition Example

Fact Numeric, measurable data Sales, Revenue, Quantity

Dimensio Descriptive data giving context to facts Time, Product, Location

Schema Structure of fact and dimension tables Star, Snowflake, Galaxy

schemas

Different multidimensional model schemas along with appropriate figures

1. Star Schema
Definition:

● Simplest and most common schema.

● A central fact table is directly connected to dimension tables.

● Dimension tables are denormalized (flattened).

The center table is the Fact Table — it holds the numbers you want to analyze.

The tables around it are Dimension Tables — they tell you more about who, what, where, and
when.

Componen Meaning
t

Fact Table Stores data to analyze, like Sales and

Quantity.

Product What product was sold? (e.g., Laptop, Phone)

Time When was it sold? (e.g., Date, Month, Year)

Store Where was it sold? (e.g., Location, Branch)

Customer Who bought it? (e.g., Name, Age, Gender)

2. Snowflake Schema
Definition:

● An extension of Star Schema.

● Dimension tables are normalized (split into sub-tables).

● Reduces data redundancy but is slightly more complex.

Diagram:
+-----------+
| Sub-Category|
+-----+-----+
|
+---v----+
| Product |
+---+----+
|
+--------+ +----v-----+ +--------+
| Time |----| Fact |----| Store |
+--------+ +----------+ +--------+
|
(Sales, Profit)

At the center is the Fact Table, where we store the values we want to measure — like Sales
and Profit.

The tables around it are Dimension Tables that describe details like time, product, and store.

One of the dimension tables (Product) is linked to another table (Sub-Category) — this is what
makes it a Snowflake schema.

Component Meaning

Fact Table Stores measurable data like Sales and Profit.

Product What was sold? (e.g., iPhone, Laptop)

Sub-Categor Group of products (e.g., Mobiles, Laptops under Electronics)

Time When was it sold? (Date, Month, Year)

Store Where was it sold? (e.g., Dhaka Branch)

3. Fact Constellation Schema (Galaxy Schema)
Definition:

● Multiple fact tables share common dimension tables.

● Suitable for complex data warehouses.

Diagram:
+--------+
| Time |
+---+----+
|
+---------+ | +-----------+
| Sales |<----+----->| Product |
+---------+ +-----+-----+
+------------+
| Inventory |
+------------+

● There are two fact tables here:

➤ Sales — stores data like units sold, revenue
➤ Inventory — stores data like stock quantity, restock info

● Both fact tables share common dimensions:

○ Time — when the event happened (e.g., sale or restock)

○ Product — what item was involved (e.g., Mobile, Laptop)

7) What is OLAP?. Describe about different OLAP operations.

OLAP stands for Online Analytical Processing Server. It is a software technology that allows
users to analyze information from multiple database systems at the same time. It is based on
multidimensional data model and allows the user to query on multi-dimensional data (eg.
Delhi -> 2018 -> Sales data).
OLAP operations:

There are five basic analytical operations that can be performed on an OLAP cube:
1. Drill down: In drill-down operation, the less detailed data is converted into highly

detailed data. It can be done by:

● Moving down in the concept hierarchy

● Adding a new dimension

In the cube given in overview section, the drill down operation is performed by

moving down in the concept hierarchy of Time dimension (Quarter -> Month).

2. Roll up: It is just opposite of the drill-down operation. It performs aggregation on the OLAP
cube. It can be done by:
● Climbing up in the concept hierarchy

● Reducing the dimensions

In the cube given in the overview section, the roll-up operation is performed by climbing up in
the concept hierarchy of Location dimension (City -> Country).
[Link]: It selects a sub-cube from the OLAP cube by selecting two or more dimensions. In the
cube given in the overview section, a sub-cube is selected by selecting following dimensions
with criteria:
● Location = "Delhi" or "Kolkata"

● Time = "Q1" or "Q2"

● Item = "Car" or "Bus"

[Link]: It selects a single dimension from the OLAP cube which results in a new sub-cube
creation. In the cube given in the overview section, Slice is performed on the dimension Time =
"Q1".
[Link]: It is also known as rotation operation as it rotates the current view to get a new view
of the representation. In the sub-cube obtained after the slice operation, performing pivot
operation gives a new view of it.

8) What is the relation between data warehousing and data mining?

Data warehousing and data mining are closely connected, but they serve different purposes
in data analysis.

Relationship Summary:

● Data Warehouse is like a library where all your data is stored neatly.

● Data Mining is like reading and studying that library to discover new knowledge.

Aspect Data Warehousing Data Mining

Definition A system that stores large amounts A technique to discover patterns,

of organized data for analysis. trends, or insights from data.

Purpose To store, clean, and organize data To analyze that data and find useful
from multiple sources. knowledge.

Dependency Provides the source data for mining. Needs data from warehouse or
other structured systems.
Process Mainly storage and management of Mainly analysis and discovery from
Type data. data.

Example A company stores 5 years of customer They use that data to find which
data. customers may leave.

Working of Top-Down Approach

Bottom-Up Approach

9) How is a data warehouse different from a database? How are they similar?
Feature Database Data Warehouse

Purpose Stores current data for daily Stores historical data for analysis and
operations reporting

Data Type Real-time, transactional data Historical, analytical data

Usage Used in apps like banking, Used in BI tools, reporting, decision

e-commerce making

Query Type Simple insert, update, delete Complex queries for trends,
summaries

Data Structure Usually normalized Usually de-normalized

Update Frequently updated Periodically updated (e.g., daily,

Frequency weekly)

Speed Focus Optimized for fast write Optimized for fast read operations
operations

Similarities

Similarity Point Description

Store Data Both are used to store and manage structured

data

Use SQL Both can use SQL to query and manage data

Have Tables Both organize data in tables with rows and

columns

Need Backup & Security Both require data protection and backup systems
10.. Define the Data warehouse implementation Process with details.

Ste Name What Happens Example

1 Requirement Understand business Sales team wants monthly and

Gathering goals and data needs yearly sales reports

2 Data Modeling Design warehouse Create a Star Schema with

structure (schemas, Fact_Sales and Dim_Product tables
tables)

3 Data Integration & Extract, clean, and load Extract data from ERP and CRM
ETL data from sources systems, clean, and load into
warehouse

4 Data Cleansing & Fix and verify data Remove duplicate customer entries,
Validation accuracy correct product codes

5 Building Data Create Build separate data marts for Sales,

Marts department-focused mini HR, and Finance departments
warehouses

6 Data Security & Control access and Give HR team access only to
Governance ensure data privacy employee data, not financial data

7 Testing & QA Test data quality, Verify that sales totals match source
performance, and data, and queries run efficiently
reliability

8 Deployment & Go live and maintain Launch the system and schedule
Maintenance system health regular data refreshes

9 User Training & Train staff to use the Conduct sessions for managers on
Adoption warehouse effectively using dashboards and reports

10 Ongoing Improve and expand over Add new data sources like mobile
Optimization time app usage or customer feedback
[Link] the context of data warehousing what is data transformation?

Data Transformation is the process of converting raw data from different sources into a clean,
consistent, and usable format before loading it into the data warehouse.

It is part of the ETL process (Extract, Transform, Load).

Task What it does Example

Data Cleaning Removes errors or Fixing missing or incorrect

inconsistencies customer names

Data Converts data into a standard Date: “12/06/25” → “2025-06-12”

Standardization format

Data Mapping Matches fields from source to “Cust_ID” in source →

target “Customer_ID” in DW

Data Aggregation Summarizes data for analysis Daily sales → Monthly sales

Data Encoding Converts categories into “Male/Female” → 1/0

numbers or formats

Data Merging Combines data from multiple Merge customer data from two
sources different systems

[Link] do you mean by mining frequent pattern?

Frequent pattern mining is the process of discovering repeated combinations or common

patterns in a large dataset.

Example (Market Basket Analysis):

Suppose you have many customer transactions at a grocery store.

One frequent pattern could be:

CopyEdit
{Bread, Butter, Jam} → These items are often bought together.

So, if a customer buys Bread and Butter, there's a high chance they also buy Jam.

Popular Algorithms:

● Apriori Algorithm

● FP-Growth (Frequent Pattern Growth)

Summary:

Mining frequent patterns helps in finding valuable relationships or associations in

big data, which supports better decision-making and predictions.

13. Discuss about support, confidence and lift.

Concept Meaning Formula Example Result

Support How often an Support(A) = If {bread, eggs} appears

itemset appears in Transactions containing in 5 out of 10 transactions →
all data A / Total transactions 5/10 = 0.5 or 50%

Confidence How likely B is Confidence(A⇒B) = If {bread, milk} appears

bought when A is Support(A∪B) / in 5 transactions and
bought Support(A) {bread, milk, eggs} in 4
→ 4/5 = 0.8 or 80%

Lift How much more Lift(A⇒B) = If Support({bread, milk,

likely A and B occur Support(A∪B) / eggs}) = 0.4, Support({bread,
together than (Support(A) × milk}) = 0.5, Support({eggs})
independently Support(B)) = 0.6 → 0.4 / (0.5 × 0.6) =
1.33
14) Describe about association rule mining with proper example.

Association Rule Mining is a key technique in data mining used to discover interesting
relationships, patterns, or associations among a set of items in large datasets.

Definition:

Association Rule Mining finds rules like:

If item A is bought, item B is likely to be bought.

This is written as:
A⇒B

Key Terms:

1. Support:

○ How frequently the itemset appears in the dataset.

○ Example: 20% of all transactions contain both A and B → support = 0.20

2. Confidence:

○ How often B is bought when A is bought.

○ Formula: Confidence(A ⇒ B) = Support(A ∩ B) / Support(A)

3. Lift:

○ How much more likely B is bought when A is bought compared to when A is not
bought.

○ Formula: Lift(A ⇒ B) = Confidence(A ⇒ B) / Support(B)

○ Lift > 1 means positive correlation

Example (Market Basket Analysis):

Let’s say we have the following transactions:

Transaction ID Items Bought

1 Bread, Milk

2 Bread, Diaper, Beer, Eggs

3 Milk, Diaper, Beer, Coke

4 Bread, Milk, Diaper, Beer

5 Bread, Milk, Diaper, Coke

We can find the rule:

{Diaper} ⇒ {Beer}

● Support = 3/5 = 0.6 → Diaper and Beer occur together in 3 out of 5 transactions.

● Confidence = 3/4 = 0.75 → Diaper appears in 4 transactions, and in 3 of them Beer is

also there.

● Lift = 0.75 / (4/5) = 0.75 / 0.8 = 0.9375 → Slightly less than 1, so weak correlation.

15) What is candidate generation,and pruning?

Candidate Generation

● It means making possible combinations of items that might appear frequently in

transactions.

● These combinations are called candidates.

● We create bigger itemsets from smaller frequent ones.

Example:
If {Milk}, {Bread}, and {Butter} are frequent 1-itemsets, we combine them to form:

● {Milk, Bread}, {Milk, Butter}, {Bread, Butter} → These are 2-item candidates

Pruning

● It means removing the candidates that cannot be frequent.

● We use the rule:

➤ If any part (subset) of a candidate is not frequent, remove it.

Example:

If we have a 3-item candidate {Milk, Bread, Diaper}, but {Milk, Diaper} is not frequent, then we
prune (remove) {Milk, Bread, Diaper}.

16)Discuss about frequent pattern growth algorithm with fp-tree.

● FP-Growth is a fast method to find frequent itemsets without generating all possible
combinations (like Apriori).

● It uses a special tree structure called an FP-Tree to store the transaction data in a
compressed form.

Why FP-Growth?

● Apriori generates many candidate itemsets and takes more time.

● FP-Growth avoids this by using a tree and is faster and more efficient.

Steps of FP-Growth Algorithm:

Step 1: Build the FP-Tree

1. Scan the data to find the frequent items and their counts (support).
2. Sort the items in each transaction by frequency (most frequent first).

3. Build the FP-Tree:

○ Start with a root node (null)

○ For each transaction, add items as a path in the tree.

○ If a path already exists, increase the count.

Example:

Let’s say we have 3 transactions:

makefile
CopyEdit
T1: a, b, d
T2: b, c, d
T3: a, b, c

Step 1: Count item frequency:

● a:2, b:3, c:2, d:2

Step 2: Sort transactions by frequency:

● T1: b, a, d

● T2: b, d, c

● T3: b, a, c

Step 3: Build FP-tree:

css
CopyEdit
(null)
↓
b(3)
/ \
a(2) d(1)
/ \
d(1) c(1)
\
c(1)

Step 2: Mine Frequent Patterns from the Tree

● Start from the bottom of the FP-Tree, and create conditional patterns.

● Build smaller FP-trees for each item and extract frequent itemsets.

Summary Table:

Term Explanation

FP-Growth Algorithm to find frequent patterns

fast

FP-Tree A tree structure to store itemsets

Step 1 Build the FP-tree from transactions

Step 2 Extract frequent itemsets from the

tree

17) Compare between apriori and fp-growth algorithm.

Feature Apriori Algorithm FP-Growth Algorithm

Method Uses candidate generation Uses a tree (FP-Tree), no candidates

Speed Slower for large data Faster and more efficient

Memory Use Uses more memory due to many Uses less memory (compressed tree)
candidates

Working Repeatedly scans database Scans database only twice

Steps

Main Uses itemset list Uses FP-Tree

Structure

Drawback Time-consuming for big data Tree can be complex if data has low
overlap

Best For Small datasets Large datasets with many frequent

patterns

18) What is the importance of association mining rule?

What is the importance of Association Rule Mining?

Association rule mining helps us find relationships between items in a dataset.

It shows what items are often bought or used together.

Why it's important (with rules and examples):

1. Market Basket Analysis

○ Finds what people buy together.

○ Rule: If a customer buys bread, they also buy butter.
→ Bread ⇒ Butter

2. Product Recommendation

○ Helps online stores suggest items.
○ Rule: If someone buys a phone, they also buy a phone cover.
→ Phone ⇒ Phone Cover

3. Sales Improvement

○ Stores can place related items together to boost sales.
○ Rule: If people buy shampoo, they also buy conditioner.
→ Shampoo ⇒ Conditioner

4. Detecting Patterns

○ Useful in banking or healthcare to find hidden patterns.
○ Rule: If a patient has symptom A and B, they might have disease C.
→ A, B ⇒ C

5. Better Decisions

○ Helps businesses make smart choices using data.
○ Rule: If a user visits page A, they often go to page B.
→ Page A ⇒ Page B

19) Define some correlation measures : chi-square, all_conf, max_conf, kulc and
cosine.

1. Chi-Square (Chi²)

● Checks whether two items are independent or related.

● A high Chi² value means the items are strongly related.
● It compares the expected frequency with the actual frequency.
● Use: To test how strong the relationship is between two items.

2. All_Conf (All Confidence)

● Measures how strong the rule is based on the less frequent item.
● Formula:
All_Conf = Support(X and Y) / max(Support(X), Support(Y))

● Value is between 0 and 1.

● Higher value = stronger relationship
3. Max_Conf (Maximum Confidence)

● Takes the maximum confidence from both directions.

● Formula:
Max_Conf = max(Confidence(X ⇒ Y), Confidence(Y ⇒ X))
● Helps to find the strongest direction of the rule.

4. Kulc (Kulczynski Measure)

● It is the average of confidence in both directions.

● Formula:
Kulc = (Confidence(X ⇒ Y) + Confidence(Y ⇒ X)) / 2
● It gives a balanced view of how strong the relationship is both ways.

5. Cosine

● Measures similarity between two items.

● Formula:
Cosine = Support(X and Y) / square root of (Support(X) × Support(Y))
● Value is between 0 and 1.
● Closer to 1 = stronger similarity

Lift

Lift is a measure used to check how strong or useful an association rule is compared to
random chance.

Lift(X ⇒ Y) = Support(X and Y) / (Support(X) × Support(Y))

How to Understand Lift:

● Lift = 1 → X and Y are independent (no effect on each other)

● Lift > 1 → X and Y are positively related (X increases chance of Y)

● Lift < 1 → X and Y are negatively related (X reduces chance of Y)

20)How apriori algorithm works?

What is Apriori Algorithm?

Apriori is a data mining algorithm used to find frequent itemsets and create association rules
(like "if a customer buys X, they also buy Y").

How It Works (Step-by-step):

Step 1: Count single items (1-itemsets)

● Scan the whole dataset.

● Count how many times each item appears.
● Keep only those that meet minimum support.

Example:
If "Milk" appears in 3 out of 5 transactions, support = 3
If minimum support = 2, keep it.

Step 2: Generate 2-itemsets

● Combine frequent 1-itemsets to form 2-item combinations.

● Count how many times each 2-itemset appears.
● Keep only those with enough support.

Example:
Milk, Bread ⇒ appears in 2 transactions → keep it

Step 3: Generate 3-itemsets (and so on)

● Combine frequent 2-itemsets to make 3-itemsets.

● Again, count and keep those that meet minimum support.
● Continue until no more frequent itemsets are found.

Step 4: Generate Association Rules

● From the frequent itemsets, generate rules like:

If X, then Y
● Calculate confidence = Support(X and Y) / Support(X)
● Keep rules that meet the minimum confidence threshold.
Example Rule:
If a customer buys Milk ⇒ they also buy Bread
Confidence = 2/3 = 66%

21)How knowledge discovery in database (KDD) works?

Steps of KDD Process:

1. Data Selection

● Choose the relevant data from a large database.

● Example: From a hospital database, select only patient age, symptoms, and test
results.

2. Data Preprocessing / Cleaning

● Remove missing, duplicate, or noisy data.

● Example: Fill missing values, remove rows with errors.

3. Data Transformation

● Convert data into a suitable format for mining.

● Example: Normalize values, convert categories to numbers.

4. Data Mining

● Apply algorithms to find patterns and rules.

● Example: Use Apriori to find which items are often bought together.

5. Pattern Evaluation

● Check if the patterns are interesting, useful, and valid.

● Example: Ignore patterns that are already obvious or not meaningful.

6. Knowledge Presentation
● Show the discovered knowledge using graphs, charts, or reports.
● Example: Show association rules in an easy-to-read format.

22) Briefly describe about sequential pattern mining.

What is Sequential Pattern Mining?

Sequential Pattern Mining is a type of data mining that finds frequent sequences or patterns
in ordered data.

Why it's useful?

It helps businesses or systems to understand sequences of actions, like:

● What customers usually do step by step

● What products are bought in sequence
● What events happen one after another

Example:

Let’s say we have this data from three customers:

Customer Transactions (by time)

C1 A→B→C

C2 A→D

C3 A→B→D

From this, a frequent sequence could be:

A → B (because it appears in C1 and C3)

Steps of Sequential Pattern Mining:

1. Scan database to find frequent individual items.

2. Generate sequences from frequent items.
3. Count how often each sequence appears.
4. Keep only those with support ≥ minimum threshold.

23)Discuss about breadth first search based generalized sequential pattern (GSP)
algorithm for the

GSP is an algorithm used in sequential pattern mining to discover frequent sequences in

ordered transaction data (like shopping history or time-series data).

It works using a breadth-first search (BFS) strategy — meaning it explores all short patterns
before moving to longer ones.

How GSP Algorithm Works (Step-by-step):

Step 1: Find Frequent 1-Sequences

● Scan the database and find all frequent individual items (called 1-sequences) that
meet the minimum support.

Step 2: Candidate Generation (k-sequences)

● Use the frequent (k-1)-sequences to create candidate k-sequences (just like Apriori).
● For example, combine frequent 1-sequences to make 2-sequences.

Step 3: Support Counting

● Scan the database again and count how many times each candidate sequence appears.
● Only keep those that meet the minimum support threshold.

Step 4: Repeat

● Repeat Steps 2 and 3, increasing sequence length each time (k = 2, 3, 4...), until no
more frequent sequences are found.

Example:

Let’s say we have a database of 3 customers:

Customer Sequence

C1 A→B→C

C2 A→C

C3 A→B

●
Frequent 1-sequences: A, B, C
● Generate 2-sequences: A→B, A→C, B→C
● Check support, keep frequent ones
● Then try 3-sequences: A→B→C, etc.

Multi-Dimensional Data Models Overview
No ratings yet
Multi-Dimensional Data Models Overview
22 pages
Understanding Data Warehousing Concepts
No ratings yet
Understanding Data Warehousing Concepts
53 pages
DWM Unit 1,2,3
No ratings yet
DWM Unit 1,2,3
44 pages
Understanding Data Warehousing Concepts
No ratings yet
Understanding Data Warehousing Concepts
60 pages
DWDM U2
No ratings yet
DWDM U2
14 pages
Data Warehouse Architecture Overview
No ratings yet
Data Warehouse Architecture Overview
37 pages
Understanding Cuboid Data Structures
No ratings yet
Understanding Cuboid Data Structures
38 pages
Data Warehouse and OLAP Overview
No ratings yet
Data Warehouse and OLAP Overview
60 pages
Data Warehousing and OLAP Overview
No ratings yet
Data Warehousing and OLAP Overview
63 pages
Data Warehousing and OLAP Overview
No ratings yet
Data Warehousing and OLAP Overview
47 pages
Introduction to Data Warehousing Concepts
No ratings yet
Introduction to Data Warehousing Concepts
35 pages
Understanding Data Warehousing Concepts
No ratings yet
Understanding Data Warehousing Concepts
30 pages
Data Warehouse Basics and Schemas
No ratings yet
Data Warehouse Basics and Schemas
17 pages
Data Warehouse Overview and Architecture
No ratings yet
Data Warehouse Overview and Architecture
21 pages
Unit 4 DM
No ratings yet
Unit 4 DM
15 pages
FDS Unit
No ratings yet
FDS Unit
11 pages
Multidimensional Design Notes
No ratings yet
Multidimensional Design Notes
8 pages
DW DM Exam Answers
No ratings yet
DW DM Exam Answers
35 pages
Data Warehouse Modeling and OLAP Insights
No ratings yet
Data Warehouse Modeling and OLAP Insights
43 pages
Introduction to Data Warehousing
No ratings yet
Introduction to Data Warehousing
35 pages
Data Warehouse Fundamentals and Models
No ratings yet
Data Warehouse Fundamentals and Models
24 pages
Data Warehouse Modeling: OLAP & Data Cubes
No ratings yet
Data Warehouse Modeling: OLAP & Data Cubes
12 pages
Data Warehousing and OLAP Concepts
No ratings yet
Data Warehousing and OLAP Concepts
18 pages
DWBI
No ratings yet
DWBI
23 pages
Data Warehouse Concepts and Architectures
No ratings yet
Data Warehouse Concepts and Architectures
33 pages
Data Warehouse and OLAP Overview
No ratings yet
Data Warehouse and OLAP Overview
14 pages
Data Warehouse Concepts Explained
0% (1)
Data Warehouse Concepts Explained
14 pages
Three Data Warehouse Models Explained
No ratings yet
Three Data Warehouse Models Explained
58 pages
Data Warehouse Design Principles
No ratings yet
Data Warehouse Design Principles
32 pages
Data Warehousing and OLAP Concepts
No ratings yet
Data Warehousing and OLAP Concepts
93 pages
Unit 2-Data Warehouse
No ratings yet
Unit 2-Data Warehouse
17 pages
Data Warehousing and Mining Course Overview
No ratings yet
Data Warehousing and Mining Course Overview
127 pages
Data Warehouse Architecture Overview
100% (1)
Data Warehouse Architecture Overview
43 pages
Understanding Data Warehouse Concepts
No ratings yet
Understanding Data Warehouse Concepts
7 pages
Multidimensional Data Models Explained
No ratings yet
Multidimensional Data Models Explained
12 pages
Wa0002.
No ratings yet
Wa0002.
44 pages
Data Warehouse Operations Overview
No ratings yet
Data Warehouse Operations Overview
18 pages
Data Warehousing Concepts by Hammad Afzal
No ratings yet
Data Warehousing Concepts by Hammad Afzal
25 pages
Multidimensional Modelling in Data Warehousing
No ratings yet
Multidimensional Modelling in Data Warehousing
26 pages
Understanding OLAP Operations and Schemas
No ratings yet
Understanding OLAP Operations and Schemas
10 pages
Understanding Data Warehousing Concepts
100% (1)
Understanding Data Warehousing Concepts
44 pages
Understanding Data Warehousing Concepts
No ratings yet
Understanding Data Warehousing Concepts
44 pages
Unit 1 DWDM
No ratings yet
Unit 1 DWDM
19 pages
Big Data vs Data Warehouse Explained
No ratings yet
Big Data vs Data Warehouse Explained
9 pages
Data Warehouse Concepts and Terminology
No ratings yet
Data Warehouse Concepts and Terminology
39 pages
Data Warehouse Modeling Overview
No ratings yet
Data Warehouse Modeling Overview
21 pages
Data Warehousing and Storage Types
No ratings yet
Data Warehousing and Storage Types
31 pages
Business Data Warehousing and Data Mining (UNIT-2)
No ratings yet
Business Data Warehousing and Data Mining (UNIT-2)
16 pages
DWDM Unit-I
No ratings yet
DWDM Unit-I
12 pages
Data Warehousing Overview and Architecture
No ratings yet
Data Warehousing Overview and Architecture
25 pages
Data Warehousing Concepts and Architecture
No ratings yet
Data Warehousing Concepts and Architecture
9 pages
Multidimensional Data Model Overview
No ratings yet
Multidimensional Data Model Overview
14 pages
Understanding Fact and Dimension Tables
No ratings yet
Understanding Fact and Dimension Tables
3 pages
DMDW Unit-1
No ratings yet
DMDW Unit-1
47 pages
Data Warehouse 1
No ratings yet
Data Warehouse 1
152 pages
Data Warehousing Exam Notes Guide
No ratings yet
Data Warehousing Exam Notes Guide
28 pages
Understanding Data Warehousing Concepts
No ratings yet
Understanding Data Warehousing Concepts
34 pages
MySQL Practice Questions for Class XII
No ratings yet
MySQL Practice Questions for Class XII
7 pages
Data Mining Functionalities Overview
No ratings yet
Data Mining Functionalities Overview
14 pages
Teradata Database Assessment Questions
No ratings yet
Teradata Database Assessment Questions
12 pages
S3 Glacier: Cost-Effective Data Archiving
No ratings yet
S3 Glacier: Cost-Effective Data Archiving
7 pages
Data Engineer Course Overview
No ratings yet
Data Engineer Course Overview
4 pages
Data Preparation for Analytics Sandbox
No ratings yet
Data Preparation for Analytics Sandbox
19 pages
Query Language Principles by Ramakrishnan
No ratings yet
Query Language Principles by Ramakrishnan
19 pages
Overview of Hadoop Ecosystem Components
No ratings yet
Overview of Hadoop Ecosystem Components
19 pages
Data Models and DBMS Architecture
No ratings yet
Data Models and DBMS Architecture
4 pages
SQL Fundamentals for Data Analysis
No ratings yet
SQL Fundamentals for Data Analysis
78 pages
Hibernate 3.2.7 Reference Guide
No ratings yet
Hibernate 3.2.7 Reference Guide
232 pages
ADO.NET Programming Guide
No ratings yet
ADO.NET Programming Guide
476 pages
CADWorx Plant Professional Features Guide
No ratings yet
CADWorx Plant Professional Features Guide
23 pages
HikariCP Connection Pool Configuration Guide
No ratings yet
HikariCP Connection Pool Configuration Guide
6 pages
Vaishnavi Karale Resume
No ratings yet
Vaishnavi Karale Resume
1 page
Text Mining and Information Retrieval
No ratings yet
Text Mining and Information Retrieval
37 pages
Introduction to Database Management
No ratings yet
Introduction to Database Management
13 pages
OpenAPI REST API Guide and Assignment
No ratings yet
OpenAPI REST API Guide and Assignment
5 pages
OData ABAP Get_Expanded_Entity Guide
No ratings yet
OData ABAP Get_Expanded_Entity Guide
4 pages
Azani Isp Kcse Full Project
100% (1)
Azani Isp Kcse Full Project
6 pages
SQL Functions, Joins, and Subqueries Guide
No ratings yet
SQL Functions, Joins, and Subqueries Guide
7 pages
Apache Sqoop Fundamentals and Usage
100% (1)
Apache Sqoop Fundamentals and Usage
66 pages
Configuring Accounts Payable in EBS R12.2
No ratings yet
Configuring Accounts Payable in EBS R12.2
21 pages
Database Management Fundamentals
No ratings yet
Database Management Fundamentals
13 pages
Online Assignment Submission System
No ratings yet
Online Assignment Submission System
20 pages
Oracle Predefined PL/SQL Exceptions
No ratings yet
Oracle Predefined PL/SQL Exceptions
4 pages
INDIACom-2025 Paper Submission Guide
No ratings yet
INDIACom-2025 Paper Submission Guide
2 pages
Web-Based Information System Project
No ratings yet
Web-Based Information System Project
4 pages
SQL Anywhere Server SQL Usage Guide
No ratings yet
SQL Anywhere Server SQL Usage Guide
906 pages
Database Security Interview Questions
No ratings yet
Database Security Interview Questions
31 pages

Data Mining

Uploaded by

Data Mining

Uploaded by

1)Compare between DW and OD.

Key Data Warehouse Operational Database

2. Data Uses de-normalized schema Uses normalized schema (efficient

4. Type of Data Focuses on historical data. Focuses on current data.

6. Update Data is updated periodically Data is updated continuously in

8. Purpose Supports reporting, forecasting, Supports day-to-day operations like

2) Compare between Data warehouse and Data Mart.

Key Data Warehouse Data Mart

1. Scope Enterprise-wide, covers all business Department-level, focuses on a

4. Complexity More complex and comprehensive. Less complex and easier to

5. Takes longer to build and requires Faster and cheaper to implement.

6. Usage Used by top management and Used by specific teams (e.g.,

7. Maintenance Harder to maintain due to size and Easier to maintain.

8. Data Type Contains detailed and summary Often contains summarized or

3) Describe with an appropriate figure the multi-tier Data Warehouse Architecture.

Bottom Tier Collects and loads raw data

Middle Tier Stores, organizes, and processes data

Top Tier Presents data to users for analysis

A typical Data Warehouse architecture is divided into three tiers:

1. Bottom Tier – Data Source Layer

●​ This layer contains databases, flat files, or external data sources.​

2. Middle Tier – Data Storage and Processing Layer

●​ This is the core of the data warehouse.​

●​ Stores the processed, structured data in a centralized repository.​

●​ Uses OLAP servers for fast querying and data analysis.​

○​ OLAP can be:​

■​ ROLAP (Relational OLAP)​

■​ MOLAP (Multidimensional OLAP)​

●​ Supports complex queries and multidimensional analysis.

3. Top Tier – Front-End or Presentation Layer

●​ This is the user interface layer.​

●​ Provides access to the data through:​

○​ Data mining tools​

●​ Used by business analysts, managers, and decision-makers.​

4) Describe about Different Data Warehouse Models.

Feature Enterprise Data Data Mart Operational Data Store

Definition Centralized warehouse Mini-warehouse Stores current real-time

Scope Organization-wide Department-specific Real-time or daily

Data Historical Historical Current

Use Case Strategic business Department-level Daily operations and

5) What is data model and multi dimensional data model?

●​ A data model is a way to organize and define the structure of data.​

●​ It shows how data is stored, connected, and processed.

1.​ Conceptual Model – High-level overview (what data is needed)​

2.​ Logical Model – Detailed structure (how data is related)​

3.​ Physical Model – Actual database structure (tables, columns, keys)

Multidimensional Data Model

●​ A multidimensional data model is used in data warehousing and OLAP.​

●​ It represents data in the form of a cube, where:​

○​ The fact contains measurable data (e.g., sales amount, quantity).​

Let’s say you're analyzing sales data:

●​ Fact Table: Sales amount​

○​ Time (e.g., year, month)​

○​ Product (e.g., phone, laptop)

6) What are facts, dimensions and schemas?

Term Definition Example

Fact Numeric, measurable data Sales, Revenue, Quantity

Dimensio Descriptive data giving context to facts Time, Product, Location

Schema Structure of fact and dimension tables Star, Snowflake, Galaxy

Different multidimensional model schemas along with appropriate figures

●​ Simplest and most common schema.​

●​ A central fact table is directly connected to dimension tables.​

Fact Table Stores data to analyze, like Sales and

Product What product was sold? (e.g., Laptop, Phone)

Time When was it sold? (e.g., Date, Month, Year)

Store Where was it sold? (e.g., Location, Branch)

Customer Who bought it? (e.g., Name, Age, Gender)

●​ An extension of Star Schema.​

●​ Reduces data redundancy but is slightly more complex.​

Fact Table Stores measurable data like Sales and Profit.

Product What was sold? (e.g., iPhone, Laptop)

Sub-Categor Group of products (e.g., Mobiles, Laptops under Electronics)

Time When was it sold? (Date, Month, Year)

Store Where was it sold? (e.g., Dhaka Branch)

●​ Multiple fact tables share common dimension tables.​

●​ Suitable for complex data warehouses.

● This layer contains databases, flat files, or external data sources.

● This is the core of the data warehouse.

● Stores the processed, structured data in a centralized repository.

● Uses OLAP servers for fast querying and data analysis.

○ OLAP can be:

■ ROLAP (Relational OLAP)

■ MOLAP (Multidimensional OLAP)

● Supports complex queries and multidimensional analysis.

● This is the user interface layer.

● Provides access to the data through:

○ Data mining tools

● Used by business analysts, managers, and decision-makers.

● A data model is a way to organize and define the structure of data.

● It shows how data is stored, connected, and processed.

1. Conceptual Model – High-level overview (what data is needed)

2. Logical Model – Detailed structure (how data is related)

3. Physical Model – Actual database structure (tables, columns, keys)

● A multidimensional data model is used in data warehousing and OLAP.

● It represents data in the form of a cube, where:

○ The fact contains measurable data (e.g., sales amount, quantity).

● Fact Table: Sales amount

○ Time (e.g., year, month)

○ Product (e.g., phone, laptop)

● Simplest and most common schema.

● A central fact table is directly connected to dimension tables.

● An extension of Star Schema.

● Reduces data redundancy but is slightly more complex.

● Multiple fact tables share common dimension tables.

● Suitable for complex data warehouses.

● There are two fact tables here:

● Both fact tables share common dimensions:

○ Time — when the event happened (e.g., sale or restock)

○ Product — what item was involved (e.g., Mobile, Laptop)

● Moving down in the concept hierarchy

● Adding a new dimension

● Reducing the dimensions

● Time = "Q1" or "Q2"

● Item = "Car" or "Bus"

● FP-Growth (Frequent Pattern Growth)

If item A is bought, item B is likely to be bought.

○ How frequently the itemset appears in the dataset.

○ Example: 20% of all transactions contain both A and B → support = 0.20

○ How often B is bought when A is bought.

○ Formula: Confidence(A ⇒ B) = Support(A ∩ B) / Support(A)

○ Formula: Lift(A ⇒ B) = Confidence(A ⇒ B) / Support(B)

○ Lift > 1 means positive correlation

● Confidence = 3/4 = 0.75 → Diaper appears in 4 transactions, and in 3 of them Beer is

● It means making possible combinations of items that might appear frequently in

● These combinations are called candidates.

● We create bigger itemsets from smaller frequent ones.

● It means removing the candidates that cannot be frequent.

● We use the rule:

● Apriori generates many candidate itemsets and takes more time.