0% found this document useful (0 votes)
7 views30 pages

Data Mining

The document compares data warehouses (DW) and operational databases (OD), highlighting their differences in data structure, performance, and usage. It also discusses the distinctions between data warehouses and data marts, outlines multi-tier data warehouse architecture, and describes various data warehouse models. Additionally, it covers OLAP operations, the relationship between data warehousing and data mining, and the data warehouse implementation process.

Uploaded by

amitchowdhuryleo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views30 pages

Data Mining

The document compares data warehouses (DW) and operational databases (OD), highlighting their differences in data structure, performance, and usage. It also discusses the distinctions between data warehouses and data marts, outlines multi-tier data warehouse architecture, and describes various data warehouse models. Additionally, it covers OLAP operations, the relationship between data warehousing and data mining, and the data warehouse implementation process.

Uploaded by

amitchowdhuryleo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

1)Compare between DW and OD.

.A data warehouse is a repository for structured, filtered data that has already been processed
for a specific purpose. It collects the data from multiple sources and transforms the data using
ETL(Extract, Transform and Load) process,

An operational database, on the other hand, is a database where the data changes frequently.
They are mainly designed for high volume of data transaction.

Key Data Warehouse Operational Database

1. Basic Stores structured, filtered data for Data changes frequently for daily
analysis. operations.

2. Data Uses de-normalized schema Uses normalized schema (efficient


Structure (simpler for analysis). for transactions).

3. Performance Fast for analytical queries. Slower for analytics, optimized for
transactional queries.

4. Type of Data Focuses on historical data. Focuses on current data.

5. Use Case Used for OLAP (Online Analytical Used for OLTP (Online Transaction
Processing). Processing).

6. Update Data is updated periodically Data is updated continuously in


Frequency (batch process). real-time.

7. Users Used by analysts and managers Used by operational staff for daily
for decision-making. activities.

8. Purpose Supports reporting, forecasting, Supports day-to-day operations like


and business intelligence. order processing and billing.

2) Compare between Data warehouse and Data Mart.

Key Data Warehouse Data Mart

1. Scope Enterprise-wide, covers all business Department-level, focuses on a


areas. specific business area.
2. Size Large in size (terabytes or more). Smaller in size (gigabytes to
terabytes).

3. Data Source Integrates data from multiple May use data from a subset of
sources. sources or the data warehouse
itself.

4. Complexity More complex and comprehensive. Less complex and easier to


manage.

5. Takes longer to build and requires Faster and cheaper to implement.


Implementation more resources.

6. Usage Used by top management and Used by specific teams (e.g.,


analysts for organization-wide sales, finance) for focused
insights. analysis.

7. Maintenance Harder to maintain due to size and Easier to maintain.


complexity.

8. Data Type Contains detailed and summary Often contains summarized or


data. specific data.

3) Describe with an appropriate figure the multi-tier Data Warehouse Architecture.


Tier Purpose

Bottom Tier Collects and loads raw data

Middle Tier Stores, organizes, and processes data

Top Tier Presents data to users for analysis

A typical Data Warehouse architecture is divided into three tiers:

1. Bottom Tier – Data Source Layer

●​ This layer contains databases, flat files, or external data sources.​

●​ Data is collected from different operational systems (e.g., sales, HR, finance).​

●​ ETL (Extract, Transform, Load) tools are used to clean, transform, and load the data
into the warehouse.

2. Middle Tier – Data Storage and Processing Layer

●​ This is the core of the data warehouse.​

●​ Stores the processed, structured data in a centralized repository.​

●​ Uses OLAP servers for fast querying and data analysis.​

○​ OLAP can be:​

■​ ROLAP (Relational OLAP)​

■​ MOLAP (Multidimensional OLAP)​

●​ Supports complex queries and multidimensional analysis.

3. Top Tier – Front-End or Presentation Layer

●​ This is the user interface layer.​

●​ Provides access to the data through:​

○​ Reporting tools​
○​ Dashboards​

○​ Data mining tools​

○​ Query tools​

●​ Used by business analysts, managers, and decision-makers.​

4) Describe about Different Data Warehouse Models.

Feature Enterprise Data Data Mart Operational Data Store


Warehouse (EDW) (ODS)

Definition Centralized warehouse Mini-warehouse Stores current real-time


storing data for the whole focused on a specific data from operational
organization department systems

Scope Organization-wide Department-specific Real-time or daily


(e.g., sales, HR, operations
marketing)

Data Historical Historical Current


Type

Use Case Strategic business Department-level Daily operations and


decisions analysis quick access

Example Retail company analyzing Sales team analyzing Bank tracking live
total sales across all regional performance transactions
regions

5) What is data model and multi dimensional data model?

Data Model

●​ A data model is a way to organize and define the structure of data.​

●​ It shows how data is stored, connected, and processed.


Types of Data Models:

1.​ Conceptual Model – High-level overview (what data is needed)​

2.​ Logical Model – Detailed structure (how data is related)​

3.​ Physical Model – Actual database structure (tables, columns, keys)

+------------+ +-------------+
| Customer |----------| Orders |
+------------+ +-------------+
| CustomerID | | OrderID |
| Name | | CustomerID |
| Address | | OrderDate |
+------------+ +-------------+

Multidimensional Data Model

●​ A multidimensional data model is used in data warehousing and OLAP.​

●​ It represents data in the form of a cube, where:​

○​ Each dimension represents a way to analyze the data (e.g., time, location,
product).​

○​ The fact contains measurable data (e.g., sales amount, quantity).​

Example:

Let’s say you're analyzing sales data:

●​ Fact Table: Sales amount​

●​ Dimensions:​

○​ Time (e.g., year, month)​

○​ Product (e.g., phone, laptop)


+-----------------------------+
| Time Dimension |
| (Year, Month, Day) |
+-----------------------------+
/ | \
/ | \
+---------+ / +---------+ +--------+
| |/ | | | |
| Product |------| Fact |-----|Location|
|Dimension| | Table | |Dimension|
|(Phone, | |(Sales) | |(City, |
|Laptop) | | | |Country) |
+---------+ +---------+ +--------+

6) What are facts, dimensions and schemas?

Term Definition Example

Fact Numeric, measurable data Sales, Revenue, Quantity

Dimensio Descriptive data giving context to facts Time, Product, Location


n

Schema Structure of fact and dimension tables Star, Snowflake, Galaxy


schemas

Different multidimensional model schemas along with appropriate figures

1. Star Schema
Definition:

●​ Simplest and most common schema.​

●​ A central fact table is directly connected to dimension tables.​


●​ Dimension tables are denormalized (flattened).​

Diagram:
+-----------+
| Product |
+-----------+
|
+--------+ +---------+ +---------+ +--------+
| Time | | Fact | | Store | |Customer|
+--------+ +---------+ +---------+ +--------+
|
(Sales, Quantity)

The center table is the Fact Table — it holds the numbers you want to analyze.​

The tables around it are Dimension Tables — they tell you more about who, what, where, and
when.

Componen Meaning
t

Fact Table Stores data to analyze, like Sales and


Quantity.

Product What product was sold? (e.g., Laptop, Phone)

Time When was it sold? (e.g., Date, Month, Year)

Store Where was it sold? (e.g., Location, Branch)

Customer Who bought it? (e.g., Name, Age, Gender)

2. Snowflake Schema
Definition:

●​ An extension of Star Schema.​


●​ Dimension tables are normalized (split into sub-tables).​

●​ Reduces data redundancy but is slightly more complex.​

Diagram:
+-----------+
| Sub-Category|
+-----+-----+
|
+---v----+
| Product |
+---+----+
|
+--------+ +----v-----+ +--------+
| Time |----| Fact |----| Store |
+--------+ +----------+ +--------+
|
(Sales, Profit)

At the center is the Fact Table, where we store the values we want to measure — like Sales
and Profit.​

The tables around it are Dimension Tables that describe details like time, product, and store.​

One of the dimension tables (Product) is linked to another table (Sub-Category) — this is what
makes it a Snowflake schema.

Component Meaning

Fact Table Stores measurable data like Sales and Profit.

Product What was sold? (e.g., iPhone, Laptop)

Sub-Categor Group of products (e.g., Mobiles, Laptops under Electronics)


y

Time When was it sold? (Date, Month, Year)

Store Where was it sold? (e.g., Dhaka Branch)


3. Fact Constellation Schema (Galaxy Schema)
Definition:

●​ Multiple fact tables share common dimension tables.​

●​ Suitable for complex data warehouses.

Diagram:
+--------+
| Time |
+---+----+
|
+---------+ | +-----------+
| Sales |<----+----->| Product |
+---------+ +-----+-----+
+------------+
| Inventory |
+------------+

●​ There are two fact tables here:​


➤ Sales — stores data like units sold, revenue​
➤ Inventory — stores data like stock quantity, restock info​

●​ Both fact tables share common dimensions:​

○​ Time — when the event happened (e.g., sale or restock)​

○​ Product — what item was involved (e.g., Mobile, Laptop)

7) What is OLAP?. Describe about different OLAP operations.

OLAP stands for Online Analytical Processing Server. It is a software technology that allows
users to analyze information from multiple database systems at the same time. It is based on
multidimensional data model and allows the user to query on multi-dimensional data (eg.
Delhi -> 2018 -> Sales data).
OLAP operations:

There are five basic analytical operations that can be performed on an OLAP cube:
1.​ Drill down: In drill-down operation, the less detailed data is converted into highly

detailed data. It can be done by:

●​ Moving down in the concept hierarchy

●​ Adding a new dimension

In the cube given in overview section, the drill down operation is performed by

moving down in the concept hierarchy of Time dimension (Quarter -> Month).

2. Roll up: It is just opposite of the drill-down operation. It performs aggregation on the OLAP
cube. It can be done by:
●​ Climbing up in the concept hierarchy

●​ Reducing the dimensions

In the cube given in the overview section, the roll-up operation is performed by climbing up in
the concept hierarchy of Location dimension (City -> Country).
[Link]: It selects a sub-cube from the OLAP cube by selecting two or more dimensions. In the
cube given in the overview section, a sub-cube is selected by selecting following dimensions
with criteria:
●​ Location = "Delhi" or "Kolkata"

●​ Time = "Q1" or "Q2"

●​ Item = "Car" or "Bus"

[Link]: It selects a single dimension from the OLAP cube which results in a new sub-cube
creation. In the cube given in the overview section, Slice is performed on the dimension Time =
"Q1".
[Link]: It is also known as rotation operation as it rotates the current view to get a new view
of the representation. In the sub-cube obtained after the slice operation, performing pivot
operation gives a new view of it.

8) What is the relation between data warehousing and data mining?

Data warehousing and data mining are closely connected, but they serve different purposes
in data analysis.

Relationship Summary:

●​ Data Warehouse is like a library where all your data is stored neatly.​

●​ Data Mining is like reading and studying that library to discover new knowledge.

Aspect Data Warehousing Data Mining

Definition A system that stores large amounts A technique to discover patterns,


of organized data for analysis. trends, or insights from data.

Purpose To store, clean, and organize data To analyze that data and find useful
from multiple sources. knowledge.

Dependency Provides the source data for mining. Needs data from warehouse or
other structured systems.
Process Mainly storage and management of Mainly analysis and discovery from
Type data. data.

Example A company stores 5 years of customer They use that data to find which
data. customers may leave.

Working of Top-Down Approach

Bottom-Up Approach

9) How is a data warehouse different from a database? How are they similar?
Feature Database Data Warehouse

Purpose Stores current data for daily Stores historical data for analysis and
operations reporting

Data Type Real-time, transactional data Historical, analytical data

Usage Used in apps like banking, Used in BI tools, reporting, decision


e-commerce making

Query Type Simple insert, update, delete Complex queries for trends,
summaries

Data Structure Usually normalized Usually de-normalized

Update Frequently updated Periodically updated (e.g., daily,


Frequency weekly)

Speed Focus Optimized for fast write Optimized for fast read operations
operations

Similarities

Similarity Point Description

Store Data Both are used to store and manage structured


data

Use SQL Both can use SQL to query and manage data

Have Tables Both organize data in tables with rows and


columns

Need Backup & Security Both require data protection and backup systems
10.. Define the Data warehouse implementation Process with details.

Ste Name What Happens Example


p

1 Requirement Understand business Sales team wants monthly and


Gathering goals and data needs yearly sales reports

2 Data Modeling Design warehouse Create a Star Schema with


structure (schemas, Fact_Sales and Dim_Product tables
tables)

3 Data Integration & Extract, clean, and load Extract data from ERP and CRM
ETL data from sources systems, clean, and load into
warehouse

4 Data Cleansing & Fix and verify data Remove duplicate customer entries,
Validation accuracy correct product codes

5 Building Data Create Build separate data marts for Sales,


Marts department-focused mini HR, and Finance departments
warehouses

6 Data Security & Control access and Give HR team access only to
Governance ensure data privacy employee data, not financial data

7 Testing & QA Test data quality, Verify that sales totals match source
performance, and data, and queries run efficiently
reliability

8 Deployment & Go live and maintain Launch the system and schedule
Maintenance system health regular data refreshes

9 User Training & Train staff to use the Conduct sessions for managers on
Adoption warehouse effectively using dashboards and reports

10 Ongoing Improve and expand over Add new data sources like mobile
Optimization time app usage or customer feedback
[Link] the context of data warehousing what is data transformation?

Data Transformation is the process of converting raw data from different sources into a clean,
consistent, and usable format before loading it into the data warehouse.

It is part of the ETL process (Extract, Transform, Load).

Task What it does Example

Data Cleaning Removes errors or Fixing missing or incorrect


inconsistencies customer names

Data Converts data into a standard Date: “12/06/25” → “2025-06-12”


Standardization format

Data Mapping Matches fields from source to “Cust_ID” in source →


target “Customer_ID” in DW

Data Aggregation Summarizes data for analysis Daily sales → Monthly sales

Data Encoding Converts categories into “Male/Female” → 1/0


numbers or formats

Data Merging Combines data from multiple Merge customer data from two
sources different systems

[Link] do you mean by mining frequent pattern?

Frequent pattern mining is the process of discovering repeated combinations or common


patterns in a large dataset.

Example (Market Basket Analysis):

Suppose you have many customer transactions at a grocery store.

One frequent pattern could be:


CopyEdit
{Bread, Butter, Jam} → These items are often bought together.

So, if a customer buys Bread and Butter, there's a high chance they also buy Jam.

Popular Algorithms:

●​ Apriori Algorithm​

●​ FP-Growth (Frequent Pattern Growth)

Summary:

Mining frequent patterns helps in finding valuable relationships or associations in


big data, which supports better decision-making and predictions.

13. Discuss about support, confidence and lift.

Concept Meaning Formula Example Result

Support How often an Support(A) = If {bread, eggs} appears


itemset appears in Transactions containing in 5 out of 10 transactions →
all data A / Total transactions 5/10 = 0.5 or 50%

Confidence How likely B is Confidence(A⇒B) = If {bread, milk} appears


bought when A is Support(A∪B) / in 5 transactions and
bought Support(A) {bread, milk, eggs} in 4
→ 4/5 = 0.8 or 80%

Lift How much more Lift(A⇒B) = If Support({bread, milk,


likely A and B occur Support(A∪B) / eggs}) = 0.4, Support({bread,
together than (Support(A) × milk}) = 0.5, Support({eggs})
independently Support(B)) = 0.6 → 0.4 / (0.5 × 0.6) =
1.33
14) Describe about association rule mining with proper example.

Association Rule Mining is a key technique in data mining used to discover interesting
relationships, patterns, or associations among a set of items in large datasets.

Definition:

Association Rule Mining finds rules like:

If item A is bought, item B is likely to be bought.​


This is written as:​
A⇒B

Key Terms:

1.​ Support:​

○​ How frequently the itemset appears in the dataset.​

○​ Example: 20% of all transactions contain both A and B → support = 0.20​

2.​ Confidence:​

○​ How often B is bought when A is bought.​

○​ Formula: Confidence(A ⇒ B) = Support(A ∩ B) / Support(A)​

3.​ Lift:​

○​ How much more likely B is bought when A is bought compared to when A is not
bought.​

○​ Formula: Lift(A ⇒ B) = Confidence(A ⇒ B) / Support(B)​

○​ Lift > 1 means positive correlation​

Example (Market Basket Analysis):


Let’s say we have the following transactions:

Transaction ID Items Bought

1 Bread, Milk

2 Bread, Diaper, Beer, Eggs

3 Milk, Diaper, Beer, Coke

4 Bread, Milk, Diaper, Beer

5 Bread, Milk, Diaper, Coke

We can find the rule:

{Diaper} ⇒ {Beer}

●​ Support = 3/5 = 0.6 → Diaper and Beer occur together in 3 out of 5 transactions.​

●​ Confidence = 3/4 = 0.75 → Diaper appears in 4 transactions, and in 3 of them Beer is


also there.​

●​ Lift = 0.75 / (4/5) = 0.75 / 0.8 = 0.9375 → Slightly less than 1, so weak correlation.​

15) What is candidate generation,and pruning?

Candidate Generation

●​ It means making possible combinations of items that might appear frequently in


transactions.​

●​ These combinations are called candidates.​

●​ We create bigger itemsets from smaller frequent ones.​

Example:
If {Milk}, {Bread}, and {Butter} are frequent 1-itemsets, we combine them to form:

●​ {Milk, Bread}, {Milk, Butter}, {Bread, Butter} → These are 2-item candidates

Pruning

●​ It means removing the candidates that cannot be frequent.​

●​ We use the rule:​


➤ If any part (subset) of a candidate is not frequent, remove it.​

Example:

If we have a 3-item candidate {Milk, Bread, Diaper}, but {Milk, Diaper} is not frequent, then we
prune (remove) {Milk, Bread, Diaper}.

16)Discuss about frequent pattern growth algorithm with fp-tree.

●​ FP-Growth is a fast method to find frequent itemsets without generating all possible
combinations (like Apriori).​

●​ It uses a special tree structure called an FP-Tree to store the transaction data in a
compressed form.

Why FP-Growth?

●​ Apriori generates many candidate itemsets and takes more time.​

●​ FP-Growth avoids this by using a tree and is faster and more efficient.

Steps of FP-Growth Algorithm:

Step 1: Build the FP-Tree

1.​ Scan the data to find the frequent items and their counts (support).​
2.​ Sort the items in each transaction by frequency (most frequent first).​

3.​ Build the FP-Tree:​

○​ Start with a root node (null)​

○​ For each transaction, add items as a path in the tree.​

○​ If a path already exists, increase the count.​

Example:

Let’s say we have 3 transactions:

makefile
CopyEdit
T1: a, b, d
T2: b, c, d
T3: a, b, c

Step 1: Count item frequency:

●​ a:2, b:3, c:2, d:2​

Step 2: Sort transactions by frequency:

●​ T1: b, a, d​

●​ T2: b, d, c​

●​ T3: b, a, c​

Step 3: Build FP-tree:

css
CopyEdit
(null)

b(3)
/ \
a(2) d(1)
/ \
d(1) c(1)
\
c(1)

Step 2: Mine Frequent Patterns from the Tree

●​ Start from the bottom of the FP-Tree, and create conditional patterns.​

●​ Build smaller FP-trees for each item and extract frequent itemsets.

Summary Table:

Term Explanation

FP-Growth Algorithm to find frequent patterns


fast

FP-Tree A tree structure to store itemsets

Step 1 Build the FP-tree from transactions

Step 2 Extract frequent itemsets from the


tree

17) Compare between apriori and fp-growth algorithm.

Feature Apriori Algorithm FP-Growth Algorithm

Method Uses candidate generation Uses a tree (FP-Tree), no candidates

Speed Slower for large data Faster and more efficient


Memory Use Uses more memory due to many Uses less memory (compressed tree)
candidates

Working Repeatedly scans database Scans database only twice


Steps

Main Uses itemset list Uses FP-Tree


Structure

Drawback Time-consuming for big data Tree can be complex if data has low
overlap

Best For Small datasets Large datasets with many frequent


patterns

18) What is the importance of association mining rule?

What is the importance of Association Rule Mining?

Association rule mining helps us find relationships between items in a dataset.​


It shows what items are often bought or used together.

Why it's important (with rules and examples):

1.​ Market Basket Analysis​

○​ Finds what people buy together.


○​ Rule: If a customer buys bread, they also buy butter.​
→ Bread ⇒ Butter​

2.​ Product Recommendation


○​ Helps online stores suggest items.
○​ Rule: If someone buys a phone, they also buy a phone cover.​
→ Phone ⇒ Phone Cover​

3.​ Sales Improvement​


○​ Stores can place related items together to boost sales.
○​ Rule: If people buy shampoo, they also buy conditioner.​
→ Shampoo ⇒ Conditioner​

4.​ Detecting Patterns


○​ Useful in banking or healthcare to find hidden patterns.
○​ Rule: If a patient has symptom A and B, they might have disease C.​
→ A, B ⇒ C​

5.​ Better Decisions


○​ Helps businesses make smart choices using data.
○​ Rule: If a user visits page A, they often go to page B.​
→ Page A ⇒ Page B

19) Define some correlation measures : chi-square, all_conf, max_conf, kulc and
cosine.

1. Chi-Square (Chi²)

●​ Checks whether two items are independent or related.


●​ A high Chi² value means the items are strongly related.
●​ It compares the expected frequency with the actual frequency.
●​ Use: To test how strong the relationship is between two items.

2. All_Conf (All Confidence)

●​ Measures how strong the rule is based on the less frequent item.
●​ Formula:​
All_Conf = Support(X and Y) / max(Support(X), Support(Y))​

●​ Value is between 0 and 1.


●​ Higher value = stronger relationship
3. Max_Conf (Maximum Confidence)

●​ Takes the maximum confidence from both directions.


●​ Formula:​
Max_Conf = max(Confidence(X ⇒ Y), Confidence(Y ⇒ X))
●​ Helps to find the strongest direction of the rule.

4. Kulc (Kulczynski Measure)

●​ It is the average of confidence in both directions.


●​ Formula:​
Kulc = (Confidence(X ⇒ Y) + Confidence(Y ⇒ X)) / 2
●​ It gives a balanced view of how strong the relationship is both ways.

5. Cosine

●​ Measures similarity between two items.


●​ Formula:​
Cosine = Support(X and Y) / square root of (Support(X) × Support(Y))
●​ Value is between 0 and 1.
●​ Closer to 1 = stronger similarity

Lift

Lift is a measure used to check how strong or useful an association rule is compared to
random chance.

Lift(X ⇒ Y) = Support(X and Y) / (Support(X) × Support(Y))

How to Understand Lift:

●​ Lift = 1 → X and Y are independent (no effect on each other)​

●​ Lift > 1 → X and Y are positively related (X increases chance of Y)​

●​ Lift < 1 → X and Y are negatively related (X reduces chance of Y)


20)How apriori algorithm works?

What is Apriori Algorithm?

Apriori is a data mining algorithm used to find frequent itemsets and create association rules
(like "if a customer buys X, they also buy Y").

How It Works (Step-by-step):

Step 1: Count single items (1-itemsets)

●​ Scan the whole dataset.


●​ Count how many times each item appears.
●​ Keep only those that meet minimum support.

Example:​
If "Milk" appears in 3 out of 5 transactions, support = 3​
If minimum support = 2, keep it.

Step 2: Generate 2-itemsets

●​ Combine frequent 1-itemsets to form 2-item combinations.


●​ Count how many times each 2-itemset appears.
●​ Keep only those with enough support.

Example:​
Milk, Bread ⇒ appears in 2 transactions → keep it

Step 3: Generate 3-itemsets (and so on)

●​ Combine frequent 2-itemsets to make 3-itemsets.


●​ Again, count and keep those that meet minimum support.
●​ Continue until no more frequent itemsets are found.

Step 4: Generate Association Rules

●​ From the frequent itemsets, generate rules like:​


If X, then Y
●​ Calculate confidence = Support(X and Y) / Support(X)
●​ Keep rules that meet the minimum confidence threshold.
Example Rule:​
If a customer buys Milk ⇒ they also buy Bread​
Confidence = 2/3 = 66%

21)How knowledge discovery in database (KDD) works?

Steps of KDD Process:

1. Data Selection

●​ Choose the relevant data from a large database.


●​ Example: From a hospital database, select only patient age, symptoms, and test
results.

2. Data Preprocessing / Cleaning

●​ Remove missing, duplicate, or noisy data.


●​ Example: Fill missing values, remove rows with errors.

3. Data Transformation

●​ Convert data into a suitable format for mining.


●​ Example: Normalize values, convert categories to numbers.

4. Data Mining

●​ Apply algorithms to find patterns and rules.


●​ Example: Use Apriori to find which items are often bought together.

5. Pattern Evaluation

●​ Check if the patterns are interesting, useful, and valid.


●​ Example: Ignore patterns that are already obvious or not meaningful.

6. Knowledge Presentation
●​ Show the discovered knowledge using graphs, charts, or reports.
●​ Example: Show association rules in an easy-to-read format.

22) Briefly describe about sequential pattern mining.

What is Sequential Pattern Mining?

Sequential Pattern Mining is a type of data mining that finds frequent sequences or patterns
in ordered data.

Why it's useful?

It helps businesses or systems to understand sequences of actions, like:

●​ What customers usually do step by step


●​ What products are bought in sequence
●​ What events happen one after another

Example:

Let’s say we have this data from three customers:

Customer Transactions (by time)

C1 A→B→C

C2 A→D

C3 A→B→D

From this, a frequent sequence could be:​


A → B (because it appears in C1 and C3)

Steps of Sequential Pattern Mining:

1.​ Scan database to find frequent individual items.


2.​ Generate sequences from frequent items.
3.​ Count how often each sequence appears.
4.​ Keep only those with support ≥ minimum threshold.​

23)Discuss about breadth first search based generalized sequential pattern (GSP)
algorithm for the

GSP is an algorithm used in sequential pattern mining to discover frequent sequences in


ordered transaction data (like shopping history or time-series data).

It works using a breadth-first search (BFS) strategy — meaning it explores all short patterns
before moving to longer ones.

How GSP Algorithm Works (Step-by-step):

Step 1: Find Frequent 1-Sequences

●​ Scan the database and find all frequent individual items (called 1-sequences) that
meet the minimum support.

Step 2: Candidate Generation (k-sequences)

●​ Use the frequent (k-1)-sequences to create candidate k-sequences (just like Apriori).
●​ For example, combine frequent 1-sequences to make 2-sequences.

Step 3: Support Counting

●​ Scan the database again and count how many times each candidate sequence appears.
●​ Only keep those that meet the minimum support threshold.

Step 4: Repeat

●​ Repeat Steps 2 and 3, increasing sequence length each time (k = 2, 3, 4...), until no
more frequent sequences are found.

Example:

Let’s say we have a database of 3 customers:


Customer Sequence

C1 A→B→C

C2 A→C

C3 A→B

●​ ​
Frequent 1-sequences: A, B, C
●​ Generate 2-sequences: A→B, A→C, B→C
●​ Check support, keep frequent ones
●​ Then try 3-sequences: A→B→C, etc.​

You might also like