0% found this document useful (0 votes)
3 views96 pages

Data Mining Techniques and Applications

The document provides an overview of data mining concepts, focusing on prediction and classification techniques, including regression analysis and its various forms such as linear and nonlinear regression. It also discusses association analysis, particularly market basket analysis, which identifies patterns of co-occurrence among items in transactions. The content is sourced from materials provided by publishers of prescribed books and is associated with BITS Pilani, a deemed university.

Uploaded by

subhajitghatak26
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views96 pages

Data Mining Techniques and Applications

The document provides an overview of data mining concepts, focusing on prediction and classification techniques, including regression analysis and its various forms such as linear and nonlinear regression. It also discusses association analysis, particularly market basket analysis, which identifies patterns of co-occurrence among items in transactions. The content is sourced from materials provided by publishers of prescribed books and is associated with BITS Pilani, a deemed university.

Uploaded by

subhajitghatak26
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Data Mining

BITS Pilani
Pilani|Dubai|Goa|Hyderabad

Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

BITS Pilani Recap


Pilani|Dubai|Goa|Hyderabad

CS1 CS4
Why Data Mining?
Introduction to Classification
What Is Data Mining? Introduction to Classification
Applications of Data Mining Decision Tree
Example: Classification, Clustering, Association mining Rule Based Classifier
Bayes Classifier, ANN, SVM, Ensemble Methods
CS2
Data Preprocessing Concepts
Data Cleaning, Integration, reduction, transformation and discretization
Data preprocessing techniques

5 Prediction CS3
Data Description
Data Similarity/Dissimilarity
Data Visualization

Data Mining
Source Courtesy: Some of the contents of this PPT are sourced from materials provided by publishers of prescribed books
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Introduction

Agenda Prediction vs. Classification


• How is (Numerical) prediction similar to classification?
• Prediction Vs. Classification • construct a model
• use model to predict continuous or ordered value for a given input
• Regression for Prediction
• Difference between Prediction and classification
• Simple Linear Regression • Classification refers to predict categorical class label
• Parameter Estimation, Examples • Prediction models continuous-valued functions

• Non Linear Regression • Major method for prediction: Regression


• model the relationship between one or more independent or predictor variables and a
• Multiple Linear Regression dependent or response variable
• Profit, sales, mortgage rates, house values, square footage, temperature, or distance could all
be predicted using regression techniques. For example, a regression model could be used to
predict the value of a house based on location, number of rooms, lot size, and other factors.

Data Mining November 11, 2025 Data Mining


6
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Applications

Prediction vs. Classification Regression for Prediction


• A regression task begins with a data set in which the target values are
known, e.g., a regression model that predicts house values could be
developed based on observed data for many houses over a period of time.
The data might track the age of the house, square footage, number of
rooms, taxes, school district, proximity to shopping centers, and so on.
House value would be the target, the other attributes would be the
predictors, and the data for each house would constitute a case.
• In the model build (training) process, a regression algorithm estimates the
value of the target as a function of the predictors for each case in the build
data. These relationships between predictors and target are summarized in
a model, which can then be applied to a different data set in which the
target values are unknown

Data Mining Data Mining


BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Regression in general Linear regression in Economics

Data Mining Data Mining


BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Prediction Techniques Linear Regression


• Regression analysis
• Linear and multiple regression
• Non-linear regression
• Other regression methods:
• Generalized linear model
• Poisson regression
• Log-linear models
• Regression trees

November 11, 2025 Data Mining Data Mining


11
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
What is Linear Regression analysis

• Regression analysis seeks to determine the values of parameters for


a function that cause the function to best fit a set of data
observations that you provide.

• Regression is the process of estimating the value of a continuous


target (y) as a function (F) of one or more predictors (x1 , x2 , ..., xn),
a set of parameters (w1 , w2 , ..., wn), and a measure of error (e).
y = F(x,w) + e

• The predictors can be understood as independent variables (x1 , x2 ,


..., xn) and the target as a dependent variable (Y).
Data Mining Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Regression analysis Simple Linear Regression

• The error, also called the residual, is the difference between


the expected and predicted value of the dependent variable.

• The process of training a regression model involves finding the


parameter values that minimize a measure of the error, the
mean of squared errors (MSE)or mean of absolute errors
(MAE).

Data Mining Data Mining


BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Which line “fits best” Linear Regression Model

Data Mining Data Mining


BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Least Squares Estimating Model Parameter

Data Mining Data Mining


BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Coefficients Coefficient Equations

Data Mining Data Mining


BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Example -1 Solution
Example -1

Parameter Estimation Parameter Estimation

(10, 0.5) and (25, 1.20) are actual purchase amount and tax
amount pairs. Predict the amount of the tax on the purchase
of amount 55 using linear regression. (10, 0.5) and (25, 1.20) are actual purchase
amount and tax amount pairs. Predict the
amount of the tax on the purchase of amount
55 using linear regression.

Data Mining Data Mining


BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Example -2 Example -2

Parameter Estimation Parameter Estimation


X - Hours Spent Studying
Hours Spent Exam Y – Exam Score
Studying Scores
4 60
1 10
10 64
14 75
4 50
7 70
22 95
3 27
8 49

Data Mining Data Mining


BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Example -2 Example -2

Prediction Error Parameter Estimation

Slope (𝜷𝟏)
• Exam score (Y) is expected to increases by 3.3 units for each 1 hour increase of study hours (X)

Intercept (𝜷𝟎)
• Exam score (Y) is 28.6 when you study 0 hours

Data Mining Data Mining


BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Example -3 Example -3

Parameter Estimation – Self Practice Parameter Estimation – Self Practice (Solution)


x 2 4 6 8
y 3 7 5 10

Construct the table and find the value

Data Mining Data Mining


BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Nonlinear Regression Nonlinear Regression


• Often the relationship between x and y • Some nonlinear models can be modeled by a polynomial function
cannot be approximated with a straight
line. In this case, a nonlinear regression
• A polynomial regression model can be transformed into linear regression model.
technique may be used. Alternatively, For example,
the data could be preprocessed to y = w0 + w1 x + w2 x2 + w3 x3
make the relationship linear. convertible to linear with new variables: x2 = x2, x3= x3
y = w0 + w1 x + w2 x2 + w3 x3
• Nonlinear regression models define y
as a function of x using an equation • Other functions, such as power function, can also be transformed to linear model
that is more complicated than the
• Some models are intractable nonlinear (e.g., sum of exponential terms)
linear regression equation
• possible to obtain least square estimates through extensive calculation on
more complex formulae

Data Mining November 11, 2025 Data Mining


33
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Regression Trees and Model Trees Multiple Linear Regression
• Regression tree: proposed in CART system (Breiman et al. 1984)
• CART: Classification And Regression Trees
• Each leaf stores a continuous-valued prediction
• It is the average value of the predicted attribute for the training tuples that reach the leaf

• Model tree: proposed by Quinlan (1992)


• Each leaf holds a regression model—a multivariate linear equation for the predicted attribute
• A more general case than regression tree

• Regression and model trees tend to be more accurate than linear regression when the data are
not represented well by a simple linear model

November 11, 2025 Data Mining Data Mining


34
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Multiple Linear Regression Multiple Linear Regression - Example

Data Mining Data Mining


BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Types of Regression Models

THANK YOU!

Data Mining Data Mining


BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

BITS Pilani
Pilani|Dubai|Goa|Hyderabad

Data Mining
Association Analysis Concepts
BITS Pilani and Apriori Algorithm
Pilani|Dubai|Goa|Hyderabad M6: Association Analysis and Apriori Algorithm

Source Courtesy: Some of the contents of this PPT are sourced from materials provided by publishers of prescribed books
Association Analysis Association Analysis Contd.

• Association analysis measures the strength of co- • Association algorithms are widely used in retail analysis
occurrence between one item and another. of transactions, recommendation engines, and online
clickstream analysis across web pages.
• The objective of this class of data mining algorithms is not
to predict an occurrence of an item, like classification or • One of the popular applications of this technique is called
regression do, but to find usable patterns in the co- market basket analysis, which finds co-occurrences of one
occurrences of the items. retail item with another item within the same retail
purchase transaction.
• Association rules learning is a branch of an unsupervised
learning process that discovers hidden patterns in data, in • retailer can take advantage of this association for bundle
the form of easily recognizable rules. pricing, product placement, and even shelf space
optimization within the store layout.

Data Mining Data Mining


November 11, 2025 November 11, 2025
3
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
4
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Market Basket Analysis Association Rule Mining

• Given a set of transactions, find rules that will predict the


occurrence of an item based on the occurrences of other
items in the transaction
Market-Basket transactions
Example of Association Rules
TID Items
{Diaper} → {Butter},
1 Bread, Milk {Milk, Bread} → {Eggs, Coke},
2 Bread, Diaper, Beer, Eggs {Butter, Bread} → {Milk},
3 Milk, Diaper, Beer, Coke
4 Bread, Milk, Diaper, Beer Implication means co-occurrence,
5 Bread, Milk, Diaper, Coke not causality!

Data Mining Data Mining


November 11, 2025
5
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Definition: Frequent Itemset Definition: Association Rule
TID Items
• Itemset Association Rule 1 Bread, Milk
• A collection of one or more items – An implication expression of the 2 Bread, Diaper, Beer, Eggs
TID Items
• Example: {Milk, Bread, Diaper} form X → Y, where X and Y are item 3 Milk, Diaper, Beer, Coke
• k-itemset 1 Bread, Milk sets 4 Bread, Milk, Diaper, Beer
• An itemset that contains k items 2 Bread, Diaper, Beer, Eggs
– Example: 5 Bread, Milk, Diaper, Coke
3 Milk, Diaper, Beer, Coke
• Support count () {Milk, Diaper} → {Butter}
4 Bread, Milk, Diaper, Beer
• Frequency of occurrence of an itemset Rule Evaluation Metrics
5 Bread, Milk, Diaper, Coke Example:
• E.g. ({Milk, Bread, Diaper}) = 2 – Support (s)
• Support {Milk, Diaper}  Butter
◆ Fraction oftransactions that
• Fraction of transactions that contain an itemset contain both X and Y  (Milk, Diaper, Butter ) 2
s= = = 0.4
• E.g. s({Milk, Bread, Diaper}) = 2/5 |T| 5
– Confidence (c)
• Frequent Itemset
◆ Measures how often items in Y  (Milk, Diaper, Butter ) 2
• An itemset whose support is greater than or equal to a minsup c= = = 0.67
appear in transactions that  (Milk, Diaper ) 3
threshold
contain X
Data Mining Data Mining

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Association Rule Mining Task Mining Association Rules


TID Items
Example of Rules:
✓Given a set of transactions T, the goal of association rule 1 Bread, Milk
mining is to find all rules having 2 Bread, Diaper, Beer, Eggs {Milk,Diaper} → {Butter} (s=0.4, c=0.67)
• support ≥ minsup threshold 3 Milk, Diaper, Beer, Coke {Milk,Butter} → {Diaper} (s=0.4, c=1.0)
{Diaper,Butter} → {Milk} (s=0.4, c=0.67)
• confidence ≥ minconf threshold 4 Bread, Milk, Diaper, Beer
{Butter} → {Milk,Diaper} (s=0.4, c=0.67)
5 Bread, Milk, Diaper, Coke
{Diaper} → {Milk,Butter} (s=0.4, c=0.5)
✓Brute-force approach: Observations: {Milk} → {Diaper,Butter} (s=0.4, c=0.5)
• List all possible association rules
• All the above rules are binary partitions of the same itemset:
• Compute the support and confidence for each rule {Milk, Diaper, Butter}
• Prune rules that do not satisfy the minsup and minconf
• Rules originating from the same itemset have identical support but
thresholds can have different confidence
 Computationally prohibitive!
• Thus, we may decouple the support and confidence requirements
Data Mining Data Mining

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Mining Association Rules Frequent Itemset Generation
null

• Two-step approach:
A B C D E
1. Frequent Itemset Generation
– Generate all itemsets whose support  minsup
AB AC AD AE BC BD BE CD CE DE

2. Rule Generation
– Generate high confidence rules from each
frequent itemset, where each rule is a binary ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

partitioning of a frequent itemset


ABCD ABCE ABDE ACDE BCDE
Frequent itemset generation is still computationally Given d items, there
expensive are 2d possible
ABCDE candidate itemsets
Data Mining Data Mining

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Frequent Itemset Generation Frequent Itemset Generation Strategies


• Brute-force approach:
• Each itemset in the lattice is a candidate frequent itemset ✓Reduce the number of candidates (M)
• Count the support of each candidate by scanning the database • Complete search: M=2d
• Match each transaction against every candidate • Use pruning techniques to reduce M
• Complexity ~ O(NMw) => Expensive since M = 2d !!!

Transactions List of ✓Reduce the number of transactions (N)


Candidates • Reduce size of N as the size of itemset increases
TID Items
• Used by DHP ( Direct Hashing and Pruning)and vertical-based
1 Bread, Milk
mining algorithms
2 Bread, Diaper, Beer, Eggs
N 3 Milk, Diaper, Beer, Coke M
4 Bread, Milk, Diaper, Beer ✓Reduce the number of comparisons (NM)
5 Bread, Milk, Diaper, Coke • Use efficient data structures to store the candidates or
w transactions
• No need to match every candidate against every transaction

Data Mining Data Mining

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Reducing Number of Candidates Illustrating Apriori Principle null

✓Apriori principle: A B C D E

• If an itemset is frequent, then all of its subsets must also be


AB AC AD AE BC BD BE CD CE DE
frequent
Found to be
✓Apriori principle holds due to the following property of the Infrequent
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

support measure:
• Support of an itemset never exceeds the support of its ABCD ABCE ABDE ACDE BCDE

subsets
Pruned
• This is known as the anti-monotone property of support
ABCDE
supersets
Data Mining Data Mining

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Illustrating Apriori Principle


Apriori Algorithm
Item Count Items (1-itemsets)
Bread 4 Method:
Coke 2
Milk 4 Itemset Count Pairs (2-itemsets)
Beer 3 {Bread,Milk} 3 1. Let k=1
Diaper 4
{Bread,Beer} 2 (No need to generate 2. Generate frequent itemsets of length 1
Eggs 1
{Bread,Diaper} 3 candidates involving Coke 3. Repeat until no new frequent itemsets are identified
{Milk,Beer} 2 or Eggs)
{Milk,Diaper} 3 i. Generate length (k+1) candidate itemsets from length k
{Beer,Diaper} 3 frequent itemsets
Minimum Support = 3 ii. Prune candidate itemsets containing subsets of length k that
Triplets (3-itemsets)
are infrequent
If every subset is considered, No 3-itemsets are frequent iii. Count the support of each candidate by scanning the DB
6C + 6C + 6C = 41
1 2 3 iv. Eliminate candidates that are infrequent, leaving only those
With support-based pruning, that are frequent
6 + 6 + 1 = 13

Data Mining Data Mining

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Support Counting
Reducing Number of Comparisons
✓Candidate counting:
• Scan the database of transactions to determine the support of each candidate
itemset
• To reduce the number of comparisons, store the candidates in a hash
structure
• Instead of matching each transaction against every candidate, match it
against candidates contained in the hashed buckets
Transactions Hash Structure
TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
N 3 Milk, Diaper, Beer, Coke k
4 Bread, Milk, Diaper, Beer
✓ Given t = {1,2,3,5,6}, all the 3-itemsets contained in t must begin with item 1, 2, or 3.
5 Bread, Milk, Diaper, Coke ✓ It is not possible to construct a 3-itemset that begins with items 5 or 6 because there
are only two items in t whose labels are greater than or equal to 5.
Buckets

Data Mining Data Mining

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Support Counting Using a Hash Tree Support Counting Using a Hash Tree

✓In the Apriori, algorithm, candidate itemsets are partitioned


into different buckets and stored in a hash tree.

✓During support counting, itemsets contained in each


transaction are also hashed into their appropriate buckets.

✓That way, instead of comparing each itemset in the transaction


with every candidate itemset, it is matched only against
candidate itemsets that belong to the same bucket,

Data Mining Data Mining

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Hashing a transaction at the root node of a hash Generate Hash Tree
tree
Suppose you have 15 candidate itemsets of length 3:
{1 4 5}, {1 2 4}, {4 5 7}, {1 2 5}, {4 5 8}, {1 5 9}, {1 3 6}, {2 3 4}, {5 6 7}, {3 4 5}, {3 5
6}, {3 5 7}, {6 8 9}, {3 6 7}, {3 6 8}
You need:
In this example, 5
out of the 9 leaf nodes are • Hash function . Let h(p)= (p-1) mod 3
visited and 9 out of the 15
itemsets are compared • Max leaf size: max number of itemsets stored in a leaf node (if number of
against the transaction. candidate itemsets exceeds max leaf size, split the node)

Hash function 234


3,6,9 567
1,4,7 367
145 345 356
2,5,8 136 368
357
124 689
457 125 159
458
Data Mining Data Mining

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Association Rule Discovery: Hash tree Association Rule Discovery: Hash tree Contd.

Hash Function Candidate Hash Tree Hash Function


Candidate Hash Tree

1,4,7 3,6,9 1,4,7 3,6,9

2,5,8 2,5,8

234
234
567
567
145 136
345 356 367 145 136
Hash on Hash on 345 356 367
357 368
1, 4 or 7 2, 5 or 8 357 368
124 159 689
125 124 159 689
457 125
458 457 458
Data Mining Data Mining

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Association Rule Discovery: Hash tree Contd. Subset Operation

Hash Function Candidate Hash Tree Given a transaction t, what are Transaction, t
the possible subsets of size 3?
1 2 3 5 6
1,4,7 3,6,9
Level 1
2,5,8
1 2 3 5 6 2 3 5 6 3 5 6
234
567 Level 2

145 136 12 3 5 6 13 5 6 15 6 23 5 6 25 6 35 6
345 356 367
Hash on
357 368
3, 6 or 9
124 159 689
125 123
135 235
457 125 156 256 356
458 126
136 236

Data Mining Level 3 Subsets


Data Mining of 3 items
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Subset Operation Using Hash Tree


Subset Operation Using Hash Tree
Hash Function Hash Function
1 2 3 5 6 transaction 1 2 3 5 6 transaction

1+ 2356 1+ 2356
2+ 356 1,4,7 3,6,9 2+ 356 1,4,7 3,6,9
12+ 356 2,5,8
2,5,8
3+ 56 3+ 56
13+ 56
234 234
15+ 6 567
567

145 136 145 136


345 356 367 345 356 367
357 368 357 368
689 124 159 689
124 125 159 125
457 457 458
458
Data Mining

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Subset Operation Using Hash Tree Subset Operation Using Hash Tree
Hash Function Hash Function
1 2 3 5 6 transaction 1 2 3 5 6 transaction

1+ 2356 1+ 2356
2+ 356 1,4,7 3,6,9 2+ 356 1,4,7 3,6,9
12+ 356 2,5,8
12+ 356 2,5,8
3+ 56 3+ 56
13+ 56 13+ 56
234 234
15+ 6 567 15+ 6 567

145 136 145 136


345 356 367 345 356 367
357 368 357 368
124 159 689 124 159 689
125 125
457 458 457 458
Match transaction against 11 out of 15 candidates

Factors Affecting Complexity Rule Generation

✓Choice of minimum support threshold


• lowering support threshold results in more frequent itemsets Procedure:
• this may increase number of candidates and max length of frequent
itemsets • For each frequent itemset “ l ”, generate all nonempty
✓Dimensionality (number of items) of the data set
• more space is needed to store support count of each item subsets of l.
• if number of frequent items also increases, both computation and I/O
costs may also increase • For every nonempty subset s of l, output the rule
✓Size of database
• since Apriori makes multiple passes, run time of algorithm may increase “s ->(l-s)” if
with number of transactions
✓Average transaction width supportcount(l) / supportcount(s) >= minconf where
• transaction width increases with denser data sets
• This may increase max length of frequent itemsets and traversals of minconf is minimum confidence threshold.
hash tree (number of subsets in a transaction increases with its width)

Data Mining Data Mining

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Rule Generation Contd. Rule Generation Contd.
TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
3 Milk, Diaper, Beer, Coke
For the itemset { Bread, Milk} 4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
Some Association Rules are:
i) Bread → Milk, conf= Supcount({Bread,

Milk})/Supcount{Bread}=(3/ 4)*100= 75%

ii) Milk → Bread, conf=Supcount({Bread,


Milk})/Supcount{Milk}= (3/ 4)*100= 75%

If the minconf threshold is 70%, then both of these two rules


are strong association rules.
Data Mining Data Mining

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Example-1

Examples
On
Apriori Algorithm

Data Mining Data Mining

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Example-2 Example-2 Contd..
Step 1:
i) Create 1-Itemset candidates and calculate support for all the items.
ii) Perform pruning to create L1 Frequent Itemset. In pruning, we will filter
out all items with Support less than the min_supp value(30%=3).

Given Support Threshold (or min_supp) = 30% and Confidence


Threshold ( min_conf)= 70%, Drtermine the frequent itemsets and the
strong association rules using Apriori Algorithm.

Data Mining Data Mining

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Step 3:
i) Create 3-Itemset candidates from L2 Frequent Itemset and calculate
support for all of them.
ii) Perform pruning to create L3 Frequent Itemset. But here if you see,
Support is less than min_supp value(30) for all the 3-itemsets so we cannot
go any forward and need to find Association Rules from L2 Frequent
Step 2: Itemset only.
i) Create 2-Itemset candidates from L1 Frequent
Itemset and calculate support for all of them.
ii) Perform Pruning to create L2 Frequent Itemset. As
before, we will again filter out all the Itemsets with
Support less than min_supp value(30%=3).

Data Mining Data Mining

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Example-2 Contd.. Example-2 Contd..

We need to calculate the Confidence for all combinations of items in the L2


Perform pruning for again, this time to filter out all those association rules
Frequent Itemset.
that have Confidence(%) less than the min_conf(70%).

Data Mining Data Mining

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Example-3

Find the frequent


item sets on this.
Assume that
minimum support
(s = 3)
Thank You

Frequent Itemset (I) = {Coke, Chips}


Data Mining Data Mining

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
BITS Pilani
Pilani|Dubai|Goa|Hyderabad

Data Mining
M7: Association Analysis and Apriori Algorithm – Part 2

Dr. Rama Satish K V CS-07 : FP-Tree Techniques and Association


Guest Faculty, WILP- BITS, Rule Formation
BITS Pilani Pilani
Pilani|Dubai|Goa|Hyderabad

Source Courtesy: Some of the contents of this PPT are sourced from materials provided by publishers of prescribed books

Important Note to Students

➢ It is important to know that just login to the session does not


guarantee the attendance.

➢ Once you join the session, continue till the end to consider you as
present in the class.

➢ IMPORTANTLY, you need to make the class more interactive by


responding to Professors queries in the session.

➢ Answering to Polls / Quiz questions during the class are mandatory to


get attendance

➢ Attend the Class Quiz at the end of the session

➢ Whenever Professor calls your number / name, you need to


respond, otherwise it will be considered as ABSENT

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus


Session Agenda Mining Association Rules
• Recap of CS06 • Two-step approach:
• Mining Association Rules 1. Frequent Itemset Generation
• Bottleneck of Frequent-pattern Mining – Generate all itemsets whose support  minsup
• Mining Frequent Patterns Without Candidate Generation
• Frequent Patterns(FP)-growth Algorithm 2. Rule Generation
• From Conditional Pattern-bases to Conditional FP-trees – Generate high confidence rules from each
• Association Rule Mining frequent itemset, where each rule is a binary
• Rule Generation for Apriori Algorithm partitioning of a frequent itemset
• Effect of Support Distribution
• Multiple Minimum Support, Pattern Evaluation • Frequent item set generation is computationally
expensive

Data Mining Data Mining

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Mining Frequent Patterns


Bottleneck of Frequent-pattern Mining Without Candidate Generation
• Multiple database scans are costly • Grow long patterns from short ones using local
• Mining long patterns needs many passes of scanning frequent items
and generates lots of candidates
• To find frequent itemset i1i2…i100 • “abc” is a frequent pattern
• # of scans: 100 • Get all transactions having “abc”: DB|abc
• # of Candidates: (1001) + (1002) + … + (110000) = 2100-1
• “d” is a local frequent item in DB|abc → abcd is a
= 1.27*1030 !
frequent pattern
• Bottleneck: candidate-generation-and-test
• Can we avoid candidate generation?

Data Mining Data Mining


November 11, 2025 November 11, 2025
7
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
8
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Frequent Patterns(FP)-growth Algorithm Construct FP-tree from a Transaction Database

• Use a compressed representation of the database using min_support = 3 F-list=f-c-a-b-m-p


an FP-tree
• Once an FP-tree has been constructed, it uses a TID Items bought (ordered) frequent items
recursive divide-and-conquer approach to mine the 100 {f, a, c, d, g, i, m, p} {f, c, a, m, p}
frequent itemsets 200 {a, b, c, f, l, m, o} {f, c, a, b, m}
300 {b, f, h, j, o, w} {f, b}
[Link] DB once, find frequent 1-itemset (single item 400 {b, c, k, s, p} {c, b, p}
pattern)
500 {a, f, c, e, l, p, m, n} {f, c, a, m, p}
[Link] frequent items in frequency descending order,
f-list
[Link] DB again, construct FP-tree f=4 a=3 c=4 d=1
e=1 g=1 h=1 i=1
j=1 k=1 l=1 m=3
n=1 o=2 p=3 s=1
Data Mining November 11, 2025 Data Mining 10

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Construct FP-tree from a Transaction Database Find Patterns Having P From P-conditional Database
{}
Header Table • Starting at the frequent item header table in the FP-tree
min_support = 3 • Traverse the FP-tree by following the link of each frequent item p
Item frequency head f:4 c:1
F-list=f-c-a-b-m-p f 4 • Accumulate all of transformed prefix paths of item p to form p’s
conditional pattern base
c 4 c:3 b:1 b:1
a 3 {}
b 3 a:3 p:1 Header Table
m 3 Conditional pattern bases
p 3 Item frequency head f:4 c:1
m:2 b:1 item cond. pattern base
f 4
c 4 c:3 b:1 b:1 c f:3
TID Items bought (ordered) frequent items p:2 m:1 a 3 a fc:3
100 {f, a, c, d, g, i, m, p} {f, c, a, m, p} b 3 a:3 p:1
200 {a, b, c, f, l, m, o} {f, c, a, b, m} m 3 b fca:1, f:1, c:1
300 {b, f, h, j, o, w} {f, b} p 3 m:2 b:1 m fca:2, fcab:1
400 {b, c, k, s, p} {c, b, p}
p fcam:2, cb:1
500 {a, f, c, e, l, p, m, n} {f, c, a, m, p} p:2 m:1
12
Data Mining Data Mining
November 11, 2025
11
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 November 11, 2025 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Benefits of the FP-tree Structure
Partition Patterns and Databases
• Completeness • Frequent patterns can be partitioned into subsets
• Preserve complete information for frequent pattern according to f-list
mining • F-list=f-c-a-b-m-p
• Never break a long pattern of any transaction • Patterns containing p
• Compactness • Patterns having m but no p
• Reduce irrelevant info—infrequent items are gone
•…
• Items in frequency descending order: the more
frequently occurring, the more likely to be shared • Patterns having c but no a nor b, m, p
• Never be larger than the original database (not • Pattern f
count node-links and the count field)
• Completeness and non-redundency

Data Mining 13 Data Mining


November 11, 2025
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
14
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

From Conditional Pattern-bases to Conditional FP-trees Recursion: Mining Each Conditional FP-tree
• For each pattern-base {}
• Accumulate the count for each item in the base Cond. pattern base of “am”: (fc:3) f:3
{}
• Construct the FP-tree for the frequent items of the
pattern base c:3
f:3
am-conditional FP-tree
m-conditional pattern base:
{} c:3 {}
fca:2, fcab:1
Header Table
Item frequency head All frequent a:3 Cond. pattern base of “cm”: (f:3)
f:4 c:1 patterns relate to m f:3
f 4 {} m-conditional FP-tree
c 4 c:3 b:1 b:1 m, cm-conditional FP-tree

a 3  fm, cm, am,
f:3
b 3 a:3 p:1 fcm, fam, cam, {}
m 3 c:3 fcam
p 3 m:2 b:1 Cond. pattern base of “cam”: (f:3)
f:3
p:2 m:1 a:3
m-conditional FP-tree cam-conditional FP-tree
15
Data Mining Data Mining
November 11, 2025
November 11, 2025 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
16
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Mining Frequent Patterns With FP-trees
A Special Case: Single Prefix Path in FP-tree
• Suppose a (conditional) FP-tree T has a shared single
• Idea: Frequent pattern growth
prefix-path P
• Recursively grow frequent patterns by pattern and
• Mining can be decomposed into two parts database partition
{} • Reduction of the single prefix path into one node • Method
• Concatenation of the mining results of the two parts • For each frequent item, construct its conditional
a1:n1
pattern-base, and then its conditional FP-tree
a2:n2 • Repeat the process on each newly created conditional
a3:n3 {} r1 FP-tree
• Until the resulting FP-tree is empty, or it contains
a1:n1 only one path—single path will generate all the
b1:m1 C1:k1  r1 =
a2:n2
+ b1:m1 C1:k1
combinations of its sub-paths, each of which is a
frequent pattern
C2:k2 C3:k3 a3:n3 C2:k2 C3:k3
17
Data Mining Data Mining

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


18
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Why Is FP-Growth the Winner? Association Rule Mining

• Divide-and-conquer: • Given a set of transactions, find rules that will predict


• decompose both the mining task and DB according the occurrence of an item based on the occurrences of
to the frequent patterns obtained so far other items in the transaction
• leads to focused search of smaller databases
Market-Basket transactions Example of Association Rules
• Other factors
TID Items {Diaper} → {Butter},
• no candidate generation, no candidate test 1 Bread, Milk {Milk, Bread} → {Eggs, Coke},
• compressed database: FP-tree structure 2 Bread, Diaper, Butter, Eggs {Butter, Bread} → {Milk},
3 Milk, Diaper, Butter, Coke
• no repeated scan of entire database 4 Bread, Milk, Diaper, Butter
• basic ops—counting local freq items and building 5 Bread, Milk, Diaper, Coke
sub FP-tree, no pattern search and matching

Data Mining Data Mining

November 11, 2025 19


BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Definition: Association Rule Mining Association Rules
TID Items
Association Rule 1 Bread, Milk
– An implication expression of the form 2 Bread, Diaper, Butter, Eggs
• Two-step approach:
X → Y, where X and Y are itemsets 3 Milk, Diaper, Butter, Coke 1. Frequent Itemset Generation
– Example: 4 Bread, Milk, Diaper, Butter – Generate all itemsets whose support 
{Milk, Diaper} → {Bread} 5 Bread, Milk, Diaper, Coke
minsup
Rule Evaluation Metrics
Example: 2. Rule Generation
– Support (s) BREAD – Generate high confidence rules from each
◆ Fraction of transactions that contain {Milk, Diaper}  Butter
both X and Y BREAD
frequent itemset, where each rule is a
– Confidence (c)  (Milk, Diaper, Butter ) 2 binary partitioning of a frequent itemset
s= = = 0.4
|T| 5
◆ Measures how often items in Y
appear in transactions that BREAD
 (Milk, Diaper, Butter ) 2 • Frequent itemset generation is still
contain X c= = = 0.67
 (Milk, Diaper ) 3
computationally expensive
Data Mining Data Mining

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Rule Generation Rule Generation

• Given a frequent itemset L, find all non-empty subsets • How to efficiently generate rules from frequent
f  L such that f → L – f satisfies the minimum itemsets?
confidence requirement • In general, confidence does not have an anti-
• If {A,B,C,D} is a frequent itemset, candidate rules: monotone property
ABC →D, ABD →C, ACD →B, BCD →A, c(ABC →D) can be larger or smaller than c(AB →D
A →BCD, B →ACD, C →ABD, D →ABC • But confidence of rules generated from the same
AB →CD, AC → BD, AD → BC, BC →AD, itemset has an anti-monotone property
BD →AC, CD →AB,
• e.g., L = {A,B,C,D}:
c(ABC → D)  c(AB → CD)  c(A → BCD)
• If |L| = k, then there are 2k – 2 candidate association
rules (ignoring L →  and  → L)
• Confidence is anti-monotone w.r.t. number of items on
the RHS of the rule
Data Mining Data Mining

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Rule Generation for Apriori Algorithm Rule Generation with Apriori Algorithm
Lattice of rules
Low
ABCD=>{ }
• Candidate rule is generated by merging two rules that
Confidence share the same prefix
Rule
BCD=>A ACD=>B ABD=>C ABC=>D
in the rule consequent CD=>AB BD=>AC

• join(CD=>AB,BD=>AC)
would produce the candidate
CD=>AB BD=>AC BC=>AD AD=>BC AC=>BD AB=>CD rule D => ABC

• Prune rule D=>ABC if its D=>ABC


subset AD=>BC does not have
D=>ABC C=>ABD B=>ACD A=>BCD high confidence
Pruned
Rules
Data Mining Data Mining

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Effect of Support Distribution Effect of Support Distribution

• Many real data sets have skewed support distribution


• How to set the appropriate minsup threshold?
• If minsup is set too high, we could miss itemsets
involving interesting rare items (e.g., expensive
products)

Support • If minsup is set too low, it is computationally


distribution of a expensive and the number of itemsets is very large
retail data set

• Using a single minimum support threshold may not be


effective

Data Mining Data Mining

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Multiple Minimum Support Pattern Evaluation
• How to apply multiple minimum supports? • Association rule algorithms tend to produce too many
• MS(i): minimum support for item i rules
• e.g.: MS(Milk)=5%, MS(Coke) = 3%, • many of them are uninteresting or redundant
MS(Broccoli)=0.1%, MS(Salmon)=0.5% • Redundant if {A,B,C} → {D} and {A,B} → {D}
• MS({Milk, Broccoli}) = min (MS(Milk), MS(Broccoli)) have same support & confidence
= 0.1%
• Interestingness measures can be used to prune/rank
• Challenge: Support is no longer anti-monotone the derived patterns
• Suppose: Support(Milk, Coke) = 1.5% and
Support(Milk, Coke, Broccoli) = 0.5%
• In the original formulation of association rules,
support & confidence are the only measures used
• {Milk,Coke} is infrequent but {Milk,Coke,Broccoli}
is frequent
Data Mining Data Mining

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Computing Interestingness Measure Drawback of Confidence

• Given a rule X → Y, information needed to compute rule


interestingness can be obtained from a contingency table Coffee Coffee
Contingency table for X → Y Tea 15 5 20
Y Y f11: support of X and Y Tea 75 5 80
X f11 f10 f1+ f10: support of X and Y 90 10 100
X f01 f00 fo+ f01: support of X and Y
f+1 f+0 |T| f00: support of X and Y Association Rule: Tea → Coffee

Confidence= P(Coffee|Tea) = 0.75


Used to define various measures
but P(Coffee) = 0.9
support, confidence, lift, Gini,  Although confidence is high, rule is misleading
J-measure, etc.
 P(Coffee|Tea) = 0.9375
Data Mining Data Mining

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Statistical Independence Statistical-based Measures

• Population of 1000 students • Measures that take into account statistical


• 600 students know how to swim (S) dependence
• 700 students know how to bike (B)
• 420 students know how to swim and bike (S,B)
P (Y | X )
Lift =
• P(SB) = 420/1000 = 0.42 P (Y )
• P(S)  P(B) = 0.6  0.7 = 0.42
P( X , Y )
• P(SB) = P(S)  P(B) => Statistical independence Interest =
• P(SB) > P(S)  P(B) => Positively correlated P ( X ) P (Y )
• P(SB) < P(S)  P(B) => Negatively correlated

Data Mining Data Mining

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Drawback of Lift & Interest


Example: Lift/Interest

Y Y Y Y
Coffee Coffee X 10 0 10 X 90 0 90
Tea 15 5 20 X 0 90 90 X 0 10 10
Tea 75 5 80 10 90 100 90 10 100
90 10 100

0.1 0.9
Association Rule: Tea → Coffee Lift = = 10 Lift = = 1.11
(0.1)(0.1) (0.9)(0.9)
Confidence= P(Coffee|Tea) = 0.75
but P(Coffee) = 0.9 Statistical independence:

 Lift = 0.75/0.9= 0.8333 (< 1, therefore is negatively associated) If P(X,Y)=P(X)P(Y) => Lift = 1

Data Mining Data Mining

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Subjective Interestingness Measure Interestingness via Unexpectedness

• Objective measure: • Need to model expectation of users (domain knowledge)


• Rank patterns based on statistics computed from data
• e.g., 21 measures of association (support, confidence, + Pattern expected to be frequent

Laplace, Gini, mutual information, Jaccard, etc). - Pattern expected to be infrequent


Pattern found to be frequent

• Subjective measure: Pattern found to be infrequent

• Rank patterns according to user’s interpretation


• A pattern is subjectively interesting if it contradicts + - Expected Patterns

the - + Unexpected Patterns


expectation of a user
• A pattern is subjectively interesting if it is
actionable • Need to combine expectation of users with evidence
Data Mining
from data (i.e., extracted patterns)
Data Mining

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Important note to self


Class Summary

• Recap of CS06 STOP RECORDING


• Mining Association Rules
• Bottleneck of Frequent-pattern Mining
• Mining Frequent Patterns Without Candidate Generation
• Frequent Patterns(FP)-growth Algorithm
• From Conditional Pattern-bases to Conditional FP-trees
• Association Rule Mining
• Rule Generation for Apriori Algorithm
• Effect of Support Distribution Thank You!
• Multiple Minimum Support, Pattern Evaluation
satishkvr@[Link]
Next Class
Revision CS01 – CS07
Data Mining

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


BITS Pilani
Pilani|Dubai|Goa|Hyderabad

Data Mining
SSWT ZC 425

Data Mining
SSWT ZC 425
BITS Pilani M1: Introduction to Data Mining ◼ M1: Introduction to Data Mining
Pilani|Dubai|Goa|Hyderabad

Source Courtesy: Some of the contents of this PPT are sourced from materials provided by publishers of prescribed books

Text Book(s)
◼ T1 :Tan P. N., Steinbach M & Kumar V. “Introduction to
Data Mining” Pearson Education, 2014

Data Mining: [Link]


Ning_Tan_Michael_Steinbach_Vipin_Kumar_-
Concepts and Techniques _Introduction_to_Data_Mining-Pe_NRDK4fi.pdf

◼ T2: Data Mining: Concepts and Techniques, by Jiawei


Han and Micheline Kamber Morgan Kaufmann
Publishers, 2012
[Link]
3 November 11, 2025 Data Mining: Concepts and Techniques 4
Reference Book(s) & other
resources SSWT ZC 425 Data Mining
◼ R1: Predictive Analytics and Data Mining: Concepts
and Practice with RapidMiner by Vijay Kotu and Bala
Deshpande Morgan Kaufmann Publishers © 2015

◼ R2: Practical Text Mining and Statistical Analysis for WASE 2022_SSWT ZC 425_ Data Mining
Non-structured Text Data Applications by Gary Miner Handout
et al. Academic Press © 2012

◼ R3:Recommender Systems for Learning by Nikos


Manouselis, Hendrik Drachsler, KatrienVerbert and Erik
Duval Springer © 2013
November 11, 2025 Data Mining: Concepts and Techniques 5 November 11, 2025 Data Mining: Concepts and Techniques 6

Data Mining Tools Chapter 1. Introduction


◼ MonkeyLearn | No-code text mining tools ◼ Why Data Mining?
◼ RapidMiner | Drag and drop workflows or data mining in
Python ◼ What Is Data Mining?
◼ Oracle Data Mining | Predictive data mining models ◼ A Multi-Dimensional View of Data Mining
◼ IBM SPSS Modeler | A predictive analytics platform for data
scientists ◼ What Kind of Data Can Be Mined?
◼ Weka | Open-source software for data mining ◼ What Kinds of Patterns Can Be Mined?
◼ Knime | Pre-built components for data mining projects
◼ H2O | Open-source library offering data mining in Python ◼ What Technology Are Used?
◼ Orange | Open-source data mining toolbox ◼ What Kind of Applications Are Targeted?
◼ Apache Mahout | Ideal for complex and large-scale data
mining ◼ Major Issues in Data Mining
◼ SAS Enterprise Miner | Solve business problems with data ◼ A Brief History of Data Mining and Data Mining Society
mining
◼ Summary
November 11, 2025 Data Mining: Concepts and Techniques 7 8
Why Data Mining? Evolution of Sciences
◼ Before 1600, empirical science
◼ The Explosive Growth of Data: from terabytes to petabytes
◼ 1600-1950s, theoretical science
◼ Data collection and data availability ◼ Each discipline has grown a theoretical component. Theoretical models often
motivate experiments and generalize our understanding.
◼ Automated data collection tools, database systems, Web,
◼ 1950s-1990s, computational science
computerized society ◼ Over the last 50 years, most disciplines have grown a third, computational branch
(e.g. empirical, theoretical, and computational ecology, or physics, or linguistics.)
◼ Major sources of abundant data
◼ Computational Science traditionally meant simulation. It grew out of our inability to
◼ Business: Web, e-commerce, transactions, stocks, … find closed-form solutions for complex mathematical models.
1990-now, data science
Science: Remote sensing, bioinformatics, scientific simulation, …


◼ The flood of data from new scientific instruments and simulations
◼ Society and everyone: news, digital cameras, YouTube ◼ The ability to economically store and manage petabytes of data online
◼ The Internet and computing Grid that makes all these archives universally accessible
◼ We are drowning in data, but starving for knowledge!
◼ Scientific info. management, acquisition, organization, query, and visualization tasks
◼ “Necessity is the mother of invention”—Data mining—Automated scale almost linearly with data volumes. Data mining is a major new challenge!
analysis of massive data sets
◼ Jim Gray and Alex Szalay, The World Wide Telescope: An Archetype for Online Science,
Comm. ACM, 45(11): 50-54, Nov. 2002
9 10

Evolution of Database Technology Chapter 1. Introduction


◼ 1960s:
◼ Why Data Mining?
◼ Data collection, database creation, IMS and network DBMS
◼ 1970s: ◼ What Is Data Mining?
◼ Relational data model, relational DBMS implementation ◼ A Multi-Dimensional View of Data Mining
◼ 1980s:
◼ What Kind of Data Can Be Mined?
◼ RDBMS, advanced data models (extended-relational, OO, deductive, etc.)
◼ Application-oriented DBMS (spatial, scientific, engineering, etc.) ◼ What Kinds of Patterns Can Be Mined?
◼ 1990s:
◼ What Technology Are Used?
◼ Data mining, data warehousing, multimedia databases, and Web
databases ◼ What Kind of Applications Are Targeted?
◼ 2000s ◼ Major Issues in Data Mining
◼ Stream data management and mining
◼ Data mining and its applications ◼ A Brief History of Data Mining and Data Mining Society
◼ Web technology (XML, data integration) and global information systems ◼ Summary
11 12
What Is Data Mining? Knowledge Discovery (KDD) Process
◼ This is a view from typical
◼ Data mining (knowledge discovery from data) database systems and data
Pattern Evaluation
◼ Extraction of interesting (non-trivial, implicit, previously warehousing communities
unknown and potentially useful) patterns or knowledge from ◼ Data mining plays an essential
role in the knowledge discovery
huge amount of data process Data Mining
◼ Data mining: a misnomer?
Task-relevant Data
◼ Alternative names
◼ Knowledge discovery (mining) in databases (KDD), knowledge
Data Warehouse Selection
extraction, data/pattern analysis, data archeology, data
dredging, information harvesting, business intelligence, etc.
Data Cleaning
◼ Watch out: Is everything “data mining”?
◼ Simple search and query processing Data Integration
◼ (Deductive) expert systems
Databases
13 14

Example: A Web Mining Framework Data Mining in Business Intelligence

◼ Web mining usually involves Increasing potential


to support
business decisions End User
◼ Data cleaning Decision
◼ Data integration from multiple sources Making

◼ Warehousing the data Data Presentation Business


Analyst
Visualization Techniques
◼ Data cube construction
Data Mining Data
◼ Data selection for data mining Information Discovery Analyst

◼ Data mining Data Exploration


◼ Presentation of the mining results Statistical Summary, Querying, and Reporting

◼ Patterns and knowledge to be used or stored into Data Preprocessing/Integration, Data Warehouses
DBA
knowledge-base Data Sources
Paper, Files, Web documents, Scientific experiments, Database Systems
15 16
KDD Process: A Typical View from ML and
Example: Mining vs. Data Exploration
Statistics
◼ Business intelligence view
◼ Warehouse, data cube, reporting but not much mining Input Data Data Pre- Data Post-
Processing Mining Processing
◼ Business objects vs. data mining tools
◼ Supply chain example: tools
◼ Data presentation
Data integration Pattern discovery Pattern evaluation
◼ Exploration Normalization Association & correlation Pattern selection
Feature selection Classification Pattern interpretation
Clustering
Dimension reduction Pattern visualization
Outlier analysis
…………

◼ This is a view from typical machine learning and statistics communities

17 18

Example: Medical Data Mining Chapter 1. Introduction


Why Data Mining?
Health care & medical data mining – often


◼ What Is Data Mining?
adopted such a view in statistics and machine
◼ A Multi-Dimensional View of Data Mining
learning
◼ What Kind of Data Can Be Mined?
◼ Preprocessing of the data (including feature
◼ What Kinds of Patterns Can Be Mined?
extraction and dimension reduction)
◼ What Technology Are Used?
◼ Classification or/and clustering processes
◼ What Kind of Applications Are Targeted?
◼ Post-processing for presentation ◼ Major Issues in Data Mining

◼ A Brief History of Data Mining and Data Mining Society

◼ Summary
19 20
Multi-Dimensional View of Data Mining Chapter 1. Introduction
◼ Data to be mined
◼ Why Data Mining?
◼ Database data (extended-relational, object-oriented, heterogeneous,

legacy), data warehouse, transactional data, stream, spatiotemporal, ◼ What Is Data Mining?
time-series, sequence, text and web, multi-media, graphs & social
and information networks ◼ A Multi-Dimensional View of Data Mining
◼ Knowledge to be mined (or: Data mining functions) ◼ What Kind of Data Can Be Mined?
◼ Characterization, discrimination, association, classification,

clustering, trend/deviation, outlier analysis, etc. ◼ What Kinds of Patterns Can Be Mined?
◼ Descriptive vs. predictive data mining
◼ What Technology Are Used?
◼ Multiple/integrated functions and mining at multiple levels
◼ What Kind of Applications Are Targeted?
◼ Techniques utilized
◼ Data-intensive, data warehouse (OLAP), machine learning, statistics, ◼ Major Issues in Data Mining
pattern recognition, visualization, high-performance, etc.
◼ A Brief History of Data Mining and Data Mining Society
◼ Adopted by Applications
◼ Retail, telecommunication, banking, fraud analysis, bio-data mining, ◼ Summary
stock market analysis, text mining, Web mining, etc.
21 22

Data Mining: On What Kinds of Data? Chapter 1. Introduction


◼ Database-oriented data sets and applications ◼ Why Data Mining?
◼ Relational database, data warehouse, transactional database
◼ What Is Data Mining?
◼ Advanced data sets and advanced applications
◼ Data streams and sensor data
◼ A Multi-Dimensional View of Data Mining

◼ Time-series data, temporal data, sequence data (incl. bio-sequences) ◼ What Kind of Data Can Be Mined?
◼ Structure data, graphs, social networks and multi-linked data ◼ What Kinds of Patterns Can Be Mined?
◼ Object-relational databases
◼ What Technology Are Used?
◼ Heterogeneous databases and legacy databases
◼ Spatial data and spatiotemporal data ◼ What Kind of Applications Are Targeted?

◼ Multimedia database ◼ Major Issues in Data Mining


◼ Text databases ◼ A Brief History of Data Mining and Data Mining Society
◼ The World-Wide Web
◼ Summary
23 24
Data Mining Function: (2) Association and
Data Mining Function: (1) Generalization
Correlation Analysis
◼ Information integration and data warehouse construction ◼ Frequent patterns (or frequent itemsets)
◼ Data cleaning, transformation, integration, and ◼ What items are frequently purchased together in your
multidimensional data model Walmart?
◼ Data cube technology ◼ Association, correlation vs. causality
◼ Scalable methods for computing (i.e., materializing) ◼ A typical association rule
multidimensional aggregates ◼ Diaper → Beer [0.5%, 75%] (support, confidence)
◼ OLAP (online analytical processing) ◼ Are strongly associated items also strongly correlated?
◼ Multidimensional concept description: Characterization ◼ How to mine such patterns and rules efficiently in large
and discrimination datasets?
◼ Generalize, summarize, and contrast data ◼ How to use such patterns for classification, clustering,
characteristics, e.g., dry vs. wet region and other applications?
25 26

Data Mining Function: (3) Classification Data Mining Function: (4) Cluster Analysis

◼ Classification and label prediction ◼ Unsupervised learning (i.e., Class label is unknown)
◼ Construct models (functions) based on some training examples ◼ Group data to form new categories (i.e., clusters), e.g.,
◼ Describe and distinguish classes or concepts for future prediction cluster houses to find distribution patterns
◼ E.g., classify countries based on (climate), or classify cars
◼ Principle: Maximizing intra-class similarity & minimizing
based on (gas mileage)
interclass similarity
◼ Predict some unknown class labels
◼ Typical methods ◼ Many methods and applications
◼ Decision trees, naïve Bayesian classification, support vector
machines, neural networks, rule-based classification, pattern-
based classification, logistic regression, …
◼ Typical applications:
◼ Credit card fraud detection, direct marketing, classifying stars,
diseases, web-pages, …

27 28
Time and Ordering: Sequential Pattern,
Data Mining Function: (5) Outlier Analysis
Trend and Evolution Analysis
◼ Outlier analysis
◼ Sequence, trend and evolution analysis
◼ Outlier: A data object that does not comply with the general
◼ Trend, time-series, and deviation analysis: e.g.,
behavior of the data
regression and value prediction
◼ Noise or exception? ― One person’s garbage could be another
◼ Sequential pattern mining
person’s treasure
◼ Methods: by product of clustering or regression analysis, … ◼ e.g., first buy digital camera, then buy large SD

◼ Useful in fraud detection, rare events analysis memory cards


◼ Periodicity analysis

◼ Motifs and biological sequence analysis

◼ Approximate and consecutive motifs

◼ Similarity-based analysis

◼ Mining data streams


◼ Ordered, time-varying, potentially infinite, data streams

29 30

Structure and Network Analysis Evaluation of Knowledge


◼ Graph mining ◼ Are all mined knowledge interesting?
◼ Finding frequent subgraphs (e.g., chemical compounds), trees
◼ One can mine tremendous amount of “patterns” and knowledge
(XML), substructures (web fragments)
◼ Information network analysis ◼ Some may fit only certain dimension space (time, location, …)
◼ Social networks: actors (objects, nodes) and relationships (edges) ◼ Some may not be representative, may be transient, …
◼ e.g., author networks in CS, terrorist networks
◼ Evaluation of mined knowledge → directly mine only
◼ Multiple heterogeneous networks
interesting knowledge?
◼ A person could be multiple information networks: friends,

family, classmates, … ◼ Descriptive vs. predictive


◼ Links carry a lot of semantic information: Link mining ◼ Coverage
◼ Web mining ◼ Typicality vs. novelty
◼ Web is a big information network: from PageRank to Google
◼ Accuracy
◼ Analysis of Web information networks
◼ Timeliness
◼ Web community discovery, opinion mining, usage mining, …
◼ …
31 32
Chapter 1. Introduction Data Mining: Confluence of Multiple Disciplines

◼ Why Data Mining?


Machine Pattern Statistics
◼ What Is Data Mining? Learning Recognition
◼ A Multi-Dimensional View of Data Mining

◼ What Kind of Data Can Be Mined?

◼ What Kinds of Patterns Can Be Mined? Applications Data Mining Visualization


◼ What Technology Are Used?

◼ What Kind of Applications Are Targeted?


Algorithm Database High-Performance
◼ Major Issues in Data Mining
Technology Computing
◼ A Brief History of Data Mining and Data Mining Society

◼ Summary
33 34

Why Confluence of Multiple Disciplines? Chapter 1. Introduction


◼ Tremendous amount of data ◼ Why Data Mining?
◼ Algorithms must be highly scalable to handle such as tera-bytes of
data ◼ What Is Data Mining?

◼ High-dimensionality of data ◼ A Multi-Dimensional View of Data Mining


◼ Micro-array may have tens of thousands of dimensions ◼ What Kind of Data Can Be Mined?
◼ High complexity of data
◼ What Kinds of Patterns Can Be Mined?
◼ Data streams and sensor data
◼ Time-series data, temporal data, sequence data ◼ What Technology Are Used?
◼ Structure data, graphs, social networks and multi-linked data ◼ What Kind of Applications Are Targeted?
◼ Heterogeneous databases and legacy databases
◼ Spatial, spatiotemporal, multimedia, text and Web data ◼ Major Issues in Data Mining
◼ Software programs, scientific simulations ◼ A Brief History of Data Mining and Data Mining Society
◼ New and sophisticated applications
◼ Summary
35 36
Applications of Data Mining Chapter 1. Introduction
◼ Web page analysis: from web page classification, clustering to ◼ Why Data Mining?
PageRank & HITS algorithms
◼ What Is Data Mining?
◼ Collaborative analysis & recommender systems
◼ A Multi-Dimensional View of Data Mining
◼ Basket data analysis to targeted marketing
◼ Biological and medical data analysis: classification, cluster analysis ◼ What Kind of Data Can Be Mined?
(microarray data analysis), biological sequence analysis, biological ◼ What Kinds of Patterns Can Be Mined?
network analysis
◼ What Technology Are Used?
◼ Data mining and software engineering (e.g., IEEE Computer, Aug.
2009 issue) ◼ What Kind of Applications Are Targeted?
◼ From major dedicated data mining systems/tools (e.g., SAS, MS SQL- ◼ Major Issues in Data Mining
Server Analysis Manager, Oracle Data Mining Tools) to invisible data
◼ A Brief History of Data Mining and Data Mining Society
mining
◼ Summary
37 38

Major Issues in Data Mining (1) Major Issues in Data Mining (2)

◼ Mining Methodology ◼ Efficiency and Scalability


◼ Mining various and new kinds of knowledge ◼ Efficiency and scalability of data mining algorithms
◼ Mining knowledge in multi-dimensional space ◼ Parallel, distributed, stream, and incremental mining methods
◼ Data mining: An interdisciplinary effort ◼ Diversity of data types
◼ Boosting the power of discovery in a networked environment ◼ Handling complex types of data
◼ Handling noise, uncertainty, and incompleteness of data ◼ Mining dynamic, networked, and global data repositories
◼ Pattern evaluation and pattern- or constraint-guided mining ◼ Data mining and society
◼ User Interaction ◼ Social impacts of data mining
◼ Interactive mining ◼ Privacy-preserving data mining
◼ Incorporation of background knowledge ◼ Invisible data mining
◼ Presentation and visualization of data mining results

39 40
Chapter 1. Introduction A Brief History of Data Mining Society

◼ Why Data Mining? ◼ 1989 IJCAI Workshop on Knowledge Discovery in Databases


◼ Knowledge Discovery in Databases (G. Piatetsky-Shapiro and W. Frawley,
◼ What Is Data Mining? 1991)
◼ A Multi-Dimensional View of Data Mining ◼ 1991-1994 Workshops on Knowledge Discovery in Databases
◼ Advances in Knowledge Discovery and Data Mining (U. Fayyad, G.
◼ What Kind of Data Can Be Mined?
Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, 1996)
◼ What Kinds of Patterns Can Be Mined? ◼ 1995-1998 International Conferences on Knowledge Discovery in Databases
and Data Mining (KDD’95-98)
◼ What Technology Are Used?
◼ Journal of Data Mining and Knowledge Discovery (1997)
◼ What Kind of Applications Are Targeted? ◼ ACM SIGKDD conferences since 1998 and SIGKDD Explorations
◼ Major Issues in Data Mining ◼ More conferences on data mining
◼ PAKDD (1997), PKDD (1997), SIAM-Data Mining (2001), (IEEE) ICDM
◼ A Brief History of Data Mining and Data Mining Society
(2001), etc.
◼ Summary ◼ ACM Transactions on KDD starting in 2007
41 42

Conferences and Journals on Data Mining Where to Find References? DBLP, CiteSeer, Google

◼ KDD Conferences ◼ Data mining and KDD (SIGKDD: CDROM)


◼ Other related conferences ◼ Conferences: ACM-SIGKDD, IEEE-ICDM, SIAM-DM, PKDD, PAKDD, etc.
◼ ACM SIGKDD Int. Conf. on DB conferences: ACM SIGMOD,
◼ ◼ Journal: Data Mining and Knowledge Discovery, KDD Explorations, ACM TKDD
Knowledge Discovery in
VLDB, ICDE, EDBT, ICDT, … ◼ Database systems (SIGMOD: ACM SIGMOD Anthology—CD ROM)
Databases and Data Mining (KDD)
◼ Web and IR conferences: WWW, ◼ Conferences: ACM-SIGMOD, ACM-PODS, VLDB, IEEE-ICDE, EDBT, ICDT, DASFAA
◼ SIAM Data Mining Conf. (SDM) ◼ Journals: IEEE-TKDE, ACM-TODS/TOIS, JIIS, J. ACM, VLDB J., Info. Sys., etc.
SIGIR, WSDM
◼ (IEEE) Int. Conf. on Data Mining ◼ AI & Machine Learning
(ICDM) ◼ ML conferences: ICML, NIPS
◼ Conferences: Machine learning (ML), AAAI, IJCAI, COLT (Learning Theory), CVPR, NIPS, etc.
◼ European Conf. on Machine ◼ PR conferences: CVPR, ◼ Journals: Machine Learning, Artificial Intelligence, Knowledge and Information Systems,
Learning and Principles and IEEE-PAMI, etc.
◼ Journals
practices of Knowledge Discovery ◼ Web and IR
◼ Data Mining and Knowledge
and Data Mining (ECML-PKDD) ◼ Conferences: SIGIR, WWW, CIKM, etc.
Discovery (DAMI or DMKD) ◼ Journals: WWW: Internet and Web Information Systems,
◼ Pacific-Asia Conf. on Knowledge
Discovery and Data Mining ◼ IEEE Trans. On Knowledge and ◼ Statistics
(PAKDD) Data Eng. (TKDE) ◼ Conferences: Joint Stat. Meeting, etc.
◼ Journals: Annals of statistics, etc.
◼ Int. Conf. on Web Search and ◼ KDD Explorations
Data Mining (WSDM) ◼ Visualization
◼ ACM Trans. on KDD
◼ Conference proceedings: CHI, ACM-SIGGraph, etc.
◼ Journals: IEEE Trans. visualization and computer graphics, etc.
43 44
Chapter 1. Introduction Summary

◼ Why Data Mining? ◼ Data mining: Discovering interesting patterns and knowledge from
massive amount of data
◼ What Is Data Mining?
◼ A natural evolution of database technology, in great demand, with
◼ A Multi-Dimensional View of Data Mining wide applications
◼ What Kind of Data Can Be Mined? ◼ A KDD process includes data cleaning, data integration, data
selection, transformation, data mining, pattern evaluation, and
◼ What Kinds of Patterns Can Be Mined?
knowledge presentation
◼ What Technology Are Used? ◼ Mining can be performed in a variety of data
◼ What Kind of Applications Are Targeted? ◼ Data mining functionalities: characterization, discrimination,
association, classification, clustering, outlier and trend analysis, etc.
◼ Major Issues in Data Mining
◼ Data mining technologies and applications
◼ A Brief History of Data Mining and Data Mining Society
◼ Major issues in data mining
◼ Summary
45 46

Recommended Reference Books


◼ S. Chakrabarti. Mining the Web: Statistical Analysis of Hypertex and Semi-Structured Data. Morgan
Kaufmann, 2002
◼ R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification, 2ed., Wiley-Interscience, 2000
◼ T. Dasu and T. Johnson. Exploratory Data Mining and Data Cleaning. John Wiley & Sons, 2003
◼ U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy. Advances in Knowledge Discovery and
Data Mining. AAAI/MIT Press, 1996
◼ U. Fayyad, G. Grinstein, and A. Wierse, Information Visualization in Data Mining and Knowledge


Discovery, Morgan Kaufmann, 2001
J. Han and M. Kamber. Data Mining: Concepts and Techniques. Morgan Kaufmann, 3rd ed., 2011
D. J. Hand, H. Mannila, and P. Smyth, Principles of Data Mining, MIT Press, 2001
Queries ??
◼ T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning: Data Mining, Inference,
and Prediction, 2nd ed., Springer-Verlag, 2009
◼ B. Liu, Web Data Mining, Springer 2006.
◼ T. M. Mitchell, Machine Learning, McGraw Hill, 1997
◼ G. Piatetsky-Shapiro and W. J. Frawley. Knowledge Discovery in Databases. AAAI/MIT Press, 1991
◼ P.-N. Tan, M. Steinbach and V. Kumar, Introduction to Data Mining, Wiley, 2005
◼ S. M. Weiss and N. Indurkhya, Predictive Data Mining, Morgan Kaufmann, 1998
◼ I. H. Witten and E. Frank, Data Mining: Practical Machine Learning Tools and Techniques with Java
Implementations, Morgan Kaufmann, 2nd ed. 2005

47 November 11, 2025 Data Mining: Concepts and Techniques 48


THANK YOU
Data Mining
BITS Pilani M4: Classification
Pilani|Dubai|Goa|Hyderabad

November 11, 2025 Data Mining: Concepts and Techniques 49

Classification: Definition
BITS Pilani
Pilani|Dubai|Goa|Hyderabad

Given a collection of records (training set )


Each record is by characterized by a tuple (x,y), where x is the attribute set and y is the
class label
x: attribute, predictor, independent variable, input
y: class, response, dependent variable, output

Task:
Learn a model that maps each attribute set x into one of the predefined class labels y
Classification Concepts

Source Courtesy: Some of the contents of this PPT are sourced from materials provided by publishers of prescribed books
BITS Pilani, Pilani Campus
Classification Classification vs. Prediction
• Classification involves dividing up objects so that each is assigned to one of a number of mutually • Classification
exhaustive and exclusive categories known as classes
• Many practical decision-making tasks can be formulated as classification problems
• predicts categorical class labels (discrete or nominal)
‒ customers who are likely to buy or not buy a particular product in a supermarket • classifies data (constructs a model) based on the training set and
‒ people who are at high, medium or low risk of acquiring a certain illness the values (class labels) in a classifying attribute and uses it in
‒ student projects worthy of a distinction, merit, pass or fail grade
‒ objects on a radar display which correspond to vehicles, people, buildings or trees classifying new data
‒ people who closely resemble, slightly resemble or do not resemble someone seen committing a crime
‒ houses that are likely to rise in value, fall in value or have an unchanged value in 12 months' time
‒ people who are at high, medium or low risk of a car accident in the next 12 months • Prediction

• models continuous-valued functions, i.e., predicts unknown or
people who are likely to vote for each of a number of political parties (or none)
‒ the likelihood of rain the next day for a weather forecast (very likely, likely, unlikely, very unlikely).
missing values

November 11, 2025 Data Mining November 11,


Data
2025
Mining: Concepts and Techniques Data Mining
4 5
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Supervised vs. Unsupervised Learning C l a s s i f i c a t i o n — A Tw o - S t e p P ro c e s s

• Supervised learning (classification) • Model construction: describing a set of predetermined classes


• Each tuple/sample is assumed to belong to a predefined class, as determined by the class
• Supervision: The training data (observations, measurements, etc.) are label attribute
accompanied by labels indicating the class of the observations • The set of tuples used for model construction is training set
• The model is represented as classification rules, decision trees, or mathematical formulae
• New data is classified based on the training set • Model usage: for classifying future or unknown objects
• Unsupervised learning (clustering) • Estimate accuracy of the model
• The known label of test sample is compared with the classified result from the model
• The class labels of training data is unknown • Accuracy rate is the percentage of test set samples that are correctly classified by the
model
• Given a set of measurements, observations, etc. with the aim of establishing
• Test set is independent of training set, otherwise over-fitting will occur
the existence of classes or clusters in the data • If the accuracy is acceptable, use the model to classify data tuples whose class labels are not
known

November 11, 2025 Data Mining November 11, 2025 Data Mining
6 7
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
General Approach for Building
Classification Model Lazy vs. Eager Learning
• Lazy vs. eager learning
Tid
1
Attrib1
Yes
Attrib2
Large
Attrib3
125K
Class
No
Learning
algorithm
• Lazy learning (e.g., instance-based learning): Simply stores training data (or only
2

3
No

No
Medium

Small
100K

70K
No
No
minor processing) and waits until it is given a test tuple
4
5
Yes
No
Medium
Large
120K
95K
No
Yes
Induction • Eager learning (the methods mentioned in next slide): Given a set of training set,
6 No Medium 60K No constructs a classification model before receiving new (e.g., test) data to classify
7 Yes Large 220K No Learn
8
9
No
No
Small
Medium
85K
75K
Yes
No
Model • Lazy: less time in training but more time in predicting
10
10 No Small 90K Yes
Model • Accuracy
Training Set
Apply • Lazy method effectively uses a richer hypothesis space since it uses many local linear
Tid Attrib1 Attrib2 Attrib3 Class Model functions to form its implicit global approximation to the target function
• Eager: must commit to a single hypothesis that covers the entire instance space
11 No Small 55K ?

12 Yes Medium 80K ?

13 Yes Large 110K ? Deduction


14 No Small 95K ?

15 No Large 67K ?
10

Test Set

Data Mining November 11, 2025 Data Mining 9


BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

C l a s s i f i c a t i o n Te c h n i q u e s
Instance-Based Methods
• Typical approaches • Decision Tree based Methods
• k-nearest neighbor approach
• Rule-based Methods
• Instances represented as points in a Euclidean space.
• Locally weighted regression • Neural Networks
• Constructs local approximation - computational networks that simulate the decision process in neurons (networks of nerve cell)
• Case-based reasoning
• Uses symbolic representations and knowledge-based inference • Naïve Bayes and Bayesian Belief Networks
- uses the probability theory to find the most likely of the possible classifications

• Support Vector Machines


- fits a boundary to a region of points that are all alike; uses the boundary to classify a new point

November 11, 2025 Data Mining 10 Data Mining


BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Decision Tree Representation Decision Tree –Vertebrate Data Set

• Flowchart-like tree structure.


• The topmost node in a tree is the root
node.
• Each internal node denotes a test on an
attribute.
• Each branch represents an outcome of
the test.
• Each leaf node holds a class label.

• Given a tuple for which the associated class label is unknown, the attribute values of the tuple are
tested against the decision tree.
• A path is traced from the root to a leaf node, which holds the class prediction for that tuple.

12 / 67 13 / 67

D ECISION T REE E XAMPLE D ECISION T REE E XAMPLE


L O A N B O R R O W E R C L A S S I F I C AT I O N P R O B L E M L O A N B O R R O W E R C L A S S I F I C AT I O N P R O B L E M
Start from the root of tree. Test Data
Splitting Attributes
Home Marital Annual Defaulted Home Marital Annual Defaulted
ID
Owner Status Income Borrower Owner Status Income Borrower
Home
No Married 80K ? 1 Yes Single 125K No Home
Yes Owner No Owner
10

2 No Married 100K No
Yes No
3 No Single 70K No
NO MarSt
4 Yes Married 120K No NO MarSt
Single, Divorced Married
5 No Divorced 95K Yes Single, Divorced Married
6 No Married 60K No
Income NO Income NO
7 Yes Divorced 220K No
< 80K > 80K < 80K > 80K
8 No Single 85K Yes
9 No Married 75K No NO YES
NO YES
10 No Single 90K Yes
10

Training Data Model: Decision Tree


14 15
BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus
D ECISION T REE- MORE THAN ONE TREE How to Build a Decision Tree
Single, Divorced
MarSt Many Algorithms:
Home Marital Annual Defaulted
Married
• Hunt’s Algorithm (one of the earliest)
• CART
ID
Owner Status Income Borrower NO Home
1 Yes Single 125K No Yes Owner No
• ID3, C4.5
2 No Married 100K No NO Income • SLIQ,SPRINT
3 No Single 70K No < 80K > 80K
4 Yes Married 120K No
NO YES
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
There could be more than one tree that
9 No Married 75K No fits the same data!
10 No Single 90K Yes 16
10

BITS Pilani, Pilani Campus 17 / 67

General Structure of Hunt’s Algorithm D ECISION T REE C ONSTRUCTION - H U N T ’ S A LG


Home Marital Annual Defaulted
ID Home
Owner Status Income Borrower
Let Dt be the set of training records that are 1 Yes Single 125K No
Yes
Owner
No Home Marital Annual Defaulted
ID
2 No Married 100K No Owner Status Income Borrower
associated with a node t 3 No Single 70K No
Defaulted = No
Defaulted = No Defaulted = No 1 Yes Single 125K No
General Procedure: 4 Yes Married 120K No 2 No Married 100K No
5 No Divorced 95K Yes
If Dt contains records that belong the same 6 No Married 60K No
(a) (b) 3 No Single 70K No
4 Yes Married 120K No
class yt, 7 Yes Divorced 220K No
5 No Divorced 95K Yes

• then t is a leaf node labeled as yt


8 No Single 85K Yes
Home 6 No Married 60K No
9 No Married 75K No Owner
7 Yes Divorced 220K No
If Dt contains records that belong to more 10
10 No Single 90K Yes Home
Owner
Yes No
8 No Single 85K Yes
Defaulted = No
than one class, Yes No
Marital
Status 9 No Married 75K No

• use an attribute test to split the data


Single,
Marital Married 10 No Single 90K Yes
Defaulted = No Divorced 10

Status
Defaulted = No
into smaller subsets. Single,
Divorced
Married
Annual
Income

Recursively apply the procedure to each Defaulted = Yes Defaulted = No < 80K >= 80K

subset. Defaulted = No Defaulted = Yes

(c) (d)
18 / 67 BITS Pilani, Pilani Campus
D ESIGN I SSUES OF D ECISION T REE I NDUCTION T EST C ONDITION FOR N OMINAL ATTRIBUTES

Marital
• Splitting Criterion Status

Method for specifying test condition • Multi-way split:


Depends on attribute types-Binary, Nominal, Ordinal, Continuous • Use as many partitions as distinct values.
Measure for evaluating the goodness of a test condition • Binary split: Single Divorced Married
• Divides values into two subsets
• Stopping Criterion • Preserve order property among attribute
Stop splitting if all the records belong to the same class or have identical values Marital Marital Marital
Status Status Status

attribute values Shirt


OR OR

Size This grouping


Early termination violates order
{Married} {Single,
Divorced}
{Single} {Married,
Divorced}
{Single,
Married}
{Divorced}

property
{Small, {Medium,
Large} Extra Large}

20 / 67 21 / 67

T EST C ONDITION FOR C ONTINUOUS ATTRIBUTES H OW TO DETERMINE THE B EST S PLIT


Before Splitting:
10 records of class 0
10 records of class 1

Binary Decision :
Discretization :
• consider all possible splits and
• to form an ordinal categorical attribute Which test condition is the best?
finds the best cut
• can be more compute intensive
22 / 67 23 / 67
S ELECTING AN ATTRIBUTE T EST C ONDITION :
H OW TO DETERMINE THE B EST S PLIT I MPURITY M EASURE FOR A S INGLE N ODE
𝑐−1
Impurity is a measure of homogeneity of the labels at the node at hand
𝐺𝑖𝑛𝑖 𝐼𝑛𝑑𝑒𝑥 = 1 − ෍ 𝑝𝑖 𝑡 2
Greedy approach: Where 𝒑𝒊 𝒕 is the frequency of
Nodes with purer class distribution are preferred 𝑖=0 class 𝒊 at node t, and 𝒄 is the total
Need a measure of node impurity: 𝑐−1
number of classes
𝐸𝑛𝑡𝑟𝑜𝑝𝑦 = − ෍ 𝑝𝑖 𝑡 𝑙𝑜𝑔2 𝑝𝑖 (𝑡)
C0: 5 C0: 9 𝑖=0
C1: 5 C1: 1
Measures seek to determine which variable would split the data to lead to the
High degree of impurity/ Relating it to Low degree of impurity / underlying child nodes being most homogenous or pure.
information theory: Relating it to information theory:
Low information content High information content In other words splitting on attribute should reduce impurity
e.g It rained heavily in Shillong yesterday e.g There was a heavy rainfall in Rajasthan last night.

24 / 67 25 / 67

S ELECTING AN ATTRIBUTE T EST C ONDITION : S ELECTING AN ATTRIBUTE T EST C ONDITION :


I MPURITY M EASURE FOR A S INGLE N ODE C OLLECTIVE I MPURITY OF C HILD N ODES
p(C1) = 0/6 = 0 and p(C2) = 6/6 = 1 Low entropy and When a node 𝑝 is split into 𝑘 partitions (children)
GINI index
Gini = 1 − (0)2 − (1)2 = 0 preferred
𝑘
Entropy = −(0) log2(0) − (1) log2(1) = 0 𝑛𝑖
:– low impurity 𝐺𝐼𝑁𝐼𝑠𝑝𝑙𝑖𝑡 = ෍ 𝐺𝐼𝑁𝐼(𝑖)
𝑛
𝑖=1
p(C1) = 1/6 and p(C2) = 5/6
Gini = 1 − (1/6)2 − (5/6)2 = 0.278 𝑘
𝑛𝑖
Entropy = −(1/6) log2(1/6) − (5/6) log2(5/6) = 0.650 𝐸𝑛𝑡𝑟𝑜𝑝𝑦𝑠𝑝𝑙𝑖𝑡 = ෍ 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑖)
𝑛
𝑖=1
where,
p(C1) = 3/6 p(C2) = 3/6
Gini = 1 − (3/6)2 − (3/6)2 = 0.5 𝑛𝑖 = number of records at child 𝑖,
Entropy = −(3/6) log2(3/6) − (3/6) log2(3/6) = 1 𝑛 = number of records at parent node 𝑝

26 / 67 27 / 67
S ELECTING AN ATTRIBUTE T EST C ONDITION :
F INDING THE B EST S PLIT D ECISION T REE - ENTROPY
Construct a decision tree for the given dataset using Information gain.
• Compute impurity measure of the parent before splitting (P) Outlook Temp Humidity Windy Play Golf
Rainy Hot High False No
• Compute the collective impurity measure of children after splitting(M) Rainy Hot High True No
• Choose the attribute test condition that produces the highest gain Overcast Hot High False Yes
Gain = P – M Sunny Mild High False Yes
• or equivalently, lowest impurity measure after splitting (M) Sunny Cool Normal False Yes
• Information gain is the expected reduction in entropy caused by partitioning Sunny Cool Normal True No
the examples on an attribute. Overcast Cool Normal True Yes

• Maximizing the gain == minimizing the weighted average impurity measure of Rainy Mild High False No
Rainy Cool Normal False Yes
children nodes Sunny Mild Normal False Yes
Choose the split that achieves most Rainy Mild Normal True Yes
𝑘
𝑛𝑖 reduction (maximizes GAIN) Overcast Mild High True Yes
𝐺𝑎𝑖𝑛𝑠𝑝𝑙𝑖𝑡 = 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑝 − ෍ 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑖) Used in the ID3 and C4.5 decision Overcast Hot Normal False Yes
𝑛
𝑖=1 tree algorithms Sunny Mild High True No
28 / 67 29 / 48

D ECISION T REE - ENTROPY D ECISION T REE - ENTROPY

30 / 48 31 / 48
D ECISION T REE - ENTROPY D ECISION T REE - ENTROPY
Step 3:
Choose attribute with the largest information gain as the decision node, divide the
dataset by its branches and repeat the same process on every branch.
Outlook has the max gain. So choose Outlook as the root decision point.
A branch with entropy of 0 is a leaf node.
A branch with entropy more than 0 needs further splitting.

32 / 48 33 / 48

D ECISION T REE - ENTROPY D ECISION T REE - ENTROPY


Step 4a: Consider only rows which have Outlook = Step 4b: Consider only rows which have Outlook = Step 5: Draw the final decision tree.
Step 4: Repeat Rainy.
Sunny

Humidity has the max gain. Choose Humidity


Windy has max the max gain. Choose Windy as the next as the next decision point.
decision point.

34 / 48 35 / 48
Splitting Criteria based on Classification
Comparison among Splitting Criteria
Error
• Classification error at a node t : For a 2-class problem:

Error (t ) = 1 − max P(i | t )


i
C1
C2
0
6
P(C1) = 0/6 = 0 P(C2) = 6/6 = 1
Error = 1 – max (0, 1) = 1 – 1 = 0

• Measures misclassification error made by a node.


C1 1 P(C1) = 1/6 P(C2) = 5/6
• Maximum (1 - 1/nc) when records are equally
distributed among all classes, implying least C2 5 Error = 1 – max (1/6, 5/6) = 1 – 5/6 = 1/6
interesting information
• Minimum (0.0) when all records belong to one C1 2 P(C1) = 2/6 P(C2) = 4/6
class, implying most interesting information C2 4 Error = 1 – max (2/6, 4/6) = 1 – 4/6 = 1/3

36 37
11/11/2025 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 11/11/2025 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Dealing with continuous-valued attributes Dealing with continuous-valued attributes


• Given a continuous-valued attribute A, • How to select the best value for the threshold c?
dynamically define new discrete valued
attribute Ac that partition the continuous • Example. Temperature in the example
attribute value into a discrete set of intervals Sort the examples according to Temperature
• Ac = True if A < c, False otherwise Temperature 60 65 67 68 70 71 72 72 75 80 81 83 85 85
• How to select the best value for the
PlayTennis Y N Y Y Y N Y Y N N Y Y N Y
threshold c?

Determine candidate thresholds by averaging


consecutive values where there is a change in
classification:

38 / 67 39 / 67
Dealing with continuous-valued attributes Dealing with continuous-valued attributes
• How to select the best value for the threshold c?
60 65 67 68 70 71 72 72 75 80 81 83 85 85
The candidate thresholds identified are:
• 62.5 ,66,70.5,71.5,73.5,80.5,84 Y N Y Y Y N Y Y N N Y Y N Y

Evaluate Candidate Thresholds


• Threshold = 62.5
• Group 1 (Temperature ≤ 62.5): [60, Yes]
• Group 2 (Temperature > 62.5): [65, No; 67, Yes; 68, Yes; 70, Yes; 71, No; 72, Yes; 75,
No; 80, No; 81, Yes; 83, Yes; 85, No]
• Calculate Entropy for Group 1 and Group 2 and their weighted Entropy, then find the
information gain.
Calculate Information gain
We will calculate the Information gain for each of these candidate thresholds.
Pick the one with the maximum gain.

40 / 67 41 / 67

S t o p p i n g C r i t e r i a f o r Tr e e I n d u c t i o n D e c i s i o n Tr e e B a s e d C l a s s i f i c a t i o n
• Stop expanding a node when all the records belong to the same class
• Advantages:
• Inexpensive to construct
• Stop expanding a node when all the records have similar attribute values • Extremely fast at classifying unknown records
• Easy to interpret for small-sized trees
• Early termination • Accuracy is comparable to other classification techniques for many simple data sets

42
11/11/2025 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Decision Tree:Overfitting
Practical Issues of Classification
• Underfitting and Overfitting
• Missing Values
• Costs of Classification Training errors (apparent errors)--Errors committed on the training set

Underfitting vs. Overfitting Test errors -- Errors committed on the test set
• Underfitting results in decision trees that are too simple to solve the problem. They may offer superior interpretability.
Generalization errors -- Expected error of a model over random selection of
• Overfitting results in decision trees that are more complex than necessary records from same distribution
• Training error no longer provides a good estimate of how well the tree will perform on previously unseen records
• Need new ways for estimating errors

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 45 / 67

Decision Tree:Overfitting Decision Tree :Overfitting

Two class problem:


+ : 5200 instances
• 5000 instances generated from
a Gaussian centered at (10,10)
• 200 noisy instances added
o : 5200 instances
• Generated from a uniform
distribution
Underfitting: when model is too simple, both training and test errors are large
Overfitting: when model is too complex, training error is small but test error is large

46 / 67
Decision Tree :Handling Overfitting Decision Tree :Handling Overfitting
Pre-Pruning (Early Stopping Rule)
• Stop the algorithm before it becomes a fully-grown tree Post Pruning :Idea

• Typical stopping conditions for a node: • Post construction, scan the tree bottom-up

• Stop if all instances belong to the same class • At every decision node

• Stop if all the attribute values are the same • Retain the attribute node & evaluate it against the prune set (validation set)

More restrictive conditions: • Remove the attribute node & reevaluate it with the same prune set

• Stop if number of instances is less than some user-specified threshold • If there is a reduction in error, prune the node else retain the node

• Stop if expanding the current node does not improve impurity measures (e.g., Gini • Repeat this in other branches of the tree
or information gain).
• Stop if estimated generalization error falls below certain threshold

C l a s s i f i c a t i o n Te c h n i q u e : R u l e - B a s e d
Rule-based Classifier (Example)
Classifier
Name Blood Type Give Birth Can Fly Live in Water Class

• Classify records by using a collection of “if…then…” rules


human warm yes no no mammals
python cold no no no reptiles
salmon cold no no yes fishes
whale warm yes no yes mammals

• Rule: (Condition) → y
frog cold no no sometimes amphibians
where komodo
bat
cold
warm
no
yes
no
yes
no
no
reptiles
mammals

• Condition is a conjunctions of attributes pigeon


cat
warm
warm
no
yes
yes
no
no
no
birds
mammals


leopard shark cold yes no yes fishes
y is the class label turtle
penguin
cold
warm
no
no
no
no
sometimes
sometimes
reptiles
birds
• LHS: rule antecedent or condition porcupine
eel
warm
cold
yes
no
no
no
no
yes
mammals
fishes


salamander cold no no sometimes amphibians
RHS: rule consequent gila monster cold no no no reptiles
platypus warm no no no mammals
• Examples of classification rules: owl
dolphin
warm
warm
no
yes
yes
no
no
yes
birds
mammals
• (Blood Type=Warm)  (Lay Eggs=Yes) → Birds eagle warm no yes no birds

R1: (Give Birth = no)  (Can Fly = yes) → Birds


• (Taxable Income < 50K)  (Refund=Yes) → Evade=No
R2: (Give Birth = no)  (Live in Water = yes) → Fishes
R3: (Give Birth = yes)  (Blood Type = warm) → Mammals
R4: (Give Birth = no)  (Can Fly = no) → Reptiles
R5: (Live in Water = sometimes) → Amphibians
Data Mining Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Application of Rule-Based Classifier Rule Coverage and Accuracy
Tid Refund Marital Taxable
• A rule r covers an instance x if the attributes of the instance satisfy the Status Income Class

condition of the rule • Coverage of a rule: 1 Yes Single 125K No


• Fraction of records that satisfy the 2 No Married 100K No
R1: (Give Birth = no)  (Can Fly = yes) → Birds antecedent of a rule 3 No Single 70K No
R2: (Give Birth = no)  (Live in Water = yes) → Fishes
R3: (Give Birth = yes)  (Blood Type = warm) → Mammals • Accuracy of a rule: 4 Yes Married 120K No

R4: (Give Birth = no)  (Can Fly = no) → Reptiles • Fraction of records that satisfy both the 5 No Divorced 95K Yes

antecedent and consequent of a rule 6 No Married 60K No


R5: (Live in Water = sometimes) → Amphibians
7 Yes Divorced 220K No
8 No Single 85K Yes
Name Blood Type Give Birth Can Fly Live in Water Class
hawk warm no yes no ? 9 No Married 75K No
grizzly bear warm yes no no ?
10 No Single 90K Yes
10

The rule R1 covers a hawk => Bird (Status=Single) → No


The rule R3 covers the grizzly bear => Mammal
Coverage = 40%, Accuracy = 50%
Data Mining Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

H o w d o e s R u l e - b a s e d C l a s s i f i e r Wo r k ? Characteristics of Rule -Based Classifier


R1: (Give Birth = no)  (Can Fly = yes) → Birds
• Mutually exclusive rules
R2: (Give Birth = no)  (Live in Water = yes) → Fishes
• Classifier contains mutually exclusive rules if the rules are
R3: (Give Birth = yes)  (Blood Type = warm) → Mammals independent of each other
R4: (Give Birth = no)  (Can Fly = no) → Reptiles • Every record is covered by at most one rule
R5: (Live in Water = sometimes) → Amphibians

Name Blood Type Give Birth Can Fly Live in Water Class
• Exhaustive rules
lemur warm yes no no ? • Classifier has exhaustive coverage if it accounts for every possible
turtle cold no no sometimes ?
dogfish shark cold yes no yes ? combination of attribute values
• Each record is covered by at least one rule
A lemur triggers rule R3, so it is classified as a mammal
A turtle triggers both R4 and R5
A dogfish shark triggers none of the rules

Data Mining Data Mining


BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
F r o m D e c i s i o n Tr e e s To R u l e s Rules Can Be Simplified
Tid Refund Marital Taxable
Classification Rules Refund Status Income Cheat
Yes No
Refund (Refund=Yes) ==> No 1 Yes Single 125K No
Yes No NO Marita l 2 No Married 100K No
(Refund=No, Marital Status={Single,Divorced}, Status
Taxable Income<80K) ==> No {Single,
NO Marita l {Married} 3 No Single 70K No
Divorced}
{Single, Status 4 Yes Married 120K No
(Refund=No, Marital Status={Single,Divorced},
{Married}
Divorced} Taxable Income>80K) ==> Yes Taxable NO 5 No Divorced 95K Yes
Income
Taxable NO (Refund=No, Marital Status={Married}) ==> No 6 No Married 60K No
< 80K > 80K
Income 7 Yes Divorced 220K No
< 80K > 80K NO YES
8 No Single 85K Yes
NO YES 9 No Married 75K No
Rules are mutually exclusive and exhaustive
10 No Single 90K Yes
Rule set contains as much information as the tree 10

Initial Rule: (Refund=No)  (Status=Married) → No


Simplified Rule: (Status=Married) → No
Data Mining Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Effect of Rule Simplification Ordered Rule Set

• Rules are no longer mutually exclusive • Rules are rank ordered according to their priority
• A record may trigger more than one rule • An ordered rule set is known as a decision list
• Solution?
• Ordered rule set • When a test record is presented to the classifier
• Unordered rule set – use voting schemes • It is assigned to the class label of the highest ranked rule it has triggered
• If none of the rules fired, it is assigned to the default class
• Rules are no longer exhaustive
R1: (Give Birth = no)  (Can Fly = yes) → Birds
• A record may not trigger any rules
R2: (Give Birth = no)  (Live in Water = yes) → Fishes
• Solution?
• Use a default class
R3: (Give Birth = yes)  (Blood Type = warm) → Mammals
R4: (Give Birth = no)  (Can Fly = no) → Reptiles
R5: (Live in Water = sometimes) → Amphibians

Name Blood Type Give Birth Can Fly Live in Water Class
turtle cold no no sometimes ?
Data Mining Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Rule Ordering Schemes Building Classification Rules

• Rule-based ordering • Direct Method:


• Individual rules are ranked based on their quality • Extract rules directly from data
• Class-based ordering • e.g.: RIPPER, CN2, Holte’s 1R

• Rules that belong to the same class appear together


Rule-based Ordering Class-based Ordering • Indirect Method:
(Refund=Yes) ==> No (Refund=Yes) ==> No • Extract rules from other classification models (e.g.
decision trees, neural networks, etc).
(Refund=No, Marital Status={Single,Divorced}, (Refund=No, Marital Status={Single,Divorced},
Taxable Income<80K) ==> No Taxable Income<80K) ==> No • e.g: C4.5rules
(Refund=No, Marital Status={Single,Divorced}, (Refund=No, Marital Status={Married}) ==> No
Taxable Income>80K) ==> Yes
(Refund=No, Marital Status={Single,Divorced},
(Refund=No, Marital Status={Married}) ==> No Taxable Income>80K) ==> Yes

Data Mining Data Mining


BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Direct Method: Sequential Covering Indirect Methods

1. Start from an empty rule


2. Grow a rule using the Learn-One-Rule function P
3. Remove training records covered by the rule No Yes

4. Repeat Step (2) and (3) until stopping criterion is met Q R Rule Set

No Yes No Yes r1: (P=No,Q=No) ==> -


r2: (P=No,Q=Yes) ==> +
- + + Q r3: (P=Yes,R=No) ==> +
r4: (P=Yes,R=Yes,Q=No) ==> -
No Yes
r5: (P=Yes,R=Yes,Q=Yes) ==> +
- +

Data Mining Data Mining


BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Bayes Classifier Example of Bayes Theorem

• A probabilistic framework for solving classification problems • Given:


• Conditional Probability: • A doctor knows that meningitis causes stiff neck 50% of the time
P ( A, C ) • Prior probability of any patient having meningitis is 1/50,000
P (C | A) = • Prior probability of any patient having stiff neck is 1/20
P ( A)
P ( A, C )
P( A | C ) = • If a patient has stiff neck, what’s the probability he/she has
P (C ) meningitis?

• Bayes theorem:
P( A | C ) P(C ) P( S | M ) P( M ) 0.5 1 / 50000
P(C | A) = P( M | S ) = = = 0.0002
P( A) P( S ) 1 / 20

Data Mining Data Mining


BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Bayesian Classifiers Bayesian Classifiers

• Consider each attribute and class label as random variables • Approach:


• compute the posterior probability P(C | A1, A2, …, An) for all values of C using the
Bayes theorem
• Given a record with attributes (A1, A2,…,An) P( A A  A | C ) P(C )
P(C | A A  A ) = 1 2 n

• Goal is to predict class C 1 2 n


P( A A  A ) 1 2 n

• Specifically, we want to find the value of C that maximizes P(C| A1, A2,…,An ) • Choose value of C that maximizes
P(C | A1, A2, …, An)
• Can we estimate P(C| A1, A2,…,An ) directly from data? • Equivalent to choosing value of C that maximizes
P(A1, A2, …, An|C) P(C)

• How to estimate P(A1, A2, …, An | C )?

Data Mining Data Mining


BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Naive Bayes Classifier How to Estimate Probabilities from Data?

• Assume independence among attributes Ai when class is given:


• P(A1, A2, …, An |C) = P(A1| Cj) P(A2| Cj)… P(An| Cj) Tid Refund Marital Taxable Evade • Class: P(C) = Nc/N
Status Income • e.g., P(No) = 7/10,
1 Yes Single 125K No P(Yes) = 3/10
• Can estimate P(Ai| Cj) for all Ai and Cj. 2 No Married 100K No
3 No Single 70K No • For discrete attributes:
• New point is classified to Cj if P(Cj)  P(Ai| Cj) is maximal. 4 Yes Married 120K No P(Ai | Ck) = |Aik|/ Nck
5 No Divorced 95K Yes • where |Aik| is number of instances
6 No Married 60K No having attribute Ai and belongs to
7 Yes Divorced 220K No class Ck
8 No Single 85K Yes
• Examples:
9 No Married 75K No P(Status=Married|No) = 4/7
P(Refund=Yes|Yes)=0
10 No Single 90K Yes

Data Mining Data Mining


BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

How to Estimate Probabilities from


Data? How to Estimate Probabilities from Data?

• For continuous attributes:


Tid Refund Marital Taxable Evade • Normal distribution:
• Discretize the range into bins Status Income ( Ai −  ij ) 2

• one ordinal attribute per bin P ( Ai | c j ) =
1
e
2 ij2
1 Yes Single 125K No 2 ij2
• Too many bins – training records are too few to provide reliable probability for each interval
2 No Married 100K No
• Too few bins – some intervals may aggregate from different classes & we may miss the correct decision
boundary 3 No Single 70K No • One for each (Ai,ci) pair
• Two-way split: (A < v) or (A > v) 4 Yes Married 120K No
• choose only one of the two splits as new attribute
5 No Divorced 95K Yes • For (Income, Class=No):
• Probability density estimation: • If Class=No
6 No Married 60K No
• Assume attribute follows a normal distribution • sample mean = 110
7 Yes Divorced 220K No
• Use data to estimate parameters of distribution • sample variance = 3179.16
(e.g., mean and standard deviation) 8 No Single 85K Yes
• Once probability distribution is known, can use it to estimate the conditional probability P(Ai|Ck) 9 No Married 75K No (120 −110 ) 2
1 −
P ( Income = 120 | No) = e 2 (3179 )
= 0.0069
10 No Single 90K Yes 2 (56.38)

Data Mining Data Mining


BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Example of Naïve Bayes Classifier Naive Bayes Classifier
Given a Test Record:
Tid Refund Marital Taxable Evade X = (Refund = No, Married, Income = 120K) • If one of the conditional probability is zero, then the entire
Status Income expression becomes zero
1 Yes Single 125K No P(X|Class=No) = P(Refund=No|Class=No)
 P(Married| Class=No) • Probability estimation is done with Laplacian correction:
2 No Married 100K No  P(Income=120K| Class=No)
3 No Single 70K No = 4/7  4/7  0.0069 = 0.0023

4 Yes Married 120K No P(X|Class=Yes) = P(Refund=No| Class=Yes) N ic


5 No Divorced 95K Yes  P(Married| Class=Yes) Original : P ( Ai | C ) =
 P(Income=120K| Class=Yes) Nc c: number of classes
6 No Married 60K No = 1  0  1.2  10-9 = 0 N +1
7 Yes Divorced 220K No Laplace : P ( Ai | C ) = ic
Since P(X|No)P(No) > P(X|Yes)P(Yes) Nc + c
8 No Single 85K Yes
Therefore P(No|X) > P(Yes|X)
9 No Married 75K No
=> Class = No
10 No Single 90K Yes

Data Mining Data Mining


BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Example of Naïve Bayes Classifier Naïve Bayes (Summary)


Name
human
Give Birth
yes no
Can Fly Live in Water Have Legs
no yes
Class
mammals A: attributes • Robust to isolated noise points
python no no no no non-mammals
salmon no no yes no non-mammals M: mammals
whale yes no yes no mammals
frog no no sometimes yes non-mammals N: non-mammals
komodo
bat
no
yes
no
yes
no
no
yes
yes
non-mammals
mammals
6 6 2 2
P ( A | M ) =    = 0.06
• Handle missing values by ignoring the instance during probability
pigeon
cat
no
yes
yes
no
no
no
yes
yes
non-mammals
mammals
7 7 7 7 estimate calculations
leopard shark yes no yes no non-mammals 1 10 3 4
turtle no no sometimes yes non-mammals P ( A | N ) =    = 0.0042
penguin no no sometimes yes non-mammals 13 13 13 13
porcupine yes no no yes mammals
eel
salamander
no
no
no
no
yes
sometimes
no
yes
non-mammals
non-mammals
7
P ( A | M ) P ( M ) = 0.06  = 0.021 • Robust to irrelevant attributes
gila monster no no no yes non-mammals 20
platypus no no no yes mammals
13
owl no yes no yes non-mammals
P ( A | N ) P ( N ) = 0.004  = 0.0027
• Independence assumption may not hold for some attributes
dolphin yes no yes no mammals
eagle no yes no yes non-mammals 20

Give Birth Can Fly Live in Water Have Legs Class P(A|M)P(M) > P(A|N)P(N) • Use other techniques such as Bayesian Belief Networks (BBN)
yes no yes no ?
=> Mammals
Data Mining Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Artificial Neural Networks (ANN) Artificial Neural Networks (ANN)

• Artificial Neural Network (ANN) is a computational model inspired by


the way the human brain processes information.
• It is made up of interconnected nodes (neurons) organized in layers: X1 X2 X3 Y Input Black box
• Input Layer: Takes in raw data. 1 0 0 0
X1
1 0 1 1
• Hidden Layers: Perform feature extraction. 1 1 0 1 Output
• Output Layer: Produces the final prediction or result. 1 1 1 1 X2 Y
0 0 1 0
0 1 0 0
0 1 1 1 X3
0 0 0 0

Output Y is 1 if at least two of the three inputs are equal to 1.

Data Mining Data Mining


BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

H o w D o e s A N N Wo r k ? Artificial Neural Networks (ANN)

• Each neuron receives inputs, multiplies them with weights, adds Input
a bias, and passes through an activation function. nodes Black box
X1 X2 X3 Y
• The network learns by adjusting weights using algorithms 1
1
0
0
0
1
0
1
X1 0.3
Output
node
like backpropagation and gradient descent. 1 1 0 1
1 1 1 1 X2 0.3
0 0 1 0
 Y
• Basic Structure of ANN 0 1 0 0
0 1 1 1 X3 0.3
Input Layer → Hidden Layer(s) → Output Layer 0 0 0 0
t=0.4
(x1, x2, ...) (h1, h2, ...) (y)
Y = I (0.3 X 1 + 0.3 X 2 + 0.3 X 3 − 0.4  0)
1 if z is true
where I ( z ) = 
0 otherwise
Data Mining Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Artificial Neural Networks (ANN) General Structure of ANN
Input x1 x2 x3 x4 x5
• Model is an assembly of inter-connected nodes
nodes and weighted links Black box Input Input Neuron i Output
Output Layer
• Output node sums up each of its input X1 w1 node I1 wi1
value according to the weights of its links Activation
w2 wi2
• Compare output node against some X2  Y I2
wi3
Si function Oi Oi
w3 g(Si )
threshold t Hidden I3
X3 Layer
t
threshold, t

Perceptron Model
Y = I (  wi X i − t )
Output Training ANN means learning the
or Layer weights of the neurons
i
Y = sign (  wi X i − t ) y
i
Data Mining Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

S u p p o r t Ve c t o r M a c h i n e s S u p p o r t Ve c t o r M a c h i n e s

B1

• Find a linear hyperplane (decision boundary) that will separate the data • One Possible Solution
Data Mining Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
S u p p o r t Ve c t o r M a c h i n e s S u p p o r t Ve c t o r M a c h i n e s

B2

B2

• Another possible solution • Other possible solutions


Data Mining Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

S u p p o r t Ve c t o r M a c h i n e s S u p p o r t Ve c t o r M a c h i n e s
B1
B1

B2
B2

b21
b22

margin
b11

• Which one is better? B1 or B2? b12

• How do you define better? • Find hyperplane maximizes the margin => B1 is better than B2
Data Mining Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
N o n l i n e a r S u p p o r t Ve c t o r M a c h i n e s
S u p p o r t Ve c t o r M a c h i n e s
B1
• What if decision boundary is not linear?

 
w• x + b = 0
 
 
w • x + b = −1 w • x + b = +1

b11

 
 1 if w • x + b  1 b12 2
Margin =  2
f ( x) =    || w ||
− 1 if w • x + b  −1 Data Mining Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

N o n l i n e a r S u p p o r t Ve c t o r M a c h i n e s Ensemble Methods

• Transform data into higher dimensional space • Construct a set of classifiers from the training data

• Predict class label of previously unseen records by aggregating


predictions made by multiple classifiers

Data Mining Data Mining


BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
General Idea Why does it work?

D
Original
Training data • Suppose there are 25 base classifiers
• Each classifier has error rate,  = 0.35
• Assume classifiers are independent
Step 1:
Create Multiple D1 D2 .... Dt-1 Dt • Probability that the ensemble classifier makes a wrong
Data Sets
prediction:
  i
25 25
Step 2:
Build Multiple C1 C2 Ct -1 Ct
   (1 −  )25−i = 0.06
i =13  i 
Classifiers

Step 3:
Combine C*
Classifiers

Data Mining Data Mining


BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

BITS Pilani
Pilani|Dubai|Goa|Hyderabad

Data Mining
M2: Data Preprocessing 2.1 Data Preprocessing Techniques
BITS Pilani
Pilani|Dubai|Goa|Hyderabad

Source Courtesy: Some of the contents of this PPT are sourced from materials provided by publishers of prescribed books
Data Preprocessing Data preprocessing

Data Preprocessing 4
Data Preprocessing 3

Data
Data Pre-processing
Pre-processing Data Data Mining
DataMining

• D ata P re - p ro c e s s i n g : A n O ve r v i e w, D ata C l e a n i n g ,
• W hat is D ata M ining ?
• D ata Re d u c t i o n - O ve r v i e w o f D ata Re d u c t i o n S t rate g i e s , P C A ,
• M any people treat data mining as a synonym for
Att r i b u te S u b s e t S e l e c t i o n , H i sto g ra m s , S a m p l i n g ;
another popularly us ed te rm, Knowledge D is cover y
• D ata Tra n sfo r m at i o n a n d D ata D i s c ret i zat i o n - D ata from D ata , or KD D,
Tra n sfo r m at i o n b y N o r m a l i zat i o n ,
• W hile ot he rs view data mining as mere ly an
• D i s c ret i zat i o n b y B i n n i n g , D i s c ret i zat i o n b y H i sto g ra m es s ential step in the proc es s of knowledge
A n a l ys i s , D i s c ret i zat i o n b y C l u ste r, D e c i s i o n Tre e , a n d discove r y.
C o r re l at i o n A n a l ys e s . Data Preprocessing Data Preprocessing
5 6
Knowledge Discovery Process Major Tasks in Data Preprocessing

• The knowledge discovery process is shown in Figure below as an iterative sequence


Data cleaning Data integration

Fill in missing values, smooth Integration of multiple


noisy data, identify or remove databases, data cubes, or
outliers, and resolve files
inconsistencies
Data
Preprocessing
Data transformation and data
Data reduction discretization

Dimensionality reduction Normalization

Data Preprocessing 7 Data Preprocessing 8

Knowledge Discovery Process


Knowledge Discovery Process

The knowledge discovery process has the following steps: ➢Steps 1 through 4 are different forms of data preprocessing, where data are
1. Data cleaning (to remove noise and inconsistent data) prepared for mining.
2. Data integration (where multiple data sources may be combined)
➢The data mining step may interact with the user or a knowledge base.
3. Data selection (where data relevant to the analysis task are retrieved from the database)
4. Data transformation (where data are transformed and consolidated into forms ➢The interesting patterns are presented to the user and may be stored as new
appropriate for mining by performing summary or aggregation operations) knowledge in the knowledge base.
5. Data mining (an essential process where intelligent methods are applied to extract data
patterns) ➢“Data mining is the process of discovering interesting patterns and knowledge from
6. Pattern evaluation (to identify the truly interesting patterns representing knowledge large amounts of data. The data sources can include databases, data warehouses, the
based on interestingness measures)
Web, other information repositories, or data that are streamed into the system
7. Knowledge presentation (where visualization and knowledge representation techniques
are used to present mined knowledge to users) dynamically” .
Data Preprocessing 9 Data Preprocessing 10
Data Preprocessing

Data Preprocessing Data Preprocessing


Data Preprocessing

➢Today’s real-world databases are highly susceptible to noisy, missing, and inconsistent ➢Data integration merges data from multiple sources into a coherent data
data due to their typically huge size and their origin from multiple, heterogenous
sources. store such as a data warehouse.
➢Low-quality data will lead to low-quality mining results.
➢“How can the data be preprocessed in order to help improve the quality of the data ➢Data reduction can reduce data size by, aggregating, eliminating redundant
and, consequently, of the mining results?
features, or clustering.
➢How can the data be preprocessed so as to improve the efficiency and ease of the
mining process?”
➢ Data transformations (e.g., normalization) may be applied, where data are
➢There are several data preprocessing techniques.
➢Data cleaning can be applied to remove noise and correct inconsistencies in data. scaled to fall within a smaller range like 0.0 to 1.0.

➢This can improve the accuracy and efficiency of mining algorithms.


Data Preprocessing 11 Data Preprocessing 12

Data Quality: Why Preprocess the Data?


Data Quality: Why Preprocess the Data? Data Preprocessing stepssteps
➢Data have quality if they satisfy the requirements of the intended use.
• Figure below summarizes the data preprocessing steps described here

➢There are many factors comprising data quality, including accuracy, completeness, consistency,
timeliness, believability, and interpretability.

➢Imagine that you are a manager at AllElectronics and have been charged with analyzing the company’s
data with respect to your branch’s sales.

➢You carefully inspect the company’s database and data warehouse, identifying and selecting the
attributes or dimensions (e.g., item, price, and units sold) to be included in your analysis.

➢ You notice that several of the attributes for various tuples have no recorded value.

Data Preprocessing 13 Data Preprocessing 14


Data Cleaning Missing Values

➢ Imagine that you need to analyze AllElectronics sales and customer data.
➢Real-world data tend to be incomplete, noisy, and inconsistent.
➢ You note that many tuples have no recorded value for several attributes such as customer income. How

➢Data cleaning routines attempt to fill in missing values, smooth can you go about filling in the missing values for this attribute?

➢ Let’s look at the following methods.


out noise while identifying outliers, and correct inconsistencies in
1. Ignore the tuple: This is usually done when the class label is missing. This method is not very effective,
the data. unless the tuple contains several attributes with missing values. It is especially poor when the percentage
of missing values per attribute varies considerably. By ignoring the tuple, we do not make use of the
remaining attributes’ values in the tuple. Such data could have been useful to the task at hand.

2. Fill in the missing value manually:

Data Preprocessing 15 Data Preprocessing 16

Missing Values Missing Values

3. Use a global constant to fill in the missing value: Replace all missing attribute values
5. Use the attribute mean or median for all samples belonging to the same class as the
by the same constant such as a label like “Unknown” or −∞. If missing values are
given tuple: For example, if classifying customers according to credit risk, we may replace
replaced by, say, “Unknown,” then the mining program may mistakenly think that
the missing value with the mean income value for customers in the same credit risk
they form an interesting concept, since they all have a value in common—that of
category as that of the given tuple. If the data distribution for a given class is skewed, the
“Unknown.” Hence, although this method is simple, it is not foolproof. median value is a better choice.

4. Use a measure of central tendency for the attribute (e.g., the mean or median) to 6. Use the most probable value to fill in the missing value: This may be determined with
fill in the missing value: measures of central tendency, which indicate the “middle” regression, inference-based tools using a Bayesian formalism, or decision tree induction.
For example, using the other customer attributes in your data set, you may construct a
value of a data distribution. For normal (symmetric) data distributions, the mean
decision tree to predict the missing values for income.
can be used, while skewed data distribution should employ the median.

Data Preprocessing 17 Data Preprocessing 18


Data Quality: Why Preprocess the Data? Major Tasks in Data Preprocessing

Data cleaning Data integration


Measures for data quality
Fill in missing values, smooth Integration of multiple
noisy data, identify or remove databases, data cubes, or
outliers, and resolve files
inconsistencies
Accuracy: correct or wrong , accurate or not Data
Preprocessing
Data transformation and data
Completeness: not recorded, unavailable, … Data reduction discretization

Consistency: some modified but some not, dangling ,


Dimensionality reduction Normalization

Data Preprocessing 19 Data Preprocessing 20

Data Reduction Data Reduction


Data reduction: Data reduction strategies

Dimensionality reduction, Ex: remove unimportant attributes

D ata re d u c t i o n : O b ta i n a re d u c e d re p re s e ntat i o n o f t h e • Wavelet transforms

d ata s et t h at i s m u c h s m a l l e r i n vo l u m e b u t yet p ro d u c e s • Principal Components Analysis (PCA)

• Feature subset selection, feature creation


t h e s a m e ( o r a l m o st t h e s a m e ) a n a l y t i c a l re s u l t s .
Numerosity reduction (some simply call it: Data Reduction)
W hy d ata re d u c t i o n ? A d ata b a s e / d ata wa re h o u s e m ay sto re • Regression and Log-Linear Models

te ra b y te s o f d ata . C o m p l ex d ata a n a l ys i s m ay ta ke a ve r y • Histograms, clustering, sampling

l o n g t i m e to r u n o n t h e co m p l e te d ata s e t . • Data cube aggregation

Data compression
Data Preprocessing 21 Data Preprocessing 22
Data Reduction Data Reduction
Dimensionality Reduction Dimensionality Reduction

C u rs e o f d i m e n s i o n a l i t y D i m e n s i o n a l i t y re d u c t i o n

• W h e n d i m e n s i o n a l i t y i n c re a s e s , d ata b e c o m e s i n c re a s i n g l y • Avo i d t h e c u rs e o f d i m e n s i o n a l i t y

s p a rs e • H e l p e l i m i n ate i r re l e va nt fe at u re s a n d re d u c e n o i s e

• D e n s i t y a n d d i sta n c e b e t w e e n p o i nt s , w h i c h i s c r i t i c a l to • Re d u c e t i m e a n d s p a c e re q u i re d i n d ata m i n i n g
c l u ste r i n g , o u t l i e r a n a l ys i s , b e co m e s l e s s m e a n i n g f u l • A l l o w e a s i e r v i s u a l i zat i o n

Data Preprocessing 23 Data Preprocessing 24

Data Reduction Regression and Log-Linear Models


Dimensionality Reduction

Linear regression
Dimensionality reduction techniques
• D ata m o d e l e d to f i t a st ra i g ht l i n e
• Wavelet transforms
• O f te n u s e s t h e l e a st - s q u a re m et h o d to f i t t h e l i n e
• Principal Component Analysis
Multiple regression
• Super vised and nonlinear techniques • Allows a response variable Y to be modeled as a linear function

(Ex: feature selection) of multidimensional feature vector

Data Preprocessing 25 Data Preprocessing 26


y
Regression Analysis Regression and Log-Linear Models
• Regression analysis: A techniques for the modeling
and analysis of numerical data consisting of values of Y1
Linear regression: Y = w X + b
a dependent variable (response variable or
Y1’ Tw o r e g r e s s i o n c o e f f i c i e n t s , w a n d b , s p e c i f y t h e l i n e a n d a r e t o b e
measurement) and of one or more independent y=x+1
variables (explanatory variables or predictors) estimated by using the data at hand

• The parameters are estimated so as to give a "best X1 x Using the least squares criterion to the known values of Y1, Y2, …,
fit" of the data X1, X2, ….
• Used for prediction
• Most commonly the best fit is evaluated by using the (including forecasting of Multiple regression: Y = b0 + b1 X1 + b2 X2
least squares method time-series data),
inference, hypothesis Many nonlinear functions can be transformed into the above
testing, and modeling of
Data Preprocessing
causal relationships 27
Data Preprocessing 28

Sampling: With or without Replacement Sampling: Cluster or Stratified Sampling

Raw Data Cluster/Stratified Sample

Raw Data
Data Preprocessing 29 Data Preprocessing 30
Binning Methods for Data Smoothing Regression and Log-Linear Models
Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
* Partition into equal-frequency (equi-depth) bins: Linear regression: Y = w X + b
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25 Tw o r e g r e s s i o n c o e f f i c i e n t s , w a n d b , s p e c i f y t h e l i n e a n d a r e t o b e
- Bin 3: 26, 28, 29, 34
estimated by using the data at hand
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9 Using the least squares criterion to the known values of Y1, Y2, …,
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
X1, X2, ….
* Smoothing by bin boundaries:
Multiple regression: Y = b0 + b1 X1 + b2 X2
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25 Many nonlinear functions can be transformed into the above
- Bin 3: 26, 26, 26, 34

Data Preprocessing 31 Data Preprocessing 32

Correlation Analysis (Nominal Data) Chi-Square Calculation: An Example


Play chess Not play chess Sum (row)

• Χ2 (chi-square) test
Like science fiction 250(90) 200(360) 450
(Observed − Expected) 2
2 =  Not like science fiction 50(210) 1000(840) 1050
Expected
Sum(col.) 300 1200 1500

• The larger the Χ2 value, the more likely the variables are related
• Χ2 (chi-square) calculation (numbers in parenthesis are expected counts
• The cells that contribute the most to the Χ2 value are those whose actual count is very calculated based on the data distribution in the two categories)
different from the expected count
(250 − 90) 2 (50 − 210) 2 (200 − 360) 2 (1000 − 840) 2
• Correlation does not imply causality 2 = + + + = 507.93
90 210 360 840
• # of hospitals and # of car-theft in a city are correlated • It shows that like_science_fiction and play_chess are correlated in the group

• Both are causally linked to the third variable: population


34
Data Preprocessing 33
Co-Variance: An Example Discretization
• Three types of attributes
• It can be simplified in computation as • Nominal—values from an unordered set, e.g., color, profession
• Ordinal—values from an ordered set, e.g., military or academic rank
• Numeric—real numbers, e.g., integer or real numbers
• Suppose two stocks A and B have the following values in one week: (2, 5), (3, 8),
(5, 10), (4, 11), (6, 14).
• Discretization: Divide the range of a continuous
attribute into intervals
• Question: If the stocks are affected by the same industry trends, will their prices
• Interval labels can then be used to replace actual data values
rise or fall together?
• Reduce data size by discretization
• E(A) = (2 + 3 + 5 + 4 + 6)/ 5 = 20/5 = 4
• Supervised vs. unsupervised
• E(B) = (5 + 8 + 10 + 11 + 14) /5 = 48/5 = 9.6
• Split (top-down) vs. merge (bottom-up)
• Cov(A,B) = (2×5+3×8+5×10+4×11+6×14)/5 − 4 × 9.6 = 4
• Discretization can be performed recursively on an attribute
• Thus, A and B rise together since Cov(A, B) > 0. • Prepare for further analysis, e.g., classification
35
Data Preprocessing 36

Data Discretization Methods Simple Discretization: Binning

• Typical methods: All the methods can be applied recursively • Equal-width (distance) partitioning
• Binning • Divides the range into N intervals of equal size: uniform grid
• Top-down split, unsupervised • if A and B are the lowest and highest values of the attribute, the width of

• Histogram analysis intervals will be: W = (B –A)/N.


• The most straightforward, but outliers may dominate presentation
• Top-down split, unsupervised
• Skewed data is not handled well
• Clustering analysis (unsupervised, top-down split or bottom-
up merge) • Equal-depth (frequency) partitioning
• Decision-tree analysis (supervised, top-down split) • Divides the range into N intervals, each containing approximately same
number of samples
• Correlation (e.g., 2) analysis (unsupervised, bottom-up
• Good data scaling
merge)
• Managing categorical attributes can be tricky

Data Preprocessing 37 37 Data Preprocessing 38


Discretization by Classification & Correlation Analysis
Data Reduction : Dimensionality Reduction
• Classification (e.g., decision tree analysis)
• Curse of dimensionality
• Supervised: Given class labels, e.g., cancerous vs. benign • When dimensionality increases, data becomes increasingly sparse
• Density and distance between points, which is critical to clustering, outlier analysis, becomes less
• Using entropy to determine split point (discretization point) meaningful
• Top-down, recursive split • The possible combinations of subspaces will grow exponentially
• Dimensionality reduction
• Details to be covered in Chapter 7 • Avoid the curse of dimensionality
• Help eliminate irrelevant features and reduce noise
• Correlation analysis (e.g., Chi-merge: χ2-based discretization) • Reduce time and space required in data mining
• Allow easier visualization
• Supervised: use class information
• Dimensionality reduction techniques
• Bottom-up merge: find the best neighboring intervals (those having similar • Wavelet transforms
• Principal Component Analysis
distributions of classes, i.e., low χ2 values) to merge
• Supervised and nonlinear techniques (e.g., feature selection)
• Merge performed recursively, until a predefined stopping condition
Data Mining
Data Preprocessing 39 40
39 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Principal Component Analysis (PCA) Principal Component Analysis (Steps)


• Find a projection that captures the largest amount of variation in data • Given N data vectors from n-dimensions, find k ≤ n orthogonal vectors (principal components)
that can be best used to represent data
• The original data are projected onto a much smaller space, resulting in dimensionality
reduction. • Normalize input data: Each attribute falls within the same range

x2 • Compute k orthonormal (unit) vectors, i.e., principal components


• Each input data (vector) is a linear combination of the k principal component vectors
• The principal components are sorted in order of decreasing “significance” or strength
e • Since the components are sorted, the size of the data can be reduced by eliminating the
weak components, i.e., those with low variance (i.e., using the strongest principal
components, it is possible to reconstruct a good approximation of the original data)
• Works for numeric data only

Data Mining x1 Data Mining


41 42
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Attribute Subset Selection Heuristic Search in Attribute Selection

• Another way to reduce dimensionality of data • There are 2d possible attribute combinations of d attributes
• Typical heuristic attribute selection methods:
• Redundant attributes
• Best single attribute under the attribute independence assumption: choose
• Duplicate much or all of the information contained in one or more other by significance tests
attributes • Best step-wise feature selection:
• The best single-attribute is picked first
• E.g., purchase price of a product and the amount of sales tax paid • Then next best attribute condition to the first, ...
• Irrelevant attributes • Step-wise attribute elimination:
• Repeatedly eliminate the worst attribute
• Contain no information that is useful for the data mining task at hand • Best combined attribute selection and elimination
• E.g., students' ID is often irrelevant to the task of predicting students' GPA • Optimal branch and bound:
• Use attribute elimination and backtracking

Data Mining Data Mining


43 44
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Attribute Creation (Feature Generation) Data Reduction: Numerosity Reduction


• Create new attributes (features) that can capture the important information in a
data set more effectively than the original ones • Reduce data volume by choosing alternative, smaller forms of data
representation
• Three general methodologies
• Parametric methods (e.g., regression)
• Attribute extraction
• Assume the data fits some model, estimate model parameters, store only the
• Domain-specific
parameters, and discard the data (except possible outliers)
• Mapping data to new space (see: data reduction)
• Ex.: Log-linear models—obtain value at a point in m-D space as the product
• E.g., Fourier transformation, wavelet transformation
on appropriate marginal subspaces
• Attribute construction
• Combining features • Non-parametric methods
• Data discretization • Do not assume models
• Major families: histograms, clustering, sampling, …

Data Mining Data Mining


45 45 46
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Prescribed Text Books

Author(s), Title, Edition, Publishing House


T1 Tan P. N., Steinbach M & Kumar V. “Introduction to Data Mining” Pearson
Education
T2 Data Mining: Concepts and Techniques, Third Edition by Jiawei Han, Micheline
Kamber and Jian Pei Morgan Kaufmann Publishers

Thank you
Data Mining Data Mining
47 48
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Note to Self: (Start Recording ! )


IMPORTANT Note to Students

▪ It is important to know that just login to the session does not guarantee the
attendance.
▪ Once you join the session, continue till the end to consider you as present in the
class.
▪ IMPORTANTLY, you need to make the class more interactive by responding to
Data Mining Professors queries in the session.
▪ Whenever Professor calls your number / name, you need to respond,
BITS Pilani
Pilani|Dubai|Goa|Hyderabad M3: Data Exploration
otherwise it will be considered as ABSENT

The instructor is gratefully acknowledging the authors who made their course materials freely available online. BITS Pilani, Pilani Campus
BITS Pilani
Recap
Pilani|Dubai|Goa|Hyderabad

• What is DM and what is not DM

• Prediction methods and description methods with examples

• Data Preprocessing concepts

• Data preprocessing techniques


3.1 Data Descriptions

Data Mining
Source Courtesy: Some of the contents of this PPT are sourced from materials provided by publishers of prescribed books
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Recap Recap : Exercise


Equal Depth (Frequency) Binning: Bins have an equal frequency. Consider that following scholar data set made available to you for carrying out some
mining activity. List 4 potential issues with this dataset identified while preprocessing?
Input:[5, 10, 11, 13, 15, 35, 50, 55, 72, 92, 204, 215]
Output: [5, 10, 11, 13] [15, 35, 50, 55] [72, 92, 204, 215]
Sno Name Age DateOfJoining Branch DateOfBirth

Equal Width (Distance) Binning : bins have equal width with a range of each bin are defined as 1 Anu 34 15-Jan-2015 CSE Feb 24, 1997
[min + w], [min + 2w] …. [min + nw] 2 Banu 33 27-Jan-2015 Mech Mar 27, 1998
w = (max – min) / (no of bins) 3 Anu C 34 15-Jan-2015 CSE Feb 24, 1997
No. of bins = 3
4 Chiru 32 30-Jan-2015 ECE Nov 25,1998
5 Amit 34 15-Jan-2015 NULL Feb 24, 1998
Input: [5, 10, 11, 13, 15, 35, 50, 55, 72, 92, 204, 215]
min = 5 w = (215-5)/3 = 70 6 Arish 38 15/Jan/2017 ECE 24 Dec 1999
Output: [5, 10, 11, 13, 15, 35, 50, 55, 72] [92] [204, 215] 7 Sanu 34 15-Jan-2015 Civil Jan 1st, 1997
8 Naresh 98 15-Dec-2017 ECE 24 Dec 1999
9 Arush 28 15-Jan-2027 ECE 24 Dec 2019
Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 BITS Pilani, Pilani Campus
Solution Types of Data Sets
• Record
Consider that following scholar data set made available to you for carrying out some • Relational records
mining activity. List 4 potential issues with this dataset identified while preprocessing?

timeout

season
coach
Data matrix, e.g., numerical matrix, crosstabs

game
score
team

ball

lost
pla

wi
n
y
• Document data: text documents: term-frequency vector
• Transaction data
Sno Name Age DateOfJoining Branch DateOfBirth
• Graph and network Document 1 3 0 5 0 2 6 0 2 0 2
• World Wide Web
1 Anu 34 15-Jan-2015 CSE Feb 24, 1997
Document 2 0 7 0 2 1 0 0 3 0 0
• Social or information networks
2 Banu 33 27-Jan-2015 Mech Mar 27, 1998
• Molecular Structures Document 3 0 1 0 0 1 2 2 0 3 0
3 Anu C 34 15-Jan-2015 CSE Feb 24, 1997
• Ordered
4 Chiru 32 30-Jan-2015 ECE Nov 25,1998 • Video data: sequence of images TID Items
5 Amit 34 15-Jan-2015 NULL Feb 24, 1998 • Temporal data: time-series 1 Bread, Coke, Milk
6 Arish 38 15/Jan/2017 ECE 24 Dec 1999 • Sequential Data: transaction sequences 2 Beer, Bread
• Genetic sequence data 3 Beer, Coke, Diaper, Milk
7 Sanu 34 15-Jan-2015 Civil Jan 1st, 1997
• Spatial, image and multimedia: 4 Beer, Bread, Diaper, Milk
8 Naresh 98 15-Dec-2017 ECE 24 Dec 1999
• Spatial data: maps 5 Coke, Diaper, Milk
9 Arush 28 15-Jan-2027 ECE 24 Dec 2019 • Image data: MRI scans
Data Mining
• Video data: Movies
BITS Pilani, Pilani Campus BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Important Characteristics of Structured Data Data Objects


• Dimensionality • Data sets are made up of data objects.
• Curse of dimensionality • A data object represents an entity.
• Examples:
• Sparsity • sales database: customers, store items, sales
• Only presence counts • medical database: patients, treatments
• Resolution • university database: students, professors, courses
• Also called samples , examples, instances, data points, objects, tuples.
• Patterns depend on the scale
• Data objects are described by attributes.
• Distribution • Database rows -> data objects
• Centrality and dispersion • columns -> attributes.

Data Mining Data Mining


BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Attributes and their types-By set of possible values.
Attribute Types
Attribute (or dimensions, features, variables): a data field, representing a characteristic or feature
of a data object.
– E.g., customer _ID, name, address Binary
Types of attributes Nominal attribute with only 2 states (0 and 1)
– Nominal :categories, states, or “names of things” Symmetric binary: both outcomes equally important
• Examples: ID numbers, eye color, zip codes, Hair_color = {auburn, black, blond, brown, grey, e.g., gender
red, white}
Asymmetric binary: outcomes not equally important.
– Ordinal :Values have a meaningful order (ranking) but magnitude between successive values is
not known. e.g., medical test (positive vs. negative)
• Examples: rankings (e.g., taste of soft drink on a scale from 1-10), grades{A,A-,B,B-,C,C- Convention: assign 1 to most important outcome (e.g., HIV positive)
,D,E}
– Interval-scaled:
• No true zero-point
• Examples: calendar dates, temperatures in Celsius or Fahrenheit.
– Ratio-scaled
• Inherent zero-point
• Examples: temperature in Kelvin, length, weight, elapsed time(e.g., time to run a race)
Data Mining
BITS Pilani, Pilani Campus BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Discrete vs. Continuous Attributes :


by the Number of Values Exercise
• Discrete Attribute
• Has only a finite or countably infinite set of values
Classify the following attributes as binary, discrete, or continuous.
• E.g., zip codes, profession, or the set of words in a collection of documents
Also classify them as qualitative (nominal or ordinal) or quantitative
• Sometimes, represented as integer variables (interval or ratio).
• Note: Binary attributes are a special case of discrete attributes • Time in terms of AM or PM.
Binary, qualitative, ordinal
• Continuous Attribute
• Angles as measured in degrees between 0◦ and 360◦.
• Has real numbers as attribute values
• E.g., temperature, height, or weight
Continuous, quantitative, ratio
• Practically, real values can only be measured and represented using a • Bronze, Silver, and Gold medals as awarded at the Olympics.
finite number of digits Discrete, qualitative, ordinal
• Continuous attributes are typically represented as floating-point variables • Colors of cars Discrete,
qualitative, nominal

Data Mining Data Mining


BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Exercise Solution
Identify whether the attributes are categorical or numerical Identify whether the attributes are categorical or numerical
1) Aircraft manufacturers 1) Aircraft manufacturers :Categorical(nominal)
2) Enrolled in a university? 2) Enrolled in a university? :Categorical(binary)
3) Number of students 3) Number of students :Numerical(Discrete)
4) Weight 4) Weight :Numerical(Continuous)
5) Customer Satisfaction 5) Customer Satisfaction :Categorical(ordinal)
6) Monthly Income (in Rs) 6) Monthly Income (in Rs) :Numerical (Continuous)
7) Passed Entrance Exam? (Yes/No) 7) Passed Entrance Exam? (Yes/No): Categorical (Binary)
8) GPA (Grade Point Average) 8) GPA (Grade Point Average) : Numerical (Continuous)

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus

Basic Statistical Descriptions of Data Measuring the Central Tendency


• Mean (algebraic measure) (sample vs. population): n
1 n w x
Note: n is sample size and N is population size.
x=  xi
n i =1 x= i =1
i i

x

n
Motivation w i =
• Weighted arithmetic mean: i =1
N
• To better understand the data: central tendency, variation and spread
• Trimmed mean: chopping extreme values
• Data dispersion characteristics • Median:
• median, max, min, quantiles, outliers, variance, etc.
• Middle value if odd number of values, or average of the
• Numerical dimensions correspond to sorted intervals middle two values otherwise
• Data dispersion: analyzed with multiple granularities of precision • Estimated by interpolation (for grouped data):

• Boxplot or quantile analysis on sorted intervals n / 2 − ( freq) l


Median

median = L1 + (
interval
) width
• Dispersion analysis on computed measures • Mode freqmedian

• Folding measures into numerical dimensions • Value that occurs most frequently in the data
• Boxplot or quantile analysis on the transformed cube • Unimodal, bimodal, trimodal
• Empirical formula: mean − mode = 3  (mean − median)
Data Mining Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Symmetric vs. Skewed Data Measuring the Dispersion of Data
• Median, mean and mode of symmetric, • Quartiles, outliers and boxplots
positively and negatively skewed data • Quartiles: Q1 (25th percentile), Q3 (75th percentile)
• Inter-quartile range: IQR = Q3 – Q1
positively skewed symmetric negatively skewed
• Five number summary: min, Q1, median, Q3, max
• Boxplot: ends of the box are the quartiles; median is marked; add
whiskers, and plot outliers individually
• Outlier: usually, a value higher/lower than 1.5 x IQR
• Variance and standard deviation (sample: s, population: σ)
• Variance: (algebraic, scalable computation)
1 n 1 n 2 1 n n n
s2 =  ( xi − x ) 2 = n − 1 [
n − 1 i =1 i =1
xi − ( xi ) 2 ]
n i =1
2 =
1
 (x −  )
i
2
=
1
x i
2
− 2
N i =1 N i =1

Skewness > Skewness = Skewness < • Standard deviation s (or σ) is the square root of variance s2 (or σ2)
0 0 Data Mining 0 Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Boxplot Analysis Graphic Displays of Basic Statistical Descriptions

• Five-number summary of a distribution • Boxplot: graphic display of five-number summary


• Minimum, Q1, Median, Q3, Maximum
• Histogram: x-axis are values, y-axis repres. frequencies
• Boxplot
• Quantile plot: each value xi is paired with fi indicating that approximately 100 fi %
• Data is represented with a box
of data are  xi
• The ends of the box are at the first and third quartiles, i.e., the height of the box is
IQR • Quantile-quantile (q-q) plot: graphs the quantiles of one univariant distribution
• The median is marked by a line within the box against the corresponding quantiles of another
• Whiskers: two lines outside the box extended to Minimum and Maximum • Scatter plot: each pair of values is a pair of coordinates and plotted as points in the
• Outliers: points beyond a specified outlier threshold, plotted individually plane

Data Mining Data Mining


BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Exercise Solution
Suppose that the data for analysis includes the attribute age. The Suppose that the data for analysis includes the attribute age. The
age values for the data tuples are (in increasing order) 13, 15, 16, age values for the data tuples are (in increasing order) 13, 15, 16,
16, 19, 20, 20, 21, 22, 22, 25, 25, 25, 25, 30, 33, 33, 35, 35, 35, 35, 36, 16, 19, 20, 20, 21, 22, 22, 25, 25, 25, 25, 30, 33, 33, 35, 35, 35, 35, 36,
40, 45, 46, 52, 70, 87 40, 45, 46, 52, 70, 87
a) Calculate mean, and mode a) Calculate mean, and mode
b) Calculate the 5-number summary b) Calculate the 5-number summary
Mean = 32, Mode = 25, 35 (bimodal) Median is 25+30/2 = 27.5
Q1 = 20, Q3 = 35 Min = 13, Max = 87
5-number summary : (13,20,27.5,35,87)
Data Mining Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

BITS Pilani Similarity and Dissimilarity


Pilani|Dubai|Goa|Hyderabad

• Similarity
• Numerical measure of how alike two data objects are
• Value is higher when objects are more alike
• Often falls in the range [0,1]
• Dissimilarity (e.g., distance)
• Numerical measure of how different two data objects are
• Lower when objects are more alike
• Minimum dissimilarity is often 0
3.2 Data Similarity/Dissimilarity • Upper limit varies
• Proximity refers to a similarity or dissimilarity

Data Mining
Source Courtesy: Some of the contents of this PPT are sourced from materials provided by publishers of prescribed books
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Data Matrix and Dissimilarity Matrix Proximity Measure for Nominal Attributes

• Data matrix  x11 ... x1f ... x1p  • Can take 2 or more states, e.g., red, yellow, blue, green
 
• n data points with p dimensions  ... ... ... ... ... 
(generalization of a binary attribute)
x x ip 
• Two modes  i1
... x if ...

 ... ... ... ... ...  • Method 1: Simple matching
x ... x nf ... x np 
 n1  • m: # of matches, p: total # of variables
• Dissimilarity matrix d (i, j) = p −
p
m
 0 
• n data points, but registers only the distance  d(2,1) 0 
• A triangular matrix
  • Method 2: Use a large number of binary attributes
 d(3,1) d ( 3,2) 0 
• Single mode   • creating a new binary attribute for each of the M nominal states
 : : : 
d ( n,1) d ( n,2) ... ... 0

Data Mining Data Mining


BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Proximity Measure for Binary Attributes Dissimilarity between Binary Variables


• A contingency table for binary data Object j

Example
Object i Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4
• Distance measure for symmetric Jack M Y N P N N N
binary variables: Mary F Y N P N P N
Jim M Y P N N N N
• Distance measure for asymmetric
binary variables:
• Gender is a symmetric attribute
• Jaccard coefficient (similarity measure • The remaining attributes are asymmetric binary
for asymmetric binary variables): • Let the values Y and P be 1, and the value N 0
0+1
d ( jack , mary ) = = 0.33
2+ 0+1
◼ Note: Jaccard coefficient is the same as “coherence”: d ( jack , jim ) =
1+1
= 0.67
1+1+1
1+ 2
d ( jim , mary ) = = 0.75
1+1+ 2
Data Mining Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Standardizing Numeric Data Normalisation-Exercise
• Z-score: −
z = x Suppose that the minimum and maximum values for the attribute income are $12,000
• X: raw score to be standardized, μ: mean of the population, σ: and $98,000, respectively. The new range is [0.0,1.0]. Apply min-max normalization to
standard deviation
value of $73,600.
• the distance between the raw score and the population mean in units
of the standard deviation
• negative when the raw score is below the mean, “+” when above Solution v − minA
v' = (new _ maxA − new _ minA) + new _ minA
• An alternative way: Calculate the mean absolute deviation maxA − minA

where
sf = 1
n (| x1 f − m f | + | x2 f − m f | +...+ | xnf − m f |)
73,600 − 12,000
mf = 1 n (x1 f + x2 f + ... + xnf )
. (1.0 − 0) + 0 = 0.716
• standardized measure (z-score): xif − m f 98,000 − 12,000
z =
• Using mean absolute deviation is more robust than using sf
if standard

deviation
Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 BITS Pilani, Pilani Campus

Normalisation-Exercise
Example: Data Matrix and Dissimilarity Matrix
Suppose that the mean and standard deviation of the values for the attribute
Data Matrix
income are $54,000 and $16,000, respectively. Apply z-score normalization
to the salary value of $73,600 point attribute1 attribute2
x1 1 2

v − A x2 3 5
v' = x3 2 0

 A
x4 4 5

Dissimilarity Matrix
73,600 − 54,000
= 1.225 (with Euclidean Distance)
16,000 x1 x2 x3 x4
x1 0
x2 3.61 0
x3 2.24 5.1 0
x4 4.24 1 5.39 0

Data Mining
BITS Pilani, Pilani Campus BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Distance on Numeric Data: Minkowski Distance Special Cases of Minkowski Distance
• h = 1: Manhattan (city block, L1 norm) distance
• E.g., the Hamming distance: the number of bits that are different
• Minkowski distance: A popular distance measure
between two binary vectors
d (i, j) =| x − x | + | x − x | +...+ | x − x |
i1 j1 i2 j 2 ip jp

where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two p-dimensional data objects, • h = 2: (L2 norm) Euclidean distance
and h is the order (the distance so defined is also called L-h norm) d (i, j) = (| x − x |2 + | x − x |2 +...+ | x − x |2 )
i1 j1 i2 j2 ip jp
• Properties
• h → . “supremum” (Lmax norm, L norm) distance.
• d(i, j) > 0 if i ≠ j, and d(i, i) = 0 (Positive definiteness)
• This is the maximum difference between any component
• d(i, j) = d(j, i) (Symmetry) (attribute) of the vectors
• d(i, j)  d(i, k) + d(k, j) (Triangle Inequality)
• A distance that satisfies these properties is a metric
Data Mining Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Example: Minkowski Distance (Dissimilarity Matrices) Ordinal Variables


point attribute 1 attribute 2
Manhattan (L1) L x1 x2 x3 x4
x1 1 2
x2 3 5
x1
x2
0
5 0
• An ordinal variable can be discrete or continuous
x3 2 0
• Order is important, e.g., rank
x3 3 6 0
x4 4 5 x4 6 1 7 0

• Can be treated like interval-scaled


Euclidean (L2) L2
x1
x1
0
x2 x3 x4
• replace xif by their rank rif {1,...,M f }
3.61 0
• map the range of each variable onto [0, 1] by replacing i-th object in the f-
x2
x3 2.24 5.1 0
x4 4.24 1 5.39 0
th variable by r −1
zif = if
M f −1
Supremum L x1 x2 x3 x4
x1 0
x2 3 0
x3 2 5 0 • compute the dissimilarity using methods for interval-scaled variables
x4 3 1 5 0
Data Mining Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Exercise Attributes of Mixed Type
• A database may contain all attribute types
• Nominal, symmetric binary, asymmetric binary, numeric,
ordinal
• One may use a weighted formula to combine their effects
 pf = 1 ij( f ) dij( f )
d (i, j) =
 pf = 1 ij( f )

• f is binary or nominal:
dij(f) = 0 if xif = xjf , or dij(f) = 1 otherwise
Calculate the dissimilarity matrix and similarity matrix for the ordinal attributes
• f is numeric: use the normalized distance
• f is ordinal
• Compute ranks rif and r −1
• Treat zif as interval-scaled
zif =
if

M −1 f
Data Mining
BITS Pilani, Pilani Campus BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Cosine Similarity Example: Cosine Similarity


• A document can be represented by thousands of attributes, each recording the frequency of a
particular word (such as keywords) or phrase in the document.
• cos(d1, d2) = (d1 • d2) /||d1|| ||d2|| ,
where • indicates vector dot product, ||d|: the length of vector d

• Ex: Find the similarity between documents 1 and 2.

d1 = (5, 0, 3, 0, 2, 0, 0, 2, 0, 0)
• Other vector objects: gene features in micro-arrays, … d2 = (3, 0, 2, 0, 1, 1, 0, 1, 0, 1)
• Applications: information retrieval, biologic taxonomy, gene feature mapping, ... d1•d2 = 5*3+0*0+3*2+0*0+2*1+0*1+0*1+2*1+0*0+0*1 = 25
• Cosine measure: If d1 and d2 are two vectors (e.g., term-frequency vectors), then ||d1||= (5*5+0*0+3*3+0*0+2*2+0*0+0*0+2*2+0*0+0*0)0.5=(42)0.5 = 6.481
cos(d1, d2) = (d1 • d2) /||d1|| ||d2|| , ||d2||= (3*3+0*0+2*2+0*0+1*1+1*1+0*0+1*1+0*0+1*1)0.5=(17)0.5 = 4.12
where • indicates vector dot product, ||d||: the length of vector d cos(d1, d2 ) = 0.94

Data Mining Data Mining


BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Exercise Exercise

For the following vectors x and y, calculate the indicated similarity or distance For the following vectors x and y, calculate the indicated similarity or distance
measures: measures:
x =(1,1,1,1), y=(2,2,2,2) cosine, Euclidian x =(0,1,0,1), y=(1,0,1,0) Cosine, Correlation ,Euclidian,Jacquard
Solution: cos(x, y) = 1, Euclidean(x, y) = 2 Solution: cos(x, y) = 0,Corr(x,y)=-3 Euclidean(x, y) = 2,Jacquard(x,y)=0

BITS Pilani, Pilani Campus BITS Pilani, Pilani Campus

BITS Pilani Data Visualization


Pilani|Dubai|Goa|Hyderabad

Data visualization is the art and practice of gathering, analyzing, and graphically representing
empirical information.

• Why data visualization?


• Gain insight into an information space by mapping data onto graphical primitives
• Provide qualitative overview of large data sets
• Search for patterns, trends, structure, irregularities, relationships among data
• Help find interesting regions and suitable parameters for further quantitative analysis
• Provide a visual proof of computer representations derived
3.3 Data Visualization

Data Mining
Source Courtesy: Some of the contents of this PPT are sourced from materials provided by publishers of prescribed books
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Visualization
Visualization techniques
Another Defination: Visualization is the conversion of data into a visual
or tabular format so that the characteristics of the data and the
Based on
relationships among data items or attributes can be analyzed or
• Visualization of a small number of attributes,
reported. • Stem and Leaf Plots
• Histograms
• Box Plots
Visualization of data is one of the most powerful and appealing • Pie charts
• Scatter Plots etc
techniques for data exploration.
• Visualization of data with spatial and/or temporal attributes,
– Humans have a well developed ability to analyze large amounts of information • Contour Plots
that is presented visually • Surface Plots
– Can detect general patterns and trends • Visualization of data with many attributes
• Data Matrix
– Can detect outliers and unusual patterns • Chernoff Faces
• Stick Figures

Data Mining
02/03/2018 Introduction to Data Mining 49
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Iris Sample Data Set Visualization Techniques: Histograms

Many of the exploratory data techniques are illustrated Histogram


with the Iris Plant data set. – Usually shows the distribution of values of a single variable
– Can be obtained from the UCI Machine Learning Repository – Divide the values into bins and show a bar plot of the number of
[Link] objects in each bin.
– From the statistician Douglas Fisher – The height of each bar indicates the number of objects
– Three flower types (classes): – Shape of histogram depends on the number of bins
◆Setosa Example: Petal Width (10 and 20 bins, respectively)
◆ Virginica
◆ Versicolour

– Four (non-class) attributes


◆ Sepal width and length
◆ Petal width and length Virginica. Robert H. Mohlenbrock. USDA
NRCS. 1995. Northeast wetland flora: Field
office guide to plant species. Northeast National
Technical Center, Chester, PA. Courtesy of
USDA NRCS Wetland Science Institute.
02/03/2018 Introduction to Data Mining 51 02/03/2018 Introduction to Data Mining 52
Visualization Techniques: Box Plots Example of Box Plots
outlier
Box Plots Box plots can be used to compare attributes
– Invented by J.
90th percentile
Tukey
– Another way of
displaying the
distribution of 75th percentile
data
50th percentile
– Following figure 25th percentile
shows the basic
part of a box
plot
10th percentile

02/03/2018 Introduction to Data Mining 53 02/03/2018 Introduction to Data Mining 54

Scatter Plot Array of Iris Attributes


Scatter Plots

A scatter plot displays 2-D data points


using Cartesian coordinates.
A third dimension can be added using
different colors or shapes to represent
different data points
Through this visualization, in the
adjacent figure, we can see that points
of types "+" and "×" tend to be
colocated

Data Mining
02/03/2018 Introduction to Data Mining 56
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Example : Stem and Leaf Plot Stem and Leaf Plot
4 : 34444566667788888999999
The set of integers shown below is the sepal length in centimeters 5 : 0000000000111111111222234444445555555666666777777778888888999
(multiplied by 10 to make the values integers) taken from the Iris data set.
6 : 000000111111222233333333344444445555566777777778889999
The values have also been sorted.
7 : 0122234677779
43 44 44 44 45 46 46 46 46 47 47 48 48 48 48 48 49 49 49 49 49 49 50
50 50 50 50 50 50 50 50 50 51 51 51 51 51 51 51 51 51 52 52 52 52 53
4 : 3444
54 54 54 54 54 54 55 55 55 55 55 55 55 56 56 56 56 56 56 57 57 57 57
4 : 566667788888999999
57 57 57 57 58 58 58 58 58 58 58 59 59 59 60 60 60 60 60 60 61 61 61
5 : 000000000011111111122223444444
61 61 61 62 62 62 62 63 63 63 63 63 63 63 63 63 64 64 64 64 64 64 64
5 : 5555555666666777777778888888999
65 65 65 65 65 66 66 67 67 67 67 67 67 67 67 68 68 68 69 69 69 69 70
6 : 00000011111122223333333334444444
71 72 72 72 73 74 76 77 77 77 77 79
6 : 5555566777777778889999
7 : 0122234
Data Mining 7 : 677779 Data Mining
Source : Figure 3.4. Sepal length data from the Iris data set[ Tan P. N., Steinbach M & Kumar V. “Introduction to Data Mining” Pearson Education ] BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 Source : Figure 3.5., Stem and leaf plot for the sepal length from the Iris data set. BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Example: Pie Chart Scatterplot Matrices


A pie chart that shows the distribution of Iris species in the Iris data set. In this case,
all three flower types have the same frequency.
The scatter-plot matrix is an
extension to the scatter plot.

For k-dimensional data a


minimum of (k2-k)/2 scatterplots
of 2D will be required.

There can be maximum of k2


plots of 2D

In the adjoining figure , there are


• Used less frequently in technical publications because the size of relative areas can be hard to k2 plots. Out of these, k are X-X
judge. plots, and all X-Y plots (where X,
Y are distinct dimensions) are
given in 2 orientations (X vs Y and
Y vs, X)
Data Mining Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Other Visualization Techniques
Example
Star Plots
The correlation matrix for the Iris data set.
– Similar approach to parallel coordinates, but axes
radiate from a central point
The rows and columns are organized so that all the – The line connecting the values of an object is a
polygon
flowers of a particular species are together
Chernoff Faces
– Approach created by Herman Chernoff
– This approach associates each attribute with a
characteristic of a face
– The values of each attribute determine the appearance
of the corresponding facial characteristic
– Each object becomes a separate face
– Relies on human’s ability to distinguish faces
Data Mining
02/03/2018 Introduction to Data Mining 62
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Chernoff Faces
Stick Figure
• A way to display variables on a two- A census data figure showing age,
dimensional surface, e.g., let x be income, gender, education
eyebrow slant, y be eye size, z be nose
length, etc.
A 5-piece stick figure (1 body and
4 limbs w. different angle/length)
• The figure shows faces produced
Age, income are indicated by
using 10 characteristics--head
position of the figure.
eccentricity, eye size, eye spacing, eye
eccentricity, pupil size, eyebrow slant, Gender, education are indicated
nose size, mouth shape, mouth size, by angle/length.
and mouth opening): Each assigned
Visualization can show a texture
one of 10 possible values.
pattern
Data Mining Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Visualizing Complex Data and Relations
Exercise
You are analysing the behaviour of students across 10 variables like Most visualization techniques were
mainly for numeric data. Recently,
attendance, assignment scores, participation, etc. more and more non-numeric data,
such as text and social networks,
Which visualization technique would you choose: Bar chart, Chernoff have become available.
faces, or Pie chart, Scatter Plot? Justify your answer. Many people on the Web tag various
objects such as pictures, blog entries,
and product reviews.
A tag cloud is a visualization of
statistics of user-generated tags.
Often, in a tag cloud, tags are listed
alphabetically or in a user-preferred
order.
The importance of a tag is indicated
by font size or color
Data Mining Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Visualization do’s and don’ts.


• ACCENT Principles The following are the ACCENT principles for effective graphical display put forth by D. A.
Burn (as adapted by Michael Friendly): Tufte’s Guidelines Edward R. Tufte has also enumerated the following
• Apprehension Ability to correctly perceive relations among variables. Does the graph maximize principles for graphical excellence:
apprehension of the relations among variables? ▪ Graphical excellence is the well-designed presentation of interesting data—a
• Clarity Ability to visually distinguish all the elements of a graph. Are the most important elements or matter of substance, of statistics, and of design.
relations visually most prominent?
▪ Graphical excellence consists of complex ideas communicated with clarity,
• Consistency Ability to interpret a graph based on similarity to previous graphs. Are the elements, symbol precision, and efficiency.
shapes, and colors consistent with their use in previous graphs?
▪ Graphical excellence is that which gives to the viewer the greatest number of
• Efficiency Ability to portray a possibly complex relation in as simple a way as possible. Are the elements of ideas in the shortest time with the least ink in the smallest space.
the graph economically used? Is the graph easy to interpret?
▪ Graphical excellence is nearly always multivariate.
• Necessity The need for the graph, and the graphical elements. Is the graph a more useful way to represent
the data than alternatives (table, text)? Are all the graph elements necessary to convey the relations? ▪ And graphical excellence requires telling the truth about the data
• Truthfulness Ability to determine the true value represented by any graphical element by its magnitude
relative to the implicit or explicit scale.
• Are the graph elements accurately positioned andData
scaled?
Mining Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
References and Textbooks

Author(s), Title, Edition, Publishing House


T1 Tan P. N., Steinbach M & Kumar V. “Introduction to Data Mining” Pearson
Education
T2 Data Mining: Concepts and Techniques, Third Edition by Jiawei Han, Micheline
Kamber and Jian Pei Morgan Kaufmann Publishers

BITS Pilani, Pilani Campus

You might also like