Data Mining Techniques and Applications
Data Mining Techniques and Applications
BITS Pilani
Pilani|Dubai|Goa|Hyderabad
Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
CS1 CS4
Why Data Mining?
Introduction to Classification
What Is Data Mining? Introduction to Classification
Applications of Data Mining Decision Tree
Example: Classification, Clustering, Association mining Rule Based Classifier
Bayes Classifier, ANN, SVM, Ensemble Methods
CS2
Data Preprocessing Concepts
Data Cleaning, Integration, reduction, transformation and discretization
Data preprocessing techniques
5 Prediction CS3
Data Description
Data Similarity/Dissimilarity
Data Visualization
Data Mining
Source Courtesy: Some of the contents of this PPT are sourced from materials provided by publishers of prescribed books
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Introduction
Applications
Example -1 Solution
Example -1
(10, 0.5) and (25, 1.20) are actual purchase amount and tax
amount pairs. Predict the amount of the tax on the purchase
of amount 55 using linear regression. (10, 0.5) and (25, 1.20) are actual purchase
amount and tax amount pairs. Predict the
amount of the tax on the purchase of amount
55 using linear regression.
Example -2 Example -2
Slope (𝜷𝟏)
• Exam score (Y) is expected to increases by 3.3 units for each 1 hour increase of study hours (X)
Intercept (𝜷𝟎)
• Exam score (Y) is 28.6 when you study 0 hours
• Regression and model trees tend to be more accurate than linear regression when the data are
not represented well by a simple linear model
THANK YOU!
BITS Pilani
Pilani|Dubai|Goa|Hyderabad
Data Mining
Association Analysis Concepts
BITS Pilani and Apriori Algorithm
Pilani|Dubai|Goa|Hyderabad M6: Association Analysis and Apriori Algorithm
Source Courtesy: Some of the contents of this PPT are sourced from materials provided by publishers of prescribed books
Association Analysis Association Analysis Contd.
• Association analysis measures the strength of co- • Association algorithms are widely used in retail analysis
occurrence between one item and another. of transactions, recommendation engines, and online
clickstream analysis across web pages.
• The objective of this class of data mining algorithms is not
to predict an occurrence of an item, like classification or • One of the popular applications of this technique is called
regression do, but to find usable patterns in the co- market basket analysis, which finds co-occurrences of one
occurrences of the items. retail item with another item within the same retail
purchase transaction.
• Association rules learning is a branch of an unsupervised
learning process that discovers hidden patterns in data, in • retailer can take advantage of this association for bundle
the form of easily recognizable rules. pricing, product placement, and even shelf space
optimization within the store layout.
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Mining Association Rules Frequent Itemset Generation
null
• Two-step approach:
A B C D E
1. Frequent Itemset Generation
– Generate all itemsets whose support minsup
AB AC AD AE BC BD BE CD CE DE
2. Rule Generation
– Generate high confidence rules from each
frequent itemset, where each rule is a binary ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
•
ABCD ABCE ABDE ACDE BCDE
Frequent itemset generation is still computationally Given d items, there
expensive are 2d possible
ABCDE candidate itemsets
Data Mining Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Reducing Number of Candidates Illustrating Apriori Principle null
✓Apriori principle: A B C D E
support measure:
• Support of an itemset never exceeds the support of its ABCD ABCE ABDE ACDE BCDE
subsets
Pruned
• This is known as the anti-monotone property of support
ABCDE
supersets
Data Mining Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Support Counting
Reducing Number of Comparisons
✓Candidate counting:
• Scan the database of transactions to determine the support of each candidate
itemset
• To reduce the number of comparisons, store the candidates in a hash
structure
• Instead of matching each transaction against every candidate, match it
against candidates contained in the hashed buckets
Transactions Hash Structure
TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
N 3 Milk, Diaper, Beer, Coke k
4 Bread, Milk, Diaper, Beer
✓ Given t = {1,2,3,5,6}, all the 3-itemsets contained in t must begin with item 1, 2, or 3.
5 Bread, Milk, Diaper, Coke ✓ It is not possible to construct a 3-itemset that begins with items 5 or 6 because there
are only two items in t whose labels are greater than or equal to 5.
Buckets
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Support Counting Using a Hash Tree Support Counting Using a Hash Tree
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Hashing a transaction at the root node of a hash Generate Hash Tree
tree
Suppose you have 15 candidate itemsets of length 3:
{1 4 5}, {1 2 4}, {4 5 7}, {1 2 5}, {4 5 8}, {1 5 9}, {1 3 6}, {2 3 4}, {5 6 7}, {3 4 5}, {3 5
6}, {3 5 7}, {6 8 9}, {3 6 7}, {3 6 8}
You need:
In this example, 5
out of the 9 leaf nodes are • Hash function . Let h(p)= (p-1) mod 3
visited and 9 out of the 15
itemsets are compared • Max leaf size: max number of itemsets stored in a leaf node (if number of
against the transaction. candidate itemsets exceeds max leaf size, split the node)
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Association Rule Discovery: Hash tree Association Rule Discovery: Hash tree Contd.
2,5,8 2,5,8
234
234
567
567
145 136
345 356 367 145 136
Hash on Hash on 345 356 367
357 368
1, 4 or 7 2, 5 or 8 357 368
124 159 689
125 124 159 689
457 125
458 457 458
Data Mining Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Association Rule Discovery: Hash tree Contd. Subset Operation
Hash Function Candidate Hash Tree Given a transaction t, what are Transaction, t
the possible subsets of size 3?
1 2 3 5 6
1,4,7 3,6,9
Level 1
2,5,8
1 2 3 5 6 2 3 5 6 3 5 6
234
567 Level 2
145 136 12 3 5 6 13 5 6 15 6 23 5 6 25 6 35 6
345 356 367
Hash on
357 368
3, 6 or 9
124 159 689
125 123
135 235
457 125 156 256 356
458 126
136 236
1+ 2356 1+ 2356
2+ 356 1,4,7 3,6,9 2+ 356 1,4,7 3,6,9
12+ 356 2,5,8
2,5,8
3+ 56 3+ 56
13+ 56
234 234
15+ 6 567
567
1+ 2356 1+ 2356
2+ 356 1,4,7 3,6,9 2+ 356 1,4,7 3,6,9
12+ 356 2,5,8
12+ 356 2,5,8
3+ 56 3+ 56
13+ 56 13+ 56
234 234
15+ 6 567 15+ 6 567
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Rule Generation Contd. Rule Generation Contd.
TID Items
1 Bread, Milk
2 Bread, Diaper, Beer, Eggs
3 Milk, Diaper, Beer, Coke
For the itemset { Bread, Milk} 4 Bread, Milk, Diaper, Beer
5 Bread, Milk, Diaper, Coke
Some Association Rules are:
i) Bread → Milk, conf= Supcount({Bread,
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Example-1
Examples
On
Apriori Algorithm
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Example-2 Example-2 Contd..
Step 1:
i) Create 1-Itemset candidates and calculate support for all the items.
ii) Perform pruning to create L1 Frequent Itemset. In pruning, we will filter
out all items with Support less than the min_supp value(30%=3).
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Step 3:
i) Create 3-Itemset candidates from L2 Frequent Itemset and calculate
support for all of them.
ii) Perform pruning to create L3 Frequent Itemset. But here if you see,
Support is less than min_supp value(30) for all the 3-itemsets so we cannot
go any forward and need to find Association Rules from L2 Frequent
Step 2: Itemset only.
i) Create 2-Itemset candidates from L1 Frequent
Itemset and calculate support for all of them.
ii) Perform Pruning to create L2 Frequent Itemset. As
before, we will again filter out all the Itemsets with
Support less than min_supp value(30%=3).
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Example-2 Contd.. Example-2 Contd..
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Example-3
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
BITS Pilani
Pilani|Dubai|Goa|Hyderabad
Data Mining
M7: Association Analysis and Apriori Algorithm – Part 2
Source Courtesy: Some of the contents of this PPT are sourced from materials provided by publishers of prescribed books
➢ Once you join the session, continue till the end to consider you as
present in the class.
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Construct FP-tree from a Transaction Database Find Patterns Having P From P-conditional Database
{}
Header Table • Starting at the frequent item header table in the FP-tree
min_support = 3 • Traverse the FP-tree by following the link of each frequent item p
Item frequency head f:4 c:1
F-list=f-c-a-b-m-p f 4 • Accumulate all of transformed prefix paths of item p to form p’s
conditional pattern base
c 4 c:3 b:1 b:1
a 3 {}
b 3 a:3 p:1 Header Table
m 3 Conditional pattern bases
p 3 Item frequency head f:4 c:1
m:2 b:1 item cond. pattern base
f 4
c 4 c:3 b:1 b:1 c f:3
TID Items bought (ordered) frequent items p:2 m:1 a 3 a fc:3
100 {f, a, c, d, g, i, m, p} {f, c, a, m, p} b 3 a:3 p:1
200 {a, b, c, f, l, m, o} {f, c, a, b, m} m 3 b fca:1, f:1, c:1
300 {b, f, h, j, o, w} {f, b} p 3 m:2 b:1 m fca:2, fcab:1
400 {b, c, k, s, p} {c, b, p}
p fcam:2, cb:1
500 {a, f, c, e, l, p, m, n} {f, c, a, m, p} p:2 m:1
12
Data Mining Data Mining
November 11, 2025
11
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 November 11, 2025 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Benefits of the FP-tree Structure
Partition Patterns and Databases
• Completeness • Frequent patterns can be partitioned into subsets
• Preserve complete information for frequent pattern according to f-list
mining • F-list=f-c-a-b-m-p
• Never break a long pattern of any transaction • Patterns containing p
• Compactness • Patterns having m but no p
• Reduce irrelevant info—infrequent items are gone
•…
• Items in frequency descending order: the more
frequently occurring, the more likely to be shared • Patterns having c but no a nor b, m, p
• Never be larger than the original database (not • Pattern f
count node-links and the count field)
• Completeness and non-redundency
From Conditional Pattern-bases to Conditional FP-trees Recursion: Mining Each Conditional FP-tree
• For each pattern-base {}
• Accumulate the count for each item in the base Cond. pattern base of “am”: (fc:3) f:3
{}
• Construct the FP-tree for the frequent items of the
pattern base c:3
f:3
am-conditional FP-tree
m-conditional pattern base:
{} c:3 {}
fca:2, fcab:1
Header Table
Item frequency head All frequent a:3 Cond. pattern base of “cm”: (f:3)
f:4 c:1 patterns relate to m f:3
f 4 {} m-conditional FP-tree
c 4 c:3 b:1 b:1 m, cm-conditional FP-tree
a 3 fm, cm, am,
f:3
b 3 a:3 p:1 fcm, fam, cam, {}
m 3 c:3 fcam
p 3 m:2 b:1 Cond. pattern base of “cam”: (f:3)
f:3
p:2 m:1 a:3
m-conditional FP-tree cam-conditional FP-tree
15
Data Mining Data Mining
November 11, 2025
November 11, 2025 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
16
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Mining Frequent Patterns With FP-trees
A Special Case: Single Prefix Path in FP-tree
• Suppose a (conditional) FP-tree T has a shared single
• Idea: Frequent pattern growth
prefix-path P
• Recursively grow frequent patterns by pattern and
• Mining can be decomposed into two parts database partition
{} • Reduction of the single prefix path into one node • Method
• Concatenation of the mining results of the two parts • For each frequent item, construct its conditional
a1:n1
pattern-base, and then its conditional FP-tree
a2:n2 • Repeat the process on each newly created conditional
a3:n3 {} r1 FP-tree
• Until the resulting FP-tree is empty, or it contains
a1:n1 only one path—single path will generate all the
b1:m1 C1:k1 r1 =
a2:n2
+ b1:m1 C1:k1
combinations of its sub-paths, each of which is a
frequent pattern
C2:k2 C3:k3 a3:n3 C2:k2 C3:k3
17
Data Mining Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
• Given a frequent itemset L, find all non-empty subsets • How to efficiently generate rules from frequent
f L such that f → L – f satisfies the minimum itemsets?
confidence requirement • In general, confidence does not have an anti-
• If {A,B,C,D} is a frequent itemset, candidate rules: monotone property
ABC →D, ABD →C, ACD →B, BCD →A, c(ABC →D) can be larger or smaller than c(AB →D
A →BCD, B →ACD, C →ABD, D →ABC • But confidence of rules generated from the same
AB →CD, AC → BD, AD → BC, BC →AD, itemset has an anti-monotone property
BD →AC, CD →AB,
• e.g., L = {A,B,C,D}:
c(ABC → D) c(AB → CD) c(A → BCD)
• If |L| = k, then there are 2k – 2 candidate association
rules (ignoring L → and → L)
• Confidence is anti-monotone w.r.t. number of items on
the RHS of the rule
Data Mining Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Rule Generation for Apriori Algorithm Rule Generation with Apriori Algorithm
Lattice of rules
Low
ABCD=>{ }
• Candidate rule is generated by merging two rules that
Confidence share the same prefix
Rule
BCD=>A ACD=>B ABD=>C ABC=>D
in the rule consequent CD=>AB BD=>AC
• join(CD=>AB,BD=>AC)
would produce the candidate
CD=>AB BD=>AC BC=>AD AD=>BC AC=>BD AB=>CD rule D => ABC
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Multiple Minimum Support Pattern Evaluation
• How to apply multiple minimum supports? • Association rule algorithms tend to produce too many
• MS(i): minimum support for item i rules
• e.g.: MS(Milk)=5%, MS(Coke) = 3%, • many of them are uninteresting or redundant
MS(Broccoli)=0.1%, MS(Salmon)=0.5% • Redundant if {A,B,C} → {D} and {A,B} → {D}
• MS({Milk, Broccoli}) = min (MS(Milk), MS(Broccoli)) have same support & confidence
= 0.1%
• Interestingness measures can be used to prune/rank
• Challenge: Support is no longer anti-monotone the derived patterns
• Suppose: Support(Milk, Coke) = 1.5% and
Support(Milk, Coke, Broccoli) = 0.5%
• In the original formulation of association rules,
support & confidence are the only measures used
• {Milk,Coke} is infrequent but {Milk,Coke,Broccoli}
is frequent
Data Mining Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Statistical Independence Statistical-based Measures
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Y Y Y Y
Coffee Coffee X 10 0 10 X 90 0 90
Tea 15 5 20 X 0 90 90 X 0 10 10
Tea 75 5 80 10 90 100 90 10 100
90 10 100
0.1 0.9
Association Rule: Tea → Coffee Lift = = 10 Lift = = 1.11
(0.1)(0.1) (0.9)(0.9)
Confidence= P(Coffee|Tea) = 0.75
but P(Coffee) = 0.9 Statistical independence:
Lift = 0.75/0.9= 0.8333 (< 1, therefore is negatively associated) If P(X,Y)=P(X)P(Y) => Lift = 1
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Subjective Interestingness Measure Interestingness via Unexpectedness
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Data Mining
SSWT ZC 425
Data Mining
SSWT ZC 425
BITS Pilani M1: Introduction to Data Mining ◼ M1: Introduction to Data Mining
Pilani|Dubai|Goa|Hyderabad
Source Courtesy: Some of the contents of this PPT are sourced from materials provided by publishers of prescribed books
Text Book(s)
◼ T1 :Tan P. N., Steinbach M & Kumar V. “Introduction to
Data Mining” Pearson Education, 2014
◼ R2: Practical Text Mining and Statistical Analysis for WASE 2022_SSWT ZC 425_ Data Mining
Non-structured Text Data Applications by Gary Miner Handout
et al. Academic Press © 2012
◼ Patterns and knowledge to be used or stored into Data Preprocessing/Integration, Data Warehouses
DBA
knowledge-base Data Sources
Paper, Files, Web documents, Scientific experiments, Database Systems
15 16
KDD Process: A Typical View from ML and
Example: Mining vs. Data Exploration
Statistics
◼ Business intelligence view
◼ Warehouse, data cube, reporting but not much mining Input Data Data Pre- Data Post-
Processing Mining Processing
◼ Business objects vs. data mining tools
◼ Supply chain example: tools
◼ Data presentation
Data integration Pattern discovery Pattern evaluation
◼ Exploration Normalization Association & correlation Pattern selection
Feature selection Classification Pattern interpretation
Clustering
Dimension reduction Pattern visualization
Outlier analysis
…………
17 18
◼ Summary
19 20
Multi-Dimensional View of Data Mining Chapter 1. Introduction
◼ Data to be mined
◼ Why Data Mining?
◼ Database data (extended-relational, object-oriented, heterogeneous,
legacy), data warehouse, transactional data, stream, spatiotemporal, ◼ What Is Data Mining?
time-series, sequence, text and web, multi-media, graphs & social
and information networks ◼ A Multi-Dimensional View of Data Mining
◼ Knowledge to be mined (or: Data mining functions) ◼ What Kind of Data Can Be Mined?
◼ Characterization, discrimination, association, classification,
clustering, trend/deviation, outlier analysis, etc. ◼ What Kinds of Patterns Can Be Mined?
◼ Descriptive vs. predictive data mining
◼ What Technology Are Used?
◼ Multiple/integrated functions and mining at multiple levels
◼ What Kind of Applications Are Targeted?
◼ Techniques utilized
◼ Data-intensive, data warehouse (OLAP), machine learning, statistics, ◼ Major Issues in Data Mining
pattern recognition, visualization, high-performance, etc.
◼ A Brief History of Data Mining and Data Mining Society
◼ Adopted by Applications
◼ Retail, telecommunication, banking, fraud analysis, bio-data mining, ◼ Summary
stock market analysis, text mining, Web mining, etc.
21 22
◼ Time-series data, temporal data, sequence data (incl. bio-sequences) ◼ What Kind of Data Can Be Mined?
◼ Structure data, graphs, social networks and multi-linked data ◼ What Kinds of Patterns Can Be Mined?
◼ Object-relational databases
◼ What Technology Are Used?
◼ Heterogeneous databases and legacy databases
◼ Spatial data and spatiotemporal data ◼ What Kind of Applications Are Targeted?
Data Mining Function: (3) Classification Data Mining Function: (4) Cluster Analysis
◼ Classification and label prediction ◼ Unsupervised learning (i.e., Class label is unknown)
◼ Construct models (functions) based on some training examples ◼ Group data to form new categories (i.e., clusters), e.g.,
◼ Describe and distinguish classes or concepts for future prediction cluster houses to find distribution patterns
◼ E.g., classify countries based on (climate), or classify cars
◼ Principle: Maximizing intra-class similarity & minimizing
based on (gas mileage)
interclass similarity
◼ Predict some unknown class labels
◼ Typical methods ◼ Many methods and applications
◼ Decision trees, naïve Bayesian classification, support vector
machines, neural networks, rule-based classification, pattern-
based classification, logistic regression, …
◼ Typical applications:
◼ Credit card fraud detection, direct marketing, classifying stars,
diseases, web-pages, …
27 28
Time and Ordering: Sequential Pattern,
Data Mining Function: (5) Outlier Analysis
Trend and Evolution Analysis
◼ Outlier analysis
◼ Sequence, trend and evolution analysis
◼ Outlier: A data object that does not comply with the general
◼ Trend, time-series, and deviation analysis: e.g.,
behavior of the data
regression and value prediction
◼ Noise or exception? ― One person’s garbage could be another
◼ Sequential pattern mining
person’s treasure
◼ Methods: by product of clustering or regression analysis, … ◼ e.g., first buy digital camera, then buy large SD
◼ Similarity-based analysis
29 30
◼ Summary
33 34
Major Issues in Data Mining (1) Major Issues in Data Mining (2)
39 40
Chapter 1. Introduction A Brief History of Data Mining Society
Conferences and Journals on Data Mining Where to Find References? DBLP, CiteSeer, Google
◼ Why Data Mining? ◼ Data mining: Discovering interesting patterns and knowledge from
massive amount of data
◼ What Is Data Mining?
◼ A natural evolution of database technology, in great demand, with
◼ A Multi-Dimensional View of Data Mining wide applications
◼ What Kind of Data Can Be Mined? ◼ A KDD process includes data cleaning, data integration, data
selection, transformation, data mining, pattern evaluation, and
◼ What Kinds of Patterns Can Be Mined?
knowledge presentation
◼ What Technology Are Used? ◼ Mining can be performed in a variety of data
◼ What Kind of Applications Are Targeted? ◼ Data mining functionalities: characterization, discrimination,
association, classification, clustering, outlier and trend analysis, etc.
◼ Major Issues in Data Mining
◼ Data mining technologies and applications
◼ A Brief History of Data Mining and Data Mining Society
◼ Major issues in data mining
◼ Summary
45 46
◼
Discovery, Morgan Kaufmann, 2001
J. Han and M. Kamber. Data Mining: Concepts and Techniques. Morgan Kaufmann, 3rd ed., 2011
D. J. Hand, H. Mannila, and P. Smyth, Principles of Data Mining, MIT Press, 2001
Queries ??
◼ T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning: Data Mining, Inference,
and Prediction, 2nd ed., Springer-Verlag, 2009
◼ B. Liu, Web Data Mining, Springer 2006.
◼ T. M. Mitchell, Machine Learning, McGraw Hill, 1997
◼ G. Piatetsky-Shapiro and W. J. Frawley. Knowledge Discovery in Databases. AAAI/MIT Press, 1991
◼ P.-N. Tan, M. Steinbach and V. Kumar, Introduction to Data Mining, Wiley, 2005
◼ S. M. Weiss and N. Indurkhya, Predictive Data Mining, Morgan Kaufmann, 1998
◼ I. H. Witten and E. Frank, Data Mining: Practical Machine Learning Tools and Techniques with Java
Implementations, Morgan Kaufmann, 2nd ed. 2005
Classification: Definition
BITS Pilani
Pilani|Dubai|Goa|Hyderabad
Task:
Learn a model that maps each attribute set x into one of the predefined class labels y
Classification Concepts
Source Courtesy: Some of the contents of this PPT are sourced from materials provided by publishers of prescribed books
BITS Pilani, Pilani Campus
Classification Classification vs. Prediction
• Classification involves dividing up objects so that each is assigned to one of a number of mutually • Classification
exhaustive and exclusive categories known as classes
• Many practical decision-making tasks can be formulated as classification problems
• predicts categorical class labels (discrete or nominal)
‒ customers who are likely to buy or not buy a particular product in a supermarket • classifies data (constructs a model) based on the training set and
‒ people who are at high, medium or low risk of acquiring a certain illness the values (class labels) in a classifying attribute and uses it in
‒ student projects worthy of a distinction, merit, pass or fail grade
‒ objects on a radar display which correspond to vehicles, people, buildings or trees classifying new data
‒ people who closely resemble, slightly resemble or do not resemble someone seen committing a crime
‒ houses that are likely to rise in value, fall in value or have an unchanged value in 12 months' time
‒ people who are at high, medium or low risk of a car accident in the next 12 months • Prediction
‒
• models continuous-valued functions, i.e., predicts unknown or
people who are likely to vote for each of a number of political parties (or none)
‒ the likelihood of rain the next day for a weather forecast (very likely, likely, unlikely, very unlikely).
missing values
November 11, 2025 Data Mining November 11, 2025 Data Mining
6 7
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
General Approach for Building
Classification Model Lazy vs. Eager Learning
• Lazy vs. eager learning
Tid
1
Attrib1
Yes
Attrib2
Large
Attrib3
125K
Class
No
Learning
algorithm
• Lazy learning (e.g., instance-based learning): Simply stores training data (or only
2
3
No
No
Medium
Small
100K
70K
No
No
minor processing) and waits until it is given a test tuple
4
5
Yes
No
Medium
Large
120K
95K
No
Yes
Induction • Eager learning (the methods mentioned in next slide): Given a set of training set,
6 No Medium 60K No constructs a classification model before receiving new (e.g., test) data to classify
7 Yes Large 220K No Learn
8
9
No
No
Small
Medium
85K
75K
Yes
No
Model • Lazy: less time in training but more time in predicting
10
10 No Small 90K Yes
Model • Accuracy
Training Set
Apply • Lazy method effectively uses a richer hypothesis space since it uses many local linear
Tid Attrib1 Attrib2 Attrib3 Class Model functions to form its implicit global approximation to the target function
• Eager: must commit to a single hypothesis that covers the entire instance space
11 No Small 55K ?
15 No Large 67K ?
10
Test Set
C l a s s i f i c a t i o n Te c h n i q u e s
Instance-Based Methods
• Typical approaches • Decision Tree based Methods
• k-nearest neighbor approach
• Rule-based Methods
• Instances represented as points in a Euclidean space.
• Locally weighted regression • Neural Networks
• Constructs local approximation - computational networks that simulate the decision process in neurons (networks of nerve cell)
• Case-based reasoning
• Uses symbolic representations and knowledge-based inference • Naïve Bayes and Bayesian Belief Networks
- uses the probability theory to find the most likely of the possible classifications
• Given a tuple for which the associated class label is unknown, the attribute values of the tuple are
tested against the decision tree.
• A path is traced from the root to a leaf node, which holds the class prediction for that tuple.
12 / 67 13 / 67
2 No Married 100K No
Yes No
3 No Single 70K No
NO MarSt
4 Yes Married 120K No NO MarSt
Single, Divorced Married
5 No Divorced 95K Yes Single, Divorced Married
6 No Married 60K No
Income NO Income NO
7 Yes Divorced 220K No
< 80K > 80K < 80K > 80K
8 No Single 85K Yes
9 No Married 75K No NO YES
NO YES
10 No Single 90K Yes
10
Status
Defaulted = No
into smaller subsets. Single,
Divorced
Married
Annual
Income
Recursively apply the procedure to each Defaulted = Yes Defaulted = No < 80K >= 80K
(c) (d)
18 / 67 BITS Pilani, Pilani Campus
D ESIGN I SSUES OF D ECISION T REE I NDUCTION T EST C ONDITION FOR N OMINAL ATTRIBUTES
Marital
• Splitting Criterion Status
property
{Small, {Medium,
Large} Extra Large}
20 / 67 21 / 67
Binary Decision :
Discretization :
• consider all possible splits and
• to form an ordinal categorical attribute Which test condition is the best?
finds the best cut
• can be more compute intensive
22 / 67 23 / 67
S ELECTING AN ATTRIBUTE T EST C ONDITION :
H OW TO DETERMINE THE B EST S PLIT I MPURITY M EASURE FOR A S INGLE N ODE
𝑐−1
Impurity is a measure of homogeneity of the labels at the node at hand
𝐺𝑖𝑛𝑖 𝐼𝑛𝑑𝑒𝑥 = 1 − 𝑝𝑖 𝑡 2
Greedy approach: Where 𝒑𝒊 𝒕 is the frequency of
Nodes with purer class distribution are preferred 𝑖=0 class 𝒊 at node t, and 𝒄 is the total
Need a measure of node impurity: 𝑐−1
number of classes
𝐸𝑛𝑡𝑟𝑜𝑝𝑦 = − 𝑝𝑖 𝑡 𝑙𝑜𝑔2 𝑝𝑖 (𝑡)
C0: 5 C0: 9 𝑖=0
C1: 5 C1: 1
Measures seek to determine which variable would split the data to lead to the
High degree of impurity/ Relating it to Low degree of impurity / underlying child nodes being most homogenous or pure.
information theory: Relating it to information theory:
Low information content High information content In other words splitting on attribute should reduce impurity
e.g It rained heavily in Shillong yesterday e.g There was a heavy rainfall in Rajasthan last night.
24 / 67 25 / 67
26 / 67 27 / 67
S ELECTING AN ATTRIBUTE T EST C ONDITION :
F INDING THE B EST S PLIT D ECISION T REE - ENTROPY
Construct a decision tree for the given dataset using Information gain.
• Compute impurity measure of the parent before splitting (P) Outlook Temp Humidity Windy Play Golf
Rainy Hot High False No
• Compute the collective impurity measure of children after splitting(M) Rainy Hot High True No
• Choose the attribute test condition that produces the highest gain Overcast Hot High False Yes
Gain = P – M Sunny Mild High False Yes
• or equivalently, lowest impurity measure after splitting (M) Sunny Cool Normal False Yes
• Information gain is the expected reduction in entropy caused by partitioning Sunny Cool Normal True No
the examples on an attribute. Overcast Cool Normal True Yes
• Maximizing the gain == minimizing the weighted average impurity measure of Rainy Mild High False No
Rainy Cool Normal False Yes
children nodes Sunny Mild Normal False Yes
Choose the split that achieves most Rainy Mild Normal True Yes
𝑘
𝑛𝑖 reduction (maximizes GAIN) Overcast Mild High True Yes
𝐺𝑎𝑖𝑛𝑠𝑝𝑙𝑖𝑡 = 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑝 − 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑖) Used in the ID3 and C4.5 decision Overcast Hot Normal False Yes
𝑛
𝑖=1 tree algorithms Sunny Mild High True No
28 / 67 29 / 48
30 / 48 31 / 48
D ECISION T REE - ENTROPY D ECISION T REE - ENTROPY
Step 3:
Choose attribute with the largest information gain as the decision node, divide the
dataset by its branches and repeat the same process on every branch.
Outlook has the max gain. So choose Outlook as the root decision point.
A branch with entropy of 0 is a leaf node.
A branch with entropy more than 0 needs further splitting.
32 / 48 33 / 48
34 / 48 35 / 48
Splitting Criteria based on Classification
Comparison among Splitting Criteria
Error
• Classification error at a node t : For a 2-class problem:
36 37
11/11/2025 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 11/11/2025 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
38 / 67 39 / 67
Dealing with continuous-valued attributes Dealing with continuous-valued attributes
• How to select the best value for the threshold c?
60 65 67 68 70 71 72 72 75 80 81 83 85 85
The candidate thresholds identified are:
• 62.5 ,66,70.5,71.5,73.5,80.5,84 Y N Y Y Y N Y Y N N Y Y N Y
40 / 67 41 / 67
S t o p p i n g C r i t e r i a f o r Tr e e I n d u c t i o n D e c i s i o n Tr e e B a s e d C l a s s i f i c a t i o n
• Stop expanding a node when all the records belong to the same class
• Advantages:
• Inexpensive to construct
• Stop expanding a node when all the records have similar attribute values • Extremely fast at classifying unknown records
• Easy to interpret for small-sized trees
• Early termination • Accuracy is comparable to other classification techniques for many simple data sets
42
11/11/2025 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Decision Tree:Overfitting
Practical Issues of Classification
• Underfitting and Overfitting
• Missing Values
• Costs of Classification Training errors (apparent errors)--Errors committed on the training set
Underfitting vs. Overfitting Test errors -- Errors committed on the test set
• Underfitting results in decision trees that are too simple to solve the problem. They may offer superior interpretability.
Generalization errors -- Expected error of a model over random selection of
• Overfitting results in decision trees that are more complex than necessary records from same distribution
• Training error no longer provides a good estimate of how well the tree will perform on previously unseen records
• Need new ways for estimating errors
46 / 67
Decision Tree :Handling Overfitting Decision Tree :Handling Overfitting
Pre-Pruning (Early Stopping Rule)
• Stop the algorithm before it becomes a fully-grown tree Post Pruning :Idea
• Typical stopping conditions for a node: • Post construction, scan the tree bottom-up
• Stop if all instances belong to the same class • At every decision node
• Stop if all the attribute values are the same • Retain the attribute node & evaluate it against the prune set (validation set)
More restrictive conditions: • Remove the attribute node & reevaluate it with the same prune set
• Stop if number of instances is less than some user-specified threshold • If there is a reduction in error, prune the node else retain the node
• Stop if expanding the current node does not improve impurity measures (e.g., Gini • Repeat this in other branches of the tree
or information gain).
• Stop if estimated generalization error falls below certain threshold
C l a s s i f i c a t i o n Te c h n i q u e : R u l e - B a s e d
Rule-based Classifier (Example)
Classifier
Name Blood Type Give Birth Can Fly Live in Water Class
• Rule: (Condition) → y
frog cold no no sometimes amphibians
where komodo
bat
cold
warm
no
yes
no
yes
no
no
reptiles
mammals
•
leopard shark cold yes no yes fishes
y is the class label turtle
penguin
cold
warm
no
no
no
no
sometimes
sometimes
reptiles
birds
• LHS: rule antecedent or condition porcupine
eel
warm
cold
yes
no
no
no
no
yes
mammals
fishes
•
salamander cold no no sometimes amphibians
RHS: rule consequent gila monster cold no no no reptiles
platypus warm no no no mammals
• Examples of classification rules: owl
dolphin
warm
warm
no
yes
yes
no
no
yes
birds
mammals
• (Blood Type=Warm) (Lay Eggs=Yes) → Birds eagle warm no yes no birds
R4: (Give Birth = no) (Can Fly = no) → Reptiles • Fraction of records that satisfy both the 5 No Divorced 95K Yes
Name Blood Type Give Birth Can Fly Live in Water Class
• Exhaustive rules
lemur warm yes no no ? • Classifier has exhaustive coverage if it accounts for every possible
turtle cold no no sometimes ?
dogfish shark cold yes no yes ? combination of attribute values
• Each record is covered by at least one rule
A lemur triggers rule R3, so it is classified as a mammal
A turtle triggers both R4 and R5
A dogfish shark triggers none of the rules
• Rules are no longer mutually exclusive • Rules are rank ordered according to their priority
• A record may trigger more than one rule • An ordered rule set is known as a decision list
• Solution?
• Ordered rule set • When a test record is presented to the classifier
• Unordered rule set – use voting schemes • It is assigned to the class label of the highest ranked rule it has triggered
• If none of the rules fired, it is assigned to the default class
• Rules are no longer exhaustive
R1: (Give Birth = no) (Can Fly = yes) → Birds
• A record may not trigger any rules
R2: (Give Birth = no) (Live in Water = yes) → Fishes
• Solution?
• Use a default class
R3: (Give Birth = yes) (Blood Type = warm) → Mammals
R4: (Give Birth = no) (Can Fly = no) → Reptiles
R5: (Live in Water = sometimes) → Amphibians
Name Blood Type Give Birth Can Fly Live in Water Class
turtle cold no no sometimes ?
Data Mining Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Rule Ordering Schemes Building Classification Rules
4. Repeat Step (2) and (3) until stopping criterion is met Q R Rule Set
• Bayes theorem:
P( A | C ) P(C ) P( S | M ) P( M ) 0.5 1 / 50000
P(C | A) = P( M | S ) = = = 0.0002
P( A) P( S ) 1 / 20
• Specifically, we want to find the value of C that maximizes P(C| A1, A2,…,An ) • Choose value of C that maximizes
P(C | A1, A2, …, An)
• Can we estimate P(C| A1, A2,…,An ) directly from data? • Equivalent to choosing value of C that maximizes
P(A1, A2, …, An|C) P(C)
Give Birth Can Fly Live in Water Have Legs Class P(A|M)P(M) > P(A|N)P(N) • Use other techniques such as Bayesian Belief Networks (BBN)
yes no yes no ?
=> Mammals
Data Mining Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Artificial Neural Networks (ANN) Artificial Neural Networks (ANN)
• Each neuron receives inputs, multiplies them with weights, adds Input
a bias, and passes through an activation function. nodes Black box
X1 X2 X3 Y
• The network learns by adjusting weights using algorithms 1
1
0
0
0
1
0
1
X1 0.3
Output
node
like backpropagation and gradient descent. 1 1 0 1
1 1 1 1 X2 0.3
0 0 1 0
Y
• Basic Structure of ANN 0 1 0 0
0 1 1 1 X3 0.3
Input Layer → Hidden Layer(s) → Output Layer 0 0 0 0
t=0.4
(x1, x2, ...) (h1, h2, ...) (y)
Y = I (0.3 X 1 + 0.3 X 2 + 0.3 X 3 − 0.4 0)
1 if z is true
where I ( z ) =
0 otherwise
Data Mining Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Artificial Neural Networks (ANN) General Structure of ANN
Input x1 x2 x3 x4 x5
• Model is an assembly of inter-connected nodes
nodes and weighted links Black box Input Input Neuron i Output
Output Layer
• Output node sums up each of its input X1 w1 node I1 wi1
value according to the weights of its links Activation
w2 wi2
• Compare output node against some X2 Y I2
wi3
Si function Oi Oi
w3 g(Si )
threshold t Hidden I3
X3 Layer
t
threshold, t
Perceptron Model
Y = I ( wi X i − t )
Output Training ANN means learning the
or Layer weights of the neurons
i
Y = sign ( wi X i − t ) y
i
Data Mining Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
S u p p o r t Ve c t o r M a c h i n e s S u p p o r t Ve c t o r M a c h i n e s
B1
• Find a linear hyperplane (decision boundary) that will separate the data • One Possible Solution
Data Mining Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
S u p p o r t Ve c t o r M a c h i n e s S u p p o r t Ve c t o r M a c h i n e s
B2
B2
S u p p o r t Ve c t o r M a c h i n e s S u p p o r t Ve c t o r M a c h i n e s
B1
B1
B2
B2
b21
b22
margin
b11
• How do you define better? • Find hyperplane maximizes the margin => B1 is better than B2
Data Mining Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
N o n l i n e a r S u p p o r t Ve c t o r M a c h i n e s
S u p p o r t Ve c t o r M a c h i n e s
B1
• What if decision boundary is not linear?
w• x + b = 0
w • x + b = −1 w • x + b = +1
b11
1 if w • x + b 1 b12 2
Margin = 2
f ( x) = || w ||
− 1 if w • x + b −1 Data Mining Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
N o n l i n e a r S u p p o r t Ve c t o r M a c h i n e s Ensemble Methods
• Transform data into higher dimensional space • Construct a set of classifiers from the training data
D
Original
Training data • Suppose there are 25 base classifiers
• Each classifier has error rate, = 0.35
• Assume classifiers are independent
Step 1:
Create Multiple D1 D2 .... Dt-1 Dt • Probability that the ensemble classifier makes a wrong
Data Sets
prediction:
i
25 25
Step 2:
Build Multiple C1 C2 Ct -1 Ct
(1 − )25−i = 0.06
i =13 i
Classifiers
Step 3:
Combine C*
Classifiers
BITS Pilani
Pilani|Dubai|Goa|Hyderabad
Data Mining
M2: Data Preprocessing 2.1 Data Preprocessing Techniques
BITS Pilani
Pilani|Dubai|Goa|Hyderabad
Source Courtesy: Some of the contents of this PPT are sourced from materials provided by publishers of prescribed books
Data Preprocessing Data preprocessing
Data Preprocessing 4
Data Preprocessing 3
Data
Data Pre-processing
Pre-processing Data Data Mining
DataMining
• D ata P re - p ro c e s s i n g : A n O ve r v i e w, D ata C l e a n i n g ,
• W hat is D ata M ining ?
• D ata Re d u c t i o n - O ve r v i e w o f D ata Re d u c t i o n S t rate g i e s , P C A ,
• M any people treat data mining as a synonym for
Att r i b u te S u b s e t S e l e c t i o n , H i sto g ra m s , S a m p l i n g ;
another popularly us ed te rm, Knowledge D is cover y
• D ata Tra n sfo r m at i o n a n d D ata D i s c ret i zat i o n - D ata from D ata , or KD D,
Tra n sfo r m at i o n b y N o r m a l i zat i o n ,
• W hile ot he rs view data mining as mere ly an
• D i s c ret i zat i o n b y B i n n i n g , D i s c ret i zat i o n b y H i sto g ra m es s ential step in the proc es s of knowledge
A n a l ys i s , D i s c ret i zat i o n b y C l u ste r, D e c i s i o n Tre e , a n d discove r y.
C o r re l at i o n A n a l ys e s . Data Preprocessing Data Preprocessing
5 6
Knowledge Discovery Process Major Tasks in Data Preprocessing
The knowledge discovery process has the following steps: ➢Steps 1 through 4 are different forms of data preprocessing, where data are
1. Data cleaning (to remove noise and inconsistent data) prepared for mining.
2. Data integration (where multiple data sources may be combined)
➢The data mining step may interact with the user or a knowledge base.
3. Data selection (where data relevant to the analysis task are retrieved from the database)
4. Data transformation (where data are transformed and consolidated into forms ➢The interesting patterns are presented to the user and may be stored as new
appropriate for mining by performing summary or aggregation operations) knowledge in the knowledge base.
5. Data mining (an essential process where intelligent methods are applied to extract data
patterns) ➢“Data mining is the process of discovering interesting patterns and knowledge from
6. Pattern evaluation (to identify the truly interesting patterns representing knowledge large amounts of data. The data sources can include databases, data warehouses, the
based on interestingness measures)
Web, other information repositories, or data that are streamed into the system
7. Knowledge presentation (where visualization and knowledge representation techniques
are used to present mined knowledge to users) dynamically” .
Data Preprocessing 9 Data Preprocessing 10
Data Preprocessing
➢Today’s real-world databases are highly susceptible to noisy, missing, and inconsistent ➢Data integration merges data from multiple sources into a coherent data
data due to their typically huge size and their origin from multiple, heterogenous
sources. store such as a data warehouse.
➢Low-quality data will lead to low-quality mining results.
➢“How can the data be preprocessed in order to help improve the quality of the data ➢Data reduction can reduce data size by, aggregating, eliminating redundant
and, consequently, of the mining results?
features, or clustering.
➢How can the data be preprocessed so as to improve the efficiency and ease of the
mining process?”
➢ Data transformations (e.g., normalization) may be applied, where data are
➢There are several data preprocessing techniques.
➢Data cleaning can be applied to remove noise and correct inconsistencies in data. scaled to fall within a smaller range like 0.0 to 1.0.
➢There are many factors comprising data quality, including accuracy, completeness, consistency,
timeliness, believability, and interpretability.
➢Imagine that you are a manager at AllElectronics and have been charged with analyzing the company’s
data with respect to your branch’s sales.
➢You carefully inspect the company’s database and data warehouse, identifying and selecting the
attributes or dimensions (e.g., item, price, and units sold) to be included in your analysis.
➢ You notice that several of the attributes for various tuples have no recorded value.
➢ Imagine that you need to analyze AllElectronics sales and customer data.
➢Real-world data tend to be incomplete, noisy, and inconsistent.
➢ You note that many tuples have no recorded value for several attributes such as customer income. How
➢Data cleaning routines attempt to fill in missing values, smooth can you go about filling in the missing values for this attribute?
3. Use a global constant to fill in the missing value: Replace all missing attribute values
5. Use the attribute mean or median for all samples belonging to the same class as the
by the same constant such as a label like “Unknown” or −∞. If missing values are
given tuple: For example, if classifying customers according to credit risk, we may replace
replaced by, say, “Unknown,” then the mining program may mistakenly think that
the missing value with the mean income value for customers in the same credit risk
they form an interesting concept, since they all have a value in common—that of
category as that of the given tuple. If the data distribution for a given class is skewed, the
“Unknown.” Hence, although this method is simple, it is not foolproof. median value is a better choice.
4. Use a measure of central tendency for the attribute (e.g., the mean or median) to 6. Use the most probable value to fill in the missing value: This may be determined with
fill in the missing value: measures of central tendency, which indicate the “middle” regression, inference-based tools using a Bayesian formalism, or decision tree induction.
For example, using the other customer attributes in your data set, you may construct a
value of a data distribution. For normal (symmetric) data distributions, the mean
decision tree to predict the missing values for income.
can be used, while skewed data distribution should employ the median.
Data compression
Data Preprocessing 21 Data Preprocessing 22
Data Reduction Data Reduction
Dimensionality Reduction Dimensionality Reduction
C u rs e o f d i m e n s i o n a l i t y D i m e n s i o n a l i t y re d u c t i o n
• W h e n d i m e n s i o n a l i t y i n c re a s e s , d ata b e c o m e s i n c re a s i n g l y • Avo i d t h e c u rs e o f d i m e n s i o n a l i t y
s p a rs e • H e l p e l i m i n ate i r re l e va nt fe at u re s a n d re d u c e n o i s e
• D e n s i t y a n d d i sta n c e b e t w e e n p o i nt s , w h i c h i s c r i t i c a l to • Re d u c e t i m e a n d s p a c e re q u i re d i n d ata m i n i n g
c l u ste r i n g , o u t l i e r a n a l ys i s , b e co m e s l e s s m e a n i n g f u l • A l l o w e a s i e r v i s u a l i zat i o n
Linear regression
Dimensionality reduction techniques
• D ata m o d e l e d to f i t a st ra i g ht l i n e
• Wavelet transforms
• O f te n u s e s t h e l e a st - s q u a re m et h o d to f i t t h e l i n e
• Principal Component Analysis
Multiple regression
• Super vised and nonlinear techniques • Allows a response variable Y to be modeled as a linear function
• The parameters are estimated so as to give a "best X1 x Using the least squares criterion to the known values of Y1, Y2, …,
fit" of the data X1, X2, ….
• Used for prediction
• Most commonly the best fit is evaluated by using the (including forecasting of Multiple regression: Y = b0 + b1 X1 + b2 X2
least squares method time-series data),
inference, hypothesis Many nonlinear functions can be transformed into the above
testing, and modeling of
Data Preprocessing
causal relationships 27
Data Preprocessing 28
Raw Data
Data Preprocessing 29 Data Preprocessing 30
Binning Methods for Data Smoothing Regression and Log-Linear Models
Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
* Partition into equal-frequency (equi-depth) bins: Linear regression: Y = w X + b
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25 Tw o r e g r e s s i o n c o e f f i c i e n t s , w a n d b , s p e c i f y t h e l i n e a n d a r e t o b e
- Bin 3: 26, 28, 29, 34
estimated by using the data at hand
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9 Using the least squares criterion to the known values of Y1, Y2, …,
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
X1, X2, ….
* Smoothing by bin boundaries:
Multiple regression: Y = b0 + b1 X1 + b2 X2
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25 Many nonlinear functions can be transformed into the above
- Bin 3: 26, 26, 26, 34
• Χ2 (chi-square) test
Like science fiction 250(90) 200(360) 450
(Observed − Expected) 2
2 = Not like science fiction 50(210) 1000(840) 1050
Expected
Sum(col.) 300 1200 1500
• The larger the Χ2 value, the more likely the variables are related
• Χ2 (chi-square) calculation (numbers in parenthesis are expected counts
• The cells that contribute the most to the Χ2 value are those whose actual count is very calculated based on the data distribution in the two categories)
different from the expected count
(250 − 90) 2 (50 − 210) 2 (200 − 360) 2 (1000 − 840) 2
• Correlation does not imply causality 2 = + + + = 507.93
90 210 360 840
• # of hospitals and # of car-theft in a city are correlated • It shows that like_science_fiction and play_chess are correlated in the group
• Typical methods: All the methods can be applied recursively • Equal-width (distance) partitioning
• Binning • Divides the range into N intervals of equal size: uniform grid
• Top-down split, unsupervised • if A and B are the lowest and highest values of the attribute, the width of
• Another way to reduce dimensionality of data • There are 2d possible attribute combinations of d attributes
• Typical heuristic attribute selection methods:
• Redundant attributes
• Best single attribute under the attribute independence assumption: choose
• Duplicate much or all of the information contained in one or more other by significance tests
attributes • Best step-wise feature selection:
• The best single-attribute is picked first
• E.g., purchase price of a product and the amount of sales tax paid • Then next best attribute condition to the first, ...
• Irrelevant attributes • Step-wise attribute elimination:
• Repeatedly eliminate the worst attribute
• Contain no information that is useful for the data mining task at hand • Best combined attribute selection and elimination
• E.g., students' ID is often irrelevant to the task of predicting students' GPA • Optimal branch and bound:
• Use attribute elimination and backtracking
Thank you
Data Mining Data Mining
47 48
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
▪ It is important to know that just login to the session does not guarantee the
attendance.
▪ Once you join the session, continue till the end to consider you as present in the
class.
▪ IMPORTANTLY, you need to make the class more interactive by responding to
Data Mining Professors queries in the session.
▪ Whenever Professor calls your number / name, you need to respond,
BITS Pilani
Pilani|Dubai|Goa|Hyderabad M3: Data Exploration
otherwise it will be considered as ABSENT
The instructor is gratefully acknowledging the authors who made their course materials freely available online. BITS Pilani, Pilani Campus
BITS Pilani
Recap
Pilani|Dubai|Goa|Hyderabad
Data Mining
Source Courtesy: Some of the contents of this PPT are sourced from materials provided by publishers of prescribed books
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Equal Width (Distance) Binning : bins have equal width with a range of each bin are defined as 1 Anu 34 15-Jan-2015 CSE Feb 24, 1997
[min + w], [min + 2w] …. [min + nw] 2 Banu 33 27-Jan-2015 Mech Mar 27, 1998
w = (max – min) / (no of bins) 3 Anu C 34 15-Jan-2015 CSE Feb 24, 1997
No. of bins = 3
4 Chiru 32 30-Jan-2015 ECE Nov 25,1998
5 Amit 34 15-Jan-2015 NULL Feb 24, 1998
Input: [5, 10, 11, 13, 15, 35, 50, 55, 72, 92, 204, 215]
min = 5 w = (215-5)/3 = 70 6 Arish 38 15/Jan/2017 ECE 24 Dec 1999
Output: [5, 10, 11, 13, 15, 35, 50, 55, 72] [92] [204, 215] 7 Sanu 34 15-Jan-2015 Civil Jan 1st, 1997
8 Naresh 98 15-Dec-2017 ECE 24 Dec 1999
9 Arush 28 15-Jan-2027 ECE 24 Dec 2019
Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 BITS Pilani, Pilani Campus
Solution Types of Data Sets
• Record
Consider that following scholar data set made available to you for carrying out some • Relational records
mining activity. List 4 potential issues with this dataset identified while preprocessing?
timeout
•
season
coach
Data matrix, e.g., numerical matrix, crosstabs
game
score
team
ball
lost
pla
wi
n
y
• Document data: text documents: term-frequency vector
• Transaction data
Sno Name Age DateOfJoining Branch DateOfBirth
• Graph and network Document 1 3 0 5 0 2 6 0 2 0 2
• World Wide Web
1 Anu 34 15-Jan-2015 CSE Feb 24, 1997
Document 2 0 7 0 2 1 0 0 3 0 0
• Social or information networks
2 Banu 33 27-Jan-2015 Mech Mar 27, 1998
• Molecular Structures Document 3 0 1 0 0 1 2 2 0 3 0
3 Anu C 34 15-Jan-2015 CSE Feb 24, 1997
• Ordered
4 Chiru 32 30-Jan-2015 ECE Nov 25,1998 • Video data: sequence of images TID Items
5 Amit 34 15-Jan-2015 NULL Feb 24, 1998 • Temporal data: time-series 1 Bread, Coke, Milk
6 Arish 38 15/Jan/2017 ECE 24 Dec 1999 • Sequential Data: transaction sequences 2 Beer, Bread
• Genetic sequence data 3 Beer, Coke, Diaper, Milk
7 Sanu 34 15-Jan-2015 Civil Jan 1st, 1997
• Spatial, image and multimedia: 4 Beer, Bread, Diaper, Milk
8 Naresh 98 15-Dec-2017 ECE 24 Dec 1999
• Spatial data: maps 5 Coke, Diaper, Milk
9 Arush 28 15-Jan-2027 ECE 24 Dec 2019 • Image data: MRI scans
Data Mining
• Video data: Movies
BITS Pilani, Pilani Campus BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
x
•
n
Motivation w i =
• Weighted arithmetic mean: i =1
N
• To better understand the data: central tendency, variation and spread
• Trimmed mean: chopping extreme values
• Data dispersion characteristics • Median:
• median, max, min, quantiles, outliers, variance, etc.
• Middle value if odd number of values, or average of the
• Numerical dimensions correspond to sorted intervals middle two values otherwise
• Data dispersion: analyzed with multiple granularities of precision • Estimated by interpolation (for grouped data):
median = L1 + (
interval
) width
• Dispersion analysis on computed measures • Mode freqmedian
• Folding measures into numerical dimensions • Value that occurs most frequently in the data
• Boxplot or quantile analysis on the transformed cube • Unimodal, bimodal, trimodal
• Empirical formula: mean − mode = 3 (mean − median)
Data Mining Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Symmetric vs. Skewed Data Measuring the Dispersion of Data
• Median, mean and mode of symmetric, • Quartiles, outliers and boxplots
positively and negatively skewed data • Quartiles: Q1 (25th percentile), Q3 (75th percentile)
• Inter-quartile range: IQR = Q3 – Q1
positively skewed symmetric negatively skewed
• Five number summary: min, Q1, median, Q3, max
• Boxplot: ends of the box are the quartiles; median is marked; add
whiskers, and plot outliers individually
• Outlier: usually, a value higher/lower than 1.5 x IQR
• Variance and standard deviation (sample: s, population: σ)
• Variance: (algebraic, scalable computation)
1 n 1 n 2 1 n n n
s2 = ( xi − x ) 2 = n − 1 [
n − 1 i =1 i =1
xi − ( xi ) 2 ]
n i =1
2 =
1
(x − )
i
2
=
1
x i
2
− 2
N i =1 N i =1
Skewness > Skewness = Skewness < • Standard deviation s (or σ) is the square root of variance s2 (or σ2)
0 0 Data Mining 0 Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
• Similarity
• Numerical measure of how alike two data objects are
• Value is higher when objects are more alike
• Often falls in the range [0,1]
• Dissimilarity (e.g., distance)
• Numerical measure of how different two data objects are
• Lower when objects are more alike
• Minimum dissimilarity is often 0
3.2 Data Similarity/Dissimilarity • Upper limit varies
• Proximity refers to a similarity or dissimilarity
Data Mining
Source Courtesy: Some of the contents of this PPT are sourced from materials provided by publishers of prescribed books
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Data Matrix and Dissimilarity Matrix Proximity Measure for Nominal Attributes
• Data matrix x11 ... x1f ... x1p • Can take 2 or more states, e.g., red, yellow, blue, green
• n data points with p dimensions ... ... ... ... ...
(generalization of a binary attribute)
x x ip
• Two modes i1
... x if ...
... ... ... ... ... • Method 1: Simple matching
x ... x nf ... x np
n1 • m: # of matches, p: total # of variables
• Dissimilarity matrix d (i, j) = p −
p
m
0
• n data points, but registers only the distance d(2,1) 0
• A triangular matrix
• Method 2: Use a large number of binary attributes
d(3,1) d ( 3,2) 0
• Single mode • creating a new binary attribute for each of the M nominal states
: : :
d ( n,1) d ( n,2) ... ... 0
Example
Object i Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4
• Distance measure for symmetric Jack M Y N P N N N
binary variables: Mary F Y N P N P N
Jim M Y P N N N N
• Distance measure for asymmetric
binary variables:
• Gender is a symmetric attribute
• Jaccard coefficient (similarity measure • The remaining attributes are asymmetric binary
for asymmetric binary variables): • Let the values Y and P be 1, and the value N 0
0+1
d ( jack , mary ) = = 0.33
2+ 0+1
◼ Note: Jaccard coefficient is the same as “coherence”: d ( jack , jim ) =
1+1
= 0.67
1+1+1
1+ 2
d ( jim , mary ) = = 0.75
1+1+ 2
Data Mining Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Standardizing Numeric Data Normalisation-Exercise
• Z-score: −
z = x Suppose that the minimum and maximum values for the attribute income are $12,000
• X: raw score to be standardized, μ: mean of the population, σ: and $98,000, respectively. The new range is [0.0,1.0]. Apply min-max normalization to
standard deviation
value of $73,600.
• the distance between the raw score and the population mean in units
of the standard deviation
• negative when the raw score is below the mean, “+” when above Solution v − minA
v' = (new _ maxA − new _ minA) + new _ minA
• An alternative way: Calculate the mean absolute deviation maxA − minA
where
sf = 1
n (| x1 f − m f | + | x2 f − m f | +...+ | xnf − m f |)
73,600 − 12,000
mf = 1 n (x1 f + x2 f + ... + xnf )
. (1.0 − 0) + 0 = 0.716
• standardized measure (z-score): xif − m f 98,000 − 12,000
z =
• Using mean absolute deviation is more robust than using sf
if standard
deviation
Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 BITS Pilani, Pilani Campus
Normalisation-Exercise
Example: Data Matrix and Dissimilarity Matrix
Suppose that the mean and standard deviation of the values for the attribute
Data Matrix
income are $54,000 and $16,000, respectively. Apply z-score normalization
to the salary value of $73,600 point attribute1 attribute2
x1 1 2
v − A x2 3 5
v' = x3 2 0
A
x4 4 5
Dissimilarity Matrix
73,600 − 54,000
= 1.225 (with Euclidean Distance)
16,000 x1 x2 x3 x4
x1 0
x2 3.61 0
x3 2.24 5.1 0
x4 4.24 1 5.39 0
Data Mining
BITS Pilani, Pilani Campus BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Distance on Numeric Data: Minkowski Distance Special Cases of Minkowski Distance
• h = 1: Manhattan (city block, L1 norm) distance
• E.g., the Hamming distance: the number of bits that are different
• Minkowski distance: A popular distance measure
between two binary vectors
d (i, j) =| x − x | + | x − x | +...+ | x − x |
i1 j1 i2 j 2 ip jp
where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two p-dimensional data objects, • h = 2: (L2 norm) Euclidean distance
and h is the order (the distance so defined is also called L-h norm) d (i, j) = (| x − x |2 + | x − x |2 +...+ | x − x |2 )
i1 j1 i2 j2 ip jp
• Properties
• h → . “supremum” (Lmax norm, L norm) distance.
• d(i, j) > 0 if i ≠ j, and d(i, i) = 0 (Positive definiteness)
• This is the maximum difference between any component
• d(i, j) = d(j, i) (Symmetry) (attribute) of the vectors
• d(i, j) d(i, k) + d(k, j) (Triangle Inequality)
• A distance that satisfies these properties is a metric
Data Mining Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
• f is binary or nominal:
dij(f) = 0 if xif = xjf , or dij(f) = 1 otherwise
Calculate the dissimilarity matrix and similarity matrix for the ordinal attributes
• f is numeric: use the normalized distance
• f is ordinal
• Compute ranks rif and r −1
• Treat zif as interval-scaled
zif =
if
M −1 f
Data Mining
BITS Pilani, Pilani Campus BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
d1 = (5, 0, 3, 0, 2, 0, 0, 2, 0, 0)
• Other vector objects: gene features in micro-arrays, … d2 = (3, 0, 2, 0, 1, 1, 0, 1, 0, 1)
• Applications: information retrieval, biologic taxonomy, gene feature mapping, ... d1•d2 = 5*3+0*0+3*2+0*0+2*1+0*1+0*1+2*1+0*0+0*1 = 25
• Cosine measure: If d1 and d2 are two vectors (e.g., term-frequency vectors), then ||d1||= (5*5+0*0+3*3+0*0+2*2+0*0+0*0+2*2+0*0+0*0)0.5=(42)0.5 = 6.481
cos(d1, d2) = (d1 • d2) /||d1|| ||d2|| , ||d2||= (3*3+0*0+2*2+0*0+1*1+1*1+0*0+1*1+0*0+1*1)0.5=(17)0.5 = 4.12
where • indicates vector dot product, ||d||: the length of vector d cos(d1, d2 ) = 0.94
For the following vectors x and y, calculate the indicated similarity or distance For the following vectors x and y, calculate the indicated similarity or distance
measures: measures:
x =(1,1,1,1), y=(2,2,2,2) cosine, Euclidian x =(0,1,0,1), y=(1,0,1,0) Cosine, Correlation ,Euclidian,Jacquard
Solution: cos(x, y) = 1, Euclidean(x, y) = 2 Solution: cos(x, y) = 0,Corr(x,y)=-3 Euclidean(x, y) = 2,Jacquard(x,y)=0
Data visualization is the art and practice of gathering, analyzing, and graphically representing
empirical information.
Data Mining
Source Courtesy: Some of the contents of this PPT are sourced from materials provided by publishers of prescribed books
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Visualization
Visualization techniques
Another Defination: Visualization is the conversion of data into a visual
or tabular format so that the characteristics of the data and the
Based on
relationships among data items or attributes can be analyzed or
• Visualization of a small number of attributes,
reported. • Stem and Leaf Plots
• Histograms
• Box Plots
Visualization of data is one of the most powerful and appealing • Pie charts
• Scatter Plots etc
techniques for data exploration.
• Visualization of data with spatial and/or temporal attributes,
– Humans have a well developed ability to analyze large amounts of information • Contour Plots
that is presented visually • Surface Plots
– Can detect general patterns and trends • Visualization of data with many attributes
• Data Matrix
– Can detect outliers and unusual patterns • Chernoff Faces
• Stick Figures
Data Mining
02/03/2018 Introduction to Data Mining 49
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Data Mining
02/03/2018 Introduction to Data Mining 56
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Example : Stem and Leaf Plot Stem and Leaf Plot
4 : 34444566667788888999999
The set of integers shown below is the sepal length in centimeters 5 : 0000000000111111111222234444445555555666666777777778888888999
(multiplied by 10 to make the values integers) taken from the Iris data set.
6 : 000000111111222233333333344444445555566777777778889999
The values have also been sorted.
7 : 0122234677779
43 44 44 44 45 46 46 46 46 47 47 48 48 48 48 48 49 49 49 49 49 49 50
50 50 50 50 50 50 50 50 50 51 51 51 51 51 51 51 51 51 52 52 52 52 53
4 : 3444
54 54 54 54 54 54 55 55 55 55 55 55 55 56 56 56 56 56 56 57 57 57 57
4 : 566667788888999999
57 57 57 57 58 58 58 58 58 58 58 59 59 59 60 60 60 60 60 60 61 61 61
5 : 000000000011111111122223444444
61 61 61 62 62 62 62 63 63 63 63 63 63 63 63 63 64 64 64 64 64 64 64
5 : 5555555666666777777778888888999
65 65 65 65 65 66 66 67 67 67 67 67 67 67 67 68 68 68 69 69 69 69 70
6 : 00000011111122223333333334444444
71 72 72 72 73 74 76 77 77 77 77 79
6 : 5555566777777778889999
7 : 0122234
Data Mining 7 : 677779 Data Mining
Source : Figure 3.4. Sepal length data from the Iris data set[ Tan P. N., Steinbach M & Kumar V. “Introduction to Data Mining” Pearson Education ] BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 Source : Figure 3.5., Stem and leaf plot for the sepal length from the Iris data set. BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Chernoff Faces
Stick Figure
• A way to display variables on a two- A census data figure showing age,
dimensional surface, e.g., let x be income, gender, education
eyebrow slant, y be eye size, z be nose
length, etc.
A 5-piece stick figure (1 body and
4 limbs w. different angle/length)
• The figure shows faces produced
Age, income are indicated by
using 10 characteristics--head
position of the figure.
eccentricity, eye size, eye spacing, eye
eccentricity, pupil size, eyebrow slant, Gender, education are indicated
nose size, mouth shape, mouth size, by angle/length.
and mouth opening): Each assigned
Visualization can show a texture
one of 10 possible values.
pattern
Data Mining Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Visualizing Complex Data and Relations
Exercise
You are analysing the behaviour of students across 10 variables like Most visualization techniques were
mainly for numeric data. Recently,
attendance, assignment scores, participation, etc. more and more non-numeric data,
such as text and social networks,
Which visualization technique would you choose: Bar chart, Chernoff have become available.
faces, or Pie chart, Scatter Plot? Justify your answer. Many people on the Web tag various
objects such as pictures, blog entries,
and product reviews.
A tag cloud is a visualization of
statistics of user-generated tags.
Often, in a tag cloud, tags are listed
alphabetically or in a user-preferred
order.
The importance of a tag is indicated
by font size or color
Data Mining Data Mining
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956