Introduction to Data Mining Course
Introduction to Data Mining Course
1
• The slides presented here are obtained from the authors of the
books and from various other contributors. I hereby
acknowledge all the contributors for their material and inputs.
• I have added and modified a few slides to suit the requirements
of the course.
2
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Evolution of Database Technology
1960s:
– Data collection, database creation, IMS and network DBMS
1970s:
– Relational data model, relational DBMS implementation
1980s:
– RDBMS, advanced data models (extended-relational, OO, deductive, etc.)
– Application-oriented DBMS (spatial, scientific, engineering, etc.)
1990s:
– Data mining, data warehousing, multimedia databases, and Web databases
2000s
– Stream data management and mining
– Data mining and its applications
– Web technology (XML, data integration) and global information systems
11
12
Increasing potential
to support
business decisions End User
Decision
Making
Data Exploration
Statistical Summary, Querying, and Reporting
13
14
18
Description Methods
– Find human-interpretable patterns that describe the data.
From [Fayyad, [Link].] Advances in Knowledge Discovery and Data Mining, 1996
20
21
Set
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
Training
Learn
10 No Single 90K Yes Model
10
Set Classifier
22
23
24
26
27
28
29
30
31
32
(A B) (C) (D E)
(A B) (C) (D E)
Timing constraints include maxgap
<= xg >ng <= ws (xg), mingap (ng), windowsize (ws),
maxspan (ms)
<= ms
33
34
35
36
37
38
39
40
41
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
CRISP data mining framework
CRISP is the most popular
methodology for analytics,
data mining, and data
science projects, with 43%
share as per 2014
KDnuggets Poll.
CRISP-DM was conceived
in 1996. In 1997 it got
underway as a European
Union project, led by SPSS,
Teradata, Daimler AG, NCR
Corporation and OHRA.
45
46
DM Issues/Challenges - Society
Social impacts of data mining
Privacy-preserving data mining
Invisible data mining
48
49
50
51
1
• The slides presented here are obtained from the authors of the
books and from various other contributors. I hereby
acknowledge all the contributors for their material and inputs.
• I have added and modified a few slides to suit the requirements
of the course.
2
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Preprocessing Objectives
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Major Tasks in Data Preprocessing
• Data cleaning
• Fill in missing values, smooth noisy data, identify or remove outliers, and
resolve inconsistencies
• Data integration
• Integration of multiple databases, data cubes, or files
• Data reduction
• Dimensionality reduction
• Numerosity reduction
• Data compression
• Data transformation and data discretization
• Normalization
• Concept hierarchy generation
5
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Data Quality: Multidimensional View
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Data Quality
– missing values
– duplicate data
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Data Cleaning
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Data Cleaning
• Data in the Real World Is Dirty: Lots of potentially incorrect data, e.g.,
instrument faulty, human or computer error, transmission error
– incomplete: lacking attribute values, lacking certain attributes of
interest, or containing only aggregate data
• e.g., Occupation = “ ” (missing data)
– noisy: containing noise, errors, or outliers
• e.g., Salary = “−10” (an error)
– inconsistent: containing discrepancies in codes or names, e.g.,
• Age = “42”, Birthday = “03/07/2010”
• Was rating “1, 2, 3”, now rating “A, B, C”
• discrepancy between duplicate records
– Intentional (e.g., disguised missing data)
• Jan. 1 as everyone’s birthday?
9
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Incomplete (Missing) Data
• Data is not always available
– E.g., many tuples have no recorded value for several attributes, such
as customer income in sales data
• Missing data may be due to
– equipment malfunction
– inconsistent with other recorded data and thus deleted
– data not entered due to misunderstanding
– certain data may not be considered important at the time of entry
– not register history or changes of the data
• Missing data may need to be inferred
10
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
How to Handle Missing Data?
• Ignore the tuple: usually done when class label is missing (when doing
classification)—not effective when the % of missing values per attribute
varies considerably
• Fill in the missing value manually: tedious + infeasible?
• Fill in it automatically with
– a global constant : e.g., “unknown”, a new class?!
– the attribute mean
– the attribute mean for all samples belonging to the same class:
smarter
– the most probable value: inference-based such as Bayesian formula or
decision tree
11
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Noisy Data
• Noise: random error or variance in a measured variable
• Incorrect attribute values may be due to
– faulty data collection instruments
– data entry problems
– data transmission problems
– technology limitation
– inconsistency in naming convention
• Other data problems which require data cleaning
– duplicate records
– incomplete data
– inconsistent data
12
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Noise
Noise refers to modification of original values
– Examples: distortion of a person’s voice when talking on a poor phone
and “snow” on television screen
13
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
How to Handle Noisy Data?
• Binning (also used for discretization)
– first sort data and partition into (equal-frequency) bins
– then one can smooth by bin means, smooth by bin median, smooth by
bin boundaries, etc.
– Binning methods smooth a sorted data value by consulting its
"neighborhood," that is, the values around it, i.e. they perform local
smoothing.
• Regression
– smooth by fitting the data into regression functions
• Clustering
– detect and remove outliers
• Combined computer and human inspection
– detect suspicious values and check by human (e.g., deal with possible
outliers) 14
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Duplicate Data
• Data set may include data objects that are duplicates, or almost duplicates
of one another
– Major issue when merging data from heterogenous sources
• Examples:
– Same person with multiple email addresses
• Data cleaning
– Process of dealing with duplicate data issues
15
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Outliers
Outliers are data objects with characteristics that are
considerably different than most of the other data objects in the
data set
16
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Data Cleaning as a Process
• Data discrepancy detection
– Use metadata (e.g., domain, range, dependency, distribution)
– Check field overloading
– Check uniqueness rule, consecutive rule and null rule
– Use commercial tools
• Data scrubbing: use simple domain knowledge (e.g., postal code, spell-
check) to detect errors and make corrections
• Data auditing: by analyzing data to discover rules and relationship to
detect violators (e.g., correlation and clustering to find outliers)
• Data migration and integration
– Data migration tools: allow transformations to be specified
– ETL (Extraction/Transformation/Loading) tools: allow users to specify
transformations through a graphical user interface
• Integration of the two processes
– Iterative and interactive (e.g., Potter’s Wheels)
17
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Data Preprocessing Techniques
18
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Major Tasks in Data Preprocessing
• Data cleaning
– Fill in missing values, smooth noisy data, identify or remove outliers,
and resolve inconsistencies
• Data integration
– Integration of multiple databases, data cubes, or files
• Data reduction
– Dimensionality reduction
– Numerosity reduction
– Data compression
• Data transformation and data discretization
– Normalization
– Concept hierarchy generation
19
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Data Integration
• Data integration: Combines data from multiple sources into a coherent
store
• Schema integration: e.g., [Link]-id ≡ [Link]-#
o Integrate metadata from different sources
• Entity identification problem:
o Identify real world entities from multiple data sources, e.g., Bill Clinton =
William Clinton
• Detecting and resolving data value conflicts
o For the same real world entity, attribute values from different sources are
different
o Possible reasons: different representations, different scales, e.g., metric
vs. British units
20
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Any problems with the Data?
Name Age DateOfFirstBuy Profession DateOfBirth
Bill Gates 34 15-Jan-2015 MGR Feb 24, 1981
John 38 27-Jan-2015 Mar 11, 1982
William 34 15-Jan-2015 MGR Feb 24, 1981
Gates
Kennedy 37 30-Jan-2015 DOC Nov 03,1982
21
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Any problems with the Data?
Name Age DateOfFirstBuy Profession DateOfBirth
Bill Gates 34 15-Jan-2015 MGR Feb 24, 1981
John 38 27-Jan-2015 Mar 11, 1982
William Gates 34 15-Jan-2015 MGR Feb 24, 1981
Kennedy 37 30-Jan-2015 DOC Nov 03,1982
22
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Handling Redundancy in Data Integration
• Redundant data occur often when integration of multiple databases
– Object identification: The same attribute or object may have different
names in different databases
– Derivable data: One attribute may be a “derived” attribute in another
table, e.g., annual revenue
• Redundant attributes may be able to be detected by correlation analysis
and covariance analysis
• Careful integration of the data from multiple sources may help
reduce/avoid redundancies and inconsistencies and improve mining speed
and quality
23
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Correlation Analysis (Nominal Data)
• χ2 (chi-square) test
(Observed − Expected ) 2
χ =∑
2
Expected
• The larger the χ2 (chi-square) value, the more likely the variables are related
• The cells that contribute the most to the χ2 value are those whose actual count
is very different from the expected count
• Correlation does not imply causality
• # of hospitals and # of car-theft in a city are correlated
• Both are causally linked to the third variable: population
Data Mining
24
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Chi-Square Calculation: An Example
Play chess Not play chess Sum (row)
Like science fiction 250(90) 200(360) 450
Not like science fiction 50(210) 1000(840) 1050
Sum(col.) 300 1200 1500
25
Data Mining
Data Mining 26
Data Mining
27
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Covariance (Numeric Data)
• Covariance is similar to correlation
Correlation coefficient:
• Independence: CovA,B = 0
Data Mining
• Suppose two stocks A and B have the following values in one week: (2, 5), (3, 8), (5,
10), (4, 11), (6, 14).
• Question: If the stocks are affected by the same industry trends, will their prices rise
or fall together?
• E(A) = (2 + 3 + 5 + 4 + 6)/ 5 = 20/5 = 4
• E(B) = (5 + 8 + 10 + 11 + 14) /5 = 48/5 = 9.6
• Cov(A,B) = (2×5+3×8+5×10+4×11+6×14)/5 − 4 × 9.6 = 4
Data Mining
73,600 − 54,000
• Ex. Let μ = 54,000, σ = 16,000. Then = 1.225
16,000
• Normalization by decimal scaling
v
v' = j Where j is the smallest integer such that Max(|ν’|) < 1
10
Data Mining
30
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Discretization
• Three types of attributes
• Nominal—values from an unordered set, e.g., color, profession
• Ordinal—values from an ordered set, e.g., military or academic rank
• Numeric—real numbers, e.g., integer or real numbers
• Discretization: Divide the range of a continuous attribute into intervals
• Interval labels can then be used to replace actual data values
• Reduce data size by discretization
• Supervised vs. unsupervised
• Split (top-down) vs. merge (bottom-up)
• Discretization can be performed recursively on an attribute
• Prepare for further analysis, e.g., classification
Data Mining 31
32
Data Mining
Data Mining
33
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Binning Methods for Data Smoothing
34
Data Mining
Data Mining 35
Data Mining 37
Curse of Dimensionality
m
g
rlu
,h
p
w
b
c
a
y
d
s
to
fin
e
D
x2
x1
39
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Principal Component Analysis (Steps)
• Given N data vectors from n-dimensions, find k ≤ n orthogonal vectors (principal
components) that can be best used to represent data
– Normalize input data: Each attribute falls within the same range
– Compute k orthonormal (unit) vectors, i.e., principal components
– Each input data (vector) is a linear combination of the k principal component
vectors
– The principal components are sorted in order of decreasing “significance” or
strength
– Since the components are sorted, the size of the data can be reduced by
eliminating the weak components, i.e., those with low variance (i.e., using the
strongest principal components, it is possible to reconstruct a good
approximation of the original data)
• Works for numeric data only
40
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Attribute Subset Selection
• Another way to reduce dimensionality of data
• Redundant attributes
– Duplicate much or all of the information contained in one or more
other attributes
– E.g., purchase price of a product and the amount of sales tax paid
• Irrelevant attributes
– Contain no information that is useful for the data mining task at hand
– E.g., students' ID is often irrelevant to the task of predicting students'
GPA
41
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Heuristic Search in Attribute Selection
• There are 2d possible attribute combinations of d attributes
• Typical heuristic attribute selection methods:
– Best single attribute under the attribute independence assumption:
choose by significance tests
– Best step-wise feature selection:
• The best single-attribute is picked first
• Then next best attribute condition to the first, ...
– Step-wise attribute elimination:
• Repeatedly eliminate the worst attribute
– Best combined attribute selection and elimination
– Optimal branch and bound:
• Use attribute elimination and backtracking
42
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Attribute Creation (Feature Generation)
• Create new attributes (features) that can capture the important
information in a data set more effectively than the original ones
• Three general methodologies
– Attribute extraction
• Domain-specific
– Mapping data to new space (see: data reduction)
• E.g., Fourier transformation, wavelet transformation
– Attribute construction
• Combining features
• Data discretization
43
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Data Reduction: Numerosity Reduction
• Reduce data volume by choosing alternative, smaller forms of data
representation
• Parametric methods (e.g., regression)
– Assume the data fits some model, estimate model parameters, store
only the parameters, and discard the data (except possible outliers)
– Ex.: Log-linear models—obtain value at a point in m-D space as the
product on appropriate marginal subspaces
• Non-parametric methods
– Do not assume models
– Major families: histograms, clustering, sampling, …
44
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Sampling
45
46
Raw Data 47
48
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Thank You
49
1
• The slides presented here are obtained from the authors of the
books and from various other contributors. I hereby
acknowledge all the contributors for their material and inputs.
• I have added and modified a few slides to suit the requirements
of the course.
2
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Types of Data Sets
• Record
• Relational records
• Data matrix, e.g., numerical matrix, crosstabs
• Document data: text documents: term-frequency
timeout
season
coach
game
score
team
ball
lost
pla
wi
n
y
vector
• Transaction data
Document 1 3 0 5 0 2 6 0 2 0 2
• Graph and network
• World Wide Web Document 2 0 7 0 2 1 0 0 3 0 0
Data Mining
6
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Data Objects
• Data sets are made up of data objects.
• A data object represents an entity.
• Examples:
• sales database: customers, store items, sales
• medical database: patients, treatments
• university database: students, professors, courses
• Also called samples , examples, instances, data points, objects, tuples.
• Data objects are described by attributes.
• Database rows -> data objects; columns ->attributes.
Data Mining
7
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Attributes
• Attribute (or dimensions, features, variables): a data field, representing
a characteristic or feature of a data object.
• Types:
• Nominal
• Binary
• Numeric: quantitative
• Interval-scaled
• Ratio-scaled
Data Mining
8
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Attribute Types
• Nominal: categories, states, or “names of things”
• Hair_color = {auburn, black, blond, brown, grey, red, white}
• marital status, occupation, ID numbers, zip codes
• Binary
• Nominal attribute with only 2 states (0 and 1)
• Symmetric binary: both outcomes equally important
• e.g., gender
• Asymmetric binary: outcomes not equally important.
• e.g., medical test (positive vs. negative)
• Convention: assign 1 to most important outcome (e.g., HIV positive)
• Ordinal
• Values have a meaningful order (ranking) but magnitude between successive
values is not known.
• Size = {small, medium, large}, grades, army rankings
Data Mining
9
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Numeric Attribute Types
• Quantity (integer or real-valued)
• Interval
• Measured on a scale of equal-sized units
• Values have order
• E.g., temperature in C˚or F˚, calendar dates
• No true zero-point
• Ratio
• Inherent zero-point
• We can speak of values as being an order of magnitude larger than the
unit of measurement (10 K˚ is twice as high as 5 K˚).
• e.g., temperature in Kelvin, length, counts, monetary quantities
Data Mining
10
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Discrete vs. Continuous Attributes
• Discrete Attribute
• Has only a finite or countably infinite set of values
• E.g., zip codes, profession, or the set of words in a collection of
documents
• Sometimes, represented as integer variables
• Note: Binary attributes are a special case of discrete attributes
• Continuous Attribute
• Has real numbers as attribute values
• E.g., temperature, height, or weight
• Practically, real values can only be measured and represented using a finite
number of digits
• Continuous attributes are typically represented as floating-point variables
Data Mining
11
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Basic Statistical Descriptions of Data
• Motivation
• To better understand the data: central tendency, variation and spread
• Data dispersion characteristics
• median, max, min, quantiles, outliers, variance, etc.
• Numerical dimensions correspond to sorted intervals
• Data dispersion: analyzed with multiple granularities of precision
• Boxplot or quantile analysis on sorted intervals
• Dispersion analysis on computed measures
• Folding measures into numerical dimensions
• Boxplot or quantile analysis on the transformed cube
Data Mining
12
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Measuring the Central Tendency
• Mean (algebraic measure) (sample vs. population):
1 n µ=∑
x
x = ∑ xi
Note: n is sample size and N is population size. n n i =1
N
∑w x i i
∑w i
• Median:
• Middle value if odd number of values, or average of the middle
two values otherwise
• Mode
• Value that occurs most frequently in the data
• Unimodal, bimodal, trimodal
• Empirical formula: mean − mode = 3× (mean − median) 13
Data Mining BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Symmetric vs. Skewed Data
• Median, mean and mode of symmetric, positively and negatively skewed data
symmetric
negatively skewed
positively skewed
Data Mining
July 5, 2021 14
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Measuring the Dispersion of Data
• Quartiles, outliers and boxplots
• Quartiles: Q1 (25th percentile), Q3 (75th percentile)
• Inter-quartile range: IQR = Q3 – Q1
• Five number summary: min, Q1, median, Q3, max
• Boxplot: ends of the box are the quartiles; median is marked; add whiskers, and plot outliers
individually
• Outlier: usually, a value higher/lower than 1.5 x IQR (on both sides of box from Q1 to Q3)
• Variance and standard deviation (sample: s, population: σ)
• Variance: (algebraic, scalable computation)
1 n
1 n
1 n 1 n 2 1 n ∑ (x − µ ) ∑x
2
σ2 = 2
= − µ2
2
s = ∑
n − 1 i =1
( xi − x ) 2 = [∑ xi − (∑ xi ) 2 ]
n − 1 i =1 n i =1 N i =1
i
N i =1
i
Data Mining
15
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Boxplot Analysis
• Five-number summary of a distribution
• Minimum, Q1, Median, Q3, Maximum
• Boxplot
• Data is represented with a box
• The ends of the box are at the first and third quartiles, i.e., the height of the box is IQR
• The median is marked by a line within the box
• Whiskers: two lines outside the box extended to Minimum and Maximum
• Outliers: points beyond a specified outlier threshold, plotted individually
Data Mining
16
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Example
Following is an ordered list of observations of a variable. Compute 5 point
summary.
13, 15, 16, 16, 19, 20, 20, 21, 22, 22, 25, 25, 25, 25, 30, 33, 33, 35, 35, 35, 35,
36, 40, 45, 46, 52, 70
Solution:
Min: 13
Q1: 20
Median: 25
Q3: 35
Max: 70
Data Mining
Data Mining
7/5/2021
18
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Histogram Analysis
Data Mining
7/5/2021
20
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Quantile Plot
• Displays all of the data (allowing the user to assess both the overall
behavior and unusual occurrences)
• Plots quantile information
• For a data xi data sorted in increasing order, fi indicates that
approximately 100 fi% of the data are below or equal to the value xi
Data Mining
Data Mining
22
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Scatter plot
Data Mining
23
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Positively and Negatively Correlated Data
Data Mining
24
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Uncorrelated Data
Data Mining
25
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Data Similarity/Dissimilarity
26
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Similarity and Dissimilarity
• Similarity
• Numerical measure of how alike two data objects are
• Value is higher when objects are more alike
• Often falls in the range [0,1]
• Dissimilarity (e.g., distance)
• Numerical measure of how different two data objects are
• Lower when objects are more alike
• Minimum dissimilarity is often 0
• Upper limit varies
• Proximity refers to a similarity or dissimilarity
Data Mining
27
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Data Matrix and Dissimilarity Matrix
x11 ... x1f ... x1p
• Data matrix
... ... ... ... ...
• n data points with p dimensions x ... xif ... xip
i1
• Two modes ... ... ... ... ...
x
n1 ... xnf ... xnp
• Dissimilarity matrix
• n data points, but registers only the
distance 0
• A triangular matrix d(2,1) 0
• Single mode d(3,1) d ( 3,2) 0
: : :
d ( n,1) d ( n,2) ... ... 0
Data Mining
28
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Proximity Measure for Nominal Attributes
• Can take 2 or more states, e.g., red, yellow, blue, green (generalization of a binary
attribute)
• Method 1: Simple matching d (i, j) = p −
p
m
• m: # of matches, p: total # of variables
Data Mining
29
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Proximity Measure for Binary Attributes
Object j
• A contingency table for binary data
Object i
Data Mining
30
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Dissimilarity between Binary Variables
• Example
Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4
Jack M Y N P N N N
Mary F Y N P N P N
Jim M Y P N N N N
Data Mining
31
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Standardizing Numeric Data
• Z-score: −µ
z = xσ
• X: raw score to be standardized, μ: mean of the population, σ: standard deviation
• the distance between the raw score and the population mean in units of the standard deviation
• negative when the raw score is below the mean, “+” when above
• Using mean absolute deviation is more robust than using standard deviation
Data Mining
32
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Example: Data Matrix and Dissimilarity Matrix
Data Matrix
x x point attribute1 attribute2
2 4
x1 1 2
4 x2 3 5
x3 2 0
x4 4 5
2 x
1
Dissimilarity Matrix
(with Euclidean Distance)
x1 x2 x3 x4
x x1 0
3
0 2 4 x2 3.61 0
x3 2.24 5.1 0
x4 4.24 1 5.39 0
Data Mining
33
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Distance on Numeric Data: Minkowski Distance
• Minkowski distance: A popular distance measure
where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two p-dimensional data
objects, and h is the order (the distance so defined is also called L-h norm)
• Properties
• d(i, j) > 0 if i ≠ j, and d(i, i) = 0 (Positive definiteness)
• d(i, j) = d(j, i) (Symmetry)
• d(i, j) ≤ d(i, k) + d(k, j) (Triangle Inequality)
• A distance that satisfies these properties is a metric
Data Mining
34
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Special Cases of Minkowski Distance
• h = 1: Manhattan (city block, L1 norm) distance
• E.g., the Hamming distance: the number of bits that are different between two binary vectors
d (i, j) =| x − x | + | x − x | +...+ | x − x |
i1 j1 i2 j 2 ip jp
• h = 2: (L2 norm) Euclidean distance
d (i, j) = (| x − x |2 + | x − x |2 +...+ | x − x |2 )
i1 j1 i2 j 2 ip jp
• h → ∞. “supremum” (Lmax norm, L∞ norm) distance.
• This is the maximum difference between any component (attribute) of the vectors
Data Mining
35
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Example: Minkowski Distance
point attribute 1 attribute 2 (Dissimilarity Matrices)
x1 1 2
L1 x1 x2 x3 x4
x2 3 5
Manhattan (L1) x1 0
x3 2 0
x2 5 0
x4 4 5
x3 3 6 0
x4 6 1 7 0
x x Euclidean (L2) L2 x1 x2 x3 x4
2 4 x1 0
x2 3.61 0
4
x3 2.24 5.1 0
x4 4.24 1 5.39 0
Supremum L∞ x1 x2 x3 x4
2 x x1 0
1
x2 3 0
x3 2 5 0
x4 3 1 5 0
x
3
0 2 4
Data Mining
36
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Ordinal Variables
• An ordinal variable can be discrete or continuous
• Order is important, e.g., rank
• Can be treated like interval-scaled
• replace xif by their rank rif ∈{1,..., M f }
• map the range of each variable onto [0, 1] by replacing i-th object in the f-th
variable by r −1
zif = if
M f −1
Data Mining
37
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Attributes of Mixed Type
• A database may contain all attribute types
• Nominal, symmetric binary, asymmetric binary, numeric, ordinal
• One may use a weighted formula to combine their effects
Σ pf = 1δ ij( f ) dij( f )
d (i, j) =
Σ pf = 1δ ij( f )
• f is binary or nominal:
dij(f) = 0 if xif = xjf , or dij(f) = 1 otherwise
• f is numeric: use the normalized distance
• f is ordinal r −1
• Compute ranks rif and z =
if
if M f −1
• Treat zif as interval-scaled
Data Mining
38
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Example
Based on the information given in the table below, find most similar and
most dissimilar persons among them. Apply min-max normalization on
income to obtain [0,1] range. Consider profession and mother tongue as
nominal. Consider native place as ordinal variable with ranking order of
[Village, Small Town, Suburban, Metropolitan]. Give equal weight to each
attribute.
Data Mining
Most similar – Balram and Bharat; Most dissimilar – Balram and Kishan
Data Mining
Data Mining
41
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Example: Cosine Similarity
d1 = (5, 0, 3, 0, 2, 0, 0, 2, 0, 0)
d2 = (3, 0, 2, 0, 1, 1, 0, 1, 0, 1)
d1•d2 = 5*3+0*0+3*2+0*0+2*1+0*1+0*1+2*1+0*0+0*1 = 25
||d1||= (5*5+0*0+3*3+0*0+2*2+0*0+0*0+2*2+0*0+0*0)0.5=(42)0.5 = 6.481
||d2||= (3*3+0*0+2*2+0*0+1*1+1*1+0*0+1*1+0*0+1*1)0.5=(17)0.5 = 4.12
cos(d1, d2 ) = 0.94
Data Mining
42
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Thank You
43
1
• The slides presented here are obtained from the authors of the
books and from various other contributors. I hereby
acknowledge all the contributors for their material and inputs.
• I have added and modified a few slides to suit the requirements
of the course.
2
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Classification
• Classification involves dividing up objects so that each is assigned to one of a number of
mutually exhaustive and exclusive categories known as classes
• Many practical decision-making tasks can be formulated as classification problems
‒ customers who are likely to buy or not buy a particular product in a supermarket
‒ people who are at high, medium or low risk of acquiring a certain illness
‒ student projects worthy of a distinction, merit, pass or fail grade
‒ objects on a radar display which correspond to vehicles, people, buildings or trees
‒ people who closely resemble, slightly resemble or do not resemble someone seen
committing a crime
‒ houses that are likely to rise in value, fall in value or have an unchanged value in 12
months' time
‒ people who are at high, medium or low risk of a car accident in the next 12 months
‒ people who are likely to vote for each of a number of political parties (or none)
‒ the likelihood of rain the next day for a weather forecast (very likely, likely, unlikely, very
unlikely).
Data Mining
7/5/2021 4
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Classification vs. Prediction
• Classification
• predicts categorical class labels (discrete or nominal)
• classifies data (constructs a model) based on the training set and the
values (class labels) in a classifying attribute and uses it in classifying
new data
• Prediction
• models continuous-valued functions, i.e., predicts unknown or
missing values
Data Mining
7/5/2021
5
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Supervised vs. Unsupervised Learning
• Supervised learning (classification)
• Supervision: The training data (observations, measurements, etc.) are
accompanied by labels indicating the class of the observations
• New data is classified based on the training set
Data Mining
7/5/2021 6
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Classification—A Two-Step Process
• Model construction: describing a set of predetermined classes
• Each tuple/sample is assumed to belong to a predefined class, as determined
by the class label attribute
• The set of tuples used for model construction is training set
• The model is represented as classification rules, decision trees, or
mathematical formulae
Data Mining
7/5/2021 7
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Illustrating Classification Task
Training Set
Apply
Tid Attrib1 Attrib2 Attrib3 Class Model
11 No Small 55K ?
12 Yes Medium 80K ?
13 Yes Large 110K ? Deduction
14 No Small 95K ?
15 No Large 67K ?
10
Test Set
Data Mining
8
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Classification Techniques
• Decision Tree based Methods
• Rule-based Methods
• Neural Networks
- computational networks that simulate the decision process in neurons
(networks of nerve cell)
• Naïve Bayes and Bayesian Belief Networks
- uses the probability theory to find the most likely of the possible
classifications
• Support Vector Machines
- fits a boundary to a region of points that are all alike; uses the boundary to
classify a new point
Data Mining
9
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Lazy vs. Eager Learning
• Lazy vs. eager learning
• Lazy learning (e.g., instance-based learning): Simply stores training data (or
only minor processing) and waits until it is given a test tuple
• Eager learning (the above discussed methods): Given a set of training set,
constructs a classification model before receiving new (e.g., test) data to
classify
Data Mining
7/5/2021 10
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Lazy Learner: Instance-Based Methods
• Instance-based learning:
• Store training examples and delay the processing (“lazy evaluation”)
until a new instance must be classified
• Typical approaches
• k-nearest neighbor approach
• Instances represented as points in a Euclidean space.
• Locally weighted regression
• Constructs local approximation
• Case-based reasoning
• Uses symbolic representations and knowledge-based inference
Data Mining
7/5/2021
11
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Example of a Decision Tree
Splitting Attributes
Tid Refund Marital Taxable
Status Income Cheat
Data Mining
12
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Another Example of Decision Tree
MarSt Single,
Married Divorced
Tid Refund Marital Taxable
Status Income Cheat NO Refund
Yes No
1 Yes Single 125K No
2 No Married 100K No NO TaxInc
3 No Single 70K No < 80K > 80K
4 Yes Married 120K No
NO YES
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No There could be more than one tree that fits
8 No Single 85K Yes the same data!
9 No Married 75K No
10 No Single 90K Yes
10
Data Mining
13
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married Assign Cheat to “No”
TaxInc NO
< 80K > 80K
NO YES
Data Mining
14
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Decision Tree Classification Task
Test Set
Data Mining
15
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Issues: Evaluating Classification Methods
• Accuracy
• classifier accuracy: predicting class label
• predictor accuracy: guessing value of predicted attributes
• Speed
• time to construct the model (training time)
• time to use the model (classification/prediction time)
• Robustness: handling noise and missing values
• Scalability: efficiency in disk-resident databases
• Interpretability
• understanding and insight provided by the model
• Other measures, e.g., goodness of rules, such as decision tree size or
compactness of classification rules
Data Mining
7/5/2021 16
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Underfitting and Overfitting (Example)
500 circular and 500
triangular data
points.
Circular points:
X2
0.5 ≤ sqrt(x12+x22) ≤ 1
Triangular points:
sqrt(x12+x22) > 0.5 or
sqrt(x12+x22) < 1
X1
Data Mining
17
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Underfitting and Overfitting
Overfitting
Underfitting: when model is too simple, both training and test errors are large
Data Mining
18
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Overfitting due to Noise
Data Mining
19
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Decision Tree Based Classification
• Decision trees are intuitive and frequently used data mining technique
for Classification
• For an analyst, they are easy to set up and for a business user they are
easy to interpret.
• A decision tree model is a decision flowchart where an attribute is
tested in each node and ends in a leaf node where a prediction is made.
• There are many algorithms for decision tree induction such as Hunt’s
Algorithm, CART, ID3, C4.5, SLIQ,SPRINT
Data Mining
20
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Hunt’s Algorithm - Structure
• Hunt's algorithm is among the earliest. More
complex algorithms were built upon it. Tid Refund Marital
Status
Taxable
Income Cheat
• It grows a decision tree in a recursive fashion 1 Yes Single 125K No
• General Procedure: 8
9
No
No
Single
Married
85K
75K
Yes
No
Dt
• If Dt is an empty set, then t is a leaf node
labeled by the default class, yd ?
• If Dt contains records that belong to more
than one class, use an attribute test to split
the data into smaller subsets. Recursively
apply the procedure to each subset.
Data Mining
21
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Hunt’s Algorithm - Example
Don’t Refund
Cheat Yes No Tid Refund Marital Taxable
Status Income Chea
Don’t Don’t
Cheat Cheat 1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
Refund
Refund
4 Yes Married 120K No
Yes No
Yes No 5 No Divorced 95K Yes
Don’t Marital
Don’t 6 No Married 60K No
Marital Cheat Status
Cheat Status Single, 7 Yes Divorced 220K No
Single, Married
Married Divorced 8 No Single 85K Yes
Divorced Don’t
Taxable 9 No Married 75K No
Cheat Don’t Cheat
Income
Cheat 10 No Single 90K Yes
< 80K >= 80K 10
Don’t Cheat
Cheat
Data Mining
22
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Tree Induction
• Greedy strategy.
• Split the records based on an attribute test that optimizes certain
criterion.
• Issues
• Determine how to split the records
• How to specify the attribute test condition?
• How to determine the best split?
• Determine when to stop splitting
Data Mining
23
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
How to Specify Test Condition?
• Depends on attribute types
• Nominal
• Ordinal
• Continuous
Data Mining
24
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Splitting Based on Nominal Attributes
CarType
Family Luxury
Sports
CarType CarType
{Sports,
Luxury} {Family} OR {Family,
Luxury} {Sports}
Data Mining
25
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Splitting Based on Ordinal Attributes
• Multi-way split: Use as many partitions as distinct values.
Size
Small Large
Medium
Size
• What about this split? {Small,
Large} {Medium}
Data Mining
26
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Splitting Based on Continuous Attributes
• Different ways of handling
• Discretization to form an ordinal categorical attribute
• Static – discretize once at the beginning
• Dynamic – ranges can be found by equal interval bucketing, equal
frequency bucketing
(percentiles), or clustering.
Data Mining
27
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Splitting Based on Continuous Attributes
Data Mining
28
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
How to determine the Best Split
Data Mining
29
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
How to determine the Best Split
• Greedy approach:
• Nodes with homogeneous class distribution are preferred
• Need a measure of node impurity:
C0: 5 C0: 9
C1: 5 C1: 1
Non-homogeneous, Homogeneous,
High degree of impurity Low degree of impurity
Data Mining
30
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Measures of Node Impurity
• Gini Index
• Entropy
• Misclassification error
Data Mining
31
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
How to Find the Best Split
C0 N00
Before Splitting: M0
C1 N01
A? B?
Yes No Yes No
M1 M2 M3 M4
Data Mining
32
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Measure of Impurity: GINI
• Maximum (1 - 1/nc) when records are equally distributed among all classes,
implying least interesting information
• Minimum (0.0) when all records belong to one class, implying most
interesting information
C1 0 C1 1 C1 2 C1 3
C2 6 C2 5 C2 4 C2 3
Gini=0.000 Gini=0.278 Gini=0.444 Gini=0.500
Data Mining
33
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Examples for computing GINI
GINI (t ) = 1 − ∑ [ p ( j | t )]2
j
Data Mining
34
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Splitting Based on GINI
• Used in CART, SLIQ, SPRINT.
• When a node p is split into k partitions (children), the quality of split is
computed as,
k
ni
GINI split = ∑ GINI (i )
i =1 n
Data Mining
35
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Binary Attributes: Computing GINI Index
Splits into two partitions
Effect of Weighing partitions:
– Larger and Purer Partitions are sought for.
Parent
B? C1 6
Yes No C2 6
Gini = 0.500
Node N1 Node N2
Gini(N1)
= 1 – (5/7)2 – (2/7)2 N1 N2 Gini(Children)
= 0.408 C1 5 1 = 7/12 * 0.408 +
Gini(N2) C2 2 4 5/12 * 0.32
= 1 – (1/5)2 – (4/5)2 Gini=0.37 = 0.37
= 0.32
Data Mining
36
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Categorical Attributes: Computing Gini Index
• For each distinct value, gather counts for each class in the dataset
• Use the count matrix to make decisions
Data Mining
37
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Continuous Attributes: Computing Gini Index
• Use Binary Decisions based on one Tid Refund Marital Taxable
value Status Income Cheat
Repetition of work.
Data Mining
38
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Continuous Attributes: Computing Gini Index...
• For efficient computation: for each attribute,
• Sort the attribute on values
• Linearly scan these values, each time updating the count matrix and
computing gini index
• Choose the split position that has the least gini index
No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0
Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420
Data Mining
39
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Alternative Splitting Criteria
• Entropy at a given node t:
Entropy (t ) = − ∑ p ( j | t ) log p ( j | t )
j
Data Mining
40
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Examples for computing Entropy
Entropy (t ) = − ∑ p ( j | t ) log p ( j | t )
j 2
Data Mining
41
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Splitting Criteria based on Classification Error
• Classification error at a node t :
Error (t ) = 1 − max P (i | t )
i
Data Mining
42
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Examples for Computing Error
Error (t ) = 1 − max P (i | t ) i
Data Mining
43
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Comparison among Splitting Criteria
For a 2-class problem:
Data Mining
44
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Gain Ratio
Data Mining
9 9 5 5
Info( D ) = I (9,5) = − log 2 ( ) − log 2 ( ) =0.940
14 14 14 14
Data Mining
47
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Attribute Selection: Information Gain
𝐼𝐼𝐼𝐼𝐼𝐼𝑜𝑜𝑎𝑎𝑎𝑎𝑎𝑎 𝐷𝐷 =
5
𝐼𝐼 2,3 +
4
𝐼𝐼 4,0 +
5
𝐼𝐼(3,2) age pi ni I(pi, ni)
14 14 14 <=30 2 3 0.971
= 0.694
31…40 4 0 0
>40 3 2 0.971
5 means “age <=30” has 5 out of 14 samples, with 2
I (2,3)
14 yes’es and 3 no’s. Hence
Similarly,
Gain(income) = 0.029
Gain( student ) = 0.151
Gain(credit _ rating ) = 0.048
Data Mining
48
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Gain Ratio for Attribute Selection (C4.5)
• Information gain measure is biased towards attributes with a large
number of values
• C4.5 (a successor of ID3) uses gain ratio to overcome the problem
(normalization to information gain) SplitInfo ( D) = −∑
A
|D | v
× log (
|D | j
) 2
j
|D| |D|
• GainRatio(A) = Gain(A)/SplitInfo(A) j =1
• Ex.
• The attribute with the maximum gain ratio is selected as the splitting
attribute
Data Mining
49
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Refining Decision Tree Model
Data Mining
• Stop expanding a node when all the records have similar attribute values
• Early termination
Data Mining
51
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Decision Tree Based Classification
• Advantages:
• Inexpensive to construct
• Extremely fast at classifying unknown records
• Easy to interpret for small-sized trees
• Accuracy is comparable to other classification techniques for many
simple data sets
Data Mining
52
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Practical Issues of Classification
• Underfitting and Overfitting
• Missing Values
• Costs of Classification
Data Mining
53
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Underfitting vs. Overfitting
• Underfitting results in decision trees that are too simple to solve the
problem. They may offer superior interpretability.
Data Mining
54
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Model Overfitting
Underfitting: when model is too simple, both training and test errors are large
Overfitting: when model is too complex, training error is small but test error is
large
Data Mining
Data Mining
56
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
How to Address Overfitting…
• Post-pruning
• Grow decision tree to its entirety
• Trim the nodes of the decision tree in a bottom-up fashion
• If generalization error(i.e. expected error of the model on previously
unseen records) improves after trimming, replace sub-tree by a leaf
node.
• Class label of leaf node is determined from majority class of instances
in the sub-tree
Data Mining
57
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Prescribed Text Books
R1 Predictive Analytics and Data Mining: Concepts and Practice with RapidMiner
by Vijay Kotu and Bala Deshpande Morgan Kaufmann Publishers
Data Mining
58
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Thank You
59
1
• The slides presented here are obtained from the authors of the
books and from various other contributors. I hereby
acknowledge all the contributors for their material and inputs.
• I have added and modified a few slides to suit the requirements
of the course.
2
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Rule-based Classification
Data Mining
Source Courtesy: Some of the contents of this PPT are sourced from materials provided by publishers of prescribed books
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Classification Techniques
• Decision Tree based Methods
• Rule-based Methods
• Neural Networks
• computational networks that simulate the decision process in
neurons (networks of nerve cell)
• Naïve Bayes and Bayesian Belief Networks
• uses the probability theory to find the most likely of the possible
classifications
• Support Vector Machines
• fits a boundary to a region of points that are all alike; uses the
boundary to classify a new point
Data Mining
5
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Rule-Based Classifier
• Classify records by using a collection of “if…then…” rules
Data Mining
6
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Rule-based Classifier (Example)
Name Blood Type Give Birth Can Fly Live in Water Class
human warm yes no no mammals
python cold no no no reptiles
salmon cold no no yes fishes
whale warm yes no yes mammals
frog cold no no sometimes amphibians
komodo cold no no no reptiles
bat warm yes yes no mammals
pigeon warm no yes no birds
cat warm yes no no mammals
leopard shark cold yes no yes fishes
turtle cold no no sometimes reptiles
penguin warm no no sometimes birds
porcupine warm yes no no mammals
eel cold no no yes fishes
salamander cold no no sometimes amphibians
gila monster cold no no no reptiles
platypus warm no no no mammals
owl warm no yes no birds
dolphin warm yes no yes mammals
eagle warm no yes no birds
Name Blood Type Give Birth Can Fly Live in Water Class
hawk warm no yes no ?
grizzly bear warm yes no no ?
Data Mining
8
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Rule Coverage and Accuracy
Tid Refund Marital Taxable
• Coverage of a rule: Status Income Class
• Fraction of records that satisfy
1 Yes Single 125K No
the antecedent of a rule
2 No Married 100K No
• Accuracy of a rule:
3 No Single 70K No
• Fraction of records that satisfy
both the antecedent and 4 Yes Married 120K No
consequent of a rule 5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
(Status=Single) → No 10
10 No Single 90K Yes
Data Mining
9
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
How does Rule-based Classifier Work?
R1: (Give Birth = no) ∧ (Can Fly = yes) → Birds
R2: (Give Birth = no) ∧ (Live in Water = yes) → Fishes
R3: (Give Birth = yes) ∧ (Blood Type = warm) → Mammals
R4: (Give Birth = no) ∧ (Can Fly = no) → Reptiles
R5: (Live in Water = sometimes) → Amphibians
Name Blood Type Give Birth Can Fly Live in Water Class
lemur warm yes no no ?
turtle cold no no sometimes ?
dogfish shark cold yes no yes ?
Data Mining
10
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Characteristics of Rule-Based Classifier
• Mutually exclusive rules
• Classifier contains mutually exclusive rules if the rules are
independent of each other
• Every record is covered by at most one rule
• Exhaustive rules
• Classifier has exhaustive coverage if it accounts for every possible
combination of attribute values
• Each record is covered by at least one rule
Data Mining
11
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
From Decision Trees To Rules
Refund Classification Rules
Yes No
(Refund=Yes) ==> No
NO Marital
(Refund=No, Marital Status={Single,Divorced},
{Single, Status
{Married} Taxable Income<80K) ==> No
Divorced}
(Refund=No, Marital Status={Single,Divorced},
Taxable NO Taxable Income>80K) ==> Yes
Income
(Refund=No, Marital Status={Married}) ==> No
< 80K > 80K
NO YES
Data Mining
12
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Rules Can Be Simplified
Tid Refund Marital Taxable
Refund Status Income Cheat
Yes No
1 Yes Single 125K No
NO Marital 2 No Married 100K No
{Single, Status
{Married} 3 No Single 70K No
Divorced}
4 Yes Married 120K No
Taxable NO 5 No Divorced 95K Yes
Income
6 No Married 60K No
< 80K > 80K
7 Yes Divorced 220K No
NO YES 8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10
Data Mining
14
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Ordered Rule Set
• Rules are rank ordered according to their priority
• An ordered rule set is known as a decision list
• When a test record is presented to the classifier
• It is assigned to the class label of the highest ranked rule it has
triggered
• If none of the rules fired, it is assigned to the default class
Name Blood Type Give Birth Can Fly Live in Water Class
turtle cold no no sometimes ?
Data Mining
15
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Rule Ordering Schemes
• Rule-based ordering
• Individual rules are ranked based on their quality
• Class-based ordering
• Rules that belong to the same class appear together
Data Mining
16
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
How to Evaluate Learnt Rule?
• Start with the most general rule possible: condition = empty
• Adding new attributes by adopting a greedy depth-first strategy
• Picks the one that most improves the rule quality
• Rule-Quality measures: consider both coverage and accuracy
• Foil-gain (in FOIL & RIPPER): assesses info_gain by extending condition
• favors rules that have high accuracy and cover many positive tuples
Data Mining
17
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
How to Evaluate Learnt Rule?
• We can use Likelihood Ratio Statistic, which confirms that effect of rule is
not attributed to chance, but represents correlation between attribute
value and classes.
𝑓𝑓
Likelihood_Ratio = 2 ∗ ∑𝑚𝑚 𝑓𝑓
𝑗𝑗=1 𝑖𝑖 log 2 ( 𝑖𝑖
)
𝑒𝑒𝑖𝑖
• m is the number of classes
• For tuples satisfying the rule, fi is the observed frequency of each
class among tuples, ei is the expected frequency if the rule made
random predictions
• Higher the Likelihood Ratio, better the rule is.
• Used by CN2
Data Mining
18
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Building Classification Rules
• Direct Method:
• Extract rules directly from data
• e.g.: RIPPER, CN2, Holte’s 1R
• Indirect Method:
• Extract rules from other classification models (e.g.
decision trees, neural networks, etc).
• e.g: C4.5rules
Data Mining
19
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Direct Method: Sequential Covering
Data Mining
20
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Rule Induction: Sequential Covering Method
• Start with an empty rule set
• Steps:
• Rules are learned one at a time
• Each time a rule is learned, the tuples covered by the
rules are removed
• Repeat the process(above steps) on the remaining tuples
• until termination condition, e.g., when no more training
examples or when the quality of a rule returned is below a
user-specified threshold
Data Mining
21
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Sequential Covering Algorithm
Examples covered
Examples covered by Rule 2
by Rule 1 Examples covered
by Rule 3
Positive
examples
Data Mining
22
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Rule Generation
• To generate a rule
while(true)
find the best predicate p
if foil-gain(p) > threshold then add p to current rule
else break
Data Mining
23
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Rule Generation
A3=1&&A1=2
&&A8=5
A3=1&&A1=2
A3=1
Positive Negative
examples examples
Data Mining
24
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Indirect Methods
P
No Yes
Q R Rule Set
Data Mining
25
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Ensemble Methods
Data Mining
26
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
General Idea
Original
D Training data
Step 1:
Create Multiple D1 D2 .... Dt-1 Dt
Data Sets
Step 2:
Build Multiple C1 C2 Ct -1 Ct
Classifiers
Step 3:
Combine C*
Classifiers
Data Mining
27
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Why does it work?
25
25 i
∑
i
i =13
ε (1 − ε ) 25−i
= 0.06
Data Mining
28
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
When error rate differs…
Data Mining
29
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Prescribed Text Books
Data Mining
30
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Thank You
31
1
• The slides presented here are obtained from the authors of the
books and from various other contributors. I hereby
acknowledge all the contributors for their material and inputs.
• I have added and modified a few slides to suit the requirements
of the course.
2
5
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Classifier Evaluation Metrics: Confusion Matrix
Example of Confusion Matrix:
Predicted class -> buy_computer = buy_computer = Total
yes no
Actual class ⇓
buy_computer = yes 6954 46 7000
buy_computer = no 412 2588 3000
Total 7366 2634 10000
6
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Classifier Evaluation Metrics:
Accuracy, Error Rate, Sensitivity and Specificity
Classifier Accuracy, or recognition rate: Class Imbalance Problem:
percentage of test set tuples that are One class may be rare, e.g. fraud, or
correctly classified HIV-positive
Accuracy = (TP + TN)/All Significant majority of the negative class
Error rate: 1 – accuracy, or and minority of the positive class
Error rate = (FP + FN)/All
Sensitivity: True Positive recognition
rate
7
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Classifier Evaluation Metrics:
Precision and Recall, and F-measures
Precision: exactness – what % of tuples that the classifier labeled as
positive are actually positive
8
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Classifier Evaluation Metrics:
Precision and Recall, and F-measures
9
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Classifier Evaluation Metrics:
Precision and Recall, and F-measures
Harmonic mean can be recomputed by applying weights to
precision and recall
By substituting β2 = (1 – α)/ α ,
We get
Fß: weighted measure of precision and recall
– assigns β2 times as much weight to recall as to precision
10
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Classifier Evaluation Metrics: Example
Precision = 90/230 = 39.13% Recall = 90/300 = 30.00%
11
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Evaluating Classifier Accuracy: Holdout Method
Holdout method
– Given data is randomly partitioned into two independent sets
• Training set (e.g., 2/3) for model construction
• Test set (e.g., 1/3) for accuracy estimation
– Random sampling: a variation of holdout
• Repeat holdout k times, accuracy = avg. of the accuracies obtained
12
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Evaluating Classifier Accuracy:
Cross-Validation Methods
13
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Evaluating Classifier Accuracy: Bootstrap
Bootstrap
– Works well with small data sets
– Samples the given training tuples uniformly with replacement
• i.e., each time a tuple is selected, it is equally likely to be selected again and
re-added to the training set
Several bootstrap methods, and a common one is .632 boostrap
– A data set with d tuples is sampled d times, with replacement, resulting in a
training set of d samples. The data tuples that did not make it into the training
set end up forming the test set. About 63.2% of the original data end up in
the bootstrap, and the remaining 36.8% form the test set (since (1 – 1/d)d ≈ e-1
= 0.368)
– Repeat the sampling procedure k times, overall accuracy of the model:
14
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Model Selection: ROC Curves
ROC (Receiver Operating
Characteristics) curves: for visual
comparison of classification
models
Originated from signal detection theory
Shows the trade-off between the true
positive rate and the false positive
rate
The area under the ROC curve is a
measure of the accuracy of the
model Vertical axis represents the
Rank the test tuples in decreasing true positive rate
order: the one that is most likely to Horizontal axis rep. the false
belong to the positive class positive rate
appears at the top of the list The plot also shows a
The closer to the diagonal line (i.e., the diagonal line
closer the area is to 0.5), the less A model with perfect
accurate is the model
accuracy will have an area of
1.0
15
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Prescribed Text Books
R1 Predictive Analytics and Data Mining: Concepts and Practice with RapidMiner
by Vijay Kotu and Bala Deshpande Morgan Kaufmann Publishers
16
7/5/2021
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
S2-20_DSECFZC415
Prediction
BITS Pilani
Pilani|Dubai|Goa|Hyderabad
1
• The slides presented here are obtained from the authors of the
books and from various other contributors. I hereby
acknowledge all the contributors for their material and inputs.
• I have added and modified a few slides to suit the requirements
of the course.
2
Source Courtesy: Some of the contents of this PPT are sourced from materials provided by publishers of prescribed books
Data Mining
Data Mining
4
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Regression for Prediction
• A regression task begins with a data set in which the target values are
known, e.g.
• A regression model that predicts house values could be developed based on
observed data for many houses over a period of time.
• The data might track the age of the house, square footage, number of rooms,
taxes, school district, proximity to shopping centers, and so on.
• House value would be the target, the other attributes would be the predictors,
and the data for each house would constitute a case.
• In the model build (training) process, a regression algorithm estimates the
value of the target as a function of the predictors for each case in the build
data.
• These relationships between predictors and target are summarized in a model,
which can then be applied to a different data set in which the target values are
unknown
Data Mining
5
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Prediction Techniques
• Regression analysis
• Linear and multiple regression
• Non-linear regression
• Other regression methods:
• Log-linear models,
• Regression trees
• etc.
Data Mining
6
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Regression Analysis
• Regression analysis seeks to determine the values of parameters for a
function that cause the function to best fit a set of data observations
that you provide.
• The following equation expresses these relationships in symbols.
y = F(x,w) + e
Data Mining
7
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Regression Analysis
• In the equation
y = F(x,w) + e
• The predictors(x1 , x2 , ..., xn) can be understood as independent
variables
• The target (y) is the dependent variable.
• The error (e), also called the residual, is the difference between the
expected and predicted value of the dependent variable.
• The regression parameters are also known as regression coefficients.
• The process of training a regression model involves finding the
parameter values that minimize a measure of the error, for example, the
sum of squared errors.
Data Mining
8
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Simple Linear Regression
• Simple Linear regression: involves a response variable y and a single predictor
variable x
y = w 0 + w1 x
where w0 (y-intercept) and w1 (slope) are regression coefficients
• Method of least squares: estimates the best-fitting straight line
| D|
∑ ( x − x )( y − y)
w= 1
i =1
i
| D|
i
w = y −wx
0 1
∑ i
( x
i =1
− x ) 2
Data Mining
9
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Linear Regression With a Single Predictor
Data Mining
10
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Multiple Linear Regression
• Training data is of the form (X1, y1), (X2, y2),…, (X|D|, y|D|)
y = w0 + w1 x1+ w2 x2
• Solvable by extension of least square method or using SAS, S-Plus
Data Mining
11
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Nonlinear Regression
Data Mining
12
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Nonlinear Regression
• Some nonlinear models can be modeled by a polynomial function
• A polynomial regression model can be transformed into linear regression
model. For example,
• y = w0 + w1 x + w2 x2 + w3 x3
• convertible to linear with new variables: x2 = x2, x3= x3
• y = w0 + w1 x + w2 x2 + w3 x3
• Other functions, such as power function, can also be transformed to
linear model
• Some models are intractable nonlinear (e.g., sum of exponential terms)
• possible to obtain least square estimates through extensive
calculation on more complex formulae
Data Mining
13
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Regression Trees and Model Trees
• Regression tree: proposed in CART system (Breiman et al. 1984)
• CART: Classification And Regression Trees
• Each leaf stores a continuous-valued prediction
• It is the average value of the predicted attribute for the training tuples that
reach the leaf
• Model tree: proposed by Quinlan (1992)
• Each leaf holds a regression model—a multivariate linear equation for the
predicted attribute
• A more general case than regression tree
• Regression and model trees tend to be more accurate than linear regression
when the data are not represented well by a simple linear model
Data Mining
14
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Regression Trees
Humidit Golf
Day Outlook Temp. Wind
y Players
1 Sunny Hot High Weak 25
2 Sunny Hot High Strong 30
3 Overcast Hot High Weak 46
4 Rain Mild High Weak 45
5 Rain Cool Normal Weak 52
6 Rain Cool Normal Strong 23
7 Overcast Cool Normal Strong 43
8 Sunny Mild High Weak 35
9 Sunny Cool Normal Weak 38
10 Rain Mild Normal Weak 46
11 Sunny Mild Normal Strong 48
12 Overcast Mild High Strong 52
13 Overcast Hot Normal Weak 44
14 Rain Mild High Strong 30
[Link]
Data Mining
15
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Model Tree Sample
[Link]
Data Mining
16
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Simple Linear Regression Example
x y
Area (in sq. m) Rent (in 000s of Rupees)
172 42
150 35
181 46
174 40
194 50
R1 Predictive Analytics and Data Mining: Concepts and Practice with RapidMiner
by Vijay Kotu and Bala Deshpande Morgan Kaufmann Publishers
18
7/5/2021
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
S2-20_DSECFZC415
Association Analysis
BITS Pilani
Pilani|Dubai|Goa|Hyderabad
1
•The slides presented here are obtained from the authors of the books and from
various other contributors. I hereby acknowledge all the contributors for their material
and inputs.
•I have added and modified a few slides to suit the requirements of the course.
Source Courtesy: Some of the contents of this PPT are sourced from materials provided by publishers of prescribed books
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Association Analysis
Association analysis measures the strength of co-occurrence between one item
and another.
– The objective of this class of data mining algorithms is not to predict an occurrence of an
item, like classification or regression do, but to find usable patterns in the co-occurrences of
the items.
– Association rules learning is a branch of an unsupervised learning process that discovers
hidden patterns in data, in the form of easily recognizable rules
Retailer can take advantage of this association for bundle pricing, product
placement, and even shelf space optimization within the store layout.
July 5, 2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Association Rule Mining
Given a set of transactions, find rules that will predict the
occurrence of an item based on the occurrences of other items
in the transaction
July 5, 2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Definition: Frequent Itemset
Itemset TID Items
– A collection of one or more items 1 Bread, Milk
• Example: {Milk, Bread, Diaper} 2 Bread, Diaper, Butter, Beans
– k-itemset 3 Milk, Diaper, Butter, Coke
• An itemset that contains k items 4 Bread, Milk, Diaper, Butter
Support count (σ) 5 Bread, Milk, Diaper, Coke
– Frequency of occurrence of an itemset
– E.g. σ({Milk, Bread,Diaper}) = 2
Support
– Fraction of transactions that contain an itemset
– E.g. s({Milk, Bread, Diaper}) = 2/5
Frequent Itemset
– An itemset whose support is greater than or equal to a minsup threshold
July 5, 2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Definition: Association Rule
TID Items
Association Rule 1 Bread, Milk
– An implication expression of the form X → Y, 2 Bread, Diaper, Butter, Beans
where X and Y are itemsets 3 Milk, Diaper, Butter, Coke
4 Bread, Milk, Diaper, Butter
– Example:
{Milk, Diaper} → {Butter} 5 Bread, Milk, Diaper, Coke
July 5, 2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Association Rule Mining Task
Given a set of transactions T, the goal of association rule mining is to find all
rules having
– support ≥ minsup threshold
– confidence ≥ minconf threshold
Brute-force approach:
– List all possible association rules
– Compute the support and confidence for each rule
– Prune rules that fail the minsup and minconf thresholds
⇒ Computationally prohibitive!
July 5, 2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Mining Association Rules
TID Items Example of Rules:
1 Bread, Milk
2 Bread, Diaper, Butter, Beans
{Milk,Diaper} → {Butter} (s=0.4, c=0.67)
{Milk,Butter} → {Diaper} (s=0.4, c=1.0)
3 Milk, Diaper, Butter, Coke
{Diaper,Butter} → {Milk} (s=0.4, c=0.67)
4 Bread, Milk, Diaper, Butter
{Butter} → {Milk,Diaper} (s=0.4, c=0.67)
5 Bread, Milk, Diaper, Coke
{Diaper} → {Milk,Butter} (s=0.4, c=0.5)
{Milk} → {Diaper,Butter} (s=0.4, c=0.5)
Observations:
• All the above rules are binary partitions of the same itemset:
{Milk, Diaper, Butter}
• Rules originating from the same itemset have identical support but
can have different confidence
• Thus, we may decouple the support and confidence requirements
9
July 5, 2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Mining Association Rules
Two-step approach:
1. Frequent Itemset Generation
– Generate all itemsets whose support ≥ minsup
2. Rule Generation
– Generate high confidence rules from each frequent
itemset, where each rule is a binary partitioning of a
frequent itemset
10
July 5, 2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Frequent Itemset Generation
null
A B C D E
AB AC AD AE BC BD BE CD CE DE
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
July 5, 2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Frequent Itemset Generation
Brute-force approach:
– Each itemset in the lattice is a candidate frequent itemset
– Count the support of each candidate by scanning the database
TID Items
1 Bread, Milk
2 Bread, Diaper, Butter, Beans
3 Milk, Diaper, Butter, Coke
4 Bread, Milk, Diaper, Butter
5 Bread, Milk, Diaper, Coke
July 5, 2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Frequent Itemset Generation Strategies
Reduce the number of candidates (M)
– Complete search: M=2d
– Use pruning techniques to reduce M
13
July 5, 2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Apriori Algorithm
14
July 5, 2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Mining Association Rules
Two-step approach:
1. Frequent Itemset Generation
– Generate all itemsets whose support ≥ minsup
2. Rule Generation
– Generate high confidence rules from each frequent
itemset, where each rule is a binary partitioning of a
frequent itemset
15
July 5, 2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Reducing Number of Candidates
Apriori principle:
– If an itemset is frequent, then all of its subsets must also
be frequent
A B C D E
AB AC AD AE BC BD BE CD CE DE
Found to be
Infrequent
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
Pruned
ABCDE
supersets 17
July 5, 2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Apriori: A Candidate Generation-and-Test Approach
Apriori pruning principle: If there is any itemset which is
infrequent, its superset should not be generated/tested!
Method:
Initially, scan DB once to get frequent
1-itemset TID Items
1 Bread, Milk
Generate length (k+1) candidate 2 Bread, Diaper, Butter, Beans
itemsets from length k frequent
3 Milk, Diaper, Butter, Coke
itemsets
4 Bread, Milk, Diaper, Butter
Test the candidates against DB 5 Bread, Milk, Diaper, Coke
18
July 5, 2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Apriori Algorithm
Method:
– Let k=1
– Generate frequent itemsets of length 1
– Repeat until no new frequent itemsets are identified
• Generate length (k+1) candidate itemsets from length k frequent
itemsets
• Prune candidate itemsets containing subsets of length k that are
infrequent
• Count the support of each candidate by scanning the DB
• Eliminate candidates that are infrequent, leaving only those that
are frequent
19
July 5, 2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Important Details of Apriori
How to generate candidates?
– Step 1: self-joining Lk
– Step 2: pruning
Example of Candidate-generation
– L3={abc, abd, acd, ace, bcd}
– Self-joining: L3*L3
• abcd from abc and abd
• acde from acd and ace
– Pruning:
• acde is removed because ade is not in L3
– C4={abcd}
How to count supports of candidates?
20
July 5, 2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Illustrating Apriori Principle
Minimum Support = 3
TID Items
1 Bread, Milk
2 Bread, Diaper, Butter, Beans
3 Milk, Diaper, Butter, Coke
4 Bread, Milk, Diaper, Butter
5 Bread, Milk, Diaper, Coke
21
July 5, 2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Illustrating Apriori Principle
July 5, 2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Illustrating Apriori Principle
If every subset is considered,
6C + 6C + 6C = 41
1 2 3
Item Count Items (1-itemsets) With support-based pruning,
Bread 4 6 + 6 + 1 = 13
Coke 2
Milk 4 Itemset Count Pairs (2-itemsets)
Butter 3 {Bread,Milk} 3
Diaper 4 {Bread,Butter} 2 (No need to generate
Beans 1 {Bread,Diaper} 3
candidates involving Coke
{Milk,Butter} 2
Minimum Support = 3 {Milk,Diaper} 3 or Beans)
{Butter,Diaper} 3
TID Items
Triplets (3-itemsets)
1 Bread, Milk
2 Bread, Diaper, Butter, Beans Itemset Count
3 Milk, Diaper, Butter, Coke {Bread,Milk,Diaper} 2
4 Bread, Milk, Diaper, Butter
5 Bread, Milk, Diaper, Coke
23
July 5, 2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Factors Affecting Complexity
Choice of minimum support threshold
– lowering support threshold results in more frequent itemsets
– this may increase number of candidates and max length of frequent
itemsets
Dimensionality (number of items) of the data set
– more space is needed to store support count of each item
– if number of frequent items also increases, both computation and I/O
costs may also increase
Size of database
– since Apriori makes multiple passes, run time of algorithm may increase
with number of transactions
Average transaction width
– transaction width increases with denser data sets
– This may increase max length of frequent itemsets and traversals of hash
tree (number of subsets in a transaction increases with its width)
24
July 5, 2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Can we improve Apriori Efficiency?
Hash-based technique
Transaction reduction
Partitioning
Sampling
Dynamic itemset counting
25
July 5, 2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Hash-based technique
27
July 5, 2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Efficiency Techniques - Dynamic Itemset Counting
ABCD
Once both A and D are determined frequent, the
counting of AD begins
ABC ABD ACD BCD Once all length-2 subsets of BCD are determined
frequent, the counting of BCD begins, thus
reducing effective number of scans
AB AC BC AD BD CD
Transactions
1-itemsets
A B C D
Apriori 2-itemsets
…
{}
Itemset lattice 1-itemsets
2-items
S. Brin R. Motwani, J. Ullman, and S.
Tsur. Dynamic itemset counting and DIC 3-items
implication rules for market basket 28
data. SIGMOD’97
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Efficiency Techniques
July 5, 2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Mining with Vertical format
30
Notice that because the itemsets {I1, I4} and {I3, I5} each contain only one
transaction, they do not belong to the set of frequent 2-itemsets.
31
32
33
July 5, 2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Closed Patterns and Max-Patterns
Example
– DB = {<a1 …, a100>, < a1 …, a100>, < a1, …, a50>}
– Min_sup = 2
What is the set of closed itemset?
– <a1, …, a100>: 2
– < a1, …, a50>: 3
What is the set of maximal pattern?
– <a1, …, a100>: 2
What is the set of all patterns?
– 1.27*1030
35
July 5, 2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Maximal vs Closed Itemsets
Transaction
null Ids
TID Items
124 123 1234
1 ABC 245 345
A B C D E
2 ABCD
3 BCE
12 124 24 4 123 2 3 24
4 ACDE AB AC AD AE BC BD BE CD
34
CE
45
DE
5 DE
12 2 24 4 4 2 3 4
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
2 4
ABCD ABCE ABDE ACDE BCDE
Not supported by
any transactions ABCDE 36
July 5, 2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Maximal vs Closed Frequent Itemsets
TID Items
Minimum support = 2 null Closed but
1 ABC not maximal
2 ABCD 124 123 1234 245 345
A B C D E
3 BCE
Closed &
4 ACDE
maximal
5 DE
12 124 24 4 123 2 3 24 34 45
AB AC AD AE BC BD BE CD CE DE
12 2 24 4 4 2 3 4
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
2 4
ABCD ABCE ABDE ACDE BCDE
# Closed = 9
# Maximal = 4
ABCDE
37
July 5, 2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Maximal vs Closed Itemsets
Frequent
Itemsets
Closed
Frequent
Itemsets
Maximal
Frequent
Itemsets
38
July 5, 2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Prescribed Text Books
39
July 5, 2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Thank You
40
FP-growth Algorithm
Association Analysis (Review)
Retailer can take advantage of this association for bundle pricing, product
placement, and even shelf space optimization within the store layout.
4
July 30, 2021
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Association Rule Mining (Review)
Given a set of transactions, find rules that will predict the
occurrence of an item based on the occurrences of other items
in the transaction
2. Rule Generation
– Generate high confidence rules from each frequent itemset, where each rule is a
binary partitioning of a frequent itemset
Bottleneck: candidate-generation-and-test
– Can we avoid candidate generation?
10
11
12
13
p:2 m:1
15
18
19
{}
Cond. pattern base of “am”: (fc:3) f:3
c:3
f:3
am-conditional FP-tree
c:3 {}
{}
Cond. pattern base of “cam”: (f:3)
f:3
cam-conditional FP-tree
20
a1:n1
b1:m1 C1:k1
r1 =
a2:n2
+ b1:m1 C1:k1
Divide-and-conquer:
– decompose both the mining task and DB according to the
frequent patterns obtained so far
– leads to focused search of smaller databases
Other factors
– no candidate generation, no candidate test
– compressed database: FP-tree structure
– no repeated scan of entire database
– basic ops—counting local freq items and building sub FP-
tree, no pattern search and matching
23
am -proj DB cm -proj DB
fc f …
fc f
fc f
24
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
BITS Pilani
Pilani|Dubai|Goa|Hyderabad
• But confidence of rules generated from the same itemset has an anti-
monotone property
• e.g., L = {A,B,C,D}:
CD=>AB BD=>AC
• join(CD=>AB,BD=>AC)
would produce the candidate
rule D => ABC
Support
distribution of
a retail data set
Coffee Coffee
Tea 15 5 20
Tea 75 5 80
90 10 100
P(Y | X )
Lift =
P(Y )
P( X , Y )
Interest =
P( X ) P(Y )
Coffee Coffee
Tea 15 5 20
Tea 75 5 80
90 10 100
Y Y Y Y
X 10 0 10 X 90 0 90
X 0 90 90 X 0 10 10
10 90 100 90 10 100
0.1 0.9
Lift = = 10 Lift = = 1.11
(0.1)(0.1) (0.9)(0.9)
Statistical independence:
If P(X,Y)=P(X)P(Y) => Lift = 1
• Subjective measure:
• Rank patterns according to user’s interpretation
• A pattern is subjectively interesting if it contradicts the
expectation of a user
• A pattern is subjectively interesting if it is actionable
+ - Expected Patterns
- + Unexpected Patterns
• Need to combine expectation of users with evidence from data (i.e., extracted
patterns)
BITS Pilani
Pilani|Dubai|Goa|Hyderabad
•The slides presented here are obtained from the authors of the books and from various other contributors. I
hereby acknowledge all the contributors for their material and inputs.
•I have added and modified a few slides to suit the requirements of the course.
Clustering Concepts
What is Cluster Analysis?
August 7, 2021
6
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
What is not Cluster Analysis?
• Supervised classification
• Have class label information
• Simple segmentation
• Dividing students into different registration groups alphabetically,
by last name
• Results of a query
• Groupings are a result of an external specification
August 7, 2021 8
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Measure the Quality of Clustering
August 7, 2021
9
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Requirements of Clustering in Data Mining
• Scalability
• Ability to deal with different types of attributes
• Ability to handle dynamic data
• Discovery of clusters with arbitrary shape
• Minimal requirements for domain knowledge to determine input
parameters
• Able to deal with noise and outliers
• Insensitive to order of input records
• High dimensionality
• Incorporation of user-specified constraints
• Interpretability and usability
August 7, 2021
10
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Type of data in clustering analysis
• Interval-scaled variables
• Binary variables
August 7, 2021
11
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Interval-valued variables(standardization)
Z-score: x
z= σ − µ
X: raw score to be standardized,
μ: mean of the population, σ: standard deviation
• the distance between the raw score and the population mean in units of
the standard deviation
• negative when the raw score is below the mean, “+” when above
• Alternately
• Calculate the mean absolute deviation:
s f = 1n (| x1 f − m f | + | x2 f − m f | +...+ | xnf − m f |)
where m f = 1n (x1 f + x2 f + ... + xnf )
.
August 7, 2021
12
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Similarity and Dissimilarity Between Objects
August 7, 2021
13
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Similarity and Dissimilarity Between Objects (Cont.)
• If q = 2, d is Euclidean distance:
d (i, j) = (| x − x |2 + | x − x |2 +...+ | x − x |2 )
i1 j1 i2 j2 ip jp
• Properties
• d(i,j) ≥ 0
• d(i,i) = 0
• d(i,j) = d(j,i)
• d(i,j) ≤ d(i,k) + d(k,j)
August 7, 2021
14
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Binary Variables
• A contingency table for binary Object j
1 0 sum
data 1 a b a +b
Object i 0 c d c+d
sum a+c b+d p
d (i, j) = p −
p
m
August 7, 2021
16
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Ordinal Variables
August 7, 2021
17
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Attributes of Mixed Type
• A database may contain all attribute types
• Nominal, symmetric binary, asymmetric binary,
numeric, ordinal
• One may use a weighted formula to combine their effects
Σ pf = 1δ ij( f ) dij( f )
d (i, j) =
Σ pf = 1δ ij( f )
18
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Notion of a Cluster can be Ambiguous
• Partitional Clustering
• A division data objects into non-overlapping subsets
(clusters) such that each data object is in exactly one
subset
• Hierarchical clustering
• A set of nested clusters organized as a hierarchical tree
p1
p3 p4
p2
p1 p2 p3 p4
Traditional Hierarchical Clustering
Traditional Dendrogram
p1
p3 p4
p2
p1 p2 p3 p4
Non-traditional Hierarchical Clustering
Non-traditional Dendrogram
• Well-separated clusters
• Center-based clusters
• Contiguous clusters
• Density-based clusters
• Property or Conceptual
• Described by an Objective Function
• Well-Separated Clusters:
• A cluster is a set of points such that any point in a cluster is closer
(or more similar) to every other point in the cluster than to any
point not in the cluster.
3 well-separated clusters
• Center-based
• A cluster is a set of objects such that an object in a cluster is closer
(more similar) to the “center” of a cluster, than to the center of any
other cluster
• The center of a cluster is often a centroid, the average of all the
points in the cluster, or a medoid, the most “representative” point
of a cluster
4 center-based clusters
8 contiguous clusters
• Density-based
• A cluster is a dense region of points, which is separated by low-
density regions, from other regions of high density.
• Used when the clusters are irregular or intertwined, and when
noise and outliers are present.
6 density-based clusters
2 Overlapping Circles
• Want to minimize the edge weight between clusters and maximize the edge
weight within clusters
Partitioning Methods
Partitioning Algorithms: Basic Concept
• Partitioning method: Partitioning a database D of n objects into a set of k
clusters, such that the sum of squared distances is minimized (where ci is the
centroid or medoid of cluster Ci)
E = Σ ik=1Σ p∈Ci ( p − ci ) 2
2.5
2 Original Points
1.5
y
1
0.5
3 3
2.5 2.5
2 2
1.5 1.5
y
y
1 1
0.5 0.5
0 0
2 2 2
y
1 1 1
0 0 0
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x
2 2 2
y
1 1 1
0 0 0
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x
Iteration 6
1
2
3
4
5
3
2.5
1.5
y
0.5
2.5 2.5
2 2
1.5 1.5
y
y
1 1
0.5 0.5
0 0
2 2 2
y
1 1 1
0 0 0
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x
2.5
1.5
y
0.5
6 6
4 4
2 2
0 0
y
y
-2 -2
-4 -4
-6 -6
0 5 10 15 20 0 5 10 15 20
x
Iteration 3 x
Iteration 4
8 8
6 6
4 4
2 2
0 0
y
y
-2 -2
-4 -4
-6 -6
0 5 10 15 20 0 5 10 15 20
x x
Starting with two initial centroids in one cluster of each pair of clusters
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
10 Clusters Example
Iteration 1 Iteration 2
8 8
6 6
4 4
2 2
0 0
y
y
-2 -2
-4 -4
-6 -6
0 5 10 15 20 0 5 10 15 20
Iteration 3 Iteration 4
8 x 8 x
6 6
4 4
2 2
0 0
y
-2 y -2
-4 -4
-6 -6
0 5 10 15 20 0 5 10 15 20
x x
Starting with some pairs of clusters having three initial centroids, while other have only one.
• Multiple runs
• Helps, but probability is not favorable
• Sample and use hierarchical clustering to determine
initial centroids
• Select more than k initial centroids and then select
among these initial centroids
• Select most widely separated
• Postprocessing
• Post-processing
• Eliminate small clusters that may represent outliers
• Split ‘loose’ clusters, i.e., clusters with relatively high SSE
• Merge clusters that are ‘close’ and that have relatively low SSE
48
56
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
The K-Medoid Clustering Method
• K-Medoids Clustering: Find representative objects (medoids) in clusters
• PAM (Partitioning Around Medoids)
• Starts from an initial set of medoids and iteratively replaces one of
the medoids by one of the non-medoids if it improves the total
distance of the resulting clustering
• PAM works effectively for small data sets, but does not scale well for
large data sets (due to the computational complexity)
• Efficiency improvement on PAM
• CLARA : PAM on samples
• CLARANS : Randomized re-sampling
57
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Prescribed Text Books
4
August 14, 2021
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Quality: What Is Good Clustering?
A good clustering method will produce high quality clusters with
– high intra-class similarity
– low inter-class similarity
The quality of a clustering result depends on both the similarity
measure used by the method and its implementation
The quality of a clustering method is also measured by its ability
to discover some or all of the hidden patterns
5
August 14, 2021
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Types of Clusterings
A clustering is a set of clusters
Partitional Clustering
– A division data objects into non-overlapping subsets
(clusters) such that each data object is in exactly one subset
Hierarchical clustering
– A set of nested clusters organized as a hierarchical tree
x y
O1 2 10
10
O2 2 5
O3 8 4
O4 5 8 8
O5 7 5
O6 6 4
6
O7 1 2
y
O8 4 9
4
0
August 14, 2021 0 1 2 3 4
x
5 6 7 8 9
8
K-means Example
12
x y
O1 2 10
10
O2 2 5
O3 8 4
O4 5 8 8
O5 7 5
O6 6 4
6
O7 1 2
y
O8 4 9
4
0
0 1 2 3 4 x 5 6 7 8 9
August 14, 2021 9
K-means Example
12
x y
O1 2 10
10
O2 2 5
O3 8 4
O4 5 8 8
O5 7 5
O6 6 4
6
O7 1 2
y
O8 4 9
4
0
August 14, 2021 0 1 2 3 4 x 5 6 7 8 9
10
K-means Example
12
x y
O1 2 10
10
O2 2 5
O3 8 4
O4 5 8 8
O5 7 5
O6 6 4
6
O7 1 2
y
O8 4 9
4
0
August 14, 2021 0 1 2 3 4 x 5 6 7 8 9
11
K-means Example
12
x y
O1 2 10
10
O2 2 5
O3 8 4
O4 5 8 8
O5 7 5
O6 6 4
6
O7 1 2
y
O8 4 9
4
0
August 14, 2021 0 1 2 3 4 x 5 6 7 8 9
12
K-means Example
12
x y
O1 2 10
10
O2 2 5
O3 8 4
O4 5 8 8
O5 7 5
O6 6 4
6
O7 1 2
y
O8 4 9
4
0
August 14, 2021 0 1 2 3 4 x 5 6 7 8 9
13
K-means Example
12
x y
O1 2 10
10
O2 2 5
O3 8 4
O4 5 8 8
O5 7 5
O6 6 4
6
O7 1 2
y
O8 4 9
4
0
August 14, 2021 0 1 2 3 4 x 5 6 7 8 9
14
K-means Example
12
x y
O1 2 10
10
O2 2 5
O3 8 4
O4 5 8 8
O5 7 5
O6 6 4
6
O7 1 2
y
O8 4 9
4
0
August 14, 2021 0 1 2 3 4 x 5 6 7 8 9
15
K-means Example
12
x y
O1 2 10
10
O2 2 5
O3 8 4
O4 5 8 8
O5 7 5
O6 6 4
6
O7 1 2
y
O8 4 9
4
0
August 14, 2021 0 1 2 3 4 x 5 6 7 8 9
16
K-means Example
12
x y
O1 2 10
10
O2 2 5
O3 8 4
O4 5 8 8
O5 7 5
O6 6 4
6
O7 1 2
y
O8 4 9
4
Clusters
Stabilized 2
0
August 14, 2021 0 1 2 3 4 x 5 6 7 8 9
17
BITS Pilani
Pilani|Dubai|Goa|Hyderabad
Hierarchical Methods
Hierarchical Clustering
Produces a set of nested clusters organized as a hierarchical tree
Can be visualized as a dendrogram
– A tree like diagram that records the sequences of merges or splits
6 5
0.2
4
3 4
0.15 2
5
2
0.1
1
0.05
3 1
0
1 3 2 5 4 6
19
vertebrate invertebrate
20
– Divisive:
• Start with one, all-inclusive cluster
• At each step, split a cluster until each cluster contains a point (or there
are k clusters)
21
22
...
p1 p2 p3 p4 p9 p10 p11 p12
23
C2 C5
...
p1 p2 p3 p4 p9 p10 p11 p12 24
C2 C5
... 25
p1 p2 p3 p4 p9 p10 p11 p12
p1 p2 p3 p4 p5 ...
Similarity?
p1
p2
p3
p4
MIN p5
.
MAX
.
Group Average Proximity Matrix
.
Distance Between Centroids
Other methods driven by an objective
function
26
p2
p3
p4
p5
MIN
.
MAX .
Group Average .
Proximity Matrix
Distance Between Centroids
Other methods driven by an objective
function
27
p2
p3
p4
p5
MIN
.
MAX .
Group Average .
Proximity Matrix
Distance Between Centroids
Other methods driven by an objective
function
28
p1 p2 p3 p4 p5 ...
p1
p2
p3
p4
p5
MIN
.
MAX .
Group Average .
Proximity Matrix
Distance Between Centroids
Other methods driven by an objective
function
29
30
p4
p5
MIN
.
MAX .
Group Average .
Proximity Matrix
Distance Between Centroids
Other methods driven by an objective
function
31
P1 P2,P5, P4
P1 P2,P5 P3,P6 P4
P3,P6
P1 0 .24 .22 .37
P1 0 .22 .37
P2,P5 0 .15 .2 P2,P5, 0 .155
P3,P6 0 .155 P3,P6
P4 0 P4 0
P1 P2,P5,P3,
P6,P4
Hierarchical Clustering
example - min P1 0 0.22
P2,P5,P3, 0
8/14/2021
P6,P4 32
Hierarchical Clustering: MIN
Similarity of two clusters is based on the two
most similar (closest) points in the different
clusters
5 Determined by one pair of points, i.e., by one link
1 in the proximity graph.
3
5 0.2
2 1
0.15
2 3 6
0.1
0.05
4
4 0
3 6 2 5 4 1
34
35
P1 P2,P5 P3,P6,
P1 P2,P5 P3,P6 P4 P4
P1 0 .34 .23 .37 P1 0 .34 .37
P2,P5 0 .39 .29 P2,P5 0 .39
P3,P6 0 .22
P3,P6, 0
P4 0 P4
P1, P3,
P2,P5 P6,P4
Hierarchical Clustering
example - max P1, P2,P5 0 0.39
P3, 0
8/14/2021
P6,P4 36
Hierarchical Clustering: MAX
Similarity of two clusters is based on the two least similar (most distant)
points in the different clusters
Determined by all pairs of points in the two clusters
4 1
2 5 0.4
0.35
5
2 0.3
0.25
3 6 0.2
3 0.15
1 0.1
4 0.05
0
3 6 4 1 2 5
Dendrogram
Nested Clusters
37
5 4 1
2 0.25
5 0.2
2
0.15
3 6 0.1
1 0.05
4 0
3 3 6 4 1 2 5
5 4 1
2
5
2
Group Average
3 6
1
4
3
41
42
43
44
45
Major features:
– Discover clusters of arbitrary shape
– Handle noise
– One scan
– Need density parameters as termination condition
47
Density-connected
– A point p is density-connected to a point p q
q w.r.t. Eps, MinPts if there is a point o
such that both, p and q are density-
reachable from o w.r.t. Eps and MinPts o
49
– A border point has fewer than MinPts within Eps, but is in the
neighborhood of a core point
– A noise point is any point that is not a core point or a border point.
50
51
8/14/2021
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
DBSCAN Algorithm
• Eliminate noise points
• Perform clustering on the remaining points
52
Original Points
Clusters
• Resistant to Noise
• Can handle clusters of different shapes and sizes
54
55
Idea is that for points in a cluster, their kth nearest neighbors are
at roughly the same distance
Noise points have the kth nearest neighbor at farther distance
So, plot sorted distance of every point to its kth nearest neighbor
56
57
August 14, 2021
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
S2-20_DSECLZC415: Data Mining
(Lecture #13 – Cluster Analysis)
BITS Pilani
Pilani|Dubai|Goa|Hyderabad
•The slides presented here are obtained from the authors of the
books and from various other contributors. I hereby acknowledge
all the contributors for their material and inputs.
•I have added and modified a few slides to suit the requirements
of the course.
2
Partitional Clustering
– A division of data objects into non-overlapping subsets
(clusters) such that each data object is in exactly one
subset
– Divisions can be
• Distance based
• Density based
Hierarchical clustering
– A set of nested clusters organized as a hierarchical tree
OPTICS
Challenges with DBSCAN
• DBSCAN requires users with the responsibility of selecting
parameter values for the discovery of acceptable clusters.
• Eps (the maximum radius of a neighborhood) and
• MinPts (the minimum number of points required in the
neighborhood of a core object)
• The parameters are empirically set and difficult to determine
especially for real-world, high dimensional data sets.
• Sensitive to these parameter values: Slightly different settings may
lead to very different clustering of the data.
• The real-world, high-dimensional data can have skewed
distributions such that their intrinsic clustering structure may not be
well characterized by a single set of global density parameters.
8
8/28/2021
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
OPTICS: A Cluster-Ordering Method (1999)
OPTICS: Ordering Points To Identify the Clustering Structure
– Ankerst, Breunig, Kriegel, and Sander (SIGMOD’99)
– Produces a special order of the database wrt its density-
based clustering structure
– This cluster-ordering contains info equivalent to the
density-based clusterings corresponding to a broad range
of parameter settings
– Good for both automatic and interactive cluster analysis,
including finding intrinsic clustering structure
– Can be represented graphically or using visualization
techniques
9
10
August 28, 2021
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
OPTICS algorithm
o
p2 MinPts = 5
ε = 3 cm
12
13
8/28/2021
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Reachability-distance
undefined
ε
ε
ε ‘
16
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
BIRCH (Balanced Iterative Reducing
and Clustering Using Hierarchies)
• Zhang, Ramakrishnan & Livny, SIGMOD’96
• Incrementally construct a CF (Clustering Feature) tree, a hierarchical data
structure for multiphase clustering
• Phase 1: scan DB to build an initial in-memory CF tree (a multi-level
compression of the data that tries to preserve the inherent clustering
structure of the data)
• Phase 2: use an arbitrary clustering algorithm to cluster the leaf nodes of
the CF-tree
• Scales linearly: finds a good clustering with a single scan and improves the
quality with a few additional scans
• Weakness: handles only numeric data, and sensitive to the order of the data
record
17
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Clustering Feature Vector in BIRCH
Clustering Feature (CF): CF = (N, LS, SS)
N: Number of data points
N
LS: linear sum of N points: ∑ X i
i =1
CF = (5, (16,30),(54,190))
SS: square sum of N points
N 2 10
(3,4)
∑ Xi
9
(2,6)
8
i =1
7
4 (4,5)
(4,7)
3
(3,8)
0
0 1 2 3 4 5 6 7 8 9 10
18
20
8/28/2021
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
The CF Tree Structure
Non-leaf node
CF1 CF2 CF3 CF5
child1 child2 child3 child5
22
August 28, 2021
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
The Birch Algorithm
Cluster Diameter 1 2
∑ (x − x )
n( n − 1) i j
23
24
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
BITS Pilani
Pilani|Dubai|Goa|Hyderabad
Cluster Validation
Cluster Validity
26
8/28/2021
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Clusters found in Random Data
1 1
0.9 0.9
Random 0.8
0.7
0.8
0.7
Points 0.6 0.6 DBSCAN
0.5 0.5
y
y
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0 0
10 0.2 0.4 0.6 0.8 1 10 0.2 0.4 0.6 0.8 1
0.9
x 0.9
x
K-means 0.8
0.7
0.8
0.7
Complete
0.6 0.6
Link
0.5 0.5
y
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
x x 27
8/28/2021
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Different Aspects of Cluster Validation
1. Determining the clustering tendency of a set of data, i.e., distinguishing
whether non-random structure actually exists in the data.
2. Comparing the results of a cluster analysis to externally known results, e.g.,
to externally given class labels.
3. Evaluating how well the results of a cluster analysis fit the data without
reference to external information.
- Use only the data
4. Comparing the results of two different sets of cluster analyses to determine
which is better.
5. Determining the ‘correct’ number of clusters.
28
8/28/2021
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Clustering Tendency
All clustering algorithms find some clusters, whether data has natural clusters
or not.
We need a mechanism to check if at least some clusters are of good quality
Alternatively, we can directly check the data for clustering tendency. A
common approach is to use statistical tests for spatial randomness (in
Euclidean space) among data points.
Hopkins Statistic : Use p points from data and generate a set of p random
points in data space. Let ui be nearest neighbor distances in generated data
and wi be nearest neighbor distances in supplied data. Hopkin statistic H is
defined as
30
8/28/2021
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Internal Measures: SSE
Clusters in more complicated figures aren’t well separated
Internal Index: Used to measure the goodness of a clustering structure
without respect to external information
– SSE
SSE is good for comparing two clusterings or two clusters (average SSE).
Can also be used to estimate the number of clusters
10
6 9
8
4
7
2
6
SSE
0 5
4
-2
3
-4 2
1
-6
0
5 10 15 2 5 10 15 20 25 30
K 31
8/28/2021
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Internal Measures: Cohesion and Separation
Cluster Cohesion: Measures how closely related are objects in a cluster
– Example: SSE
Cluster Separation: Measure how distinct or well-separated a cluster is
from other clusters
– Example: Squared Error
– Cohesion is measured by the within cluster sum of squares (SSE)
32
8/28/2021
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Internal Measures: Cohesion and Separation
• Example: SSE
• BSS + WSS = constant
m
× × ×
1 m1 2 3 4 m2 5
35
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Classification-based Measures of Cluster Validity
Precision: The fraction of a cluster that consists of objects of a
specified class. The precision of cluster i with respect to a class j is
precision(i,j) = pij
Two matrices
– Proximity Matrix
– Ideal Similarity Matrix
One row and one column for each data point
An entry is 1 if the associated pair of points belong to the same cluster
An entry is 0 if the associated pair of points belongs to different clusters
Compute the correlation between the two matrices
– Since the matrices are symmetric, only the correlation between
n(n-1) / 2 entries needs to be calculated.
High correlation indicates that points that belong to the
same cluster are close to each other.
Not a good measure for some density or contiguity based
clusters.
38
8/28/2021
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Determine the Number of Clusters
Empirical method
– # of clusters ≈√(n/2) for a dataset of n points
Elbow method
– Use the turning point in the curve of sum of within cluster variance w.r.t
the # of clusters
Cross validation method
– Divide a given data set into m parts
– Use m – 1 parts to obtain a clustering model
– Use the remaining part to test the quality of the clustering
• E.g., For each point in the test set, find the closest centroid, and use
the sum of squared distance between all points in the test set and
the closest centroids to measure how well the model fits the test set
– For any k > 0, repeat it m times, compare the overall quality measure
w.r.t. different k’s, and find # of clusters that fits the data the best
39
40
8/28/2021
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Prescribed Text Books
41
August 28, 2021
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
S2-20_DSECLZC415: Data Mining
(Lecture #14 – Outlier Analysis)
BITS Pilani
Pilani|Dubai|Goa|Hyderabad
•The slides presented here are obtained from the authors of the
books and from various other contributors. I hereby acknowledge
all the contributors for their material and inputs.
•I have added and modified a few slides to suit the requirements
of the course.
2
Applications:
– Credit card fraud detection
– Telecom fraud detection
– Customer segmentation
– Medical analysis
4
6
September 14, 2021
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
More on Outlier/Anomaly Detection
Challenges
– How many outliers are there in the data?
– Method is unsupervised (sometimes supervised methods are used)
• Validation can be quite challenging (just like for clustering)
– Finding needle in a haystack
Working assumption:
– There are considerably more “normal” observations than “abnormal”
observations (outliers/anomalies) in the data
Outlier detection vs. novelty detection (identify new topics and trends
in a timely manner in social media): early stage, outlier; but later
merged into the model
8
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Types of Outliers (Contd.)
Collective Outliers
– A subset of data objects collectively deviate significantly from the whole
data set, even if the individual data objects may not be outliers
– Applications: E.g., intrusion detection:
• When a number of computers keep sending denial-of-service
packages to each other
• Detection of collective outliers
• Consider not only behavior of individual objects, but also that of
groups of objects
• Need to have the background knowledge on the relationship among
data objects, such as a distance or similarity measure on objects.
• A data set may have multiple types of outlier
• One object may belong to more than one type of outlier
9
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
BITS Pilani
Pilani|Dubai|Goa|Hyderabad
11
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Outlier Detection I: Supervised Methods
Outlier Detection I: Supervised Methods
– Modeling outlier detection as a classification problem
• Samples examined by domain experts used for training & testing
– Methods for Learning a classifier for outlier detection effectively:
• Model normal objects & report those not matching the model as
outliers, or
• Model outliers and treat those not matching the model as normal
– Challenges
• Imbalanced classes, i.e., outliers are rare: Boost the outlier class
and make up some artificial outliers
• Catch as many outliers as possible, i.e., recall is more important
than accuracy (i.e., not mislabeling normal objects as outliers)
12
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Outlier Detection II: Unsupervised Methods
Assume the normal objects are somewhat ``clustered'‘ into multiple groups, each having
some distinct features
An outlier is expected to be far away from any groups of normal objects
Weakness: Cannot detect collective outlier effectively
– Normal objects may not share any strong patterns, but the collective outliers may
share high similarity in a small area
Ex. In some intrusion or virus detection, normal activities are diverse
– Unsupervised methods may have a high false positive rate but still miss many real
outliers.
– Supervised methods can be more effective, e.g., identify attacking some key resources
14
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Mining Contextual Outliers:
Transform into Conventional Outlier Detection
If the contexts can be clearly identified, transform it to conventional outlier
detection
1. Identify the context of the object using the contextual attributes
2. Calculate the outlier score for the object in the context using a
conventional outlier detection method
Ex. Detect outlier customers in the context of customer groups
– Contextual attributes: age group, postal code
– Behavioral attributes: # of trans/yr, annual total trans. amount
Steps:
(1) locate c’s context,
(2) compare c with the other customers in the same group, and
(3) use a conventional outlier detection method
15
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Mining Contextual Outliers:
Modeling Normal Behavior with Respect to Contexts
In some applications, one cannot clearly partition the data into contexts
– Ex. if a customer suddenly purchased a product that is unrelated to those
she recently browsed, it is unclear how many products browsed earlier
should be considered as the context
Model the “normal” behavior with respect to contexts
– Using a training data set, train a model that predicts the expected
behavior attribute values with respect to the contextual attribute values
– An object is a contextual outlier if its behavior attribute values
significantly deviate from the values predicted by the model
Using a prediction model that links the contexts and behavior, these methods
avoid the explicit identification of specific contexts
Methods: A number of classification and prediction techniques can be used to
build such models, such as regression, Markov Models, and Finite State
Automaton
16
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Mining Collective Outliers :
On the Set of “Structured Objects”
Collective outlier - objects as a group deviate from the
entire data
Need to examine the structure of the data set,
i.e, the relationships between multiple data objects
Each of these structures is inherent to its respective type
of data
For temporal data (such as time series and
sequences)
explore the structures formed by time, which occur in
segments of the time series or subsequences
For spatial data, explore local areas
For graph and network data, we explore subgraphs
17
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Mining Collective Outliers :
On the Set of “Structured Objects”
Difference from the contextual outlier detection: the structures are often
not explicitly defined, and have to be discovered as part of the outlier
detection process.
Collective outlier detection methods: two categories
Reduce the problem to conventional outlier detection
Identify structure units, treat each structure unit (e.g.,
subsequence, time series segment, local area, or subgraph) as a
data object, and extract features
Then outlier detection on the set of “structured objects”
constructed as such using the extracted features
e.g. Detect collective outliers in online social network of customers
Treat each possible subgraph of the network as a structure unit
Collective outlier: An outlier subgraph in the social network
Small subgraphs that are of very low frequency
18
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Mining Collective Outliers II:
Direct Modeling of the Expected Behavior of Structure Units
Model the expected behavior of structure units directly
– e.g. Detect collective outliers in temporal sequences
• Learn a Markov model from the sequences
• A subsequence can then be declared as a collective outlier if it significantly deviates
from the model
19
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Challenges of Outlier Detection
• Modeling normal objects and outliers properly
• Hard to enumerate all possible normal behaviors in an application
• The border between normal and outlier objects is often a gray area
• Application-specific outlier detection
• Choice of distance measure among objects and the model of relationship
among objects are often application-dependent
• E.g., clinic data: a small deviation could be an outlier; while in marketing
analysis, larger fluctuations
• Handling noise in outlier detection
• Noise may distort the normal objects and blur the distinction between
normal objects and outliers. It may help hide outliers and reduce the
effectiveness of outlier detection
• Understandability
• Understand why these are outliers: Justification of the detection
• Specify the degree of an outlier: the unlikelihood of the object being
generated by a normal mechanism
20
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
BITS Pilani
Pilani|Dubai|Goa|Hyderabad
Statistical Outliers
Statistical Approaches
Statistical approaches assume that the objects in a data set are
generated by a stochastic process (a generative model)
– The effectiveness of statistical methods highly depends on whether the
assumptions made for the statistical model hold true for the given data
Statistic models used in the methods may be parametric or
nonparametric.
– A parametric method assumes that the normal data objects are
generated by a parametric distribution
– A nonparametric method does not assume an a priori statistical
model. Instead, a nonparametric method tries to determine the
model from the input data.
22
Ha : oi ϵ G, where i=1,2,….n
Avg. temp.: {24.0, 28.9, 28.9, 29.0, 29.1, 29.1, 29.2, 29.2, 29.3, 29.4}
30
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Parametric Methods:
Using Mixture of Parametric Distributions
• To overcome this problem, assume the normal data is generated by two normal
distributions. For any object o in the data set, the probability that o is generated
by the mixture of the two distributions is given by
• where fθ1 and fθ2 are the probability density functions of θ1 and θ2
• Then use EM algorithm to learn the parameters μ1, σ1, μ2, σ2 from data
• An object o is an outlier if it does not belong to any cluster
31
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Detecting outliers
There are two basic types of procedures for detecting outliers:
Block procedures: In this case, either all of the suspect objects are treated
as outliers or all of them are accepted as consistent.
Intuition: Objects that are far away from the others are outliers
Assumption of proximity-based approach: The proximity of an
outlier deviates significantly from that of most of the others in
the data set
Two types of proximity-based outlier detection methods
– Distance-based outlier detection: An object o is an outlier if
its neighborhood does not have enough other points
– Density-based outlier detection: An object o is an outlier if
its density is relatively much lower than that of its
neighbors
35
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Distance-Based Outlier Detection
36
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Distance-Based Outlier Detection
Efficient computation: Nested loop
algorithm
– For any object oi, calculate its
distance from other objects, and
count the # of other objects in
the r-neighborhood.
– If π∙n other objects are within r
distance, terminate the inner
loop
– Otherwise, oi is a DB(r, π)
outlier
Efficiency: Actually CPU time is not O(n2)
but linear to the data set size since for
most non-outlier objects, the inner loop
terminates early
37
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Distance-Based Outlier Detection: Improving Algorithm
Why efficiency is still a concern? When the complete set of objects
cannot be held into main memory, cost of I/O swapping will be high
The major cost:
(1) each object tests against the whole data set, why not only its
close neighbor?
(2) instead of checking objects one by one, why not group by
group?
Grid-based method (CELL): Data space is partitioned into a multi-D
grid. Only adjoining cells are checked for determining if object is an
outlier
38
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Distance-Based Outlier Detection: A Grid-Based Method
• Thus we only need to check the objects that cannot be pruned, and even for
such an object o, only need to compute the distance between o and the
objects in the level-2 cells (since beyond level-2, the distance from o is more
than r)
39
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Distance-Based Outlier Detection: Limitations
Density-Based Outlier
Density-Based Outlier Detection
Local outliers: Outliers comparing to their local neighborhoods, instead of the
global data distribution
In Fig., o1 and o2 are local outliers to C1, o3 is a global outlier, but o4 is not an
outlier. However, proximity-based clustering cannot find o1 and o2 are outlier
(e.g., comparing with O4).
42
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Density-Based Outlier Detection
Local outliers: Outliers comparing to their local
neighborhoods, instead of the global data
distribution
In Fig., o1 and o2 are local outliers to C1, o3 is a global
outlier, but o4 is not an outlier. However, proximity-
based clustering cannot find o1 and o2 are outlier
(e.g., comparing with O4).
43
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Local Reachability Density
k-distance of an object o, distk(o): distance between o and its k-th NN
k-distance neighborhood of o, Nk(o) = {o’| o’ in D, dist(o, o’) ≤ distk(o)}
– Nk(o) could be bigger than k since multiple objects may have identical
distance to o
Reachability distance from o’ to o:
• LOF (Local outlier factor) of an object o is the average of the ratio of local
reachability of o and those of o’s k-nearest neighbors
• the local outlier factor is the average of the ratio of the local reachability
density of o and those of o's k-nearest neighbors
• The lower the local reachability density of o, and the higher the local
reachability density of the k-NN of o, the higher LOF
• This captures a local outlier whose local density is relatively low comparing to
the local densities of its k-NN
45
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Clustering-Based
Basic idea:
– Cluster the data into groups of
different density
– Choose points in small cluster as
candidate outliers
– Compute the distance between
candidate points and non-candidate
clusters.
• If candidate points are far from all
other non-candidate points, they
are outliers
• Bayes theorem:
• More generally:
Now what is the probability that the patient has the disease?
The Base-Rate Fallacy and the Difficulty of Intrusion Detection by Stefan Axelsson ACM Transactions on Information and System Security
54
September 14, 2021
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
S1-20_DSECLZC415 : Data Mining
(Lecture #15 - Mining Unstructured Data)
BITS Pilani
Pilani|Dubai|Goa|Hyderabad
•The slides presented here are obtained from the authors of the
books and from various other contributors. I hereby acknowledge
all the contributors for their material and inputs.
•I have added and modified a few slides to suit the requirements
of the course.
2
Structured data
comprised of clearly defined data types whose pattern makes
them easily searchable
4
9/14/2021
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Unstructured Data
Can be
– Text
– www
– Multimedia
– Graph
– Spatial data
……
5
9/14/2021
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
BITS Pilani
Pilani|Dubai|Goa|Hyderabad
9
Text Data Management and Analysis: A Practical Introduction to Information Retrieval and Text Mining ChengXiang Zhai, Sean Massung, ACM Books 2016
9/14/2021
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
General NLP—Too Difficult!
Word-level ambiguity
– “design” can be a noun or a verb (Ambiguous POS)
– “root” has multiple meanings (Ambiguous sense)
Syntactic ambiguity
– “natural language processing” (Modification)
– “A man saw a boy with a telescope.” (PP Attachment)
Anaphora resolution
– “John persuaded Bill to buy a TV for himself.”
(himself = John or Bill?)
Presupposition
– “He has quit smoking.” implies that he smoked before.
Information Retrieval
Text Databases and IR
Text databases (document databases)
– Large collections of documents from various sources: news
articles, research papers, books, digital libraries, e-mail
messages, and Web pages, library database, etc.
– Data stored is usually semi-structured
– Traditional information retrieval techniques become inadequate
for the increasingly vast amounts of text data
Information retrieval
– A field developed in parallel with database systems
– Information is organized into (a large number of) documents
– Information retrieval problem: locating relevant documents
based on user input, such as keywords or example documents
12
9/14/2021
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Information Retrieval
Typical IR systems
– Online library catalogs
– Online document management systems
Information retrieval vs. database systems
– Some DB problems are not present in IR, e.g., update,
transaction management, complex objects
– Some IR problems are not addressed well in DBMS, e.g.,
unstructured documents, approximate search using
keywords and relevance
13
9/14/2021
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Basic Measures for Text Retrieval
Precision: the percentage of retrieved documents that are in fact relevant to
the query (i.e., “correct” responses)
Recall: the percentage of documents that are relevant to the query and were,
in fact, retrieved
All Documents
15
9/14/2021
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Indexing Process
Text + Meta data
(Doc type, structure,
features, size, etc.)
Takes index terms and
Identifies and stores creates data structures
documents for (indexes) to support
indexing fast searching
16
9/14/2021
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Zipf’s Law
Distribution of word frequencies is very skewed
Few words occur very often, many hardly ever occur
e.g., “the” and “of”, two common words, make up about 10%
of all word occurrences in text documents
Zipf’s law:
The frequency f of a word in a corpus is inversely proportional
to its rank r (assuming words are ranked in order of
decreasing frequency)
k
f = r f×r=k
20
9/14/2021
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Heaps’ Law on the Web
Heaps’ Law works with very large corpora
New words occurring even after seeing 30 million!
23
9/14/2021
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Boolean Model
24
9/14/2021
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Keyword-Based Retrieval
A document is represented by a string, which can be identified
by a set of keywords
Queries may use expressions of keywords
– E.g., car and repair shop, tea or coffee, DBMS but not
Oracle
– Queries and retrieval should consider synonyms, e.g.,
repair and maintenance
Major difficulties of the model
– Synonymy: A keyword T does not appear anywhere in the
document, even though the document is closely related to
T, e.g., data mining
– Polysemy: The same keyword may mean different things in
different contexts, e.g., mining
25
9/14/2021
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Similarity-Based Retrieval in Text Data
26
9/14/2021
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Similarity-Based Retrieval in Text Data (contd)
– Word stem
• Several words are small syntactic variants of each other since they
share a common word stem
• E.g., drug, drugs, drugged
– A term frequency table
• Each entry frequent_table(i, j) = # of occurrences of the word ti in
document di
• Usually, the ratio instead of the absolute number of occurrences is
used
– Similarity metrics: measure the closeness of a document to
a query (a set of keywords)
• Relative term occurrences v1 ⋅ v2
sim(v1 , v2 ) =
• Cosine distance: | v1 || v2 |
27
9/14/2021
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
BITS Pilani
Pilani|Dubai|Goa|Hyderabad
29
9/14/2021
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
What VS Model Does Not Specify
How to select terms to capture “basic concepts”
– Word stopping
• e.g. “a”, “the”, “always”, “along”
– Word stemming
• e.g. “computer”, “computing”, “computerize” => “compute”
– Latent semantic indexing
How to assign weights
– Not all words are equally important: Some are more
indicative than others
• e.g. “algebra” vs. “science”
How to measure the similarity
30
9/14/2021
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
How to Assign Weights
31
9/14/2021
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Cornell SMART TF-IDF Model
TF is computed using the equation:
32
9/14/2021
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Alternate TF-IDF Weighting
TF Weighting: IDF Idea:
– More frequent => more – Less frequent among
documents more
relevant to topic discriminative
• Raw TF= f(t,d): how
many times term t Formula:
appears in doc d
Normalization:
– Document length varies n — total number of docs
=> relative frequency k — # docs with term t
preferred appearing
• e.g., Maximum
frequency normalization (IDF – inverse document
frequency)
33
9/14/2021
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
TF-IDF Weighting
TF-IDF weighting : weight(t, d) = TF(t, d) * IDF(t)
– Freqent within doc high tf high weight
– Selective among docs high idf high weight
Recall VS model
– Each selected term represents one dimension
– Each doc is represented by a feature vector
– Its t-term coordinate of document d is the TF-IDF weight
– This is more reasonable
Just for illustration …
– Many complex and more effective weighting variants exist in
practice
34
9/14/2021
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
How to Measure Similarity?
Given two document
Similarity definition
– dot product
35
9/14/2021
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Illustrative Example
text
doc1 mining
Sim(newdoc,doc1)=(4.8*2.4+4.5*4.5)/||v1||*||vn||
search
engine
Sim(newdoc,doc2)=2.4*2.4/||vn||*||v2||
text
To whom is newdoc
more similar?
travel
text
Sim(newdoc,doc3)=0
doc2 map
travel
37
9/14/2021
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
BITS Pilani
Pilani|Dubai|Goa|Hyderabad
Intuitively, if a user likes document d, how likely would the user enter
query q in order to retrieve document d?
39
9/14/2021
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Probabilistic Retrieval Models
40
9/14/2021
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
BITS Pilani
Pilani|Dubai|Goa|Hyderabad
9/14/2021 42
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Keyword-Based Association Analysis
• Motivation
• Collect sets of keywords or terms that occur frequently together and
then find the association or correlation relationships among them
• Association Analysis Process
• Preprocess the text data by parsing, stemming, removing stop words,
etc.
• Evoke association mining algorithms
• Consider each document as a transaction
• View a set of keywords in the document as a set of items in the transaction
• Term level association mining
• No need for human effort in tagging documents
• The number of meaningless results and the execution time is greatly
reduced
9/14/2021 43
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Text Classification
• Motivation
• Automatic classification for the large number of on-line text documents
(Web pages, e-mails, corporate intranets, etc.)
• Classification Process
• Data preprocessing
• Definition of training set and test sets
• Creation of the classification model using the selected classification
algorithm
• Classification model validation
• Classification of new/unknown text documents
• Text document classification differs from the classification of relational data
• Document databases are not structured according to attribute-value pairs
9/14/2021 44
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Document Clustering
• Motivation
• Automatically group related documents based on their
contents
• No predetermined training sets or taxonomies
• Generate a taxonomy at runtime
• Clustering Process
• Data preprocessing: remove stop words, stem, feature
extraction, lexical analysis, etc.
• Hierarchical clustering: compute similarities applying clustering
algorithms.
• Model-Based clustering: clusters are represented by
“exemplars”. (e.g.: Self-Organizing Maps)
9/14/2021 45
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
VS Model-Based Classifiers
9/14/2021 46
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
BITS Pilani
Pilani|Dubai|Goa|Hyderabad
Mining WWW
Challenges in Mining WWW
• The Web is too huge for effective data warehousing and data mining.
It is barely possible to set up a data warehouse to replicate, store, or
integrate all of the data on the Web
• The complexity of Web pages is far greater than that of any
traditional text document collection. There is no index by category,
nor by title, author, cover page, table of contents, etc.
• The Web is a highly dynamic information source. Not only does the
Web grow rapidly, but is also constantly updated. Linkage
information and access records are also updated frequently
• The Web serves a broad diversity of user communities. Users may
have very different backgrounds, interests, and usage purposes.
• Only a small portion of the information on the Web is truly relevant
or useful. It is said that 99% of the Web information is useless to
99% of Web users. How can the portion of the Web that is truly
relevant to your interest be determined? How can we find high
quality Web pages on a specified topic?
48
9/14/2021
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Search Engine – Two Rank Functions
Ranking based on link
Search structure analysis
Importance Ranking
Rank Functions (Link Analysis)
Similarity
based on Relevance Ranking
content or text Backward Link Web Topology
(Anchor Text) Graph
Inverted Indexer
Index
Anchor Text Web Graph
Generator Constructor
Web Pages
49
9/14/2021
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Relevance Ranking
• Inverted index
- A data structure for supporting text queries
- like index in a book
aalborg 3452, 11437, …..
.
.
.
indexing .
.
arm 4, 19, 29, 98, 143, ...
disks with armada 145, 457, 789, ...
documents armadillo 678, 2134, 3970, ...
armani 90, 256, 372, 511, ...
.
.
.
.
.
zz 602, 1189, 3209, ...
50
inverted index
9/14/2021
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
BITS Pilani
Pilani|Dubai|Goa|Hyderabad
PageRank
Introduction on PageRank
53
54
55
A B
C D
• R(.) = PageRank of a Webpage
1. R(A) = 100%R(B) + 50%R(C)
2. R(B) = 50%R(C) + 100%R(D)
3. R(C) = 100%R(A)
57
9/14/2021
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Signals for Google Search Engine
• Google apparently deals with the 15 percent of queries a day
it gets which its systems have never seen before.
• For last 5 years, Google has been using RankBrain, which uses
AI/ML to make a guess as to what words or phrases might
have a similar meaning
• Google Executives state that three major search ranking
factors are
• Content (of query)
• Links
• RankBrain
[Link]
Clark, Jack. "Google Turning Its Lucrative Web Search Over to AI Machines". Bloomberg Business.
2015 58
9/14/2021
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Thank You
59
9/14/2021
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Prescribed Text Books
60
September 14, 2021
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
S2-20_DSECLZC415 : Data Mining
(Lecture #17 - Data Mining Applications)
BITS Pilani
Pilani|Dubai|Goa|Hyderabad
•The slides presented here are obtained from the authors of the
books and from various other contributors. I hereby acknowledge
all the contributors for their material and inputs.
•I have added and modified a few slides to suit the requirements
of the course.
2
9/22/2021
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Data Mining
Data Mining Applications
BITS Pilani
Pilani|Dubai|Goa|Hyderabad
BITS Pilani
Pilani|Dubai|Goa|Hyderabad
Source Courtesy: Some of the contents of this PPT are sourced from materials provided by publishers of prescribed books
Recommendation System
• System that produces individualized recommendations as output or has the effect of
guiding the user in a personalized way to interesting or useful objects in a large
space of possible options
Burke (2002)
• Recommender systems suggest items of interest to users based on their explicit and
implicit preferences, the preferences of other users, and user and item attributes
Schein et al. (2005)
Popularity
what is popular,
while on-line institutions can
make everything available
Product
Some popular recommendation engines which have been proved as highly profitable:
the [Link],
the [Link] and
the [Link] recommendation engines
Recommender Systems for Location-based Social Networks by Panagiotis Symeonidis, Dimitrios Ntempos and Yannis
Manolopoulos Springer © 2014
8
An extreme example of how the long tail, together with a well designed
recommendation system can influence events is the story told by Chris
Anderson about a book called Touching the Void. This mountain-climbing
book was not a big seller in its day, but many years after it was published,
another book on the same topic, called Into Thin Air was published.
Amazon’s recommendation system noticed a few people who
bought both books, and started recommending Touching the Void to
people who bought, or were considering, Into Thin Air. Had there been no
on-line bookseller, Touching the Void might never have been seen by
potential buyers, but in the on-line world, Touching the Void eventually
became very popular in its own right, in fact, more so than Into Thin Air.
10
11
12
13
14
3. Stereotypes Users with similar attributes are – No cold-start problem – Obtaining information –Allocates learners to
or matched, then recommends – Domain-independent –Insufficient groups
demographics items that are preferred by information –Benefits from
CF similar users (based on user data –Only popular taste experience
instead of ratings). –Obtaining metadata –Recommendation from
information the beginning of the RS
–Maintenance ontology
TEL : Technology Enhanced Learning
15
18
20
Source Courtesy: Some of the contents of this PPT are sourced from materials provided by publishers of prescribed books
Why study Fraud?
[Link]
22
23
• Often, the extra accuracy is associated with higher cost, but the cost
of not doing so may be much higher
24
25
26
27
28
29
• Profiles of the past behavior and actions of both the fraudsters and the
nonfraudsters must be built and employed in the modeling methodology
30
31
[Link]
32
[Link]
33
• In the example,
Suzie Smith has a
direct relationship
with one closed
business and a
second degree
relationship with
another closed
business.
• The state can deny
the new
registration or
conduct additional
investigations in a
cost effective
manner.
[Link] 34
[Link]
36
37
Source Courtesy: Some of the contents of this PPT are sourced from materials provided by publishers of prescribed books
Sentiment Analysis
• Monitoring what consumers are saying about a company's brands and products
and how they are expressing their opinions and sentiments to others has
always been important to businesses.
• Until the last century, businesses typically used surveys and focus groups from
time to time to gauge and track consumer sentiments.
• With the widespread adoption of the Internet, the proliferation of social media
channels (such as Twitter, Facebook, and others), and the abundant opportunity
for consumers to express their opinions and sentiments, monitoring sentiment
continuously has become more critical
• "Conventional marketing wisdom long held that a dissatisfied customer tells ten
people; but in the age of new social media, he or she has the tools to tell
millions,"
- Paul Gillin, author of The New Influencers: A Marketer's Guide to the New Social Media
39
40
41
42
43
44
• But, there are also context-dependent lexicons in which the polarity of the
word depends on the domain or context, e.g. the word "small" can be
positive or negative depending on the context. The sentence, "The size
seems small." can be positive for a USB flash drive with 1 TB capacity. But,
the same sentence can be interpreted as negative if the context is an LED
big-screen TV.
46
47
48
49
50
51
52
9/22/2021
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
S2-20_DSECFZC415
Mid-Semester Review
BITS Pilani
Pilani|Dubai|Goa|Hyderabad
1
•The slides presented here are obtained from the authors of the books and from
various other contributors. I hereby acknowledge all the contributors for their material
and inputs.
•I have added and modified a few slides to suit the requirements of the course.
Source Courtesy: Some of the contents of this PPT are sourced from materials provided by publishers of prescribed books
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
What is Data Mining?
Data mining (knowledge discovery from data)
– Extraction of interesting (non-trivial, implicit, previously unknown and
potentially useful) patterns or knowledge from huge amount of data
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Major Tasks in Data Preprocessing
• Data cleaning
– Fill in missing values, smooth noisy data, identify or remove outliers,
and resolve inconsistencies
• Data integration
– Integration of multiple databases, data cubes, or files
• Data reduction
– Dimensionality reduction
– Numerosity reduction
– Data compression
• Data transformation and data discretization
– Normalization
– Concept hierarchy generation
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Attributes
• Attribute (or dimensions, features, variables): a data field, representing a
characteristic or feature of a data object.
• E.g., customer _ID, name, address
• Types:
• Nominal
• Binary
• Symmetric
• Asymmetric
• Ordinal
• Numeric: quantitative
• Interval-scaled
• Ratio-scaled
Data Mining
8
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Basic Statistical Descriptions of Data
• Motivation
• To better understand the data: central tendency, variation and spread
• Data dispersion characteristics
• median, max, min, quantiles, outliers, variance, etc.
• Numerical dimensions correspond to sorted intervals
• Data dispersion: analyzed with multiple granularities of precision
• Boxplot or quantile analysis on sorted intervals
• Dispersion analysis on computed measures
• Folding measures into numerical dimensions
• Boxplot or quantile analysis on the transformed cube
Data Mining
9
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Classification
Classification • Decision Tree based Methods
– predicts categorical class • Rule-based Methods
labels (discrete or nominal)
– classifies data (constructs a • Neural Networks
model) based on the training - computational networks that
simulate the decision process in
set and the values (class neurons (networks of nerve cell)
labels) in a classifying
attribute and uses it in • Naïve Bayes and Bayesian Belief
classifying new data Networks
- uses the probability theory to find
the most likely of the possible
Prediction classifications
– models continuous-valued • Support Vector Machines
functions, i.e., predicts - fits a boundary to a region of points
unknown or missing values that are all alike; uses the boundary
to classify a new point
10
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Frequent-pattern Mining
Bottleneck: candidate-generation-and-test
Can we optimize scan, candidate generation, test cycle?
11
July 5, 2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Prescribed Text Books
12
July 5, 2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Thank You
13
BITS Pilani
Pilani|Dubai|Goa|Hyderabad
BITS Pilani
Pilani|Dubai|Goa|Hyderabad
Review
Data Mining
Data Mining
9/21/2021 3
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Data Mining in Business Intelligence
Increasing potential
to support
business decisions End User
Decision
Making
Data Exploration
Statistical Summary, Querying, and Reporting
Data Mining
9/21/2021 4
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Major Tasks in Data Preprocessing
• Data cleaning
• Fill in missing values, smooth noisy data, identify or remove outliers, and
resolve inconsistencies
• Data integration
• Integration of multiple databases, data cubes, or files
• Data reduction
• Dimensionality reduction
• Numerosity reduction
• Data compression
• Data transformation and data discretization
• Normalization
• Concept hierarchy generation
• Retailer can take advantage of this association for bundle pricing, product
placement, and even shelf space optimization within the store layout.
Data Mining
September 21, 2021
9
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Good Clustering
• A good clustering method will produce • Dissimilarity/Similarity metric: Similarity is
expressed in terms of a distance function,
high quality clusters with typically metric: d(i, j)
• high intra-class similarity • There is a separate “quality” function that
measures the “goodness” of a cluster.
• low inter-class similarity
• The definitions of distance functions are
• The quality of a clustering result usually very different for interval-scaled,
boolean, categorical, ordinal ratio, and
depends on both the similarity
vector variables.
measure used by the method and its • Weights should be associated with different
implementation variables based on applications and data
semantics.
• The quality of a clustering method is
• It is hard to define “similar enough” or
also measured by its ability to discover “good enough”
some or all of the hidden patterns • the answer is typically highly
subjective.
Data Mining
September 21, 2021 10
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Measuring Cluster Validity
• For graph-based analysis, the cohesion
of a cluster can be defined as the sum
of weights of links in the proximity
graph that connects points within the
cluster.
• Separation between two clusters can
be measured by the sum of weights of
the links from points in one cluster to
points in the other cluster.
Data Mining
9/21/2021
11
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Types of Outliers
• Three kinds: global, contextual and collective outliers
• Global outlier (or point anomaly)
• Object is Og if it significantly deviates from the rest of the data set
• Ex. Intrusion detection in computer networks
• Issue: Find an appropriate measurement of deviation
• Contextual outlier (or conditional outlier)
• Object is Oc if it deviates significantly based on a selected context
• Ex. 80o F in Urbana: outlier? (depending on summer or winter?)
• Attributes of data objects should be divided into two groups
• Contextual attributes: defines the context, e.g., time & location
• Behavioral attributes: characteristics of the object, used in outlier evaluation, e.g., temperature
• Can be viewed as a generalization of local outliers—whose density significantly deviates from its
local area
• Issue: How to define or formulate meaningful context?
• Collective Outliers
• A subset of data objects collectively deviate significantly from the whole data set, even if the
individual data objects may not be outliers
• Applications: E.g., intrusion detection:
• When a number of computers keep sending denial-of-service packages to each other
Data Mining
12
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Prescribed Text Books
Data Mining
13
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956