0% found this document useful (0 votes)

12 views710 pages

Introduction to Data Mining Course

Uploaded by

Sridhar Eswaran

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views710 pages

Introduction to Data Mining Course

Uploaded by

Sridhar Eswaran

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

S2-20_DSECLZC415

Introduction to Data Mining

BITS Pilani
Pilani|Dubai|Goa|Hyderabad

1
• The slides presented here are obtained from the authors of the
books and from various other contributors. I hereby
acknowledge all the contributors for their material and inputs.
• I have added and modified a few slides to suit the requirements
of the course.
2

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Textbooks/Reference Books
Text Books
T1 Tan P. N., Steinbach M & Kumar V. “Introduction to Data Mining” Pearson
Education, 2006
T2 Data Mining: Concepts and Techniques, Third Edition by Jiawei Han and
Micheline Kamber Morgan Kaufmann Publishers, 2011

Reference Book(s) & other resources

R1 Predictive Analytics and Data Mining: Concepts and Practice with
RapidMiner by Vijay Kotu and Bala Deshpande Morgan Kaufmann
Publishers © 2015
Additional references may be given during lectures

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Modular Structure
No Title of the Module

M1 Introduction to Data Mining

M2 Data Preprocessing
M3 Data Exploration
M4 Classification and Prediction
M5 Clustering
M6 Association Analysis
M7 Anomaly Detection
M8 Data mining on unstructured (Big) data
M9 Data Mining Applications

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Evaluation Scheme
No Name Type Weight
1. Quiz-I Online 5%
Quiz-II Online 5%
Assignment Group 10%
2. Mid-Semester Test 30%
3. Comprehensive Exam 50%

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Data Mining Defined

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

What Is Data Mining?
Data mining (knowledge discovery from data)
– Extraction of interesting (non-trivial, implicit, previously unknown
and potentially useful) patterns or knowledge from huge amount
of data
Alternative names
– Knowledge discovery (mining) in databases (KDD), knowledge
extraction, data/pattern analysis, data archeology, data dredging,
information harvesting, business intelligence, etc.
Watch out: Is everything “data mining”?
– Simple search and query processing
– (Deductive) expert systems

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

What is (not) Data Mining?
What is not Data  What is Data Mining?
Mining?
– Certain names are more
– Look up prevalent in certain US
phone number locations (O’Brien,
in phone O’Rurke, O’Reilly… in
directory Boston area)
– Query a Web – Group together similar
search engine documents returned by
for information search engine according
about to their context (e.g.
“Amazon” Amazon rainforest,
[Link],)
8

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Why Data Mining?
The Explosive Growth of Data: from terabytes to petabytes
– Data collection and data availability
• Automated data collection tools, database systems, Web,
computerized society
– Major sources of abundant data
• Business: Web, e-commerce, transactions, stocks, …
• Science: Remote sensing, bioinformatics, scientific simulation, …
• Society and everyone: news, digital cameras, YouTube
We are drowning in data, but starving for knowledge!

“Necessity is the mother of invention”—Data mining—

Automated analysis of massive data sets
9

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Why Data Mining
A search engine (e.g., Google) receives hundreds of millions of queries every
day. Each query can be viewed as a transaction where the user describes her
or his information need.
What novel and useful knowledge can a search engine learn from such a huge
collection of queries collected from users over time? Some patterns found in
user search queries can disclose invaluable knowledge that cannot be obtained
by reading individual data items alone.
For example, Google's Flu Trends uses specific search terms as indicators of
flu activity. It found a close relationship between the number of people who
search for flu-related information and the number of people who actually have
flu symptoms. A pattern emerges when all of the search queries related to flu
are aggregated. Using aggregated Google search data, Flu Trends can
estimate flu activity up to two weeks faster than traditional systems can.
This example shows how data mining can turn a large collection of data into
knowledge that can help meet a challenge.
10

7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Evolution of Database Technology
1960s:
– Data collection, database creation, IMS and network DBMS
1970s:
– Relational data model, relational DBMS implementation
1980s:
– RDBMS, advanced data models (extended-relational, OO, deductive, etc.)
– Application-oriented DBMS (spatial, scientific, engineering, etc.)
1990s:
– Data mining, data warehousing, multimedia databases, and Web databases
2000s
– Stream data management and mining
– Data mining and its applications
– Web technology (XML, data integration) and global information systems

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Origins of Data Mining
Draws ideas from machine learning/AI, pattern recognition,
statistics, and database systems
Traditional Techniques
may be unsuitable due to
Statistics/ Machine Learning/
– Enormity of data AI Pattern
– High dimensionality Recognition

of data Data Mining

– Heterogeneous,
distributed nature
Database
of data systems

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Data Mining in Business Intelligence

Increasing potential
to support
business decisions End User
Decision
Making

Data Presentation Business

Analyst
Visualization Techniques
Data Mining Data
Information Discovery Analyst

Data Exploration
Statistical Summary, Querying, and Reporting

Data Preprocessing/Integration, Data Warehouses

DBA
Data Sources
Paper, Files, Web documents, Scientific experiments, Database Systems

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Data Mining/KDD Process

Input Data Data Pre- Data Post-

Processing Mining Processing

Data integration Pattern discovery Pattern evaluation

Normalization Association & correlation Pattern selection
Feature selection Classification Pattern interpretation
Clustering
Dimension reduction Pattern visualization
Outlier analysis
…………
KDD – Knowledge Discovery in Databases

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Data Mining & Machine Learning
According to Tom M. Mitchell, Chair of Machine Learning at
Carnegie Mellon University and author of the book Machine
Learning (McGraw-Hill),
A computer program is said to learn from experience E with respect to some class
of tasks T and performance measure P, if its performance at tasks in T, as measured
by P, improves with the experience E.
We now have a set of objects to define machine learning:
Task (T), Experience (E), and Performance (P)
With a computer running a set of tasks, the experience should be leading to
performance increases (to satisfy the definition)

Many data mining tasks are executed successfully with help of

machine learning
15
Machine Learning: Hands-on for Developers and Technical Professionals by Jason Bell John Wiley & Sons

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Multi-Dimensional View of Data Mining
Data to be mined
– Database data (extended-relational, object-oriented, heterogeneous, legacy),
data warehouse, transactional data, stream, spatiotemporal, time-series,
sequence, text and web, multi-media, graphs & social and information networks
Knowledge to be mined (or: Data mining functions)
– Characterization, discrimination, association, classification, clustering,
trend/deviation, outlier analysis, etc.
– Descriptive vs. predictive data mining
– Multiple/integrated functions and mining at multiple levels
Techniques utilized
– Data-intensive, data warehouse (OLAP), machine learning, statistics, pattern
recognition, visualization, high-performance, etc.
Applications adapted
– Retail, telecommunication, banking, fraud analysis, bio-data mining, stock market
analysis, text mining, Web mining, etc.
16

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Data Mining on Diverse kinds of Data
Besides relational database data (from operational or analytical systems),
there are many other kinds of data that have diverse forms and structures
and different semantic meanings.
Examples of data can be :
time-related or sequence data (e.g., historical records, stock exchange data, and
time-series and biological sequence data),
data streams (e.g., video surveillance and sensor data, which are continuously
transmitted),
spatial data (e.g., maps),
engineering design data (e.g., the design of buildings, system components, or
integrated circuits),
hypertext and multimedia data (including text, image, video, and audio data),
graph and networked data (e.g., social and information networks), and
the Web (a widely distributed information repository).
Diversity of data brings in new challenges such as handling special structures
(e.g., sequences, trees, graphs, and networks) and specific semantics (such as
ordering, image, audio and video contents, and connectivity)
17

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Data Mining Activities

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Data Mining Tasks
Prediction Methods
– Use some variables to predict unknown or future values of other variables.

Description Methods
– Find human-interpretable patterns that describe the data.

From [Fayyad, [Link].] Advances in Knowledge Discovery and Data Mining, 1996

Experts have more terms:

Gartner Analyst View:

[Link]

SCM Expert View:

[Link]
19

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Data Mining Tasks...
Classification [Predictive]
Clustering [Descriptive]
Association Rule Discovery [Descriptive]
Sequential Pattern Discovery [Descriptive]
Regression [Predictive]
Deviation Detection [Predictive]

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Classification: Definition
Given a collection of records (training set )
– Each record contains a set of attributes, one of the attributes is the class.
Find a model for class attribute as a function of the values of
other attributes.
Goal: previously unseen records should be assigned a class as
accurately as possible.
– A test set is used to determine the accuracy of the model. Usually, the given
data set is divided into training and test sets, with training set used to build the
model and test set used to validate it.

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Classification Example

Tid Refund Marital Taxable Refund Marital Taxable

Status Income Cheat Status Income Cheat

1 Yes Single 125K No No Single 75K ?

2 No Married 100K No Yes Married 50K ?
3 No Single 70K No No Married 150K ?
4 Yes Married 120K No Yes Divorced 90K ?
5 No Divorced 95K Yes No Single 40K ?
6 No Married 60K No No Married 80K ? Test
10

Set
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
Training
Learn
10 No Single 90K Yes Model
10

Set Classifier
22

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Classification: Application 1
Direct Marketing
– Goal: Reduce cost of mailing by targeting a set of consumers likely to
buy a new cell-phone product.
– Approach:
• Use the data for a similar product introduced before.
• We know which customers decided to buy and which decided
otherwise. This {buy, don’t buy} decision forms the class attribute.
• Collect various demographic, lifestyle, and company-interaction
related information about all such customers.
• Type of business, where they stay, how much they earn, etc.
• Use this information as input attributes to learn a classifier model.

From [Berry & Linoff] Data Mining Techniques, 1997

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Classification: Application 2
Fraud Detection
– Goal: Predict fraudulent cases in credit card transactions.
– Approach:
• Use credit card transactions and the information on its account-
holder as attributes.
• When does a customer buy, what does he buy, how often he pays on
time, etc
• Label past transactions as fraud or fair transactions. This forms the
class attribute.
• Learn a model for the class of the transactions.
• Use this model to detect fraud by observing credit card
transactions on an account.

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Classification: Application 3
Customer Attrition/Churn:
– Goal: To predict whether a customer is likely to be lost to a competitor.
– Approach:
• Use detailed record of transactions with each of the past and
present customers, to find attributes.
• How often the customer calls, where he calls, what time-of-the day
he calls most, his financial status, marital status, etc.
• Label the customers as loyal or disloyal.
• Find a model for loyalty.

From [Berry & Linoff] Data Mining Techniques, 1997

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Clustering Definition
Given a set of data points, each having a set of attributes, and
a similarity measure among them, find clusters such that
– Data points in one cluster are more similar to one another.
– Data points in separate clusters are less similar to one another.
Similarity Measures:
– Euclidean Distance if attributes are continuous.
– Other Problem-specific Measures.

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Illustrating Clustering

Intracluster distances Intercluster distances

are minimized are maximized

Euclidean Distance Based

Clustering in 3-D space

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Clustering: Application 1
Market Segmentation:
– Goal: subdivide a market into distinct subsets of customers where any
subset may conceivably be selected as a market target to be reached
with a distinct marketing mix.
– Approach:
• Collect different attributes of customers based on their
geographical and lifestyle related information.
• Find clusters of similar customers.
• Measure the clustering quality by observing buying patterns of
customers in same cluster vs. those from different clusters.

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Clustering: Application 2
Document Clustering:
– Goal: To find groups of documents that are similar to each other based
on the important terms appearing in them.
– Approach: To identify frequently occurring terms in each document.
Form a similarity measure based on the frequencies of different terms.
Use it to cluster.
– Gain: Information Retrieval can utilize the clusters to relate a new
document or search term to clustered documents.

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Association Rule Discovery: Definition
Given a set of records each of which contain some number of
items from a given collection;
– Produce dependency rules which will predict occurrence
of an item based on occurrences of other items.
Example of Association Rules
TID Items
1 Bread, Milk {Diaper} → {Butter},
2 Bread, Diaper, Butter, Beans {Milk, Bread} → {Beans, Coke},
3 Milk, Diaper, Butter, Coke {Butter, Bread} → {Milk},
4 Bread, Milk, Diaper, Butter
5 Bread, Milk, Diaper, Coke

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Association Rule Discovery: Application 1
Marketing and Sales Promotion:
– Let the rule discovered be
{Bagels, … } --> {Potato Chips}
– Potato Chips as consequent => Can be used to determine what should
be done to boost its sales.
– Bagels in the antecedent => Can be used to see which products would be
affected if the store discontinues selling bagels.
– Bagels in antecedent and Potato chips in consequent => Can be used to
see what products should be sold with Bagels to promote sale of Potato
chips!

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Association Rule Discovery: Application
Inventory Management:
– Goal: A consumer appliance repair company wants to anticipate the nature of
repairs on its consumer products and keep the service vehicles equipped with
right parts to reduce on number of visits to consumer households.
– Approach: Process the data on tools and parts required in previous repairs at
different consumer locations and discover the co-occurrence patterns.

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Sequential Pattern Discovery: Definition
Given is a set of objects, with each object associated with its own timeline of
events, find rules that predict strong sequential dependencies among
different events.

(A B) (C) (D E)

Rules are formed by first discovering patterns. Event occurrences in the

patterns are governed by timing constraints.

(A B) (C) (D E)
Timing constraints include maxgap
<= xg >ng <= ws (xg), mingap (ng), windowsize (ws),
maxspan (ms)

<= ms

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Sequential Pattern Discovery: Examples
In telecommunications alarm logs,
– (Inverter_Problem Excessive_Line_Current)
(Rectifier_Alarm) --> (Fire_Alarm)

In point-of-sale transaction sequences,

– Computer Bookstore:
(Intro_To_Visual_C) (C++_Primer) -->
(Perl_for_dummies,Tcl_Tk)
– Athletic Apparel Store:
(Shoes) (Racket, Racketball) --> (Sports_Jacket)

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Prediction/Regression
Predict a value of a given continuous valued variable based on
the values of other variables, assuming a linear or nonlinear
model of dependency.
Greatly studied in statistics, neural network fields.
Examples:
• Predicting sales amounts of new product based on advertising
expenditure.
• Predicting wind velocities as a function of temperature, humidity, air
pressure, etc.
• Time series prediction of stock market indices.

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Deviation/Anomaly Detection
Detect significant deviations from normal behavior
Applications:
– Credit Card Fraud Detection
– Network Intrusion Detection

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Gartner’s Magic Quadrant

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

DM Process & Challenges

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

DM Process
The standard data mining process involves
1. understanding the problem,
2. preparing the data (samples),
3. developing the model,
4. applying the model on a data set to see how the model may work in
real world, and
5. production deployment.
A popular data mining process frameworks is CRISP-DM (Cross
Industry Standard Process for Data Mining). This framework
was developed by a consortium of companies involved in data
mining

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Generic Data Mining Process

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Prior Knowledge

Data Mining tools/solutions identify hidden patterns.

– Generally we get many patterns
– Out of them many could be false or trivial.
– Filtering false patterns requires domain understanding.
Understanding how the data is collected, stored, transformed,
reported, and used is essential.

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Data Preparation
Data needs to be understood. It requires descriptive statistics such as mean, median, mode,
standard deviation, and range for each attribute
Data quality is an ongoing concern wherever data is collected, processed, and stored.
– The data cleansing practices include elimination of duplicate records, quarantining outlier records
that exceed the bounds, standardization of attribute values, substitution of missing values, etc.
– it is critical to check the data using data exploration techniques in addition to using prior knowledge
of the data and business before building models to ensure a certain degree of data quality
Missing Values
– Need to track the data lineage of the data source to find right solution
Data Types and Conversion
– The attributes in a data set can be of different types, such as continuous numeric (interest rate),
integer numeric (credit score), or categorical
– data mining algorithms impose different restrictions on what data types they accept as inputs
Transformation
– Can go beyond type conversion, may include dimensionality reduction or numerosity reduction
Outliers are anomalies in the data set
– May occur legitimately or erroneously.
Feature Selection
– Many data mining problems involve a data set with hundreds to thousands of attributes, most of
which may not be helpful. Some attributes may be correlated, e.g. sales amount and tax.
Data Sampling may be adequate in many cases
42

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Modeling & Evaluation
A model is the abstract
representation of the data
and its relationships in a
given data set.
Data mining models can
be classified into the
following categories:
classification, regression,
association analysis,
clustering, and outlier or
anomaly detection.
Each category has a few
dozen different
algorithms; each takes a
slightly different approach
to solve the problem at
hand
43

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Application
The model deployment stage considerations:
– assessing model readiness, technical integration, response time, model maintenance, and
assimilation
Production Readiness
– Real-time response capabilities, and other business requirements
Technical Integration
– Use of modeling tools (e.g. RapidMiner), Use of PMML for portable and consistent format
of model description, integration with other tools
Timeliness
– The trade-offs between production responsiveness and build time need to be considered
Remodeling
– The conditions in which the model is built may change after deployment
Assimilation
– The challenge is to assimilate the knowledge gained from data mining in the organization.
For example, the objective may be finding logical clusters in the customer database so that
separate treatment can be provided to each customer cluster. 44

7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
CRISP data mining framework
CRISP is the most popular
methodology for analytics,
data mining, and data
science projects, with 43%
share as per 2014
KDnuggets Poll.
CRISP-DM was conceived
in 1996. In 1997 it got
underway as a European
Union project, led by SPSS,
Teradata, Daimler AG, NCR
Corporation and OHRA.

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

DM Issues/Challenges
DM Issues/Challenges – Mining Methodology
Mining various and new kinds of knowledge
Mining knowledge in multidimensional space
Data mining—an interdisciplinary effort
Boosting the power of discovery in a networked environment
Handling uncertainty, noise, or incompleteness of data
Pattern evaluation and pattern- or constraint-guided mining

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

DM Issues/Challenges

DM Issues/Challenges – User Interaction

Interactive mining
Incorporation of background knowledge
Ad hoc data mining and data mining query languages
Presentation and visualization of data mining results

DM Issues/Challenges - Efficiency and Scalability

Efficiency and scalability of data mining algorithms
Parallel, distributed, and incremental mining algorithms
Cloud computing and cluster computing
47

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

DM Issues/Challenges

DM Issues/Challenges - Diversity of Database Types

Handling complex types of data

Mining dynamic, networked, and global data repositories

DM Issues/Challenges - Society
Social impacts of data mining
Privacy-preserving data mining
Invisible data mining
48

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Data Mining on Diverse kinds of Data
Besides relational database data (from operational or analytical systems), there are many
other kinds of data that have diverse forms and structures and different semantic
meanings.
Examples of data can be :
time-related or sequence data (e.g., historical records, stock exchange data, and time-series and
biological sequence data),
data streams (e.g., video surveillance and sensor data, which are continuously transmitted),
spatial data (e.g., maps),
engineering design data (e.g., the design of buildings, system components, or integrated circuits),
hypertext and multimedia data (including text, image, video, and audio data),
graph and networked data (e.g., social and information networks), and
the Web (a widely distributed information repository).
Diversity of data brings in new challenges such as handling special structures (e.g.,
sequences, trees, graphs, and networks) and specific semantics (such as ordering, image,
audio and video contents, and connectivity)

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Text Books

Author(s), Title, Edition, Publishing House

T1 Tan P. N., Steinbach M & Kumar V. “Introduction to Data Mining” Pearson
Education
T2 Data Mining: Concepts and Techniques, Third Edition by Jiawei Han, Micheline
Kamber and Jian Pei Morgan Kaufmann Publishers
R1 Predictive Analytics and Data Mining: Concepts and Practice with RapidMiner
by Vijay Kotu and Bala Deshpande Morgan Kaufmann Publishers

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Thank You

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Source: [Link]

Data Data Lake

Database Data Mart
Warehouse
Source Single Single Multiple Multiple
Structure Structured Structured Structured Unstructured
Purpose Determined Determined Determined Undetermined
Storage Centralized Decentralized Centralized Centralized
Detailed &
Granularity Detailed Summarized All
Summary
Flexibility Low Medium Medium High
Analytics &
Primary Use Transactional Reporting Analytics
Reporting
Data Volume Low Low Medium High
Development Top-down Bottom-up Top-down All
Design Time Medium Medium High Low
Volatility Medium Low None None
Data Operations CRUD CR CRU CR
Subject Area Single Single Multiple Multiple
Multi- Multi-
Design Schema Relational No Schema
dimensional dimensional 52

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

S2-20_DSECLZC415
Data Pre-processing
BITS Pilani
Pilani|Dubai|Goa|Hyderabad

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Data Preprocessing Concepts

7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Preprocessing Objectives

• To improve data quality

• To modify data to better fit specific data mining technique

7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Major Tasks in Data Preprocessing
• Data cleaning
• Fill in missing values, smooth noisy data, identify or remove outliers, and
resolve inconsistencies
• Data integration
• Integration of multiple databases, data cubes, or files
• Data reduction
• Dimensionality reduction
• Numerosity reduction
• Data compression
• Data transformation and data discretization
• Normalization
• Concept hierarchy generation
5

7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Data Quality: Multidimensional View

• Measures for data quality: A multidimensional view

– Accuracy: correct or wrong, accurate or not
– Completeness: not recorded, unavailable, …
– Consistency: some modified but some not, dangling, …
– Timeliness: timely update?
– Believability: how trustable the data are correct?
– Interpretability: how easily the data can be understood?

7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Data Quality

• What kinds of data quality problems?

• How can we detect problems with the data?

• What can we do about these problems?

• Examples of data quality problems:

– Noise and outliers

– missing values

– duplicate data

7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Data Cleaning

7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Data Cleaning
• Data in the Real World Is Dirty: Lots of potentially incorrect data, e.g.,
instrument faulty, human or computer error, transmission error
– incomplete: lacking attribute values, lacking certain attributes of
interest, or containing only aggregate data
• e.g., Occupation = “ ” (missing data)
– noisy: containing noise, errors, or outliers
• e.g., Salary = “−10” (an error)
– inconsistent: containing discrepancies in codes or names, e.g.,
• Age = “42”, Birthday = “03/07/2010”
• Was rating “1, 2, 3”, now rating “A, B, C”
• discrepancy between duplicate records
– Intentional (e.g., disguised missing data)
• Jan. 1 as everyone’s birthday?
9

7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Incomplete (Missing) Data
• Data is not always available
– E.g., many tuples have no recorded value for several attributes, such
as customer income in sales data
• Missing data may be due to
– equipment malfunction
– inconsistent with other recorded data and thus deleted
– data not entered due to misunderstanding
– certain data may not be considered important at the time of entry
– not register history or changes of the data
• Missing data may need to be inferred

7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
How to Handle Missing Data?
• Ignore the tuple: usually done when class label is missing (when doing
classification)—not effective when the % of missing values per attribute
varies considerably
• Fill in the missing value manually: tedious + infeasible?
• Fill in it automatically with
– a global constant : e.g., “unknown”, a new class?!
– the attribute mean
– the attribute mean for all samples belonging to the same class:
smarter
– the most probable value: inference-based such as Bayesian formula or
decision tree

7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Noisy Data
• Noise: random error or variance in a measured variable
• Incorrect attribute values may be due to
– faulty data collection instruments
– data entry problems
– data transmission problems
– technology limitation
– inconsistency in naming convention
• Other data problems which require data cleaning
– duplicate records
– incomplete data
– inconsistent data
12

7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Noise
Noise refers to modification of original values
– Examples: distortion of a person’s voice when talking on a poor phone
and “snow” on television screen

Two Sine Waves Two Sine Waves + Noise

7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
How to Handle Noisy Data?
• Binning (also used for discretization)
– first sort data and partition into (equal-frequency) bins
– then one can smooth by bin means, smooth by bin median, smooth by
bin boundaries, etc.
– Binning methods smooth a sorted data value by consulting its
"neighborhood," that is, the values around it, i.e. they perform local
smoothing.
• Regression
– smooth by fitting the data into regression functions
• Clustering
– detect and remove outliers
• Combined computer and human inspection
– detect suspicious values and check by human (e.g., deal with possible
outliers) 14

7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Duplicate Data
• Data set may include data objects that are duplicates, or almost duplicates
of one another
– Major issue when merging data from heterogenous sources

• Examples:
– Same person with multiple email addresses

• Data cleaning
– Process of dealing with duplicate data issues

7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Outliers
Outliers are data objects with characteristics that are
considerably different than most of the other data objects in the
data set

7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Data Cleaning as a Process
• Data discrepancy detection
– Use metadata (e.g., domain, range, dependency, distribution)
– Check field overloading
– Check uniqueness rule, consecutive rule and null rule
– Use commercial tools
• Data scrubbing: use simple domain knowledge (e.g., postal code, spell-
check) to detect errors and make corrections
• Data auditing: by analyzing data to discover rules and relationship to
detect violators (e.g., correlation and clustering to find outliers)
• Data migration and integration
– Data migration tools: allow transformations to be specified
– ETL (Extraction/Transformation/Loading) tools: allow users to specify
transformations through a graphical user interface
• Integration of the two processes
– Iterative and interactive (e.g., Potter’s Wheels)
17

7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Data Preprocessing Techniques

7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Major Tasks in Data Preprocessing
• Data cleaning
– Fill in missing values, smooth noisy data, identify or remove outliers,
and resolve inconsistencies
• Data integration
– Integration of multiple databases, data cubes, or files
• Data reduction
– Dimensionality reduction
– Numerosity reduction
– Data compression
• Data transformation and data discretization
– Normalization
– Concept hierarchy generation

7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Data Integration
• Data integration: Combines data from multiple sources into a coherent
store
• Schema integration: e.g., [Link]-id ≡ [Link]-#
o Integrate metadata from different sources
• Entity identification problem:
o Identify real world entities from multiple data sources, e.g., Bill Clinton =
William Clinton
• Detecting and resolving data value conflicts
o For the same real world entity, attribute values from different sources are
different
o Possible reasons: different representations, different scales, e.g., metric
vs. British units

7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Any problems with the Data?
Name Age DateOfFirstBuy Profession DateOfBirth
Bill Gates 34 15-Jan-2015 MGR Feb 24, 1981
John 38 27-Jan-2015 Mar 11, 1982
William 34 15-Jan-2015 MGR Feb 24, 1981
Gates
Kennedy 37 30-Jan-2015 DOC Nov 03,1982

7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Any problems with the Data?
Name Age DateOfFirstBuy Profession DateOfBirth
Bill Gates 34 15-Jan-2015 MGR Feb 24, 1981
John 38 27-Jan-2015 Mar 11, 1982
William Gates 34 15-Jan-2015 MGR Feb 24, 1981
Kennedy 37 30-Jan-2015 DOC Nov 03,1982

1) Missing values in Profession column

2) Format of DateOfFirstBuy and DateOfBirth are different, needs
standardization
3) Row 1 and Row 3 are potentially duplicate data.
4) Both Age and DateOfBirth are stored. Age is derived attribute.
5) Inconsistent format for name, missing first or last names
6) Entity identification issues

7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Handling Redundancy in Data Integration
• Redundant data occur often when integration of multiple databases
– Object identification: The same attribute or object may have different
names in different databases
– Derivable data: One attribute may be a “derived” attribute in another
table, e.g., annual revenue
• Redundant attributes may be able to be detected by correlation analysis
and covariance analysis
• Careful integration of the data from multiple sources may help
reduce/avoid redundancies and inconsistencies and improve mining speed
and quality

7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Correlation Analysis (Nominal Data)
• χ2 (chi-square) test
(Observed − Expected ) 2
χ =∑
2

Expected

• The larger the χ2 (chi-square) value, the more likely the variables are related
• The cells that contribute the most to the χ2 value are those whose actual count
is very different from the expected count
• Correlation does not imply causality
• # of hospitals and # of car-theft in a city are correlated
• Both are causally linked to the third variable: population

Data Mining
24
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Chi-Square Calculation: An Example
Play chess Not play chess Sum (row)
Like science fiction 250(90) 200(360) 450
Not like science fiction 50(210) 1000(840) 1050
Sum(col.) 300 1200 1500

• Χ2 (chi-square) calculation (numbers in parenthesis are expected counts calculated

based on the data distribution in the two categories)
(250 − 90) 2 (50 − 210) 2 (200 − 360) 2 (1000 − 840) 2
χ =
2
+ + + = 507.93
90 210 360 840
• It shows that like_science_fiction and play_chess are correlated in the group

25
Data Mining

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Correlation Analysis (Numeric Data)
• Correlation coefficient (also called Pearson’s product moment coefficient)

∑𝑛𝑛𝑖𝑖=1(𝑎𝑎𝑖𝑖 − 𝐴𝐴)(𝑏𝑏𝑖𝑖 − 𝐵𝐵) ∑𝑛𝑛𝑖𝑖=1(𝑎𝑎𝑖𝑖 𝑏𝑏𝑖𝑖 ) − 𝑛𝑛𝐴𝐴𝐵𝐵

𝑟𝑟𝐴𝐴,𝐵𝐵 = =
𝑛𝑛𝜎𝜎𝐴𝐴 𝜎𝜎𝐵𝐵 𝑛𝑛𝜎𝜎𝐴𝐴 𝜎𝜎𝐵𝐵

where n is the number of tuples, A and B are the respective means of A

and B, σA and σB are the respective standard deviation of A and B, and
Σ(aibi) is the sum of the AB cross-product.
• If rA,B > 0, A and B are positively correlated (A’s values increase as B’s). The
higher, the stronger correlation.
• rA,B = 0: independent; rAB < 0: negatively correlated

Data Mining 26

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Correlation (viewed as linear relationship)
• Correlation measures the linear relationship between objects
• To compute correlation, we standardize data objects, A and B, and then
take their dot product

a 'k = (ak − mean( A)) / std ( A)

b'k = (bk − mean( B)) / std ( B)

correlation( A, B) = A'• B'

Data Mining
27
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Covariance (Numeric Data)
• Covariance is similar to correlation

Correlation coefficient:

where n is the number of tuples, A and B are the respective mean or

expected values of A and B, σA and σB are the respective standard deviation of
A and B
• Positive covariance: If CovA,B > 0, then A and B both tend to be larger than their
expected values
• Negative covariance: If CovA,B < 0 then if A is larger than its expected value, B is
likely to be smaller than its expected value 28

• Independence: CovA,B = 0
Data Mining

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Co-Variance: An Example

• It can be simplified in computation as

• Suppose two stocks A and B have the following values in one week: (2, 5), (3, 8), (5,
10), (4, 11), (6, 14).

• Question: If the stocks are affected by the same industry trends, will their prices rise
or fall together?
• E(A) = (2 + 3 + 5 + 4 + 6)/ 5 = 20/5 = 4
• E(B) = (5 + 8 + 10 + 11 + 14) /5 = 48/5 = 9.6
• Cov(A,B) = (2×5+3×8+5×10+4×11+6×14)/5 − 4 × 9.6 = 4

• Thus, A and B rise together since Cov(A, B) > 0.

Data Mining

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Normalization
• Min-max normalization: to [new_minA, new_maxA]
v − minA
v' = (new _ maxA − new _ minA) + new _ minA
maxA − minA
• Ex. Let income range $12,000 to $98,000 normalized to [0.0, 1.0]. Then
73,600 − 12,000
$73,000 is mapped to 98,000 − 12,000
(1.0 − 0) + 0 = 0.716

• Z-score normalization (μ: mean, σ: standard deviation):

v − µA
v' =
σ A

73,600 − 54,000
• Ex. Let μ = 54,000, σ = 16,000. Then = 1.225
16,000
• Normalization by decimal scaling
v
v' = j Where j is the smallest integer such that Max(|ν’|) < 1
10
Data Mining

30
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Discretization
• Three types of attributes
• Nominal—values from an unordered set, e.g., color, profession
• Ordinal—values from an ordered set, e.g., military or academic rank
• Numeric—real numbers, e.g., integer or real numbers
• Discretization: Divide the range of a continuous attribute into intervals
• Interval labels can then be used to replace actual data values
• Reduce data size by discretization
• Supervised vs. unsupervised
• Split (top-down) vs. merge (bottom-up)
• Discretization can be performed recursively on an attribute
• Prepare for further analysis, e.g., classification

Data Mining 31

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Data Discretization Methods
• Typical methods: All the methods can be applied recursively
• Binning
• Top-down split, unsupervised
• Histogram analysis
• Top-down split, unsupervised
• Clustering analysis (unsupervised, top-down split or bottom-up
merge)
• Decision-tree analysis (supervised, top-down split)
• Correlation (e.g., χ2) analysis (unsupervised, bottom-up merge)

Data Mining

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Simple Discretization: Binning
• Equal-width (distance) partitioning
• Divides the range into N intervals of equal size: uniform grid
• if A and B are the lowest and highest values of the attribute, the
width of intervals will be: W = (B –A)/N.
• The most straightforward, but outliers may dominate presentation
• Skewed data is not handled well
• Equal-depth (frequency) partitioning
• Divides the range into N intervals, each containing approximately
same number of samples
• Good data scaling
• Managing categorical attributes can be tricky

Data Mining
33
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Binning Methods for Data Smoothing

Data Mining

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Discretization by Classification & Correlation Analysis
• Classification (e.g., decision tree analysis)
• Supervised: Given class labels, e.g., cancerous vs. benign
• Using entropy to determine split point (discretization point)
• Top-down, recursive split
• Details to be covered in Chapter “Classification”

• Correlation analysis (e.g., Chi-merge: χ2-based discretization)

• Supervised: use class information
• Bottom-up merge: find the best neighboring intervals (those having similar
distributions of classes, i.e., low χ2 values) to merge
• Merge performed recursively, until a predefined stopping condition

Data Mining 35

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Data Reduction Strategies
• Data reduction: Obtain a reduced representation of the data set that is much smaller
in volume but yet produces the same (or almost the same) analytical results
• Why data reduction? — A database/data warehouse may store terabytes of data.
Complex data analysis may take a very long time to run on the complete data set.
• Data reduction strategies
• Dimensionality reduction, e.g., remove unimportant attributes
• Wavelet transforms
• Principal Components Analysis (PCA)
• Feature subset selection, feature creation
• Numerosity reduction (some simply call it: Data Reduction)
• Regression and Log-Linear Models
• Histograms, clustering, sampling
• Data cube aggregation
• Data compression 36
Data Mining

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Data Reduction : Dimensionality Reduction
• Curse of dimensionality
• When dimensionality increases, data becomes increasingly sparse
• Density and distance between points, which is critical to clustering, outlier
analysis, becomes less meaningful
• The possible combinations of subspaces will grow exponentially
• Dimensionality reduction
• Avoid the curse of dimensionality
• Help eliminate irrelevant features and reduce noise
• Reduce time and space required in data mining
• Allow easier visualization
• Dimensionality reduction techniques
• Wavelet transforms
• Principal Component Analysis
• Supervised and nonlinear techniques (e.g., feature selection)

Data Mining 37

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

u
p
g
r,b
c
lty
a
o
s
im
d
n
e
h
W

Curse of Dimensionality
m
g
rlu
,h
p
w
b
c
a
y
d
s
to
fin
e
D

• When dimensionality increases, data

becomes increasingly sparse in the
space that it occupies

• Definitions of density and distance

between points, which are critical for
clustering and outlier detection,
become less meaningful

•Randomly generate 500 points

•Compute difference between max and
min distance between any pair of points

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Principal Component Analysis (PCA)
Find a projection that captures the largest amount of variation in data
The original data are projected onto a much smaller space, resulting in
dimensionality reduction.

7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Principal Component Analysis (Steps)
• Given N data vectors from n-dimensions, find k ≤ n orthogonal vectors (principal
components) that can be best used to represent data
– Normalize input data: Each attribute falls within the same range
– Compute k orthonormal (unit) vectors, i.e., principal components
– Each input data (vector) is a linear combination of the k principal component
vectors
– The principal components are sorted in order of decreasing “significance” or
strength
– Since the components are sorted, the size of the data can be reduced by
eliminating the weak components, i.e., those with low variance (i.e., using the
strongest principal components, it is possible to reconstruct a good
approximation of the original data)
• Works for numeric data only

7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Attribute Subset Selection
• Another way to reduce dimensionality of data
• Redundant attributes
– Duplicate much or all of the information contained in one or more
other attributes
– E.g., purchase price of a product and the amount of sales tax paid
• Irrelevant attributes
– Contain no information that is useful for the data mining task at hand
– E.g., students' ID is often irrelevant to the task of predicting students'
GPA

7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Heuristic Search in Attribute Selection
• There are 2d possible attribute combinations of d attributes
• Typical heuristic attribute selection methods:
– Best single attribute under the attribute independence assumption:
choose by significance tests
– Best step-wise feature selection:
• The best single-attribute is picked first
• Then next best attribute condition to the first, ...
– Step-wise attribute elimination:
• Repeatedly eliminate the worst attribute
– Best combined attribute selection and elimination
– Optimal branch and bound:
• Use attribute elimination and backtracking

7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Attribute Creation (Feature Generation)
• Create new attributes (features) that can capture the important
information in a data set more effectively than the original ones
• Three general methodologies
– Attribute extraction
• Domain-specific
– Mapping data to new space (see: data reduction)
• E.g., Fourier transformation, wavelet transformation
– Attribute construction
• Combining features
• Data discretization

7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Data Reduction: Numerosity Reduction
• Reduce data volume by choosing alternative, smaller forms of data
representation
• Parametric methods (e.g., regression)
– Assume the data fits some model, estimate model parameters, store
only the parameters, and discard the data (except possible outliers)
– Ex.: Log-linear models—obtain value at a point in m-D space as the
product on appropriate marginal subspaces
• Non-parametric methods
– Do not assume models
– Major families: histograms, clustering, sampling, …

7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Sampling

Sampling: obtaining a small sample s to represent the whole data

set N
Allow a mining algorithm to run in complexity that is potentially
sub-linear to the size of the data
Key principle: Choose a representative subset of the data
– Simple random sampling may have very poor performance in
the presence of skew
– Develop adaptive sampling methods, e.g., stratified sampling:
Note: Sampling may not reduce database I/Os (page at a time)

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Types of Sampling
Simple random sampling
– There is an equal probability of selecting any particular item
Sampling without replacement
– Once an object is selected, it is removed from the population
Sampling with replacement
– A selected object is not removed from the population
Stratified sampling:
– Partition the data set, and draw samples from each partition
(proportionally, i.e., approximately the same percentage of the
data)
– Used in conjunction with skewed data

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Sampling: With or without Replacement

Raw Data 47

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Text Books

Author(s), Title, Edition, Publishing House

T1 Tan P. N., Steinbach M & Kumar V. “Introduction to Data Mining” Pearson
Education
T2 Data Mining: Concepts and Techniques, Third Edition by Jiawei Han, Micheline
Kamber and Jian Pei Morgan Kaufmann Publishers
R1 Predictive Analytics and Data Mining: Concepts and Practice with RapidMiner
by Vijay Kotu and Bala Deshpande Morgan Kaufmann Publishers

7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Thank You

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

S2-20_DSECLZC415
Data Exploration
BITS Pilani
Pilani|Dubai|Goa|Hyderabad

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Agenda

• Data objects and Attributes types

• Basic Statistical Descriptions of Data
• Measuring Data Similarity and Dissimilarity

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Data Description

7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Types of Data Sets
• Record
• Relational records
• Data matrix, e.g., numerical matrix, crosstabs
• Document data: text documents: term-frequency

timeout

season
coach

game
score
team

ball

lost
pla

wi
n
y
vector
• Transaction data
Document 1 3 0 5 0 2 6 0 2 0 2
• Graph and network
• World Wide Web Document 2 0 7 0 2 1 0 0 3 0 0

• Social or information networks Document 3 0 1 0 0 1 2 2 0 3 0

• Molecular Structures
TID Items
• Ordered
1 Bread, Coke, Milk
• Video data: sequence of images
2 Beer, Bread
• Temporal data: time-series
3 Beer, Coke, Diaper, Milk
• Sequential Data: transaction sequences 4 Beer, Bread, Diaper, Milk
• Genetic sequence data 5 Coke, Diaper, Milk
• Spatial, image and multimedia:
• Spatial data: maps
• Image data:
• Video data: Data Mining
5
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Important Characteristics of Structured Data
• Dimensionality
• Curse of dimensionality
• Sparsity
• Only presence counts
• Resolution
• Patterns depend on the scale
• Distribution
• Centrality and dispersion

Data Mining
6
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Data Objects
• Data sets are made up of data objects.
• A data object represents an entity.
• Examples:
• sales database: customers, store items, sales
• medical database: patients, treatments
• university database: students, professors, courses
• Also called samples , examples, instances, data points, objects, tuples.
• Data objects are described by attributes.
• Database rows -> data objects; columns ->attributes.

Data Mining
7
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Attributes
• Attribute (or dimensions, features, variables): a data field, representing
a characteristic or feature of a data object.

• E.g., customer _ID, name, address

• Types:

• Nominal

• Binary

• Numeric: quantitative

• Interval-scaled
• Ratio-scaled

Data Mining
8
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Attribute Types
• Nominal: categories, states, or “names of things”
• Hair_color = {auburn, black, blond, brown, grey, red, white}
• marital status, occupation, ID numbers, zip codes
• Binary
• Nominal attribute with only 2 states (0 and 1)
• Symmetric binary: both outcomes equally important
• e.g., gender
• Asymmetric binary: outcomes not equally important.
• e.g., medical test (positive vs. negative)
• Convention: assign 1 to most important outcome (e.g., HIV positive)
• Ordinal
• Values have a meaningful order (ranking) but magnitude between successive
values is not known.
• Size = {small, medium, large}, grades, army rankings
Data Mining
9
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Numeric Attribute Types
• Quantity (integer or real-valued)
• Interval
• Measured on a scale of equal-sized units
• Values have order
• E.g., temperature in C˚or F˚, calendar dates
• No true zero-point
• Ratio
• Inherent zero-point
• We can speak of values as being an order of magnitude larger than the
unit of measurement (10 K˚ is twice as high as 5 K˚).
• e.g., temperature in Kelvin, length, counts, monetary quantities

Data Mining
10
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Discrete vs. Continuous Attributes
• Discrete Attribute
• Has only a finite or countably infinite set of values
• E.g., zip codes, profession, or the set of words in a collection of
documents
• Sometimes, represented as integer variables
• Note: Binary attributes are a special case of discrete attributes
• Continuous Attribute
• Has real numbers as attribute values
• E.g., temperature, height, or weight
• Practically, real values can only be measured and represented using a finite
number of digits
• Continuous attributes are typically represented as floating-point variables

Data Mining
11
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Basic Statistical Descriptions of Data
• Motivation
• To better understand the data: central tendency, variation and spread
• Data dispersion characteristics
• median, max, min, quantiles, outliers, variance, etc.
• Numerical dimensions correspond to sorted intervals
• Data dispersion: analyzed with multiple granularities of precision
• Boxplot or quantile analysis on sorted intervals
• Dispersion analysis on computed measures
• Folding measures into numerical dimensions
• Boxplot or quantile analysis on the transformed cube

Data Mining
12
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Measuring the Central Tendency
• Mean (algebraic measure) (sample vs. population):
1 n µ=∑
x
x = ∑ xi
Note: n is sample size and N is population size. n n i =1
N

∑w x i i

• Weighted arithmetic mean: x= i =1

∑w i

Trimmed mean: chopping extreme values

i =1
• Median
interval

• Median:
• Middle value if odd number of values, or average of the middle
two values otherwise

• Mode
• Value that occurs most frequently in the data
• Unimodal, bimodal, trimodal
• Empirical formula: mean − mode = 3× (mean − median) 13

Data Mining BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Symmetric vs. Skewed Data
• Median, mean and mode of symmetric, positively and negatively skewed data

symmetric

negatively skewed
positively skewed

Data Mining
July 5, 2021 14
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Measuring the Dispersion of Data
• Quartiles, outliers and boxplots
• Quartiles: Q1 (25th percentile), Q3 (75th percentile)
• Inter-quartile range: IQR = Q3 – Q1
• Five number summary: min, Q1, median, Q3, max
• Boxplot: ends of the box are the quartiles; median is marked; add whiskers, and plot outliers
individually
• Outlier: usually, a value higher/lower than 1.5 x IQR (on both sides of box from Q1 to Q3)
• Variance and standard deviation (sample: s, population: σ)
• Variance: (algebraic, scalable computation)
1 n
1 n
1 n 1 n 2 1 n ∑ (x − µ ) ∑x
2
σ2 = 2
= − µ2
2
s = ∑
n − 1 i =1
( xi − x ) 2 = [∑ xi − (∑ xi ) 2 ]
n − 1 i =1 n i =1 N i =1
i
N i =1
i

• Standard deviation s (or σ) is the square root of variance s2 (or σ2)

Data Mining
15
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Boxplot Analysis
• Five-number summary of a distribution
• Minimum, Q1, Median, Q3, Maximum
• Boxplot
• Data is represented with a box
• The ends of the box are at the first and third quartiles, i.e., the height of the box is IQR
• The median is marked by a line within the box
• Whiskers: two lines outside the box extended to Minimum and Maximum
• Outliers: points beyond a specified outlier threshold, plotted individually

Data Mining
16
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Example
Following is an ordered list of observations of a variable. Compute 5 point
summary.
13, 15, 16, 16, 19, 20, 20, 21, 22, 22, 25, 25, 25, 25, 30, 33, 33, 35, 35, 35, 35,
36, 40, 45, 46, 52, 70
Solution:
Min: 13
Q1: 20
Median: 25
Q3: 35
Max: 70

Data Mining

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Graphic Displays of Basic Statistical Descriptions

• Boxplot: graphic display of five-number summary

• Histogram: x-axis are values, y-axis repres. frequencies

• Quantile plot: each value xi is paired with fi indicating that

approximately 100 fi % of data are ≤ xi

• Quantile-quantile (q-q) plot: graphs the quantiles of one univariant

distribution against the corresponding quantiles of another

• Scatter plot: each pair of values is a pair of coordinates and plotted as

points in the plane

Data Mining
7/5/2021
18
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Histogram Analysis

• Histogram: Graph display of tabulated frequencies, shown as bars

• It shows what proportion of cases fall into each of several categories
• Differs from a bar chart in that it is the area of the bar that denotes the value,
not the height as in bar charts, a crucial distinction when the categories are not
of uniform width
• The categories are usually specified as non-overlapping intervals of some
variable. The categories (bars) must be adjacent
Data Mining
7/5/2021
19
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Histograms Often Tell More than Boxplots

 The two histograms shown in

the left may have the same
boxplot representation
 The same values for: min,
Q1, median, Q3, max
 But they have rather different
data distributions

Data Mining
7/5/2021
20
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Quantile Plot
• Displays all of the data (allowing the user to assess both the overall
behavior and unusual occurrences)
• Plots quantile information
• For a data xi data sorted in increasing order, fi indicates that
approximately 100 fi% of the data are below or equal to the value xi

Data Mining

Data Mining: Concepts and Techniques 21

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Quantile-Quantile (Q-Q) Plot
• Graphs the quantiles of one univariate distribution against the corresponding
quantiles of another
• View: Is there a shift in going from one distribution to another?
• Example shows unit price of items sold at Branch 1 vs. Branch 2 for each
quantile. Unit prices of items sold at Branch 1 tend to be lower than those at
Branch 2.

Data Mining
22
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Scatter plot

• Provides a first look at bivariate data to see clusters of

points, outliers, etc
• Each pair of values is treated as a pair of coordinates and
plotted as points in the plane

Data Mining
23
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Positively and Negatively Correlated Data

• The left half fragment is positively correlated

• The right half is negative correlated

Data Mining
24
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Uncorrelated Data

Data Mining
25
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Data Similarity/Dissimilarity

7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Similarity and Dissimilarity
• Similarity
• Numerical measure of how alike two data objects are
• Value is higher when objects are more alike
• Often falls in the range [0,1]
• Dissimilarity (e.g., distance)
• Numerical measure of how different two data objects are
• Lower when objects are more alike
• Minimum dissimilarity is often 0
• Upper limit varies
• Proximity refers to a similarity or dissimilarity

Data Mining
27
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Data Matrix and Dissimilarity Matrix
 x11 ... x1f ... x1p 
• Data matrix  
 ... ... ... ... ... 
• n data points with p dimensions x ... xif ... xip 
 i1 
• Two modes  ... ... ... ... ... 
x
 n1 ... xnf ... xnp 


• Dissimilarity matrix
• n data points, but registers only the
distance  0 
• A triangular matrix  d(2,1) 0 
 
• Single mode  d(3,1) d ( 3,2) 0 
 
 : : : 
d ( n,1) d ( n,2) ... ... 0

Data Mining
28
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Proximity Measure for Nominal Attributes

• Can take 2 or more states, e.g., red, yellow, blue, green (generalization of a binary
attribute)
• Method 1: Simple matching d (i, j) = p −
p
m
• m: # of matches, p: total # of variables

• Method 2: Use a large number of binary attributes

• creating a new binary attribute for each of the M nominal states

Data Mining
29
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Proximity Measure for Binary Attributes
Object j
• A contingency table for binary data
Object i

• Distance measure for symmetric binary

variables:

• Distance measure for asymmetric binary

variables:

• Jaccard coefficient (similarity measure for

asymmetric binary variables):

 Note: Jaccard coefficient is the same as “coherence”:

Data Mining
30
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Dissimilarity between Binary Variables
• Example
Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4
Jack M Y N P N N N
Mary F Y N P N P N
Jim M Y P N N N N

• Gender is a symmetric attribute

• The remaining attributes are asymmetric binary
• Let the values Y and P be 1, and the value N be 0 (to match contingency
table of prev slide)
• Following are distances based on asymmetric binary variables:
0+1
d ( jack , mary ) = = 0.33
2+ 0+1
1+1
d ( jack , jim ) = = 0.67
1+1+1
1+ 2
d ( jim , mary ) = = 0.75
1+1+ 2

Data Mining
31
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Standardizing Numeric Data
• Z-score: −µ
z = xσ
• X: raw score to be standardized, μ: mean of the population, σ: standard deviation
• the distance between the raw score and the population mean in units of the standard deviation
• negative when the raw score is below the mean, “+” when above

• An alternative way: Calculate the mean absolute deviation

s f = 1n (| x1 f − m f | + | x2 f − m f | +...+ | xnf − m f |)
where xif − m f
m f = 1n (x1 f + x2 f + ... + xnf ) zif = sf
.

• standardized measure (z-score):

• Using mean absolute deviation is more robust than using standard deviation

Data Mining
32
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Example: Data Matrix and Dissimilarity Matrix

Data Matrix
x x point attribute1 attribute2
2 4
x1 1 2
4 x2 3 5
x3 2 0
x4 4 5

2 x
1
Dissimilarity Matrix
(with Euclidean Distance)
x1 x2 x3 x4
x x1 0
3
0 2 4 x2 3.61 0
x3 2.24 5.1 0
x4 4.24 1 5.39 0

Data Mining
33
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Distance on Numeric Data: Minkowski Distance
• Minkowski distance: A popular distance measure

where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two p-dimensional data
objects, and h is the order (the distance so defined is also called L-h norm)
• Properties
• d(i, j) > 0 if i ≠ j, and d(i, i) = 0 (Positive definiteness)
• d(i, j) = d(j, i) (Symmetry)
• d(i, j) ≤ d(i, k) + d(k, j) (Triangle Inequality)
• A distance that satisfies these properties is a metric

Data Mining
34
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Special Cases of Minkowski Distance
• h = 1: Manhattan (city block, L1 norm) distance
• E.g., the Hamming distance: the number of bits that are different between two binary vectors

d (i, j) =| x − x | + | x − x | +...+ | x − x |
i1 j1 i2 j 2 ip jp
• h = 2: (L2 norm) Euclidean distance

d (i, j) = (| x − x |2 + | x − x |2 +...+ | x − x |2 )
i1 j1 i2 j 2 ip jp
• h → ∞. “supremum” (Lmax norm, L∞ norm) distance.
• This is the maximum difference between any component (attribute) of the vectors

Data Mining
35
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Example: Minkowski Distance
point attribute 1 attribute 2 (Dissimilarity Matrices)
x1 1 2
L1 x1 x2 x3 x4
x2 3 5
Manhattan (L1) x1 0
x3 2 0
x2 5 0
x4 4 5
x3 3 6 0
x4 6 1 7 0

x x Euclidean (L2) L2 x1 x2 x3 x4
2 4 x1 0
x2 3.61 0
4
x3 2.24 5.1 0
x4 4.24 1 5.39 0

Supremum L∞ x1 x2 x3 x4
2 x x1 0
1
x2 3 0
x3 2 5 0
x4 3 1 5 0
x
3
0 2 4

Data Mining
36
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Ordinal Variables
• An ordinal variable can be discrete or continuous
• Order is important, e.g., rank
• Can be treated like interval-scaled
• replace xif by their rank rif ∈{1,..., M f }
• map the range of each variable onto [0, 1] by replacing i-th object in the f-th
variable by r −1
zif = if
M f −1

• compute the dissimilarity using methods for interval-scaled variables

Data Mining
37
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Attributes of Mixed Type
• A database may contain all attribute types
• Nominal, symmetric binary, asymmetric binary, numeric, ordinal
• One may use a weighted formula to combine their effects

Σ pf = 1δ ij( f ) dij( f )
d (i, j) =
Σ pf = 1δ ij( f )

• f is binary or nominal:
dij(f) = 0 if xif = xjf , or dij(f) = 1 otherwise
• f is numeric: use the normalized distance
• f is ordinal r −1
• Compute ranks rif and z =
if
if M f −1
• Treat zif as interval-scaled

Data Mining
38
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Example
Based on the information given in the table below, find most similar and
most dissimilar persons among them. Apply min-max normalization on
income to obtain [0,1] range. Consider profession and mother tongue as
nominal. Consider native place as ordinal variable with ranking order of
[Village, Small Town, Suburban, Metropolitan]. Give equal weight to each
attribute.

Name Income Profession Mother Native Place

tongue
Ram 70000 Doctor Bengali Village
Balram 50000 Data Scientist Hindi Small Town
Bharat 60000 Carpenter Hindi Suburban
Kishan 80000 Doctor Bhojpuri Metropolitan

Data Mining

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Solution
After normalizing income and quantifying native place, we get
Name Income Profession Mother tongue Native Place
Ram 0.67 Doctor Bengali 1
Balram 0 Data Scientist Hindi 2
Bharat 0.33 Carpenter Hindi 3
Kishan 1 Doctor Bhojpuri 4

d(Ram, Balram) = 0.67+1+1+(2-1)/(4-1)=3 d(Ram, Bharat) = 0.33+1+1+(3-1)/(4-1)=3

d(Ram, Kishan) = 0.33+0+1+(4-1)/(4-1) = 2.33 d(Balram, Bharat) = 0.33+1+0+(3-2)/(4-1)=1.67
d(Balram, Kishan) = 1+1+1+(4-2)/(4-1) = 3.67 d(Bharat, Kishan) = 0.67+1+1+(4-3)/(4-1) = 3

Most similar – Balram and Bharat; Most dissimilar – Balram and Kishan

Data Mining

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Cosine Similarity
• A document can be represented by thousands of attributes, each recording the
frequency of a particular word (such as keywords) or phrase in the document.

• Other vector objects: gene features in micro-arrays, …

• Applications: information retrieval, biologic taxonomy, gene feature mapping, ...
• Cosine measure: If d1 and d2 are two vectors (e.g., term-frequency vectors), then
cos(d1, d2) = (d1 • d2) /||d1|| ||d2|| ,
where • indicates vector dot product, ||d||: the length of vector d

Data Mining
41
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Example: Cosine Similarity

• cos(d1, d2) = (d1 • d2) /||d1|| ||d2|| ,

where • indicates vector dot product, ||d|: the length of vector d

• Ex: Find the similarity between documents 1 and 2.

d1 = (5, 0, 3, 0, 2, 0, 0, 2, 0, 0)
d2 = (3, 0, 2, 0, 1, 1, 0, 1, 0, 1)

d1•d2 = 5*3+0*0+3*2+0*0+2*1+0*1+0*1+2*1+0*0+0*1 = 25
||d1||= (5*5+0*0+3*3+0*0+2*2+0*0+0*0+2*2+0*0+0*0)0.5=(42)0.5 = 6.481
||d2||= (3*3+0*0+2*2+0*0+1*1+1*1+0*0+1*1+0*0+1*1)0.5=(17)0.5 = 4.12
cos(d1, d2 ) = 0.94

Data Mining
42
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Thank You

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

S2-20_DSECFZC415
Classification and Prediction
BITS Pilani
Pilani|Dubai|Goa|Hyderabad

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Classification

7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Classification
• Classification involves dividing up objects so that each is assigned to one of a number of
mutually exhaustive and exclusive categories known as classes
• Many practical decision-making tasks can be formulated as classification problems
‒ customers who are likely to buy or not buy a particular product in a supermarket
‒ people who are at high, medium or low risk of acquiring a certain illness
‒ student projects worthy of a distinction, merit, pass or fail grade
‒ objects on a radar display which correspond to vehicles, people, buildings or trees
‒ people who closely resemble, slightly resemble or do not resemble someone seen
committing a crime
‒ houses that are likely to rise in value, fall in value or have an unchanged value in 12
months' time
‒ people who are at high, medium or low risk of a car accident in the next 12 months
‒ people who are likely to vote for each of a number of political parties (or none)
‒ the likelihood of rain the next day for a weather forecast (very likely, likely, unlikely, very
unlikely).

Data Mining
7/5/2021 4
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Classification vs. Prediction
• Classification
• predicts categorical class labels (discrete or nominal)
• classifies data (constructs a model) based on the training set and the
values (class labels) in a classifying attribute and uses it in classifying
new data

• Prediction
• models continuous-valued functions, i.e., predicts unknown or
missing values

Data Mining
7/5/2021
5
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Supervised vs. Unsupervised Learning
• Supervised learning (classification)
• Supervision: The training data (observations, measurements, etc.) are
accompanied by labels indicating the class of the observations
• New data is classified based on the training set

• Unsupervised learning (clustering)

• The class labels of training data is unknown
• Given a set of measurements, observations, etc. with the aim of
establishing the existence of classes or clusters in the data
People also talk about more forms of machine learning
[Link]
[Link]

Data Mining
7/5/2021 6
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Classification—A Two-Step Process
• Model construction: describing a set of predetermined classes
• Each tuple/sample is assumed to belong to a predefined class, as determined
by the class label attribute
• The set of tuples used for model construction is training set
• The model is represented as classification rules, decision trees, or
mathematical formulae

• Model usage: for classifying future or unknown objects

• Estimate accuracy of the model
• The known label of test sample is compared with the classified result
from the model
• Accuracy rate is the percentage of test set samples that are correctly
classified by the model
• Test set is independent of training set, otherwise over-fitting will occur
• If the accuracy is acceptable, use the model to classify data tuples whose
class labels are not known

Data Mining
7/5/2021 7
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Illustrating Classification Task

Tid Attrib1 Attrib2 Attrib3 Class Learning

1 Yes Large 125K No
algorithm
2 No Medium 100K No
3 No Small 70K No
4 Yes Medium 120K No
Induction
5 No Large 95K Yes
6 No Medium 60K No
7 Yes Large 220K No Learn
8 No Small 85K Yes Model
9 No Medium 75K No
10 No Small 90K Yes
Model
10

Training Set
Apply
Tid Attrib1 Attrib2 Attrib3 Class Model
11 No Small 55K ?
12 Yes Medium 80K ?
13 Yes Large 110K ? Deduction
14 No Small 95K ?
15 No Large 67K ?
10

Test Set

Data Mining
8
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Classification Techniques
• Decision Tree based Methods
• Rule-based Methods
• Neural Networks
- computational networks that simulate the decision process in neurons
(networks of nerve cell)
• Naïve Bayes and Bayesian Belief Networks
- uses the probability theory to find the most likely of the possible
classifications
• Support Vector Machines
- fits a boundary to a region of points that are all alike; uses the boundary to
classify a new point

Data Mining
9
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Lazy vs. Eager Learning
• Lazy vs. eager learning
• Lazy learning (e.g., instance-based learning): Simply stores training data (or
only minor processing) and waits until it is given a test tuple
• Eager learning (the above discussed methods): Given a set of training set,
constructs a classification model before receiving new (e.g., test) data to
classify

• Lazy: less time in training but more time in predicting

• Accuracy
• Lazy method effectively uses a richer hypothesis space since it uses many
local linear functions to form its implicit global approximation to the target
function
• Eager: must commit to a single hypothesis that covers the entire instance
space

Data Mining
7/5/2021 10
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Lazy Learner: Instance-Based Methods
• Instance-based learning:
• Store training examples and delay the processing (“lazy evaluation”)
until a new instance must be classified

• Typical approaches
• k-nearest neighbor approach
• Instances represented as points in a Euclidean space.
• Locally weighted regression
• Constructs local approximation
• Case-based reasoning
• Uses symbolic representations and knowledge-based inference

Data Mining
7/5/2021
11
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Example of a Decision Tree

Splitting Attributes
Tid Refund Marital Taxable
Status Income Cheat

1 Yes Single 125K No Refund

2 No Married 100K No Yes No
3 No Single 70K No
NO MarSt
4 Yes Married 120K No
Single, Divorced Married
5 No Divorced 95K Yes
6 No Married 60K No TaxInc NO
7 Yes Divorced 220K No < 80K > 80K
8 No Single 85K Yes
9 No Married 75K No
NO YES
10 No Single 90K Yes
10

Training Data Model: Decision Tree

Data Mining
12
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Another Example of Decision Tree

MarSt Single,
Married Divorced
Tid Refund Marital Taxable
Status Income Cheat NO Refund
Yes No
1 Yes Single 125K No
2 No Married 100K No NO TaxInc
3 No Single 70K No < 80K > 80K
4 Yes Married 120K No
NO YES
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No There could be more than one tree that fits
8 No Single 85K Yes the same data!
9 No Married 75K No
10 No Single 90K Yes
10

Data Mining
13
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Apply Model to Test Data

Test Data
Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married Assign Cheat to “No”

TaxInc NO
< 80K > 80K

NO YES

Data Mining
14
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Decision Tree Classification Task

Tid Attrib1 Attrib2 Attrib3 Class

Tree
1 Yes Large 125K No Induction
2 No Medium 100K No algorithm
3 No Small 70K No
4 Yes Medium 120K No
Induction
5 No Large 95K Yes
6 No Medium 60K No
7 Yes Large 220K No Learn
8 No Small 85K Yes Model
9 No Medium 75K No
10 No Small 90K Yes
Model
10

Training Set (Decision Tree)

Apply
Tid Attrib1 Attrib2 Attrib3 Class
Model
11 No Small 55K ?
12 Yes Medium 80K ?
13 Yes Large 110K ?
Deduction
14 No Small 95K ?
15 No Large 67K ?
10

Test Set

Data Mining
15
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Issues: Evaluating Classification Methods
• Accuracy
• classifier accuracy: predicting class label
• predictor accuracy: guessing value of predicted attributes
• Speed
• time to construct the model (training time)
• time to use the model (classification/prediction time)
• Robustness: handling noise and missing values
• Scalability: efficiency in disk-resident databases
• Interpretability
• understanding and insight provided by the model
• Other measures, e.g., goodness of rules, such as decision tree size or
compactness of classification rules

Data Mining
7/5/2021 16
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Underfitting and Overfitting (Example)
500 circular and 500
triangular data
points.

Circular points:
X2 

0.5 ≤ sqrt(x12+x22) ≤ 1

Triangular points:
sqrt(x12+x22) > 0.5 or
sqrt(x12+x22) < 1
X1 

Data Mining
17
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Underfitting and Overfitting

Overfitting

Underfitting: when model is too simple, both training and test errors are large
Data Mining
18
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Overfitting due to Noise

Decision boundary is distorted by noise point

Data Mining
19
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Decision Tree Based Classification
• Decision trees are intuitive and frequently used data mining technique
for Classification
• For an analyst, they are easy to set up and for a business user they are
easy to interpret.
• A decision tree model is a decision flowchart where an attribute is
tested in each node and ends in a leaf node where a prediction is made.
• There are many algorithms for decision tree induction such as Hunt’s
Algorithm, CART, ID3, C4.5, SLIQ,SPRINT

Data Mining
20
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Hunt’s Algorithm - Structure
• Hunt's algorithm is among the earliest. More
complex algorithms were built upon it. Tid Refund Marital
Status
Taxable
Income Cheat
• It grows a decision tree in a recursive fashion 1 Yes Single 125K No

by partitioning the training records into 2 No Married 100K No

3 No Single 70K No
successively purer subsets 4 Yes Married 120K No

• Let Dt be the set of training records that reach 5

6
No
No
Divorced 95K
Married 60K
Yes
No
a node t 7 Yes Divorced 220K No

• General Procedure: 8
9
No
No
Single
Married
85K
75K
Yes
No

• If Dt contains records that belong the same 10 No Single 90K Yes

class yt, then t is a leaf node labeled as yt

Dt
• If Dt is an empty set, then t is a leaf node
labeled by the default class, yd ?
• If Dt contains records that belong to more
than one class, use an attribute test to split
the data into smaller subsets. Recursively
apply the procedure to each subset.
Data Mining
21
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Hunt’s Algorithm - Example

Don’t Refund
Cheat Yes No Tid Refund Marital Taxable
Status Income Chea
Don’t Don’t
Cheat Cheat 1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
Refund
Refund
4 Yes Married 120K No
Yes No
Yes No 5 No Divorced 95K Yes
Don’t Marital
Don’t 6 No Married 60K No
Marital Cheat Status
Cheat Status Single, 7 Yes Divorced 220K No
Single, Married
Married Divorced 8 No Single 85K Yes
Divorced Don’t
Taxable 9 No Married 75K No
Cheat Don’t Cheat
Income
Cheat 10 No Single 90K Yes
< 80K >= 80K 10

Don’t Cheat
Cheat

Data Mining
22
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Tree Induction
• Greedy strategy.
• Split the records based on an attribute test that optimizes certain
criterion.

• Issues
• Determine how to split the records
• How to specify the attribute test condition?
• How to determine the best split?
• Determine when to stop splitting

Data Mining
23
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
How to Specify Test Condition?
• Depends on attribute types
• Nominal
• Ordinal
• Continuous

• Depends on number of ways to split

• 2-way split
• Multi-way split

Data Mining
24
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Splitting Based on Nominal Attributes

• Multi-way split: Use as many partitions as distinct values.

CarType
Family Luxury
Sports

• Binary split: Divides values into two subsets.

Need to find optimal partitioning.

CarType CarType
{Sports,
Luxury} {Family} OR {Family,
Luxury} {Sports}

Data Mining
25
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Splitting Based on Ordinal Attributes
• Multi-way split: Use as many partitions as distinct values.
Size
Small Large
Medium

• Binary split: Divides values into two subsets.

Need to find optimal partitioning.

{Small, Size Size

Medium {Large}
OR {Medium,
{Small}
Large}
}

Size
• What about this split? {Small,
Large} {Medium}

Data Mining
26
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Splitting Based on Continuous Attributes
• Different ways of handling
• Discretization to form an ordinal categorical attribute
• Static – discretize once at the beginning
• Dynamic – ranges can be found by equal interval bucketing, equal
frequency bucketing
(percentiles), or clustering.

• Binary Decision: (A < v) or (A ≥ v)

• consider all possible splits and finds the best cut
• can be more compute intensive

Data Mining
27
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Splitting Based on Continuous Attributes

Data Mining
28
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
How to determine the Best Split

Before Splitting: 10 records of class 0,

10 records of class 1

Own Car Student

Car? Type? ID?

Yes No Family Luxury c1 c20

c10 c11
Sports
C0: 6
C1: 4
C0: 4
C1: 6
C0: 1
C1: 3
C0: 8
C1: 0
C0: 1
C1: 7
C0: 1
C1: 0
... C0: 1
C1: 0
C0: 0
C1: 1
... C0: 0
C1: 1

Which test condition is the best?

Data Mining
29
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
How to determine the Best Split
• Greedy approach:
• Nodes with homogeneous class distribution are preferred
• Need a measure of node impurity:

C0: 5 C0: 9
C1: 5 C1: 1

Non-homogeneous, Homogeneous,
High degree of impurity Low degree of impurity

Data Mining
30
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Measures of Node Impurity
• Gini Index

• Entropy

• Misclassification error

Data Mining
31
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
How to Find the Best Split
C0 N00
Before Splitting: M0
C1 N01

A? B?
Yes No Yes No

Node N1 Node N2 Node N3 Node N4

C0 N10 C0 N20 C0 N30 C0 N40

C1 N11 C1 N21 C1 N31 C1 N41

M1 M2 M3 M4

M12 Gain = M0 – M12 vs M0 – M34 M34

Data Mining
32
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Measure of Impurity: GINI

• Gini Index for a given node t :

GINI (t ) = 1 − ∑ [ p ( j | t )]2
j

(NOTE: p( j | t) is the relative frequency of class j at node t).

• Maximum (1 - 1/nc) when records are equally distributed among all classes,
implying least interesting information
• Minimum (0.0) when all records belong to one class, implying most
interesting information

C1 0 C1 1 C1 2 C1 3
C2 6 C2 5 C2 4 C2 3
Gini=0.000 Gini=0.278 Gini=0.444 Gini=0.500

Data Mining
33
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Examples for computing GINI
GINI (t ) = 1 − ∑ [ p ( j | t )]2
j

C1 0 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1

C2 6
Gini = 1 – P(C1)2 – P(C2)2 = 1 – 0 – 1 = 0

C1 1 P(C1) = 1/6 P(C2) = 5/6

C2 5
Gini = 1 – (1/6)2 – (5/6)2 = 0.278

C1 2 P(C1) = 2/6 P(C2) = 4/6

C2 4
Gini = 1 – (2/6)2 – (4/6)2 = 0.444

Data Mining
34
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Splitting Based on GINI
• Used in CART, SLIQ, SPRINT.
• When a node p is split into k partitions (children), the quality of split is
computed as,
k
ni
GINI split = ∑ GINI (i )
i =1 n

where, ni = number of records at child i,

n = number of records at node p.

Data Mining
35
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Binary Attributes: Computing GINI Index
 Splits into two partitions
 Effect of Weighing partitions:
– Larger and Purer Partitions are sought for.

Parent
B? C1 6
Yes No C2 6
Gini = 0.500
Node N1 Node N2
Gini(N1)
= 1 – (5/7)2 – (2/7)2 N1 N2 Gini(Children)
= 0.408 C1 5 1 = 7/12 * 0.408 +
Gini(N2) C2 2 4 5/12 * 0.32
= 1 – (1/5)2 – (4/5)2 Gini=0.37 = 0.37
= 0.32

Data Mining
36
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Categorical Attributes: Computing Gini Index
• For each distinct value, gather counts for each class in the dataset
• Use the count matrix to make decisions

Multi-way split Two-way split

(find best partition of values)

CarType CarType CarType

Family Sports Luxury {Sports, {Family,
{Family} {Sports}
Luxury} Luxury}
C1 1 2 1 C1 C1
3 1 2 2
C2 4 1 1 C2 2 4 C2 1 5
Gini 0.393 Gini 0.400 Gini 0.419

Data Mining
37
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Continuous Attributes: Computing Gini Index
• Use Binary Decisions based on one Tid Refund Marital Taxable
value Status Income Cheat

• Several Choices for the splitting value 1 Yes Single 125K No

• Number of possible splitting values 2 No Married 100K No
= Number of distinct values
3 No Single 70K No
• Each splitting value has a count matrix 4 Yes Married 120K No
associated with it
5 No Divorced 95K Yes
• Class counts in each of the Taxable
Income
partitions, A < v and A ≥ v 6 No Married 60K No
> 80K?
• Simple method to choose best v 7 Yes Divorced 220K No
• For each v, scan the database to 8 No Single 85K Yes
Yes No

gather count matrix and compute 9 No Married 75K No

its Gini index
10 No Single 90K Yes
• Computationally Inefficient! 10

Repetition of work.

Data Mining
38
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Continuous Attributes: Computing Gini Index...
• For efficient computation: for each attribute,
• Sort the attribute on values
• Linearly scan these values, each time updating the count matrix and
computing gini index
• Choose the split position that has the least gini index

Cheat No No No Yes Yes Yes No No No No

Taxable Income
60 70 75 85 90 95 100 120 125 220
Sorted Values
55 65 72 80 87 92 97 110 122 172 230
Split Positions <= > <= > <= > <= > <= > <= > <= > <= > <= > <= > <= >
Yes 0 3 0 3 0 3 0 3 1 2 2 1 3 0 3 0 3 0 3 0 3 0

No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0

Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420

Data Mining
39
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Alternative Splitting Criteria
• Entropy at a given node t:
Entropy (t ) = − ∑ p ( j | t ) log p ( j | t )
j

(NOTE: p( j | t) is the relative frequency of class j at node t).

• Measures homogeneity of a node.
• Maximum (log nc) when records are equally distributed among all
classes implying least information
• Minimum (0.0) when all records belong to one class, implying
most information
• Entropy based computations are similar to the GINI index
computations

Data Mining
40
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Examples for computing Entropy

Entropy (t ) = − ∑ p ( j | t ) log p ( j | t )
j 2

C1 0 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1

C2 6 Entropy = – 0 log 0 – 1 log 1 = – 0 – 0 = 0

C1 1 P(C1) = 1/6 P(C2) = 5/6

C2 5 Entropy = – (1/6) log2 (1/6) – (5/6) log2 (1/6) = 0.65

C1 2 P(C1) = 2/6 P(C2) = 4/6

C2 4 Entropy = – (2/6) log2 (2/6) – (4/6) log2 (4/6) = 0.92

Data Mining
41
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Splitting Criteria based on Classification Error
• Classification error at a node t :

Error (t ) = 1 − max P (i | t )
i

• Measures misclassification error made by a node.

• Maximum (1 - 1/nc) when records are equally distributed among all
classes, implying least interesting information
• Minimum (0.0) when all records belong to one class, implying most
interesting information

Data Mining
42
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Examples for Computing Error
Error (t ) = 1 − max P (i | t ) i

C1 0 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1

C2 6 Error = 1 – max (0, 1) = 1 – 1 = 0

C1 1 P(C1) = 1/6 P(C2) = 5/6

C2 5 Error = 1 – max (1/6, 5/6) = 1 – 5/6 = 1/6

C1 2 P(C1) = 2/6 P(C2) = 4/6

C2 4 Error = 1 – max (2/6, 4/6) = 1 – 4/6 = 1/3

Data Mining
43
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Comparison among Splitting Criteria
For a 2-class problem:

Data Mining
44
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Gain Ratio

Data Mining

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Attribute Selection Measure: Information Gain (ID3/C4.5)

 Select the attribute with the highest information gain

 Let pi be the probability that an arbitrary tuple in D belongs to
class Ci, estimated by |Ci, D|/|D|
 Expected information (entropy) needed to classify a tuple in D:
m
Info( D) = −∑ pi log 2 ( pi )
i =1
 Information needed (after using A to split D into v partitions) to
classify D: v | D |
Info A ( D ) = ∑
j
× Info( D j )
j =1 | D |
 Information gained by branching on attribute A

Gain(A) = Info(D) − Info A(D)

Data Mining
46
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Attribute Selection: Information Gain
age income student credit_rating buys_computer
<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
• Class P: buys_computer = “yes” 31…40 high yes fair yes
>40 medium no excellent no
• Class N: buys_computer = “no”

9 9 5 5
Info( D ) = I (9,5) = − log 2 ( ) − log 2 ( ) =0.940
14 14 14 14

Data Mining
47
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Attribute Selection: Information Gain

𝐼𝐼𝐼𝐼𝐼𝐼𝑜𝑜𝑎𝑎𝑎𝑎𝑎𝑎 𝐷𝐷 =
5
𝐼𝐼 2,3 +
4
𝐼𝐼 4,0 +
5
𝐼𝐼(3,2) age pi ni I(pi, ni)
14 14 14 <=30 2 3 0.971
= 0.694
31…40 4 0 0
>40 3 2 0.971
5 means “age <=30” has 5 out of 14 samples, with 2
I (2,3)
14 yes’es and 3 no’s. Hence

Gain(age) = Info( D) − Infoage ( D) = 0.246

Similarly,

Gain(income) = 0.029
Gain( student ) = 0.151
Gain(credit _ rating ) = 0.048
Data Mining
48
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Gain Ratio for Attribute Selection (C4.5)
• Information gain measure is biased towards attributes with a large
number of values
• C4.5 (a successor of ID3) uses gain ratio to overcome the problem
(normalization to information gain) SplitInfo ( D) = −∑
A
|D | v
× log (
|D | j
) 2
j

|D| |D|
• GainRatio(A) = Gain(A)/SplitInfo(A) j =1

• Ex.

• gain_ratio(income) = 0.029/1.557 = 0.019

• The attribute with the maximum gain ratio is selected as the splitting
attribute
Data Mining
49
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Refining Decision Tree Model

Data Mining

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Stopping Criteria for Tree Induction
• Stop expanding a node when all the records belong to the same class

• Stop expanding a node when all the records have similar attribute values

• Early termination

Data Mining
51
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Decision Tree Based Classification
• Advantages:
• Inexpensive to construct
• Extremely fast at classifying unknown records
• Easy to interpret for small-sized trees
• Accuracy is comparable to other classification techniques for many
simple data sets

Data Mining
52
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Practical Issues of Classification
• Underfitting and Overfitting

• Missing Values

• Costs of Classification

Data Mining
53
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Underfitting vs. Overfitting
• Underfitting results in decision trees that are too simple to solve the
problem. They may offer superior interpretability.

• Overfitting results in decision trees that are more complex than

necessary
• Training error no longer provides a good estimate of how well the
tree will perform on previously unseen records
• Need new ways for estimating errors

Data Mining
54
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Model Overfitting

Underfitting: when model is too simple, both training and test errors are large
Overfitting: when model is too complex, training error is small but test error is
large
Data Mining

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

How to Address Overfitting
• Pre-Pruning (Early Stopping Rule)
• Stop the algorithm before it becomes a fully-grown tree
• General stopping conditions for a node:
• Stop if all instances belong to the same class
• Stop if all the attribute values are the same
• More restrictive conditions (for pre-pruning) :
• Stop if number of instances is less than some user-specified
threshold
• Stop if class distribution of instances are independent of the
available features (e.g., using χ 2 test)
• Stop if expanding the current node does not improve impurity
measures (e.g., Gini or information gain).

Data Mining
56
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
How to Address Overfitting…
• Post-pruning
• Grow decision tree to its entirety
• Trim the nodes of the decision tree in a bottom-up fashion
• If generalization error(i.e. expected error of the model on previously
unseen records) improves after trimming, replace sub-tree by a leaf
node.
• Class label of leaf node is determined from majority class of instances
in the sub-tree

Data Mining
57
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Prescribed Text Books

Author(s), Title, Edition, Publishing House

R1 Predictive Analytics and Data Mining: Concepts and Practice with RapidMiner
by Vijay Kotu and Bala Deshpande Morgan Kaufmann Publishers

Data Mining
58
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Thank You

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

S2-20_DSECFZC415
Classification and Prediction
BITS Pilani
Pilani|Dubai|Goa|Hyderabad

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Classification

7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Rule-based Classification

Data Mining
Source Courtesy: Some of the contents of this PPT are sourced from materials provided by publishers of prescribed books
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Classification Techniques
• Decision Tree based Methods
• Rule-based Methods
• Neural Networks
• computational networks that simulate the decision process in
neurons (networks of nerve cell)
• Naïve Bayes and Bayesian Belief Networks
• uses the probability theory to find the most likely of the possible
classifications
• Support Vector Machines
• fits a boundary to a region of points that are all alike; uses the
boundary to classify a new point

Data Mining
5
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Rule-Based Classifier
• Classify records by using a collection of “if…then…” rules

• Rule: (Condition) → y where

• Condition is a conjunctions of attributes
• y is the class label
• LHS: rule antecedent or condition
• RHS: rule consequent
• Examples of classification rules:
• (Blood Type=Warm) ∧ (Lay Eggs=Yes) → Birds
• (Taxable Income < 50K) ∧ (Refund=Yes) → Evade=No

Data Mining
6
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Rule-based Classifier (Example)
Name Blood Type Give Birth Can Fly Live in Water Class
human warm yes no no mammals
python cold no no no reptiles
salmon cold no no yes fishes
whale warm yes no yes mammals
frog cold no no sometimes amphibians
komodo cold no no no reptiles
bat warm yes yes no mammals
pigeon warm no yes no birds
cat warm yes no no mammals
leopard shark cold yes no yes fishes
turtle cold no no sometimes reptiles
penguin warm no no sometimes birds
porcupine warm yes no no mammals
eel cold no no yes fishes
salamander cold no no sometimes amphibians
gila monster cold no no no reptiles
platypus warm no no no mammals
owl warm no yes no birds
dolphin warm yes no yes mammals
eagle warm no yes no birds

R1: (Give Birth = no) ∧ (Can Fly = yes) → Birds

R2: (Give Birth = no) ∧ (Live in Water = yes) → Fishes
R3: (Give Birth = yes) ∧ (Blood Type = warm) → Mammals
R4: (Give Birth = no) ∧ (Can Fly = no) → Reptiles
R5: (Live in Water = sometimes) → Amphibians
Data Mining
7
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Application of Rule-Based Classifier
• A rule r covers an instance x if the attributes of the instance satisfy the
condition of the rule
R1: (Give Birth = no) ∧ (Can Fly = yes) → Birds
R2: (Give Birth = no) ∧ (Live in Water = yes) → Fishes
R3: (Give Birth = yes) ∧ (Blood Type = warm) → Mammals
R4: (Give Birth = no) ∧ (Can Fly = no) → Reptiles
R5: (Live in Water = sometimes) → Amphibians

Name Blood Type Give Birth Can Fly Live in Water Class
hawk warm no yes no ?
grizzly bear warm yes no no ?

The rule R1 covers a hawk => Bird

The rule R3 covers the grizzly bear => Mammal

Data Mining
8
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Rule Coverage and Accuracy
Tid Refund Marital Taxable
• Coverage of a rule: Status Income Class
• Fraction of records that satisfy
1 Yes Single 125K No
the antecedent of a rule
2 No Married 100K No
• Accuracy of a rule:
3 No Single 70K No
• Fraction of records that satisfy
both the antecedent and 4 Yes Married 120K No
consequent of a rule 5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
(Status=Single) → No 10
10 No Single 90K Yes

Coverage = 40%, Accuracy = 50%

Data Mining
9
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
How does Rule-based Classifier Work?
R1: (Give Birth = no) ∧ (Can Fly = yes) → Birds
R2: (Give Birth = no) ∧ (Live in Water = yes) → Fishes
R3: (Give Birth = yes) ∧ (Blood Type = warm) → Mammals
R4: (Give Birth = no) ∧ (Can Fly = no) → Reptiles
R5: (Live in Water = sometimes) → Amphibians
Name Blood Type Give Birth Can Fly Live in Water Class
lemur warm yes no no ?
turtle cold no no sometimes ?
dogfish shark cold yes no yes ?

A lemur triggers rule R3, so it is classified as a mammal

A turtle triggers both R4 and R5
A dogfish shark triggers none of the rules

Data Mining
10
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Characteristics of Rule-Based Classifier
• Mutually exclusive rules
• Classifier contains mutually exclusive rules if the rules are
independent of each other
• Every record is covered by at most one rule

• Exhaustive rules
• Classifier has exhaustive coverage if it accounts for every possible
combination of attribute values
• Each record is covered by at least one rule

Data Mining
11
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
From Decision Trees To Rules
Refund Classification Rules
Yes No
(Refund=Yes) ==> No
NO Marital
(Refund=No, Marital Status={Single,Divorced},
{Single, Status
{Married} Taxable Income<80K) ==> No
Divorced}
(Refund=No, Marital Status={Single,Divorced},
Taxable NO Taxable Income>80K) ==> Yes
Income
(Refund=No, Marital Status={Married}) ==> No
< 80K > 80K

NO YES

Rules are mutually exclusive and exhaustive

Rule set contains as much information as the tree

Data Mining
12
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Rules Can Be Simplified
Tid Refund Marital Taxable
Refund Status Income Cheat
Yes No
1 Yes Single 125K No
NO Marital 2 No Married 100K No
{Single, Status
{Married} 3 No Single 70K No
Divorced}
4 Yes Married 120K No
Taxable NO 5 No Divorced 95K Yes
Income
6 No Married 60K No
< 80K > 80K
7 Yes Divorced 220K No
NO YES 8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10

Initial Rule: (Refund=No) ∧ (Status=Married) → No

Simplified Rule: (Status=Married) → No
Data Mining
13
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Further Characterizing Rules

• Rules are not mutually exclusive

• A record may trigger more than one rule
• Solution?
• Ordered rule set
• Unordered rule set – use voting schemes

• Rules are not exhaustive

• A record may not trigger any rules
• Solution?
• Use a default class

Data Mining
14
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Ordered Rule Set
• Rules are rank ordered according to their priority
• An ordered rule set is known as a decision list
• When a test record is presented to the classifier
• It is assigned to the class label of the highest ranked rule it has
triggered
• If none of the rules fired, it is assigned to the default class

R1: (Give Birth = no) ∧ (Can Fly = yes) → Birds

Name Blood Type Give Birth Can Fly Live in Water Class
turtle cold no no sometimes ?

Data Mining
15
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Rule Ordering Schemes
• Rule-based ordering
• Individual rules are ranked based on their quality
• Class-based ordering
• Rules that belong to the same class appear together

Rule-based Ordering Class-based Ordering

(Refund=Yes) ==> No (Refund=Yes) ==> No

(Refund=No, Marital Status={Single,Divorced}, (Refund=No, Marital Status={Single,Divorced},

Taxable Income<80K) ==> No Taxable Income<80K) ==> No

(Refund=No, Marital Status={Single,Divorced}, (Refund=No, Marital Status={Married}) ==> No

Taxable Income>80K) ==> Yes
(Refund=No, Marital Status={Single,Divorced},
(Refund=No, Marital Status={Married}) ==> No Taxable Income>80K) ==> Yes

Data Mining
16
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
How to Evaluate Learnt Rule?
• Start with the most general rule possible: condition = empty
• Adding new attributes by adopting a greedy depth-first strategy
• Picks the one that most improves the rule quality
• Rule-Quality measures: consider both coverage and accuracy
• Foil-gain (in FOIL & RIPPER): assesses info_gain by extending condition
• favors rules that have high accuracy and cover many positive tuples

pos ' pos

FOIL _ Gain = pos '×(log 2 − log 2 )
pos '+ neg ' pos + neg

• Rule pruning based on an independent set of test tuples

• Pos/neg are # of positive/negative tuples covered by R.
• If FOIL_Prune is higher for the pruned version of R, prune R
pos − neg
FOIL _ Prune( R) =
pos + neg

Data Mining
17
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
How to Evaluate Learnt Rule?

• We can use Likelihood Ratio Statistic, which confirms that effect of rule is
not attributed to chance, but represents correlation between attribute
value and classes.
𝑓𝑓
Likelihood_Ratio = 2 ∗ ∑𝑚𝑚 𝑓𝑓
𝑗𝑗=1 𝑖𝑖 log 2 ( 𝑖𝑖
)
𝑒𝑒𝑖𝑖
• m is the number of classes
• For tuples satisfying the rule, fi is the observed frequency of each
class among tuples, ei is the expected frequency if the rule made
random predictions
• Higher the Likelihood Ratio, better the rule is.
• Used by CN2

Data Mining
18
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Building Classification Rules

• Direct Method:
• Extract rules directly from data
• e.g.: RIPPER, CN2, Holte’s 1R

• Indirect Method:
• Extract rules from other classification models (e.g.
decision trees, neural networks, etc).
• e.g: C4.5rules

Data Mining
19
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Direct Method: Sequential Covering

• Sequential covering algorithm: Extracts rules directly from

training data
• Typical sequential covering algorithms: FOIL, AQ, CN2,
RIPPER
• Rules are learned sequentially, each for a given class Ci will
cover many tuples of Ci but none (or few) of the tuples of
other classes

Data Mining
20
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Rule Induction: Sequential Covering Method
• Start with an empty rule set
• Steps:
• Rules are learned one at a time
• Each time a rule is learned, the tuples covered by the
rules are removed
• Repeat the process(above steps) on the remaining tuples
• until termination condition, e.g., when no more training
examples or when the quality of a rule returned is below a
user-specified threshold

• Comparison with decision-tree induction: learning a set of

rules simultaneously

Data Mining
21
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Sequential Covering Algorithm

while (enough target tuples left)

generate a rule
remove positive target tuples satisfying this rule

Examples covered
Examples covered by Rule 2
by Rule 1 Examples covered
by Rule 3

Positive
examples

Data Mining
22
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Rule Generation

• To generate a rule
while(true)
find the best predicate p
if foil-gain(p) > threshold then add p to current rule
else break

Predicates considered may be independent of each other

(as in previous slide) or progressively restrictive (as in the
next slide)

Data Mining
23
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Rule Generation

A3=1&&A1=2
&&A8=5

A3=1&&A1=2

A3=1

Positive Negative
examples examples

Data Mining
24
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Indirect Methods

P
No Yes

Q R Rule Set

No Yes No Yes r1: (P=No,Q=No) ==> -

r2: (P=No,Q=Yes) ==> +
- + + Q r3: (P=Yes,R=No) ==> +
r4: (P=Yes,R=Yes,Q=No) ==> -
No Yes
r5: (P=Yes,R=Yes,Q=Yes) ==> +
- +

Data Mining
25
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Ensemble Methods

• Construct a set of classifiers from the training data

• Predict class label of previously unseen records by aggregating

predictions made by multiple classifiers

Data Mining
26
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
General Idea

Original
D Training data

Step 1:
Create Multiple D1 D2 .... Dt-1 Dt
Data Sets

Step 2:
Build Multiple C1 C2 Ct -1 Ct
Classifiers

Step 3:
Combine C*
Classifiers

Data Mining
27
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Why does it work?

• Suppose there are 25 base classifiers

• Each classifier has error rate, ε = 0.35
• Assume classifiers are independent
• Probability that the ensemble classifier makes a wrong
prediction:

25
 25  i
∑ 
 i 
i =13 
ε (1 − ε ) 25−i
= 0.06


Data Mining
28
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
When error rate differs…

• Suppose there are k base classifiers

• Each classifier has different error rate, εi
• Again, assume classifiers are independent
• Probability that the ensemble classifier makes a wrong
prediction:
• Majority of classifiers have to make wrong prediction
• Compute the probability for each combination that can make
wrong prediction (brute force method)
• Sum up for all possible combinations

Data Mining
29
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Prescribed Text Books

Author(s), Title, Edition, Publishing House

Data Mining
30
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Thank You

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

S2-20_DSECFZC415
Classification Model Evaluation
BITS Pilani
Pilani|Dubai|Goa|Hyderabad

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Model Evaluation and Selection

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Model Evaluation and Selection
Evaluation metrics: How can we measure accuracy? Other metrics to
consider?
Use validation test set of class-labeled tuples instead of training set when
assessing accuracy
Methods for estimating a classifier’s accuracy:
– Holdout method, random subsampling
– Cross-validation
– Bootstrap
Comparing classifiers:
– Cost-benefit analysis and ROC Curves

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Classifier Evaluation Metrics: Confusion Matrix
Confusion Matrix:
Given m classes, an entry, CMi,j in a confusion matrix indicates # of tuples in
class i that were labeled by the classifier as class j
May have extra rows/columns to provide totals

Predicted class -> C1 ¬ C1

Actual class⇓
C1 True Positives False Negatives
(TP) (FN)
¬ C1 False Positives True Negatives
(FP) (TN)

5
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Classifier Evaluation Metrics: Confusion Matrix
Example of Confusion Matrix:
Predicted class -> buy_computer = buy_computer = Total
yes no
Actual class ⇓
buy_computer = yes 6954 46 7000
buy_computer = no 412 2588 3000
Total 7366 2634 10000

6
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Classifier Evaluation Metrics:
Accuracy, Error Rate, Sensitivity and Specificity
Classifier Accuracy, or recognition rate:  Class Imbalance Problem:
percentage of test set tuples that are  One class may be rare, e.g. fraud, or
correctly classified HIV-positive
Accuracy = (TP + TN)/All  Significant majority of the negative class
Error rate: 1 – accuracy, or and minority of the positive class
Error rate = (FP + FN)/All
 Sensitivity: True Positive recognition
rate

A\P C ¬C  Sensitivity = TP/P

C TP FN P  Specificity: True Negative recognition
rate
¬C FP TN N
P’ N’ All  Specificity = TN/N

7
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Classifier Evaluation Metrics:
Precision and Recall, and F-measures
Precision: exactness – what % of tuples that the classifier labeled as
positive are actually positive

Recall: completeness – what % of positive tuples did the classifier

label as positive?

Perfect score is 1.0

Inverse relationship between precision & recall

F measure (F1 or F-score): harmonic mean of precision and recall,

Why a harmonic mean, but not arithmetic or geometric mean?

8
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Classifier Evaluation Metrics:
Precision and Recall, and F-measures

Precision ( Recall fixed at 70 % )

9
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Classifier Evaluation Metrics:
Precision and Recall, and F-measures
Harmonic mean can be recomputed by applying weights to
precision and recall
By substituting β2 = (1 – α)/ α ,

We get
Fß: weighted measure of precision and recall
– assigns β2 times as much weight to recall as to precision

10
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Classifier Evaluation Metrics: Example
Precision = 90/230 = 39.13% Recall = 90/300 = 30.00%

Actual Class\Predicted class cancer = yes cancer = Total Recognition(%)

no
cancer = yes 90 210 300 30.00
(sensitivity
cancer = no 140 9560 9700 98.56
(specificity)
Total 230 9770 10000 96.40
(accuracy)

11
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Evaluating Classifier Accuracy: Holdout Method

Holdout method
– Given data is randomly partitioned into two independent sets
• Training set (e.g., 2/3) for model construction
• Test set (e.g., 1/3) for accuracy estimation
– Random sampling: a variation of holdout
• Repeat holdout k times, accuracy = avg. of the accuracies obtained

12
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Evaluating Classifier Accuracy:
Cross-Validation Methods

Cross-validation (k-fold, where k = 10 is most popular)

– Randomly partition the data into k mutually exclusive subsets, each
approximately equal size
– At i-th iteration, use Di as test set and others as training set

– Leave-one-out: k folds where k = # of tuples, for small sized data

– Stratified cross-validation: folds are stratified so that class dist. in

each fold is approx. the same as that in the initial data

13
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Evaluating Classifier Accuracy: Bootstrap
Bootstrap
– Works well with small data sets
– Samples the given training tuples uniformly with replacement
• i.e., each time a tuple is selected, it is equally likely to be selected again and
re-added to the training set
Several bootstrap methods, and a common one is .632 boostrap
– A data set with d tuples is sampled d times, with replacement, resulting in a
training set of d samples. The data tuples that did not make it into the training
set end up forming the test set. About 63.2% of the original data end up in
the bootstrap, and the remaining 36.8% form the test set (since (1 – 1/d)d ≈ e-1
= 0.368)
– Repeat the sampling procedure k times, overall accuracy of the model:

14
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Model Selection: ROC Curves
ROC (Receiver Operating
Characteristics) curves: for visual
comparison of classification
models
Originated from signal detection theory
Shows the trade-off between the true
positive rate and the false positive
rate
The area under the ROC curve is a
measure of the accuracy of the
model  Vertical axis represents the
Rank the test tuples in decreasing true positive rate
order: the one that is most likely to  Horizontal axis rep. the false
belong to the positive class positive rate
appears at the top of the list  The plot also shows a
The closer to the diagonal line (i.e., the diagonal line
closer the area is to 0.5), the less A model with perfect
accurate is the model 
accuracy will have an area of
1.0

15
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Prescribed Text Books

Author(s), Title, Edition, Publishing House

R1 Predictive Analytics and Data Mining: Concepts and Practice with RapidMiner
by Vijay Kotu and Bala Deshpande Morgan Kaufmann Publishers

7/5/2021
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
S2-20_DSECFZC415
Prediction
BITS Pilani
Pilani|Dubai|Goa|Hyderabad

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Prediction

Source Courtesy: Some of the contents of this PPT are sourced from materials provided by publishers of prescribed books
Data Mining

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Prediction vs. Classification
• How is (Numerical) prediction similar to classification?
• construct a model
• use model to predict continuous or ordered value for a given input
• Difference between Prediction and classification
• Classification refers to predict categorical class label
• Prediction models continuous-valued functions
• Major method for prediction: regression
• model the relationship between one or more independent or predictor variables
and a dependent or response variable
• Profit, sales, mortgage rates, house values, square footage, temperature, or distance
could all be predicted using regression techniques. For example, a regression model
could be used to predict the value of a house based on location, number of rooms,
lot size, and other factors.

Data Mining
4
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Regression for Prediction
• A regression task begins with a data set in which the target values are
known, e.g.
• A regression model that predicts house values could be developed based on
observed data for many houses over a period of time.
• The data might track the age of the house, square footage, number of rooms,
taxes, school district, proximity to shopping centers, and so on.
• House value would be the target, the other attributes would be the predictors,
and the data for each house would constitute a case.
• In the model build (training) process, a regression algorithm estimates the
value of the target as a function of the predictors for each case in the build
data.
• These relationships between predictors and target are summarized in a model,
which can then be applied to a different data set in which the target values are
unknown

Data Mining
5
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Prediction Techniques

• Regression analysis
• Linear and multiple regression
• Non-linear regression
• Other regression methods:
• Log-linear models,
• Regression trees
• etc.

Data Mining
6
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Regression Analysis
• Regression analysis seeks to determine the values of parameters for a
function that cause the function to best fit a set of data observations
that you provide.
• The following equation expresses these relationships in symbols.
y = F(x,w) + e

• Regression is the process of estimating the value of a continuous target

(y) as a function (F) of one or more predictors (x1 , x2 , ..., xn), a set of
parameters (w0, w1 , w2 , ..., wn), and a measure of error (e).

Data Mining
7
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Regression Analysis
• In the equation
y = F(x,w) + e
• The predictors(x1 , x2 , ..., xn) can be understood as independent
variables
• The target (y) is the dependent variable.
• The error (e), also called the residual, is the difference between the
expected and predicted value of the dependent variable.
• The regression parameters are also known as regression coefficients.
• The process of training a regression model involves finding the
parameter values that minimize a measure of the error, for example, the
sum of squared errors.

Data Mining
8
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Simple Linear Regression
• Simple Linear regression: involves a response variable y and a single predictor
variable x
y = w 0 + w1 x
where w0 (y-intercept) and w1 (slope) are regression coefficients
• Method of least squares: estimates the best-fitting straight line

| D|

∑ ( x − x )( y − y)
w= 1
i =1
i

| D|
i
w = y −wx
0 1

∑ i
( x
i =1
− x ) 2

Data Mining
9
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Linear Regression With a Single Predictor

Data Mining
10
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Multiple Linear Regression

Multiple linear regression: involves more than one predictor variable

• Training data is of the form (X1, y1), (X2, y2),…, (X|D|, y|D|)

• e.g. For 2-D data, we may have:

y = w0 + w1 x1+ w2 x2
• Solvable by extension of least square method or using SAS, S-Plus

• Many nonlinear functions can be transformed into the above

Data Mining
11
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Nonlinear Regression

• Often the relationship between x and y

cannot be approximated with a straight
line. In this case, a nonlinear regression
technique may be used. Alternatively, the
data could be preprocessed to make the
relationship linear.
• Nonlinear regression models define y as a
function of x using an equation that is
more complicated than the linear
regression equation

Data Mining
12
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Nonlinear Regression
• Some nonlinear models can be modeled by a polynomial function
• A polynomial regression model can be transformed into linear regression
model. For example,
• y = w0 + w1 x + w2 x2 + w3 x3
• convertible to linear with new variables: x2 = x2, x3= x3
• y = w0 + w1 x + w2 x2 + w3 x3
• Other functions, such as power function, can also be transformed to
linear model
• Some models are intractable nonlinear (e.g., sum of exponential terms)
• possible to obtain least square estimates through extensive
calculation on more complex formulae

Data Mining
13
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Regression Trees and Model Trees
• Regression tree: proposed in CART system (Breiman et al. 1984)
• CART: Classification And Regression Trees
• Each leaf stores a continuous-valued prediction
• It is the average value of the predicted attribute for the training tuples that
reach the leaf
• Model tree: proposed by Quinlan (1992)
• Each leaf holds a regression model—a multivariate linear equation for the
predicted attribute
• A more general case than regression tree
• Regression and model trees tend to be more accurate than linear regression
when the data are not represented well by a simple linear model

Data Mining
14
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Regression Trees
Humidit Golf
Day Outlook Temp. Wind
y Players
1 Sunny Hot High Weak 25
2 Sunny Hot High Strong 30
3 Overcast Hot High Weak 46
4 Rain Mild High Weak 45
5 Rain Cool Normal Weak 52
6 Rain Cool Normal Strong 23
7 Overcast Cool Normal Strong 43
8 Sunny Mild High Weak 35
9 Sunny Cool Normal Weak 38
10 Rain Mild Normal Weak 46
11 Sunny Mild Normal Strong 48
12 Overcast Mild High Strong 52
13 Overcast Hot Normal Weak 44
14 Rain Mild High Strong 30

[Link]
Data Mining
15
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Model Tree Sample

LM1, LM2, …., LM6 are distinct linear models

[Link]

Data Mining
16
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Simple Linear Regression Example

x y
Area (in sq. m) Rent (in 000s of Rupees)
172 42
150 35
181 46
174 40
194 50

Can we predict rent for a house of 160 sq. m. in the locality?

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Prescribed Text Books

Author(s), Title, Edition, Publishing House

R1 Predictive Analytics and Data Mining: Concepts and Practice with RapidMiner
by Vijay Kotu and Bala Deshpande Morgan Kaufmann Publishers

7/5/2021
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
S2-20_DSECFZC415
Association Analysis
BITS Pilani
Pilani|Dubai|Goa|Hyderabad

1
•The slides presented here are obtained from the authors of the books and from
various other contributors. I hereby acknowledge all the contributors for their material
and inputs.
•I have added and modified a few slides to suit the requirements of the course.

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Association Analysis Basics

Source Courtesy: Some of the contents of this PPT are sourced from materials provided by publishers of prescribed books
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Association Analysis
Association analysis measures the strength of co-occurrence between one item
and another.
– The objective of this class of data mining algorithms is not to predict an occurrence of an
item, like classification or regression do, but to find usable patterns in the co-occurrences of
the items.
– Association rules learning is a branch of an unsupervised learning process that discovers
hidden patterns in data, in the form of easily recognizable rules

Association algorithms are widely used in retail analysis of transactions,

recommendation engines, and online clickstream analysis across web pages.
– One of the popular applications of this technique is called market basket analysis, which
finds co-occurrences of one retail item with another item within the same retail purchase
transaction

Retailer can take advantage of this association for bundle pricing, product
placement, and even shelf space optimization within the store layout.

July 5, 2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Association Rule Mining
Given a set of transactions, find rules that will predict the
occurrence of an item based on the occurrences of other items
in the transaction

Market-Basket transactions Example of Association Rules

TID Items {Diaper} → {Butter},

1 Bread, Milk {Milk, Bread} → {Beans, Coke},
2 Bread, Diaper, Butter, Beans {Butter, Bread} → {Milk},
3 Milk, Diaper, Butter, Coke
4 Bread, Milk, Diaper, Butter
Implication means co-occurrence,
5 Bread, Milk, Diaper, Coke
not causality!

July 5, 2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Definition: Frequent Itemset
Itemset TID Items
– A collection of one or more items 1 Bread, Milk
• Example: {Milk, Bread, Diaper} 2 Bread, Diaper, Butter, Beans
– k-itemset 3 Milk, Diaper, Butter, Coke
• An itemset that contains k items 4 Bread, Milk, Diaper, Butter
Support count (σ) 5 Bread, Milk, Diaper, Coke
– Frequency of occurrence of an itemset
– E.g. σ({Milk, Bread,Diaper}) = 2
Support
– Fraction of transactions that contain an itemset
– E.g. s({Milk, Bread, Diaper}) = 2/5
Frequent Itemset
– An itemset whose support is greater than or equal to a minsup threshold

July 5, 2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Definition: Association Rule
TID Items
 Association Rule 1 Bread, Milk
– An implication expression of the form X → Y, 2 Bread, Diaper, Butter, Beans
where X and Y are itemsets 3 Milk, Diaper, Butter, Coke
4 Bread, Milk, Diaper, Butter
– Example:
{Milk, Diaper} → {Butter} 5 Bread, Milk, Diaper, Coke

 Rule Evaluation Metrics

Example:
– Support (s)
 Fraction of transactions that contain both X {Milk, Diaper} ⇒ Butter
and Y
σ (Milk, Diaper, Butter) 2
– Confidence (c) s= = = 0.4
|T| 5
 Measures how often items in Y
appear in transactions that σ (Milk, Diaper, Butter ) 2
contain X c= = = 0.67
σ (Milk, Diaper ) 3
7

July 5, 2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Association Rule Mining Task
Given a set of transactions T, the goal of association rule mining is to find all
rules having
– support ≥ minsup threshold
– confidence ≥ minconf threshold

Brute-force approach:
– List all possible association rules
– Compute the support and confidence for each rule
– Prune rules that fail the minsup and minconf thresholds
⇒ Computationally prohibitive!

July 5, 2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Mining Association Rules
TID Items Example of Rules:
1 Bread, Milk
2 Bread, Diaper, Butter, Beans
{Milk,Diaper} → {Butter} (s=0.4, c=0.67)
{Milk,Butter} → {Diaper} (s=0.4, c=1.0)
3 Milk, Diaper, Butter, Coke
{Diaper,Butter} → {Milk} (s=0.4, c=0.67)
4 Bread, Milk, Diaper, Butter
{Butter} → {Milk,Diaper} (s=0.4, c=0.67)
5 Bread, Milk, Diaper, Coke
{Diaper} → {Milk,Butter} (s=0.4, c=0.5)
{Milk} → {Diaper,Butter} (s=0.4, c=0.5)
Observations:
• All the above rules are binary partitions of the same itemset:
{Milk, Diaper, Butter}
• Rules originating from the same itemset have identical support but
can have different confidence
• Thus, we may decouple the support and confidence requirements
9

July 5, 2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Mining Association Rules
Two-step approach:
1. Frequent Itemset Generation
– Generate all itemsets whose support ≥ minsup

2. Rule Generation
– Generate high confidence rules from each frequent
itemset, where each rule is a binary partitioning of a
frequent itemset

Frequent itemset generation is still computationally expensive

July 5, 2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Frequent Itemset Generation
null

A B C D E

AB AC AD AE BC BD BE CD CE DE

ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

ABCD ABCE ABDE ACDE BCDE Given d items, there are

2d possible candidate
itemsets
ABCDE 11

July 5, 2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Frequent Itemset Generation
Brute-force approach:
– Each itemset in the lattice is a candidate frequent itemset
– Count the support of each candidate by scanning the database

TID Items
1 Bread, Milk
2 Bread, Diaper, Butter, Beans
3 Milk, Diaper, Butter, Coke
4 Bread, Milk, Diaper, Butter
5 Bread, Milk, Diaper, Coke

– Match each transaction against every candidate

– Complexity ~ O(NMw) => Expensive since M = 2d !!!
12

July 5, 2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Frequent Itemset Generation Strategies
Reduce the number of candidates (M)
– Complete search: M=2d
– Use pruning techniques to reduce M

Reduce the number of transactions (N)

– Reduce size of N as the size of itemset increases
– Used by DHP(Direct Hashing & Pruning) and vertical-based mining
algorithms

Reduce the number of comparisons (NM)

– Use efficient data structures to store the candidates or transactions
– No need to match every candidate against every transaction

July 5, 2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Apriori Algorithm

2. Rule Generation
– Generate high confidence rules from each frequent
itemset, where each rule is a binary partitioning of a
frequent itemset

Frequent itemset generation is still computationally expensive

– Apriori principle can be used to reduce computations

July 5, 2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Reducing Number of Candidates
Apriori principle:
– If an itemset is frequent, then all of its subsets must also
be frequent

Apriori principle holds due to the following property of

the support measure:
∀X , Y : ( X ⊆ Y ) ⇒ s( X ) ≥ s(Y )
– Support of an itemset never exceeds the support of its
subsets
– This is known as the anti-monotone property of support
16
The Apriori algorithm was proposed by Rakesh Agrawal and Ramakrishnan Srikant in 1994
July 5, 2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Illustrating Apriori Principle
null

A B C D E

AB AC AD AE BC BD BE CD CE DE

Found to be
Infrequent
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

ABCD ABCE ABDE ACDE BCDE

Pruned
ABCDE
supersets 17

July 5, 2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Apriori: A Candidate Generation-and-Test Approach
Apriori pruning principle: If there is any itemset which is
infrequent, its superset should not be generated/tested!
Method:
Initially, scan DB once to get frequent
1-itemset TID Items
1 Bread, Milk
Generate length (k+1) candidate 2 Bread, Diaper, Butter, Beans
itemsets from length k frequent
3 Milk, Diaper, Butter, Coke
itemsets
4 Bread, Milk, Diaper, Butter
Test the candidates against DB 5 Bread, Milk, Diaper, Coke

Terminate when no frequent or

candidate set can be generated

July 5, 2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Apriori Algorithm
Method:
– Let k=1
– Generate frequent itemsets of length 1
– Repeat until no new frequent itemsets are identified
• Generate length (k+1) candidate itemsets from length k frequent
itemsets
• Prune candidate itemsets containing subsets of length k that are
infrequent
• Count the support of each candidate by scanning the DB
• Eliminate candidates that are infrequent, leaving only those that
are frequent

July 5, 2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Important Details of Apriori
How to generate candidates?
– Step 1: self-joining Lk
– Step 2: pruning
Example of Candidate-generation
– L3={abc, abd, acd, ace, bcd}
– Self-joining: L3*L3
• abcd from abc and abd
• acde from acd and ace
– Pruning:
• acde is removed because ade is not in L3
– C4={abcd}
How to count supports of candidates?
20

July 5, 2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Illustrating Apriori Principle

Item Count Items (1-itemsets)

Bread 4
Coke 2
Milk 4
Butter 3
Diaper 4
Beans 1

Minimum Support = 3

TID Items
1 Bread, Milk
2 Bread, Diaper, Butter, Beans
3 Milk, Diaper, Butter, Coke
4 Bread, Milk, Diaper, Butter
5 Bread, Milk, Diaper, Coke
21

July 5, 2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Illustrating Apriori Principle

Item Count Items (1-itemsets)

Bread 4
Coke 2
Milk 4 Itemset Count Pairs (2-itemsets)
Butter 3 {Bread,Milk} 3
Diaper 4 {Bread,Butter} 2 (No need to generate
Beans 1 {Bread,Diaper} 3
candidates involving Coke
{Milk,Butter} 2
Minimum Support = 3 {Milk,Diaper} 3 or Beans)
{Butter,Diaper} 3
TID Items
1 Bread, Milk
2 Bread, Diaper, Butter, Beans
3 Milk, Diaper, Butter, Coke
4 Bread, Milk, Diaper, Butter
5 Bread, Milk, Diaper, Coke
22

July 5, 2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Illustrating Apriori Principle
If every subset is considered,
6C + 6C + 6C = 41
1 2 3
Item Count Items (1-itemsets) With support-based pruning,
Bread 4 6 + 6 + 1 = 13
Coke 2
Milk 4 Itemset Count Pairs (2-itemsets)
Butter 3 {Bread,Milk} 3
Diaper 4 {Bread,Butter} 2 (No need to generate
Beans 1 {Bread,Diaper} 3
candidates involving Coke
{Milk,Butter} 2
Minimum Support = 3 {Milk,Diaper} 3 or Beans)
{Butter,Diaper} 3
TID Items
Triplets (3-itemsets)
1 Bread, Milk
2 Bread, Diaper, Butter, Beans Itemset Count
3 Milk, Diaper, Butter, Coke {Bread,Milk,Diaper} 2
4 Bread, Milk, Diaper, Butter
5 Bread, Milk, Diaper, Coke
23

July 5, 2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Factors Affecting Complexity
Choice of minimum support threshold
– lowering support threshold results in more frequent itemsets
– this may increase number of candidates and max length of frequent
itemsets
Dimensionality (number of items) of the data set
– more space is needed to store support count of each item
– if number of frequent items also increases, both computation and I/O
costs may also increase
Size of database
– since Apriori makes multiple passes, run time of algorithm may increase
with number of transactions
Average transaction width
– transaction width increases with denser data sets
– This may increase max length of frequent itemsets and traversals of hash
tree (number of subsets in a transaction increases with its width)

July 5, 2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Can we improve Apriori Efficiency?

Hash-based technique
Transaction reduction
Partitioning
Sampling
Dynamic itemset counting

July 5, 2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Hash-based technique

Create hash table using hash function

h(x, y)=((order of x)*10 + (order of y)) mod 7

A 2-itemset with a corresponding bucket count in the hash table

26
that is below the support threshold cannot be frequent
July 5, 2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Efficiency Techniques
Transaction reduction
– A transaction that does not contain any frequent k-itemsets
cannot contain any frequent (k + 1)-itemsets. Such a
transaction can be removed from further consideration

Dynamic Itemset Counting

– Instead of counting for the entire database, promote a
candidate to frequent itemset if it passes a (lower)
threshold after partial counting. Afterwards, generate larger
patterns using the itemset, so promoted.

July 5, 2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Efficiency Techniques - Dynamic Itemset Counting

ABCD
Once both A and D are determined frequent, the
counting of AD begins
ABC ABD ACD BCD Once all length-2 subsets of BCD are determined
frequent, the counting of BCD begins, thus
reducing effective number of scans
AB AC BC AD BD CD
Transactions
1-itemsets
A B C D
Apriori 2-itemsets
…
{}
Itemset lattice 1-itemsets
2-items
S. Brin R. Motwani, J. Ullman, and S.
Tsur. Dynamic itemset counting and DIC 3-items
implication rules for market basket 28
data. SIGMOD’97
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Efficiency Techniques

Mining by partitioning the data

– It has two phases. In phase I, divide the transactions of D into n partitions. Each
partition has proportionally lower threshold. For each partition, all the local
frequent itemsets are found.
– Any itemset that is potentially frequent with respect to D must be a frequent
itemset in at least one of the partitions. Therefore, all local frequent itemsets
are candidate itemsets with respect to D 29

July 5, 2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Mining with Vertical format

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Mining with Vertical format

Notice that because the itemsets {I1, I4} and {I3, I5} each contain only one
transaction, they do not belong to the set of frequent 2-itemsets.

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Mining with Vertical format

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Mining with Vertical format

Vertical format: t(AB) = {T11, T25, …}

– tid-list: list of trans.-ids containing an itemset
Deriving frequent patterns based on vertical intersections
– t(X) = t(Y): X and Y always happen together
– t(X) ⊂ t(Y): transaction having X always has Y
Using diffset to accelerate mining
– Only keep track of differences of tids
– t(X) = {T1, T2, T3}, t(XY) = {T1, T3}
– Diffset (XY, X) = {T2}
• Diffset can reduce space complexity

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Closed Patterns and Max-Patterns
A long pattern contains a combinatorial number of sub-patterns,
e.g., {a1, …, a100} contains
100C + 100C +…+ 100C = 2 100 – 1 = 1.27*1030 sub-patterns!
1 2 100

Solution: Mine closed frequent patterns and maximal frequent

patterns instead
– An itemset X is closed if X is frequent and there exists no
super-pattern Y ‫ כ‬X, with the same support as X
– An itemset X is a maximal pattern if X is frequent and there
exists no frequent super-pattern Y ‫ כ‬X
Closed pattern is a lossless compression of freq. patterns
– Reducing the # of patterns and rules
34

July 5, 2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Closed Patterns and Max-Patterns
Example
– DB = {<a1 …, a100>, < a1 …, a100>, < a1, …, a50>}
– Min_sup = 2
What is the set of closed itemset?
– <a1, …, a100>: 2
– < a1, …, a50>: 3
What is the set of maximal pattern?
– <a1, …, a100>: 2
What is the set of all patterns?
– 1.27*1030
35

July 5, 2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Maximal vs Closed Itemsets
Transaction
null Ids
TID Items
124 123 1234
1 ABC 245 345
A B C D E
2 ABCD
3 BCE
12 124 24 4 123 2 3 24
4 ACDE AB AC AD AE BC BD BE CD
34
CE
45
DE

5 DE

12 2 24 4 4 2 3 4
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

2 4
ABCD ABCE ABDE ACDE BCDE

Not supported by
any transactions ABCDE 36

July 5, 2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Maximal vs Closed Frequent Itemsets
TID Items
Minimum support = 2 null Closed but
1 ABC not maximal
2 ABCD 124 123 1234 245 345
A B C D E
3 BCE
Closed &
4 ACDE
maximal
5 DE
12 124 24 4 123 2 3 24 34 45
AB AC AD AE BC BD BE CD CE DE

12 2 24 4 4 2 3 4
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

2 4
ABCD ABCE ABDE ACDE BCDE
# Closed = 9
# Maximal = 4
ABCDE
37

July 5, 2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Maximal vs Closed Itemsets

Frequent
Itemsets

Closed
Frequent
Itemsets

Maximal
Frequent
Itemsets

July 5, 2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Prescribed Text Books

Author(s), Title, Edition, Publishing House

T1 Tan P. N., Steinbach M & Kumar V. “Introduction to Data Mining” Pearson
Education
T2 Data Mining: Concepts and Techniques, Third Edition by Jiawei Han,
Micheline Kamber and Jian Pei Morgan Kaufmann Publishers
R1 Predictive Analytics and Data Mining: Concepts and Practice with RapidMiner
by Vijay Kotu and Bala Deshpande Morgan Kaufmann Publishers

July 5, 2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Thank You

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

S2-20_DSECFZC415: Data Mining
(Lecture #10 – Association Analysis)
BITS Pilani
Pilani|Dubai|Goa|Hyderabad
•The slides presented here are obtained from the authors of the books and from various other contributors. I
hereby acknowledge all the contributors for their material and inputs.
•I have added and modified a few slides to suit the requirements of the course.

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

BITS Pilani
Pilani|Dubai|Goa|Hyderabad

FP-growth Algorithm
Association Analysis (Review)

Association analysis measures the strength of co-occurrence between one item

and another.
– The objective of this class of data mining algorithms is not to predict an occurrence of an
item, like classification or regression do, but to find usable patterns in the co-occurrences of
the items.
– Association rules learning is a branch of an unsupervised learning process that discovers
hidden patterns in data, in the form of easily recognizable rules

Association algorithms are widely used in retail analysis of transactions,

Retailer can take advantage of this association for bundle pricing, product
placement, and even shelf space optimization within the store layout.

4
July 30, 2021
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Association Rule Mining (Review)
Given a set of transactions, find rules that will predict the
occurrence of an item based on the occurrences of other items
in the transaction

Market-Basket transactions Example of Association Rules

TID Items {Diaper} → {Butter},

July 30, 2021

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Definition: Frequent Itemset (Review)
Itemset TID Items
– A collection of one or more items 1 Bread, Milk
• Example: {Milk, Bread, Diaper} 2 Bread, Diaper, Butter, Beans
– k-itemset 3 Milk, Diaper, Butter, Coke
• An itemset that contains k items 4 Bread, Milk, Diaper, Butter
Support count (σ) 5 Bread, Milk, Diaper, Coke
– Frequency of occurrence of an itemset
– E.g. σ({Milk, Bread,Diaper}) = 2
Support
– Fraction of transactions that contain an itemset
– E.g. s({Milk, Bread, Diaper}) = 2/5
Frequent Itemset
– An itemset whose support is greater than or equal to a minsup threshold

July 30, 2021

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Definition: Association Rule (Review)
TID Items
 Association Rule 1 Bread, Milk
– An implication expression of the form X → Y, 2 Bread, Diaper, Butter, Beans
where X and Y are itemsets 3 Milk, Diaper, Butter, Coke
4 Bread, Milk, Diaper, Butter
– Example:
{Milk, Diaper} → {Butter} 5 Bread, Milk, Diaper, Coke

 Rule Evaluation Metrics

July 30, 2021

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Mining Association Rules
Two-step approach:
1. Frequent Itemset Generation
– Generate all itemsets whose support ≥ minsup

2. Rule Generation
– Generate high confidence rules from each frequent itemset, where each rule is a
binary partitioning of a frequent itemset

Frequent itemset generation is computationally expensive

– Can we avoid it?

July 30, 2021

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Bottleneck of Frequent-pattern Mining
Multiple database scans are costly
Mining long patterns needs many passes of scanning and generates lots of
candidates
– To find frequent itemset i1i2…i100
• # of scans: 100
• # of Candidates: 100C1 + 100C2 +…+ 100C100 = 2100 – 1 = 1.27*1030 !

Bottleneck: candidate-generation-and-test
– Can we avoid candidate generation?

July 30, 2021

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Mining Frequent Patterns
Without Candidate Generation
Bottlenecks of the Apriori approach
– Breadth-first (i.e., level-wise) search
– Candidate generation and test
– Often generates a huge number of candidates
The FPGrowth Approach (J. Han, J. Pei, and Y. Yin, SIGMOD’ 00)
– Depth-first search
– Avoid explicit candidate generation
Grow long patterns from short ones using local frequent items (Major
philosophy behind FPGrowth)
– “abc” is a frequent pattern
– Get all transactions having “abc”: DB|abc
– “d” is a local frequent item in DB|abc  abcd is a frequent pattern

July 30, 2021

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
FP-growth Algorithm
Use a compressed representation of the database using an FP-tree
Once an FP-tree has been constructed, it uses a recursive divide-and-conquer
approach to mine the frequent itemsets

1. Scan DB once, find frequent 1-itemset (single

item pattern)
2. Sort frequent items in frequency descending
order, f-list
3. Scan DB again, construct FP-tree

July 30, 2021

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Construct FP-tree from a Transaction Database
min_support = 3 F-list=f-c-a-b-m-p

TID Items bought (ordered) frequent items

100 {f, a, c, d, g, i, m, p} {f, c, a, m, p}
200 {a, b, c, f, l, m, o} {f, c, a, b, m}
300 {b, f, h, j, o, w} {f, b}
400 {b, c, k, s, p} {c, b, p}
500 {a, f, c, e, l, p, m, n} {f, c, a, m, p}

July 30, 2021

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Construct FP-tree from a Transaction Database
min_support = 3 F-list=f-c-a-b-m-p

TID Items bought (ordered) frequent items

100 {f, a, c, d, g, i, m, p} {f, c, a, m, p}
200 {a, b, c, f, l, m, o} {f, c, a, b, m}
300 {b, f, h, j, o, w} {f, b}
400 {b, c, k, s, p} {c, b, p}
500 {a, f, c, e, l, p, m, n} {f, c, a, m, p}

July 30, 2021

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Construct FP-tree from a Transaction Database
{}
min_support = 3 Header Table
F-list=f-c-a-b-m-p
Item frequency head f:4 c:1
f 4
c 4 c:3 b:1 b:1
a 3
b 3 a:3 p:1
m 3
p 3
m:2 b:1
TID Items bought (ordered) frequent items
100 {f, a, c, d, g, i, m, p} {f, c, a, m, p} p:2 m:1
200 {a, b, c, f, l, m, o} {f, c, a, b, m}
300 {b, f, h, j, o, w} {f, b}
400 {b, c, k, s, p} {c, b, p}
500 {a, f, c, e, l, p, m, n} {f, c, a, m, p}
14

July 30, 2021

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Find Patterns Having P From P-conditional Database
Starting at the frequent item header table in the FP-tree
Traverse the FP-tree by following the link of each frequent item p
Accumulate all of transformed prefix paths of item p to form p’s conditional
pattern base
{}
Header Table

Item frequency head f:4 c:1

f 4
c 4 c:3 b:1 b:1 Conditional pattern base of p:
a 3 fcam:2, cb:1
b 3 a:3 p:1
m 3
p 3 m:2 b:1

p:2 m:1
15

July 30, 2021

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
From Conditional Pattern-bases to Conditional FP-trees

For each pattern-base

– Accumulate the count for each item in the base
– Construct the FP-tree for the frequent items of the pattern
base
m-conditional pattern base:
{} fca:2, fcab:1
Header Table
Item frequency head All frequent
f:4 c:1 patterns relate to m
f 4 {}
c 4 c:3 b:1 b:1 m,

a 3 f:3
 fm, cm, am,
b 3 a:3 p:1 fcm, fam, cam,
m 3 c:3 fcam
p 3 m:2 b:1
p:2 m:1 a:3
m-conditional FP-tree 16

July 30, 2021

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Find Patterns Having x From x-conditional Database
{}
Header Table

Item frequency head f:4 c:1

f 4
c 4 c:3 b:1 b:1
a 3
b 3 a:3 p:1
m 3
Item Conditional Conditional Frequent
p 3 m:2 b:1 pattern base fp-tree patterns
p fcam: 2, c:3 cp:3
p:2 m:1 cb:1
m fca:2, fcab:1 fca:3 fcam:3 (and all
its subsets)
b fca:1, f:1, c:1 None
a fc:3 fc:3
c f:3 f:3
17

July 30, 2021

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Benefits of the FP-tree Structure
Completeness
– Preserve complete information for frequent pattern
mining
– Never break a long pattern of any transaction
Compactness
– Reduce irrelevant info—infrequent items are gone
– Items in frequency descending order: the more frequently
occurring, the more likely to be shared
– Never be larger than the original database (not count
node-links and the count field)

July 30, 2021

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Partition Patterns and Databases
Frequent patterns can be partitioned into subsets according to f-
list
– F-list=f-c-a-b-m-p
– Patterns containing p
– Patterns having m but no p
– …
– Patterns having c but no a nor b, m, p
– Pattern f
Completeness and non-redundency

July 30, 2021

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Recursion: Mining Each Conditional FP-tree
{}

{}
Cond. pattern base of “am”: (fc:3) f:3

c:3
f:3
am-conditional FP-tree
c:3 {}

a:3 Cond. pattern base of “cm”: (f:3)

f:3
m-conditional FP-tree
cm-conditional FP-tree

{}
Cond. pattern base of “cam”: (f:3)
f:3
cam-conditional FP-tree
20

July 30, 2021

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
A Special Case: Single Prefix Path in FP-tree
Suppose a (conditional) FP-tree T has a shared single prefix-
path P
Mining can be decomposed into two parts
{}
– Reduction of the single prefix path into one node
a1:n1
– Concatenation of the mining results of the two parts
a2:n2
{} r1
a3:n3

a1:n1
b1:m1 C1:k1
 r1 =
a2:n2
+ b1:m1 C1:k1

a3:n3 C2:k2 C3:k3

C2:k2 C3:k3 21

July 30, 2021

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Mining Frequent Patterns With FP-trees
Idea: Frequent pattern growth
– Recursively grow frequent patterns by pattern and
database partition
Method
– For each frequent item, construct its conditional pattern-
base, and then its conditional FP-tree
– Repeat the process on each newly created conditional FP-
tree
– Until the resulting FP-tree is empty, or it contains only one
path—single path will generate all the combinations of its
sub-paths, each of which is a frequent pattern
22

July 30, 2021

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Why Is FP-Growth the Winner?

Divide-and-conquer:
– decompose both the mining task and DB according to the
frequent patterns obtained so far
– leads to focused search of smaller databases
Other factors
– no candidate generation, no candidate test
– compressed database: FP-tree structure
– no repeated scan of entire database
– basic ops—counting local freq items and building sub FP-
tree, no pattern search and matching

July 30, 2021

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Scaling FP-growth by Database Projection

• What about if FP-tree cannot fit in

memory?
• DB projection Tran. DB
• First partition a database into a set of fcamp
projected DBs fcabm
fb
• Then construct and mine FP-tree for cbp
each projected DB fcamp

p-proj DB m -proj DB b-proj DB a-proj DB c-proj DB f-proj DB

fcam fcab f fc f …
cb fca cb … …
fcam fca …

am -proj DB cm -proj DB
fc f …
fc f
fc f
24
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
BITS Pilani
Pilani|Dubai|Goa|Hyderabad

Mining association rules

Rule Generation

• Given a frequent itemset L, find all non-empty

subsets f ⊂ L such that f → L – f satisfies the
minimum confidence requirement
• If {A,B,C,D} is a frequent itemset, candidate rules:
ABC →D, ABD →C, ACD →B, BCD →A,
AB →CD, AC → BD, AD → BC, BC →AD,
BD →AC, CD →AB,
A →BCD, B →ACD, C →ABD, D →ABC

• If |L| = k, then there are 2k – 2 candidate

association rules (ignoring L → ∅ and ∅ → L)

July 30, 2021 26

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Rule Generation

• How to efficiently generate rules from frequent itemsets?

• In general, confidence does not have an anti-monotone property
c(ABC →D) can be larger or smaller than c(AB →D)

• But confidence of rules generated from the same itemset has an anti-
monotone property
• e.g., L = {A,B,C,D}:

c(ABC → D) ≥ c(AB → CD) ≥ c(A → BCD)

• Confidence is anti-monotone in this case

July 30, 2021 27

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Rule Generation for Apriori Algorithm
Lattice of rules
ABCD=>{ }
Low
Confidence
Rule
BCD=>A ACD=>B ABD=>C ABC=>D

CD=>AB BD=>AC BC=>AD AD=>BC AC=>BD AB=>CD

D=>ABC C=>ABD B=>ACD A=>BCD

Pruned
Rules
July 30, 2021 28
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Rule Generation with Apriori Algorithm
• Candidate rule is generated by merging two rules that share the same
prefix in the rule consequent

CD=>AB BD=>AC
• join(CD=>AB,BD=>AC)
would produce the candidate
rule D => ABC

• Prune rule D=>ABC if its

subset AD=>BC does not have
high confidence D=>ABC

July 30, 2021 29

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Effect of Support Distribution
• Many real data sets have skewed support distribution

Support
distribution of
a retail data set

July 30, 2021 30

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Effect of Support Distribution

• How to set the appropriate minsup threshold?

• If minsup is set too high, we could miss itemsets involving interesting rare
items (e.g., expensive products)
• If minsup is set too low, it is computationally expensive and the number of
itemsets is very large

• Using a single minimum support threshold may not be effective

July 30, 2021 31

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Multiple Minimum Support

• How to apply multiple minimum supports?

• MS(i): minimum support for item i
• e.g.: MS(Milk)=5%, MS(Coke) = 3%,
MS(Broccoli)=0.1%, MS(Salmon)=0.5%
• MS({Milk, Broccoli}) = min (MS(Milk), MS(Broccoli))
= 0.1%

• Challenge: Support is no longer anti-monotone

• Suppose: Support(Milk, Coke) = 1.5% and
Support(Milk, Coke, Broccoli) = 0.5%

• {Milk,Coke} is infrequent but {Milk,Coke,Broccoli} is frequent

July 30, 2021 32

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Pattern Evaluation

• Association rule algorithms tend to produce too many rules

• many of them are uninteresting or redundant
• Redundant if {A,B,C} → {D} and {A,B} → {D}
have same support & confidence

• Interestingness measures can be used to prune/rank the derived

patterns

• In the original formulation of association rules, support & confidence

are the only measures used

July 30, 2021 33

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Computing Interestingness Measure
• Given a rule X → Y, information needed to compute rule
interestingness can be obtained from a contingency table
Contingency table for X → Y
Y Y f11: support of X and Y
X f11 f10 f1+
f10: support of X and Y
f01: support of X and Y
X f01 f00 fo+
f00: support of X and Y
f+1 f+0 |T|

Used to define various measures

 support, confidence, lift, Gini,
J-measure, etc.

July 30, 2021 34

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Drawback of Confidence

Coffee Coffee
Tea 15 5 20
Tea 75 5 80
90 10 100

Association Rule: Tea → Coffee

Confidence= P(Coffee|Tea) = 0.75

but P(Coffee) = 0.9
⇒ Although confidence is high, rule is misleading
⇒ P(Coffee|Tea) = 0.9375
July 30, 2021 35
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Statistical Independence

• Population of 1000 students

• 600 students know how to swim (S)
• 700 students know how to bike (B)
• 420 students know how to swim and bike (S,B)

• P(S∧B) = 420/1000 = 0.42

• P(S) × P(B) = 0.6 × 0.7 = 0.42

• P(S∧B) = P(S) × P(B) => Statistical independence

• P(S∧B) > P(S) × P(B) => Positively correlated
• P(S∧B) < P(S) × P(B) => Negatively correlated

July 30, 2021 36

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Statistical-based Measures

• Measures that take into account statistical dependence

P(Y | X )
Lift =
P(Y )
P( X , Y )
Interest =
P( X ) P(Y )

July 30, 2021 37

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Example: Lift/Interest

Coffee Coffee
Tea 15 5 20
Tea 75 5 80
90 10 100

Association Rule: Tea → Coffee

Confidence= P(Coffee|Tea) = 0.75

but P(Coffee) = 0.9
⇒ Lift = 0.75/0.9= 0.8333 (< 1, therefore is negatively associated)

July 30, 2021 38

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Drawback of Lift & Interest

Y Y Y Y
X 10 0 10 X 90 0 90
X 0 90 90 X 0 10 10
10 90 100 90 10 100

0.1 0.9
Lift = = 10 Lift = = 1.11
(0.1)(0.1) (0.9)(0.9)

Statistical independence:
If P(X,Y)=P(X)P(Y) => Lift = 1

July 30, 2021 39

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Subjective Interestingness Measure
• Objective measure:
• Rank patterns based on statistics computed from data
• e.g., many measures of association (support, confidence,
Laplace, Gini, mutual information, Jaccard, etc).

• Subjective measure:
• Rank patterns according to user’s interpretation
• A pattern is subjectively interesting if it contradicts the
expectation of a user
• A pattern is subjectively interesting if it is actionable

July 30, 2021 40

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Interestingness via Unexpectedness
• Need to model expectation of users (domain knowledge)

+ Pattern expected to be frequent

- Pattern expected to be infrequent

Pattern found to be frequent

Pattern found to be infrequent

+ - Expected Patterns

- + Unexpected Patterns

• Need to combine expectation of users with evidence from data (i.e., extracted
patterns)

July 30, 2021 41

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Prescribed Text Books

Author(s), Title, Edition, Publishing House

T1 Data Mining: Concepts and Techniques, Third Edition by Jiawei Han,
Micheline Kamber and Jian Pei Morgan Kaufmann Publishers
R1 Tan P. N., Steinbach M & Kumar V. “Introduction to Data Mining”
Pearson Education
R2 Predictive Analytics and Data Mining: Concepts and Practice with RapidMiner by
Vijay Kotu and Bala Deshpande Morgan Kaufmann Publishers

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

S2-20_DSECFZC415: Data Mining
(Lecture #11 – Cluster Analysis)

BITS Pilani
Pilani|Dubai|Goa|Hyderabad
•The slides presented here are obtained from the authors of the books and from various other contributors. I
hereby acknowledge all the contributors for their material and inputs.
•I have added and modified a few slides to suit the requirements of the course.

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Clustering Concepts
What is Cluster Analysis?

• Finding groups of objects such that the objects in a group

will be similar (or related) to one another and different from
(or unrelated to) the objects in other groups
Inter-cluster
Intra-cluster distances are
distances are maximized
minimized

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Examples of Clustering Applications

• Marketing: Help marketers discover distinct groups in their customer bases,

and then use this knowledge to develop targeted marketing programs
• Land use: Identification of areas of similar land use in an earth observation
database
• Insurance: Identifying groups of motor insurance policy holders with a high
average claim cost
• City-planning: Identifying groups of houses according to their house type,
value, and geographical location
• Earth-quake studies: Observed earth quake epicenters should be clustered
along continent faults

August 7, 2021
6
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
What is not Cluster Analysis?

• Supervised classification
• Have class label information

• Simple segmentation
• Dividing students into different registration groups alphabetically,
by last name

• Results of a query
• Groupings are a result of an external specification

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Quality: What Is Good Clustering?

• A good clustering method will produce high quality clusters

with
• high intra-class similarity
• low inter-class similarity
• The quality of a clustering result depends on both the
similarity measure used by the method and its
implementation
• The quality of a clustering method is also measured by its
ability to discover some or all of the hidden patterns

August 7, 2021 8
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Measure the Quality of Clustering

• Dissimilarity/Similarity metric: Similarity is expressed in terms of a

distance function, typically metric: d(i, j)
• There is a separate “quality” function that measures the
“goodness” of a cluster.
• The definitions of distance functions are usually very different for
interval-scaled, boolean, categorical, ordinal ratio, and vector
variables.
• Weights should be associated with different variables based on
applications and data semantics.
• It is hard to define “similar enough” or “good enough”
• the answer is typically highly subjective.

August 7, 2021
9
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Requirements of Clustering in Data Mining

• Scalability
• Ability to deal with different types of attributes
• Ability to handle dynamic data
• Discovery of clusters with arbitrary shape
• Minimal requirements for domain knowledge to determine input
parameters
• Able to deal with noise and outliers
• Insensitive to order of input records
• High dimensionality
• Incorporation of user-specified constraints
• Interpretability and usability

August 7, 2021
10
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Type of data in clustering analysis

• Interval-scaled variables

• Binary variables

• Nominal, ordinal, and ratio variables

• Variables of mixed types

August 7, 2021
11
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Interval-valued variables(standardization)
Z-score: x
z= σ − µ
X: raw score to be standardized,
μ: mean of the population, σ: standard deviation
• the distance between the raw score and the population mean in units of
the standard deviation
• negative when the raw score is below the mean, “+” when above
• Alternately
• Calculate the mean absolute deviation:
s f = 1n (| x1 f − m f | + | x2 f − m f | +...+ | xnf − m f |)
where m f = 1n (x1 f + x2 f + ... + xnf )
.

• Calculate the standardized measurement (z-score)

xif − m f
zif = sf
• Using mean absolute deviation is more robust than using standard deviation

August 7, 2021
12
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Similarity and Dissimilarity Between Objects

• Distances are normally used to measure the similarity or

dissimilarity between two data objects
• Some popular ones include: Minkowski distance:
d (i, j) = q (| x − x |q + | x − x |q +...+ | x − x |q )
i1 j1 i2 j2 ip jp
where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two p-
dimensional data objects, and q is a positive integer
• If q = 1, d is Manhattan distance
d (i, j) =| x − x | + | x − x | +...+ | x − x |
i1 j1 i2 j 2 ip jp

August 7, 2021
13
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Similarity and Dissimilarity Between Objects (Cont.)

• If q = 2, d is Euclidean distance:
d (i, j) = (| x − x |2 + | x − x |2 +...+ | x − x |2 )
i1 j1 i2 j2 ip jp
• Properties
• d(i,j) ≥ 0
• d(i,i) = 0
• d(i,j) = d(j,i)
• d(i,j) ≤ d(i,k) + d(k,j)

• Also, one can use weighted distance, parametric Pearson

product moment correlation, or other dissimilarity measures

August 7, 2021
14
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Binary Variables
• A contingency table for binary Object j
1 0 sum
data 1 a b a +b
Object i 0 c d c+d
sum a+c b+d p

• Distance measure for symmetric

d (i, j) = b+c
binary variables: a +b+c + d
• Distance measure for asymmetric b+c
d (i, j) =
binary variables: a +b+c
• Jaccard coefficient (similarity a
simJaccard (i, j) =
measure for asymmetric binary a +b+c
variables):
15
August 7, 2021
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Nominal Variables

• A generalization of the binary variable in that it can take

more than 2 states, e.g., red, yellow, blue, green
• Method 1: Simple matching
• m: # of matches, p: total # of variables

d (i, j) = p −
p
m

• Method 2: use a large number of binary variables

• creating a new binary variable for each of the M nominal
states

August 7, 2021
16
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Ordinal Variables

• An ordinal variable can be discrete or continuous

• Order is important, e.g., rank
• Can be treated like interval-scaled
• replace xif by their rank rif ∈{1,..., M f }
• map the range of each variable onto [0, 1] by replacing i-
th object in the f-th variable by
rif −1
zif =
M f −1
• compute the dissimilarity using methods for interval-
scaled variables

August 7, 2021
17
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Attributes of Mixed Type
• A database may contain all attribute types
• Nominal, symmetric binary, asymmetric binary,
numeric, ordinal
• One may use a weighted formula to combine their effects

Σ pf = 1δ ij( f ) dij( f )
d (i, j) =
Σ pf = 1δ ij( f )

18
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Notion of a Cluster can be Ambiguous

How many clusters? Six Clusters

Two Clusters Four Clusters

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Types of Clusterings
• A clustering is a set of clusters

• An important distinction among types of clustering :

hierarchical and partitional sets of clusters

• Partitional Clustering
• A division data objects into non-overlapping subsets
(clusters) such that each data object is in exactly one
subset
• Hierarchical clustering
• A set of nested clusters organized as a hierarchical tree

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Partitional Clustering

Original Points A Partitional Clustering

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Hierarchical Clustering

p1
p3 p4
p2

p1 p2 p3 p4
Traditional Hierarchical Clustering
Traditional Dendrogram

p1
p3 p4
p2

p1 p2 p3 p4
Non-traditional Hierarchical Clustering
Non-traditional Dendrogram

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Other Distinctions Between Sets of Clusters
• Exclusive versus non-exclusive
• In non-exclusive clustering, points may belong to multiple clusters.
• Can represent multiple classes or ‘border’ points
• Fuzzy versus non-fuzzy
• In fuzzy clustering, a point belongs to every cluster with some weight
between 0 and 1
• Weights must sum to 1
• Probabilistic clustering has similar characteristics
• Partial versus complete
• In some cases, we only want to cluster some of the data
• Heterogeneous versus homogeneous
• Cluster of widely different sizes, shapes, and densities

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Types of Clusters

Clusters can be of many types:

• Well-separated clusters

• Center-based clusters

• Contiguous clusters

• Density-based clusters

• Property or Conceptual
• Described by an Objective Function

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Types of Clusters: Well-Separated

• Well-Separated Clusters:
• A cluster is a set of points such that any point in a cluster is closer
(or more similar) to every other point in the cluster than to any
point not in the cluster.

3 well-separated clusters

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Types of Clusters: Center-Based

• Center-based
• A cluster is a set of objects such that an object in a cluster is closer
(more similar) to the “center” of a cluster, than to the center of any
other cluster
• The center of a cluster is often a centroid, the average of all the
points in the cluster, or a medoid, the most “representative” point
of a cluster

4 center-based clusters

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Types of Clusters: Contiguity-Based

• Contiguous Cluster (Nearest neighbor or Transitive)

• A cluster is a set of points such that a point in a cluster is closer (or
more similar) to one or more other points in the cluster than to any
point not in the cluster.

8 contiguous clusters

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Types of Clusters: Density-Based

• Density-based
• A cluster is a dense region of points, which is separated by low-
density regions, from other regions of high density.
• Used when the clusters are irregular or intertwined, and when
noise and outliers are present.

6 density-based clusters

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Types of Clusters: Conceptual Clusters

• Shared Property or Conceptual Clusters

• Finds clusters that share some common property or represent a
particular concept.
.

2 Overlapping Circles

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Types of Clusters: Objective Function
• Clusters Defined by an Objective Function
• Finds clusters that minimize or maximize an objective function.
• Enumerate all possible ways of dividing the points into clusters and
evaluate the `goodness' of each potential set of clusters by using the
given objective function. (NP Hard)
• Can have global or local objectives.
• Hierarchical clustering algorithms typically have local objectives
• Partitional algorithms typically have global objectives
• A variation of the global objective function approach is to fit the data
to a parameterized model.
• Parameters for the model are determined from the data.
• Mixture models assume that the data is a ‘mixture' of a number of
statistical distributions.

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Types of Clusters: Objective Function …

• Map the clustering problem to a different domain and solve a related

problem in that domain
• Proximity matrix defines a weighted graph, where the nodes are the points
being clustered, and the weighted edges represent the proximities between
points

• Clustering is equivalent to breaking the graph into connected components,

one for each cluster.

• Want to minimize the edge weight between clusters and maximize the edge
weight within clusters

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Important Characteristics of the Input Data
• Type of proximity or density measure
• This is a derived measure, but central to clustering
• Sparseness
• Dictates type of similarity
• Adds to efficiency
• Attribute type
• Dictates type of similarity
• Type of Data
• Dictates type of similarity
• Other characteristics, e.g., autocorrelation
• Dimensionality
• Noise and Outliers
• Type of Distribution

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

BITS Pilani
Pilani|Dubai|Goa|Hyderabad

Partitioning Methods
Partitioning Algorithms: Basic Concept
• Partitioning method: Partitioning a database D of n objects into a set of k
clusters, such that the sum of squared distances is minimized (where ci is the
centroid or medoid of cluster Ci)

E = Σ ik=1Σ p∈Ci ( p − ci ) 2

• Given k, find a partition of k clusters that optimizes the chosen partitioning

criterion
• Global optimal: exhaustively enumerate all partitions
• Heuristic methods: k-means and k-medoids algorithms
• k-means : Each cluster is represented by the center of the cluster
• k-medoids or PAM (Partition around medoids): Each cluster is represented
by one of the objects in the cluster
34

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

K-means Clustering
• Partitional clustering approach
• Each cluster is associated with a centroid (center point)
• Each point is assigned to the cluster with the closest centroid
• Number of clusters, K, must be specified
• The basic algorithm is very simple

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

K-means Clustering – Details
• Initial centroids are often chosen randomly.
• Clusters produced vary from one run to another.
• The centroid is (typically) the mean of the points in the cluster.
• ‘Closeness’ is measured by Euclidean distance, cosine
similarity, etc.
• K-means will converge for common similarity measures
mentioned above.
• Most of the convergence happens in the first few iterations.
• Often the stopping condition is changed to ‘Until relatively few points
change clusters’
• Complexity is O( n * K * I * d )
• n = number of points, K = number of clusters,
I = number of iterations, d = number of attributes

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Two different K-means Clusterings
3

2.5

2 Original Points
1.5

y
1

0.5

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

3 3

2.5 2.5

2 2

1.5 1.5
y

y
1 1

0.5 0.5

0 0

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2

x x

Optimal Clustering Sub-optimal Clustering

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Evaluating K-means Clusters
• Most common measure is Sum of Squared Error (SSE)
• For each point, the error is the distance to the nearest cluster
• To get SSE, we square these errors and sum them.
K
SSE = ∑ ∑ dist 2 ( mi , x )
i =1 x∈Ci

• x is a data point in cluster Ci and mi is the representative point for

cluster Ci
• can show that mi corresponds to the center (mean) of the cluster
• Given two clusters, we can choose the one with the smallest error
• One easy way to reduce SSE is to increase K, the number of clusters
• A good clustering with smaller K can have a lower SSE than a poor clustering
with higher K

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Importance of Choosing Initial Centroids
Iteration 1 Iteration 2 Iteration 3
3 3 3

2.5 2.5 2.5

2 2 2

1.5 1.5 1.5

y
1 1 1

0.5 0.5 0.5

0 0 0

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x

Iteration 4 Iteration 5 Iteration 6

3 3 3

2.5 2.5 2.5

2 2 2

1.5 1.5 1.5

y
1 1 1

0.5 0.5 0.5

0 0 0

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Importance of Choosing Initial Centroids

Iteration 6
1
2
3
4
5
3

2.5

1.5
y

0.5

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Importance of Choosing Initial Centroids …
Iteration 1 Iteration 2
3 3

2.5 2.5

2 2

1.5 1.5
y

y
1 1

0.5 0.5

0 0

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2

x x

Iteration 3 Iteration 4 Iteration 5

3 3 3

2.5 2.5 2.5

2 2 2

1.5 1.5 1.5

y
1 1 1

0.5 0.5 0.5

0 0 0

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Importance of Choosing Initial Centroids …
Iteration 5
1
2
3
4
3

2.5

1.5
y

0.5

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Problems with Selecting Initial Points

• If there are K ‘real’ clusters then the chance of selecting one

centroid from each cluster is small.
• Chance is relatively small when K is large
• If clusters are the same size, n, then

• For example, if K = 10, then probability = 10!/1010 = 0.00036

• Sometimes the initial centroids will readjust themselves in ‘right’
way, and sometimes they don’t
• Consider an example of five pairs of clusters

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

10 Clusters Example
Iteration 1 Iteration 2
8 8

6 6

4 4

2 2

0 0
y

y
-2 -2

-4 -4

-6 -6

0 5 10 15 20 0 5 10 15 20
x
Iteration 3 x
Iteration 4
8 8

6 6

4 4

2 2

0 0
y

y
-2 -2

-4 -4

-6 -6

0 5 10 15 20 0 5 10 15 20
x x
Starting with two initial centroids in one cluster of each pair of clusters
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
10 Clusters Example
Iteration 1 Iteration 2
8 8

6 6

4 4

2 2

0 0
y

y
-2 -2

-4 -4

-6 -6

0 5 10 15 20 0 5 10 15 20
Iteration 3 Iteration 4
8 x 8 x

6 6

4 4

2 2

0 0
y

-2 y -2

-4 -4

-6 -6

0 5 10 15 20 0 5 10 15 20
x x

Starting with some pairs of clusters having three initial centroids, while other have only one.

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Solutions to Initial Centroids Problem

• Multiple runs
• Helps, but probability is not favorable
• Sample and use hierarchical clustering to determine
initial centroids
• Select more than k initial centroids and then select
among these initial centroids
• Select most widely separated
• Postprocessing

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Pre-processing and Post-processing
• Pre-processing
• Normalize the data
• Eliminate outliers

• Post-processing
• Eliminate small clusters that may represent outliers
• Split ‘loose’ clusters, i.e., clusters with relatively high SSE
• Merge clusters that are ‘close’ and that have relatively low SSE

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Variations of the K-Means Method
• Most of the variants of the k-means which differ in
• Selection of the initial k means
• Dissimilarity calculations
• Strategies to calculate cluster means

• Handling categorical data: k-modes

• Replacing means of clusters with modes
• Using new dissimilarity measures to deal with categorical objects
• Using a frequency-based method to update modes of clusters

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Limitations of K-means

• K-means has problems when clusters are of differing

• Sizes
• Densities
• Non-globular shapes

• K-means has problems when the data contains outliers.

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Limitations of K-means: Differing Sizes

Original Points K-means (3 Clusters)

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Limitations of K-means: Differing Density

Original Points K-means (3 Clusters)

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Limitations of K-means: Non-globular Shapes

Original Points K-means (2 Clusters)

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Overcoming K-means Limitations

Original Points K-means Clusters

One solution is to use many clusters.

Find parts of clusters, but need to put together.

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Overcoming K-means Limitations

Original Points K-means Clusters

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Overcoming K-means Limitations

Original Points K-means Clusters

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Comments on the K-Means Method
• Strength: Efficient: O(tkn), where n is # objects, k is # clusters, and t is #
iterations. Normally, k, t << n.
• Comparing: PAM: O(k(n-k)2 ), CLARA: O(ks2 + k(n-k))
• Comment: Often terminates at a local optimal.
• Weakness
• Applicable only to objects in a continuous n-dimensional space
• Using the k-modes method for categorical data
• In comparison, k-medoids can be applied to a wide range of data
• Need to specify k, the number of clusters, in advance
• Sensitive to noisy data and outliers
• Not suitable to discover clusters with non-convex shapes

56
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
The K-Medoid Clustering Method
• K-Medoids Clustering: Find representative objects (medoids) in clusters
• PAM (Partitioning Around Medoids)
• Starts from an initial set of medoids and iteratively replaces one of
the medoids by one of the non-medoids if it improves the total
distance of the resulting clustering
• PAM works effectively for small data sets, but does not scale well for
large data sets (due to the computational complexity)
• Efficiency improvement on PAM
• CLARA : PAM on samples
• CLARANS : Randomized re-sampling

57
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Prescribed Text Books

Author(s), Title, Edition, Publishing House

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

S2-20_DSECLZC415 : Data Mining
Lecture #12 – Cluster Analysis
BITS Pilani
Pilani|Dubai|Goa|Hyderabad
•The slides presented here are obtained from the authors of the
books and from various other contributors. I hereby acknowledge
all the contributors for their material and inputs.
•I have added and modified a few slides to suit the requirements
of the course.
2

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Data Mining
Cluster Analysis
BITS Pilani
Pilani|Dubai|Goa|Hyderabad
What is Cluster Analysis?
Finding groups of objects such that the objects in a group will be
similar (or related) to one another and different from (or
unrelated to) the objects in other groups
Inter-cluster
Intra-cluster distances are
distances are maximized
minimized

4
August 14, 2021
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Quality: What Is Good Clustering?
A good clustering method will produce high quality clusters with
– high intra-class similarity
– low inter-class similarity
The quality of a clustering result depends on both the similarity
measure used by the method and its implementation
The quality of a clustering method is also measured by its ability
to discover some or all of the hidden patterns

5
August 14, 2021
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Types of Clusterings
A clustering is a set of clusters

An important distinction among types of clusterings :

hierarchical and partitional sets of clusters

Partitional Clustering
– A division data objects into non-overlapping subsets
(clusters) such that each data object is in exactly one subset
Hierarchical clustering
– A set of nested clusters organized as a hierarchical tree

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

Partitional Clustering Method K-Means
Strength: Efficient: O(tkn), where n is # objects, k is # clusters, and t is # iterations.
Normally, k, t << n.
• Compare to PAM (Partitioning around Medoids) : O(k(n-k)2 ), CLARA
(Clustering LARge Applications) : O(ks2 + k(n-k))
Comment: Often terminates at a local optimal.
Weakness
– Applicable only to objects in a continuous n-dimensional space
• Using the k-modes method for categorical data
• In comparison, k-medoids can be applied to a wide range of data
– Need to specify k, the number of clusters, in advance
– Sensitive to noisy data and outliers
– Not suitable to discover clusters with non-convex shapes