0% found this document useful (0 votes)
12 views710 pages

Introduction to Data Mining Course

Uploaded by

Sridhar Eswaran
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views710 pages

Introduction to Data Mining Course

Uploaded by

Sridhar Eswaran
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

S2-20_DSECLZC415

Introduction to Data Mining


BITS Pilani
Pilani|Dubai|Goa|Hyderabad

1
• The slides presented here are obtained from the authors of the
books and from various other contributors. I hereby
acknowledge all the contributors for their material and inputs.
• I have added and modified a few slides to suit the requirements
of the course.
2

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Textbooks/Reference Books
Text Books
T1 Tan P. N., Steinbach M & Kumar V. “Introduction to Data Mining” Pearson
Education, 2006
T2 Data Mining: Concepts and Techniques, Third Edition by Jiawei Han and
Micheline Kamber Morgan Kaufmann Publishers, 2011

Reference Book(s) & other resources


R1 Predictive Analytics and Data Mining: Concepts and Practice with
RapidMiner by Vijay Kotu and Bala Deshpande Morgan Kaufmann
Publishers © 2015
Additional references may be given during lectures

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Modular Structure
No Title of the Module

M1 Introduction to Data Mining


M2 Data Preprocessing
M3 Data Exploration
M4 Classification and Prediction
M5 Clustering
M6 Association Analysis
M7 Anomaly Detection
M8 Data mining on unstructured (Big) data
M9 Data Mining Applications

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Evaluation Scheme
No Name Type Weight
1. Quiz-I Online 5%
Quiz-II Online 5%
Assignment Group 10%
2. Mid-Semester Test 30%
3. Comprehensive Exam 50%

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Data Mining Defined

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


What Is Data Mining?
Data mining (knowledge discovery from data)
– Extraction of interesting (non-trivial, implicit, previously unknown
and potentially useful) patterns or knowledge from huge amount
of data
Alternative names
– Knowledge discovery (mining) in databases (KDD), knowledge
extraction, data/pattern analysis, data archeology, data dredging,
information harvesting, business intelligence, etc.
Watch out: Is everything “data mining”?
– Simple search and query processing
– (Deductive) expert systems

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


What is (not) Data Mining?
What is not Data  What is Data Mining?
Mining?
– Certain names are more
– Look up prevalent in certain US
phone number locations (O’Brien,
in phone O’Rurke, O’Reilly… in
directory Boston area)
– Query a Web – Group together similar
search engine documents returned by
for information search engine according
about to their context (e.g.
“Amazon” Amazon rainforest,
[Link],)
8

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Why Data Mining?
The Explosive Growth of Data: from terabytes to petabytes
– Data collection and data availability
• Automated data collection tools, database systems, Web,
computerized society
– Major sources of abundant data
• Business: Web, e-commerce, transactions, stocks, …
• Science: Remote sensing, bioinformatics, scientific simulation, …
• Society and everyone: news, digital cameras, YouTube
We are drowning in data, but starving for knowledge!

“Necessity is the mother of invention”—Data mining—


Automated analysis of massive data sets
9

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Why Data Mining
A search engine (e.g., Google) receives hundreds of millions of queries every
day. Each query can be viewed as a transaction where the user describes her
or his information need.
What novel and useful knowledge can a search engine learn from such a huge
collection of queries collected from users over time? Some patterns found in
user search queries can disclose invaluable knowledge that cannot be obtained
by reading individual data items alone.
For example, Google's Flu Trends uses specific search terms as indicators of
flu activity. It found a close relationship between the number of people who
search for flu-related information and the number of people who actually have
flu symptoms. A pattern emerges when all of the search queries related to flu
are aggregated. Using aggregated Google search data, Flu Trends can
estimate flu activity up to two weeks faster than traditional systems can.
This example shows how data mining can turn a large collection of data into
knowledge that can help meet a challenge.
10

7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Evolution of Database Technology
1960s:
– Data collection, database creation, IMS and network DBMS
1970s:
– Relational data model, relational DBMS implementation
1980s:
– RDBMS, advanced data models (extended-relational, OO, deductive, etc.)
– Application-oriented DBMS (spatial, scientific, engineering, etc.)
1990s:
– Data mining, data warehousing, multimedia databases, and Web databases
2000s
– Stream data management and mining
– Data mining and its applications
– Web technology (XML, data integration) and global information systems

11

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Origins of Data Mining
Draws ideas from machine learning/AI, pattern recognition,
statistics, and database systems
Traditional Techniques
may be unsuitable due to
Statistics/ Machine Learning/
– Enormity of data AI Pattern
– High dimensionality Recognition

of data Data Mining


– Heterogeneous,
distributed nature
Database
of data systems

12

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Data Mining in Business Intelligence

Increasing potential
to support
business decisions End User
Decision
Making

Data Presentation Business


Analyst
Visualization Techniques
Data Mining Data
Information Discovery Analyst

Data Exploration
Statistical Summary, Querying, and Reporting

Data Preprocessing/Integration, Data Warehouses


DBA
Data Sources
Paper, Files, Web documents, Scientific experiments, Database Systems

13

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Data Mining/KDD Process

Input Data Data Pre- Data Post-


Processing Mining Processing

Data integration Pattern discovery Pattern evaluation


Normalization Association & correlation Pattern selection
Feature selection Classification Pattern interpretation
Clustering
Dimension reduction Pattern visualization
Outlier analysis
…………
KDD – Knowledge Discovery in Databases

14

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Data Mining & Machine Learning
According to Tom M. Mitchell, Chair of Machine Learning at
Carnegie Mellon University and author of the book Machine
Learning (McGraw-Hill),
A computer program is said to learn from experience E with respect to some class
of tasks T and performance measure P, if its performance at tasks in T, as measured
by P, improves with the experience E.
We now have a set of objects to define machine learning:
Task (T), Experience (E), and Performance (P)
With a computer running a set of tasks, the experience should be leading to
performance increases (to satisfy the definition)

Many data mining tasks are executed successfully with help of


machine learning
15
Machine Learning: Hands-on for Developers and Technical Professionals by Jason Bell John Wiley & Sons

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Multi-Dimensional View of Data Mining
Data to be mined
– Database data (extended-relational, object-oriented, heterogeneous, legacy),
data warehouse, transactional data, stream, spatiotemporal, time-series,
sequence, text and web, multi-media, graphs & social and information networks
Knowledge to be mined (or: Data mining functions)
– Characterization, discrimination, association, classification, clustering,
trend/deviation, outlier analysis, etc.
– Descriptive vs. predictive data mining
– Multiple/integrated functions and mining at multiple levels
Techniques utilized
– Data-intensive, data warehouse (OLAP), machine learning, statistics, pattern
recognition, visualization, high-performance, etc.
Applications adapted
– Retail, telecommunication, banking, fraud analysis, bio-data mining, stock market
analysis, text mining, Web mining, etc.
16

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Data Mining on Diverse kinds of Data
Besides relational database data (from operational or analytical systems),
there are many other kinds of data that have diverse forms and structures
and different semantic meanings.
Examples of data can be :
time-related or sequence data (e.g., historical records, stock exchange data, and
time-series and biological sequence data),
data streams (e.g., video surveillance and sensor data, which are continuously
transmitted),
spatial data (e.g., maps),
engineering design data (e.g., the design of buildings, system components, or
integrated circuits),
hypertext and multimedia data (including text, image, video, and audio data),
graph and networked data (e.g., social and information networks), and
the Web (a widely distributed information repository).
Diversity of data brings in new challenges such as handling special structures
(e.g., sequences, trees, graphs, and networks) and specific semantics (such as
ordering, image, audio and video contents, and connectivity)
17

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Data Mining Activities

18

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Data Mining Tasks
Prediction Methods
– Use some variables to predict unknown or future values of other variables.

Description Methods
– Find human-interpretable patterns that describe the data.

From [Fayyad, [Link].] Advances in Knowledge Discovery and Data Mining, 1996

Experts have more terms:

Gartner Analyst View:


[Link]

SCM Expert View:


[Link]
19

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Data Mining Tasks...
Classification [Predictive]
Clustering [Descriptive]
Association Rule Discovery [Descriptive]
Sequential Pattern Discovery [Descriptive]
Regression [Predictive]
Deviation Detection [Predictive]

20

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Classification: Definition
Given a collection of records (training set )
– Each record contains a set of attributes, one of the attributes is the class.
Find a model for class attribute as a function of the values of
other attributes.
Goal: previously unseen records should be assigned a class as
accurately as possible.
– A test set is used to determine the accuracy of the model. Usually, the given
data set is divided into training and test sets, with training set used to build the
model and test set used to validate it.

21

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Classification Example

Tid Refund Marital Taxable Refund Marital Taxable


Status Income Cheat Status Income Cheat

1 Yes Single 125K No No Single 75K ?


2 No Married 100K No Yes Married 50K ?
3 No Single 70K No No Married 150K ?
4 Yes Married 120K No Yes Divorced 90K ?
5 No Divorced 95K Yes No Single 40K ?
6 No Married 60K No No Married 80K ? Test
10

Set
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
Training
Learn
10 No Single 90K Yes Model
10

Set Classifier
22

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Classification: Application 1
Direct Marketing
– Goal: Reduce cost of mailing by targeting a set of consumers likely to
buy a new cell-phone product.
– Approach:
• Use the data for a similar product introduced before.
• We know which customers decided to buy and which decided
otherwise. This {buy, don’t buy} decision forms the class attribute.
• Collect various demographic, lifestyle, and company-interaction
related information about all such customers.
• Type of business, where they stay, how much they earn, etc.
• Use this information as input attributes to learn a classifier model.

From [Berry & Linoff] Data Mining Techniques, 1997

23

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Classification: Application 2
Fraud Detection
– Goal: Predict fraudulent cases in credit card transactions.
– Approach:
• Use credit card transactions and the information on its account-
holder as attributes.
• When does a customer buy, what does he buy, how often he pays on
time, etc
• Label past transactions as fraud or fair transactions. This forms the
class attribute.
• Learn a model for the class of the transactions.
• Use this model to detect fraud by observing credit card
transactions on an account.

24

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Classification: Application 3
Customer Attrition/Churn:
– Goal: To predict whether a customer is likely to be lost to a competitor.
– Approach:
• Use detailed record of transactions with each of the past and
present customers, to find attributes.
• How often the customer calls, where he calls, what time-of-the day
he calls most, his financial status, marital status, etc.
• Label the customers as loyal or disloyal.
• Find a model for loyalty.

From [Berry & Linoff] Data Mining Techniques, 1997


25

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Clustering Definition
Given a set of data points, each having a set of attributes, and
a similarity measure among them, find clusters such that
– Data points in one cluster are more similar to one another.
– Data points in separate clusters are less similar to one another.
Similarity Measures:
– Euclidean Distance if attributes are continuous.
– Other Problem-specific Measures.

26

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Illustrating Clustering

Intracluster distances Intercluster distances


are minimized are maximized

Euclidean Distance Based


Clustering in 3-D space

27

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Clustering: Application 1
Market Segmentation:
– Goal: subdivide a market into distinct subsets of customers where any
subset may conceivably be selected as a market target to be reached
with a distinct marketing mix.
– Approach:
• Collect different attributes of customers based on their
geographical and lifestyle related information.
• Find clusters of similar customers.
• Measure the clustering quality by observing buying patterns of
customers in same cluster vs. those from different clusters.

28

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Clustering: Application 2
Document Clustering:
– Goal: To find groups of documents that are similar to each other based
on the important terms appearing in them.
– Approach: To identify frequently occurring terms in each document.
Form a similarity measure based on the frequencies of different terms.
Use it to cluster.
– Gain: Information Retrieval can utilize the clusters to relate a new
document or search term to clustered documents.

29

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Association Rule Discovery: Definition
Given a set of records each of which contain some number of
items from a given collection;
– Produce dependency rules which will predict occurrence
of an item based on occurrences of other items.
Example of Association Rules
TID Items
1 Bread, Milk {Diaper} → {Butter},
2 Bread, Diaper, Butter, Beans {Milk, Bread} → {Beans, Coke},
3 Milk, Diaper, Butter, Coke {Butter, Bread} → {Milk},
4 Bread, Milk, Diaper, Butter
5 Bread, Milk, Diaper, Coke

30

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Association Rule Discovery: Application 1
Marketing and Sales Promotion:
– Let the rule discovered be
{Bagels, … } --> {Potato Chips}
– Potato Chips as consequent => Can be used to determine what should
be done to boost its sales.
– Bagels in the antecedent => Can be used to see which products would be
affected if the store discontinues selling bagels.
– Bagels in antecedent and Potato chips in consequent => Can be used to
see what products should be sold with Bagels to promote sale of Potato
chips!

31

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Association Rule Discovery: Application
Inventory Management:
– Goal: A consumer appliance repair company wants to anticipate the nature of
repairs on its consumer products and keep the service vehicles equipped with
right parts to reduce on number of visits to consumer households.
– Approach: Process the data on tools and parts required in previous repairs at
different consumer locations and discover the co-occurrence patterns.

32

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Sequential Pattern Discovery: Definition
Given is a set of objects, with each object associated with its own timeline of
events, find rules that predict strong sequential dependencies among
different events.

(A B) (C) (D E)

Rules are formed by first discovering patterns. Event occurrences in the


patterns are governed by timing constraints.

(A B) (C) (D E)
Timing constraints include maxgap
<= xg >ng <= ws (xg), mingap (ng), windowsize (ws),
maxspan (ms)

<= ms

33

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Sequential Pattern Discovery: Examples
In telecommunications alarm logs,
– (Inverter_Problem Excessive_Line_Current)
(Rectifier_Alarm) --> (Fire_Alarm)

In point-of-sale transaction sequences,


– Computer Bookstore:
(Intro_To_Visual_C) (C++_Primer) -->
(Perl_for_dummies,Tcl_Tk)
– Athletic Apparel Store:
(Shoes) (Racket, Racketball) --> (Sports_Jacket)

34

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Prediction/Regression
Predict a value of a given continuous valued variable based on
the values of other variables, assuming a linear or nonlinear
model of dependency.
Greatly studied in statistics, neural network fields.
Examples:
• Predicting sales amounts of new product based on advertising
expenditure.
• Predicting wind velocities as a function of temperature, humidity, air
pressure, etc.
• Time series prediction of stock market indices.

35

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Deviation/Anomaly Detection
Detect significant deviations from normal behavior
Applications:
– Credit Card Fraud Detection
– Network Intrusion Detection

36

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Gartner’s Magic Quadrant

37

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


DM Process & Challenges

38

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


DM Process
The standard data mining process involves
1. understanding the problem,
2. preparing the data (samples),
3. developing the model,
4. applying the model on a data set to see how the model may work in
real world, and
5. production deployment.
A popular data mining process frameworks is CRISP-DM (Cross
Industry Standard Process for Data Mining). This framework
was developed by a consortium of companies involved in data
mining

39

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Generic Data Mining Process

40

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Prior Knowledge

Data Mining tools/solutions identify hidden patterns.


– Generally we get many patterns
– Out of them many could be false or trivial.
– Filtering false patterns requires domain understanding.
Understanding how the data is collected, stored, transformed,
reported, and used is essential.

41

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Data Preparation
Data needs to be understood. It requires descriptive statistics such as mean, median, mode,
standard deviation, and range for each attribute
Data quality is an ongoing concern wherever data is collected, processed, and stored.
– The data cleansing practices include elimination of duplicate records, quarantining outlier records
that exceed the bounds, standardization of attribute values, substitution of missing values, etc.
– it is critical to check the data using data exploration techniques in addition to using prior knowledge
of the data and business before building models to ensure a certain degree of data quality
Missing Values
– Need to track the data lineage of the data source to find right solution
Data Types and Conversion
– The attributes in a data set can be of different types, such as continuous numeric (interest rate),
integer numeric (credit score), or categorical
– data mining algorithms impose different restrictions on what data types they accept as inputs
Transformation
– Can go beyond type conversion, may include dimensionality reduction or numerosity reduction
Outliers are anomalies in the data set
– May occur legitimately or erroneously.
Feature Selection
– Many data mining problems involve a data set with hundreds to thousands of attributes, most of
which may not be helpful. Some attributes may be correlated, e.g. sales amount and tax.
Data Sampling may be adequate in many cases
42

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Modeling & Evaluation
A model is the abstract
representation of the data
and its relationships in a
given data set.
Data mining models can
be classified into the
following categories:
classification, regression,
association analysis,
clustering, and outlier or
anomaly detection.
Each category has a few
dozen different
algorithms; each takes a
slightly different approach
to solve the problem at
hand
43

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Application
The model deployment stage considerations:
– assessing model readiness, technical integration, response time, model maintenance, and
assimilation
Production Readiness
– Real-time response capabilities, and other business requirements
Technical Integration
– Use of modeling tools (e.g. RapidMiner), Use of PMML for portable and consistent format
of model description, integration with other tools
Timeliness
– The trade-offs between production responsiveness and build time need to be considered
Remodeling
– The conditions in which the model is built may change after deployment
Assimilation
– The challenge is to assimilate the knowledge gained from data mining in the organization.
For example, the objective may be finding logical clusters in the customer database so that
separate treatment can be provided to each customer cluster. 44

7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
CRISP data mining framework
CRISP is the most popular
methodology for analytics,
data mining, and data
science projects, with 43%
share as per 2014
KDnuggets Poll.
CRISP-DM was conceived
in 1996. In 1997 it got
underway as a European
Union project, led by SPSS,
Teradata, Daimler AG, NCR
Corporation and OHRA.

45

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


DM Issues/Challenges
DM Issues/Challenges – Mining Methodology
Mining various and new kinds of knowledge
Mining knowledge in multidimensional space
Data mining—an interdisciplinary effort
Boosting the power of discovery in a networked environment
Handling uncertainty, noise, or incompleteness of data
Pattern evaluation and pattern- or constraint-guided mining

46

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


DM Issues/Challenges

DM Issues/Challenges – User Interaction


Interactive mining
Incorporation of background knowledge
Ad hoc data mining and data mining query languages
Presentation and visualization of data mining results

DM Issues/Challenges - Efficiency and Scalability


Efficiency and scalability of data mining algorithms
Parallel, distributed, and incremental mining algorithms
Cloud computing and cluster computing
47

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


DM Issues/Challenges

DM Issues/Challenges - Diversity of Database Types


Handling complex types of data

Mining dynamic, networked, and global data repositories

DM Issues/Challenges - Society
Social impacts of data mining
Privacy-preserving data mining
Invisible data mining
48

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Data Mining on Diverse kinds of Data
Besides relational database data (from operational or analytical systems), there are many
other kinds of data that have diverse forms and structures and different semantic
meanings.
Examples of data can be :
time-related or sequence data (e.g., historical records, stock exchange data, and time-series and
biological sequence data),
data streams (e.g., video surveillance and sensor data, which are continuously transmitted),
spatial data (e.g., maps),
engineering design data (e.g., the design of buildings, system components, or integrated circuits),
hypertext and multimedia data (including text, image, video, and audio data),
graph and networked data (e.g., social and information networks), and
the Web (a widely distributed information repository).
Diversity of data brings in new challenges such as handling special structures (e.g.,
sequences, trees, graphs, and networks) and specific semantics (such as ordering, image,
audio and video contents, and connectivity)

49

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Text Books

Author(s), Title, Edition, Publishing House


T1 Tan P. N., Steinbach M & Kumar V. “Introduction to Data Mining” Pearson
Education
T2 Data Mining: Concepts and Techniques, Third Edition by Jiawei Han, Micheline
Kamber and Jian Pei Morgan Kaufmann Publishers
R1 Predictive Analytics and Data Mining: Concepts and Practice with RapidMiner
by Vijay Kotu and Bala Deshpande Morgan Kaufmann Publishers

50

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Thank You

51

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Source: [Link]

Data Data Lake


Database Data Mart
Warehouse
Source Single Single Multiple Multiple
Structure Structured Structured Structured Unstructured
Purpose Determined Determined Determined Undetermined
Storage Centralized Decentralized Centralized Centralized
Detailed &
Granularity Detailed Summarized All
Summary
Flexibility Low Medium Medium High
Analytics &
Primary Use Transactional Reporting Analytics
Reporting
Data Volume Low Low Medium High
Development Top-down Bottom-up Top-down All
Design Time Medium Medium High Low
Volatility Medium Low None None
Data Operations CRUD CR CRU CR
Subject Area Single Single Multiple Multiple
Multi- Multi-
Design Schema Relational No Schema
dimensional dimensional 52

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


S2-20_DSECLZC415
Data Pre-processing
BITS Pilani
Pilani|Dubai|Goa|Hyderabad

1
• The slides presented here are obtained from the authors of the
books and from various other contributors. I hereby
acknowledge all the contributors for their material and inputs.
• I have added and modified a few slides to suit the requirements
of the course.
2

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Data Preprocessing Concepts

7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Preprocessing Objectives

• To improve data quality

• To modify data to better fit specific data mining technique

7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Major Tasks in Data Preprocessing
• Data cleaning
• Fill in missing values, smooth noisy data, identify or remove outliers, and
resolve inconsistencies
• Data integration
• Integration of multiple databases, data cubes, or files
• Data reduction
• Dimensionality reduction
• Numerosity reduction
• Data compression
• Data transformation and data discretization
• Normalization
• Concept hierarchy generation
5

7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Data Quality: Multidimensional View

• Measures for data quality: A multidimensional view


– Accuracy: correct or wrong, accurate or not
– Completeness: not recorded, unavailable, …
– Consistency: some modified but some not, dangling, …
– Timeliness: timely update?
– Believability: how trustable the data are correct?
– Interpretability: how easily the data can be understood?

7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Data Quality

• What kinds of data quality problems?

• How can we detect problems with the data?

• What can we do about these problems?

• Examples of data quality problems:

– Noise and outliers

– missing values

– duplicate data

7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Data Cleaning

7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Data Cleaning
• Data in the Real World Is Dirty: Lots of potentially incorrect data, e.g.,
instrument faulty, human or computer error, transmission error
– incomplete: lacking attribute values, lacking certain attributes of
interest, or containing only aggregate data
• e.g., Occupation = “ ” (missing data)
– noisy: containing noise, errors, or outliers
• e.g., Salary = “−10” (an error)
– inconsistent: containing discrepancies in codes or names, e.g.,
• Age = “42”, Birthday = “03/07/2010”
• Was rating “1, 2, 3”, now rating “A, B, C”
• discrepancy between duplicate records
– Intentional (e.g., disguised missing data)
• Jan. 1 as everyone’s birthday?
9

7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Incomplete (Missing) Data
• Data is not always available
– E.g., many tuples have no recorded value for several attributes, such
as customer income in sales data
• Missing data may be due to
– equipment malfunction
– inconsistent with other recorded data and thus deleted
– data not entered due to misunderstanding
– certain data may not be considered important at the time of entry
– not register history or changes of the data
• Missing data may need to be inferred

10

7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
How to Handle Missing Data?
• Ignore the tuple: usually done when class label is missing (when doing
classification)—not effective when the % of missing values per attribute
varies considerably
• Fill in the missing value manually: tedious + infeasible?
• Fill in it automatically with
– a global constant : e.g., “unknown”, a new class?!
– the attribute mean
– the attribute mean for all samples belonging to the same class:
smarter
– the most probable value: inference-based such as Bayesian formula or
decision tree

11

7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Noisy Data
• Noise: random error or variance in a measured variable
• Incorrect attribute values may be due to
– faulty data collection instruments
– data entry problems
– data transmission problems
– technology limitation
– inconsistency in naming convention
• Other data problems which require data cleaning
– duplicate records
– incomplete data
– inconsistent data
12

7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Noise
Noise refers to modification of original values
– Examples: distortion of a person’s voice when talking on a poor phone
and “snow” on television screen

Two Sine Waves Two Sine Waves + Noise

13

7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
How to Handle Noisy Data?
• Binning (also used for discretization)
– first sort data and partition into (equal-frequency) bins
– then one can smooth by bin means, smooth by bin median, smooth by
bin boundaries, etc.
– Binning methods smooth a sorted data value by consulting its
"neighborhood," that is, the values around it, i.e. they perform local
smoothing.
• Regression
– smooth by fitting the data into regression functions
• Clustering
– detect and remove outliers
• Combined computer and human inspection
– detect suspicious values and check by human (e.g., deal with possible
outliers) 14

7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Duplicate Data
• Data set may include data objects that are duplicates, or almost duplicates
of one another
– Major issue when merging data from heterogenous sources

• Examples:
– Same person with multiple email addresses

• Data cleaning
– Process of dealing with duplicate data issues

15

7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Outliers
Outliers are data objects with characteristics that are
considerably different than most of the other data objects in the
data set

16

7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Data Cleaning as a Process
• Data discrepancy detection
– Use metadata (e.g., domain, range, dependency, distribution)
– Check field overloading
– Check uniqueness rule, consecutive rule and null rule
– Use commercial tools
• Data scrubbing: use simple domain knowledge (e.g., postal code, spell-
check) to detect errors and make corrections
• Data auditing: by analyzing data to discover rules and relationship to
detect violators (e.g., correlation and clustering to find outliers)
• Data migration and integration
– Data migration tools: allow transformations to be specified
– ETL (Extraction/Transformation/Loading) tools: allow users to specify
transformations through a graphical user interface
• Integration of the two processes
– Iterative and interactive (e.g., Potter’s Wheels)
17

7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Data Preprocessing Techniques

18

7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Major Tasks in Data Preprocessing
• Data cleaning
– Fill in missing values, smooth noisy data, identify or remove outliers,
and resolve inconsistencies
• Data integration
– Integration of multiple databases, data cubes, or files
• Data reduction
– Dimensionality reduction
– Numerosity reduction
– Data compression
• Data transformation and data discretization
– Normalization
– Concept hierarchy generation

19

7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Data Integration
• Data integration: Combines data from multiple sources into a coherent
store
• Schema integration: e.g., [Link]-id ≡ [Link]-#
o Integrate metadata from different sources
• Entity identification problem:
o Identify real world entities from multiple data sources, e.g., Bill Clinton =
William Clinton
• Detecting and resolving data value conflicts
o For the same real world entity, attribute values from different sources are
different
o Possible reasons: different representations, different scales, e.g., metric
vs. British units

20

7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Any problems with the Data?
Name Age DateOfFirstBuy Profession DateOfBirth
Bill Gates 34 15-Jan-2015 MGR Feb 24, 1981
John 38 27-Jan-2015 Mar 11, 1982
William 34 15-Jan-2015 MGR Feb 24, 1981
Gates
Kennedy 37 30-Jan-2015 DOC Nov 03,1982

21

7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Any problems with the Data?
Name Age DateOfFirstBuy Profession DateOfBirth
Bill Gates 34 15-Jan-2015 MGR Feb 24, 1981
John 38 27-Jan-2015 Mar 11, 1982
William Gates 34 15-Jan-2015 MGR Feb 24, 1981
Kennedy 37 30-Jan-2015 DOC Nov 03,1982

1) Missing values in Profession column


2) Format of DateOfFirstBuy and DateOfBirth are different, needs
standardization
3) Row 1 and Row 3 are potentially duplicate data.
4) Both Age and DateOfBirth are stored. Age is derived attribute.
5) Inconsistent format for name, missing first or last names
6) Entity identification issues

22

7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Handling Redundancy in Data Integration
• Redundant data occur often when integration of multiple databases
– Object identification: The same attribute or object may have different
names in different databases
– Derivable data: One attribute may be a “derived” attribute in another
table, e.g., annual revenue
• Redundant attributes may be able to be detected by correlation analysis
and covariance analysis
• Careful integration of the data from multiple sources may help
reduce/avoid redundancies and inconsistencies and improve mining speed
and quality

23

7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Correlation Analysis (Nominal Data)
• χ2 (chi-square) test
(Observed − Expected ) 2
χ =∑
2

Expected

• The larger the χ2 (chi-square) value, the more likely the variables are related
• The cells that contribute the most to the χ2 value are those whose actual count
is very different from the expected count
• Correlation does not imply causality
• # of hospitals and # of car-theft in a city are correlated
• Both are causally linked to the third variable: population

Data Mining
24
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Chi-Square Calculation: An Example
Play chess Not play chess Sum (row)
Like science fiction 250(90) 200(360) 450
Not like science fiction 50(210) 1000(840) 1050
Sum(col.) 300 1200 1500

• Χ2 (chi-square) calculation (numbers in parenthesis are expected counts calculated


based on the data distribution in the two categories)
(250 − 90) 2 (50 − 210) 2 (200 − 360) 2 (1000 − 840) 2
χ =
2
+ + + = 507.93
90 210 360 840
• It shows that like_science_fiction and play_chess are correlated in the group

25
Data Mining

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Correlation Analysis (Numeric Data)
• Correlation coefficient (also called Pearson’s product moment coefficient)

∑𝑛𝑛𝑖𝑖=1(𝑎𝑎𝑖𝑖 − 𝐴𝐴)(𝑏𝑏𝑖𝑖 − 𝐵𝐵) ∑𝑛𝑛𝑖𝑖=1(𝑎𝑎𝑖𝑖 𝑏𝑏𝑖𝑖 ) − 𝑛𝑛𝐴𝐴𝐵𝐵


𝑟𝑟𝐴𝐴,𝐵𝐵 = =
𝑛𝑛𝜎𝜎𝐴𝐴 𝜎𝜎𝐵𝐵 𝑛𝑛𝜎𝜎𝐴𝐴 𝜎𝜎𝐵𝐵

where n is the number of tuples, A and B are the respective means of A


and B, σA and σB are the respective standard deviation of A and B, and
Σ(aibi) is the sum of the AB cross-product.
• If rA,B > 0, A and B are positively correlated (A’s values increase as B’s). The
higher, the stronger correlation.
• rA,B = 0: independent; rAB < 0: negatively correlated

Data Mining 26

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Correlation (viewed as linear relationship)
• Correlation measures the linear relationship between objects
• To compute correlation, we standardize data objects, A and B, and then
take their dot product

a 'k = (ak − mean( A)) / std ( A)


b'k = (bk − mean( B)) / std ( B)

correlation( A, B) = A'• B'

Data Mining
27
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Covariance (Numeric Data)
• Covariance is similar to correlation

Correlation coefficient:

where n is the number of tuples, A and B are the respective mean or


expected values of A and B, σA and σB are the respective standard deviation of
A and B
• Positive covariance: If CovA,B > 0, then A and B both tend to be larger than their
expected values
• Negative covariance: If CovA,B < 0 then if A is larger than its expected value, B is
likely to be smaller than its expected value 28

• Independence: CovA,B = 0
Data Mining

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Co-Variance: An Example

• It can be simplified in computation as

• Suppose two stocks A and B have the following values in one week: (2, 5), (3, 8), (5,
10), (4, 11), (6, 14).

• Question: If the stocks are affected by the same industry trends, will their prices rise
or fall together?
• E(A) = (2 + 3 + 5 + 4 + 6)/ 5 = 20/5 = 4
• E(B) = (5 + 8 + 10 + 11 + 14) /5 = 48/5 = 9.6
• Cov(A,B) = (2×5+3×8+5×10+4×11+6×14)/5 − 4 × 9.6 = 4

• Thus, A and B rise together since Cov(A, B) > 0.

Data Mining

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Normalization
• Min-max normalization: to [new_minA, new_maxA]
v − minA
v' = (new _ maxA − new _ minA) + new _ minA
maxA − minA
• Ex. Let income range $12,000 to $98,000 normalized to [0.0, 1.0]. Then
73,600 − 12,000
$73,000 is mapped to 98,000 − 12,000
(1.0 − 0) + 0 = 0.716

• Z-score normalization (μ: mean, σ: standard deviation):


v − µA
v' =
σ A

73,600 − 54,000
• Ex. Let μ = 54,000, σ = 16,000. Then = 1.225
16,000
• Normalization by decimal scaling
v
v' = j Where j is the smallest integer such that Max(|ν’|) < 1
10
Data Mining

30
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Discretization
• Three types of attributes
• Nominal—values from an unordered set, e.g., color, profession
• Ordinal—values from an ordered set, e.g., military or academic rank
• Numeric—real numbers, e.g., integer or real numbers
• Discretization: Divide the range of a continuous attribute into intervals
• Interval labels can then be used to replace actual data values
• Reduce data size by discretization
• Supervised vs. unsupervised
• Split (top-down) vs. merge (bottom-up)
• Discretization can be performed recursively on an attribute
• Prepare for further analysis, e.g., classification

Data Mining 31

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Data Discretization Methods
• Typical methods: All the methods can be applied recursively
• Binning
• Top-down split, unsupervised
• Histogram analysis
• Top-down split, unsupervised
• Clustering analysis (unsupervised, top-down split or bottom-up
merge)
• Decision-tree analysis (supervised, top-down split)
• Correlation (e.g., χ2) analysis (unsupervised, bottom-up merge)

32

Data Mining

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Simple Discretization: Binning
• Equal-width (distance) partitioning
• Divides the range into N intervals of equal size: uniform grid
• if A and B are the lowest and highest values of the attribute, the
width of intervals will be: W = (B –A)/N.
• The most straightforward, but outliers may dominate presentation
• Skewed data is not handled well
• Equal-depth (frequency) partitioning
• Divides the range into N intervals, each containing approximately
same number of samples
• Good data scaling
• Managing categorical attributes can be tricky

Data Mining
33
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Binning Methods for Data Smoothing

34

Data Mining

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Discretization by Classification & Correlation Analysis
• Classification (e.g., decision tree analysis)
• Supervised: Given class labels, e.g., cancerous vs. benign
• Using entropy to determine split point (discretization point)
• Top-down, recursive split
• Details to be covered in Chapter “Classification”

• Correlation analysis (e.g., Chi-merge: χ2-based discretization)


• Supervised: use class information
• Bottom-up merge: find the best neighboring intervals (those having similar
distributions of classes, i.e., low χ2 values) to merge
• Merge performed recursively, until a predefined stopping condition

Data Mining 35

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Data Reduction Strategies
• Data reduction: Obtain a reduced representation of the data set that is much smaller
in volume but yet produces the same (or almost the same) analytical results
• Why data reduction? — A database/data warehouse may store terabytes of data.
Complex data analysis may take a very long time to run on the complete data set.
• Data reduction strategies
• Dimensionality reduction, e.g., remove unimportant attributes
• Wavelet transforms
• Principal Components Analysis (PCA)
• Feature subset selection, feature creation
• Numerosity reduction (some simply call it: Data Reduction)
• Regression and Log-Linear Models
• Histograms, clustering, sampling
• Data cube aggregation
• Data compression 36
Data Mining

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Data Reduction : Dimensionality Reduction
• Curse of dimensionality
• When dimensionality increases, data becomes increasingly sparse
• Density and distance between points, which is critical to clustering, outlier
analysis, becomes less meaningful
• The possible combinations of subspaces will grow exponentially
• Dimensionality reduction
• Avoid the curse of dimensionality
• Help eliminate irrelevant features and reduce noise
• Reduce time and space required in data mining
• Allow easier visualization
• Dimensionality reduction techniques
• Wavelet transforms
• Principal Component Analysis
• Supervised and nonlinear techniques (e.g., feature selection)

Data Mining 37

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


u
p
g
r,b
c
lty
a
o
s
im
d
n
e
h
W

Curse of Dimensionality
m
g
rlu
,h
p
w
b
c
a
y
d
s
to
fin
e
D

• When dimensionality increases, data


becomes increasingly sparse in the
space that it occupies

• Definitions of density and distance


between points, which are critical for
clustering and outlier detection,
become less meaningful

•Randomly generate 500 points


•Compute difference between max and
min distance between any pair of points

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Principal Component Analysis (PCA)
Find a projection that captures the largest amount of variation in data
The original data are projected onto a much smaller space, resulting in
dimensionality reduction.

x2

x1

39

7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Principal Component Analysis (Steps)
• Given N data vectors from n-dimensions, find k ≤ n orthogonal vectors (principal
components) that can be best used to represent data
– Normalize input data: Each attribute falls within the same range
– Compute k orthonormal (unit) vectors, i.e., principal components
– Each input data (vector) is a linear combination of the k principal component
vectors
– The principal components are sorted in order of decreasing “significance” or
strength
– Since the components are sorted, the size of the data can be reduced by
eliminating the weak components, i.e., those with low variance (i.e., using the
strongest principal components, it is possible to reconstruct a good
approximation of the original data)
• Works for numeric data only

40

7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Attribute Subset Selection
• Another way to reduce dimensionality of data
• Redundant attributes
– Duplicate much or all of the information contained in one or more
other attributes
– E.g., purchase price of a product and the amount of sales tax paid
• Irrelevant attributes
– Contain no information that is useful for the data mining task at hand
– E.g., students' ID is often irrelevant to the task of predicting students'
GPA

41

7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Heuristic Search in Attribute Selection
• There are 2d possible attribute combinations of d attributes
• Typical heuristic attribute selection methods:
– Best single attribute under the attribute independence assumption:
choose by significance tests
– Best step-wise feature selection:
• The best single-attribute is picked first
• Then next best attribute condition to the first, ...
– Step-wise attribute elimination:
• Repeatedly eliminate the worst attribute
– Best combined attribute selection and elimination
– Optimal branch and bound:
• Use attribute elimination and backtracking

42

7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Attribute Creation (Feature Generation)
• Create new attributes (features) that can capture the important
information in a data set more effectively than the original ones
• Three general methodologies
– Attribute extraction
• Domain-specific
– Mapping data to new space (see: data reduction)
• E.g., Fourier transformation, wavelet transformation
– Attribute construction
• Combining features
• Data discretization

43

7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Data Reduction: Numerosity Reduction
• Reduce data volume by choosing alternative, smaller forms of data
representation
• Parametric methods (e.g., regression)
– Assume the data fits some model, estimate model parameters, store
only the parameters, and discard the data (except possible outliers)
– Ex.: Log-linear models—obtain value at a point in m-D space as the
product on appropriate marginal subspaces
• Non-parametric methods
– Do not assume models
– Major families: histograms, clustering, sampling, …

44

7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Sampling

Sampling: obtaining a small sample s to represent the whole data


set N
Allow a mining algorithm to run in complexity that is potentially
sub-linear to the size of the data
Key principle: Choose a representative subset of the data
– Simple random sampling may have very poor performance in
the presence of skew
– Develop adaptive sampling methods, e.g., stratified sampling:
Note: Sampling may not reduce database I/Os (page at a time)

45

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Types of Sampling
Simple random sampling
– There is an equal probability of selecting any particular item
Sampling without replacement
– Once an object is selected, it is removed from the population
Sampling with replacement
– A selected object is not removed from the population
Stratified sampling:
– Partition the data set, and draw samples from each partition
(proportionally, i.e., approximately the same percentage of the
data)
– Used in conjunction with skewed data

46

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Sampling: With or without Replacement

Raw Data 47

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Text Books

Author(s), Title, Edition, Publishing House


T1 Tan P. N., Steinbach M & Kumar V. “Introduction to Data Mining” Pearson
Education
T2 Data Mining: Concepts and Techniques, Third Edition by Jiawei Han, Micheline
Kamber and Jian Pei Morgan Kaufmann Publishers
R1 Predictive Analytics and Data Mining: Concepts and Practice with RapidMiner
by Vijay Kotu and Bala Deshpande Morgan Kaufmann Publishers

48

7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Thank You

49

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


S2-20_DSECLZC415
Data Exploration
BITS Pilani
Pilani|Dubai|Goa|Hyderabad

1
• The slides presented here are obtained from the authors of the
books and from various other contributors. I hereby
acknowledge all the contributors for their material and inputs.
• I have added and modified a few slides to suit the requirements
of the course.
2

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Agenda

• Data objects and Attributes types


• Basic Statistical Descriptions of Data
• Measuring Data Similarity and Dissimilarity

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Data Description

7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Types of Data Sets
• Record
• Relational records
• Data matrix, e.g., numerical matrix, crosstabs
• Document data: text documents: term-frequency

timeout

season
coach

game
score
team

ball

lost
pla

wi
n
y
vector
• Transaction data
Document 1 3 0 5 0 2 6 0 2 0 2
• Graph and network
• World Wide Web Document 2 0 7 0 2 1 0 0 3 0 0

• Social or information networks Document 3 0 1 0 0 1 2 2 0 3 0


• Molecular Structures
TID Items
• Ordered
1 Bread, Coke, Milk
• Video data: sequence of images
2 Beer, Bread
• Temporal data: time-series
3 Beer, Coke, Diaper, Milk
• Sequential Data: transaction sequences 4 Beer, Bread, Diaper, Milk
• Genetic sequence data 5 Coke, Diaper, Milk
• Spatial, image and multimedia:
• Spatial data: maps
• Image data:
• Video data: Data Mining
5
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Important Characteristics of Structured Data
• Dimensionality
• Curse of dimensionality
• Sparsity
• Only presence counts
• Resolution
• Patterns depend on the scale
• Distribution
• Centrality and dispersion

Data Mining
6
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Data Objects
• Data sets are made up of data objects.
• A data object represents an entity.
• Examples:
• sales database: customers, store items, sales
• medical database: patients, treatments
• university database: students, professors, courses
• Also called samples , examples, instances, data points, objects, tuples.
• Data objects are described by attributes.
• Database rows -> data objects; columns ->attributes.

Data Mining
7
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Attributes
• Attribute (or dimensions, features, variables): a data field, representing
a characteristic or feature of a data object.

• E.g., customer _ID, name, address

• Types:

• Nominal

• Binary

• Numeric: quantitative

• Interval-scaled
• Ratio-scaled

Data Mining
8
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Attribute Types
• Nominal: categories, states, or “names of things”
• Hair_color = {auburn, black, blond, brown, grey, red, white}
• marital status, occupation, ID numbers, zip codes
• Binary
• Nominal attribute with only 2 states (0 and 1)
• Symmetric binary: both outcomes equally important
• e.g., gender
• Asymmetric binary: outcomes not equally important.
• e.g., medical test (positive vs. negative)
• Convention: assign 1 to most important outcome (e.g., HIV positive)
• Ordinal
• Values have a meaningful order (ranking) but magnitude between successive
values is not known.
• Size = {small, medium, large}, grades, army rankings
Data Mining
9
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Numeric Attribute Types
• Quantity (integer or real-valued)
• Interval
• Measured on a scale of equal-sized units
• Values have order
• E.g., temperature in C˚or F˚, calendar dates
• No true zero-point
• Ratio
• Inherent zero-point
• We can speak of values as being an order of magnitude larger than the
unit of measurement (10 K˚ is twice as high as 5 K˚).
• e.g., temperature in Kelvin, length, counts, monetary quantities

Data Mining
10
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Discrete vs. Continuous Attributes
• Discrete Attribute
• Has only a finite or countably infinite set of values
• E.g., zip codes, profession, or the set of words in a collection of
documents
• Sometimes, represented as integer variables
• Note: Binary attributes are a special case of discrete attributes
• Continuous Attribute
• Has real numbers as attribute values
• E.g., temperature, height, or weight
• Practically, real values can only be measured and represented using a finite
number of digits
• Continuous attributes are typically represented as floating-point variables

Data Mining
11
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Basic Statistical Descriptions of Data
• Motivation
• To better understand the data: central tendency, variation and spread
• Data dispersion characteristics
• median, max, min, quantiles, outliers, variance, etc.
• Numerical dimensions correspond to sorted intervals
• Data dispersion: analyzed with multiple granularities of precision
• Boxplot or quantile analysis on sorted intervals
• Dispersion analysis on computed measures
• Folding measures into numerical dimensions
• Boxplot or quantile analysis on the transformed cube

Data Mining
12
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Measuring the Central Tendency
• Mean (algebraic measure) (sample vs. population):
1 n µ=∑
x
x = ∑ xi
Note: n is sample size and N is population size. n n i =1
N

∑w x i i

• Weighted arithmetic mean: x= i =1


n

∑w i

Trimmed mean: chopping extreme values


i =1
• Median
interval

• Median:
• Middle value if odd number of values, or average of the middle
two values otherwise

• Mode
• Value that occurs most frequently in the data
• Unimodal, bimodal, trimodal
• Empirical formula: mean − mode = 3× (mean − median) 13

Data Mining BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Symmetric vs. Skewed Data
• Median, mean and mode of symmetric, positively and negatively skewed data

symmetric

negatively skewed
positively skewed

Data Mining
July 5, 2021 14
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Measuring the Dispersion of Data
• Quartiles, outliers and boxplots
• Quartiles: Q1 (25th percentile), Q3 (75th percentile)
• Inter-quartile range: IQR = Q3 – Q1
• Five number summary: min, Q1, median, Q3, max
• Boxplot: ends of the box are the quartiles; median is marked; add whiskers, and plot outliers
individually
• Outlier: usually, a value higher/lower than 1.5 x IQR (on both sides of box from Q1 to Q3)
• Variance and standard deviation (sample: s, population: σ)
• Variance: (algebraic, scalable computation)
1 n
1 n
1 n 1 n 2 1 n ∑ (x − µ ) ∑x
2
σ2 = 2
= − µ2
2
s = ∑
n − 1 i =1
( xi − x ) 2 = [∑ xi − (∑ xi ) 2 ]
n − 1 i =1 n i =1 N i =1
i
N i =1
i

• Standard deviation s (or σ) is the square root of variance s2 (or σ2)

Data Mining
15
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Boxplot Analysis
• Five-number summary of a distribution
• Minimum, Q1, Median, Q3, Maximum
• Boxplot
• Data is represented with a box
• The ends of the box are at the first and third quartiles, i.e., the height of the box is IQR
• The median is marked by a line within the box
• Whiskers: two lines outside the box extended to Minimum and Maximum
• Outliers: points beyond a specified outlier threshold, plotted individually

Data Mining
16
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Example
Following is an ordered list of observations of a variable. Compute 5 point
summary.
13, 15, 16, 16, 19, 20, 20, 21, 22, 22, 25, 25, 25, 25, 30, 33, 33, 35, 35, 35, 35,
36, 40, 45, 46, 52, 70
Solution:
Min: 13
Q1: 20
Median: 25
Q3: 35
Max: 70

Data Mining

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Graphic Displays of Basic Statistical Descriptions

• Boxplot: graphic display of five-number summary

• Histogram: x-axis are values, y-axis repres. frequencies

• Quantile plot: each value xi is paired with fi indicating that


approximately 100 fi % of data are ≤ xi

• Quantile-quantile (q-q) plot: graphs the quantiles of one univariant


distribution against the corresponding quantiles of another

• Scatter plot: each pair of values is a pair of coordinates and plotted as


points in the plane

Data Mining
7/5/2021
18
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Histogram Analysis

• Histogram: Graph display of tabulated frequencies, shown as bars


• It shows what proportion of cases fall into each of several categories
• Differs from a bar chart in that it is the area of the bar that denotes the value,
not the height as in bar charts, a crucial distinction when the categories are not
of uniform width
• The categories are usually specified as non-overlapping intervals of some
variable. The categories (bars) must be adjacent
Data Mining
7/5/2021
19
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Histograms Often Tell More than Boxplots

 The two histograms shown in


the left may have the same
boxplot representation
 The same values for: min,
Q1, median, Q3, max
 But they have rather different
data distributions

Data Mining
7/5/2021
20
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Quantile Plot
• Displays all of the data (allowing the user to assess both the overall
behavior and unusual occurrences)
• Plots quantile information
• For a data xi data sorted in increasing order, fi indicates that
approximately 100 fi% of the data are below or equal to the value xi

Data Mining

Data Mining: Concepts and Techniques 21


BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Quantile-Quantile (Q-Q) Plot
• Graphs the quantiles of one univariate distribution against the corresponding
quantiles of another
• View: Is there a shift in going from one distribution to another?
• Example shows unit price of items sold at Branch 1 vs. Branch 2 for each
quantile. Unit prices of items sold at Branch 1 tend to be lower than those at
Branch 2.

Data Mining
22
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Scatter plot

• Provides a first look at bivariate data to see clusters of


points, outliers, etc
• Each pair of values is treated as a pair of coordinates and
plotted as points in the plane

Data Mining
23
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Positively and Negatively Correlated Data

• The left half fragment is positively correlated

• The right half is negative correlated

Data Mining
24
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Uncorrelated Data

Data Mining
25
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Data Similarity/Dissimilarity

26

7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Similarity and Dissimilarity
• Similarity
• Numerical measure of how alike two data objects are
• Value is higher when objects are more alike
• Often falls in the range [0,1]
• Dissimilarity (e.g., distance)
• Numerical measure of how different two data objects are
• Lower when objects are more alike
• Minimum dissimilarity is often 0
• Upper limit varies
• Proximity refers to a similarity or dissimilarity

Data Mining
27
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Data Matrix and Dissimilarity Matrix
 x11 ... x1f ... x1p 
• Data matrix  
 ... ... ... ... ... 
• n data points with p dimensions x ... xif ... xip 
 i1 
• Two modes  ... ... ... ... ... 
x
 n1 ... xnf ... xnp 

• Dissimilarity matrix
• n data points, but registers only the
distance  0 
• A triangular matrix  d(2,1) 0 
 
• Single mode  d(3,1) d ( 3,2) 0 
 
 : : : 
d ( n,1) d ( n,2) ... ... 0

Data Mining
28
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Proximity Measure for Nominal Attributes

• Can take 2 or more states, e.g., red, yellow, blue, green (generalization of a binary
attribute)
• Method 1: Simple matching d (i, j) = p −
p
m
• m: # of matches, p: total # of variables

• Method 2: Use a large number of binary attributes


• creating a new binary attribute for each of the M nominal states

Data Mining
29
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Proximity Measure for Binary Attributes
Object j
• A contingency table for binary data
Object i

• Distance measure for symmetric binary


variables:

• Distance measure for asymmetric binary


variables:

• Jaccard coefficient (similarity measure for


asymmetric binary variables):

 Note: Jaccard coefficient is the same as “coherence”:

Data Mining
30
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Dissimilarity between Binary Variables
• Example
Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4
Jack M Y N P N N N
Mary F Y N P N P N
Jim M Y P N N N N

• Gender is a symmetric attribute


• The remaining attributes are asymmetric binary
• Let the values Y and P be 1, and the value N be 0 (to match contingency
table of prev slide)
• Following are distances based on asymmetric binary variables:
0+1
d ( jack , mary ) = = 0.33
2+ 0+1
1+1
d ( jack , jim ) = = 0.67
1+1+1
1+ 2
d ( jim , mary ) = = 0.75
1+1+ 2

Data Mining
31
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Standardizing Numeric Data
• Z-score: −µ
z = xσ
• X: raw score to be standardized, μ: mean of the population, σ: standard deviation
• the distance between the raw score and the population mean in units of the standard deviation
• negative when the raw score is below the mean, “+” when above

• An alternative way: Calculate the mean absolute deviation


s f = 1n (| x1 f − m f | + | x2 f − m f | +...+ | xnf − m f |)
where xif − m f
m f = 1n (x1 f + x2 f + ... + xnf ) zif = sf
.

• standardized measure (z-score):

• Using mean absolute deviation is more robust than using standard deviation

Data Mining
32
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Example: Data Matrix and Dissimilarity Matrix

Data Matrix
x x point attribute1 attribute2
2 4
x1 1 2
4 x2 3 5
x3 2 0
x4 4 5

2 x
1
Dissimilarity Matrix
(with Euclidean Distance)
x1 x2 x3 x4
x x1 0
3
0 2 4 x2 3.61 0
x3 2.24 5.1 0
x4 4.24 1 5.39 0

Data Mining
33
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Distance on Numeric Data: Minkowski Distance
• Minkowski distance: A popular distance measure

where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two p-dimensional data
objects, and h is the order (the distance so defined is also called L-h norm)
• Properties
• d(i, j) > 0 if i ≠ j, and d(i, i) = 0 (Positive definiteness)
• d(i, j) = d(j, i) (Symmetry)
• d(i, j) ≤ d(i, k) + d(k, j) (Triangle Inequality)
• A distance that satisfies these properties is a metric

Data Mining
34
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Special Cases of Minkowski Distance
• h = 1: Manhattan (city block, L1 norm) distance
• E.g., the Hamming distance: the number of bits that are different between two binary vectors

d (i, j) =| x − x | + | x − x | +...+ | x − x |
i1 j1 i2 j 2 ip jp
• h = 2: (L2 norm) Euclidean distance

d (i, j) = (| x − x |2 + | x − x |2 +...+ | x − x |2 )
i1 j1 i2 j 2 ip jp
• h → ∞. “supremum” (Lmax norm, L∞ norm) distance.
• This is the maximum difference between any component (attribute) of the vectors

Data Mining
35
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Example: Minkowski Distance
point attribute 1 attribute 2 (Dissimilarity Matrices)
x1 1 2
L1 x1 x2 x3 x4
x2 3 5
Manhattan (L1) x1 0
x3 2 0
x2 5 0
x4 4 5
x3 3 6 0
x4 6 1 7 0

x x Euclidean (L2) L2 x1 x2 x3 x4
2 4 x1 0
x2 3.61 0
4
x3 2.24 5.1 0
x4 4.24 1 5.39 0

Supremum L∞ x1 x2 x3 x4
2 x x1 0
1
x2 3 0
x3 2 5 0
x4 3 1 5 0
x
3
0 2 4

Data Mining
36
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Ordinal Variables
• An ordinal variable can be discrete or continuous
• Order is important, e.g., rank
• Can be treated like interval-scaled
• replace xif by their rank rif ∈{1,..., M f }
• map the range of each variable onto [0, 1] by replacing i-th object in the f-th
variable by r −1
zif = if
M f −1

• compute the dissimilarity using methods for interval-scaled variables

Data Mining
37
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Attributes of Mixed Type
• A database may contain all attribute types
• Nominal, symmetric binary, asymmetric binary, numeric, ordinal
• One may use a weighted formula to combine their effects

Σ pf = 1δ ij( f ) dij( f )
d (i, j) =
Σ pf = 1δ ij( f )

• f is binary or nominal:
dij(f) = 0 if xif = xjf , or dij(f) = 1 otherwise
• f is numeric: use the normalized distance
• f is ordinal r −1
• Compute ranks rif and z =
if
if M f −1
• Treat zif as interval-scaled

Data Mining
38
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Example
Based on the information given in the table below, find most similar and
most dissimilar persons among them. Apply min-max normalization on
income to obtain [0,1] range. Consider profession and mother tongue as
nominal. Consider native place as ordinal variable with ranking order of
[Village, Small Town, Suburban, Metropolitan]. Give equal weight to each
attribute.

Name Income Profession Mother Native Place


tongue
Ram 70000 Doctor Bengali Village
Balram 50000 Data Scientist Hindi Small Town
Bharat 60000 Carpenter Hindi Suburban
Kishan 80000 Doctor Bhojpuri Metropolitan

Data Mining

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Solution
After normalizing income and quantifying native place, we get
Name Income Profession Mother tongue Native Place
Ram 0.67 Doctor Bengali 1
Balram 0 Data Scientist Hindi 2
Bharat 0.33 Carpenter Hindi 3
Kishan 1 Doctor Bhojpuri 4

d(Ram, Balram) = 0.67+1+1+(2-1)/(4-1)=3 d(Ram, Bharat) = 0.33+1+1+(3-1)/(4-1)=3


d(Ram, Kishan) = 0.33+0+1+(4-1)/(4-1) = 2.33 d(Balram, Bharat) = 0.33+1+0+(3-2)/(4-1)=1.67
d(Balram, Kishan) = 1+1+1+(4-2)/(4-1) = 3.67 d(Bharat, Kishan) = 0.67+1+1+(4-3)/(4-1) = 3

Most similar – Balram and Bharat; Most dissimilar – Balram and Kishan

Data Mining

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Cosine Similarity
• A document can be represented by thousands of attributes, each recording the
frequency of a particular word (such as keywords) or phrase in the document.

• Other vector objects: gene features in micro-arrays, …


• Applications: information retrieval, biologic taxonomy, gene feature mapping, ...
• Cosine measure: If d1 and d2 are two vectors (e.g., term-frequency vectors), then
cos(d1, d2) = (d1 • d2) /||d1|| ||d2|| ,
where • indicates vector dot product, ||d||: the length of vector d

Data Mining
41
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Example: Cosine Similarity

• cos(d1, d2) = (d1 • d2) /||d1|| ||d2|| ,


where • indicates vector dot product, ||d|: the length of vector d

• Ex: Find the similarity between documents 1 and 2.

d1 = (5, 0, 3, 0, 2, 0, 0, 2, 0, 0)
d2 = (3, 0, 2, 0, 1, 1, 0, 1, 0, 1)

d1•d2 = 5*3+0*0+3*2+0*0+2*1+0*1+0*1+2*1+0*0+0*1 = 25
||d1||= (5*5+0*0+3*3+0*0+2*2+0*0+0*0+2*2+0*0+0*0)0.5=(42)0.5 = 6.481
||d2||= (3*3+0*0+2*2+0*0+1*1+1*1+0*0+1*1+0*0+1*1)0.5=(17)0.5 = 4.12
cos(d1, d2 ) = 0.94

Data Mining
42
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Thank You

43

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


S2-20_DSECFZC415
Classification and Prediction
BITS Pilani
Pilani|Dubai|Goa|Hyderabad

1
• The slides presented here are obtained from the authors of the
books and from various other contributors. I hereby
acknowledge all the contributors for their material and inputs.
• I have added and modified a few slides to suit the requirements
of the course.
2

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Classification

7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Classification
• Classification involves dividing up objects so that each is assigned to one of a number of
mutually exhaustive and exclusive categories known as classes
• Many practical decision-making tasks can be formulated as classification problems
‒ customers who are likely to buy or not buy a particular product in a supermarket
‒ people who are at high, medium or low risk of acquiring a certain illness
‒ student projects worthy of a distinction, merit, pass or fail grade
‒ objects on a radar display which correspond to vehicles, people, buildings or trees
‒ people who closely resemble, slightly resemble or do not resemble someone seen
committing a crime
‒ houses that are likely to rise in value, fall in value or have an unchanged value in 12
months' time
‒ people who are at high, medium or low risk of a car accident in the next 12 months
‒ people who are likely to vote for each of a number of political parties (or none)
‒ the likelihood of rain the next day for a weather forecast (very likely, likely, unlikely, very
unlikely).

Data Mining
7/5/2021 4
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Classification vs. Prediction
• Classification
• predicts categorical class labels (discrete or nominal)
• classifies data (constructs a model) based on the training set and the
values (class labels) in a classifying attribute and uses it in classifying
new data

• Prediction
• models continuous-valued functions, i.e., predicts unknown or
missing values

Data Mining
7/5/2021
5
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Supervised vs. Unsupervised Learning
• Supervised learning (classification)
• Supervision: The training data (observations, measurements, etc.) are
accompanied by labels indicating the class of the observations
• New data is classified based on the training set

• Unsupervised learning (clustering)


• The class labels of training data is unknown
• Given a set of measurements, observations, etc. with the aim of
establishing the existence of classes or clusters in the data
People also talk about more forms of machine learning
[Link]
[Link]

Data Mining
7/5/2021 6
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Classification—A Two-Step Process
• Model construction: describing a set of predetermined classes
• Each tuple/sample is assumed to belong to a predefined class, as determined
by the class label attribute
• The set of tuples used for model construction is training set
• The model is represented as classification rules, decision trees, or
mathematical formulae

• Model usage: for classifying future or unknown objects


• Estimate accuracy of the model
• The known label of test sample is compared with the classified result
from the model
• Accuracy rate is the percentage of test set samples that are correctly
classified by the model
• Test set is independent of training set, otherwise over-fitting will occur
• If the accuracy is acceptable, use the model to classify data tuples whose
class labels are not known

Data Mining
7/5/2021 7
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Illustrating Classification Task

Tid Attrib1 Attrib2 Attrib3 Class Learning


1 Yes Large 125K No
algorithm
2 No Medium 100K No
3 No Small 70K No
4 Yes Medium 120K No
Induction
5 No Large 95K Yes
6 No Medium 60K No
7 Yes Large 220K No Learn
8 No Small 85K Yes Model
9 No Medium 75K No
10 No Small 90K Yes
Model
10

Training Set
Apply
Tid Attrib1 Attrib2 Attrib3 Class Model
11 No Small 55K ?
12 Yes Medium 80K ?
13 Yes Large 110K ? Deduction
14 No Small 95K ?
15 No Large 67K ?
10

Test Set

Data Mining
8
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Classification Techniques
• Decision Tree based Methods
• Rule-based Methods
• Neural Networks
- computational networks that simulate the decision process in neurons
(networks of nerve cell)
• Naïve Bayes and Bayesian Belief Networks
- uses the probability theory to find the most likely of the possible
classifications
• Support Vector Machines
- fits a boundary to a region of points that are all alike; uses the boundary to
classify a new point

Data Mining
9
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Lazy vs. Eager Learning
• Lazy vs. eager learning
• Lazy learning (e.g., instance-based learning): Simply stores training data (or
only minor processing) and waits until it is given a test tuple
• Eager learning (the above discussed methods): Given a set of training set,
constructs a classification model before receiving new (e.g., test) data to
classify

• Lazy: less time in training but more time in predicting


• Accuracy
• Lazy method effectively uses a richer hypothesis space since it uses many
local linear functions to form its implicit global approximation to the target
function
• Eager: must commit to a single hypothesis that covers the entire instance
space

Data Mining
7/5/2021 10
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Lazy Learner: Instance-Based Methods
• Instance-based learning:
• Store training examples and delay the processing (“lazy evaluation”)
until a new instance must be classified

• Typical approaches
• k-nearest neighbor approach
• Instances represented as points in a Euclidean space.
• Locally weighted regression
• Constructs local approximation
• Case-based reasoning
• Uses symbolic representations and knowledge-based inference

Data Mining
7/5/2021
11
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Example of a Decision Tree

Splitting Attributes
Tid Refund Marital Taxable
Status Income Cheat

1 Yes Single 125K No Refund


2 No Married 100K No Yes No
3 No Single 70K No
NO MarSt
4 Yes Married 120K No
Single, Divorced Married
5 No Divorced 95K Yes
6 No Married 60K No TaxInc NO
7 Yes Divorced 220K No < 80K > 80K
8 No Single 85K Yes
9 No Married 75K No
NO YES
10 No Single 90K Yes
10

Training Data Model: Decision Tree

Data Mining
12
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Another Example of Decision Tree

MarSt Single,
Married Divorced
Tid Refund Marital Taxable
Status Income Cheat NO Refund
Yes No
1 Yes Single 125K No
2 No Married 100K No NO TaxInc
3 No Single 70K No < 80K > 80K
4 Yes Married 120K No
NO YES
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No There could be more than one tree that fits
8 No Single 85K Yes the same data!
9 No Married 75K No
10 No Single 90K Yes
10

Data Mining
13
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Apply Model to Test Data

Test Data
Refund Marital Taxable
Status Income Cheat

No Married 80K ?
Refund 10

Yes No

NO MarSt
Single, Divorced Married Assign Cheat to “No”

TaxInc NO
< 80K > 80K

NO YES

Data Mining
14
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Decision Tree Classification Task

Tid Attrib1 Attrib2 Attrib3 Class


Tree
1 Yes Large 125K No Induction
2 No Medium 100K No algorithm
3 No Small 70K No
4 Yes Medium 120K No
Induction
5 No Large 95K Yes
6 No Medium 60K No
7 Yes Large 220K No Learn
8 No Small 85K Yes Model
9 No Medium 75K No
10 No Small 90K Yes
Model
10

Training Set (Decision Tree)


Apply
Tid Attrib1 Attrib2 Attrib3 Class
Model
11 No Small 55K ?
12 Yes Medium 80K ?
13 Yes Large 110K ?
Deduction
14 No Small 95K ?
15 No Large 67K ?
10

Test Set

Data Mining
15
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Issues: Evaluating Classification Methods
• Accuracy
• classifier accuracy: predicting class label
• predictor accuracy: guessing value of predicted attributes
• Speed
• time to construct the model (training time)
• time to use the model (classification/prediction time)
• Robustness: handling noise and missing values
• Scalability: efficiency in disk-resident databases
• Interpretability
• understanding and insight provided by the model
• Other measures, e.g., goodness of rules, such as decision tree size or
compactness of classification rules

Data Mining
7/5/2021 16
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Underfitting and Overfitting (Example)
500 circular and 500
triangular data
points.

Circular points:
X2 

0.5 ≤ sqrt(x12+x22) ≤ 1

Triangular points:
sqrt(x12+x22) > 0.5 or
sqrt(x12+x22) < 1
X1 

Data Mining
17
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Underfitting and Overfitting

Overfitting

Underfitting: when model is too simple, both training and test errors are large
Data Mining
18
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Overfitting due to Noise

Decision boundary is distorted by noise point

Data Mining
19
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Decision Tree Based Classification
• Decision trees are intuitive and frequently used data mining technique
for Classification
• For an analyst, they are easy to set up and for a business user they are
easy to interpret.
• A decision tree model is a decision flowchart where an attribute is
tested in each node and ends in a leaf node where a prediction is made.
• There are many algorithms for decision tree induction such as Hunt’s
Algorithm, CART, ID3, C4.5, SLIQ,SPRINT

Data Mining
20
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Hunt’s Algorithm - Structure
• Hunt's algorithm is among the earliest. More
complex algorithms were built upon it. Tid Refund Marital
Status
Taxable
Income Cheat
• It grows a decision tree in a recursive fashion 1 Yes Single 125K No

by partitioning the training records into 2 No Married 100K No


3 No Single 70K No
successively purer subsets 4 Yes Married 120K No

• Let Dt be the set of training records that reach 5


6
No
No
Divorced 95K
Married 60K
Yes
No
a node t 7 Yes Divorced 220K No

• General Procedure: 8
9
No
No
Single
Married
85K
75K
Yes
No

• If Dt contains records that belong the same 10 No Single 90K Yes

class yt, then t is a leaf node labeled as yt


10

Dt
• If Dt is an empty set, then t is a leaf node
labeled by the default class, yd ?
• If Dt contains records that belong to more
than one class, use an attribute test to split
the data into smaller subsets. Recursively
apply the procedure to each subset.
Data Mining
21
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Hunt’s Algorithm - Example

Don’t Refund
Cheat Yes No Tid Refund Marital Taxable
Status Income Chea
Don’t Don’t
Cheat Cheat 1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
Refund
Refund
4 Yes Married 120K No
Yes No
Yes No 5 No Divorced 95K Yes
Don’t Marital
Don’t 6 No Married 60K No
Marital Cheat Status
Cheat Status Single, 7 Yes Divorced 220K No
Single, Married
Married Divorced 8 No Single 85K Yes
Divorced Don’t
Taxable 9 No Married 75K No
Cheat Don’t Cheat
Income
Cheat 10 No Single 90K Yes
< 80K >= 80K 10

Don’t Cheat
Cheat

Data Mining
22
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Tree Induction
• Greedy strategy.
• Split the records based on an attribute test that optimizes certain
criterion.

• Issues
• Determine how to split the records
• How to specify the attribute test condition?
• How to determine the best split?
• Determine when to stop splitting

Data Mining
23
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
How to Specify Test Condition?
• Depends on attribute types
• Nominal
• Ordinal
• Continuous

• Depends on number of ways to split


• 2-way split
• Multi-way split

Data Mining
24
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Splitting Based on Nominal Attributes

• Multi-way split: Use as many partitions as distinct values.

CarType
Family Luxury
Sports

• Binary split: Divides values into two subsets.


Need to find optimal partitioning.

CarType CarType
{Sports,
Luxury} {Family} OR {Family,
Luxury} {Sports}

Data Mining
25
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Splitting Based on Ordinal Attributes
• Multi-way split: Use as many partitions as distinct values.
Size
Small Large
Medium

• Binary split: Divides values into two subsets.


Need to find optimal partitioning.

{Small, Size Size


Medium {Large}
OR {Medium,
{Small}
Large}
}

Size
• What about this split? {Small,
Large} {Medium}

Data Mining
26
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Splitting Based on Continuous Attributes
• Different ways of handling
• Discretization to form an ordinal categorical attribute
• Static – discretize once at the beginning
• Dynamic – ranges can be found by equal interval bucketing, equal
frequency bucketing
(percentiles), or clustering.

• Binary Decision: (A < v) or (A ≥ v)


• consider all possible splits and finds the best cut
• can be more compute intensive

Data Mining
27
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Splitting Based on Continuous Attributes

Data Mining
28
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
How to determine the Best Split

Before Splitting: 10 records of class 0,


10 records of class 1

Own Car Student


Car? Type? ID?

Yes No Family Luxury c1 c20


c10 c11
Sports
C0: 6
C1: 4
C0: 4
C1: 6
C0: 1
C1: 3
C0: 8
C1: 0
C0: 1
C1: 7
C0: 1
C1: 0
... C0: 1
C1: 0
C0: 0
C1: 1
... C0: 0
C1: 1

Which test condition is the best?

Data Mining
29
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
How to determine the Best Split
• Greedy approach:
• Nodes with homogeneous class distribution are preferred
• Need a measure of node impurity:

C0: 5 C0: 9
C1: 5 C1: 1

Non-homogeneous, Homogeneous,
High degree of impurity Low degree of impurity

Data Mining
30
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Measures of Node Impurity
• Gini Index

• Entropy

• Misclassification error

Data Mining
31
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
How to Find the Best Split
C0 N00
Before Splitting: M0
C1 N01

A? B?
Yes No Yes No

Node N1 Node N2 Node N3 Node N4

C0 N10 C0 N20 C0 N30 C0 N40


C1 N11 C1 N21 C1 N31 C1 N41

M1 M2 M3 M4

M12 Gain = M0 – M12 vs M0 – M34 M34

Data Mining
32
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Measure of Impurity: GINI

• Gini Index for a given node t :


GINI (t ) = 1 − ∑ [ p ( j | t )]2
j

(NOTE: p( j | t) is the relative frequency of class j at node t).

• Maximum (1 - 1/nc) when records are equally distributed among all classes,
implying least interesting information
• Minimum (0.0) when all records belong to one class, implying most
interesting information

C1 0 C1 1 C1 2 C1 3
C2 6 C2 5 C2 4 C2 3
Gini=0.000 Gini=0.278 Gini=0.444 Gini=0.500

Data Mining
33
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Examples for computing GINI
GINI (t ) = 1 − ∑ [ p ( j | t )]2
j

C1 0 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1


C2 6
Gini = 1 – P(C1)2 – P(C2)2 = 1 – 0 – 1 = 0

C1 1 P(C1) = 1/6 P(C2) = 5/6


C2 5
Gini = 1 – (1/6)2 – (5/6)2 = 0.278

C1 2 P(C1) = 2/6 P(C2) = 4/6


C2 4
Gini = 1 – (2/6)2 – (4/6)2 = 0.444

Data Mining
34
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Splitting Based on GINI
• Used in CART, SLIQ, SPRINT.
• When a node p is split into k partitions (children), the quality of split is
computed as,
k
ni
GINI split = ∑ GINI (i )
i =1 n

where, ni = number of records at child i,


n = number of records at node p.

Data Mining
35
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Binary Attributes: Computing GINI Index
 Splits into two partitions
 Effect of Weighing partitions:
– Larger and Purer Partitions are sought for.

Parent
B? C1 6
Yes No C2 6
Gini = 0.500
Node N1 Node N2
Gini(N1)
= 1 – (5/7)2 – (2/7)2 N1 N2 Gini(Children)
= 0.408 C1 5 1 = 7/12 * 0.408 +
Gini(N2) C2 2 4 5/12 * 0.32
= 1 – (1/5)2 – (4/5)2 Gini=0.37 = 0.37
= 0.32

Data Mining
36
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Categorical Attributes: Computing Gini Index
• For each distinct value, gather counts for each class in the dataset
• Use the count matrix to make decisions

Multi-way split Two-way split


(find best partition of values)

CarType CarType CarType


Family Sports Luxury {Sports, {Family,
{Family} {Sports}
Luxury} Luxury}
C1 1 2 1 C1 C1
3 1 2 2
C2 4 1 1 C2 2 4 C2 1 5
Gini 0.393 Gini 0.400 Gini 0.419

Data Mining
37
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Continuous Attributes: Computing Gini Index
• Use Binary Decisions based on one Tid Refund Marital Taxable
value Status Income Cheat

• Several Choices for the splitting value 1 Yes Single 125K No


• Number of possible splitting values 2 No Married 100K No
= Number of distinct values
3 No Single 70K No
• Each splitting value has a count matrix 4 Yes Married 120K No
associated with it
5 No Divorced 95K Yes
• Class counts in each of the Taxable
Income
partitions, A < v and A ≥ v 6 No Married 60K No
> 80K?
• Simple method to choose best v 7 Yes Divorced 220K No
• For each v, scan the database to 8 No Single 85K Yes
Yes No

gather count matrix and compute 9 No Married 75K No


its Gini index
10 No Single 90K Yes
• Computationally Inefficient! 10

Repetition of work.

Data Mining
38
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Continuous Attributes: Computing Gini Index...
• For efficient computation: for each attribute,
• Sort the attribute on values
• Linearly scan these values, each time updating the count matrix and
computing gini index
• Choose the split position that has the least gini index

Cheat No No No Yes Yes Yes No No No No


Taxable Income
60 70 75 85 90 95 100 120 125 220
Sorted Values
55 65 72 80 87 92 97 110 122 172 230
Split Positions <= > <= > <= > <= > <= > <= > <= > <= > <= > <= > <= >
Yes 0 3 0 3 0 3 0 3 1 2 2 1 3 0 3 0 3 0 3 0 3 0

No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0

Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420

Data Mining
39
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Alternative Splitting Criteria
• Entropy at a given node t:
Entropy (t ) = − ∑ p ( j | t ) log p ( j | t )
j

(NOTE: p( j | t) is the relative frequency of class j at node t).


• Measures homogeneity of a node.
• Maximum (log nc) when records are equally distributed among all
classes implying least information
• Minimum (0.0) when all records belong to one class, implying
most information
• Entropy based computations are similar to the GINI index
computations

Data Mining
40
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Examples for computing Entropy

Entropy (t ) = − ∑ p ( j | t ) log p ( j | t )
j 2

C1 0 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1


C2 6 Entropy = – 0 log 0 – 1 log 1 = – 0 – 0 = 0

C1 1 P(C1) = 1/6 P(C2) = 5/6


C2 5 Entropy = – (1/6) log2 (1/6) – (5/6) log2 (1/6) = 0.65

C1 2 P(C1) = 2/6 P(C2) = 4/6


C2 4 Entropy = – (2/6) log2 (2/6) – (4/6) log2 (4/6) = 0.92

Data Mining
41
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Splitting Criteria based on Classification Error
• Classification error at a node t :

Error (t ) = 1 − max P (i | t )
i

• Measures misclassification error made by a node.


• Maximum (1 - 1/nc) when records are equally distributed among all
classes, implying least interesting information
• Minimum (0.0) when all records belong to one class, implying most
interesting information

Data Mining
42
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Examples for Computing Error
Error (t ) = 1 − max P (i | t ) i

C1 0 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1


C2 6 Error = 1 – max (0, 1) = 1 – 1 = 0

C1 1 P(C1) = 1/6 P(C2) = 5/6


C2 5 Error = 1 – max (1/6, 5/6) = 1 – 5/6 = 1/6

C1 2 P(C1) = 2/6 P(C2) = 4/6


C2 4 Error = 1 – max (2/6, 4/6) = 1 – 4/6 = 1/3

Data Mining
43
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Comparison among Splitting Criteria
For a 2-class problem:

Data Mining
44
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Gain Ratio

Data Mining

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Attribute Selection Measure: Information Gain (ID3/C4.5)

 Select the attribute with the highest information gain


 Let pi be the probability that an arbitrary tuple in D belongs to
class Ci, estimated by |Ci, D|/|D|
 Expected information (entropy) needed to classify a tuple in D:
m
Info( D) = −∑ pi log 2 ( pi )
i =1
 Information needed (after using A to split D into v partitions) to
classify D: v | D |
Info A ( D ) = ∑
j
× Info( D j )
j =1 | D |
 Information gained by branching on attribute A

Gain(A) = Info(D) − Info A(D)


Data Mining
46
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Attribute Selection: Information Gain
age income student credit_rating buys_computer
<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
• Class P: buys_computer = “yes” 31…40 high yes fair yes
>40 medium no excellent no
• Class N: buys_computer = “no”

9 9 5 5
Info( D ) = I (9,5) = − log 2 ( ) − log 2 ( ) =0.940
14 14 14 14

Data Mining
47
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Attribute Selection: Information Gain

𝐼𝐼𝐼𝐼𝐼𝐼𝑜𝑜𝑎𝑎𝑎𝑎𝑎𝑎 𝐷𝐷 =
5
𝐼𝐼 2,3 +
4
𝐼𝐼 4,0 +
5
𝐼𝐼(3,2) age pi ni I(pi, ni)
14 14 14 <=30 2 3 0.971
= 0.694
31…40 4 0 0
>40 3 2 0.971
5 means “age <=30” has 5 out of 14 samples, with 2
I (2,3)
14 yes’es and 3 no’s. Hence

Gain(age) = Info( D) − Infoage ( D) = 0.246

Similarly,

Gain(income) = 0.029
Gain( student ) = 0.151
Gain(credit _ rating ) = 0.048
Data Mining
48
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Gain Ratio for Attribute Selection (C4.5)
• Information gain measure is biased towards attributes with a large
number of values
• C4.5 (a successor of ID3) uses gain ratio to overcome the problem
(normalization to information gain) SplitInfo ( D) = −∑
A
|D | v
× log (
|D | j
) 2
j

|D| |D|
• GainRatio(A) = Gain(A)/SplitInfo(A) j =1

• Ex.

• gain_ratio(income) = 0.029/1.557 = 0.019

• The attribute with the maximum gain ratio is selected as the splitting
attribute
Data Mining
49
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Refining Decision Tree Model

Data Mining

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Stopping Criteria for Tree Induction
• Stop expanding a node when all the records belong to the same class

• Stop expanding a node when all the records have similar attribute values

• Early termination

Data Mining
51
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Decision Tree Based Classification
• Advantages:
• Inexpensive to construct
• Extremely fast at classifying unknown records
• Easy to interpret for small-sized trees
• Accuracy is comparable to other classification techniques for many
simple data sets

Data Mining
52
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Practical Issues of Classification
• Underfitting and Overfitting

• Missing Values

• Costs of Classification

Data Mining
53
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Underfitting vs. Overfitting
• Underfitting results in decision trees that are too simple to solve the
problem. They may offer superior interpretability.

• Overfitting results in decision trees that are more complex than


necessary
• Training error no longer provides a good estimate of how well the
tree will perform on previously unseen records
• Need new ways for estimating errors

Data Mining
54
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Model Overfitting

Underfitting: when model is too simple, both training and test errors are large
Overfitting: when model is too complex, training error is small but test error is
large
Data Mining

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


How to Address Overfitting
• Pre-Pruning (Early Stopping Rule)
• Stop the algorithm before it becomes a fully-grown tree
• General stopping conditions for a node:
• Stop if all instances belong to the same class
• Stop if all the attribute values are the same
• More restrictive conditions (for pre-pruning) :
• Stop if number of instances is less than some user-specified
threshold
• Stop if class distribution of instances are independent of the
available features (e.g., using χ 2 test)
• Stop if expanding the current node does not improve impurity
measures (e.g., Gini or information gain).

Data Mining
56
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
How to Address Overfitting…
• Post-pruning
• Grow decision tree to its entirety
• Trim the nodes of the decision tree in a bottom-up fashion
• If generalization error(i.e. expected error of the model on previously
unseen records) improves after trimming, replace sub-tree by a leaf
node.
• Class label of leaf node is determined from majority class of instances
in the sub-tree

Data Mining
57
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Prescribed Text Books

Author(s), Title, Edition, Publishing House


T1 Tan P. N., Steinbach M & Kumar V. “Introduction to Data Mining” Pearson
Education
T2 Data Mining: Concepts and Techniques, Third Edition by Jiawei Han,
Micheline Kamber and Jian Pei Morgan Kaufmann Publishers
R2 Principles of Data Mining, Second Edition by Max Bramer Springer © 2013

R1 Predictive Analytics and Data Mining: Concepts and Practice with RapidMiner
by Vijay Kotu and Bala Deshpande Morgan Kaufmann Publishers

Data Mining
58
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Thank You

59

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


S2-20_DSECFZC415
Classification and Prediction
BITS Pilani
Pilani|Dubai|Goa|Hyderabad

1
• The slides presented here are obtained from the authors of the
books and from various other contributors. I hereby
acknowledge all the contributors for their material and inputs.
• I have added and modified a few slides to suit the requirements
of the course.
2

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Classification

7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Rule-based Classification

Data Mining
Source Courtesy: Some of the contents of this PPT are sourced from materials provided by publishers of prescribed books
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Classification Techniques
• Decision Tree based Methods
• Rule-based Methods
• Neural Networks
• computational networks that simulate the decision process in
neurons (networks of nerve cell)
• Naïve Bayes and Bayesian Belief Networks
• uses the probability theory to find the most likely of the possible
classifications
• Support Vector Machines
• fits a boundary to a region of points that are all alike; uses the
boundary to classify a new point

Data Mining
5
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Rule-Based Classifier
• Classify records by using a collection of “if…then…” rules

• Rule: (Condition) → y where


• Condition is a conjunctions of attributes
• y is the class label
• LHS: rule antecedent or condition
• RHS: rule consequent
• Examples of classification rules:
• (Blood Type=Warm) ∧ (Lay Eggs=Yes) → Birds
• (Taxable Income < 50K) ∧ (Refund=Yes) → Evade=No

Data Mining
6
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Rule-based Classifier (Example)
Name Blood Type Give Birth Can Fly Live in Water Class
human warm yes no no mammals
python cold no no no reptiles
salmon cold no no yes fishes
whale warm yes no yes mammals
frog cold no no sometimes amphibians
komodo cold no no no reptiles
bat warm yes yes no mammals
pigeon warm no yes no birds
cat warm yes no no mammals
leopard shark cold yes no yes fishes
turtle cold no no sometimes reptiles
penguin warm no no sometimes birds
porcupine warm yes no no mammals
eel cold no no yes fishes
salamander cold no no sometimes amphibians
gila monster cold no no no reptiles
platypus warm no no no mammals
owl warm no yes no birds
dolphin warm yes no yes mammals
eagle warm no yes no birds

R1: (Give Birth = no) ∧ (Can Fly = yes) → Birds


R2: (Give Birth = no) ∧ (Live in Water = yes) → Fishes
R3: (Give Birth = yes) ∧ (Blood Type = warm) → Mammals
R4: (Give Birth = no) ∧ (Can Fly = no) → Reptiles
R5: (Live in Water = sometimes) → Amphibians
Data Mining
7
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Application of Rule-Based Classifier
• A rule r covers an instance x if the attributes of the instance satisfy the
condition of the rule
R1: (Give Birth = no) ∧ (Can Fly = yes) → Birds
R2: (Give Birth = no) ∧ (Live in Water = yes) → Fishes
R3: (Give Birth = yes) ∧ (Blood Type = warm) → Mammals
R4: (Give Birth = no) ∧ (Can Fly = no) → Reptiles
R5: (Live in Water = sometimes) → Amphibians

Name Blood Type Give Birth Can Fly Live in Water Class
hawk warm no yes no ?
grizzly bear warm yes no no ?

The rule R1 covers a hawk => Bird


The rule R3 covers the grizzly bear => Mammal

Data Mining
8
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Rule Coverage and Accuracy
Tid Refund Marital Taxable
• Coverage of a rule: Status Income Class
• Fraction of records that satisfy
1 Yes Single 125K No
the antecedent of a rule
2 No Married 100K No
• Accuracy of a rule:
3 No Single 70K No
• Fraction of records that satisfy
both the antecedent and 4 Yes Married 120K No
consequent of a rule 5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
(Status=Single) → No 10
10 No Single 90K Yes

Coverage = 40%, Accuracy = 50%

Data Mining
9
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
How does Rule-based Classifier Work?
R1: (Give Birth = no) ∧ (Can Fly = yes) → Birds
R2: (Give Birth = no) ∧ (Live in Water = yes) → Fishes
R3: (Give Birth = yes) ∧ (Blood Type = warm) → Mammals
R4: (Give Birth = no) ∧ (Can Fly = no) → Reptiles
R5: (Live in Water = sometimes) → Amphibians
Name Blood Type Give Birth Can Fly Live in Water Class
lemur warm yes no no ?
turtle cold no no sometimes ?
dogfish shark cold yes no yes ?

A lemur triggers rule R3, so it is classified as a mammal


A turtle triggers both R4 and R5
A dogfish shark triggers none of the rules

Data Mining
10
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Characteristics of Rule-Based Classifier
• Mutually exclusive rules
• Classifier contains mutually exclusive rules if the rules are
independent of each other
• Every record is covered by at most one rule

• Exhaustive rules
• Classifier has exhaustive coverage if it accounts for every possible
combination of attribute values
• Each record is covered by at least one rule

Data Mining
11
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
From Decision Trees To Rules
Refund Classification Rules
Yes No
(Refund=Yes) ==> No
NO Marital
(Refund=No, Marital Status={Single,Divorced},
{Single, Status
{Married} Taxable Income<80K) ==> No
Divorced}
(Refund=No, Marital Status={Single,Divorced},
Taxable NO Taxable Income>80K) ==> Yes
Income
(Refund=No, Marital Status={Married}) ==> No
< 80K > 80K

NO YES

Rules are mutually exclusive and exhaustive


Rule set contains as much information as the tree

Data Mining
12
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Rules Can Be Simplified
Tid Refund Marital Taxable
Refund Status Income Cheat
Yes No
1 Yes Single 125K No
NO Marital 2 No Married 100K No
{Single, Status
{Married} 3 No Single 70K No
Divorced}
4 Yes Married 120K No
Taxable NO 5 No Divorced 95K Yes
Income
6 No Married 60K No
< 80K > 80K
7 Yes Divorced 220K No
NO YES 8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10

Initial Rule: (Refund=No) ∧ (Status=Married) → No


Simplified Rule: (Status=Married) → No
Data Mining
13
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Further Characterizing Rules

• Rules are not mutually exclusive


• A record may trigger more than one rule
• Solution?
• Ordered rule set
• Unordered rule set – use voting schemes

• Rules are not exhaustive


• A record may not trigger any rules
• Solution?
• Use a default class

Data Mining
14
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Ordered Rule Set
• Rules are rank ordered according to their priority
• An ordered rule set is known as a decision list
• When a test record is presented to the classifier
• It is assigned to the class label of the highest ranked rule it has
triggered
• If none of the rules fired, it is assigned to the default class

R1: (Give Birth = no) ∧ (Can Fly = yes) → Birds


R2: (Give Birth = no) ∧ (Live in Water = yes) → Fishes
R3: (Give Birth = yes) ∧ (Blood Type = warm) → Mammals
R4: (Give Birth = no) ∧ (Can Fly = no) → Reptiles
R5: (Live in Water = sometimes) → Amphibians

Name Blood Type Give Birth Can Fly Live in Water Class
turtle cold no no sometimes ?

Data Mining
15
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Rule Ordering Schemes
• Rule-based ordering
• Individual rules are ranked based on their quality
• Class-based ordering
• Rules that belong to the same class appear together

Rule-based Ordering Class-based Ordering


(Refund=Yes) ==> No (Refund=Yes) ==> No

(Refund=No, Marital Status={Single,Divorced}, (Refund=No, Marital Status={Single,Divorced},


Taxable Income<80K) ==> No Taxable Income<80K) ==> No

(Refund=No, Marital Status={Single,Divorced}, (Refund=No, Marital Status={Married}) ==> No


Taxable Income>80K) ==> Yes
(Refund=No, Marital Status={Single,Divorced},
(Refund=No, Marital Status={Married}) ==> No Taxable Income>80K) ==> Yes

Data Mining
16
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
How to Evaluate Learnt Rule?
• Start with the most general rule possible: condition = empty
• Adding new attributes by adopting a greedy depth-first strategy
• Picks the one that most improves the rule quality
• Rule-Quality measures: consider both coverage and accuracy
• Foil-gain (in FOIL & RIPPER): assesses info_gain by extending condition
• favors rules that have high accuracy and cover many positive tuples

pos ' pos


FOIL _ Gain = pos '×(log 2 − log 2 )
pos '+ neg ' pos + neg

• Rule pruning based on an independent set of test tuples


• Pos/neg are # of positive/negative tuples covered by R.
• If FOIL_Prune is higher for the pruned version of R, prune R
pos − neg
FOIL _ Prune( R) =
pos + neg

Data Mining
17
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
How to Evaluate Learnt Rule?

• We can use Likelihood Ratio Statistic, which confirms that effect of rule is
not attributed to chance, but represents correlation between attribute
value and classes.
𝑓𝑓
Likelihood_Ratio = 2 ∗ ∑𝑚𝑚 𝑓𝑓
𝑗𝑗=1 𝑖𝑖 log 2 ( 𝑖𝑖
)
𝑒𝑒𝑖𝑖
• m is the number of classes
• For tuples satisfying the rule, fi is the observed frequency of each
class among tuples, ei is the expected frequency if the rule made
random predictions
• Higher the Likelihood Ratio, better the rule is.
• Used by CN2

Data Mining
18
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Building Classification Rules

• Direct Method:
• Extract rules directly from data
• e.g.: RIPPER, CN2, Holte’s 1R

• Indirect Method:
• Extract rules from other classification models (e.g.
decision trees, neural networks, etc).
• e.g: C4.5rules

Data Mining
19
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Direct Method: Sequential Covering

• Sequential covering algorithm: Extracts rules directly from


training data
• Typical sequential covering algorithms: FOIL, AQ, CN2,
RIPPER
• Rules are learned sequentially, each for a given class Ci will
cover many tuples of Ci but none (or few) of the tuples of
other classes

Data Mining
20
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Rule Induction: Sequential Covering Method
• Start with an empty rule set
• Steps:
• Rules are learned one at a time
• Each time a rule is learned, the tuples covered by the
rules are removed
• Repeat the process(above steps) on the remaining tuples
• until termination condition, e.g., when no more training
examples or when the quality of a rule returned is below a
user-specified threshold

• Comparison with decision-tree induction: learning a set of


rules simultaneously

Data Mining
21
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Sequential Covering Algorithm

while (enough target tuples left)


generate a rule
remove positive target tuples satisfying this rule

Examples covered
Examples covered by Rule 2
by Rule 1 Examples covered
by Rule 3

Positive
examples

Data Mining
22
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Rule Generation

• To generate a rule
while(true)
find the best predicate p
if foil-gain(p) > threshold then add p to current rule
else break

Predicates considered may be independent of each other


(as in previous slide) or progressively restrictive (as in the
next slide)

Data Mining
23
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Rule Generation

A3=1&&A1=2
&&A8=5

A3=1&&A1=2

A3=1

Positive Negative
examples examples

Data Mining
24
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Indirect Methods

P
No Yes

Q R Rule Set

No Yes No Yes r1: (P=No,Q=No) ==> -


r2: (P=No,Q=Yes) ==> +
- + + Q r3: (P=Yes,R=No) ==> +
r4: (P=Yes,R=Yes,Q=No) ==> -
No Yes
r5: (P=Yes,R=Yes,Q=Yes) ==> +
- +

Data Mining
25
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Ensemble Methods

• Construct a set of classifiers from the training data

• Predict class label of previously unseen records by aggregating


predictions made by multiple classifiers

Data Mining
26
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
General Idea

Original
D Training data

Step 1:
Create Multiple D1 D2 .... Dt-1 Dt
Data Sets

Step 2:
Build Multiple C1 C2 Ct -1 Ct
Classifiers

Step 3:
Combine C*
Classifiers

Data Mining
27
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Why does it work?

• Suppose there are 25 base classifiers


• Each classifier has error rate, ε = 0.35
• Assume classifiers are independent
• Probability that the ensemble classifier makes a wrong
prediction:

25
 25  i
∑ 
 i 
i =13 
ε (1 − ε ) 25−i
= 0.06

Data Mining
28
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
When error rate differs…

• Suppose there are k base classifiers


• Each classifier has different error rate, εi
• Again, assume classifiers are independent
• Probability that the ensemble classifier makes a wrong
prediction:
• Majority of classifiers have to make wrong prediction
• Compute the probability for each combination that can make
wrong prediction (brute force method)
• Sum up for all possible combinations

Data Mining
29
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Prescribed Text Books

Author(s), Title, Edition, Publishing House


T1 Tan P. N., Steinbach M & Kumar V. “Introduction to Data Mining”
Pearson Education
T2 Data Mining: Concepts and Techniques, Third Edition by Jiawei
Han, Micheline Kamber and Jian Pei Morgan Kaufmann Publishers

Data Mining
30
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Thank You

31

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


S2-20_DSECFZC415
Classification Model Evaluation
BITS Pilani
Pilani|Dubai|Goa|Hyderabad

1
• The slides presented here are obtained from the authors of the
books and from various other contributors. I hereby
acknowledge all the contributors for their material and inputs.
• I have added and modified a few slides to suit the requirements
of the course.
2

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Model Evaluation and Selection

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Model Evaluation and Selection
Evaluation metrics: How can we measure accuracy? Other metrics to
consider?
Use validation test set of class-labeled tuples instead of training set when
assessing accuracy
Methods for estimating a classifier’s accuracy:
– Holdout method, random subsampling
– Cross-validation
– Bootstrap
Comparing classifiers:
– Cost-benefit analysis and ROC Curves

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Classifier Evaluation Metrics: Confusion Matrix
Confusion Matrix:
Given m classes, an entry, CMi,j in a confusion matrix indicates # of tuples in
class i that were labeled by the classifier as class j
May have extra rows/columns to provide totals

Predicted class -> C1 ¬ C1


Actual class⇓
C1 True Positives False Negatives
(TP) (FN)
¬ C1 False Positives True Negatives
(FP) (TN)

5
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Classifier Evaluation Metrics: Confusion Matrix
Example of Confusion Matrix:
Predicted class -> buy_computer = buy_computer = Total
yes no
Actual class ⇓
buy_computer = yes 6954 46 7000
buy_computer = no 412 2588 3000
Total 7366 2634 10000

6
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Classifier Evaluation Metrics:
Accuracy, Error Rate, Sensitivity and Specificity
Classifier Accuracy, or recognition rate:  Class Imbalance Problem:
percentage of test set tuples that are  One class may be rare, e.g. fraud, or
correctly classified HIV-positive
Accuracy = (TP + TN)/All  Significant majority of the negative class
Error rate: 1 – accuracy, or and minority of the positive class
Error rate = (FP + FN)/All
 Sensitivity: True Positive recognition
rate

A\P C ¬C  Sensitivity = TP/P


C TP FN P  Specificity: True Negative recognition
rate
¬C FP TN N
P’ N’ All  Specificity = TN/N

7
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Classifier Evaluation Metrics:
Precision and Recall, and F-measures
Precision: exactness – what % of tuples that the classifier labeled as
positive are actually positive

Recall: completeness – what % of positive tuples did the classifier


label as positive?

Perfect score is 1.0

Inverse relationship between precision & recall


F measure (F1 or F-score): harmonic mean of precision and recall,

Why a harmonic mean, but not arithmetic or geometric mean?

8
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Classifier Evaluation Metrics:
Precision and Recall, and F-measures

Precision ( Recall fixed at 70 % )

9
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Classifier Evaluation Metrics:
Precision and Recall, and F-measures
Harmonic mean can be recomputed by applying weights to
precision and recall
By substituting β2 = (1 – α)/ α ,

We get
Fß: weighted measure of precision and recall
– assigns β2 times as much weight to recall as to precision

10
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Classifier Evaluation Metrics: Example
Precision = 90/230 = 39.13% Recall = 90/300 = 30.00%

Actual Class\Predicted class cancer = yes cancer = Total Recognition(%)


no
cancer = yes 90 210 300 30.00
(sensitivity
cancer = no 140 9560 9700 98.56
(specificity)
Total 230 9770 10000 96.40
(accuracy)

11
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Evaluating Classifier Accuracy: Holdout Method

Holdout method
– Given data is randomly partitioned into two independent sets
• Training set (e.g., 2/3) for model construction
• Test set (e.g., 1/3) for accuracy estimation
– Random sampling: a variation of holdout
• Repeat holdout k times, accuracy = avg. of the accuracies obtained

12
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Evaluating Classifier Accuracy:
Cross-Validation Methods

Cross-validation (k-fold, where k = 10 is most popular)


– Randomly partition the data into k mutually exclusive subsets, each
approximately equal size
– At i-th iteration, use Di as test set and others as training set

– Leave-one-out: k folds where k = # of tuples, for small sized data

– *Stratified cross-validation*: folds are stratified so that class dist. in


each fold is approx. the same as that in the initial data

13
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Evaluating Classifier Accuracy: Bootstrap
Bootstrap
– Works well with small data sets
– Samples the given training tuples uniformly with replacement
• i.e., each time a tuple is selected, it is equally likely to be selected again and
re-added to the training set
Several bootstrap methods, and a common one is .632 boostrap
– A data set with d tuples is sampled d times, with replacement, resulting in a
training set of d samples. The data tuples that did not make it into the training
set end up forming the test set. About 63.2% of the original data end up in
the bootstrap, and the remaining 36.8% form the test set (since (1 – 1/d)d ≈ e-1
= 0.368)
– Repeat the sampling procedure k times, overall accuracy of the model:

14
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Model Selection: ROC Curves
ROC (Receiver Operating
Characteristics) curves: for visual
comparison of classification
models
Originated from signal detection theory
Shows the trade-off between the true
positive rate and the false positive
rate
The area under the ROC curve is a
measure of the accuracy of the
model  Vertical axis represents the
Rank the test tuples in decreasing true positive rate
order: the one that is most likely to  Horizontal axis rep. the false
belong to the positive class positive rate
appears at the top of the list  The plot also shows a
The closer to the diagonal line (i.e., the diagonal line
closer the area is to 0.5), the less A model with perfect
accurate is the model 
accuracy will have an area of
1.0

15
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Prescribed Text Books

Author(s), Title, Edition, Publishing House


T1 Tan P. N., Steinbach M & Kumar V. “Introduction to Data Mining” Pearson
Education
T2 Data Mining: Concepts and Techniques, Third Edition by Jiawei Han,
Micheline Kamber and Jian Pei Morgan Kaufmann Publishers
R2 Principles of Data Mining, Second Edition by Max Bramer Springer © 2013

R1 Predictive Analytics and Data Mining: Concepts and Practice with RapidMiner
by Vijay Kotu and Bala Deshpande Morgan Kaufmann Publishers

16

7/5/2021
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
S2-20_DSECFZC415
Prediction
BITS Pilani
Pilani|Dubai|Goa|Hyderabad

1
• The slides presented here are obtained from the authors of the
books and from various other contributors. I hereby
acknowledge all the contributors for their material and inputs.
• I have added and modified a few slides to suit the requirements
of the course.
2

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Prediction

Source Courtesy: Some of the contents of this PPT are sourced from materials provided by publishers of prescribed books
Data Mining

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Prediction vs. Classification
• How is (Numerical) prediction similar to classification?
• construct a model
• use model to predict continuous or ordered value for a given input
• Difference between Prediction and classification
• Classification refers to predict categorical class label
• Prediction models continuous-valued functions
• Major method for prediction: regression
• model the relationship between one or more independent or predictor variables
and a dependent or response variable
• Profit, sales, mortgage rates, house values, square footage, temperature, or distance
could all be predicted using regression techniques. For example, a regression model
could be used to predict the value of a house based on location, number of rooms,
lot size, and other factors.

Data Mining
4
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Regression for Prediction
• A regression task begins with a data set in which the target values are
known, e.g.
• A regression model that predicts house values could be developed based on
observed data for many houses over a period of time.
• The data might track the age of the house, square footage, number of rooms,
taxes, school district, proximity to shopping centers, and so on.
• House value would be the target, the other attributes would be the predictors,
and the data for each house would constitute a case.
• In the model build (training) process, a regression algorithm estimates the
value of the target as a function of the predictors for each case in the build
data.
• These relationships between predictors and target are summarized in a model,
which can then be applied to a different data set in which the target values are
unknown

Data Mining
5
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Prediction Techniques

• Regression analysis
• Linear and multiple regression
• Non-linear regression
• Other regression methods:
• Log-linear models,
• Regression trees
• etc.

Data Mining
6
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Regression Analysis
• Regression analysis seeks to determine the values of parameters for a
function that cause the function to best fit a set of data observations
that you provide.
• The following equation expresses these relationships in symbols.
y = F(x,w) + e

• Regression is the process of estimating the value of a continuous target


(y) as a function (F) of one or more predictors (x1 , x2 , ..., xn), a set of
parameters (w0, w1 , w2 , ..., wn), and a measure of error (e).

Data Mining
7
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Regression Analysis
• In the equation
y = F(x,w) + e
• The predictors(x1 , x2 , ..., xn) can be understood as independent
variables
• The target (y) is the dependent variable.
• The error (e), also called the residual, is the difference between the
expected and predicted value of the dependent variable.
• The regression parameters are also known as regression coefficients.
• The process of training a regression model involves finding the
parameter values that minimize a measure of the error, for example, the
sum of squared errors.

Data Mining
8
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Simple Linear Regression
• Simple Linear regression: involves a response variable y and a single predictor
variable x
y = w 0 + w1 x
where w0 (y-intercept) and w1 (slope) are regression coefficients
• Method of least squares: estimates the best-fitting straight line

| D|

∑ ( x − x )( y − y)
w= 1
i =1
i

| D|
i
w = y −wx
0 1

∑ i
( x
i =1
− x ) 2

Data Mining
9
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Linear Regression With a Single Predictor

Data Mining
10
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Multiple Linear Regression

Multiple linear regression: involves more than one predictor variable

• Training data is of the form (X1, y1), (X2, y2),…, (X|D|, y|D|)

• e.g. For 2-D data, we may have:

y = w0 + w1 x1+ w2 x2
• Solvable by extension of least square method or using SAS, S-Plus

• Many nonlinear functions can be transformed into the above

Data Mining
11
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Nonlinear Regression

• Often the relationship between x and y


cannot be approximated with a straight
line. In this case, a nonlinear regression
technique may be used. Alternatively, the
data could be preprocessed to make the
relationship linear.
• Nonlinear regression models define y as a
function of x using an equation that is
more complicated than the linear
regression equation

Data Mining
12
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Nonlinear Regression
• Some nonlinear models can be modeled by a polynomial function
• A polynomial regression model can be transformed into linear regression
model. For example,
• y = w0 + w1 x + w2 x2 + w3 x3
• convertible to linear with new variables: x2 = x2, x3= x3
• y = w0 + w1 x + w2 x2 + w3 x3
• Other functions, such as power function, can also be transformed to
linear model
• Some models are intractable nonlinear (e.g., sum of exponential terms)
• possible to obtain least square estimates through extensive
calculation on more complex formulae

Data Mining
13
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Regression Trees and Model Trees
• Regression tree: proposed in CART system (Breiman et al. 1984)
• CART: Classification And Regression Trees
• Each leaf stores a continuous-valued prediction
• It is the average value of the predicted attribute for the training tuples that
reach the leaf
• Model tree: proposed by Quinlan (1992)
• Each leaf holds a regression model—a multivariate linear equation for the
predicted attribute
• A more general case than regression tree
• Regression and model trees tend to be more accurate than linear regression
when the data are not represented well by a simple linear model

Data Mining
14
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Regression Trees
Humidit Golf
Day Outlook Temp. Wind
y Players
1 Sunny Hot High Weak 25
2 Sunny Hot High Strong 30
3 Overcast Hot High Weak 46
4 Rain Mild High Weak 45
5 Rain Cool Normal Weak 52
6 Rain Cool Normal Strong 23
7 Overcast Cool Normal Strong 43
8 Sunny Mild High Weak 35
9 Sunny Cool Normal Weak 38
10 Rain Mild Normal Weak 46
11 Sunny Mild Normal Strong 48
12 Overcast Mild High Strong 52
13 Overcast Hot Normal Weak 44
14 Rain Mild High Strong 30

[Link]
Data Mining
15
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Model Tree Sample

LM1, LM2, …., LM6 are distinct linear models

[Link]

Data Mining
16
7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Simple Linear Regression Example

x y
Area (in sq. m) Rent (in 000s of Rupees)
172 42
150 35
181 46
174 40
194 50

Can we predict rent for a house of 160 sq. m. in the locality?


17

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Prescribed Text Books

Author(s), Title, Edition, Publishing House


T1 Tan P. N., Steinbach M & Kumar V. “Introduction to Data Mining” Pearson
Education
T2 Data Mining: Concepts and Techniques, Third Edition by Jiawei Han,
Micheline Kamber and Jian Pei Morgan Kaufmann Publishers
R2 Principles of Data Mining, Second Edition by Max Bramer Springer © 2013

R1 Predictive Analytics and Data Mining: Concepts and Practice with RapidMiner
by Vijay Kotu and Bala Deshpande Morgan Kaufmann Publishers

18

7/5/2021
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
S2-20_DSECFZC415
Association Analysis
BITS Pilani
Pilani|Dubai|Goa|Hyderabad

1
•The slides presented here are obtained from the authors of the books and from
various other contributors. I hereby acknowledge all the contributors for their material
and inputs.
•I have added and modified a few slides to suit the requirements of the course.

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Association Analysis Basics

Source Courtesy: Some of the contents of this PPT are sourced from materials provided by publishers of prescribed books
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Association Analysis
Association analysis measures the strength of co-occurrence between one item
and another.
– The objective of this class of data mining algorithms is not to predict an occurrence of an
item, like classification or regression do, but to find usable patterns in the co-occurrences of
the items.
– Association rules learning is a branch of an unsupervised learning process that discovers
hidden patterns in data, in the form of easily recognizable rules

Association algorithms are widely used in retail analysis of transactions,


recommendation engines, and online clickstream analysis across web pages.
– One of the popular applications of this technique is called market basket analysis, which
finds co-occurrences of one retail item with another item within the same retail purchase
transaction

Retailer can take advantage of this association for bundle pricing, product
placement, and even shelf space optimization within the store layout.

July 5, 2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Association Rule Mining
Given a set of transactions, find rules that will predict the
occurrence of an item based on the occurrences of other items
in the transaction

Market-Basket transactions Example of Association Rules

TID Items {Diaper} → {Butter},


1 Bread, Milk {Milk, Bread} → {Beans, Coke},
2 Bread, Diaper, Butter, Beans {Butter, Bread} → {Milk},
3 Milk, Diaper, Butter, Coke
4 Bread, Milk, Diaper, Butter
Implication means co-occurrence,
5 Bread, Milk, Diaper, Coke
not causality!

July 5, 2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Definition: Frequent Itemset
Itemset TID Items
– A collection of one or more items 1 Bread, Milk
• Example: {Milk, Bread, Diaper} 2 Bread, Diaper, Butter, Beans
– k-itemset 3 Milk, Diaper, Butter, Coke
• An itemset that contains k items 4 Bread, Milk, Diaper, Butter
Support count (σ) 5 Bread, Milk, Diaper, Coke
– Frequency of occurrence of an itemset
– E.g. σ({Milk, Bread,Diaper}) = 2
Support
– Fraction of transactions that contain an itemset
– E.g. s({Milk, Bread, Diaper}) = 2/5
Frequent Itemset
– An itemset whose support is greater than or equal to a minsup threshold

July 5, 2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Definition: Association Rule
TID Items
 Association Rule 1 Bread, Milk
– An implication expression of the form X → Y, 2 Bread, Diaper, Butter, Beans
where X and Y are itemsets 3 Milk, Diaper, Butter, Coke
4 Bread, Milk, Diaper, Butter
– Example:
{Milk, Diaper} → {Butter} 5 Bread, Milk, Diaper, Coke

 Rule Evaluation Metrics


Example:
– Support (s)
 Fraction of transactions that contain both X {Milk, Diaper} ⇒ Butter
and Y
σ (Milk, Diaper, Butter) 2
– Confidence (c) s= = = 0.4
|T| 5
 Measures how often items in Y
appear in transactions that σ (Milk, Diaper, Butter ) 2
contain X c= = = 0.67
σ (Milk, Diaper ) 3
7

July 5, 2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Association Rule Mining Task
Given a set of transactions T, the goal of association rule mining is to find all
rules having
– support ≥ minsup threshold
– confidence ≥ minconf threshold

Brute-force approach:
– List all possible association rules
– Compute the support and confidence for each rule
– Prune rules that fail the minsup and minconf thresholds
⇒ Computationally prohibitive!

July 5, 2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Mining Association Rules
TID Items Example of Rules:
1 Bread, Milk
2 Bread, Diaper, Butter, Beans
{Milk,Diaper} → {Butter} (s=0.4, c=0.67)
{Milk,Butter} → {Diaper} (s=0.4, c=1.0)
3 Milk, Diaper, Butter, Coke
{Diaper,Butter} → {Milk} (s=0.4, c=0.67)
4 Bread, Milk, Diaper, Butter
{Butter} → {Milk,Diaper} (s=0.4, c=0.67)
5 Bread, Milk, Diaper, Coke
{Diaper} → {Milk,Butter} (s=0.4, c=0.5)
{Milk} → {Diaper,Butter} (s=0.4, c=0.5)
Observations:
• All the above rules are binary partitions of the same itemset:
{Milk, Diaper, Butter}
• Rules originating from the same itemset have identical support but
can have different confidence
• Thus, we may decouple the support and confidence requirements
9

July 5, 2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Mining Association Rules
Two-step approach:
1. Frequent Itemset Generation
– Generate all itemsets whose support ≥ minsup

2. Rule Generation
– Generate high confidence rules from each frequent
itemset, where each rule is a binary partitioning of a
frequent itemset

Frequent itemset generation is still computationally expensive

10

July 5, 2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Frequent Itemset Generation
null

A B C D E

AB AC AD AE BC BD BE CD CE DE

ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

ABCD ABCE ABDE ACDE BCDE Given d items, there are


2d possible candidate
itemsets
ABCDE 11

July 5, 2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Frequent Itemset Generation
Brute-force approach:
– Each itemset in the lattice is a candidate frequent itemset
– Count the support of each candidate by scanning the database

TID Items
1 Bread, Milk
2 Bread, Diaper, Butter, Beans
3 Milk, Diaper, Butter, Coke
4 Bread, Milk, Diaper, Butter
5 Bread, Milk, Diaper, Coke

– Match each transaction against every candidate


– Complexity ~ O(NMw) => Expensive since M = 2d !!!
12

July 5, 2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Frequent Itemset Generation Strategies
Reduce the number of candidates (M)
– Complete search: M=2d
– Use pruning techniques to reduce M

Reduce the number of transactions (N)


– Reduce size of N as the size of itemset increases
– Used by DHP(Direct Hashing & Pruning) and vertical-based mining
algorithms

Reduce the number of comparisons (NM)


– Use efficient data structures to store the candidates or transactions
– No need to match every candidate against every transaction

13

July 5, 2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Apriori Algorithm

14

July 5, 2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Mining Association Rules
Two-step approach:
1. Frequent Itemset Generation
– Generate all itemsets whose support ≥ minsup

2. Rule Generation
– Generate high confidence rules from each frequent
itemset, where each rule is a binary partitioning of a
frequent itemset

Frequent itemset generation is still computationally expensive


– Apriori principle can be used to reduce computations

15

July 5, 2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Reducing Number of Candidates
Apriori principle:
– If an itemset is frequent, then all of its subsets must also
be frequent

Apriori principle holds due to the following property of


the support measure:
∀X , Y : ( X ⊆ Y ) ⇒ s( X ) ≥ s(Y )
– Support of an itemset never exceeds the support of its
subsets
– This is known as the anti-monotone property of support
16
The Apriori algorithm was proposed by Rakesh Agrawal and Ramakrishnan Srikant in 1994
July 5, 2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Illustrating Apriori Principle
null

A B C D E

AB AC AD AE BC BD BE CD CE DE

Found to be
Infrequent
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

ABCD ABCE ABDE ACDE BCDE

Pruned
ABCDE
supersets 17

July 5, 2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Apriori: A Candidate Generation-and-Test Approach
Apriori pruning principle: If there is any itemset which is
infrequent, its superset should not be generated/tested!
Method:
Initially, scan DB once to get frequent
1-itemset TID Items
1 Bread, Milk
Generate length (k+1) candidate 2 Bread, Diaper, Butter, Beans
itemsets from length k frequent
3 Milk, Diaper, Butter, Coke
itemsets
4 Bread, Milk, Diaper, Butter
Test the candidates against DB 5 Bread, Milk, Diaper, Coke

Terminate when no frequent or


candidate set can be generated

18

July 5, 2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Apriori Algorithm
Method:
– Let k=1
– Generate frequent itemsets of length 1
– Repeat until no new frequent itemsets are identified
• Generate length (k+1) candidate itemsets from length k frequent
itemsets
• Prune candidate itemsets containing subsets of length k that are
infrequent
• Count the support of each candidate by scanning the DB
• Eliminate candidates that are infrequent, leaving only those that
are frequent

19

July 5, 2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Important Details of Apriori
How to generate candidates?
– Step 1: self-joining Lk
– Step 2: pruning
Example of Candidate-generation
– L3={abc, abd, acd, ace, bcd}
– Self-joining: L3*L3
• abcd from abc and abd
• acde from acd and ace
– Pruning:
• acde is removed because ade is not in L3
– C4={abcd}
How to count supports of candidates?
20

July 5, 2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Illustrating Apriori Principle

Item Count Items (1-itemsets)


Bread 4
Coke 2
Milk 4
Butter 3
Diaper 4
Beans 1

Minimum Support = 3

TID Items
1 Bread, Milk
2 Bread, Diaper, Butter, Beans
3 Milk, Diaper, Butter, Coke
4 Bread, Milk, Diaper, Butter
5 Bread, Milk, Diaper, Coke
21

July 5, 2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Illustrating Apriori Principle

Item Count Items (1-itemsets)


Bread 4
Coke 2
Milk 4 Itemset Count Pairs (2-itemsets)
Butter 3 {Bread,Milk} 3
Diaper 4 {Bread,Butter} 2 (No need to generate
Beans 1 {Bread,Diaper} 3
candidates involving Coke
{Milk,Butter} 2
Minimum Support = 3 {Milk,Diaper} 3 or Beans)
{Butter,Diaper} 3
TID Items
1 Bread, Milk
2 Bread, Diaper, Butter, Beans
3 Milk, Diaper, Butter, Coke
4 Bread, Milk, Diaper, Butter
5 Bread, Milk, Diaper, Coke
22

July 5, 2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Illustrating Apriori Principle
If every subset is considered,
6C + 6C + 6C = 41
1 2 3
Item Count Items (1-itemsets) With support-based pruning,
Bread 4 6 + 6 + 1 = 13
Coke 2
Milk 4 Itemset Count Pairs (2-itemsets)
Butter 3 {Bread,Milk} 3
Diaper 4 {Bread,Butter} 2 (No need to generate
Beans 1 {Bread,Diaper} 3
candidates involving Coke
{Milk,Butter} 2
Minimum Support = 3 {Milk,Diaper} 3 or Beans)
{Butter,Diaper} 3
TID Items
Triplets (3-itemsets)
1 Bread, Milk
2 Bread, Diaper, Butter, Beans Itemset Count
3 Milk, Diaper, Butter, Coke {Bread,Milk,Diaper} 2
4 Bread, Milk, Diaper, Butter
5 Bread, Milk, Diaper, Coke
23

July 5, 2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Factors Affecting Complexity
Choice of minimum support threshold
– lowering support threshold results in more frequent itemsets
– this may increase number of candidates and max length of frequent
itemsets
Dimensionality (number of items) of the data set
– more space is needed to store support count of each item
– if number of frequent items also increases, both computation and I/O
costs may also increase
Size of database
– since Apriori makes multiple passes, run time of algorithm may increase
with number of transactions
Average transaction width
– transaction width increases with denser data sets
– This may increase max length of frequent itemsets and traversals of hash
tree (number of subsets in a transaction increases with its width)

24

July 5, 2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Can we improve Apriori Efficiency?

Hash-based technique
Transaction reduction
Partitioning
Sampling
Dynamic itemset counting

25

July 5, 2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Hash-based technique

Create hash table using hash function


h(x, y)=((order of x)*10 + (order of y)) mod 7

A 2-itemset with a corresponding bucket count in the hash table


26
that is below the support threshold cannot be frequent
July 5, 2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Efficiency Techniques
Transaction reduction
– A transaction that does not contain any frequent k-itemsets
cannot contain any frequent (k + 1)-itemsets. Such a
transaction can be removed from further consideration

Dynamic Itemset Counting


– Instead of counting for the entire database, promote a
candidate to frequent itemset if it passes a (lower)
threshold after partial counting. Afterwards, generate larger
patterns using the itemset, so promoted.

27

July 5, 2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Efficiency Techniques - Dynamic Itemset Counting

ABCD
Once both A and D are determined frequent, the
counting of AD begins
ABC ABD ACD BCD Once all length-2 subsets of BCD are determined
frequent, the counting of BCD begins, thus
reducing effective number of scans
AB AC BC AD BD CD
Transactions
1-itemsets
A B C D
Apriori 2-itemsets

{}
Itemset lattice 1-itemsets
2-items
S. Brin R. Motwani, J. Ullman, and S.
Tsur. Dynamic itemset counting and DIC 3-items
implication rules for market basket 28
data. SIGMOD’97
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Efficiency Techniques

Mining by partitioning the data


– It has two phases. In phase I, divide the transactions of D into n partitions. Each
partition has proportionally lower threshold. For each partition, all the local
frequent itemsets are found.
– Any itemset that is potentially frequent with respect to D must be a frequent
itemset in at least one of the partitions. Therefore, all local frequent itemsets
are candidate itemsets with respect to D 29

July 5, 2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Mining with Vertical format

30

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Mining with Vertical format

Notice that because the itemsets {I1, I4} and {I3, I5} each contain only one
transaction, they do not belong to the set of frequent 2-itemsets.

31

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Mining with Vertical format

32

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Mining with Vertical format

Vertical format: t(AB) = {T11, T25, …}


– tid-list: list of trans.-ids containing an itemset
Deriving frequent patterns based on vertical intersections
– t(X) = t(Y): X and Y always happen together
– t(X) ⊂ t(Y): transaction having X always has Y
Using diffset to accelerate mining
– Only keep track of differences of tids
– t(X) = {T1, T2, T3}, t(XY) = {T1, T3}
– Diffset (XY, X) = {T2}
• Diffset can reduce space complexity

33

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Closed Patterns and Max-Patterns
A long pattern contains a combinatorial number of sub-patterns,
e.g., {a1, …, a100} contains
100C + 100C +…+ 100C = 2 100 – 1 = 1.27*1030 sub-patterns!
1 2 100

Solution: Mine closed frequent patterns and maximal frequent


patterns instead
– An itemset X is closed if X is frequent and there exists no
super-pattern Y ‫ כ‬X, with the same support as X
– An itemset X is a maximal pattern if X is frequent and there
exists no frequent super-pattern Y ‫ כ‬X
Closed pattern is a lossless compression of freq. patterns
– Reducing the # of patterns and rules
34

July 5, 2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Closed Patterns and Max-Patterns
Example
– DB = {<a1 …, a100>, < a1 …, a100>, < a1, …, a50>}
– Min_sup = 2
What is the set of closed itemset?
– <a1, …, a100>: 2
– < a1, …, a50>: 3
What is the set of maximal pattern?
– <a1, …, a100>: 2
What is the set of all patterns?
– 1.27*1030
35

July 5, 2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Maximal vs Closed Itemsets
Transaction
null Ids
TID Items
124 123 1234
1 ABC 245 345
A B C D E
2 ABCD
3 BCE
12 124 24 4 123 2 3 24
4 ACDE AB AC AD AE BC BD BE CD
34
CE
45
DE

5 DE

12 2 24 4 4 2 3 4
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

2 4
ABCD ABCE ABDE ACDE BCDE

Not supported by
any transactions ABCDE 36

July 5, 2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Maximal vs Closed Frequent Itemsets
TID Items
Minimum support = 2 null Closed but
1 ABC not maximal
2 ABCD 124 123 1234 245 345
A B C D E
3 BCE
Closed &
4 ACDE
maximal
5 DE
12 124 24 4 123 2 3 24 34 45
AB AC AD AE BC BD BE CD CE DE

12 2 24 4 4 2 3 4
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

2 4
ABCD ABCE ABDE ACDE BCDE
# Closed = 9
# Maximal = 4
ABCDE
37

July 5, 2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Maximal vs Closed Itemsets

Frequent
Itemsets

Closed
Frequent
Itemsets

Maximal
Frequent
Itemsets

38

July 5, 2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Prescribed Text Books

Author(s), Title, Edition, Publishing House


T1 Tan P. N., Steinbach M & Kumar V. “Introduction to Data Mining” Pearson
Education
T2 Data Mining: Concepts and Techniques, Third Edition by Jiawei Han,
Micheline Kamber and Jian Pei Morgan Kaufmann Publishers
R1 Predictive Analytics and Data Mining: Concepts and Practice with RapidMiner
by Vijay Kotu and Bala Deshpande Morgan Kaufmann Publishers

39

July 5, 2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Thank You

40

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


S2-20_DSECFZC415: Data Mining
(Lecture #10 – Association Analysis)
BITS Pilani
Pilani|Dubai|Goa|Hyderabad
•The slides presented here are obtained from the authors of the books and from various other contributors. I
hereby acknowledge all the contributors for their material and inputs.
•I have added and modified a few slides to suit the requirements of the course.

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


BITS Pilani
Pilani|Dubai|Goa|Hyderabad

FP-growth Algorithm
Association Analysis (Review)

Association analysis measures the strength of co-occurrence between one item


and another.
– The objective of this class of data mining algorithms is not to predict an occurrence of an
item, like classification or regression do, but to find usable patterns in the co-occurrences of
the items.
– Association rules learning is a branch of an unsupervised learning process that discovers
hidden patterns in data, in the form of easily recognizable rules

Association algorithms are widely used in retail analysis of transactions,


recommendation engines, and online clickstream analysis across web pages.
– One of the popular applications of this technique is called market basket analysis, which
finds co-occurrences of one retail item with another item within the same retail purchase
transaction

Retailer can take advantage of this association for bundle pricing, product
placement, and even shelf space optimization within the store layout.

4
July 30, 2021
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Association Rule Mining (Review)
Given a set of transactions, find rules that will predict the
occurrence of an item based on the occurrences of other items
in the transaction

Market-Basket transactions Example of Association Rules

TID Items {Diaper} → {Butter},


1 Bread, Milk {Milk, Bread} → {Beans, Coke},
2 Bread, Diaper, Butter, Beans {Butter, Bread} → {Milk},
3 Milk, Diaper, Butter, Coke
4 Bread, Milk, Diaper, Butter
Implication means co-occurrence,
5 Bread, Milk, Diaper, Coke
not causality!

July 30, 2021


BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Definition: Frequent Itemset (Review)
Itemset TID Items
– A collection of one or more items 1 Bread, Milk
• Example: {Milk, Bread, Diaper} 2 Bread, Diaper, Butter, Beans
– k-itemset 3 Milk, Diaper, Butter, Coke
• An itemset that contains k items 4 Bread, Milk, Diaper, Butter
Support count (σ) 5 Bread, Milk, Diaper, Coke
– Frequency of occurrence of an itemset
– E.g. σ({Milk, Bread,Diaper}) = 2
Support
– Fraction of transactions that contain an itemset
– E.g. s({Milk, Bread, Diaper}) = 2/5
Frequent Itemset
– An itemset whose support is greater than or equal to a minsup threshold

July 30, 2021


BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Definition: Association Rule (Review)
TID Items
 Association Rule 1 Bread, Milk
– An implication expression of the form X → Y, 2 Bread, Diaper, Butter, Beans
where X and Y are itemsets 3 Milk, Diaper, Butter, Coke
4 Bread, Milk, Diaper, Butter
– Example:
{Milk, Diaper} → {Butter} 5 Bread, Milk, Diaper, Coke

 Rule Evaluation Metrics


Example:
– Support (s)
 Fraction of transactions that contain both X {Milk, Diaper} ⇒ Butter
and Y
σ (Milk, Diaper, Butter) 2
– Confidence (c) s= = = 0.4
|T| 5
 Measures how often items in Y
appear in transactions that σ (Milk, Diaper, Butter ) 2
contain X c= = = 0.67
σ (Milk, Diaper ) 3
7

July 30, 2021


BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Mining Association Rules
Two-step approach:
1. Frequent Itemset Generation
– Generate all itemsets whose support ≥ minsup

2. Rule Generation
– Generate high confidence rules from each frequent itemset, where each rule is a
binary partitioning of a frequent itemset

Frequent itemset generation is computationally expensive


– Can we avoid it?

July 30, 2021


BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Bottleneck of Frequent-pattern Mining
Multiple database scans are costly
Mining long patterns needs many passes of scanning and generates lots of
candidates
– To find frequent itemset i1i2…i100
• # of scans: 100
• # of Candidates: 100C1 + 100C2 +…+ 100C100 = 2100 – 1 = 1.27*1030 !

Bottleneck: candidate-generation-and-test
– Can we avoid candidate generation?

July 30, 2021


BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Mining Frequent Patterns
Without Candidate Generation
Bottlenecks of the Apriori approach
– Breadth-first (i.e., level-wise) search
– Candidate generation and test
– Often generates a huge number of candidates
The FPGrowth Approach (J. Han, J. Pei, and Y. Yin, SIGMOD’ 00)
– Depth-first search
– Avoid explicit candidate generation
Grow long patterns from short ones using local frequent items (Major
philosophy behind FPGrowth)
– “abc” is a frequent pattern
– Get all transactions having “abc”: DB|abc
– “d” is a local frequent item in DB|abc  abcd is a frequent pattern

10

July 30, 2021


BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
FP-growth Algorithm
Use a compressed representation of the database using an FP-tree
Once an FP-tree has been constructed, it uses a recursive divide-and-conquer
approach to mine the frequent itemsets

1. Scan DB once, find frequent 1-itemset (single


item pattern)
2. Sort frequent items in frequency descending
order, f-list
3. Scan DB again, construct FP-tree

11

July 30, 2021


BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Construct FP-tree from a Transaction Database
min_support = 3 F-list=f-c-a-b-m-p

TID Items bought (ordered) frequent items


100 {f, a, c, d, g, i, m, p} {f, c, a, m, p}
200 {a, b, c, f, l, m, o} {f, c, a, b, m}
300 {b, f, h, j, o, w} {f, b}
400 {b, c, k, s, p} {c, b, p}
500 {a, f, c, e, l, p, m, n} {f, c, a, m, p}

12

July 30, 2021


BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Construct FP-tree from a Transaction Database
min_support = 3 F-list=f-c-a-b-m-p

TID Items bought (ordered) frequent items


100 {f, a, c, d, g, i, m, p} {f, c, a, m, p}
200 {a, b, c, f, l, m, o} {f, c, a, b, m}
300 {b, f, h, j, o, w} {f, b}
400 {b, c, k, s, p} {c, b, p}
500 {a, f, c, e, l, p, m, n} {f, c, a, m, p}

13

July 30, 2021


BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Construct FP-tree from a Transaction Database
{}
min_support = 3 Header Table
F-list=f-c-a-b-m-p
Item frequency head f:4 c:1
f 4
c 4 c:3 b:1 b:1
a 3
b 3 a:3 p:1
m 3
p 3
m:2 b:1
TID Items bought (ordered) frequent items
100 {f, a, c, d, g, i, m, p} {f, c, a, m, p} p:2 m:1
200 {a, b, c, f, l, m, o} {f, c, a, b, m}
300 {b, f, h, j, o, w} {f, b}
400 {b, c, k, s, p} {c, b, p}
500 {a, f, c, e, l, p, m, n} {f, c, a, m, p}
14

July 30, 2021


BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Find Patterns Having P From P-conditional Database
Starting at the frequent item header table in the FP-tree
Traverse the FP-tree by following the link of each frequent item p
Accumulate all of transformed prefix paths of item p to form p’s conditional
pattern base
{}
Header Table

Item frequency head f:4 c:1


f 4
c 4 c:3 b:1 b:1 Conditional pattern base of p:
a 3 fcam:2, cb:1
b 3 a:3 p:1
m 3
p 3 m:2 b:1

p:2 m:1
15

July 30, 2021


BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
From Conditional Pattern-bases to Conditional FP-trees

For each pattern-base


– Accumulate the count for each item in the base
– Construct the FP-tree for the frequent items of the pattern
base
m-conditional pattern base:
{} fca:2, fcab:1
Header Table
Item frequency head All frequent
f:4 c:1 patterns relate to m
f 4 {}
c 4 c:3 b:1 b:1 m,

a 3 f:3
 fm, cm, am,
b 3 a:3 p:1 fcm, fam, cam,
m 3 c:3 fcam
p 3 m:2 b:1
p:2 m:1 a:3
m-conditional FP-tree 16

July 30, 2021


BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Find Patterns Having x From x-conditional Database
{}
Header Table

Item frequency head f:4 c:1


f 4
c 4 c:3 b:1 b:1
a 3
b 3 a:3 p:1
m 3
Item Conditional Conditional Frequent
p 3 m:2 b:1 pattern base fp-tree patterns
p fcam: 2, c:3 cp:3
p:2 m:1 cb:1
m fca:2, fcab:1 fca:3 fcam:3 (and all
its subsets)
b fca:1, f:1, c:1 None
a fc:3 fc:3
c f:3 f:3
17

July 30, 2021


BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Benefits of the FP-tree Structure
Completeness
– Preserve complete information for frequent pattern
mining
– Never break a long pattern of any transaction
Compactness
– Reduce irrelevant info—infrequent items are gone
– Items in frequency descending order: the more frequently
occurring, the more likely to be shared
– Never be larger than the original database (not count
node-links and the count field)

18

July 30, 2021


BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Partition Patterns and Databases
Frequent patterns can be partitioned into subsets according to f-
list
– F-list=f-c-a-b-m-p
– Patterns containing p
– Patterns having m but no p
– …
– Patterns having c but no a nor b, m, p
– Pattern f
Completeness and non-redundency

19

July 30, 2021


BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Recursion: Mining Each Conditional FP-tree
{}

{}
Cond. pattern base of “am”: (fc:3) f:3

c:3
f:3
am-conditional FP-tree
c:3 {}

a:3 Cond. pattern base of “cm”: (f:3)


f:3
m-conditional FP-tree
cm-conditional FP-tree

{}
Cond. pattern base of “cam”: (f:3)
f:3
cam-conditional FP-tree
20

July 30, 2021


BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
A Special Case: Single Prefix Path in FP-tree
Suppose a (conditional) FP-tree T has a shared single prefix-
path P
Mining can be decomposed into two parts
{}
– Reduction of the single prefix path into one node
a1:n1
– Concatenation of the mining results of the two parts
a2:n2
{} r1
a3:n3

a1:n1
b1:m1 C1:k1
 r1 =
a2:n2
+ b1:m1 C1:k1

a3:n3 C2:k2 C3:k3


C2:k2 C3:k3 21

July 30, 2021


BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Mining Frequent Patterns With FP-trees
Idea: Frequent pattern growth
– Recursively grow frequent patterns by pattern and
database partition
Method
– For each frequent item, construct its conditional pattern-
base, and then its conditional FP-tree
– Repeat the process on each newly created conditional FP-
tree
– Until the resulting FP-tree is empty, or it contains only one
path—single path will generate all the combinations of its
sub-paths, each of which is a frequent pattern
22

July 30, 2021


BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Why Is FP-Growth the Winner?

Divide-and-conquer:
– decompose both the mining task and DB according to the
frequent patterns obtained so far
– leads to focused search of smaller databases
Other factors
– no candidate generation, no candidate test
– compressed database: FP-tree structure
– no repeated scan of entire database
– basic ops—counting local freq items and building sub FP-
tree, no pattern search and matching

23

July 30, 2021


BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Scaling FP-growth by Database Projection

• What about if FP-tree cannot fit in


memory?
• DB projection Tran. DB
• First partition a database into a set of fcamp
projected DBs fcabm
fb
• Then construct and mine FP-tree for cbp
each projected DB fcamp

p-proj DB m -proj DB b-proj DB a-proj DB c-proj DB f-proj DB


fcam fcab f fc f …
cb fca cb … …
fcam fca …

am -proj DB cm -proj DB
fc f …
fc f
fc f
24
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
BITS Pilani
Pilani|Dubai|Goa|Hyderabad

Mining association rules


Rule Generation

• Given a frequent itemset L, find all non-empty


subsets f ⊂ L such that f → L – f satisfies the
minimum confidence requirement
• If {A,B,C,D} is a frequent itemset, candidate rules:
ABC →D, ABD →C, ACD →B, BCD →A,
AB →CD, AC → BD, AD → BC, BC →AD,
BD →AC, CD →AB,
A →BCD, B →ACD, C →ABD, D →ABC

• If |L| = k, then there are 2k – 2 candidate


association rules (ignoring L → ∅ and ∅ → L)

July 30, 2021 26


BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Rule Generation

• How to efficiently generate rules from frequent itemsets?


• In general, confidence does not have an anti-monotone property
c(ABC →D) can be larger or smaller than c(AB →D)

• But confidence of rules generated from the same itemset has an anti-
monotone property
• e.g., L = {A,B,C,D}:

c(ABC → D) ≥ c(AB → CD) ≥ c(A → BCD)

• Confidence is anti-monotone in this case

July 30, 2021 27


BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Rule Generation for Apriori Algorithm
Lattice of rules
ABCD=>{ }
Low
Confidence
Rule
BCD=>A ACD=>B ABD=>C ABC=>D

CD=>AB BD=>AC BC=>AD AD=>BC AC=>BD AB=>CD

D=>ABC C=>ABD B=>ACD A=>BCD


Pruned
Rules
July 30, 2021 28
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Rule Generation with Apriori Algorithm
• Candidate rule is generated by merging two rules that share the same
prefix in the rule consequent

CD=>AB BD=>AC
• join(CD=>AB,BD=>AC)
would produce the candidate
rule D => ABC

• Prune rule D=>ABC if its


subset AD=>BC does not have
high confidence D=>ABC

July 30, 2021 29


BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Effect of Support Distribution
• Many real data sets have skewed support distribution

Support
distribution of
a retail data set

July 30, 2021 30


BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Effect of Support Distribution

• How to set the appropriate minsup threshold?


• If minsup is set too high, we could miss itemsets involving interesting rare
items (e.g., expensive products)
• If minsup is set too low, it is computationally expensive and the number of
itemsets is very large

• Using a single minimum support threshold may not be effective

July 30, 2021 31


BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Multiple Minimum Support

• How to apply multiple minimum supports?


• MS(i): minimum support for item i
• e.g.: MS(Milk)=5%, MS(Coke) = 3%,
MS(Broccoli)=0.1%, MS(Salmon)=0.5%
• MS({Milk, Broccoli}) = min (MS(Milk), MS(Broccoli))
= 0.1%

• Challenge: Support is no longer anti-monotone


• Suppose: Support(Milk, Coke) = 1.5% and
Support(Milk, Coke, Broccoli) = 0.5%

• {Milk,Coke} is infrequent but {Milk,Coke,Broccoli} is frequent

July 30, 2021 32


BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Pattern Evaluation

• Association rule algorithms tend to produce too many rules


• many of them are uninteresting or redundant
• Redundant if {A,B,C} → {D} and {A,B} → {D}
have same support & confidence

• Interestingness measures can be used to prune/rank the derived


patterns

• In the original formulation of association rules, support & confidence


are the only measures used

July 30, 2021 33


BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Computing Interestingness Measure
• Given a rule X → Y, information needed to compute rule
interestingness can be obtained from a contingency table
Contingency table for X → Y
Y Y f11: support of X and Y
X f11 f10 f1+
f10: support of X and Y
f01: support of X and Y
X f01 f00 fo+
f00: support of X and Y
f+1 f+0 |T|

Used to define various measures


 support, confidence, lift, Gini,
J-measure, etc.

July 30, 2021 34


BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Drawback of Confidence

Coffee Coffee
Tea 15 5 20
Tea 75 5 80
90 10 100

Association Rule: Tea → Coffee

Confidence= P(Coffee|Tea) = 0.75


but P(Coffee) = 0.9
⇒ Although confidence is high, rule is misleading
⇒ P(Coffee|Tea) = 0.9375
July 30, 2021 35
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Statistical Independence

• Population of 1000 students


• 600 students know how to swim (S)
• 700 students know how to bike (B)
• 420 students know how to swim and bike (S,B)

• P(S∧B) = 420/1000 = 0.42


• P(S) × P(B) = 0.6 × 0.7 = 0.42

• P(S∧B) = P(S) × P(B) => Statistical independence


• P(S∧B) > P(S) × P(B) => Positively correlated
• P(S∧B) < P(S) × P(B) => Negatively correlated

July 30, 2021 36


BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Statistical-based Measures

• Measures that take into account statistical dependence

P(Y | X )
Lift =
P(Y )
P( X , Y )
Interest =
P( X ) P(Y )

July 30, 2021 37


BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Example: Lift/Interest

Coffee Coffee
Tea 15 5 20
Tea 75 5 80
90 10 100

Association Rule: Tea → Coffee

Confidence= P(Coffee|Tea) = 0.75


but P(Coffee) = 0.9
⇒ Lift = 0.75/0.9= 0.8333 (< 1, therefore is negatively associated)

July 30, 2021 38


BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Drawback of Lift & Interest

Y Y Y Y
X 10 0 10 X 90 0 90
X 0 90 90 X 0 10 10
10 90 100 90 10 100

0.1 0.9
Lift = = 10 Lift = = 1.11
(0.1)(0.1) (0.9)(0.9)

Statistical independence:
If P(X,Y)=P(X)P(Y) => Lift = 1

July 30, 2021 39


BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Subjective Interestingness Measure
• Objective measure:
• Rank patterns based on statistics computed from data
• e.g., many measures of association (support, confidence,
Laplace, Gini, mutual information, Jaccard, etc).

• Subjective measure:
• Rank patterns according to user’s interpretation
• A pattern is subjectively interesting if it contradicts the
expectation of a user
• A pattern is subjectively interesting if it is actionable

July 30, 2021 40


BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Interestingness via Unexpectedness
• Need to model expectation of users (domain knowledge)

+ Pattern expected to be frequent

- Pattern expected to be infrequent


Pattern found to be frequent

Pattern found to be infrequent

+ - Expected Patterns

- + Unexpected Patterns

• Need to combine expectation of users with evidence from data (i.e., extracted
patterns)

July 30, 2021 41


BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Prescribed Text Books

Author(s), Title, Edition, Publishing House


T1 Data Mining: Concepts and Techniques, Third Edition by Jiawei Han,
Micheline Kamber and Jian Pei Morgan Kaufmann Publishers
R1 Tan P. N., Steinbach M & Kumar V. “Introduction to Data Mining”
Pearson Education
R2 Predictive Analytics and Data Mining: Concepts and Practice with RapidMiner by
Vijay Kotu and Bala Deshpande Morgan Kaufmann Publishers

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


S2-20_DSECFZC415: Data Mining
(Lecture #11 – Cluster Analysis)

BITS Pilani
Pilani|Dubai|Goa|Hyderabad
•The slides presented here are obtained from the authors of the books and from various other contributors. I
hereby acknowledge all the contributors for their material and inputs.
•I have added and modified a few slides to suit the requirements of the course.

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Data Mining
Cluster Analysis
BITS Pilani
Pilani|Dubai|Goa|Hyderabad
BITS Pilani
Pilani|Dubai|Goa|Hyderabad

Clustering Concepts
What is Cluster Analysis?

• Finding groups of objects such that the objects in a group


will be similar (or related) to one another and different from
(or unrelated to) the objects in other groups
Inter-cluster
Intra-cluster distances are
distances are maximized
minimized

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Examples of Clustering Applications

• Marketing: Help marketers discover distinct groups in their customer bases,


and then use this knowledge to develop targeted marketing programs
• Land use: Identification of areas of similar land use in an earth observation
database
• Insurance: Identifying groups of motor insurance policy holders with a high
average claim cost
• City-planning: Identifying groups of houses according to their house type,
value, and geographical location
• Earth-quake studies: Observed earth quake epicenters should be clustered
along continent faults

August 7, 2021
6
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
What is not Cluster Analysis?

• Supervised classification
• Have class label information

• Simple segmentation
• Dividing students into different registration groups alphabetically,
by last name

• Results of a query
• Groupings are a result of an external specification

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Quality: What Is Good Clustering?

• A good clustering method will produce high quality clusters


with
• high intra-class similarity
• low inter-class similarity
• The quality of a clustering result depends on both the
similarity measure used by the method and its
implementation
• The quality of a clustering method is also measured by its
ability to discover some or all of the hidden patterns

August 7, 2021 8
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Measure the Quality of Clustering

• Dissimilarity/Similarity metric: Similarity is expressed in terms of a


distance function, typically metric: d(i, j)
• There is a separate “quality” function that measures the
“goodness” of a cluster.
• The definitions of distance functions are usually very different for
interval-scaled, boolean, categorical, ordinal ratio, and vector
variables.
• Weights should be associated with different variables based on
applications and data semantics.
• It is hard to define “similar enough” or “good enough”
• the answer is typically highly subjective.

August 7, 2021
9
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Requirements of Clustering in Data Mining

• Scalability
• Ability to deal with different types of attributes
• Ability to handle dynamic data
• Discovery of clusters with arbitrary shape
• Minimal requirements for domain knowledge to determine input
parameters
• Able to deal with noise and outliers
• Insensitive to order of input records
• High dimensionality
• Incorporation of user-specified constraints
• Interpretability and usability

August 7, 2021
10
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Type of data in clustering analysis

• Interval-scaled variables

• Binary variables

• Nominal, ordinal, and ratio variables

• Variables of mixed types

August 7, 2021
11
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Interval-valued variables(standardization)
Z-score: x
z= σ − µ
X: raw score to be standardized,
μ: mean of the population, σ: standard deviation
• the distance between the raw score and the population mean in units of
the standard deviation
• negative when the raw score is below the mean, “+” when above
• Alternately
• Calculate the mean absolute deviation:
s f = 1n (| x1 f − m f | + | x2 f − m f | +...+ | xnf − m f |)
where m f = 1n (x1 f + x2 f + ... + xnf )
.

• Calculate the standardized measurement (z-score)


xif − m f
zif = sf
• Using mean absolute deviation is more robust than using standard deviation

August 7, 2021
12
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Similarity and Dissimilarity Between Objects

• Distances are normally used to measure the similarity or


dissimilarity between two data objects
• Some popular ones include: Minkowski distance:
d (i, j) = q (| x − x |q + | x − x |q +...+ | x − x |q )
i1 j1 i2 j2 ip jp
where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two p-
dimensional data objects, and q is a positive integer
• If q = 1, d is Manhattan distance
d (i, j) =| x − x | + | x − x | +...+ | x − x |
i1 j1 i2 j 2 ip jp

August 7, 2021
13
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Similarity and Dissimilarity Between Objects (Cont.)

• If q = 2, d is Euclidean distance:
d (i, j) = (| x − x |2 + | x − x |2 +...+ | x − x |2 )
i1 j1 i2 j2 ip jp
• Properties
• d(i,j) ≥ 0
• d(i,i) = 0
• d(i,j) = d(j,i)
• d(i,j) ≤ d(i,k) + d(k,j)

• Also, one can use weighted distance, parametric Pearson


product moment correlation, or other dissimilarity measures

August 7, 2021
14
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Binary Variables
• A contingency table for binary Object j
1 0 sum
data 1 a b a +b
Object i 0 c d c+d
sum a+c b+d p

• Distance measure for symmetric


d (i, j) = b+c
binary variables: a +b+c + d
• Distance measure for asymmetric b+c
d (i, j) =
binary variables: a +b+c
• Jaccard coefficient (similarity a
simJaccard (i, j) =
measure for asymmetric binary a +b+c
variables):
15
August 7, 2021
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Nominal Variables

• A generalization of the binary variable in that it can take


more than 2 states, e.g., red, yellow, blue, green
• Method 1: Simple matching
• m: # of matches, p: total # of variables

d (i, j) = p −
p
m

• Method 2: use a large number of binary variables


• creating a new binary variable for each of the M nominal
states

August 7, 2021
16
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Ordinal Variables

• An ordinal variable can be discrete or continuous


• Order is important, e.g., rank
• Can be treated like interval-scaled
• replace xif by their rank rif ∈{1,..., M f }
• map the range of each variable onto [0, 1] by replacing i-
th object in the f-th variable by
rif −1
zif =
M f −1
• compute the dissimilarity using methods for interval-
scaled variables

August 7, 2021
17
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Attributes of Mixed Type
• A database may contain all attribute types
• Nominal, symmetric binary, asymmetric binary,
numeric, ordinal
• One may use a weighted formula to combine their effects

Σ pf = 1δ ij( f ) dij( f )
d (i, j) =
Σ pf = 1δ ij( f )

18
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Notion of a Cluster can be Ambiguous

How many clusters? Six Clusters

Two Clusters Four Clusters

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Types of Clusterings
• A clustering is a set of clusters

• An important distinction among types of clustering :


hierarchical and partitional sets of clusters

• Partitional Clustering
• A division data objects into non-overlapping subsets
(clusters) such that each data object is in exactly one
subset
• Hierarchical clustering
• A set of nested clusters organized as a hierarchical tree

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Partitional Clustering

Original Points A Partitional Clustering

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Hierarchical Clustering

p1
p3 p4
p2

p1 p2 p3 p4
Traditional Hierarchical Clustering
Traditional Dendrogram

p1
p3 p4
p2

p1 p2 p3 p4
Non-traditional Hierarchical Clustering
Non-traditional Dendrogram

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Other Distinctions Between Sets of Clusters
• Exclusive versus non-exclusive
• In non-exclusive clustering, points may belong to multiple clusters.
• Can represent multiple classes or ‘border’ points
• Fuzzy versus non-fuzzy
• In fuzzy clustering, a point belongs to every cluster with some weight
between 0 and 1
• Weights must sum to 1
• Probabilistic clustering has similar characteristics
• Partial versus complete
• In some cases, we only want to cluster some of the data
• Heterogeneous versus homogeneous
• Cluster of widely different sizes, shapes, and densities

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Types of Clusters

Clusters can be of many types:

• Well-separated clusters

• Center-based clusters

• Contiguous clusters

• Density-based clusters

• Property or Conceptual
• Described by an Objective Function

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Types of Clusters: Well-Separated

• Well-Separated Clusters:
• A cluster is a set of points such that any point in a cluster is closer
(or more similar) to every other point in the cluster than to any
point not in the cluster.

3 well-separated clusters

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Types of Clusters: Center-Based

• Center-based
• A cluster is a set of objects such that an object in a cluster is closer
(more similar) to the “center” of a cluster, than to the center of any
other cluster
• The center of a cluster is often a centroid, the average of all the
points in the cluster, or a medoid, the most “representative” point
of a cluster

4 center-based clusters

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Types of Clusters: Contiguity-Based

• Contiguous Cluster (Nearest neighbor or Transitive)


• A cluster is a set of points such that a point in a cluster is closer (or
more similar) to one or more other points in the cluster than to any
point not in the cluster.

8 contiguous clusters

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Types of Clusters: Density-Based

• Density-based
• A cluster is a dense region of points, which is separated by low-
density regions, from other regions of high density.
• Used when the clusters are irregular or intertwined, and when
noise and outliers are present.

6 density-based clusters

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Types of Clusters: Conceptual Clusters

• Shared Property or Conceptual Clusters


• Finds clusters that share some common property or represent a
particular concept.
.

2 Overlapping Circles

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Types of Clusters: Objective Function
• Clusters Defined by an Objective Function
• Finds clusters that minimize or maximize an objective function.
• Enumerate all possible ways of dividing the points into clusters and
evaluate the `goodness' of each potential set of clusters by using the
given objective function. (NP Hard)
• Can have global or local objectives.
• Hierarchical clustering algorithms typically have local objectives
• Partitional algorithms typically have global objectives
• A variation of the global objective function approach is to fit the data
to a parameterized model.
• Parameters for the model are determined from the data.
• Mixture models assume that the data is a ‘mixture' of a number of
statistical distributions.

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Types of Clusters: Objective Function …

• Map the clustering problem to a different domain and solve a related


problem in that domain
• Proximity matrix defines a weighted graph, where the nodes are the points
being clustered, and the weighted edges represent the proximities between
points

• Clustering is equivalent to breaking the graph into connected components,


one for each cluster.

• Want to minimize the edge weight between clusters and maximize the edge
weight within clusters

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Important Characteristics of the Input Data
• Type of proximity or density measure
• This is a derived measure, but central to clustering
• Sparseness
• Dictates type of similarity
• Adds to efficiency
• Attribute type
• Dictates type of similarity
• Type of Data
• Dictates type of similarity
• Other characteristics, e.g., autocorrelation
• Dimensionality
• Noise and Outliers
• Type of Distribution

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


BITS Pilani
Pilani|Dubai|Goa|Hyderabad

Partitioning Methods
Partitioning Algorithms: Basic Concept
• Partitioning method: Partitioning a database D of n objects into a set of k
clusters, such that the sum of squared distances is minimized (where ci is the
centroid or medoid of cluster Ci)

E = Σ ik=1Σ p∈Ci ( p − ci ) 2

• Given k, find a partition of k clusters that optimizes the chosen partitioning


criterion
• Global optimal: exhaustively enumerate all partitions
• Heuristic methods: k-means and k-medoids algorithms
• k-means : Each cluster is represented by the center of the cluster
• k-medoids or PAM (Partition around medoids): Each cluster is represented
by one of the objects in the cluster
34

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


K-means Clustering
• Partitional clustering approach
• Each cluster is associated with a centroid (center point)
• Each point is assigned to the cluster with the closest centroid
• Number of clusters, K, must be specified
• The basic algorithm is very simple

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


K-means Clustering – Details
• Initial centroids are often chosen randomly.
• Clusters produced vary from one run to another.
• The centroid is (typically) the mean of the points in the cluster.
• ‘Closeness’ is measured by Euclidean distance, cosine
similarity, etc.
• K-means will converge for common similarity measures
mentioned above.
• Most of the convergence happens in the first few iterations.
• Often the stopping condition is changed to ‘Until relatively few points
change clusters’
• Complexity is O( n * K * I * d )
• n = number of points, K = number of clusters,
I = number of iterations, d = number of attributes

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Two different K-means Clusterings
3

2.5

2 Original Points
1.5

y
1

0.5

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2


x

3 3

2.5 2.5

2 2

1.5 1.5
y

y
1 1

0.5 0.5

0 0

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2


x x

Optimal Clustering Sub-optimal Clustering


BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Evaluating K-means Clusters
• Most common measure is Sum of Squared Error (SSE)
• For each point, the error is the distance to the nearest cluster
• To get SSE, we square these errors and sum them.
K
SSE = ∑ ∑ dist 2 ( mi , x )
i =1 x∈Ci

• x is a data point in cluster Ci and mi is the representative point for


cluster Ci
• can show that mi corresponds to the center (mean) of the cluster
• Given two clusters, we can choose the one with the smallest error
• One easy way to reduce SSE is to increase K, the number of clusters
• A good clustering with smaller K can have a lower SSE than a poor clustering
with higher K

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Importance of Choosing Initial Centroids
Iteration 1 Iteration 2 Iteration 3
3 3 3

2.5 2.5 2.5

2 2 2

1.5 1.5 1.5


y

y
1 1 1

0.5 0.5 0.5

0 0 0

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x

Iteration 4 Iteration 5 Iteration 6


3 3 3

2.5 2.5 2.5

2 2 2

1.5 1.5 1.5


y

y
1 1 1

0.5 0.5 0.5

0 0 0

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Importance of Choosing Initial Centroids

Iteration 6
1
2
3
4
5
3

2.5

1.5
y

0.5

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2


x

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Importance of Choosing Initial Centroids …
Iteration 1 Iteration 2
3 3

2.5 2.5

2 2

1.5 1.5
y

y
1 1

0.5 0.5

0 0

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2


x x

Iteration 3 Iteration 4 Iteration 5


3 3 3

2.5 2.5 2.5

2 2 2

1.5 1.5 1.5


y

y
1 1 1

0.5 0.5 0.5

0 0 0

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2
x x x

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Importance of Choosing Initial Centroids …
Iteration 5
1
2
3
4
3

2.5

1.5
y

0.5

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2


x

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Problems with Selecting Initial Points

• If there are K ‘real’ clusters then the chance of selecting one


centroid from each cluster is small.
• Chance is relatively small when K is large
• If clusters are the same size, n, then

• For example, if K = 10, then probability = 10!/1010 = 0.00036


• Sometimes the initial centroids will readjust themselves in ‘right’
way, and sometimes they don’t
• Consider an example of five pairs of clusters

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


10 Clusters Example
Iteration 1 Iteration 2
8 8

6 6

4 4

2 2

0 0
y

y
-2 -2

-4 -4

-6 -6

0 5 10 15 20 0 5 10 15 20
x
Iteration 3 x
Iteration 4
8 8

6 6

4 4

2 2

0 0
y

y
-2 -2

-4 -4

-6 -6

0 5 10 15 20 0 5 10 15 20
x x
Starting with two initial centroids in one cluster of each pair of clusters
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
10 Clusters Example
Iteration 1 Iteration 2
8 8

6 6

4 4

2 2

0 0
y

y
-2 -2

-4 -4

-6 -6

0 5 10 15 20 0 5 10 15 20
Iteration 3 Iteration 4
8 x 8 x

6 6

4 4

2 2

0 0
y

-2 y -2

-4 -4

-6 -6

0 5 10 15 20 0 5 10 15 20
x x

Starting with some pairs of clusters having three initial centroids, while other have only one.

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Solutions to Initial Centroids Problem

• Multiple runs
• Helps, but probability is not favorable
• Sample and use hierarchical clustering to determine
initial centroids
• Select more than k initial centroids and then select
among these initial centroids
• Select most widely separated
• Postprocessing

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Pre-processing and Post-processing
• Pre-processing
• Normalize the data
• Eliminate outliers

• Post-processing
• Eliminate small clusters that may represent outliers
• Split ‘loose’ clusters, i.e., clusters with relatively high SSE
• Merge clusters that are ‘close’ and that have relatively low SSE

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Variations of the K-Means Method
• Most of the variants of the k-means which differ in
• Selection of the initial k means
• Dissimilarity calculations
• Strategies to calculate cluster means

• Handling categorical data: k-modes


• Replacing means of clusters with modes
• Using new dissimilarity measures to deal with categorical objects
• Using a frequency-based method to update modes of clusters

48

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Limitations of K-means

• K-means has problems when clusters are of differing


• Sizes
• Densities
• Non-globular shapes

• K-means has problems when the data contains outliers.

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Limitations of K-means: Differing Sizes

Original Points K-means (3 Clusters)

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Limitations of K-means: Differing Density

Original Points K-means (3 Clusters)

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Limitations of K-means: Non-globular Shapes

Original Points K-means (2 Clusters)

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Overcoming K-means Limitations

Original Points K-means Clusters

One solution is to use many clusters.


Find parts of clusters, but need to put together.

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Overcoming K-means Limitations

Original Points K-means Clusters

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Overcoming K-means Limitations

Original Points K-means Clusters

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Comments on the K-Means Method
• Strength: Efficient: O(tkn), where n is # objects, k is # clusters, and t is #
iterations. Normally, k, t << n.
• Comparing: PAM: O(k(n-k)2 ), CLARA: O(ks2 + k(n-k))
• Comment: Often terminates at a local optimal.
• Weakness
• Applicable only to objects in a continuous n-dimensional space
• Using the k-modes method for categorical data
• In comparison, k-medoids can be applied to a wide range of data
• Need to specify k, the number of clusters, in advance
• Sensitive to noisy data and outliers
• Not suitable to discover clusters with non-convex shapes

56
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
The K-Medoid Clustering Method
• K-Medoids Clustering: Find representative objects (medoids) in clusters
• PAM (Partitioning Around Medoids)
• Starts from an initial set of medoids and iteratively replaces one of
the medoids by one of the non-medoids if it improves the total
distance of the resulting clustering
• PAM works effectively for small data sets, but does not scale well for
large data sets (due to the computational complexity)
• Efficiency improvement on PAM
• CLARA : PAM on samples
• CLARANS : Randomized re-sampling

57
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Prescribed Text Books

Author(s), Title, Edition, Publishing House


T1 Data Mining: Concepts and Techniques, Third Edition by Jiawei Han,
Micheline Kamber and Jian Pei Morgan Kaufmann Publishers
T2 Tan P. N., Steinbach M & Kumar V. “Introduction to Data Mining”
Pearson Education

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


S2-20_DSECLZC415 : Data Mining
Lecture #12 – Cluster Analysis
BITS Pilani
Pilani|Dubai|Goa|Hyderabad
•The slides presented here are obtained from the authors of the
books and from various other contributors. I hereby acknowledge
all the contributors for their material and inputs.
•I have added and modified a few slides to suit the requirements
of the course.
2

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Data Mining
Cluster Analysis
BITS Pilani
Pilani|Dubai|Goa|Hyderabad
What is Cluster Analysis?
Finding groups of objects such that the objects in a group will be
similar (or related) to one another and different from (or
unrelated to) the objects in other groups
Inter-cluster
Intra-cluster distances are
distances are maximized
minimized

4
August 14, 2021
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Quality: What Is Good Clustering?
A good clustering method will produce high quality clusters with
– high intra-class similarity
– low inter-class similarity
The quality of a clustering result depends on both the similarity
measure used by the method and its implementation
The quality of a clustering method is also measured by its ability
to discover some or all of the hidden patterns

5
August 14, 2021
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Types of Clusterings
A clustering is a set of clusters

An important distinction among types of clusterings :


hierarchical and partitional sets of clusters

Partitional Clustering
– A division data objects into non-overlapping subsets
(clusters) such that each data object is in exactly one subset
Hierarchical clustering
– A set of nested clusters organized as a hierarchical tree

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Partitional Clustering Method K-Means
Strength: Efficient: O(tkn), where n is # objects, k is # clusters, and t is # iterations.
Normally, k, t << n.
• Compare to PAM (Partitioning around Medoids) : O(k(n-k)2 ), CLARA
(Clustering LARge Applications) : O(ks2 + k(n-k))
Comment: Often terminates at a local optimal.
Weakness
– Applicable only to objects in a continuous n-dimensional space
• Using the k-modes method for categorical data
• In comparison, k-medoids can be applied to a wide range of data
– Need to specify k, the number of clusters, in advance
– Sensitive to noisy data and outliers
– Not suitable to discover clusters with non-convex shapes

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


K-means Example
12

x y
O1 2 10
10
O2 2 5
O3 8 4
O4 5 8 8

O5 7 5
O6 6 4
6
O7 1 2
y

O8 4 9
4

0
August 14, 2021 0 1 2 3 4
x
5 6 7 8 9
8
K-means Example
12

x y
O1 2 10
10
O2 2 5
O3 8 4
O4 5 8 8

O5 7 5
O6 6 4
6
O7 1 2
y

O8 4 9
4

0
0 1 2 3 4 x 5 6 7 8 9
August 14, 2021 9
K-means Example
12

x y
O1 2 10
10
O2 2 5
O3 8 4
O4 5 8 8

O5 7 5
O6 6 4
6
O7 1 2
y

O8 4 9
4

0
August 14, 2021 0 1 2 3 4 x 5 6 7 8 9
10
K-means Example
12

x y
O1 2 10
10
O2 2 5
O3 8 4
O4 5 8 8

O5 7 5
O6 6 4
6
O7 1 2
y

O8 4 9
4

0
August 14, 2021 0 1 2 3 4 x 5 6 7 8 9
11
K-means Example
12

x y
O1 2 10
10
O2 2 5
O3 8 4
O4 5 8 8

O5 7 5
O6 6 4
6
O7 1 2
y

O8 4 9
4

0
August 14, 2021 0 1 2 3 4 x 5 6 7 8 9
12
K-means Example
12

x y
O1 2 10
10
O2 2 5
O3 8 4
O4 5 8 8

O5 7 5
O6 6 4
6
O7 1 2
y

O8 4 9
4

0
August 14, 2021 0 1 2 3 4 x 5 6 7 8 9
13
K-means Example
12

x y
O1 2 10
10
O2 2 5
O3 8 4
O4 5 8 8

O5 7 5
O6 6 4
6
O7 1 2
y

O8 4 9
4

0
August 14, 2021 0 1 2 3 4 x 5 6 7 8 9
14
K-means Example
12

x y
O1 2 10
10
O2 2 5
O3 8 4
O4 5 8 8

O5 7 5
O6 6 4
6
O7 1 2
y

O8 4 9
4

0
August 14, 2021 0 1 2 3 4 x 5 6 7 8 9
15
K-means Example
12

x y
O1 2 10
10
O2 2 5
O3 8 4
O4 5 8 8

O5 7 5
O6 6 4
6
O7 1 2
y

O8 4 9
4

0
August 14, 2021 0 1 2 3 4 x 5 6 7 8 9
16
K-means Example
12

x y
O1 2 10
10
O2 2 5
O3 8 4
O4 5 8 8

O5 7 5
O6 6 4
6
O7 1 2
y

O8 4 9
4

Clusters
Stabilized 2

0
August 14, 2021 0 1 2 3 4 x 5 6 7 8 9
17
BITS Pilani
Pilani|Dubai|Goa|Hyderabad

Hierarchical Methods
Hierarchical Clustering
Produces a set of nested clusters organized as a hierarchical tree
Can be visualized as a dendrogram
– A tree like diagram that records the sequences of merges or splits

6 5
0.2
4
3 4
0.15 2
5
2
0.1

1
0.05
3 1

0
1 3 2 5 4 6

19

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Strengths of Hierarchical Clustering
Do not have to assume any particular number of clusters
– Any desired number of clusters can be obtained by ‘cutting’ the
dendogram at the proper level

They may correspond to meaningful taxonomies


– Example in biological sciences (e.g., animal kingdom, phylogeny
reconstruction, …)
animal

vertebrate invertebrate

fish reptile amphib. mammal worm insect crustacean

20

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Hierarchical Clustering
Two main types of hierarchical clustering
– Agglomerative:
• Start with the points as individual clusters
• At each step, merge the closest pair of clusters until only one cluster (or k
clusters) left

– Divisive:
• Start with one, all-inclusive cluster
• At each step, split a cluster until each cluster contains a point (or there
are k clusters)

Traditional hierarchical algorithms use a similarity or distance


matrix
– Merge or split one cluster at a time

21

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Agglomerative Clustering Algorithm
More popular hierarchical clustering technique

Basic algorithm is straightforward


1. Compute the proximity matrix
2. Let each data point be a cluster
3. Repeat
4. Merge the two closest clusters
5. Update the proximity matrix
6. Until only a single cluster remains

Key operation is the computation of the proximity of two clusters


– Different approaches to defining the distance between clusters
distinguish the different algorithms

22

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Starting Situation
p1 p2 p3 p4 p5
• Start with clusters of individual p1
...

points and a proximity matrix p2


p3
p4
p5
.
.
. Proximity Matrix

...
p1 p2 p3 p4 p9 p10 p11 p12
23

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Intermediate Situation
• After some merging steps, we have some
clusters C1 C2 C3 C4 C5
C1
C2
C3
C3
C4
C4
C5
Proximity Matrix
C1

C2 C5

...
p1 p2 p3 p4 p9 p10 p11 p12 24

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Intermediate Situation
• We want to merge the two closest
clusters (C2 and C5) and update the C1 C2 C3 C4 C5
proximity matrix. C1
C2
C3
C3 C4
C4 C5
Proximity Matrix
C1

C2 C5

... 25
p1 p2 p3 p4 p9 p10 p11 p12

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


How to Define Inter-Cluster Similarity

p1 p2 p3 p4 p5 ...
Similarity?
p1

p2
p3

p4

 MIN p5
.
 MAX
.
 Group Average Proximity Matrix
.
 Distance Between Centroids
 Other methods driven by an objective
function
26

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


How to Define Inter-Cluster Similarity
p1 p2 p3 p4 p5 ...
p1

p2
p3

p4
p5
 MIN
.
 MAX .
 Group Average .
Proximity Matrix
 Distance Between Centroids
 Other methods driven by an objective
function
27

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


How to Define Inter-Cluster Similarity
p1 p2 p3 p4 p5 ...
p1

p2
p3

p4
p5
 MIN
.
 MAX .
 Group Average .
Proximity Matrix
 Distance Between Centroids
 Other methods driven by an objective
function
28

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


How to Define Inter-Cluster Similarity

p1 p2 p3 p4 p5 ...
p1

p2
p3

p4
p5
 MIN
.
 MAX .
 Group Average .
Proximity Matrix
 Distance Between Centroids
 Other methods driven by an objective
function
29

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Clustering Example

30

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


How to Define Inter-Cluster Similarity
p1 p2 p3 p4 p5 ...
p1
× × p2
p3

p4
p5
 MIN
.
 MAX .
 Group Average .
Proximity Matrix
 Distance Between Centroids
 Other methods driven by an objective
function
31

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


P1 P2 P3, P6 P4 P5
P1 0 .24 .22 .37 .34
P2 0 .15 .2 .14
P3, P6 0 .155 .28
P4 0 .29
P5 0

P1 P2,P5, P4
P1 P2,P5 P3,P6 P4
P3,P6
P1 0 .24 .22 .37
P1 0 .22 .37
P2,P5 0 .15 .2 P2,P5, 0 .155
P3,P6 0 .155 P3,P6
P4 0 P4 0

P1 P2,P5,P3,
P6,P4
Hierarchical Clustering
example - min P1 0 0.22
P2,P5,P3, 0
8/14/2021
P6,P4 32
Hierarchical Clustering: MIN
Similarity of two clusters is based on the two
most similar (closest) points in the different
clusters
5 Determined by one pair of points, i.e., by one link
1 in the proximity graph.
3
5 0.2
2 1
0.15
2 3 6
0.1

0.05
4
4 0
3 6 2 5 4 1

Nested Clusters Dendrogram


33

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Strength of MIN

Original Points Two Clusters

• Can handle non-elliptical shapes

34

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Limitations of MIN

Original Points Two Clusters

• Sensitive to noise and outliers

35

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


P1 P2 P3, P6 P4 P5
P1 0 .24 .23 .37 .34
P2 0 .25 .2 .14
P3, P6 0 .22 .39
P4 0 .29
P5 0

P1 P2,P5 P3,P6,
P1 P2,P5 P3,P6 P4 P4
P1 0 .34 .23 .37 P1 0 .34 .37
P2,P5 0 .39 .29 P2,P5 0 .39
P3,P6 0 .22
P3,P6, 0
P4 0 P4

P1, P3,
P2,P5 P6,P4
Hierarchical Clustering
example - max P1, P2,P5 0 0.39
P3, 0
8/14/2021
P6,P4 36
Hierarchical Clustering: MAX
Similarity of two clusters is based on the two least similar (most distant)
points in the different clusters
Determined by all pairs of points in the two clusters

4 1
2 5 0.4

0.35
5
2 0.3

0.25

3 6 0.2

3 0.15
1 0.1

4 0.05

0
3 6 4 1 2 5

Dendrogram
Nested Clusters
37

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Strength of MAX

Original Points Two Clusters

• Less susceptible to noise and outliers


38

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Limitations of MAX

Original Points Two Clusters

•Tends to break large clusters


•Biased towards globular clusters
39

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Hierarchical Clustering: Group Average
Proximity of two clusters is the average of pairwise proximity between
points in the two clusters.

5 4 1
2 0.25

5 0.2
2
0.15

3 6 0.1

1 0.05

4 0
3 3 6 4 1 2 5

Nested Clusters Dendrogram


40

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Hierarchical Clustering: Comparison
5
1 4 1
3
2 5
5 5
2 1 2
MIN MAX
2 3 6 3 6
3
1
4 4
4

5 4 1
2
5
2
Group Average
3 6
1
4
3
41

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


MST: Divisive Hierarchical Clustering
• Build MST (Minimum Spanning Tree)
• Start with a tree that consists of any point
• In successive steps, look for the closest pair of points
(p, q) such that one point (p) is in the current tree but
the other (q) is not
• Add q to the tree and put an edge between p and q

42

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


MST: Divisive Hierarchical Clustering
• Use MST for constructing hierarchy of clusters

43

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Hierarchical Clustering: Time and Space requirements

O(N2) space since it uses the proximity matrix.


– N is the number of points.

O(N3) time in many cases


– There are N steps and at each step the size, N2, proximity matrix must be updated and
searched
– Complexity can be reduced to O(N2 log(N) ) time for some approaches

44

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Hierarchical Clustering: Problems and Limitations

Once a decision is made to combine two clusters, it cannot be


undone

No objective function is directly minimized

Different schemes have problems with one or more of the


following:
– Sensitivity to noise and outliers
– Difficulty handling different sized clusters and convex shapes
– Breaking large clusters

45

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


BITS Pilani
Pilani|Dubai|Goa|Hyderabad

Density based Cluster Analysis


Density-Based Clustering Methods
Clustering based on density (local cluster criterion), such as density-connected points

Major features:
– Discover clusters of arbitrary shape
– Handle noise
– One scan
– Need density parameters as termination condition

Several interesting studies:


– DBSCAN: Ester, et al. (KDD’96)
– OPTICS: Ankerst, et al (SIGMOD’99).
– DENCLUE: Hinneburg & D. Keim (KDD’98)
– CLIQUE: Agrawal, et al. (SIGMOD’98) (more grid-based)

47

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Density-Based Clustering: Basic Concepts
Two parameters:
– Eps: Maximum radius of the neighbourhood
– MinPts: Minimum number of points in an Eps-neighbourhood of
that point
NEps(p): {q belongs to D | dist(p,q) ≤ Eps}
Directly density-reachable: A point p is directly density-reachable from
a point q w.r.t. Eps, MinPts if
– p belongs to NEps(q)
– core point condition: p MinPts = 5
|NEps (q)| ≥ MinPts Eps = 1 cm
q
48

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Density-Reachable and Density-Connected
Density-reachable:
– A point p is density-reachable from a
p
point q w.r.t. Eps, MinPts if there is a
chain of points p1, …, pn, p1 = q, pn = p p1
such that pi+1 is directly density- q
reachable from pi

Density-connected
– A point p is density-connected to a point p q
q w.r.t. Eps, MinPts if there is a point o
such that both, p and q are density-
reachable from o w.r.t. Eps and MinPts o

49

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


DBSCAN
DBSCAN is a density-based algorithm.
– Density = number of points within a specified radius (Eps)

– A point is a core point if it has more than a specified number of


points (MinPts) within Eps
• These are points that are at the interior of a cluster

– A border point has fewer than MinPts within Eps, but is in the
neighborhood of a core point

– A noise point is any point that is not a core point or a border point.

50

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


DBSCAN: Core, Border, and Noise Points
MinPts = 7 (including self)

51
8/14/2021
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
DBSCAN Algorithm
• Eliminate noise points
• Perform clustering on the remaining points

52

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


DBSCAN: Core, Border and Noise Points

Original Points Point types: core, border


and noise

Eps = 10, MinPts = 4


53

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


When DBSCAN Works Well

Original Points
Clusters
• Resistant to Noise
• Can handle clusters of different shapes and sizes
54

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


DBSCAN: Sensitive to Parameters

DBSCAN does not work well for


Varying densities
High-dimensional data

55

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


DBSCAN: Determining EPS and MinPts

Idea is that for points in a cluster, their kth nearest neighbors are
at roughly the same distance
Noise points have the kth nearest neighbor at farther distance
So, plot sorted distance of every point to its kth nearest neighbor

56

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Prescribed Text Books

Author(s), Title, Edition, Publishing House


T1 Tan P. N., Steinbach M & Kumar V. “Introduction to Data Mining” Pearson
Education
T2 Data Mining: Concepts and Techniques, Third Edition by Jiawei Han,
Micheline Kamber and Jian Pei Morgan Kaufmann Publishers
R1 Predictive Analytics and Data Mining: Concepts and Practice with RapidMiner
by Vijay Kotu and Bala Deshpande Morgan Kaufmann Publishers

57
August 14, 2021
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
S2-20_DSECLZC415: Data Mining
(Lecture #13 – Cluster Analysis)
BITS Pilani
Pilani|Dubai|Goa|Hyderabad
•The slides presented here are obtained from the authors of the
books and from various other contributors. I hereby acknowledge
all the contributors for their material and inputs.
•I have added and modified a few slides to suit the requirements
of the course.
2

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Data Mining
Cluster Analysis
BITS Pilani
Pilani|Dubai|Goa|Hyderabad
Types of Clusterings

Partitional Clustering
– A division of data objects into non-overlapping subsets
(clusters) such that each data object is in exactly one
subset
– Divisions can be
• Distance based
• Density based
Hierarchical clustering
– A set of nested clusters organized as a hierarchical tree

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Comparison of DBSCAN and K-means

Both are partitional.


K-means is complete; DBSCAN is not.
K-means has a prototype-based notion of a cluster; DBSCAN
uses a density-based notion.
K-means can find clusters that are not well-separated. DBSCAN
will merge clusters that touch.
DBSCAN handles clusters of different shapes and sizes; K-means
prefers globular clusters.
DBSCAN can handle noise and outliers; K-means performs
poorly in the presence of outliers

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Comparison of DBSCAN and K-means

K-means can only be applied to data for which a centroid is


meaningful; DBSCAN requires a meaningful definition of density
Both techniques were designed for Euclidean data, but extended
to other types of data
K-means has an O(n) time complexity; DBSCAN is O(n^2)
Because of random initialization, the clusters found by K-means
can vary from one run to another; DBSCAN always produces the
same clusters
DBSCAN automatically determines the number of clusters; K-
means does not
K-means has only one parameter, DBSCAN has two.

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


BITS Pilani
Pilani|Dubai|Goa|Hyderabad

OPTICS
Challenges with DBSCAN
• DBSCAN requires users with the responsibility of selecting
parameter values for the discovery of acceptable clusters.
• Eps (the maximum radius of a neighborhood) and
• MinPts (the minimum number of points required in the
neighborhood of a core object)
• The parameters are empirically set and difficult to determine
especially for real-world, high dimensional data sets.
• Sensitive to these parameter values: Slightly different settings may
lead to very different clustering of the data.
• The real-world, high-dimensional data can have skewed
distributions such that their intrinsic clustering structure may not be
well characterized by a single set of global density parameters.
8
8/28/2021
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
OPTICS: A Cluster-Ordering Method (1999)
OPTICS: Ordering Points To Identify the Clustering Structure
– Ankerst, Breunig, Kriegel, and Sander (SIGMOD’99)
– Produces a special order of the database wrt its density-
based clustering structure
– This cluster-ordering contains info equivalent to the
density-based clusterings corresponding to a broad range
of parameter settings
– Good for both automatic and interactive cluster analysis,
including finding intrinsic clustering structure
– Can be represented graphically or using visualization
techniques
9

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


OPTICS algorithm

Ordering points to identify the clustering structure (OPTICS) is


an algorithm for finding density-based clusters
OPTICS overcomes the (DBSCAN) problem of detecting
meaningful clusters in data of varying density
Points of the dataset are (linearly) ordered such that spatially
closest points become neighbors. A special distance is plotted
for each point

10
August 28, 2021
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
OPTICS algorithm

• The core-distance of an object p is the smallest value Eps’ such that


the
– Eps’-neighborhood of p has at least MinPts objects.
• That is, Eps’ is the minimum distance threshold that makes p a core
object.
• If p is not a core object with respect to Eps and MinPts, the core-
distance of p is undefined.
• The reachability-distance to object p from q is the minimum radius
value that makes p density-reachable from q.
• According to the definition of density-reachability, q has to be a core object
and p must be in the neighborhood of q.
• Therefore, the reachability-distance from q to p is max(core-distance(q), dist(p,
q). If q is not a core object with respect to Eps and MinPts, the reachability-
distance to p from q is undefined.
11
8/28/2021
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
OPTICS: Some Extension from DBSCAN
Complexity: O(NlogN)
Core Distance:
– min eps s.t. point is core
Reachability Distance Eps

Max (core-distance (o), d(o, p))


r(p1, o) = 3cm. r(p2,o) = 4cm
p1

o
p2 MinPts = 5
ε = 3 cm
12

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


OPTICS Core & Reachability Distance

13
8/28/2021
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Reachability-distance

undefined
ε
ε
ε ‘

Cluster-order of the objects


14

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


BITS Pilani
Pilani|Dubai|Goa|Hyderabad

Extensions to Hierarchical Clustering


Limitations to Hierarchical Clustering

• Major weakness of agglomerative clustering methods


• Can never undo what was done previously
• Do not scale well: time complexity of at least O(n2), where
n is the number of total objects

• Integration of hierarchical & distance-based clustering


• BIRCH : uses CF-tree and incrementally adjusts the quality
of sub-clusters

16
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
BIRCH (Balanced Iterative Reducing
and Clustering Using Hierarchies)
• Zhang, Ramakrishnan & Livny, SIGMOD’96
• Incrementally construct a CF (Clustering Feature) tree, a hierarchical data
structure for multiphase clustering
• Phase 1: scan DB to build an initial in-memory CF tree (a multi-level
compression of the data that tries to preserve the inherent clustering
structure of the data)
• Phase 2: use an arbitrary clustering algorithm to cluster the leaf nodes of
the CF-tree
• Scales linearly: finds a good clustering with a single scan and improves the
quality with a few additional scans
• Weakness: handles only numeric data, and sensitive to the order of the data
record

17
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Clustering Feature Vector in BIRCH
Clustering Feature (CF): CF = (N, LS, SS)
N: Number of data points
N
LS: linear sum of N points: ∑ X i
i =1

CF = (5, (16,30),(54,190))
SS: square sum of N points
N 2 10
(3,4)
∑ Xi
9

(2,6)
8

i =1
7

4 (4,5)
(4,7)
3

(3,8)
0
0 1 2 3 4 5 6 7 8 9 10

18

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


CF-Tree in BIRCH
Clustering feature:
– Summary of the statistics for a given subcluster: the 0-th, 1st,
and 2nd moments of the subcluster from the statistical point of
view
– Registers crucial measurements for computing cluster and
utilizes storage efficiently
A CF tree is a height-balanced tree that stores the clustering features
for a hierarchical clustering
– A nonleaf node in a tree has descendants or “children”
– The nonleaf nodes store sums of the CFs of their children
A CF tree has two parameters
– Branching factor: max # of children
– Threshold: max diameter of sub-clusters stored at the leaf
nodes
19

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


The CF Tree Structure

• Each non-leaf node has at


most B entries

• Each leaf node has at most


L CF entries which each
satisfy threshold T

• Node size is determined by


dimensionality of data
space and input parameter
P (page size)

20
8/28/2021
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
The CF Tree Structure

Root Each non-leaf node contains


at most B entries. A leaf node
contains at most L entries

B=7 CF1 CF2 CF3 CF6


child1 child2 child3 child6
L=6

Non-leaf node
CF1 CF2 CF3 CF5
child1 child2 child3 child5

Leaf node Leaf node


prev CF1 CF2 CF6 next prev CF1 CF2 CF4 next
21

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


BIRCH Steps
The first phase builds a CF tree out of the data points, a height-
balanced tree data structure
In the second step, the algorithm scans all the leaf entries in the initial
CF tree to rebuild a smaller CF tree, while removing outliers and
grouping crowded subclusters into larger ones.
In step three an existing clustering algorithm is used to cluster all leaf
entries. Here algorithm is applied directly to the subclusters
represented by their CF vectors.
In (optional) step 4 the centroids of the clusters produced in step 3 are
used as seeds and redistribute the data points to its closest seeds to
obtain a new set of clusters

22
August 28, 2021
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
The Birch Algorithm
Cluster Diameter 1 2
∑ (x − x )
n( n − 1) i j

In first phase, For each point in the input


– Find closest leaf entry
– Add point to leaf entry and update CF
– If entry diameter > max_diameter, then split leaf, and possibly parents
Algorithm is O(n)
Concerns
– Sensitive to insertion order of data points
– Since we fix the size of leaf nodes, so clusters may not be so natural
– Clusters tend to be spherical given the radius and diameter measures

23

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Centroid, Radius and Diameter of a Cluster
• Centroid: the “middle” of a cluster

• Radius: square root of average distance from any point of the


cluster to its centroid

• Diameter: square root of average mean squared distance


between all pairs of points in the cluster

24
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
BITS Pilani
Pilani|Dubai|Goa|Hyderabad

Cluster Validation
Cluster Validity

For supervised classification we have a variety of measures to evaluate


how good our model is
– Accuracy, precision, recall

For cluster analysis, the analogous question is how to evaluate the


“goodness” of the resulting clusters?

But “clusters are in the eye of the beholder”!

Then why do we want to evaluate them?


– To avoid finding patterns in noise
– To compare clustering algorithms
– To compare two sets of clusters
– To compare two clusters

26
8/28/2021
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Clusters found in Random Data
1 1

0.9 0.9

Random 0.8

0.7
0.8

0.7
Points 0.6 0.6 DBSCAN
0.5 0.5
y

y
0.4 0.4

0.3 0.3

0.2 0.2

0.1 0.1

0 0
10 0.2 0.4 0.6 0.8 1 10 0.2 0.4 0.6 0.8 1

0.9
x 0.9
x

K-means 0.8

0.7
0.8

0.7
Complete
0.6 0.6
Link
0.5 0.5
y

0.4 0.4

0.3 0.3

0.2 0.2

0.1 0.1

0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
x x 27
8/28/2021
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Different Aspects of Cluster Validation
1. Determining the clustering tendency of a set of data, i.e., distinguishing
whether non-random structure actually exists in the data.
2. Comparing the results of a cluster analysis to externally known results, e.g.,
to externally given class labels.
3. Evaluating how well the results of a cluster analysis fit the data without
reference to external information.
- Use only the data
4. Comparing the results of two different sets of cluster analyses to determine
which is better.
5. Determining the ‘correct’ number of clusters.

For 2, 3, and 4, we can further distinguish whether we want to evaluate the


entire clustering or just individual clusters.

28
8/28/2021
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Clustering Tendency
All clustering algorithms find some clusters, whether data has natural clusters
or not.
We need a mechanism to check if at least some clusters are of good quality
Alternatively, we can directly check the data for clustering tendency. A
common approach is to use statistical tests for spatial randomness (in
Euclidean space) among data points.
Hopkins Statistic : Use p points from data and generate a set of p random
points in data space. Let ui be nearest neighbor distances in generated data
and wi be nearest neighbor distances in supplied data. Hopkin statistic H is
defined as

H is closer to 0 => data is highly clustered;


H is closer to 0.5 => data is uniformly distributed in data space
29
8/28/2021
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Evaluation measures for Cluster Validity
Unsupervised Measures without resort to external information, also referred
internal indices e.g. SSE.
Measures can be
– Cluster Cohesion (compactness, tightness) - how closely related the objects in a
cluster are?
– Cluster Separation (isolation) - how distinct or well-separated a cluster is from
other clusters
Supervised Measures match the clustering output to some external structure,
also referred external indices e.g. entropy
– Measures how well cluster labels match externally supplied class labels.
Relative - compare different clustering outcomes. can be based on supervised
or unsupervised measures, e.g. two K-means clustering outputs can be
compared using either SSE or entropy.

30
8/28/2021
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Internal Measures: SSE
Clusters in more complicated figures aren’t well separated
Internal Index: Used to measure the goodness of a clustering structure
without respect to external information
– SSE
SSE is good for comparing two clusterings or two clusters (average SSE).
Can also be used to estimate the number of clusters
10
6 9

8
4
7
2
6

SSE
0 5

4
-2
3

-4 2

1
-6
0
5 10 15 2 5 10 15 20 25 30
K 31
8/28/2021
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Internal Measures: Cohesion and Separation
Cluster Cohesion: Measures how closely related are objects in a cluster
– Example: SSE
Cluster Separation: Measure how distinct or well-separated a cluster is
from other clusters
– Example: Squared Error
– Cohesion is measured by the within cluster sum of squares (SSE)

WSS = ∑∑i x∈Ci


( x − ci ) 2

– Separation is measured by the between cluster sum of squares


BSS = ∑ i
C i ( c − ci ) 2
• Where |Ci| is the size of cluster i

32
8/28/2021
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Internal Measures: Cohesion and Separation
• Example: SSE
• BSS + WSS = constant
m
× × ×
1 m1 2 3 4 m2 5

K=1 cluster: SSE = WSS= (1 − 3) 2 + ( 2 − 3) 2 + ( 4 − 3) 2 + (5 − 3) 2 = 10


BSS= 4 × (3 − 3) 2 = 0
Total = 10 + 0 = 10

K=2 clusters: SSE = WSS= (1 − 1.5)2 + (2 − 1.5)2 + (4 − 4.5)2 + (5 − 4.5)2 = 1


BSS= 2 × (3 − 1.5)2 + 2 × (4.5 − 3)2 = 9
Total = 1 + 9 = 10

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Measuring Clustering Quality: Extrinsic Methods
Clustering quality measure: Q(C, Cg), for a clustering C given the ground
truth Cg.
Q is good if it satisfies the following 4 essential criteria
– Cluster homogeneity: the purer, the better
– Cluster completeness: should assign objects belong to the same
category in the ground truth to the same cluster
– Rag bag: putting a heterogeneous object into a pure cluster
should be penalized more than putting it into a rag bag (i.e.,
“miscellaneous” or “other” category)
– Small cluster preservation: splitting a small category into pieces
is more harmful than splitting a large category into pieces

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


34
Measuring Clustering Quality
Two methods: extrinsic vs. intrinsic
Extrinsic: supervised, i.e., the ground truth is available
– Compare a clustering against the ground truth using certain
clustering quality measure
– Ex. precision and recall metrics (averaged over all classes)
Intrinsic: unsupervised, i.e., the ground truth is unavailable
– Evaluate the goodness of a clustering by considering how
well the clusters are separated, and how compact the
clusters are
– Ex. Silhouette coefficient

35
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Classification-based Measures of Cluster Validity
Precision: The fraction of a cluster that consists of objects of a
specified class. The precision of cluster i with respect to a class j is
precision(i,j) = pij

Recall: The extent to which a cluster contains all objects of a specified


class. The recall of cluster i with respect to class j is recall(i, j) = mij/mj
where mj is the number of objects in class j.

F-measure: A combination of both precision and recall that measures


the extent to which a cluster contains only objects of a particular class
and all objects of that class. The F-measure of cluster i with respect to
class j is
F(i,j) = (2*precision(i,j)*recall(i,j))/(precision(i,j)+recall(i,j)) 36
8/28/2021
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
External Measures of Cluster Validity: Precision/Recall

K-means clustering result for a newspaper articles document data set


For Cluster 1 and Metro class, For Cluster 3 and Sports class,
Precision = 506/677 = 0.75 Precision = 671/685 = 0.98
Recall = 506/943 = 0.54 Recall = 671/738 = 0.91
F-value = 0.63 F-value = 0.94 37
8/28/2021
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Measuring Cluster Validity Via Correlation

 Two matrices
– Proximity Matrix
– Ideal Similarity Matrix
 One row and one column for each data point
 An entry is 1 if the associated pair of points belong to the same cluster
 An entry is 0 if the associated pair of points belongs to different clusters
 Compute the correlation between the two matrices
– Since the matrices are symmetric, only the correlation between
n(n-1) / 2 entries needs to be calculated.
 High correlation indicates that points that belong to the
same cluster are close to each other.
 Not a good measure for some density or contiguity based
clusters.
38
8/28/2021
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Determine the Number of Clusters
Empirical method
– # of clusters ≈√(n/2) for a dataset of n points
Elbow method
– Use the turning point in the curve of sum of within cluster variance w.r.t
the # of clusters
Cross validation method
– Divide a given data set into m parts
– Use m – 1 parts to obtain a clustering model
– Use the remaining part to test the quality of the clustering
• E.g., For each point in the test set, find the closest centroid, and use
the sum of squared distance between all points in the test set and
the closest centroids to measure how well the model fits the test set
– For any k > 0, repeat it m times, compare the overall quality measure
w.r.t. different k’s, and find # of clusters that fits the data the best

39

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Final Comment on Cluster Validity
“The validation of clustering structures is the most difficult and
frustrating part of cluster analysis.

Without a strong effort in this direction, cluster analysis will


remain a black art accessible only to those true believers who
have experience and great courage.”

Algorithms for Clustering Data, Jain and Dubes

40
8/28/2021
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Prescribed Text Books

Author(s), Title, Edition, Publishing House


T1 Tan P. N., Steinbach M & Kumar V. “Introduction to Data Mining” Pearson
Education
T2 Data Mining: Concepts and Techniques, Third Edition by Jiawei Han,
Micheline Kamber and Jian Pei Morgan Kaufmann Publishers

41
August 28, 2021
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
S2-20_DSECLZC415: Data Mining
(Lecture #14 – Outlier Analysis)
BITS Pilani
Pilani|Dubai|Goa|Hyderabad
•The slides presented here are obtained from the authors of the
books and from various other contributors. I hereby acknowledge
all the contributors for their material and inputs.
•I have added and modified a few slides to suit the requirements
of the course.
2

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Data Mining
Outlier Analysis
BITS Pilani
Pilani|Dubai|Goa|Hyderabad
What Are Outliers/Anomalies?
Outlier: A data object that deviates significantly from the normal objects as if
it were generated by a different mechanism
Outliers are different from the noise data
– Noise is random error or variance in a measured variable
– Noise should be removed before outlier detection
Outliers are interesting: It violates the mechanism that generates the normal
data

Applications:
– Credit card fraud detection
– Telecom fraud detection
– Customer segmentation
– Medical analysis
4

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Importance of Anomaly Detection
Ozone Depletion History
In 1977 three researchers (Farman, Gardinar
and Shanklin) were puzzled by data
gathered by the British Antarctic Survey
showing that ozone levels for Antarctica
had dropped 10% below normal levels
Why did the Nimbus 7 satellite, which had
instruments aboard for recording ozone
levels, not record similarly low ozone
concentrations? The researchers held back
publishing their work for nearly a decade.
The ozone concentrations recorded by the
satellite were so low they were being
Sources:
treated as outliers by a computer program “Cosmic Imagery: Key Images in the History of Science” By
and discarded! John D. Barrow
[Link]
[Link]

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Key components
associated with an
anomaly detection
technique

Anomaly Detection : A Survey by


Varun Chandola, Arindam Banerjee
And Vipin Kumar University of Minnesota
ACM Computing Surveys, September 2009

6
September 14, 2021
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
More on Outlier/Anomaly Detection
Challenges
– How many outliers are there in the data?
– Method is unsupervised (sometimes supervised methods are used)
• Validation can be quite challenging (just like for clustering)
– Finding needle in a haystack

Working assumption:
– There are considerably more “normal” observations than “abnormal”
observations (outliers/anomalies) in the data

Outlier detection vs. novelty detection (identify new topics and trends
in a timely manner in social media): early stage, outlier; but later
merged into the model

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Types of Outliers
Three kinds: global, contextual and collective outliers
Global outlier (or point anomaly)
– Object is Og if it significantly deviates from the rest of the data set
– Ex. Intrusion detection in computer networks
– Issue: Find an appropriate measurement of deviation
Contextual outlier (or conditional outlier)
– Object is Oc if it deviates significantly based on a selected context
– Ex. 80o F in Urbana: outlier? (depending on summer or winter?)
– Attributes of data objects should be divided into two groups
• Contextual attributes: defines the context, e.g., time & location
• Behavioral attributes: characteristics of the object, used in outlier
evaluation, e.g., temperature
– Can be viewed as a generalization of local outliers—whose density
significantly deviates from its local area
– Issue: How to define or formulate meaningful context?

8
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Types of Outliers (Contd.)
Collective Outliers
– A subset of data objects collectively deviate significantly from the whole
data set, even if the individual data objects may not be outliers
– Applications: E.g., intrusion detection:
• When a number of computers keep sending denial-of-service
packages to each other
• Detection of collective outliers
• Consider not only behavior of individual objects, but also that of
groups of objects
• Need to have the background knowledge on the relationship among
data objects, such as a distance or similarity measure on objects.
• A data set may have multiple types of outlier
• One object may belong to more than one type of outlier

9
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
BITS Pilani
Pilani|Dubai|Goa|Hyderabad

Outlier Detection Approaches


Outlier Detection Methods
Two ways to categorize outlier detection methods:
– Based on whether user-labeled examples of outliers
can be obtained:
• Supervised, semi-supervised vs. unsupervised methods

– Based on assumptions about normal data and


outliers:
• Statistical, proximity-based, and clustering-based methods

11
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Outlier Detection I: Supervised Methods
Outlier Detection I: Supervised Methods
– Modeling outlier detection as a classification problem
• Samples examined by domain experts used for training & testing
– Methods for Learning a classifier for outlier detection effectively:
• Model normal objects & report those not matching the model as
outliers, or
• Model outliers and treat those not matching the model as normal
– Challenges
• Imbalanced classes, i.e., outliers are rare: Boost the outlier class
and make up some artificial outliers
• Catch as many outliers as possible, i.e., recall is more important
than accuracy (i.e., not mislabeling normal objects as outliers)

12
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Outlier Detection II: Unsupervised Methods
Assume the normal objects are somewhat ``clustered'‘ into multiple groups, each having
some distinct features
An outlier is expected to be far away from any groups of normal objects
Weakness: Cannot detect collective outlier effectively
– Normal objects may not share any strong patterns, but the collective outliers may
share high similarity in a small area
Ex. In some intrusion or virus detection, normal activities are diverse
– Unsupervised methods may have a high false positive rate but still miss many real
outliers.
– Supervised methods can be more effective, e.g., identify attacking some key resources

Many clustering methods can be adapted for unsupervised methods


– Find clusters, then outliers: not belonging to any cluster
– Problem 1: Hard to distinguish noise from outliers
– Problem 2: Costly since first clustering: but far less outliers than normal objects
• Newer methods: tackle outliers directly
13
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Outlier Detection III: Semi-Supervised Methods
Situation: In many applications, the number of labeled data is often small:
Labels could be on outliers only, normal objects only, or both
Semi-supervised outlier detection: Regarded as applications of semi-
supervised learning
If some labeled normal objects are available
– Use the labeled examples and the proximate unlabeled objects to train a
model for normal objects
– Those not fitting the model of normal objects are detected as outliers
If only some labeled outliers are available, a small number of labeled outliers
many not cover the possible outliers well
– To improve the quality of outlier detection, one can get help from
models for normal objects learned from unsupervised methods

14
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Mining Contextual Outliers:
Transform into Conventional Outlier Detection
If the contexts can be clearly identified, transform it to conventional outlier
detection
1. Identify the context of the object using the contextual attributes
2. Calculate the outlier score for the object in the context using a
conventional outlier detection method
Ex. Detect outlier customers in the context of customer groups
– Contextual attributes: age group, postal code
– Behavioral attributes: # of trans/yr, annual total trans. amount
Steps:
(1) locate c’s context,
(2) compare c with the other customers in the same group, and
(3) use a conventional outlier detection method

15
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Mining Contextual Outliers:
Modeling Normal Behavior with Respect to Contexts
In some applications, one cannot clearly partition the data into contexts
– Ex. if a customer suddenly purchased a product that is unrelated to those
she recently browsed, it is unclear how many products browsed earlier
should be considered as the context
Model the “normal” behavior with respect to contexts
– Using a training data set, train a model that predicts the expected
behavior attribute values with respect to the contextual attribute values
– An object is a contextual outlier if its behavior attribute values
significantly deviate from the values predicted by the model
Using a prediction model that links the contexts and behavior, these methods
avoid the explicit identification of specific contexts
Methods: A number of classification and prediction techniques can be used to
build such models, such as regression, Markov Models, and Finite State
Automaton
16
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Mining Collective Outliers :
On the Set of “Structured Objects”
 Collective outlier - objects as a group deviate from the
entire data
 Need to examine the structure of the data set,
 i.e, the relationships between multiple data objects
 Each of these structures is inherent to its respective type
of data
 For temporal data (such as time series and
sequences)
 explore the structures formed by time, which occur in
segments of the time series or subsequences
 For spatial data, explore local areas
 For graph and network data, we explore subgraphs

17
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Mining Collective Outliers :
On the Set of “Structured Objects”
 Difference from the contextual outlier detection: the structures are often
not explicitly defined, and have to be discovered as part of the outlier
detection process.
 Collective outlier detection methods: two categories
 Reduce the problem to conventional outlier detection
 Identify structure units, treat each structure unit (e.g.,
subsequence, time series segment, local area, or subgraph) as a
data object, and extract features
 Then outlier detection on the set of “structured objects”
constructed as such using the extracted features
 e.g. Detect collective outliers in online social network of customers
 Treat each possible subgraph of the network as a structure unit
 Collective outlier: An outlier subgraph in the social network
 Small subgraphs that are of very low frequency

 Large subgraphs that are surprisingly frequent

18
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Mining Collective Outliers II:
Direct Modeling of the Expected Behavior of Structure Units
Model the expected behavior of structure units directly
– e.g. Detect collective outliers in temporal sequences
• Learn a Markov model from the sequences
• A subsequence can then be declared as a collective outlier if it significantly deviates
from the model

Collective outlier detection is subtle due to the challenge of exploring the


structures in data
– The exploration typically uses heuristics, and thus may be application
dependent
– The computational cost is often high due to the sophisticated mining
process

19
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Challenges of Outlier Detection
• Modeling normal objects and outliers properly
• Hard to enumerate all possible normal behaviors in an application
• The border between normal and outlier objects is often a gray area
• Application-specific outlier detection
• Choice of distance measure among objects and the model of relationship
among objects are often application-dependent
• E.g., clinic data: a small deviation could be an outlier; while in marketing
analysis, larger fluctuations
• Handling noise in outlier detection
• Noise may distort the normal objects and blur the distinction between
normal objects and outliers. It may help hide outliers and reduce the
effectiveness of outlier detection
• Understandability
• Understand why these are outliers: Justification of the detection
• Specify the degree of an outlier: the unlikelihood of the object being
generated by a normal mechanism

20
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
BITS Pilani
Pilani|Dubai|Goa|Hyderabad

Statistical Outliers
Statistical Approaches
Statistical approaches assume that the objects in a data set are
generated by a stochastic process (a generative model)
– The effectiveness of statistical methods highly depends on whether the
assumptions made for the statistical model hold true for the given data
Statistic models used in the methods may be parametric or
nonparametric.
– A parametric method assumes that the normal data objects are
generated by a parametric distribution
– A nonparametric method does not assume an a priori statistical
model. Instead, a nonparametric method tries to determine the
model from the input data.

22

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Discordancy test
The statistical distribution-based approach identifies outliers with respect to
the model using a discordancy test.
A statistical discordancy test examines first a working hypothesis. A working
hypothesis, H, is a statement that the entire data set of n objects comes from
an initial distribution model, F, that is,
H : oi ϵ F, where i=1,2,….n
The hypothesis is retained if there is no statistically significant evidence
supporting its rejection
A discordancy test verifies whether an object, oi, is significantly large (or
small) in relation to the distribution F
The result is very much dependent on which model F is chosen because oi
may be an outlier under one model and a perfectly valid value under another.

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Alternative distributions
Inherent alternative distribution: In this case, the working
hypothesis that all of the objects come from distribution F is
rejected in favor of the alternative hypothesis that all of the
objects arise from another distribution, G:

Ha : oi ϵ G, where i=1,2,….n

F and G may be different distributions or differ only in


parameters of the same distribution. For example, it may have a
different mean or dispersion, or a longer tail.

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Alternative distributions
Mixture alternative distribution: The mixture alternative states
that discordant values are not outliers in the F population, but
contaminants from some other population, G. In this case, the
alternative hypothesis is

Ha : oi ϵ (1-λ)F + λG , where i=1,2,….n


Slippage alternative distribution: This alternative states that all
of the objects (apart from some prescribed small number) arise
independently from the initial model, F, with its given
parameters, whereas the remaining objects are independent
observations from a modified version of F in which the
parameters have been shifted.

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Statistical Approaches – Parametric Methods
Assumes that the normal data is generated by a parametric
distribution with parameter θ
The probability density function of the parametric distribution f(x, θ)
gives the probability that object x is generated by the distribution
The smaller this value, the more likely x is an outlier
The parametric distribution can be normal distribution with a mean
and variance.

Outliers are points where


probability of occurrence is
below a threshold.

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Parametric Methods: Univariate Outliers
• Univariate data: A data set involving only one attribute or variable
• Often assume that data are generated from a normal distribution, learn the
parameters from the input data, and identify the points with low probability
as outliers
• Ex: Avg. temp.: {24.0, 28.9, 28.9, 29.0, 29.1, 29.1, 29.2, 29.2, 29.3, 29.4}
• Use the maximum likelihood method to estimate μ and σ

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Parametric Methods: Univariate Outliers

Avg. temp.: {24.0, 28.9, 28.9, 29.0, 29.1, 29.1, 29.2, 29.2, 29.3, 29.4}

 For the above data with n = 10, we have

 Then (24 – 28.61) /1.51 = – 3.04 < –3, 24 is an outlier since

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Visual Approach
A straightforward method for statistical outlier detection can also
be used in visualization, e.g., the boxplot method plots the
univariate input data using a five-number summary
the smallest nonoutlier value (Min),
the lower quartile (Q1),
the median (Q2),
the upper quartile (Q3), and
the largest nonoutlier value (Max).
The interquantile range (IQR) is defined as Q3 − Q1. Any object
that is more than 1.5 × IQR smaller than Q1 or 1.5 × IQR larger
than Q3 is treated as an outlier because the region between Q1 −
1.5 × IQR and Q3 + 1.5 × IQR contains 99.3% of the objects. The
rationale is similar to using 3σ as the threshold for normal
distribution

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Parametric Methods:
Detection of Multivariate Outliers
Multivariate data: A data set involving two or more attributes or variables
Transform the multivariate outlier detection task into a univariate outlier detection
problem
Method 1. Compute Mahalanobis distance
– Mahalanobis distance is a measure of the distance between a point P and a
distribution D.
– This distance is zero if P is at the mean of D, and grows as P moves away from
the mean: along each principal component axis, it measures the number of
standard deviations from P to the mean of D
Method 2. Use χ2 –statistic:
– where Ei is the mean of the i-dimension among all objects, and n is the
dimensionality
– If χ2 –statistic is large, then object oi is an outlier

30
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Parametric Methods:
Using Mixture of Parametric Distributions

Assuming data generated by a normal distribution could


be sometimes overly simplified
Example (right figure): The objects between the two
clusters cannot be captured as outliers since they are
close to the estimated mean

• To overcome this problem, assume the normal data is generated by two normal
distributions. For any object o in the data set, the probability that o is generated
by the mixture of the two distributions is given by

• where fθ1 and fθ2 are the probability density functions of θ1 and θ2
• Then use EM algorithm to learn the parameters μ1, σ1, μ2, σ2 from data
• An object o is an outlier if it does not belong to any cluster

31
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Detecting outliers
There are two basic types of procedures for detecting outliers:

Block procedures: In this case, either all of the suspect objects are treated
as outliers or all of them are accepted as consistent.

Consecutive (or sequential) procedures: e.g. inside-out procedure. The


idea is that the object that is least "likely" to be an outlier is tested first. If
it is found to be an outlier, then all of the more extreme values are also
considered outliers; otherwise, the next most extreme object is tested,
and so on.
(This procedure tends to be more effective than block procedures.)

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Non-Parametric Methods: Detection Using Histogram

The model of normal data is learned from the input


data without any a priori structure.
Often makes fewer assumptions about the data,
and thus can be applicable in more scenarios
Outlier detection using histogram:

• Figure shows the histogram of purchase amounts in transactions


• A transaction in the amount of $7,500 is an outlier, since only 0.2%
transactions have an amount higher than $5,000
• Problem: Hard to choose an appropriate bin size for histogram
• Too small bin size → normal objects in empty/rare bins, false positive
• Too big bin size → outliers in some frequent bins, false negative
33
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
BITS Pilani
Pilani|Dubai|Goa|Hyderabad

Proximity Based Outliers


Proximity-Based Approaches:
Distance-Based vs. Density-Based Outlier Detection

Intuition: Objects that are far away from the others are outliers
Assumption of proximity-based approach: The proximity of an
outlier deviates significantly from that of most of the others in
the data set
Two types of proximity-based outlier detection methods
– Distance-based outlier detection: An object o is an outlier if
its neighborhood does not have enough other points
– Density-based outlier detection: An object o is an outlier if
its density is relatively much lower than that of its
neighbors

35
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Distance-Based Outlier Detection

For each object o, examine the # of other objects in the r-


neighborhood of o, where r is a user-specified distance
threshold
An object o is an outlier if most (taking π as a fraction threshold)
of the objects in D are far away from o, i.e., not in the r-
neighborhood of o
An object o is a DB(r, π) outlier if
Equivalently, one can check the distance between o and its k-th
nearest neighbor ok, where .
o is an outlier if dist(o, ok) > r

36
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Distance-Based Outlier Detection
Efficient computation: Nested loop
algorithm
– For any object oi, calculate its
distance from other objects, and
count the # of other objects in
the r-neighborhood.
– If π∙n other objects are within r
distance, terminate the inner
loop
– Otherwise, oi is a DB(r, π)
outlier
Efficiency: Actually CPU time is not O(n2)
but linear to the data set size since for
most non-outlier objects, the inner loop
terminates early

37
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Distance-Based Outlier Detection: Improving Algorithm
Why efficiency is still a concern? When the complete set of objects
cannot be held into main memory, cost of I/O swapping will be high
The major cost:
(1) each object tests against the whole data set, why not only its
close neighbor?
(2) instead of checking objects one by one, why not group by
group?
Grid-based method (CELL): Data space is partitioned into a multi-D
grid. Only adjoining cells are checked for determining if object is an
outlier

38
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Distance-Based Outlier Detection: A Grid-Based Method

Grid-based method (CELL): Data space is partitioned into


a multi-D grid. Each cell is a hyper cube with diagonal
length r/2

• Pruning using the level-1 & level 2 cell properties:


• For any possible point x in cell C and any possible
point y in a level-1 cell, dist(x,y) ≤ r
• For any possible point x in cell C and any point y
such that dist(x,y) ≥ r, y is in a level-2 cell

• Thus we only need to check the objects that cannot be pruned, and even for
such an object o, only need to compute the distance between o and the
objects in the level-2 cells (since beyond level-2, the distance from o is more
than r)

39
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Distance-Based Outlier Detection: Limitations

Distance-based outliers, such as -outliers, are just one type of


outlier
Distance-based outlier detection takes a global view of the data set
-outlier, for example, is far (as quantified by parameter r) from at
least of the objects in the data set. In other words, an
outlier as such is remote from the majority of the data.
To detect distance-based outliers, we need two global parameters, r and π,
which are applied to every outlier object.
Many real-world data sets demonstrate a more complex structure, where
objects may be considered outliers with respect to their local neighborhoods,
rather than with respect to the global data distribution.

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


BITS Pilani
Pilani|Dubai|Goa|Hyderabad

Density-Based Outlier
Density-Based Outlier Detection
Local outliers: Outliers comparing to their local neighborhoods, instead of the
global data distribution
In Fig., o1 and o2 are local outliers to C1, o3 is a global outlier, but o4 is not an
outlier. However, proximity-based clustering cannot find o1 and o2 are outlier
(e.g., comparing with O4).

• Intuition (density-based outlier detection):


The density around an outlier object is
significantly different from the density around
its neighbors
• Method: Use the relative density of an object
against its neighbors as the indicator of the
degree of the object being outliers

42
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Density-Based Outlier Detection
Local outliers: Outliers comparing to their local
neighborhoods, instead of the global data
distribution
In Fig., o1 and o2 are local outliers to C1, o3 is a global
outlier, but o4 is not an outlier. However, proximity-
based clustering cannot find o1 and o2 are outlier
(e.g., comparing with O4).

• Intuition (density-based outlier detection): The density around an outlier object


is significantly different from the density around its neighbors
• Method: Use the relative density of an object against its neighbors as the
indicator of the degree of the object being outliers

43
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Local Reachability Density
k-distance of an object o, distk(o): distance between o and its k-th NN
k-distance neighborhood of o, Nk(o) = {o’| o’ in D, dist(o, o’) ≤ distk(o)}
– Nk(o) could be bigger than k since multiple objects may have identical
distance to o
Reachability distance from o’ to o:

where k is a user-specified parameter


that controls the smoothing effect

Local reachability density of o:

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Local Outlier Factor: LOF

• LOF (Local outlier factor) of an object o is the average of the ratio of local
reachability of o and those of o’s k-nearest neighbors

• the local outlier factor is the average of the ratio of the local reachability
density of o and those of o's k-nearest neighbors
• The lower the local reachability density of o, and the higher the local
reachability density of the k-NN of o, the higher LOF
• This captures a local outlier whose local density is relatively low comparing to
the local densities of its k-NN

45
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Clustering-Based

Basic idea:
– Cluster the data into groups of
different density
– Choose points in small cluster as
candidate outliers
– Compute the distance between
candidate points and non-candidate
clusters.
• If candidate points are far from all
other non-candidate points, they
are outliers

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


BITS Pilani
Pilani|Dubai|Goa|Hyderabad

Base Rate Fallacy – An outlier challenge


Base Rate Fallacy

• Bayes theorem:

• More generally:

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Base Rate Fallacy (Axelsson, 1999)
The base-rate fallacy is best described through example:
Suppose a doctor performs on a patient a diagnostic test that is
99% accurate symmetrically (i.e. both for presence and absence
of disease).
The doctor may report the test result as bad news + good news
• Bad News – the patient is tested positive
• Good News – Out of entire population, rate of incidence is
only 1 in 10000.

Now what is the probability that the patient has the disease?

The Base-Rate Fallacy and the Difficulty of Intrusion Detection by Stefan Axelsson ACM Transactions on Information and System Security

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Base Rate Fallacy

• Even though the test is 99% certain, your chance of


having the disease is 1/100, because the population of
healthy people is much larger than sick people

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Base Rate Fallacy in Intrusion Detection
I: intrusive behavior,
¬I: non-intrusive behavior
A: alarm
¬A: no alarm

Detection rate (true positive rate): P(A|I)


False alarm rate: P(A|¬I)

Goal is to maximize both


– Bayesian detection rate, P(I|A)
– P(¬I|¬A)

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Detection Rate vs False Alarm Rate

Suppose there are about 20 intrusions Per million

• False alarm rate becomes more dominant if P(I) is very low

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Detection Rate vs False Alarm Rate

• Axelsson: We need an extremely low false alarm rate to


achieve a reasonable Bayesian detection rate

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Prescribed Text Books

Author(s), Title, Edition, Publishing House


T1 Tan P. N., Steinbach M & Kumar V. “Introduction to Data Mining” Pearson
Education
T2 Data Mining: Concepts and Techniques, Third Edition by Jiawei Han,
Micheline Kamber and Jian Pei Morgan Kaufmann Publishers
R1 Predictive Analytics and Data Mining: Concepts and Practice with RapidMiner
by Vijay Kotu and Bala Deshpande Morgan Kaufmann Publishers
The Base-Rate Fallacy and the Difficulty of Intrusion Detection by Stefan
Axelsson ACM Transactions on Information and System Security, Vol. 3, No.
3, August 2000, Pages 186–205

54
September 14, 2021
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
S1-20_DSECLZC415 : Data Mining
(Lecture #15 - Mining Unstructured Data)

BITS Pilani
Pilani|Dubai|Goa|Hyderabad
•The slides presented here are obtained from the authors of the
books and from various other contributors. I hereby acknowledge
all the contributors for their material and inputs.
•I have added and modified a few slides to suit the requirements
of the course.
2

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Data Mining
Mining Unstructured Data
BITS Pilani
Pilani|Dubai|Goa|Hyderabad
Structured data vs. Unstructured data

Structured data
comprised of clearly defined data types whose pattern makes
them easily searchable

Unstructured data – “everything else”


comprised of data that is usually not as easily searchable,
including formats like audio, video, and social media postings

4
9/14/2021
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Unstructured Data

Can be
– Text
– www
– Multimedia
– Graph
– Spatial data

……

5
9/14/2021
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
BITS Pilani
Pilani|Dubai|Goa|Hyderabad

Mining Text Data – NLP Challenges


Mining Text Data: An Introduction

Data Mining / Knowledge


Discovery

Structured Data Multimedia Free Text Hypertext


HomeLoan ( Frank Rizzo bought <a href>Frank Rizzo
Loanee: Frank Rizzo his home from Lake </a> Bought
Lender: MWF View Real Estate in <a hef>this home</a>
Agency: Lake View 1992. from <a href>Lake
Amount: $200,000 He paid $200,000 View Real Estate</a>
Term: 15 years under a15-year loan In <b>1992</b>.
) from MW Financial. <p>...
Loans($200K,[map],...)
7
9/14/2021
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Bag-of-Tokens Approaches
Documents Token Sets

Four score and seven nation – 5


years ago our fathers brought civil - 1
forth on this continent, a new war – 2
nation, conceived in Liberty, Feature men – 2
and dedicated to the Extraction died – 4
proposition that all men are people – 5
created equal. Liberty – 1
Now we are engaged in a God – 1
great civil war, testing …
whether that nation, or …

Loses all order-specific information!


Severely limits context! 8
9/14/2021
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Natural Language Processing

9
Text Data Management and Analysis: A Practical Introduction to Information Retrieval and Text Mining ChengXiang Zhai, Sean Massung, ACM Books 2016
9/14/2021
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
General NLP—Too Difficult!
Word-level ambiguity
– “design” can be a noun or a verb (Ambiguous POS)
– “root” has multiple meanings (Ambiguous sense)
Syntactic ambiguity
– “natural language processing” (Modification)
– “A man saw a boy with a telescope.” (PP Attachment)
Anaphora resolution
– “John persuaded Bill to buy a TV for himself.”
(himself = John or Bill?)
Presupposition
– “He has quit smoking.” implies that he smoked before.

Humans rely on context to interpret (when possible).


This context may extend beyond a given document!
10
9/14/2021
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
BITS Pilani
Pilani|Dubai|Goa|Hyderabad

Information Retrieval
Text Databases and IR
Text databases (document databases)
– Large collections of documents from various sources: news
articles, research papers, books, digital libraries, e-mail
messages, and Web pages, library database, etc.
– Data stored is usually semi-structured
– Traditional information retrieval techniques become inadequate
for the increasingly vast amounts of text data
Information retrieval
– A field developed in parallel with database systems
– Information is organized into (a large number of) documents
– Information retrieval problem: locating relevant documents
based on user input, such as keywords or example documents
12
9/14/2021
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Information Retrieval
Typical IR systems
– Online library catalogs
– Online document management systems
Information retrieval vs. database systems
– Some DB problems are not present in IR, e.g., update,
transaction management, complex objects
– Some IR problems are not addressed well in DBMS, e.g.,
unstructured documents, approximate search using
keywords and relevance

13
9/14/2021
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Basic Measures for Text Retrieval
Precision: the percentage of retrieved documents that are in fact relevant to
the query (i.e., “correct” responses)

Recall: the percentage of documents that are relevant to the query and were,
in fact, retrieved

Relevant Relevant &


Retrieved Retrieved

All Documents

| {Relevant} ∩ {Retrieved} | | {Relevant} ∩ {Retrieved} |


precision = recall =
| {Retrieved} | | {Relevant} |
14
9/14/2021
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Information Retrieval Techniques
Basic Concepts
– A document can be described by a set of representative
keywords called index terms.
– Different index terms have varying relevance when used
to describe document contents.
– This effect is captured through the assignment of
numerical weights to each index term of a document.
(e.g.: frequency, tf-idf)
DBMS Analogy
– Index Terms  Attributes
– Weights  Attribute Values

15
9/14/2021
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Indexing Process
Text + Meta data
(Doc type, structure,
features, size, etc.)
Takes index terms and
Identifies and stores creates data structures
documents for (indexes) to support
indexing fast searching

Transforms documents into


or Text Processing index terms or features

16
9/14/2021
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Zipf’s Law
Distribution of word frequencies is very skewed
 Few words occur very often, many hardly ever occur

 e.g., “the” and “of”, two common words, make up about 10%
of all word occurrences in text documents

Zipf’s law:
 The frequency f of a word in a corpus is inversely proportional
to its rank r (assuming words are ranked in order of
decreasing frequency)
k
f = r f×r=k

where k is a constant for the corpus 17


9/14/2021
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Top 50 Words from AP89

Associated Press collection of news stories from 1989 (called AP89) 18


9/14/2021
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Vocabulary Growth
Heaps’ Law, another prediction of word occurrence
As corpus grows, so does vocabulary size. However, fewer new
words when corpus is already large
Observed relationship (Heaps’ Law):
v = k × nβ
where
v is the vocabulary size (number of unique words)
n is the total number of words in corpus
k, β are parameters that vary for each corpus
(typical values given are 10 ≤ k ≤ 100 and β ≈ 0.5)
 Predicting that the number of new words increases very
rapidly when the corpus is small
19
9/14/2021
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Heaps’ Law Predictions
Number of new words increases very rapidly when the
corpus is small, and continue to increase indefinitely

Predictions for TREC collections are accurate for large


numbers of words, e.g.,
 First 10,879,522 words of the AP89 collection scanned
 Prediction is 100,151 unique words
 Actual number is 100,024

Predictions for small numbers of words (i.e., < 1000) are


much worse

20
9/14/2021
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Heaps’ Law on the Web
Heaps’ Law works with very large corpora
 New words occurring even after seeing 30 million!

 Parameter values different than typical TREC values

New words come from a variety of sources


 Spelling errors, invented words (e.g., product, company
names), code, other languages, email addresses, etc.

Search engines must deal with these large and


growing vocabularies

Text REtrieval Conference - TREC 21


9/14/2021
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
BITS Pilani
Pilani|Dubai|Goa|Hyderabad

Information Retrieval Techniques


Information Retrieval Techniques
Index Terms (Attribute) Selection:
– Stop list
– Word stem
– Index terms weighting methods
Terms  Documents Frequency Matrices
Information Retrieval Models:
– Boolean Model (simplistic)
– Vector Space Model (Document Ranking considered)
– Probabilistic Retrieval Models (Ranked as per probability of relevance)

23
9/14/2021
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Boolean Model

Consider that index terms are either present or absent in a


document
As a result, the index term weights are assumed to be all binaries
A query is composed of index terms linked by three connectives:
not, and, and or
– e.g.: car and repair, plane or airplane
The Boolean model predicts that each document is either
relevant or non-relevant based on the match of a document to
the query

24
9/14/2021
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Keyword-Based Retrieval
A document is represented by a string, which can be identified
by a set of keywords
Queries may use expressions of keywords
– E.g., car and repair shop, tea or coffee, DBMS but not
Oracle
– Queries and retrieval should consider synonyms, e.g.,
repair and maintenance
Major difficulties of the model
– Synonymy: A keyword T does not appear anywhere in the
document, even though the document is closely related to
T, e.g., data mining
– Polysemy: The same keyword may mean different things in
different contexts, e.g., mining
25
9/14/2021
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Similarity-Based Retrieval in Text Data

Finds similar documents based on a set of common keywords


– Answer should be based on the degree of relevance based on the
nearness of the keywords, relative frequency of the keywords, etc.
Basic techniques
-Stop list
• Set of words that are deemed “irrelevant”, even though they
may appear frequently
• E.g., a, the, of, for, to, with, etc.
• Stop lists may vary when document set varies

26
9/14/2021
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Similarity-Based Retrieval in Text Data (contd)
– Word stem
• Several words are small syntactic variants of each other since they
share a common word stem
• E.g., drug, drugs, drugged
– A term frequency table
• Each entry frequent_table(i, j) = # of occurrences of the word ti in
document di
• Usually, the ratio instead of the absolute number of occurrences is
used
– Similarity metrics: measure the closeness of a document to
a query (a set of keywords)
• Relative term occurrences v1 ⋅ v2
sim(v1 , v2 ) =
• Cosine distance: | v1 || v2 |
27
9/14/2021
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
BITS Pilani
Pilani|Dubai|Goa|Hyderabad

Vector Space Model


Vector Space Model
Represent a doc by a term vector
– Term: basic concept, e.g., word or phrase
– Each term defines one dimension
– N terms define a N-dimensional space
– Element of vector corresponds to term weight
– E.g., d = (x1,…,xN), xi is “importance” of term i
New document is assigned to the most likely category based on
vector similarity.

29
9/14/2021
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
What VS Model Does Not Specify
How to select terms to capture “basic concepts”
– Word stopping
• e.g. “a”, “the”, “always”, “along”
– Word stemming
• e.g. “computer”, “computing”, “computerize” => “compute”
– Latent semantic indexing
How to assign weights
– Not all words are equally important: Some are more
indicative than others
• e.g. “algebra” vs. “science”
How to measure the similarity
30
9/14/2021
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
How to Assign Weights

Two-fold heuristics based on frequency


– TF (Term frequency)
• More frequent within a document  more relevant to semantics

– IDF (Inverse document frequency)


• Less frequent among documents  more discriminative

31
9/14/2021
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Cornell SMART TF-IDF Model
TF is computed using the equation:

IDF is computed using the equation:

where d is the document collection, and dt is the set of documents containing


term t. If |dt| << |d|, the term t will have a large IDF scaling factor

32
9/14/2021
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Alternate TF-IDF Weighting
TF Weighting: IDF Idea:
– More frequent => more – Less frequent among
documents  more
relevant to topic discriminative
• Raw TF= f(t,d): how
many times term t Formula:
appears in doc d
Normalization:
– Document length varies n — total number of docs
=> relative frequency k — # docs with term t
preferred appearing
• e.g., Maximum
frequency normalization (IDF – inverse document
frequency)
33

9/14/2021
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
TF-IDF Weighting
TF-IDF weighting : weight(t, d) = TF(t, d) * IDF(t)
– Freqent within doc  high tf  high weight
– Selective among docs  high idf  high weight
Recall VS model
– Each selected term represents one dimension
– Each doc is represented by a feature vector
– Its t-term coordinate of document d is the TF-IDF weight
– This is more reasonable
Just for illustration …
– Many complex and more effective weighting variants exist in
practice

34
9/14/2021
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
How to Measure Similarity?
Given two document

Similarity definition
– dot product

– normalized dot product (or cosine)

35
9/14/2021
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Illustrative Example
text
doc1 mining
Sim(newdoc,doc1)=(4.8*2.4+4.5*4.5)/||v1||*||vn||
search
engine
Sim(newdoc,doc2)=2.4*2.4/||vn||*||v2||
text
To whom is newdoc
more similar?
travel
text
Sim(newdoc,doc3)=0
doc2 map
travel

text mining travel map search engine govern president congress


IDF(faked) 2.4 4.5 2.8 3.3 2.1 5.4 2.2 3.2 4.3

government doc1 2(4.8) 1(4.5) 1(2.1) 1(5.4)


president doc2 1(2.4 ) 2 (5.6) 1(3.3)
doc3 congress doc3 1 (2.2) 1(3.2) 1(4.3)

newdoc 1(2.4) 1(4.5)


…… 36
9/14/2021
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
VS Model-Based Classifiers
What do we have so far?
– A feature space with similarity measure
– This is a classic supervised learning problem
• Search for an approximation to classification hyper plane

VS model based classifiers


– K-NN
– Decision tree based
– Neural networks
– Support vector machine

37
9/14/2021
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
BITS Pilani
Pilani|Dubai|Goa|Hyderabad

Probabilistic Retrieval Models


Probabilistic Retrieval Models
The ranking function based on the probability that a given document d
is relevant to a query q,
– or p(R = 1| d, q) where R ∈ {0, 1} is a binary random variable denoting
relevance.
In query likelihood retrieval model (one of the probabilistic models),
we assume that this probability of relevance can be approximated by
the probability of a query given a document and relevance,
– p(q | d, R = 1).

Intuitively, if a user likes document d, how likely would the user enter
query q in order to retrieve document d?

39
9/14/2021
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Probabilistic Retrieval Models

Clearly, p(R = 1 | d, q) + p(R = 0 | d, q) = 1

40
9/14/2021
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
BITS Pilani
Pilani|Dubai|Goa|Hyderabad

Text Data Mining


Types of Text Data Mining

• Keyword-based association analysis


• Automatic document classification
• Similarity detection
• Cluster documents by a common author
• Cluster documents containing information from a common source
• Link analysis: unusual correlation between entities
• Sequence analysis: predicting a recurring event
• Anomaly detection: find information that violates usual patterns
• Hypertext analysis
• Patterns in anchors/links
• Anchor text correlations with linked objects

9/14/2021 42
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Keyword-Based Association Analysis

• Motivation
• Collect sets of keywords or terms that occur frequently together and
then find the association or correlation relationships among them
• Association Analysis Process
• Preprocess the text data by parsing, stemming, removing stop words,
etc.
• Evoke association mining algorithms
• Consider each document as a transaction
• View a set of keywords in the document as a set of items in the transaction
• Term level association mining
• No need for human effort in tagging documents
• The number of meaningless results and the execution time is greatly
reduced

9/14/2021 43
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Text Classification

• Motivation
• Automatic classification for the large number of on-line text documents
(Web pages, e-mails, corporate intranets, etc.)
• Classification Process
• Data preprocessing
• Definition of training set and test sets
• Creation of the classification model using the selected classification
algorithm
• Classification model validation
• Classification of new/unknown text documents
• Text document classification differs from the classification of relational data
• Document databases are not structured according to attribute-value pairs

9/14/2021 44
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Document Clustering

• Motivation
• Automatically group related documents based on their
contents
• No predetermined training sets or taxonomies
• Generate a taxonomy at runtime
• Clustering Process
• Data preprocessing: remove stop words, stem, feature
extraction, lexical analysis, etc.
• Hierarchical clustering: compute similarities applying clustering
algorithms.
• Model-Based clustering: clusters are represented by
“exemplars”. (e.g.: Self-Organizing Maps)

9/14/2021 45
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
VS Model-Based Classifiers

• What does VS model offer?


• A feature space with similarity measure
• This is a classic supervised learning problem
• Search for an approximation to classification hyper plane

• VS model based classifiers


• K-NN
• Decision tree based (dimensionality/interpretability challenges)
• Neural networks
• Support vector machine

9/14/2021 46
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
BITS Pilani
Pilani|Dubai|Goa|Hyderabad

Mining WWW
Challenges in Mining WWW
• The Web is too huge for effective data warehousing and data mining.
It is barely possible to set up a data warehouse to replicate, store, or
integrate all of the data on the Web
• The complexity of Web pages is far greater than that of any
traditional text document collection. There is no index by category,
nor by title, author, cover page, table of contents, etc.
• The Web is a highly dynamic information source. Not only does the
Web grow rapidly, but is also constantly updated. Linkage
information and access records are also updated frequently
• The Web serves a broad diversity of user communities. Users may
have very different backgrounds, interests, and usage purposes.
• Only a small portion of the information on the Web is truly relevant
or useful. It is said that 99% of the Web information is useless to
99% of Web users. How can the portion of the Web that is truly
relevant to your interest be determined? How can we find high
quality Web pages on a specified topic?
48
9/14/2021
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Search Engine – Two Rank Functions
Ranking based on link
Search structure analysis

Importance Ranking
Rank Functions (Link Analysis)
Similarity
based on Relevance Ranking
content or text Backward Link Web Topology
(Anchor Text) Graph
Inverted Indexer
Index
Anchor Text Web Graph
Generator Constructor

Term Dictionary Forward Forward URL


Meta Data
(Lexicon) Index Link Dictioanry

Web Page Parser

Web Pages
49
9/14/2021
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Relevance Ranking
• Inverted index
- A data structure for supporting text queries
- like index in a book
aalborg 3452, 11437, …..
.
.
.
indexing .
.
arm 4, 19, 29, 98, 143, ...
disks with armada 145, 457, 789, ...
documents armadillo 678, 2134, 3970, ...
armani 90, 256, 372, 511, ...
.
.
.
.
.
zz 602, 1189, 3209, ...
50
inverted index
9/14/2021
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
BITS Pilani
Pilani|Dubai|Goa|Hyderabad

PageRank
Introduction on PageRank

PageRank is a link analysis algorithm … with the


purpose of "measuring" its (Webpage) relative
importance within the set.
– From Wikipedia, the free encyclopedia

Developed by Larry Page as his PhD research topic


3 years later, he quit Stanford and founded Google with
Brin
Apparently Larry Page had lost his PhD qualification.

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


PageRank: the intuitive idea
PageRank relies on the democratic nature of the Web by
using its vast link structure as an indicator of an individual
page's value or quality.
PageRank interprets a hyperlink from page x to page y as a
vote, by page x, for page y.
However, PageRank looks at more than the sheer number
of votes; it also analyzes the page that casts the vote.
– Votes casted by “important” pages weigh more heavily and help
to make other pages more "important."
This is exactly the idea of rank prestige in social network.

53

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


More specifically …
A hyperlink from a page to another page is an implicit
conveyance of authority to the target page.
– The more in-links that a page i receives, the more prestige the
page i has.
Pages that point to page i also have their own prestige
scores.
– A page of a higher prestige pointing to i is more important than
a page of a lower prestige pointing to i.
– In other words, a page is important if it is pointed to by other
important pages.

54

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


PageRank algorithm
According to rank prestige, the importance of page i (i’s
PageRank score) is the sum of the PageRank scores of
all pages that point to i.
Since a page may point to many other pages, its
prestige score should be shared.
The Web as a directed graph G = (V, E). Let the total
number of pages be n. The PageRank score of the page
i (denoted by P(i)) is defined by: Oj is the number
P( j )
P (i ) = ∑
( j , i )∈E O j
, of out-link of j

55

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Matrix notation
We have a system of n linear equations with n unknowns.
We can use a matrix to represent them.
Let P be a n-dimensional column vector of PageRank values,
i.e., P = (P(1), P(2), …, P(n))T.
Let A be the adjacency matrix of our graph with
1
 if (i, j ) ∈ E
Aij =  Oi
 0 otherwise

We can write the n equations with (PageRank)


T
P=A P
56
CS583, Bing Liu, UIC
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Calculation of PageRank on Webpage

A B

C D
• R(.) = PageRank of a Webpage
1. R(A) = 100%R(B) + 50%R(C)
2. R(B) = 50%R(C) + 100%R(D)
3. R(C) = 100%R(A)
57
9/14/2021
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Signals for Google Search Engine
• Google apparently deals with the 15 percent of queries a day
it gets which its systems have never seen before.
• For last 5 years, Google has been using RankBrain, which uses
AI/ML to make a guess as to what words or phrases might
have a similar meaning
• Google Executives state that three major search ranking
factors are
• Content (of query)
• Links
• RankBrain

[Link]
Clark, Jack. "Google Turning Its Lucrative Web Search Over to AI Machines". Bloomberg Business.
2015 58
9/14/2021
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Thank You

59
9/14/2021
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Prescribed Text Books

Author(s), Title, Edition, Publishing House


T1 Tan P. N., Steinbach M & Kumar V. “Introduction to Data Mining” Pearson
Education
T2 Data Mining: Concepts and Techniques, Third Edition by Jiawei Han,
Micheline Kamber and Jian Pei Morgan Kaufmann Publishers
Data Mining: Concepts and Techniques, Second Edition by Jiawei Han,
Micheline Kamber Morgan Kaufmann Publishers
Text Data Management and Analysis: A Practical Introduction to Information
Retrieval and Text Mining ChengXiang Zhai, Sean Massung, ACM Books 2016
Web data Mining - Exploring Hyperlinks, Contents and Usage Data, By Bing
Liu, Second Edition, Springer, July 2011
Search Engines: Information Retrieval in Practice, Croft, Metzler, and
Strohman, 2010

60
September 14, 2021
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
S2-20_DSECLZC415 : Data Mining
(Lecture #17 - Data Mining Applications)
BITS Pilani
Pilani|Dubai|Goa|Hyderabad
•The slides presented here are obtained from the authors of the
books and from various other contributors. I hereby acknowledge
all the contributors for their material and inputs.
•I have added and modified a few slides to suit the requirements
of the course.
2
9/22/2021
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Data Mining
Data Mining Applications
BITS Pilani
Pilani|Dubai|Goa|Hyderabad
BITS Pilani
Pilani|Dubai|Goa|Hyderabad

9.1 Recommendation Systems

Source Courtesy: Some of the contents of this PPT are sourced from materials provided by publishers of prescribed books
Recommendation System
• System that produces individualized recommendations as output or has the effect of
guiding the user in a personalized way to interesting or useful objects in a large
space of possible options
Burke (2002)

• A personalized information filtering technology used to either predict whether a


particular user will like a particular item (prediction problem) or to identify a set of
N items that will be of interest to a certain user (top-N recommendation problem).
Deshpande and Karypis (2004)

• Recommender systems suggest items of interest to users based on their explicit and
implicit preferences, the preferences of other users, and user and item attributes
Schein et al. (2005)

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Why Recommender Systems?
• Physical delivery systems are characterized by a scarcity of resources.
Brick-and-mortar stores have limited shelf space, and can show the
customer only a small fraction of all the choices that exist.
• A physical bookstore may have several thousand books on its shelves
• It is not possible to tailor the store to each individual customer
• The choice of what is made available is governed only by the aggregate
numbers
• Typically, a bookstore will display only the books that are most popular, and a
newspaper will print only the articles it believes the most people will be
interested in.

• On-line stores can make anything that exists available to the


customer.
• Largest ecommerce stores offer millions of books.
• It is not possible to present all available items to the user, the way physical
institution can. Neither can we expect users to have heard of each of the
items they might like.
6

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Why Recommender Systems?

The long tail: physical


institutions can only provide

Popularity 
what is popular,
while on-line institutions can
make everything available

The long-tail phenomenon


forces on-line institutions to
recommend items to
individual users.

Product 

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Popular Recommendation Engines

Some popular recommendation engines which have been proved as highly profitable:
the [Link],
the [Link] and
the [Link] recommendation engines

[Link] claims that 35 % of products sales result from recommendations.


About 66 % of movies rented in [Link] are recommended.
Google News Recommendations generate 38 % more click-throughs .

Recommender Systems for Location-based Social Networks by Panagiotis Symeonidis, Dimitrios Ntempos and Yannis
Manolopoulos Springer © 2014
8

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Into Thin Air and Touching the Void

An extreme example of how the long tail, together with a well designed
recommendation system can influence events is the story told by Chris
Anderson about a book called Touching the Void. This mountain-climbing
book was not a big seller in its day, but many years after it was published,
another book on the same topic, called Into Thin Air was published.
Amazon’s recommendation system noticed a few people who
bought both books, and started recommending Touching the Void to
people who bought, or were considering, Into Thin Air. Had there been no
on-line bookseller, Touching the Void might never have been seen by
potential buyers, but in the on-line world, Touching the Void eventually
became very popular in its own right, in fact, more so than Into Thin Air.

Mining of Massive Datasets


By Jure Leskovec, Anand Rajaraman, Jeffrey David Ullman

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Recommendation Approaches
The approaches of recommender systems: collaborative filtering (CF), content-based Filtering
(CB) and hybrid methods:
• Collaborative filtering algorithms recommend those items to the target user, that have
been rated highly by other users with similar preferences and tastes.
• "Customers who bought item i also bought item y”
• Content-based filtering uses the information derived from documents or item features (eg.
terms or attributes). It uses a set of attributes, which describes the items and recommends
other items similar to those that exist in the user's profile.
• The cold start problem for new items and new users are alleviated, provided that
features of users and items are known.
• The pitfall is that there is no diversity in the recommendations. That is, the user gets
recommendations that are very familiar to her, since the recommended items are
similar to those in her item profile
• TF-IDF is widely used
• Hybrid algorithms attempt to combine Collaborative filtering with Content-based filtering.
The combination of content with rating data helps capture more effective correlations
between users or items, which yields more accurate recommendations.

10

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Recommendation System Types
Other types of recommender systems proposed (by Burke) in the
literature :
• Demographic recommendation, which classifies the users according
to the attributes of their personal profile, and makes
recommendations based on demographic classes
• Utility-based recommendation, which makes suggestions based on a
computation of the utility of each item for a user, for whom a utility
function has to be stored
• Knowledge-based recommendation, which suggests items based on
logical inferences about user preferences. A knowledge
representation (e.g. rules) about how an item meets a particular user
need is necessary.

11

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Utility matrix sample
The movie names are
HP1, HP2, and HP3 for
Harry Potter I, II, and
III, TW for Twilight, and
SW1,
SW2, and SW3 for Star
Wars episodes 1, 2,
and 3. The users are
represented
by capital letters A
through D

A utility matrix representing ratings of movies on a 1–5 scale


Blanks represent the situation where the user has not rated the movie. In practice, the
matrix would be even sparser

12

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Recommendation Problem
• Object of the decision. That is, defining the object upon which the
decision has to be made and the rationale of the recommendation
decision
• Family of criteria. That is, the identification and modelling of a set of
criteria that affect the recommendation decision, and which are
exhaustive and non-redundant
• Global preference model. That is, the definition of the function that
aggregates the marginal preferences upon each criterion into the
global preference of the decision maker about each item
• Decision support process. That is, the study of the various categories
and types of recommender systems that may be used to support the
recommendation decision maker, in accordance to the results of the
previous steps

13

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Recommendation Capabilities
• Choice, which involves choosing one item from a set of
candidates
• Sorting, which involves classifying items into pre-defined
categories
• Ranking, which involves ranking items from the best one to
the worst one
• Description, which involves describing all the items in terms
of performance upon each criterion

14

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Recommendation technique used in TEL
Collaborative filtering (CF) techniques
Name Short description Advantages Disadvantages Usefulness for TEL
1. User-based Users that rated the same item – No content analysis – New user problem – Benefits from
CF similarly probably have the same – Domain-independent – New item problem experience
taste. Based on this assumption, – Quality improves over – Popular taste –Allocates learners to
this technique recommends time – Scalability groups (based on
unseen items already rated by – Bottom-up approach –Sparsity similar ratings)
similar users. – Cold-start problem
2. Item-based Focus on items, assuming that – No content analysis – New item problem – Benefits from
CF items rated similarly are probably – Domain-independent – Popular taste experience
similar. It recommends items with – Quality improves over – Sparsity
highest correlation (based on time – Cold-start problem
ratings to the items). – Bottom-up approach

3. Stereotypes Users with similar attributes are – No cold-start problem – Obtaining information –Allocates learners to
or matched, then recommends – Domain-independent –Insufficient groups
demographics items that are preferred by information –Benefits from
CF similar users (based on user data –Only popular taste experience
instead of ratings). –Obtaining metadata –Recommendation from
information the beginning of the RS
–Maintenance ontology
TEL : Technology Enhanced Learning
15

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Recommendation technique used in TEL
Content-based (CB) techniques
Name Short description Advantages Disadvantages Usefulness for
TEL
4. Case- Assumes that if a user – No content analysis – New user – Keeps learner
based likes a certain item, (s)he – Domain-independent problem informed about
reasoning will probably also like – Quality improves over – Overspecialization learning goal
similar items. time – Sparsity – Useful for
Recommends new but – Cold-start hybrid RS
similar items. problem
5. Attribute- Recommends items –No cold-start problem –Does not learn – Useful for
based based on the matching –No new user I new item –Only works with hybrid RS
techniques of their attributes to the problem categories –
user profile. Attributes –Sensitive to changes of –Ontology Recommendation
could be weighted for preferences modeling and from the
their importance to the –Can include non-item maintenance is beginning
user. related features required
–Can map from user – Overspecialization
needs to items
TEL : Technology Enhanced Learning
16

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Recommendation technique used in TEL
Data-Mining (DM) techniques
6. Decision A decision tree represents a set –Easy to understand –Overspecialisation in –Visualize differences of
Trees (C4.5,ID3) of classifications created from a –High representation small datasets learners from the data
set of rules. They start form a power –Can become very –Alternative approach to
single classification and branch broad expert driven ontologies
out based on classification rules
mined from the data.
7.K-Nearest Does not build an explicit model –Simple approach only –Difficult to select –Recommend similar
Neighbor instead exams the categories of two parameters to distance function d peers, or contents to
(Isodata, Forgy) the K-most similar data points. select –Irrelevant data learners
K-means is often used in TEL –Robust to noise needs to be removed –Cluster learners in
recommenders to compute –High representation –Slower than model- groups
similarity of vector-based power based
approaches. recommendations
8. Vector-based Vector-based approaches –Suitable for sparse –Content depended –Useful to monitor and
models (TF-IDF, characterise items and users as datasets (Items with same predict learner
Singular value vectors of factors in a 3D space. –Can take temporal context but different performance
decomposition, A high correlation between an differences into account terms are not –Can adapt to increased
Matrix item and a user can be used as –Can take various matched) knowledge level of
Factorisation) recommendation but also implicit information –User keywords have learners
predictions can be created. into account does not to match semantic –Can mark learning
need explicit ratings space resources that are not
popular anymore
17

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


TEL variables

- K. Verbert, N. Manouselis, H. Drachsler, E. Duval

18

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Framework for the analysis of Recommender Systems

-Manouselis and Costopoulou (2007)


19

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Prescribed Text Books

Author(s), Title, Edition, Publishing House


Recommender Systems for Learning by Nikos Manouselis, Hendrik Drachsler,
Katrien Verbert and Erik Duval Springer © 2013
Recommender Systems for Location-based Social Networks by Panagiotis
Symeonidis, Dimitrios Ntempos and Yannis Manolopoulos Springer © 2014

20

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


BITS Pilani
Pilani|Dubai|Goa|Hyderabad

9.2 Fraud Detection

Source Courtesy: Some of the contents of this PPT are sourced from materials provided by publishers of prescribed books
Why study Fraud?

• Fraud can be defined as a criminal activity, involving false


representations to gain an unjust advantage (Concise Oxford
Dictionary)

• The Association of Certified Fraud Examiners estimates that globally


organizations lose about 5% of their revenues to fraud. If this were to
hold true for all organizations contributing to the Gross World
Product of about $90.52 trillion for 2019 (IMF estimate), fraud losses
could be as high as $4.5 trillion

[Link]
22

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Fraud in Aspects of Business
• Fraud can occur in many aspects of business, for e.g.:
• Credit card fraud: Stealing or counterfeiting credit card numbers, or
nonpayment of accounts
• Application fraud: Untrue statements on a credit application, leading
to assignment of an artificially low credit risk
• Claim fraud: Submitting inflated or false claims
• Life insurance: False or "engineered" death claims
• Health care fraud: False billings by health care providers
etc.

23

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Issues with Fraud Detection

• Fraud is usually a rare event. Identifying fraud is difficult because of


its rarity and because its very nature is stealthy.
• We need accurate models to make effective detection.
• The vast majority of the records (i.e., 99.9%) may be legitimate. Only 0.1% of
the records may be fraudulent. Here a 99% accurate model will lead to too
many false alarms.
• Say we have million transactions. As per above, 1000 are fraudulent. With
99% accuracy, i.e. 1% inaccuracy (false positives, false negatives), total alarms
will be 1% of 0.999 million false alarms and 990 (99% of 1000) true alarms.
Total alarms = 9990 + 990 = 10980, out of which more than 90% are false alarms.

• Often, the extra accuracy is associated with higher cost, but the cost
of not doing so may be much higher

24

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Issues with Fraud Detection
• Fraud is Evolving
• Fraudsters may adapt quickly to many fraud detection methods, by devising novel
and increasingly subtle ways to get away with it. Also, fraud detection schemes
must evolve also to try to keep up with (and get ahead of) fraudsters
• Large Data Set Processing Needed
• Large credit card issuers like Capital One may process billions of transactions per
year. Even a very small percentage of fraud among these billions of transactions can
result in proportionately large losses.
• Telecom companies handle billions of calls in a month
• The Fact of Fraud is Not Always Known during Modeling
• We need to use both supervised and unsupervised methods to detect fraud.
• Fraud is Very Complex
• The complexity is partly due to the fraudster's need for stealth and secrecy, and
partly due to the intentional obfuscation of the trail of evidence indicating fraud

25

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Issues with Fraud Detection
• Fraud Detection May Require the Formulation of Rules Based on
General Principles, "Red Flags," Alerts, and Profiles
• General principle: The incidence of fraud is more likely when the opportunity is
high and the potential gains are large.
• A "red flag": A large number of accidents or claims is made by one individual
• An alert: A new product is introduced before fraud management systems are put
in place

• Fraud Detection Requires Both Internal and External Business Data


• Internal data describing their business events (selling things or providing services)
• External data such as demographic data, firmographic data (profile of businesses),
psychographic data (people with various attitudinal and philosophical views)

• Very Few Data Sets and Modeling Details are Available


• Fraud data sets and modeling methodologies are tightly kept secrets. Companies
do not share with anyone.

26

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Types of fraud models

27

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Types of Fraud Models
• Early fraud models employed expert systems to detect fraudulent
events. An expert system is a collection of expert opinions on a
number of decision criteria. These systems induced rules from the
responses of a group of experts in the field. These rules can be
coordinated into a flow chart leading to a decision.
• The problem with expert systems is that they are based on subjective inputs that
may be contradictory

• Subsequent fraud detection systems used automated rule induction


engines, based decision tree technology, and fuzzy logic. Some of
these fraud detection systems are still marketed today (iPrevent by
Brighterion).

28

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Types of Fraud Models

• The Fair Isaac fraud detection systems Falcon Fraud Manager,


eFalcon, and LiquidCredit Fraud Solution are built around a
sophisticated system of predictive variables derived from extensive
historical customer data.
• These predictors have been selected by many years of modeling fraud in
many companies. The variables are submitted to a powerful backpropagation
neural net.

29

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Supervised Methods for Fraud Detection
Several elements are crucial to the successful supervised
fraud model
• The fraud event and the relationship of that event to specific transactions
or responses of the fraudster must be accurately identified

• Historical data of past transactions or responses must be available to


derive powerfully predictive variables

• Profiles of the past behavior and actions of both the fraudsters and the
nonfraudsters must be built and employed in the modeling methodology

• Predictive variables need to be identified for each type of


fraud.

30

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Detection of money laundering and other financial crimes

To detect money laundering and other financial crimes, it is necessary to


integrate information from multiple databases such as bank transaction
databases, and federal or state crime history
Multiple data analysis tools can then be used :
• Data visualization tools to display transaction activities using graphs by time
and by groups of customers
• Linkage analysis tools to identify links among different customers and
activities
• Classification tools to filter unrelated attributes and rank the related ones
• Clustering tools to group different cases
• Outlier analysis tools to detect unusual amounts of fund transfers or other
activities, and
• Sequential pattern analysis tools to characterize unusual access sequences

31

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Identifying Tax Fraud through Social Network Analysis

• SNA(Social Network Analysis) is an analytic approach of correlating


people, entities and relationships to determine how tightly an
individual or business is related to others who have known
compliance issues
• These relationships can be from shared
• phone numbers,
• physical addresses,
• bank accounts,
• credit cards, or
• any other connection

• Graph analytics can give insights into members of the network

[Link]

32

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Opportunities for Tax Agencies in SNA

• Tax and revenue agencies to take advantage of SNA tools, across


registration, audit and collections business areas

• Improper registrations. A tax evader closes a business and the


business re-opens (typically owned by a relative of the original
owner) at the same or a nearby location. i.e. the owner stays in
business by opening a similar (or identical) business with a different
legal name and a different legal owner (e.g., a spouse, parent or
another relative).
• Very difficult to catch by manual efforts. The challenge for the tax agency is
the owner will have changed and won’t be an exact match to the previous
business.

[Link]

33

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Opportunities for Tax Agencies in SNA

• In the example,
Suzie Smith has a
direct relationship
with one closed
business and a
second degree
relationship with
another closed
business.
• The state can deny
the new
registration or
conduct additional
investigations in a
cost effective
manner.
[Link] 34

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Opportunities for Tax Agencies in SNA
• Identify Fraud Rings. The SNA can identify related businesses. i.e.
businesses with related addresses, bank accounts, phone numbers,
email addresses, or other identifying characteristics.
• SNA can help a revenue agent can identify additional individuals and/or
businesses that may be related to the same fraud ring, saving the investigator
time and effort.
• Assisting with Locating a Delinquent Debtor. SNA’s are frequently
used during the debt collection process to identify related
individuals. SNA can greatly enhance and automate this effort, by
finding people who share the same physical address, phone number,
email, etc
• Earlier collectors have utilized manual tools along these lines for years,
contacting next-door neighbors (for example) in an attempt to locate a
debtor.
[Link]
35

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Opportunities for Tax Agencies in SNA
• Finding Successor Businesses. When a business ceases operation,
the business can re-open in a new location or under new
ownership. If the original business owes money, the government
can in many cases pursue that debt if there is a successor business
• This can save the collector a significant amount of time for what
otherwise would require significant manual research.

• Tax agencies are data rich organizations, and analytics solutions


like Social Network Analysis will allow them to identify more fraud
and potential non-compliance situations well before a liability
occurs

[Link]
36

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Prescribed Text Books

Author(s), Title, Edition, Publishing House


Fraud Detection; Handbook of Statistical Analysis and Data Mining Applications
by Robert Nisbet, John Elder and Gary Miner Academic Press 2009
Data Mining: Concepts and Techniques, Second Edition by Jiawei Han, Micheline
Kamber and Jian Pei Morgan Kaufmann Publishers
[Link] (Fair Issac Corporation)

37

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


BITS Pilani
Pilani|Dubai|Goa|Hyderabad

9.3 Sentiment Analysis

Source Courtesy: Some of the contents of this PPT are sourced from materials provided by publishers of prescribed books
Sentiment Analysis
• Monitoring what consumers are saying about a company's brands and products
and how they are expressing their opinions and sentiments to others has
always been important to businesses.
• Until the last century, businesses typically used surveys and focus groups from
time to time to gauge and track consumer sentiments.
• With the widespread adoption of the Internet, the proliferation of social media
channels (such as Twitter, Facebook, and others), and the abundant opportunity
for consumers to express their opinions and sentiments, monitoring sentiment
continuously has become more critical

• "Conventional marketing wisdom long held that a dissatisfied customer tells ten
people; but in the age of new social media, he or she has the tools to tell
millions,"
- Paul Gillin, author of The New Influencers: A Marketer's Guide to the New Social Media

39

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Sentiment Analysis

• The basic task involved in sentiment analysis is


identifying and quantifying the polarity or valence of
sentiments (such as positive, negative, neutral, or
mixed) expressed typically in written opinions,
expressions, reviews, comments, and so on
• It involves many of the text analytics steps such as
• tokenization,
• sentence identification,
• part-of-speech tagging,
• and so on

40

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Sentiment Analysis
• Need to identify statements that convey sentiment
- "It is an amazing TV“ – conveys sentiment
- "Do not buy this TV" – conveys sentiment
- "Which is the best TV?" – conveys no sentiment

• Depending on the context, sentences can be non-comparative (where


opinion is restricted to one thing) or comparative (where multiple
things might be compared).

41

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Example of a Review

Consider the following example of a review by a customer for a TV:


The TV is wonderful. Great size, great picture, easy
interface It makes a cute little song when you boot it up
and when you shut it off I just want to point out that the
43" does not in fact play videos from the USB This is really
annoying because that was one of the major perks I
wanted from a new TV Looking at the product description
now. I realize that the feature list applies to the X758
series as a whole, and that each model's capabilities are
listed below. Kind of a dumb oversight on my part, but it's
equally stupid to put a description that does not apply on
the listing for a very specific model

42

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Granularity of Sentiment
Sentiment analysis starts with determining whether a text contains an
opinion (sentiment). If it does contain sentiment, at what granularity
level does the sentiment exist?
• Document level: At this level, the task is to figure out whether the entire
document can be classified as positive or negative
• This is possible only if the document involves a single entity (such as the TV in
the previous example).

• Sentence level: At this level, the task is to classify each sentence in a


document as a positive, negative, or mixed sentiment sentence.
• In the previous example, the first sentence, expresses positive sentiment. The
third statement, expresses negative sentiment

43

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Granularity of Sentiment

Sentiment granularity can also be looked at from the object side:

• Entity (or Object) and Attribute (or Aspect or Feature) level: An


entity is typically the target of the opinion. However, in many
sentences, the sentiments reflect the reviewer's opinions about
attributes (or aspects or features) of the entity
• "Great size, great picture, easy interface", express positive sentiment for
three specific attributes “size, picture, and interface” of the entity, the TV

44

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Challenges in Sentiment Analysis - NLP
Sentiment analysis starts with text data. So it has all of the typical
natural language processing (NLP) problems associated with text
analytics, viz.
• identifying part-of-speech tags,
• disambiguating terms and lexicons,
• correcting spelling errors, etc.

In addition to words, there are idiom lexicons— e.g. "costs an arm


and a leg" that embody sentiments.
In general, the difficulty with correctly identifying sentiments
increases as you move from general to context-dependent to
idiom lexicons in texts
45

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Challenges in Sentiment Analysis – Opinion Words

Sentiment analysis needs to correctly identifying opinion words


that express positive or negative sentiments.
• There are opinion words whose polarity is always the same, e.g. the word
"beautiful," always expresses a positive sentiment.

• But, there are also context-dependent lexicons in which the polarity of the
word depends on the domain or context, e.g. the word "small" can be
positive or negative depending on the context. The sentence, "The size
seems small." can be positive for a USB flash drive with 1 TB capacity. But,
the same sentence can be interpreted as negative if the context is an LED
big-screen TV.

46

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Challenges in Sentiment Analysis – Type of Text

Further challenges in conducting sentiment analysis come from


the nature of the text.
• For example, tweets are short, and they are typically focused on one topic
only. In that sense, they are easier to analyze. But, tweets often contain a
lot of special meaning characters, such as RT (retweets), hashtags (#),
emoticons (such as smiley faces), that need to be handled carefully.

• Customer reviews are typically on one entity or object. Therefore, there is


less ambiguity in the entity detection task when analyzing reviews.

• Analysis of discussions, free-flow comments, and blog postings is often


the hardest because they typically cover multiple entities, make
comparisons instead of expressing direct opinions, use a lot of sarcasm,
etc.

47

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Unsupervised versus Supervised Sentiment Analysis
• The sentiment analysis can be formulated as a supervised or an
unsupervised mining problem, depending on whether there are known
examples of documents belonging to positive or negative sentiments.
• Unsupervised sentiment analysis involves the application of a sentiment
lexicon of opinion-related positive or negative terms to evaluate text in the
document.
• Supervised approach involves machine-learning algorithms (such as
support vector machines (SVMs) and neural networks) to textual feature
representations to derive the relationships between features of the text
segment and the opinions expressed in the document.
• In many situations, known class examples are created by experts who read the
documents or use rules, e.g. if a text review's numeric rating is four or more stars,
then the review is positive.
• If no known class examples are possible, then analysts have to use an unsupervised
classification of sentiments.

48

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Unsupervised versus Supervised Sentiment Analysis
Supervised classification is typically performed at the document level.
• If enough labeled examples are available, commonly used classification
models can be trained, validated, and tested to check their performances.
• A good candidate is product review data, which typically has a text review and
an overall numeric rating on a scale of one to five stars. Often, a review rating
of four to five stars is considered a positive rating, and a review rating of one
to two stars is considered a negative rating.
• The main challenge for modelers is to select the inputs from text features
such as terms and their frequencies (often weighted or normalized), part-of-
speech tags, opinion lexicons (general, context-specific, and idiom), syntactic
dependency (from parsing trees), and the handling of negation words (such as
"not").

49

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Unsupervised versus Supervised Sentiment Analysis
Unsupervised method is typically applied at the sentence level. There are two types of unsupervised
methods: lexicon-based and syntactic-pattern based.
• The lexicon-based approach can be used for sentence- and aspect-level sentiment classification. The
relationships between opinion words and attributes are identified via dependency relationships
obtained through parsing.
• For example, in the sentence, "The picture quality is outstanding," the opinion word "outstanding" and the attribute "picture
quality" share the same dependency relationship with the verb "is."
• If a clear dependency is not observed between an opinion word and an attribute, then how close an opinion
word is to an attribute in a sentence can be used to judge the polarity of the attribute.
• This process can get very complex, depending on how long the sentence is, how many attributes are being mentioned in the
same sentence, whether both positive and negative polarity words are used in the same sentence, whether negation is used,
and so on.
• Once sentiment values are computed for each word-attribute combination, they are typically combined using
appropriate normalization or weights to come up with an overall sentiment score.
• The syntactic pattern-based approach involves defining part-of-speech tags and the keywords AND,
NOR, OR, NOT, BUT, etc.
• Primarily useful in contextual analysis when performing phrase -level analysis, this method can be used to develop a variety of
rules for better accuracy. For example, a simple pattern such as <subject> <NOT> <verb> can be used to extract negative
phrases like, "This <feature> does <not> < work> as advertised."

50

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


3

Prescribed Text Books

Author(s), Title, Edition, Publishing House


Tutorial BB - Mining Twitter for Airline Consumer Sentiment
Practical Text Mining and Statistical Analysis for Non-structured Text Data
Applications by Gary Miner et al. Academic Press © 2012
Text Mining and Analysis: Practical Methods, Examples, and Case Studies Using
SAS SAS Institute © 201

51

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Thank You

52
9/22/2021
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
S2-20_DSECFZC415
Mid-Semester Review
BITS Pilani
Pilani|Dubai|Goa|Hyderabad

1
•The slides presented here are obtained from the authors of the books and from
various other contributors. I hereby acknowledge all the contributors for their material
and inputs.
•I have added and modified a few slides to suit the requirements of the course.

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Review

Source Courtesy: Some of the contents of this PPT are sourced from materials provided by publishers of prescribed books
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
What is Data Mining?
Data mining (knowledge discovery from data)
– Extraction of interesting (non-trivial, implicit, previously unknown and
potentially useful) patterns or knowledge from huge amount of data

 What is not Data  What is Data Mining?


Mining? – Certain names are more
– Look up phone prevalent in certain US
number in phone locations (O’Brien, O’Rurke,
directory O’Reilly… in Boston area)
– Query a Web – Group together similar
search engine for documents returned by search
information about engine according to their
“Amazon” context (e.g. Amazon
rainforest, [Link],)
4

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Origins of Data Mining
Draws ideas from machine learning/AI, pattern recognition,
statistics, and database systems
Traditional Techniques
may be unsuitable due to
Statistics/ Machine Learning/
– Enormity of data AI Pattern
– High dimensionality Recognition

of data Data Mining


– Heterogeneous,
distributed nature
Database
of data systems

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Data Quality: Multidimensional View
• Measures for data quality: A multidimensional view
– Accuracy: correct or wrong, accurate or not
– Completeness: not recorded, unavailable, …
– Consistency: some modified but some not, dangling, …
– Timeliness: timely update?
– Believability: how trustable the data are correct?
– Interpretability: how easily the data can be understood?

7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Major Tasks in Data Preprocessing
• Data cleaning
– Fill in missing values, smooth noisy data, identify or remove outliers,
and resolve inconsistencies
• Data integration
– Integration of multiple databases, data cubes, or files
• Data reduction
– Dimensionality reduction
– Numerosity reduction
– Data compression
• Data transformation and data discretization
– Normalization
– Concept hierarchy generation

7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Attributes
• Attribute (or dimensions, features, variables): a data field, representing a
characteristic or feature of a data object.
• E.g., customer _ID, name, address

• Types:
• Nominal
• Binary
• Symmetric
• Asymmetric

• Ordinal
• Numeric: quantitative
• Interval-scaled
• Ratio-scaled

Data Mining
8
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Basic Statistical Descriptions of Data
• Motivation
• To better understand the data: central tendency, variation and spread
• Data dispersion characteristics
• median, max, min, quantiles, outliers, variance, etc.
• Numerical dimensions correspond to sorted intervals
• Data dispersion: analyzed with multiple granularities of precision
• Boxplot or quantile analysis on sorted intervals
• Dispersion analysis on computed measures
• Folding measures into numerical dimensions
• Boxplot or quantile analysis on the transformed cube

Data Mining
9
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Classification
Classification • Decision Tree based Methods
– predicts categorical class • Rule-based Methods
labels (discrete or nominal)
– classifies data (constructs a • Neural Networks
model) based on the training - computational networks that
simulate the decision process in
set and the values (class neurons (networks of nerve cell)
labels) in a classifying
attribute and uses it in • Naïve Bayes and Bayesian Belief
classifying new data Networks
- uses the probability theory to find
the most likely of the possible
Prediction classifications
– models continuous-valued • Support Vector Machines
functions, i.e., predicts - fits a boundary to a region of points
unknown or missing values that are all alike; uses the boundary
to classify a new point

10

7/5/2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Frequent-pattern Mining

Multiple database scans are costly


Mining long patterns needs many passes of scanning and
generates lots of candidates
– To find frequent itemset i1i2…i100
• # of scans: 100
• # of Candidates: (1001) + (1002) + … + (110000) = 2100-1 = 1.27*1030 !

Bottleneck: candidate-generation-and-test
Can we optimize scan, candidate generation, test cycle?

11

July 5, 2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Prescribed Text Books

Author(s), Title, Edition, Publishing House


T1 Tan P. N., Steinbach M & Kumar V. “Introduction to Data Mining” Pearson
Education
T2 Data Mining: Concepts and Techniques, Third Edition by Jiawei Han,
Micheline Kamber and Jian Pei Morgan Kaufmann Publishers
R1 Predictive Analytics and Data Mining: Concepts and Practice with RapidMiner
by Vijay Kotu and Bala Deshpande Morgan Kaufmann Publishers

12

July 5, 2021 BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Thank You

13

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Data Mining
(Lecture #18 – Subject Review)

BITS Pilani
Pilani|Dubai|Goa|Hyderabad
BITS Pilani
Pilani|Dubai|Goa|Hyderabad

Review
Data Mining

• Draws ideas from machine learning/AI, pattern recognition,


statistics, and database systems
• Traditional Techniques
may be unsuitable due to
• Enormity of data Statistics/ Machine Learning/
• High dimensionality AI Pattern
of data Recognition

• Heterogeneous, Data Mining


distributed nature
of data
Database
systems

Data Mining
9/21/2021 3
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Data Mining in Business Intelligence

Increasing potential
to support
business decisions End User
Decision
Making

Data Presentation Business


Analyst
Visualization Techniques
Data Mining Data
Information Discovery Analyst

Data Exploration
Statistical Summary, Querying, and Reporting

Data Preprocessing/Integration, Data Warehouses


DBA
Data Sources
Paper, Files, Web documents, Scientific experiments, Database Systems

Data Mining

9/21/2021 4
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Major Tasks in Data Preprocessing
• Data cleaning
• Fill in missing values, smooth noisy data, identify or remove outliers, and
resolve inconsistencies
• Data integration
• Integration of multiple databases, data cubes, or files
• Data reduction
• Dimensionality reduction
• Numerosity reduction
• Data compression
• Data transformation and data discretization
• Normalization
• Concept hierarchy generation

9/21/2021 Data Mining


5
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Basic Statistical Descriptions of Data
• Motivation
• To better understand the data: central tendency, variation and spread
• Data dispersion characteristics
• median, max, min, quantiles, outliers, variance, etc.
• Numerical dimensions correspond to sorted intervals
• Data dispersion: analyzed with multiple granularities of precision
• Boxplot or quantile analysis on sorted intervals
• Dispersion analysis on computed measures
• Folding measures into numerical dimensions
• Boxplot or quantile analysis on the transformed cube

9/21/2021 Data Mining


6
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Classification
• Classification • Decision Tree based Methods
• predicts categorical class • Rule-based Methods
labels (discrete or nominal)
• Neural Networks
• classifies data (constructs a - computational networks that
model) based on the simulate the decision process in
training set and the values neurons (networks of nerve cell)
(class labels) in a classifying
• Naïve Bayes and Bayesian Belief
attribute and uses it in Networks
classifying new data
- uses the probability theory to find
the most likely of the possible
• Prediction classifications
• models continuous-valued • Support Vector Machines
functions, i.e., predicts - fits a boundary to a region of points
unknown or missing values that are all alike; uses the boundary
to classify a new point

9/21/2021 Data Mining


7
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Association Analysis
• Association analysis measures the strength of co-occurrence between one
item and another.
• The objective of this class of data mining algorithms is not to predict an occurrence of an
item, like classification or regression do, but to find usable patterns in the co-occurrences of
the items.
• Association rules learning is a branch of an unsupervised learning process that discovers
hidden patterns in data, in the form of easily recognizable rules

• Association algorithms are widely used in retail analysis of transactions,


recommendation engines, and online clickstream analysis across web pages.
• One of the popular applications of this technique is called market basket analysis, which
finds co-occurrences of one retail item with another item within the same retail purchase
transaction

• Retailer can take advantage of this association for bundle pricing, product
placement, and even shelf space optimization within the store layout.

September 21, 2021 Data Mining


8
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Better Frequent-pattern Mining

• Multiple database scans are costly • Use a compressed


representation of the
• Mining long patterns needs many database using an FP-tree
passes of scanning and generates • Once an FP-tree has been
constructed, it uses a
lots of candidates recursive divide-and-conquer
• To find frequent itemset i1i2…i100 approach to mine the
• # of scans: 100
frequent itemsets
• # of Candidates: (1001) + (1002) + … + (110000)
= 2100-1 = 1.27*1030 ! 1. Scan DB once, find frequent 1-
itemset (single item pattern)
• Bottleneck: candidate-generation- 2. Sort frequent items in
and-test frequency descending order, f-
list
• Can we avoid candidate 3. Scan DB again, construct FP-
tree
generation?

Data Mining
September 21, 2021
9
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Good Clustering
• A good clustering method will produce • Dissimilarity/Similarity metric: Similarity is
expressed in terms of a distance function,
high quality clusters with typically metric: d(i, j)
• high intra-class similarity • There is a separate “quality” function that
measures the “goodness” of a cluster.
• low inter-class similarity
• The definitions of distance functions are
• The quality of a clustering result usually very different for interval-scaled,
boolean, categorical, ordinal ratio, and
depends on both the similarity
vector variables.
measure used by the method and its • Weights should be associated with different
implementation variables based on applications and data
semantics.
• The quality of a clustering method is
• It is hard to define “similar enough” or
also measured by its ability to discover “good enough”
some or all of the hidden patterns • the answer is typically highly
subjective.

Data Mining
September 21, 2021 10
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Measuring Cluster Validity
• For graph-based analysis, the cohesion
of a cluster can be defined as the sum
of weights of links in the proximity
graph that connects points within the
cluster.
• Separation between two clusters can
be measured by the sum of weights of
the links from points in one cluster to
points in the other cluster.

• For prototype-based analysis, the


cohesion of a cluster can be defined
as the sum of the proximities w.r.t.
the prototype (centroid or medoid)
of the cluster.
• Separation between two clusters
can be measured by the proximity
of two cluster prototypes.
Sometimes we may also define
centroid C for entire space

Data Mining
9/21/2021
11
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Types of Outliers
• Three kinds: global, contextual and collective outliers
• Global outlier (or point anomaly)
• Object is Og if it significantly deviates from the rest of the data set
• Ex. Intrusion detection in computer networks
• Issue: Find an appropriate measurement of deviation
• Contextual outlier (or conditional outlier)
• Object is Oc if it deviates significantly based on a selected context
• Ex. 80o F in Urbana: outlier? (depending on summer or winter?)
• Attributes of data objects should be divided into two groups
• Contextual attributes: defines the context, e.g., time & location
• Behavioral attributes: characteristics of the object, used in outlier evaluation, e.g., temperature
• Can be viewed as a generalization of local outliers—whose density significantly deviates from its
local area
• Issue: How to define or formulate meaningful context?
• Collective Outliers
• A subset of data objects collectively deviate significantly from the whole data set, even if the
individual data objects may not be outliers
• Applications: E.g., intrusion detection:
• When a number of computers keep sending denial-of-service packages to each other
Data Mining

12
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Prescribed Text Books

Author(s), Title, Edition, Publishing House


T1 Data Mining: Concepts and Techniques, Third Edition by Jiawei Han, Micheline
Kamber and Jian Pei Morgan Kaufmann Publishers
T2 Tan P. N., Steinbach M & Kumar V. “Introduction to Data Mining” Pearson
Education
R1 Predictive Analytics and Data Mining: Concepts and Practice with RapidMiner by
Vijay Kotu and Bala Deshpande Morgan Kaufmann Publishers

Data Mining
13
BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956

You might also like