0% found this document useful (0 votes)

18 views25 pages

Data Mining Primitives and Languages

Q: What role do concept hierarchies play in background knowledge for data mining, and how do they enhance the mining process?

Concept hierarchies in data mining map low-level concepts to higher-level, more general ones, enabling operations like 'drilling down' for detail or 'rolling up' for summary . They provide domain-specific or general knowledge that guides the mining process and helps interpret results, making patterns more meaningful and concise . This enhances the mining process by allowing analysis at different abstraction levels, increasing comprehensibility and potentially uncovering more significant insights .

Q: Compare the purposes and outputs of SQL and DMQL in data management and mining.

SQL is primarily for data manipulation and retrieval, focusing on accessing, managing, and manipulating existing data in relational databases, producing structured tables and datasets as output . Conversely, DMQL is designed for knowledge discovery, focusing on uncovering hidden patterns and insights within large datasets . While SQL retrieves or modifies data, DMQL discovers patterns such as rules or models and outputs these as synthesized representations like visualizations or models .

Q: In what way do interestingness measures and thresholds impact the output of data mining tasks?

Interestingness measures and thresholds are criteria used to guide the search for meaningful patterns in data mining, filtering out less significant findings . For instance, in association rule mining, measures such as minimum support and confidence determine which rules are considered interesting and hence, prioritized . These measures ensure that the output is both actionable and relevant, preventing the user from being overwhelmed by trivial patterns .

Q: Explain the role of OLAP in data characterization and how it facilitates deeper data insights.

OLAP facilitates data characterization by allowing operations such as roll-up and drill-down on predefined data cubes, enabling generalization to higher levels of hierarchy and obtaining detailed information from lower levels . This ability to navigate different abstraction levels provides deeper insights into the data, allowing users to explore and summarize data dimensions effectively and tailor patterns to specific analytical needs .

Q: Describe the differences between simple summarization and analytical characterization in data mining.

Simple summarization condenses data by replacing low-level concepts with high-level ones, often using attribute-oriented induction and basic aggregation techniques . It provides a general overview with minimal detail. Analytical characterization, however, focuses on discovering significant, discriminating characteristics by analyzing attribute relevance and dispersion using statistical tests and dimensionality reduction . It delivers a more in-depth analytical understanding with detailed statistical measures, thereby offering a refined, insightful description of the data .

Q: How does the Data Mining Query Language (DMQL) improve the accessibility of data mining for non-expert users?

DMQL improves accessibility by providing a high-level, declarative syntax for specifying data mining tasks, abstracting the complexities of underlying algorithms . This enables users, including business analysts who are not expert programmers, to focus on the knowledge they want to discover rather than the technical implementation details, thus broadening the usability of data mining technologies .

Q: How does mining class comparison contribute to fields like fraud detection and market research?

Mining class comparison highlights distinguishing characteristics between a target class and contrasting classes, thus identifying features that make one group unique . In fraud detection, this can identify suspicious patterns by comparing fraudulent transactions to legitimate ones, enabling more targeted detection strategies . In market research, it can differentiate between successful and unsuccessful products, providing insights for strategic planning and product development .

Q: How does SQL support the data preprocessing stage in data mining, and why is this important?

SQL is crucial for the data preprocessing stage in data mining as it allows for selecting, filtering, aggregating, and joining data in relational databases and data warehouses, thus preparing it for mining algorithms . This preprocessing step is important because it ensures that the data fed into mining algorithms is clean and relevant, facilitating accurate and effective pattern discovery .

Q: What is the significance of the data sources/information repository component in a data mining system's architecture, and how is it utilized?

The data sources/information repository is crucial as it provides the foundational, raw data needed for analysis . This component accommodates diverse types of data from databases, data warehouses, data lakes, and more, all of which require initial cleaning, integration, and selection to ensure that they are prepared for mining . These preprocessing steps are essential for effective mining, as they enhance data quality and relevancy, ultimately influencing the reliability of the mining results .

Q: What are the key statistical measures used in analytical characterization and how do they enhance the data mining process?

Analytical characterization employs statistical measures such as correlation analysis, ANOVA, and information gain to determine attribute relevance and data dispersion . These measures enhance the data mining process by ensuring that only significant and relevant attributes are included, reducing complexity and improving the clarity and usefulness of the mined data . This refined approach provides a deeper, more statistical understanding of the data's descriptive qualities .

Unit II of the Data Mining course covers data mining primitives, including languages and system architecture, focusing on Data Mining Query Languages (DMQLs) and general-purpose programming languages like Python and R. It details the five key primitives for data mining tasks: task-relevant data, kind of knowledge to be mined, background knowledge, interestingness measures, and expected output format. The document also discusses the architecture of data mining systems, emphasizing the importance of a structured organization for efficient knowledge extraction.

Uploaded by

priya

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

18 views25 pages

Data Mining Primitives and Languages

Uploaded by

priya

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

DATA MINING-unit II

Unit-II: Data Mining Primitives

Languages and System Architecture: Data Mining–Primitives–Data Mining Query

Language, Architecture of Data mining Systems. Concept Description, Characterization and
Comparison: Concept Description, Data Generalization and Summarization, Analytical
Characterization, Mining Class Comparison–Statistical Measures.

Languages for Data Mining

a. Data Mining Query Languages (DMQLs):

 Concept: Historically, there have been proposals for dedicated Data Mining Query
Languages (DMQLs), inspired by SQL. The idea was to have a declarative language
where users could specify what they want to mine, rather than how to mine it.
 Han, Kamber, and Pei's DMQL: A prominent conceptual DMQL defined by the
authors (Han, Kamber, and Pei) in their widely-used textbook specifies data mining tasks
using "primitives" (as discussed in a previous response). These primitives cover:
o Task-relevant data: Specifying the data source and subset.
o Kind of knowledge to be mined: The data mining functionality (e.g.,
classification, association).
o Background knowledge: Concept hierarchies, domain knowledge.
o Interestingness measures: Thresholds (e.g., minimum support/confidence for
association rules).
o Expected output format: How results should be presented (e.g., rules, tables,
visualizations).
 Status: While DMQLs were proposed, a single, universally adopted standard like SQL
for relational databases doesn't exist for data mining. However, the principles of DMQL
(specifying tasks using primitives) are fundamental to how all data mining tools operate.

b. General-Purpose Programming Languages with Libraries:

 In practice, much of modern data mining is done using general-purpose programming

languages coupled with powerful libraries. These offer flexibility, extensibility, and the
ability to integrate with other data science tasks.
 Python:
o Strengths: Highly popular, easy to learn, vast ecosystem of libraries.
o Key Libraries:
 Pandas: For data manipulation and analysis (data cleaning, integration,
transformation).
 NumPy: For numerical computing.
 Scikit-learn: Comprehensive machine learning library for classification,
regression, clustering, dimensionality reduction, etc.
 Matplotlib, Seaborn: For data visualization.

[Link] [Link]., [Link], [Link]., BHARATHI WOMEN’S COLLEGE-KALLAKURICHI Page 1

DATA MINING-unit II

 TensorFlow, PyTorch, Keras: For deep learning tasks, which are

increasingly relevant in advanced data mining.
 NLTK, spaCy: For natural language processing (text mining).
 R:
o Strengths: Excellent for statistical analysis, data visualization, and specialized
machine learning tasks. Strong community and a rich collection of statistical
packages.
o Key Packages: dplyr, ggplot2, caret, e1071, randomForest, arules.
 Scala (with Apache Spark):
o Strengths: Designed for big data processing, integrates well with Spark's
distributed computing capabilities.
o Key Libraries: Spark MLlib (Spark's machine learning library).
 Java:
o Strengths: Mature, robust, widely used in enterprise systems. Many early data
mining tools were built in Java (e.g., Weka, Apache Mahout).
 SQL (for Data Preparation):
o While not a data mining language itself, SQL is crucial for the data preprocessing
stage, especially for selecting, filtering, aggregating, and joining data in relational
databases and data warehouses before it's fed into mining algorithms.

c. Domain-Specific Languages/Tools:

 Many commercial and open-source data mining tools provide their own visual interfaces
or scripting languages that abstract away the underlying code, making it accessible to
business users.
 Examples:
o KNIME: Node-based visual workflow editor.
o Orange: Visual programming tool for data analysis and machine learning.
o RapidMiner: Integrated environment for predictive analytics, text mining, and
machine learning.
o IBM SPSS Modeler: Drag-and-drop interface for building predictive models.
o SAS Enterprise Miner: Comprehensive suite for statistical analysis and data
mining.
o Weka: Collection of machine learning algorithms for data mining tasks, with a
GUI.

PRIMITIVES:

A data mining query (or task specification) is defined by five key primitives:

1. Task-Relevant Data (or Data Cube where mining is to be performed):

o What it is: This primitive defines the specific subset of data from the entire
database, data warehouse, or information repository that is relevant to the mining
task. It's the scope of your analysis.

[Link] [Link]., [Link], [Link]., BHARATHI WOMEN’S COLLEGE-KALLAKURICHI Page 2

DATA MINING-unit II

o Components:
 Database/Data Warehouse name(s): The source system.
 Tables/Relations or Data Cube(s): The specific tables, views, or data
cubes to be used.
 Attributes/Dimensions of Interest: The specific columns or dimensions
you want to consider.
 Selection Conditions: Filters or criteria to narrow down the records/rows
(e.g., age > 30, city = 'Bangalore').
o Example: "All sales transactions for 'electronics' products in 'South India' during
'2024'."
2. Kind of Knowledge to be Mined (Data Mining Functionality):
o What it is: This primitive specifies the type of data mining task or functionality
you want to perform. It tells the system what kind of pattern you are looking to
discover.
o Examples:
 characterization (e.g., describing customer demographics)
 discrimination (e.g., comparing loyal vs. churned customers)
 association (e.g., finding frequently bought together items)
 classification (e.g., predicting if a loan applicant is high-risk)
 clustering (e.g., segmenting customers into groups)
 prediction (e.g., forecasting next month's sales)
 outlier analysis (e.g., detecting fraudulent transactions)
 trend/evolution analysis (e.g., identifying seasonal sales patterns)
o Example: "Perform classification."
3. Background Knowledge (Concept Hierarchies):
o What it is: This primitive provides domain-specific or general knowledge that
helps in guiding the mining process and interpreting the results. Concept
hierarchies are a primary form of background knowledge.
o Concept Hierarchies: These define a mapping from a set of low-level concepts
to higher-level, more general concepts. They allow mining to be performed at
different levels of abstraction.
 Example: A geographic hierarchy: street → city → state →
country.
 Example: An age hierarchy: individual age → age group (youth,
middle-aged, senior).
o Role: Enables "drilling down" for more detail or "rolling up" for higher-level
summaries, making patterns more meaningful and concise.
o Example: "Use an age_group_hierarchy for the age attribute."
4. Interestingness Measures and Thresholds:
o What it is: These are criteria used to guide the search for patterns and to filter out
patterns that are not considered "interesting" or significant to the user. Without
these, a data mining system might find an overwhelming number of trivial
patterns.
o Examples:
 For Association Rules: min_support (minimum frequency of itemsets),
min_confidence (minimum strength of the rule).

[Link] [Link]., [Link], [Link]., BHARATHI WOMEN’S COLLEGE-KALLAKURICHI Page 3

DATA MINING-unit II

For Classification: min_accuracy, max_error_rate.

For Clusters: Measures of compactness or separation.
o Role: Helps the system focus on generating only the most relevant and actionable
patterns.
o Example: "Find association rules with min_support = 0.02 and
min_confidence = 0.6."
5. Expected Output Format:
o What it is: This primitive specifies how the discovered patterns should be
presented to the user. Clear and appropriate presentation is essential for the results
to be comprehensible and actionable.
o Examples:
 Rules: IF-THEN statements (for association or classification).
 Tables: Summarized data tables, pivot tables.
 Graphs/Visualizations: Decision trees, cluster plots, bar charts, line
graphs.
 Reports: Textual summaries or detailed reports.
o Role: Ensures the results are delivered in a way that is easy for the end-user to
understand, interpret, and utilize.
o Example: "Display the results as rules and graphs."

DATA MINING QUERY LANGUAGE

The Data Mining Query Language (DMQL) is a conceptual language that was proposed to
allow users to formally specify data mining tasks. It's designed to be a high-level, declarative
language, much like SQL for traditional databases, but tailored specifically for the complexities
of knowledge discovery.

Purpose and Motivation for DMQL:

The primary motivations behind the proposal of a DMQL were:

1. Standardization: To provide a common framework for specifying data mining tasks

across different systems, promoting interoperability and easier learning.
2. Ad-hoc and Interactive Mining: To enable users to perform flexible, on-the-fly data
mining queries, rather than relying solely on pre-defined algorithms or complex
programming.
3. Integration with Database Systems: To seamlessly integrate data mining capabilities
with existing database management systems (DBMS) or data warehouses, allowing users
to leverage familiar SQL-like syntax.
4. Abstraction: To abstract away the complexities of underlying algorithms, allowing users
to focus on what knowledge they want to find, rather than how the algorithms work.
5. User-Friendliness: To make data mining more accessible to a wider range of users,
including business analysts, who might not be expert programmers.

[Link] [Link]., [Link], [Link]., BHARATHI WOMEN’S COLLEGE-KALLAKURICHI Page 4

DATA MINING-unit II

DMQL and Data Mining Primitives:

As we discussed, DMQL is built upon the concept of data mining primitives. It provides the
syntax to define each of these five essential components of a data mining task:

1. Task-Relevant Data Specification:

o Purpose: To define the specific data subset to be mined.
o DMQL Syntax (Conceptual):

Code snippet

USE DATABASE <database_name> OR USE DATA WAREHOUSE

<data_warehouse_name>
FROM <table(s)/relation(s)/cube(s)>
MINE <attribute(s)/dimension(s)_list>
[WHERE <condition(s)>]
[ORDER BY <order_list>]
[GROUP BY <grouping_list>];

o Example: FROM Sales_Data MINE product_category, amount WHERE

region = 'North';
2. Kind of Knowledge to be Mined (Functionality) Specification:
o Purpose: To specify the data mining task or type of pattern to be discovered.
o DMQL Syntax (Conceptual):

Code snippet

MINE AS <functionality_type> [AS <pattern_name>]

o Examples:
 MINE AS CHARACTERIZATION
 MINE AS DISCRIMINATION
 MINE AS ASSOCIATION RULES
 MINE AS CLASSIFICATION
 MINE AS CLUSTERING
 MINE AS PREDICTION
3. Background Knowledge (Concept Hierarchy) Specification:
o Purpose: To provide concept hierarchies or other domain knowledge to guide the
mining process and allow for mining at different levels of abstraction.
o DMQL Syntax (Conceptual):

Code snippet

DEFINE HIERARCHY <hierarchy_name> FOR <attribute_name> ON

<table_name> AS <hierarchy_definition>;

o Example (Schema Hierarchy): DEFINE HIERARCHY Geo_Hierarchy FOR city

ON Customer_Data AS [city, province, country];

[Link] [Link]., [Link], [Link]., BHARATHI WOMEN’S COLLEGE-KALLAKURICHI Page 5

DATA MINING-unit II

o Example (Set-Grouping Hierarchy): DEFINE HIERARCHY

Age_Group_Hierarchy FOR age ON Customer_Data AS level1: {young,
middle_aged, senior} < level0: all;
4. Interestingness Measures and Thresholds Specification:
o Purpose: To specify criteria for evaluating the "interestingness" or significance of
discovered patterns, helping to filter out trivial results.
o DMQL Syntax (Conceptual):

Code snippet

WITH THRESHOLD FOR <measure_name> IS <value>;

o Examples:
 For association rules: WITH THRESHOLD FOR min_support IS 0.01;
 WITH THRESHOLD FOR min_confidence IS 0.5;
 For classification: WITH THRESHOLD FOR min_accuracy IS 0.85;
5. Expected Output Format Specification:
o Purpose: To specify how the discovered patterns should be presented to the user
(e.g., rules, tables, graphs).
o DMQL Syntax (Conceptual):

Code snippet

DISPLAY AS <result_form>;

o Examples: DISPLAY AS RULES;, DISPLAY AS TABLE;, DISPLAY AS TREE;,

DISPLAY AS GRAPHS;

Example of a Complete DMQL Query (Conceptual):

Let's imagine we want to find association rules for items frequently purchased together by young
customers (under 30) in electronic stores, with a minimum support of 1% and minimum
confidence of 50%, displaying the results as rules and a bar chart.

Code snippet
USE DATABASE AllElectronics_DB;

// Task-Relevant Data
FROM Transactions, Customers
MINE item_name, age, customer_id
WHERE Transactions.customer_id = Customers.customer_id
AND age < 30
AND store_type = 'Electronics';

// Kind of Knowledge to be Mined

MINE AS ASSOCIATION RULES
FOR item_name;

// Background Knowledge (Implicit or Explicit Hierarchy)

[Link] [Link]., [Link], [Link]., BHARATHI WOMEN’S COLLEGE-KALLAKURICHI Page 6

DATA MINING-unit II
// (Assuming item_name is at a sufficiently low level, otherwise define a
hierarchy)
// WITH CONCEPT HIERARCHY FOR item_name USE Product_Category_Hierarchy;

// Interestingness Measures and Thresholds

WITH THRESHOLD FOR min_support IS 0.01;
WITH THRESHOLD FOR min_confidence IS 0.5;

// Expected Output Format

DISPLAY AS RULES, GRAPH BAR;

DMQL vs. SQL:

While DMQL borrows syntax from SQL, their fundamental purposes are different:

DMQL (Data Mining Query

Feature SQL (Structured Query Language)
Language)
Data Manipulation and Retrieval: Knowledge Discovery: Primarily for
Primarily for accessing, managing, and discovering hidden patterns,
Purpose
manipulating existing data in relational relationships, and insights from large
databases. datasets.
SELECT, INSERT, UPDATE, DELETE, JOIN, MINE AS (classification, association,
Operations GROUP BY, ORDER BY, CREATE TABLE, clustering, etc.), WITH THRESHOLD,
etc. DEFINE HIERARCHY, DISPLAY AS.
Structured tables, rows, columns of Discovered patterns (rules, models,
Output
existing data. clusters), summaries, visualizations.
Declarative: Specifies what data to Declarative: Specifies what
What it Does
retrieve or modify. knowledge to discover.
Machine learning algorithms,
Underlying
Relational algebra, set theory. statistical models, pattern recognition
Logic
techniques.
Highly standardized (SQL-92, No universally adopted standard;
Standardization
SQL:1999, etc.). primarily a conceptual proposal.
Data storage, integrity, efficient Pattern identification, predictive
Focus
retrieval. modeling, anomaly detection.

ARCHITECTURE OF DATA MINING SYSTEMS

The architecture of data mining systems refers to the structural organization of various
components that work together to perform the process of extracting knowledge and patterns from
data. A robust architecture is crucial for efficient, scalable, and effective data mining, especially
when dealing with large volumes of diverse data.

[Link] [Link]., [Link], [Link]., BHARATHI WOMEN’S COLLEGE-KALLAKURICHI Page 7

DATA MINING-unit II

Core Components of a Data Mining System

A typical data mining system, regardless of its specific implementation (e.g., a commercial tool,
an open-source framework, or a custom-built solution), generally comprises the following key
components:

1. Data Sources / Information Repository:

o Function: This is the foundation, providing the raw data for analysis. Data
mining requires large volumes of historical and current data.
o Types: Can include a wide variety of sources:
 Databases: Relational, object-oriented, hierarchical, NoSQL databases.
 Data Warehouses: Integrated, subject-oriented, time-variant, non-volatile
data collections specifically designed for analytical processing (OLAP).
 Data Lakes: Stores vast amounts of raw data in its native format.
 Flat Files/Spreadsheets: Simple text files, CSVs, Excel files.
 Web Data: Web logs, social media data, web page content.
 Specialized Repositories: Sensor data, image data, video, audio, text
documents, etc.
o Preprocessing: Data from these sources often undergoes initial cleaning,
integration, and selection before being loaded into a more structured environment
for mining.
2. Database/Data Warehouse Server:
o Function: Manages the data in the information repository and efficiently retrieves
relevant data based on the data mining requests.
o Role: Handles data storage, indexing, querying (e.g., SQL queries), and
sometimes basic aggregations. It's the gateway for the data mining engine to
access the required data subset.
o Optimization: Often leverages database optimization techniques like indexing
and query planners for faster data access.
3. Data Mining Engine:
o Function: This is the core of the data mining system, containing the actual
algorithms and computational logic for discovering patterns.
o Modules: Comprises a set of functional modules, each dedicated to a specific
data mining task:
 Classification Module: For building predictive models for categorical
labels (e.g., decision trees, neural networks, support vector machines).
 Regression/Prediction Module: For building models to predict
continuous numerical values.
 Clustering Module: For grouping similar data points (e.g., K-Means,
hierarchical clustering).
 Association Rule Mining Module: For finding relationships between
items (e.g., Apriori, FP-Growth).
 Characterization & Discrimination Module: For summarizing and
comparing data classes.
 Outlier Analysis Module: For detecting anomalies.

[Link] [Link]., [Link], [Link]., BHARATHI WOMEN’S COLLEGE-KALLAKURICHI Page 8

DATA MINING-unit II

 Trend and Evolution Analysis Module: For analyzing time-series data

and sequential patterns.
o Interaction: Interacts with the Knowledge Base for guidance and with the Pattern
Evaluation Module for feedback.
4. Knowledge Base:
o Function: Stores and provides domain-specific knowledge and general
background information that helps guide the data mining process and evaluate the
interestingness of discovered patterns.
o Contents:
 Concept Hierarchies: Tree-like structures that define relationships
between concepts at different levels of abstraction (e.g., city → state →
country).
 User Beliefs/Experience: Can incorporate explicit rules or thresholds
defined by domain experts.
 Interestingness Thresholds: Predefined or user-specified thresholds for
measures like support, confidence, accuracy, novelty.
 Metadata: Information about the data structure, types, and constraints.
o Role: Makes the mining process more intelligent and the results more relevant by
incorporating prior knowledge.
5. Pattern Evaluation Module:
o Function: Assesses the "interestingness" and validity of the patterns discovered
by the Data Mining Engine. It filters out trivial or uninteresting patterns.
o Methods: Uses various interestingness measures (often from the Knowledge
Base) to evaluate patterns based on their statistical significance, novelty,
comprehensibility, or utility.
o Interaction: Can interact with the Data Mining Engine to refine the search
process, for instance, by adjusting thresholds or focusing on specific areas.
6. Graphical User Interface (GUI) / User Interaction Module:
o Function: Provides the communication bridge between the user and the data
mining system. It allows users to specify data mining tasks and visualize the
results.
o Features:
 Query Specification: Allows users to define data mining primitives (data
selection, task type, interestingness measures) through intuitive forms,
visual builders, or query language interfaces (like a conceptual DMQL).
 Result Presentation: Displays the discovered patterns in various human-
understandable formats (e.g., rules, tables, decision trees, graphs, charts,
3D visualizations).
 Interactive Exploration: Supports drill-down, roll-up, slice, and dice
operations, enabling users to explore patterns at different levels of detail.

Types of Data Mining System Architectures (Coupling Levels)

The integration level between the data mining engine and the database/data warehouse system is
a crucial architectural decision, impacting performance, scalability, and flexibility:

[Link] [Link]., [Link], [Link]., BHARATHI WOMEN’S COLLEGE-KALLAKURICHI Page 9

DATA MINING-unit II

1. No Coupling:
o Description: The data mining system operates entirely independently. It extracts
data from raw files or other sources and performs all computations internally
without leveraging any database features.
o Pros: Simple to implement for very small datasets.
o Cons: Highly inefficient, lacks scalability, cannot handle large datasets, misses
out on powerful database management capabilities (indexing, query optimization,
security). Generally considered a poor architecture for practical data mining.
2. Loose Coupling:
o Description: The data mining system retrieves data from a database or data
warehouse (e.g., via SQL queries) but then loads this data into its own memory or
file system for processing. All data mining algorithms run within the separate
mining system.
o Pros: More flexible than no coupling, can access structured data.
o Cons: Significant data transfer overhead (extracting and loading), performance
bottlenecks for very large datasets (memory limitations), duplicate data, lacks
tight integration with database features. Often found in older or simpler desktop-
based tools.
3. Semi-Tight Coupling:
o Description: The data mining system leverages some advanced functionalities of
the database or data warehouse. This might include using database features for
data preprocessing steps like sorting, indexing, aggregation, or complex joins.
Intermediate results might be stored back in the database for better performance.
o Pros: Improved performance and efficiency by offloading some tasks to the
optimized database system.
o Cons: Still some data movement overhead, not fully seamless integration, limited
to the database's built-in functionalities.
4. Tight Coupling:
o Description: The data mining system is deeply integrated or even embedded
within the database management system or data warehouse itself. Data mining
algorithms run directly on the data within the database environment, leveraging its
native capabilities for data management, query processing, and optimization. Data
mining results can be stored directly back into the database.
o Pros:
 High Performance: Minimizes data movement.
 Scalability: Leverages the database's ability to handle large datasets.
 Integrated Information Management: Seamlessly uses database features
like security, concurrency control, and backup/recovery.
 Efficiency: Optimizes query execution and data access.
o Cons: Requires the database system to have native data mining capabilities or
robust extensions, can be less flexible for custom or cutting-edge algorithms not
supported by the database.
o Three Tiers (often for Tight Coupling):
 Data Layer: The database or data warehouse system, acting as the
interface to all data sources and storing both raw and mined data.

[Link] [Link]., [Link], [Link]., BHARATHI WOMEN’S COLLEGE-KALLAKURICHI Page 10

DATA MINING-unit II

 Data Mining Application Layer: Retrieves data, performs necessary

transformations, and applies various data mining algorithms.
 Front-End Layer: The user interface responsible for presenting results in
a user-friendly, visualized format.

CONCEPT DESCRIPTION

"Concept Description" in data mining refers to the task of summarizing and

characterizing a given set of data. It's about providing a concise, high-level, and understandable
description of the attributes and properties of a specific group or class of data objects.

Concept description is often one of the first analytical steps in exploring a new dataset,
providing a foundational understanding before delving into more complex predictive or
prescriptive analytics.

Two Main Types of Concept Description:

Concept Description is primarily divided into two functionalities:

1. Data Characterization: Summarizes the general features of a target class of data.

2. Data Discrimination (or Class Comparison): Compares the general features of a target
class with those of one or a set of contrasting classes.

1. Data Characterization

Goal: To provide a concise and precise summary of the characteristics of a user-specified target
class. It essentially answers "What are the common properties of this group?"

Process: Data characterization often involves a process of data generalization and

summarization. The goal is to move from low-level, detailed data to higher-level, more abstract
concepts. This can be achieved through:

 Attribute-Oriented Induction (AOI): This is a classic method for data generalization in

relational databases. It works by:
1. Collecting Task-Relevant Data: Selecting the data tuples that belong to the
target class.
2. Attribute Removal: Eliminating attributes that are not relevant to the
characterization or have too many distinct values to be meaningful.
3. Generalization by Attribute Removal: If an attribute has a large number of
distinct values and no clear concept hierarchy, it might be removed if it doesn't
contribute significantly to the generalization.
4. Generalization by Climbing Concept Hierarchies: For attributes with concept
hierarchies, values are replaced by their higher-level concepts. This is a core step
for summarization (e.g., specific dates → months → quarters).

[Link] [Link]., [Link], [Link]., BHARATHI WOMEN’S COLLEGE-KALLAKURICHI Page 11

DATA MINING-unit II

5. Aggregation of Generalized Data: After generalization, many tuples might

become identical. These identical tuples are merged, and their counts are
recorded.
6. Result Presentation: The generalized, summarized data is presented.
 Data Cube Approach (for Data Warehouses): If the data resides in a data warehouse,
characterization can be highly efficient by performing OLAP (Online Analytical
Processing) operations like roll-up (generalizing to higher levels of hierarchy) and
drill-down (moving to lower levels for more detail) on predefined data cubes.

Output (Forms of Characterization):

 Characteristic Rules: These are IF-THEN rules that describe the general properties of
the class.
o Example: customer_group(X) = 'high_spender' $\implies$ age(X) =
'middle_aged' $\land$ income(X) = 'high' $\land$ profession(X) =
'executive'
 Generalized Relation/Table: A table summarizing the counts or percentages of
generalized attributes.
 Charts/Graphs: Visualizations (e.g., bar charts showing percentage distribution of age
groups, pie charts for education levels).

Example: Characterizing "customers who frequently buy organic products."

 Result might be: "Most frequent organic buyers are 'middle-aged' (35-50 years), have
'high' income, and live in 'suburban' areas. They tend to buy 'fresh produce' and 'dairy'
items."

2. Data Discrimination (or Class Comparison)

Goal: To compare the general characteristics of a target class with one or a set of contrasting
classes. It highlights the features that distinguish the target class from the contrasting class(es). It
answers "What makes X different from Y (and Z)?"

Process:

 Input: Requires at least two classes of data: a target class and one or more contrasting
classes.
 Data Collection: Gather task-relevant data for all classes involved in the comparison.
 Generalization to a Comparable Level: Similar to characterization, data from all
classes is generalized to a level of abstraction that allows for meaningful comparison.
This ensures that you're comparing "apples to apples."
 Measures for Comparison: Statistical measures are used to quantify the differences
between the distributions of attributes across the classes.

[Link] [Link]., [Link], [Link]., BHARATHI WOMEN’S COLLEGE-KALLAKURICHI Page 12

DATA MINING-unit II

Statistical Measures for Class Comparison:

 For Numerical Attributes (e.g., income, age):

o Measures of Central Tendency: Compare means, medians, or modes.
 Example: Mean income of target class vs. mean income of contrasting
class.
o Measures of Dispersion: Compare standard deviations, variances, or interquartile
ranges (IQR).
 Example: The spread of ages among target vs. contrasting groups.
o Quantile Analysis: Compare the distribution across different percentiles (e.g., the
25th, 50th, 75th percentile of purchase amount).
o Statistical Hypothesis Tests:
 t-test: To compare the means of two groups.
 ANOVA (Analysis of Variance): To compare the means of three or more
groups.
o Visualization: Box plots, overlapping histograms.
 For Categorical Attributes (e.g., education, profession, product category):
o Frequency/Percentage Distribution: Compare the counts or percentages of each
category within the attribute for the target vs. contrasting classes.
o Contingency Tables (Cross-tabulations): Show the joint distribution of two or
more categorical variables.
o Chi-Square Test (χ2 test): A statistical test to determine if there's a significant
association between two categorical variables, indicating if their distributions are
dependent. A large χ2 value implies a stronger difference.
o Visualization: Bar charts (side-by-side or stacked).

Output (Forms of Discrimination):

 Discriminant Rules: IF-THEN rules that highlight the distinguishing features.

o Example: customer_churned(X) = 'yes' $\implies$
service_complaints(X) = 'high' $\land$ call_duration(X) = 'low'
(distinguishing churned from non-churned customers).
 Comparison Tables/Reports: Tables summarizing the statistical differences for each
attribute across the classes.
 Visualizations: Side-by-side charts or interactive dashboards that visually compare the
characteristics.

Example: Discriminating between "customers who churned" and "customers who did not
churn."

 Result might be: "Churned customers, compared to non-churned customers, show a

significantly higher average number of service calls, lower average monthly data usage,
and a preference for older mobile phone models."

[Link] [Link]., [Link], [Link]., BHARATHI WOMEN’S COLLEGE-KALLAKURICHI Page 13

DATA MINING-unit II

Importance of Concept Description:

 Business Intelligence: Helps managers and analysts understand their customer base,
product performance, market segments, and operational efficiency.
 Decision Making: Provides insights that can directly inform strategic decisions (e.g.,
targeted marketing campaigns, product development, service improvements).
 Foundation for Further Mining: The generalized and summarized descriptions can
often serve as inputs for more complex data mining tasks like classification or prediction,
as they reduce noise and irrelevant details.
 Explainability: Provides human-readable summaries, making the knowledge discovered
transparent and actionable.

DATA GENERALIZATION AND SUMMARIZATION

Data generalization and summarization are fundamental techniques in data mining,

particularly within the realm of descriptive data mining. They aim to present a concise,
understandable, and high-level overview of a large dataset, moving from specific details to
broader concepts.

While often used interchangeably or together, there's a subtle distinction:

 Data Generalization: Focuses on replacing low-level, primitive data values with higher-
level, more abstract concepts, typically by using concept hierarchies. It's about reducing
the granularity or level of detail of individual data points.
 Data Summarization: Focuses on aggregating and presenting the overall characteristics
of a dataset using various descriptive measures and visualizations. It's about getting a
compact description of the data's properties.

1. Data Generalization

The process of abstracting or "rolling up" detailed data into higher-level concepts. It reduces
the number of distinct values for attributes, making the data less granular and easier to
comprehend at a broader level.

 Reduces Data Volume: By replacing many specific values with fewer general concepts,
the dataset size can be reduced.
 Simplifies Data Understanding: Makes it easier for humans to grasp broad patterns and
trends without getting overwhelmed by details.
 Hides Sensitive Information: Can be used as a privacy-preserving technique by blurring
specific identifiers (e.g., exact birth date to birth year or age range).

[Link] [Link]., [Link], [Link]., BHARATHI WOMEN’S COLLEGE-KALLAKURICHI Page 14

DATA MINING-unit II

 Prepares Data for Analysis: Many data mining algorithms perform better or become
more efficient on generalized data.

Key Mechanism: Concept Hierarchies Data generalization heavily relies on concept

hierarchies. A concept hierarchy defines a sequence of mappings from a set of low-level
concepts to higher-level, more general concepts.

 Example 1 (Geographic): Street → City → State → Country

 Example 2 (Time): Day → Month → Quarter → Year
 Example 3 (Numerical/Age): Specific Age (e.g., 25) → Age Group (e.g., "Youth") →
Life Stage (e.g., "Adult")

Techniques for Data Generalization:

 Attribute-Oriented Induction (AOI): A classic data generalization method primarily

for relational databases. It works by:
1. Task-Relevant Data Collection: Select the subset of data relevant to the concept
being described.
2. Attribute Removal: Eliminate irrelevant attributes or those with too many
distinct values to generalize meaningfully.
3. Attribute Generalization (Concept Hierarchy Ascension): For each remaining
attribute, replace its values with higher-level concepts by climbing up its concept
hierarchy. This is the core "generalization" step. If no hierarchy exists, one might
be automatically or manually generated (e.g., for numerical data, by binning).
4. Tuple Merging and Count Aggregation: After generalization, identical tuples
(rows) might appear. These are merged into a single tuple, and a count or
frequency is associated with it, representing the number of original tuples it
represents.
5. Result Presentation: The generalized relation is then presented as a characteristic
rule, table, or visualization.
 Data Cube Approach (OLAP Roll-up): When data is organized in a data cube (as in
data warehouses), generalization is efficiently performed using OLAP (Online
Analytical Processing) roll-up operations.
o Roll-up reduces the dimensionality or climbs up concept hierarchies within a
data cube. For instance, rolling up sales data from city to country level, or from
day to ``month` level.
o This approach pre-computes and stores various aggregations, allowing for very
fast generalization and summarization.

2. Data Summarization

The process of providing a compact and informative description of a dataset, highlighting its
key features and trends. It involves computing aggregated values and descriptive statistics. While
generalization is about changing the level of detail of individual data points, summarization is
about deriving overview statistics from the data.

[Link] [Link]., [Link], [Link]., BHARATHI WOMEN’S COLLEGE-KALLAKURICHI Page 15

DATA MINING-unit II

 Quick Insights: Provides a fast way to understand the main characteristics of data (e.g.,
what's the average, what's the typical range?).
 Exploratory Data Analysis (EDA): A crucial step in EDA to get a feel for the data's
distribution, central tendencies, and variability.
 Report Generation: Forms the basis for management reports and dashboards.
 Identifies Normality/Anomalies: Helps in spotting what's "normal" in the data, which in
turn can highlight outliers or unusual patterns.

Techniques for Data Summarization:

 Descriptive Statistical Measures:

o Measures of Central Tendency:
 Mean: Average value (for numerical data).
 Median: Middle value when data is sorted (less sensitive to outliers).
 Mode: Most frequent value (for numerical or categorical data).
o Measures of Dispersion/Spread:
 Range: Difference between max and min.
 Variance & Standard Deviation: How spread out the data is around the
mean.
 Interquartile Range (IQR): Range of the middle 50% of the data.
o Measures of Shape:
 Skewness: Degree of asymmetry of the distribution.
 Kurtosis: Peakness or flatness of the distribution.
o Frequency Distributions: Counts or percentages of occurrences for each value or
bin (for categorical or binned numerical data).
 Aggregation Functions:
o COUNT(): Number of occurrences.
o SUM(): Total of numerical values.
o AVG(): Average of numerical values.
o MIN()/MAX(): Smallest/Largest value.
o Other Aggregate Functions: percentile, stddev, etc.
 Data Visualization:
o Histograms: Show the distribution of a single numerical variable.
o Bar Charts: Show frequencies or comparisons for categorical variables.
o Pie Charts: Show proportions of categories.
o Box Plots: Display summary statistics (median, quartiles, outliers) for numerical
data, useful for comparing distributions across groups.
o Scatter Plots: Show relationships between two numerical variables.
o Line Graphs: Show trends over time.
 Dimensionality Reduction Techniques (often for more advanced summarization):
While primarily used for data reduction, some dimensionality reduction techniques can
also be viewed as creating summarized representations of the data by capturing its most
important variance in fewer dimensions.
o Principal Component Analysis (PCA): Transforms data into a new set of
uncorrelated variables (principal components) that capture most of the original
data's variance.

[Link] [Link]., [Link], [Link]., BHARATHI WOMEN’S COLLEGE-KALLAKURICHI Page 16

DATA MINING-unit II

o Singular Value Decomposition (SVD): Similar to PCA, often used for matrix
factorization and dimensionality reduction.

Relationship between Generalization and Summarization:

 Generalization often precedes or enables summarization: You might first generalize

raw transactional data (e.g., individual sales) to a higher level (e.g., monthly sales by
product category) using concept hierarchies. Then, you can summarize these generalized
data points (e.g., calculate the average monthly sales for a product category).
 Summarization can be applied to both raw and generalized data: You can calculate
the mean age of individual customers (raw data) or the average income of customers
within a generalized "young" age group.
 Both are key for "Concept Description": They are the primary tools used in data
mining to characterize (describe a single class) and discriminate (compare multiple
classes) by providing a high-level, aggregate view.

ANALYTICAL CHARACTERIZATION

Analytical characterization in data mining is a sophisticated form of data

characterization that goes beyond simple summarization. While basic characterization focuses
on merely generalizing and aggregating data to provide a high-level description, analytical
characterization emphasizes the analysis of attribute relevance and the dispersion of data
within the target class. It aims to identify which attributes are most significant in defining the
concept and how those attributes are distributed.

Essentially, analytical characterization incorporates statistical analysis and sometimes

dimensionality reduction techniques into the characterization process to provide a more
insightful and less verbose description.

Key Aspects of Analytical Characterization:

1. Analysis of Attribute Relevance:

o This is a core component that distinguishes analytical characterization. It's about
determining which attributes (features or dimensions) are truly important and
discriminative for describing the target class, and which can be ignored or are
redundant.
 Reduces Complexity: Data often has many attributes, some of which are
irrelevant or weakly relevant. Identifying and removing them simplifies
the description.
 Improves Readability: A concise description focuses on the most
impactful characteristics.
 Guides Further Analysis: Highlights the key dimensions for more in-
depth study.
o Methods for Attribute Relevance Analysis:

[Link] [Link]., [Link], [Link]., BHARATHI WOMEN’S COLLEGE-KALLAKURICHI Page 17

DATA MINING-unit II

 Statistical Measures:
 Correlation Analysis: Measures the linear relationship between
attributes and the target concept.
 Chi-Square Test (χ2): For categorical attributes, assesses if there's
a significant association between the attribute and the class.
 Information Gain / Gini Index: Used to evaluate how much an
attribute reduces uncertainty (entropy) or impurity regarding the
class label. (Commonly used in decision tree algorithms, but
applicable here).
 ANOVA (Analysis of Variance): For numerical attributes, to see
if there's a statistically significant difference in the mean of the
attribute across different values of a categorical class.
 Filter Methods: These evaluate attribute relevance based on intrinsic
properties of the data, independent of any specific mining algorithm.
 Wrapper Methods: These evaluate subsets of attributes by training and
testing a data mining model with them, essentially using the model's
performance as a relevance metric.
 Dimensionality Reduction Techniques: While primarily for data
reduction, methods like Principal Component Analysis (PCA) can
indirectly indicate attribute relevance by showing which original attributes
contribute most to the principal components that capture significant
variance.
2. Incorporation of Statistical Measures (Dispersion Analysis):
o Beyond simple counts and averages (which are part of basic summarization),
analytical characterization delves deeper into the statistical properties of the
attributes within the target class.
o Measures Used:
 Measures of Central Tendency: Mean, Median, Mode (to describe
typical values).
 Measures of Dispersion: Standard Deviation, Variance, Interquartile
Range (IQR) (to describe the spread or variability of data for numerical
attributes).
 Quantile Analysis: Examining percentiles (e.g., 25th, 50th, 75th) to
understand the distribution of values.
 Frequency Distributions: For categorical data, detailed frequency counts
or percentages for each category.
o Example: Instead of just saying "most customers are middle-aged," analytical
characterization might say "the average age of customers is 42.5 years, with a
standard deviation of 7.2 years, indicating a relatively tight clustering around the
mean. The median income is 75,000 INR, and 80% of customers fall within the
50,000 to 100,000 INR income bracket."
3. Automated Generalization and Refinement:
o Analytical characterization often aims for a more automated process to determine
the "right" level of generalization (using concept hierarchies) for different
attributes, ensuring the description is neither too detailed nor too vague.

[Link] [Link]., [Link], [Link]., BHARATHI WOMEN’S COLLEGE-KALLAKURICHI Page 18

DATA MINING-unit II

o It might employ heuristics or algorithms to suggest optimal generalization levels

based on the relevance analysis.

How it differs from Simple Summarization / Data Generalization:

Simple Summarization / Data

Feature Analytical Characterization
Generalization
Discover significant and discriminating
Primary Condense data, replace low-level
characteristics; analyze attribute relevance
Goal concepts with high-level ones.
and data dispersion.
Which attributes are important? How are
What are the aggregated
Focus they distributed within the class? Why are
values/generalized concepts?
they descriptive?
Attribute-oriented induction AOI/OLAP plus sophisticated statistical
Techniques (climbing hierarchies), OLAP roll- tests, attribute relevance measures
Used up, basic aggregation (COUNT, (Information Gain, Chi-Square), dispersion
SUM, AVG). analysis.
More refined characteristic rules focusing on
Output Generalized tables, characteristic relevant attributes; includes statistical
Detail rules with simple summaries. measures (SD, IQR, percentiles) and
potentially visual dispersion analysis.
Depth of More in-depth, analytical understanding of
Descriptive, high-level overview.
Analysis the attribute's role and distribution.
Often involves automated analysis of
Automation Often relies on pre-defined
attribute relevance and optimal
Level hierarchies or user-driven OLAP.
generalization levels.

MINING CLASS COMPARISON

Mining Class Comparison, also known as Data Discrimination, is a data mining task
that aims to identify the distinguishing characteristics between a target class and one or more
contrasting classes. Unlike data characterization, which describes a single class, class
comparison explicitly highlights what makes one group different from others.

The core idea is to find features or patterns that are prominent in the target class but significantly
less so (or absent) in the contrasting classes, or vice-versa.

Important of Class Comparison:

 Targeted Marketing: Understand what distinguishes loyal customers from churned

customers to develop retention strategies.

[Link] [Link]., [Link], [Link]., BHARATHI WOMEN’S COLLEGE-KALLAKURICHI Page 19

DATA MINING-unit II

 Fraud Detection: Compare fraudulent transactions with legitimate ones to identify

suspicious patterns.
 Disease Diagnosis: Discriminate between patients with a specific disease and healthy
individuals to find symptoms or risk factors.
 Market Research: Differentiate between successful and unsuccessful products to learn
from past experiences.
 Process Improvement: Compare efficient and inefficient operational processes to
identify bottlenecks.

The Process of Mining Class Comparison

The process generally involves the following steps:

1. Data Collection and Class Partitioning:

o Identify Target Class: The specific group you want to understand and
differentiate (e.g., "customers who bought product X").
o Identify Contrasting Class(es): One or more groups you want to compare
against the target class (e.g., "customers who did NOT buy product X," or
"customers who bought product Y").
o Comparability: It's crucial that the target and contrasting classes are comparable.
They should share similar attributes or dimensions, even if their values differ. For
example, comparing "sales in Mumbai" with "sales in Delhi" is comparable, but
comparing "sales data" with "employee data" is not, as their underlying
dimensions are different.
o Data Retrieval: Collect all relevant data from the database or data warehouse for
both the target and contrasting classes.
2. Dimension/Attribute Relevance Analysis (Optional but Recommended):
o If there are many attributes, some may not be relevant for distinguishing the
classes. This step identifies the most salient attributes.
o Methods: Statistical tests (like Chi-square for categorical attributes, t-tests or
ANOVA for numerical attributes), information gain, Gini index, or other feature
selection techniques can be used to rank or filter attributes based on their
discriminatory power. This helps in focusing the comparison on truly
differentiating factors.
3. Synchronous Generalization:
o This is a critical step. To ensure a fair comparison, the data for all classes (target
and contrasting) must be generalized to the same level of abstraction.
o Mechanism: Concept hierarchies are used to "roll up" detailed data into higher-
level concepts. For example, if comparing sales data from two different years,
both datasets should be generalized to the month level, or quarter level, rather
than comparing individual days from one year to months from another.
o Why synchronous? Comparing "sales in Vancouver" (city level) with "sales in
Canada" (country level) for the same period would be misleading. Both should be
at the city, state, or country level for a meaningful comparison.
4. Comparison of Generalized Data:

[Link] [Link]., [Link], [Link]., BHARATHI WOMEN’S COLLEGE-KALLAKURICHI Page 20

DATA MINING-unit II

o Once the data for all classes is generalized to comparable levels, the next step is to
statistically compare the distributions of attributes across the classes.
o Statistical Measures Used:
 For Numerical Attributes (e.g., age, income, purchase amount):
 Measures of Central Tendency: Compare means, medians, or
modes. Are the average ages significantly different?
 Measures of Dispersion: Compare standard deviations, variances,
or interquartile ranges (IQR). Is one group's income much more
spread out than another's?
 Quantile Analysis: Compare distributions across quartiles or
percentiles.
 Hypothesis Testing: t-tests (for two groups) or ANOVA (for three
or more groups) to determine if observed differences in means are
statistically significant.
 For Categorical Attributes (e.g., profession, education level, product
category):
 Frequency/Percentage Distribution: Compare the counts or
percentages of each category within an attribute.
 Contingency Tables (Cross-tabulations): Show joint
distributions.
 Chi-Square Test (χ2 test): To determine if there's a statistically
significant association between the attribute and the class. A higher
χ2 value for an attribute indicates it's more discriminatory.
 Count% or Disparity Ratio: A common "contrasting measure"
used in output, showing the percentage of a generalized tuple in the
target class versus its percentage in the contrasting class(es).
5. Presentation of the Derived Comparison:
o The discovered discriminant features are presented in a clear and intuitive
manner.
o Forms of Output:
 Discriminant Rules: IF-THEN rules that highlight the differences.
 Example: Customer(X) = 'Churned' $\implies$
service_complaints(X) = 'High' $\land$
contract_type(X) = 'Monthly' (compared to non-churned
customers who have service_complaints(X) = 'Low' $\land$
contract_type(X) = 'Annual').
 These rules often include a "d-weight" or "contrast measure" to
quantify the discriminatory power of the rule.
 Comparison Tables: Summarize the statistical differences for attributes
across classes.
 Graphs and Charts: Visualizations are highly effective for showing
comparisons:
 Side-by-side bar charts for categorical attribute distributions.
 Box plots for numerical attribute distributions.
 Histograms, overlaid or juxtaposed.

[Link] [Link]., [Link], [Link]., BHARATHI WOMEN’S COLLEGE-KALLAKURICHI Page 21

DATA MINING-unit II

 Interactive Tools: Allow users to drill-down, roll-up, or slice and dice the
comparison to explore details or higher-level summaries.

Example: Comparing High-Value Customers vs. Low-Value Customers

Target Class: High_Value_Customers (e.g., those with annual spending > ₹100,000)
Contrasting Class: Low_Value_Customers (e.g., those with annual spending < ₹20,000)

Possible Discriminant Findings (Analytical):

 Age: High_Value_Customers have a median age of 48, with most falling in the 40-55
age group, whereas Low_Value_Customers have a median age of 28, predominantly in
the 20-30 age group.
 Occupation: A significantly higher percentage of High_Value_Customers are Business
Owners or Executives (e.g., 60% vs 15%), while Low_Value_Customers are primarily
Students or Entry-level Employees (e.g., 70% vs 5%).
 Product Categories Purchased: High_Value_Customers show strong preference for
Luxury Goods and Electronics, while Low_Value_Customers primarily purchase
Groceries and Apparel (with differing brand tiers).
 Online vs. In-Store Shopping: High_Value_Customers conduct 80% of their purchases
Online (with free shipping often utilized), whereas Low_Value_Customers make 60% of
their purchases In-Store.

STATISTICAL MEASURES

Statistical measures are quantitative values used to summarize, describe, and analyze data. They
are fundamental to data mining, providing insights into data characteristics, distributions,
relationships, and significance. These measures help in understanding the raw data,
preprocessing it, evaluating patterns, and making informed decisions.

Here's a breakdown of common statistical measures, categorized by what they describe:

1. Measures of Central Tendency (Location)

These measures aim to describe the "center" or typical value of a dataset.

 Mean (Arithmetic Mean): The most common average, calculated by summing all values
and dividing by the number of values.
o Formula: xˉ=n∑xi
o Use Case: Good for normally distributed numerical data. Sensitive to outliers.
o Example: Average salary of employees.
 Median: The middle value in a dataset when all values are ordered from smallest to
largest. If there's an even number of values, it's the average of the two middle values.

[Link] [Link]., [Link], [Link]., BHARATHI WOMEN’S COLLEGE-KALLAKURICHI Page 22

DATA MINING-unit II

o Use Case: Less affected by outliers, suitable for skewed distributions or ordinal
data.
o Example: The median house price in a city.
 Mode: The value that appears most frequently in a dataset. A dataset can have one mode
(unimodal), two modes (bimodal), or more (multimodal), or no mode if all values are
unique.
o Use Case: Best for categorical or discrete data; indicates the most common
category or value.
o Example: The most popular car color.

2. Measures of Dispersion (Spread or Variability)

These measures describe how spread out or varied the data points are.

 Range: The difference between the highest and lowest values in a dataset.
o Formula: Range=Xmax−Xmin
o Use Case: Simple to calculate, but highly sensitive to outliers and only considers
two data points.
o Example: The range of test scores in a class.
 Variance (σ2 or s2): The average of the squared differences from the mean. It quantifies
how much the data points deviate from the mean.
o Formula (Population): σ2=N∑(xi−μ)2
o Formula (Sample): s2=n−1∑(xi−xˉ)2
o Use Case: Provides a numerical measure of data spread. Units are squared,
making it less intuitive than standard deviation.
 Standard Deviation (σ or s): The square root of the variance. It's the most widely used
measure of spread because it's in the same units as the original data.
o Formula (Population): σ=N∑(xi−μ)2
o Formula (Sample): s=n−1∑(xi−xˉ)2
o Use Case: Essential for understanding the typical distance of data points from the
mean. Crucial for hypothesis testing and confidence intervals.
o Example: A higher standard deviation in product delivery times means more
variability in delivery speed.
 Interquartile Range (IQR): The range of the middle 50% of the data. It's the difference
between the third quartile (Q3) and the first quartile (Q1).
o Formula: IQR=Q3−Q1
o Use Case: Robust to outliers; indicates the spread of the central portion of the
data. Used in box plots to identify potential outliers.
 Quantiles/Percentiles: Values that divide a dataset into equal parts.
o Quartiles: Divide data into four equal parts (Q1, Q2=Median, Q3).
o Percentiles: Divide data into 100 equal parts. The P-th percentile is the value
below which P% of the observations fall.
o Use Case: Provides insights into data distribution and relative standing (e.g., a
student's score in the 90th percentile).

3. Measures of Distribution Shape

[Link] [Link]., [Link], [Link]., BHARATHI WOMEN’S COLLEGE-KALLAKURICHI Page 23

DATA MINING-unit II

These measures describe the symmetry and "peakedness" of the data distribution.

 Skewness: Measures the asymmetry of the probability distribution of a real-valued

random variable about its mean.
o Positive Skew (Right-skewed): Tail is longer on the right side; mode < median <
mean.
o Negative Skew (Left-skewed): Tail is longer on the left side; mean < median <
mode.
o Zero Skew: Perfectly symmetrical distribution (e.g., normal distribution), mean =
median = mode.
o Use Case: Important for understanding the nature of the data and choosing
appropriate statistical models.
 Kurtosis: Measures the "tailedness" of the probability distribution of a real-valued
random variable. It describes the shape of the distribution's peak and tails relative to a
normal distribution.
o Leptokurtic: High kurtosis, sharp peak, heavy tails (more outliers).
o Mesokurtic: Medium kurtosis (like a normal distribution).
o Platykurtic: Low kurtosis, flat peak, light tails (fewer outliers).
o Use Case: Helps in assessing the risk of extreme events (outliers).

4. Measures of Relationship/Association

These measures describe the strength and direction of the relationship between two or more
variables.

 Covariance: Measures the degree to which two variables change together. A positive
covariance indicates that they tend to move in the same direction; a negative covariance
indicates they tend to move in opposite directions.
o Use Case: Provides an initial idea of relationship, but its magnitude is not
standardized, making comparisons difficult.
 Correlation Coefficient (e.g., Pearson's r, Spearman's ρ): A standardized measure of
the linear relationship between two numerical variables.
o Pearson's r: Measures linear correlation for normally distributed data. Ranges
from -1 (perfect negative linear correlation) to +1 (perfect positive linear
correlation), with 0 indicating no linear correlation.
o Spearman's ρ: Measures monotonic correlation (linear or non-linear, as long as
consistent direction) for ordinal data or non-normally distributed numerical data.
Based on ranks.
o Use Case: Crucial for understanding dependencies, feature selection, and building
predictive models.
o Example: Correlation between advertising spend and sales.
 Chi-Square Test (χ2 test): Used to determine if there is a statistically significant
association between two categorical variables.
o Use Case: Important in class comparison for identifying distinguishing
categorical features between groups.

[Link] [Link]., [Link], [Link]., BHARATHI WOMEN’S COLLEGE-KALLAKURICHI Page 24

DATA MINING-unit II

o Example: Is there a relationship between customer_segment (categorical) and

preferred_payment_method (categorical)?

5. Other Important Statistical Concepts in Data Mining

 Hypothesis Testing (e.g., t-test, ANOVA): Used to make inferences about a population
based on a sample.
o t-test: Compares the means of two groups.
o ANOVA (Analysis of Variance): Compares the means of three or more groups.
o Use Case: Essential for validating findings, comparing different segments, or
determining if differences are statistically significant or due to chance.
 Regression Analysis (Linear, Logistic, etc.): Statistical modeling technique to analyze
the relationship between a dependent variable and one or more independent variables.
o Use Case: Prediction, forecasting, understanding causal relationships (though
causation isn't directly proven by correlation alone).
 Confidence Intervals: A range of values, derived from a sample, that is likely to contain
the true population parameter with a certain level of confidence.
o Use Case: Provides a measure of the precision and reliability of estimates.
 Probability Distributions (e.g., Normal, Poisson, Binomial): Mathematical functions
that describe the likelihood of different outcomes for a random variable.
o Use Case: Fundamental to understanding data generation processes, statistical
modeling, and hypothesis testing.

[Link] [Link]., [Link], [Link]., BHARATHI WOMEN’S COLLEGE-KALLAKURICHI Page 25

Common questions