DATA MINING AND WAREHOUSING
UNIT 1
1. Functionalities
Introduction
Data Mining and Data Warehousing are important concepts in the field of Data Mining and
data analysis. Data warehousing stores large volumes of data, while data mining extracts
useful patterns and knowledge from that data.
1. Data Mining Functionalities
Data mining functionalities refer to the different types of tasks that can be performed to
discover patterns from data.
1. Concept/Class Description
• Describes general characteristics of data
• Produces summaries of data
Types:
• Characterization
• Discrimination
2. Association Rule Mining
• Finds relationships between variables
• Shows how items are related
Example: Market basket analysis (If A is bought, B is also bought)
3. Classification
• Assigns data into predefined categories
• Uses training data
Example: Spam or not spam email
4. Prediction
• Predicts future values based on existing data
• Uses regression techniques
5. Clustering
• Groups similar data objects together
• No predefined classes
6. Outlier Analysis
• Identifies unusual or abnormal data points
• Useful in fraud detection
7. Evolution Analysis
• Studies changes in data over time
• Includes trends, patterns, and sequences
2. Data Warehousing Functionalities
Data warehousing focuses on storing and managing large amounts of data for analysis.
1. Data Extraction
• Collects data from multiple sources
• Sources include databases, files, etc.
2. Data Cleaning
• Removes errors and inconsistencies
• Ensures data quality
3. Data Transformation
• Converts data into a suitable format
• Normalization and aggregation
4. Data Loading
• Loads processed data into the warehouse
5. Data Storage
• Stores integrated and historical data
• Organized for efficient retrieval
6. Query and Analysis
• Allows users to retrieve and analyze data
• Supports decision-making
7. OLAP (Online Analytical Processing)
• Performs multidimensional analysis
• Operations include:
o Roll-up
o Drill-down
o Slice and dice
Advantages
• Helps in decision making
• Improves data analysis
• Identifies patterns and trends
• Efficient data management
Conclusion
Data mining and data warehousing functionalities work together to transform raw data into
meaningful information. While data warehousing focuses on storing and organizing data,
data mining extracts valuable insights, making them essential for business intelligence and
analytics.
2. Data Processing
Introduction
Data processing is the procedure of collecting, transforming, and organizing raw data into
meaningful information. It is a fundamental concept in Data Processing and is widely used in
computing, business, and data analysis.
Definition
Data processing is the series of operations performed on raw data to convert it into useful
and understandable information.
Steps in Data Processing Cycle
1. Data Collection
• Gathering raw data from various sources
• Sources: databases, sensors, surveys, files
2. Data Preparation
• Cleaning and organizing data
• Removing errors and duplicates
3. Data Input
• Entering data into a system
• Example: keyboard, scanners, files
4. Processing
• Applying operations like sorting, calculating, classifying
• Converts raw data into meaningful output
5. Output
• Displaying processed data
• Example: reports, charts, tables
6. Storage
• Saving processed data for future use
• Stored in databases or data warehouses
Types of Data Processing
1. Manual Processing
• Done by humans without machines
• Time-consuming and error-prone
2. Mechanical Processing
• Uses simple machines
• Faster than manual processing
3. Electronic Data Processing (EDP)
• Uses computers
• Fast, accurate, and efficient
4. Batch Processing
• Data is processed in groups
• Example: payroll systems
5. Real-Time Processing
• Data is processed instantly
• Example: online transactions
Applications
• Business and finance
• Banking systems
• Scientific research
• Data analytics
Advantages
• Converts raw data into useful information
• Improves decision making
• Saves time and effort
• Increases accuracy
Disadvantages
• Requires proper system setup
• Risk of data loss
• Security concerns
Conclusion
Data processing is an essential activity in modern computing that transforms raw data into
meaningful information. It plays a key role in decision-making, business operations, and data
analysis across various fields.
3. Data Cleaning
Introduction
Data cleaning is the process of detecting and correcting errors, inconsistencies, and missing
values in a dataset to improve its quality. It is an important step in Data Mining and data
analysis.
Definition
Data cleaning is the process of preparing raw data by removing noise and errors so that it
can be used for accurate analysis.
Need for Data Cleaning
• Raw data may contain errors or missing values
• Inconsistent formats can affect analysis
• Improves accuracy and reliability of results
Common Data Quality Issues
• Missing values (NULL, NA)
• Duplicate records
• Incorrect or invalid data
• Inconsistent formats
• Outliers
Data Cleaning Techniques
1. Handling Missing Values
• Remove records with missing data
• Replace with mean, median, or mode
• Use interpolation methods
2. Removing Duplicates
• Identify and delete repeated records
3. Correcting Errors
• Fix incorrect entries
• Validate data using rules
4. Data Transformation
• Convert data into a standard format
• Example: date formats, units
5. Handling Outliers
• Detect abnormal values
• Remove or adjust them
6. Data Normalization
• Scale data into a common range
• Helps in analysis and comparison
Tools for Data Cleaning
• Spreadsheet tools (Excel)
• Programming languages (R, Python)
• Data cleaning software
Advantages
• Improves data quality
• Increases accuracy of analysis
• Reduces errors
• Enhances decision-making
Disadvantages
• Time-consuming
• Requires domain knowledge
• Risk of data loss if not done carefully
Applications
• Data analysis and mining
• Machine learning
• Business intelligence
• Research studies
Conclusion
Data cleaning is a crucial step in data processing that ensures the dataset is accurate,
consistent, and reliable. Proper data cleaning leads to better analysis and more meaningful
results.
4. Data Reduction
Introduction
Data reduction is the process of reducing the volume of data while maintaining its integrity
and usefulness. It is an important step in Data Mining and data preprocessing.
Definition
Data reduction refers to techniques used to obtain a reduced representation of a dataset
that is much smaller in size but still produces the same analytical results.
Need for Data Reduction
• Large datasets require more storage
• Reduces processing time
• Improves efficiency of data analysis
• Eliminates unnecessary data
Data Reduction Techniques
1. Data Cube Aggregation
• Summarizes data by grouping
• Example: daily sales → monthly sales
2. Dimensionality Reduction
• Reduces number of attributes (features)
• Removes irrelevant or redundant data
Methods:
• Feature selection
• Feature extraction
3. Data Compression
• Compresses data to reduce size
• Types:
o Lossless (no data loss)
o Lossy (some data loss)
4. Numerosity Reduction
• Replaces data with models or summaries
Methods:
• Sampling
• Regression
• Clustering
5. Discretization
• Converts continuous data into intervals
• Example: age → young, adult, old
6. Concept Hierarchy Generation
• Organizes data into different levels
• Example: city → state → country
Advantages
• Reduces storage space
• Faster data processing
• Improves performance of algorithms
• Simplifies data analysis
Disadvantages
• Possible loss of information
• May reduce accuracy if not done properly
• Requires careful selection of techniques
Applications
• Big data analytics
• Machine learning
• Data warehousing
• Business intelligence
Conclusion
Data reduction is a crucial preprocessing step that minimizes data size while preserving
important information. It enhances efficiency, reduces complexity, and supports faster and
more effective data analysis.
UNIT 2
1. Data Mining Primitives
Introduction
Data mining primitives define the basic elements required to perform a data mining task.
They help users specify what kind of knowledge to discover from the data. These are
essential in Data Mining for guiding the mining process.
Definition
Data mining primitives are the set of specifications that describe the data, type of
knowledge, and conditions used in a data mining process.
Types of Data Mining Primitives
1. Task-Relevant Data
• Specifies the data to be used for mining
• Includes database, tables, and attributes
Example: Sales data of a company
2. Kind of Knowledge to be Mined
• Defines what type of patterns to extract
Types:
• Characterization
• Association rules
• Classification
• Clustering
3. Background Knowledge
• Additional information used to guide mining
• Includes domain knowledge, hierarchies
Example: Product categories
4. Interestingness Measures
• Measures to evaluate usefulness of patterns
Examples:
• Support
• Confidence
5. Presentation of Discovered Patterns
• Specifies how results should be displayed
Formats:
• Tables
• Graphs
• Rules
Importance of Data Mining Primitives
• Guides the data mining process
• Helps in selecting relevant data
• Improves quality of results
• Makes analysis more efficient
Applications
• Business analytics
• Market analysis
• Fraud detection
• Decision support systems
Conclusion
Data mining primitives are fundamental components that define and control the data mining
process. By specifying data, tasks, and output formats, they ensure efficient and meaningful
knowledge discovery from large datasets.
2. Data Mining Query Language (DMQL)
Introduction
A Data Mining Query Language (DMQL) is a specialized language used to define and
perform data mining tasks. It allows users to specify what kind of patterns or knowledge
they want to extract from large datasets. It is widely used in Data Mining.
Definition
DMQL is a high-level query language designed to support data mining operations such as
classification, association, clustering, and prediction.
Features of DMQL
• User-friendly and declarative
• Supports multiple data mining tasks
• Allows specification of constraints
• Integrates with databases and data warehouses
Basic Components of DMQL
1. Data Specification
• Defines the dataset to be used
Example:
USE database_name
2. Task-Relevant Data
• Specifies tables and attributes
Example:
FROM sales_data
3. Kind of Knowledge to be Mined
• Defines type of mining task
Example:
MINE ASSOCIATION RULES
4. Background Knowledge
• Provides domain knowledge or hierarchies
5. Interestingness Measures
• Defines criteria for useful patterns
Example:
WITH SUPPORT = 30%, CONFIDENCE = 70%
6. Presentation of Results
• Specifies output format
Example:
DISPLAY AS RULES
Example of DMQL Query
USE sales_db
MINE ASSOCIATION RULES
FROM transactions
WHERE age > 25
WITH SUPPORT = 50%, CONFIDENCE = 60%
DISPLAY AS TABLE
Applications
• Market basket analysis
• Customer behavior analysis
• Fraud detection
• Business intelligence
Advantages
• Simplifies data mining tasks
• Reduces complexity
• Flexible and powerful
• Supports multiple mining techniques
Disadvantages
• Requires understanding of syntax
• Not widely standardized
• Limited support in some tools
Conclusion
Data Mining Query Language is an important tool for extracting useful knowledge from large
datasets. By allowing users to define mining tasks clearly, DMQL improves efficiency and
supports effective data analysis and decision-making.
3. Architectures of Data Mining Systems
Introduction
The architecture of a data mining system describes how different components work
together to extract useful knowledge from large datasets. It integrates databases, data
warehouses, and data mining techniques to support decision-making. This is a core concept
in Data Mining.
Definition
Data mining system architecture refers to the overall structure that includes data sources,
processing modules, mining engines, and user interfaces used for knowledge discovery.
Components of Data Mining System Architecture
1. Data Sources
• Includes databases, data warehouses, files, and external data
• Provides raw data for mining
2. Data Cleaning and Integration Module
• Removes noise and inconsistencies
• Combines data from multiple sources
3. Data Warehouse / Database Server
• Stores processed and integrated data
• Manages data efficiently for retrieval
4. Data Mining Engine
• Core component of the system
• Performs mining tasks such as:
o Classification
o Clustering
o Association
o Prediction
5. Pattern Evaluation Module
• Identifies interesting and useful patterns
• Uses measures like support and confidence
6. Knowledge Base
• Stores domain knowledge and metadata
• Guides the mining process
7. Graphical User Interface (GUI)
• Allows users to interact with the system
• Displays results in visual formats
Types of Data Mining Architectures
1. No Coupling Architecture
• Data mining system does not use database or data warehouse
• Works independently
2. Loose Coupling Architecture
• Data mining system uses database but operates separately
• Limited integration
3. Semi-Tight Coupling Architecture
• Some integration between database and mining system
• Improves performance
4. Tight Coupling Architecture
• Fully integrated with database or data warehouse
• High efficiency and performance
Advantages
• Efficient handling of large data
• Supports complex data analysis
• Improves decision-making
• Scalable and flexible
Disadvantages
• Complex design
• High implementation cost
• Requires skilled management
Applications
• Business intelligence
• Healthcare systems
• Banking and finance
• E-commerce
Conclusion
The architecture of data mining systems defines how data is collected, processed, and
analyzed to extract meaningful patterns. A well-designed architecture ensures efficient data
handling, better performance, and accurate knowledge discovery.
4. Data Generalization and Summarization
Introduction
Data generalization and summarization are techniques used in Data Mining to simplify large
datasets and make them easier to understand. They help in transforming detailed data into
higher-level, meaningful information.
1. Data Generalization
Definition
Data generalization is the process of replacing low-level detailed data with higher-level
abstract concepts using concept hierarchies.
Concept
• Converts specific data into general forms
• Reduces complexity of data
• Uses concept hierarchy
Example
Detailed Data Generalized Data
Chennai Tamil Nadu
Tamil Nadu India
Techniques of Data Generalization
1. Attribute Removal
• Removes less important attributes
2. Attribute Generalization
• Replaces detailed values with higher-level concepts
3. Concept Hierarchy
• Organizes data into levels (city → state → country)
Advantages
• Reduces data size
• Simplifies analysis
• Improves understanding
2. Data Summarization
Definition
Data summarization is the process of generating a compact representation of data that
highlights key features.
Concept
• Produces summary statistics
• Provides overview of data
Techniques of Data Summarization
1. Aggregation
• Combines data values
Example: Total sales, average marks
2. Descriptive Statistics
• Mean, median, mode
• Standard deviation
3. Data Cube (OLAP)
• Multidimensional summarization
• Supports operations like roll-up and drill-down
Advantages
• Reduces data complexity
• Helps in decision making
• Quick understanding of large datasets
Difference Between Generalization and Summarization
Feature Generalization Summarization
Purpose Replace detailed data Provide overview
Method Concept hierarchy Aggregation/statistics
Output Higher-level data Summary information
Applications
• Business reporting
• Data analysis
• Decision support systems
• Data warehousing
Conclusion
Data generalization and summarization are important techniques that help in simplifying
and analyzing large datasets. While generalization abstracts data to higher levels,
summarization provides concise insights, making both essential for effective data mining.
5. Statistical Measures
Introduction
Statistical measures are numerical values used to summarize, describe, and analyze data.
They are essential in Statistics and data analysis for understanding the distribution and
characteristics of a dataset.
Types of Statistical Measures
1. Measures of Central Tendency
These measures represent the central or average value of a dataset.
Mean (Average)
• Sum of all values divided by number of values
Median
• Middle value in sorted data
Mode
• Most frequently occurring value
2. Measures of Dispersion
These measures show how data is spread out.
Range
• Difference between maximum and minimum values
Variance
• Average of squared differences from mean
Standard Deviation
• Square root of variance
3. Measures of Position
Quartiles
• Divide data into four equal parts
Percentiles
• Divide data into 100 equal parts
4. Measures of Shape
Skewness
• Indicates asymmetry of data distribution
Kurtosis
• Indicates peakedness or flatness of distribution
Importance of Statistical Measures
• Summarizes large datasets
• Helps in comparison
• Supports decision making
• Identifies trends and patterns
Applications
• Data analysis
• Business intelligence
• Research and surveys
• Machine learning
Advantages
• Simple and easy to understand
• Provides meaningful insights
• Useful for prediction and analysis
Disadvantages
• May not represent all data accurately
• Sensitive to outliers (mean)
• Requires proper interpretation
Conclusion
Statistical measures are fundamental tools for analyzing data. By using measures of central
tendency, dispersion, position, and shape, they help in understanding data patterns and
making informed decisions.
UNIT 3
1. Single-Dimension Boolean Association Rule
Introduction
In Data Mining, association rules are used to discover relationships between items in a
dataset. A single-dimension boolean association rule is a simple and commonly used type
of association rule.
Definition
A single-dimension boolean association rule is a rule that involves only one attribute
(dimension) and deals with binary (true/false or presence/absence) values.
Key Concepts
1. Single-Dimension
• Only one type of attribute is considered
• Typically involves a single predicate like “buys”
2. Boolean Values
• Data is represented as:
o True (1) → Item is present
o False (0) → Item is absent
Example
Rule:
• If a customer buys bread → they also buy butter
This can be written as:
• buys(bread) → buys(butter)
Here:
• Only one dimension: buys
• Boolean values: bought or not bought
Measures Used
1. Support
• Frequency of occurrence of both items
Formula:
Support(A → B) = (Transactions containing A and B) / Total transactions
2. Confidence
• Strength of the rule
Formula:
Confidence(A → B) = (Transactions containing A and B) / (Transactions containing A)
Characteristics
• Simple and easy to understand
• Focuses on presence or absence of items
• Widely used in market basket analysis
Applications
• Retail analysis
• Recommendation systems
• Customer behavior analysis
Advantages
• Easy to implement
• Clear interpretation
• Efficient for large datasets
Disadvantages
• Limited to one dimension
• Cannot handle complex relationships
• Ignores quantitative values
Conclusion
Single-dimension boolean association rules are basic yet powerful tools in data mining. They
help identify simple relationships between items using binary data, making them useful for
applications like market basket analysis and recommendation systems.
2. Multi-Dimensional Association Rule
Introduction
In Data Mining, association rules are used to find relationships between data items. A multi-
dimensional association rule involves more than one attribute (dimension), making it more
powerful and realistic than single-dimensional rules.
Definition
A multi-dimensional association rule is a rule that involves two or more attributes
(dimensions) instead of just one.
Key Concept
• Uses multiple predicates (attributes)
• Captures complex relationships between different data fields
• Not limited to only “buy” transactions
Example
Rule:
• Age(20–30) AND Income(High) → Buys(Laptop)
Here:
• Dimensions involved:
o Age
o Income
o Purchase (Buys)
Types of Multi-Dimensional Association Rules
1. Inter-Dimension Association Rule
• No repeated predicates
• Each dimension appears only once
Example:
• Age → Buys
2. Hybrid-Dimension Association Rule
• At least one predicate is repeated
Example:
• Buys(Laptop) AND Buys(Mouse) → Buys(Keyboard)
Measures Used
1. Support
• Frequency of rule occurrence in dataset
2. Confidence
• Strength or reliability of the rule
Characteristics
• Handles multiple attributes
• Provides deeper insights
• More complex than single-dimensional rules
Applications
• Customer segmentation
• Market analysis
• Recommendation systems
• Business intelligence
Advantages
• Captures complex patterns
• More realistic analysis
• Useful for decision making
Disadvantages
• More computationally expensive
• Complex to interpret
• Requires more data
Conclusion
Multi-dimensional association rules extend basic association rule mining by involving
multiple attributes. They provide deeper insights into data relationships and are widely used
in advanced data analysis and business applications.
UNIT 4
1. Bayesian Classification
Introduction
Bayesian classification is a statistical classification technique based on probability theory. It
is widely used in Machine Learning and data mining for predicting class labels of data.
Definition
Bayesian classification is a method that uses Bayes’ Theorem to classify data based on the
probability of features belonging to a particular class.
Bayes’ Theorem
𝑃(𝐵 ∣ 𝐴)𝑃(𝐴)
𝑃(𝐴 ∣ 𝐵) =
𝑃(𝐵)
𝑃(𝐴)
𝑃(𝐵 ∣ 𝐴)
𝑃(𝐵 ∣ ¬𝐴)
𝑃(𝐵 ∣ 𝐴)𝑃(𝐴)
𝑃(𝐴 ∣ 𝐵) = ≈ 0.68, 𝑃(𝐵) ≈ 0.25
𝑃(𝐵)
P(B)=0.25P(B|A)P(A)=0.17P(A|B)~0.68Posterior = useful evidence / total evidence
Where:
• 𝑃(𝐴 ∣ 𝐵): Posterior probability (probability of class given data)
• 𝑃(𝐵 ∣ 𝐴): Likelihood
• 𝑃(𝐴): Prior probability
• 𝑃(𝐵): Evidence
Working of Bayesian Classification
1. Calculate prior probabilities of each class
2. Compute likelihood of data for each class
3. Apply Bayes’ theorem
4. Assign the class with highest probability
Naïve Bayes Classifier
Definition
A simplified Bayesian classifier that assumes all attributes are independent of each other.
Types of Naïve Bayes
• Gaussian Naïve Bayes (for continuous data)
• Multinomial Naïve Bayes (for text data)
• Bernoulli Naïve Bayes (for binary data)
Example
Classifying email as Spam or Not Spam:
• Words like “offer”, “free” increase probability of spam
• Based on probabilities, email is classified
Advantages
• Simple and fast
• Works well with large datasets
• Effective for text classification
• Requires less training data
Disadvantages
• Assumes independence (not always realistic)
• Less accurate for complex relationships
Applications
• Spam detection
• Sentiment analysis
• Medical diagnosis
• Document classification
Conclusion
Bayesian classification is a powerful probabilistic method for classification tasks. Despite its
simplicity, it performs efficiently in many real-world applications, especially when using the
Naïve Bayes approach.
2. Back Propagation Classification
Introduction
Back propagation is a supervised learning algorithm used in artificial neural networks
(ANNs) for classification and prediction tasks. It is widely applied in Machine Learning and
data mining.
Definition
Back propagation is a learning algorithm that adjusts the weights of a neural network by
minimizing the error between predicted and actual outputs using a backward pass.
Concept of Back Propagation
• Works with multi-layer neural networks
• Uses forward propagation to compute output
• Uses backward propagation to update weights
• Based on gradient descent optimization
Working of Back Propagation
1. Forward Pass
• Input data is passed through input, hidden, and output layers
• Output is generated
2. Error Calculation
• Difference between actual output and predicted output is calculated
Error = Actual Output – Predicted Output
3. Backward Pass
• Error is propagated back through the network
• Weights are adjusted to reduce error
4. Weight Update
• Weights are updated using learning rate
Steps in Algorithm
1. Initialize weights randomly
2. Perform forward propagation
3. Compute error
4. Backpropagate error
5. Update weights
6. Repeat until error is minimized
Features
• Learns from labeled data
• Improves accuracy over iterations
• Handles complex non-linear relationships
Advantages
• High accuracy for classification tasks
• Can model complex patterns
• Widely used in deep learning
Disadvantages
• Requires large training data
• Time-consuming
• Can get stuck in local minima
Applications
• Image recognition
• Speech recognition
• Medical diagnosis
• Pattern classification
Conclusion
Back propagation is a powerful algorithm used for classification in neural networks. By
continuously adjusting weights to minimize error, it enables machines to learn complex
patterns and make accurate predictions.
3. Prediction in Data Mining
Introduction
Prediction is a data mining task used to forecast future values or trends based on existing
data. It is an important concept in Data Mining and is widely used for decision-making and
planning.
Definition
Prediction is the process of estimating unknown or future values of a variable using known
data and statistical or machine learning techniques.
Types of Prediction
1. Numeric Prediction
• Predicts continuous values
• Uses regression techniques
Example: Predicting sales amount
2. Categorical Prediction
• Predicts class labels
• Uses classification techniques
Example: Predicting whether an email is spam or not
Techniques Used in Prediction
1. Regression Analysis
• Finds relationship between variables
• Types: Linear, Multiple regression
2. Time Series Analysis
• Predicts future values based on past trends
• Used for stock prices, weather forecasting
3. Machine Learning Methods
• Decision trees
• Neural networks
• Bayesian methods
Steps in Prediction Process
1. Data collection
2. Data preprocessing (cleaning and transformation)
3. Model selection
4. Training the model
5. Testing and evaluation
6. Prediction of new data
Applications
• Business forecasting
• Weather prediction
• Stock market analysis
• Medical diagnosis
• Demand forecasting
Advantages
• Helps in future planning
• Improves decision making
• Identifies trends and patterns
Disadvantages
• Depends on data quality
• May produce inaccurate results
• Requires proper model selection
Conclusion
Prediction is a powerful data mining technique that helps estimate future outcomes based
on historical data. By using various statistical and machine learning methods, it plays a key
role in planning, analysis, and decision-making.
4. Classifier Accuracy
Introduction
Classifier accuracy is a measure used to evaluate the performance of a classification model.
It indicates how correctly a model predicts class labels. It is an important concept in Machine
Learning.
Definition
Classifier accuracy is the ratio of correctly predicted instances to the total number of
instances.
Formula for Accuracy
𝑇𝑃 + 𝑇𝑁
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 =
𝑇𝑃 + 𝑇𝑁 + 𝐹𝑃 + 𝐹𝑁
Where:
• TP (True Positive) → Correctly predicted positive cases
• TN (True Negative) → Correctly predicted negative cases
• FP (False Positive) → Incorrectly predicted positive
• FN (False Negative) → Incorrectly predicted negative
Confusion Matrix
Predicted Positive Predicted Negative
Actual Positive TP FN
Actual Negative FP TN
Interpretation
• Accuracy ranges from 0 to 1 (or 0% to 100%)
• Higher accuracy → Better model performance
Example
If a model correctly predicts 90 out of 100 instances:
• Accuracy = 90 / 100 = 0.9 (90%)
Advantages
• Simple and easy to understand
• Useful for balanced datasets
• Quick evaluation metric
Disadvantages
• Misleading for imbalanced datasets
• Does not consider type of errors
• May hide poor performance in certain classes
Related Measures
• Precision
• Recall
• F1-score
Conclusion
Classifier accuracy is a basic and widely used metric for evaluating classification models.
However, it should be used along with other measures like precision and recall for better
assessment, especially in real-world scenarios.
UNIT 5
1. Types of Data in Cluster Analysis
Introduction
In Data Mining, cluster analysis groups similar data objects based on their characteristics.
The type of data used plays a crucial role in determining the clustering method and similarity
measures.
Types of Data in Cluster Analysis
1. Interval-Scaled Data
• Continuous numeric data measured on a scale
• Differences between values are meaningful
Example: Temperature, height, weight
2. Binary Data
• Data with only two possible values
Values:
• 0 and 1 (False/True, Yes/No)
Types:
• Symmetric binary (equal importance)
• Asymmetric binary (one value more important)
3. Nominal Data
• Categorical data without any order
Example: Color (red, blue, green), gender
4. Ordinal Data
• Categorical data with a meaningful order
Example: Rank (1st, 2nd, 3rd), ratings (low, medium, high)
5. Ratio-Scaled Data
• Similar to interval data but has a true zero point
Example: Age, income, distance
6. Mixed-Type Data
• Combination of different data types
• Requires special handling
Importance
• Determines similarity/distance measures
• Affects clustering results
• Helps in choosing appropriate algorithms
Applications
• Market segmentation
• Image processing
• Pattern recognition
• Customer analysis
Conclusion
Different types of data such as interval, binary, nominal, ordinal, ratio, and mixed data are
used in cluster analysis. Understanding these data types is essential for selecting appropriate
clustering techniques and achieving accurate results.
2. Grid-Based Method in Clustering
Introduction
The grid-based method is a clustering technique used in Data Mining. It divides the data
space into a finite number of cells (grids) and performs clustering on these grids instead of
individual data points.
Definition
A grid-based method partitions the data space into a grid structure and groups dense cells to
form clusters.
Concept
• The data space is divided into rectangular cells
• Each cell contains a number of data points
• Cells are classified as dense or sparse
• Clusters are formed from connected dense cells
Working of Grid-Based Method
1. Divide the data space into grid cells
2. Count the number of data points in each cell
3. Identify dense cells based on a threshold
4. Merge adjacent dense cells to form clusters
5. Ignore sparse cells
Characteristics
• Does not depend on number of data points
• Depends on number of grid cells
• Fast processing
Advantages
• High efficiency (fast computation)
• Suitable for large datasets
• Simple implementation
Disadvantages
• Accuracy depends on grid size
• Not suitable for irregular cluster shapes
• Loss of detailed information
Examples of Grid-Based Algorithms
• STING (Statistical Information Grid)
• CLIQUE (Clustering In QUEst)
• WaveCluster
Applications
• Spatial data analysis
• Image processing
• Geographic information systems (GIS)
Conclusion
Grid-based clustering methods are efficient techniques that divide the data space into grids
and form clusters based on density. They are especially useful for large datasets but require
careful selection of grid size for accurate results.
3. Model-Based Clustering Method
Introduction
Model-based clustering is a statistical approach in Data Mining used to identify clusters in
data by assuming that the data is generated from a mixture of underlying probability
distributions (models). Each cluster corresponds to a different statistical model.
Definition
Model-based clustering assumes that the dataset is generated from a mixture of underlying
probability distributions, typically Gaussian, and uses statistical methods to estimate
parameters and assign data points to clusters.
Concept
• Each cluster is represented by a probability density function
• Data points are assigned to clusters based on maximum likelihood estimation
• Flexible in modeling clusters of different shapes and sizes
Working of Model-Based Clustering
1. Assume a Model
o Choose a probabilistic model (e.g., Gaussian mixture)
2. Estimate Parameters
o Use methods like Expectation-Maximization (EM) to estimate parameters of
the distributions
3. Assign Data Points
o Calculate probability that each point belongs to each cluster
o Assign points to cluster with highest probability
4. Iterate
o Repeat parameter estimation and assignment until convergence
Characteristics
• Statistical approach
• Can handle overlapping clusters
• Can estimate the optimal number of clusters
Advantages
• Can model complex cluster shapes
• Handles noise and outliers effectively
• Provides probabilistic membership for points
Disadvantages
• Computationally intensive
• Requires assumption about underlying distribution
• Sensitive to initial parameter settings
Examples of Model-Based Clustering
• Gaussian Mixture Models (GMM)
• Expectation-Maximization (EM) Algorithm
• Bayesian Clustering
Applications
• Image segmentation
• Market segmentation
• Bioinformatics (gene expression clustering)
• Pattern recognition
Conclusion
Model-based clustering is a robust and flexible method that assumes a statistical model for
each cluster. It provides probabilistic assignment of data points and can handle complex
structures, making it suitable for advanced clustering applications.