UNIT-3
Classification: Basic Concepts, General Approach to solving a classification
problem, Decision Tree Induction: Attribute SelectionMeasures, Tree Pruning,
Scalability and Decision Tree Induction, VisualMining for Decision TreeInduction,
Bayesian Classification Methods:Bayes Theorem, Naïve Bayes Classification,
Rule-Based Classification,Model Evaluation and Selection.
General Approach to Solving a Classification Problem
Classification is a supervised machine learning task where the goal is to assign input data into
predefined categories or classes (e.g., spam/not spam, disease/healthy).
1. Problem Definition
● Clearly define the objective of the classification task.
● Identify:
o Input features (independent variables)
o Output classes (labels)
Example:
Predict whether an email is Spam or Not Spam.
2. Data Collection
● Gather relevant data from reliable sources such as:
o Databases
o Sensors
o Surveys
o Public datasets
● Ensure data is sufficient and representative of all classes.
3. Data Preprocessing
Raw data is usually noisy and incomplete. Preprocessing improves data quality.
Includes:
● Handling missing values
● Removing duplicates
● Noise reduction
● Encoding categorical variables
● Feature scaling (normalization/standardization)
Diagram: Data Preprocessing
Raw Data
Cleaning Encoding Scaling
Processed Data
4. Exploratory Data Analysis (EDA)
● Understand patterns, trends, and relationships in data.
● Use:
oStatistical summaries
oVisualizations (histograms, box plots, scatter plots)
● Detect class imbalance and outliers.
5. Feature Selection / Feature Engineering
● Select the most relevant features that improve model performance.
● Create new features if required.
● Helps reduce overfitting and computation time.
6. Splitting the Dataset
● Divide data into:
o Training set
o Testing set (and sometimes validation set)
Common split: 70% training, 30% testing
Diagram: Dataset Split
Dataset
├── Training Data (70%)
└── Testing Data (30%)
7. Model Selection
Choose an appropriate classification algorithm based on the problem.
Common classifiers:
● Logistic Regression
● Decision Tree
● Naïve Bayes
● Support Vector Machine (SVM)
● k-Nearest Neighbors (k-NN)
● Neural Networks
8. Model Training
● Train the chosen model using the training dataset.
● The model learns the relationship between features and class labels.
Diagram: Training Phase
Training Data Classification Algorithm Trained Model
9. Model Evaluation
● Evaluate model performance using test data.
● Common evaluation metrics:
o Accuracy
o Precision
o Recall
o F1-Score
o Confusion Matrix
Diagram: Evaluation
Test Data Trained Model Predicted Output
Performance Metrics
10. Model Optimization and Deployment
● Improve performance using:
o Hyperparameter tuning
o Cross-validation
● Deploy the model for real-world use once satisfactory performance is achieved.
Overall Flow Diagram
Problem Definition
Data Collection
Data Preprocessing
Feature Selection
Train–Test Split
Model Selection
Model Training
Model Evaluation
Deployment
Decision Tree Induction
Decision Tree Induction is a supervised machine learning technique used for classification and
regression, where a tree-like model is constructed from training data by recursively partitioning
the dataset based on feature values.
1. Concept of Decision Tree
A decision tree consists of:
● Root node – represents the entire dataset
● Internal nodes – represent decision tests on attributes
● Branches – outcomes of tests
● Leaf nodes – represent class labels or output values
Diagram: Structure of a Decision Tree
[Attribute A]
/ \
Yes No
/ \
[Attribute B] Class 2
/ \
Class 1 Class 3
2. What is Decision Tree Induction?
Decision tree induction is the process of automatically building a decision tree from a given
set of training data using a top-down, greedy approach.
● Starts with all data at the root
● Selects the best attribute to split the data
● Repeats recursively until stopping conditions are met
3. Steps in Decision Tree Induction
Step 1: Select the Best Attribute
● Choose an attribute that best separates the data into distinct classes.
● This is done using attribute selection measures such as:
o Information Gain
o Gain Ratio
o Gini Index
4. Attribute Selection Measures
a) Information Gain (ID3)
● Based on entropy, which measures impurity.
● Attribute with maximum information gain is selected.
Entropy Formula:
Entropy(S) = p log ₂ p
b) Gain Ratio (C4.5)
● Overcomes bias of information gain.
● Normalizes information gain using split information.
c) Gini Index (CART)
● Measures probability of incorrect classification.
● Lower Gini index indicates a better split.
5. Recursive Tree Construction
● After selecting the best attribute:
o Data is split into subsets
o The same process is applied to each subset
● This continues until:
o All instances belong to the same class, or
o No attributes remain
Diagram: Recursive Partitioning
Dataset
Best Attribute Split
Subset 1 Subset 2
Further Further
Splits Splits
6. Stopping Conditions
Tree growth stops when:
● All records in a node belong to the same class
● No attributes are left for further splitting
● Node contains very few instances
7. Tree Pruning
Pruning is used to avoid overfitting.
● Pre-pruning (Early stopping):
o Stops tree growth early
● Post-pruning:
o Removes branches after the tree is fully grown
Diagram: Pruning Concept
Fully Grown Tree Remove Weak Branches Pruned Tree
8. Advantages of Decision Tree Induction
● Easy to understand and interpret
● Handles both numerical and categorical data
● Requires little data preprocessing
● Fast prediction
9. Limitations of Decision Tree Induction
● Prone to overfitting
● Sensitive to noisy data
● Small changes in data can lead to different trees
10. Applications of Decision Trees
● Medical diagnosis
● Credit risk analysis
● Customer segmentation
● Fault detection systems
Attribute Selection Measures
Attribute Selection Measures are techniques used in decision tree induction to choose the best
attribute for splitting the dataset at each node. The objective is to maximize class purity in the
resulting subsets and improve classification accuracy.
1. Need for Attribute Selection Measures
● Different attributes divide data differently.
● The best attribute produces homogeneous subsets.
● Proper selection reduces tree size and overfitting.
2. Common Attribute Selection Measures
The most widely used attribute selection measures are:
1. Information Gain
2. Gain Ratio
3. Gini Index
3. Information Gain
Information Gain measures the reduction in entropy after splitting the dataset based on an
attribute.
a) Entropy
Entropy measures the impurity or randomness in a dataset.
Formula:
Entropy(S) = p log ₂ p
where
pᵢ = probability of class i in dataset S
b) Information Gain Formula
IG(S, A) = Entropy(S) ( |S| / |S| ) × Entropy(S)
● Attribute with maximum information gain is selected.
● Used in ID3 algorithm.
Diagram: Information Gain Split
Dataset (High Entropy)
Split on Attribute A
/ \
Low Entropy Low Entropy
4. Limitations of Information Gain
● Biased toward attributes with many distinct values
● May lead to overfitting
5. Gain Ratio
Gain Ratio overcomes the bias of information gain by normalizing it using split information.
a) Split Information
Measures how broadly the data is split.
Formula:
SplitInfo(A) = ( |S| / |S| ) log ₂ ( |S| / |S| )
b) Gain Ratio Formula
GainRatio(A) = Information Gain / SplitInfo(A)
● Attribute with highest gain ratio is selected
● Used in C4.5 algorithm
Diagram: Gain Ratio Concept
Attribute with Many Values Penalized
Attribute with Meaningful Split Selected
6. Gini Index
Gini Index measures the probability of incorrect classification of a randomly chosen instance.
Formula:
Gini(S) = 1 (p)²
● Lower Gini value Better split
● Used in CART algorithm
Diagram: Gini Index
Pure Node (Low Gini)
Mixed Node (High Gini)
7. Comparison of Measures
Algorith
Measure Criteria
m
Information Maximize
ID3
Gain Gain
Algorith
Measure Criteria
m
Maximize
Gain Ratio C4.5
Ratio
Minimize
Gini Index CART
Gini
8. Advantages of Attribute Selection Measures
● Improve decision tree accuracy
● Reduce tree complexity
● Ensure better generalization
9. Role in Decision Tree Induction
● Used at every internal node
● Determines tree structure
● Affects performance and interpretability
Tree Pruning
Tree Pruning is a technique used in decision tree learning to reduce the size of a tree by
removing unnecessary branches. It helps improve generalization, reduce overfitting, and
enhance model performance on unseen data.
1. Need for Tree Pruning
● Decision trees may grow very deep
● Overfitting occurs when the tree memorizes training data
● Noisy and irrelevant data increase complexity
● Pruning improves accuracy on test data
2. Overfitting in Decision Trees
● Large trees fit training data well
● Perform poorly on new data
● Pruning removes branches with little predictive power
Diagram: Overfitting vs Pruning
Overfitted Tree Pruned Tree
🌳 🌲
Many branches Fewer branches
3. Types of Tree Pruning
Tree pruning techniques are mainly classified into:
1. Pre-Pruning (Early Stopping)
2. Post-Pruning (Late Pruning)
4. Pre-Pruning (Early Stopping)
Pre-pruning stops tree growth before it becomes too complex.
Conditions for Pre-Pruning:
● Minimum number of samples at a node
● Maximum tree depth reached
● Information gain below threshold
● Node purity achieved
Diagram: Pre-Pruning
Dataset
Split
Stop Early Leaf Node
Advantages:
● Faster tree construction
● Less computation
Limitations:
● May stop too early
● Can miss important patterns
5. Post-Pruning (Late Pruning)
Post-pruning allows the tree to grow fully and then removes unnecessary branches.
Steps:
1. Build a complete decision tree
2. Evaluate subtrees using validation data
3. Remove branches that do not improve accuracy
Diagram: Post-Pruning
Fully Grown Tree
Evaluate Branches
Remove Weak Branches
Pruned Tree
Common Post-Pruning Methods
● Reduced Error Pruning
● Cost Complexity Pruning
● Rule Post-Pruning
6. Reduced Error Pruning
● Uses a validation dataset
● Removes branches if accuracy does not decrease
● Simple and effective
7. Cost Complexity Pruning
● Balances tree accuracy and size
● Introduces a penalty for complex trees
● Used in CART algorithm
8. Advantages of Tree Pruning
● Reduces overfitting
● Improves prediction accuracy
● Produces simpler and more interpretable trees
● Reduces computation time
9. Disadvantages of Tree Pruning
● Choosing pruning level is difficult
● Requires additional validation data
● Risk of underfitting if over-pruned
Scalability and Decision Tree Induction
Scalability in decision tree induction refers to the ability of a decision tree algorithm to
efficiently handle very large datasets in terms of number of records and attributes, without
significant loss of performance or accuracy.
1. Need for Scalability in Decision Tree Induction
● Modern datasets are very large (Big Data)
● High dimensional data increases computation
● Memory and time constraints must be managed
● Efficient tree construction is required
2. Challenges to Scalability
● Large number of training instances
● Large number of attributes
● Repeated scanning of datasets
● Complex attribute selection calculations
Diagram: Scalability Challenge
Large Dataset
High Computation & Memory Usage
3. Time Complexity in Decision Tree Induction
● Tree construction requires:
o Repeated dataset scans
o Sorting of attribute values
● Time complexity increases with:
o Dataset size
o Tree depth
4. Memory Constraints
● Entire dataset may not fit in memory
● Intermediate node data storage is expensive
● Scalability requires memory-efficient strategies
5. Techniques to Improve Scalability
a) Attribute Selection Optimization
● Use efficient measures like Gini Index
● Reduce number of attributes evaluated
● Pre-sort continuous attributes
b) Data Partitioning
● Split dataset into smaller subsets
● Process each subset independently
Diagram: Data Partitioning
Dataset
├── Partition 1
├── Partition 2
└── Partition 3
6. Sampling Techniques
● Use representative samples instead of full data
● Reduces computation cost
● Maintains acceptable accuracy
7. Incremental Tree Induction
● Tree is built incrementally as data arrives
● Useful for streaming data
● Avoids rebuilding tree from scratch
Diagram: Incremental Learning
Old Data Existing Tree
New Data Update Tree
8. Parallel and Distributed Processing
● Use parallel algorithms to split computation
● Each processor handles a subset of data
● Improves speed and scalability
Diagram: Parallel Processing
Processor 1 Subtree
Processor 2 Subtree
Processor 3 Subtree
Final Tree
9. Scalable Decision Tree Algorithms
● CART – Efficient binary splits
● C4.5 – Handles large datasets
● SPRINT – Designed for large, disk-resident data
● RainForest – Uses compact data structures
Visual Mining for Decision Tree Induction
Visual Mining is an approach that integrates data visualization techniques with decision tree
induction to support interactive, user-guided discovery of patterns. It allows users to visually
analyze data distributions, splits, and tree structures to build more interpretable and accurate
decision trees.
1. Concept of Visual Mining
● Combines human visual perception with automated data mining
● Uses graphical displays to reveal hidden patterns
● Enhances understanding of complex datasets
2. Need for Visual Mining in Decision Tree Induction
● Automated decision trees may become complex and opaque
● Large datasets make split selection difficult
● Visualization helps users understand:
o Attribute relevance
o Class distribution
o Split effectiveness
3. Visual Mining Process
Visual mining follows an interactive loop involving data, visualization, and user feedback.
Diagram: Visual Mining Process
Data Visualization User Interaction
Model Decision Tree Induction
4. Visualization of Attribute Distributions
● Graphical tools like:
o Histograms
o Box plots
o Scatter plots
● Help identify how attributes separate classes
Diagram: Attribute Distribution
Class A ███████
Class B ████
Attribute Value
5. Visual Support for Attribute Selection
● Visual plots reveal:
o Class overlap
o Good splitting points
● Helps choose attributes that maximize class separation
Diagram: Visual Attribute Selection
Attribute X
|---Class 1---|---Class 2---|
Best Split Point
6. Visual Construction of Decision Trees
● Decision trees are displayed graphically:
o Nodes represent attributes
o Edges represent conditions
o Leaves represent class labels
Diagram: Visual Decision Tree
[Age]
/ \
Young Old
| |
[Income] Yes
/ \
No Yes
7. Interactive Tree Modification
● Users can:
o Adjust split points
o Expand or collapse nodes
o Explore subtrees
● Encourages human-in-the-loop learning
8. Visual Mining for Tree Pruning
● Visualization highlights:
o Weak or noisy branches
o Misclassified instances
● Helps in deciding which branches to prune
Diagram: Visual Pruning
Large Tree Highlight Weak Branches Pruned Tree
9. Advantages of Visual Mining
● Improves interpretability
● Enhances decision-making
● Supports exploratory analysis
● Reduces overfitting through visual inspection
10. Limitations of Visual Mining
● Difficult to scale for extremely large trees
● Requires user expertise
● Subjective decisions may affect consistency
Bayesian Classification Method – Bayes Theorem
Bayesian classification is a probabilistic supervised learning method based on Bayes
Theorem. It predicts the class of a data instance by computing the probability that the instance
belongs to each class and selecting the class with the highest posterior probability.
1. Concept of Bayesian Classification
● Uses probability theory for classification
● Assumes uncertainty in data
● Classifies instances based on maximum posterior probability
2. Bayes Theorem
Bayes Theorem describes the relationship between:
● Prior probability
● Likelihood
● Posterior probability
Formula:
P(C | X) = [ P(X | C) × P(C) ] / P(X)
Where:
● P(C | X) = Posterior probability of class C given X
● P(X | C) = Likelihood of data X given class C
● P(C) = Prior probability of class C
● P(X) = Evidence (constant for all classes)
3. Interpretation of Bayes Theorem
● Combines prior knowledge with observed data
● Updates belief after seeing new evidence
Diagram: Bayes Components
Prior (P(C)) + Evidence (X)
Bayes Theorem
Posterior (P(C | X))
4. Bayesian Classification Process
1. Identify possible classes
2. Calculate prior probability for each class
3. Compute likelihood for given data
4. Apply Bayes Theorem
5. Choose class with maximum posterior probability
5. Bayesian Decision Rule
A tuple X is assigned to class Cᵢ if:
P(Cᵢ | X) > P(Cⱼ | X) for all j ≠ i
6. Bayesian Classification Model
● Requires estimation of:
o Prior probabilities
o Conditional probabilities
● Model is built from training data
Diagram: Bayesian Classifier
Training Data
Probability Estimation
Bayesian Model
New Data Class Prediction
7. Advantages of Bayesian Classification
● Simple and mathematically sound
● Works well with small datasets
● Handles missing data efficiently
● Robust to noise
8. Limitations of Bayesian Classification
● Requires accurate probability estimates
● Computationally expensive for many features
● Assumptions may not always hold
9. Applications of Bayesian Classification
● Email spam filtering
● Medical diagnosis
● Document classification
● Credit risk analysis
Naïve Bayes Classification
Naïve Bayes Classification is a probabilistic supervised learning method based on Bayes
Theorem, with the naïve assumption that all features are conditionally independent given the
class label. Despite this simplification, it performs well in many real-world applications.
1. Concept of Naïve Bayes Classifier
● Uses Bayes Theorem for prediction
● Assumes features are independent of each other
● Chooses the class with maximum posterior probability
2. Bayes Theorem Used in Naïve Bayes
For a data instance X = (x₁, x₂, …, xₙ) and class Cᵢ:
P(C | X) = [ P(X | C) × P(C) ] / P(X)
Since P(X) is constant:
P(C | X) ∝ P(C) × P(X | C)
3. Naïve Independence Assumption
Naïve Bayes assumes:
P(X | C) = P(x ₁ | C) × P(x ₂ | C) × … × P(x ₙ | C)
This simplifies computation significantly.
Diagram: Independence Assumption
Class C
|/ \
x1 x2 x3
(Independent features)
4. Types of Naïve Bayes Classifiers
1. Gaussian Naïve Bayes – continuous data
2. Multinomial Naïve Bayes – text and document classification
3. Bernoulli Naïve Bayes – binary features
5. Naïve Bayes Classification Process
1. Calculate prior probabilities for each class
2. Estimate conditional probabilities of features
3. Compute posterior probability for each class
4. Assign instance to class with highest probability
6. Naïve Bayes Model Construction
● Probabilities are estimated from training data
● Model stores:
o Class priors
o Feature likelihoods
Diagram: Naïve Bayes Model
Training Data
Probability Estimation
Naïve Bayes Model
New Data Class Prediction
7. Advantages of Naïve Bayes
● Simple and fast
● Works well with high-dimensional data
● Requires small training data
● Handles missing values efficiently
8. Limitations of Naïve Bayes
● Independence assumption is often unrealistic
● Poor performance when features are highly correlated
● Zero probability problem (solved using Laplace smoothing)
9. Applications of Naïve Bayes
● Email spam detection
● Sentiment analysis
● Document and text classification
● Medical diagnosis
Rule-Based Classification
Rule-Based Classification is a supervised learning technique that classifies data using a set of
IF–THEN rules. Each rule represents a relationship between attribute conditions and a class
label, making the model easy to understand and interpret.
1. Concept of Rule-Based Classification
● Knowledge is represented in the form of rules
● Each rule predicts a class when conditions are satisfied
● Classification is done by matching rules to data instances
Example Rule:
IF Age > 30 AND Income = High
THEN Class = Approved
2. Structure of a Classification Rule
A rule consists of two parts:
● Antecedent (IF part) – conditions on attributes
● Consequent (THEN part) – predicted class
Diagram: Rule Structure
IF (Condition1 AND Condition2)
THEN Class
3. Rule-Based Classifier Model
● A classifier contains a set of rules
● Rules are usually evaluated in sequence
● The first matching rule assigns the class
Diagram: Rule Set
Rule 1 Rule 2 Rule 3 Default Rule
4. Rule Extraction Methods
Rules can be generated using:
● Direct rule learning algorithms
● Conversion from decision trees
● Association rule mining
5. Rule Generation from Decision Trees
● Each root-to-leaf path forms a rule
● Improves interpretability of trees
Diagram: Tree to Rules
Decision Tree
Root Node Leaf
IF conditions THEN Class
6. Rule Evaluation Measures
Rules are evaluated using:
● Coverage – number of instances covered by a rule
● Accuracy – correctness of predictions
● Support and Confidence
7. Conflict Resolution in Rules
When multiple rules apply:
● Use rule ordering
● Choose rule with highest confidence
● Apply default rule if none match
8. Advantages of Rule-Based Classification
● Easy to understand and interpret
● Flexible and modular
● Suitable for knowledge-based systems
● Can incorporate domain knowledge
9. Limitations of Rule-Based Classification
● Large number of rules may be required
● Rule conflicts can occur
● Performance depends on rule quality
10. Applications of Rule-Based Classification
● Expert systems
● Medical diagnosis
● Credit approval systems
● Fault detection
Model Evaluation and Selection
Model Evaluation and Selection is the process of assessing, comparing, and choosing the
best machine learning model that performs well on unseen data. It ensures that the selected
model is accurate, reliable, and generalizes well.
1. Need for Model Evaluation and Selection
● Different models perform differently on the same data
● Prevents overfitting and underfitting
● Ensures good generalization to new data
● Helps choose the most suitable model
2. Model Evaluation Process
Evaluation measures how well a trained model predicts unseen data.
Diagram: Evaluation Process
Training Data Model Training Trained Model
Test Data
Performance Metrics
3. Data Splitting Methods
To evaluate models, data is divided into:
● Training set
● Validation set
● Test set
Diagram: Dataset Split
Dataset
├── Training Set
├── Validation Set
└── Test Set
4. Performance Evaluation Metrics
Common metrics used for classification:
● Accuracy
● Precision
● Recall
● F1-Score
● Confusion Matrix
Diagram: Confusion Matrix
Predicted
Yes No
Actual Yes TP FN
Actual No FP TN
5. Cross-Validation
Cross-validation improves reliability of evaluation.
● Data is divided into k folds
● Model is trained and tested k times
● Average performance is calculated
Diagram: k-Fold Cross-Validation
Fold 1 | Fold 2 | Fold 3 | Fold 4 | Fold 5
Train + Test rotated each time
6. Bias–Variance Tradeoff
● High bias Underfitting
● High variance Overfitting
● Good model balances both
Diagram: Bias–Variance
Underfitting Optimal Model Overfitting
7. Model Selection Criteria
Models are selected based on:
● Accuracy and error rate
● Complexity of the model
● Training and prediction time
● Interpretability
8. Hyperparameter Tuning
● Adjusts model parameters to improve performance
● Uses validation data or cross-validation
● Examples: tree depth, k in k-NN, learning rate
9. Comparing Multiple Models
● Train different models on same dataset
● Evaluate using same metrics
● Select best-performing model
Diagram: Model Comparison
Model A Accuracy
Model B Accuracy
Model C Accuracy
Select Best Model