Comprehensive Guide to Machine Learning
Comprehensive Guide to Machine Learning
updated: [Link]
1140e70e8a0f80e1b059d08bf77fd68d?pvs=4
Unit - 1
Machine Learning
Types of Machine Learning
Key Concepts in Machine Learning
Common Machine Learning Algorithms
Applications of Machine Learning
Perspectives and Issues in Machine Learning
Perspectives in Machine Learning
Issues in Machine Learning
Review of Probability in Machine Learning
Key Concepts in Probability
Basic Linear Algebra in Machine Learning Techniques
Linear Algebra in Machine Learning Techniques
Conclusion
Dataset and Its Types
Components of a Dataset
Types of Datasets
1. Based on Machine Learning Task
2. Based on Data Structure
3. Based on Label Availability
4. Based on Data Type
Data Preprocessing in Machine Learning
Key Steps in Data Preprocessing
Conclusion
Machine Learning: Bias, Variance, Function Approximation, and Overfitting
1. Function Approximation
Mathematical Formulation:
Common Approaches:
2. Bias and Variance
Bias
Variance
Bias-Variance Decomposition
3. Overfitting
Characteristics:
Causes:
Detection Methods:
Prevention Techniques:
Relationships and Trade-offs
Machine Learning 1
Unit - 2
Regression Analysis in Machine Learning: Introduction and Terminologies
Key Terminologies in Regression
Types of Regression in Machine Learning
7. Stepwise Regression
8. Quantile Regression
Logistic Regression
Assumptions of Logistic Regression
Advantages of Logistic Regression
Disadvantages of Logistic Regression
Extensions of Logistic Regression
Applications of Logistic Regression
Conclusion
Simple Linear Regression: Introduction, Assumptions, and Model Building
1. Introduction to Simple Linear Regression
Mathematical Representation
2. Assumptions of Simple Linear Regression
3. Simple Linear Regression Model Building
Step 1: Data Collection and Preparation
Step 2: Estimating Model Parameters
Step 3: Assessing Model Fit
Step 4: Hypothesis Testing
Step 5: Model Diagnostics
Step 6: Model Interpretation and Use
Step 7: Model Validation
Ordinary Least Squares Estimation, Properties of Estimators, Interval Estimation, and Residuals in Linear
Regression
1. Ordinary Least Squares (OLS) Estimation
Mathematical Formulation
Derivation of OLS Estimators
2. Properties of the Least-Squares Estimators
Sampling Distributions of the Estimators
3. Properties of the Fitted Regression Model
4. Interval Estimation in Simple Linear Regression
5. Residuals
Properties of Residuals
Standardized Residuals
Residual Analysis
Measures of Influence
Multiple Linear Regression: Model, Assumptions, Output Interpretation, and Model Fit Assessment
1. Multiple Linear Regression Model
Mathematical Representation
2. Assumptions of Multiple Linear Regression
3. Interpreting Multiple Linear Regression Output
Machine Learning 2
R-Square (R²)
Adjusted R-Square
Standard Error of the Regression (S)
F-statistic and Significance F
Coefficient P-values
Coefficients
4. Assessing the Fit of Multiple Linear Regression Model
R-squared (R²)
Adjusted R-squared
Standard Error of the Regression (S)
Additional Methods for Assessing Model Fit
Feature Selection and Dimensionality Reduction: PCA, LDA, ICA
Feature Selection
Dimensionality Reduction
Principal Component Analysis (PCA).
How Principle Component Analysis (PCA) work?
Advantages of PCA:
Limitations of PCA:
Linear Discriminant Analysis (LDA)
How the Linear Discriminant Analysis (LDA) work?
Advantages of LDA:
Limitations of LDA:
Independent Component Analysis (ICA)
Statistical Independence Concept:
Assumptions in ICA
Mathematical Representation of Independent Component Analysis
Linear Static Transformation
Advantages of Independent Component Analysis (ICA):
Disadvantages of Independent Component Analysis (ICA):
Cocktail Party Problem
Limitations of ICA:
Comparison of PCA, LDA, and ICA
Unit 3
Introduction to Classification and Classification Algorithms
What is Classification?
Examples of Classification Problems
General Approach to Classification
Common Classification Algorithms
Challenges in Classification
k-Nearest Neighbor (k-NN) Algorithm
Introduction
How k-NN Works
Key Components of k-NN
Algorithm Steps
Advantages of k-NN
Machine Learning 3
Disadvantages of k-NN
Selecting the Optimal Value of k
Practical Considerations
Example
Applications of k-NN
Conclusion
Random Forests
Introduction
How Random Forest Works
Steps Involved in Random Forest Algorithm
Key Features of Random Forest
Advantages of Random Forest
Disadvantages of Random Forest
Hyperparameters in Random Forest
Feature Importance in Random Forest
Example of Random Forest in Action
Applications of Random Forest
Conclusion
Fuzzy Set Approaches
Introduction to Fuzzy Sets
Characteristics of Fuzzy Sets
Membership Functions
Operations on Fuzzy Sets
Fuzzy Set vs. Probability
Applications of Fuzzy Sets
Fuzzy Inference Systems (FIS)
Example of Fuzzy Set
Conclusion
Support Vector Machine (SVM)
Introduction
Key Concepts in SVM
How SVM Works
Kernel Trick
Mathematical Representation
Soft Margin and Regularization
Advantages of SVM
Disadvantages of SVM
Applications of SVM
Example of SVM
Types of Support Vector Machine (SVM) Kernels
1. Linear Kernel
2. Polynomial Kernel
3. Gaussian Kernel (Radial Basis Function - RBF Kernel)
Choosing the Right Kernel
Summary of Kernels
Machine Learning 4
Examples of Kernel Applications
Conclusion
Hyperplane, Properties of SVM, and Issues in SVM
1. Hyperplane (Decision Surface)
2. Properties of SVM
3. Issues in SVM
Summary
Decision Trees: Decision Tree Learning Algorithm
Introduction
Structure of a Decision Tree
Decision Tree Learning Algorithm
Common Splitting Criteria
Algorithm Steps of Decision Tree Learning
Hyperparameters for Decision Trees
Advantages of Decision Trees
Disadvantages of Decision Trees
Regularization Techniques in Decision Trees
Applications of Decision Trees
Conclusion
Decision Tree Learning: ID3 Algorithm, Inductive Bias, Entropy, Information Theory, Information Gain,
and Issues in Decision Tree Learning
1. ID3 Algorithm (Iterative Dichotomiser 3)
Working Principle of ID3 Algorithm
Steps in ID3 Algorithm
Features of ID3
Disadvantages of ID3
2. Inductive Bias
Inductive Bias in Decision Trees:
3. Entropy and Information Theory
Entropy in Decision Trees:
Interpretation:
4. Information Gain
How Information Gain is Calculated:
Significance:
5. Issues in Decision Tree Learning
Summary
Bayesian Learning
1. Bayes' Theorem
Bayes' Theorem Formula
Illustrative Example Using Bayes' Theorem
2. Concept Learning in Bayesian Framework
3. Bayes Optimal Classifier
Bayes Optimal Prediction
Benefits of Bayes Optimal Classifier
Limitations
Machine Learning 5
Example of Bayesian Learning and Bayes Optimal Classifier
Summary
What is Bayes’ theorem?
What is Naive Bayes algorithm?
Advantages of using Naive Bayes
Disadvantages of using Naive Bayes
Naive Bayes Types
Bayesian Belief Networks and EM Algorithm
1. Bayesian Belief Networks (BBNs)
Structure of a Bayesian Belief Network
Properties of Bayesian Belief Networks
Joint Probability Distribution
Inference in Bayesian Networks
Applications of Bayesian Belief Networks
Example of a Bayesian Network
2. Expectation-Maximization (EM) Algorithm
Objective of the EM Algorithm
Steps of the EM Algorithm
Mathematical Representation of EM Algorithm
Applications of EM Algorithm
Example of EM Algorithm for Gaussian Mixture Model (GMM)
Summary
Ensemble Learning
Types of Ensemble Learning
Bagging: Bootstrap Aggregating
Random forest
Boosting: Iterative Learning
Gradient boosting
Voting
Benefits of Ensemble Learning
Improved Accuracy and Stability
Robustness
Reducing Overfitting
Challenges and Considerations in Ensemble Learning
Model Selection and Weighting
Computational Complexity
Diversity and Overfitting
Interpretability
AdaBoost and XGBoost
1. AdaBoost (Adaptive Boosting)
Introduction to AdaBoost
How AdaBoost Works
Advantages of AdaBoost
Disadvantages of AdaBoost
Applications of AdaBoost
Machine Learning 6
2. XGBoost (Extreme Gradient Boosting)
Introduction to XGBoost
How XGBoost Works
Advantages of XGBoost
Disadvantages of XGBoost
Applications of XGBoost
Summary: AdaBoost vs. XGBoost
Classification Metrics in Machine Learning
The Limitations of Accuracy
Example of Limitation of Accuracy
What is Confusion Matrix?
Precision
Recall (Sensitivity)
F1 Score
AUC-ROC
Working of AUC
Log Loss
Conclusion
Classification Model Evaluation and Selection: Real Example
Step 1: Data Preparation
Step 2: Model Training
Step 3: Model Evaluation
Example Confusion Matrix Output
Step 5: Model Selection
Conclusion
1. Sensitivity (Recall or True Positive Rate)
2. Specificity (True Negative Rate)
3. Positive Predictive Value (PPV or Precision)
4. Negative Predictive Value (NPV)
Summary
Gains chart
Confusion matrix
ROC curve
Usage
So, what is the difference?
Lift chart
Misclassification Cost Adjustment and Decision Cost/Benefit Analysis
1. Misclassification Cost Adjustment
Techniques for Misclassification Cost Adjustment
Example:
2. Decision Cost/Benefit Analysis
Steps for Decision Cost/Benefit Analysis
Example: Bank Loan Approval
Optimizing Based on Cost/Benefit:
Summary
Machine Learning 7
Unit 4
Cluster Analysis and Clustering Methods
Introduction to Cluster Analysis
Applications of Cluster Analysis
Steps in Cluster Analysis
Types of Clustering
Clustering Methods
1. Partitioning Methods
2. Hierarchical Methods
3. Density-Based Methods
4. Grid-Based Methods
5. Model-Based Methods
Evaluation of Clustering
Challenges in Clustering
Conclusion
The Clustering Task and Requirements for Cluster Analysis
The Clustering Task
Requirements for Cluster Analysis
Overview of Basic Clustering Methods
1. k-Means Clustering
2. k-Medoids Clustering
Key Differences Between k-Means and k-Medoids:
K-Medoids clustering-Theoretical Explanation
Here is a small recap on K-Means clustering:
K-Medoids:
Algorithm:
Limitation of PAM:
Time complexity: O(k * (n - k)2)
CLARA:
CLARANS:
Advantages of using K-Medoids:
Disadvantages:
K-Means and K-Medoids:
Density-Based Clustering: DBSCAN
Density-Based Clustering Algorithms
Parameter Estimation
Gaussian Mixture Model algorithm
Gaussian Mixture Model Defined
What Is a Gaussian Mixture Model?
Gaussian Mixture Model (GMM) Algorithm
Overview
Key Concepts
Algorithm Steps
Advantages
Disadvantages
Machine Learning 8
Applications
Example Illustration
BIRCH
Cluster Features
CF Tree
Clustering the Sub-Clusters
Parameters of BIRCH
Conclusion
Affinity Propagation Clustering Algorithm
Advantages:
Limitations:
Mean-Shift Clustering Algorithm
Key Concepts:
Algorithm Steps:
Advantages:
Limitations:
Key Differences Between AP and Mean-Shift:
Ordering Points to Identify the Clustering Structure (OPTICS) Algorithm
Advantages:
Limitations:
What is Hierarchical Clustering
Why Hierarchical Clustering
Types of Hierarchical Clustering
Agglomerative Clustering
How does Agglomerative Hierarchical Clustering work
Step 1
Step 2
Step 3
Step 4
Difference ways to measure the distance between two clusters
Simple Linkage
Pros and Cons of Simple Linkage method
Pros and Cons of Complete Linkage method
Pros and Cons of the Average Linkage method
Pros and Cons of Centroid Linkage method
Pros and Cons of Ward’s Linkage method
What is Dendrogram
Divisive Hierarchical Clustering
Steps to perform Divisive Clustering
Strengths and Limitations of Hierarchical Clustering Algorithm
Strengths of Hierarchical Clustering
Limitations of Hierarchical Clustering
Conclusion
Divisive Hierarchical Clustering Algorithm
Key Concepts:
Machine Learning 9
Algorithm Steps:
Advantages:
Limitations:
Measuring Clustering Goodness
Choosing a Measure
Unit - 1
Machine Learning
Machine Learning (ML) is a branch of artificial intelligence (AI) that enables computers to learn
from data and improve their performance on tasks over time, without being explicitly
programmed. The core idea is to make data-driven predictions or decisions by building
mathematical models based on sample data, known as "training data."
ML involves several important concepts, techniques, and classifications. Below are detailed
notes covering key concepts.
Supervised learning is a type of ML where the model is trained on labeled data. The
algorithm learns the mapping function from the input (features) to the output (target labels).
Output: A function that can predict the label for new, unseen data.
Examples:
Popular Algorithms:
Linear Regression
Logistic Regression
Decision Trees
Random Forests
2. Unsupervised Learning
In unsupervised learning, the model is trained on data without any labels. The goal is to
identify underlying structures in the data. This type of learning is used for clustering,
Machine Learning 10
association, and dimensionality reduction.
Examples:
Dimensionality Reduction: Reducing the complexity of data while preserving its key
characteristics (Principal Component Analysis - PCA).
Popular Algorithms:
K-Means Clustering
3. Semi-Supervised Learning
Popular Algorithms:
Self-training
Co-training
4. Reinforcement Learning
In reinforcement learning, an agent learns by interacting with its environment and receiving
rewards or penalties based on its actions. The goal is to maximize the cumulative reward
over time.
Examples:
Game-playing AI (AlphaGo)
Autonomous driving.
Popular Algorithms:
Machine Learning 11
Q-Learning
Splitting the dataset into training and testing sets (e.g., 80% training, 20% testing).
Testing and evaluating performance using metrics like accuracy, precision, recall, etc.
Overfitting: When a model performs well on the training data but poorly on new,
unseen data. This happens when the model learns the noise in the training data.
Underfitting: When a model is too simple and cannot capture the underlying trends in
the data, leading to poor performance on both the training and testing sets.
Preventive Measures:
Cross-validation: A technique where the dataset is split into multiple parts, and the
model is trained and tested on different splits to avoid overfitting.
3. Bias-Variance Tradeoff
Bias: Error due to overly simplistic models that fail to capture complex patterns (leading
to underfitting).
Variance: Error due to a model that is too complex and captures noise in the training
data (leading to overfitting).
The goal is to find a balance between bias and variance.
Precision: The proportion of true positive predictions out of all positive predictions.
Machine Learning 12
Recall (Sensitivity): The proportion of true positive predictions out of all actual
positives.
F1 Score: The harmonic mean of precision and recall, useful when you need to balance
them.
Confusion Matrix: A matrix that shows true positives, true negatives, false positives,
and false negatives.
Feature Selection: The process of selecting the most important features from the
dataset to improve model performance and reduce overfitting.
Use Cases:
Spam detection.
3. Decision Trees
Machine Learning 13
Decision trees are non-parametric models used for both classification and regression
tasks. They recursively split the data based on feature values to form a tree-like structure,
where each internal node represents a decision based on a feature, and each leaf node
represents an outcome.
Key Concepts:
Advantages:
Easy to interpret.
Key Concepts:
Margin: The distance between the hyperplane and the nearest data point from each
class.
Kernel Trick: Transforms data into higher dimensions to make it linearly separable.
6. Neural Networks
Neural networks are inspired by the structure of the human brain and are used for both
classification and regression tasks. They consist of layers of neurons, where each neuron
performs a weighted sum of its inputs and passes the result through a non-linear activation
function.
Key Concepts:
Machine Learning 14
Backpropagation: The process of updating weights in a neural network by calculating
the error and propagating it back through the layers.
2. Finance
Machine learning is used for stock price prediction, fraud detection, and algorithmic
trading.
3. E-commerce
Recommendation engines use ML to suggest products to users based on their browsing
and purchasing history.
4. Autonomous Vehicles
Self-driving cars use ML algorithms for tasks like object detection, path planning, and
decision-making.
Machine Learning 15
Current Trends:
Deep Learning: A subset of ML that uses artificial neural networks for complex tasks
like image and speech recognition.
Edge Computing in ML: Running ML models locally on devices, reducing latency and
dependence on cloud computing.
2. Societal Perspective
Machine learning has significant societal implications. It has the potential to improve the
quality of life, automate routine tasks, and offer personalized services. However, it also
raises concerns about job displacement, privacy, and decision-making fairness.
Key Aspects:
Key Benefits:
Cost Reduction: By automating processes and reducing human error, businesses can
lower costs.
Increased Productivity: Machine learning models can process large datasets faster
than humans, improving decision-making speed.
Algorithmic Bias: Algorithms may perpetuate or even exacerbate biases if the training
data used is biased.
Machine Learning 16
Transparency: Many ML models, especially deep learning models, are seen as "black
boxes" due to their complexity, making it difficult to explain their decisions.
Data Scarcity: Many domains lack sufficient labeled data, which hinders model training.
Noisy Data: Data can be incomplete, inconsistent, or contain errors, leading to poor
model performance.
Bias in Data: If the data used for training is biased or unbalanced, the model may
produce biased outcomes, particularly in sensitive applications like hiring, lending, or
criminal justice.
Potential Solutions:
Data Augmentation: Generating synthetic data to improve the dataset size and variety.
Transfer Learning: Using pre-trained models on related tasks to address data scarcity.
Active Learning: Involving human experts to label the most informative samples
selectively.
2. Model Interpretability
One of the biggest challenges in machine learning, especially with complex models like
deep learning, is interpretability.
Black Box Problem: Many machine learning models, particularly deep neural networks,
are not interpretable by humans. This makes it difficult to understand how a decision is
made, which can be problematic in areas like healthcare or finance where transparency
is essential.
Potential Solutions:
Machine Learning 17
Model Simplification: Choosing simpler models like decision trees or linear models
when interpretability is more important than performance.
Overfitting occurs when a model performs well on training data but fails to generalize to
unseen data. This is one of the most common issues in machine learning.
Overfitting: This happens when the model learns noise or irrelevant patterns in the
training data, resulting in poor performance on new data.
Underfitting: On the contrary, underfitting occurs when the model is too simple to
capture the underlying structure of the data, leading to both poor training and test
performance.
Potential Solutions:
Data Augmentation: Increasing the size of the training data to expose the model to
more variation.
Machine learning models can sometimes make unfair decisions due to biases present in
the data. This can result in discrimination, particularly in applications like hiring, lending,
law enforcement, and healthcare.
Bias Amplification: If a training dataset contains biased data, the model can perpetuate
or even amplify these biases.
Potential Solutions:
Training large-scale machine learning models, especially deep neural networks, requires
enormous computational resources, both in terms of processing power and storage.
Machine Learning 18
High Computational Cost: Training large models, like GPT (generative pre-trained
transformers), requires massive amounts of GPU/TPU resources, which is a challenge
for many organizations.
Energy Consumption: As the size of machine learning models grows, so does their
energy consumption, raising concerns about sustainability and environmental impact.
Potential Solutions:
Adversarial Attacks: Small, imperceptible changes to input data can cause machine
learning models (especially deep neural networks) to make incorrect predictions. This
can be problematic in applications like facial recognition or autonomous driving.
Data Privacy: Many ML models require large amounts of personal data, raising
concerns about user privacy. In particular, the use of data in healthcare or finance
needs to comply with strict privacy regulations like GDPR.
Potential Solutions:
Machine learning raises ethical concerns, particularly when deployed in sensitive areas
such as healthcare, law enforcement, or autonomous decision-making.
Ethics in AI: Ensuring that machine learning models are built and deployed in a way that
respects human rights, dignity, and fairness.
Potential Solutions:
Machine Learning 19
Human-in-the-loop Systems: Keeping a human supervisor in critical decision-making
processes to ensure accountability.
Conclusion
The perspectives and issues in machine learning are diverse and complex. While ML offers
great potential in improving technology and society, it also presents challenges related to data
quality, interpretability, bias, security, and ethics. It is crucial to develop machine learning
systems responsibly, with a strong emphasis on fairness, transparency, and accountability to
ensure that these systems benefit society equitably and without harm.
Machine Learning 20
Machine Learning 21
Machine Learning 22
3. Hidden Markov Models (HMMs)
HMMs are used in sequential data, where the system being modeled is assumed to be a
Markov process with hidden states. Probabilistic transitions between states are modeled,
and observations depend on the underlying states.
Machine Learning 23
Basic Linear Algebra in Machine Learning Techniques
Linear algebra is the backbone of many machine learning algorithms, especially those dealing
with large datasets and complex models. It is essential for understanding how data is
represented and manipulated in high-dimensional spaces.
Machine Learning 24
7. Matrix Inversion and Pseudo-Inverse
In machine learning, finding the inverse of a matrix can be essential for solving systems of
linear equations (as in linear regression). However, not all matrices are invertible, so we
often compute the Moore-Penrose pseudo-inverse.
w = (XT X)−1 XT y
In SVMs, the goal is to find a hyperplane that maximizes the margin between different
classes. The problem is formulated as a convex optimization problem, often solved using
quadratic programming. SVMs make extensive use of matrix operations to compute
distances and margins.
4. Neural Networks
Conclusion
Probability helps to model uncertainty, make predictions, and estimate parameters in
various machine learning techniques like Naive Bayes, Hidden Markov Models, and
probabilistic models.
Linear Algebra provides the computational foundation for most machine learning
algorithms, enabling data representation, optimization, dimensionality reduction, and neural
network computations.
Understanding these two areas is critical for both theoretical and practical machine learning
applications.
Machine Learning 25
Dataset and Its Types
In machine learning, a dataset refers to a structured collection of data that is used to train and
evaluate models. A dataset typically consists of input data (features) and corresponding
outputs (labels or target values) that the model learns to predict. The quality and structure of
the dataset significantly impact the performance of machine learning algorithms.
Components of a Dataset
1. Features (Input Variables): The independent variables used as input to a machine learning
model. In a dataset, features are the attributes or columns that describe the data points
(observations).
3. Samples (Data Points): Each row in the dataset corresponds to a sample, which represents
a single observation or instance in the dataset.
Example: Each house in the housing dataset is a sample with specific features.
4. Metadata: Information about the dataset, such as feature descriptions, units, or additional
details about how the data was collected.
Types of Datasets
There are several types of datasets based on how they are used in machine learning, the type
of data they contain, and the tasks they address.
The dataset used to train the machine learning model. The model learns the
relationships between input features and target labels from the training data.
Typically, this dataset contains both input features and the corresponding labels
(supervised learning).
Example: A dataset of house features and prices used to train a regression model.
2. Validation Dataset
Used to tune model parameters and evaluate model performance during training. This
dataset helps in selecting the best model and hyperparameters.
Machine Learning 26
The validation set helps prevent overfitting by evaluating the model on data it hasn't
seen during training.
Example: After training a model on the training dataset, the validation set is used to
measure how well the model generalizes.
3. Test Dataset
The dataset used to assess the model’s final performance. It contains data points that
the model has never seen before, ensuring that the model generalizes well to unseen
data.
Example: After training and tuning, the test dataset evaluates how the model performs
on real-world or unseen data.
Data that is organized into tables (rows and columns), often found in databases and
spreadsheets. Each row represents a data point (sample), and each column represents
a feature or attribute.
Example: Financial records, customer data (with attributes like name, age, income), and
sensor readings.
2. Unstructured Data
Data that does not follow a predefined format or structure. This type of data is typically
text, images, audio, or video.
Example: Social media posts, emails, images (e.g., in facial recognition), or videos (e.g.,
for object detection).
3. Semi-Structured Data
Data that does not fit neatly into structured formats but contains some organizational
properties (tags, labels, etc.). Examples include data in JSON or XML formats.
Example: Web data, log files, and emails with specific headers.
In a labeled dataset, each sample has a corresponding output label. This is common in
supervised learning tasks, where the model is trained to predict the label.
Example: A dataset of images with labels indicating whether they contain a cat or a
dog.
2. Unlabeled Data
Machine Learning 27
In an unlabeled dataset, only the input features are provided, and no labels are
available. Unsupervised learning techniques, such as clustering, are used to extract
patterns or structures from such data.
Data that represents numbers and can be continuous (e.g., weight, height, price) or
discrete (e.g., count of items).
2. Categorical Data
Data that represents categories or discrete groups. Categorical features can be nominal
(no inherent order, like color ) or ordinal (with order, like education level ).
3. Time-Series Data
Data points that are ordered and collected over time intervals, commonly used in
forecasting and temporal analysis.
4. Text Data
Data in the form of words and sentences. Natural Language Processing (NLP)
techniques are used to process this type of data.
Removal: Remove rows or columns with missing data. This is feasible if the proportion
of missing data is small.
Machine Learning 28
Imputation: Replace missing values with a statistic (mean, median, or mode) or use
algorithms like K-Nearest Neighbors (KNN) to impute missing values.
Label Encoding: Assigns a unique integer to each category. This works well when the
categories have an ordinal relationship.
One-Hot Encoding: Creates binary columns for each category. Each column represents
whether a category is present (1) or not (0). This is commonly used for nominal
categorical features.
4. Feature Scaling
Feature scaling ensures that all input features contribute equally to the model’s
performance. This step is important for algorithms that are sensitive to feature magnitude,
such as:
5. Feature Engineering
Machine Learning 29
Feature Extraction: Creating new features from existing ones that better represent the
underlying patterns in the data.
Example: From date-time data, extract day , month , year , weekday , etc.
Resampling:
7. Dimensionality Reduction
Reducing the number of features in the dataset can help reduce overfitting and improve
computation time.
Principal Component Analysis (PCA): A linear technique that projects data onto fewer
dimensions while preserving variance.
IQR Method: Values beyond 1.5 times the interquartile range (IQR) are considered
outliers.
Conclusion
Datasets form the
Machine Learning 30
backbone of machine learning, providing the data required for model training and evaluation.
Understanding the types of datasets and their structure is crucial for selecting the right
algorithms and methods.
Data preprocessing is an essential step to prepare raw data for analysis. Proper handling
of missing data, scaling, encoding categorical variables, and dealing with imbalances and
outliers can significantly improve the performance and generalizability of machine learning
models.
By investing time in preprocessing, you ensure that your models are trained on clean, well-
structured data, leading to better results.
1. Function Approximation
In machine learning, function approximation is the task of learning or constructing a function
that generates appropriate outputs given inputs, based on a set of training examples.
Mathematical Formulation:
Given: A set of training examples {(x₁, y₁), (x₂, y₂), ..., (xₙ, yₙ)}
Goal: Find a function f(x) that approximates the true underlying function f*(x)
Common Approaches:
1. Parametric: Assume a specific form of f(x) with parameters θ, e.g., f(x) = θ₀ + θ₁x for linear
regression
Machine Learning 31
Bias
Definition: The error introduced by approximating a real-world problem with a simplified
model
Indicators: High training error, similar performance on training and validation sets
Variance
Definition: The error introduced due to the model's sensitivity to small fluctuations in the
training set
Bias-Variance Decomposition
For a given point x, the expected mean squared error can be decomposed as:
E[(y - f̂ (x))²] = (Bias[f̂ (x)])² + Var[f̂ (x)] + σ²
Where:
f̂ (x) is the learned function
Bias[f̂ (x)] = E[f̂ (x)] - f(x), the expected difference between the learned function and the true
function
3. Overfitting
Overfitting occurs when a model learns the training data too well, capturing noise and random
fluctuations as if they were part of the underlying pattern.
Characteristics:
Machine Learning 32
Low training error
Causes:
1. High model complexity relative to the amount of training data
2. Insufficient regularization
Detection Methods:
1. Learning curves: Plot training and validation errors against the training set size
Prevention Techniques:
1. Regularization (e.g., L1, L2 norms)
2. Early stopping
4. Data augmentation
3. Impact on Learning:
High Variance: Simplify model, add more training data, apply regularization
Machine Learning 33
Unit - 2
Regression Analysis in Machine Learning: Introduction and
Terminologies
Regression analysis is a fundamental technique in machine learning and statistics used to
model the relationship between a dependent variable (also called the target or output) and one
or more independent variables (also called features or predictors). The goal of regression is to
predict a continuous value, making it one of the most widely used approaches for supervised
learning tasks like predicting prices, demand, temperature, etc.
Example: In house price prediction, the house price is the dependent variable.
The input variables used to predict the dependent variable. They can be continuous or
categorical.
Example: The size of the house, number of rooms, location, etc., are independent
variables when predicting house prices.
3. Regression Coefficients
These are the values that represent the relationship between each independent
variable and the dependent variable. In a simple linear regression model, these are the
slope values that define how much the dependent variable changes for each unit
change in the independent variable.
4. Intercept (Constant)
The value of the dependent variable when all the independent variables are zero. It is
the point at which the regression line crosses the y-axis.
The difference between the observed actual values and the values predicted by the
regression model. It represents the unexplained variation in the dependent variable.
Machine Learning 34
6. Line of Best Fit (Regression Line)
In linear regression, the regression line represents the predicted values for the
dependent variable based on the independent variables. The goal is to find the line that
minimizes the residuals.
9. Multicollinearity
A situation where two or more independent variables are highly correlated, making it
difficult for the regression model to separate their individual effects on the dependent
variable.
Overfitting: When the model learns not only the underlying patterns in the data but also
the noise, leading to poor generalization to new data.
Underfitting: When the model is too simple to capture the underlying patterns, resulting
in poor performance on both training and test data.
Machine Learning 35
Machine Learning 36
Machine Learning 37
7. Stepwise Regression
Stepwise regression is a technique that involves adding or removing predictors based on their
statistical significance. The process can either:
Forward Selection: Start with no variables and add them one by one based on their
contribution to the model.
Backward Elimination: Start with all variables and remove the least significant ones step by
step.
Machine Learning 38
This technique is particularly useful when trying to build a simpler, more interpretable model by
selecting only significant features.
Applications: Useful in feature selection when you have a large number of variables, like in
econometrics or healthcare.
8. Quantile Regression
Unlike ordinary regression, which estimates the conditional mean of the dependent variable,
quantile regression estimates the conditional median (or other quantiles) of the dependent
variable. This allows us to model different quantiles (such as the 25th, 50th, or 90th percentile)
of the target variable, providing a more comprehensive understanding of the relationship
between variables.
Logistic Regression
Logistic regression is a statistical method used for binary classification problems where the
output variable is categorical and takes one of two possible values. Unlike linear regression,
which predicts a continuous output, logistic regression predicts the probability that a given
input point belongs to a particular class. The model is based on the logistic function, also
known as the sigmoid function, which maps any real-valued number into the range of 0 to 1.
1. Binary Classification
Logistic regression is primarily used for binary classification tasks, where the output
can be either 0 or 1, such as spam vs. non-spam or disease vs. no disease.
Machine Learning 39
5. Decision Boundary
The decision boundary is the threshold that determines the classification of the data
points. For binary classification, if the predicted probability \( P(y=1|x) \) is greater than
0.5, the output is classified as 1; otherwise, it is classified as 0.
Machine Learning 40
Assumptions of Logistic Regression
Machine Learning 41
1. Binary Outcome
2. Independence of Observations
The observations must be independent of each other. Logistic regression does not handle
correlated observations well.
3. Linearity of Logits
It assumes a linear relationship between the independent variables and the log-odds of the
dependent variable. While the actual relationship may not be linear, the logit transformation
allows modeling this relationship linearly.
4. No Multicollinearity
Logistic regression assumes that the independent variables are not highly correlated with
each other. Multicollinearity can distort the estimates of the coefficients.
2. Less Complex: It is computationally less intensive compared to more complex models like
neural networks, making it efficient for large datasets.
3. Good Performance: Performs well for linearly separable data and can be extended to
handle non-linear relationships through feature transformations.
2. Sensitivity to Outliers: Outliers can significantly affect the coefficients and predictions in
logistic regression, leading to biased results.
Machine Learning 42
4. Binary Classification: While extensions like multinomial logistic regression exist for multi-
class classification, logistic regression is primarily designed for binary outcomes.
2. Ordinal Logistic Regression: Used when the dependent variable is ordinal (i.e., has a
meaningful order but no consistent interval). It models the probability of being in a
particular category or below.
2. Finance: Credit scoring and risk assessment to determine the likelihood of loan default.
4. Social Sciences: Analyzing survey data to understand factors influencing voter behavior or
public opinion.
Conclusion
Logistic regression is a powerful and widely used classification algorithm in machine learning.
Its simplicity, interpretability, and effectiveness make it a go-to choice for binary classification
problems. Understanding its underlying mechanics, assumptions, and extensions is crucial for
applying it effectively in various applications. While it has limitations, logistic regression serves
as a solid foundation for many more complex classification techniques in machine learning.
Machine Learning 43
Simple Linear Regression is a statistical method used to model the linear relationship between
two variables:
The goal is to find the best-fitting straight line through the data points, known as the regression
line.
Mathematical Representation
The simple linear regression model is represented as:
Y = β₀ + β₁X + ε
Where:
ε is the error term (the difference between the predicted and actual Y values)
Machine Learning 44
5. No or little multicollinearity: Not applicable in simple linear regression, but important in
multiple regression.
R² = 1 - (SSR / SST)
Where SSR is the sum of squared residuals and SST is the total sum of squares
3. Standard Error of the Regression (S): Measures the average distance between the
observed values and the regression line
S = √(SSR / (n - 2))
Machine Learning 45
Step 4: Hypothesis Testing
1. t-test for individual coefficients:
Where MSR is mean square regression and MSE is mean square error
2. Influence diagnostics:
2. Prediction:
Machine Learning 46
2. Test on a separate dataset if available
Remember that while simple linear regression is a powerful tool, it's limited to modeling linear
relationships between two variables. For more complex relationships or multiple predictors,
consider polynomial regression or multiple linear regression.
Mathematical Formulation
For the simple linear regression model Y = β₀ + β₁X + ε, OLS minimizes:
S(β₀, β₁) = Σ(yᵢ - (β₀ + β₁xᵢ))²
where (xᵢ, yᵢ) are the observed data points.
The expected values of the estimators equal the true parameter values.
Among all linear unbiased estimators, OLS estimators have the smallest variance.
3. Consistency: As sample size increases, β̂₀ and β̂₁ converge in probability to β₀ and β₁.
Machine Learning 47
4. Efficiency: OLS estimators achieve the Cramér-Rao lower bound.
3. The sum of the observed y values equals the sum of the fitted y values: Σyᵢ = Σŷᵢ
4. The sum of the cross-products of the x values and the residuals is zero: Σ(xᵢ(yᵢ - ŷᵢ)) = 0
where:
Machine Learning 48
5. Residuals
Residuals are the differences between the observed values of the dependent variable and the
predicted values:
eᵢ = yᵢ - ŷᵢ
Properties of Residuals
1. The sum of residuals is zero: Σeᵢ = 0
Standardized Residuals
Standardized residuals are useful for identifying outliers:
rᵢ = eᵢ / (s * √(1 - hᵢᵢ))
where hᵢᵢ is the i-th diagonal element of the hat matrix H = X(X'X)⁻¹X'.
Residual Analysis
Residual analysis is crucial for checking model assumptions:
Measures of Influence
1. Leverage: hᵢᵢ (diagonal elements of hat matrix)
2. Cook's Distance: Measures the influence of each observation on the fitted values
Machine Learning 49
Multiple Linear Regression (MLR) is an extension of simple linear regression. It models the
linear relationship between a dependent variable and two or more independent variables.
Mathematical Representation
The general form of the multiple linear regression model is:
Y = β₀ + β₁X₁ + β₂X₂ + ... + βₖXₖ + ε
Where:
In matrix notation:
Y = Xβ + ε
Where:
X is an n×(k+1) matrix of independent variables (including a column of 1's for the intercept)
1. Linearity: The relationship between the dependent variable and each independent variable
should be linear.
3. Homoscedasticity: The variance of residuals should be constant across all levels of the
independent variables.
Machine Learning 50
Visualization: Residual plot against fitted values
5. No Multicollinearity: The independent variables should not be highly correlated with each
other.
Caution: R² always increases when adding more variables, even if they're not meaningful.
Adjusted R-Square
Definition: A modified version of R² that adjusts for the number of predictors in the model.
Machine Learning 51
Interpretation: Smaller values indicate better fit. Used in calculating confidence intervals
and prediction intervals.
Interpretation: A small p-value (typically < 0.05) indicates that the model is statistically
significant.
Coefficient P-values
Definition: The p-value for each coefficient tests the null hypothesis that the coefficient is
equal to zero.
Interpretation: A small p-value (typically < 0.05) indicates that the variable is statistically
significant in the model.
Coefficients
Interpretation: Each coefficient represents the change in Y for a one-unit change in X,
holding all other variables constant.
Easy to interpret
Limitations:
Adjusted R-squared
Strengths:
Machine Learning 52
Useful for comparing models with different numbers of predictors
Limitations:
Limitations:
Affected by outliers
2. Cross-Validation
3. Residual Analysis
4. Prediction Error
5. Multicollinearity Diagnostics
Condition number
6. Influence Diagnostics
Cook's distance
DFBETAS
7. Partial F-tests
Machine Learning 53
Compare nested models to assess the contribution of a subset of predictors
Remember that assessing model fit is not just about maximizing R² or minimizing error. It's
about finding a balance between model complexity and predictive power, while ensuring that
the model meets the necessary assumptions and is interpretable in the context of your
research question or business problem.
In practice, it's often beneficial to use a combination of these methods to get a comprehensive
understanding of your model's performance and limitations.
Feature Selection
Feature selection is the process of selecting a subset of relevant features for use in model
construction.
Types of Feature Selection:
3. Embedded Methods: Perform feature selection as part of the model construction process
Dimensionality Reduction
Dimensionality reduction transforms the data from a high-dimensional space to a lower-
dimensional space, retaining most of the relevant information.
Common Techniques:
In Machine Learning and Statistic, Dimensionality Reduction the process of reducing the
number of random variables under consideration via obtaining a set of principal variables. It
can be divided into feature selection and feature extraction.
Machine Learning 54
are in danger of overwriting your model to your data or that you might be violating the
assumptions of whichever modeling tactic you’re using?
You might ask the question “how do I take all of the variables. I’ve collected and focused on
only a few of them? In technical terms, you want to “reduce the dimension of your feature
space. By reducing the dimension of your feature space, you have fewer relationships between
variables to consider and less likely to overheat your model.
· Feature Elimination
· Feature extraction
Feature Elimination: we reduce the feature space by elimination feature. The advantages of
the feature elimination method include simplicity and maintainability features. We’ve also
eliminated any benefits those dropped variables would bring.
Feature Extraction: PCA is a technique for feature extraction. So it combines our input
variables in a specific way, then we can drop the “least important” variables while still retaining
the most valuable parts of all the variables.
When should I use PCA?
1. Do you want to reduce the no. of variables, but are not able to identify variables to
completely remove from consideration?
Machine Learning 55
3. Are you comfortable making your independent variable less interpretable?
The above picture displays the two main directions in this data: the back direction and the blue
direction. In red direction is the most important one. We’ll get into why this is the case later, but
given how the case later, but given how the dots are arranged can you see why the red
direction looks more important than the green direction (What would be fitting a line of best fit
to the data look like?
Machine Learning 56
In the above pic, we will transform our original data into aligning with these important
directions. The fig. The show is the same extract data as above but transformed. So that the X-
& Y axis are now direction.
Advantages of PCA:
Reduces overfitting by reducing the number of features
Limitations of PCA:
Assumes linearity
May not be suitable when the variance in noise is larger than the variance in signal
[Link]
LDA is a type of Linear combination, a mathematical process using various data items and
applying a function to that site to separately analyze multiple classes of objects or items.
Following Fisher’s Linear discriminant, linear discriminant analysis can be useful in areas like
image recognition and predictive analysis in marketing.
The fundamental idea of linear combinations goes back as far as the 1960s with the Altman Z-
scores for bankruptcy and other predictive constructs. Now LDA helps in preventative data for
more than two classes, when Logistics Regression is not sufficient. The linear Discriminant
analysis takes the mean value for each class and considers variants to make predictions
assuming a Gaussian distribution.
Machine Learning 57
How the Linear Discriminant Analysis (LDA) work?
First general steps for performing a Linear Discriminant Analysis
1. Compute the d-dimensional mean vector for the different classes from the dataset.
2. Compute the Scatter matrix (in between class and within the class scatter matrix)
3. Sort the Eigen Vector by decrease Eigen Value and choose k eigenvector with the largest
eigenvalue to from a d x k dimensional matrix w (where every column represent an
eigenvector)
4. Used d * k eigenvector matrix to transform the sample onto the new subspace.
Advantages of LDA:
Works well for multi-class classification problems
Limitations of LDA:
Assumes normal distribution of data with equal covariance for each class
Machine Learning 58
Independent Component Analysis (ICA)
ICA is a computational method for separating a multivariate signal into additive
subcomponents, assuming the subcomponents are non-Gaussian signals and statistically
independent from each other.
The heart of ICA lies in the principle of statistical independence. ICA identify components
within mixed signals that are statistically independent of each other.
or
Assumptions in ICA
1. The first assumption asserts that the source signals (original signals) are statistically
independent of each other.
2. The second assumption is that each source signal exhibits non-Gaussian distributions.
, representing the observed data with m components. The hidden components are represented
by the random vector
Machine Learning 59
, where n is the number of hidden sources.
The goal is to transform the observed data x in a way that the resulting hidden components are
independent. The independence is measured by some function
. The task is to find the optimal transformation matrix W that maximizes the independence of
the hidden components.
ICA is a non-parametric approach, which means that it does not require assumptions
about the underlying probability distribution of the data.
ICA is an unsupervised learning technique, which means that it can be applied to data
without the need for labeled examples. This makes it useful in situations where labeled data
is not available.
ICA can be used for feature extraction, which means that it can identify important features
in the data that can be used for other tasks, such as classification.
ICA assumes that the sources are mixed linearly, which may not always be the case. If the
sources are mixed nonlinearly, ICA may not be effective.
ICA can be computationally expensive, especially for large datasets. This can make it
difficult to apply ICA to real-world problems.
ICA can suffer from convergence issues, which means that it may not always be able to
find a solution. This can be a problem for complex datasets with many sources.
Machine Learning 60
Cocktail Party Problem
Consider Cocktail Party Problem or Blind Source Separation problem to understand the
problem which is solved by independent component analysis.
Problem: To extract independent sources’ signals from a mixed signal composed of the signals
from those sources.
Source 1
Source 2
Source 3
Source 4
Source 5
Here, there is a party going into a room full of people. There is ‘n’ number of speakers in that
room, and they are speaking simultaneously at the party. In the same room, there are also ‘n’
microphones placed at different distances from the speakers, which are recording ‘n’ speakers’
voice signals. Hence, the number of speakers is equal to the number of microphones in the
room.
Now, using these microphones’ recordings, we want to separate all the ‘n’ speakers’ voice
signals in the room, given that each microphone recorded the voice signals coming from each
Machine Learning 61
speaker of different intensity due to the difference in distances between them.
Decomposing the mixed signal of each microphone’s recording into an independent source’s
speech signal can be done by using the machine learning technique, independent component
analysis.
where, X1, X2, …, Xn are the original signals present in the mixed signal and Y1, Y2, …, Yn are
the new features and are independent components that are independent of each other.
Limitations of ICA:
Assumes statistical independence of sources, which may not always hold
PCA: Unsupervised
LDA: Supervised
ICA: Unsupervised
2. Optimization Criterion:
3. Assumptions:
LDA: Assumes normal distribution with equal covariance for each class
4. Applications:
Machine Learning 62
When choosing between these methods, consider the nature of your data, the problem you're
trying to solve, and the assumptions of each technique.
Unit 3
Introduction to Classification and Classification Algorithms
What is Classification?
Classification is a supervised learning technique in machine learning that involves predicting
the category or class of a given data point based on its features. The goal is to map input
variables (features) to a discrete output variable (class label).
Definition: Classification is the process of identifying the category to which a new data
point belongs, based on a training dataset containing observations with known class labels.
Key Features:
3. Real-World Applications:
Machine Learning 63
Clearly define the problem.
2. Data Preprocessing:
4. Splitting Data:
Divide the data into training and test sets (and sometimes a validation set).
Select an algorithm based on the problem and dataset characteristics (e.g., Logistic
Regression, Decision Trees).
8. Hyperparameter Tuning:
Optimize model parameters using methods like Grid Search or Random Search.
Machine Learning 64
Common Classification Algorithms
1. Linear Classifiers:
Logistic Regression
2. Non-linear Classifiers:
3. Tree-based Methods:
Decision Trees
Random Forest
4. Neural Networks:
5. Bayesian Models:
Naive Bayes
Challenges in Classification
Imbalanced Data: Some classes may have significantly more samples than others.
Overfitting: The model performs well on training data but poorly on test data.
Introduction
The k-Nearest Neighbors (k-NN) algorithm is a simple, yet powerful, instance-based, and non-
parametric classification technique in machine learning. It is widely used due to its ease of
implementation and interpretability. The key idea is that a data point’s class label can be
determined based on the majority class of its k closest neighbors in the feature space.
Machine Learning 65
In k-NN, there is no explicit training phase. The algorithm simply stores the training
dataset.
2. Prediction Phase:
Given a new input (test point), the algorithm calculates the distance between this point
and all points in the training set.
The class label of the test point is assigned based on the majority class among these k
neighbors.
To determine which points are the closest, a distance metric must be chosen. Common
metrics include:
Where
x_i and y_i are the coordinates of the two points.
2. Choosing k:
Small values of k (e.g., k=1) make the model highly sensitive to noise.
Large values of k make the algorithm more robust but can lead to underfitting.
3. Class Assignment:
The class label of the new data point is assigned based on the majority vote of the k
nearest neighbors.
In case of a tie, various tie-breaking strategies can be used (e.g., choosing the class
with the smallest distance sum).
4. Weighted Voting:
Instead of simple majority voting, weighted voting can be used where closer neighbors
have a higher influence on the classification.
Algorithm Steps
1. Input: A dataset of labeled instances, the number of neighbors (k), and the test instance.
Machine Learning 66
2. For each test point:
Calculate the distance between the test point and all training data points using the
chosen distance metric.
Perform a majority vote or weighted vote to assign the test point’s class.
Advantages of k-NN
1. Simplicity: k-NN is very simple to understand and implement.
3. Versatility: k-NN can be used for both classification and regression problems.
4. Adaptability: The model can naturally adapt to changes in the data as no explicit training is
required.
Disadvantages of k-NN
1. Computational Cost:
As k-NN needs to compute distances for every test point to all training samples, it can
be computationally expensive, especially with large datasets.
2. Memory-Intensive:
Since the algorithm stores the entire training dataset, it requires significant memory.
If the dataset has irrelevant or noisy features, the performance of k-NN can degrade.
4. Curse of Dimensionality:
As the number of features increases, the distance between data points increases,
which can make it harder for the algorithm to identify meaningful neighbors. This issue
can be mitigated by dimensionality reduction techniques such as PCA.
5. Choice of k:
Machine Learning 67
Selecting the Optimal Value of k
Cross-validation is typically used to choose the best value for k.
A common approach is to try different values of k and select the one that minimizes the
classification error.
Odd values for k are preferred when dealing with binary classification problems to avoid
ties.
Practical Considerations
1. Feature Scaling:
2. Curse of Dimensionality:
k-NN can be sensitive to missing data. Simple approaches like imputation or removing
data points with missing values can be applied.
Example
Consider a dataset with two classes: A (label = 0) and B (label = 1). You are given a new data
point (test point) and need to classify it using k-NN.
Calculate the distance between the test point and every point in the training set.
Find the 3 closest points to the test point. If two of them belong to class A and one belongs
to class B , the test point is classified as A based on majority voting.
Applications of k-NN
1. Pattern Recognition: Classifying images based on pixel data.
3. Medical Diagnosis: Classifying patients based on medical data (e.g., disease presence or
absence).
Machine Learning 68
Conclusion
k-NN is a powerful and intuitive algorithm widely used in classification tasks. However, it
comes with challenges such as computational inefficiency and sensitivity to the curse of
dimensionality. By understanding these trade-offs, k-NN can be effectively applied in various
domains with careful pre-processing and parameter tuning.
Random Forests
Introduction
Random Forest is an ensemble learning technique used for both classification and regression
tasks. It combines multiple Decision Trees to improve performance, overcome overfitting, and
provide more robust predictions. The concept behind Random Forests is to create a "forest" of
decision trees where each tree votes for the predicted class, and the majority vote is taken as
the final output.
Key Idea: Reduce overfitting and increase accuracy by averaging multiple decision trees.
Machine Learning 69
How Random Forest Works
1. Bootstrap Aggregation (Bagging):
Random Forest uses a technique called bagging, where multiple subsets of the original
training data are created with replacement. This means that some data points may be
selected multiple times, while others may be left out.
When building each tree, Random Forest does not consider all features for splitting;
instead, it selects a random subset of features. This randomness helps to reduce
correlation among the trees and makes the model more robust.
Since each tree uses a random subset of features, it is slightly different from others.
Machine Learning 70
For Classification: Each tree in the Random Forest predicts a class label, and the final
prediction is based on the majority vote across all trees.
For Regression: Each tree predicts a numeric value, and the final output is the average
of all predictions.
4. Prediction:
By using multiple trees, Random Forest avoids the overfitting that can happen with
individual decision trees.
Bagging helps reduce variance, while random feature selection adds diversity to the
individual trees, preventing them from becoming too correlated.
3. Feature Importance:
The ensemble approach of Random Forest often provides higher accuracy than a single
decision tree.
2. Robustness:
Machine Learning 71
Random Forests are less prone to overfitting due to the randomness introduced by
bagging and random feature selection.
4. Feature Importance:
It can rank features based on their importance to the prediction, which is helpful in
understanding the data.
5. No Pruning Required:
Unlike decision trees, Random Forest does not require explicit pruning, as the ensemble
approach balances complexity and overfitting.
Random Forests are much more complex compared to a single decision tree. This
complexity makes them harder to interpret.
2. Computational Cost:
3. Black-Box Model:
The number of decision trees to be built in the forest. Increasing the number of trees
usually leads to higher accuracy but also increases training time.
The number of features to consider when splitting each node. Options include:
Limits how deep each tree can grow. A shallow tree will help prevent overfitting.
Machine Learning 72
Minimum Samples for Splitting ( min_samples_split ):
The minimum number of samples required to split a node. This helps control the growth
of the tree and avoid overfitting.
1. Training Phase:
2. Prediction Phase:
The Random Forest takes the majority vote as the final prediction.
Conclusion
Random Forest is a powerful and widely used machine learning algorithm for both classification
and regression tasks. It combines the strengths of multiple decision trees to reduce overfitting
and increase accuracy. Despite being more complex and computationally expensive compared
to individual decision trees, Random Forest's robustness, versatility, and feature importance
evaluation make it a popular choice for many practical applications.
Machine Learning 73
Fuzzy Set Approaches
Classical Set: In a classical set, an element can either fully belong (membership value = 1)
or not belong at all (membership value = 0).
Fuzzy Set: In a fuzzy set, each element has a degree of membership ranging between 0
and 1, representing the grade of membership.
Fuzzy sets were introduced by Lotfi A. Zadeh in 1965 to handle uncertainties and to model
problems that have vagueness or imprecision.
The membership function defines how each element in the domain is mapped to its
corresponding degree of belonging to a fuzzy set.
2. Fuzzy Membership:
The membership value can be interpreted as the degree of truth or the degree to
which an element belongs to a particular set.
For example, a fuzzy set of "tall people" may assign a membership of 0.7 to a person
who is 180 cm tall and 0.2 to a person who is 160 cm tall.
Machine Learning 74
Membership Functions
Membership functions can have different shapes, depending on the characteristics of the
fuzzy set:
Formula:
Machine Learning 75
Defined by four parameters a , b , c , and d that form a trapezoid.
It has two flat regions, indicating higher certainty over a range of values.
The union of two fuzzy sets A and B is given by the maximum of the membership
values.
Formula:
The intersection of two fuzzy sets A and B is given by the minimum of the
membership values.
Formula:
The complement of a fuzzy set A is given by subtracting the membership value from 1.
Formula:
μ_(¬A)(x) = 1 - μ_A(x)
Probability Theory: Deals with the likelihood of an event occurring. It is used to model
randomness and uncertainty.
Machine Learning 76
For example, the statement "it is cloudy" can be modeled using fuzzy sets by assigning a
membership value to describe the "degree of cloudiness." On the other hand, probability
theory could be used to predict the chance of rain given the cloudiness.
Fuzzy set theory is widely used in fuzzy logic controllers. It is particularly useful in
systems where human-like reasoning is required. For example, air conditioners,
washing machines, and other appliances use fuzzy logic to adjust settings like
temperature, washing time, etc.
2. Decision Making:
Fuzzy sets are used in multi-criteria decision-making where options are evaluated
based on multiple attributes with varying degrees of importance.
3. Pattern Recognition:
Fuzzy sets are applied to classification problems where the boundaries between
classes are not sharply defined (e.g., determining the membership of an image to
different object categories).
4. Medical Diagnosis:
1. Fuzzification:
2. Rule Evaluation:
Apply fuzzy rules using a series of if-then conditions. These rules help in describing
the relationship between input and output variables in a human-understandable way.
3. Defuzzification:
Convert the fuzzy output back into a crisp value. Popular defuzzification methods
include centroid and max-membership methods.
Machine Learning 77
Example of Fuzzy Set
Suppose we want to create a fuzzy set to represent the concept of "hot temperature."
The fuzzy set "Hot Temperature" could be represented with a membership function
that assigns a degree of membership to temperatures. For example:
2. Rule-Based System:
A fuzzy rule might be: "If the temperature is hot, turn on the fan at high speed."
The fuzzy inference system would calculate the degree to which the current
temperature is "hot" and adjust the fan accordingly.
Conclusion
Fuzzy Set Theory provides a powerful framework for modeling uncertainty, vagueness, and
imprecision, making it ideal for applications that involve subjective, ambiguous, or linguistic
information. By allowing partial membership, fuzzy sets represent real-world concepts more
effectively compared to traditional binary sets.
Introduction
Support Vector Machine (SVM) is a powerful supervised learning algorithm used for
classification and regression tasks. It is especially effective in high-dimensional spaces and
for problems where the data is not linearly separable. The main idea of SVM is to find a
decision boundary (or hyperplane) that maximizes the margin between two classes of data
points.
Goal: Maximize the margin between the decision boundary and the closest data points
from either class.
Machine Learning 78
A hyperplane is a decision boundary that separates the data into different classes. For
a 2D space, it is simply a line, while in a 3D space, it is a plane. For higher dimensions,
it is called a hyperplane.
The goal of SVM is to find the optimal hyperplane that maximizes the separation
between classes.
2. Margin:
The margin is the distance between the hyperplane and the closest data points from
either class. SVM aims to maximize this margin.
The data points that are closest to the hyperplane are called support vectors, and they
are critical for defining the position of the hyperplane.
3. Support Vectors:
Support vectors are the data points that lie closest to the decision boundary. These
points influence the position and orientation of the hyperplane.
The hyperplane is uniquely defined by these support vectors, hence the name Support
Vector Machine.
Machine Learning 79
In the simplest case, SVM finds a straight line (or hyperplane) that can completely
separate data points of two different classes.
If the data is linearly separable, SVM constructs the optimal hyperplane that maximizes
the margin.
Often, real-world data is not linearly separable. To handle this, SVM uses a kernel trick
to transform the data into a higher-dimensional space, where it becomes linearly
separable.
Kernel functions help create the separation by projecting the data to a new dimension.
Kernel Trick
The kernel trick is used to transform non-linearly separable data into a higher dimension
where a hyperplane can separate the classes. This allows SVM to create complex decision
boundaries without explicitly computing the transformation. Common kernel functions include:
1. Linear Kernel:
Suitable when data is linearly separable. In this case, the SVM simply finds a straight
line to separate the data.
2. Polynomial Kernel:
Suitable when the relationship between the features is more complex and requires
curved decision boundaries.
4. Sigmoid Kernel:
This kernel behaves like a neural network activation function and can be used in
specific applications.
Mathematical Representation
The goal of SVM is to find the hyperplane that maximizes the margin between the classes.
Suppose you have data points (x_1, y_1), (x_2, y_2), ..., (x_n, y_n) where y_i is the class
label ( +1 or 1 ).
Machine Learning 80
The hyperplane can be represented as:
w * x + b = 0
The objective of SVM is to find w and b that maximize the margin, subject to the constraint
that:
This ensures that each data point is correctly classified and lies on the correct side of the
margin.
C is a hyperparameter that controls the trade-off between maximizing the margin and
High C: The model attempts to classify all training data points correctly, which may lead
to overfitting.
Low C: The model allows some misclassification to achieve a larger margin, which
helps in generalization.
Advantages of SVM
1. Effective in High Dimensions:
SVM is effective when the number of features is large compared to the number of
samples.
With the kernel trick, SVM can efficiently handle complex and non-linear data.
3. Memory Efficiency:
SVM only uses the support vectors for classification, making it memory-efficient as it
does not need to store the entire dataset.
4. Versatility:
Machine Learning 81
Can be used for both classification and regression (referred to as Support Vector
Regression or SVR).
Disadvantages of SVM
1. Computational Complexity:
Training SVMs can be computationally intensive, especially for large datasets with
many features. Complexity grows quadratically with the number of samples.
SVM can be inefficient when the number of samples is very large due to high training
times.
4. Interpretability:
The results of SVM are often harder to interpret compared to other algorithms like
Decision Trees.
Applications of SVM
1. Text Classification:
SVM is used for spam detection and sentiment analysis due to its ability to handle high-
dimensional data (like word counts in text).
2. Face Detection:
3. Bioinformatics:
SVMs are used for classifying genes, identifying proteins, and analyzing medical data.
4. Handwriting Recognition:
SVM is used to classify handwritten digits, often applied in digit recognition systems
like postal code reading.
Example of SVM
Consider a dataset with two classes: +1 and -1 , with features such as height and weight of
individuals.
Machine Learning 82
If the data is linearly separable, SVM will find the line (hyperplane) that divides the two
classes with the maximum margin.
If the data is not linearly separable, an RBF kernel can be used to transform the data into a
higher-dimensional space, allowing SVM to find a non-linear decision boundary.
1. Linear Kernel
Definition: The linear kernel is the simplest kernel type used in SVM. It is primarily used
when the data is linearly separable, meaning it can be separated by a straight line (or
hyperplane).
K(x, y) = x · y
Use Cases:
It is effective when the number of features is significantly greater than the number of
training samples.
Commonly used for text classification problems like sentiment analysis or document
classification, where the features are word counts or term frequency vectors.
Advantages:
Limitation:
Machine Learning 83
2. Polynomial Kernel
Definition: The polynomial kernel is used to handle non-linear relationships in the data by
allowing more complex decision boundaries. It introduces polynomial terms, effectively
enabling SVM to create curved separation boundaries.
K(x, y) = (x · y + c)^d
dis the degree of the polynomial, which determines the complexity of the decision
surface.
Use Cases:
Suitable when the data has complex non-linear relationships between features.
Works well for applications where the relationships between input features and output
are better captured by polynomials, such as image recognition tasks.
Advantages:
Limitations:
Machine Learning 84
||x - y||^2 represents the squared Euclidean distance between x and y .
High Gamma: The decision boundary is highly influenced by individual data points,
which can lead to overfitting.
Low Gamma: The model captures broader trends, which might lead to underfitting.
Use Cases:
Advantages:
Flexibility: The RBF kernel is very flexible and can model a wide range of decision
surfaces by controlling the γ parameter.
Effective for Complex Boundaries: It is effective when the decision boundary is curved
or when the data is not linearly separable.
Limitations:
Parameter Tuning: The value of γ needs careful tuning, typically done using cross-
validation.
Linear Kernel: Choose this if the data is linearly separable or if interpretability and
speed are priorities.
Polynomial Kernel: Use when you suspect non-linear relationships but want to model
these relationships with polynomial decision boundaries.
Gaussian/RBF Kernel: Choose this if the data is complex and not linearly separable,
and you want the SVM to create a sophisticated decision boundary.
Summary of Kernels
1. Linear Kernel:
Equation: K(x, y) = x · y
Machine Learning 85
2. Polynomial Kernel:
Used for text categorization (e.g., spam detection) where the input features are often
sparse and high-dimensional.
Polynomial Kernel:
Gaussian Kernel:
Conclusion
The different SVM kernels make it possible to adapt SVM for a wide range of problems, from
simple linear separations to very complex non-linear relationships. By transforming data into
different feature spaces using these kernels, SVM is able to effectively handle many practical
machine learning tasks. The selection of an appropriate kernel, along with its corresponding
parameters, plays a crucial role in the performance and success of the SVM model.
Machine Learning 86
A hyperplane is a decision surface that separates the data into different classes in a Support
Vector Machine (SVM). The hyperplane is essentially the boundary that best divides the
dataset into classes.
Definition:
A hyperplane is a linear decision boundary that divides the input space into two halves,
representing the different classes.
Equation of a Hyperplane:
w · x + b = 0
Where:
b is the bias term that helps adjust the position of the hyperplane.
Goal of SVM:
The main goal of SVM is to find the optimal hyperplane that maximizes the margin
between the two classes. The margin is defined as the distance between the
hyperplane and the closest data points from each class, which are called support
vectors.
The larger the margin, the better the generalization ability of the classifier.
Support Vectors:
Support vectors are the data points that lie closest to the hyperplane. These points are
critical as they determine the exact position and orientation of the hyperplane.
Only support vectors influence the decision boundary, making SVM computationally
efficient in terms of memory, as it does not need to store all data points.
For non-linearly separable data, a hyperplane might not be able to divide the classes
in the original feature space. In such cases, the kernel trick is used to transform the
data into a higher-dimensional space, where a hyperplane can effectively separate the
classes.
Machine Learning 87
2. Properties of SVM
Support Vector Machines have several properties that make them effective for both
classification and regression tasks:
1. Margin Maximization:
SVM aims to maximize the margin between the hyperplane and the support vectors.
Maximizing the margin enhances the generalization of the classifier, making it robust to
noise and overfitting.
The decision boundary is determined entirely by the support vectors. This allows SVM
to be memory-efficient because it only needs to keep track of these critical data points
rather than all training samples.
4. Kernel Trick:
The kernel trick allows SVM to create non-linear decision boundaries by mapping input
features into a higher-dimensional space without explicitly computing the
transformation. This makes SVM versatile in handling non-linearly separable data.
5. Regularization:
SVM uses a regularization parameter (C) to control the trade-off between achieving a
larger margin and minimizing classification error. A larger value of C places more
emphasis on classifying all training points correctly, potentially leading to overfitting. A
smaller value allows for more misclassification but ensures a wider margin.
6. Robustness to Outliers:
Although SVM can handle outliers with the soft margin approach, it is still sensitive to
noisy data. However, the impact of noisy points is minimized if they are not selected as
support vectors.
3. Issues in SVM
Despite its advantages, SVM has some issues and challenges that users should be aware of:
1. Computational Complexity:
Machine Learning 88
For very large datasets, SVM may become infeasible in terms of memory and
computation.
Choosing the right kernel function and tuning the associated hyperparameters (such
as C and γ for RBF kernel) is crucial for good performance.
3. Interpretability:
Unlike simple linear models, SVM models are less interpretable, especially when using
non-linear kernels. The decision boundary formed by the hyperplane in a higher-
dimensional space is difficult to visualize and explain.
When feature interpretability is needed, SVM may not be the best choice compared to
models like Decision Trees.
SVM can be sensitive to outliers, especially when using a small value for the
regularization parameter C . The presence of noisy data points can significantly affect
the decision boundary if they lie close to the hyperplane.
5. Memory Usage:
In scenarios where a large number of support vectors are required, SVM can consume
a significant amount of memory. This can be problematic for very large datasets where
the number of support vectors is a large percentage of the total training set.
6. Class Imbalance:
SVM may perform poorly with imbalanced datasets where one class has significantly
more samples than the other. This is because the decision boundary is influenced by
the number of samples, and a small class may not have enough representation to shape
an appropriate boundary.
SVM does not inherently provide probabilistic outputs. Extensions such as Platt
Scaling are used to estimate probabilities, but these tend to add complexity and are not
as reliable as models specifically designed for probabilistic interpretation.
Summary
Machine Learning 89
Hyperplane: The decision surface that separates data into different classes. The goal of
SVM is to find the optimal hyperplane that maximizes the margin.
Properties of SVM: SVM is effective in high dimensions, uses support vectors to define the
decision boundary, applies kernel tricks for non-linear data, and maximizes the margin to
achieve good generalization.
Issues in SVM:
Kernel Selection: Choosing the right kernel and tuning hyperparameters can be
challenging.
Sensitivity to Noise and Outliers: SVM may be affected by outliers or noisy data points.
Memory Usage: SVM can be memory-intensive, especially if many support vectors are
needed.
Class Imbalance: SVM might struggle with imbalanced datasets, requiring additional
techniques to adjust.
SVM is a powerful algorithm that works well on a wide range of classification and regression
tasks, but it requires careful tuning and can be computationally expensive. Understanding
these properties and issues helps in deciding when SVM is the right choice for a given
problem.
Introduction
A Decision Tree is a popular supervised learning algorithm used for classification and
regression tasks. It is a tree-like structure where each internal node represents a decision
based on a feature, each branch represents the outcome of the decision, and each leaf node
represents a final output or class label. Decision trees are easy to understand and interpret,
making them very useful for a wide range of applications.
Goal: Create a model that predicts the value of a target variable based on input features.
Machine Learning 90
1. Root Node: The topmost node in a decision tree, representing the entire dataset. The root
node is split based on the feature that provides the highest information gain.
2. Internal Nodes: Nodes that represent decisions on features. Each internal node splits into
two or more branches based on feature values.
3. Leaf Nodes (Terminal Nodes): Nodes that represent the final output value or class label.
Each leaf node contains a prediction that applies to the data reaching that point.
4. Branches: These are the connections between nodes, representing the outcome of
decisions.
The algorithm starts by selecting the best feature that splits the data into subsets. The
quality of a split is evaluated using metrics such as:
Gini Impurity
The data is split into subsets based on the selected feature. This process continues
recursively for each subset.
3. Stopping Criteria:
Machine Learning 91
p_i is the proportion of instances belonging to class i in dataset S .
Information Gain (IG) measures the reduction in entropy by splitting the dataset on a
particular feature.
A feature with the highest Information Gain is selected to split the data.
Gini Impurity is another metric to evaluate the quality of a split. It measures how often a
randomly chosen element would be incorrectly labeled if it was randomly classified
based on the distribution of class labels.
Gini(S) = 1 - ∑ (p_i^2)
A feature that provides the lowest Gini Impurity is selected for the split.
When using decision trees for regression, Variance Reduction is used to measure the
effectiveness of a split.
The feature that results in the greatest reduction in variance is selected for splitting the
data.
Calculate the splitting criteria (e.g., Information Gain, Gini Impurity) for all features.
Choose the feature with the highest information gain (or lowest Gini Impurity).
4. Repeat:
Machine Learning 92
Apply the same process to each subset (node) until one of the stopping criteria is met
(e.g., pure nodes, max depth reached).
5. Output: A decision tree where each leaf node provides the final classification or prediction.
Controls the maximum number of levels in the tree. Limiting tree depth helps prevent
overfitting by reducing model complexity.
The minimum number of samples required to split an internal node. Higher values
prevent overfitting by reducing splits.
The minimum number of samples that must be present in a leaf node. This helps ensure
that leaves do not end up with only a few samples.
Controls the number of features to consider when looking for the best split. Helps in
controlling overfitting.
5. Criterion ( criterion ):
The function to measure the quality of a split. Common criteria are gini for the Gini
Impurity and entropy for Information Gain.
Decision trees are intuitive and easily visualized, making it easier for non-experts to
understand the model.
4. Feature Importance:
Machine Learning 93
Disadvantages of Decision Trees
1. Overfitting:
Decision trees are prone to overfitting, especially when the tree depth is not limited,
and the model becomes too complex. Pruning methods and limiting tree depth help
mitigate this issue.
2. Instability:
Decision trees are sensitive to small changes in the data. Small variations in the data
can lead to a completely different tree structure.
Decision trees can be biased towards classes with more samples, especially when the
data is imbalanced.
4. Greedy Approach:
Decision trees use a greedy algorithm for splitting, which may not always lead to the
global optimal solution.
Pruning reduces the size of the tree by removing branches that have little importance.
It helps in simplifying the model and improving generalization.
Pre-pruning: Limit tree growth by setting constraints (e.g., max depth, minimum
samples per leaf).
Post-pruning: The tree is first grown fully, then pruned back based on error rates on
validation data.
By setting min_samples_split and min_samples_leaf , you control the size of nodes, reducing
the chances of forming too specific (overfitted) splits.
Decision trees are used in medical diagnostics to classify patients based on symptoms
and determine possible diseases.
2. Credit Scoring:
Machine Learning 94
Banks use decision trees to assess whether a loan applicant is likely to default, based
on various financial parameters.
3. Customer Churn:
4. Fraud Detection:
Conclusion
Decision Trees are intuitive, easy to interpret, and capable of handling both numerical and
categorical data, making them useful for many practical machine learning tasks. However, they
tend to overfit on training data if not properly regularized, which is why pruning techniques and
limiting hyperparameters are crucial to improve their generalization capability. Despite their
disadvantages, decision trees are often used as the base learners for more advanced
ensemble models like Random Forest and Gradient Boosting Machines.
[Link]
Introduction Decision Trees are a type of Supervised Machine Learning (that is you explain
what the input is and what the corresponding output is in the training data) where the data is
continuously split according to a certain parameter. The tree can be explained by two entities,
namely decision nodes and leaves. The leaves are the decisions or the final outcomes. And the
decision nodes are where the data is split.
An example of a decision tree can be explained using above binary tree. Let’s say you want to
predict whether a person is fit given their information like age, eating habit, and physical
activity, etc. The decision nodes here are questions like ‘What’s the age?’, ‘Does he exercise?’,
‘Does he eat a lot of pizzas’? And the leaves, which are outcomes like either ‘fit’, or ‘unfit’. In
Machine Learning 95
this case this was a binary classification problem (a yes no type problem). There are two main
types of Decision Trees:
What we’ve seen above is an example of classification tree, where the outcome was a variable
like ‘fit’ or ‘unfit’. Here the decision variable is Categorical.
Here the decision or the outcome variable is Continuous, e.g. a number like 123. Working Now
that we know what a Decision Tree is, we’ll see how it works internally. There are many
algorithms out there which construct Decision Trees, but one of the best is called as ID3
Algorithm. ID3 Stands for Iterative Dichotomiser 3. Before discussing the ID3 algorithm, we’ll
go through few definitions.
Entropy:
Entropy, also called as Shannon Entropy is denoted by H(S) for a finite set S, is the measure of
the amount of uncertainty or randomness in data. Intuitively, it tells us about the predictability
of a certain event. Example, consider a coin toss whose probability of heads is 0.5 and
probability of tails is 0.5. Here the entropy is the highest possible, since there’s no way of
determining what the outcome might be. Alternatively, consider a coin which has heads on
both the sides, the entropy of such an event can be predicted perfectly since we know
beforehand that it’ll always be heads. In other words, this event has no randomness hence it’s
entropy is zero. In particular, lower values imply less uncertainty while higher values imply high
uncertainty.
Information Gain:
nformation gain is also called as Kullback-Leibler divergence denoted by IG(S,A) for a set S is
the effective change in entropy after deciding on a particular attribute A. It measures the
relative change in entropy with respect to the independent variables. Alternatively, where
IG(S, A) is the information gain by applying feature A. H(S) is the Entropy of the entire set, while
the second term calculates the Entropy after applying the feature A, where P(x) is the
probability of event x.
Machine Learning 96
Let’s understand this with the help of an example. Consider a piece of data collected over the
course of 14 days where the features are Outlook, Temperature, Humidity, Wind and the
outcome variable is whether Golf was played on the day. Now, our job is to build a predictive
model which takes in above 4 parameters and predicts whether Golf will be played on the day.
We’ll build a decision tree to do that using ID3 algorithm.
5. For each attribute, calculate the entropy with respect to the attribute ‘x’ denoted by H(S, x)
7. Remove the attribute that offers highest IG from the set of attributes
8. Repeat until we run out of all attributes, or the decision tree has all leaf nodes.
Now, let's go ahead and grow the decision tree. The initial step is to calculate H(S), the Entropy
of the current state. In the above example, we can see in total there are 5 No’s and 9 Yes’s.
Yes No Total
Machine Learning 97
9 5 14
Remember that the Entropy is 0 if all members belong to the same class, and 1 when half of
them belong to one class and other half belong to other class that is perfect randomness. Here
it’s 0.94 which means the distribution is fairly random. Now, the next step is to choose the
attribute that gives us highest possible Information Gain which we’ll choose as the root node.
Let’s start with ‘Wind’
where ‘x’ are the possible values for an attribute. Here, attribute ‘Wind’ takes two possible
values in the sample data, hence x = {Weak, Strong} We’ll have to calculate: Amongst all the 14
examples we have 8 places where the wind is weak and 6 where the wind is Strong.
8 6 14
Now, out of the 8 Weak examples, 6 of them were ‘Yes’ for Play Golf and 2 of them were ‘No’
for ‘Play Golf’. So, we have, Similarly, out of 6 Strong examples, we have 3 examples where
Machine Learning 98
the outcome was ‘Yes’ for Play Golf and 3 where we had ‘No’ for Play Golf. Remember, here
half items belong to one class while other half belong to other. Hence we have perfect
randomness. Now we have all the pieces required to calculate the Information Gain, Which
tells us the Information Gain by considering ‘Wind’ as the feature and give us information gain
of 0.048. Now we must similarly calculate the Information Gain for all the features. We can
clearly see that IG(S, Outlook) has the highest information gain of 0.246, hence we chose
Outlook attribute as the root node. At this point, the decision tree looks like.
Here we observe that whenever the outlook is Overcast, Play Golf is always ‘Yes’, it’s no
coincidence by any chance, the simple tree resulted because of the highest information gain
is given by the attribute Outlook. Now how do we proceed from this point? We can simply
apply recursion, you might want to look at the algorithm steps described earlier. Now that
Machine Learning 99
we’ve used Outlook, we’ve got three of them remaining Humidity, Temperature, and Wind. And,
we had three possible values of Outlook: Sunny, Overcast, Rain. Where the Overcast node
already ended up having leaf node ‘Yes’, so we’re left with two subtrees to compute: Sunny and
Rain. Table where the value of Outlook is Sunny looks like:
In the similar fashion, we compute the following values As we can see the highest
Information Gain is given by Humidity. Proceeding in the same way with will give us Wind as
the one with highest information gain. The final Decision Tree looks something like this.
Criterion: The ID3 algorithm uses Information Gain based on Entropy as the splitting
criterion to determine the best feature at each node of the tree.
The feature with the highest Information Gain is selected to split the data.
4. Split Data:
Create branches for each possible value of the feature, and assign subsets of the
training data to these branches.
5. Repeat Recursively:
Continue splitting each subset based on the next feature with the highest information
gain until all data points are classified or other stopping criteria are met (e.g., all
samples belong to the same class).
When all data points are classified, the nodes are turned into leaf nodes, which
represent the final decision.
Features of ID3
Attribute Selection: ID3 uses Information Gain to determine the most informative attribute
at each level.
Works Well for Categorical Data: ID3 is suited for classification problems, particularly with
categorical data.
Only for Categorical Attributes: The ID3 algorithm works primarily with categorical
features and requires discretization for numerical data.
Greedy Approach: The selection of attributes is done using a greedy approach, which
does not guarantee the global optimum solution.
2. Inductive Bias
Inductive Bias refers to the set of assumptions a machine learning algorithm makes to
generalize from the training data to unseen data.
Shorter Trees Are Preferred: Decision trees aim to create the shortest possible tree
that fits the data.
Preference for Features with High Information Gain: The ID3 algorithm selects
features based on their information gain, assuming that features with higher
information gain lead to better classification results.
Inductive bias helps decision trees generalize well to unseen data by preventing them from
creating unnecessarily complex models that overfit the training data.
Mathematical Formula:
Where:
S is the dataset.
Interpretation:
Low Entropy: When entropy is low (close to 0), the dataset is pure, meaning that most of
the instances belong to the same class.
High Entropy: When entropy is high (close to 1), the dataset is more mixed, with an almost
equal distribution of classes.
4. Information Gain
Information Gain is a measure used to evaluate the effectiveness of an attribute in classifying
the dataset. It is calculated as the reduction in entropy after splitting the dataset based on a
particular feature.
The feature with the highest information gain is selected as the split attribute for the
current node.
Significance:
Higher Information Gain indicates that the feature is more informative and results in a
better split of the dataset.
The goal is to maximize information gain at every node, which helps in making the decision
tree more efficient in classification.
Decision trees tend to overfit the training data, especially when they grow too deep
and create too many branches. This results in poor generalization to unseen data.
Solution: Use pruning techniques, limit tree depth, or set constraints on the number of
samples per leaf node.
When selecting attributes, decision trees may favor features with many distinct values
(e.g., ID number). This could lead to splits that do not actually provide meaningful
information.
Solution: Use different metrics like Gain Ratio that penalize features with a large
number of values.
3. Imbalanced Datasets:
Decision trees can be biased towards the majority class in imbalanced datasets.
4. High Variance:
Decision trees are susceptible to high variance; small changes in data can lead to a
significantly different tree.
5. Greedy Nature:
Decision trees use a greedy algorithm for attribute selection, which may not lead to the
optimal solution.
Decision trees can handle numeric features, but they require special handling to
determine optimal split points.
7. Data Fragmentation:
As the tree grows deeper, the dataset gets split into smaller fragments, leading to
insufficient data at some nodes. This is known as the fragmentation problem and can
result in unreliable splits.
Solution: Limit the depth of the tree or prune nodes with insufficient data.
Inductive Bias: Represents the set of assumptions made by the decision tree to generalize
from the training data to unseen data. This includes a preference for smaller trees and
attributes with high information gain.
Entropy and Information Theory: Entropy measures the uncertainty or impurity in the data,
while Information Gain measures the reduction in entropy after splitting the dataset based
on a particular attribute.
Information Gain: Used as the criterion in ID3 to determine the best feature for splitting,
with higher information gain representing a better split.
Issues in Decision Tree Learning: Challenges include overfitting, high variance, bias
towards features with many values, difficulty in handling imbalanced data, and greedy
nature of the algorithm.
Understanding these concepts provides the foundation for effectively utilizing decision trees,
addressing their limitations, and applying more advanced algorithms such as Random Forest
and Gradient Boosting Machines for improved performance.
Bayesian Learning
Bayesian learning is a probabilistic framework for machine learning that leverages Bayes'
theorem to update the probability of a hypothesis based on observed evidence or data.
Bayesian learning methods are valuable for dealing with uncertainty and making predictions
that incorporate prior knowledge.
1. Bayes' Theorem
Bayes' theorem forms the backbone of Bayesian learning by providing a principled way to
revise probabilities in light of new data. It connects the prior probability of a hypothesis, the
likelihood of observing data given that hypothesis, and the posterior probability—which is the
updated belief after seeing the data.
P (E∣H) ⋅ P (H)
P (H∣E) =
P (E)
Where:
P(H) : Prior Probability of hypothesis H before considering the evidence. It is our initial
belief about H .
P(E|H): Likelihood of the evidence E given hypothesis H . It measures how probable the
observed data is under the hypothesis.
P(E|H): The probability of getting a positive result given that the person has the disease
(i.e., the sensitivity of the test).
P(E): The probability of a positive result occurring overall (which depends on the
prevalence of the disease and the accuracy of the test).
P(H|E): The updated probability that the person has the disease after observing the
positive test result.
By combining these values using Bayes’ theorem, we can get a more accurate estimate of the
likelihood that the person actually has the disease.
Limitations
Computational Complexity: Evaluating every hypothesis in the hypothesis space can be
computationally challenging, especially for large hypothesis spaces.
Prior Probability (\( P(H) \)): Prior belief about how likely each model is to be correct.
Likelihood (\( P(D|H) \)): Probability of observing certain features in the email (e.g., certain
keywords) given the hypothesis.
Posterior Probability: Updated belief about the hypothesis after seeing the email.
Bayes Optimal Classifier: Instead of selecting a single model, the classifier considers all
possible models and weighs their predictions by their posterior probabilities.
Summary
Bayes' Theorem: Provides a method for updating the probability of a hypothesis based on
observed evidence.
Bayesian learning is a powerful framework that naturally integrates uncertainty and prior
knowledge into the learning process, making it useful for a wide range of applications such as
medical diagnosis, spam detection, and risk assessment.
[Link]
Essentially, the Bayes’ theorem describes the probability of an event based on prior
knowledge of the conditions that might be relevant to the event.
The theorem is named after English statistician, Thomas Bayes, who discovered the
formula in 1763. It is considered the foundation of the special statistical inference approach
called the Bayes’ inference.
To use the algorithm: 1-We must convert the presented data set into frequency tables. 2-
Then create a probability table by finding the probabilities of certain features. 3- Then use
Bayes’ theorem in order to calculate the posterior probability.
For example, let’s solve the following problem: If the weather is sunny, then the Player
should play or not?
Likelihood/probability table
𝑃(𝑌𝑒𝑠│𝑆𝑢𝑛𝑛𝑦) > 𝑃(𝑁𝑜│𝑆𝑢𝑛𝑛𝑦) ⇒ So on a sunny day, the player can play the game.
When the assumption of feature independence is correct, the algorithm can perform better
than other models, it also requires much less training data.
It is suitable for classification with discrete features that are categorically distributed.
This algorithm has a “zero-frequency problem” that assigns null probabilities to the
categorical variables whose classes in the test dataset are not available in the training
dataset, a smoothing method can be used in order to overcome this problem.
Its estimates can be wrong in some cases, so you shouldn’t take its potential outcomes
very seriously.
Complement: Used for imbalanced data, as it measures the probability of each sample
belonging to all other classes not its class.
Bernoulli: Used when features fellow Bernoulli distribution, it is suitable for discrete data,
where it is designed for binary/boolean features.
Multimodal: Unlike Bernoulli it works with occurrence counts, not only binary features.
Nodes: Each node represents a random variable (e.g., symptoms, test results, or
conditions).
Edges: Directed edges represent dependencies between variables. An edge from node
A to node B means that A has a direct influence on B.
For a variable X with parents P1, P2, ..., Pn , the CPT defines:
P (X∣P 1, P 2, ..., P n)
Global Representation: The joint probability distribution of all nodes can be determined
using local CPTs.
n
P (X1, X2, ..., Xn) = ∏ P (Xi ∣Parents(Xi ))
i=1
1. Exact Inference:
Methods like Variable Elimination or Belief Propagation can be used to derive exact
probabilities in small networks.
2. Approximate Inference:
BBNs are used in medical diagnosis to model the relationships between symptoms,
test results, and diseases. They allow for reasoning about the likelihood of different
diseases given observed symptoms.
2. Risk Assessment:
Nodes:
D : Disease (Yes/No)
S : Symptom (Present/Absent)
Edges:
The network structure and conditional probability tables can help calculate the probability of
having the disease given that the symptom is present and the test is positive.
In the E-step, the algorithm calculates the expected value of the log-likelihood
function, considering the current estimate of the parameters.
In other words, it calculates the probability distribution over the possible values of the
hidden variables, using the current parameters to fill in the missing values.
In the M-step, the algorithm maximizes the expected log-likelihood computed in the E-
step with respect to the model parameters.
The M-step updates the parameters to values that maximize the likelihood function
based on the distribution calculated in the E-step.
These two steps are repeated until convergence, meaning that the parameters no longer
change significantly.
The EM algorithm iteratively updates the parameters by alternating between the two steps:
Where \( Z \) represents the latent variables and \( \theta^{(t)} \) represents the current
parameter estimates.
Applications of EM Algorithm
1. Gaussian Mixture Models (GMMs):
The EM algorithm is used to train Hidden Markov Models by estimating the transition
and emission probabilities in the presence of hidden states.
3. Image Reconstruction:
1. E-Step:
Calculate the responsibilities for each data point, which represents the probability that
each point belongs to each Gaussian component, using the current parameter
estimates (mean, variance, and mixing coefficients).
2. M-Step:
Update the parameters of each Gaussian component (mean, variance, and mixing
coefficients) by maximizing the expected complete log-likelihood using the
responsibilities from the E-step.
This process is repeated until the parameters converge, leading to the best-fit Gaussian
components for the data.
Summary
Bayesian Belief Networks are graphical models that represent probabilistic relationships
between variables using a Directed Acyclic Graph and Conditional Probability Tables.
They are highly useful for reasoning under uncertainty in domains like medical diagnosis
and risk assessment.
Both Bayesian Belief Networks and the EM Algorithm are important tools in machine learning
and statistical modeling, enabling robust reasoning and parameter estimation even when
dealing with complex dependencies and incomplete information.
[Link]
Ensemble Learning
Imagine you are watching a football match. The sports analysts provide you with detailed
statistics and expert opinions. At the same time, you also take into account the opinions of
fellow enthusiasts who may have witnessed previous matches. This approach helps overcome
the limitations of relying solely on one model and increases overall accuracy. Similarly, in
ensemble learning, combining multiple models or algorithms can improve prediction accuracy.
In both cases, the power of collective knowledge and multiple viewpoints is harnessed to make
more informed and reliable predictions, overcoming the limitations of relying solely on one
model. Let us take a deeper dive into what Ensemble Learning actually is.
Ensemble learning is a machine learning technique that improves the performance of machine
learning models by combining predictions from multiple models. By leveraging the strengths of
diverse algorithms, ensemble methods aim to reduce both bias and variance, resulting in more
reliable predictions. It also increases the model’s robustness to errors and uncertainties,
especially in critical applications like healthcare or finance.
Ensemble learning techniques like bagging, boosting, and stacking enhance performance and
reliability, making them valuable for teams that want to build reliable ML systems.
Brief overview
These techniques aim to reduce bias and variance in individual models, and improve prediction
accuracy by learning previous errors, ultimately leading to a consensus prediction that is often
more reliable than any single model.
Random forest
The Random Forest algorithm is a prime example of bagging. It creates an ensemble of
decision trees trained on samples of datasets. Ensemble learning effectively handles complex
features and captures nuanced patterns, resulting in more reliable predictions. However, it is
also true that the interpretability of ensemble models may be compromised due to the
combination of multiple decision trees. Ensemble models can provide more accurate
predictions than individual decision trees, but understanding the reasoning behind each
prediction becomes challenging. Bagging helps reduce overfitting by generating multiple
subsets of the training data and training individual decision trees on each subset. It also helps
reduce the impact of outliers or noisy data points by averaging the predictions of multiple
decision trees.
Gradient boosting
Gradient Boosting (GB) trains each model to minimize the errors of previous models by training
each new model on the remaining errors. This iterative process effectively handles numerical
and categorical data and can outperform other machine learning algorithms, making it
versatile for various applications.
For example, you can apply Gradient Boosting in healthcare to predict disease likelihood
accurately. Iteratively combining weak learners to build a strong learner can improve prediction
accuracy, which could be valuable in providing insights for early intervention and personalized
treatment plans based on demographic and medical factors such as age, gender, family
history, and biomarkers.
One potential challenge of gradient boosting in healthcare is its lack of interpretability. While it
excels at accurately predicting disease likelihood, the complex nature of the algorithm makes it
difficult to understand and interpret the underlying factors driving those predictions.
This can pose challenges for healthcare professionals who must explain the reasoning behind
a particular prediction or treatment recommendation to patients. However, efforts are being
made to develop techniques that enhance the interpretability of GB models in healthcare,
ensuring transparency and trust in their use for decision-making.
Boosting is an ensemble method that seeks to change the training data to focus attention on
examples that previous fit models on the training dataset have gotten wrong.
Stacking: Meta-learning
Stacking, or stacked generalization, is a model-ensembling technique that improves predictive
performance by combining predictions from multiple models. It involves training a meta-model
that uses the output of base-level models to make a final prediction. The meta-model, a linear
regression, a neural network, or any other algorithm makes the final prediction.
This technique leverages the collective knowledge of different models to generate more
accurate and robust predictions. The meta-model can be trained using ensemble algorithms
like linear regression, neural networks, or support vector machines. The final prediction is
based on the meta-model's output. Overfitting occurs when a model becomes too closely fitted
to the training data and performs poorly on new, unseen data. Stacking helps mitigate
overfitting by combining multiple models with different strengths and weaknesses, thereby
reducing the risk of relying too heavily on a single model’s biases or idiosyncrasies.
For example, in financial forecasting, stacking combines models like regression, random forest,
and gradient boosting to improve stock market predictions. This ensemble approach mitigates
the individual biases in the model and allows easy incorporation of new models or the removal
of underperforming ones, enhancing prediction performance over time.
Voting
The application of either bagging or boosting requires the selection of a base learner
algorithm first. For example, if one chooses a classification tree, then boosting and bagging
would be a pool of trees with a size equal to the user’s preference.
Robustness
Ensemble learning enhances robustness by considering multiple models' opinions and making
consensus-based predictions. This mitigates the impact of outliers or errors in a single model,
ensuring more accurate results. Combining diverse models reduces the risk of biases or
inaccuracies from individual models, enhancing the overall reliability and performance of the
ensemble learning approach. However, combining multiple models can increase the
computational complexity compared to using a single model. Furthermore, as ensemble models
Reducing Overfitting
Ensemble learning reduces overfitting by using random data subsets for training each model.
Bagging introduces randomness and diversity, improving generalization performance. Boosting
assigns higher weights to difficult-to-classify instances, focusing on challenging cases and
improving accuracy. Iteratively adjusting weights allows boosting to learn from mistakes and
build models sequentially, resulting in a strong ensemble capable of handling complex data
patterns. Both approaches help improve generalization performance and accuracy in ensemble
learning.
Computational Complexity
Ensemble learning, involving multiple algorithms and feature sets, requires more computational
resources than individual models. While parallel processing offers a solution, orchestrating an
ensemble of models across multiple processors can introduce complexity in both
implementation and maintenance. Also, more computation might not always lead to better
performance, especially if the ensemble is not set up correctly or if the models amplify each
other's errors in noisy datasets.
Interpretability
Ensemble learning models prioritize accuracy over interpretability, resulting in highly accurate
predictions. However, this trade-off makes the ensemble model more challenging to interpret.
Techniques like feature importance analysis and model introspection can help provide insights
but may not fully demystify the predictions of complex ensembles. the factors contributing to
ensemble models' decision-making, reducing the interpretability challenge.
Introduction to AdaBoost
AdaBoost stands for Adaptive Boosting. It is an ensemble method that combines multiple
"weak" learners (often simple decision trees or stumps) into a single "strong" learner in an
Key Idea: Create a strong classifier by combining multiple weak classifiers in sequence.
Weak Learners: Typically uses decision stumps (a decision tree with only one split).
Samples that are misclassified by the weak classifier are given higher weights.
3. Update Weights:
Misclassified samples get higher weights, which makes them more important in the
next round.
After each iteration, assign a weight to the weak classifier based on its accuracy. This
weight indicates the importance of the classifier in the final ensemble.
Repeat the process to train multiple classifiers and combine their predictions to form
the final, strong classifier.
2. Versatile:
Works well with various types of weak learners (commonly decision stumps).
Disadvantages of AdaBoost
1. Sensitive to Noisy Data:
If there are outliers or noisy data, AdaBoost can focus too much on these instances,
leading to overfitting.
Performance depends on the quality of weak learners. If weak learners are very poor,
the ensemble won't perform well.
Applications of AdaBoost
Face Recognition: Used in computer vision tasks, such as face detection (e.g., Haar
cascades).
Introduction to XGBoost
XGBoost stands for Extreme Gradient Boosting. It is an optimized implementation of the
Gradient Boosting Machine (GBM), specifically designed for speed and performance. XGBoost
is widely recognized for its high efficiency, scalability, and accuracy.
Key Idea: Uses advanced regularization and gradient-based optimization to build strong
models from weak learners.
The idea of Gradient Boosting is to build new learners to correct the errors made by
previous learners.
Each learner in XGBoost is trained to minimize the residuals (errors) of the previous
learners, by fitting a decision tree to the gradient of the loss function with respect to the
predictions.
Pruning: XGBoost uses depth-first pruning to grow trees only until no further
improvement can be made.
The trees are constructed iteratively, and the final prediction is obtained by summing
the output from all the trees.
A learning rate parameter helps to control the contribution of each new tree added to
the model, allowing the model to make smaller, more controlled updates.
Advantages of XGBoost
1. High Performance:
XGBoost is highly efficient and fast, making it suitable for large datasets and real-time
applications.
2. Regularization:
4. Flexibility:
Disadvantages of XGBoost
1. Complexity:
2. Computationally Intensive:
Applications of XGBoost
1. Kaggle Competitions:
XGBoost is a favorite among data scientists in Kaggle competitions due to its high
accuracy and performance.
2. Credit Scoring:
Widely used in the finance industry for predicting credit risk and assessing loan
eligibility.
Applications Spam detection, Face detection Kaggle, Credit scoring, Customer churn
AdaBoost is simpler and often used with decision stumps to create sequential weak
learners, emphasizing the misclassified data.
XGBoost uses gradient boosting to iteratively improve the model using decision trees, with
an emphasis on regularization and gradient optimization to reduce overfitting and increase
generalization performance.
Both algorithms are highly effective for classification and regression tasks, and their choice
depends on the problem, computational resources, and data characteristics.
There are many ways for measuring classification performance. Accuracy, confusion matrix,
log-loss, and AUC-ROC are some of the most popular metrics. Precision-recall is a widely
used metrics for classification problems.
When any model gives an accuracy rate of 99%, you might think that model is performing very
good but this is not always true and can be misleading in some situations. I am going to explain
this with the help of an example.
We repeat this process for all images in X_test data. Eventually, we’ll have a count of correct
and incorrect matches. But in reality, it is very rare that all incorrect or correct matches
hold equal value. Therefore one metric won’t tell the entire story.
Accuracy is useful when the target class is well balanced but is not a good choice for the
unbalanced classes. Imagine the scenario where we had 99 images of the dog and only 1
image of a cat present in our training data. Then our model would always predict the dog, and
therefore we got 99% accuracy. In reality, Data is always imbalanced for example Spam email,
credit card fraud, and medical diagnosis. Hence, if we want to do a better model evaluation and
have a full picture of the model evaluation, other metrics such as recall and precision should
also be considered.
True Negative: We predicted negative and it’s true. In the image, we predicted that a man is
not pregnant and he actually is not.
False Positive (Type 1 Error): We predicted positive and it’s false. In the image, we
predicted that a man is pregnant but he actually is not.
False Negative (Type 2 Error): We predicted negative and it’s false. In the image, we
predicted that a woman is not pregnant but she actually is.
We discussed Accuracy, now let’s discuss some other metrics of the confusion matrix!
Precision
It explains how many of the correctly predicted cases actually turned out to be positive.
Precision is useful in the cases where False Positive is a higher concern than False Negatives.
The importance of Precision is in music or video recommendation systems, e-commerce
websites, etc. where wrong results could lead to customer churn and this could be harmful to
the business.
Recall (Sensitivity)
It explains how many of the actual positive cases we were able to predict correctly with our
model. Recall is a useful metric in cases where False Negative is of higher concern than False
Positive. It is important in medical cases where it doesn’t matter whether we raise a false
alarm but the actual positive cases should not go undetected!
Recall for a label is defined as the number of true positives divided by the
total number of actual positives.
F1 Score
It gives a combined idea about Precision and Recall metrics. It is maximum when Precision is
equal to Recall.
The F1 score punishes extreme values more. F1 Score could be an effective evaluation metric in
the following cases:
AUC-ROC
The Receiver Operator Characteristic (ROC) is a probability curve that plots the TPR(True
Positive Rate) against the FPR(False Positive Rate) at various threshold values and separates
the ‘signal’ from the ‘noise’.
The Area Under the Curve (AUC) is the measure of the ability of a classifier to distinguish
between classes. From the graph, we simply say the area of the curve ABDE and the X and Y-
axis.
From the graph shown below, the greater the AUC, the better is the performance of the model
at different threshold points between positive and negative classes. This simply means that
When AUC is equal to 1, the classifier is able to perfectly distinguish between all Positive and
Negative class points. When AUC is equal to 0, the classifier would be predicting all Negatives
as Positives and vice versa. When AUC is 0.5, the classifier is not able to distinguish between
the Positive and Negative classes.
To know more, read our Guide to AUC ROC Curve in Machine Learning.
Log Loss
Log loss (Logistic loss) or Cross-Entropy Loss is one of the major metrics to assess the
performance of a classification problem.
For a single sample with true label y ∈{0,1} and a probability estimate p=Pr(y=1), the log loss is:
Logistic Regression: Good for binary classification with a linear decision boundary.
Decision Tree: Offers a more flexible approach by segmenting the space into smaller
regions.
SVM: Works well for higher-dimensional spaces or when there is a clear margin of
separation.
Decision Tree:
TP: 81
TN: 80
FP: 20
FN: 19
SVM:
TP: 75
TN: 90
FP: 10
FN: 25
Logistic Regression offers a balance with a good F1-Score and decent performance across
all metrics.
Conclusion
For this task:
If prioritizing the correctness of the positive classifications (ensuring that those predicted
as likely to subscribe are very likely to do so), SVM might be best due to its high precision.
For a balance between precision and recall, Logistic Regression presents a viable option
with the highest F1-Score.
Choosing the right model depends on the business goals and the cost of false positives vs.
false negatives. For instance, if missing out on potential subscribers is more costly than
mistakenly identifying non-subscribers as potential subscribers, a model with higher recall may
be preferable.
Use: Indicates how well the model can detect positive cases (e.g., detecting actual
churners in a dataset).
Use: Indicates how reliable a positive prediction is (e.g., if the model says someone will
churn, what is the chance they actually will?).
Use: Indicates how reliable a negative prediction is (e.g., if the model says someone will not
churn, what is the chance they really won't?).
Summary
Sensitivity: How good is the model at finding actual positives?
PPV (Precision): When the model predicts positive, how often is it correct?
These metrics help you understand the trade-offs in your model's performance and can be
chosen based on what is most important in your specific problem (e.g., minimizing false
positives vs. maximizing detection of positives).
Gains chart
[Link]
understand-the-difference/
Typically called a Cumulative Gains Chart it can be simply explained by the following example:
For simplicity let's assume we have 1000 customers. If we run an advertising campaign for all
our customers, we might find that 30% (300 out of 1000) will respond and buy our new
product.
The Gains chart is the visualization of that principle. On the X-axis we have the percentage of
the customer base we want to target with the campaign. The Y-axis gives us the answer to the
percentage of all positive responses customers have found in the targeted sample. In the
picture below you can see an example of the Gains chart. (The gains chart associated with the
model is the red curve).
Confusion matrix
To test our strategy (defined by the model and the targeted percentage or equivalently the cut-
off value) we need to compare the output of the model to the actual results in the real world.
This is done by comparing the results and creating a contingency table of misclassification
errors (terminology as used in hypothesis testing - TP means true positive, FN false negative,
FP false positive, and TN true negative):
Observed YES Count TP (right decision) Count FN (error of the second kind)
Ideally, we want to have the right decisions made with high frequency. Such a table (usually
called a confusion matrix) is a very important decision tool when we evaluate the quality of the
model.
For better orientation, it is common practice to display the confusion matrix in the form of the
following graph. From this graph, we see, how many times the model predicts correctly (true
negatives and true positives) and how many times we have an incorrect prediction (false
positives and false negatives). The better the model, the larger the bars TP and TN in
comparison to FN, and FP.
Discussed curves (ROC, Gains, and Lift) are computed based on information from confusion
matrices. It is important to realize that curves are created according to a larger number of
these confusion matrices for various targeted percentages/cut-off values.
ROC curve
Other terms connected with a confusion matrix are Sensitivity and Specificity. They are
computed in the following way:
The Y-axis measures the rate (as a percentage) of correctly predicted customers with a
positive response. The X-axis measures the rate of incorrectly predicted customers with a
negative response.
The optimal model could be the following: Sensitivity will rise to a maximum and specificity will
stay the whole time at 1 (the optimal model is in green color). The task is to have the ROC
curve of the developed model as close as possible to the optimal model.
The Graphical representation of the results as a confusion matrix is below - colors on the graph
represent the same as the color markings in the table above:
False Positive (FP): When the model incorrectly predicts a positive outcome.
Example: Predicting a customer will churn when they won't. This might lead to
unnecessary retention efforts and increased costs.
False Negative (FN): When the model incorrectly predicts a negative outcome.
The aim is to adjust the model’s focus to minimize high-cost errors based on the specific use
case.
For example, if a false negative is more costly than a false positive, assign a higher
penalty for false negatives in the loss function.
Resampling: Oversample the minority class (e.g., fraud cases) or undersample the
majority class.
3. Custom Thresholds:
Adjust the classification threshold to change the balance between precision and
recall.
For example, in a fraud detection system, you may want to reduce false negatives
(missing fraud), even if it means increasing false positives, by setting a lower threshold
for classifying something as "fraud".
False Positive Cost (FP): Annoying a customer by incorrectly flagging their legitimate
transaction as fraud.
False Negative Cost (FN): Failing to identify an actual fraud, leading to financial loss.
In this case, False Negative Cost is much higher, and hence, the model should be adjusted to
prioritize reducing false negatives, even at the cost of more false positives.
Costs: Define the financial or operational cost of a False Positive and False Negative.
Benefits: Define the value associated with True Positive and True Negative
predictions.
This is similar to a confusion matrix but includes the estimated costs and benefits for
each type of outcome.
Use the number of True Positives (TP), True Negatives (TN), False Positives (FP), and
False Negatives (FN) to calculate the total cost or benefit.
Net Gain = (Benefit from TP * Number of TP) + (Benefit from TN * Number of TN) -
(Cost of FP * Number of FP) - (Cost of FN * Number of FN)
True Positive (TP): Correctly identifying someone who would default—Benefit is avoiding a
loan loss.
True Negative (TN): Correctly approving a customer who will not default—Benefit is
gaining a profitable customer.
False Positive (FP): Incorrectly rejecting someone who wouldn’t default—Cost is the
opportunity loss of rejecting a good customer.
False Negative (FN): Incorrectly approving someone who ends up defaulting—Cost is the
financial loss from loan default.
Summary
1. Misclassification Cost Adjustment is about adjusting your model to minimize errors with
high real-world costs. Techniques include custom loss functions, resampling to address
imbalances, and threshold tuning to minimize costly errors.
Both techniques are crucial for deploying machine learning models that have real-world
implications, ensuring that the model not only performs well on accuracy metrics but also
aligns with the financial and operational priorities of the business or use case.
Unit 4
Cluster Analysis and Clustering Methods
Key Idea: Minimize intra-cluster distance (distance between objects in the same cluster)
and maximize inter-cluster distance (distance between objects in different clusters).
2. Feature Selection:
4. Evaluation of Clusters:
Using metrics like silhouette score, Davies-Bouldin index, or purity to assess the quality
of clusters.
5. Interpreting Results:
Types of Clustering
1. Hard Clustering: Each object is assigned to a single cluster.
Example: K-Means.
2. Soft Clustering (Fuzzy Clustering): Each object has a degree of belonging to multiple
clusters.
Clustering Methods
1. Partitioning Methods
Objective: Partition the data into \( k \) clusters where each cluster is represented by a
centroid.
Example Algorithms:
K-Means: Minimizes the sum of squared distances between points and the cluster
centroid.
K-Medoids: Similar to K-Means but uses actual data points as centroids (more robust to
noise).
Advantages:
Disadvantages:
Sensitive to outliers.
2. Hierarchical Methods
Objective: Build a hierarchy of clusters using a tree-like structure called a dendrogram.
Types:
Agglomerative (Bottom-Up): Start with each data point as a cluster and merge them
iteratively.
Advantages:
Disadvantages:
3. Density-Based Methods
Objective: Identify clusters based on regions of high data density.
Example Algorithms:
Groups points closely packed together and marks points in sparse regions as noise.
Advantages:
Disadvantages:
4. Grid-Based Methods
Objective: Divide the data space into a grid and perform clustering on the grid cells.
Example Algorithm:
Disadvantages:
5. Model-Based Methods
Objective: Assume a model for each cluster and find the best fit for the data.
Example Algorithms:
Advantages:
Disadvantages:
Computationally expensive.
Evaluation of Clustering
1. Internal Metrics:
Silhouette Score: Measures how similar an object is to its cluster compared to other
clusters.
Adjusted Rand Index (ARI): Compares the agreement between predicted clusters and
ground truth.
Mutual Information: Measures the shared information between predicted clusters and
ground truth.
Challenges in Clustering
1. Choosing the right number of clusters (\( k \)).
Conclusion
Cluster analysis is a fundamental unsupervised learning technique used to uncover hidden
patterns in data. Choosing the appropriate clustering method depends on the dataset's nature
and the problem's objectives.
Key Objectives:
2. High-Dimensional Data:
The algorithm should require minimal input, such as the number of clusters (\( k \)).
6. Interpretability:
7. Distance Metrics:
9. Evaluation Metrics:
Provide methods to validate and assess clustering quality, such as silhouette score or
purity.
These requirements ensure the effectiveness and usability of clustering algorithms in real-
world applications.
1. k-Means Clustering
Overview:
Each cluster is represented by its centroid (mean of the points in the cluster).
Iterative algorithm that minimizes the sum of squared distances (intra-cluster variance)
between points and their corresponding cluster centroid.
Algorithm:
2. Assign each data point to the nearest centroid (based on Euclidean distance).
3. Update the centroids by calculating the mean of the points in each cluster.
4. Repeat steps 2 and 3 until centroids no longer change significantly or a stopping criterion is
met.
Advantages:
Disadvantages:
Applications:
2. k-Medoids Clustering
Overview:
Similar to k-Means but instead of centroids, it uses actual data points (medoids) as cluster
representatives.
Algorithm:
3. Update the medoids by choosing the point within each cluster that minimizes the sum of
distances to all other points in the cluster.
4. Repeat steps 2 and 3 until medoids no longer change or a stopping criterion is met.
Advantages:
Disadvantages:
Applications:
1. Choose k number of random points (Data point from the data set or some other points).
These points are also called "Centroids" or "Means".
2. Assign all the data points in the data set to the closest centroid by applying any distance
formula like Euclidian distance, Manhattan distance, etc.
3. Now, choose new centroids by calculating the mean of all the data points in the clusters
and goto step 2
4. Continue step 3 until no data point changes classification between two iterations.
The problem with the K-Means algorithm is that the algorithm needs to handle outlier data. An
outlier is a point different from the rest of the points. All the outlier data points show up in a
different cluster and will attract other clusters to merge with it. Outlier data increases the mean
of a cluster by up to 10 units. Hence, K-Means clustering is highly affected by outlier data.
K-Medoids:
Medoid: A Medoid is a point in the cluster from which the sum of distances to other data points
is minimal.
(or)
A Medoid is a point in the cluster from which dissimilarities with all the other points in the
clusters are minimal.
Instead of centroids as reference points in K-Means algorithms, the K-Medoids algorithm takes
a Medoid as a reference point.
There are three types of algorithms for K-Medoids Clustering:
PAM is the most powerful algorithm of the three algorithms but has the disadvantage of time
complexity. The following K-Medoids are performed using PAM. In the further parts, we'll see
what CLARA and CLARANS are.
Algorithm:
Given the value of k and unlabelled data:
1. Choose k number of random points from the data and assign these k points to k number of
clusters. These are the initial medoids.
2. For all the remaining data points, calculate the distance from each medoid and assign it to
the cluster with the nearest medoid.
3. Calculate the total cost (Sum of all the distances from all the data points to the medoids)
4. Select a random point as the new medoid and swap it with the previous medoid. Repeat 2
and 3 steps.
5. If the total cost of the new medoid is less than that of the previous medoid, make the new
medoid permanent and repeat step 4.
6. If the total cost of the new medoid is greater than the cost of the previous medoid, undo the
swap and repeat step 4.
7. The Repetitions have to continue until no change is encountered with new medoids to
classify data points.
Data set:
x y
0 5 4
1 7 7
2 1 3
3 8 6
4 4 9
Scatter plot:
2. Calculation of distances
0 5 4 5 6
1 7 7 10 5
2 1 3 - -
3 8 6 10 7
4 4 9 - -
Cluster 1: 0
Cluster 2: 1, 3
0 5 4 - -
1 7 7 5 5
2 1 3 5 9
3 8 6 5 7
4 4 9 - -
Cluster 1: 2, 3
Cluster 2: 1
1. Calculation of total cost:(5 + 5) + 5 = 15Less than the previous costNew medoid: (5, 4).
0 5 4 - -
1 7 7 - -
2 1 3 5 10
3 8 6 5 2
4 4 9 6 5
Cluster 1: 2
Cluster 2: 3, 4
1. Calculation of total cost:(5) + (2 + 5) = 12Less than the previous costNew medoid: (7, 7).
0 5 4 5 5
1 7 7 - -
2 1 3 10 10
3 8 6 - -
4 4 9 5 7
Cluster 1: 4
Cluster 2: 0, 2
Limitation of PAM:
CLARA:
It is an extension to PAM to support Medoid clustering for large data sets. This algorithm
selects data samples from the data set, applies Pam on each sample, and outputs the best
Clustering out of these samples. This is more effective than PAM. We should ensure that the
selected samples aren't biased as they affect the Clustering of the whole data.
CLARANS:
This algorithm selects a sample of neighbors to examine instead of selecting samples from the
data set. In every step, it examines the neighbors of every node. The time complexity of this
Disadvantages:
1. Not suitable for Clustering arbitrarily shaped groups of data points.
2. As the initial medoids are chosen randomly, the results might vary based on the choice in
different runs.
A centroid can be a data point or some other point in the A medoid is always a data point in the
cluster cluster.
Can't cope with outlier data Can manage outlier data too
It comprises of many different methods based on different distance measures. E.g. K-Means
(distance between points), Affinity propagation (graph distance), Mean-shift (distance between
Another challenge with k-means is that you need to specify the number of clusters (“k”) in
order to use it. Much of the time, we won’t know what a reasonable k value is a priori.
What’s nice about DBSCAN is that you don’t have to specify the number of clusters to use it. All
you need is a function to calculate the distance between values and some guidance for what
amount of distance is considered “close”. DBSCAN also produces more reasonable results
than k-means across a variety of different distributions. Below figure illustrates the fact:
Credits
minPts: The minimum number of points (a threshold) clustered together for a region to be
considered dense.
eps (ε): A distance measure that will be used to locate the points in the neighborhood of
any point.
These parameters can be understood if we explore two concepts called Density Reachability
and Density Connectivity.
Reachability in terms of density establishes a point to be reachable from another if it lies within
a particular distance (eps) from it.
Core — This is a point that has at least m points within distance n from itself.
Border — This is a point that has at least one Core point at a distance n.
Noise — This is a point that is neither a Core nor a Border. And it has less than m points
within distance n from itself.
The algorithm proceeds by arbitrarily picking up a point in the dataset (until all points have
been visited).
The clusters are then expanded by recursively repeating the neighborhood calculation for
each neighboring point
DBSCAN in action
[Link]
Parameter Estimation
Every data mining task has the problem of parameters. Every parameter influences the
algorithm in specific ways. For DBSCAN, the parameters ε and minPts are needed.
minPts: As a rule of thumb, a minimum minPts can be derived from the number of
dimensions D in the data set, as minPts ≥ D + 1 . The low value minPts = 1 does not make
sense, as then every point on its own will already be a cluster. With minPts ≤ 2 , the result
will be the same as of hierarchical clustering with the single link metric, with the
dendrogram cut at height ε. Therefore, minPts must be chosen at least 3. However, larger
values are usually better for data sets with noise and will yield more significant clusters. As
a rule of thumb, minPts = 2· dim can be used, but it may be necessary to choose larger
values for very large data, for noisy data or for data that contains many duplicates.
ε: The value for ε can then be chosen by using a k-distance graph, plotting the distance to
the k = minPts 1 nearest neighbor ordered from the largest to the smallest value. Good
values of ε are where this plot shows an “elbow”: if ε is chosen much too small, a large part
of the data will not be clustered; whereas for a too high value of ε, clusters will merge and
the majority of objects will be in the same cluster. In general, small values of ε are
preferable, and as a rule of thumb, only a small fraction of points should be within this
distance of each other.
Distance function: The choice of distance function is tightly linked to the choice of ε, and
has a major impact on the outcomes. In general, it will be necessary to first identify a
reasonable measure of similarity for the data set, before the parameter ε can be chosen.
Two data clusters with centroids defined. | Image: Oscar Contreras Carrasco
We are now introducing some additional notation. Here, μ1 and μ2 are the centroids of each
cluster and are parameters that identify each of these. Let’s look at how a Gaussian mixture
model can help us analyze these clusters.
A covariance Σ that defines its width. This would be equivalent to the dimensions of an
ellipsoid in a multivariate scenario.
A mixing probability π that defines how big or small the Gaussian function will be.
Equation that mixing coefficients must meet. | Image: Oscar Contreras Carrasco
How do we determine the optimal values for these parameters? To achieve this we must ensure
that each Gaussian fits the data points belonging to each cluster. This is exactly what maximum
likelihood does.
In general, the Gaussian density function is given by:
Overview
The Gaussian Mixture Model (GMM) is a probabilistic clustering method that assumes the data
is generated from a mixture of multiple Gaussian distributions (normal distributions). It is a soft
clustering algorithm, meaning a data point can belong to multiple clusters with a certain
probability.
Key Concepts
1. Gaussian Distribution:
2. Mixture Model:
3. Soft Clustering:
Unlike k-Means, GMM assigns probabilities to each data point for belonging to each
cluster.
Algorithm Steps
The GMM algorithm uses the Expectation-Maximization (EM) approach to optimize the
parameters. The steps are:
Repeat the E-step and M-step until convergence (i.e., when parameter changes
become negligible or a maximum number of iterations is reached).
Disadvantages
1. Computationally expensive for large datasets.
3. Assumes Gaussian distributions, which may not hold for all datasets.
Applications
1. Image segmentation.
2. Speech recognition.
3. Anomaly detection.
4. Customer segmentation.
Example Illustration
Assume we want to cluster data into two groups:
3. Assign data points based on the highest probability or interpret the soft assignments.
The GMM provides a probabilistic and flexible approach for clustering, especially when the
data contains overlapping or complex patterns.
BIRCH
[Link]
hierarchies-birch-1428bb06bb38
Balanced Iterative Reducing and Clustering using Hierarchies (BIRCH) is a clustering algorithm
that can cluster large datasets by first generating a small and compact summary of the the
large dataset that retains as much information as possible. This smaller summary is then
clustered instead of clustering the larger dataset. BIRCH is often used to complement other
clustering algorithms by creating a summary of the dataset that the other clustering algorithm
can now use. However, BIRCH has one major drawback — it can only process metric
1. Building the CF Tree: BIRCH summarizes large datasets into smaller, dense regions called
Clustering Feature (CF) entries. Formally, a Clustering Feature entry is defined as an
ordered triple, (N, LS, SS) where ’N’ is the number of data points in the cluster, ‘LS’ is the
linear sum of the data points and ‘SS’ is the squared sum of the data points in the cluster. It
is possible for a CF entry to be composed of other CF entries. Optionally, we can condense
this initial CF tree into a smaller CF.
2. Global Clustering: Applies an existing clustering algorithm on the leaves of the CF tree. A
CF tree is a tree where each leaf node contains a sub-cluster. Every entry in a CF tree
contains a pointer to a child node and a CF entry made up of the sum of CF entries in the
child nodes. Optionally, we can refine these clusters.
Due to this two step process, BIRCH is also called Two Step Clustering. Let us now deep dive
into these steps to get a holistic understanding.
Cluster Features
A CF is a set of three summary statistics that represent a set of data points in a single cluster.
These statistics are as follows:
★ Count [The number of data values in the cluster.]
★ Linear Sum [The sum of the individual coordinates. This is a measure of the location of the
cluster.]
★ Squared Sum [The sum of the squared coordinates. This is a measure of the spread of the
cluster.]
Together, the linear sum and the squared sum are equivalent to the mean and variance of
the data point.
CF Tree
The building process of the CF Tree can be summarized as:
1. For each given record, BIRCH compares the location of that record with the location of
each CF in the root node, using either the linear sum or the mean of the CF. BIRCH passes
the incoming record to the root node CF closest to the incoming record.
2. The record then descends down to the non-leaf child nodes of the root node CF selected in
step 1. BIRCH compares the location of the record with the location of each non-leaf CF.
BIRCH passes the incoming record to the non-leaf node CF closest to the incoming record.
3. The record then descends down to the leaf child nodes of the non-leaf node CF selected in
step 2. BIRCH compares the location of the record with the location of each leaf. BIRCH
tentatively passes the incoming record to the leaf closest to the incoming record.
(a) If the radius of the chosen leaf including the new record does not exceed the Threshold T,
then the incoming record is assigned to that leaf. The leaf and all of its parent CF’s are updated
to account for the new data point.
(b) If the radius of the chosen leaf including the new record does exceed the Threshold T, then
a new leaf is formed, consisting of the incoming record only. The parent CFs are updated to
account for the new data point.
If step 4(b) is executed, and there are already the maximum of L leaves in the leaf node, then
the leaf node is split into two leaf nodes. The most distant leaf node CFs are used as leaf node
seeds, with the remaining CFs being assigned to whichever leaf node is closer. If the parent
node is full, split the parent node, and so on. Note that the radius of a cluster may be calculated
even without knowing the data points, as long as we have the count n, the linear sum LS, and
In the original paper, the authors have used agglomerative hierarchical clustering.
Parameters of BIRCH
There are three parameters in this algorithm, which needs to be tuned. Unlike K-means, here
the optimal number of clusters (k) need not be input by the user as they are determined by the
algorithm.
✓ threshold : threshold is the maximum number of data points a sub-cluster in the leaf node of
the CF tree can hold.
✓ branching_factor : This parameter specifies the maximum number of CF sub-clusters in
each node (internal node).
✓ n_clusters : The number of clusters to be returned after the entire BIRCH algorithm is
complete i.e., number of clusters after the final clustering step. If set to None, the final
clustering step is not performed and intermediate clusters are returned.
Scikit-learn offers a robust implementation for Python.
Conclusion
Limitations:
Computationally intensive for large datasets due to the similarity matrix.
Key Concepts:
Kernel Density Estimation (KDE): Used to estimate the data distribution. A common kernel
is the Gaussian kernel.
Bandwidth (h): Defines the radius of the kernel, controlling the range of influence for
density estimation.
Algorithm Steps:
1. Initialization:
2. Mean Shift:
For each data point, compute the weighted mean of points within the bandwidth h
3. Cluster Assignment:
After convergence, data points are assigned to clusters based on their proximity to the
converged centroids.
Advantages:
Automatically determines the number of clusters.
Limitations:
Computationally expensive for large datasets due to density estimation.
May merge nearby clusters if h is too large or produce fragmented clusters if h is too small.
Both methods are powerful but suited to different kinds of clustering problems. Affinity
Propagation excels in scenarios where a good similarity measure is available, while Mean-Shift
is ideal for density-based cluster identification.
Limitations:
Computationally expensive for large datasets.
well-defined hierarchy
In the next section of this article, let’s learn about these two ways in detail. For now, the above
image gives you a high level of understanding.
In the early sections of this article, we were given various algorithms to perform the clustering.
But how is this hierarchical clustering different from other techniques?
Let’s discuss that.
Still, in hierarchical clustering no need to pre-specify the number of clusters as we did in the K-
Means Clustering; one can stop at any number of clusters.
Furthermore, Hierarchical Clustering has an advantage over K-Means Clustering. i.e., it results
in an attractive tree-based representation of the observations, called a Dendrogram.
At each step, it merges the closest pair of clusters until only one cluster ( or K clusters
left).
At each step, it splits a cluster until each cluster contains a point ( or there are clusters).
Agglomerative Clustering
It is also known as AGNES ( Agglomerative Nesting) and follows the bottom-up approach.
Each observation starts with its own cluster, and pairs of clusters are merged as one moves up
the hierarchy.
That means the algorithm considers each data point as a single cluster initially and then starts
combining the closest pair of clusters together.
It does the same process until all the clusters are merged into a single cluster that contains all
the datasets.
Step 1
Step 2
Take the next two closest data points and make them one cluster; now, it forms N-1 clusters.
Once all the clusters are combined into a big cluster. We develop the Dendrogram to divide the
clusters.
For the divisive hierarchical clustering, it treats all the data points as one cluster and splits the
clustering until it creates meaningful clusters.
Simple Linkage
Complete Linkage
Centroid Linkage
Ward’s Linkage
Simple Linkage
Single Linkage algorithms are the best for capturing clusters of different sizes.
That means Simple Linkage methods can not group clusters properly if there is any noise
between the clusters.
Complete Linkage
That means the Complete Linkage method also does well in separating clusters if there is
any noise between the clusters.
Average Linkage
The average Linkage method also does well in separating clusters if there is any noise
between the clusters.
Centroid Linkage
The Centroid Linkage method also does well in separating clusters if there is any noise
between the clusters.
Similar to Complete Linkage and Average Linkage methods, the Centroid Linkage method is
also biased towards globular clusters.
Ward’s Linkage
Ward's Linkage method is the similarity of two clusters. Which is based on the increase in
squared error when two clusters are merged, and it is similar to the group average if the
distance between points is distance squared.
In many cases, Ward’s Linkage is preferred as it usually produces better cluster hierarchies
Strong Linkage
Flexible linkage
Simple Average
The Linkage method’s choice depends on you, and you can apply any of them according to the
type of problem, and different linkage methods lead to different clusters.
Below is the comparison image, which shows all the linkage methods. We took this reference
image from greatlearning platform blog.
These hierarchical structures can be visualized using a tree-like diagram called Dendrogram.
Now let us discuss Dendrogram.
What is Dendrogram
A Dendrogram is a diagram that represents the hierarchical relationship between objects. The
Dendrogram is used to display the distance between each pair of sequentially merged objects.
These are commonly used in studying hierarchical clusters before deciding the number of
clusters significant to the dataset.
The distance at which the two clusters combine is referred to as the dendrogram distance.
The primary use of a dendrogram is to work out the best way to allocate objects to clusters.
The key point to interpreting or implementing a dendrogram is to focus on the closest objects
in the dataset.
Hence from the above figure, we can observe that the objects P6 and P5 are very close to
each other, merging them into one cluster named C1, and followed by the object P4 is closed to
the cluster C1, so combine these into a cluster (C2).
And the objects P1 and P2 are close to each other so merge them into one cluster (C3), now
cluster C3 is merged with the following object P0 and forms a cluster (C4), the object P3 is
merged with the cluster C2, and finally the cluster C2 and C4 and merged into a single cluster
(C6).
Till now, we have a clear idea of the Agglomerative Hierarchical Clustering and Dendrograms.
Here, the divisive approach method is known as rigid, i.e., once a splitting is done on clusters,
we can't revert it.
And continue this process to form the new clusters until the desired number of clusters
means one cluster for each observation.
Can obtain any desired number of clusters by cutting the Dendrogram at the proper
level.
All the approaches to calculate the similarity between clusters have their own
disadvantages.
In hierarchical Clustering, once a decision is made to combine two clusters, it can not be
undone.
In this technique, the order of the data has an impact on the final results.
Conclusion
In this article, we discussed the hierarchical cluster algorithm’s in-depth intuition and
approaches, such as the Agglomerative Clustering and Divisive Clustering approach.
Hierarchical Clustering is often used in the form of descriptive rather than predictive modeling.
Mostly we use Hierarchical Clustering algorithm when the application requires a hierarchy. The
advantage of Hierarchical Clustering is we don’t have to pre-specify the clusters.
However, it doesn’t work very well on vast amounts of data or huge datasets. And there are
some disadvantages of the Hierarchical Clustering algorithm that it is not suitable for large
datasets. And it gives the best results in some cases only.
Key Concepts:
1. Splitting Criterion:
Determines how to divide a cluster into smaller clusters. This may involve techniques
like:
2. Dendrogram Representation:
Algorithm Steps:
1. Start with All Points:
Select a cluster to divide (e.g., the largest or the one with the highest variance).
Use a clustering method (e.g., k-means) to divide the cluster into two or more smaller
clusters.
4. Repeat:
Recursively split the resulting clusters until a stopping criterion is met, such as:
Advantages:
Captures hierarchical relationships in data.
Limitations:
Computationally expensive, especially for large datasets.
1. Elbow Method:
Plot the within-cluster sum of squares (WCSS) against the number of clusters .
kk
2. Gap Statistic:
3. Silhouette Analysis:
Examine silhouette scores across a range of cluster numbers to identify the best .
kk
Choosing a Measure
Use internal measures when no ground truth labels are available.
By leveraging these methods, one can assess the quality and reliability of clustering outcomes
for different algorithms and datasets.