Introduction to Machine Learning
Last Updated : 29 Jul, 2025
Machine learning (ML) allows computers to learn and make
decisions without being explicitly programmed. It involves feeding
data into algorithms to identify patterns and make predictions on
new data. It is used in various applications like image recognition,
speech processing, language translation, recommender systems,
etc. In this article, we will see more about ML and its core
concepts.
Why do we need Machine Learning?
Traditional programming requires exact instructions and doesn’t
handle complex tasks like understanding images or language
well. It can’t efficiently process large amounts of data. Machine
Learning solves these problems by learning from examples and
making predictions without fixed rules. Let's see various reasons
why it is important:
1. Solving Complex Business Problems
Traditional programming struggles with tasks like language
understanding and medical diagnosis. ML learns from data and
predicts outcomes easily.
Examples:
Image and speech recognition in healthcare.
Language translation and sentiment analysis.
2. Handling Large Volumes of Data
The internet generates huge amounts of data every day. Machine
Learning processes and analyzes this data quickly by providing
valuable insights and real-time predictions.
Examples:
Fraud detection in financial transactions.
Personalized feed recommendations on Facebook and
Instagram from billions of interactions.
3. Automate Repetitive Tasks
ML automates time-consuming, repetitive tasks with high
accuracy hence reducing manual work and errors.
Examples:
Gmail filtering spam emails automatically.
Chatbots handling order tracking and password resets.
Automating large-scale invoice analysis for key insights.
4. Personalized User Experience
ML enhances user experience by tailoring recommendations to
individual preferences. It analyze user behavior to deliver highly
relevant content.
Examples:
Netflix suggesting movies and TV shows based on our
viewing history.
E-commerce sites recommending products we're likely to
buy.
5. Self Improvement in Performance
ML models evolve and improve with more data helps in making
them smarter over time. They adapt to user behavior and
increase their performance.
Examples:
Voice assistants like Siri and Alexa learning our preferences
and accents.
Search engines refining results based on user interaction.
Self-driving cars improving decisions using millions of miles
of driving data.
What Makes a Machine "Learn"?
A machine "learns" by identifying patterns in data and improving
its ability to perform specific tasks without being explicitly
programmed for every scenario. This learning process helps
machines to make accurate predictions or decisions based on the
information they receive. Unlike traditional programming where
instructions are fixed, ML allows models to adapt and improve
through experience.
Here is how the learning process works:
1. Data Input: Machine needs data like text, images or
numbers to analyze. Good quality and enough quantity of data
are important for effective learning.
2. Algorithms: Algorithms are mathematical methods that
help the machine find patterns in data. Different algorithms
help different tasks such as classification or regression.
3. Model Training: During training, the machine adjusts its
internal settings to better predict outcomes. It learns by
reducing the difference between its predictions and actual
results.
4. Feedback Loop: Machine compares its predictions with
true outcomes and uses this feedback to correct errors.
Techniques like gradient descent help it update and improve.
5. Experience and Iteration: Machine repeats training many
times with data helps in refining its predictions with each pass,
more data and iterations improve accuracy.
6. Evaluation and Generalization: Model is tested on unseen
data to ensure it performs well on real-world tasks.
Machines "learn" by continuously increasing their understanding
through data-driven iterations like how humans learn from
experience.
Importance of Data in Machine Learning
Data is the foundation of machine learning (ML) without quality
data ML models cannot learn, perform or make accurate
predictions.
Data provides the examples from which models learn
patterns and relationships.
High-quality and diverse data improves how well models
perform and generalize to new situations.
It helps models to understand real-world scenarios and
adapt to practical uses.
Features extracted from data are important for effective
training.
Separate datasets for validation and testing measure how
well the model works on unseen data.
Data drives continuous improvements in models through
feedback loops.
Types of Machine Learning
There are three main types of machine learning which are as
follows:
1. Supervised learning
Supervised learning trains a model using labeled data where
each input has a known correct output. The model learns by
comparing its predictions with these correct answers and
improves over time. It is used for
both classification and regression problems.
Example: Consider the following data regarding patients entering
a clinic. The data consists of the gender and age of the patients
and each patient is labeled as "healthy" or "sick".
Gender Age Label
M 48 sick
M 67 sick
F 53 healthy
M 49 sick
F 32 healthy
Gender Age Label
M 34 healthy
M 21 healthy
In this example, supervised learning is to use this labeled data to
train a model that can predict the label ("healthy" or "sick") for
new patients based on their gender and age. For example if a
new patient i.e Male with 50 years old visits the clinic, model can
classify whether the patient is "healthy" or "sick" based on the
patterns it learned during training.
2. Unsupervised learning:
Unsupervised learning works with unlabeled data where no
correct answers or categories are provided. The model's job is to
find the data, hidden patterns, similarities or groups on its own.
This is useful in scenarios where labeling data is difficult or
impossible. Common applications are clustering and association.
Example: Consider the following data regarding patients. The
dataset has a unlabeled data where only the gender and age of
the patients are available with no health status labels.
Gender Age
M 48
M 67
F 53
Gender Age
M 49
F 34
M 21
Here unsupervised learning looks for patterns or groups within
the data on its own. For example it might cluster patients by age
or gender and grouping them into categories like "younger
healthy patients" or "older patients" without knowing their health
status.
3. Reinforcement Learning
Reinforcement Learning (RL) trains an agent to make decisions
by interacting with an environment. Instead of being told the
correct answers, agent learns by trial and error method and gets
rewards for good actions and penalties for bad ones. Over time it
develops a strategy to maximize rewards and achieve goals. This
approach is good for problems having sequential decision making
such as robotics, gaming and autonomous systems.
Example: While Identifying a Fruit, system receives an input for
example an apple and initially makes an incorrect prediction like
"It's a mango". Feedback is provided to correct the error "Wrong!
It's an apple" and the system updates its model based on this
feedback.
Over time it learns to respond correctly that "It's an apple" when
getting similar inputs and also improves accuracy.
Besides these three main types, modern machine learning also
includes two other important approaches: Self-Supervised
Learning and Semi-Supervised Learning .
To learn more, refer to the article: Types of Machine Learning
Benefits of Machine Learning
1. Enhanced Efficiency and Automation: ML automates
repetitive tasks, freeing up human resources for more complex
work. This leads to faster, smoother processes and higher
productivity.
2. Data-Driven Insights: It can analyze large amounts of data
to identify patterns and trends that might be missed by people
and help businesses make better decisions.
3. Improved Personalization: It customizes user experiences
by tailoring recommendations and ads based on individual
preferences.
4. Advanced Automation and Robotics: It helps robots and
machines to perform complex tasks with greater accuracy and
adaptability. This is transforming industries like manufacturing
and logistics.
Challenges of Machine Learning
1. Data Bias and Fairness: ML models learn from training
data and if the data is biased, model’s decisions can be unfair
so it’s important to select and monitor data carefully.
2. Security and Privacy Concerns: Since it depends on large
amounts of data, there is a risk of sensitive information being
exposed so protecting privacy is important.
3. Interpretability and Explainability: Complex ML models
can be difficult to understand which makes it difficult to explain
why they make certain decisions. This can affect trust and
accountability.
4. Job Displacement and Automation: Automation may
replace some jobs so retraining and helping workers learn
new skills is important to adapt to these changes.
Applications of Machine Learning
Machine Learning is used in many industries to solve problems
and improve services. Here are some common real-world
applications:
1. Healthcare: It helps doctors to diagnose diseases from
medical images like X-rays and MRIs. It also predicts patient
outcomes and personalizes treatments which improves
healthcare quality.
2. Finance: In finance it detects fraudulent transactions in real
time and supports algorithmic trading. It also helps to assess
credit risk helps in making lending safer and faster.
3. Retail and E-Commerce: It helps in personalized product
recommendations and forecasts demand to optimize inventory
and also analyzes customer sentiment to improve shopping
experiences.
4. Transportation and Automotive: Self-driving cars rely on
ML to navigate and make decisions. It optimizes delivery
routes and predicts vehicle maintenance needs which reduces
downtime.
5. Social Media and Entertainment: Platforms like Netflix and
YouTube use ML to recommend content we'll enjoy. It enables
image and speech recognition for better user interaction.
6. Manufacturing: It improves quality control by detecting
defects in products automatically and predicts machine
failures in advance and helps in production processes.
Machine learning continues to evolve which helps in opening new
possibilities and transforming industries by helping smarter, data-
driven decisions and automation which was not possible earlier.
ypes of Machine Learning
Last Updated : 04 Nov, 2025
Machine Learning (ML) is a subfield of Artificial Intelligence (AI) that focuses on
building algorithms and models that enable computers to learn from data and
improve with experience without explicit programming for every task. In simple
words, Machine Learning teaches systems to learn patterns and make decisions
like humans by analyzing and learning from data.
There are several types of machine learning, each with special characteristics and
applications. Some of the main types of machine learning algorithms are as
follows:
1. Supervised Machine Learning
2. Unsupervised Machine Learning
3. Reinforcement Learning
Additionally, there is a more specific category called Semi-Supervised Learning
and Self-Supervised Learning, which combines elements of both supervised and
unsupervised learning
Types of Machine Learning
1. Supervised Machine Learning
Supervised learning is defined as when a model gets trained on a "Labeled
Dataset". Labelled datasets have both input and output parameters. In Supervised
Learning algorithms learn to map points between inputs and correct outputs. It has
both training and validation datasets labelled.
Supervised Learning
Example: If you train a model using labeled images of cats and dogs, it learns the
features of each. When shown a new image, it predicts whether it’s a cat or a dog.
There are two main categories of supervised learning that are mentioned below:
1. Classification
Classification predicts categorical outputs, meaning it assigns data into predefined
classes like spam/non-spam emails or disease risk categories. These algorithms
learn to map input features to discrete labels. Here are some classification
algorithms:
Logistic Regression
Decision Tree
Random Forest
K-Nearest Neighbors (KNN)
Naive Bayes
Support Vector Machine
2. Regression
Regression, predicts continuous values, such as house prices or product sales. It
learns the relationship between input features and a numerical target variable.
Here are some regression algorithms:
Linear Regression
Polynomial Regression
Ridge Regression
Lasso Regression
Decision tree
Random Forest
Where to Use Supervised Learning
When you have labeled data and want to predict outcomes.
Ideal for classification (like spam detection) or regression tasks (like price
forecasting).
Best used in domains where historical data with outcomes is already
available.
Applications
Supervised learning is used in a wide variety of applications, including:
Image, speech and text processing: For tasks like image classification,
speech recognition and sentiment analysis.
Predictive analytics: To forecast sales, customer churn, stock prices and
weather conditions.
Recommendation and personalization: Powering systems that suggest
products, movies or content.
Healthcare and finance: Used for medical diagnosis, fraud detection and
credit scoring.
Automation and control: In autonomous vehicles, manufacturing quality
checks and gaming AI.
2. Unsupervised Machine Learning
Unsupervised Learning works with unlabeled data, meaning there are no
predefined outputs. The algorithm finds hidden patterns, groups or relationships
within the data on its own. It’s mainly used for clustering, dimensionality reduction
and data visualization.
Unsupervised Learning
Example: If you have customer data without labels, the algorithm can group
similar customers based on purchase behavior useful for segmentation and
marketing.
There are two main categories of unsupervised learning that are mentioned below:
1. Clustering
Clustering is the process of grouping data points into clusters based on their
similarity. This technique is useful for identifying patterns and relationships in data
without the need for labeled examples. Common techniques include:
K-Means
DBSCAN
Mean-shift
2. Dimensionality Reduction Techniques
Dimensionality reduction helps reduce the number of features while preserving
important information. Common techniques include:
Principal Component Analysis
Independent Component Analysis
3. Association Rule Learning
Association rule learning is a technique for discovering relationships between
items in a dataset. It identifies rules that indicate the presence of one item implies
the presence of another item with a specific probability. Common techniques
include:
Apriori
FP-growth
Eclat
Where to Use Unsupervised Learning
When data is unlabeled or unstructured.
Useful for exploratory analysis, clustering or feature extraction.
Common in marketing, recommendation systems and fraud detection where
patterns matter more than labels.
Applications of Unsupervised Learning
Here are some common applications of unsupervised learning:
Clustering and segmentation: Group similar data points, customers or
images.
Anomaly detection: Spot unusual patterns or outliers in data.
Dimensionality reduction: Simplify large datasets while retaining key
information.
Recommendation and marketing: Identify user preferences and improve
product suggestions.
Data preprocessing and analysis: Clean data, detect patterns and support
exploratory data analysis (EDA).
3. Reinforcement Learning
Reinforcement learning trains an agent to make a sequence of decisions through
trial and error. The agent interacts with the environment, receives feedback in the
form of rewards or penalties and learns optimal actions over time.
Reinforcement Machine Learning
Example: An AI agent learning to play chess gets positive feedback for good
moves and negative for poor ones. Over time, it learns strategies to win more
often.
Here are some of most common reinforcement learning algorithms:
Q-learning: Learns the best action for each state based on expected
rewards.
SARSA (State-Action-Reward-State-Action): Similar to Q-learning but
updates values for the action actually taken.
Deep Q-learning: Uses neural networks to handle complex state-action
relationships
Types of Reinforcement Learning
Positive Reinforcement: Rewards desired behavior (e.g., giving points for
correct answers).
Negative Reinforcement: Removes negative outcomes to encourage good
actions (e.g., turning off a buzzer after the right move).
Where to Use Reinforcement Learning
When you need an agent to learn by interacting with an environment.
Best for decision-making or optimization tasks involving trial and feedback
loops.
Used when long-term performance or adaptive behavior is more important
than immediate accuracy.
Applications of Reinforcement Learning
Here are some applications of reinforcement learning:
Gaming and simulation: Teaching agents or NPCs to play and adapt
intelligently.
Robotics and automation: Enabling robots to perform tasks autonomously.
Autonomous vehicles: Helping self-driving cars make real-time decisions.
Healthcare and finance: Optimizing treatment plans, trading and resource
allocation.
Recommendation and personalization: Improving user experience
through adaptive suggestions.
Industrial and energy management: Optimizing control systems and
energy use.
Semi-Supervised Learning: Supervised + Unsupervised
Learning
Semi-Supervised learning Semi-Supervised Learning combines both Supervised
and Unsupervised approaches. It uses a small set of labeled data and a large set
of unlabeled data for training useful when labeling is costly or time-consuming.
Semi-Supervised Learning
Example: Consider that we are building a language translation model, having
labeled translations for every sentence pair can be resources intensive. It allows
the models to learn from labeled and unlabeled sentence pairs, making them more
accurate. This technique has led to significant improvements in the quality of
machine translation services.
Popular Techniques
Graph-based Learning: Spreads label information through data
relationships.
Label Propagation: Iteratively assigns labels to unlabeled data.
Co-training: Uses two models to train and label each other’s data.
Self-training: Uses model predictions as pseudo-labels.
Generative Adversarial Networks (GANs): Generates synthetic data to
improve learning.
Where to Use Semi-Supervised Learning
When you have limited labeled data but plenty of unlabeled data.
Useful for domains with high labeling costs, such as medical, NLP or image
datasets.
Ideal when unlabeled data still holds valuable information that can improve
learning performance.
Applications
Image Classification: Combine small labeled and large unlabeled image
datasets to improve accuracy.
Natural Language Processing (NLP): Enhance language models by using
a mix of labeled and vast unlabeled text data.
Speech Recognition: Boost accuracy by leveraging limited transcribed
audio and more unlabeled speech data.
Recommendation Systems: Improve recommendations using sparse
labeled data and abundant unlabeled user behavior.
Healthcare & Medical Imaging: Improve medical image analysis with a mix
of labeled and unlabeled images.
Self-Supervised Learning
Self-Supervised Learning (SSL) is a modern approach where models generate
their own labels from raw data. It doesn’t rely on manual annotation instead, the
model learns by predicting parts of data from other parts.
Example: In NLP, models like BERT or GPT learn by predicting masked words in
sentences, using surrounding context as supervision. This helps them learn
language understanding without human labeling.
Popular Techniques
Masked Modeling (BERT)
Contrastive Learning (SimCLR, MoCo)
Autoencoders
Predictive Coding
Applications
Natural Language Processing
Computer Vision and Speech Recognition
Video understanding
Pre-training for large AI models
Where to Use Self-Supervised Learning
When manual labeling is impossible or expensive.
Suitable for large-scale datasets like text, audio and images.
Best for pre-training models that can later be fine-tuned for specific
supervised tasks.
Comparison
Data Label
Type Requirement Availability Learning Goal Common Use Case
Supervised High Labeled Predict outputs Spam detection,
Data Label
Type Requirement Availability Learning Goal Common Use Case
Price prediction
Find hidden Customer
Unsupervised Medium Unlabeled
patterns segmentation
Reward
Reinforcement High Learn best actions Robotics, Games
feedback
Semi- Combine both NLP, Image
Medium Partial labels
Supervised learning types recognition
Self-
Self- Learn data
High generated BERT, GPT, CLIP
Supervised representations
labels
What is Machine Learning Pipeline?
Last Updated : 03 Nov, 2025
A Machine Learning Pipeline is a systematic workflow designed to
automate the process of building, training, and deploying ML
models. It includes several steps, such as:
Data Collection
Preprocessing
Feature Engineering
Model Training
Evaluation
Deployment.
Rather than managing each step individually, pipelines help
simplify and standardize the workflow, making machine learning
development faster, more efficient and scalable. They also
enhance data management by enabling the extraction,
transformation, and loading of data from various sources.
Steps to build Machine Learning Pipeline
A machine learning pipeline is a step-by-step process that
automates data preparation, model training and deployment. Here,
we will discuss the key steps:
Step 1: Data Collection and Preprocessing
Gather data from sources like databases, APIs or CSV files.
Clean the data by handling missing values, duplicates and
errors.
Normalize and standardize numerical values.
Convert categorical variables into a machine readable format.
Step 2: Feature Engineering
Select the most important features for better model
performance.
Create new features for feature extraction or transformation.
Step 3: Data splitting
Divide the dataset into training, validation and testing sets.
When dealing with imbalanced datasets, use random
sampling.
Step 4: Model Selection & Training
Choose the best algorithm based on the problem
includes classification, regression, Clustering etc.
Train the model using the training dataset.
Step 5: Model evaluation & Optimization
Test the model's performance using accuracy, precision,
recall and other metrics.
Tune hyperparameters using Grid Search or Random Search
and avoiding overfitting using techniques like cross- validation.
Step 6: Model Deployment
Deploy the trained model
using Flask, FastAPI, TensorFlow and cloud services.
Save the trained model for real-world applications.
Step 7: Continuous learning & Monitoring
Automates the pipeline using MLOps tools like MLflow or
Kubeflow.
Update the model with new data to maintain accuracy.
Implementation for model Training
1. Import Libraries
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from [Link] import StandardScaler, OneHotEncoder
from [Link] import Pipeline
from [Link] import ColumnTransformer
from [Link] import RandomForestClassifier
from [Link] import accuracy_score
2. Load and Prepare the data
# Load dataset
df =
pd.read_csv("[Link]
atasets/master/[Link]")
# Select relevant features
features = ['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked']
df = df[features + ['Survived']].dropna() # Drop rows with missing
values
# Display the first few rows
print([Link]())
Output:
3. Define Preprocessing Steps
# Define numerical and categorical features
num_features = ['Age', 'SibSp', 'Parch', 'Fare']
cat_features = ['Pclass', 'Sex', 'Embarked']
# Define transformers
num_transformer = StandardScaler() # Standardization for
numerical features
cat_transformer = OneHotEncoder(handle_unknown='ignore') #
One-hot encoding for categorical features
# Combine transformers into a preprocessor
preprocessor = ColumnTransformer([
('num', num_transformer, num_features),
('cat', cat_transformer, cat_features)
])
4. Split the data for training and Testing
# Define target and features
X = df[features]
y = df['Survived']
# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)
# Display the shape of the data
print(f"Training set shape: {X_train.shape}")
print(f"Testing set shape: {X_test.shape}")
Output:
Training set shape: (567, 7)
Testing set shape: (143, 7)
5. Build and Train model
# Define the pipeline
pipeline = Pipeline([
('preprocessor', preprocessor), # Data transformation
('classifier', RandomForestClassifier(n_estimators=100,
random_state=42)) # ML model
])
# Train the model
[Link](X_train, y_train)
print("Model training complete!")
Output:
Model training complete!
6. Evaluate the Model
# Make predictions
y_pred = [Link](X_test)
# Compute accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy: {accuracy:.2f}")
Output:
Model Accuracy: 0.76
7. Save and Load the Model
import joblib
# Save the trained pipeline
[Link](pipeline, 'ml_pipeline.pkl')
# Load the model
loaded_pipeline = [Link]('ml_pipeline.pkl')
# Predict using the loaded model
sample_data = [Link]([{'Pclass': 3, 'Sex': 'male', 'Age': 25,
'SibSp': 0, 'Parch': 0, 'Fare': 7.5, 'Embarked': 'S'}])
prediction = loaded_pipeline.predict(sample_data)
print(f"Prediction: {'Survived' if prediction[0] == 1 else 'Did not
Survive'}")
Output:
Prediction: Did not Survive
Implementation code
# Step 1: Import Required Libraries
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from [Link] import StandardScaler, OneHotEncoder
from [Link] import Pipeline
from [Link] import ColumnTransformer
from [Link] import RandomForestClassifier
from [Link] import accuracy_score
import joblib # For saving and loading models
# Step 2: Load and Prepare the Data
# Load dataset (Titanic dataset as an example)
df =
pd.read_csv("[Link]
atasets/master/[Link]")
# Select relevant features
features = ['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked']
df = df[features + ['Survived']].dropna() # Drop rows with missing
values
# Display the first few rows of the dataset
print("Data Sample:\n", [Link]())
# Step 3: Define Preprocessing Steps
# Define numerical and categorical features
num_features = ['Age', 'SibSp', 'Parch', 'Fare']
cat_features = ['Pclass', 'Sex', 'Embarked']
# Define transformers for preprocessing
num_transformer = StandardScaler() # Standardize numerical
features
cat_transformer = OneHotEncoder(handle_unknown='ignore') #
One-hot encode categorical features
# Combine transformers into a single preprocessor
preprocessor = ColumnTransformer([
('num', num_transformer, num_features),
('cat', cat_transformer, cat_features)
])
# Step 4: Split Data into Training and Testing Sets
# Define target and features
X = df[features]
y = df['Survived']
# Split into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)
print(f"Training set shape: {X_train.shape}")
print(f"Testing set shape: {X_test.shape}")
# Step 5: Build the Machine Learning Pipeline
# Define the pipeline (includes preprocessing + RandomForest
classifier)
pipeline = Pipeline([
('preprocessor', preprocessor), # Apply preprocessing steps
('classifier', RandomForestClassifier(n_estimators=100,
random_state=42)) # ML model (RandomForest)
])
# Step 6: Train the Model
# Train the model using the pipeline
[Link](X_train, y_train)
print("Model training complete!")
# Step 7: Evaluate the Model
# Make predictions on the test data
y_pred = [Link](X_test)
# Compute accuracy of the model
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy: {accuracy:.2f}")
# Step 8: Save and Load the Model
# Save the trained pipeline (preprocessing + model)
[Link](pipeline, 'ml_pipeline.pkl')
# Load the model back
loaded_pipeline = [Link]('ml_pipeline.pkl')
# Predict using the loaded model
sample_data = [Link]([{'Pclass': 3, 'Sex': 'male', 'Age': 25,
'SibSp': 0, 'Parch': 0, 'Fare': 7.5, 'Embarked': 'S'}])
prediction = loaded_pipeline.predict(sample_data)
# Output prediction for a sample input
print(f"Prediction for Sample Data: {'Survived' if prediction[0] == 1
else 'Did not Survive'}")
Output:
Benefits of Machine Learning pipeline
A Machine Learning Pipeline offers several advantages by
automating and streamlining the process of developing, training
and deploying machine learning models. Here are the key
benefits:
1. Automation and Efficiency: It automates the repetitive tasks
such as data cleaning, model training and testing. It saves time
and speeds up the development process and allows data
scientists to focus on more strategic task.
2. Faster Model Deployment: It helps in quickly moving a trained
model into real-world use. It is useful for AI applications like stock
trading, fraud detection and healthcare.
3. Improve Accuracy & Consistency: It ensures that data is
processed the same way every time reducing human error and
making predictions more reliable.
4. Handles Large Data easily: ML pipeline works efficiently with
big datasets and can run on powerful cloud platforms for better
performance.
5. Cost-Effective: Machine Learning Pipeline saves time and
money by automating tasks that would normally require manual
work. This means fewer mistakes and less work for extra workers,
making the process more efficient and cost-effective.
Supervised Machine Learning
Last Updated : 12 Sep, 2025
Supervised learning is a type of machine learning where a model
learns from labelled data—meaning every input has a
corresponding correct output. The model makes predictions and
compares them with the true outputs, adjusting itself to reduce
errors and improve accuracy over time. The goal is to make
accurate predictions on new, unseen data. For example, a model
trained on images of handwritten digits can recognise new digits it
has never seen before.
Supervised Machine Learning
Types of Supervised Learning in Machine
Learning
Now, Supervised learning can be applied to two main types of
problems:
Classification: Where the output is a categorical variable
(e.g., spam vs. non-spam emails, yes vs. no).
Regression: Where the output is a continuous variable (e.g.,
predicting house prices, stock prices).
Types
of Supervised Learning
While training the model, data is usually split in the ratio of 80:20
i.e. 80% as training data and the rest as testing data. In training
data, we feed input as well as output for 80% of data. The model
learns from training data only. We use different supervised
learning algorithms (which we will discuss in detail in the next
section) to build our model. Let's first understand the classification
and regression data through the table below:
Sample
Both the above figures have labelled data set as follows:
Figure A: It is a dataset of a shopping store that is useful in
predicting whether a customer will purchase a particular product
under consideration or not based on his/her gender, age and
salary.
Input: Gender, Age, Salary
Output: Purchased i.e. 0 or 1; 1 means yes the customer will
purchase and 0 means that the customer won't purchase it.
Figure B: It is a Meteorological dataset that serves the purpose of
predicting wind speed based on different parameters.
Input: Dew Point, Temperature, Pressure, Relative Humidity,
Wind Direction
Output: Wind Speed
Working of Supervised Machine Learning
The working of supervised machine learning follows these key
steps:
1. Collect Labeled Data
Gather a dataset where each input has a known correct
output (label).
Example: Images of handwritten digits with their actual
numbers as labels.
2. Split the Dataset
Divide the data into training data (about 80%) and testing
data (about 20%).
The model will learn from the training data and be evaluated
on the testing data.
3. Train the Model
Feed the training data (inputs and their labels) to a suitable
supervised learning algorithm (like Decision Trees, SVM or
Linear Regression).
The model tries to find patterns that map inputs to correct
outputs.
4. Validate and Test the Model
Evaluate the model using testing data it has never seen
before.
The model predicts outputs and these predictions are
compared with the actual labels to calculate accuracy or error.
5. Deploy and Predict on New Data
Once the model performs well, it can be used to predict
outputs for completely new, unseen data.
Supervised Machine Learning Algorithms
Supervised learning can be further divided into several different
types, each with its own unique characteristics and applications.
Here are some of the most common types of supervised learning
algorithms:
Linear Regression: Linear regression is a type of supervised
learning regression algorithm that is used to predict a
continuous output value. It is one of the simplest and most
widely used algorithms in supervised learning.
Logistic Regression: Logistic regression is a type of
supervised learning classification algorithm that is used to
predict a binary output variable.
Decision Trees : Decision tree is a tree-like structure that is
used to model decisions and their possible consequences.
Each internal node in the tree represents a decision, while each
leaf node represents a possible outcome.
Random Forests: Random forests again are made up of
multiple decision trees that work together to make predictions.
Each tree in the forest is trained on a different subset of the
input features and data. The final prediction is made by
aggregating the predictions of all the trees in the forest.
Support Vector Machine(SVM): The SVM algorithm creates
a hyperplane to segregate n-dimensional space into classes
and identify the correct category of new data points. The
extreme cases that help create the hyperplane are called
support vectors, hence the name Support Vector Machine.
K-Nearest Neighbors: KNN works by finding k training
examples closest to a given input and then predicts the class or
value based on the majority class or average value of these
neighbors. The performance of KNN can be influenced by the
choice of k and the distance metric used to measure proximity.
Gradient Boosting: Gradient Boosting combines weak
learners, like decision trees, to create a strong model. It
iteratively builds new models that correct errors made by
previous ones.
Naive Bayes Algorithm: The Naive Bayes algorithm is a
supervised machine learning algorithm based on applying
Bayes' Theorem with the “naive” assumption that features are
independent of each other given the class label.
Let's summarize the supervised machine learning algorithms in
table:
Regression,
Algorithm Classification Purpose Method Use Cases
Linear
Predict equation Predicting
Regression continuous minimizing continuous
Linear output values sum of squares values
Regression of residuals
Logistic
Predict binary function Binary
Classification output transforming classification
Logistic variable linear tasks
Regression relationship
Decision Both Model Tree-like Classification
Trees decisions and structure with and
outcomes decisions and Regression
Regression,
Algorithm Classification Purpose Method Use Cases
outcomes tasks
Reducing
Improve
Combining overfitting,
classification
Both multiple improving
and regression
Random decision trees prediction
accuracy
Forests accuracy
Maximizing
Create
margin
hyperplane for Classification
between
classification and
Both classes or
or predict Regression
predicting
continuous tasks
continuous
values
SVM values
Finding k
Classification
closest
Predict class or and
neighbors and
value based on Regression
Both predicting
k closest tasks,
based on
neighbors sensitive to
majority or
noisy data
KNN average
Regression,
Algorithm Classification Purpose Method Use Cases
Classification
and
Combine weak Iteratively
Regression
learners to correcting
Both tasks to
create strong errors with
improve
model new models
Gradient prediction
Boosting accuracy
Text
Predict class Bayes' classification,
based on theorem with spam
Classification feature feature filtering,
independence independence sentiment
Naive assumption assumption analysis,
Bayes medical
These types of supervised learning in machine learning vary
based on the problem we're trying to solve and the dataset we're
working with. In classification problems, the task is to assign inputs
to predefined classes, while regression problems involve
predicting numerical outcomes.
Practical Examples of Supervised learning
Few practical examples of supervised machine learning across
various industries:
Fraud Detection in Banking: Utilizes supervised learning
algorithms on historical transaction data, training models with
labeled datasets of legitimate and fraudulent transactions to
accurately predict fraud patterns.
Parkinson Disease Prediction: Parkinson’s disease is a
progressive disorder that affects the nervous system and the
parts of the body controlled by the nerves.
Customer Churn Prediction: Uses supervised learning
techniques to analyze historical customer data, identifying
features associated with churn rates to predict customer
retention effectively.
Cancer cell classification: Implements supervised learning
for cancer cells based on their features and identifying them if
they are ‘malignant’ or ‘benign.
Stock Price Prediction: Applies supervised learning to
predict a signal that indicates whether buying a particular stock
will be helpful or not.
Advantages
Here are some advantages of supervised learning listed below:
Simplicity & clarity: Easy to understand and implement
since it learns from labeled examples.
High accuracy: When sufficient labeled data is available,
models achieve strong predictive performance.
Versatility: Works for both classification like spam detection,
disease prediction and regression like price forecasting.
Generalization: With enough diverse data and proper
training, models can generalize well to unseen inputs.
Wide application: Used in speech recognition, medical
diagnosis, sentiment analysis, fraud detection and more.
Disadvantages
Requires labeled data: Large amounts of labeled datasets
are expensive and time-consuming to prepare.
Bias from data: If training data is biased or unbalanced, the
model may learn and amplify those biases.
Overfitting risk: Model may memorize training data instead
of learning general patterns, especially with small datasets.
Limited adaptability: Performance drops significantly when
applied to data distributions very different from training data.
Not scalable for some problems: In tasks with millions of
possible labels like natural language, supervised labeling
becomes impractical.
Suggested Quiz
10 Questions
In supervised machine learning, what is the primary purpose of
using labeled data during the training phase?
A
To introduce randomness in the model
B
To allow the model to learn the relationship between inputs and
outputs
C
To increase the complexity of the model
D
To reduce the amount of data required for training
Linear Regression in Machine learning
Linear regression is a type of supervised machine-learning
algorithm that learns from the labelled datasets and maps the
data points with most optimized linear functions which can be
used for prediction on new datasets. It assumes that there is a
linear relationship between the input and output, meaning the
output changes at a constant rate as the input changes. This
relationship is represented by a straight line.
For example we want to predict a student's exam score based
on how many hours they studied. We observe that as students
study more hours, their scores go up. In the example of
predicting exam scores based on hours studied. Here
Independent variable (input): Hours studied because it's
the factor we control or observe.
Dependent variable (output): Exam score because it
depends on hobw many hours were studied.
We use the independent variable to predict the dependent
variable.
Best Fit Line in Linear Regression
In linear regression, the best-fit line is the straight line that most
accurately represents the relationship between the independent
variable (input) and the dependent variable (output). It is the line
that minimizes the difference between the actual data points and
the predicted values from the model.
1. Goal of the Best-Fit Line
The goal of linear regression is to find a straight line that
minimizes the error (the difference) between the observed data
points and the predicted values. This line helps us predict the
dependent variable for new, unseen data.
Linear Regression
Here Y is called a dependent or target variable and X is called an
independent variable also known as the predictor of Y. There are
many types of functions or modules that can be used for
regression. A linear function is the simplest type of function.
Here, X may be a single feature or multiple features representing
the problem.
2. Equation of the Best-Fit Line
For simple linear regression (with one independent variable), the
best-fit line is represented by the equation
y=mx+by=mx+b
Where:
y is the predicted value (dependent variable)
x is the input (independent variable)
m is the slope of the line (how much y changes when x
changes)
b is the intercept (the value of y when x = 0)
The best-fit line will be the one that optimizes the values of m
(slope) and b (intercept) so that the predicted y values are as
close as possible to the actual data points.
3. Minimizing the Error: The Least Squares Method
To find the best-fit line, we use a method called Least Squares.
The idea behind this method is to minimize the sum of squared
differences between the actual values (data points) and the
predicted values from the line. These differences are called
residuals.
The formula for residuals is:
Residual=yᵢ−y^ᵢResidual=yᵢ−y^ᵢ
Where:
yᵢyᵢ is the actual observed value
y^ᵢy^ᵢ is the predicted value from the line for that xᵢxᵢ
The least squares method minimizes the sum of the squared
residuals:
Sumofsquarederrors(SSE)=Σ(yᵢ−y^ᵢ)²Sumofsquarederro
rs(SSE)=Σ(yᵢ−y^ᵢ)²
This method ensures that the line best represents the data where
the sum of the squared differences between the predicted values
and actual values is as small as possible.
4. Interpretation of the Best-Fit Line
Slope (m): The slope of the best-fit line indicates how much
the dependent variable (y) changes with each unit change in
the independent variable (x). For example if the slope is 5, it
means that for every 1-unit increase in x, the value of y
increases by 5 units.
Intercept (b): The intercept represents the predicted value
of y when x = 0. It’s the point where the line crosses the y-
axis.
In linear regression some hypothesis are made to ensure
reliability of the model's results.
Limitations
Assumes Linearity: The method assumes the relationship
between the variables is linear. If the relationship is non-linear,
linear regression might not work well.
Sensitivity to Outliers: Outliers can significantly affect the
slope and intercept, skewing the best-fit line.
Hypothesis function in Linear Regression
In linear regression, the hypothesis function is the equation used
to make predictions about the dependent variable based on the
independent variables. It represents the relationship between the
input features and the target output.
For a simple case with one independent variable, the hypothesis
function is:
h(x)=β₀+β₁xh(x)=β₀+β₁x
Where:
h(x)(ory^)h(x)(ory^) is the predicted value of the
dependent variable (y).
x xxis the independent variable.
β₀β₀ is the intercept, representing the value of y when x is
0.
β₁β₁ is the slope, indicating how much y changes for each
unit change in x.
For multiple linear regression (with more than one independent
variable), the hypothesis function expands to:
h(x₁,x₂,...,xₖ)=β₀+β₁x₁+β₂x₂+...
+βₖxₖh(x₁,x₂,...,xₖ)=β₀+β₁x₁+β₂x₂+...+βₖxₖ
Where:
x₁,x₂,...,xₖx₁,x₂,...,xₖ are the independent variables.
β₀β₀ is the intercept.
β₁,β₂,...,βₖβ₁,β₂,...,βₖ are the coefficients, representing the
influence of each respective independent variable on the
predicted output.
Assumptions of the Linear Regression
1. Linearity: The relationship between inputs (X) and the output
(Y) is a straight line.
Linearity
2. Independence of Errors: The errors in predictions should not
affect each other.
3. Constant Variance (Homoscedasticity): The errors should
have equal spread across all values of the input. If the spread
changes (like fans out or shrinks), it's called heteroscedasticity
and it's a problem for the model.
Homoscedasticity
4. Normality of Errors: The errors should follow a normal (bell-
shaped) distribution.
5. No Multicollinearity(for multiple regression): Input variables
shouldn’t be too closely related to each other.
6. No Autocorrelation: Errors shouldn't show repeating patterns,
especially in time-based data.
7. Additivity: The total effect on Y is just the sum of effects from
each X, no mixing or interaction between them.'
To understand Multicollinearity detail refer to article:
Multicollinearity.
Types of Linear Regression
When there is only one independent feature it is known as
Simple Linear Regression or Univariate Linear Regression and
when there are more than one feature it is known as Multiple
Linear Regression or Multivariate Regression.
1. Simple Linear Regression
Simple linear regression is used when we want to predict a target
value (dependent variable) using only one input feature
(independent variable). It assumes a straight-line relationship
between the two.
Formula
y^=θ0+θ1xy^=θ0+θ1x
Where:
y^y^ is the predicted value
xxis the input (independent variable)
θ0θ0 is the intercept (value of y^y^ when x=0)
θ1θ1 is the slope or coefficient (how much y^y^ changes
with one unit of x)
Example:
Predicting a person’s salary (y) based on their years of
experience (x).
2. Multiple Linear Regression
Multiple linear regression involves more than one independent
variable and one dependent variable. The equation for multiple
y^=θ0+θ1x1+θ2x2+⋯+θnxny^=θ0+θ1x1+θ2x2+⋯+θnxn
linear regression is:
where:
y^y^ is the predicted value
x1,x2,…,xnx1,x2,…,xn are the independent variables
θ1,θ2,…,θnθ1,θ2,…,θn are the coefficients (weights)
corresponding to each predictor.
θ0θ0 is the intercept.
The goal is to find the best-fit line that predicts Y accurately for
given inputs X.
Use Cases
Real Estate: Predict property prices using location, size and
other factors.
Finance: Forecast stock prices using interest rates and
inflation data.
Agriculture: Estimate crop yield from rainfall, temperature
and soil quality.
E-commerce: Analyze how price, promotions and seasons
affect sales.
Once you understand linear regression and its types, the next
step is building the model in practice.
Cost function for Linear Regression
In Linear Regression, the cost function measures how far the
predicted values (Y^Y^) are from the actual values (Y). It helps
identify and reduce errors to find the best-fit line. The most
common cost function used is Mean Squared Error (MSE), which
calculates the average of squared differences between actual
and predicted values:
Cost function(J)=1n∑ni(yi^−yi)2 Cost function(J)=n1∑ni(yi^
−yi)2
Here, yi^=θ1+θ2xiyi^=θ1+θ2xi
To minimize this cost, we use Gradient Descent, which iteratively
updates θ1 and θ2 until the MSE reaches its lowest value. This
ensures the line fits the data as accurately as possible.
Gradient Descent for Linear Regression
Gradient descent is an optimization technique used to train a
linear regression model by minimizing the prediction error. It
works by starting with random model parameters and repeatedly
adjusting them to reduce the difference between predicted and
actual values.
Gradient Descent
How it works:
Start with random values for slope and intercept.
Calculate the error between predicted and actual values.
Find how much each parameter contributes to the error
(gradient).
Update the parameters in the direction that reduces the
error.
Repeat until the error is as small as possible.
This helps the model find the best-fit line for the data.
For more details you can refer to: Gradient Descent in Linear
Regression
Evaluation Metrics for Linear Regression
A variety of evaluation measures can be used to determine the
strength of any linear regression model. These assessment
metrics often give an indication of how well the model is
producing the observed outputs.
The most common measurements are:
1. Mean Square Error (MSE)
Mean Squared Error (MSE) is an evaluation metric that
calculates the average of the squared differences between the
actual and predicted values for all the data points. The difference
is squared to ensure that negative and positive differences don't
cancel each other out.
MSE=1n∑i=1n(yi−yi^)2MSE=n1∑i=1n(yi−yi)2
Here,
nnis the number of data points.
yiyiis the actual or observed value for the ithithdata point.
yi^yi is the predicted value for the ithithdata point.
MSE is a way to quantify the accuracy of a model's predictions.
MSE is sensitive to outliers as large errors contribute significantly
to the overall score.
2. Mean Absolute Error (MAE)
Mean Absolute Error is an evaluation metric used to calculate the
accuracy of a regression model. MAE measures the average
absolute difference between the predicted values and actual
values.
MAE=1n∑i=1n∣Yi−Yi^∣MAE=n1∑i=1n∣Yi−Yi∣
Mathematically MAE is expressed as:
Here,
n is the number of observations
Yi represents the actual values.
Yi^Yi represents the predicted values
Lower MAE value indicates better model performance. It is not
sensitive to the outliers as we consider absolute differences.
3. Root Mean Squared Error (RMSE)
The square root of the residuals' variance is the Root Mean
Squared Error. It describes how well the observed data points
match the expected values or the model's absolute fit to the data.
In mathematical notation, it can be expressed as:
RMSE=RSSn=∑i=2n(yiactual−yipredicted)2nRMSE=nRSS
=n∑i=2n(yiactual−yipredicted)2
Where:
nn : Number of observations
yiyi : Actual value
yi^yi^ : Predicted value
RMSE is in the same unit as the target variable and highlights
larger errors more clearly.
4. Coefficient of Determination (R-squared)
R-Squared is a statistic that indicates how much variation the
developed model can explain or capture. It is always in the range
of 0 to 1. In general, the better the model matches the data, the
greater the R-squared number.
In mathematical notation, it can be expressed as:
R2=1−(RSSTSS)R2=1−(TSSRSS)
Residual sum of Squares(RSS): The sum of squares of
the residual for each data point in the plot or data is known as
the residual sum of squares or RSS. It is a measurement of
the difference between the output that was observed and what
was anticipated.
RSS=∑i=1n(yi−b0−b1xi)2RSS=∑i=1n(yi−b0−b1xi)2
Total Sum of Squares (TSS): The sum of the data points'
errors from the answer variable's mean is known as the total
sum of squares or TSS.
TSS=∑i=1n(y−yi‾)2TSS=∑i=1n(y−yi)2.
R squared metric is a measure of the proportion of variance in
the dependent variable that is explained the independent
variables in the model.
5. Adjusted R-Squared Error
Adjusted R2R2measures the proportion of variance in the
dependent variable that is explained by independent variables in
a regression model. Adjusted R-square accounts the number of
predictors in the model and penalizes the model for including
irrelevant predictors that don't contribute significantly to explain
the variance in the dependent variables.
Mathematically, adjusted R2R2is expressed as:
AdjustedR2=1−((1−R2).(n−1)n−k−1)AdjustedR2=1−
(n−k−1(1−R2).(n−1))
Here,
n is the number of observations
k is the number of predictors in the model
R2 is coeeficient of determination
It penalizes the inclusion of unnecessary predictors, helping to
prevent overfitting.
Regularization Techniques for Linear Models
1. Lasso Regression (L1 Regularization)
Lasso Regression is a technique used for regularizing a linear
regression model, it adds a penalty term to the linear regression
objective function to prevent overfitting.
The objective function after applying lasso regression is:
J(θ)=12m∑i=1m(yi^−yi)2+λ∑j=1n ∣θj∣J(θ)=2m1∑i=1m(yi−yi
)2+λ∑j=1n∣θj∣
the first term is the least squares loss, representing the
squared difference between predicted and actual values.
the second term is the L1 regularization term, it penalizes
the sum of absolute values of the regression coefficient θ j.
2. Ridge Regression (L2 Regularization)
Ridge regression is a linear regression technique that adds a
regularization term to the standard linear objective. Again, the
goal is to prevent overfitting by penalizing large coefficient in
linear regression equation. It useful when the dataset has
multicollinearity where predictor variables are highly correlated.
The objective function after applying ridge regression is:
J(θ)=12m∑i=1m(yi^−yi)2+λ∑j=1nθj2J(θ)=2m1∑i=1m(yi−yi
)2+λ∑j=1nθj2
the first term is the least squares loss, representing the
squared difference between predicted and actual values.
the second term is the L1 regularization term, it penalizes
the sum of square of values of the regression coefficient θ j.
3. Elastic Net Regression
Elastic Net Regression is a hybrid regularization technique that
combines the power of both L1 and L2 regularization in linear
J(θ)=12m∑i=1m(yi^−yi)2+αλ∑j=1n ∣θj∣+12(1−α)λ∑j=1nθj
regression objective.
2J(θ)=2m1∑i=1m(yi−yi)2+αλ∑j=1n∣θj∣+21(1−α)λ∑j=1nθj2
the first term is least square loss.
the second term is L1 regularization and third is ridge
regression.
λλis the overall regularization strength.
ααcontrols the mix between L1 and L2 regularization.
Now that we have learned how to make a linear regression
model, now we will implement it.
Python Implementation of Linear Regression
1. Import the necessary libraries
import numpy as np
import [Link] as plt
from sklearn.linear_model import LinearRegression
2. Generating Random Dataset
Fetches the California Housing dataset and separates features
(X) and target (y).
[Link](42)
X = [Link](50, 1) * 100
Y = 3.5 * X + [Link](50, 1) * 20
3. Creating and Training Linear Regression Model
model = LinearRegression()
[Link](X, Y)
4. Predicting Y Values
Y_pred = [Link](X)
5. Visualizing the Regression Line
[Link](figsize=(8,6))
[Link](X, Y, color='blue', label='Data Points')
[Link](X, Y_pred, color='red', linewidth=2, label='Regression
Line')
[Link]('Linear Regression on Random Dataset')
[Link]('X')
[Link]('Y')
[Link]()
[Link](True)
[Link]()
Output:
Regression Line
6. Slope and Intercept
print("Slope (Coefficient):", model.coef_[0][0])
print("Intercept:", model.intercept_[0])
Output:
Slope (Coefficient): 3.4553132007706204
Intercept: 1.9337854893777546
Why Linear Regression is Important
Here’s why linear regression is important:
Simplicity and Interpretability: It’s easy to understand and
interpret, making it a starting point for learning about machine
learning.
Predictive Ability: Helps predict future outcomes based on
past data, making it useful in various fields like finance,
healthcare and marketing.
Basis for Other Models: Many advanced algorithms, like
logistic regression or neural networks, build on the concepts of
linear regression.
Efficiency: It’s computationally efficient and works well for
problems with a linear relationship.
Widely Used: It’s one of the most widely used techniques in
both statistics and machine learning for regression tasks.
Analysis: It provides insights into relationships between
variables (e.g., how much one variable influences another).
Advantages
Linear regression is a relatively simple algorithm, making it
easy to understand and implement. The coefficients of the
linear regression model can be interpreted as the change in
the dependent variable for a one-unit change in the
independent variable, providing insights into the relationships
between variables.
Linear regression is computationally efficient and can
handle large datasets effectively. It can be trained quickly on
large datasets, making it suitable for real-time applications.
Linear regression is relatively robust to outliers compared to
other machine learning algorithms. Outliers may have a
smaller impact on the overall model performance.
Linear regression often serves as a good baseline model for
comparison with more complex machine learning algorithms.
Linear regression is a well-established algorithm with a rich
history and is widely available in various machine learning
libraries and software packages.
Limitations
Linear regression assumes a linear relationship between the
dependent and independent variables. If the relationship is not
linear, the model may not perform well.
Linear regression is sensitive to multicollinearity, which
occurs when there is a high correlation between independent
variables. Multicollinearity can inflate the variance of the
coefficients and lead to unstable model predictions.
Linear regression assumes that the features are already in a
suitable form for the model. Feature engineering may be
required to transform features into a format that can be
effectively used by the model.
Linear regression is susceptible to both overfitting and
underfitting. Overfitting occurs when the model learns the
training data too well and fails to generalize to unseen data.
Underfitting occurs when the model is too simple to capture
the underlying relationships in the data.
Linear regression provides limited explanatory power for
complex relationships between variables. More advanced
machine learning techniques may be necessary for deeper
insights.
Logistic Regression in Machine Learning
Last Updated : 18 Nov, 2025
Logistic Regression is a supervised machine learning algorithm
used for classification problems. Unlike linear regression which
predicts continuous values it predicts the probability that an input
belongs to a specific class.
It is used for binary classification where the output can be
one of two possible categories such as Yes/No, True/False or
0/1.
It uses sigmoid function to convert inputs into a probability
value between 0 and 1.
Types of Logistic Regression
Logistic regression can be classified into three main types based
on the nature of the dependent variable:
1. Binomial Logistic Regression: This type is used when the
dependent variable has only two possible categories. Examples
include Yes/No, Pass/Fail or 0/1. It is the most common form of
logistic regression and is used for binary classification
problems.
2. Multinomial Logistic Regression: This is used when the
dependent variable has three or more possible categories that
are not ordered. For example, classifying animals into
categories like "cat," "dog" or "sheep." It extends the binary
logistic regression to handle multiple classes.
3. Ordinal Logistic Regression: This type applies when the
dependent variable has three or more categories with a natural
order or ranking. Examples include ratings like "low," "medium"
and "high." It takes the order of the categories into account
when modeling.
Assumptions of Logistic Regression
Understanding the assumptions behind logistic regression is
important to ensure the model is applied correctly, main
assumptions are:
1. Independent observations: Each data point is assumed to
be independent of the others means there should be no
correlation or dependence between the input samples.
2. Binary dependent variables: It takes the assumption that
the dependent variable must be binary, means it can take only
two values. For more than two categories SoftMax functions are
used.
3. Linearity relationship between independent variables and
log odds: The model assumes a linear relationship between
the independent variables and the log odds of the dependent
variable which means the predictors affect the log odds in a
linear way.
4. No outliers: The dataset should not contain extreme outliers
as they can distort the estimation of the logistic regression
coefficients.
5. Large sample size: It requires a sufficiently large sample
size to produce reliable and stable results.
Understanding Sigmoid Function
1. The sigmoid function is a important part of logistic regression
which is used to convert the raw output of the model into a
probability value between 0 and 1.
2. This function takes any real number and maps it into the range
0 to 1 forming an "S" shaped curve called the sigmoid curve or
logistic curve. Because probabilities must lie between 0 and 1, the
sigmoid function is perfect for this purpose.
3. In logistic regression, we use a threshold value usually 0.5 to
decide the class label.
If the sigmoid output is same or above the threshold, the
input is classified as Class 1.
If it is below the threshold, the input is classified as Class 0.
This approach helps to transform continuous input values into
meaningful class predictions.
How does Logistic Regression work?
Logistic regression model transforms the linear regression function
continuous value output into categorical value output using a
sigmoid function which maps any real-valued set of independent
variables input into a value between 0 and 1. This function is
known as the logistic function.
X=[x11 ...x1mx21 ...x2m ⋮⋱ ⋮ xn1 ...xnm]X=⎣⎡x11 x21 ⋮xn1
Suppose we have input features represented as a matrix:
......⋱ ...x1mx2m⋮ xnm⎦⎤
and the dependent variable is YYhaving only binary value i.e 0 or
1.
Y={0 if Class11 if Class2Y={01 if Class1 if Class2
then, apply the multi-linear function to the input variables X.
z=(∑i=1nwixi)+bz=(∑i=1nwixi)+b
⋯,wm]wi=[w1,w2,w3,⋯,wm] is the weights or Coefficient
Here xixi is the ithith observation of X, wi=[w1,w2,w3,
and bbis the bias term also known as intercept. Simply this can be
represented as the dot product of weight and bias.
z=w⋅X+bz=w⋅X+b
At this stage, zzis a continuous value from the linear regression.
Logistic regression then applies the sigmoid function to zzto
convert it into a probability between 0 and 1 which can be used to
predict the class.
Now we use the sigmoid function where the input will be z and we
find the probability between 0 and 1. i.e. predicted y.
σ(z)=11+e−zσ(z)=1+e−z1
Sigmoid function
As shown above the sigmoid function converts the continuous
variable data into the probability i.e between 0 and 1.
σ(z) σ(z) tends towards 1 as z→∞z→∞
σ(z) σ(z) tends towards 0 as z→−∞z→−∞
σ(z) σ(z) is always bounded between 0 and 1
where the probability of being a class can be measured as:
P(y=1)=σ(z)P(y=0)=1−σ(z)P(y=1)=σ(z)P(y=0)=1−σ(z)
Logistic Regression Equation and Odds:
It models the odds of the dependent event occurring which is the
ratio of the probability of the event to the probability of it not
occurring:
p(x)1−p(x) =ez1−p(x)p(x) =ez
Taking the natural logarithm of the odds gives the log-odds or logit:
log[p(x)1−p(x)]=zlog[p(x)1−p(x)]=w⋅X+bp(x)1−p(x)=ew⋅X+
b⋯Exponentiate both sidesp(x)=e
w⋅X+b⋅(1−p(x))p(x)=ew⋅X+b−ew⋅X+b⋅p(x))p(x)
+ew⋅X+b⋅p(x))=ew⋅X+bp(x)
(1+ew⋅X+b)=ew⋅X+bp(x)=ew⋅X+b1+ew⋅X+blog[1−p(x)p(x
+ew⋅X+b⋅p(x))p(x)(1+ew⋅X+b)p(x)=z=w⋅X+b=ew⋅X+b⋯
)]log[1−p(x)p(x)]1−p(x)p(x)p(x)p(x)p(x)
Exponentiate both sides=ew⋅X+b⋅(1−p(x))=ew⋅X+b−ew⋅X+b⋅p(x
))=ew⋅X+b=ew⋅X+b=1+ew⋅X+bew⋅X+b
then the final logistic regression equation will be:
p(X;b,w)=ew⋅X+b1+ew⋅X+b=11+e−w⋅X+bp(X;b,w)=1+ew⋅X+
bew⋅X+b=1+e−w⋅X+b1
This formula represents the probability of the input belonging to
Class 1.
Likelihood Function for Logistic Regression
The goal is to find weights ww and bias bb that maximize the
likelihood of observing the data.
For each data point ii
for y=1y=1, predicted probabilities will be: p(X;b,w)
=p(x)p(x)
for y=0y=0 The predicted probabilities will be: 1-p(X;b,w)
= 1−p(x)1−p(x)
L(b,w)=∏i=1np(xi)yi(1−p(xi))1−yiL(b,w)=∏i=1np(xi)yi
(1−p(xi))1−yi
Taking natural logs on both sides:
log(L(b,w))=∑i=1nyilogp(xi)
+(1−yi)log(1−p(xi))=∑i=1nyilogp(xi)+log(1−p(xi))
−yilog(1−p(xi))=∑i=1nlog(1−p(xi))
+∑i=1nyilogp(xi)1−p(xi=∑i=1n−log1−e−(w⋅xi+b)
+∑i=1nyi(w⋅xi+b)=∑i=1n−log1+ew⋅xi+b+∑i=1nyi(w⋅xi+b)l
og(L(b,w))=i=1∑nyilogp(xi)+(1−yi)log(1−p(xi))=i=1∑nyi
logp(xi)+log(1−p(xi))−yilog(1−p(xi))=i=1∑nlog(1−p(xi))
+i=1∑nyilog1−p(xip(xi)=i=1∑n−log1−e−(w⋅xi+b)+i=1∑nyi(w⋅xi
+b)=i=1∑n−log1+ew⋅xi+b+i=1∑nyi(w⋅xi+b)
This is known as the log-likelihood function.
Gradient of the log-likelihood function
To find the best ww and bb we use gradient ascent on the log-
likelihood function. The gradient with respect to each weight wjwj
is:
∂J(l(b,w)∂wj=−∑i=nn11+ew⋅xi+bew⋅xi+bxij+∑i=1nyixij=−
∑i=nnp(xi;b,w)xij+∑i=1nyixij=∑i=nn(yi−p(xi;b,w))xij∂wj
∂J(l(b,w)=−i=n∑n1+ew⋅xi+b1ew⋅xi+bxij+i=1∑nyixij=−i=n∑np(xi
;b,w)xij+i=1∑nyixij=i=n∑n(yi−p(xi;b,w))xij
Terminologies involved in Logistic Regression
Here are some common terms involved in logistic regression:
1. Independent Variables: These are the input features or
predictor variables used to make predictions about the
dependent variable.
2. Dependent Variable: This is the target variable that we aim
to predict. In logistic regression, the dependent variable is
categorical.
3. Logistic Function: This function transforms the independent
variables into a probability between 0 and 1 which represents
the likelihood that the dependent variable is either 0 or 1.
4. Odds: This is the ratio of the probability of an event
happening to the probability of it not happening. It differs from
probability because probability is the ratio of occurrences to
total possibilities.
5. Log-Odds (Logit): The natural logarithm of the odds. In
logistic regression, the log-odds are modeled as a linear
combination of the independent variables and the intercept.
6. Coefficient: These are the parameters estimated by the
logistic regression model which shows how strongly the
independent variables affect the dependent variable.
7. Intercept: The constant term in the logistic regression model
which represents the log-odds when all independent variables
are equal to zero.
8. Maximum Likelihood Estimation (MLE): This method is
used to estimate the coefficients of the logistic regression
model by maximizing the likelihood of observing the given data.
Implementation for Logistic Regression
Now, let's see the implementation of logistic regression in Python.
Here we will be implementing two main types of Logistic
Regression:
1. Binomial Logistic regression:
In binomial logistic regression, the target variable can only have
two possible values such as "0" or "1", "pass" or "fail". The sigmoid
function is used for prediction.
We will be using sckit-learn library for this and shows how to use
the breast cancer dataset to implement a Logistic Regression
model for classification.
from [Link] import load_breast_cancer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from [Link] import accuracy_score
X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.20, random_state=23)
clf = LogisticRegression(max_iter=10000, random_state=0)
[Link](X_train, y_train)
acc = accuracy_score(y_test, [Link](X_test)) * 100
print(f"Logistic Regression model accuracy: {acc:.2f}%")
Output:
Logistic Regression model accuracy (in %): 96.49%
This code uses logistic regression to classify whether a sample
from the breast cancer dataset is malignant or benign.
2. Multinomial Logistic Regression:
Target variable can have 3 or more possible types which are not
ordered i.e types have no quantitative significance like “disease A”
vs “disease B” vs “disease C”.
In this case, the softmax function is used in place of the sigmoid
function. Softmax function for K classes will be:
softmax(zi)=ezi∑j=1Kezjsoftmax(zi)=∑j=1Kezjezi
Here KK represents the number of elements in the
vector zz and i,ji,j iterates over all the elements in the vector.
Then the probability for class cc will be:
P(Y=c∣X→=x)=ewc⋅x+bc∑k=1Kewk⋅x+bkP(Y=c∣X=x)=∑k=1K
ewk⋅x+bkewc⋅x+bc
Below is an example of implementing multinomial logistic
regression using the Digits dataset from scikit-learn:
from sklearn.model_selection import train_test_split
from sklearn import datasets, linear_model, metrics
digits = datasets.load_digits()
X = [Link]
y = [Link]
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.4, random_state=1)
reg = linear_model.LogisticRegression(max_iter=10000,
random_state=0)
[Link](X_train, y_train)
y_pred = [Link](X_test)
print(f"Logistic Regression model accuracy:
{metrics.accuracy_score(y_test, y_pred) * 100:.2f}%")
Output:
Logistic Regression model accuracy: 96.66%
This model is used to predict one of 10 digits (0-9) based on the
image features.
How to Evaluate Logistic Regression Model?
Evaluating the logistic regression model helps assess its
performance and ensure it generalizes well to new, unseen data.
The following metrics are commonly used:
1. Accuracy: Accuracy provides the proportion of correctly
classified instances.
Accuracy=TruePositives+TrueNegativesTotalAccuracy=To
talTruePositives+TrueNegatives
2. Precision: Precision focuses on the accuracy of positive
predictions.
Precision=TruePositivesTruePositives+FalsePositivesPreci
sion=TruePositives+FalsePositivesTruePositives
3. Recall (Sensitivity or True Positive Rate): Recall measures
the proportion of correctly predicted positive instances among all
actual positive instances.
Recall=TruePositivesTruePositives+FalseNegativesRecall=
TruePositives+FalseNegativesTruePositives
4. F1 Score: F1 score is the harmonic mean of precision and
F1Score=2∗Precision∗RecallPrecision+RecallF1Score=2
recall.
∗Precision+RecallPrecision∗Recall
5. Area Under the Receiver Operating Characteristic Curve
(AUC-ROC): The ROC curve plots the true positive rate against
the false positive rate at various thresholds. AUC-ROC measures
the area under this curve which provides an aggregate measure of
a model's performance across different classification thresholds.
6. Area Under the Precision-Recall Curve (AUC-PR): Similar to
AUC-ROC, AUC-PR measures the area under the precision-recall
curve helps in providing a summary of a model's performance
across different precision-recall trade-offs.
Differences Between Linear and Logistic
Regression
Logistic regression and linear regression differ in their application
and output. Here's a comparison:
Aspect Linear Regression Logistic Regression
Linear regression is used to Logistic regression is used to
predict the continuous predict the categorical
dependent variable using a dependent variable using a
given set of independent given set of independent
Definition variables. variables.
It is used for solving It is used for solving
Problem Type regression problem. classification problems.
In this we predict the value In this we predict values of
Output Type of continuous variables. categorical variables.
Curve/Model
In this we find best fit line. In this we find S-Curve.
Fitting
Estimation Least square estimation Maximum likelihood
Method method is used for estimation method is used
Aspect Linear Regression Logistic Regression
estimation of accuracy. for estimation of accuracy.
The output must be Output must be categorical
Output continuous value such as value such as 0 or 1, Yes or
Example price, age etc. No, etc.
It required linear
relationship between It not required linear
Relationship dependent and relationship.
Requirement independent variables.
There may be collinearity There should be little to no
between the independent collinearity between
Collinearity variables. independent variables.
Decision Tree
Last Updated : 30 Jun, 2025
A Decision Tree helps us to make decisions by mapping out different choices and
their possible outcomes. It’s used in machine learning for tasks like classification
and prediction. In this article, we’ll see more about Decision Trees, their types and
other core concepts.
A Decision Tree helps us make decisions by showing different options and how
they are related. It has a tree-like structure that starts with one main question
called the root node which represents the entire dataset. From there, the tree
branches out into different possibilities based on features in the data.
Root Node: Starting point representing the whole dataset.
Branches: Lines connecting nodes showing the flow from one decision to
another.
Internal Nodes: Points where decisions are made based on data features.
Leaf Nodes: End points of the tree where the final decision or prediction is
made.
Decision Tree
A Decision Tree also helps with decision-making by showing possible outcomes
clearly. By looking at the "branches" we can quickly compare options and figure
out the best choice.
There are mainly two types of Decision Trees based on the target variable:
1. Classification Trees: Used for predicting categorical outcomes like spam
or not spam. These trees split the data based on features to classify data into
predefined categories.
2. Regression Trees: Used for predicting continuous outcomes like predicting
house prices. Instead of assigning categories, it provides numerical predictions
based on the input features.
How Decision Trees Work?
1. Start with the Root Node: It begins with a main question at the root node which
is derived from the dataset’s features.
2. Ask Yes/No Questions: From the root, the tree asks a series of yes/no
questions to split the data into subsets based on specific attributes.
3. Branching Based on Answers: Each question leads to different branches:
If the answer is yes, the tree follows one path.
If the answer is no, the tree follows another path.
4. Continue Splitting: This branching continues through further decisions helps in
reducing the data down step-by-step.
5. Reach the Leaf Node: The process ends when there are no more useful
questions to ask leading to the leaf node where the final decision or prediction is
made.
Let’s look at a simple example to understand how it works. Imagine we need to
decide whether to drink coffee based on the time of day and how tired we feel. The
tree first checks the time:
1. In the morning: It asks “Tired?”
If yes, the tree suggests drinking coffee.
If no, it says no coffee is needed.
2. In the afternoon: It asks again “Tired?”
If yes, it suggests drinking coffee.
If no, no coffee is needed.
Example
Splitting Criteria in Decision Trees
In a Decision Tree, the process of splitting data at each node is important. The
splitting criteria finds the best feature to split the data on. Common splitting criteria
include Gini Impurity and Entropy.
Gini Impurity: This criterion measures how "impure" a node is. The lower
the Gini Impurity the better the feature splits the data into distinct categories.
Entropy: This measures the amount of uncertainty or disorder in the data.
The tree tries to reduce the entropy by splitting the data on features that provide
the most information about the target variable.
These criteria help decide which features are useful for making the best split at
each decision point in the tree.
Pruning in Decision Trees
Pruning is an important technique used to prevent overfitting in Decision
Trees. Overfitting occurs when a tree becomes too deep and starts to memorize
the training data rather than learning general patterns. This leads to poor
performance on new, unseen data.
This technique reduces the complexity of the tree by removing branches
that have little predictive power. It improves model performance by helping the
tree generalize better to new data. It also makes the model simpler and faster to
deploy.
It is useful when a Decision Tree is too deep and starts to capture noise in
the data.
Advantages of Decision Trees
Easy to Understand: Decision Trees are visual which makes it easy to
follow the decision-making process.
Versatility: Can be used for both classification and regression problems.
No Need for Feature Scaling: Unlike many machine learning models, it
don’t require us to scale or normalize our data.
Handles Non-linear Relationships: It capture complex, non-linear
relationships between features and outcomes effectively.
Interpretability: The tree structure is easy to interpret helps in allowing
users to understand the reasoning behind each decision.
Handles Missing Data: It can handle missing values by using strategies
like assigning the most common value or ignoring missing data during splits.
Disadvantages of Decision Trees
Overfitting: They can overfit the training data if they are too deep which
means they memorize the data instead of learning general patterns. This leads
to poor performance on unseen data.
Instability: It can be unstable which means that small changes in the data
may lead to significant differences in the tree structure and predictions.
Bias towards Features with Many Categories: It can become biased
toward features with many distinct values which focuses too much on them and
potentially missing other important features which can reduce prediction
accuracy.
Difficulty in Capturing Complex Interactions: Decision Trees may
struggle to capture complex interactions between features which helps in
making them less effective for certain types of data.
Computationally Expensive for Large Datasets: For large datasets,
building and pruning a Decision Tree can be computationally intensive,
especially as the tree depth increases.
Applications of Decision Trees
Decision Trees are used across various fields due to their simplicity, interpretability
and versatility lets see some key applications:
1. Loan Approval in Banking: Banks use Decision Trees to assess whether a
loan application should be approved. The decision is based on factors like
credit score, income, employment status and loan history. This helps predict
approval or rejection helps in enabling quick and reliable decisions.
2. Medical Diagnosis: In healthcare they assist in diagnosing diseases. For
example, they can predict whether a patient has diabetes based on clinical data
like glucose levels, BMI and blood pressure. This helps classify patients into
diabetic or non-diabetic categories, supporting early diagnosis and treatment.
3. Predicting Exam Results in Education: Educational institutions use to
predict whether a student will pass or fail based on factors like attendance,
study time and past grades. This helps teachers identify at-risk students and
offer targeted support.
4. Customer Churn Prediction: Companies use Decision Trees to predict
whether a customer will leave or stay based on behavior patterns, purchase
history, and interactions. This allows businesses to take proactive steps to
retain customers.
5. Fraud Detection: In finance, Decision Trees are used to detect fraudulent
activities, such as credit card fraud. By analyzing past transaction data and
patterns, Decision Trees can identify suspicious activities and flag them for
further investigation.
A decision tree can also be used to help build automated predictive models which
have applications in machine learning, data mining and statistics. By mastering
Decision Trees, we can gain a deeper understanding of data and make more
informed decisions across different fields.
Random Forest Regression in Python
Last Updated : 13 Nov, 2025
A random forest is an ensemble learning method that combines
the predictions from multiple decision trees to produce a more
accurate and stable prediction. It can be used for both
classification and regression tasks. In a regression task, we can
use the Random Forest Regression technique for predicting
numerical values. It predicts continuous values by averaging the
results of multiple decision trees.
Working of Random Forest Regression
Random Forest Regression works by creating multiple of decision
trees each trained on a random subset of the data. The process
begins with Bootstrap sampling where random rows of data are
selected with replacement to form different training datasets for
each tree. After this we do feature sampling where only a random
subset of features is used to build each tree ensuring diversity in
the models.
After the trees are trained each tree make a prediction and the
final prediction for regression tasks is the average of all the
individual tree predictions and this process is called
as Aggregation.
Random Forest Regression Model Working
This approach is beneficial because individual decision trees may
have high variance and are prone to overfitting especially with
complex data. However by averaging the predictions from multiple
decision trees Random Forest minimizes this variance leading to
more accurate and stable predictions and hence improving
generalization of model.
Implementing Random Forest Regression in
Python
We will be implementing random forest regression on salaries
data.
1. Importing Libraries
Here we are
importing numpy, pandas, matplotlib, seaborn and scikit learn.
RandomForestRegressor: This is the regression model that
is based upon the Random Forest model.
LabelEncoder: This class is used to encode categorical data
into numerical values.
KNNImputer: This class is used to impute missing values in
a dataset using a k-nearest neighbors approach.
train_test_split: This function is used to split a dataset into
training and testing sets.
StandardScaler: This class is used to standardize features
by removing the mean and scaling to unit variance.
f1_score: This function is used to evaluate the performance
of a classification model using the F1 score.
RandomForestRegressor: This class is used to train a
random forest regression model.
cross_val_score: This function is used to perform k-fold
cross-validation to evaluate the performance of a model
import pandas as pd
import [Link] as plt
import seaborn as sns
import sklearn
import warnings
from [Link] import LabelEncoder
from [Link] import KNNImputer
from sklearn.model_selection import train_test_split
from [Link] import StandardScaler
from [Link] import f1_score
from [Link] import RandomForestRegressor
from [Link] import RandomForestRegressor
from sklearn.model_selection import cross_val_score
[Link]('ignore')
2. Importing Dataset
Now let's load the dataset in the panda's data frame. For better
data handling and leveraging the handy functions to perform
complex tasks in one go.
You can download dataset from here.
df= pd.read_csv('/content/Position_Salaries.csv')
print(df)
Output:
Dataset
[Link]()
Output:
Info of the dataset
3. Data Preparation
Here the code will extracts two subsets of data from the Dataset
and stores them in separate variables.
Extracting Features: It extracts the features from the
DataFrame and stores them in a variable named X.
Extracting Target Variable: It extracts the target variable
from the DataFrame and stores it in a variable named y.
X = [Link][:,1:2].values
y = [Link][:,2].values
4. Random Forest Regressor Model
The code processes categorical data by encoding it numerically,
combines the processed data with numerical data and trains a
Random Forest Regression model using the prepared data.
RandomForestRegressor: It builds multiple decision trees
and combines their predictions.
n_estimators=10: Defines the number of decision trees in
the Random Forest.
random_state=0: Ensures the randomness in model training
is controlled for reproducibility.
oob_score=True: Enables out-of-bag scoring which
evaluates the model's performance using data not seen by
individual trees during training.
LabelEncoder(): Converts categorical variables (object type)
into numerical values, making them suitable for machine
learning models.
apply(label_encoder.fit_transform): Applies the
LabelEncoder transformation to each categorical column,
converting string labels into numbers.
concat(): Combines the numerical and encoded categorical
features horizontally into one dataset which is then used as
input for the model.
import pandas as pd
from [Link] import RandomForestRegressor
from [Link] import LabelEncoder
label_encoder = LabelEncoder()
x_categorical =
df.select_dtypes(include=['object']).apply(label_encoder.fit_tr
ansform)
x_numerical = df.select_dtypes(exclude=['object']).values
x = [Link]([[Link](x_numerical), x_categorical],
axis=1).values
regressor = RandomForestRegressor(n_estimators=10,
random_state=0, oob_score=True)
[Link](x, y)
5. Making predictions and Evaluating
The code evaluates the trained Random Forest Regression model:
oob_score_: Retrive out-of-bag (OOB) score which
estimates the model's generalization performance.
Makes predictions using the trained model and stores them in
the 'predictions' array.
Evaluates the model's performance using the Mean Squared
Error (MSE) and R-squared (R2) metrics.
from [Link] import mean_squared_error, r2_score
oob_score = regressor.oob_score_
print(f'Out-of-Bag Score: {oob_score}')
predictions = [Link](x)
mse = mean_squared_error(y, predictions)
print(f'Mean Squared Error: {mse}')
r2 = r2_score(y, predictions)
print(f'R-squared: {r2}')
Output:
Out-of-Bag Score: 0.644879832593859
Mean Squared Error: 2647325000.0
R-squared: 0.9671801245316117
6. Visualizing
Now let's visualize the results obtained by using the
RandomForest Regression model on our salaries dataset.
Creates a grid of prediction points covering the range of the
feature values.
Plots the real data points as blue scatter points.
Plots the predicted values for the prediction grid as a green
line.
Adds labels and a title to the plot for better understanding.
import numpy as np
X_grid = [Link](min(X[:, 0]), max(X[:, 0]), 0.01) # Only the
first feature
X_grid = X_grid.reshape(-1, 1)
X_grid = [Link]((X_grid, [Link]((X_grid.shape[0], 2)))) #
Pad with zeros
[Link](X[:, 0], y, color='blue', label="Actual Data")
[Link](X_grid[:, 0], [Link](X_grid),
color='green', label="Random Forest Prediction")
[Link]("Random Forest Regression Results")
[Link]('Position Level')
[Link]('Salary')
[Link]()
[Link]()
Output:
7. Visualizing a Single Decision Tree from the Random
Forest Model
The code visualizes one of the decision trees from the trained
Random Forest model. Plots the selected decision tree, displaying
the decision-making process of a single tree within the ensemble.
from [Link] import plot_tree
import [Link] as plt
tree_to_plot = regressor.estimators_[0]
[Link](figsize=(20, 10))
plot_tree(tree_to_plot, feature_names=[Link](),
filled=True, rounded=True, fontsize=10)
[Link]("Decision Tree from Random Forest")
[Link]()
Output:
Applications of Random Forest Regression
The Random forest regression has a wide range of real-world
problems including:
Predicting continuous numerical values: Predicting house
prices, stock prices or customer lifetime value.
Identifying risk factors: Detecting risk factors for diseases,
financial crises or other negative events.
Handling high-dimensional data: Analyzing datasets with a
large number of input features.
Capturing complex relationships: Modeling complex
relationships between input features and the target variable.
Advantages of Random Forest Regression
Handles Non-Linearity: It can capture complex, non-linear
relationships in the data that other models might miss.
Reduces Overfitting: By combining multiple decision trees
and averaging predictions it reduces the risk of overfitting
compared to a single decision tree.
Robust to Outliers: Random Forest is less sensitive to
outliers as it aggregates the predictions from multiple trees.
Works Well with Large Datasets: It can efficiently handle
large datasets and high-dimensional data without a significant
loss in performance.
Handles Missing Data: Random Forest can handle missing
values by using surrogate splits and maintaining high accuracy
even with incomplete data.
No Need for Feature Scaling: Unlike many other algorithms
Random Forest does not require normalization or scaling of the
data.
Disadvantages of Random Forest Regression
Complexity: It can be computationally expensive and slow to
train especially with a large number of trees and high-
dimensional data. Due to this it may not be suitable for real-time
predictions especially with a large number of trees.
Less Interpretability: Since it uses many trees it can be
harder to interpret compared to simpler models like linear
regression or decision trees.
Memory Intensive: Storing multiple decision trees for large
datasets require significant memory resources.
Overfitting on Noisy Data: While Random Forest reduces
overfitting, it can still overfit if the data is highly noisy especially
with a large number of trees.
Sensitive to Imbalanced Data: It may perform poorly if the
dataset is highly imbalanced like one class is significantly more
frequent than another.
Random Forest Regression has become a important tool for
continuous prediction tasks with advantages over traditional
decision trees. Its capability to handle high-dimensional data,
capture complex relationships and reduce overfitting has made it
useful.
Support Vector Machine (SVM) Algorithm
Last Updated : 13 Nov, 2025
Support Vector Machine (SVM) is a supervised machine learning
algorithm used for classification and regression tasks. It tries to
find the best boundary known as hyperplane that separates
different classes in the data. It is useful when you want to do
binary classification like spam vs. not spam or cat vs. dog.
The main goal of SVM is to maximize the margin between the
two classes. The larger the margin the better the model performs
on new and unseen data.
Key Concepts of Support Vector Machine
Hyperplane: A decision boundary separating different
classes in feature space and is represented by the equation
wx + b = 0 in linear classification.
Support Vectors: The closest data points to the
hyperplane, crucial for determining the hyperplane and margin
in SVM.
Margin: The distance between the hyperplane and the
support vectors. SVM aims to maximize this margin for better
classification performance.
Kernel: A function that maps data to a higher-dimensional
space enabling SVM to handle non-linearly separable data.
Hard Margin: A maximum-margin hyperplane that perfectly
separates the data without misclassifications.
Soft Margin: Allows some misclassifications by introducing
slack variables, balancing margin maximization and
misclassification penalties when data is not perfectly
separable.
C: A regularization term balancing margin maximization and
misclassification penalties. A higher C value forces stricter
penalty for misclassifications.
Hinge Loss: A loss function penalizing misclassified points
or margin violations and is combined with regularization in
SVM.
Dual Problem: Involves solving for Lagrange multipliers
associated with support vectors, facilitating the kernel trick
and efficient computation.
How does Support Vector Machine Algorithm
Work?
The key idea behind the SVM algorithm is to find the hyperplane
that best separates two classes by maximizing the margin
between them. This margin is the distance from the hyperplane
to the nearest data points (support vectors) on each side.
Multiple hyperplanes separate the data from two classes
The best hyperplane also known as the "hard margin" is the one
that maximizes the distance between the hyperplane and the
nearest data points from both classes. This ensures a clear
separation between the classes. So from the above figure, we
choose L2 as hard margin. Let's consider a scenario like shown
below:
Selecting hyperplane for data with outlier
Here, we have one blue ball in the boundary of the red ball.
How does SVM classify the data?
The blue ball in the boundary of red ones is an outlier of blue
balls. The SVM algorithm has the characteristics to ignore the
outlier and finds the best hyperplane that maximizes the margin.
SVM is robust to outliers.
Hyperplane which is the most optimized one
A soft margin allows for some misclassifications or violations of
the margin to improve generalization. The SVM optimizes the
following equation to balance margin maximization and penalty
minimization:
Objective Function=(1margin)+λ∑penalty
Objective Function=(margin1)+λ∑penalty
The penalty used for violations is often hinge loss which has the
following behavior:
If a data point is correctly classified and within the margin
there is no penalty (loss = 0).
If a point is incorrectly classified or violates the margin the
hinge loss increases proportionally to the distance of the
violation.
Till now we were talking about linearly separable data that
seprates group of blue balls and red balls by a straight line/linear
line.
What if data is not linearly separable?
When data is not linearly separable i.e it can't be divided by a
straight line, SVM uses a technique called kernels to map the
data into a higher-dimensional space where it becomes
separable. This transformation helps SVM find a decision
boundary even for non-linear data.
Original 1D dataset for classification
A kernel is a function that maps data points into a higher-
dimensional space without explicitly computing the coordinates in
that space. This allows SVM to work efficiently with non-linear
data by implicitly performing the mapping. For example consider
data points that are not linearly separable. By applying a kernel
function SVM transforms the data points into a higher-
dimensional space where they become linearly separable.
Linear Kernel: For linear separability.
Polynomial Kernel: Maps data into a polynomial space.
Radial Basis Function (RBF) Kernel: Transforms data into
a space based on distances between data points.
Mapping 1D data to 2D to become able to separate the two
classes
In this case the new variable y is created as a function of
distance from the origin.
Mathematical Computation of SVM
Consider a binary classification problem with two classes, labeled
as +1 and -1. We have a training dataset consisting of input
feature vectors X and their corresponding class labels Y. The
equation for the linear hyperplane can be written as:
wTx+b=0wTx+b=0
Where:
ww is the normal vector to the hyperplane (the direction
perpendicular to it).
bb is the offset or bias term representing the distance of the
hyperplane from the origin along the normal vector ww.
Distance from a Data Point to the Hyperplane
The distance between a data point xixiand the decision
boundary can be calculated as:
di=wTxi+b∣∣w∣∣di=∣∣w∣∣wTxi+b
where ||w|| represents the Euclidean norm of the weight vector w.
Linear SVM Classifier
Distance from a Data Point to the Hyperplane:
y^={1: wTx+b≥0−1: wTx+b <0y^={1−1: wTx+b≥0:
wTx+b <0
Where y^y^ is the predicted label of a data point.
Optimization Problem for SVM
For a linearly separable dataset the goal is to find the hyperplane
that maximizes the margin between the two classes while
ensuring that all data points are correctly classified. This leads to
minimizew,b12∥w∥2w,bminimize21∥w∥2
the following optimization problem:
Subject to the constraint:
yi(wTxi+b)≥1fori=1,2,3,⋯,myi(wTxi
+b)≥1fori=1,2,3,⋯,m
Where:
yiyi is the class label (+1 or -1) for each training instance.
xixi is the feature vector for the ii-th training instance.
mm is the total number of training instances.
The condition yi(wTxi+b)≥1yi(wTxi+b)≥1 ensures that each
data point is correctly classified and lies outside the margin.
Soft Margin in Linear SVM Classifier
In the presence of outliers or non-separable data the SVM allows
some misclassification by introducing slack variables ζiζi. The
optimization problem is modified as:
∥w∥2+C∑i=1mζi
minimize w,b12∥w∥2+C∑i=1mζiw,bminimize 21
Subject to the constraints:
yi(wTxi+b)≥1−ζiandζi≥0for i=1,2,…,myi(wTxi
+b)≥1−ζiandζi≥0for i=1,2,…,m
Where:
CC is a regularization parameter that controls the trade-off
between margin maximization and penalty for
misclassifications.
ζiζi are slack variables that represent the degree of violation
of the margin by each data point.
Dual Problem for SVM
The dual problem involves maximizing the Lagrange multipliers
associated with the support vectors. This transformation allows
solving the SVM optimization using kernel functions for non-linear
classification.
The dual objective function is given by:
maximize α12∑i=1m∑j=1mαiαjtitjK(xi,xj)
−∑i=1mαiαmaximize 21∑i=1m∑j=1mαiαjtitjK(xi,xj)−∑i=1mαi
Where:
αiαi are the Lagrange multipliers associated with
the ithith training sample.
titi is the class label for the ithith-th training sample.
K(xi,xj)K(xi,xj) is the kernel function that computes the
similarity between data points xixi and xjxj. The kernel allows
SVM to handle non-linear classification problems by mapping
data into a higher-dimensional space.
The dual formulation optimizes the Lagrange multipliers αiαi and
the support vectors are those training samples where αi>0αi>0.
SVM Decision Boundary
Once the dual problem is solved, the decision boundary is given
by:
w=∑i=1mαitiK(xi,x)+bw=∑i=1mαitiK(xi,x)+b
Where ww is the weight vector, xx is the test data point
and bb is the bias term. Finally the bias term bb is determined
by the support vectors, which satisfy:
ti(wTxi−b)=1⇒b=wTxi−titi(wTxi−b)=1⇒b=wTxi−ti
Where xixi is any support vector.
This completes the mathematical framework of the Support
Vector Machine algorithm which allows for both linear and non-
linear classification using the dual problem and kernel trick.
Types of Support Vector Machine
Based on the nature of the decision boundary, Support Vector
Machines (SVM) can be divided into two main parts:
Linear SVM: Linear SVMs use a linear decision boundary to
separate the data points of different classes. When the data
can be precisely linearly separated, linear SVMs are very
suitable. This means that a single straight line (in 2D) or a
hyperplane (in higher dimensions) can entirely divide the data
points into their respective classes. A hyperplane that
maximizes the margin between the classes is the decision
boundary.
Non-Linear SVM: Non-Linear SVM can be used to classify
data when it cannot be separated into two classes by a
straight line (in the case of 2D). By using kernel functions,
nonlinear SVMs can handle nonlinearly separable data. The
original input data is transformed by these kernel functions
into a higher-dimensional feature space where the data points
can be linearly separated. A linear SVM is used to locate a
nonlinear decision boundary in this modified space.
Implementing SVM Algorithm Using Scikit-Learn
We will predict whether cancer is Benign or Malignant using
historical data about patients diagnosed with cancer. This data
includes independent attributes such as tumor size, texture, and
others. To perform this classification, we will use an SVM
(Support Vector Machine) classifier to differentiate between
benign and malignant cases effectively.
load_breast_cancer(): Loads the breast cancer dataset
(features and target labels).
SVC(kernel="linear", C=1): Creates a Support Vector
Classifier with a linear kernel and regularization parameter
C=1.
[Link](X, y): Trains the SVM model on the feature matrix X
and target labels y.
DecisionBoundaryDisplay.from_estimator(): Visualizes
the decision boundary of the trained model with a specified
color map.
[Link](): Creates a scatter plot of the data points,
colored by their labels.
[Link](): Displays the plot to the screen.
from [Link] import load_breast_cancer
import [Link] as plt
from [Link] import DecisionBoundaryDisplay
from [Link] import SVC
cancer = load_breast_cancer()
X = [Link][:, :2]
y = [Link]
svm = SVC(kernel="linear", C=1)
[Link](X, y)
DecisionBoundaryDisplay.from_estimator(
svm,
X,
response_method="predict",
alpha=0.8,
cmap="Pastel1",
xlabel=cancer.feature_names[0],
ylabel=cancer.feature_names[1],
)
[Link](X[:, 0], X[:, 1],
c=y,
s=20, edgecolors="k")
[Link]()
Output:
SVM
Advantages of Support Vector Machine (SVM)
1. High-Dimensional Performance: SVM excels in high-
dimensional spaces, making it suitable for image classification
and gene expression analysis.
2. Nonlinear Capability: Utilizing kernel functions like RBF
and polynomial SVM effectively handles nonlinear
relationships.
3. Outlier Resilience: The soft margin feature allows SVM to
ignore outliers, enhancing robustness in spam detection and
anomaly detection.
4. Binary and Multiclass Support: SVM is effective for both
binary classification and multiclass classification suitable for
applications in text classification.
5. Memory Efficiency: It focuses on support vectors making it
memory efficient compared to other algorithms.
Disadvantages of Support Vector Machine (SVM)
1. Slow Training: SVM can be slow for large datasets,
affecting performance in SVM in data mining tasks.
2. Parameter Tuning Difficulty: Selecting the right kernel and
adjusting parameters like C requires careful tuning, impacting
SVM algorithms.
3. Noise Sensitivity: SVM struggles with noisy datasets and
overlapping classes, limiting effectiveness in real-world
scenarios.
4. Limited Interpretability: The complexity of the hyperplane
in higher dimensions makes SVM less interpretable than other
models.
5. Feature Scaling Sensitivity: Proper feature scaling is
essential, otherwise SVM models may perform poorly.
K-Nearest Neighbor(KNN) Algorithm
Last Updated : 23 Aug, 2025
K-Nearest Neighbors (KNN) is a supervised machine learning
algorithm generally used for classification but can also be used for
regression tasks. It works by finding the "k" closest data points
(neighbors) to a given input and makes a predictions based on the
majority class (for classification) or the average value (for
regression). Since KNN makes no assumptions about the
underlying data distribution it makes it a non-parametric and
instance-based learning method.
K-Nearest Neighbors is also called as a lazy learner
algorithm because it does not learn from the training set
immediately instead it stores the entire dataset and performs
computations only at the time of classification.
For example, consider the following table of data points containing
two features:
KNN Algorithm working visualization
The new point is classified as Category 2 because most of its
closest neighbors are blue squares. KNN assigns the category
based on the majority of nearby points. The image shows how
KNN predicts the category of a new data point based on its closest
neighbours.
The red diamonds represent Category 1 and the blue
squares represent Category 2.
The new data point checks its closest neighbors (circled
points).
Since the majority of its closest neighbors are blue squares
(Category 2) KNN predicts the new data point belongs to
Category 2.
KNN works by using proximity and majority voting to make
predictions.
What is 'K' in K Nearest Neighbour?
In the k-Nearest Neighbours algorithm k is just a number that tells
the algorithm how many nearby points or neighbors to look at
when it makes a decision.
Example: Imagine you're deciding which fruit it is based on its
shape and size. You compare it to fruits you already know.
If k = 3, the algorithm looks at the 3 closest fruits to the new
one.
If 2 of those 3 fruits are apples and 1 is a banana, the
algorithm says the new fruit is an apple because most of its
neighbors are apples.
How to choose the value of k for KNN Algorithm?
The value of k in KNN decides how many neighbors the
algorithm looks at when making a prediction.
Choosing the right k is important for good results.
If the data has lots of noise or outliers, using a larger k can
make the predictions more stable.
But if k is too large the model may become too simple and
miss important patterns and this is called underfitting.
So k should be picked carefully based on the data.
Statistical Methods for Selecting k
Cross-Validation: Cross-Validation is a good way to find the
best value of k is by using k-fold cross-validation. This means
dividing the dataset into k parts. The model is trained on some
of these parts and tested on the remaining ones. This process
is repeated for each part. The k value that gives the highest
average accuracy during these tests is usually the best one to
use.
Elbow Method: In Elbow Method we draw a graph showing
the error rate or accuracy for different k values. As k increases
the error usually drops at first. But after a certain point error
stops decreasing quickly. The point where the curve changes
direction and looks like an "elbow" is usually the best choice for
k.
Odd Values for k: It’s a good idea to use an odd number for
k especially in classification problems. This helps avoid ties
when deciding which class is the most common among the
neighbors.
Distance Metrics Used in KNN Algorithm
KNN uses distance metrics to identify nearest neighbor, these
neighbors are used for classification and regression task. To
identify nearest neighbor we use below distance metrics:
1. Euclidean Distance
Euclidean distance is defined as the straight-line distance between
two points in a plane or space. You can think of it like the shortest
path you would walk if you were to go directly from one point to
another.
distance(x,Xi)=∑j=1d(xj−Xij)2]distance(x,Xi)=∑j=1d(xj−Xij
)2]
2. Manhattan Distance
This is the total distance you would travel if you could only move
along horizontal and vertical lines like a grid or city streets. It’s also
called "taxicab distance" because a taxi can only drive along the
d(x,y)=∑i=1n∣xi−yi∣d(x,y)=∑i=1n∣xi−yi∣
grid-like streets of a city.
3. Minkowski Distance
Minkowski distance is like a family of distances, which includes
both Euclidean and Manhattan distances as special cases.
d(x,y)=(∑i=1n(xi−yi)p)1pd(x,y)=(∑i=1n(xi−yi)p)p1
From the formula above, when p=2, it becomes the same as the
Euclidean distance formula and when p=1, it turns into the
Manhattan distance formula. Minkowski distance is essentially a
flexible formula that can represent either Euclidean or Manhattan
distance depending on the value of p.
Working of KNN algorithm
Thе K-Nearest Neighbors (KNN) algorithm operates on the
principle of similarity where it predicts the label or value of a new
data point by considering the labels or values of its K nearest
neighbors in the training dataset.
Step 1: Selecting the optimal value of K
K represents the number of nearest neighbors that needs to
be considered while making prediction.
Step 2: Calculating distance
To measure the similarity between target and training data
points Euclidean distance is widely used. Distance is calculated
between data points in the dataset and target point.
Step 3: Finding Nearest Neighbors
The k data points with the smallest distances to the target
point are nearest neighbors.
Step 4: Voting for Classification or Taking Average for
Regression
When you want to classify a data point into a category like
spam or not spam, the KNN algorithm looks at the K closest
points in the dataset. These closest points are called neighbors.
The algorithm then looks at which category the neighbors
belong to and picks the one that appears the most. This is
called majority voting.
In regression, the algorithm still looks for the K closest points.
But instead of voting for a class in classification, it takes the
average of the values of those K neighbors. This average is the
predicted value for the new point for the algorithm.
It shows how a test point is classified based on its nearest
neighbors. As the test point moves the algorithm identifies the
closest 'k' data points i.e. 5 in this case and assigns test point the
majority class label that is grey label class here.
Implementing KNN from Scratch in Python
1. Importing Libraries
Counter is used to count the occurrences of elements in a list or
iterable. In KNN after finding the k nearest neighbor labels Counter
helps count how many times each label appears.
import numpy as np
from collections import Counter
2. Defining the Euclidean Distance Function
euclidean_distance is to calculate euclidean distance between
points.
def euclidean_distance(point1, point2):
return [Link]([Link](([Link](point1) - [Link](point2))**2))
3. KNN Prediction Function
[Link] saves how far each training point is from
the test point, along with its label.
[Link] is used to sorts the list so the nearest points
come first.
k_nearest_labels picks the labels of the k closest points.
Uses Counter to find which label appears most among those
k labels that becomes the prediction.
def knn_predict(training_data, training_labels, test_point, k):
distances = []
for i in range(len(training_data)):
dist = euclidean_distance(test_point, training_data[i])
[Link]((dist, training_labels[i]))
[Link](key=lambda x: x[0])
k_nearest_labels = [label for _, label in distances[:k]]
return Counter(k_nearest_labels).most_common(1)[0][0]
4. Training Data, Labels and Test Point
training_data = [[1, 2], [2, 3], [3, 4], [6, 7], [7, 8]]
training_labels = ['A', 'A', 'A', 'B', 'B']
test_point = [4, 5]
k=3
5. Prediction
prediction = knn_predict(training_data, training_labels, test_point,
k)
print(prediction)
Output:
A
The algorithm calculates the distances of the test point [4, 5] to all
training points selects the 3 closest points as k = 3 and determines
their labels. Since the majority of the closest points are
labelled 'A' the test point is classified as 'A'.
In machine learning we can also use Scikit Learn python library
which has in built functions to perform KNN machine learning
model and for that you refer to Implementation of KNN classifier
using Sklearn.
Applications of KNN
Recommendation Systems: Suggests items like movies or
products by finding users with similar preferences.
Spam Detection: Identifies spam emails by comparing new
emails to known spam and non-spam examples.
Customer Segmentation: Groups customers by comparing
their shopping behavior to others.
Speech Recognition: Matches spoken words to known
patterns to convert them into text.
Advantages of KNN
Simple to use: Easy to understand and implement.
No training step: No need to train as it just stores the data
and uses it during prediction.
Few parameters: Only needs to set the number of neighbors
(k) and a distance method.
Versatile: Works for both classification and regression
problems.
Disadvantages of KNN
Slow with large data: Needs to compare every point during
prediction.
Struggles with many features: Accuracy drops when data
has too many features.
Can Overfit: It can overfit especially when the data is high-
dimensional or not clean.
Gradient Boosting in ML
Last Updated : 03 Sep, 2025
Gradient Boosting is a boosting algorithm and here each new
model is trained to minimize the loss function such as mean
squared error or cross-entropy of the previous model
using gradient descent. In each iteration the algorithm computes
the gradient of the loss function with respect to predictions and
then trains a new weak model to minimize this gradient.
Predictions of the new model are then added to the ensemble (all
models prediction) and the process is repeated until a stopping
criterion is met.
Shrinkage and Model Complexity
A key feature of Gradient Boosting is shrinkage which scales the
contribution of each new model using learning rate (denoted
as ηη).
Smaller learning rates: mean the contribution of each tree is
smaller which reduces the risk of overfitting but requires more
trees to achieve the same performance.
Larger learning rates: mean each tree has a more
significant impact but this can lead to overfitting.
There's a trade off between the learning rate and the number of
estimators (trees) a smaller learning rate usually means more
trees are required to achieve optimal performance.
Working of Gradient Boosting
1. Sequential Learning Process
The ensemble consists of multiple trees each trained to correct the
errors of the previous one. In the first iteration Tree 1 is trained on
the original data xx and the true labels yy. It makes predictions
which are used to compute the errors.
2. Residuals Calculation
In the second iteration Tree 2 is trained using the feature
matrix xx and the errors from Tree 1 as labels. This means Tree 2
is trained to predict the errors of Tree 1. This process continues for
all the trees in the ensemble. Each subsequent tree is trained to
predict the errors of the previous tree.
Gradient Boosted Trees
3. Shrinkage
After each tree is trained its predictions are shrunk by multiplying
them with the learning rate η which ranges from 0 to 1. This
prevents overfitting by ensuring each tree has a smaller impact on
the final model.
Once all trees are trained predictions are made by summing the
contributions of all the trees. The final prediction is given by the
ypred=y1+η⋅r1+η⋅r2+⋯+η⋅rNypred=y1+η⋅r1+η⋅r2+⋯
formula:
+η⋅rN
Where r1,r2,…,rNr1,r2,…,rN are the errors predicted by each
tree.
Difference between Adaboost and Gradient
Boosting
Lets see difference between AdaBoost and gradient boosting
which are as follows:
Features AdaBoost Gradient Boosting
Increase weights of Updates predictions by
misclassified sample so minimizing a loss function
that the next learner using the negative
Weight Update focuses more on them. gradient
Strategy
AdaBoost uses simple Gradient Boosting can use
decision trees with one a wide range of base
split known as the decision learners such as decision
stumps of weak learners. trees and linear models.
Base learners
AdaBoost is more sensitive
Gradient Boosting is less
to noisy data and outliers
sensitive as it smooths
due to aggressive
Sensitivity to updates using gradients.
weighting.
Noise
No explicit loss function
Optimization Explicitly minimizes a
i.e it focuses on
differentiable loss function.
Technique classification error.
Learners are trained Learners are trained
Boosting sequentially with sample sequentially with residual
Mechanism reweighting. fitting (gradient descent).
Features AdaBoost Gradient Boosting
Easier to interpret due to Harder to interpret if
Interpretability simple weak learners. complex models are used.
Suitable for complex
Suitable for clean datasets
problems with varying loss
with fewer outliers
Use case function
Implementing Gradient Boosting for Classification
and Regression
Here are two examples to demonstrate how Gradient Boosting
works for both classification and regression. But before that let's
understand gradient boosting parameters.
n_estimators: This specifies the number of trees
(estimators) to be built. A higher value typically improves model
performance but increases computation time.
learning_rate: This is the shrinkage parameter. It scales the
contribution of each tree.
random_state: It ensures reproducibility of results. Setting a
fixed value for random_state ensure that you get the same
results every time you run the model.
max_features: This parameter limits the number of features
each tree can use for splitting. It helps prevent overfitting by
limiting the complexity of each tree and promoting diversity in
the model.
Now we start building our models with Gradient Boosting.
Example 1: Classification
We use Gradient Boosting Classifier to predict digits from Digits
dataset.
Import the necessary libraries
Setting SEED for reproducibility
Load the digit dataset and split it into train and test.
Instantiate Gradient Boosting classifier and fit the model.
Predict the test set and compute the accuracy score.
from [Link] import GradientBoostingClassifier
from sklearn.model_selection import train_test_split
from [Link] import accuracy_score
from [Link] import load_digits
SEED = 23
X, y = load_digits(return_X_y=True)
train_X, test_X, train_y, test_y = train_test_split(X, y,
test_size =
0.25,
random_state = SEED)
gbc = GradientBoostingClassifier(n_estimators=300,
learning_rate=0.05,
random_state=100,
max_features=5 )
[Link](train_X, train_y)
pred_y = [Link](test_X)
acc = accuracy_score(test_y, pred_y)
print("Gradient Boosting Classifier accuracy is :
{:.2f}".format(acc))
Output:
Gradient Boosting Classifier accuracy is : 0.98
Example 2: Regression
We use Gradient Boosting Regressor on the Diabetes dataset to
predict continuous values:
Import the necessary libraries
Setting SEED for reproducibility
Load the diabetes dataset and split it into train and test.
Instantiate Gradient Boosting Regressor and fit the model.
Predict on the test set and compute RMSE.
from [Link] import GradientBoostingRegressor
from sklearn.model_selection import train_test_split
from [Link] import mean_squared_error
from [Link] import load_diabetes
SEED = 23
X, y = load_diabetes(return_X_y=True)
train_X, test_X, train_y, test_y = train_test_split(X, y,
test_size =
0.25,
random_state = SEED)
gbr = GradientBoostingRegressor(loss='absolute_error',
learning_rate=0.1,
n_estimators=300,
max_depth = 1,
random_state = SEED,
max_features = 5)
[Link](train_X, train_y)
pred_y = [Link](test_X)
test_rmse = mean_squared_error(test_y, pred_y) ** (1 / 2)
print('Root mean Square error: {:.2f}'.format(test_rmse))
Output:
Root mean Square error: 56.39
Gradient Boosting is an effective and widely-used machine
learning technique for both classification and regression problems.
It builds models sequentially focusing on correcting errors made by
previous models which leads to improved performance.
Suggested Quiz
5 Questions
What does the learning rate (η) control in Gradient Boosting?
A
Randomness in model training
B
Number of features
C
Contribution of each tree
D
Maximum tree depth
Naive Bayes Classifiers
Last Updated : 18 Nov, 2025
Naive Bayes is a machine learning classification algorithm that
predicts the category of a data point using probability. It assumes
that all features are independent of each other. Naive Bayes
performs well in many real-world applications such as spam
filtering, document categorization and sentiment analysis.
Here:
Original data has two classes: green circles (y=1) and red
squares (y=2).
i.e P(x1∣y=1),P(x1∣y=2)P(x1∣y=1),P(x1∣y=2)
Estimate probability distribution along the first dimension
i.e P(x2∣y=1),P(x2∣y=2)P(x2∣y=1),P(x2∣y=2)
Estimate probability distribution along the second dimension
i.e P(x∣y)=∏αP(xα∣y)P(x∣y)=∏αP(xα∣y)
Combine both dimensions using conditional independence
Key Features of Naive Bayes Classifiers
The main idea behind the Naive Bayes classifier is to use Bayes'
Theorem to classify data based on the probabilities of different
classes given the features of the data. It is used mostly in high-
dimensional text classification
The Naive Bayes Classifier is a simple probabilistic
classifier and it has very few number of parameters which are
used to build the ML models that can predict at a faster speed
than other classification algorithms.
It is a probabilistic classifier because it assumes that one
feature in the model is independent of existence of another
feature. In other words, each feature contributes to the
predictions with no relation between each other.
Naive Bayes Algorithm is used in spam filtration,
Sentimental analysis, classifying articles and many more.
Why it is Called Naive Bayes?
It is named as "Naive" because it assumes the presence of one
feature does not affect other features. The "Bayes" part of the
name refers to its basis in Bayes’ Theorem.
Consider a fictional dataset that describes the weather conditions
for playing a game of golf. Given the weather conditions, each
tuple classifies the conditions as fit(“Yes”) or unfit(“No”) for
playing golf. Here is a tabular representation of our dataset.
Play
Outlook Temperature Humidity Windy Golf
0 Rainy Hot High False Yes
1 Rainy Hot High True No
2 Overcast Hot High False Yes
3 Sunny Mild High False Yes
4 Sunny Cool Normal False Yes
Play
Outlook Temperature Humidity Windy Golf
5 Sunny Cool Normal True No
6 Overcast Cool Normal True Yes
7 Rainy Mild High False No
8 Rainy Cool Normal False Yes
9 Sunny Mild Normal False Yes
10 Rainy Mild Normal True Yes
11 Overcast Mild High True Yes
12 Overcast Hot Normal False Yes
13 Sunny Mild High True No
The dataset is divided into two parts i.e feature matrix and
the response vector.
Feature matrix contains all the vectors(rows) of dataset in
which each vector consists of the value of dependent features.
In above dataset, features are ‘Outlook’, ‘Temperature’,
‘Humidity’ and ‘Windy’.
Response vector contains the value of class variable
(prediction or output) for each row of feature matrix. In above
dataset, the class variable name is ‘Play golf’.
Assumption of Naive Bayes
The fundamental Naive Bayes assumption is that each feature
makes an:
Feature independence: This means that when we are
trying to classify something, we assume that each feature (or
piece of information) in the data does not affect any other
feature.
Continuous features are normally distributed: If a
feature is continuous, then it is assumed to be normally
distributed within each class.
Discrete features have multinomial distributions: If a
feature is discrete, then it is assumed to have a multinomial
distribution within each class.
Features are equally important: All features are assumed
to contribute equally to the prediction of the class label.
No missing data: The data should not contain any missing
values.
Introduction to Bayes' Theorem
Bayes’ Theorem provides a principled way to reverse conditional
P(y∣X)=P(X∣y)⋅P(y)P(X)P(y∣X)=P(X)P(X∣y)⋅P(y)
probabilities. It is defined as:
Where:
P(y∣X)P(y∣X): Posterior probability, probability of
class yy given features XX
P(X∣y)P(X∣y): Likelihood, probability of features XX given
class yy
P(y)P(y): Prior probability of class yy
P(X)P(X): Marginal likelihood or evidence
Naive Bayes Working
1. Terminology
Consider a classification problem (like predicting if someone
plays golf based on weather). Then:
yy is the class label (e.g. "Yes" or "No" for playing golf)
X=(x1,x2,...,xn)X=(x1,x2,...,xn) is the feature vector (e.g.
Outlook, Temperature, Humidity, Wind)
A sample row from the dataset:
X=(Rainy, Hot, High, False),y=NoX=(Rainy, Hot, High, False
),y=No
This represents:
What is the probability that someone will not play golf given that
the weather is Rainy, Hot, High humidity and No wind?
2. The Naive Assumption
The "naive" in Naive Bayes comes from the assumption that all
P(x1,x2,...,xn∣y)=P(x1∣y)⋅P(x2∣y)⋯P(xn∣y)P(x1,x2,...,xn
features are independent given the class. That is:
∣y)=P(x1∣y)⋅P(x2∣y)⋯P(xn∣y)
P(y∣x1,...,xn)=P(y)⋅∏i=1nP(xi∣y)P(x1)P(x2)...P(xn)P(y∣x1
Thus, Bayes' theorem becomes:
,...,xn)=P(x1)P(x2)...P(xn)P(y)⋅∏i=1nP(xi∣y)
P(y∣x1,...,xn)∝P(y)⋅∏i=1nP(xi∣y)P(y∣x1,...,xn)∝P(y)⋅∏i=1n
Since the denominator is constant for a given input, we can write:
P(xi∣y)
3. Constructing the Naive Bayes Classifier
We compute the posterior for each class yy and choose the
y^=argmaxyP(y)⋅∏i=1nP(xi∣y)y^=argmaxyP(y)⋅∏i=1nP(xi
class with the highest probability:
∣y)
This becomes our Naive Bayes classifier.
4. Example: Weather Dataset
Let’s take a dataset used for predicting if golf is played based on:
Outlook: Sunny, Rainy, Overcast
Temperature: Hot, Mild, Cool
Humidity: High, Normal
Wind: True, False
Example
Input: X=(Sunny,Hot,Normal,False)X=(Sunny,Hot,Norm
al,False)
Goal: Predict if golf will be played ( Yes or No).
5. Pre-computation from Dataset
Class Probabilities:
From dataset of 14 rows:
P(Yes)=914P(Yes)=149
P(No)=514P(No)=145
Conditional Probabilities (Tables 1–4):
Feature Value P (Value | Yes)P (Value | No)
Outlook Sunny 2/9 3/5
Temperature Hot 2/9 2/5
Humidity Normal 6/9 1/5
Wind False 6/9 2/5
6. Calculate Posterior Probabilities
P(Yes | today)∝29⋅29⋅69⋅69⋅914P(Yes | today)∝92⋅92⋅96⋅96⋅149
For Class = Yes:
P(Yes | today)≈0.02116P(Yes | today)≈0.02116
P(No | today)∝35⋅25⋅15⋅25⋅514P(No | today)∝53⋅52⋅51⋅52⋅145
For Class = No:
P(No | today)≈0.0068P(No | today)≈0.0068
7. Normalize Probabilities
To compare:
P(Yes | today)=0.021160.02116+0.0068≈0.756P(Yes | today
)=0.02116+0.00680.02116≈0.756
P(No | today)=0.00680.02116+0.0068≈0.244P(No | today)=
0.02116+0.00680.0068≈0.244
8. Final Prediction
Since:
P(Yes | today)>P(No | today)P(Yes | today)>P(No | today)
The model predicts: Yes (Play Golf)
Naive Bayes for Continuous Features
P(xi∣y)=12πσy2exp(−(xi−μy)22σy2)P(xi∣y)=2πσy21exp(−2σy2
For continuous features, we assume a Gaussian distribution:
(xi−μy)2)
Where:
μyμy is the mean of feature xixi for class yy
σy2σy2 is the variance of feature xixi for class yy
This leads to what is called Gaussian Naive Bayes.
Types of Naive Bayes Model
There are three types of Naive Bayes Model :
1. Gaussian Naive Bayes
In Gaussian Naive Bayes, continuous values associated with
each feature are assumed to be distributed according to a
Gaussian distribution. A Gaussian distribution is also
called Normal distribution When plotted, it gives a bell shaped
curve which is symmetric about the mean of the feature values
as shown below:
2. Multinomial Naive Bayes
Multinomial Naive Bayesis used when features represent the
frequency of terms (such as word counts) in a document. It is
commonly applied in text classification, where term frequencies
are important.
3. Bernoulli Naive Bayes
Bernoulli Naive Bayes deals with binary features, where each
feature indicates whether a word appears or not in a document. It
is suited for scenarios where the presence or absence of terms is
more relevant than their frequency. Both models are widely used
in document classification tasks
Advantages of Naive Bayes Classifier
Easy to implement and computationally efficient.
Effective in cases with a large number of features.
Performs well even with limited training data.
It performs well in the presence of categorical features.
For numerical features data is assumed to come from
normal distributions
Disadvantages of Naive Bayes Classifier
Assumes that features are independent, which may not
always hold in real-world data.
Can be influenced by irrelevant attributes.
May assign zero probability to unseen events, leading to
poor generalization.
Applications of Naive Bayes Classifier
Spam Email Filtering: Classifies emails as spam or non-
spam based on features.
Text Classification: Used in sentiment analysis, document
categorization and topic classification.
Medical Diagnosis: Helps in predicting the likelihood of a
disease based on symptoms.
Credit Scoring: Evaluates creditworthiness of individuals
for loan approval.
Weather Prediction: Classifies weather conditions based
on various factors.
K means Clustering – Introduction
Last Updated : 10 Nov, 2025
K-Means Clustering groups similar data points into clusters
without needing labeled data. It is used to uncover hidden
patterns when the goal is to organize data based on similarity.
Helps identify natural groupings in unlabeled datasets
Works by grouping points based on distance to cluster
centers
Commonly used in customer segmentation, image
compression, and pattern discovery
Useful when you need structure from raw, unorganized data
Working of K-Means Clustering
Suppose we are given a data set of items with certain features
and values for these features like a vector. The task is to
categorize those items into groups. To achieve this we will use
the K-means algorithm. "kk" represents the number of groups or
clusters we want to classify our items into.
The algorithm will categorize the items into " kk" groups or
clusters of similarity. To calculate that similarity we will use
the Euclidean distance as a measurement. The algorithm works
as follows:
1. Initialization: We begin by randomly selecting k cluster
centroids.
2. Assignment Step: Each data point is assigned to the
nearest centroid, forming clusters.
3. Update Step: After the assignment, we recalculate the
centroid of each cluster by averaging the points within it.
4. Repeat: This process repeats until the centroids no longer
change or the maximum number of iterations is reached.
The goal is to partition the dataset into kk clusters such that data
points within each cluster are more similar to each other than to
those in other clusters.
Selecting the right number of clusters is important for meaningful
segmentation to do this we use Elbow Method for optimal value
of k in KMeans which is a graphical tool used to determine the
optimal number of clusters (k) in K-means.
Why Use K-Means Clustering?
K-Means is popular in a wide variety of applications due to its
simplicity, efficiency and effectiveness. Here’s why it is widely
used:
1. Data Segmentation: One of the most common uses of K-
Means is segmenting data into distinct groups. For example,
businesses use K-Means to group customers based on
behavior, such as purchasing patterns or website interaction.
2. Image Compression: K-Means can be used to reduce the
complexity of images by grouping similar pixels into clusters,
effectively compressing the image. This is useful for image
storage and processing.
3. Anomaly Detection: K-Means can be applied to detect
anomalies or outliers by identifying data points that do not
belong to any of the clusters.
4. Document Clustering: In natural language processing
(NLP), K-Means is used to group similar documents or articles
together. It’s often used in applications like recommendation
systems or news categorization.
5. Organizing Large Datasets: When dealing with large
datasets, K-Means can help in organizing the data into
smaller, more manageable chunks based on similarities,
improving the efficiency of data analysis.
Implementation of K-Means Clustering
We will be using blobs datasets and show how clusters are made
using Python programming language.
Step 1: Importing the necessary libraries
We will be importing the following libraries.
Numpy: for numerical operations (e.g., distance calculation).
Matplotlib: for plotting data and results.
Scikit learn: to create a synthetic dataset
using make_blobs
import numpy as np
import [Link] as plt
from [Link] import make_blobs
Step 2: Creating Custom Dataset
We will generate a synthetic dataset with make_blobs.
make_blobs(n_samples=500, n_features=2,
centers=3): Generates 500 data points in a 2D space,
grouped into 3 clusters.
[Link](X[:, 0], X[:, 1]): Plots the dataset in 2D, showing
all the points.
[Link](): Displays the plot
X,y = make_blobs(n_samples = 500,n_features = 2,centers =
3,random_state = 23)
fig = [Link](0)
[Link](True)
[Link](X[:,0],X[:,1])
[Link]()
Output:
Clust
ering dataset
Step 3: Initializing Random Centroids
We will randomly initialize the centroids for K-Means clustering
[Link](23): Ensures reproducibility by fixing the
random seed.
The for loop initializes k random centroids, with values
between -2 and 2, for a 2D dataset.
k = 3
clusters = {}
[Link](23)
for idx in range(k):
center = 2*(2*[Link](([Link][1],))-1)
points = []
cluster = {
'center' : center,
'points' : []
}
clusters[idx] = cluster
clusters
Output:
Random
Centroids
Step 4: Plotting Random Initialized Center with Data
Points
We will now plot the data points and the initial centroids.
[Link](): Plots a grid.
[Link](center[0], center[1], marker='*', c='red'): Plots
the cluster center as a red star (* marker).
[Link](X[:,0],X[:,1])
[Link](True)
for i in clusters:
center = clusters[i]['center']
[Link](center[0],center[1],marker = '*',c = 'red')
[Link]()
Output:
Data
points with random center
Step 5: Defining Euclidean Distance
To assign data points to the nearest centroid, we define a
distance function:
[Link](): Computes the square root of a number or array
element-wise.
[Link](): Sums all elements in an array or along a
specified axis
def distance(p1,p2):
return [Link]([Link]((p1-p2)**2))
Step 6: Creating Assign and Update Functions
Next, we define functions to assign points to the nearest centroid
and update the centroids based on the average of the points
assigned to each cluster.
[Link](dis): Appends the calculated distance to the
list dist.
curr_cluster = [Link](dist): Finds the index of the
closest cluster by selecting the minimum distance.
new_center = [Link](axis=0): Calculates the new
centroid by taking the mean of the points in the cluster.
def assign_clusters(X, clusters):
for idx in range([Link][0]):
dist = []
curr_x = X[idx]
for i in range(k):
dis = distance(curr_x,clusters[i]['center'])
[Link](dis)
curr_cluster = [Link](dist)
clusters[curr_cluster]['points'].append(curr_x)
return clusters
def update_clusters(X, clusters):
for i in range(k):
points = [Link](clusters[i]['points'])
if [Link][0] > 0:
new_center = [Link](axis =0)
clusters[i]['center'] = new_center
clusters[i]['points'] = []
return clusters
Step 7: Predicting the Cluster for the Data Points
We create a function to predict the cluster for each data point
based on the final centroids.
[Link]([Link](dist)): Appends the index of the
closest cluster (the one with the minimum distance) to pred.
def pred_cluster(X, clusters):
pred = []
for i in range([Link][0]):
dist = []
for j in range(k):
[Link](distance(X[i],clusters[j]['center']))
[Link]([Link](dist))
return pred
Step 8: Assigning, Updating and Predicting the Cluster
Centers
We assign points to clusters, update the centroids and predict the
final cluster labels.
assign_clusters(X, clusters): Assigns data points to the
nearest centroids.
update_clusters(X, clusters): Recalculates the centroids.
pred_cluster(X, clusters): Predicts the final clusters for all
data points.
clusters = assign_clusters(X,clusters)
clusters = update_clusters(X,clusters)
pred = pred_cluster(X,clusters)
Step 9: Plotting Data Points with Predicted Cluster Centers
Finally, we plot the data points, colored by their predicted
clusters, along with the updated centroids.
center = clusters[i]['center']: Retrieves the center
(centroid) of the current cluster.
[Link](center[0], center[1], marker='^', c='red'): Plots
the cluster center as a red triangle (^ marker).
[Link](X[:,0],X[:,1],c = pred)
for i in clusters:
center = clusters[i]['center']
[Link](center[0],center[1],marker = '^',c = 'red')
[Link]()
Output:
K-
means Clustering
Challenges with K-Means Clustering
K-Means algorithm has the following limitations:
Choosing the Right Number of Clusters (kk): One of the
biggest challenges is deciding how many clusters to use.
Sensitive to Initial Centroids: The final clusters can vary
depending on the initial random placement of centroids.
Non-Spherical Clusters: K-Means assumes that the
clusters are spherical and equally sized. This can be a
problem when the actual clusters in the data are of different
shapes or densities.
Outliers: K-Means is sensitive to outliers, which can distort
the centroid and, ultimately, the clusters.
Hierarchical Clustering in Machine Learning
Last Updated : 08 Nov, 2025
Hierarchical Clustering is an unsupervised learning method used
to group similar data points into clusters based on their distance or
similarity. Instead of choosing the number of clusters in advance, it
builds a tree-like structure called a dendrogram that shows how
clusters merge or split at different levels. It helps identify natural
groupings in data and is commonly used in pattern recognition,
customer segmentation, gene analysis and image grouping.
Imagine we have four fruits with different weights: an apple (100g),
a banana (120g), a cherry (50g) and a grape (30g). Hierarchical
clustering starts by treating each fruit as its own group.
Start with each fruit as its own cluster.
Merge the closest items: grape (30g) and cherry (50g) are
grouped first.
Next, apple (100g) and banana (120g) are grouped.
Finally, these two clusters merge into one.
Finally all the fruits are merged into one large group, showing how
hierarchical clustering progressively combines the most similar
data points.
Dendrogram
A dendrogram is like a family tree for clusters. It shows how
individual data points or groups of data merge together. The
bottom shows each data point as its own group and as we move
up, similar groups are combined. The lower the merge point, the
more similar the groups are. It helps us see how things are
grouped step by step.
Dendrogram
At the bottom of the dendrogram the points P, Q, R, S and T
are all separate.
As we move up, the closest points are merged into a single
group.
The lines connecting the points show how they are
progressively merged based on similarity.
The height at which they are connected shows how similar
the points are to each other; the shorter the line the more
similar they are
Types of Hierarchical Clustering
Now we understand the basics of hierarchical clustering. There are
two main types of hierarchical clustering.
1. Agglomerative Clustering
2. Divisive clustering
1. Hierarchical Agglomerative Clustering
It is also known as the bottom-up approach or
hierarchical agglomerative clustering (HAC). Bottom-up algorithms
treat each data as a singleton cluster at the outset and then
successively agglomerate pairs of clusters until all clusters have
been merged into a single cluster that contains all data.
Hierarchical Agglomerative Clustering
Workflow for Hierarchical Agglomerative clustering
1. Start with individual points: Each data point is its own
cluster. For example if we have 5 data points we start with 5
clusters each containing just one data point.
2. Calculate distances between clusters: Calculate the
distance between every pair of clusters. Initially since each
cluster has one point this is the distance between the two data
points.
3. Merge the closest clusters: Identify the two clusters with the
smallest distance and merge them into a single cluster.
4. Update distance matrix: After merging we now have one
less cluster. Recalculate the distances between the new cluster
and the remaining clusters.
5. Repeat steps 3 and 4: Keep merging the closest clusters
and updating the distance matrix until we have only one cluster
left.
6. Create a dendrogram: As the process continues we can
visualize the merging of clusters using a tree-like diagram
called a dendrogram. It shows the hierarchy of how clusters are
merged.
Implementation
Let's see the implementation of Agglomerative Clustering,
Start with each data point as its own cluster.
Compute distances between all clusters.
Merge the two closest clusters based on a linkage method.
Update the distances to reflect the new cluster.
Repeat merging until the desired number of clusters or one
cluster remains.
The dendrogram visualizes these merges as a tree, showing
cluster relationships and distances.
import numpy as np
import [Link] as plt
from [Link] import AgglomerativeClustering
from [Link] import dendrogram
from [Link] import make_blobs
X, _ = make_blobs(n_samples=30, centers=3, cluster_std=10,
random_state=42)
clustering = AgglomerativeClustering(n_clusters=3)
labels = clustering.fit_predict(X)
agg = AgglomerativeClustering(distance_threshold=0,
n_clusters=None)
[Link](X)
def plot_dendrogram(model, **kwargs):
counts = [Link](model.children_.shape[0])
n_samples = len(model.labels_)
for i, merge in enumerate(model.children_):
current_count = 0
for child_idx in merge:
if child_idx < n_samples:
current_count += 1
else:
current_count += counts[child_idx - n_samples]
counts[i] = current_count
linkage_matrix = np.column_stack(
[model.children_, model.distances_,
counts]).astype(float)
dendrogram(linkage_matrix, **kwargs)
fig, (ax1, ax2) = [Link](1, 2, figsize=(14, 6))
[Link](X[:, 0], X[:, 1], c=labels, cmap='viridis', s=70)
ax1.set_title("Agglomerative Clustering")
ax1.set_xlabel("Feature 1")
ax1.set_ylabel("Feature 2")
[Link](ax2)
plot_dendrogram(agg, truncate_mode='level', p=5)
[Link]("Hierarchical Clustering Dendrogram")
[Link]("Sample index")
[Link]("Distance")
plt.tight_layout()
[Link]()
Output :
Agglomerative Clustering
2. Hierarchical Divisive clustering
Divisive clustering is also known as a top-down approach. Top-
down clustering requires a method for splitting a cluster that
contains the whole data and proceeds by splitting clusters
recursively until individual data have been split into singleton
clusters.
Workflow for Hierarchical Divisive clustering :
1. Start with all data points in one cluster: Treat the entire
dataset as a single large cluster.
2. Split the cluster: Divide the cluster into two smaller clusters.
The division is typically done by finding the two most dissimilar
points in the cluster and using them to separate the data into
two parts.
3. Repeat the process: For each of the new clusters, repeat
the splitting process: Choose the cluster with the most
dissimilar points and split it again into two smaller clusters.
4. Stop when each data point is in its own cluster: Continue
this process until every data point is its own cluster or the
stopping condition (such as a predefined number of clusters) is
met.
Divisive Clustering
Implementation
Let's see the implementation of Divisive Clustering,
Starts with all data points as one big cluster.
Finds the largest cluster and splits it into two using KMeans.
Repeats splitting the largest cluster until reaching the desired
number of clusters.
Assigns cluster labels to each data point based on the splits.
Returns history of clusters at each step and final labels.
Visualizes data points colored by their final cluster.
import numpy as np
import [Link] as plt
from [Link] import KMeans
from [Link] import make_blobs
from [Link] import dendrogram, linkage
X, _ = make_blobs(n_samples=30, centers=5, cluster_std=10,
random_state=42)
def divisive_clustering(data, max_clusters=3):
while len(clusters) < max_clusters:
cluster_to_split = max(clusters, key=lambda x: len(x))
[Link](cluster_to_split)
kmeans = KMeans(n_clusters=2,
random_state=42).fit(cluster_to_split)
cluster1 = cluster_to_split[kmeans.labels_ == 0]
cluster2 = cluster_to_split[kmeans.labels_ == 1]
[Link]([cluster1, cluster2])
return clusters
clusters = divisive_clustering(X, max_clusters=3)
[Link](figsize=(12, 5))
[Link](1, 2, 1)
colors = ['r', 'g', 'b', 'c', 'm', 'y']
for i, cluster in enumerate(clusters):
[Link](cluster[:, 0], cluster[:, 1], s=50,
c=colors[i], label=f'Cluster {i+1}')
[Link]('Divisive Clustering Result')
[Link]()
linked = linkage(X, method='ward')
[Link](1, 2, 2)
dendrogram(linked, orientation='top',
distance_sort='descending', show_leaf_counts=True)
[Link]('Hierarchical Clustering Dendrogram')
plt.tight_layout()
[Link]()
Output:
Divisive Clustering
Computing Distance Matrix
While merging two clusters we check the distance between two
every pair of clusters and merge the pair with the least
distance/most similarity. But the question is how is that distance
determined. There are different ways of defining Inter Cluster
distance/similarity. Some of them are:
Min Distance: Find the minimum distance between any two
points of the cluster.
Max Distance: Find the maximum distance between any two
points of the cluster.
Group Average: Find the average distance between every
two points of the clusters.
Ward's Method: The similarity of two clusters is based on
the increase in squared error when two clusters are merged.
Distance Matrix
The image compares cluster distance methods:
Min uses the shortest distance between clusters
Max uses the longest
Group Average computes the mean of all pairwise distances
Ward’s method minimizes the increase in within-cluster
variance during merging
Suggested Quiz
6 Questions
What is the main idea of hierarchical clustering?
A
Grouping data into fixed clusters
B
Building a hierarchy of clusters
C
Assigning each point to the nearest cluster
D
Splitting data into equal parts
DBSCAN Clustering in ML - Density based clustering
Last Updated : 30 Oct, 2025
DBSCAN is a density-based clustering algorithm that groups data
points that are closely packed together and marks outliers as noise
based on their density in the feature space. It identifies clusters as
dense regions in the data space separated by areas of lower
density. Unlike K-Means or hierarchical clustering which assumes
clusters are compact and spherical, DBSCAN perform well in
handling real-world data irregularities such as:
Arbitrary-Shaped Clusters: Clusters can take any shape not
just circular or convex.
Noise and Outliers: It effectively identifies and handles noise
points without assigning them to any cluster.
Data Sets with clustering algorithms
The figure above shows a data set with clustering algorithms: K-
Means and Hierarchical handling compact, spherical clusters with
varying noise tolerance while DBSCAN manages arbitrary-shaped
clusters and noise handling.
Key Parameters in DBSCAN
1. eps: This defines the radius of the neighborhood around a data
point. If the distance between two points is less than or equal
to eps they are considered neighbors. A common method to
determine eps is by analyzing the k-distance graph. Choosing the
right eps is important:
If eps is too small most points will be classified as noise.
If eps is too large clusters may merge and the algorithm may
fail to distinguish between them.
2. MinPts: This is the minimum number of points required within
the eps radius to form a dense region. A general rule of thumb is
to set MinPts >= D+1 where D is the number of dimensions in the
dataset.
For most cases a minimum value of MinPts = 3 is recommended.
How Does DBSCAN Work?
DBSCAN works by categorizing data points into three types:
1. Core points which have a sufficient number of neighbors
within a specified radius (eplison)
2. Border points which are near core points but lack enough
neighbors to be core points themselves
3. Noise points which do not belong to any cluster.
By iteratively expanding clusters from core points and connecting
density-reachable points, DBSCAN forms clusters without relying
on rigid assumptions about their shape or size.
DBSCAN ALgorithm
Steps in the DBSCAN Algorithm
1. Identify Core Points: For each point in the dataset count the
number of points within its eps neighborhood. If the count
meets or exceeds MinPts mark the point as a core point.
2. Form Clusters: For each core point that is not already
assigned to a cluster create a new cluster. Recursively find
all density-connected points i.e points within the eps radius of
the core point and add them to the cluster.
3. Density Connectivity: Two points a and b are density-
connected if there exists a chain of points where each point is
within the eps radius of the next and at least one point in the
chain is a core point. This chaining process ensures that all
points in a cluster are connected through a series of dense
regions.
4. Label Noise Points: After processing all points any point that
does not belong to a cluster is labeled as noise.
Implementation of DBSCAN Algorithm In Python
Here we’ll use the Python library sklearn to compute DBSCAN and
[Link] library for visualizing clusters.
Step 1: Importing Libraries
We import all the necessary library
like numpy , matplotlib and scikit-learn.
import [Link] as plt
import numpy as np
from [Link] import DBSCAN
from sklearn import metrics
from [Link] import make_blobs
from [Link] import StandardScaler
from sklearn import datasets
Step 2: Preparing Dataset
We will create a dataset of 4 clusters using make_blob. The
dataset have 300 points that are grouped into 4 visible clusters.
X, y_true = make_blobs(n_samples=300, centers=4,
cluster_std=0.50, random_state=0)
Step 3: Applying DBSCAN Clustering
Now we apply DBSCAN clustering on our data, count it and
visualize it using the matplotlib library.
eps=0.3: The radius to look for neighboring points.
min_samples: Minimum number of points required to form a
dense region a cluster.
labels: Cluster numbers for each point. -1 means the point is
considered noise.
db = DBSCAN(eps=0.3, min_samples=10).fit(X)
core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
core_samples_mask[db.core_sample_indices_] = True
labels = db.labels_
n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
unique_labels = set(labels)
colors = ['y', 'b', 'g', 'r']
print(colors)
for k, col in zip(unique_labels, colors):
if k == -1:
col = 'k'
class_member_mask = (labels == k)
xy = X[class_member_mask & core_samples_mask]
[Link](xy[:, 0], xy[:, 1], 'o', markerfacecolor=col,
markeredgecolor='k',
markersize=6)
xy = X[class_member_mask & ~core_samples_mask]
[Link](xy[:, 0], xy[:, 1], 'o', markerfacecolor=col,
markeredgecolor='k',
markersize=6)
[Link]('number of clusters: %d' % n_clusters_)
[Link]()
Output:
Cluster
of dataset
As shown in above output image cluster are shown in different
colours like yellow, blue, green and red.
Step 4: Evaluation Metrics For DBSCAN Algorithm In
Machine Learning
We will use the Silhouette score and Adjusted rand score for
evaluating clustering algorithms.
Silhouette's score is in the range of -1 to 1. A score near 1
denotes the best meaning that the data point i is very compact
within the cluster to which it belongs and far away from the
other clusters. The worst value is -1. Values near 0 denote
overlapping clusters.
Absolute Rand Score is in the range of 0 to 1. More than 0.9
denotes excellent cluster recovery and above 0.8 is a good
recovery. Less than 0.5 is considered to be poor recovery.
sc = metrics.silhouette_score(X, labels)
print("Silhouette Coefficient:%0.2f" % sc)
ari = adjusted_rand_score(y_true, labels)
print("Adjusted Rand Index: %0.2f" % ari)
Output:
Coefficient:0.13
Adjusted Rand Index: 0.31:
Black points represent outliers. By changing the eps and the
MinPts we can change the cluster configuration.
When Should We Use DBSCAN Over K-Means
Clustering?
DBSCAN and K-Means are both clustering algorithms that group
together data that have the same characteristic. However they
work on different principles and are suitable for different types of
data. We prefer to use DBSCAN when the data is not spherical in
shape or the number of classes is not known beforehand.
DBSCAN K-Means
In DBSCAN we need not
It is very sensitive to the number of clusters
specify the number of
(k), which must be specified in advance.
clusters.
Clusters formed in DBSCAN
can be of any arbitrary Clusters formed are spherical or convex in
shape. shape
It can work well with
It does not work well with outliers data.
datasets having noise and Outliers can skew the clusters in K-Means to
outliers a very large extent.
In DBSCAN two parameters In K-Means only one parameter is required
are required for training the is for training the model
DBSCAN K-Means
Model
DBSCAN Vs K-Means
As it can identify clusters of arbitrary shapes and effectively handle
noise. K-Means on the other hand is better suited for data with
well-defined, spherical clusters and is less effective with noise or
complex cluster structures.
DBSCAN Clustering in ML | Density based clustering
Apriori Algorithm
Last Updated : 21 Nov, 2025
Apriori Algorithm is a basic method used in data analysis to find
groups of items that often appear together in large sets of data. It
helps to discover useful patterns or rules about how items are
related which is particularly valuable in market basket analysis.
Like in a grocery store if many customers buy bread and butter
together, the store can use this information to place these items
closer or create special offers. This helps the store sell more and
make customers happy.
How the Apriori Algorithm Works?
The Apriori Algorithm operates through a systematic process that
involves several key steps:
1. Identifying Frequent Item-Sets
The Apriori algorithm starts by looking through all the data to
count how many times each single item appears. These single
items are called 1-Item-Sets.
Next it uses a rule called minimum support this is a number
that tells us how often an item or group of items needs to
appear to be important. If an item appears often enough
meaning its count is above this minimum support it is called a
frequent Item-Set.
2. Creating Possible Item Group
After finding the single items that appear often enough
(frequent 1-item groups) the algorithm combines them to create
pairs of items (2-item groups). Then it checks which pairs are
frequent by seeing if they appear enough times in the data.
This process keeps going step by step making groups of 3
items, then 4 items and so on. The algorithm stops when it can’t
find any bigger groups that happen often enough.
3. Removing Infrequent Item Groups
The Apriori algorithm uses a helpful rule to save time. This
rule says: if a group of items does not appear often enough
then any larger group that incl2 udes these items will also not
appear often.
Because of this, the algorithm does not check those larger
groups. This way it avoids wasting time looking at groups that
won’t be important make the whole process faster.
4. Generating Association Rules
The algorithm makes rules to show how items are related.
It checks these rules using support, confidence and lift to find
the strongest ones.
Key Metrics of Apriori Algorithm
Support: This metric measures how frequently an item
appears in the dataset relative to the total number of
transactions. A higher support indicates a more significant
presence of the Item-Set in the dataset. Support tells us how
often a particular item or combination of items appears in all the
transactions like Bread is bought in 20% of all transactions.
Confidence: Confidence assesses the likelihood that an item
Y is purchased when item X is purchased. It provides insight
into the strength of the association between two items.
Confidence tells us how often items go together i.e If bread is
bought, butter is bought 75% of the time.
Lift: Lift evaluates how much more likely two items are to be
purchased together compared to being purchased
independently. A lift greater than 1 suggests a strong positive
association. Lift shows how strong the connection is between
items. Like Bread and butter are much more likely to be bought
together than by chance.
Example
Lets understand the concept of apriori Algorithm with the help of
an example. Consider the following dataset and we will find
frequent Item-Sets and generate association rules for them:
Transactions of a Grocery Shop
Step 1 : Setting the parameters
Minimum Support Threshold: 50% (item must appear in at
least 3/5 transactions). This threshold is formulated from this
formula:
Support(A)=Number of transactions containing itemset ATot
al number of transactionsSupport(A)=Total number of transactions
Number of transactions containing itemset A
Minimum Confidence Threshold: 70% ( You can change
the value of parameters as per the use case and problem
statement ). This threshold is formulated from this formula:
Confidence(X→Y)=Support(X∪Y)Support(X)Confidence(X→
Y)=Support(X)Support(X∪Y)
Step 2: Find Frequent 1-Item-Sets
Lets count how many transactions include each item in the dataset
(calculating the frequency of each item).
Frequent 1-Itemsets
All items have support% ≥ 50%, so they qualify as frequent 1-Item-
Sets. if any item has support% < 50%, It will be omitted out from
the frequent 1- Item-Sets.
Step 3: Generate Candidate 2-Item-Sets
Combine the frequent 1-Item-Sets into pairs and calculate their
support. For this use case we will get 3 item pairs ( bread,butter) ,
(bread,ilk) and (butter,milk) and will calculate the support similar to
step 2
Candidate 2-Itemsets
Frequent 2-Item-Sets: {Bread, Milk} meet the 50% threshold but
{Butter, Milk} and {Bread ,Butter} doesn't meet the threshold, so
will be committed out.
Step 4: Generate Candidate 3-Item-Sets
Combine the frequent 2-Item-Sets into groups of 3 and calculate
their support. for the triplet we have only got one case i.e
{bread,butter,milk} and we will calculate the support.
Candidate 3-
Itemsets
Since this does not meet the 50% threshold, there are no frequent
3-Item-Sets.
Step 5: Generate Association Rules
Now we generate rules from the frequent Item-Sets and calculate
confidence.
Rule 1: If Bread -> Butter (if customer buys bread, the customer will
buy butter also)
Support of {Bread, Butter} = 2.
Support of {Bread} = 4.
Confidence = 2/4 = 50% (Failed threshold).
Rule 2: If Butter -> Bread (if customer buys butter, the customer will
buy bread also)
Support of {Bread, Butter} = 2.
Support of {Butter} = 3.
Confidence = 2/3 = 66.67% (Passes threshold).
Rule 3: If Bread -> Milk (if customer buys bread, the customer will buy
milk also)
Support of {Bread, Milk} = 3.
Support of {Bread} = 4.
Confidence = 3/4 = 75% (Passes threshold).
The Apriori Algorithm, as demonstrated in the bread-butter
example, is widely used in modern startups like Zomato, Swiggy
and other food delivery platforms. These companies use it to
perform market basket analysis which helps them identify
customer behaviour patterns and optimise recommendations.
Applications of Apriori Algorithm
Below are some applications of Apriori algorithm used in today's
companies and startups
1. E-commerce: Used to recommend products that are often
bought together like laptop + laptop bag, increasing sales.
2. Food Delivery Services: Identifies popular combos such as
burger + fries to offer combo deals to customers.
3. Streaming Services: Recommends related movies or shows
based on what users often watch together like action +
superhero movies.
4. Financial Services: Analyzes spending habits to suggest
personalised offers such as credit card deals based on frequent
purchases.
5. Travel & Hospitality: Creates travel packages like flight +
hotel by finding commonly purchased services together.
6. Health & Fitness: Suggests workout plans or supplements
based on users past activities like protein shakes + workouts.
Related Articles
Apriori algorithm in Python
Apriori Algorithm in Machine Learning
Principal Component Analysis (PCA)
Last Updated : 13 Nov, 2025
PCA (Principal Component Analysis) is a dimensionality reduction
technique and helps us to reduce the number of features in a
dataset while keeping the most important information. It changes
complex datasets by transforming correlated features into a
smaller set of uncorrelated components.
Principal Component Analysis (PCA)
It helps us to remove redundancy, improve computational
efficiency and make data easier to visualize and analyze.
How Principal Component Analysis Works
PCA uses linear algebra to transform data into new features called
principal components. It finds these by calculating eigenvectors
(directions) and eigenvalues (importance) from the covariance
matrix. PCA selects the top components with the highest
eigenvalues and projects the data onto them simplify the dataset.
Note: It prioritizes the directions where the data varies
the most because more variation = more useful information.
Imagine you’re looking at a messy cloud of data points like stars in
the sky and want to simplify it. PCA helps you find the "most
important angles" to view this cloud so you don’t miss the big
patterns. Here’s how it works step by step:
Step 1: Standardize the Data
Different features may have different units and scales like salary
vs. age. To compare them fairly PCA first standardizes the data by
making each feature have:
A mean of 0
A standard deviation of 1
Z=X−μσZ=σX−μ
where:
⋯,μm}μ={μ1,μ2,⋯,μm}
μμ is the mean of independent features μ={μ1,μ2,
σσ is the standard deviation of independent features
σ={σ1,σ2,⋯,σm}σ={σ1,σ2,⋯,σm}
Step 2: Calculate Covariance Matrix
Next PCA calculates the covariance matrix to see how features
relate to each other whether they increase or decrease together.
The covariance between two features x1x1 and x2x2 is:
cov(x1,x2)=∑i=1n(x1i−x1ˉ)
(x2i−x2ˉ)n−1cov(x1,x2)=n−1∑i=1n(x1i−x1ˉ)(x2i−x2ˉ)
Where:
xˉ1andxˉ2xˉ1andxˉ2 are the mean values of
features x1andx2x1andx2
nn is the number of data points
The value of covariance can be positive, negative or zeros.
Step 3: Find the Principal Components
PCA identifies new axes where the data spreads out the most:
1st Principal Component (PC1): The direction of maximum
variance (most spread).
2nd Principal Component (PC2): The next best
direction, perpendicular to PC1 and so on.
These directions come from the eigenvectors of the covariance
matrix and their importance is measured by eigenvalues. For a
square matrix A an eigenvector X (a non-zero vector) and its
corresponding eigenvalue λ satisfy:
AX=λXAX=λX
This means:
When A acts on X it only stretches or shrinks X by the
scalar λ.
The direction of X remains unchanged hence eigenvectors
define "stable directions" of A.
Eigenvalues help rank these directions by importance.
Step 4: Pick the Top Directions & Transform Data
After calculating the eigenvalues and eigenvectors PCA ranks
them by the amount of information they capture. We then:
1. Select the top k components that capture most of the
variance like 95%.
2. Transform the original dataset by projecting it onto these top
components.
This means we reduce the number of features (dimensions) while
keeping the important patterns in the data.
Transform this 2D dataset into a 1D representation while
preserving as much variance as possible.
In the above image the original dataset has two features "Radius"
and "Area" represented by the black axes. PCA identifies two new
directions: PC₁ and PC₂ which are the principal components.
These new axes are rotated versions of the original ones.
PC₁ captures the maximum variance in the data meaning it
holds the most information while PC₂ captures the remaining
variance and is perpendicular to PC₁.
The spread of data is much wider along PC₁ than along PC₂.
This is why PC₁ is chosen for dimensionality reduction. By
projecting the data points (blue crosses) onto PC₁ we
effectively transform the 2D data into 1D and retain most of the
important structure and patterns.
Implementation of Principal Component
Analysis in Python
Hence PCA uses a linear transformation that is based on
preserving the most variance in the data using the least number of
dimensions. It involves the following steps:
Step 1: Importing Required Libraries
We import the necessary library like pandas, numpy, scikit
learn, seaborn and matplotlib to visualize results.
import numpy as np
import pandas as pd
from [Link] import StandardScaler
from [Link] import PCA
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from [Link] import confusion_matrix
import [Link] as plt
import seaborn as sns
Step 2: Creating Sample Dataset
We make a small dataset with three features Height, Weight, Age
and Gender.
data = {
'Height': [170, 165, 180, 175, 160, 172, 168, 177, 162, 158],
'Weight': [65, 59, 75, 68, 55, 70, 62, 74, 58, 54],
'Age': [30, 25, 35, 28, 22, 32, 27, 33, 24, 21],
'Gender': [1, 0, 1, 1, 0, 1, 0, 1, 0, 0] # 1 = Male, 0 = Female
}
df = [Link](data)
print(df)
Output:
Dataset
Step 3: Standardizing the Data
Since the features have different scales Height vs Age we
standardize the data. This makes all features have mean = 0 and
standard deviation = 1 so that no feature dominates just because
of its units.
X = [Link]('Gender', axis=1)
y = df['Gender']
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
Step 4: Applying PCA algorithm
We reduce the data from 3 features to 2 new features called
principal components. These components capture most of the
original information but in fewer dimensions.
We split the data into 70% training and 30% testing sets.
We train a logistic regression model on the reduced training
data and predict gender labels on the test set.
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)
X_train, X_test, y_train, y_test = train_test_split(X_pca, y,
test_size=0.3, random_state=42)
model = LogisticRegression()
[Link](X_train, y_train)
y_pred = [Link](X_test)
Step 5: Evaluating with Confusion Matrix
The confusion matrix compares actual vs predicted labels. This
makes it easy to see where predictions were correct or wrong.
cm = confusion_matrix(y_test, y_pred)
[Link](figsize=(5,4))
[Link](cm, annot=True, fmt='d', cmap='Blues',
xticklabels=['Female', 'Male'], yticklabels=['Female', 'Male'])
[Link]('Predicted Label')
[Link]('True Label')
[Link]('Confusion Matrix')
[Link]()
Output:
Confusion matrix
Step 6: Visualizing PCA Result
y_numeric = [Link](y)[0]
[Link](figsize=(12, 5))
[Link](1, 2, 1)
[Link](X_scaled[:, 0], X_scaled[:, 1], c=y_numeric,
cmap='coolwarm', edgecolor='k', s=80)
[Link]('Original Feature 1')
[Link]('Original Feature 2')
[Link]('Before PCA: Using First 2 Standardized Features')
[Link](label='Target classes')
[Link](1, 2, 2)
[Link](X_pca[:, 0], X_pca[:, 1], c=y_numeric,
cmap='coolwarm', edgecolor='k', s=80)
[Link]('Principal Component 1')
[Link]('Principal Component 2')
[Link]('After PCA: Projected onto 2 Principal Components')
[Link](label='Target classes')
plt.tight_layout()
[Link]()
Output:
PCA Algorithm
Left Plot Before PCA: This shows the original
standardized data plotted using the first two features. There
is no guarantee of clear separation between classes as these
are raw input dimensions.
Right Plot After PCA: This displays the transformed
data using the top 2 principal components. These new
components capture the maximum variance often showing
better class separation and structure making it easier to
analyze or model.
You can download source code from here.
Advantages of Principal Component Analysis
1. Multicollinearity Handling: Creates new, uncorrelated
variables to address issues when original features are highly
correlated.
2. Noise Reduction: Eliminates components with low variance
enhance data clarity.
3. Data Compression: Represents data with fewer components
reduce storage needs and speeding up processing.
4. Outlier Detection: Identifies unusual data points by showing
which ones deviate significantly in the reduced space.
Disadvantages of Principal Component Analysis
1. Interpretation Challenges: The new components are
combinations of original variables which can be hard to explain.
2. Data Scaling Sensitivity: Requires proper scaling of data
before application or results may be misleading.
3. Information Loss: Reducing dimensions may lose some
important information if too few components are kept.
4. Assumption of Linearity: Works best when relationships
between variables are linear and may struggle with non-linear
data.
5. Computational Complexity: Can be slow and resource-
intensive on very large datasets.
6. Risk of Overfitting: Using too many components or working
with a small dataset might lead to models that don't generalize
well.
Principal Component Analysis (PCA) in Machine Learning