CS3491
ARTIFICIAL INTELLIGENCE
AND
MACHINE LEARNING
SUPERVISED LEARNING Unit - 3
SUPERVISED LEARNING
Introduction to machine learning – Linear Regression Models:
Least squares, single & multiple variables, Bayesian linear
regression, gradient descent, Linear Classification Models:
Discriminant function – Probabilistic discriminative model -
Logistic regression, Probabilistic generative model – Naive
Bayes, Maximum margin classifier – Support vector machine,
Decision Tree, Random forests
[Link]
Machine Learning
Machine Learning is the process of training a model, to make useful
predictions or generate content from data.
[Link]
Machine Learning
ML - core concepts
● Data
● Model
● Training
● Evaluating
● Inference
● Data
Store related data in datasets
Datasets are made up of individual examples that contain
features and a label.
A dataset is characterized by its size and diversity.
● Model
A model is the complex collection of numbers that define the
mathematical relationship from specific input feature patterns to
specific output label values. The model discovers these patterns
through training.
● Training
Before a supervised model can make predictions, it must be trained.
To train a model, we give the model a dataset with labeled
examples.
● Training
An ML model making a prediction from a labeled example.
An ML model updating its predicted value
● Training
An ML model updating its predictions for each labeled example in the training
dataset.
● Evaluate
Evaluating an ML model by
comparing its predictions to the
actual values.
[Link]
Supervised Learning
Supervised learning is a type of machine learning where a model learns from labeled
data, meaning the input data comes with corresponding correct output or "label" values.
Supervised Learning
Regression
A regression model predicts a numeric value.
Classification
[Link]
Linear Regression
Linear regression is a statistical technique used to find the relationship
between variables. In an ML context, linear regression finds the relationship
between features and a label.
Linear Regression
Linear Regression
creating model by drawing a best fit line through the points
Linear regression equation
❏ In this example, calculate the weight and bias from the line drew.
❏ The bias is 30 (where the line intersects the y-axis), and the weight is -3.6
(the slope of the line). The model would be defined as 𝑦′=30+(−3.6)(𝑥1)
❏ , and use it to make predictions. For instance, using this model, a
4,000-pound car would have a predicted fuel efficiency of 15.6 miles per
gallon.
Models with multiple features
For example, a model that predicts gas mileage could additionally use features
such as the following:
● Engine displacement
● Acceleration
● Number of cylinders
● Horsepower
Mean Squared Error (MSE):
Summing up all the squared The actual (true) value for the ith data point
errors
The squared error — squaring
ensures all errors are positive
The error (residual) —
Total number of data points
how far off the
prediction is
The predicted value by the model for the
ith data point
Mean Squared Error (MSE):
Example
Actual: y=[3,5,2]
Predicted: y^=[2.5,5.3,1.7]
1. (3−2.5)2=0.25
2. (5−5.3)2=0.09
3. (2−1.7)2=0.09
MSE=⅓ (0.25+0.09+0.09) = 0.43 / 3 ≈0.143
Least Square Method
Least Squares method is a statistical technique used to find the
equation of best-fitting curve or line to a set of data points by minimizing
the sum of the squared differences between the observed values and the
values predicted by the model.
Regression line / line of best fit
Slope
Intercept
Example
The table below provides the monthly average petrol prices from April (Month 4) to
September (Month 9).
a. Using linear regression, calculate the best-fit line for the given data.
b. Predict the petrol price in December (Month 12).
c. Interpret the goodness of the regression line using R2
c. Interpret the goodness of the regression line using R2
R² measures how well the linear regression line fits the data by quantifying the
proportion of variance in the dependent variable (petrol price) explained by the
independent variable (month).
SST ≈ 42.6667
SSR ≈ 4.3333
R² = 1 - (4.3333/42.6667) ≈ 0.895.
This confirms the prior result of 0.895 (89.5%), indicating strong fit.
reference table for R2 calculation
Month (x) Actual (y) Predicted (y^\hat{y}y^) Residual (y - y^\hat{y}y^) Squared Residual
4 77 77.17 -0.17 0.03
5 78 78.34 -0.34 0.12
6 81 79.52 1.48 2.19
7 80 80.69 -0.69 0.48
8 82 81.86 0.14 0.02
9 83 83.04 -0.04 0.00
Total SSR ≈ 4.33
Bayesian linear regression
Bayesian regression is a probabilistic approach to regression where we treat model
parameters (like weights) as random variables, not fixed values.
Apply Bayes’ Theorem to update our beliefs about the parameters after seeing the
data.
Bayesian linear regression
Model Assumption (Like Ordinary Linear Regression)
Prior over Weights w
Likelihood Function
Posterior over Weights
Making Predictions
Bayesian linear regression
Gradient Based Optimization
The goal of optimization in machine learning is to find the parameters of the
model that minimize (or maximize) a loss function. The loss function quantifies
how well the model's predictions match the actual data.
Gradient descent is a widely used optimization algorithm for minimizing the
loss function. It involves the following steps:
1. Compute Loss: At each step, calculate the loss.
2. Compute Gradient: Find the gradient (slope) of the loss function with
respect to model parameters.
3. Update Parameters: Adjust parameters in the opposite direction of the
gradient.
SIMPLE LINEAR
REGRESSION
Example 1
Example 2
[Link]
Linear Classification Model
Classification
Classification is a fundamental machine learning technique aimed at organizing input data
into distinct classes.
Classification
Classification is a fundamental machine learning technique aimed at
organizing input data into distinct classes.
Types of Classification
❖ Binary Classification
Two distinct categories
❖ Multiclass Classification
more than two categories
Characteristics of Classification Models
❖ Class Separation
❖ Decision Boundaries
❖ Sensitivity to Data Quality
❖ Handling Imbalanced Data
Classification Algorithms
Linear Classifiers
Non - Linear Classifiers
Classification
Classification
Classification
Data points of 2 classes
Linear Line to separate the classes
How Mean and Variance plays a role?
Mean
Variance
Variance measures how much each feature spreads.
Correlation measures how features move together.
Predict the Class with argmax function
The argmax function is used to select the class with the highest score or
probability.
Predict the Class with
argmax function
Example
import numpy as np
# Example output probabilities from a model
probabilities = [Link]([0.1, 0.8, 0.1])
# Find the index of the maximum probability
predicted_class = [Link](probabilities)
print(predicted_class) # Output: 1
Discriminant Function
A discriminant function is a function used in pattern classifiers
to partition the feature space based on probabilities or
equivalent functions, helping to determine the class to which a
given input belongs
Discriminant Function
Discriminant Function
Discriminant Function
Discriminant Function
Discriminant Function
Linear Discriminant Analysis (LDA)
Modeling of distinctions between groups, effectively separating two
or more classes.
LDA operates by projecting features from a higher-dimensional
space into a lower-dimensional one.
In machine learning, LDA serves as a supervised learning algorithm
specifically designed for classification tasks, aiming to identify a
linear combination of features that optimally segregates classes
within a dataset.
Linear Discriminant Analysis (LDA)
LDA assumes that the data has a Gaussian distribution and that the
covariance matrices of the different classes are equal.
It also assumes that the data is linearly separable, meaning that a
linear decision boundary can accurately classify the different
classes.
Linear Discriminant Analysis (LDA)
Two criteria are used by LDA to create a new axis:
1. Maximize the distance between the means of the two
classes.
2. Minimize the variation within each class.
Linear Discriminant Analysis (LDA)
Two criteria are used by LDA to create a new axis:
1. Maximize the distance between the means of the two
classes.
2. Minimize the variation within each class.
Mathematical Intuition Behind LDA
1: Maximize the distance between class means
2: Minimize the variation within each class
Step - by - step
1. Project data onto a line
2. Class means after projection
Step - by - step
3. Variance (scatter) within each class after projection
Step - by - step
4. Fisher’s Criterion (Objective Function)
5. Optimal w
Probabilistic generative model
vs
Probabilistic discriminative model
Probabilistic generative model
vs
Probabilistic discriminative model
Probabilistic generative model
vs
Probabilistic discriminative model
Probabilistic generative model
vs
Probabilistic discriminative model
Probabilistic generative model
vs
Probabilistic discriminative model
Example 2
Problem in Linear Classification
Classes are not completely separated using linear equation since
linear classification inadequate
Logistic regression used to predict the probability of output being a
specific category based on the input feature,
Logistic Regression
❏ Logistic regression is a supervised machine learning algorithm used for classification
tasks where the goal is to predict the probability that an instance belongs to a given
class or not.
❏ Logistic regression is used for binary classification where we use sigmoid function, that
takes input as independent variables and produces a probability value between 0 and 1.
❏ For example, we have two classes Class 0 and Class 1 if the value of the logistic
function for an input is greater than 0.5 (threshold value) then it belongs to Class 1
otherwise it belongs to Class 0.
❏ In Logistic regression, instead of fitting a regression line, we fit an “S” shaped logistic
function, which predicts two maximum values (0 or 1).
Logistic Regression
Equation of Logistic Regression:
Logistic Regression
❏ Logistic regression is a supervised learning algorithm used to estimate the
probability that a given instance belongs to a particular class.
❏ Logistic regression outputs probabilities between 0 and 1. It achieves this by passing
the linear combination of input features through a sigmoid function, ensuring that
predictions remain within a valid probability range.
Types of Logistic Regression
❏ Binomial Logistic Regression: Applied when the dependent variable has
two outcomes (e.g., spam vs. non-spam).
❏ Multinomial Logistic Regression: Used when the target variable has more
than two categories without a natural order (e.g., product categories).
❏ Ordinal Logistic Regression: Applied when the dependent variable has
ordered categories (e.g., satisfaction levels: low, medium, high).
Steps in Logistic Regression
1. Sigmoid Function and Probability Estimation
sigmoid function transforms the linear regression output into a probability value between 0 and 1
2. Logistic Regression Equation
3. Cost Function
4. Gradient descent to optimize parameters
Linear vs Logistic Regression
Example
Classify whether a student gets admitted (1) or not admitted (0) to a college based on
their exam score.
Example
Step 1: Model Assumption
Step 2: Initialize Parameters
Step 3: Define Cost Function (Log Loss)
Example
Step 1: Model Assumption
Step 2: Initialize Parameters
Step 3: Define Cost Function (Log Loss)
Example
Step 4: Gradient Descent Optimization
Example
Step 5: After Training
Step 6: Make Predictions
Support Vector Machine (SVM) Algorithm
Support Vector Machine (SVM) is a supervised machine learning algorithm used for
classification and regression tasks.
SVM aims to find the optimal hyperplane in an N-dimensional space to separate
data points into different classes.
The algorithm maximizes the margin between the closest points of different classes.
Hyperplane
Support Vectors
Margin
Linear SVM
Non-Linear SVM
Decision Tree Classification Algorithm
❏ Internal nodes represent the features of a dataset.
❏ Branches represent the decision rules.
❏ Each leaf node represents the outcome
Decision Node Leaf Node
Decision Tree Classification Algorithm
○ Step-1: Begin the tree with the root node, says S, which contains the
complete dataset.
○ Step-2: Find the best attribute in the dataset using Attribute Selection
Measure (ASM).
○ Step-3: Divide the S into subsets that contains possible values for the best
attributes.
○ Step-4: Generate the decision tree node, which contains the best attribute.
○ Step-5: Recursively make new decision trees using the subsets of the
dataset created in step -3. Continue this process until a stage is reached
where you cannot further classify the nodes and called the final node as a
leaf node.
Example
Random Forest Algorithm
Random Forest Algorithm
Evaluation Metrics
○ Accuracy
○ Logarithmic Loss
○ Area Under Curve (AUC)
○ Precision
○ Recall
○ F1 Score
○ Confusion Matrix
Accuracy
Accuracy is the proportion of correct predictions made by the model, out of all
predictions:
True Positive Rate
True Negative Rate
False Positive Rate
Precision Recall
Confusion Matrix
Decision Tree problems
❏ Selecting the Best Attribute
❏ Creating Tree Nodes
❏ Stopping Criteria
❏ Handling Missing Values
❏ Tree Pruning
Entropy Information Gain
Thank You