Fundamentals of Machine
Learning
Chapter 2: Supervised Learning
Target Group: G3SE A & B
Instructor: Melaku M.
What is Supervised learning?
❖ Approach:
• Given a collection of records (training set)
• each record contains a set of attributes, one of the attributes is the
class attribute (label).
• Dataset D={(x1,y1),(x2,y2),..., (xn,yn)}
❖ Task: Find a model for class attribute as a function of the values of other
attributes.
❖ Goal: Previously unseen records should be assigned a class as accurately as
possible.
Examples of supervised learning Tasks
❖ Classifying credit card transactions as legitimate or fraudulent
• Attributes: your age, income, debts, …
• Class: are you getting credit by your bank?
❖ Marketing
• Attributes: previously bought products, browsing behavior
• Class: are you a target customer for a new product?
❖ SPAM Detection: Categorizing email messages
• Attributes: words and header fields of an e-mail
• Class: regular e-mail or spam e-mail?
❖ Identifying tumor cells: Predicting tumor cells as benign or malignant
• Attributes: features extracted from x-rays or MRI scans
• Class: malignant or benign cells
❖ Categorizing news stories as finance, weather, entertainment, sports, etc.
Supervised learning
There are two major types of supervised machine learning problems.
• Classification
• Regression
1. Regression
❖Regression Analysis is a statistical approach for modelling the relationships
between the dependent variables and one or more independent variables.
❖In Regression, we plot a graph between the variables, which best fit the given
data points. In naïve words, “Regression shows a line or curve that passes
through all the data points on a target-predictor graph in such a way that the
vertical distance between the data points and the regression line is minimum”.
❖Regression is principally used for prediction, forecasting, time series modeling,
and determining the causal-effect relationship between variables.
Predicting House Prices
Problem: Given features like the size of a house, and the number of
bedrooms of the house, predict its price.
Independent variable Dependent variable
Size (sq ft) Bedrooms Price ($)
1914 4 245679.32
2859 3 338129.41
1465 2 192672.58
1297 5 194902.16
2433 1 251292.31
Types of Regression models
❖There are several types of regression techniques, each suited for
different types of data and different types of relationships.
❖The main types of regression techniques are:
➢Linear Regression
➢Polynomial Regression
Linear Regression
❖This is the simplest and most common type of regression algorithm.
❖It models the relationship/connection between a dependent variable (what
you are trying to predict) and one or more independent variables (the
features that influence the dependent variable) using a linear equation.
❖Linear regression is a regression model that uses a straight line to
describe the relationship between variables.
❖The goal of the algorithm is to find the best Fit Line that passes through
most of data points.
Linear Regression
In the figure given above, on X-axis is the independent variable and on Y-axis is
the dependent variable(target). The regression line (a sloped straight line) is
referred to as the best-fit line for a model. And our main objective in this
algorithm is to find this best fit line based on the given data points.
Types of Linear Regression
1) Simple Linear Regression
❖In a simple linear regression, there is one independent variable(x) and one
dependent variable(y).
❖The model estimates the regression coefficients(slope and intercept) of
the line of best fit, which represents the relationship between the
variables.
❖The goal of the linear regression algorithm is to get the best values
for B0 and B1 to find the best fit line.
1) Simple Linear Regression
❖The equation for simple linear regression
y=mx+b => ŷ = B0+B1x
❖where,
y = predicted value. It represents the value that the model tries to predict.
x = feature value
B0 = bias term or intercept of the line.
B1 = the feature weights or regression coefficient. Slope of best-fit line.
❖B0, B1 are known as Model Parameters.
2) Multiple Linear Regression
❖Multiple linear regression is a technique to model the relationship between
a single dependent variable and multiple independent variables.
❖The equation for multiple linear regression is:
ŷ=𝛽0+𝛽1x1+ 𝛽2x2 +……… 𝛽nxn
where:
❖ŷ is the predicted value
❖n is the number of features
❖xi ith feature value
❖𝛽0 is the intercept/bias term
❖𝛽i, ith feature weight value
Hypothesis function in Linear Regression
❖Let us assume that our independent feature is the experience i.e. X and the
respective salary Y is the dependent variable. Let’s assume there is a linear
relationship between X and Y then the salary can be predicted using:
Ŷ= B0 + B1 X Or ŷi= B0 + B1 xi
𝑦𝑖 𝜖 𝑌(𝑖=1,2,⋯,𝑛) are labels to data
𝑥𝑖 𝜖 𝑋(𝑖=1,2,⋯,𝑛) are the input independent training data
ŷi 𝜖 Ŷ(𝑖=1,2,⋯,𝑛) are the predicted values.
❖Once we find the best B0 and B1 values, we get the best-fit line.
This algorithm explains the linear relationship between the dependent variable y
and the independent(predictor) variable X using a straight-line Y= B0 + B1 X.
▪ Red line is the best fit line predicted
by the model.
▪ Blue dots are the data points
▪ b1 is the slope of the x variable.
Random error
The best fit line should have the least error means the error between predicted values and actual
values should be minimized.
Cont’d
❖The vertical distance between the data point and the regression line is
known as error or residual. Each data point has one residual and the sum of
all the differences is known as the Sum of Residuals/Errors.
❖Mathematical Approach:
❖Residual/Error = Actual values – Predicted Values
❖Sum of Residuals/Errors = Sum(Actual- Predicted Values)
❖Square of Sum of Residuals/Errors = (Sum(Actual- Predicted Values))2
Cost Function
❖The cost function is nothing but the error or difference between the actual value yi and
the predicted value Y.
❖In Linear Regression, the Mean Squared Error (MSE) cost function is employed, which
calculates the average of the squared errors between the actual values ŷ and the
predicted values yi.
❖It helps us to figure out optimal/best values for B0 and B1, which provides the best fit line
for the given data points.
• Our MSE tells us how well (or rather how poorly) our function fits our data. This means
that we want to make our MSE as low as possible.
• Our MSE is just a function, we can compute the minimum of it directly by taking the first
derivative of the MSE.
Cost Function
➢Utilizing the MSE function, the iterative process of gradient descent is applied to
update/adjust the values of B0 and B1 .This ensures that the MSE value converges
to the global minima, signifying the most accurate fit of the linear regression line
to the dataset.
➢This process involves continuously adjusting the parameters based on the
gradients calculated from the MSE. The final result is a linear regression line that
minimizes the overall squared differences between the predicted and actual
values, providing an optimal representation of the underlying relationship in the
data.
Gradient Descent
❖Gradient Descent is a generic optimization algorithm capable of
finding optimal solutions to a wide range of problems. The general idea
of Gradient Descent is to tweak model parameters iteratively in order to
minimize a cost function, to reach the optimal solution.
❖A linear regression model can be trained using the optimization
algorithm gradient descent by iteratively modifying the model’s
parameters to reduce the mean squared error (MSE) of the model on a
training dataset.
Gradient Descent
❖To update B0 and B1, we take gradients from the MSE cost function. To find these
gradient, we compute the partial derivatives of the cost function (MSE) with respect
to both m and c.
❖These gradients tell us how much we should adjust B0 and B1 to reduce the error.
❖Initially, it starts with random B0 and B1 values and then iteratively update the
values, reaching minimum cost.
❖A gradient is nothing but a derivative that defines the effects on outputs of the
function with a little bit of variation in inputs.
How Linear regression works
1. Initialize parameters: Start with initial values for the slope (m) and y-intercept (c).
2. Calculate predictions: Use your current parameter values (m and c) to predict the
dependent variable (ŷ) for each data point.
3. Calculate gradients: Compute the partial derivatives of the cost function (MSE) with
respect to both m and c. These gradients indicate the direction of the steepest descent in
the cost function landscape.
4. Update parameters: Use the learning rate (α) to determine the step size for the update.
Subtract the learning rate multiplied by the gradients from the current values of m and c.
This moves the parameters in the direction that minimizes the cost function.
5. Repeat steps 2-4: Continue iterating until the cost function converges (stops decreasing
significantly) or reaches a certain threshold.
Impact of different values for learning rate
Polynomial regression
❖Polynomial regression helps to build a model over non-linear data.
❖A simple linear regression algorithm only works when the relationship
between the data is linear. But suppose we have non-linear data, then
linear regression will not be able to draw a best-fit line.
Consider the diagram, which has a
non-linear relationship, and you can
see the linear regression results on it,
which does not perform well,
meaning it does not come close to
reality.
Polynomial regression
❖Polynomial Regression is a regression algorithm that models the
relationship between a dependent variable(y) and independent
variable(x) as nth degree polynomial.
❖The Polynomial Regression equation is given below:
❖Polynomial regression is a form of Linear regression where only
due to the Non-linear relationship between dependent and
independent variables, we add some polynomial terms to linear
regression to convert it into Polynomial regression.
Types of Polynomial Regression
Linear Regression Vs. Polynomial Regression
• We build our model and realize that it performs badly. We examine the
difference between the actual value and the best fit line we predicted,
and it appears that the true value has a curve on the graph, but our line is
nowhere near cutting the mean of the points. This is where polynomial
regression comes into play; it predicts the best-fit line that matches the
pattern of the data (curve).
Math Behind Polynomial Regression!
Cost Function
❖A common cost function employed in polynomial regression is the
mean squared error (MSE). It calculates the average of the squared
differences between the predicted values and the actual values (y).
❖Cost = (1 / n) * sum((yi - y_predi) ^ 2)
❖Where:
❖n represents the total number of training examples.
❖The summation calculates the squared error for each data point.
❖So, the equation can also be written as:
Cost = 1/n*sum(square(yi - (b0 + b1x + b2x2 + b3x3.....))
Gradient Descent for Polynomial Regression
• Gradient descent is an optimization algorithm used to find the values of
parameters (coefficients) of a function that minimizes a cost function.
The values of slope (m) and slope-intercept (b) will be set to 0 at the start of
the function, and the learning rate (α) will be introduced. The learning rate
(α) is set to an extremely low number, perhaps between 0.01 and 0.0001.
The learning rate is a tuning parameter in an optimization algorithm that
sets the step size at each iteration as it moves toward the cost function’s
minimum.
Overfitting Vs Under-fitting
❖In linear regression, when analyzing a dataset linearly, we encounter an under-
fitting problem.
• Model is unable to capture data patterns due to the model’s simple assumptions in
fitting the data.
❖In polynomial regression one thing that we face is the problem of overfitting
this happens because while we increase the order/degree of the polynomial
regression to achieve better and better performance, model gets overfit on the
data and does not perform on the new data points.
• Model memorizes the training data too well and loses generalizability.
❖Due to this reason only while using the polynomial regression, do we try to
penalize the weights of the model to regularize the effect of the overfitting
problem.
Overfitting Vs Under-fitting
Regularization Technique
• While developing machine learning models you must have encountered a
situation in which the training accuracy of the model is high but the
validation accuracy is low. This is the case which is popularly known as
overfitting.
• Regularization in ML which helps us to solve the problem of overfitting.
• Regularization is implemented by adding a “penalty” term to the best fit
derived from the trained data, in order to achieve a lesser variance with the
tested data and also restricts the influence of feature variables over the
output variable by compressing their coefficients.
Regularization in Machine Learning
❖Regularization is a technique used to prevent overfitting by adding a
penalty term to the loss function, discouraging the model from assigning
too much importance to individual features or coefficients.
❖ The commonly used regularization techniques are:
▪ Lasso Regularization – L1 Regularization
▪ Ridge Regularization – L2 Regularization
Evaluation metrics for regression models
• To build and deploy a generalized model we need to evaluate the model
on different metrics which helps us to better optimize the performance,
fine-tune it, and obtain a better result.
• Some of the Evaluation metrics used for Regression analysis are:
1. Mean Absolute Error (MAE)
• MAE is simplest and easily comprehensible loss
functions, which calculates the absolute difference
between actual and predicted values. We aim to get a
minimum MAE because this is a loss.
Evaluation metrics for regression models
2. Mean Squared Error (MSE) :
MSE is a most used and very simple metric
MSE represents the squared distance between actual and
predicted values.
3. Root Mean Squared Error (RMSE) :
• As RMSE is clear by the name itself, that it is a
simple square root of mean squared error.
R2 score is a metric that tells the
performance/accuracy of your model. R² Score usually
ranges from 0 to 1. SST is the total sum of squares
4. R-squared (R²) Score: and SSR is the total sum of squares of residuals.
Classification
✓Given: Dtrain= (x1, y1), . . . , (xn, yn) / xi ∈ Rd and yi is the label
➢Goal:
✓Learn a function to predict y given x
– y is categorical/discrete == classification
✓In other word, Build a model that can accurately predict the
class labels of new, unseen instances.
Types of classification tasks
• Types of classification tasks based on the nature of the class labels.
• Binary Classification: The simplest case, where there are only two
classes (e.g., classifying an email as spam/not spam(ham)).
• Multi-Class Classification: More than two classes (e.g., classifying
images as cat, dog, or horse. Predicting handwritten digits[0-9]
where there are 10 classes( response variable can have any value
ranging from 0 to 9).
• Multi-Label Classification: A data point can belong to multiple
classes simultaneously. For example, an image can contain a cat, a
dog, and a person. predicting the presence of multiple diseases in a
patient based on medical records.
Classification problem 1: Predicting the Weather
• Suppose we take a real-world example of predicting the weather.
• Let’s keep it simple and say we are trying to predict if the weather is sunny
or rainy based on multiple input data samples consisting of attributes or
features like humidity, temperature, pressure, and precipitation.
• Input features = { humidity, temperature, pressure, and precipitation}
• Output labels = {Sunny, runny}
• Since the prediction can be either sunny or rainy, there are a total of two
distinct classes in total; hence this problem can also be termed as a binary
classification problem.
Figure. Supervised learning: binary classification for weather prediction
Classification problem 2: Iris plant classification
Cont.… the data:
Cont.…
Cont.…
Classification has a vast array of real-world applications
1) Fraud Detection: Classifying financial transactions as fraudulent or legitimate based on
patterns, anomalies, or historical data.
2) Sentiment analysis: Categorizing customer product reviews as positive, negative, or
neutral.
3) Image recognition: Classifying images into various categories like cat, dog, horse etc.
4) Object recognition: Identifying objects in an image, like a picture containing a cat, a
person, and a tree.
5) News article/document classification: Categorizing a news article by topic, such as
politics, sports, and technology (based on their content).
6) Cybersecurity threat detection: Identifying malicious network activity from a vast
amount of regular network traffic.
7) Credit Risk Assessment: Classifying loan applicants as low risk or high risk based on their
credit history, income, and other relevant factors.
8) Customer Churn Prediction: Classifying customers as likely to churn(cancel service) or
not churn from a subscription or service based on their behavior, usage patterns, or
demographics.
Typical classification models
Typical classification models include the following major types of
methods; however, the list is not exhaustive.
• Linear models like logistic regression, Naïve Bayes, and support vector
machines
• Non-parametric models like K-nearest neighbors(KNN)
• Tree based methods like decision trees
• Ensemble methods like random forests (bagging) and gradient boosted
machines (boosting)
• Neural networks (MLPs)
DecisionTree
• Decision Trees are versatile supervised machine Learning algorithms that
can perform both classification and regression tasks. They are powerful
algorithms, capable of fitting complex datasets.
Sample Dataset
• Columns denote features
X i, (xi, y i )
• Rows denote labeled
instances (xi, y i )
• Class label denotes
whether customer buys a
computer.
Decision Tree Induction
• Decision tree induction is the
learning of decision trees from
class-labeled training tuples.
• To build optimal decision tree
select an attribute that produces
purest subset of records.
select best attribute
that splits this data set.
Examples of Decision Tree Induction:
• A decision tree is grown by recursively partitioning the training records
successively into purer subsets.
Cont’d
Purest record because all instances
belongs to the same class
yes
all examples are yes
Cont’d
select best attribute
that splits this data set.
Yes
all examples are yes,
No further split is required
Decision Tree Induction Algorithm: Example
yes
no yes
all examples are no all examples are yes
Decision Tree Induction Algorithm: Example
select best attribute that splits this data set.
Decision Tree Induction Algorithm: Example
Training and Testing Decision Tree Model
Training Data Model: Decision Tree
Test Data
Apply Model
Tree-based Models is Easy to Interpret
• A path from root to a leaf node Set of rules can be
corresponds to a rule visualized as a tree.
• E.g., if age<=30 and student=no then
target value, buys=no
✓ Each path from the tree root to a
leaf corresponds to a conjunction
of attribute tests.
✓ The instance (age <=30, income =
low, student = no, credit_rating =
fair) is classified as a negative
instance.
Decision Tree Induction Algorithm: Example 2
Features Class/Label
.
Build a decision tree model for the given
dataset
Given : <outlook=sunny,
temperature=hot, humidity=high,
wind=weak> ?
Predict, if there will be a match?
Many possible Decision Trees
There could be more than one tree that fits the samedata!
Which Decision Tree is the Best?
Decision Tree Induction Algorithms
❖ How to learn a decision tree from training data?
❖ Which feature should be used to break the dataset?
✓ finding an optimal decision tree is NP-hard
✓ tree building algorithms thus use a greedy, top-down, recursive partitioning
strategy to induce a reasonable solution
Number of Algorithms:
• Hunt’s: Hunt's Algorithm (1966)
• Quinlan’s: Iterative Dichotomizer3 (ID3) (1975) uses Entropy
– C4.5 / 4.8 / 5.0 (1993)(Successor of ID3) uses Entropy
• Brieman’s: CART: Classification and Regression Trees (1984) uses Gini
Design Issues for Decision Tree induction
1. How should training records be split?
❖How to specify the attribute test condition (Splitting Criterion)?
◆ Depending on attribute types(Nominal, Ordinal, Continuous)
❖How to determine the best split?
• Measure for evaluating the goodness of a test condition
• Different purity measures can be used:
• information gain, , Gini index, gain ratio misclassification error, …
2. When should the splitting procedure stop?
–Stop splitting if all the records belong to the same class.
• Early termination depending on the results of a statistical test, b/c fully
grown trees might overfit training data
How to Determine the Best Split?
Before splitting the dataset contains:
❖ 10 records of class C0 and
❖ 10 records of class C1
Which attribute test is the best?
How to determine the Best Split?
❖ Key problem: choosing which attribute to split a given set of training records.
❖ Greedy approach:
▪ Nodes with purer/homogeneous class distribution are preferred
▪ Need a measure of node impurity:
Non-homogeneous Homogeneous
High degree of node impurity Low degree of node impurity
How to determine the Best Split?
Node Impurity
Very impure group Less impure Minimum impurity
❖ Common measures of node impurity
1. GINI Index
2. Information gain:-follows max-gain
• The ID3 algorithm uses the Max-Gain method of selecting the
best attribute
Splitting Based on Information Gain
Weighted impurity measure of children
❖ Information Gain:
Parent Node p is split into k partitions; ni is
number of records in partition i
❖ Information gain measures the entropy reduction of a split
❖ We choose the split with the largest reduction (maximal GAIN)
❖ Disadvantage: Tends to prefer splits that result in large number of
partitions, each being small but pure (split by ID attribute?)
How to determine the Best Split?
Splitting Based on Information gain
❖ Information gain relies on the entropy of each node
❖ Entropy of a given node t:
p( j | t) is the relative frequency of class j at node t
❖ Entropy measures homogeneity of a node(impurity of a node)
Splitting Based on Information Gain
Calculating Information Gain
Parametric and non-parametric methods
These methods approach learning from data in fundamentally different ways:
Parametric methods:
❖Parametric methods make assumptions about the underlying statistical
distribution of your data and functional forms(eg. Y=mx+b).
❖Parametric method assumes/follows a specific predefined form for model,
and summarizes the given data with a fixed number of parameters (independent
of the number of training examples).
❖Imagine a straight line for linear regression - that's the form, and the parameters
are the slope and intercept.
Parametric and non-parametric methods
Parametric methods:
❖The algorithms involve two steps:
• Select a form for the function.
• Learn the coefficients for the function from the training data.
A. Linear Regression: It assumes a linear relationship between the input
variables and the target variable and estimates the coefficients of the
linear equation.
B. Naive Bayes: It assumes that the features are conditionally
independent given the class label and estimates the class probabilities
using Bayes' theorem.
Non-parametric Methods
❖These methods make minimal assumptions about the data's underlying
structure. They learn the model(functional form) directly from the data,
allowing for more complex relationships.
❖Non-parametric methods don't rely on predefined forms, and often have a
flexible number of parameters that grow with the amount of training data.
The model complexity can grow with more data, making them adaptable.
❖If your data is complex or you're unsure about its distribution(no prior
knowledge), non-parametric methods offer more flexibility.
Choosing the right method:
The best method depends on your specific problem and data:
❖If you have a strong understanding of your data and can make
reasonable assumptions about its distribution, parametric methods
can be a good choice, especially when data is limited.