Chapter 02
Regression Models
Introduction to Regression
• Regression is a fundamental statistical and machine learning
technique used to model the relationship between a
dependent variable (target) and one or more independent
variables (predictors).
• Regression is a statistical and machine learning technique
used to model and predict continuous outcomes by analyzing
the relationship between dependent and independent
variables.
• Its primary purpose is prediction, estimation, and
understanding how variables influence each other.
• It is widely applied in forecasting, risk analysis, and
decision-making across fields like economics, healthcare,
and engineering.
What is Regression?
• Definition: Regression is a supervised learning method
that predicts a continuous dependent variable
(outcome) based on one or more independent
variables (predictors).
• Purpose: To understand how changes in predictors
influence the outcome and to make predictions.
• Applications: Price prediction, demand forecasting, risk
scoring, medical prognosis, and trend estimation.
Core Concepts
•Dependent Variable (Target/Outcome): The variable you
want to predict or explain (e.g., house price).
•Independent Variables (Predictors/Features): The
variables used to make the prediction (e.g., house size,
location).
•Relationship: How changes in independent variables
correspond to changes in the dependent variable.
•Model: The mathematical function (line, curve) that
represents this relationship, fitted to data points.
What is a Regression Line?
• Regression Line predicts the relationship between two
or more variables.
• A regression line is a straight line that reflects the
best-fit connection in a dataset between independent
and dependent variables.
• The main purpose of developing a regression line is to
predict or estimate the value of the dependent variable
based on the values of one or more independent
variables.
• Equation of Regression Line
• The equation of a simple linear regression line is given
by:
Y = a + bX + ε
• Y is the dependent variable
• X is the independent variable
• a is the y-intercept, which represents the value of Y
when X is 0.
• b is the slope, which represents the change in Y for a
unit change in X
• ε is residual error.
Types of Regression
Type Description Example Use Case
Linear Regression If the regression curve is straight Predicting a student's final exam
line, then the regression is said to score from study time or class
be linear. attendance.
Simple Linear Regression Models relationship between one Predicting house price based on
independent variable and one square footage
dependent variable using a
straight line.
Multiple Linear Regression Uses two or more independent Predicting salary based on
variables to predict a dependent education, experience, and age
variable.
Non-linear Regression Models complex relationships Chemical reaction rates
beyond linear assumptions.
Logistic Regression Used when the dependent Predicting whether a patient has a
variable is categorical (binary disease (Yes/No)
outcomes).
Linear Regression
• Linear regression is a type of supervised
machine-learning algorithm.
• It learns from the labelled datasets and maps the data
points with most optimized linear functions which can be
used for prediction on new datasets.
• It assumes that there is a linear relationship between
the input and output, meaning the output changes at a
constant rate as the input changes. This relationship is
represented by a straight line.
Best Fit Line in Linear Regression
• In linear regression, the best-fit line is the straight line that
most accurately represents the relationship between the
independent variable (input) and the dependent variable
(output).
• It is the line that minimizes the difference between the
actual data points and the predicted values from the
model.
[Link] of the Best-Fit Line
• The goal of linear regression is to find a straight line that
minimizes the error (the difference) between the observed
data points and the predicted values. This line helps us
predict the dependent variable for new, unseen data.
•
•
•
4. Interpretation of the Best-Fit Line
• Slope (m): The slope of the best-fit line indicates how
much the dependent variable (y) changes with each unit
change in the independent variable (x).
• For example if the slope is 5, it means that for every 1-unit
increase in x, the value of y increases by 5 units.
• Intercept (b): The intercept represents the predicted value
of y when x = 0.
• It’s the point where the line crosses the y-axis.
•
•
Lines of regression
• Two lines of regression exist for two variables (X and Y)
because you can model the relationship in two ways:
• predicting Y from X (y on x) or predicting X from Y (x on
y),
• each minimizing different errors (vertical vs. horizontal
distances) using the least squares method,
• Which resulting in distinct lines that usually intersect at
the mean point (x̄,ȳ).
• Different Predictions:
• If you're interested in predicting height (Y) from weight
(X), you use Y on X; if predicting weight (X) from height
(Y), you use X on Y.
• The Two Lines Regression:
• Regression Line of Y on X (y = a + bx):
• Predicts the dependent variable Y for a given value of
independent variable X. It minimizes the sum of squared
vertical errors (residuals).
• Regression Line of X on Y (x = c + dy):
• Predicts the dependent variable X for a given value of
independent variable Y. It minimizes the sum of squared
horizontal errors.
Calculation and Properties
•Formula: For simple linear regression of 𝑌on 𝑋, the coefficient
𝑏𝑦𝑥 is calculated as:
•Formula: For simple linear regression of 𝑋on 𝑌, the coefficient
𝑏𝑥𝑦 is calculated as:
Alternatively, it can be expressed using the correlation coefficient
(𝑟): 𝑏𝑥𝑦 = 𝑟 × ( )
Advantages of Linear Regression
• Simplicity and Interpretability: It provides a straightforward
mathematical equation where coefficients represent the impact
of each independent variable on the target variable.
• Efficiency: It is computationally fast to train and, therefore,
suitable for large datasets.
• Optimal for Linear Trends: It works exceptionally well when
the relationship between variables is truly linear.
• Regularization: Overfitting can be reduced using techniques
like L1 (Lasso) and L2 (Ridge) regularization.
• Provides Baseline: It is a commonly used, solid baseline
model for comparing more complex algorithms.
Disadvantages of Linear Regression
• Assumption of Linearity: It cannot capture non-linear
relationships, potentially leading to underfitting.
• Sensitivity to Outliers: Anomalies can have a disproportionate,
large influence on the regression coefficients, distorting the
model.
• Multicollinearity: It assumes independence among predictors; if
predictors are highly correlated, the model’s performance drops.
• Assumption of Independence: It assumes that the
observations are independent of each other.
• Poor on Complex Data: It may not represent real-world data
well, which is rarely perfectly linear.
Assumptions of Linear Regression
• Linearity
• No multicollinearity
• Homoskedasticity
• No autocorrelation
Limitations of Linear Regression
• Outliers: This can significantly impact the slope and intercept
of the regression line.
• Non-linearity: Linear regression assumes a linear relationship,
but this assumption may not hold in some cases.
• Correlation ≠ Causation: Just because two variables have a
linear relationship doesn’t mean changes in one cause
changes in the other.
Cross validation in machine learning
• Cross-validation techniques are statistical methods
used in machine learning to assess model performance
and prevent overfitting by training and testing models on
multiple, different subsets of the data.
• By partitioning data, training on some subsets, and
testing on others, these methods ensure robust
evaluation on unseen data.
• Cross-validation is a technique used to check how well
a machine learning model performs on unseen data
while preventing overfitting.
• What is Cross Validation
• Cross-validation serves multiple purposes:
• Avoids Overfitting: Ensures that the model does not
perform well only on the training data but generalizes to
unseen data.
• Provides Robust Evaluation: Averages results over
multiple iterations, reducing bias and variance in the
performance metrics.
• Efficient Use of Data: Maximizes the utilization of the
dataset, especially when the data size is limited.
• It works by:
• Splitting the dataset into several parts.
• Training the model on some parts and testing it on the
remaining part.
• Repeating this resampling process multiple times by
choosing different parts of the dataset.
• Averaging the results from each validation step to get
the final performance.
Why Use Cross-Validation?
• Reduces Overfitting: By rotating validation sets, it ensures
the model isn't just memorizing training data.
• Better Data Utilization: It uses the entire dataset for both
training and validation across different iterations.
• Hyperparameter Tuning: It helps identify the optimal
parameters that yield the lowest validation error.
Key Concepts and Types
• Purpose: Accurately assesses how a model generalizes to
new data, rather than just how it performs on the training set.
• Stratified K-Fold: Ensures each fold has the same percentage
of samples of each target class, which is vital for imbalanced
datasets.
• The dataset is divided into k folds, keeping class proportions
consistent in each fold.
• In each iteration, one fold is used for testing and the remaining
folds for training.
• This process is repeated k times so that each fold is used once
as the test set.
• It helps classification models generalize better by maintaining
balanced class representation.
• Holdout Method: The simplest form, splitting data into
a fixed training set and a testing set (often 70/30 or
80/20).
• In Holdout Validation method typically 50% data is used
for training and 50% for testing. Making it simple and
quick to apply.
• The major drawback of this method is that only 50%
data is used for training, the model may miss important
patterns in the other half which leads to high bias
• Monte Carlo (Shuffle Split): Randomly splits the data
into training and validation sets multiple times.
• 2. LOOCV (Leave One Out Cross Validation)
• In this method the model is trained on the entire dataset
except for one data point which is used for testing.
• This process is repeated for each data point in the
dataset.
• All data points are used for training, resulting in low
bias.
• Testing on a single data point can cause high variance,
especially if the point is an outlier.
• It can be very time-consuming for large datasets as it
requires one iteration per data point.
• K-Fold Cross-Validation: The most common method,
where data is split into 𝐾 equal-sized folds.
• K-Fold Cross Validation splits the dataset
into k equal-sized folds. The model is trained
on k-1 folds and tested on the remaining fold.
• This process is repeated k times each time using a
different fold for testing.
Life cycle of K-fold Cross-Validation
Steps in K-Fold Cross-Validation:
[Link] the dataset into 𝐾equal-sized folds.
[Link] the model on 𝐾−1folds.
[Link] the model on the remaining fold and record the
error/accuracy.
[Link] this process 𝐾 times, using a different fold as the
validation set each time.
[Link] the 𝐾 recorded performances to get a final, more
accurate estimation.
• Here we will take k as 5.
• 1st iteration: The first 20% of data [1–5] is used for testing
and the remaining 80% [6–25] is used for training.
• 2nd iteration: The second 20% [6–10] is used for testing and
the remaining data [1–5] and [11–25] is used for training.
• This process continues until each fold has been used once as
the test set.
• Each iteration uses different subsets for testing and
training, ensuring that all data points are used for both
training and testing.
Performance Aggregation:
• Calculate Scores: Collect the evaluation metrics (e.g.,
accuracy, 𝑅2, RMSE) from each of the 𝐾 iterations.
• Average Performance: Compute the mean (and often
standard deviation) of the 𝐾 validation scores to
determine the final, generalized performance metric of
the model.
Key Considerations
• Choosing 𝐾: A value of 𝐾=10 is common as it provides
a good balance between low bias and moderate
variance.
• Computational Cost: High 𝐾values mean longer
training times, as the model must be trained 𝐾 times.
• Final Model: After finding the best parameters via CV,
the final model is typically trained on the entire dataset.
Common Applications
• Model Selection: Choosing between a simple linear
model or a polynomial model.
• Hyperparameter Tuning: Finding the best regularization
parameter (𝜆) for Ridge or Lasso regression.
• Performance Assessment: Ensuring the model
generalizes well rather than just memorizing training data
Model Selection for Machine Learning
• Model selection in machine learning is the process of
choosing the best-performing algorithm, such as linear
regression or random forests, for a specific dataset to
maximize predictive accuracy, efficiency, and
generalization.
• The choice of model significantly affects the accuracy,
efficiency and reliability of predictions. A bad model can
cause overfitting or underfitting and sometimes even
lead to increased computational costs.
Key Aspects of Model Selection
• Purpose: To find the optimal model that generalizes well to
unseen data, not just the training set.
• Key Factors:
• Performance Metrics: Accuracy, precision, recall, and mean
squared error.
• Data Type/Size: Linear models for simple data, deep
learning for complex, high-dimensional data.
• Interpretability vs. Accuracy: Simpler models (e.g.,
Decision Trees) are chosen for transparency, while complex
models (e.g., Neural Networks) are used for higher accuracy.
• Computational Efficiency: Training speed and prediction
time.
• Techniques:
• Cross-Validation: Dividing data into multiple, exclusive
sets to train and test models iteratively.
• Train-Test Split: Evaluating the model's performance
on a separate test set.
Model Selection Steps
1. Define the Problem: Determine if it is a classification,
regression, or clustering task.
• Another important point is about the nature of the dataset
itself. One has to check for missing values, the number of
numerical and categorical variables and the distribution of
data.
• Understanding the type of problem and the dataset helps in
choosing the most suitable machine learning model
2. Select Candidate Models: After understanding the problem,
we then choose a best model that should solve the problem.
• Choose potential algorithms suitable for the data type.
3. Evaluate Performance: Once we have identified the right
models, we must rank each one according to how well it does the
job.
• The most common method is to split the dataset into two parts.
• Training Set: The data used to train a machine learning model
by learning patterns and relationships.
• Testing Set: This checks how well a model performs over new,
unseen data.
• Different machine learning problems require different evaluation
metrics.
• For Regression Problems: We make use of Mean Squared
Error (MSE), Mean Absolute Error (MAE) and R-squared.
• For Classification Problems: We make use of Accuracy,
Precision, Recall and F1-score.
• 4. Select the Best Model: Choose the model with the best
balance of accuracy and complexity.
• Cross-Validation Based Selection: This method involves
using cross-validation to evaluate multiple models and
selecting the one with the best average performance.
• Instead of relying on a single train-test
split, cross-validation divides the dataset into multiple parts
and trains the model on different subsets.
Challenges of Model Selection:
• Overfitting: A model that is too complex may fit the noise in
the training data, leading to poor generalization.
• Underfitting: A model that is too simple may fail to capture
the underlying patterns.
• Trade-offs: Balancing interpretability with accuracy.
Logistic Regression in Machine Learning
• Like all regression analyses, it is a predictive analysis.
• It is used to describe data and to explain the relationship
between one dependent binary variable and one or more
nominal, ordinal, interval or ratio-level independent variables.
• Logistic regression is a fundamental supervised machine
learning algorithm used primarily for binary classification
(predicting one of two outcomes, e.g., Yes/No, 0/1) by
calculating probabilities using the sigmoid function.
• It models the relationship between independent variables and
a categorical dependent variable, mapping outputs to a 0-1
range.
• Unlike linear regression which predicts continuous
values it predicts the probability that an input belongs to
a specific class.
• It is used for binary classification where the output can
be one of two possible categories such as Yes/No,
True/False or 0/1.
• It can be used to classify data into categories, or
classes, by predicting the probability that an observation
falls into a particular class based on its features.
• It uses sigmoid function to convert inputs into a
probability value between 0 and 1.
Types of Logistic Regression
• Logistic regression can be classified into three main types
based on the nature of the dependent variable:
• Binomial Logistic Regression: This type is used when the
dependent variable has only two possible categories.
• Examples include Yes/No, Pass/Fail or 0/1. It is the most
common form of logistic regression and is used for binary
classification problems.
• Multinomial Logistic Regression: This is used when the
dependent variable has three or more possible categories that
are not ordered.
• For example, classifying animals into categories like "cat,"
"dog" or "sheep." It extends the binary logistic regression to
handle multiple classes.
• Ordinal Logistic Regression: This type applies when
the dependent variable has three or more categories
with a natural order or ranking.
• Examples include ratings like "low," "medium" and
"high." It takes the order of the categories into account
when modeling.
• Logistic regression
• Logistic regression estimates the probability of an event
occurring, such as voted or didn't vote, based on a given
dataset of independent variables.
• Since the outcome is a probability, the dependent variable is
bounded between 0 and 1.
• We can call a Logistic Regression a Linear Regression model
but the Logistic Regression uses a more complex cost
function, this cost function can be defined as the ‘Sigmoid
function’ or also known as the ‘logistic function’ instead of a
linear function.
•
How does Logistic Regression work?
• Prepare the data: The data should be in a format where each
row represents a single observation and each column
represents a different variable. The target variable (the
variable you want to predict) should be binary (yes/no,
true/false, 0/1).
• Train the model: We teach the model by showing it the
training data. This involves finding the values of the model
parameters that minimize the error in the training data.
• Evaluate the model: The model is evaluated on the held-out
test data to assess its performance on unseen data.
• Use the model to make predictions: After the model has
been trained and assessed, it can be used to forecast
outcomes on new data.
What is the Sigmoid Function?
• In order to map predicted values to probabilities, we
use the Sigmoid function. The function maps any real
value into another value between 0 and 1. In machine
learning, we use sigmoid to map predictions to
probabilities.
• The sigmoid function is a important part of logistic regression
which is used to convert the raw output of the model into a
probability value between 0 and 1.
• This function takes any real number and maps it into the range
0 to 1 forming an "S" shaped curve called the sigmoid curve or
logistic curve. Because probabilities must lie between 0 and 1,
the sigmoid function is perfect for this purpose.
• In logistic regression, we use a threshold value usually 0.5 to
decide the class label.
• If the sigmoid output is same or above the threshold, the input
is classified as Class 1.
• If it is below the threshold, the input is classified as Class 0.
• This approach helps to transform continuous input values into
meaningful class predictions.
• Hypothesis Representation
• When using linear regression we used a formula of the
hypothesis i.e.
• hΘ(x) = β₀ + β₁X
• For logistic regression we are going to modify it a little bit i.e.
• σ(Z) = σ(β₀ + β₁X)
• We have expected that our hypothesis will give values
between 0 and 1.
• Z = β₀ + β₁X
• hΘ(x) = sigmoid(Z)
•
• What logit means?
• log-odds function
• What is a Logit? A Logit function, also known as the log-odds
function, is a function that represents probability values
from 0 to 1, and negative infinity to infinity.
• The function is an inverse to the sigmoid function that limits
values between 0 and 1 across the Y-axis, rather than the
X-axis.
•
• Odds are nothing but the ratio of the probability of success and probability of failure.
• odds can always be positive which means the range will always be (0,+∞ ).
Cost Function in Logistic Regression
• To measure how well the model is performing, we use a cost
function, which tells us how far the predicted values are from
the actual ones.
• In Logistic Regression, the cost function is based on log loss
(cross-entropy loss) instead of mean squared error.
• It measures the error between the predicted probability and the
actual class label (0 or 1).
• Instead of a straight line (like Linear Regression), Logistic
Regression works with probabilities between 0 and 1 using
the sigmoid function.
• The cost function penalizes wrong predictions more heavily
when the model is confident but wrong.
Assumptions of Logistic Regression
• Understanding the assumptions behind logistic regression
is important to ensure the model is applied correctly, main
assumptions are:
• Independent observations: Each data point is assumed to
be independent of the others means there should be no
correlation or dependence between the input samples.
• Binary dependent variables: It takes the assumption that
the dependent variable must be binary, means it can take
only two values.
• Linearity relationship between independent variables
and log odds: The model assumes a linear relationship
between the independent variables and the log odds of the
dependent variable which means the predictors affect the
log odds in a linear way.
• No outliers: The dataset should not contain extreme
outliers as they can distort the estimation of the logistic
regression coefficients.
• Large sample size: It requires a sufficiently large sample
size to produce reliable and stable results.
Key Advantages:
• High Interpretability: Coefficients indicate the direction
and strength of the relationship between features and the
target.
• Efficiency: Fast to train and requires less computational
power, making it suitable for large datasets.
• Probability Output: Provides well-calibrated
probabilities rather than just class labels.
• Low Overfitting Potential: Less prone to overfitting in
low-dimensional datasets, though it can occur if the
number of features is large relative to observations.
Key Disadvantages:
• Linearity Assumption: Assumes a linear relationship
between the independent variables and the log-odds of
the dependent variable, failing on complex, non-linear
data.
• Sensitive to Outliers: Extreme data points can
significantly skew results.
• Feature Limitation: Requires extensive preprocessing,
such as handling multicollinearity, and fails if the
independent variables are not highly correlated with the
target.
• Binary Restriction: Primarily designed for binary
classification, requiring modifications for multiclass
problems.
Common Applications
• Spam detection (spam/not spam).
• Medical diagnosis (malignant/benign tumor).
• Customer churn prediction (churn/not churn).
• Text classification (sentiment analysis).
An odds ratio (OR) is a statistic measuring the strength of
association between an exposure and an outcome, calculated
as the ratio of the odds of an event occurring in an exposed
group to the odds in an unexposed group.
Interpretation:
• 𝑂𝑅=1: No association between exposure and outcome.
• 𝑂𝑅>1: Exposure is associated with higher odds of the
outcome (risk factor).
• 𝑂𝑅<1: Exposure is associated with lower odds of the outcome
(protective factor)
Generalized linear Model
• In a generalised linear model (GLM), each outcome Y of the
dependent variables is assumed to be generated from the
exponential family of distributions (which includes
distributions such as the normal, binomial, Poisson and
gamma distributions, among others).
• GLM thus expands the scenarios in which linear regression
can apply by expanding the possibilities of the outcome
variable.
• GLM uses the maximum likelihood estimation of the model
parameters for the exponential family and least squares for
normal linear models.
• Logistic regression measures the relationship between the
dependent variable and one or more independent
• variables(features) by estimating probabilities using the
underlying logit function.