0% found this document useful (0 votes)
13 views88 pages

CH 2

Chapter 02 discusses regression models, a statistical and machine learning technique for modeling relationships between dependent and independent variables to predict continuous outcomes. It covers various types of regression, including linear, multiple, and logistic regression, as well as the importance of the best-fit line and cross-validation methods for model evaluation. The chapter emphasizes the significance of model selection in achieving optimal predictive accuracy and generalization to unseen data.

Uploaded by

vedantkharade05
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views88 pages

CH 2

Chapter 02 discusses regression models, a statistical and machine learning technique for modeling relationships between dependent and independent variables to predict continuous outcomes. It covers various types of regression, including linear, multiple, and logistic regression, as well as the importance of the best-fit line and cross-validation methods for model evaluation. The chapter emphasizes the significance of model selection in achieving optimal predictive accuracy and generalization to unseen data.

Uploaded by

vedantkharade05
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Chapter 02

Regression Models
Introduction to Regression
• Regression is a fundamental statistical and machine learning
technique used to model the relationship between a
dependent variable (target) and one or more independent
variables (predictors).

• Regression is a statistical and machine learning technique


used to model and predict continuous outcomes by analyzing
the relationship between dependent and independent
variables.

• Its primary purpose is prediction, estimation, and


understanding how variables influence each other.

• It is widely applied in forecasting, risk analysis, and


decision-making across fields like economics, healthcare,
and engineering.
What is Regression?

• Definition: Regression is a supervised learning method


that predicts a continuous dependent variable
(outcome) based on one or more independent
variables (predictors).

• Purpose: To understand how changes in predictors


influence the outcome and to make predictions.

• Applications: Price prediction, demand forecasting, risk


scoring, medical prognosis, and trend estimation.
Core Concepts

•Dependent Variable (Target/Outcome): The variable you


want to predict or explain (e.g., house price).

•Independent Variables (Predictors/Features): The


variables used to make the prediction (e.g., house size,
location).

•Relationship: How changes in independent variables


correspond to changes in the dependent variable.

•Model: The mathematical function (line, curve) that


represents this relationship, fitted to data points.
What is a Regression Line?
• Regression Line predicts the relationship between two
or more variables.

• A regression line is a straight line that reflects the


best-fit connection in a dataset between independent
and dependent variables.

• The main purpose of developing a regression line is to


predict or estimate the value of the dependent variable
based on the values of one or more independent
variables.
• Equation of Regression Line
• The equation of a simple linear regression line is given
by:
Y = a + bX + ε

• Y is the dependent variable


• X is the independent variable
• a is the y-intercept, which represents the value of Y
when X is 0.
• b is the slope, which represents the change in Y for a
unit change in X
• ε is residual error.
Types of Regression
Type Description Example Use Case

Linear Regression If the regression curve is straight Predicting a student's final exam
line, then the regression is said to score from study time or class
be linear. attendance.

Simple Linear Regression Models relationship between one Predicting house price based on
independent variable and one square footage
dependent variable using a
straight line.

Multiple Linear Regression Uses two or more independent Predicting salary based on
variables to predict a dependent education, experience, and age
variable.

Non-linear Regression Models complex relationships Chemical reaction rates


beyond linear assumptions.

Logistic Regression Used when the dependent Predicting whether a patient has a
variable is categorical (binary disease (Yes/No)
outcomes).
Linear Regression

• Linear regression is a type of supervised


machine-learning algorithm.

• It learns from the labelled datasets and maps the data


points with most optimized linear functions which can be
used for prediction on new datasets.

• It assumes that there is a linear relationship between


the input and output, meaning the output changes at a
constant rate as the input changes. This relationship is
represented by a straight line.
Best Fit Line in Linear Regression
• In linear regression, the best-fit line is the straight line that
most accurately represents the relationship between the
independent variable (input) and the dependent variable
(output).

• It is the line that minimizes the difference between the


actual data points and the predicted values from the
model.
[Link] of the Best-Fit Line
• The goal of linear regression is to find a straight line that
minimizes the error (the difference) between the observed
data points and the predicted values. This line helps us
predict the dependent variable for new, unseen data.



4. Interpretation of the Best-Fit Line
• Slope (m): The slope of the best-fit line indicates how
much the dependent variable (y) changes with each unit
change in the independent variable (x).

• For example if the slope is 5, it means that for every 1-unit


increase in x, the value of y increases by 5 units.

• Intercept (b): The intercept represents the predicted value


of y when x = 0.

• It’s the point where the line crosses the y-axis.




Lines of regression
• Two lines of regression exist for two variables (X and Y)
because you can model the relationship in two ways:
• predicting Y from X (y on x) or predicting X from Y (x on
y),
• each minimizing different errors (vertical vs. horizontal
distances) using the least squares method,

• Which resulting in distinct lines that usually intersect at


the mean point (x̄,ȳ).

• Different Predictions:
• If you're interested in predicting height (Y) from weight
(X), you use Y on X; if predicting weight (X) from height
(Y), you use X on Y.
• The Two Lines Regression:
• Regression Line of Y on X (y = a + bx):
• Predicts the dependent variable Y for a given value of
independent variable X. It minimizes the sum of squared
vertical errors (residuals).

• Regression Line of X on Y (x = c + dy):


• Predicts the dependent variable X for a given value of
independent variable Y. It minimizes the sum of squared
horizontal errors.
Calculation and Properties
•Formula: For simple linear regression of 𝑌on 𝑋, the coefficient
𝑏𝑦𝑥 is calculated as:

•Formula: For simple linear regression of 𝑋on 𝑌, the coefficient


𝑏𝑥𝑦 is calculated as:

Alternatively, it can be expressed using the correlation coefficient


(𝑟): 𝑏𝑥𝑦 = 𝑟 × ( )
Advantages of Linear Regression
• Simplicity and Interpretability: It provides a straightforward
mathematical equation where coefficients represent the impact
of each independent variable on the target variable.

• Efficiency: It is computationally fast to train and, therefore,


suitable for large datasets.

• Optimal for Linear Trends: It works exceptionally well when


the relationship between variables is truly linear.

• Regularization: Overfitting can be reduced using techniques


like L1 (Lasso) and L2 (Ridge) regularization.

• Provides Baseline: It is a commonly used, solid baseline


model for comparing more complex algorithms.
Disadvantages of Linear Regression
• Assumption of Linearity: It cannot capture non-linear
relationships, potentially leading to underfitting.

• Sensitivity to Outliers: Anomalies can have a disproportionate,


large influence on the regression coefficients, distorting the
model.

• Multicollinearity: It assumes independence among predictors; if


predictors are highly correlated, the model’s performance drops.

• Assumption of Independence: It assumes that the


observations are independent of each other.

• Poor on Complex Data: It may not represent real-world data


well, which is rarely perfectly linear.
Assumptions of Linear Regression
• Linearity
• No multicollinearity
• Homoskedasticity
• No autocorrelation

Limitations of Linear Regression


• Outliers: This can significantly impact the slope and intercept
of the regression line.
• Non-linearity: Linear regression assumes a linear relationship,
but this assumption may not hold in some cases.
• Correlation ≠ Causation: Just because two variables have a
linear relationship doesn’t mean changes in one cause
changes in the other.
Cross validation in machine learning

• Cross-validation techniques are statistical methods


used in machine learning to assess model performance
and prevent overfitting by training and testing models on
multiple, different subsets of the data.

• By partitioning data, training on some subsets, and


testing on others, these methods ensure robust
evaluation on unseen data.

• Cross-validation is a technique used to check how well


a machine learning model performs on unseen data
while preventing overfitting.
• What is Cross Validation
• Cross-validation serves multiple purposes:
• Avoids Overfitting: Ensures that the model does not
perform well only on the training data but generalizes to
unseen data.

• Provides Robust Evaluation: Averages results over


multiple iterations, reducing bias and variance in the
performance metrics.

• Efficient Use of Data: Maximizes the utilization of the


dataset, especially when the data size is limited.
• It works by:
• Splitting the dataset into several parts.

• Training the model on some parts and testing it on the


remaining part.

• Repeating this resampling process multiple times by


choosing different parts of the dataset.

• Averaging the results from each validation step to get


the final performance.
Why Use Cross-Validation?

• Reduces Overfitting: By rotating validation sets, it ensures


the model isn't just memorizing training data.

• Better Data Utilization: It uses the entire dataset for both


training and validation across different iterations.

• Hyperparameter Tuning: It helps identify the optimal


parameters that yield the lowest validation error.
Key Concepts and Types
• Purpose: Accurately assesses how a model generalizes to
new data, rather than just how it performs on the training set.

• Stratified K-Fold: Ensures each fold has the same percentage


of samples of each target class, which is vital for imbalanced
datasets.

• The dataset is divided into k folds, keeping class proportions


consistent in each fold.
• In each iteration, one fold is used for testing and the remaining
folds for training.
• This process is repeated k times so that each fold is used once
as the test set.
• It helps classification models generalize better by maintaining
balanced class representation.
• Holdout Method: The simplest form, splitting data into
a fixed training set and a testing set (often 70/30 or
80/20).

• In Holdout Validation method typically 50% data is used


for training and 50% for testing. Making it simple and
quick to apply.

• The major drawback of this method is that only 50%


data is used for training, the model may miss important
patterns in the other half which leads to high bias

• Monte Carlo (Shuffle Split): Randomly splits the data


into training and validation sets multiple times.
• 2. LOOCV (Leave One Out Cross Validation)
• In this method the model is trained on the entire dataset
except for one data point which is used for testing.

• This process is repeated for each data point in the


dataset.

• All data points are used for training, resulting in low


bias.

• Testing on a single data point can cause high variance,


especially if the point is an outlier.

• It can be very time-consuming for large datasets as it


requires one iteration per data point.
• K-Fold Cross-Validation: The most common method,
where data is split into 𝐾 equal-sized folds.

• K-Fold Cross Validation splits the dataset


into k equal-sized folds. The model is trained
on k-1 folds and tested on the remaining fold.
• This process is repeated k times each time using a
different fold for testing.
Life cycle of K-fold Cross-Validation
Steps in K-Fold Cross-Validation:
[Link] the dataset into 𝐾equal-sized folds.

[Link] the model on 𝐾−1folds.

[Link] the model on the remaining fold and record the


error/accuracy.

[Link] this process 𝐾 times, using a different fold as the


validation set each time.

[Link] the 𝐾 recorded performances to get a final, more


accurate estimation.
• Here we will take k as 5.
• 1st iteration: The first 20% of data [1–5] is used for testing
and the remaining 80% [6–25] is used for training.

• 2nd iteration: The second 20% [6–10] is used for testing and
the remaining data [1–5] and [11–25] is used for training.

• This process continues until each fold has been used once as
the test set.
• Each iteration uses different subsets for testing and
training, ensuring that all data points are used for both
training and testing.
Performance Aggregation:
• Calculate Scores: Collect the evaluation metrics (e.g.,
accuracy, 𝑅2, RMSE) from each of the 𝐾 iterations.
• Average Performance: Compute the mean (and often
standard deviation) of the 𝐾 validation scores to
determine the final, generalized performance metric of
the model.
Key Considerations
• Choosing 𝐾: A value of 𝐾=10 is common as it provides
a good balance between low bias and moderate
variance.

• Computational Cost: High 𝐾values mean longer


training times, as the model must be trained 𝐾 times.

• Final Model: After finding the best parameters via CV,


the final model is typically trained on the entire dataset.
Common Applications

• Model Selection: Choosing between a simple linear


model or a polynomial model.

• Hyperparameter Tuning: Finding the best regularization


parameter (𝜆) for Ridge or Lasso regression.

• Performance Assessment: Ensuring the model


generalizes well rather than just memorizing training data
Model Selection for Machine Learning

• Model selection in machine learning is the process of


choosing the best-performing algorithm, such as linear
regression or random forests, for a specific dataset to
maximize predictive accuracy, efficiency, and
generalization.

• The choice of model significantly affects the accuracy,


efficiency and reliability of predictions. A bad model can
cause overfitting or underfitting and sometimes even
lead to increased computational costs.
Key Aspects of Model Selection

• Purpose: To find the optimal model that generalizes well to


unseen data, not just the training set.

• Key Factors:
• Performance Metrics: Accuracy, precision, recall, and mean
squared error.

• Data Type/Size: Linear models for simple data, deep


learning for complex, high-dimensional data.

• Interpretability vs. Accuracy: Simpler models (e.g.,


Decision Trees) are chosen for transparency, while complex
models (e.g., Neural Networks) are used for higher accuracy.

• Computational Efficiency: Training speed and prediction


time.
• Techniques:
• Cross-Validation: Dividing data into multiple, exclusive
sets to train and test models iteratively.

• Train-Test Split: Evaluating the model's performance


on a separate test set.
Model Selection Steps

1. Define the Problem: Determine if it is a classification,


regression, or clustering task.

• Another important point is about the nature of the dataset


itself. One has to check for missing values, the number of
numerical and categorical variables and the distribution of
data.

• Understanding the type of problem and the dataset helps in


choosing the most suitable machine learning model

2. Select Candidate Models: After understanding the problem,


we then choose a best model that should solve the problem.
• Choose potential algorithms suitable for the data type.
3. Evaluate Performance: Once we have identified the right
models, we must rank each one according to how well it does the
job.
• The most common method is to split the dataset into two parts.
• Training Set: The data used to train a machine learning model
by learning patterns and relationships.
• Testing Set: This checks how well a model performs over new,
unseen data.

• Different machine learning problems require different evaluation


metrics.
• For Regression Problems: We make use of Mean Squared
Error (MSE), Mean Absolute Error (MAE) and R-squared.

• For Classification Problems: We make use of Accuracy,


Precision, Recall and F1-score.
• 4. Select the Best Model: Choose the model with the best
balance of accuracy and complexity.
• Cross-Validation Based Selection: This method involves
using cross-validation to evaluate multiple models and
selecting the one with the best average performance.
• Instead of relying on a single train-test
split, cross-validation divides the dataset into multiple parts
and trains the model on different subsets.

Challenges of Model Selection:


• Overfitting: A model that is too complex may fit the noise in
the training data, leading to poor generalization.

• Underfitting: A model that is too simple may fail to capture


the underlying patterns.

• Trade-offs: Balancing interpretability with accuracy.


Logistic Regression in Machine Learning
• Like all regression analyses, it is a predictive analysis.

• It is used to describe data and to explain the relationship


between one dependent binary variable and one or more
nominal, ordinal, interval or ratio-level independent variables.

• Logistic regression is a fundamental supervised machine


learning algorithm used primarily for binary classification
(predicting one of two outcomes, e.g., Yes/No, 0/1) by
calculating probabilities using the sigmoid function.

• It models the relationship between independent variables and


a categorical dependent variable, mapping outputs to a 0-1
range.
• Unlike linear regression which predicts continuous
values it predicts the probability that an input belongs to
a specific class.

• It is used for binary classification where the output can


be one of two possible categories such as Yes/No,
True/False or 0/1.

• It can be used to classify data into categories, or


classes, by predicting the probability that an observation
falls into a particular class based on its features.

• It uses sigmoid function to convert inputs into a


probability value between 0 and 1.
Types of Logistic Regression
• Logistic regression can be classified into three main types
based on the nature of the dependent variable:
• Binomial Logistic Regression: This type is used when the
dependent variable has only two possible categories.
• Examples include Yes/No, Pass/Fail or 0/1. It is the most
common form of logistic regression and is used for binary
classification problems.

• Multinomial Logistic Regression: This is used when the


dependent variable has three or more possible categories that
are not ordered.
• For example, classifying animals into categories like "cat,"
"dog" or "sheep." It extends the binary logistic regression to
handle multiple classes.
• Ordinal Logistic Regression: This type applies when
the dependent variable has three or more categories
with a natural order or ranking.
• Examples include ratings like "low," "medium" and
"high." It takes the order of the categories into account
when modeling.
• Logistic regression
• Logistic regression estimates the probability of an event
occurring, such as voted or didn't vote, based on a given
dataset of independent variables.

• Since the outcome is a probability, the dependent variable is


bounded between 0 and 1.

• We can call a Logistic Regression a Linear Regression model


but the Logistic Regression uses a more complex cost
function, this cost function can be defined as the ‘Sigmoid
function’ or also known as the ‘logistic function’ instead of a
linear function.

How does Logistic Regression work?
• Prepare the data: The data should be in a format where each
row represents a single observation and each column
represents a different variable. The target variable (the
variable you want to predict) should be binary (yes/no,
true/false, 0/1).

• Train the model: We teach the model by showing it the


training data. This involves finding the values of the model
parameters that minimize the error in the training data.

• Evaluate the model: The model is evaluated on the held-out


test data to assess its performance on unseen data.

• Use the model to make predictions: After the model has


been trained and assessed, it can be used to forecast
outcomes on new data.
What is the Sigmoid Function?
• In order to map predicted values to probabilities, we
use the Sigmoid function. The function maps any real
value into another value between 0 and 1. In machine
learning, we use sigmoid to map predictions to
probabilities.
• The sigmoid function is a important part of logistic regression
which is used to convert the raw output of the model into a
probability value between 0 and 1.

• This function takes any real number and maps it into the range
0 to 1 forming an "S" shaped curve called the sigmoid curve or
logistic curve. Because probabilities must lie between 0 and 1,
the sigmoid function is perfect for this purpose.

• In logistic regression, we use a threshold value usually 0.5 to


decide the class label.

• If the sigmoid output is same or above the threshold, the input


is classified as Class 1.
• If it is below the threshold, the input is classified as Class 0.
• This approach helps to transform continuous input values into
meaningful class predictions.
• Hypothesis Representation
• When using linear regression we used a formula of the
hypothesis i.e.
• hΘ(x) = β₀ + β₁X
• For logistic regression we are going to modify it a little bit i.e.
• σ(Z) = σ(β₀ + β₁X)
• We have expected that our hypothesis will give values
between 0 and 1.
• Z = β₀ + β₁X
• hΘ(x) = sigmoid(Z)


• What logit means?
• log-odds function
• What is a Logit? A Logit function, also known as the log-odds
function, is a function that represents probability values
from 0 to 1, and negative infinity to infinity.

• The function is an inverse to the sigmoid function that limits


values between 0 and 1 across the Y-axis, rather than the
X-axis.

• Odds are nothing but the ratio of the probability of success and probability of failure.

• odds can always be positive which means the range will always be (0,+∞ ).
Cost Function in Logistic Regression
• To measure how well the model is performing, we use a cost
function, which tells us how far the predicted values are from
the actual ones.
• In Logistic Regression, the cost function is based on log loss
(cross-entropy loss) instead of mean squared error.

• It measures the error between the predicted probability and the


actual class label (0 or 1).

• Instead of a straight line (like Linear Regression), Logistic


Regression works with probabilities between 0 and 1 using
the sigmoid function.
• The cost function penalizes wrong predictions more heavily
when the model is confident but wrong.
Assumptions of Logistic Regression
• Understanding the assumptions behind logistic regression
is important to ensure the model is applied correctly, main
assumptions are:
• Independent observations: Each data point is assumed to
be independent of the others means there should be no
correlation or dependence between the input samples.

• Binary dependent variables: It takes the assumption that


the dependent variable must be binary, means it can take
only two values.
• Linearity relationship between independent variables
and log odds: The model assumes a linear relationship
between the independent variables and the log odds of the
dependent variable which means the predictors affect the
log odds in a linear way.

• No outliers: The dataset should not contain extreme


outliers as they can distort the estimation of the logistic
regression coefficients.

• Large sample size: It requires a sufficiently large sample


size to produce reliable and stable results.
Key Advantages:
• High Interpretability: Coefficients indicate the direction
and strength of the relationship between features and the
target.

• Efficiency: Fast to train and requires less computational


power, making it suitable for large datasets.

• Probability Output: Provides well-calibrated


probabilities rather than just class labels.

• Low Overfitting Potential: Less prone to overfitting in


low-dimensional datasets, though it can occur if the
number of features is large relative to observations.
Key Disadvantages:
• Linearity Assumption: Assumes a linear relationship
between the independent variables and the log-odds of
the dependent variable, failing on complex, non-linear
data.

• Sensitive to Outliers: Extreme data points can


significantly skew results.

• Feature Limitation: Requires extensive preprocessing,


such as handling multicollinearity, and fails if the
independent variables are not highly correlated with the
target.

• Binary Restriction: Primarily designed for binary


classification, requiring modifications for multiclass
problems.
Common Applications
• Spam detection (spam/not spam).

• Medical diagnosis (malignant/benign tumor).

• Customer churn prediction (churn/not churn).

• Text classification (sentiment analysis).


An odds ratio (OR) is a statistic measuring the strength of
association between an exposure and an outcome, calculated
as the ratio of the odds of an event occurring in an exposed
group to the odds in an unexposed group.

Interpretation:
• 𝑂𝑅=1: No association between exposure and outcome.

• 𝑂𝑅>1: Exposure is associated with higher odds of the


outcome (risk factor).

• 𝑂𝑅<1: Exposure is associated with lower odds of the outcome


(protective factor)
Generalized linear Model
• In a generalised linear model (GLM), each outcome Y of the
dependent variables is assumed to be generated from the
exponential family of distributions (which includes
distributions such as the normal, binomial, Poisson and
gamma distributions, among others).
• GLM thus expands the scenarios in which linear regression
can apply by expanding the possibilities of the outcome
variable.
• GLM uses the maximum likelihood estimation of the model
parameters for the exponential family and least squares for
normal linear models.
• Logistic regression measures the relationship between the
dependent variable and one or more independent
• variables(features) by estimating probabilities using the
underlying logit function.

You might also like