0% found this document useful (0 votes)

22 views12 pages

Understanding Linear Regression Basics

Uploaded by

Apu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

22 views12 pages

Understanding Linear Regression Basics

Uploaded by

Apu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Linear Regression in Machine Learning

Linear regression is one of the easiest and most popular Machine Learning
algorithms. It is a statistical method that is used for predictive analysis. Linear
regression makes predictions for continuous/real or numeric variables such
as sales, salary, age, product price, etc.

Linear regression algorithm shows a linear relationship between a dependent (y)

and one or more independent (y) variables, hence called as linear regression. Since
linear regression shows the linear relationship, which means it finds how the value
of the dependent variable is changing according to the value of the independent
variable.

The linear regression model provides a sloped straight line representing the
relationship between the variables. Consider the below image:

Mathematically, we can represent a linear regression as:

y= a0+a1x+ ε
Here,

Y= Dependent Variable (Target Variable)

X= Independent Variable (predictor Variable)
a0= intercept of the line (Gives an additional degree of freedom)
a1 = Linear regression coefficient (scale factor to each input value).
ε = random error

The values for x and y variables are training datasets for Linear Regression model
representation.

Types of Linear Regression

Linear regression can be further divided into two types of the algorithm:

o Simple Linear Regression:

If a single independent variable is used to predict the value of a numerical
dependent variable, then such a Linear Regression algorithm is called
Simple Linear Regression.
o Multiple Linear regression:
If more than one independent variable is used to predict the value of a
numerical dependent variable, then such a Linear Regression algorithm is
called Multiple Linear Regression.

Linear Regression Line

A linear line showing the relationship between the dependent and independent
variables is called a regression line. A regression line can show two types of
relationship:

o Positive Linear Relationship:

If the dependent variable increases on the Y-axis and independent variable
increases on X-axis, then such a relationship is termed as a Positive linear
relationship.
o Negative Linear Relationship:
If the dependent variable decreases on the Y-axis and independent variable
increases on the X-axis, then such a relationship is called a negative linear
relationship.

Finding the best fit line:

When working with linear regression, our main goal is to find the best fit line that
means the error between predicted values and actual values should be minimized.
The best fit line will have the least error.

The different values for weights or the coefficient of lines (a 0, a1) gives a different
line of regression, so we need to calculate the best values for a 0 and a1 to find the
best fit line, so to calculate this we use cost function.
Cost function-
o The different values for weights or coefficient of lines (a0, a1) gives the
different line of regression, and the cost function is used to estimate the
values of the coefficient for the best fit line.
o Cost function optimizes the regression coefficients or weights. It measures
how a linear regression model is performing.
o We can use the cost function to find the accuracy of the mapping function,
which maps the input variable to the output variable. This mapping function
is also known as Hypothesis function.

For Linear Regression, we use the Mean Squared Error (MSE) cost function,
which is the average of squared error occurred between the predicted values and
actual values. It can be written as:

For the above linear equation, MSE can be calculated as:

Where,

N=Total number of observation

Yi = Actual value
(a1xi+a0)= Predicted value.

Residuals: The distance between the actual value and predicted values is called
residual. If the observed points are far from the regression line, then the residual
will be high, and so cost function will high. If the scatter points are close to the
regression line, then the residual will be small and hence the cost function.

Gradient Descent:
o Gradient descent is used to minimize the MSE by calculating the gradient of
the cost function.
o A regression model uses gradient descent to update the coefficients of the
line by reducing the cost function.
o It is done by a random selection of values of coefficient and then iteratively
update the values to reach the minimum cost function.

Model Performance:
The Goodness of fit determines how the line of regression fits the set of
observations. The process of finding the best model out of various models is
called optimization. It can be achieved by below method:

1. R-squared method:

o R-squared is a statistical method that determines the goodness of fit.

o It measures the strength of the relationship between the dependent and
independent variables on a scale of 0-100%.
o The high value of R-square determines the less difference between the
predicted values and actual values and hence represents a good model.
o It is also called a coefficient of determination, or coefficient of multiple
determination for multiple regression.
o It can be calculated from the below formula:

Assumptions of Linear Regression

Below are some important assumptions of Linear Regression. These are some
formal checks while building a Linear Regression model, which ensures to get the
best possible result from the given dataset.

o Linear relationship between the features and target:

Linear regression assumes the linear relationship between the dependent and
independent variables.
o Small or no multicollinearity between the features:
Multicollinearity means high-correlation between the independent variables.
Due to multicollinearity, it may difficult to find the true relationship between
the predictors and target variables. Or we can say, it is difficult to determine
which predictor variable is affecting the target variable and which is not. So,
the model assumes either little or no multicollinearity between the features
or independent variables.
o Homoscedasticity Assumption:
Homoscedasticity is a situation when the error term is the same for all the
values of independent variables. With homoscedasticity, there should be no
clear pattern distribution of data in the scatter plot.
o Normal distribution of error terms:
Linear regression assumes that the error term should follow the normal
distribution pattern. If error terms are not normally distributed, then
confidence intervals will become either too wide or too narrow, which may
cause difficulties in finding coefficients.
It can be checked using the q-q plot. If the plot shows a straight line without
any deviation, which means the error is normally distributed.
o No autocorrelations:
The linear regression model assumes no autocorrelation in error terms. If
there will be any correlation in the error term, then it will drastically reduce
the accuracy of the model. Autocorrelation usually occurs if there is a
dependency between residual errors.

Linear Regression:
It is the basic and commonly used type for predictive analysis. It is a
statistical approach to modeling the relationship between a dependent
variable and a given set of independent variables.

These are of two types:

1. Simple linear Regression
2. Multiple Linear Regression
Let’s Discuss Multiple Linear Regression using Python.
Multiple Linear Regression attempts to model the relationship between
two or more features and a response by fitting a linear equation to
observed data. The steps to perform multiple linear Regression are
almost similar to that of simple linear Regression. The Difference Lies
in the evaluation. We can use it to find out which factor has the highest
impact on the predicted output and now different variables relate to
each other.

Here :
Y = b0 + b1 * x1 + b2 * x2 + b3 * x3 + …… bn * xn
Y = Dependent variable and x1, x2, x3, …… xn = multiple
independent variables
Assumption of Regression Model :
 Linearity: The relationship between dependent and independent
variables should be linear.
 Homoscedasticity: Constant variance of the errors should be
maintained.
 Multivariate normality: Multiple Regression assumes that the
residuals are normally distributed.
 Lack of Multicollinearity: It is assumed that there is little or no
multicollinearity in the data.

Logistic regression:

Logistic Regression in Machine Learning

o Logistic regression is one of the most popular Machine Learning algorithms, which
comes under the Supervised Learning technique. It is used for predicting the categorical
dependent variable using a given set of independent variables.
o Logistic regression predicts the output of a categorical dependent variable. Therefore the
outcome must be a categorical or discrete value. It can be either Yes or No, 0 or 1, true or
False, etc. but instead of giving the exact value as 0 and 1, it gives the probabilistic
values which lie between 0 and 1.
o Logistic Regression is much similar to the Linear Regression except that how they are
used. Linear Regression is used for solving Regression problems, whereas Logistic
regression is used for solving the classification problems.
o In Logistic regression, instead of fitting a regression line, we fit an "S" shaped logistic
function, which predicts two maximum values (0 or 1).
o The curve from the logistic function indicates the likelihood of something such as
whether the cells are cancerous or not, a mouse is obese or not based on its weight, etc.
o Logistic Regression is a significant machine learning algorithm because it has the ability
to provide probabilities and classify new data using continuous and discrete datasets.
o Logistic Regression can be used to classify the observations using different types of data
and can easily determine the most effective variables used for the classification. The
below image is showing the logistic function:

Note: Logistic regression uses the concept of predictive modeling as regression; therefore, it is
called logistic regression, but is used to classify samples; Therefore, it falls under the
classification algorithm.

Logistic Function (Sigmoid Function):

o The sigmoid function is a mathematical function used to map the predicted values to
probabilities.
o It maps any real value into another value within a range of 0 and 1.
o The value of the logistic regression must be between 0 and 1, which cannot go beyond
this limit, so it forms a curve like the "S" form. The S-form curve is called the Sigmoid
function or the logistic function.
o In logistic regression, we use the concept of the threshold value, which defines the
probability of either 0 or 1. Such as values above the threshold value tends to 1, and a
value below the threshold values tends to 0.

Assumptions for Logistic Regression:

o The dependent variable must be categorical in nature.
o The independent variable should not have multi-collinearity.

Logistic Regression Equation:

The Logistic regression equation can be obtained from the Linear Regression
equation. The mathematical steps to get Logistic Regression equations are given
below:

o We know the equation of the straight line can be written as:

o In Logistic Regression y can be between 0 and 1 only, so for this let's divide the above
equation by (1-y):

o But we need range between -[infinity] to +[infinity], then take logarithm of the equation
it will become:
The above equation is the final equation for Logistic Regression.

Type of Logistic Regression:

On the basis of the categories, Logistic Regression can be classified into three
types:

o Binomial: In binomial Logistic regression, there can be only two possible types of the
dependent variables, such as 0 or 1, Pass or Fail, etc.
o Multinomial: In multinomial Logistic regression, there can be 3 or more possible
unordered types of the dependent variable, such as "cat", "dogs", or "sheep"
o Ordinal: In ordinal Logistic regression, there can be 3 or more possible ordered types of
dependent variables, such as "low", "Medium", or "High".

Steps in Logistic Regression: To implement the Logistic Regression using

Python, we will use the same steps as we have done in previous topics of
Regression. Below are the steps:

o Data Pre-processing step

o Fitting Logistic Regression to the Training set
o Predicting the test result
o Test accuracy of the result(Creation of Confusion matrix)
o Visualizing the test set result.
Performance Metrics for Regression
Confusion Matrix

Common questions

Multicollinearity refers to high correlation between independent variables, which complicates the determination of the true relationship between predictors and the target variable. In linear regression, multicollinearity can make it difficult to ascertain individual impacts of predictors on the dependent variable. In logistic regression, multicollinearity violates assumptions and can lead to inaccurate model estimates, making it challenging to understand which variables significantly contribute to the outcome .

Linear regression is used for predictive analysis with continuous numeric variables, aiming to model the relationship between a dependent and independent variable using a linear approach . In contrast, logistic regression is utilized for classification tasks, predicting the probability of a categorical outcome and fitting data into a logistic curve or 'S' shape .

Logistic regression, despite its name, is used in classification problems because it models the probability that a given input point belongs to a particular category. It replaces the linear equation with a logistic function to handle binary outcomes, allowing it to predict categories rather than continuous variables. The term 'regression' arises from its use of a regression-like approach in estimating probabilities .

R-squared is a statistical metric that quantifies the strength of the relationship between the dependent and independent variables as a percentage. It evaluates the proportion of variance in the dependent variable that is predictable from the independent variable. A high R-squared value, close to 1 or 100%, indicates a good model fit, signifying that the model explains most of the variability in the response data .

In logistic regression, a threshold value is set to determine classification outcomes, typically a probability of 0.5. Predictions above this threshold are classified as one category (e.g., 1), while those below are classified as the other category (e.g., 0). This threshold converts continuous predicted probabilities into categorical predictions, thus facilitating classification .

Logistic regression transforms the linear regression equation to restrict predictions to categorical outcomes. It divides the linear equation by (1-y) and applies a logarithm to convert the linear equation into the logistic regression form. This transformation ensures predictions are between 0 and 1, allowing the output to represent probabilities for classification tasks .

In linear regression, the cost function typically used is the Mean Squared Error (MSE), which calculates the average of the squared differences between the predicted and actual values. The purpose of the cost function is to optimize the linear regression coefficients by minimizing the error terms. Gradient descent is then employed to iteratively update the model coefficients to achieve the lowest possible MSE, guiding the system toward the best-fit line .

Residuals represent the differences between observed and predicted values. High residuals indicate a poor model fit as the data points are far from the regression line, leading to a high cost function value. Conversely, small residuals suggest the data points closely follow the regression line, indicating an accurate model with minimal errors and thus a good fit .

Homoscedasticity assumes constant variance of the error terms across all levels of the independent variables. Violation of this assumption causes heteroscedasticity, leading to inefficient estimations and unreliable hypothesis testing. This violation impairs model accuracy, as the standard errors of the coefficients are underestimated or overestimated, which can lead to incorrect interpretations and predictions .

Gradient descent is an optimization technique used in linear regression to find the minimum of the cost function by iteratively adjusting the model coefficients. By computing the gradient (or slope) of the cost function, it updates the coefficients in the direction that reduces the error, eventually converging to the minimum cost function value, which corresponds to the best-fit line .

Understanding Linear Regression Basics
No ratings yet
Understanding Linear Regression Basics
6 pages
Understanding Linear Regression Basics
No ratings yet
Understanding Linear Regression Basics
24 pages
Data Science Regression Techniques Guide
No ratings yet
Data Science Regression Techniques Guide
27 pages
Understanding Linear Regression Basics
No ratings yet
Understanding Linear Regression Basics
29 pages
Linear Regression in Machine Learning
No ratings yet
Linear Regression in Machine Learning
5 pages
Linear Regression Overview in Data Science
100% (1)
Linear Regression Overview in Data Science
14 pages
Statistical Decision Theory & Linear Regression
No ratings yet
Statistical Decision Theory & Linear Regression
16 pages
Understanding Linear Models in ML
No ratings yet
Understanding Linear Models in ML
60 pages
Python Multiple Linear Regression Lab Guide
No ratings yet
Python Multiple Linear Regression Lab Guide
9 pages
Understanding Linear Regression in ML
No ratings yet
Understanding Linear Regression in ML
17 pages
Linear Regression in Machine Learning
No ratings yet
Linear Regression in Machine Learning
9 pages
Understanding Linear Regression Basics
No ratings yet
Understanding Linear Regression Basics
26 pages
Understanding Linear Regression Basics
No ratings yet
Understanding Linear Regression Basics
11 pages
Supervised Learning in AI & ML
No ratings yet
Supervised Learning in AI & ML
33 pages
Regression in Machine Learning
No ratings yet
Regression in Machine Learning
13 pages
Linear Regression
No ratings yet
Linear Regression
32 pages
Univariate Linear Regression Overview
No ratings yet
Univariate Linear Regression Overview
16 pages
Understanding Linear Regression Basics
No ratings yet
Understanding Linear Regression Basics
35 pages
Understanding Linear Regression Basics
No ratings yet
Understanding Linear Regression Basics
29 pages
Machine Learning Regression Techniques
No ratings yet
Machine Learning Regression Techniques
13 pages
Regression and Classification Overview
No ratings yet
Regression and Classification Overview
80 pages
Linear Regression Basics in ML
No ratings yet
Linear Regression Basics in ML
23 pages
Da Unit 3 Notes
No ratings yet
Da Unit 3 Notes
13 pages
Linear Regression: Simple & Multiple Models
No ratings yet
Linear Regression: Simple & Multiple Models
43 pages
Understanding Linear Regression Basics
No ratings yet
Understanding Linear Regression Basics
28 pages
Understanding Linear Regression Basics
No ratings yet
Understanding Linear Regression Basics
7 pages
Understanding Linear Regression Basics
No ratings yet
Understanding Linear Regression Basics
4 pages
Understanding Linear Regression Basics
No ratings yet
Understanding Linear Regression Basics
9 pages
Understanding Linear Regression Basics
100% (1)
Understanding Linear Regression Basics
8 pages
Linear Regression in Machine Learning
No ratings yet
Linear Regression in Machine Learning
54 pages
Predictor Variables for IMDB Ratings
No ratings yet
Predictor Variables for IMDB Ratings
43 pages
Supervised Learning and Regression Analysis
No ratings yet
Supervised Learning and Regression Analysis
20 pages
UNIT 2-Part 2
No ratings yet
UNIT 2-Part 2
46 pages
IDS UNIT 5 Linear Regression
No ratings yet
IDS UNIT 5 Linear Regression
27 pages
Introduction To Linear Regression
No ratings yet
Introduction To Linear Regression
48 pages
Supervised Learning: Linear Regression Guide
No ratings yet
Supervised Learning: Linear Regression Guide
147 pages
Classification vs. Regression Algorithms
No ratings yet
Classification vs. Regression Algorithms
19 pages
Understanding Regression Analysis Basics
No ratings yet
Understanding Regression Analysis Basics
20 pages
Understanding Linear Regression Variables
No ratings yet
Understanding Linear Regression Variables
18 pages
ML Exp 1
No ratings yet
ML Exp 1
4 pages
Understanding Linear Regression Basics
No ratings yet
Understanding Linear Regression Basics
3 pages
Understanding Parametric Models in Regression
No ratings yet
Understanding Parametric Models in Regression
19 pages
Understanding Linear Regression Basics
No ratings yet
Understanding Linear Regression Basics
14 pages
Linear Regression Explained for Beginners
No ratings yet
Linear Regression Explained for Beginners
18 pages
Understanding Linear Regression Basics
No ratings yet
Understanding Linear Regression Basics
8 pages
Understanding Linear Regression in ML
No ratings yet
Understanding Linear Regression in ML
9 pages
Linear Regression: Concepts and Code
No ratings yet
Linear Regression: Concepts and Code
4 pages
ML Notes Mod 2
No ratings yet
ML Notes Mod 2
28 pages
Simple Linear Regression Explained
No ratings yet
Simple Linear Regression Explained
12 pages
Logistic Regression in Machine Learning
No ratings yet
Logistic Regression in Machine Learning
45 pages
Linear Separability in Regression Models
No ratings yet
Linear Separability in Regression Models
32 pages
Machine Learning Regression Overview
No ratings yet
Machine Learning Regression Overview
40 pages
Supervised Learning: Regression Techniques
No ratings yet
Supervised Learning: Regression Techniques
34 pages
Pai Unit3 - 1
No ratings yet
Pai Unit3 - 1
5 pages
Understanding Linear Regression Basics
No ratings yet
Understanding Linear Regression Basics
38 pages
Understanding Regression Models Basics
No ratings yet
Understanding Regression Models Basics
26 pages
ML Unit2
No ratings yet
ML Unit2
82 pages
Understanding Linear Regression Basics
No ratings yet
Understanding Linear Regression Basics
19 pages
Understanding Linear Regression Models
No ratings yet
Understanding Linear Regression Models
5 pages
Bisection and Iteration Methods in C
No ratings yet
Bisection and Iteration Methods in C
49 pages
Two-Phase Method in Linear Programming
No ratings yet
Two-Phase Method in Linear Programming
20 pages
Final Exam: Integer Programming & TSP
No ratings yet
Final Exam: Integer Programming & TSP
2 pages
Compare Linear Regression Types
No ratings yet
Compare Linear Regression Types
2 pages
Understanding Matrices for SAT Math
No ratings yet
Understanding Matrices for SAT Math
1 page
Simplex Method in Linear Programming
No ratings yet
Simplex Method in Linear Programming
32 pages
Nonlinear Regression and Interaction Terms
No ratings yet
Nonlinear Regression and Interaction Terms
2 pages
Root Function Equation Methods
No ratings yet
Root Function Equation Methods
18 pages
Local Linear Approximation Lecture Summary2
No ratings yet
Local Linear Approximation Lecture Summary2
2 pages
Baker 2004
No ratings yet
Baker 2004
25 pages
Applications of Numerical Analysis
No ratings yet
Applications of Numerical Analysis
6 pages
Polynomial Multiplication Exercises
No ratings yet
Polynomial Multiplication Exercises
1 page
Linear Transformations in Mathematics
No ratings yet
Linear Transformations in Mathematics
11 pages
Engineering Mathematics IV Syllabus
No ratings yet
Engineering Mathematics IV Syllabus
13 pages
Maximize Z in Linear Programming
No ratings yet
Maximize Z in Linear Programming
18 pages
Brute Force Time Complexity of Knapsack
No ratings yet
Brute Force Time Complexity of Knapsack
3 pages
JEE Main 2025 Maths Schedule: Partial Fractions & Quadratics
No ratings yet
JEE Main 2025 Maths Schedule: Partial Fractions & Quadratics
6 pages
Techniques for Fuzzy Integral Equations
No ratings yet
Techniques for Fuzzy Integral Equations
4 pages
MSc Computational Fluid Dynamics Exam
No ratings yet
MSc Computational Fluid Dynamics Exam
2 pages
Numerical Methods Exam for Biomedical Engineering
No ratings yet
Numerical Methods Exam for Biomedical Engineering
3 pages
An Improved Explicit Time Integration Method For Linear A 2018 Computers S
No ratings yet
An Improved Explicit Time Integration Method For Linear A 2018 Computers S
12 pages
4-Noded Quadrilateral Element Stiffness
No ratings yet
4-Noded Quadrilateral Element Stiffness
7 pages
Understanding Polynomial Zeroes
No ratings yet
Understanding Polynomial Zeroes
3 pages
Graphical Method for LPP Solutions
No ratings yet
Graphical Method for LPP Solutions
3 pages
IITJEE Binomial Theorem Explained
No ratings yet
IITJEE Binomial Theorem Explained
30 pages
Fancy Indexing and R-Squared Explained
No ratings yet
Fancy Indexing and R-Squared Explained
4 pages
Class 9 Maths: Polynomials Assignment
No ratings yet
Class 9 Maths: Polynomials Assignment
2 pages
History of Sparse Direct Methods
No ratings yet
History of Sparse Direct Methods
44 pages
Regression Analysis Results Summary
No ratings yet
Regression Analysis Results Summary
7 pages
Simplex Method for Minimization Problems
No ratings yet
Simplex Method for Minimization Problems
6 pages

Understanding Linear Regression Basics

Uploaded by

Understanding Linear Regression Basics

Uploaded by

Linear Regression in Machine Learning

Linear regression algorithm shows a linear relationship between a dependent (y)

Mathematically, we can represent a linear regression as:

Y= Dependent Variable (Target Variable)

Types of Linear Regression

o Simple Linear Regression:

Linear Regression Line

o Positive Linear Relationship:

Finding the best fit line:

For the above linear equation, MSE can be calculated as:

N=Total number of observation

o R-squared is a statistical method that determines the goodness of fit.

Assumptions of Linear Regression

o Linear relationship between the features and target:

These are of two types:

Logistic Regression in Machine Learning

Logistic Function (Sigmoid Function):

Assumptions for Logistic Regression:

Logistic Regression Equation:

o We know the equation of the straight line can be written as:

Type of Logistic Regression:

Steps in Logistic Regression: To implement the Logistic Regression using

o Data Pre-processing step

Common questions

Why is the concept of multicollinearity significant in both linear and logistic regression models?

What is the primary difference in the purpose of linear regression and logistic regression in machine learning?

Why is it said that logistic regression, although regression in nature, is applied to classification problems?

How does the R-squared metric quantify the goodness of fit in a linear regression model, and what does a high R-squared value indicate?

How does logistic regression determine the threshold for classification, and what role does this threshold play in prediction?

What mathematical transformation distinguishes the logistic regression equation from linear regression?

How does the cost function in linear regression guide the optimization process of finding the best-fit line?

What are the implications of residuals on the performance of a linear regression model?

Discuss the impact of assumptions violations, such as homoscedasticity, on linear regression model's accuracy and reliability.

Explain how gradient descent contributes to the optimization process in linear regression models.

You might also like