Regression Models
2.1Regression Analysis
2.2Linear Regression
2.3Simple Linear Regression
2.4Multiple Linear Regression
2.5Polynomial Regression
2.6Backward Elimination
2.7Evaluating Regression Models
Regression Analysis-
Regression analysis is a statistical technique used to model the relationship
between a dependent (target) variable and one or more independent
(predictor) variables. Essentially, it helps us understand how changes in the
independent variable(s) affect the dependent variable when other factors
remain constant. This method is useful for predicting continuous values such as
temperature, age, salary, and price.
To better understand regression analysis, let’s consider an example with a
farmer who plants different amounts of seeds each year and records the
corresponding crop yield. The table below shows the amount of seeds planted
by the farmer over the past five years and the resulting yield:
Seeds Planted (in kg) Crop Yield (in kg)
50 200
60 250
70 300
80 340
90 400
100 ??
In this scenario, the farmer wants to know the predicted yield if they plant 100
kg of seeds. By using regression analysis on the data, the farmer can estimate
the expected yield based on the pattern observed from past years.
Regression is a supervised learning method that helps identify the relationship
between variables, enabling us to predict a continuous outcome based on one
or more predictor variables. It's commonly used for forecasting, time series
analysis, prediction, and understanding cause-and-effect relationships
between variables.
In regression, a graph is plotted to fit the best line or curve through the data
points, allowing the model to make predictions. Simply put, "Regression draws
a line or curve that best fits the target-predictor data points so that the vertical
distances between the points and the line are minimized." The closeness of the
data points to the regression line indicates the model's accuracy in capturing
the relationship.
Here are some practical applications of regression:
Predicting rainfall based on temperature and other weather conditions
Analyzing market trends
Estimating the likelihood of road accidents due to reckless driving
Why Use Regression Analysis?
Regression analysis is valuable for predicting continuous variables, such as
weather conditions, sales forecasts, and market trends, making it essential in
various fields. Here are a few reasons for its use:
Regression helps estimate the relationship between the target and
independent variables.
It allows us to identify trends within data.
It aids in predicting real/continuous values.
By conducting regression analysis, we can understand the most
significant and least significant factors and how each influences the
others.
Key Terminologies in Regression Analysis:
Dependent Variable: This is the main variable we want to predict or
understand, also known as the target variable.
Independent Variable: These are the factors that influence or predict
the dependent variable, also called predictor variables.
Outliers: These are data points with unusually high or low values
compared to others. Outliers can distort the results, so they are often
excluded.
Multicollinearity: This occurs when independent variables are highly
correlated with each other, making it difficult to determine which
variable has the most impact. Multicollinearity should be minimized in a
[Link] Example, in most cases, the House Size is likely to be highly
correlated with the Number of Bedrooms and Number of Bathrooms.
Underfitting and Overfitting: Overfitting happens when a model
performs well on the training data but poorly on new data. Underfitting
occurs when the model doesn’t even perform well on the training data.
Types of Regression
There are various types of regressions which are used in data science and
machine learning. Each type has its own importance on different scenarios, but
at the core, all the regression methods analyze the effect of the independent
variable on dependent variables. Here we are discussing some important types
of regression which are given below:
Linear Regression
Logistic Regression
Polynomial Regression
Support Vector Regression
Decision Tree Regression
Random Forest Regression
Ridge Regression
Lasso Regression
Linear Non-linear
Regression Regression
Model Model
Simple Decisio
n Tree
Multiple Support
Vector
Linear RegressionPolynomi
– Random
Linear regression alis a simple yet widely used machine
Forest
learning
algorithm.
It is a statistical method applied in predictive analysis.
Linear regression makes predictions for continuous numeric
variables, such as sales, salary, age, and product price.
This algorithm displays a linear relationship between a dependent
variable (y) and one or more independent variables (x), which is why
it is called "linear" regression.
The model identifies how changes in the independent variable(x)
affect the dependent variable(y).
A linear regression model generates a sloped straight line that
represents the relationship between the variable.
Linear Regression Line
A linear line showing the relationship between the dependent and
independent variables is called a regression line. A regression line can show
two types of relationship:
Positive Linear Relationship:
If the dependent variable increases on the Y-axis and independent variable
increases on X-axis, then such a relationship is termed as a Positive linear
relationship.
Negative Linear Relationship:
If the dependent variable decreases on the Y-axis and independent variable
increases on the X-axis, then such a relationship is called a negative linear
relationship.
1. Simple Linear Regression
2. Multiple Linear Regression
3. Polynomial Linear Regression
Simple Linear Regression-
Simple Linear Regression is a type of Regression algorithm that
models the relationship between a dependent variable and a single
independent variable.
The relationship shown by a Simple Linear Regression model is linear
or a sloped straight line, hence it is called Simple Linear Regression.
The key point in Simple Linear Regression is that the dependent
variable must be a continuous/real value. However, the independent
variable can be measured on continuous or categorical values.
The relationship between the independent and dependent variable
is linear: the line of best fit through the data points is a straight line
A simple Linear regression algorithm has mainly two objectives:
Model the relationship between the two variables. Such as the
relationship between Income and expenditure, experience and
Salary, etc.
Forecasting new observations. Such as Weather forecasting according
to temperature, Revenue of a company according to the investments
in a year, etc.
Mathematically, we can represent a linear regression as-
Simple linear regression has only one independent variable.
Coefficient(b1): How change in x1, a unit change in y
The values for x1 and y variables are training datasets for Linear
Regression model representation.
Example-
Finding the best fit line:
In linear regression, our primary objective is to find the best-fit line, minimizing
the error between predicted and actual values. The best-fit line is the one with
the smallest error.
Different values for the weights or coefficients (b0 and b1) produce different
regression lines, so we need to determine the optimal values of b0 and b1 to
achieve the best fit.
Residuals: The residual is the difference between the actual value and the
predicted value. If the observed data points are far from the regression line,
the residuals will be large. Conversely, if the points are close to the regression
line, the residuals will be small.
Cost function-
The cost function is a way to measure how well our model's predictions match
the actual data. In simple terms, it tells us how "wrong" the model is.
The cost function calculates the error between the actual values and the
predicted values. In linear regression, we often use the Mean Squared Error
(MSE) as the cost function.
Cost Function Formula (Mean Squared Error)
The formula for the Mean Squared Error (MSE) cost function is:
The cost function (MSE) calculates the average squared error for all data
points. A lower MSE value means that the predictions are closer to the actual
values, indicating a better-fitting model.
Example: -
Imagine you are trying to predict how much a taxi ride will cost based on the
distance travelled. You have data that shows the actual taxi fares for different
distances, and you want to create a line (a model) that predicts the fare based
on the distance. The closer your line's predictions are to the actual fares, the
better your model is.
Suppose we have two data points:
For 2 miles, the actual fare was $5.
For 4 miles, the actual fare was $9.
And our model predicts:
For 2 miles, it predicts $6.
For 4 miles, it predicts $8.
Gradient Descent
Gradient Descent is the method we use to find the best line (or best model) by
minimizing the cost function. In other words, it helps us to find the values for
the coefficients (like the slope and intercept in a line) that make the cost
function as low as possible.
Imagine you are standing on top of a hill and want to reach the bottom. You
take small steps downhill in the direction that lowers your altitude the most.
Similarly, gradient descent takes small steps to reduce the cost.
How Gradient Descent Works:
Example:-
Multiple Linear Regression-
Multiple linear regression (MLR), is a statistical technique that uses
several explanatory variables to predict the outcome of a response
variable.
The goal of multiple linear regression (MLR) is to model the linear
relationship between the independent variables and response
dependent variable.
e.g. how rainfall, temperature, and amount of fertilizer added affect
crop growth
For MLR, the dependent or target variable(Y) must be the
continuous/real, but the predictor or independent variable may be of
continuous or categorical form.
Each feature variable must model the linear relationship with the
dependent variable.
MLR tries to fit a regression line through a multidimensional space of
data-points.
Polynomial regression
Polynomial regression is a special case of linear regression where we
fit a polynomial equation on the data with a curvilinear relationship
between the target variable and the independent variables.
If your data points clearly will not fit a linear regression (a straight
line through all data points), it might be ideal for polynomial
regression.
Polynomial regression, like linear regression, uses the relationship
between the variables x and y to find the best way to draw a line
through the data points.
Need-
If we apply a linear model on a linear dataset, then it provides us with
a good result as we have seen in Simple Linear Regression, but if we
apply the same model without any modification on a non-linear
dataset, then it will produce a drastic output. Due to this loss function
will increase, the error rate will be high, and accuracy will decrease.
So for such cases, where data points are arranged in a non-linear
fashion, we need the Polynomial Regression model.
Example: -
In the above image, we have taken a dataset which is arranged non-
linearly. So if we try to cover it with a linear model, then we can
clearly see that it hardly covers any data point. On the other hand, a
curve is suitable to cover most of the data points, which is of the
Polynomial model.
Hence, if the datasets are arranged in a non-linear fashion, then we
should use the Polynomial Regression model instead of Simple Linear
Regression.
Backward Elimination-
Backward elimination is a technique used in regression analysis to simplify a
model by removing features (independent variables) that have little impact on
the target variable (dependent variable). The goal is to keep only the most
relevant features to make the model easier to interpret and potentially
improve its performance.
Steps in Backward Elimination
1. Start with All Features: Begin with a model that includes all available
features.
2. Fit the Model: Run a regression analysis to see the effect of each feature
on the target variable.
3. Check Significance Levels: Identify the feature with the highest p-value
(indicating the lowest statistical significance).
4. Remove the Feature: If this p-value is greater than a chosen threshold
(e.g., 0.05), remove that feature from the model.
5. Repeat: Refit the model without the removed feature, and check the p-
values again.
6. Stop: Repeat steps 3-5 until all remaining features have a p-value below
the threshold, indicating they all have a significant impact on the model.
Example Calculation
Let’s say we are predicting a student’s exam score based on the time they
spent studying, their sleep hours, and their participation in class.
Key Points to Remember
Significance Threshold: Common thresholds are 0.05 or 0.01. This value
depends on the strictness of the analysis.
Interpretability: Removing insignificant features can make the model
easier to understand.
Model Performance: Simplifying a model may improve generalizability
and reduce overfitting.
Advantages and Disadvantages
Advantages: Reduces model complexity and can improve interpretability
and model efficiency.
Disadvantages: Only removes features based on statistical significance,
which doesn’t always align with real-world importance.
Backward elimination is a useful method when building a linear regression
model and aiming for a simpler, more interpretable model.
Evaluating Regression Models-
When we evaluate regression models, we measure how well the model
predicts the target variable (output) based on the input variables (features).
Different metrics help us understand how accurate or precise the predictions
are.
Mean Absolute Error (MAE)-
MAE measures the average absolute difference between the actual and
predicted values. It shows how much, on average, the model's predictions
deviate from the actual values, without considering the direction of the error.
For
example: -
2. Mean Squared Error (MSE)
Definition: MSE is the average of the squared differences between the actual
and predicted values. By squaring the errors, it gives more weight to larger
errors.
3. Root Mean Squared Error (RMSE)
Definition: RMSE is the square root of MSE. It’s also sensitive to large errors
but is more interpretable than MSE because it’s in the same unit as the target
variable.
4. R-squared -
R-squared tells us how much of the variation in the target variable (the
variable we want to predict) is explained by the model.
Variance is a statistical measure that represents the spread or dispersion
of a set of data points around their mean (average).
The value of ranges from 0 to 1, where:
o 0 means the model does not explain any of the variability (it’s no
better than guessing).
o 1 means the model perfectly explains all the variability in the data.
In simple terms, the closer is to 1, the better the model’s predictions align
with the actual values.
R-squared helps answer the question: How well is my model doing?
Specifically, it tells you how much of the data’s behavior or "pattern" is
captured by your model.
Formula-
Where:
Sum of Squared Errors (SSE) is the sum of the squared differences
between actual and predicted values. It measures the error in the
model.
Total Sum of Squares (SST) is the sum of squared differences between
each actual value and the average of all actual values. It represents the
total variation in the data.
5. Adjusted R-squared
What is Adjusted R-squared?
Adjusted R-squared is a modified version of that adjusts for the number of
predictors (independent variables) in the model. It’s especially helpful when
comparing models with different numbers of predictors.
Why Adjust R-squared?
When you add more predictors to a model, will generally increase, even if
the new predictors don’t add much useful information. This can give a false
impression that the model is improving just by adding more variables.
Adjusted R-squared corrects this by penalizing the model if the new predictors
don’t actually improve its performance. This means it only increases if the new
predictors genuinely add value.
Adjusted R-squared Formula
The formula for Adjusted is:
Choosing the Right Metric-
MAE is good for understanding the average size of errors without
emphasizing large outliers.
MSE and RMSE are useful if you want to penalize larger errors, with
RMSE being more interpretable due to its units.
R-squared is helpful for understanding how much of the target variable’s
variability is explained by the model.
Adjusted R-squared is essential for model selection, especially when
comparing models with different numbers of predictors.