0% found this document useful (0 votes)
8 views26 pages

Understanding Regression Models Basics

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views26 pages

Understanding Regression Models Basics

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd

Regression Models

2.1Regression Analysis
2.2Linear Regression
2.3Simple Linear Regression
2.4Multiple Linear Regression
2.5Polynomial Regression
2.6Backward Elimination
2.7Evaluating Regression Models

Regression Analysis-

Regression analysis is a statistical technique used to model the relationship


between a dependent (target) variable and one or more independent
(predictor) variables. Essentially, it helps us understand how changes in the
independent variable(s) affect the dependent variable when other factors
remain constant. This method is useful for predicting continuous values such as
temperature, age, salary, and price.

To better understand regression analysis, let’s consider an example with a


farmer who plants different amounts of seeds each year and records the
corresponding crop yield. The table below shows the amount of seeds planted
by the farmer over the past five years and the resulting yield:

Seeds Planted (in kg) Crop Yield (in kg)


50 200
60 250
70 300
80 340
90 400
100 ??

In this scenario, the farmer wants to know the predicted yield if they plant 100
kg of seeds. By using regression analysis on the data, the farmer can estimate
the expected yield based on the pattern observed from past years.

Regression is a supervised learning method that helps identify the relationship


between variables, enabling us to predict a continuous outcome based on one
or more predictor variables. It's commonly used for forecasting, time series
analysis, prediction, and understanding cause-and-effect relationships
between variables.

In regression, a graph is plotted to fit the best line or curve through the data
points, allowing the model to make predictions. Simply put, "Regression draws
a line or curve that best fits the target-predictor data points so that the vertical
distances between the points and the line are minimized." The closeness of the
data points to the regression line indicates the model's accuracy in capturing
the relationship.

Here are some practical applications of regression:

 Predicting rainfall based on temperature and other weather conditions


 Analyzing market trends
 Estimating the likelihood of road accidents due to reckless driving

Why Use Regression Analysis?

Regression analysis is valuable for predicting continuous variables, such as


weather conditions, sales forecasts, and market trends, making it essential in
various fields. Here are a few reasons for its use:
 Regression helps estimate the relationship between the target and
independent variables.
 It allows us to identify trends within data.
 It aids in predicting real/continuous values.
 By conducting regression analysis, we can understand the most
significant and least significant factors and how each influences the
others.

Key Terminologies in Regression Analysis:

 Dependent Variable: This is the main variable we want to predict or


understand, also known as the target variable.
 Independent Variable: These are the factors that influence or predict
the dependent variable, also called predictor variables.
 Outliers: These are data points with unusually high or low values
compared to others. Outliers can distort the results, so they are often
excluded.
 Multicollinearity: This occurs when independent variables are highly
correlated with each other, making it difficult to determine which
variable has the most impact. Multicollinearity should be minimized in a
[Link] Example, in most cases, the House Size is likely to be highly
correlated with the Number of Bedrooms and Number of Bathrooms.
 Underfitting and Overfitting: Overfitting happens when a model
performs well on the training data but poorly on new data. Underfitting
occurs when the model doesn’t even perform well on the training data.

Types of Regression
There are various types of regressions which are used in data science and
machine learning. Each type has its own importance on different scenarios, but
at the core, all the regression methods analyze the effect of the independent
variable on dependent variables. Here we are discussing some important types
of regression which are given below:

 Linear Regression
 Logistic Regression
 Polynomial Regression
 Support Vector Regression
 Decision Tree Regression
 Random Forest Regression
 Ridge Regression
 Lasso Regression
Linear Non-linear
Regression Regression
Model Model
Simple Decisio
n Tree

Multiple Support
Vector
Linear RegressionPolynomi
– Random
 Linear regression alis a simple yet widely used machine
Forest
learning
algorithm.
 It is a statistical method applied in predictive analysis.
 Linear regression makes predictions for continuous numeric
variables, such as sales, salary, age, and product price.
 This algorithm displays a linear relationship between a dependent
variable (y) and one or more independent variables (x), which is why
it is called "linear" regression.
 The model identifies how changes in the independent variable(x)
affect the dependent variable(y).
 A linear regression model generates a sloped straight line that
represents the relationship between the variable.

Linear Regression Line


A linear line showing the relationship between the dependent and
independent variables is called a regression line. A regression line can show
two types of relationship:
Positive Linear Relationship:
If the dependent variable increases on the Y-axis and independent variable
increases on X-axis, then such a relationship is termed as a Positive linear
relationship.

Negative Linear Relationship:


If the dependent variable decreases on the Y-axis and independent variable
increases on the X-axis, then such a relationship is called a negative linear
relationship.

1. Simple Linear Regression


2. Multiple Linear Regression
3. Polynomial Linear Regression

Simple Linear Regression-


 Simple Linear Regression is a type of Regression algorithm that
models the relationship between a dependent variable and a single
independent variable.
 The relationship shown by a Simple Linear Regression model is linear
or a sloped straight line, hence it is called Simple Linear Regression.
 The key point in Simple Linear Regression is that the dependent
variable must be a continuous/real value. However, the independent
variable can be measured on continuous or categorical values.
 The relationship between the independent and dependent variable
is linear: the line of best fit through the data points is a straight line

A simple Linear regression algorithm has mainly two objectives:


 Model the relationship between the two variables. Such as the
relationship between Income and expenditure, experience and
Salary, etc.
 Forecasting new observations. Such as Weather forecasting according
to temperature, Revenue of a company according to the investments
in a year, etc.

 Mathematically, we can represent a linear regression as-

 Simple linear regression has only one independent variable.


 Coefficient(b1): How change in x1, a unit change in y
 The values for x1 and y variables are training datasets for Linear
Regression model representation.

Example-
Finding the best fit line:

In linear regression, our primary objective is to find the best-fit line, minimizing
the error between predicted and actual values. The best-fit line is the one with
the smallest error.

Different values for the weights or coefficients (b0 and b1) produce different
regression lines, so we need to determine the optimal values of b0 and b1 to
achieve the best fit.

Residuals: The residual is the difference between the actual value and the
predicted value. If the observed data points are far from the regression line,
the residuals will be large. Conversely, if the points are close to the regression
line, the residuals will be small.
Cost function-

The cost function is a way to measure how well our model's predictions match
the actual data. In simple terms, it tells us how "wrong" the model is.

The cost function calculates the error between the actual values and the
predicted values. In linear regression, we often use the Mean Squared Error
(MSE) as the cost function.

Cost Function Formula (Mean Squared Error)


The formula for the Mean Squared Error (MSE) cost function is:

The cost function (MSE) calculates the average squared error for all data
points. A lower MSE value means that the predictions are closer to the actual
values, indicating a better-fitting model.

Example: -

Imagine you are trying to predict how much a taxi ride will cost based on the
distance travelled. You have data that shows the actual taxi fares for different
distances, and you want to create a line (a model) that predicts the fare based
on the distance. The closer your line's predictions are to the actual fares, the
better your model is.

Suppose we have two data points:

 For 2 miles, the actual fare was $5.


 For 4 miles, the actual fare was $9.

And our model predicts:

 For 2 miles, it predicts $6.


 For 4 miles, it predicts $8.
Gradient Descent

Gradient Descent is the method we use to find the best line (or best model) by
minimizing the cost function. In other words, it helps us to find the values for
the coefficients (like the slope and intercept in a line) that make the cost
function as low as possible.

Imagine you are standing on top of a hill and want to reach the bottom. You
take small steps downhill in the direction that lowers your altitude the most.
Similarly, gradient descent takes small steps to reduce the cost.
How Gradient Descent Works:

Example:-
Multiple Linear Regression-

 Multiple linear regression (MLR), is a statistical technique that uses


several explanatory variables to predict the outcome of a response
variable.
 The goal of multiple linear regression (MLR) is to model the linear
relationship between the independent variables and response
dependent variable.
 e.g. how rainfall, temperature, and amount of fertilizer added affect
crop growth
 For MLR, the dependent or target variable(Y) must be the
continuous/real, but the predictor or independent variable may be of
continuous or categorical form.
 Each feature variable must model the linear relationship with the
dependent variable.
 MLR tries to fit a regression line through a multidimensional space of
data-points.
Polynomial regression

 Polynomial regression is a special case of linear regression where we


fit a polynomial equation on the data with a curvilinear relationship
between the target variable and the independent variables.
 If your data points clearly will not fit a linear regression (a straight
line through all data points), it might be ideal for polynomial
regression.
 Polynomial regression, like linear regression, uses the relationship
between the variables x and y to find the best way to draw a line
through the data points.
Need-

 If we apply a linear model on a linear dataset, then it provides us with


a good result as we have seen in Simple Linear Regression, but if we
apply the same model without any modification on a non-linear
dataset, then it will produce a drastic output. Due to this loss function
will increase, the error rate will be high, and accuracy will decrease.
 So for such cases, where data points are arranged in a non-linear
fashion, we need the Polynomial Regression model.

Example: -
 In the above image, we have taken a dataset which is arranged non-
linearly. So if we try to cover it with a linear model, then we can
clearly see that it hardly covers any data point. On the other hand, a
curve is suitable to cover most of the data points, which is of the
Polynomial model.
 Hence, if the datasets are arranged in a non-linear fashion, then we
should use the Polynomial Regression model instead of Simple Linear
Regression.

Backward Elimination-

Backward elimination is a technique used in regression analysis to simplify a


model by removing features (independent variables) that have little impact on
the target variable (dependent variable). The goal is to keep only the most
relevant features to make the model easier to interpret and potentially
improve its performance.

Steps in Backward Elimination

1. Start with All Features: Begin with a model that includes all available
features.
2. Fit the Model: Run a regression analysis to see the effect of each feature
on the target variable.
3. Check Significance Levels: Identify the feature with the highest p-value
(indicating the lowest statistical significance).
4. Remove the Feature: If this p-value is greater than a chosen threshold
(e.g., 0.05), remove that feature from the model.
5. Repeat: Refit the model without the removed feature, and check the p-
values again.
6. Stop: Repeat steps 3-5 until all remaining features have a p-value below
the threshold, indicating they all have a significant impact on the model.

Example Calculation

Let’s say we are predicting a student’s exam score based on the time they
spent studying, their sleep hours, and their participation in class.
Key Points to Remember

 Significance Threshold: Common thresholds are 0.05 or 0.01. This value


depends on the strictness of the analysis.
 Interpretability: Removing insignificant features can make the model
easier to understand.
 Model Performance: Simplifying a model may improve generalizability
and reduce overfitting.

Advantages and Disadvantages

 Advantages: Reduces model complexity and can improve interpretability


and model efficiency.
 Disadvantages: Only removes features based on statistical significance,
which doesn’t always align with real-world importance.

Backward elimination is a useful method when building a linear regression


model and aiming for a simpler, more interpretable model.

Evaluating Regression Models-


When we evaluate regression models, we measure how well the model
predicts the target variable (output) based on the input variables (features).
Different metrics help us understand how accurate or precise the predictions
are.

Mean Absolute Error (MAE)-


MAE measures the average absolute difference between the actual and
predicted values. It shows how much, on average, the model's predictions
deviate from the actual values, without considering the direction of the error.
For
example: -

2. Mean Squared Error (MSE)

Definition: MSE is the average of the squared differences between the actual
and predicted values. By squaring the errors, it gives more weight to larger
errors.
3. Root Mean Squared Error (RMSE)

Definition: RMSE is the square root of MSE. It’s also sensitive to large errors
but is more interpretable than MSE because it’s in the same unit as the target
variable.

4. R-squared -

 R-squared tells us how much of the variation in the target variable (the
variable we want to predict) is explained by the model.
 Variance is a statistical measure that represents the spread or dispersion
of a set of data points around their mean (average).

 The value of ranges from 0 to 1, where:


o 0 means the model does not explain any of the variability (it’s no
better than guessing).
o 1 means the model perfectly explains all the variability in the data.

In simple terms, the closer is to 1, the better the model’s predictions align
with the actual values.

R-squared helps answer the question: How well is my model doing?


Specifically, it tells you how much of the data’s behavior or "pattern" is
captured by your model.

Formula-

Where:

 Sum of Squared Errors (SSE) is the sum of the squared differences


between actual and predicted values. It measures the error in the
model.
 Total Sum of Squares (SST) is the sum of squared differences between
each actual value and the average of all actual values. It represents the
total variation in the data.
5. Adjusted R-squared

What is Adjusted R-squared?

Adjusted R-squared is a modified version of that adjusts for the number of


predictors (independent variables) in the model. It’s especially helpful when
comparing models with different numbers of predictors.

Why Adjust R-squared?

When you add more predictors to a model, will generally increase, even if
the new predictors don’t add much useful information. This can give a false
impression that the model is improving just by adding more variables.

Adjusted R-squared corrects this by penalizing the model if the new predictors
don’t actually improve its performance. This means it only increases if the new
predictors genuinely add value.

Adjusted R-squared Formula

The formula for Adjusted is:


Choosing the Right Metric-
 MAE is good for understanding the average size of errors without
emphasizing large outliers.
 MSE and RMSE are useful if you want to penalize larger errors, with
RMSE being more interpretable due to its units.
 R-squared is helpful for understanding how much of the target variable’s
variability is explained by the model.
 Adjusted R-squared is essential for model selection, especially when
comparing models with different numbers of predictors.

Common questions

Powered by AI

Simple linear regression involves modeling the relationship between two variables: one independent and one dependent, represented by a straight line (best fit line) through data points. It is suitable for understanding direct, linear relationships . Multiple linear regression, however, uses multiple independent variables to predict a single dependent variable, assessing the impact in a higher dimensional space and is used when several factors influence the outcome .

Mean Squared Error (MSE) is crucial in evaluating regression models as it measures the average squared difference between actual and predicted values. Squaring the errors gives more weight to larger discrepancies, highlighting substantial prediction inaccuracies. A lower MSE indicates a model with predictions closer to true values, thereby demonstrating better model performance .

Gradient descent optimizes regression models by iteratively adjusting the model coefficients to minimize the cost function, typically Mean Squared Error (MSE). It functions by calculating the gradient of the cost function with respect to each coefficient, then updating the coefficients in the steepest descent direction, incrementally approaching the minimum error. This process continues until convergence, thereby optimizing the model for best fit .

Underfitting occurs when a regression model is too simplistic, failing to capture the underlying trend of the data; it performs poorly even on training data. In contrast, overfitting occurs when a model is excessively complex, capturing noise along with the actual data pattern, resulting in good training performance but poor generalization to new data. Both affect predictive performance but in opposite directions .

Adjusted R-squared is considered more reliable than R-squared for model selection as it accounts for the number of predictors in a model. While R-squared values generally increase with more predictors, adjusted R-squared includes a penalty for non-contributing predictors, preventing misleading impressions of model improvement. It thus provides a more accurate reflection of a model’s explanatory power, making it useful for comparing models with different numbers of predictors .

Residuals, the differences between actual and predicted values, indicate the accuracy of a linear regression model's fit. Small residuals suggest that the model predicts values close to actual ones, thus being more accurate. The best-fit line minimizes these residuals, essentially reducing the average squared distance, often measured by mean squared error, thereby maximizing model accuracy .

Backward elimination simplifies regression models by iteratively removing features with the lowest statistical significance, typically identified by high p-values. This reduction helps clarify the model by highlighting variables most impactful on the dependent variable, thereby improving interpretability and often enhancing performance by reducing issues like multicollinearity and overfitting .

Multicollinearity occurs when independent variables are highly correlated with each other, complicating the determination of each variable's effect on the dependent variable. It may lead to inflated standard errors, unreliable parameter estimates, and complicates model interpretation. High multicollinearity can also result in overfitting, where the model might perform well on the training data but poorly on unseen data .

Regression analysis models the relationship between a dependent variable and one or more independent variables, enabling the prediction of continuous values like temperature or price. It helps understand how changes in the independent variables affect the dependent variable, isolating the impact of each factor while controlling for others. This statistical technique is vital for forecasting and understanding cause-and-effect relationships, as it allows for the quantification of trends and patterns within data .

Polynomial regression is preferred when data points show a non-linear pattern and a linear model fails to capture the curve of the data adequately. It fits a polynomial equation, allowing the model to handle the curvilinear relationship between variables, resulting in a curve that better matches non-linear trends, improving the accuracy of predictions for such datasets .

You might also like