Mekdela Amba Uinversity
Colege of natural and computational science
Department of Statistics
Regression Analysis (Stat 3041) Lecture Note
10/5/2025 Regression Analysis by Nurye S. 1
Chapter One
Introduction
Outlines
Over view of regression Analysis
Application of Regression Analysis
Steps in regression Analysis
10/5/2025 Regression Analysis by Nurye S. 2
1.1. Over view of regression Analysis
The term regression was introduced by Francis Galton (a renowned British biologist and
polymath, and a colleague of Charles Darwin) in his studies on heredity in the late 19th
century.
Galton studied the relationship between the heights of parents and their children.
Francis Galton's work on hereditary height demonstrated that in a stable population, the
average value of a height in the offspring of parents with an extreme value of that height
will tend to be closer to the population mean than the parents' value.
Then, the average height of children born of parents of a given height tended to move or
regress toward the average height in the population as whole.
10/5/2025 3
Regression Analysis by Nurye S.
Cont.
• The law of Galton are described as Galton's Law of Filial Regression, now more commonly
known as Regression Toward the Mean (or Regression to Mediocrity as he first termed it
in his 1886 paper, "Regression towards Mediocrity in Hereditary Stature“ or
[Link]
• This law demonstrated a fundamental principle of quantitative heredity and statistics.
This regressing toward mediocrity" gave these statistical methods its name regression
analysis.
Significance and Contributions of Galton law of filial regression
1. Statistical Basis of Heredity: The law provided a statistical explanation for how the overall
variability of a height remains constant from generation to generation in a population,
despite the tendency for children to inherit their parents' characteristics.
10/5/2025 4
Regression Analysis by Nurye S.
Cont.
2. Creation of Core Statistical Concepts: Galton's analysis of this phenomenon led directly to
the creation of two of the most important concepts in modern statistics:
• Correlation: He developed the concept of correlation to quantify the strength of the
relationship between two variables (mid-parent height and offspring height).
• Regression Analysis: The line representing the average trend of the offspring's height
toward the mean was named the regression line.
3. Nature of Inheritance: While Galton initially thought this "regression" was a biological
force pulling traits back to an ancestral average (which he called "reversion"), the modern
understanding shows it is a statistical phenomenon resulting from two primary factors
(Imperfect Heredity and Polygenic Traits).
10/5/2025 5
Regression Analysis by Nurye S.
Cont.
Regression analysis is a statistical technique for investigating and modeling the
relationship between variables .
It is one of the most widely used techniques for analyzing multifactor data.
Regression analysis is also interesting theoretically because of elegant underlying
mathematics (the minimization problem) and a well developed statistical theory
(powerful statistical properties of the OLS estimators).
Its broad application and usefulness result from the conceptually logical process of using
an equation to express the relationship between a variable of interest (the response) and
a set of related predictor variables.
10/5/2025 6
Regression Analysis by Nurye S.
Cont.
Example: We may wish to examine whether substance abuse is related to various
socioeconomic and demographic variables such as age, education, and income. The
relationship is expressed in the form of an equation or a model connecting the response or
dependent variable and one or more explanatory or predictor variables. The response
variable is substance abuse (measured by the number of packs of substance used by a
person per day) and the explanatory or predictor variables are age, education, and income.
10/5/2025 7
Regression Analysis by Nurye S.
Cont.
Even though correlation and regression are related in the sense that both
deals with relationships among variables, the two statistical techniques have
slight differences
o Although Correlation is a useful quantity for measuring the direction and
the strength of linear relationships, it cannot be used for prediction
purposes, that is, we cannot use correlation to predict the value of one
variable given the value of the other.
o In correlation both variables are assumed to be random but in regression
the independent variable is assumed to be fixed.
o Furthermore, Correlation measures only pair wise relationships. Regression
analysis, however, can be used to relate one or more response variable to one
or more predictor variables.
10/5/2025 8
Regression Analysis by Nurye S.
Cont.
• Thus regression analysis is an attractive extension to correlation analysis
because it postulates a model that can be used not only to measure the
direction and the strength of a relationship between the response and
predictor variables, but also to numerically describe that relationship.
• Regression takes the information obtained from the correlation analysis and
chi-square test of association and tries to develop an equation that used to
describe the relationship between the variables.
• Regression equation is used to predict the values of one random variable
depending on the values of one or more random variables.
• The variable whose values are to be estimated is called dependent variables
or outcome variable while the variables whose values are used to estimate are
called independent variables or explanatory variables.
10/5/2025 9
Regression Analysis by Nurye S.
Cont.
Different Names for dependent and independent variables
10/5/2025 10
Regression Analysis by Nurye S.
Cont.
Types of Regression models
Depending on the nature of dependent and independent variables and the type of
relationship between the variables, we can define different types of regression models.
• Simple regression model is that contains only one predictor variable.
• Multiple regression model is used when there are more than one independent
variables in the model
• Logistic regression is used when we have categorical response variable.
There different types of logistic regression
Binary logistic regression is used when the dependent variable is dichotomous(
with only two levels or categories)
Multinomial logistic regression is used when the dependent variable is categorical
with nominal level
• Ordinal logistic regression is used when the dependent variable is categorical with
ordinal level.
10/5/2025 11
Regression Analysis by Nurye S.
Cont.
Objective of the regression analysis
• The objective of regression analysis is to explain variability in dependent
variable by means of one or more of independent or control variables.
We are primarily interested in the following issues:
The form of the relationship among the dependent variables and independent
variables, or what the equation that represents the relationship looks like.
The direction and strength of the relationships based on the sign and size of
the slope coefficients.
Which explanatory variables are important and which are not. This issue is
based on comparing the size of the slope coefficients (by constructing
confidence interval and hypothesis testing concerning on the model
parameters).
Predicting a value or set of values of the dependent variable for a given set of
values of the explanatory variables.
10/5/2025 12
Regression Analysis by Nurye S.
1.2. Application of Regression Analysis
Regression analysis is an incredibly versatile statistical tool with applications
across virtually every field that deals with data.
I.e. applications of regression are numerous and occur in almost every field,
including engineering, the physical and chemical sciences, economics,
management, life and biological sciences, and social sciences.
• Its primary uses are to predict/forecast an outcome and to infer causal
relationships between variables.
• Regression analysis is used extensively in data mining and is a basic tool of
data science and analytics.
10/5/2025 13
Regression Analysis by Nurye S.
Cont..
The main applications of regression analysis across various sectors:
1. Business and Finance: Regression models are essential for forecasting and risk assessment
in the corporate and financial worlds like;
Sales Forecasting: Predicting future product or service sales based on variables like
advertising spend, seasonality, pricing, and competitor actions.
Financial Risk Management: Assessing the relationship between economic indicators
(e.g., interest rates, GDP growth) and a company's financial performance or stock
returns.
Asset Pricing (CAPM): The Capital Asset Pricing Model (CAPM) is a regression model
that estimates the expected return of an asset (dependent variable) based on the
market's return and the asset's beta (a measure of systematic risk determined by
regression).
10/5/2025 14
Regression Analysis by Nurye S.
Cont..
2. Economics and Public Policy: Economists use regression to test theories and understand
the impact of policy changes like.
Economic Forecasting: Predicting macroeconomic variables like GDP growth, inflation
rates, or unemployment based on various leading indicators.
Policy Evaluation: Estimating the effect of a new tax, minimum salary increase, or
education program on outcomes like employment, consumption, or poverty levels.
Labor Economics: Analyzing the relationship between years of education and wages while
controlling for other factors like experience or gender.
3. Healthcare and Biostatistics: Regression is crucial for clinical research, epidemiology, and
public health like.
Dose-Response Modeling: Determining the relationship between the dosage of a drug
and its therapeutic effect or the severity of side effects.
10/5/2025 15
Regression Analysis by Nurye S.
Cont..
Risk Factor Analysis (Logistic & Cox Regression):
Using Logistic Regression to identify risk factors that predict a binary health outcome (e.g., predicting
the probability of developing a disease given age, BMI, and smoking status).
Using Cox Proportional Hazards Regression (Survival Analysis) to model the time until an event
occurs (e.g., time to recovery, time to relapse) based on various patient characteristics.
Medical Cost Prediction: Forecasting a patient's expected healthcare costs based on age/
pre-existing conditions.
4. Marketing and Customer Analytics: Companies use regression to optimize their marketing
efforts and understand customer behavior like.
Marketing Mix Modeling (MMM): Determining the optimal allocation of marketing
budget by quantifying the impact of different channels (TV, social media, print) on sales
or brand awareness.
Customer Lifetime Value (CLV): Predicting the total revenue a company can expect from
a single customer throughout their relationship.
Product Pricing: Modeling how changes in price affect the quantity of goods sold
(demand elasticity).
10/5/2025 16
Regression Analysis by Nurye S.
Cont..
5. Environmental Science and Engineering: Regression helps model physical and
environmental processes like.
Climate Modeling: Estimating the relationship between greenhouse gas concentrations (independent
variables) and changes in global surface temperature (dependent variable).
Crop Yield Prediction: Predicting agricultural yields based on factors like rainfall, temperature, fertilizer
use, and soil quality.
Energy Consumption: Forecasting electricity demand based on weather, time of day, and day of the week to
optimize power plant operations.
Social Sciences and Education: Regression is fundamental for testing hypotheses about human behavior
and social trends.
Educational Research: Studying the impact of class size, teacher-student ratios, or curriculum changes on
student performance (e.g., test scores).
Psychology: Analyzing the relationship between personality traits or cognitive test scores and specific
behavioral outcomes.
Sociology: Modeling the factors that influence community crime rates, migration patterns, or voting
behavior.
10/5/2025 17
Regression Analysis by Nurye S.
1.3. Steps in regression Analysis
Regression analysis includes the following steps:
1. Identifying the statistical problem(statement of the problem)
2. Selection of potentially relevant variables
3. Data collection
4. Formulation of the model
5. Model validation and criticism
6. Using the chosen model(s) for prediction.
10/5/2025 18
Regression Analysis by Nurye S.
Cont.
1. Statement of the problem
The problem statement is the first and perhaps the most important step in
regression analysis.
Regression analysis usually starts with a formulation of the problem.
This includes the determination of the question(s) to be addressed by the
analysis.
It is important because an ill-defined problem or a miss formulated question
can lead to wasted effort.
It can lead to the selection of irrelevant set of variables or to a wrong choice
of the statistical method of analysis that is a question that is not carefully
formulated can also lead to the wrong choice of a model.
10/5/2025 19
Regression Analysis by Nurye S.
Cont.
2. Selection of potentially relevant variables
The next step after the statement of the problem is to select a set of variables
that are thought by the experts in the area of study to explain or predict the
response variable.
The response variable is denoted by Y and the explanatory or predictor
variables are denoted by 𝑋1, 𝑋2 … 𝑋𝑝 where p denotes the number of
predictor variables.
An example of a response variable is the price of a single family house in a
given geographical area.
A possible relevant set of predictor variables in this case are: area of the lot,
area of the house, age of the house, number of bedrooms, number of
bathrooms, type of neighborhood, style of the house, amount of real estate
taxes, etc.
10/5/2025 20
Regression Analysis by Nurye S.
Cont.
3. Data collection
The next step after the selection of potentially relevant variables is to collect
the data from the environment under study to be used in the analysis.
The collected data consist of observations on n subjects.
Each of these n observations consists of measurements for each of the
potentially relevant variables.
10/5/2025 21
Regression Analysis by Nurye S.
Cont.
4. Model formulation
The form of the model that is thought to relate the response variable to the
set of predictor variables can be specified initially by the experts in the area
of study based on their knowledge or their objective and or subjective
judgments.
Scatter plot, correlation coefficient and chi-square test of association can also
be used to select the form of the model.
After the model has been defined and the data have been collected, the next
task is to estimate the parameters of the model based on the collected data.
This is also referred to as parameter estimation or model fitting.
The most commonly used method of estimation is called the least squares
method.
10/5/2025 22
Regression Analysis by Nurye S.
Cont.
5. Model validation and criticism
The validity of a statistical method, such as regression analysis, depends on
certain assumptions.
Assumptions are usually made about the data and the model.
The accuracy of the analysis and the conclusions derived from an analysis
depends crucially on the validity of these assumptions.
The adequacy of the linear model can also be checked by using summary
statistics such as coefficient of determination and tests such as ANOVA.
10/5/2025 23
Regression Analysis by Nurye S.
Cont.
6. Using the model for prediction/ different purpose
The regression equation may be used for several purposes.
It may be used to evaluate the importance of individual predictors, to
analyze the effects of policy that involves changing values of the predictor
variables, or to forecast values of the response variable for a given set of
predictors.
10/5/2025 24
Regression Analysis by Nurye S.
Cont.
Exercise
In each of the following sets of variables, identify which of the variables can be
regarded as a response variable and which can be used as independent variable and
state possible lurking variable?
i. Income and expenditure of households
ii. Number of hours spent in studying and the score obtained
iii. Height and weight of students
iv. Distance covered and fuel consumed by car
v. Demand and supply
vi. The time to run the race, and the temperature at the time of running.
vii. The weight of a person, whether or not the person is a smoker, and whether or
not the person has a lung cancer.
viii. The height and weight of a child, his/her parents’ height and weight, and the
sex and age of the child.
10/5/2025 25
Regression Analysis by Nurye S.