University of Khartoum
Faculty of Mathematical Sciences and Informatics
Descriptive Statistics (S1013)
Correlation and Regression
February 5, 2026
1 / 22
Week Outline
Correlation
Scatter Plots
Correlation Coefficient
Regression
2 / 22
Lecture objectives
Draw a scatter plot:
Visualize the relationship between two variables on a coordinate
plane.
Compute the correlation coefficient:
Measure the strength and direction of the linear relationship
between variables.
Regression:
Develop an understanding of regression analysis, including how to
interpret the regression equation and use it for predictive
purposes.
3 / 22
Correlation
Definitions
Correlation is a statistical methods used to determine whether a
relationship exists between two or more variables.
There are two types of relationships:
simple relationships
multiple relationships.
4 / 22
Simple Relationship
In a simple relationship, also called simple regression, there are
two variables:
1 Independent variable, also called an explanatory variable or a
predictor variable
2 Dependent variable, also called a response variable
Example
A manager, may wish to see whether the number of years the
salespeople have been working for the company has anything to do
with the amount of sales they make.
This type of study involves a simple relationship, since there are only
two variables: years of experience and amount of sales.
5 / 22
Simple Relationship
Simple relationships can be either positive or negative.
Positive relationship: As one variable increases, the other also
increases.
Example: A person’s height and weight tend to increase together.
Negative relationship: As one variable increases, the other
decreases.
Example: In older adults, as age increases, physical strength may
decrease.
6 / 22
Multiple Relationship
In a multiple relationship, also called multiple regression, two or
more independent variables are used to predict one dependent
variable.
Example:
An educator may want to study a student’s academic success based
on several predictors, such as:
the number of hours spent studying,
the student’s GPA, and
the student’s high school background.
This type of study involves multiple variables and examines their
collective influence on a single outcome.
7 / 22
Studying Relationships Between Variables
Once we identify that a relationship may exist between two variables,
we can study it in different ways:
Visually, using a scatter plot, to observe general patterns.
Numerically, by calculating the correlation coefficient to
measure the strength and direction of the relationship.
Analytically, using regression analysis to describe the
relationship with an equation and make predictions.
These methods allow us to assess whether a relationship exists, how
strong it is, and how we can model it for decision-making.
8 / 22
Scatter Plots
Definition
A scatter plot is a graph of ordered pairs (x, y) of numbers consisting
of the independent variable x and the dependent variable y. It is used
to visually determine whether there is a relationship between the two
variables.
Example 1:
The following table shows the number of hours students studied and
their test scores:
Hours (x) Score (y)
2 65
4 70
6 75
8 80
10 85
9 / 22
10 / 22
Example 2:
Problem:
Car Rental Companies: Construct a scatter plot for the data shown
for car rental companies in the United States for a recent year.
Company Cars (in ten thousands) Revenue (in billions)
A 63.0 7.0
B 29.0 3.9
C 20.8 2.1
D 19.1 2.8
E 13.4 1.4
F 8.5 1.5
11 / 22
12 / 22
Example 3:
Problem:
Absences and Final Grades Construct a scatter plot for the data
obtained in a study on the number of absences and the final grades of
seven randomly selected students from a statistics class. The data are
shown here.
Student Number of absences x Final grade y (%)
A 6 82
B 2 86
C 15 43
D 9 74
E 12 58
F 5 90
G 8 78
13 / 22
14 / 22
Correlation Coefficient
Definition
The correlation coefficient r is a numerical measure that describes
the strength and direction of a linear relationship between two
variables.
There are several ways to compute the value of the correlation
coefficient. One method is to use the formula shown below:
Formula for the Correlation Coefficient r
P P P
n( xy) − ( x)( y)
r=p P
[n( x 2 ) − ( x)2 ][n( y 2 ) − ( y)2 ]
P P P
Where:
n is the number of data pairs.
x represents values of the independent variable.
y represents values of the dependent variable.
15 / 22
Correlation Coefficient
The symbol for the sample correlation coefficient is r .
The symbol for the population correlation coefficient is ρ (Greek
letter rho).
The range of the correlation coefficient is from −1 to +1.
If there is a strong positive linear relationship between the
variables, the value of r will be close to +1.
If there is a strong negative linear relationship between the
variables, the value of r will be close to −1.
When there is no linear relationship or only a weak relationship
between the variables, the value of r will be close to 0.
16 / 22
17 / 22
Assumptions for the Correlation Coefficient
1 The sample is a random sample.
2 The data pairs fall approximately on a straight line and are
measured at the interval or ratio level.
3 The variables have a joint normal distribution. (This means that
given any specific value of x, the y values are normally
distributed; and given any specific value of y , the x values are
normally distributed.)
18 / 22
Example
Problem
For the previous Car Rental Companies example, compute the
correlation coefficient.
Solution
Company x (Cars) y (Revenue) xy x2 y2
A 63.0 7.0 441.00 3969.00 49.00
B 29.0 3.9 113.10 841.00 15.21
C 20.8 2.1 43.68 432.64 4.41
D 19.1 2.8 53.48 364.81 7.84
E 13.4 1.4 18.76 179.56 1.96
F 8.5 1.5 12.75 72.25 2.25
Totals:
x 2 = 5859.26
P P P P
x = 153.8 y = 18.7 xy = 682.77
y 2 = 80.67
P
19 / 22
Example
P P P
n( xy) − ( x)( y)
r=p P
[n( x 2 ) − ( x)2 ][n( y 2 ) − ( y)2 ]
P P P
6(682.77) − (153.8)(18.7)
r=p = 0.982
[6(5859.26) − (153.8)2 ][6(80.67) − (18.7)2 ]
The correlation coefficient suggests a strong relationship between the
number of cars a rental agency has and its annual revenue.
20 / 22
Regression
Definition
Regression is a statistical method used to describe the nature of the
relationship between variables whether it is positive or negative, linear
or nonlinear.
Regression Line: It is the line of best fit.
The purpose of the regression line is to enable the researcher to
observe trends and make predictions based on the data.
Determination of the Regression Line Equation:
The equation of the regression line is:
y = a + bx
where:
a is the y-intercept
b is the slope of the line
21 / 22
Example
Problem:
A farming cooperative in Sudan wants to estimate groundnut yield (in
tons) based on the number of irrigation days during the growing
season.
From past records, the estimated regression line is:
y = 0.35x + 1.2
where:
x = number of irrigation days
y = groundnut yield in tons per hectare
Question: What is the expected yield if a field is irrigated 20 times?
Solution:
x = 20 ⇒ y = 0.35(20) + 1.2 = 8.2
Interpretation: The predicted yield is 8.2 tons per hectare.
22 / 22