Week 7 Notes - 2025
Week 7 Notes - 2025
Durham University
Contents
1 Basic Concepts 5
1.1 Some basic concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2 Branches of statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3 What’s the big idea? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3 Descriptive Statistics 7
3.1 Measures of Central Tendency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.2 Measure of Variation (Dispersion) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.3 Shape of a distribution: Skewness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.4 Shape of a distribution: Kurtosis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.5 Modality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.6 Symmetry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.7 Empirical Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.8 Measure of Position: z-score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.9 Percentiles and Quartiles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.10 Five-number summary & Boxplots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.11 Outliers & Extremes values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.12 Descriptive statistics for qualitative variables . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.13 Example: Accounting final exam grades . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
4 Probability 18
4.1 The Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.2 Outcomes and Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.3 Probabilities As Proportions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.4 Relative Frequencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.5 Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.6 Mutual Exclusivity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.7 Complements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.8 An Introduction To Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.9 Conditional Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.10 Bayes Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
5 Random Variables 30
5.1 What is a Variable? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
1
5.2 Making Variables Random . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
5.3 Probability Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
5.4 Introduction to Discrete and Continuous Probability Functions . . . . . . . . . . . . . . . . . 33
5.5 Discrete Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
5.6 Continuous Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
5.7 Cumulative distribution function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
5.8 Characteristics of probability distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5.9 Some useful continuous distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5.10 Joint distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
5.11 Conditional probability (density) function, PDF . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5.12 Properties of Expected values and Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.13 Covariance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.14 Correlation Coefficient . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.15 Conditional expectation and conditional variance . . . . . . . . . . . . . . . . . . . . . . . . . 42
6 Sampling 43
6.1 Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
6.2 Random versus non-random sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
6.3 Simple random sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
6.4 Central limit theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
6.5 Sampling distribution of the sample mean x̄ . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
6.6 Sampling distribution of the sample proportion . . . . . . . . . . . . . . . . . . . . . . . . . . 45
6.7 Sampling distribution of the sample variance . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
6.8 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
7 Estimation 47
7.1 Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
7.2 Point estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
7.3 Interval estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
7.4 Confidence intervals for the population mean . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
7.5 Interpreting confidence intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
7.6 Confidence interval for a population proportion . . . . . . . . . . . . . . . . . . . . . . . . . . 49
7.7 Confidence interval for a population variance . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
7.8 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
10 Nonparametric Tests 66
2
10.1 Wilcoxon signed-rank test (Paired samples) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
10.2 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
10.3 Wilcoxon rank-sum test (Independent samples) . . . . . . . . . . . . . . . . . . . . . . . . . . 67
10.4 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
11 Correlation 69
11.1 Correlation and Causation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
3
17 Multiple Linear Regression: Fit and Inference 104
17.1 Coefficient of determination, R2 and adjusted R2 . . . . . . . . . . . . . . . . . . . . . . . . . 104
17.2 The residual standard error, se . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
17.3 Inferences about a particular predictor variable . . . . . . . . . . . . . . . . . . . . . . . . . . 105
17.4 How useful is the multiple regression model? . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
17.5 Used cars example continued . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
17.6 Regression in R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
4
1 Basic Concepts
1.1 Some basic concepts
• Data consist of information coming from observations, counts, measurements, or responses.
• Statistics is the science of collecting, organising, analysing, and interpreting data in order to make
decisions.
• A population is the collection of all outcomes, responses, measurements, or counts that are of interest.
Populations may be finite or infinite. If a population of values consists of a fixed number of these values,
the population is said to be finite. If, on the other hand, a population consists of an endless succession
of values, the population is an infinite one.
• A sample is a subset of a population.
• A parameter is a numerical description of a population characteristic.
• A statistic is a numerical description of a sample characteristic.
5
1.4 Notation
Below is a table containing commonly-used notation for some of the parameters and statistics we will deal
with most often.
Population Sample
Size N n
Parameter Statistic
Mean µ x̄
Variance σ2 s2
Standard deviation σ s
Proportion π π̂
Correlation ρ r
6
2.4 Types of data (Econometrics)
• Cross-sectional data: Data on different entities (e.g. workers, consumers, firms, governmental units)
for a single time period. For example, data on test scores in different school districts.
• Time series data: Data for a single entity (e.g. person, firm, country) collected at multiple time
periods. For example, the rate of inflation or of unemployment for a country over the last 10 years.
• Panel data: Data for multiple entities in which each entity is observed at two or more time periods.
For example, the daily prices of a number of stocks over two years.
3 Descriptive Statistics
3.1 Measures of Central Tendency
Measures of central tendency provide numerical information about a ‘typical’ observation in the data.
• The mean (also called the average) of a data set is the sum of the data values divided by the number
of observations.
n
1X
Sample mean: x̄ = xi
n i=1
• The median is the middle observation when the data set is sorted in ascending order. If the data set
has an even number of observations, the median is the mean of the two middle observations.
• The mode is the data value that occurs with the greatest frequency. If no entry is repeated, the data
set has no mode. If two (more than two) values occur with the same greatest frequency, each value is a
mode and the data set is called bimodal (multimodal).
n
1 X
Sample variance: s2 = (xi − x̄)2
n − 1 i=1
• Shortcut formula for sample variance is given by
( n )
1 X
Sample variance: s2 = x2i − nx̄2
n−1 i=1
7
• The standard deviation (s) of a data set is the square root of the sample variance.
8
3.5 Modality
The number of highest points in a distribution gives us the modality. Note that in a situation in which a
distribution has two or more “humps” which aren’t equally high, we still describe the shape of the graph as
“bimodal” or “multimodal”, even though only the (equal) highest point of the curve represents the actual
mode(s) of the data.
3.6 Symmetry
Below are three common forms of symmetrical distribution. Note that if a symmetrical distribution is also
unimodal, the median, mode and mean will all be equal.
9
x − x̄
z=
s
As the z-score has no unit, it can be used to compare values from different data sets or to compare values
within the same data set. The mean of z-scores is 0 and the standard deviation is 1.
Note that s > 0 so if z is negative, the corresponding x-value is below the mean. If z is positive, the
corresponding x-value is above the mean. And if z = 0, the corresponding x-value is equal to the mean.
10
70% of people are shorter than the red figure. 30% of people are taller than the blue figure. The 70%
percentile for height therefore lies between the blue and red figure.
11
• The four quartiles divide a data set into quarters (four equal parts). As the diagram below shows, the
four equal parts do not necessarily have equal lengths, it is the number of data points which are the
same within each part.
• The interquartile range (IQR) of a data set is the difference between the first and third quartiles
(IQR = Q3 − Q1 )
• The IQR is a measure of variation that gives you an idea of how much the middle 50% of the data
varies.
The box represents the interquartile range (IQR), which contains the middle 50% of values.
12
3.12 Descriptive statistics for qualitative variables
• Frequency distributions are tabular or graphical presentations of data that show each category for a
variable and the frequency of the category’s occurrence in the data set. Percentages for each category
are often reported instead of, or in addition to, the frequencies.
• The mode can be used in this case as a measure of central tendency.
• Bar charts and pie charts are often used to display the results of categorical or qualitative variables. Pie
charts can become cluttered and difficult to read if variables have many categories. Pie charts should
always include information on the total number of data points.
• Bar charts can also be used to group together numerical values. Doing so loses the original values,
however. An alternative is a stem-and-leaf plot, which makes the bars out of data values themselves.
• A dot plot can be used to quickly compare numerical values between multiple categories (clusted bar
charts can also do this).
• Next we arrange the data from the lowest to the largest grade: 51, 63, 65, 70, 73, 77, 79, 79, 85,
88. The median grade is 75, which located midway between the 5th and 6th ordered data points
(73 + 77)/2 = 75.
• The mode is 79 since it appears twice and all other grades appeared only once.
• The range is 88 − 51 = 37.
• The sample variance:
n
1 X 1
s2 = (xi − x̄)2 = ((88 − 73)2 + . . . + (77 − 73)2 ) = 123.78
n − 1 i=1 9
√
• The sample standard deviation: s = 123.78 = 11.13
• The coefficient of variation: CV = s/x̄ = 11.13/73 = 0.1525
• Empirical rule: the empirical rule states (for normally distributed data) that 68% of the data falls
within one standard deviation from the mean. In our example, this means that 68% of the grades fall
between 61.87 and 84.13 (73 ± 11.12555)
13
# R codes for "Accounting final exam grades" example
# Data example
grades<-c(88,51,63,85,79,65,79,70,73,77)
program<-factor(c("MA","MA","MBA","MBA","MBA","MBA","MBA","MSc","MSc","MSc"))
# no of observations
length(grades)
## [1] 10
# Mean, Median, Variance, standard deviation, range, quantile
mean(grades)
## [1] 73
median(grades)
## [1] 75
var(grades)
## [1] 123.7778
sd(grades)
## [1] 11.12555
range(grades)
## [1] 51 88
quantile(grades,probs=c(0,0.25,0.5,0.75,1))
## 0% 25% 50% 75% 100%
## 51.00 66.25 75.00 79.00 88.00
14
# Summary
summary(grades)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 51.00 66.25 75.00 73.00 79.00 88.00
# Calculate z-score
(grades-mean(grades))/sd(grades)
## [1] 1.3482484 -1.9774310 -0.8988323 1.0785987 0.5392994 -0.7190658
## [7] 0.5392994 -0.2696497 0.0000000 0.3595329
scale(grades)
## [,1]
## [1,] 1.3482484
## [2,] -1.9774310
## [3,] -0.8988323
## [4,] 1.0785987
## [5,] 0.5392994
## [6,] -0.7190658
## [7,] 0.5392994
## [8,] -0.2696497
## [9,] 0.0000000
## [10,] 0.3595329
## attr(,"scaled:center")
## [1] 73
## attr(,"scaled:scale")
## [1] 11.12555
# Histograms present frequencies for values grouped into interval.
hist(grades,xlab="grades", main="Histogram of grades")
# Boxplot
boxplot(grades,xlab="grades")
15
80
70
60
50
grades
In a stem-and-leaf plot: each score on a variable is divided into two parts, the stem gives the leading digits
and the leaf shows the trailing digits.
The accounting final exam grades (arranged from the lowest to the largest grade) are: 51, 63, 65, 70, 73, 77,
79, 79, 85, 88.
# Stem-and-leaf plot.
stem(grades)
##
## The decimal point is 1 digit(s) to the right of the |
##
## 5 | 1
## 6 | 35
## 7 | 03799
## 8 | 58
A dot plot is a simple graph to show the relative positions of the data points.
col2<-[Link](factor(program,labels=c("red","blue","orange")))
dotchart(grades, labels=factor(1:10), groups=program, pch=16, col=col2, xlab="Grades",xlim=c(45,100))
16
# Frequency table
table(program)
## program
## MA MBA MSc
## 2 5 3
# Pie and Bar charts
pie(table(program))
barplot(table(program))
17
4 Probability
4.1 The Basic Idea
Probability is a measurement, a way we express how likely something is to happen. What makes probability
so interesting as a measurement is a) it has no units, and b) there are lots of different ways to to conceive of
and calculate what a probability could and “should” be.
Despite there being many different interpretations and philosophies of probability (a topic we will mostly be
steering clear of in this module, but which will become very important next term if you take any module
relating to the idea of Bayesian statistics), there are a few basic ideas everyone agrees on, which are:
1. An impossible event has probability 0, and no probability can ever be lower than 0.
2. A certain event has probability 1, and no probability can ever be higher than 1.
3. An event which is neither impossible nor certain has a probability between 0 and 1.
4. When comparing multiple events, an event with a higher probability is more likely to happen than an
event with a lower probability. Events with the same probability are equally likely to happen.
One common way to think about probabilities is as functions - we input an event, and our output is a
number telling us how likely that event is to happen. We’ll talk more about this idea of probabilities as
functions later in these notes.
Even this very quick, simple and broad summary throws up more questions, though, starting with one we’ll
answer in the next subsection: what is an event?
18
in different ways. As long as each of them are fully clear on their choice of outcomes and outcome space,
this is entirely fine. As I say in the videos, though, if in doubt, define “relevant” as widely as you can. If
information collected turns out to not be useful, you can disregard it. If information not collected turns out
to be relevant, then you could be in real trouble!
Example I am about to roll a six-sided dice, and I want to express the probability of each one of the six
numbers on the dice being the one that lands face-up. One way to define the outcomes here would be each of
the six possible numbers, 1 to 6. If those are my choice of outcomes, my outcome space would be each of
those numbers, {1, 2, 3, 4, 5, 6}.
Alternatively, if I wished to, I coul define the outcomes as “an even number” and “an odd number”, in which
case I might write my outcome space as, say, {O, E}. We normally wouldn’t do this, because we can break
each of those outcomes up into three separate results, but if we considered the specific number we roll to be
irrelevant for our purposes, then nothing here about {O, E} as an outcome space is in any sense incorrect
or invalid.
Choosing your outcomes to be, say, “at least three” and “no more than three” would be incorrect, however,
because if you roll a three, that would mean more than one outcome has occurred at the same time, which
is not permissable. Similarly, you couldn’t choose “less than three” and “more than three” as your only
outcomes, as rolling a three would mean no outcome had occurred.
Example I am going to play Bizzfin, a game in which each turn I roll a fair six-sided dice and take a card
from a standard western deck of playing card, containing 52 cards. I gain or lose points in the game depending
on the combination of dice and card. What is the outcome space for the game?
In this case, I have two things to keep track of, the dice score, of which there are 6 possible values, and the card
drawn, of which there are 52 possible values. The outcome space is every possibe combination of dice score and
card, of which there are 6 × 52 = 312. I won’t list them all here, but the outcomes could be expressed as paired
values such as (1, 7C), representing a score of 1 on the dice and drawing the Seven of Clubs. The outcome
space could hen be represented as, say, {(1, 1C), (2, 1C), . . . , (6, 1C), (1, 2C), . . . , (6, KC), (1, 1D), . . . , (6, AS)}.
There are other ways we could represent all this, what matters is making our intent clear, and being consistent
in whatever approach we’re using.
We can now define an event. An event is either an outcome, or a combination of outcomes. For our previous
example, if {1, 2, 3, 4, 5, 6} is our outcome space, then any element of combination of elements from that set
is an event. “1” is an event, “4 or more” is an event, “not prime” is an event, and so on.
Note that this means all outcomes are events, indeed we call them simple events. Not all events our outcomes,
though; if an event comprises more than one outcome, it is called a compound event.
One last event we need to consider is the empty event. This is the event that no outcome occurs. This
is impossible, as we must always have one outcome occur. As a result, the empty event has probability 0.
It might seem odd to insist on this idea of an impossible event, which doesn’t contain any outcomes and
therefore has probability 0. It’s very useful in terms of the set theory that mathematicians use to make
probability work, though, which is why it’s important to consider it.
19
Number of times outcome happens
P (Outcome) =
Number of times outcome could have happened
Example I am about to roll a four-sided dice, which I know to be fair (each number is equally likely to be
rolled). I define my outcome space as {1, 2, 3, 4}. What is the probability I roll a 4?
Under the definition above, the probability of rolling a 4 is the number of times a 4 is rolled on a fair
four-sided dice, divided by the number of times the dice is rolled, because a four showing is something that
could happen on each roll.
Because the dice is fair, I will roll a 4 one fourth of the time I roll the dice.
It’s important to note in the above example that I used a theoretical property of the dice - it is “fair”. If I
actually roll the dice multiple times, there is no guarantee I will get a 4 precisely one fourth of the time -
indeed this is impossible if I don’t throw the dice a number of times which is divisible by four! We will come
back to this in the next subsection.
We find the probability of an event by adding together the probabilities of each outcome making the event up
(remember an event by definition is made up of one or more outcomes). Alternatively, we can just tweak our
previous definition - the probability of an event is defined as the proportion of times an event does happen,
out of all the times that event could have happened. Thanks to the laws of maths, these two definitions are
actually equivalent.
Example I am about to roll a four-sided dice, which I know to be fair (each number is equally likely to be
rolled). I define my outcome space as {1, 2, 3, 4}. What is the probability I roll less than 4?
We can find this in two ways. Firstly, I could add up the probabilities for each of the three outcomes (1, 2
and 3) which make up the event “less than 4”. Due to the dice being fair, each of these outcomes has the
same probability as the outcome of rolling a 4, so they sum to three quarters. Alternatively, if I roll a fair
four-sided dice, I will roll a 1, 2, or 3 three quarters of the time.
The above example shows us two important general ideas. Firstly, if we wanted to calculate the probability
of rolling a 1, 2, 3 or 4, then that probability would equal 0.25 + 0.25 + 0.25 + 0.25 = 1. This makes sense,
though, because the event I’m considering now is one which contains every outcome. Therefore that event
must happen.
Secondly, in a situation in which you have, say, n outcomes, each of which is equally likely, the probability of
each outcome must be 1/n, since they all have to have the same probability (otherwise they’re not equally
likely!) and adding all n of them together must result in a value of 1.
20
The relative frequency of an event is our estimate of that event’s probability. It is equal to the number of
times we ran an experiment in which the event happened, divided by the total number of experiments we ran.
Note that if the event never happens in our experiments, the relative frequency is 0 (suggesting the event
could be impossible). If the event always happens in or experiments, the relative frequency is 1 (suggesting
the event might always happen). The more times we run the experiment, the more accurate we expect our
relative frequencies to be (we might have to do this a lot if we’re looking for an estimate of the probability
of a rare event, and we don’t want that estimate to just be 0). This is hopefully not surprising, since we
can think of a probability as being the proportion of times an event occurred during an infinite sequence of
events.
Example 4.5 I use R to simulate tosses of a fair coin. I ask R to do this 10 times, getting 2 results of Heads.
I then ask R to do this 10,000 times, getting 5,056 results of Heads. Finally, I ask R to do this 10,000,000
times, getting 4,997,386 results of Heads. The relative frequencies I get are shown in the table below.
Outcome 10 tosses 10,000 tosses 10,000,000 tosses
Heads 0.2 0.5056 0.4997386
Tails 0.8 0.4944 0.5002614
We can see here how the relative frequencies are approaching the true probability of Heads, which is 0.5.
4.5 Independence
The concept of independence is absolutely critical to both probability and statistics. It is also very
commonly misunderstood.
There are a number of different ways of thinking about independence. Some get a little technical, and we’ll
return to those later in the module. For now, though, I’ll give you a definition in plain English. Two events
are independent if learning whether one event has happened gives no additional information about
whether the other event will happen.
For instance, say I roll two fair six-sided dice. The probability of rolling a one on such a dice is 1/6. If I tell
you I rolled a one on the first dice, this would not cause you to rethink what the probability of rolling a one
on the second dice is.
This kind of example is very commonly used to explain independence. Unfortunately, it can lead to people
mistakenly believing two events are independent if the process which produce those events do not in any way
interact. This isn’t actually the case! Let’s consider another example.
Example 4.6 I roll a fair six-sided dice with one hand, and roll a fair twenty-sided dice with the other hand.
Let event A be “I roll a one on the dice in my left hand”, and let event B be “I roll a one on the dice in my
right hand”. I roll the dice in such a way that they never touch, and so the number each ends up showing
can’t have any effect on what the number the other dice shows is. Are A and B independent events?
We know the probability of rolling a one on the six-sided dice is 1/6, and the probability of rolling a one on
the twenty-sided dice is 1/20. Let’s say the probability the six-sided dice is in my left hand is 1/2. We shall
soon see how to find the probability of event A happening; it’s (1/2) × (1/6) + (1/2) × (1/20) = 13/120, with
event B having the same probability.
Now, suppose I roll the dice in my left hand, and I get a 20. This immediately tells us two things. First,
event A didn’t happen. Second, and much more importantly, the probability of event B can’t be 13/20 any
more. That’s because we now know that the dice in my right hand must have been the one with six sides.
Therefore, the probability of event B is now 1/6.
21
Note that if I told you ahead of time which dice was in each hand, events A and B would be independent,
because this trick of learning about which dice is which from one roll no longer works - if we already knew I
had the twenty-sided dice in my left hand, learning that I got a 20 from that roll would have no effect on our
beliefs about the other roll.
This highlights the extremely important idea that what we can say about the probabilities of events depends
on how we are defining our events, not just whatever’s going on in whatever process we’re interested in.
Example 4.7 In a game of Bizzfin (see Example 4.2), find the probability that I roll a 5 or 6 on the dice,
while also drawing a red card.
In this example, our two events - roll a 5 or a 6 on the dice (call this event A), and draw a red card (call this
event B) - are independent, because knowing what I rolled on the dice gives me no information about the
card I drew, and vice versa.
Hence, we can use the multiplicative rule. The probability of rolling a 5 or a 6 on fair six-sided dice is
(1/6) + (1/6) = 1/3. Half the cards in a western deck are red (the other half are black), so the probability of
drawing a red card is 1/2.
1 1 1
P (A and B) = P (A) × P (B) = × = .
3 2 6
So why does the multiplicative rule work? In Section 4.3, we talked about how in this module, we think
of a probability of an event as being the proportion of times the event does happen, out of all the times it
could. If A and B are independent events, then B is no more or less likely to happen if A happens than if A
doesn’t happen. Therefore, A happens with a proportion P (A), and B happens with a proportion P (B) of
the times A has happened (and proportion P(B) of the times A hasn’t happened, but we’re not considering
that right now). This means they both happen for a proportion P (B) of the proportion P (A), which is an
overall proportion P (B) × P (A).
In other words, all Example 4.7 is doing is using the fact that half the time I’ll draw a red card, and one
third of those times, I will roll a 5 or a 6.
Note that we cannot perform a similar trick when we don’t have independence, because the proportion of
times B happens will be different depending on whether A happens or doesn’t happen; those two cases have
to be considered separately. We will see soon how we can go about doing this.
22
4.6.1 Additive Rule (Mutually Exclusive Results)
The additive rule for mutually exclusive events: If events A and B are mutually exclusive, then
P (A or B) = P (A) + P (B).
Example 4.8 My friend’s two sons Alfred and Martin love to compete in triathlons. He decides, based
on their previous times, that in the next triathalon they run, Alfred has a 0.07 probability of winning the
triathlon, and Martin has a 0.09 probability of winning the triathlon. What is the probability one of my
friend’s sons wins the triathlon?
Let us define A as being the event Alfred wins the race, and define M as the event Martin wins the race.
These events are mutually exclusive - they cannot both win the race. Therefore we can use the special
case of the additive rule.
So why does the additive rule work? There are two ways to think about this. One is to think about the
proportions involved. There is a certain proportion of times that A happens, and a certain proportion of
times that B happens. There is never an occasion when both happens. As a result, the proportion of time
where A or B happens must be the sum of the proportions when each happens, as shown in the diagrams
below.
Alternatively, we can think back to how we define probabilities. If we run an infinite series of experiments,
the number of times A or B happens must equal the number of times A happens plus the number of times B
happens, since they can never both happen at once. The additive rule therefore just tells us it doesn’t matter
if we add the number of times A happens to the number of times B happens before dividing by the total
number of times, or after.
Note that we can combine the multiplicative and additive rule where appropriate. That’s part of how, in
Example 4.6, I calculated the probability of rolling a one on a dice (call this R1) that had a probability 0.5
of being six-sided, and 0.5 of being twenty-sided. I used the fact that the event “this dice has six sides” (call
this S) is mutually exclusive to the event “this dice has twenty sides” (call this T ), and so
We still need to do a bit more work to see how this calculation worked overall; we’ll come back to this in
Subsection 4.9.
23
4.6.2 Additive Rule (General)
We can’t use the additive rule in the same way when A and B are not mutually exclusive, because in that
situation, there is certain proportion of times when A and B both happen.
This isn’t difficult to deal with, though. Rather than add the proportions for A to the proportion for B, we
could instead:
1. add the proportion for A to the proportion of B but not A;
2. add the proportion for B to the proportion of A but not B;
3. add the proportion for B but not A to the proportion of A but not B, and then add the proportion
of A and B.
4. add the proportion for A to the proportion of B, but then subtract the proportion of A and B (because
that proportion has been counted twice).
All four of these approaches give us the same answer, but its the fourth one we use to define the general
additive rule.
The additive rule in general is
Note that this still works for mutually exclusive events, since in such a case, P (A and B) = 0.
4.7 Complements
The concept of a complement is a little different to what we’ve looked at so far. We’ve focussed on when
two events are independent, or when they’re mutually exclusive (or when they’re neither), but there’s nothing
stopping us thinking about any number of independent events, or any number of mutually exclusive events.
In contrast, each and every event has one and only one complement. That’s because for an event A, the
complement of A is the event that A doesn’t happen. We denote this event as AC .
Rules of complements
C
1. The complement of a complement is itself: AC = A. In words, if A doesn’t not happen, A happens.
2. Every outcome in the outcome belongs to one and only one of events A and AC .
3. A and AC are mutually exclusive events, because A must either happen or not happen, it cannot do
both at the same time.
4. The event that A happens or AC happens is certain.
5. The complement of the event including all outcomes is the empty event, and vice versa (this is why we
need the concept of the empty event).
As a consequence of the 3rd and 4th rules, we also have
P (A or AC ) = P (A) + P (AC ) = 1.
24
It is for this reason that we have the result, which you may have seen before, that
While each event A has only one complement, we can nevertheless extend what we’ve seen above by talking
about mutually exclusive and exhaustive events. This is a set of m events A1 , A2 , . . . , Am , such that
every outcome in the outcome space belongs to one and only one event. This in turn means Ai and Aj are
mutually exclusive whenever i ̸= j, and that one and only one Ai must occur.
When we have such events, we must have
n
X
P (Ai ) = 1 (∗).
i=1
One way to justify this result is that every event Ai is mutually exclusive to all other events, and therefore
the sum of their probabilities must equal the probability that (A1 or A2 or . . . or Am ) happens. Since every
outcome belongs to one of the events Ai , we must have
P (A1 or A2 or . . . or Am ) = 1
and hence
n
X
P (Ai ) = P (A1 or A2 or . . . or Am ) = 1
i=1
The rule that P (A) + P (AC ) = 1 is actually a special case of rule (∗), because A and AC are mutually
exclusive and exhaustive events - they can’t both happen, and one of them must happen.
Example 4.9 In the game of bizzfin (see Example 4.2), what is the probability that I either don’t roll a 1,
or that I don’t draw a Heart card?
We can count all the outcomes which lie inside this event, but it might be faster to count the outcomes which
lie inside its complement. There are 1 × 52 I can roll a one and draw a Heart card. The probability that I
don’t roll a 1 or don’t draw a Heart card is therefore 1 − 52/312 = 260/312.
Note that the probablity of not rolling a 1 or not drawing a Heart card is not the same as not rolling a 1 and
not drawing a Heart card. The latter event here has a probability of 195/312 (try checking this for yourself).
Example 4.10 Suppose market research has been done on orders in the Palatine Cafe, and the research
shows that the probability a customer’s offer includes coffee is found to be 2/5. If the Palatine Cafe serves
300 people in a day, the expected number of orders including coffee will be 300 × (2/5) = 120.
25
if the same six numbers are chosen by the Lottery Machine at the end of the week. Assuming the Lottery
Machine chooses those six numbers randomly and fairly, my probability of winning the jackpot is 0.0000072%.
Imagine further though that when the first number is announced, it matches one of the six I have chosen.
I now need the Lottery Machine to choose the five numbers I have remaining in order to win the jackpot,
which has a probability of 0.000058%. If the second number also matches one of mine, the probability of
winning goes up to 0.00056%, and so on. If on the other hand any number comes out that doesn’t match
mine, then I know I’ve lost, and the probability drops to zero.
This is a very simple example of situations in which the probability of an event can change, based on
additional information we have received. Because situations in which we want to rethink probabilities based
on additional information are so common (especially in statistics), we need some way to update probabilities
in a mathematically rigorous manner. This is done through what we call conditional probabilities.
We often write P (A happens, given B has happened) as P (A|B), said out loud as “the probability of A given
B”, for ease of notation.
26
We can take this a stage further, using the mathematical result that a fraction does not change its value if
you divide both top and bottom by the same number.
This alternative form of the formula is commonly considered to be “the” conditional probability formula.
Important: For events A and B, we can talk about P (B|A) just as easily as P (A|B). This sometimes seems
odd, if event B either happens or doesn’t happen before event A does or doesn’t happen. Probability is
all about our beliefs about events we personally don’t know whether have happened yet. It is perfectly
possible for an event to have happened or not without you knowing whether it has happened or not, and in
such circumstances it is still perfectly reasonable to discuss probability.
It’s also important to bear in mind that, in general, P (A|B) ̸= P (B|A). We can see from equation
(∗∗) that P (A|B) = P (B|A) would only be true when the number of times A could have happened =
the number of times B could have happened.
Example 4.11 In a catch-and-release study, one hundred swallows in the UK are captured during summer,
given leg rings (numbered 1 to 100), and released into the wild. Two months later, one of the swallows is
spotted in Tunisia.
Assuming the leg ring on the spotted swallow is equally likely to be each of the numbers between 1 and 100,
find:
1. P (A), where A is the event that the leg ring number is 20 or lower;
2. P (B) where B is the event that the leg ring number is prime;
3. P (A|B);
4. P (B|A).
Answers (we use {1, . . . , 100} as the outcome space here):
1. 20 of the 100 outcomes are 20 or lower, hence P (A) = 20/100 = 1/5.
2. There are 25 prime numbers between 1 and 100. Hence 25 of the 100 outcomes are prime, and
P (B) = 25/100 = 1/4.
3. Since we know B happened, our outcome space for this calculation contains only the 25 prime numbers
between 1 and 100. Of these, 7 are below 20 (20 itself is not prime, of course). Hence P (A|B) = 7/25.
4. Since we know A happened, our outcome space for this calculation contains only the numbers 1 to 20.
Of these, 7 are prime. Hence P (B|A) = 7/20.
Note that this is one of many situations in which P (A|B) ̸= P (B|A).
27
4.9.5 Conditional probabilities and independence
Our working definition of independence, as discussed in Section 4.5, is that the events A and B are independent
of learning whether event A (B) has happens does not cause us to change our beliefs about the chance of B
(A) happening.
If A and B are independent, then, what does that imply about the probabilities P (A|B) and P (B|A)? We can
answer this question using the definition of independence, but instead, we’ll consider it from the perspective
of the relevant equations, and then discuss how the implications of those equations match up to our definition
of independence.
We know that
P (A and B)
P (A|B) = ⇒ P (A and B) = P (A|B)P (B)
P (B)
whether A and B are independent or not. When A and B are independent, then, we must have
and hence that P (A|B) = P (A). A very similar argument gives us that when A and B are independent,
P (B|A) = P (B).
So why does this make sense? What we are seeing here is that A and B are independent, what we believe the
probability for A is after learning B has happened should be the same as what we believe the probability
for A is without knowing B had happened. If A and B are independent, it doesn’t make any difference to
our beliefs about A if we learn B happened; that’s what independence means!
The same argument justifies why P (B|A) = P (B); learning A has happened gives us absolutely no additional
information about how likely an event independent of A is likely to happen.
This is a very useful result in statistics. The equation P (A and B) = P (A|B)P (B) also comes in handy an
awful lot, and is sometimes called the conditional multiplicative rule.
We can now finally fully understand what happened in Example 4.6. I showed already that the probability
of rolling a 1 with the dice in my left hand is
using the additive rule. We can now use the conditional multiplicative rule to get
1 1 1 1
P (R1) = P (R1|S)P (S) + P (R1|T )P (T ) = × + 20 × 2
6 2 / /
as stated.
28
P (B|A)P (A) = P (B and A) = P (A and B) = P (A|B)P (B)
P (A|B)P (B)
⇒ P (B|A) =
P (A)
This simple little trick unlocks an amazingly powerful result - we have a way of swapping round the order of
events in a conditional probability. We can go back in time!
There are any number of situations in which this is extremely useful. Here’s just one: imagine a GP wants to
diagnose a sick patient. What she might want to do is figure out the probability of different diseases, given
the symptoms being shown. That could be extremely hard to work out, though. It might be much easier to
look up the probabilities of the symptoms given different diseases, and then use Bayes rule to swap that
round into what she really wants. Let’s dig into how that might work with an example.
Example 4.12 A doctor wants to calculate the probability a patient has the flu, given they have a cough, a
sore throat, and aching arms. The doctor knows the following information:
a) Currently 3% of the UK population is suffering from the flu (based on the latest NHS data).
b) Currently 5% of the UK population is reporting they have all three of a cough, a sore throat, and open
arms (again, based on the latest NHS data).
c) 90% of people with the flu report all three of a cough, a sore throat, and open arms (this has been
demonstrated by previous medical research).
Using these values, the doctor can calculate the probability their patient has the flu. Denoting by F the
event that a patient has the flu, and denoting by S the event that a patient has all three listed symptoms:
P (S|F )P (F )
P (F |S) =
P (S)
0.9 × 0.03
=
0.05
= 0.54.
One way to intepret this result is that, before learning about the patient’s symptoms, it would be sensible for
the doctor to assume their probability of having the flue is 0.03. Once she learns about the symptoms, this
probability increases by a multiple of 18.
29
5 Random Variables
We’ve now covered the basics of probability theory, upon which just about everything in the field of statistics
is based. There’s still a lot more to talk about, though. The kinds of examples we’ve been looking at have
been very simple in terms of the situations involved (which isn’t to say the maths might not have been
challenging). These sorts of situations - dice rolls, coin tosses, lotteries - aren’t really where we need to put
our focus. If you can work out probabilities just through counting equally likely outcomes, you don’t need to
call in a statistician.
The kinds of situation where a statistician becomes useful are those where you can’t just do some calculations
based on the situation’s properties. We get called in when conclusions - potentially quite complex and subtle
ones - need to be made by studying the behaviour of a system, via the data that system produces.
This will require an understanding of how we think about the data we collect in terms of the underlying
probabilities of the situation. This in turn will mean grasping the concept of a random variable, another
idea which is absolutely foundational to the practice of statistics.
So what is a random variable? How do we define them, how do they work, and why are they useful? Answering
all three of those questions requires we consider an even more fundamental question first: what is a variable?
30
even not assign a number at all, using algebra instead; P (H) = p, p ∈ [0, 1] expresses my belief that the
probability of getting Heads is p, where p is an unspecified number between 0 and 1 (note my use of the
Greek letter epsilon, ∈, to denote “belongs to” here).
Each of these expressions of belief turns X into a random variable, and turns X into the outcome space for
X. We describe the value that X ends up taking as a realisation, often denoted x.
It’s a common mistake to see non-random variables and random variables as somehow opposite concepts. I
don’t think this is a useful approach. I think a better way to consider what we’re doing is in terms of a house.
Randomness is something you build on top of the concept of variables. It makes no more sense to say
non-random variables and random variables are opposites than it is to say one-storey houses and two-storey
houses are opposites. In particular, a variable in which you express certainty about which value it will take is
still a random variable, even though the outcome is fully known, because you’ve expressed that fact as a
belief.
Example 5.1 The data set chicks in R tracks the weights of 50 newly-hatched chicks, with each chick being
weighed once a day. For any one chick, we could imagine a function, say f , for which the input t is the time
in minutes since a chick hatched, and the output f (t) is the chick’s weight in grams.
31
We can draw a graph for this function, as shown below. Because we only have one measurement a day,
though, the graph has long stretches of time in which the chick must have weighed something, but we
cannot possibly say what. The function f therefore does not have closed form. Attempting to estimate the
outputs for the values of t for which we didn’t record the chick’s weight, say by drawing a line of best fit (see
below), is a classic example of what we do in statistics.
32
statisticians may use the word “probability” to talk about the probability function, rather than the
probabilities that the function gives us as outputs. It should usually be clear in context which one is being
referred to, but keep on your toes!
Example 5.2 Consider a six-sided dice. Let the value I next roll on that dice be expressed as a variable
X. The variable can take values from X = {1, . . . , 6}. To make this into a random variable, I need to add
a belief statement. I shall assume the dice is fair, and that therefore each of the elements of X has a 1/6
probability of being the one I roll.
It is common in mathematics to denote outcomes using a lower-case Greek omega. I shall therefore denote
the outcome “I roll a one” as ω1 , the outcome “I roll a two” as ω2 , and so on, up to ω6 . I will now call my
probability function P , and express the outputs as follows
1
P (ωi ) = , for every i ∈ X .
6
I could instead offer a different belief statement. Perhaps I know the dice has been tampered with, so that
the probability of rolling a six is actually 1/2, with the other five outcomes being as likely as each other,
though all less likely than a six.
I’ll define a new probability function P̃ for this belief:
1
P̃ (ωi ) = , for every i ∈ {1, 2, 3, 4, 5},
10
1
P̃ (ω6 ) = .
2
33
Continuous distributions, in contrast, are probability functions for which the outcome space includes infinite
values. The uniform distribution U [0, 1] is a good example of such a function. This is a probability function
which takes any value in the interval [0, 1] as an input (and there are an infinite number of such values), and
which gives its outputs in such a way that all outcome values are considered equally likely.
This quickly gets a bit complicated, though, because since there are infinitely many outcomes, and they
all have to have the same probability, the probability of getting any one specific outcome has to be zero.
This is true of all continuous distributions, in fact. In order to deal with this, we have to express continuous
distributions so they only give non-zero outputs when the inputs are intervals. Continuous distribusions are
also referred to as probability density functions, or PDFs.
34
“...is a random variable which has distribution...”. Hence X ∼ Ber(p) can be thought of as saying “X is a
random variable which has a Bernoulli distribution”.
n!
P (X = r) = pr (1 − p)n−r
r!(n − r)!
We can break down this equation to see how it works. The term pr represents the probability of observing r
successful trials in a row (using the fact that the trials are independent). The term (1 − p)n−r represents
the probability of observing n − r failed trails in a row. Combining these gives us that the probability of
observing r success followed by n − r failures is pr (1 − p)n−r . Again under the assumption of independence,
this must also equal, say the probability of observing n − r failures followed by r successes, or the probability
of observing r successes and n − r failures in any order at all.
We can therefore find P (X = r) by multiplying this probability of observing r successes and n − r failures in
one specific combination by the number of such combinations there are (this relies on the additive rule for
mutually exclusive events, and in noting each specific order of r successes and n − r failures must be mutually
exclusive to any other specific order of r successes and n − r failures).
The expression r!(n−r)!
n!
gives us that number of different combinations of r successes and n − r failures. I
won’t explain how this works here, but I’ve put together a separate document with that explanation, which is
also available in the Week 3 material on Ultra.
I mentioned above that there are two ways to think about the binomial distribution. The second is to
recognise that each individual one of the n trials we’re observing has the same probability p of success.
Therefore, each individual trial can be represented by a Bernoulli random variable. The behaviour of the
random variable X ∼ Bin(n, p) is therefore equivalent to the behaviour of the sum of n Bernoulli random
variables Y1 , . . . , Yn , each of which is a Bernoulli random variable, Yi ∼ Ber(p).
For this to work, each of these n Bernoulli random variables has to be independent of all the others. We
refer to such groups of independent random variables, all with the same distribution, as independent and
identically distributed random variables, or iid RVs for short.
35
for this to work, we assume that the events in question happen independently - that is, an event happening at
a given point in the interval tells us nothing additional about when we might expect the next event to occur.
The Poisson distribution has a single parameter, λ, which is referred to as the intensity. The larger the value
of λ, the more times we expect to see the event happen over the interval associated with the distribution. We
express that the random variable X has a Poisson distribution with intensity λ by writing X ∼ P ois(λ).
The formula for finding the the probability of seeing exactly r events over the interval associated with
X ∼ P ois(λ) is written below.
e−λ λr
P (X = r) = .
r!
Note that the Poisson distribution is somewhat different to the Bernoulli and binomial distributions, in that
there is no upper limit to the value of its realisations. There are therefore infinitely many possible values a
Poisson random variable can take. Once past a certain point, though (which varies depending on the value of
λ for the distribution), the probabilities get smaller and smaller as the realisation value gets larger and larger,
and they do so in such a way that, even though there is an infinite number of them, these probabilities still
all sum to 1.
• For any a < b, the probability that X falls in the interval (a, b) is the area under the density function
between a and b: Z b
P (a < X < b) = f (x)dx
a
• Thus the probability that a continuous random variable X takes on any particular value is 0:
Z c
P (X = c) = f (x)dx = 0
c
%Although this may seem strange initially, it is really quite natural. If the uniform random variable
of Example A had a positive probability of being any particular number, it should have the same
probability for any number in [0, 1], in which case the sum of the probabilities of any countably infinite
subset of [0, 1] (for example, the rational numbers) would be infinite.
• If X is a continuous random variable, then
P (a < X < b) = P (a ≤ X < b) = P (a < X ≤ b)
Note that this is not true for a discrete random variable.
• The cdf can be used to evaluate the probability that X falls in an interval:
Z b
P (a ≤ X ≤ b) = f (x)dx = F (b) − F (a)
a
36
5.8 Characteristics of probability distributions
• If X is a continuous random variable with density f (x), then
Z ∞
µ = E(X) = xf (x)dx
−∞
• The variance of X is
Z ∞
σ = V ar(X) = E [X − E(X)]
2 2
= (x − µ)2 f (x)dx
−∞
• The variance of X is the average value of the squared deviation of X from its mean.
• The variance of X can also be expressed as V ar(X) = E(X 2 ) − [E(X)]2 .
0 for x < a
F (x) = x−a
b−a for a ≤ x < b
1 for x ≥ b
37
• The cumulative distribution function is
Z x
F (x) = f (u)du = 1 − e−λx
−∞
• The exponential distribution is often used to model lifetimes or waiting times data.
38
5.9.4 Standard normal distribution N (µ = 0, σ 2 = 1)
• The probability density function of the standardized normal distribution is given by:
1 2
f (z) = √ e−z /2 , −∞ < z < ∞
2π
• We write Z ∼ N (0, 1) as short way of saying ‘Z follows a standard normal distribution with mean 0
and variance 1’.
• To standardize any variable X (into Z) we calculate Z as:
X −µ
Z=
σ
The Z-score calculated above indicates how many standard deviations X is from the mean.
39
5.9.5 Example
• If fX is a normal density function with parameters µ and σ, then
" 2 #
1 1
y − b − aµ
fY (y) = √ exp −
aσ 2π 2 aσ
• Thus, Y = aX + b follows a normal distribution with parameters aµ + b and aσ.
• If X ∼ N (µ, σ 2 ) and Y = aX + b, then Y ∼ N (aµ + b, a2 σ 2 ).
• Can you use this to show that Z ∼ N (0, 1)?
40
• The joint density function f (x, y) of two continuous random variables X and Y is such that
f (x, y) ≥ 0
Z ∞ Z ∞
f (x, y) dxdy = 1
−∞ −∞
Z d Z b
f (x, y)dxdy = P (a ≤ X ≤ b, c ≤ Y ≤ d)
c a
The marginal density function of X is
Z ∞
fX (x) = f (x, y) dy
−∞
and
∂2
f (x, y) = F (x, y)
∂x∂y
wherever the derivative is defined.
41
5.12 Properties of Expected values and Variance
• The expected value of a constant is the constant itself, i.e. if c is a constant, E(c) = c.
• The variance of a constant is zero, i.e. if c is a constant, V ar(c) = 0.
• If a and b are constants, and Y = aX + b, then E(Y ) = aE(X) + b and V ar(Y ) = a2 V ar(X) (if V ar(X)
exists).
• If X and Y are independent, then E(XY ) = E(X)E(Y ) and
V ar(X + Y ) = V ar(X) + V ar(Y )
V ar(X − Y ) = V ar(X) + V ar(Y )
• If X and Y are independent random variables and g and h are fixed functions, then
E[g(X)h(Y )] = E[g(X)]E[h(Y )]
5.13 Covariance
• Let X and Y be two random variables with means µx and µy , respectively. Then the covariance
between the two variables is defined as
cov(X, Y ) = E {(X − µx )(Y − µy )} = E(XY ) − µx µy
• If X and Y are independent, then cov(X, Y ) = 0.
• If two variables are uncorrelated, that does not in general imply that they are independent.
• V ar(X) = cov(X, X)
• cov(bX + a, dY + c) = bd cov(X, Y ), where a, b, c, and d are constants.
and if X is continuous,
Z ∞
V ar(X|Y = y) = [X − E(X|Y = y)]2 fX|Y (x|Y = y)dx
−∞
42
6 Sampling
6.1 Sampling
• Sampling is widely used as a means of gathering useful information about a population.
• Data are gathered from samples and conclusions are drawn about the population as a part of the
inferential statistics process.
• Often, a sample provides a reasonable means for gathering such useful decision-making information
that might be otherwise unattainable and unaffordable.
• Sampling error occurs when the sample is not representative of the population.
Note that this result holds true regardless of the form of the underlying distribution. As a result, it follows
that
Xn − µ
Z= √ ∼ N (0, 1)
σ/ n n→∞
That is, Z is a standardized normal variable.
43
There are two cases:
1. Sampling is from a normally distributed population with a known population variance:
σ2
x̄ ∼ N µ,
n
That is, the sampling
√ distribution of the sample mean is normal with mean µx̄ = µ and standard
deviation σx̄ = σ/ n.
2. Sampling is from a non-normally distributed population with known population variance and n is large,
then the mean of x̄,
µx̄ = µ
and the variance,
σ2
n with replacement (infinite population)
σx̄2 =
σ 2 N −n
without replacement (finite population)
n N −1
• If the sample size is large, the central limit theorem applies and the sampling distribution of x̄ will be
approximately normal.
• The standard deviation of the sampling distribution of the sample mean, σx̄ , is called the standard
error of the mean or, simply, the standard error
• If x̄ is a normal distributed (or approximately normal distributed), we can use the following formula to
transform x̄ to a Z-score.
44
x̄ − µx̄
Z=
σx̄
where Z ∼ N (0, 1).
then
π̂ − π
Z=q ≈ N (0, 1)
π(1−π)
n
where π̂ = x/n, x is the number in the sample with the characteristic of interest.
• A widely used criterion is that both nπ and n(1 − π) must be greater than 5 for this approximation to
be reasonable.
and
E(s2 ) = σ 2
V ar(s2 ) = 2σ 4 /(n − 1)
Then
(n − 1)s2
∼ χ2n−1
σ2
6.8 Example
Suppose that during any hour in a large department store, the average number of shoppers is 448, with a
standard deviation of 21 shoppers. What is the probability that a random sample of 49 different shopping
hours will yield a sample mean between 441 and 446 shoppers?
µ = 448, σ = 21, n = 49
45
This means there is a 24.15% chance of randomly selecting 49 hourly periods for which the sample mean is
between 441 and 446 shoppers.
We used the standard normal table to obtain these probabilities. We can also use R.
pnorm(-0.67)-pnorm(-2.33)
## [1] 0.2415258
46
7 Estimation
7.1 Estimation
• The values of population parameters are often unknown.
• We use a representative sample of the population to estimate the population parameters.
There are two types of estimation:
• Point Estimation
• Interval Estimation
47
7.3 Interval estimation
• An interval estimate (confidence interval) is an interval, or range of values, used to estimate a
population parameter.
• The level of confidence (1 − α)100% is the probability that the interval estimate contains the
population parameter.
• Interval estimate components:
• If σ is unknown and n ≥ 30, the sample standard deviation s = (xi − x̄)2 /(n − 1) can be used in
pP
place of σ.
48
• If the sampling is from a non-normal distribution and n ≥ 30, then the sampling distribution of x̄
is approximately
√ normally distributed (Central Limit Theorem) and we can use the same formula,
x̄ ± zα/2 (σ/ n), to construct the approximate confidence interval for population mean.
• When sampling is from a normal distribution whose standard deviation σ is unknown and the sample
size is small, the 100(1 − α)% confidence interval for the population mean µ is
√
x̄ ± tα/2 (s/ n)
where tα/2 can be obtained from the t distribution table with df = n − 1 and s is the sample standard
deviation which is given by
(xi − x̄)2
rP
s=
n−1
• If σ is unknown, and we neither have normal population nor large sample, then we should use
nonparametric statistics (not cover in this course).
49
7.7 Confidence interval for a population variance
The 100(1 − α)% confidence interval for the variance, σ 2 , of a normally distributed population is given by
!
(n − 1)s2 (n − 1)s2
,
χ2α ,n−1 χ21− α ,n−1
2 2
Pn
where s2 = 1
n−1 i=1 (xi − x̄)2 is the sample variance.
7.8 Example
Suppose a car rental firm wants to estimate the average number of kilometres travelled per day by each of its
cars rented in London. A random sample of 20 cars rented in London reveals that the sample mean travel
distance per day is 85.5 kilometres, with a population standard deviation of 19.3 kilometres. Compute a 99%
confidence interval to estimate µ.
For a 99% level of confidence, a z value of 2.58 is obtained (from the standard normal table). Assume that
number of kilometres travelled per day is normally distributed.
σ
x̄ ± zα/2 √
n
19.3
85.5 ± 2.58 √
20
50
85.5 ± 11.1
thus 74.4 ≤ µ ≤ 96.6
Note that:
qnorm((1-0.99)/2)
## [1] -2.575829
51
8 Hypothesis Testing One Sample
8.1 Hypothesis testing: Motivation
We often encounter such statements or claims:
• A newspaper claims that the average starting salary of MBA graduates is over £50K. (one sample test)
• A claim about the efficiency of a particular diet program, the average weight after the program is less
than the average weight before the program. (two paired samples test)
• On average female managers earn less than male managers, given that they have the same qualifications
and skills. (two independent samples test)
So we have claims about the populations’ means (averages) and we would like to verify or examine these
claims.
This is a kind of problem that hypothesis testing is designed to solve.
52
8.3 Type I and Type II Errors
H0 : µ = µ0
53
8.5 The p-value approach to hypothesis testing
• The p-value is the smallest significance level at which the null hypothesis would be rejected. The p-value
is also known as the observed significance level.
• The p-value measures how well the observed sample agrees with the null hypothesis. A small p-value
(close to zero) indicates that the sample is not consistent with the null hypothesis and the null hypothesis
should be rejected. On the other hand, a large p-value (larger than 0.10) generally indicates a reasonable
level of agreement between the sample and the null hypothesis.
• As a rule of thumb, if p-value ≤ α then reject H0 ; otherwise do not reject H0 .
For any specific significance level α, one can obtain these critical values ±zα/2 and ±zα from the standard
normal table.
If the value of the test statistic falls in the rejection region, reject H0 ; otherwise do not reject H0 .
For any specific significance level α, one can obtain these critical values ±tα/2 and ±tα from the T distribution
table. For example, for df = 9 and α = .05, the critical values are ±t0.025 = ±2.262 and ±t0.05 = ±1.833.
54
rejected if and only if the value µ0 given for the mean in the null hypothesis lies outside the 100(1 − α)-level
confidence interval for µ.
Example:
• At significance level α = 0.05, we want to test H0 : µ = 40 against H1 : µ ̸= 40 (so here µ0 = 40).
• Suppose that the 95% confidence interval for µ is 35 <µ< 38.
• As µ0 = 40 lies outside this confidence intervals, we reject H0 .
8.9 Example
A company reported that a new car model equipped with an enhanced manual transmission averaged 29
mpg on the highway. Suppose the Environmental Protection Agency tested 15 of the cars and obtained the
following gas mileages.
What decision would you make regarding the company’s claim on the gas mileage of the car? Perform the
required hypothesis test at the 5% significance level.
Solution:
The null and alternative hypotheses:
55
R output:
# Data
mlg<-c(27.3, 30.9, 25.9, 31.2, 29.7,
28.8, 29.4, 28.5, 28.9, 31.6,
27.8, 27.8, 28.6, 27.3, 27.6)
# t-test
[Link](mlg,alternative = "[Link]", mu = 29, [Link] = 0.95)
##
## One Sample t-test
##
## data: mlg
## t = -0.59878, df = 14, p-value = 0.5589
## alternative hypothesis: true mean is not equal to 29
## 95 percent confidence interval:
## 27.86979 29.63688
## sample estimates:
## mean of x
## 28.75333
# Normality test
# Kolmogorov Smirnov Test
[Link](mlg,"pnorm", mean=mean(mlg), sd=sd(mlg))
## Warning in [Link](mlg, "pnorm", mean = mean(mlg), sd = sd(mlg)): ties should
56
## not be present for the Kolmogorov-Smirnov test
##
## One-sample Kolmogorov-Smirnov test
##
## data: mlg
## D = 0.13004, p-value = 0.9616
## alternative hypothesis: two-sided
# Shapiro-Wilk test
[Link](mlg)
##
## Shapiro-Wilk normality test
##
## data: mlg
## W = 0.95817, p-value = 0.6606
par(mfrow=c(1,2))
qqnorm(mlg)
qqline(mlg, col = "red")
hist(mlg)
57
9 Hypothesis Testing Two Samples
9.1 Motivation
We often encounter such statements or claims:
• A newspaper claims that the average starting salary of MBA graduates is over £50K. (one sample test)
• A claim about the efficiency of a particular diet program, the average weight after the program is less
than the average weight before the program. (two paired samples test)
• On average female managers earn less than male managers, given that they have the same qualifications
and skills. (two independent samples test)
So we have claims about the populations’ means (averages) and we would like to verify or examine these
claims.
This is a kind of problem that hypothesis testing is designed to solve.
In order to compare two population means, we are going to test the null hypothesis
H0 : µ1 = µ2
• 100(1 − α)% confidence intervals for the difference between two population means µ1 − µ2 are
√
d¯ ± tα/2 sd / n
where tα/2 is the α/2 critical value from the t-distribution with df = n − 1
58
9.4 Comparing two means: Independent samples
In order to test H0 : µ1 = µ2 for two independent samples, we need to use one of the following test statistics,
we should choose the one that satisfies the assumptions. Let σ1 and σ2 be the standard deviations of
population 1 and population 2, respectively.
9.4.1 z-test
• Assumptions: σ1 and σ2 are known and we have large samples (n1 ≥ 30, n2 ≥ 30)
• Test statistic: z-test
x̄1 − x̄2
z=p
(σ1 /n1 ) + (σ22 /n2 )
2
• 100(1 − α)% confidence intervals for the difference between two population means µ1 − µ2 are
q
(x̄1 − x̄2 ) ± zα/2 (σ12 /n1 ) + (σ22 /n2 )
where zα/2 is the α/2 critical value from the standard normal distribution.
where tα/2 is the α/2 critical value from the t-distribution with df = n1 + n2 − 2.
• 100(1- α)% confidence intervals for the difference between two population means µ1 − µ2 are
q
(x̄1 − x̄2 ) ± tα/2 (s21 /n1 ) + (s22 /n2 )
where tα/2 is the α/2 critical value from the t-distribution with df = ∆.
59
9.4.4 Levene’s Test for Equality of Variances
In order to choose between Pooled t-test and Non-Pooled t-test, we need to check the assumption that the
two populations have equal (but unknown) variances. That is, test the null hypothesis that H0 : σ12 = σ22
against the alternative that H1 : σ12 ̸= σ22 .
The test statistic of Levene’s test follows F distribution with 1 and n1 + n2 − 2 degrees of freedom.
60
For any specific significance level α, one can obtain these critical values ±zα/2 and ±zα from the standard
normal distribution table. If the value of the test statistic falls in the rejection region, reject H0 ; otherwise
do not reject H0 .
For t-test:
For any specific significance level α, one can obtain these critical values ±tα/2 and ±tα from the T distribution
table. For example, for df = 9 and α = .05, the critical values are ±t0.025 = ±2.262 and ±t0.05 = ±1.833.
9.6 Example
In a study of the effect of cigarette smoking on blood clotting, blood samples were gathered from 11 individuals
before and after smoking a cigarette and the level of platelet aggregation in the blood was measured. Does
smoking affect platelet aggregation?
before after d
25 27 2
25 29 4
27 37 10
44 56 12
30 46 16
67 82 15
53 57 4
53 80 27
61
before after d
52 61 9
60 59 -1
28 43 15
n
1X
d¯ = di = 10.27
n i=1
sd = 7.98
sd 7.98
sd¯ = √ = √ = 2.40
n 11
At the 90% level (α = 0.10), the critical value t10,0.05 = 1.812, and so a 90% confidence interval is
√
d¯ ± 1.812 (sd / n) = 10.27 ± 1.812 × 2.40 = [5.19, 14.63]
62
before<-c(25,25,27,44,30,67,53,53,52,60,28)
after<-c(27,29,37,56,46,82,57,80,61,59,43)
d<-after-before
qt(0.1/2, df=10)
## [1] -1.812461
[Link](after, before, "[Link]", paired = TRUE,[Link] = 0.90)
##
## Paired t-test
##
## data: after and before
## t = 4.2716, df = 10, p-value = 0.001633
## alternative hypothesis: true difference in means is not equal to 0
## 90 percent confidence interval:
## 5.913967 14.631488
## sample estimates:
63
## mean of the differences
## 10.27273
hist(d,main="",col = '#61B2F2')
qqnorm(d, pch = 1)
qqline(d, col = "steelblue", lwd = 2)
64
65
10 Nonparametric Tests
10.1 Wilcoxon signed-rank test (Paired samples)
If the population of all paired differences d is symmetric but not necessarily normal, then we should use a
nonparametric test called Wilcoxon signed-rank test in order to compare the two populations, i.e. to test
H0 : no group difference.
To calculate the Wilcoxon signed-rank test statistic:
• Calculate all paired differences.
• Rank the absolute differences, that is ignoring the sign, after excluding the zeros.
• Sum the ranks of the positive and negative differences.
• The Wilcoxon signed-rank test V is the minimum of these two sums. That is
V = min(V + , V − )
where V + is the sum of the ranks of the absolute differences for all pairs with positive difference, and
V − is the equivalent for negative differences.
• We can then compare V to the critical value, T , for a given significance level, α , and number of
non-zero differences, n, from the statistical table.
• We reject H0 at level α if V < T .
Under H0 and assuming no ties, V has the following properties:
• E[V ] = µV = 14 n(n + 1).
• V ar[V ] = σV2 = 1
24 n(n + 1)(2n + 1).
• The distribution of V is symmetric about µV .
• For large n, V ∼ N (µV , σV2 ).
So the standardize version of this test statistic is
n(n+1)
V −
Z=q 4
n(n+1)(2n+1)
24
10.2 Example
Consider a sample of five students’ grades in Finance and Accounting. We are interested in testing whether
the students’ grades in finance is lower than the students’ grades in accounting, so we have a left-tailed test.
Use α = 10%.
x1 x2 x1 − x2 rank of |x1 − x2 |
73 88 -15 3
51 60 -9 2
85 65 20 4
65 66 -1 1
70 70 0 -
66
x1<-c(73,51,85,65,70)
x2<-c(88,60,65,66,70)
[Link](x1,x2,paired=TRUE, alternative = "less")
## Warning in [Link](x1, x2, paired = TRUE, alternative = "less"):
## cannot compute exact p-value with zeroes
##
## Wilcoxon signed rank test with continuity correction
##
## data: x1 and x2
## V = 4, p-value = 0.4276
## alternative hypothesis: true location shift is less than 0
Let the ranks of the first sample in the combined sample be r1 , . . . , rn which are all integers from the set
{1, . . . , N }, where N = n + m.
The Wilcoxon rank-sum test statistic is then
n
X
W = ri
i=1
67
10.4 Example
Suppose we have two groups of salaries, in thousand of pounds, of women and men. Test the claim that, on
average, women earn less salary than men, so again we have a left-sided test. Use α = 5%
Women Men
16 18
30 45
25 36
65 28
70 40
• We will consider women salaries, and the sum of ranks related to the women’s group is 1 + 3 + 5 = 9.
• For n = 3, m = 5 and α = 0.05, we can obtain the critical values from the table, so we have TL = 8 (as
we have a left-sided test)
• Since W = 9 ≮ TL = 8, so we do not reject H0 .
• Notice, the value given by R is the Mann-Whitney U test, which is given by
Mann-Whitney U test= 9 − 3(3+1)
2 = 9 − 6 = 3.
68
Figure 1: [Link]
11 Correlation
11.1 Correlation and Causation
A note so important it get its own chapter in the notes: correlation and causation are not the same thing.
Ice-cream sales are correlated with shark attacks, but this means neither that ice-cream attracts sharks, not
that shark attacks make any survivors crave ice-cream. What’s happening is that both ice-cream sales and
shark attacks increase with temperature, which is why a correlation exists. Those two relationships really are
causal - more people want ice-creams as it gets hotter, and sharks become more active in warmer weather.
In this example, temperature is what we call a confounder, an unrecorded variable which is partially or
fully responsible for a relationship we have found between recorded variables. Note that this means the
same variable can be a confounder or not be a confounder, depending on whether we have recorded it.
69
12 Simple regression: Introduction
12.1 Motivation
Predicting the Price of a used car
then the “line of best fit” will correspond to the line with values of β0 and β1 that minimises Q(β0 , β1 ).
70
• The regression line is the line that fits a set of data points according to the least squares criterion.
• The regression equation is the equation of the regression line.
• The regression equation for a set of n data points is ŷ = b0 + b1 x, where
• y is the dependent variable (or response variable) and x is the independent variable (predictor variable
or explanatory variable).
• b0 is called the y-intercept and b1 is called the slope.
The standard error of the estimate, se = SSE/(n − 2), indicates how much, on average, the observed
p
values of the response variable differ from the predicated values of the response variable.
71
Price (y) Age (x)
85 5
103 4
70 6
82 5
89 5
98 5
66 6
95 6
169 2
70 7
48 7
• For our example, age is the predictor variable and price is the response variable.
• The regression equation is ŷ = 195.47 − 20.26 x, where the slope b1 = −20.26 and the intercept
b0 = 195.47
• Prediction: for x = 4, that is we would like to predict the price of a 4-year-old car,
Back to our used cars example, we want to find the “best line” through the data points, which can be used to
predict prices of used cars based on their age.
First we need to enter the data in R.
Price<-c(85, 103, 70, 82, 89, 98, 66, 95, 169, 70, 48)
Age<- c(5, 4, 6, 5, 5, 5, 6, 6, 2, 7, 7)
carSales<-[Link](Price,Age)
str(carSales)
## [1] -0.9237821
72
library(ggplot2)
12.5 Prediction
# simple linear regression
reg<-lm(Price~Age)
print(reg)
##
## Call:
## lm(formula = Price ~ Age)
##
## Coefficients:
## (Intercept) Age
## 195.47 -20.26
To predict the price of a 4-year-old car (x = 4):
73
13 Simple Regression: Coefficient of Determination
13.1 Extrapolation
• Within the range of the observed values of the predictor variable, we can reasonably use the regression
equation to make predictions for the response variable.
• However, to do so outside the range, which is called Extrapolation, may not be reasonable because
the linear relationship between the predictor and response variables may not hold here.
• To predict the price of an 11-year old car, ŷ = 195.47 − 20.26(11) = −27.39 or $ 2739, this result is
unrealistic as no one is going to pay us $2739 to take away their 11-year old car.
74
13.3 Coefficient of determination
75
• The total variation in the observed values of the response variable, SST = (yi − ȳ)2 , can be partitioned
P
into two components:
– The
P variation in the observed values of the response variable explained by the regression: SSR =
(ŷi − ȳ)2
– The variation in the observed values of the response variable not explained by the regression:
SSE = (yi − ŷi )2
P
• The coefficient of determination, R2 (or R-square), is the proportion of the variation in the observed
values of the response variable explained by the regression, which is given by
SSR SST − SSE SSE
R2 = = =1−
SST SST SST
where SST = SSR + SSE. R2 is a descriptive measure of the utility of the regression equation for
76
making prediction.
• The coefficient of determination R2 always lies between 0 and 1. A value of R2 near 0 suggests that the
regression equation is not very useful for making predictions, whereas a value of R2 near 1 suggests
that the regression equation is quite useful for making predictions.
• For a simple linear regression (one independent variable) ONLY, R2 is the square of the Bravais
correlation coefficient, r.
• Adjusted R2 is a modification of R2 which takes into account the number of independent variables,
say k. In a simple linear regression k = 1. Adjusted-R2 increases only when a significant related
independent variable is added to the model. Adjusted-R2 has a crucial role in the process of model
building. Adjusted-R2 is given by
n−1
Adjusted-R2 = 1 − (1 − R2 )
n−k−1
P (xi − x̄)
2
P P 2 2
Sxx P xi − nx̄
Sxy (x i − x̄)(yi − ȳ) P x2i yi − nx̄ȳ
(yi − ȳ)2 yi − nȳ 2
P
Syy
P P
xi yi
where x̄ = n and ȳ = n . And,
2 2
Sxy Sxy
SST = Syy , SSR = , SSE = Syy −
Sxx Sxx
and SST = SSR + SSE.
where r always lies between -1 and 1. Values of r near -1 or 1 indicate a strong linear relationship between
the variables whereas values of r near 0 indicate a weak linear relationship between variables. If r is zero the
variables are linearly uncorrelated, that is there is no linear relationship between the two variables.
77
78
13.6 Hypothesis testing for the population correlation coefficient ρ
Hypothesis testing for the population correlation coefficient ρ.
Assumptions:
• The sample of paired (x, y) data is a random sample.
• The pairs of (x, y) data have a bivariate normal distribution.
The null hypothesis
H0 : ρ = 0 (no significant correlation)
against one of the alternative hypotheses:
• H1 : ρ ̸= 0 (significant correlation) “Two-tailed test’ ’
• H1 : ρ < 0 (significant negative correlation) “Left-tailed test’ ’
• H1 : ρ > 0 (significant positive correlation) “Right-tailed test’ ’
Compute the value of the test statistic:
√
r n−2
t= √ ∼ T(n−2) with df = n − 2.
1 − r2
• If the value of the test statistic falls in the rejection region, then reject H0 ; otherwise, do not reject H0 .
• Statistical packages report p-values rather than critical values which can be used in testing the null
hypothesis H0 .
79
13.8 Rho correlation coefficient (rs )
• When the normality assumption for the Bravais correlation coefficient r cannot be met, or when one or
both variables may be ordinal, then we should consider nonparametric methods such as the rho and
Kendall’s tau correlation coefficients.
• Rho correlation coefficient, rs ,can be obtained by first rank the x values (and y values) among themselves,
and then we compute the Bravais correlation coefficient of the rank pairs. Similarly −1 ≤ rs ≤ 1, the
values of rs range from -1 to +1 inclusive.
• Rho correlation coefficient can be used to describe the strength of the linear relationship as well as the
nonlinear relationship.
xi yi − ( xi )( yi )/n
P P P
r= pP 2
[ xi − ( xi )2 /n][ yi2 − ( yi )2 /n]
P P P
4732 − (58)(975)/11
r= p = −0.924
(326 − 582 /11)(96129 − 9752 /11)
the value of r = −0.924 suggests a strong negative linear correlation between age and price.
• Test the hypothesis H0 : ρ = 0 (no linear correlation) against H1 : ρ < 0 (negative correlation) at
significant level α = 0.05.
80
Compute the value of the test statistic:
√ √
r n−2 −0.924 11 − 2
t= √ =p = −7.249
1 − r2 1 − (−0.924)2
Using R:
First we need to enter the data in R.
Price<-c(85, 103, 70, 82, 89, 98, 66, 95, 169, 70, 48)
Age<- c(5, 4, 6, 5, 5, 5, 6, 6, 2, 7, 7)
carSales<-[Link](Price,Age)
str(carSales)
81
From this plot it seems that there is a negative linear relationship between age and price. There are several
tools that can help us to measure this relationship more precisely.
[Link](Age, Price,
alternative = "less",
method = "pearson", [Link] = 0.95)
##
## Bravais' product-moment correlation
##
## data: Age and Price
## t = -7.2374, df = 9, p-value = 2.441e-05
## alternative hypothesis: true correlation is less than 0
## 95 percent confidence interval:
## -1.0000000 -0.7749819
## sample estimates:
## cor
## -0.9237821
Suppose now we scale both variables (standardized)
[Link](scale(Age), scale(Price),
alternative = "less",
method = "pearson", [Link] = 0.95)
##
## Bravais product-moment correlation
##
## data: scale(Age) and scale(Price)
## t = -7.2374, df = 9, p-value = 2.441e-05
## alternative hypothesis: true correlation is less than 0
## 95 percent confidence interval:
## -1.0000000 -0.7749819
## sample estimates:
## cor
## -0.9237821
We notice that corr(age, price in pounds) = corr(age, price in dollars).
82
[Link](Age, Price,
alternative = "less",
method = "spearman", [Link] = 0.95)
##
## Rank correlation rho
##
## data: Age and Price
## S = 403.26, p-value = 0.0007267
## alternative hypothesis: true rho is less than 0
## sample estimates:
## rho
## -0.8330014
[Link](Age, Price,
alternative = "less",
method = "kendall", [Link] = 0.95)
##
## Kendall's rank correlation tau
##
## data: Age and Price
## z = -2.9311, p-value = 0.001689
## alternative hypothesis: true tau is less than 0
## sample estimates:
## tau
## -0.7302967
As the p-values for all three tests (Bravais, rho, Kendall) less than α = 0.05, we reject the null hypothesis of
no correlation between the age and the price, at the 5% significance level.
Figure 2: [Link]
Y = β0 + β1 x + ϵ
where
83
Y : the dependent or response variable
x : the independent or predictor variable, assumed known
β0 , β1 : the regression parameters, the intercept and slope of the regression line
ϵ : the random regression error around the line.
and the regression equation for a set of n data points is ŷ = b0 + b1 x, where
and
b0 = ȳ − b1 x̄
where b0 is called the y-intercept and b1 is called the slope.
The residual standard error se can be defined as
(yi − ŷi )2
r rP
SSE
se = =
n−2 n−2
se indicates how much, on average, the observed values of the response variable differ from the predicated
values of the response variable.
84
14.2 Example: used cars (cont.)
We can see that for each age, the mean price of all cars of that age lies on the regression line E(Y |x) = β0 +β1 x.
And, the prices of all cars of that age are assumed to be normally distributed with mean equal to β0 + β1 x
and variance σ 2 . For example, the prices of all 4-year-old cars must be normally distributed with mean
β0 + β1 (4) and variance σ 2 .
We used the least square method to find the best fit for this data set, and residuals can be obtained as
ei = yi − yˆi = yi − (195.47 − 20.26xi ).
85
14.3 Residual Analysis
The easiest way to check the simple linear regression assumptions is by constructing a scatterplot of the
residuals (ei = yi − yˆi ) against the fitted values (yˆi ) or against x. If the model is a good fit, then the residual
plot should show an even and random scatter of the residuals.
14.3.1 Linearity
The regression needs to be linear in the parameters.
Y = β0 + β1 x + ϵ
E(Yi |Xi ) = β0 + β1 xi ≡ E(ϵi |Xi ) = E(ϵi ) = 0
The residual plot below shows that a linear regression model is not appropriate for this data.
86
14.3.2 Constant error variance (homoscedasticity)
The plot shows the spread of the residuals is increasing as the fitted values (or x) increases, which indicates that
we have heteroskedasticity (non-constant variance). The standard errors are biased when heteroskedasticity
is present, but the regression coefficients will still be unbiased.
How to detect?
• Residuals plot (fitted vs residuals)
• Goldfeld–Quandt test
• Breusch-Pagan test
How to fix?
• White’s standard errors
• Weighted least squares model
• Taking the log
87
An example of positive autocorrelation
How to detect?
88
• Residuals plot
• Durbin-Watson test
• Breusch-Godfrey test
How to fix?
• Investigate omitted variables (e.g. trend, business cycle)
• Use advanced models (e.g. AR model)
89
##
## Call:
## lm(formula = infantMortality ~ ppgdp, data = newUN)
##
## Residuals:
## Min 1Q Median 3Q Max
## -31.48 -18.65 -8.59 10.86 83.59
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 41.3780016 2.2157454 18.675 < 0.0000000000000002 ***
## ppgdp -0.0008656 0.0001041 -8.312 0.0000000000000173 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 25.13 on 191 degrees of freedom
## Multiple R-squared: 0.2656, Adjusted R-squared: 0.2618
## F-statistic: 69.08 on 1 and 191 DF, p-value: 0.0000000000000173
plot(newUN$infantMortality ~ newUN$ppgdp, xlab="GDP per Capita"
, ylab="Infant mortality (per 1000 births)"
, pch=16, col="cornflowerblue")
abline(fit,col="red")
We can see from the scatterplot that the relationship between the two variables is not linear. There is a
concentration of data points at small values of GDP (many poor countries) and a concentration of data
points at small values of infant mortality (many countries with very low mortality). This suggests a skewness
to both variables which would not conform to the normality assumption. Indeed, the regression line (the red
line) we construct is a poor fit and only has an R2 of 0.266.
From the residual plot below we can see clear evidence of structure to the residuals suggesting the linear
relationship is a poor description of the data, and substantial changes in spread suggesting the assumption of
homogeneous variance is not appropriate.
90
# diagnostic plots
plot(fit,which=1,pch=16,col="cornflowerblue")
So we can apply a transformation to one or both variables, e.g. taking the log or adding a quadratic form.
Notice that this will not affect (violet) the linearity assumption as the regression will still be linear in the
parameters. So if we take the logs of both variables gives us the scatterplot of the transformed data set, below,
which appears to show a more promising linear structure. The quality of the regression is now improved,
with an R2 value of 0.766, which is still a little weak due to the rather large spread in the data.
fit1<- lm(log(infantMortality) ~ log(ppgdp), data=newUN)
summary(fit1)
##
## Call:
## lm(formula = log(infantMortality) ~ log(ppgdp), data = newUN)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.16789 -0.36738 -0.02351 0.24544 2.43503
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.10377 0.21087 38.43 <0.0000000000000002 ***
## log(ppgdp) -0.61680 0.02465 -25.02 <0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5281 on 191 degrees of freedom
## Multiple R-squared: 0.7662, Adjusted R-squared: 0.765
## F-statistic: 625.9 on 1 and 191 DF, p-value: < 0.00000000000000022
91
plot(log(newUN$infantMortality) ~ log(newUN$ppgdp), xlab="log GDP per Capita"
, ylab="Log infant mortality (per 1000 births)", pch=16, col="cornflowerblue")
abline(fit1,col="red")
So we check the residuals again, as we can see from the residuals plot below that the log transformation has
corrected many of the problems with residual plot and the residuals now much closer to the expected random
scatter.
# diagnostic plots
plot(fit1,which=1,pch=16,col="cornflowerblue")
92
Now let us check the Normality of the errors by creating a histogram and normal QQ plot for the residuals,
before and after the log transformation. The normal quantile (QQ) plot shows the sample quantiles of the
residuals against the theoretical quantiles that we would expect if these values were drawn from a Normal
distribution. If the Normal assumption holds, then we would see an approximate straight-line relationship on
the Normal quantile plot.
par(mfrow=c(2,2))
# before the log transformation.
plot(fit, which = 2,pch=16, col="cornflowerblue")
hist(resid(fit),col="cornflowerblue",main="")
# after the log transformation.
plot(fit1, which = 2, pch=16, col="hotpink3")
hist(resid(fit1),col="hotpink3",main="")
The normal quantile plot and the histogram of residuals (before the log transformation) shows strong departure
from the expectation of an approximate straight line, with curvature in the tails which reflects the skewness
of the data. Finally, the normal quantile plot and the histogram of residuals suggest that residuals are much
closer to Normality after the transformation, with some minor deviations in the tails.
93
15 Simple Linear Regression: Inference
15.1 Simple Linear Regression Assumptions
• Linearity of the relationship between the dependent and independent variables
• Independence of the errors (no autocorrelation)
• Constant variance of the errors (homoscedasticity)
• Normality of the error distribution.
Y = β0 + β1 x + ϵ
where
Y : the dependent or response variable
x : the independent or predictor variable, assumed known
β0 , β1 : the regression parameters, the intercept and slope of the regression line
ϵ : the random regression error around the line.
and
b0 = ȳ − b1 x̄
• y is the dependent variable (or response variable) and x is the independent variable (predictor variable
or explanatory variable).
• b0 is called the y-intercept and b1 is called the slope.
94
15.5 Properties of Regression Coefficients
Under the simple linear regression assumptions, the least square estimates b0 and b1 are unbiased for the β0
and β1 , respectively, i.e.
E[b0 ] = β0 and E[b1 ] = β1 .
The variances of the least squares estimators in simple linear regression are:
1 x̄2
V ar[b0 ] = σb20 =σ 2
+
n Sxx
σ2
V ar[b1 ] = σb21 =
Sxx
x̄
Cov[b0 , b1 ] = σb0 ,b1 = −σ 2
Sxx
1 x̄2
s2b0 = s2e +
n Sxx
s2e
s2b1 =
Sxx
x̄
sb0 ,b1 = −s2e
Sxx
and
b1 − β1
b1 ∼ N (β1 , σb21 ) → ∼ N (0, 1)
σb1
and
b1 − β 1
∼ tn−2
sb1
95
• The sample variance has n − 1 degrees of freedom, since it is computed from n pieces of data, minus the
1 parameter estimated as intermediate step, the sample mean. Similarly, having estimated the sample
mean we only have n − 1 independent pieces of data left, as if we are given the sample mean and any
n − 1 of the observations then we can determine the value of remaining observation exactly.
(xi − x̄)2
P
s2 =
n−1
• In linear regression, the degrees of freedom of the residuals is df = n − k ∗ , where k ∗ is the numbers of
estimated parameters (including the intercept). So for the simple linear regression, we are estimating
β0 and β1 , thus df = n − 2.
• Test statistic:
b0
t=
sb0
has a t-distribution with df = n − 2, where sb0 is the standard error of b0 , and given by
s
1 x̄2
sb0 = se +
n Sxx
and
(yi − ŷi )2
r rP
SSE
se = =
n−2 n−2
We reject H0 at level α if |t| > tα/2 with df = n − 2.
• 100(1-α)% confidence interval for β0 ,
b0 ± tα/2 . sb0
where tα/2 is critical value obtained from the t-distribution table with df = n − 2.
• Test statistic:
b1
t=
sb1
has a t-distribution with df = n − 2, where sb1 is the standard error of b1 ,and given by
se
sb1 = √
Sxx
and
(yi − ŷi )2
r rP
SSE
se = =
n−2 n−2
We reject H0 at level α if |t| > tα/2 with df = n − 2.
96
• 100(1-α)% confidence interval for β1 ,
b1 ± tα/2 sb1
where tα/2 is critical value obtained from the t-distribution table with df = n − 2.
97
# simple linear regression
reg<-lm(Price~Age)
summary(reg)
##
## Call:
## lm(formula = Price ~ Age)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.162 -8.531 -5.162 8.946 21.099
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 195.47 15.24 12.826 0.000000436 ***
## Age -20.26 2.80 -7.237 0.000048819 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.58 on 9 degrees of freedom
## Multiple R-squared: 0.8534, Adjusted R-squared: 0.8371
## F-statistic: 52.38 on 1 and 9 DF, p-value: 0.00004882
## 2.5 % 97.5 %
## (Intercept) 160.99243 229.94451
## Age -26.59419 -13.92833
15.12 R output
98
Recall that the simple linear regression model for Y on x is
Y = β0 + β1 x + ϵ
where
Y : the dependent or response variable
x : the independent or predictor variable, assumed known
β0 , β1 : the regression parameters, the intercept and slope of the regression line
ϵ : the random regression error around the line.
and the regression equation for a set of n data points is ŷ = b0 + b1 x, where
and
b0 = ȳ − b1 x̄
where b0 is called the y-intercept and b1 is called the slope.
Under the simple linear regression assumptions, the residual standard error se is an unbiased estimate
for the error standard deviation σ, where
(yi − ŷi )2
r rP
SSE
se = =
n−2 n−2
se indicates how much, on average, the observed values of the response variable differ from the predicated
values of the response variable.
Below we will see how we can use these least square estimates for prediction. First, we will consider the
inference for the conditional mean of the response variable y given a particular value of the independent
variable x, let us call this particular value x∗ . Next we will see how to predicting the value of the response
variable Y for a given value of the independent variable x∗ . These confidence and predictive intervals, to be
valid, the usual four simple regression assumptions must hold.
µ∗ = µY |x∗ = E [Y |x∗ ] = β0 + β1 x∗
but β0 and β1 are unknown. We can use the least square regression equation to estimate the unknown true
value of the regression line, so we have
µ̂∗ = b0 + b1 x∗ = ŷ ∗
This is simply a point estimate for the regression line. However, in statistics, point estimate is often not
enough, and we need to express our uncertainty about this point estimate, and one way to do so is via
confidence interval.
99
A 100(1 − α)% confidence interval for the conditional mean µ∗ is
s
1 (x∗ − x̄)2
∗
ŷ ± tα/2 · se +
n Sxx
Pn
where Sxx = i=1 (xi − x̄)2 , and tα/2 is the α/2 critical value from the t-distribution with df = n − 2.
Y ∗ = β0 + β1 x ∗ + ϵ
where but β0 , β1 and ϵ are unknown. We will use ŷ ∗ = b0 + b1 x∗ as a basis for our prediction.
100
A 95% confidence interval for the mean price of all 3-year-old cars is
s
1 (x∗ − x̄)2
ŷ ∗ ± tα/2 × se +
n Sxx
s
1 (3 − 5.273)2
[195.47 − 20.26(3)] ± 2.262 × 12.58 +
11 (11 − 1) × 2.018
134.69 ± 16.76
that is
117.93 < µ∗ < 151.45
15.17 Regression in R
# Build linear model
Price<-c(85, 103, 70, 82, 89, 98, 66, 95, 169, 70, 48)
Age<- c(5, 4, 6, 5, 5, 5, 6, 6, 2, 7, 7)
carSales<-[Link](Price=Price,Age=Age)
101
##
## Call:
## lm(formula = Price ~ Age, data = carSales)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.162 -8.531 -5.162 8.946 21.099
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 195.47 15.24 12.826 0.000000436 ***
## Age -20.26 2.80 -7.237 0.000048819 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.58 on 9 degrees of freedom
## Multiple R-squared: 0.8534, Adjusted R-squared: 0.8371
## F-statistic: 52.38 on 1 and 9 DF, p-value: 0.00004882
mean(Age)
## [1] 5.272727
var(Age)
## [1] 2.018182
qt(0.975,9)
## [1] 2.262157
newage<- [Link](Age = 3)
predict(reg, newdata = newage, interval = "confidence")
102
16 Multiple Linear Regression
16.1 Multiple linear regression model
In simple linear regression, we have one dependent variable (y) and one independent variable (x). In multiple
linear regression, we have one dependent variable (y) and several independent variables (x1 , x2 , . . . , xk ).
• The multiple linear regression model, for the population, can be expressed as
Y = β0 + β1 x 1 + β2 x 2 + . . . + βk x k + ϵ
• The coefficient b0 (or β0 ) represents the y-intercept, that is, the value of y when x1 = x2 = . . . = xk = 0.
The coefficient bi (or βi ) (i = 1, . . . , k) is the partial slope of xi , holding all other x’s fixed. So bi (or βi )
tells us the change in y for a unit increase in xi , holding all other x’s fixed.
103
Price (y) Age (x1 ) Miles (x2 )
48 7 89
Below is the sample covariance matrix calculated in R, along with the scatter diagram.
The scatterplot and the correlation matrix show a fairly negative relationship between the price of the car and
both independent variables (age and miles). It is desirable to have a relationship between each independent
variable and the dependent variable. However, the scatterplot also shows a positive relationship between the
age and the miles, which is undesirable as it will cause the issue of multicollinearity.
ei = yi − ŷi
104
• In a multiple linear regression with k predictors, the standard error of the estimate, se , is defined by
s
SSE X
se = where SSE = (yi − ŷi )2
n − (k + 1)
• The standard error of the estimate, se , indicates how much, on average, the observed values of the
response variable differ from the predicated values of the response variable. The se is the estimate of
the common standard deviation σ.
M SR SSR/k
F = =
M SE SSE/(n − k − 1)
105
17.5 Used cars example continued
Multiple regression equation: ŷ = 183.04 − 9.50x1 − 0.82x2
The predicted price for a 4-year-old car that has driven 45 thousand miles is
17.6 Regression in R
Price<-c(85, 103, 70, 82, 89, 98, 66, 95, 169, 70, 48)
Age<- c(5, 4, 6, 5, 5, 5, 6, 6, 2, 7, 7)
Miles<-c(57,40,77,60,49,47,58,39,8,69,89)
carSales<-[Link](Price=Price,Age=Age,Miles=Miles)
# Scatterplot matrix
# Customize upper panel
[Link]<-function(x, y){
points(x,y, pch=19, col=4)
r <- round(cor(x, y), digits=3)
txt <- paste0("r = ", r)
usr <- par("usr"); [Link](par(usr))
106
par(usr = c(0, 1, 0, 1))
text(0.5, 0.9, txt)
}
pairs(carSales, [Link] = NULL,
[Link] = [Link])
##
## Call:
## lm(formula = Price ~ Age + Miles, data = carSales)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.364 -5.243 1.028 5.926 11.545
##
## Coefficients:
107
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 183.0352 11.3476 16.130 0.000000219 ***
## Age -9.5043 3.8742 -2.453 0.0397 *
## Miles -0.8215 0.2552 -3.219 0.0123 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.805 on 8 degrees of freedom
## Multiple R-squared: 0.9361, Adjusted R-squared: 0.9201
## F-statistic: 58.61 on 2 and 8 DF, p-value: 0.00001666
confint(reg, level=0.95)
## 2.5 % 97.5 %
## (Intercept) 156.867552 209.2028630
## Age -18.438166 -0.5703751
## Miles -1.409991 -0.2329757
17.6.1 Summary
108
18 Multiple Linear Regression: Assumptions
• Linearity: For each set of values, x1 , x2 , . . . , xk , of the predictor variables, the conditional mean of the
response variable y is β0 + β1 x1 + β2 x2 + . . . + βk xk .
• Equal variance (homoscedasticity): The conditional variance of the response variable are the same
(equal to σ 2 ) for all sets of values, x1 , x2 , . . . , xk , of the predictor variables.
• Independent observations: The observations of the response variable are independent of one another.
• Normally: For each set values, x1 , x2 , . . . , xk , of the predictor variables, the conditional distribution of
the response variable is a normal distribution.
• No Multicollinearity: Multicollinearity exists when two or more of the predictor variables are highly
correlated.
18.0.1 Multicollinearity
• Multicollinearity refers to a situation when two or more predictor variables in our multiple regression
model are highly (linearly) correlated.
• The least square estimates will remain unbiased, but unstable.
• The standard errors (of the affected variables) are likely to be high.
• Overall model fit (e.g. R-square, F, prediction) is not affected.
where Ri2 is the R-square value obtained by regressing the ith predictor on the other predictor variables.
• V IF = 1 indicates that there is no correlation between ith predictor variable and the other predictor
variables.
• As rule of thumb if V IF > 5 then multicollinearity could be a problem, and a serious problem if if
V IF > 10.
109
plot(reg, which=2, pch=19, col=4)
110
# [Link]("car")
library(car)
vif(reg)
## Age Miles
## 3.907129 3.907129
The value of V IF = 3.91 indicates a moderate correlation between the age and the miles in the model, but
this is not a major concern.
19 Dummy Variables
We will consider the case when we have a qualitative (categorical) predictor (also known as a factor) with
two or more levels (or possible values).
Qualitative predictors with only two levels
To include a qualitative predictor in our model, we create a dummy variable that takes on two possible
numerical values, e.g. 0 and 1.
Back to our used cars example, suppose we want to add the transmission type to our linear regression model.
So let d be a dummy variable represents the car’s transmission type which takes value 1 for manual car and
value 0 for automatic car.
Again, y = P rice and x1 = age, and let us not include x2 = miles at the moment.
111
then we can regress price on age and transmission type as
y = β0 + β1 x 1 + β2 d + ϵ
y = β0 + β1 x1 + β2 d1 + β3 d2 + ϵ
112
so for petrol cars:
y = (β0 + β2 ) + β1 x1 + ϵ
Y = β0 + β1 x1 + β2 x2 + ϵ
113
P rice = β0 + (β1 + β3 × miles) × age + β2 miles + ϵ
P rice = β0 + β̃1 × age + β2 miles + ϵ
where β̃1 = β1 + β3 × miles. Since β̃1 changes with x2 = miles, the effect of x1 = age on Y = P rice is no
longer constant.
That is adjusting x2 = miles will change the impact of x1 = age on Y = P rice.
114