0% found this document useful (0 votes)
13 views114 pages

Week 7 Notes - 2025

The document contains notes from Weeks 1 to 7 of an ISDS course at Durham University, covering fundamental concepts in statistics, data types, descriptive statistics, probability, random variables, and sampling. It includes detailed sections on measures of central tendency, variation, and various probability functions. The notes serve as a comprehensive guide for understanding statistical principles and methodologies.

Uploaded by

li2624425532
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views114 pages

Week 7 Notes - 2025

The document contains notes from Weeks 1 to 7 of an ISDS course at Durham University, covering fundamental concepts in statistics, data types, descriptive statistics, probability, random variables, and sampling. It includes detailed sections on measures of central tendency, variation, and various probability functions. The notes serve as a comprehensive guide for understanding statistical principles and methodologies.

Uploaded by

li2624425532
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

ISDS Week 1 to 7 Notes

Durham University

Contents
1 Basic Concepts 5
1.1 Some basic concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2 Branches of statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3 What’s the big idea? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2 Data Types in Statistics 6


2.1 Data collection methods (Traditional data) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 Data collection methods (Big data) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3 Types of data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.4 Types of data (Econometrics) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.5 Levels of measurement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

3 Descriptive Statistics 7
3.1 Measures of Central Tendency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.2 Measure of Variation (Dispersion) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.3 Shape of a distribution: Skewness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.4 Shape of a distribution: Kurtosis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.5 Modality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.6 Symmetry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.7 Empirical Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.8 Measure of Position: z-score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.9 Percentiles and Quartiles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.10 Five-number summary & Boxplots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.11 Outliers & Extremes values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.12 Descriptive statistics for qualitative variables . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.13 Example: Accounting final exam grades . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

4 Probability 18
4.1 The Basic Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.2 Outcomes and Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.3 Probabilities As Proportions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.4 Relative Frequencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.5 Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.6 Mutual Exclusivity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.7 Complements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.8 An Introduction To Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.9 Conditional Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.10 Bayes Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

5 Random Variables 30
5.1 What is a Variable? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

1
5.2 Making Variables Random . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
5.3 Probability Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
5.4 Introduction to Discrete and Continuous Probability Functions . . . . . . . . . . . . . . . . . 33
5.5 Discrete Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
5.6 Continuous Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
5.7 Cumulative distribution function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
5.8 Characteristics of probability distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5.9 Some useful continuous distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5.10 Joint distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
5.11 Conditional probability (density) function, PDF . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5.12 Properties of Expected values and Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.13 Covariance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.14 Correlation Coefficient . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.15 Conditional expectation and conditional variance . . . . . . . . . . . . . . . . . . . . . . . . . 42

6 Sampling 43
6.1 Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
6.2 Random versus non-random sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
6.3 Simple random sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
6.4 Central limit theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
6.5 Sampling distribution of the sample mean x̄ . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
6.6 Sampling distribution of the sample proportion . . . . . . . . . . . . . . . . . . . . . . . . . . 45
6.7 Sampling distribution of the sample variance . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
6.8 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

7 Estimation 47
7.1 Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
7.2 Point estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
7.3 Interval estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
7.4 Confidence intervals for the population mean . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
7.5 Interpreting confidence intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
7.6 Confidence interval for a population proportion . . . . . . . . . . . . . . . . . . . . . . . . . . 49
7.7 Confidence interval for a population variance . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
7.8 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

8 Hypothesis Testing One Sample 52


8.1 Hypothesis testing: Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
8.2 The nature of hypothesis testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
8.3 Type I and Type II Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
8.4 Hypothesis tests for one population mean . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
8.5 The p-value approach to hypothesis testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
8.6 Critical-value approach to hypothesis testing . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
8.7 Hypothesis testing and confidence intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
8.8 Test of Normality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
8.9 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

9 Hypothesis Testing Two Samples 58


9.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
9.2 Hypothesis tests for two population means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
9.3 Comparing two means: Paired (related) samples . . . . . . . . . . . . . . . . . . . . . . . . . 58
9.4 Comparing two means: Independent samples . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
9.5 Critical-value approach to hypothesis testing . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
9.6 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

10 Nonparametric Tests 66

2
10.1 Wilcoxon signed-rank test (Paired samples) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
10.2 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
10.3 Wilcoxon rank-sum test (Independent samples) . . . . . . . . . . . . . . . . . . . . . . . . . . 67
10.4 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

11 Correlation 69
11.1 Correlation and Causation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

12 Simple regression: Introduction 70


12.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
12.2 Simple linear regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
12.3 Least-Squares criterion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
12.4 Example: used cars (cont.) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
12.5 Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

13 Simple Regression: Coefficient of Determination 74


13.1 Extrapolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
13.2 Outliers and influential observations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
13.3 Coefficient of determination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
13.4 Notation used in regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
13.5 Bravais correlation coefficient . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
13.6 Hypothesis testing for the population correlation coefficient ρ . . . . . . . . . . . . . . . . . . 79
13.7 Correlation and linear transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
13.8 Rho correlation coefficient (rs ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
13.9 Kendall’s tau (τ ) correlation coefficient . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
13.10Example: used cars . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

14 Simple Linear Regression: Assumptions 83


14.1 Simple Linear Regression Assumptions (SLR) . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
14.2 Example: used cars (cont.) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
14.3 Residual Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
14.4 Example: Infant mortality and GDP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

15 Simple Linear Regression: Inference 94


15.1 Simple Linear Regression Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
15.2 Simple Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
15.3 The simple linear regression equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
15.4 Residual standard error, se . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
15.5 Properties of Regression Coefficients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
15.6 Sampling distribution of the least square estimators . . . . . . . . . . . . . . . . . . . . . . . 95
15.7 Degrees of Freedom . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
15.8 Inference for the intercept β0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
15.9 Inference for the slope β1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
15.10How useful is the regression model? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
15.11Example: used cars (cont.) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
15.12R output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
15.13Simple Linear Regression: Confidence and Prediction intervals . . . . . . . . . . . . . . . . . 98
15.14Inference for the regression line E [Y |x∗ ] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
15.15Inference for the response variable Y for a given x = x∗ . . . . . . . . . . . . . . . . . . . . . 100
15.16Example: used cars (cont.) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
15.17Regression in R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

16 Multiple Linear Regression 103


16.1 Multiple linear regression model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
16.2 Example: used cars (cont.) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

3
17 Multiple Linear Regression: Fit and Inference 104
17.1 Coefficient of determination, R2 and adjusted R2 . . . . . . . . . . . . . . . . . . . . . . . . . 104
17.2 The residual standard error, se . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
17.3 Inferences about a particular predictor variable . . . . . . . . . . . . . . . . . . . . . . . . . . 105
17.4 How useful is the multiple regression model? . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
17.5 Used cars example continued . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
17.6 Regression in R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

18 Multiple Linear Regression: Assumptions 109


18.1 Regression in R (regression assumptions) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

19 Dummy Variables 111

4
1 Basic Concepts
1.1 Some basic concepts
• Data consist of information coming from observations, counts, measurements, or responses.
• Statistics is the science of collecting, organising, analysing, and interpreting data in order to make
decisions.
• A population is the collection of all outcomes, responses, measurements, or counts that are of interest.
Populations may be finite or infinite. If a population of values consists of a fixed number of these values,
the population is said to be finite. If, on the other hand, a population consists of an endless succession
of values, the population is an infinite one.
• A sample is a subset of a population.
• A parameter is a numerical description of a population characteristic.
• A statistic is a numerical description of a sample characteristic.

1.2 Branches of statistics


The study of statistics has two major branches - descriptive statistics and inferential statistics:
• Descriptive statistics is the branch of statistics that involves the organisation, summarisation, and
display of data.
• Inferential statistics is the branch of statistics that involves using a sample to draw conclusions about
a population, e.g. estimation and hypothesis testing.

1.3 What’s the big idea?


There are many qualities of a population we might be interested in. These qualities are referred to as
parameters. We can never know the value of these parameters in general. What we do instead is find
corresponding values from a sample, and use these as estimates for the parameter values. These estimates we
find from the sample are referred to as statistics.

5
1.4 Notation
Below is a table containing commonly-used notation for some of the parameters and statistics we will deal
with most often.

Population Sample
Size N n
Parameter Statistic
Mean µ x̄
Variance σ2 s2
Standard deviation σ s
Proportion π π̂
Correlation ρ r

2 Data Types in Statistics


2.1 Data collection methods (Traditional data)
There are several approaches we can take when collecting data:
• Take a census: a census is a count or measure of an entire population. Taking a census provides
complete information, but it is often costly and difficult to perform.
• Use sampling: a sample is a count or measure of a part of a population. Statistics calculated from a
sample are used to estimate population parameters.
• Use a simulation: collecting data often involves the use of computers. Simulations allow studying
situations that are impractical or even dangerous to create in real life and often save time and money.
• Perform an experiment: e.g. to test the effect f imposing a new marketing strategy, one could
perform an experiment by using the new marketing strategy in a certain region.

2.2 Data collection methods (Big data)


The characteristics of big data (the 4Vs):
• Volume: how much data is there?
• Variety: different types of data?
• Velocity: at what speed?
• Veracity: how accurate?

2.3 Types of data


Data sets can consist of two types of data:
• Qualitative (categorical) data consist of attributes, labels, or nonnumerical entries. e.g. name of
cities, gender etc.
• Quantitative data consist of numerical measurements or counts. e.g. heights, weights, age. Quantita-
tive data can be distinguished as:
– Discrete data result when the number of possible values is either a finite number or a “countable”
number. e.g. the number of phone calls you received in any given day.
– Continuous data result from infinitely many possible values that correspond to some continuous
scale that covers a range of values without gaps, interruptions, or jumps. e.g. height, weight, sales
and market shares.

6
2.4 Types of data (Econometrics)
• Cross-sectional data: Data on different entities (e.g. workers, consumers, firms, governmental units)
for a single time period. For example, data on test scores in different school districts.
• Time series data: Data for a single entity (e.g. person, firm, country) collected at multiple time
periods. For example, the rate of inflation or of unemployment for a country over the last 10 years.
• Panel data: Data for multiple entities in which each entity is observed at two or more time periods.
For example, the daily prices of a number of stocks over two years.

2.5 Levels of measurement


• Nominal: Categories only, data cannot be arranged in an ordering scheme. (e.g. Marital status: single,
married etc.)
• Ordinal: Categories are ordered, but differences cannot be determined or they are meaningless (e.g. a
rating poor, average, good)
• Interval: differences between values are meaningful, but there is no natural starting point, ratios are
meaningless (e.g. we cannot say that the temperature 80◦ F is twice as hot as 40◦ F)
• Ratio: Like interval level, but there is a natural zero starting point and rations are meaningful (e.g. £20
is twice as much as £10)

3 Descriptive Statistics
3.1 Measures of Central Tendency
Measures of central tendency provide numerical information about a ‘typical’ observation in the data.
• The mean (also called the average) of a data set is the sum of the data values divided by the number
of observations.

n
1X
Sample mean: x̄ = xi
n i=1
• The median is the middle observation when the data set is sorted in ascending order. If the data set
has an even number of observations, the median is the mean of the two middle observations.
• The mode is the data value that occurs with the greatest frequency. If no entry is repeated, the data
set has no mode. If two (more than two) values occur with the same greatest frequency, each value is a
mode and the data set is called bimodal (multimodal).

3.2 Measure of Variation (Dispersion)


The variation (dispersion) of a set of observations refers to the variability that they exhibit - how far from
the average value do we expect individual values to be, in general?
• Range = maximum data value - minimum data value
• The variance measures the variability or spread of the observations from the mean.

n
1 X
Sample variance: s2 = (xi − x̄)2
n − 1 i=1
• Shortcut formula for sample variance is given by
( n )
1 X
Sample variance: s2 = x2i − nx̄2
n−1 i=1

7
• The standard deviation (s) of a data set is the square root of the sample variance.

3.3 Shape of a distribution: Skewness


Skewness is a measure of the asymmetry of the distribution.

3.4 Shape of a distribution: Kurtosis


Kurtosis measures the degree of peakedness or flatness of the distribution.

8
3.5 Modality
The number of highest points in a distribution gives us the modality. Note that in a situation in which a
distribution has two or more “humps” which aren’t equally high, we still describe the shape of the graph as
“bimodal” or “multimodal”, even though only the (equal) highest point of the curve represents the actual
mode(s) of the data.

3.6 Symmetry
Below are three common forms of symmetrical distribution. Note that if a symmetrical distribution is also
unimodal, the median, mode and mean will all be equal.

3.7 Empirical Rule


The empirical rule states (for a normally distributed data) that 68% of the data falls within one standard
deviation; 95% of the data falls within two standard deviations; 99.7% of the data falls within three standard
deviations from the mean.

3.8 Measure of Position: z-score


The z-score of an observation tells us the number of standard deviations that the observation is from the
mean, that is, how far the observation is from the mean in units of standard deviation.

9
x − x̄
z=
s
As the z-score has no unit, it can be used to compare values from different data sets or to compare values
within the same data set. The mean of z-scores is 0 and the standard deviation is 1.
Note that s > 0 so if z is negative, the corresponding x-value is below the mean. If z is positive, the
corresponding x-value is above the mean. And if z = 0, the corresponding x-value is equal to the mean.

3.9 Percentiles and Quartiles


• Given a set of observations, the kth percentile Pk is the value of X such that k% or less of the
observations are less than Pk and (100 − k)% or less of the observations are greater than Pk :

• The 25th percentile, Q1 , is often referred to as the first quartile.


• The 50th percentile (the median), Q2 , is referred to as the second or middle quartile.
• The 75th percentile, Q3 , is referred to as the third quartile
• Here is a quick example.

10
70% of people are shorter than the red figure. 30% of people are taller than the blue figure. The 70%
percentile for height therefore lies between the blue and red figure.

11
• The four quartiles divide a data set into quarters (four equal parts). As the diagram below shows, the
four equal parts do not necessarily have equal lengths, it is the number of data points which are the
same within each part.

• The interquartile range (IQR) of a data set is the difference between the first and third quartiles
(IQR = Q3 − Q1 )
• The IQR is a measure of variation that gives you an idea of how much the middle 50% of the data
varies.

3.10 Five-number summary & Boxplots


To draw a boxplot (also called a box-and-whisker plot), we need the following values (called the five-number
summary):
• The minimum entry
• The first quartile Q1
• The median (second quartile ) Q2
• The third quartile Q3
• The maximum entry

The box represents the interquartile range (IQR), which contains the middle 50% of values.

3.11 Outliers & Extremes values


Some data sets contain outliers or extremes values, observations that fall well outside the overall pattern of
the data. Boxplots can help us to identify such values if some rules-of-thumb are used, e.g.:
• Outlier: Cases with values between 1.5 and 3 box lengths (the box length is the interquartile range)
from the upper or lower edge of the box.
• Extremes: Cases with values more than 3 box lengths from the upper or lower edge of the box.

12
3.12 Descriptive statistics for qualitative variables
• Frequency distributions are tabular or graphical presentations of data that show each category for a
variable and the frequency of the category’s occurrence in the data set. Percentages for each category
are often reported instead of, or in addition to, the frequencies.
• The mode can be used in this case as a measure of central tendency.
• Bar charts and pie charts are often used to display the results of categorical or qualitative variables. Pie
charts can become cluttered and difficult to read if variables have many categories. Pie charts should
always include information on the total number of data points.
• Bar charts can also be used to group together numerical values. Doing so loses the original values,
however. An alternative is a stem-and-leaf plot, which makes the bars out of data values themselves.
• A dot plot can be used to quickly compare numerical values between multiple categories (clusted bar
charts can also do this).

3.13 Example: Accounting final exam grades


The accounting final exam grades of 10 students are: 88, 51, 63, 85, 79, 65, 79, 70, 73, and 77. Their study
programs, respectively, are: MA, MA, MBA, MBA, MBA, MBA, MBA, MSc, MSc, and MSc.
• The sample mean grade is
n
1X 1
x̄ = xi = (88 + 51 + . . . + 77) = 73
n i=1 10

• Next we arrange the data from the lowest to the largest grade: 51, 63, 65, 70, 73, 77, 79, 79, 85,
88. The median grade is 75, which located midway between the 5th and 6th ordered data points
(73 + 77)/2 = 75.
• The mode is 79 since it appears twice and all other grades appeared only once.
• The range is 88 − 51 = 37.
• The sample variance:
n
1 X 1
s2 = (xi − x̄)2 = ((88 − 73)2 + . . . + (77 − 73)2 ) = 123.78
n − 1 i=1 9


• The sample standard deviation: s = 123.78 = 11.13
• The coefficient of variation: CV = s/x̄ = 11.13/73 = 0.1525
• Empirical rule: the empirical rule states (for normally distributed data) that 68% of the data falls
within one standard deviation from the mean. In our example, this means that 68% of the grades fall
between 61.87 and 84.13 (73 ± 11.12555)

13
# R codes for "Accounting final exam grades" example
# Data example
grades<-c(88,51,63,85,79,65,79,70,73,77)
program<-factor(c("MA","MA","MBA","MBA","MBA","MBA","MBA","MSc","MSc","MSc"))

# no of observations
length(grades)
## [1] 10
# Mean, Median, Variance, standard deviation, range, quantile
mean(grades)

## [1] 73
median(grades)
## [1] 75
var(grades)
## [1] 123.7778
sd(grades)
## [1] 11.12555
range(grades)
## [1] 51 88
quantile(grades,probs=c(0,0.25,0.5,0.75,1))
## 0% 25% 50% 75% 100%
## 51.00 66.25 75.00 79.00 88.00

14
# Summary
summary(grades)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 51.00 66.25 75.00 73.00 79.00 88.00
# Calculate z-score
(grades-mean(grades))/sd(grades)
## [1] 1.3482484 -1.9774310 -0.8988323 1.0785987 0.5392994 -0.7190658
## [7] 0.5392994 -0.2696497 0.0000000 0.3595329
scale(grades)
## [,1]
## [1,] 1.3482484
## [2,] -1.9774310
## [3,] -0.8988323
## [4,] 1.0785987
## [5,] 0.5392994
## [6,] -0.7190658
## [7,] 0.5392994
## [8,] -0.2696497
## [9,] 0.0000000
## [10,] 0.3595329
## attr(,"scaled:center")
## [1] 73
## attr(,"scaled:scale")
## [1] 11.12555
# Histograms present frequencies for values grouped into interval.
hist(grades,xlab="grades", main="Histogram of grades")

# Boxplot
boxplot(grades,xlab="grades")

15
80
70
60
50

grades

In a stem-and-leaf plot: each score on a variable is divided into two parts, the stem gives the leading digits
and the leaf shows the trailing digits.
The accounting final exam grades (arranged from the lowest to the largest grade) are: 51, 63, 65, 70, 73, 77,
79, 79, 85, 88.
# Stem-and-leaf plot.
stem(grades)
##
## The decimal point is 1 digit(s) to the right of the |
##
## 5 | 1
## 6 | 35
## 7 | 03799
## 8 | 58
A dot plot is a simple graph to show the relative positions of the data points.
col2<-[Link](factor(program,labels=c("red","blue","orange")))
dotchart(grades, labels=factor(1:10), groups=program, pch=16, col=col2, xlab="Grades",xlim=c(45,100))

16
# Frequency table
table(program)
## program
## MA MBA MSc
## 2 5 3
# Pie and Bar charts
pie(table(program))

barplot(table(program))

17
4 Probability
4.1 The Basic Idea
Probability is a measurement, a way we express how likely something is to happen. What makes probability
so interesting as a measurement is a) it has no units, and b) there are lots of different ways to to conceive of
and calculate what a probability could and “should” be.
Despite there being many different interpretations and philosophies of probability (a topic we will mostly be
steering clear of in this module, but which will become very important next term if you take any module
relating to the idea of Bayesian statistics), there are a few basic ideas everyone agrees on, which are:
1. An impossible event has probability 0, and no probability can ever be lower than 0.
2. A certain event has probability 1, and no probability can ever be higher than 1.
3. An event which is neither impossible nor certain has a probability between 0 and 1.
4. When comparing multiple events, an event with a higher probability is more likely to happen than an
event with a lower probability. Events with the same probability are equally likely to happen.
One common way to think about probabilities is as functions - we input an event, and our output is a
number telling us how likely that event is to happen. We’ll talk more about this idea of probabilities as
functions later in these notes.
Even this very quick, simple and broad summary throws up more questions, though, starting with one we’ll
answer in the next subsection: what is an event?

4.2 Outcomes and Events


Before I can define an event, I need to define two other aspects of probability theory: outcomes and the
outcome space. For any situation in which we are uncertain about what is going to happen:
• An outcome is something that could happen, and which cannot happen in more than one relevant
way.
• The outcome space is the collection of all the outcomes for the situation. Whatever the result of the
situation, that result must correspond to one and only one outcome in the outcome space.
Note my use of the word “relevant” here - this is very important. What we consider relevant in any given
situation is up to us, so it’s possible two different people might define their outcomes for the same situtation

18
in different ways. As long as each of them are fully clear on their choice of outcomes and outcome space,
this is entirely fine. As I say in the videos, though, if in doubt, define “relevant” as widely as you can. If
information collected turns out to not be useful, you can disregard it. If information not collected turns out
to be relevant, then you could be in real trouble!

Example I am about to roll a six-sided dice, and I want to express the probability of each one of the six
numbers on the dice being the one that lands face-up. One way to define the outcomes here would be each of
the six possible numbers, 1 to 6. If those are my choice of outcomes, my outcome space would be each of
those numbers, {1, 2, 3, 4, 5, 6}.
Alternatively, if I wished to, I coul define the outcomes as “an even number” and “an odd number”, in which
case I might write my outcome space as, say, {O, E}. We normally wouldn’t do this, because we can break
each of those outcomes up into three separate results, but if we considered the specific number we roll to be
irrelevant for our purposes, then nothing here about {O, E} as an outcome space is in any sense incorrect
or invalid.
Choosing your outcomes to be, say, “at least three” and “no more than three” would be incorrect, however,
because if you roll a three, that would mean more than one outcome has occurred at the same time, which
is not permissable. Similarly, you couldn’t choose “less than three” and “more than three” as your only
outcomes, as rolling a three would mean no outcome had occurred.

Example I am going to play Bizzfin, a game in which each turn I roll a fair six-sided dice and take a card
from a standard western deck of playing card, containing 52 cards. I gain or lose points in the game depending
on the combination of dice and card. What is the outcome space for the game?
In this case, I have two things to keep track of, the dice score, of which there are 6 possible values, and the card
drawn, of which there are 52 possible values. The outcome space is every possibe combination of dice score and
card, of which there are 6 × 52 = 312. I won’t list them all here, but the outcomes could be expressed as paired
values such as (1, 7C), representing a score of 1 on the dice and drawing the Seven of Clubs. The outcome
space could hen be represented as, say, {(1, 1C), (2, 1C), . . . , (6, 1C), (1, 2C), . . . , (6, KC), (1, 1D), . . . , (6, AS)}.
There are other ways we could represent all this, what matters is making our intent clear, and being consistent
in whatever approach we’re using.

We can now define an event. An event is either an outcome, or a combination of outcomes. For our previous
example, if {1, 2, 3, 4, 5, 6} is our outcome space, then any element of combination of elements from that set
is an event. “1” is an event, “4 or more” is an event, “not prime” is an event, and so on.
Note that this means all outcomes are events, indeed we call them simple events. Not all events our outcomes,
though; if an event comprises more than one outcome, it is called a compound event.
One last event we need to consider is the empty event. This is the event that no outcome occurs. This
is impossible, as we must always have one outcome occur. As a result, the empty event has probability 0.
It might seem odd to insist on this idea of an impossible event, which doesn’t contain any outcomes and
therefore has probability 0. It’s very useful in terms of the set theory that mathematicians use to make
probability work, though, which is why it’s important to consider it.

4.3 Probabilities As Proportions


So how do we calculate a probability? Again, there are a number of different ways to answer that question,
and again, we’re going to essentially ignore that fact during this module.
For our purposes, it suffices to think of probabilities as being proportions. The probability of an outcome is
defined as the proportion of times an outcome does happen, out of all the times that outcome could have
happened.

19
Number of times outcome happens
P (Outcome) =
Number of times outcome could have happened

Example I am about to roll a four-sided dice, which I know to be fair (each number is equally likely to be
rolled). I define my outcome space as {1, 2, 3, 4}. What is the probability I roll a 4?
Under the definition above, the probability of rolling a 4 is the number of times a 4 is rolled on a fair
four-sided dice, divided by the number of times the dice is rolled, because a four showing is something that
could happen on each roll.
Because the dice is fair, I will roll a 4 one fourth of the time I roll the dice.

It’s important to note in the above example that I used a theoretical property of the dice - it is “fair”. If I
actually roll the dice multiple times, there is no guarantee I will get a 4 precisely one fourth of the time -
indeed this is impossible if I don’t throw the dice a number of times which is divisible by four! We will come
back to this in the next subsection.
We find the probability of an event by adding together the probabilities of each outcome making the event up
(remember an event by definition is made up of one or more outcomes). Alternatively, we can just tweak our
previous definition - the probability of an event is defined as the proportion of times an event does happen,
out of all the times that event could have happened. Thanks to the laws of maths, these two definitions are
actually equivalent.

Number of times event happens


P (Event) =
Number of times event could have happened

Example I am about to roll a four-sided dice, which I know to be fair (each number is equally likely to be
rolled). I define my outcome space as {1, 2, 3, 4}. What is the probability I roll less than 4?
We can find this in two ways. Firstly, I could add up the probabilities for each of the three outcomes (1, 2
and 3) which make up the event “less than 4”. Due to the dice being fair, each of these outcomes has the
same probability as the outcome of rolling a 4, so they sum to three quarters. Alternatively, if I roll a fair
four-sided dice, I will roll a 1, 2, or 3 three quarters of the time.

The above example shows us two important general ideas. Firstly, if we wanted to calculate the probability
of rolling a 1, 2, 3 or 4, then that probability would equal 0.25 + 0.25 + 0.25 + 0.25 = 1. This makes sense,
though, because the event I’m considering now is one which contains every outcome. Therefore that event
must happen.
Secondly, in a situation in which you have, say, n outcomes, each of which is equally likely, the probability of
each outcome must be 1/n, since they all have to have the same probability (otherwise they’re not equally
likely!) and adding all n of them together must result in a value of 1.

4.4 Relative Frequencies


Finding probabilities is actually very simple, then, so long as all outcomes are equally likely. Unfortunately,
that’s very often not the case. Often we don’t know how much more or less likely one outcome is than another.
Perhaps a coin is clipped, or a dice is loaded, or we’re in any one of the billions of other circumstances where
we can’t assume all possible outcomes are just as likely as each other.
In such circumstances, we tend to resort to experimentation. In what follows we assume it’s possible to run
multiple experiments, each in identical or nearly identical circumstances.

20
The relative frequency of an event is our estimate of that event’s probability. It is equal to the number of
times we ran an experiment in which the event happened, divided by the total number of experiments we ran.

Number of experiments in which event happens


P (Event) ≈ Relative frequency of event =
Total number of experiments run

Note that if the event never happens in our experiments, the relative frequency is 0 (suggesting the event
could be impossible). If the event always happens in or experiments, the relative frequency is 1 (suggesting
the event might always happen). The more times we run the experiment, the more accurate we expect our
relative frequencies to be (we might have to do this a lot if we’re looking for an estimate of the probability
of a rare event, and we don’t want that estimate to just be 0). This is hopefully not surprising, since we
can think of a probability as being the proportion of times an event occurred during an infinite sequence of
events.

Example 4.5 I use R to simulate tosses of a fair coin. I ask R to do this 10 times, getting 2 results of Heads.
I then ask R to do this 10,000 times, getting 5,056 results of Heads. Finally, I ask R to do this 10,000,000
times, getting 4,997,386 results of Heads. The relative frequencies I get are shown in the table below.
Outcome 10 tosses 10,000 tosses 10,000,000 tosses
Heads 0.2 0.5056 0.4997386
Tails 0.8 0.4944 0.5002614
We can see here how the relative frequencies are approaching the true probability of Heads, which is 0.5.

4.5 Independence
The concept of independence is absolutely critical to both probability and statistics. It is also very
commonly misunderstood.
There are a number of different ways of thinking about independence. Some get a little technical, and we’ll
return to those later in the module. For now, though, I’ll give you a definition in plain English. Two events
are independent if learning whether one event has happened gives no additional information about
whether the other event will happen.
For instance, say I roll two fair six-sided dice. The probability of rolling a one on such a dice is 1/6. If I tell
you I rolled a one on the first dice, this would not cause you to rethink what the probability of rolling a one
on the second dice is.
This kind of example is very commonly used to explain independence. Unfortunately, it can lead to people
mistakenly believing two events are independent if the process which produce those events do not in any way
interact. This isn’t actually the case! Let’s consider another example.

Example 4.6 I roll a fair six-sided dice with one hand, and roll a fair twenty-sided dice with the other hand.
Let event A be “I roll a one on the dice in my left hand”, and let event B be “I roll a one on the dice in my
right hand”. I roll the dice in such a way that they never touch, and so the number each ends up showing
can’t have any effect on what the number the other dice shows is. Are A and B independent events?
We know the probability of rolling a one on the six-sided dice is 1/6, and the probability of rolling a one on
the twenty-sided dice is 1/20. Let’s say the probability the six-sided dice is in my left hand is 1/2. We shall
soon see how to find the probability of event A happening; it’s (1/2) × (1/6) + (1/2) × (1/20) = 13/120, with
event B having the same probability.
Now, suppose I roll the dice in my left hand, and I get a 20. This immediately tells us two things. First,
event A didn’t happen. Second, and much more importantly, the probability of event B can’t be 13/20 any
more. That’s because we now know that the dice in my right hand must have been the one with six sides.
Therefore, the probability of event B is now 1/6.

21
Note that if I told you ahead of time which dice was in each hand, events A and B would be independent,
because this trick of learning about which dice is which from one roll no longer works - if we already knew I
had the twenty-sided dice in my left hand, learning that I got a 20 from that roll would have no effect on our
beliefs about the other roll.
This highlights the extremely important idea that what we can say about the probabilities of events depends
on how we are defining our events, not just whatever’s going on in whatever process we’re interested in.

4.5.1 The Multiplicative Rule


The property of independence has an absolutely colossal number of uses within statistics. Perhaps the most
fundamental is what I shall call the multiplicative rule.

The multiplicative rule: If events A and B are independent, then

P (A and B) = P (A) × P (B).

Example 4.7 In a game of Bizzfin (see Example 4.2), find the probability that I roll a 5 or 6 on the dice,
while also drawing a red card.
In this example, our two events - roll a 5 or a 6 on the dice (call this event A), and draw a red card (call this
event B) - are independent, because knowing what I rolled on the dice gives me no information about the
card I drew, and vice versa.
Hence, we can use the multiplicative rule. The probability of rolling a 5 or a 6 on fair six-sided dice is
(1/6) + (1/6) = 1/3. Half the cards in a western deck are red (the other half are black), so the probability of
drawing a red card is 1/2.

1 1 1
P (A and B) = P (A) × P (B) = × = .
3 2 6

So why does the multiplicative rule work? In Section 4.3, we talked about how in this module, we think
of a probability of an event as being the proportion of times the event does happen, out of all the times it
could. If A and B are independent events, then B is no more or less likely to happen if A happens than if A
doesn’t happen. Therefore, A happens with a proportion P (A), and B happens with a proportion P (B) of
the times A has happened (and proportion P(B) of the times A hasn’t happened, but we’re not considering
that right now). This means they both happen for a proportion P (B) of the proportion P (A), which is an
overall proportion P (B) × P (A).
In other words, all Example 4.7 is doing is using the fact that half the time I’ll draw a red card, and one
third of those times, I will roll a 5 or a 6.
Note that we cannot perform a similar trick when we don’t have independence, because the proportion of
times B happens will be different depending on whether A happens or doesn’t happen; those two cases have
to be considered separately. We will see soon how we can go about doing this.

4.6 Mutual Exclusivity


If independence is at one end of a scale regarding how much influence learning about one event has on our
thinking about the other event, then mutual exclusivity exists at the opposite end of that scale. Two
events are mutually exclusive if one of them happening means the other cannot possibly happen. Despite
this being as far from independence as you can get - learning about one event can give you total information
about the other - the two are very commonly mixed up. It’s even more common to mix up when we can
use the multiplicative rule which we’ve seen already, and when we can use the additive rule (or at least a
special case of it), which we can use when we have mutually exclusive events.

22
4.6.1 Additive Rule (Mutually Exclusive Results)
The additive rule for mutually exclusive events: If events A and B are mutually exclusive, then

P (A or B) = P (A) + P (B).

Example 4.8 My friend’s two sons Alfred and Martin love to compete in triathlons. He decides, based
on their previous times, that in the next triathalon they run, Alfred has a 0.07 probability of winning the
triathlon, and Martin has a 0.09 probability of winning the triathlon. What is the probability one of my
friend’s sons wins the triathlon?
Let us define A as being the event Alfred wins the race, and define M as the event Martin wins the race.
These events are mutually exclusive - they cannot both win the race. Therefore we can use the special
case of the additive rule.

P (A or M ) = P (A) + P (M ) = 0.07 + 0.09 = 0.16.

So why does the additive rule work? There are two ways to think about this. One is to think about the
proportions involved. There is a certain proportion of times that A happens, and a certain proportion of
times that B happens. There is never an occasion when both happens. As a result, the proportion of time
where A or B happens must be the sum of the proportions when each happens, as shown in the diagrams
below.

Alternatively, we can think back to how we define probabilities. If we run an infinite series of experiments,
the number of times A or B happens must equal the number of times A happens plus the number of times B
happens, since they can never both happen at once. The additive rule therefore just tells us it doesn’t matter
if we add the number of times A happens to the number of times B happens before dividing by the total
number of times, or after.
Note that we can combine the multiplicative and additive rule where appropriate. That’s part of how, in
Example 4.6, I calculated the probability of rolling a one on a dice (call this R1) that had a probability 0.5
of being six-sided, and 0.5 of being twenty-sided. I used the fact that the event “this dice has six sides” (call
this S) is mutually exclusive to the event “this dice has twenty sides” (call this T ), and so

P (R1) = P (R1 and S) + P (R1 and T )

We still need to do a bit more work to see how this calculation worked overall; we’ll come back to this in
Subsection 4.9.

23
4.6.2 Additive Rule (General)
We can’t use the additive rule in the same way when A and B are not mutually exclusive, because in that
situation, there is certain proportion of times when A and B both happen.
This isn’t difficult to deal with, though. Rather than add the proportions for A to the proportion for B, we
could instead:
1. add the proportion for A to the proportion of B but not A;
2. add the proportion for B to the proportion of A but not B;
3. add the proportion for B but not A to the proportion of A but not B, and then add the proportion
of A and B.
4. add the proportion for A to the proportion of B, but then subtract the proportion of A and B (because
that proportion has been counted twice).
All four of these approaches give us the same answer, but its the fourth one we use to define the general
additive rule.
The additive rule in general is

P (A or B) = P (A) + P (B) − P (A and B).

Note that this still works for mutually exclusive events, since in such a case, P (A and B) = 0.

4.7 Complements
The concept of a complement is a little different to what we’ve looked at so far. We’ve focussed on when
two events are independent, or when they’re mutually exclusive (or when they’re neither), but there’s nothing
stopping us thinking about any number of independent events, or any number of mutually exclusive events.
In contrast, each and every event has one and only one complement. That’s because for an event A, the
complement of A is the event that A doesn’t happen. We denote this event as AC .

Rules of complements
C
1. The complement of a complement is itself: AC = A. In words, if A doesn’t not happen, A happens.
2. Every outcome in the outcome belongs to one and only one of events A and AC .
3. A and AC are mutually exclusive events, because A must either happen or not happen, it cannot do
both at the same time.
4. The event that A happens or AC happens is certain.
5. The complement of the event including all outcomes is the empty event, and vice versa (this is why we
need the concept of the empty event).
As a consequence of the 3rd and 4th rules, we also have

P (A or AC ) = P (A) + P (AC ) = 1.

24
It is for this reason that we have the result, which you may have seen before, that

P (A happens) = 1 − P (A doesn’t happen).

While each event A has only one complement, we can nevertheless extend what we’ve seen above by talking
about mutually exclusive and exhaustive events. This is a set of m events A1 , A2 , . . . , Am , such that
every outcome in the outcome space belongs to one and only one event. This in turn means Ai and Aj are
mutually exclusive whenever i ̸= j, and that one and only one Ai must occur.
When we have such events, we must have
n
X
P (Ai ) = 1 (∗).
i=1

One way to justify this result is that every event Ai is mutually exclusive to all other events, and therefore
the sum of their probabilities must equal the probability that (A1 or A2 or . . . or Am ) happens. Since every
outcome belongs to one of the events Ai , we must have

P (A1 or A2 or . . . or Am ) = 1

and hence
n
X
P (Ai ) = P (A1 or A2 or . . . or Am ) = 1
i=1

The rule that P (A) + P (AC ) = 1 is actually a special case of rule (∗), because A and AC are mutually
exclusive and exhaustive events - they can’t both happen, and one of them must happen.

Example 4.9 In the game of bizzfin (see Example 4.2), what is the probability that I either don’t roll a 1,
or that I don’t draw a Heart card?
We can count all the outcomes which lie inside this event, but it might be faster to count the outcomes which
lie inside its complement. There are 1 × 52 I can roll a one and draw a Heart card. The probability that I
don’t roll a 1 or don’t draw a Heart card is therefore 1 − 52/312 = 260/312.

Note that the probablity of not rolling a 1 or not drawing a Heart card is not the same as not rolling a 1 and
not drawing a Heart card. The latter event here has a probability of 195/312 (try checking this for yourself).

4.8 An Introduction To Expectation


One of the most fundamental concepts in both probability and statistics is that of expectation. Indeed, it’s
so important we’ll meet it more than once, in what might seem to be quite different contexts.
For now, we’ll think about the relationship between the probability of an event, and how many times we
expect that event to happen. If we have r opportunities for an event A to happen, and the probability of the
event happening each time is P (A), then the number of times we expect the event to happen is r × P (A).

Example 4.10 Suppose market research has been done on orders in the Palatine Cafe, and the research
shows that the probability a customer’s offer includes coffee is found to be 2/5. If the Palatine Cafe serves
300 people in a day, the expected number of orders including coffee will be 300 × (2/5) = 120.

4.9 Conditional Probability


4.9.1 Motivation
Sometimes we learn information that will make us want to reassess the probability of something happening.
Imagine I’ve bought a National Lottery ticket, and chosen six numbers from 1 to 49. I will win the jackpot

25
if the same six numbers are chosen by the Lottery Machine at the end of the week. Assuming the Lottery
Machine chooses those six numbers randomly and fairly, my probability of winning the jackpot is 0.0000072%.
Imagine further though that when the first number is announced, it matches one of the six I have chosen.
I now need the Lottery Machine to choose the five numbers I have remaining in order to win the jackpot,
which has a probability of 0.000058%. If the second number also matches one of mine, the probability of
winning goes up to 0.00056%, and so on. If on the other hand any number comes out that doesn’t match
mine, then I know I’ve lost, and the probability drops to zero.
This is a very simple example of situations in which the probability of an event can change, based on
additional information we have received. Because situations in which we want to rethink probabilities based
on additional information are so common (especially in statistics), we need some way to update probabilities
in a mathematically rigorous manner. This is done through what we call conditional probabilities.

4.9.2 Updating the outcome space


One useful way to think about how updating probabilities works is to consider the outcome space. As we saw
in Section 4.2, the outcome space is the set of all the possible outcomes of whatever situation we are looking
at, so that one and only one outcome can happen, and that no outcome listed can be further broken up into
two or more relevant possible results.
Gaining information about a situation corresponds to learning that some outcome cannot have happened. To
return to my lottery example, we can think of the outcomes being the 13,983,816 different ways to choose six
different numbers from 1 to 49. Once I learn the first number chosen by the Lottery Machine matches one of
mine, all the outcomes in which none of my numbers are chosen become impossible.
This then gives me a new outcome space, where each outcome remaining includes the number the Lottery
Machine has picked first. Once we learn the second number picked, that gives us another new outcome space,
which contains only the outcomes which include both the numbers picked, and so on.
In situations in which all outcomes are equally likely, updating probabilities can be quite quick - you throw
away the outcomes which are no longer possible, and calculate the new probability using the outcomes you
have left.
In such a case, the probability event A happens given event B has happened is found as follows

Number of outcomes in both A and B


P (A happens, given B has happened) =
Number of outcomes in B

We often write P (A happens, given B has happened) as P (A|B), said out loud as “the probability of A given
B”, for ease of notation.

4.9.3 The conditional probability formula


Often we’re not lucky enough to be in a situation in which all outcomes are equally likely, of course. Still, the
approach we just looked at can be adapted to situations in which some outcomes are more likely than others.
Remember that we define the probability that event A happens as
Number of times A happens
P (A) =
Number of times A could have happened
Now that we know B has happened, then all the times B didn’t happen, whether or not A happened as well,
are no longer relevant. We’re interested only in how many times B happened and A happened, out of the
number of times B happened and A could have happened.

Number of times B and A happens


P (A|B) = (∗∗)
Number of times B happens and A could have happened

26
We can take this a stage further, using the mathematical result that a fraction does not change its value if
you divide both top and bottom by the same number.

Number of times B and A happens/Number of times something could have happened


P (A|B) =
Number of times B happens and A could have happened/Number of times something could have happened
P (A and B)
= .
P (B)

This alternative form of the formula is commonly considered to be “the” conditional probability formula.
Important: For events A and B, we can talk about P (B|A) just as easily as P (A|B). This sometimes seems
odd, if event B either happens or doesn’t happen before event A does or doesn’t happen. Probability is
all about our beliefs about events we personally don’t know whether have happened yet. It is perfectly
possible for an event to have happened or not without you knowing whether it has happened or not, and in
such circumstances it is still perfectly reasonable to discuss probability.
It’s also important to bear in mind that, in general, P (A|B) ̸= P (B|A). We can see from equation
(∗∗) that P (A|B) = P (B|A) would only be true when the number of times A could have happened =
the number of times B could have happened.

Example 4.11 In a catch-and-release study, one hundred swallows in the UK are captured during summer,
given leg rings (numbered 1 to 100), and released into the wild. Two months later, one of the swallows is
spotted in Tunisia.
Assuming the leg ring on the spotted swallow is equally likely to be each of the numbers between 1 and 100,
find:
1. P (A), where A is the event that the leg ring number is 20 or lower;
2. P (B) where B is the event that the leg ring number is prime;
3. P (A|B);
4. P (B|A).
Answers (we use {1, . . . , 100} as the outcome space here):
1. 20 of the 100 outcomes are 20 or lower, hence P (A) = 20/100 = 1/5.
2. There are 25 prime numbers between 1 and 100. Hence 25 of the 100 outcomes are prime, and
P (B) = 25/100 = 1/4.
3. Since we know B happened, our outcome space for this calculation contains only the 25 prime numbers
between 1 and 100. Of these, 7 are below 20 (20 itself is not prime, of course). Hence P (A|B) = 7/25.
4. Since we know A happened, our outcome space for this calculation contains only the numbers 1 to 20.
Of these, 7 are prime. Hence P (B|A) = 7/20.
Note that this is one of many situations in which P (A|B) ̸= P (B|A).

4.9.4 Conditional probabilities and relative frequency


Of course, there are many situations in which don’t know the proportions we would need in order to directly
calculate P (A|B). In such situations, we can once again make use of relative frequencies, as in Section 4.4.

Number of experiments run in which A and B happens


P (A|B) ≈ Relative frequency of A|B =
Total number of experiments run in which B happens

27
4.9.5 Conditional probabilities and independence
Our working definition of independence, as discussed in Section 4.5, is that the events A and B are independent
of learning whether event A (B) has happens does not cause us to change our beliefs about the chance of B
(A) happening.
If A and B are independent, then, what does that imply about the probabilities P (A|B) and P (B|A)? We can
answer this question using the definition of independence, but instead, we’ll consider it from the perspective
of the relevant equations, and then discuss how the implications of those equations match up to our definition
of independence.
We know that

P (A and B) = P (A)P (B)

when A and B are independent, and that

P (A and B)
P (A|B) = ⇒ P (A and B) = P (A|B)P (B)
P (B)

whether A and B are independent or not. When A and B are independent, then, we must have

P (A and B) = P (A|B)P (B) = P (A)P (B)

and hence that P (A|B) = P (A). A very similar argument gives us that when A and B are independent,
P (B|A) = P (B).
So why does this make sense? What we are seeing here is that A and B are independent, what we believe the
probability for A is after learning B has happened should be the same as what we believe the probability
for A is without knowing B had happened. If A and B are independent, it doesn’t make any difference to
our beliefs about A if we learn B happened; that’s what independence means!
The same argument justifies why P (B|A) = P (B); learning A has happened gives us absolutely no additional
information about how likely an event independent of A is likely to happen.
This is a very useful result in statistics. The equation P (A and B) = P (A|B)P (B) also comes in handy an
awful lot, and is sometimes called the conditional multiplicative rule.
We can now finally fully understand what happened in Example 4.6. I showed already that the probability
of rolling a 1 with the dice in my left hand is

P (R1) = P (R1 and S) + P (R1 and T )

using the additive rule. We can now use the conditional multiplicative rule to get
1 1 1 1
P (R1) = P (R1|S)P (S) + P (R1|T )P (T ) = × + 20 × 2
6 2 / /

as stated.

4.10 Bayes Rule


One of the most important uses of the conditional multiplicative rule is that it leads us to Bayes rule.
This is a result so important it birthed an entire new approach to statistics; the helpfully-named Bayesian
statistics that most if not all of you will be learning more about in Epiphany term.
Bayes rule works by exploiting the fact that P (A and B) must equal P (B and A). Either both events happen,
or they don’t both happen; it can’t matter which of the two events we name first. Applying the conditional
multiplicative rule, then, we have

28
P (B|A)P (A) = P (B and A) = P (A and B) = P (A|B)P (B)
P (A|B)P (B)
⇒ P (B|A) =
P (A)

This simple little trick unlocks an amazingly powerful result - we have a way of swapping round the order of
events in a conditional probability. We can go back in time!
There are any number of situations in which this is extremely useful. Here’s just one: imagine a GP wants to
diagnose a sick patient. What she might want to do is figure out the probability of different diseases, given
the symptoms being shown. That could be extremely hard to work out, though. It might be much easier to
look up the probabilities of the symptoms given different diseases, and then use Bayes rule to swap that
round into what she really wants. Let’s dig into how that might work with an example.

Example 4.12 A doctor wants to calculate the probability a patient has the flu, given they have a cough, a
sore throat, and aching arms. The doctor knows the following information:
a) Currently 3% of the UK population is suffering from the flu (based on the latest NHS data).
b) Currently 5% of the UK population is reporting they have all three of a cough, a sore throat, and open
arms (again, based on the latest NHS data).
c) 90% of people with the flu report all three of a cough, a sore throat, and open arms (this has been
demonstrated by previous medical research).
Using these values, the doctor can calculate the probability their patient has the flu. Denoting by F the
event that a patient has the flu, and denoting by S the event that a patient has all three listed symptoms:

P (S|F )P (F )
P (F |S) =
P (S)
0.9 × 0.03
=
0.05
= 0.54.

One way to intepret this result is that, before learning about the patient’s symptoms, it would be sensible for
the doctor to assume their probability of having the flue is 0.03. Once she learns about the symptoms, this
probability increases by a multiple of 18.

29
5 Random Variables
We’ve now covered the basics of probability theory, upon which just about everything in the field of statistics
is based. There’s still a lot more to talk about, though. The kinds of examples we’ve been looking at have
been very simple in terms of the situations involved (which isn’t to say the maths might not have been
challenging). These sorts of situations - dice rolls, coin tosses, lotteries - aren’t really where we need to put
our focus. If you can work out probabilities just through counting equally likely outcomes, you don’t need to
call in a statistician.
The kinds of situation where a statistician becomes useful are those where you can’t just do some calculations
based on the situation’s properties. We get called in when conclusions - potentially quite complex and subtle
ones - need to be made by studying the behaviour of a system, via the data that system produces.
This will require an understanding of how we think about the data we collect in terms of the underlying
probabilities of the situation. This in turn will mean grasping the concept of a random variable, another
idea which is absolutely foundational to the practice of statistics.
So what is a random variable? How do we define them, how do they work, and why are they useful? Answering
all three of those questions requires we consider an even more fundamental question first: what is a variable?

5.1 What is a Variable?


A variable is a quantity that can take one of several possible values. There are all sorts of ways in which
this can happen. A value can regularly change as we observe it (temperature in a room), a value might be
changeable by us as people (temperature of our oven), a value might be unchanging but unknown to us, so
that it has several possible values and we don’t know which is true (number of people born in the year 10,000
BCE). All of these are variables - they’re quantities that have more than one possible value. Actually, we can
even define a fixed and known value as a variable (number of people born in my house last year - that would
be zero). It’s quite a boring thing to do, but it’s possible to do it, and sometimes useful.
All of that is quite a broad definition, but that’s not a bad thing. Indeed, given there are so many situations
in which we might want to think about variables, it’s actually quite important to have so wide a definition.
We can get a little more technical, though, as follows: a variable X is associated with a set X of values, which
describe the possible values X can take. The following would all be examples of variables:
• X, X = N (representing all positive whole numbers)
• Y , Y = [0, 1] (representing all real numbers between 0 and 1);
• Z, Z = {Red, Blue, Green} (this means Z is a qualitative value);
• A, A = {0}.
Note that variable A is of the kind earlier discussed, where it can only take one value, in this case 0. A little
dull, but nothing in the notation I just gave you requires the set associated with a variable has to have more
than one element.

5.2 Making Variables Random


So far, none of the variables I’ve defined are random, because they lack a crucial property of random variables.
To be a random variable, a variable has to be associated not just with a set of possible values, but an
expression of belief regarding how likely each of those values is to be the one taken.
For example, let’s say I define a variable X, with X = {H, T }. This variable could very well represent the
result of a coin toss, which will be either Heads (H), or Tails (T ). Coin tosses are a very common example of
a random situation, but the variable X itself is not random yet. To become a random variable, I need to
express my belief about how likely each of the values in X are to being the value X takes. I could do this by
setting the probability of Heads at 0.5, P (H) = 0.5, representing a fair coin. I could set P (H) = 0.3, if I
think the coin has been tampered with in some way so that Heads only come up 30% of the time. I could

30
even not assign a number at all, using algebra instead; P (H) = p, p ∈ [0, 1] expresses my belief that the
probability of getting Heads is p, where p is an unspecified number between 0 and 1 (note my use of the
Greek letter epsilon, ∈, to denote “belongs to” here).
Each of these expressions of belief turns X into a random variable, and turns X into the outcome space for
X. We describe the value that X ends up taking as a realisation, often denoted x.
It’s a common mistake to see non-random variables and random variables as somehow opposite concepts. I
don’t think this is a useful approach. I think a better way to consider what we’re doing is in terms of a house.
Randomness is something you build on top of the concept of variables. It makes no more sense to say
non-random variables and random variables are opposites than it is to say one-storey houses and two-storey
houses are opposites. In particular, a variable in which you express certainty about which value it will take is
still a random variable, even though the outcome is fully known, because you’ve expressed that fact as a
belief.

5.3 Probability Functions


5.3.1 Functions
Functions, in the context of mathematics, are processes by which an input is converted into an output. The
inputs, the processes, and the outputs can all be quite weird and complicated sometimes, but the basic idea
is always the same.
Commonly in maths, we use letters to denote functions. The expression f (x) = x2 , for instance, defines a
function f which takes an input x, and squares it.
Functions also require what we call a domain, which is all the values you are allowed to use as inputs, and a
range, which is all the values you can get as an output. What the range looks like will depend on what your
domain looks like. If I say the function f has a domain of R (all the real numbers), then the range is the
interval [0, ∞), because for any non-negative number, there is always some number you can square to get to
that number. You can’t get negative numbers after squaring a real number, though, so negative numbers
aren’t in the range of f . If I instead use the domain [2, ∞) for f , the range becomes [4, ∞), because the
smallest output I can get is now 22 = 4.
We say a function is closed form if we always know what the output is for every input. Not all functions
have this property, especially in statistics, where we often think of real-world situations as being powered by
functions we can never find, merely try to estimate through our models.

Example 5.1 The data set chicks in R tracks the weights of 50 newly-hatched chicks, with each chick being
weighed once a day. For any one chick, we could imagine a function, say f , for which the input t is the time
in minutes since a chick hatched, and the output f (t) is the chick’s weight in grams.

31
We can draw a graph for this function, as shown below. Because we only have one measurement a day,
though, the graph has long stretches of time in which the chick must have weighed something, but we
cannot possibly say what. The function f therefore does not have closed form. Attempting to estimate the
outputs for the values of t for which we didn’t record the chick’s weight, say by drawing a line of best fit (see
below), is a classic example of what we do in statistics.

5.3.2 Probabilities as functions


We’ll now combine what we understand about probabilities with what we understand about functions. I’m
going to be quite precise in my terminology here, so make sure you’re comfortable with that terminology
before reading beyond it.
A probability function, which is also known as a probability distribution, is a function. The input
of a probability distribution will either be one outcome, or an event made up of multiple outcomes. The
output of a probability function will be a number between 0 and 1, representing the probability of the
outcome or outcomes which make up the input.
Be careful to note then that an outcome is not an output, it is an input. Also note that sometimes,

32
statisticians may use the word “probability” to talk about the probability function, rather than the
probabilities that the function gives us as outputs. It should usually be clear in context which one is being
referred to, but keep on your toes!

Example 5.2 Consider a six-sided dice. Let the value I next roll on that dice be expressed as a variable
X. The variable can take values from X = {1, . . . , 6}. To make this into a random variable, I need to add
a belief statement. I shall assume the dice is fair, and that therefore each of the elements of X has a 1/6
probability of being the one I roll.
It is common in mathematics to denote outcomes using a lower-case Greek omega. I shall therefore denote
the outcome “I roll a one” as ω1 , the outcome “I roll a two” as ω2 , and so on, up to ω6 . I will now call my
probability function P , and express the outputs as follows
1
P (ωi ) = , for every i ∈ X .
6
I could instead offer a different belief statement. Perhaps I know the dice has been tampered with, so that
the probability of rolling a six is actually 1/2, with the other five outcomes being as likely as each other,
though all less likely than a six.
I’ll define a new probability function P̃ for this belief:
1
P̃ (ωi ) = , for every i ∈ {1, 2, 3, 4, 5},
10
1
P̃ (ω6 ) = .
2

5.4 Introduction to Discrete and Continuous Probability Functions


There are two broad forms of probability functions, both of which we’ll consider in more detail later in the
notes. One are discrete distributions. These are probability functions for which the outcomes (and hence
the inputs) are either finite in number (such as in Example 5.1, or when dealing with, say, drawing cards
from a deck, or pulling marbles from a bag), or they are what we call countably infinite. An outcome
space is countably infinite if there’s an infinite number of outcomes, but there are gaps between the outcomes
which represent impossible events.
For instance, if we think about the number of insects on the planet, there’s no easy way to think about the
biggest possible number that could be, so we would consider the outcome space to be infinite. However, we
would consider it impossible to have any number of insects that wasn’t a whole number. Therefore there are
gaps between outcomes which represent impossible events, and the outcome space is countably infinite.
Both of the probability functions we looked at in Example 5.2 are discrete distributions, because there is
only a finite number of rolls we could see. If we try to imagine a dice with infinite faces (or perhaps a finite
number, but a finite number which is unknown and with no limit on how big it could be), we would still call
any probability function related to that dice a discrete distribution, because non-whole number rolls would
be impossible. Sometimes we call a discrete distribution a probability mass function, or PMF.

33
Continuous distributions, in contrast, are probability functions for which the outcome space includes infinite
values. The uniform distribution U [0, 1] is a good example of such a function. This is a probability function
which takes any value in the interval [0, 1] as an input (and there are an infinite number of such values), and
which gives its outputs in such a way that all outcome values are considered equally likely.
This quickly gets a bit complicated, though, because since there are infinitely many outcomes, and they
all have to have the same probability, the probability of getting any one specific outcome has to be zero.
This is true of all continuous distributions, in fact. In order to deal with this, we have to express continuous
distributions so they only give non-zero outputs when the inputs are intervals. Continuous distribusions are
also referred to as probability density functions, or PDFs.

5.5 Discrete Random Variables


In this section we shall consider three of the most common discrete random variables. In each case, the set of
realisations that the variable can take is either the natural numbers (0,1,2, etc.) or some subset of those
numbers. As we’ve noted, the realisations of discrete random variable doesn’t have√ to be limited to natural
numbers (just imagine a six-sided die with the faces labelled “-1”, “0,”, “0.5”, “ 2”, “π”, and “e”), but they
will do so here.

5.5.1 The Bernoulli distribution


The Bernoulli distribution is possibly the simplest distribution we can think of, other than a distribution
where a random variable always takes the same value (we call that the atomic distribution).
For a Bernoulli random variable X, the corresponding outcome space X = {0, 1}. We denote the probability
of a realisation equalling 1 as being p. This then means P (X = 0) = 1 − p.
We express this whole set-up through the notation X ∼ Ber(p). Here, p is the one and only parameter of
the distribution.
You will be seeing that tilde sign ∼ a lot in this module (and possibly others). It is a short way of saying

34
“...is a random variable which has distribution...”. Hence X ∼ Ber(p) can be thought of as saying “X is a
random variable which has a Bernoulli distribution”.

5.5.2 The binomial distribution


The binomial distribution can be thought of in two ways. The common way we express it is as follows:
imagine we are about to observe the results of n trials. Each trial can only have two outcomes (often denoted
0 and 1, or “success” and “failure”). Let the random variable X represent the number of successful trials (so
X = {0, 1, . . . , n}).
If each trial has the same probability p of success, and if every trial is independent of every other trial
(so learning about the result of one trial gives you no information about whether any other trial will be
successful), X will have a binomial distribution, X ∼ Bin(n, p). The two parameters here are n and p,
illustrating the fact that the two ways to change the behaviour of the series of trials is to change how many
trials are performed, and/or to change the probability of each trial being successful.
For given values of n and p, we can express the probability of each possible realisation within X = {0, 1, . . . , n}.
In what follows, the notation r!, pronounced r factorial, denotes the product of all natural numbers between
1 and r, i.e. r! = 1 × 2 × . . . × (r − 1) × r. For book-keeping reasons, we set 0! = 1! = 1.

n!
P (X = r) = pr (1 − p)n−r
r!(n − r)!

We can break down this equation to see how it works. The term pr represents the probability of observing r
successful trials in a row (using the fact that the trials are independent). The term (1 − p)n−r represents
the probability of observing n − r failed trails in a row. Combining these gives us that the probability of
observing r success followed by n − r failures is pr (1 − p)n−r . Again under the assumption of independence,
this must also equal, say the probability of observing n − r failures followed by r successes, or the probability
of observing r successes and n − r failures in any order at all.
We can therefore find P (X = r) by multiplying this probability of observing r successes and n − r failures in
one specific combination by the number of such combinations there are (this relies on the additive rule for
mutually exclusive events, and in noting each specific order of r successes and n − r failures must be mutually
exclusive to any other specific order of r successes and n − r failures).
The expression r!(n−r)!
n!
gives us that number of different combinations of r successes and n − r failures. I
won’t explain how this works here, but I’ve put together a separate document with that explanation, which is
also available in the Week 3 material on Ultra.
I mentioned above that there are two ways to think about the binomial distribution. The second is to
recognise that each individual one of the n trials we’re observing has the same probability p of success.
Therefore, each individual trial can be represented by a Bernoulli random variable. The behaviour of the
random variable X ∼ Bin(n, p) is therefore equivalent to the behaviour of the sum of n Bernoulli random
variables Y1 , . . . , Yn , each of which is a Bernoulli random variable, Yi ∼ Ber(p).
For this to work, each of these n Bernoulli random variables has to be independent of all the others. We
refer to such groups of independent random variables, all with the same distribution, as independent and
identically distributed random variables, or iid RVs for short.

5.5.3 The Poisson distribution


Both the Bernoulli and binomial distribution assume can only happen a certain number of times, and concern
themselves with how many times it in fact does happen. The Poisson distribution is a little different. We use
the Poisson distribution in situations where there is no specific limit to how something might happen. Rather
than fix the number of events, then, the Poisson distribution works across a specific interval (commonly but
not always an interval of time), in which some event can happen (in theory) any number of times. In order

35
for this to work, we assume that the events in question happen independently - that is, an event happening at
a given point in the interval tells us nothing additional about when we might expect the next event to occur.
The Poisson distribution has a single parameter, λ, which is referred to as the intensity. The larger the value
of λ, the more times we expect to see the event happen over the interval associated with the distribution. We
express that the random variable X has a Poisson distribution with intensity λ by writing X ∼ P ois(λ).
The formula for finding the the probability of seeing exactly r events over the interval associated with
X ∼ P ois(λ) is written below.

e−λ λr
P (X = r) = .
r!
Note that the Poisson distribution is somewhat different to the Bernoulli and binomial distributions, in that
there is no upper limit to the value of its realisations. There are therefore infinitely many possible values a
Poisson random variable can take. Once past a certain point, though (which varies depending on the value of
λ for the distribution), the probabilities get smaller and smaller as the realisation value gets larger and larger,
and they do so in such a way that, even though there is an infinite number of them, these probabilities still
all sum to 1.

5.6 Continuous Random Variables


• For a continuous random variable, the role of the probability mass function is taken by a density
function, f (x), which has the properties that f (x) ≥ 0 and
Z ∞
f (x)dx = 1
−∞

• For any a < b, the probability that X falls in the interval (a, b) is the area under the density function
between a and b: Z b
P (a < X < b) = f (x)dx
a

• Thus the probability that a continuous random variable X takes on any particular value is 0:
Z c
P (X = c) = f (x)dx = 0
c

%Although this may seem strange initially, it is really quite natural. If the uniform random variable
of Example A had a positive probability of being any particular number, it should have the same
probability for any number in [0, 1], in which case the sum of the probabilities of any countably infinite
subset of [0, 1] (for example, the rational numbers) would be infinite.
• If X is a continuous random variable, then
P (a < X < b) = P (a ≤ X < b) = P (a < X ≤ b)
Note that this is not true for a discrete random variable.

5.7 Cumulative distribution function


• The cumulative distribution function (cdf) of a continuous random variable X is defined as:
Z x
F (x) = P (X ≤ x) = f (u)du
−∞

• The cdf can be used to evaluate the probability that X falls in an interval:
Z b
P (a ≤ X ≤ b) = f (x)dx = F (b) − F (a)
a

36
5.8 Characteristics of probability distributions
• If X is a continuous random variable with density f (x), then
Z ∞
µ = E(X) = xf (x)dx
−∞

or in general, for any function g,


Z ∞
E(g(X)) = g(x)f (x)dx
−∞

• The variance of X is
Z ∞
σ = V ar(X) = E [X − E(X)]
2 2
= (x − µ)2 f (x)dx

−∞

• The variance of X is the average value of the squared deviation of X from its mean.
• The variance of X can also be expressed as V ar(X) = E(X 2 ) − [E(X)]2 .

5.9 Some useful continuous distributions


5.9.1 Uniform distribution
• A random variable X with the density function
1
f (x) = , a≤x≤b
b−a
is called the uniform distribution on the interval [a, b].

• The cumulative distribution function is

 0 for x < a

F (x) = x−a
b−a for a ≤ x < b
1 for x ≥ b

• A special case, f (x) = 1 and 0 ≤ x ≤ 1.

5.9.2 Exponential distribution


• The exponential density function is

f (x) = λe−λx , x ≥ 0 and λ > 0

37
• The cumulative distribution function is
Z x
F (x) = f (u)du = 1 − e−λx
−∞

• The exponential distribution is often used to model lifetimes or waiting times data.

5.9.3 Normal distribution, N (µ, σ 2 )


• The normal (Gaussian) distribution plays a central role in probability and statistics, probably the most
widely known and used of all distributions
• The normal distribution fits many natural phenomena, e.g. human’s height, weight, IQ scores. In
business, for example, the annual cost of household insurance, among others.
• The density function of the normal distribution depends on two parameters, µ and σ (where −∞ <
µ < ∞, σ > 0):
1 2 2
f (x) = √ e−(x−µ) /2σ , −∞ < x < ∞
σ 2π
• The parameters µ and σ are the mean and standard deviation of the normal density.
• We write X ∼ N (µ, σ 2 ) as short way of saying ‘X follows a normal distribution with mean µ and
variance σ 2 ’.

38
5.9.4 Standard normal distribution N (µ = 0, σ 2 = 1)
• The probability density function of the standardized normal distribution is given by:

1 2
f (z) = √ e−z /2 , −∞ < z < ∞

• We write Z ∼ N (0, 1) as short way of saying ‘Z follows a standard normal distribution with mean 0
and variance 1’.
• To standardize any variable X (into Z) we calculate Z as:

X −µ
Z=
σ
The Z-score calculated above indicates how many standard deviations X is from the mean.

39
5.9.5 Example
• If fX is a normal density function with parameters µ and σ, then
" 2 #
1 1

y − b − aµ
fY (y) = √ exp −
aσ 2π 2 aσ
• Thus, Y = aX + b follows a normal distribution with parameters aµ + b and aσ.
• If X ∼ N (µ, σ 2 ) and Y = aX + b, then Y ∼ N (aµ + b, a2 σ 2 ).
• Can you use this to show that Z ∼ N (0, 1)?

5.10 Joint distributions


• The joint behaviour of two random variables, X and Y , is determined by the cumulative distribution
function,
F (x, y) = P (X ≤ x, Y ≤ y)
regardless of whether X and Y are continuous or discrete. The cdf gives the probability that the point
(X, Y ) belongs to a semi-infinite rectangle in the plane.

40
• The joint density function f (x, y) of two continuous random variables X and Y is such that
f (x, y) ≥ 0
Z ∞ Z ∞
f (x, y) dxdy = 1
−∞ −∞
Z d Z b
f (x, y)dxdy = P (a ≤ X ≤ b, c ≤ Y ≤ d)
c a
The marginal density function of X is
Z ∞
fX (x) = f (x, y) dy
−∞

Similarly, the marginal density function of Y is


Z ∞
fY (y) = f (x, y) dx
−∞

• The cdf of two continuous random variables X and Y can be obtained as


Z x Z y
F (x, y) = f (u, v)dvdu
−∞ −∞

and
∂2
f (x, y) = F (x, y)
∂x∂y
wherever the derivative is defined.

5.11 Conditional probability (density) function, PDF


• The conditional probability (density) functions may be obtained as follows:
f (x, y)
fX|Y (x|y) = conditional PDF of X
f (y)
f (x, y)
fY |X (y|x) = conditional PDF of Y
f (x)
• Two random variables X and Y are statistically independent if and only if
f (x, y) = f (x)f (y)
That is, if the joint PDF can be expressed as the product of the marginal PDFs. So,
fX|Y (x|y) = f (x) and fY |X (y|x) = f (y)

41
5.12 Properties of Expected values and Variance
• The expected value of a constant is the constant itself, i.e. if c is a constant, E(c) = c.
• The variance of a constant is zero, i.e. if c is a constant, V ar(c) = 0.
• If a and b are constants, and Y = aX + b, then E(Y ) = aE(X) + b and V ar(Y ) = a2 V ar(X) (if V ar(X)
exists).
• If X and Y are independent, then E(XY ) = E(X)E(Y ) and
V ar(X + Y ) = V ar(X) + V ar(Y )
V ar(X − Y ) = V ar(X) + V ar(Y )

• If X and Y are independent random variables and g and h are fixed functions, then
E[g(X)h(Y )] = E[g(X)]E[h(Y )]

5.13 Covariance
• Let X and Y be two random variables with means µx and µy , respectively. Then the covariance
between the two variables is defined as
cov(X, Y ) = E {(X − µx )(Y − µy )} = E(XY ) − µx µy
• If X and Y are independent, then cov(X, Y ) = 0.
• If two variables are uncorrelated, that does not in general imply that they are independent.
• V ar(X) = cov(X, X)
• cov(bX + a, dY + c) = bd cov(X, Y ), where a, b, c, and d are constants.

5.14 Correlation Coefficient


• The (population) correlation coefficient ρ is defined as
cov(X, Y ) cov(X, Y )
ρ= p =
V ar(X)V ar(Y ) σx σy
• Thus, ρ is a measure of linear association between two variables and lies between −1 (indicating perfect
negative association) and +1 (indicating perfect positive association).
• cov(X, Y ) = ρ σx σy
• Variances of correlated variables,
V ar(X ± Y ) = V ar(X) + V ar(Y ) ± 2cov(X, Y )
V ar(X ± Y ) = V ar(X) + V ar(Y ) ± 2ρ σx σy

5.15 Conditional expectation and conditional variance


Let f (x, y) be the joint PDF of random variables X and Y . The conditional expectation of X, given Y = y,
is defined as X
E(X|Y = y) = xfX|Y (x|Y = y) if X is discrete
x
Z ∞
E(X|Y = y) = xfX|Y (x|Y = y)dx if X is continuous
−∞
The conditional variance of X given Y = y is defined as, if X is discrete,
X
V ar(X|Y = y) = [X − E(X|Y = y)]2 fX|Y (x|Y = y)
x

and if X is continuous,
Z ∞
V ar(X|Y = y) = [X − E(X|Y = y)]2 fX|Y (x|Y = y)dx
−∞

42
6 Sampling
6.1 Sampling
• Sampling is widely used as a means of gathering useful information about a population.
• Data are gathered from samples and conclusions are drawn about the population as a part of the
inferential statistics process.
• Often, a sample provides a reasonable means for gathering such useful decision-making information
that might be otherwise unattainable and unaffordable.
• Sampling error occurs when the sample is not representative of the population.

6.2 Random versus non-random sampling


• In random sampling every unit of the population has the same probability of being selected into the
sample.
– Simple random sampling
– Stratified sampling
– Cluster sampling
– Multistage sampling
• In non-random sampling not every unit of the population has the same probability of being selected
into the sample.
– Convenience sampling
– Judgement sampling
– Quota sampling

6.3 Simple random sampling


Simple random sampling: is the basic sampling technique where we select a group of subjects (a sample)
from a larger group (a population). Each individual is chosen entirely by chance and each member of the
population has an equal chance of being included in the sample.

6.4 Central limit theorem


Let X1 , X2 , . . . be independent and identically distributedP(i.i.d.) random variables with mean µ and variance
n
σ 2 . Then as n increases indefinitely (i.e. n → ∞), X n = i=1 Xi /n approaches the normal distribution with
mean µ and variance σ 2 /n. That is
X n ∼ N (µ, σ 2 /n)
n→∞

Note that this result holds true regardless of the form of the underlying distribution. As a result, it follows
that
Xn − µ
Z= √ ∼ N (0, 1)
σ/ n n→∞
That is, Z is a standardized normal variable.

6.5 Sampling distribution of the sample mean x̄


The sampling distribution of a statistic is the probability distribution of that statistic.

43
There are two cases:
1. Sampling is from a normally distributed population with a known population variance:
σ2
 
x̄ ∼ N µ,
n
That is, the sampling
√ distribution of the sample mean is normal with mean µx̄ = µ and standard
deviation σx̄ = σ/ n.
2. Sampling is from a non-normally distributed population with known population variance and n is large,
then the mean of x̄,
µx̄ = µ
and the variance,

σ2

 n with replacement (infinite population)
σx̄2 =
σ 2 N −n
without replacement (finite population)


n N −1

• If the sample size is large, the central limit theorem applies and the sampling distribution of x̄ will be
approximately normal.
• The standard deviation of the sampling distribution of the sample mean, σx̄ , is called the standard
error of the mean or, simply, the standard error
• If x̄ is a normal distributed (or approximately normal distributed), we can use the following formula to
transform x̄ to a Z-score.

44
x̄ − µx̄
Z=
σx̄
where Z ∼ N (0, 1).

6.6 Sampling distribution of the sample proportion


• When the sample size n is large, the distribution of the sample proportion, π̂, is approximately normally
distributed by the use of the central limit theorem,
 
π(1 − π)
π̂ ≈ N π,
n

then
π̂ − π
Z=q ≈ N (0, 1)
π(1−π)
n

where π̂ = x/n, x is the number in the sample with the characteristic of interest.
• A widely used criterion is that both nπ and n(1 − π) must be greater than 5 for this approximation to
be reasonable.

6.7 Sampling distribution of the sample variance


Sampling is from a normally distributed population with mean µ and variance σ 2 . The sample variance is
n
1 X
s2 = (xi − x̄)2
n − 1 i=1

and
E(s2 ) = σ 2
V ar(s2 ) = 2σ 4 /(n − 1)
Then
(n − 1)s2
∼ χ2n−1
σ2

6.8 Example
Suppose that during any hour in a large department store, the average number of shoppers is 448, with a
standard deviation of 21 shoppers. What is the probability that a random sample of 49 different shopping
hours will yield a sample mean between 441 and 446 shoppers?

µ = 448, σ = 21, n = 49

441 − 448 446 − 448


 
x̄ − µ
P (441 ≤ x̄ ≤ 446) = √ ≤ √ ≤ √
21/ 49 σ/ n 21/ 49
P (−2.33 ≤ Z ≤ −0.67) =P (Z ≤ −0.67) − P (Z ≤ −2.33)
=0.2514 − 0.0099 = 0.2415

45
This means there is a 24.15% chance of randomly selecting 49 hourly periods for which the sample mean is
between 441 and 446 shoppers.
We used the standard normal table to obtain these probabilities. We can also use R.
pnorm(-0.67)-pnorm(-2.33)
## [1] 0.2415258

46
7 Estimation
7.1 Estimation
• The values of population parameters are often unknown.
• We use a representative sample of the population to estimate the population parameters.
There are two types of estimation:
• Point Estimation
• Interval Estimation

7.2 Point estimation


• A point estimate is a single numerical value used to estimate the corresponding population parameter.
A point estimate is obtained by selecting a suitable statistic (a suitable function of the data) and
computing its value from the given sample data. The selected statistic is called the point estimator.
• The point estimator is a random variable, so it has a distribution, mean, variance etc.
Pn
• e.g. the sample mean X = (1/n) i=1 Xi is P one possible point {estimator} of the population mean
n
µ, and the point estimate is x̄ = (1/n) i=1 xi .
Properties:
• Let θ be the unknown population parameter and θ̂ be its estimator. The parameter space is denoted by
Θ.
• An estimator θ̂ is called unbiased estimator of θ if E(θ̂) = θ.
• The bias of the estimator θ̂ is defined as Bias(θ̂) = E(θ̂) − θ
• Mean Square Error (MSE) is a measure of how close θ̂ is, on average, to the true θ,

M SE = E[(θ̂ − θ)2 ] = V ar(θ̂) + [Bias(θ̂)]2

47
7.3 Interval estimation
• An interval estimate (confidence interval) is an interval, or range of values, used to estimate a
population parameter.
• The level of confidence (1 − α)100% is the probability that the interval estimate contains the
population parameter.
• Interval estimate components:

point estimate ± (critical value × standard error)

7.4 Confidence intervals for the population mean


• When sampling is from a normal distribution with known variance σ 2 , then a 100(1 − α)% confidence
interval for the population mean µ is √
x̄ ± zα/2 (σ/ n)
where zα/2 can be obtained from the standard normal distribution table.

100(1 − α)% α zα/2


90% 0.10 1.645
95% 0.05 1.96
99% 0.01 2.58

• If σ is unknown and n ≥ 30, the sample standard deviation s = (xi − x̄)2 /(n − 1) can be used in
pP

place of σ.

48
• If the sampling is from a non-normal distribution and n ≥ 30, then the sampling distribution of x̄
is approximately
√ normally distributed (Central Limit Theorem) and we can use the same formula,
x̄ ± zα/2 (σ/ n), to construct the approximate confidence interval for population mean.
• When sampling is from a normal distribution whose standard deviation σ is unknown and the sample
size is small, the 100(1 − α)% confidence interval for the population mean µ is

x̄ ± tα/2 (s/ n)

where tα/2 can be obtained from the t distribution table with df = n − 1 and s is the sample standard
deviation which is given by
(xi − x̄)2
rP
s=
n−1
• If σ is unknown, and we neither have normal population nor large sample, then we should use
nonparametric statistics (not cover in this course).

7.5 Interpreting confidence intervals


• Probabilistic interpretation: In repeated sampling, from some population, 100(1 − α)% of all
intervals which we constructed will in the long run include the population parameter.
• Practical interpretation: When sampling is from some population, we have 100(1 − α)% confidence
that the single computed interval contains the population parameter.

7.6 Confidence interval for a population proportion


The 100(1 − α)% confidence interval for a population proportion π is given by
r
π̂(1 − π̂)
π̂ ± zα/2
n
where π̂ is the sample proportion.

49
7.7 Confidence interval for a population variance
The 100(1 − α)% confidence interval for the variance, σ 2 , of a normally distributed population is given by
!
(n − 1)s2 (n − 1)s2
,
χ2α ,n−1 χ21− α ,n−1
2 2

Pn
where s2 = 1
n−1 i=1 (xi − x̄)2 is the sample variance.

7.8 Example
Suppose a car rental firm wants to estimate the average number of kilometres travelled per day by each of its
cars rented in London. A random sample of 20 cars rented in London reveals that the sample mean travel
distance per day is 85.5 kilometres, with a population standard deviation of 19.3 kilometres. Compute a 99%
confidence interval to estimate µ.
For a 99% level of confidence, a z value of 2.58 is obtained (from the standard normal table). Assume that
number of kilometres travelled per day is normally distributed.
σ
x̄ ± zα/2 √
n
19.3
85.5 ± 2.58 √
20

50
85.5 ± 11.1
thus 74.4 ≤ µ ≤ 96.6

Note that:
qnorm((1-0.99)/2)
## [1] -2.575829

51
8 Hypothesis Testing One Sample
8.1 Hypothesis testing: Motivation
We often encounter such statements or claims:
• A newspaper claims that the average starting salary of MBA graduates is over £50K. (one sample test)
• A claim about the efficiency of a particular diet program, the average weight after the program is less
than the average weight before the program. (two paired samples test)
• On average female managers earn less than male managers, given that they have the same qualifications
and skills. (two independent samples test)
So we have claims about the populations’ means (averages) and we would like to verify or examine these
claims.
This is a kind of problem that hypothesis testing is designed to solve.

8.2 The nature of hypothesis testing


• We often use inferential statistics to make decisions or judgments about the value of a parameter, such
as a population mean.
• Typically, a hypothesis test involves two hypotheses:
– Null hypothesis: a hypothesis to be tested, denoted by H0 .
– Alternative hypothesis (or research hypothesis): a hypothesis to be considered as an
alternate to the null hypothesis, denoted by H1 or Ha .
• The problem in a hypothesis test is to decide whether or not the null hypothesis should be rejected in
favour of the alternative hypothesis.
• The choice of the alternative hypothesis should reflect the purpose of performing the hypothesis test.
• How do we decide whether or not to reject the null hypothesis in favour of the alternative hypothesis?
• Very roughly, the procedure for deciding is the following:
– Take a random sample from the population.
– If the sample data are consistent with the null hypothesis, then do not reject the null hypothesis;
if the sample data are inconsistent with the null hypothesis, then reject the null hypothesis and
conclude that the alternative hypothesis is true.
• Test statistic: the statistic used as a basis for deciding whether the null hypothesis should be rejected.
• The test statistic is a random variable which therefore has a sampling distribution with mean and
standard deviation (so-called standard error).

52
8.3 Type I and Type II Errors

• Type I error: rejecting the null hypothesis when it is in fact true.


• Type II error: not rejecting the null hypothesis when it is fact false.
• The significance level, α, of a hypothesis test is defined as the probability of making a Type I error,
that is, the probability of rejecting a true null hypothesis.
• Relation between Type I and II error probabilities: For a fixed sample size, the smaller the
Type I error probability, α, of rejecting a true null hypothesis, the larger the Type II error probability
of not rejecting a false null hypothesis and vice versa.
• Possible conclusions for a hypothesis test: If the null hypothesis is rejected, we conclude that the
alternative hypothesis is probably true. If the null hypothesis is not rejected, we conclude that the data
do not provide sufficient evidence to support the alternative hypothesis.
• When the null hypothesis is rejected in a hypothesis test performed at the significance level α, we say
that the results are statistically significant at level α.

8.4 Hypothesis tests for one population mean


In order to test the hypothesis that the population mean µ is equal to a particular value µ0 , we are going to
test the null hypothesis

H0 : µ = µ0

against one of the following alternatives:


• H1 : µ ̸= µ0 (Two-tailed)
• H1 : µ < µ0 (Left-tailed)
• H1 : µ > µ0 (Right-tailed)
In order to test H0 , we need to use one of the following test statistics, we should choose the one that satisfies
the assumptions.
• If σ is known, and we have a normally distributed population or large sample (n ≥ 30), then the test
statistic, so-called z-test, is
x̄ − µ0
z= √
σ/ n
where σ is the standard deviation of the population.
• If σ is unknown, and we have a normally distributed population or large sample (n ≥ 30), then the test
statistic, so-called t-test, is
x̄ − µ0
t= √ with df = n − 1.
s/ n
where s is the standard deviation of the sample.

53
8.5 The p-value approach to hypothesis testing
• The p-value is the smallest significance level at which the null hypothesis would be rejected. The p-value
is also known as the observed significance level.
• The p-value measures how well the observed sample agrees with the null hypothesis. A small p-value
(close to zero) indicates that the sample is not consistent with the null hypothesis and the null hypothesis
should be rejected. On the other hand, a large p-value (larger than 0.10) generally indicates a reasonable
level of agreement between the sample and the null hypothesis.
• As a rule of thumb, if p-value ≤ α then reject H0 ; otherwise do not reject H0 .

8.6 Critical-value approach to hypothesis testing

For any specific significance level α, one can obtain these critical values ±zα/2 and ±zα from the standard
normal table.

1.282 1.645 1.960 2.326 2.576


z0.10 z0.05 z0.025 z0.01 z0.005

If the value of the test statistic falls in the rejection region, reject H0 ; otherwise do not reject H0 .

For any specific significance level α, one can obtain these critical values ±tα/2 and ±tα from the T distribution
table. For example, for df = 9 and α = .05, the critical values are ±t0.025 = ±2.262 and ±t0.05 = ±1.833.

8.7 Hypothesis testing and confidence intervals


Hypothesis tests and confidence intervals are closely related. Consider, for instance, a two tailed hypothesis
test for a population mean at the significance level α. It can be shown that the null hypothesis will be

54
rejected if and only if the value µ0 given for the mean in the null hypothesis lies outside the 100(1 − α)-level
confidence interval for µ.
Example:
• At significance level α = 0.05, we want to test H0 : µ = 40 against H1 : µ ̸= 40 (so here µ0 = 40).
• Suppose that the 95% confidence interval for µ is 35 <µ< 38.
• As µ0 = 40 lies outside this confidence intervals, we reject H0 .

8.8 Test of Normality


One of the assumptions in order to use a z-test or t-test is that the population which we sampled from is
normally distributed. However we did not yet test this assumption, we should perform a so-called test of
normality. In order to do so:
• We can plot our data sample, e.g. histogram, boxplot, stem-and-leaf and normal Q-Q plot
• Use normality tests such as Kolmogorov-Smirnov test or Shapiro-Wilk test. The null and alternative
hypotheses are:
– H0 : the population being sampled is normally distributed.
– H1 : the population being sampled is nonnormally distributed.
If σ is unknown, and we neither have normal population nor large sample, then we should use nonparametric
tests instead of z-test or t-test.

8.9 Example
A company reported that a new car model equipped with an enhanced manual transmission averaged 29
mpg on the highway. Suppose the Environmental Protection Agency tested 15 of the cars and obtained the
following gas mileages.

27.3 30.9 25.9 31.2 29.7


28.8 29.4 28.5 28.9 31.6
27.8 27.8 28.6 27.3 27.6

What decision would you make regarding the company’s claim on the gas mileage of the car? Perform the
required hypothesis test at the 5% significance level.
Solution:
The null and alternative hypotheses:

H0 : µ = 29 mpg vs. H1 : µ ̸= 29 mpg

The value of the test statistic,


x̄ − µ0 28.753 − 29
t= √ = √ = −0.599
s/ n 1.595/ 15
As p-value = 0.559 > α = 0.05. So, we cannot reject H0 . At the 5% significance level, the data do not provide
sufficient evidence to conclude that the company’s report was incorrect.

55
R output:
# Data
mlg<-c(27.3, 30.9, 25.9, 31.2, 29.7,
28.8, 29.4, 28.5, 28.9, 31.6,
27.8, 27.8, 28.6, 27.3, 27.6)

# t-test
[Link](mlg,alternative = "[Link]", mu = 29, [Link] = 0.95)
##
## One Sample t-test
##
## data: mlg
## t = -0.59878, df = 14, p-value = 0.5589
## alternative hypothesis: true mean is not equal to 29
## 95 percent confidence interval:
## 27.86979 29.63688
## sample estimates:
## mean of x
## 28.75333
# Normality test
# Kolmogorov Smirnov Test
[Link](mlg,"pnorm", mean=mean(mlg), sd=sd(mlg))
## Warning in [Link](mlg, "pnorm", mean = mean(mlg), sd = sd(mlg)): ties should

56
## not be present for the Kolmogorov-Smirnov test
##
## One-sample Kolmogorov-Smirnov test
##
## data: mlg
## D = 0.13004, p-value = 0.9616
## alternative hypothesis: two-sided
# Shapiro-Wilk test
[Link](mlg)
##
## Shapiro-Wilk normality test
##
## data: mlg
## W = 0.95817, p-value = 0.6606
par(mfrow=c(1,2))
qqnorm(mlg)
qqline(mlg, col = "red")
hist(mlg)

57
9 Hypothesis Testing Two Samples
9.1 Motivation
We often encounter such statements or claims:
• A newspaper claims that the average starting salary of MBA graduates is over £50K. (one sample test)
• A claim about the efficiency of a particular diet program, the average weight after the program is less
than the average weight before the program. (two paired samples test)
• On average female managers earn less than male managers, given that they have the same qualifications
and skills. (two independent samples test)
So we have claims about the populations’ means (averages) and we would like to verify or examine these
claims.
This is a kind of problem that hypothesis testing is designed to solve.

9.2 Hypothesis tests for two population means


We have two types of samples here:
• Paired samples: each case must have scores on two variables and it is applicable to two types of
studies, repeated-measures (e.g. weights before and after a diet plan) and matched-subjects designs
(e.g. measurements on twins or child/parent pairs).
• Independent samples: two samples are called independent samples if the sample selected from one
of the populations has no effect on (holds no information about) the sample selected from the other
population.

In order to compare two population means, we are going to test the null hypothesis

H0 : µ1 = µ2

against one of the following alternatives:


• H1 : µ1 ̸= µ2 or µ1 − µ2 ̸= 0 (Two-tailed)
• H1 : µ1 < µ2 or µ1 − µ2 < 0 (Left-tailed)
• H1 : µ1 > µ2 or µ1 − µ2 > 0 (Right-tailed)

9.3 Comparing two means: Paired (related) samples


• Assumptions: the paired differences, d = x1 − x2 , are normally distributed.
• Test statistics: Paired t-test

t= √
sd / n
where d¯ = 1
s2d = 1 ¯2
(di − d)
P P
n di and n−1

• 100(1 − α)% confidence intervals for the difference between two population means µ1 − µ2 are

d¯ ± tα/2 sd / n

where tα/2 is the α/2 critical value from the t-distribution with df = n − 1

58
9.4 Comparing two means: Independent samples
In order to test H0 : µ1 = µ2 for two independent samples, we need to use one of the following test statistics,
we should choose the one that satisfies the assumptions. Let σ1 and σ2 be the standard deviations of
population 1 and population 2, respectively.

9.4.1 z-test
• Assumptions: σ1 and σ2 are known and we have large samples (n1 ≥ 30, n2 ≥ 30)
• Test statistic: z-test
x̄1 − x̄2
z=p
(σ1 /n1 ) + (σ22 /n2 )
2

• 100(1 − α)% confidence intervals for the difference between two population means µ1 − µ2 are
q
(x̄1 − x̄2 ) ± zα/2 (σ12 /n1 ) + (σ22 /n2 )

where zα/2 is the α/2 critical value from the standard normal distribution.

9.4.2 Pooled t-test


• Assumptions: Normal populations, σ1 and σ2 are unknown but equal (σ1 = σ2 )
• Test statistic: Pooled t-test
x̄ − x̄2
p 1
t=
sp (1/n1 ) + (1/n2 )
q
(n1 −1)s21 +(n2 −1)s22
has a t-distribution with df = n1 + n2 − 2, where sp = n1 +n2 −2 .
• 100(1 − α)% confidence intervals for the difference between two population means µ1 − µ2 are

(x̄1 − x̄2 ) ± tα/2 sp (1/n1 ) + (1/n2 )


p

where tα/2 is the α/2 critical value from the t-distribution with df = n1 + n2 − 2.

9.4.3 Non-Pooled t-test


• Assumptions: Normal populations, σ1 and σ2 are unknown and unequal (σ1 ̸= σ2 )
• Test statistic: Non-Pooled t-test
x̄1 − x̄2
t= p 2
(s1 /n1 ) + (s22 /n2 )
2 2 2
[(s /n )+(s /n )]
has a t-distribution with df = ∆ = h (s21/n11)2 (s22 /n22 )2 i
1 + 2
n1 −1 n2 −1

• 100(1- α)% confidence intervals for the difference between two population means µ1 − µ2 are
q
(x̄1 − x̄2 ) ± tα/2 (s21 /n1 ) + (s22 /n2 )

where tα/2 is the α/2 critical value from the t-distribution with df = ∆.

59
9.4.4 Levene’s Test for Equality of Variances
In order to choose between Pooled t-test and Non-Pooled t-test, we need to check the assumption that the
two populations have equal (but unknown) variances. That is, test the null hypothesis that H0 : σ12 = σ22
against the alternative that H1 : σ12 ̸= σ22 .
The test statistic of Levene’s test follows F distribution with 1 and n1 + n2 − 2 degrees of freedom.

9.5 Critical-value approach to hypothesis testing


1. State the null and alternative hypotheses
2. Decide on the significance level α
3. Compute the value of the test statistic
4. Determine the critical value(s)
5. If the value of the test statistic falls in the rejection region, reject H0 ; otherwise do not reject H0 .
6. Interpret the result of the hypothesis test.
We can replace Steps 4 and 5 by using the p-value. A common rule of thumb is that we reject the null
hypothesis if the p-value is less than or equal to the significance level α.
For z-test:

60
For any specific significance level α, one can obtain these critical values ±zα/2 and ±zα from the standard
normal distribution table. If the value of the test statistic falls in the rejection region, reject H0 ; otherwise
do not reject H0 .

z0.10 z0.05 z0.025 z0.01 z0.005


1.282 1.645 1.96 2.326 2.576

For t-test:

For any specific significance level α, one can obtain these critical values ±tα/2 and ±tα from the T distribution
table. For example, for df = 9 and α = .05, the critical values are ±t0.025 = ±2.262 and ±t0.05 = ±1.833.

9.6 Example
In a study of the effect of cigarette smoking on blood clotting, blood samples were gathered from 11 individuals
before and after smoking a cigarette and the level of platelet aggregation in the blood was measured. Does
smoking affect platelet aggregation?

before after d
25 27 2
25 29 4
27 37 10
44 56 12
30 46 16
67 82 15
53 57 4
53 80 27

61
before after d
52 61 9
60 59 -1
28 43 15

n
1X
d¯ = di = 10.27
n i=1
sd = 7.98
sd 7.98
sd¯ = √ = √ = 2.40
n 11

At the 90% level (α = 0.10), the critical value t10,0.05 = 1.812, and so a 90% confidence interval is


d¯ ± 1.812 (sd / n) = 10.27 ± 1.812 × 2.40 = [5.19, 14.63]

which clearly excludes 0.


To test the null hypothesis that the means before and after are the sample: that is H0 : µbef ore = µaf ter
against H1 : µbef ore ̸= µaf ter
d¯ 10.27
t= √ = = 4.28
sd / n 2.40
since |t| > 1.812 then we reject H0 .

62
before<-c(25,25,27,44,30,67,53,53,52,60,28)
after<-c(27,29,37,56,46,82,57,80,61,59,43)
d<-after-before
qt(0.1/2, df=10)
## [1] -1.812461
[Link](after, before, "[Link]", paired = TRUE,[Link] = 0.90)
##
## Paired t-test
##
## data: after and before
## t = 4.2716, df = 10, p-value = 0.001633
## alternative hypothesis: true difference in means is not equal to 0
## 90 percent confidence interval:
## 5.913967 14.631488
## sample estimates:

63
## mean of the differences
## 10.27273
hist(d,main="",col = '#61B2F2')

qqnorm(d, pch = 1)
qqline(d, col = "steelblue", lwd = 2)

64
65
10 Nonparametric Tests
10.1 Wilcoxon signed-rank test (Paired samples)
If the population of all paired differences d is symmetric but not necessarily normal, then we should use a
nonparametric test called Wilcoxon signed-rank test in order to compare the two populations, i.e. to test
H0 : no group difference.
To calculate the Wilcoxon signed-rank test statistic:
• Calculate all paired differences.
• Rank the absolute differences, that is ignoring the sign, after excluding the zeros.
• Sum the ranks of the positive and negative differences.
• The Wilcoxon signed-rank test V is the minimum of these two sums. That is

V = min(V + , V − )

where V + is the sum of the ranks of the absolute differences for all pairs with positive difference, and
V − is the equivalent for negative differences.
• We can then compare V to the critical value, T , for a given significance level, α , and number of
non-zero differences, n, from the statistical table.
• We reject H0 at level α if V < T .
Under H0 and assuming no ties, V has the following properties:
• E[V ] = µV = 14 n(n + 1).
• V ar[V ] = σV2 = 1
24 n(n + 1)(2n + 1).
• The distribution of V is symmetric about µV .
• For large n, V ∼ N (µV , σV2 ).
So the standardize version of this test statistic is
n(n+1)
V −
Z=q 4
n(n+1)(2n+1)
24

10.2 Example
Consider a sample of five students’ grades in Finance and Accounting. We are interested in testing whether
the students’ grades in finance is lower than the students’ grades in accounting, so we have a left-tailed test.
Use α = 10%.

x1 x2 x1 − x2 rank of |x1 − x2 |
73 88 -15 3
51 60 -9 2
85 65 20 4
65 66 -1 1
70 70 0 -

The Wilcoxon signed-rank test has value V = min(6, 4) = 4.


We compare this value to the critical value T = 1 obtained using R, qsignrank(0.1,4), or we can use the
p-value as below (using R).

66
x1<-c(73,51,85,65,70)
x2<-c(88,60,65,66,70)
[Link](x1,x2,paired=TRUE, alternative = "less")
## Warning in [Link](x1, x2, paired = TRUE, alternative = "less"):
## cannot compute exact p-value with zeroes
##
## Wilcoxon signed rank test with continuity correction
##
## data: x1 and x2
## V = 4, p-value = 0.4276
## alternative hypothesis: true location shift is less than 0

As the p-value is large we do not reject H0 .

10.3 Wilcoxon rank-sum test (Independent samples)


If the two (independent) samples are not normally distributed then we should use a nonparametric test called
the Wilcoxon rank-sum test or alternatively the Mann-Whitney U test.
To calculate the Wilcoxon rank-sum test:
• First combine the two samples into one sample.
• Rank the combined sample.
• Calculate the sum of ranks corresponding to the first sample (we can also choose the second sample).
• Wilcoxon rank-sum test is the sum of ranks of the first sample.

Let the ranks of the first sample in the combined sample be r1 , . . . , rn which are all integers from the set
{1, . . . , N }, where N = n + m.
The Wilcoxon rank-sum test statistic is then
n
X
W = ri
i=1

The Mann-Whitney U test statistic is


n(n + 1)
W−
2
where n is the number of observations from the first sample, and m is the number of observations from the
second sample.
When the samples are both large, the distribution of the Wilcoxon rank-sum statistic is approximately Normal.
For large n and m, under the null hypothesis of no group differences we have:
1
E[W ] = µW = n(n + m + 1)
2
1
V ar[W ] = σW
2
= nm(n + m + 1)
12
W ∼ N (µW , σW
2
)
So for large samples, we can use these values to standardise W and use standard Normal tables to construct
confidence intervals and test hypotheses.

67
10.4 Example
Suppose we have two groups of salaries, in thousand of pounds, of women and men. Test the claim that, on
average, women earn less salary than men, so again we have a left-sided test. Use α = 5%

Women Men
16 18
30 45
25 36
65 28
70 40

• First we rank the combined sample.

Combined sample Rank


16 1
18 2
25 3
28 4
30 5
36 6
40 7
45 8

• We will consider women salaries, and the sum of ranks related to the women’s group is 1 + 3 + 5 = 9.
• For n = 3, m = 5 and α = 0.05, we can obtain the critical values from the table, so we have TL = 8 (as
we have a left-sided test)
• Since W = 9 ≮ TL = 8, so we do not reject H0 .
• Notice, the value given by R is the Mann-Whitney U test, which is given by
Mann-Whitney U test= 9 − 3(3+1)
2 = 9 − 6 = 3.

Or we can use R as follows:


w<-c(16,30,25)
m<-c(18,45,36,28)
[Link](w,m, alternative = "less")
##
## Exact test
##
## data: w and m
## W = 3, p-value = 0.2
## alternative hypothesis: true location shift is less than 0

Again the p-value is large so we do not reject H0 .

68
Figure 1: [Link]

11 Correlation
11.1 Correlation and Causation
A note so important it get its own chapter in the notes: correlation and causation are not the same thing.
Ice-cream sales are correlated with shark attacks, but this means neither that ice-cream attracts sharks, not
that shark attacks make any survivors crave ice-cream. What’s happening is that both ice-cream sales and
shark attacks increase with temperature, which is why a correlation exists. Those two relationships really are
causal - more people want ice-creams as it gets hotter, and sharks become more active in warmer weather.
In this example, temperature is what we call a confounder, an unrecorded variable which is partially or
fully responsible for a relationship we have found between recorded variables. Note that this means the
same variable can be a confounder or not be a confounder, depending on whether we have recorded it.

69
12 Simple regression: Introduction
12.1 Motivation
Predicting the Price of a used car

12.2 Simple linear regression


Simple linear regression (population)
Y = β0 + β1 x + ϵ
In our example:
P rice = β0 + β1 Age + ϵ

Simple linear regression (sample)


ŷ = b0 + b1 x
where the coefficient β0 (and its estimate b0 or β̂0 ) refers to the y-intercept or simply the intercept or the
constant of the regression line, and the coefficient β1 (and its estimate b1 or β̂1 ) refers to the slope of the
regression line.

12.3 Least-Squares criterion


• The least-squares criterion is that the line that best fits a set of data points is the one having the
smallest possible sum of squared errors. The ‘errors’ are the vertical distances of the data points to the
line.
• We need to use the data to estimate the values of the parameters β0 and β1 , i.e. to fit a straight line to
the set of points {(xi , yi )}. There are many straight lines we could use, so we need some idea of which
is best. Clearly, a bad straight line model would be one that had many large errors, and conversely, a
good straight line model will have, on average, small errors. We quantify this by the sum of squares of
the errors:
n
X n
X
Q(β0 , β1 ) = ϵ2i = [yi − (β0 + β1 xi )]2
i=1 i=1

then the “line of best fit” will correspond to the line with values of β0 and β1 that minimises Q(β0 , β1 ).

70
• The regression line is the line that fits a set of data points according to the least squares criterion.
• The regression equation is the equation of the regression line.
• The regression equation for a set of n data points is ŷ = b0 + b1 x, where

(xi − x̄)(yi − ȳ)


P
Sxy
b1 = = and b0 = ȳ − b1 x̄
(xi − x̄)2
P
Sxx

• y is the dependent variable (or response variable) and x is the independent variable (predictor variable
or explanatory variable).
• b0 is called the y-intercept and b1 is called the slope.

SSE and the standard error


This least square regression line minimizes the error sum of squares
X X
SSE = e2i = (yi − ŷi )2

The standard error of the estimate, se = SSE/(n − 2), indicates how much, on average, the observed
p

values of the response variable differ from the predicated values of the response variable.

12.4 Example: used cars (cont.)


The table below displays data on Age (in years) and Price (in hundreds of dollars) for a sample of cars of a
particular make and model.(Weiss, 2012)

71
Price (y) Age (x)
85 5
103 4
70 6
82 5
89 5
98 5
66 6
95 6
169 2
70 7
48 7

• For our example, age is the predictor variable and price is the response variable.
• The regression equation is ŷ = 195.47 − 20.26 x, where the slope b1 = −20.26 and the intercept
b0 = 195.47
• Prediction: for x = 4, that is we would like to predict the price of a 4-year-old car,

ŷ = 195.47 − 20.26(4) = 114.43 or $11443

Back to our used cars example, we want to find the “best line” through the data points, which can be used to
predict prices of used cars based on their age.
First we need to enter the data in R.
Price<-c(85, 103, 70, 82, 89, 98, 66, 95, 169, 70, 48)
Age<- c(5, 4, 6, 5, 5, 5, 6, 6, 2, 7, 7)
carSales<-[Link](Price,Age)
str(carSales)

## '[Link]': 11 obs. of 2 variables:


## $ Price: num 85 103 70 82 89 98 66 95 169 70 ...
## $ Age : num 5 4 6 5 5 5 6 6 2 7 ...
cor(Age, Price, method = "pearson")

## [1] -0.9237821

72
library(ggplot2)

## Warning: package 'ggplot2' was built under R version 4.3.2


ggplot(carSales, aes(x=Age, y=Price)) +
geom_point()+
geom_smooth(method=lm, formula= y~x, se=TRUE)

12.5 Prediction
# simple linear regression
reg<-lm(Price~Age)
print(reg)

##
## Call:
## lm(formula = Price ~ Age)
##
## Coefficients:
## (Intercept) Age
## 195.47 -20.26
To predict the price of a 4-year-old car (x = 4):

ŷ = 195.47 − 20.26(4) = 114.43

73
13 Simple Regression: Coefficient of Determination
13.1 Extrapolation
• Within the range of the observed values of the predictor variable, we can reasonably use the regression
equation to make predictions for the response variable.
• However, to do so outside the range, which is called Extrapolation, may not be reasonable because
the linear relationship between the predictor and response variables may not hold here.
• To predict the price of an 11-year old car, ŷ = 195.47 − 20.26(11) = −27.39 or $ 2739, this result is
unrealistic as no one is going to pay us $2739 to take away their 11-year old car.

13.2 Outliers and influential observations


• Recall that an outlier is an observation that lies outside the overall pattern of the data. In the context
of regression, an outlier is a data point that lies far from the regression line, relative to the other data
points.
• An influential observation is a data point whose removal causes the regression equation (and line) to
change considerably.
• From the scatterplot, it seems that the data point (2,169) might be an influential observation. Removing
that data point and recalculating the regression equation yields ŷ = 160.33 − 14.24 x.

74
13.3 Coefficient of determination

75
• The total variation in the observed values of the response variable, SST = (yi − ȳ)2 , can be partitioned
P
into two components:
– The
P variation in the observed values of the response variable explained by the regression: SSR =
(ŷi − ȳ)2
– The variation in the observed values of the response variable not explained by the regression:
SSE = (yi − ŷi )2
P

• The coefficient of determination, R2 (or R-square), is the proportion of the variation in the observed
values of the response variable explained by the regression, which is given by
SSR SST − SSE SSE
R2 = = =1−
SST SST SST
where SST = SSR + SSE. R2 is a descriptive measure of the utility of the regression equation for

76
making prediction.
• The coefficient of determination R2 always lies between 0 and 1. A value of R2 near 0 suggests that the
regression equation is not very useful for making predictions, whereas a value of R2 near 1 suggests
that the regression equation is quite useful for making predictions.
• For a simple linear regression (one independent variable) ONLY, R2 is the square of the Bravais
correlation coefficient, r.
• Adjusted R2 is a modification of R2 which takes into account the number of independent variables,
say k. In a simple linear regression k = 1. Adjusted-R2 increases only when a significant related
independent variable is added to the model. Adjusted-R2 has a crucial role in the process of model
building. Adjusted-R2 is given by
n−1
Adjusted-R2 = 1 − (1 − R2 )
n−k−1

13.4 Notation used in regression

Quantity Defining formula Computing formula

P (xi − x̄)
2
P P 2 2
Sxx P xi − nx̄
Sxy (x i − x̄)(yi − ȳ) P x2i yi − nx̄ȳ
(yi − ȳ)2 yi − nȳ 2
P
Syy

P P
xi yi
where x̄ = n and ȳ = n . And,

2 2
Sxy Sxy
SST = Syy , SSR = , SSE = Syy −
Sxx Sxx
and SST = SSR + SSE.

13.5 Bravais correlation coefficient


Bravais correlation coefficient (r) is a measure of the strength and the direction of a linear relationship
between two variables in the sample,

(xi − x̄)(yi − ȳ)


P
r= P
(xi − x̄)2 (yi − ȳ)2
p P

where r always lies between -1 and 1. Values of r near -1 or 1 indicate a strong linear relationship between
the variables whereas values of r near 0 indicate a weak linear relationship between variables. If r is zero the
variables are linearly uncorrelated, that is there is no linear relationship between the two variables.

77
78
13.6 Hypothesis testing for the population correlation coefficient ρ
Hypothesis testing for the population correlation coefficient ρ.
Assumptions:
• The sample of paired (x, y) data is a random sample.
• The pairs of (x, y) data have a bivariate normal distribution.
The null hypothesis
H0 : ρ = 0 (no significant correlation)
against one of the alternative hypotheses:
• H1 : ρ ̸= 0 (significant correlation) “Two-tailed test’ ’
• H1 : ρ < 0 (significant negative correlation) “Left-tailed test’ ’
• H1 : ρ > 0 (significant positive correlation) “Right-tailed test’ ’
Compute the value of the test statistic:

r n−2
t= √ ∼ T(n−2) with df = n − 2.
1 − r2

where n is the sample size.


The critical value(s) for this test can be found from T distribution table ( ±tα/2 for a two-tailed test, −tα for
a left-tailed test and tα for a right-tailed test).

• If the value of the test statistic falls in the rejection region, then reject H0 ; otherwise, do not reject H0 .
• Statistical packages report p-values rather than critical values which can be used in testing the null
hypothesis H0 .

13.7 Correlation and linear transformation


• Suppose we have a linear transformation of the two variables x and y, say x1 = ax + b and y1 = cy + d
where a > 0 and c > 0. Then the Bravais correlation coefficient between x1 and y1 is equal to Bravais
correlation coefficient between x and y.
• For our example, suppose we convert cars’ prices from dollars to pounds (say $1 = £0.75, so y1 = 0.75y),
and we left the age of the cars unchanged. Then we will find that the correlation between the age of
the car and its price in pounds is equal to the one we obtained before (i.e. the correlation between the
age and the price in dollars).
• A special linear transformation is to standardize one or both variables. That is obtaining the values
zx = (x − x̄)/sx and zy = (y − ȳ)/sy . Then the correlation between zx and zy is equal to the correlation
between x and y.

79
13.8 Rho correlation coefficient (rs )
• When the normality assumption for the Bravais correlation coefficient r cannot be met, or when one or
both variables may be ordinal, then we should consider nonparametric methods such as the rho and
Kendall’s tau correlation coefficients.
• Rho correlation coefficient, rs ,can be obtained by first rank the x values (and y values) among themselves,
and then we compute the Bravais correlation coefficient of the rank pairs. Similarly −1 ≤ rs ≤ 1, the
values of rs range from -1 to +1 inclusive.
• Rho correlation coefficient can be used to describe the strength of the linear relationship as well as the
nonlinear relationship.

13.9 Kendall’s tau (τ ) correlation coefficient


• Kendall’s tau, τ , measures the concordance of the relationship between two variables, and −1 ≤ τ ≤ 1.
• Any pair of observations (xi , yi ) and (xj , yj ) are said to be concordant if both xi > xj and yi > yj or if
both xi < xj and yi < yj . And they are said to be discordant, if xi > xj and yi < yj or if xi < xj and
yi > yj . We will have n(n − 1)/2 of pairs to compare.
• The Kendall’s tau (τ ) correlation coefficient is defined as:

number of concordant pairs − number of discordant pairs


τ=
n(n − 1)/2

13.10 Example: used cars


The table below displays data on Age (in years) and Price (in hundreds of dollars) for a sample of cars of a
particular make and model (Weiss, 2012).

Price (y) Age (x)


85 5
103 4
70 6
82 5
89 5
98 5
66 6
95 6
169 2
70 7
48 7

• The Bravais correlation coefficient,

xi yi − ( xi )( yi )/n
P P P
r= pP 2
[ xi − ( xi )2 /n][ yi2 − ( yi )2 /n]
P P P

4732 − (58)(975)/11
r= p = −0.924
(326 − 582 /11)(96129 − 9752 /11)
the value of r = −0.924 suggests a strong negative linear correlation between age and price.
• Test the hypothesis H0 : ρ = 0 (no linear correlation) against H1 : ρ < 0 (negative correlation) at
significant level α = 0.05.

80
Compute the value of the test statistic:
√ √
r n−2 −0.924 11 − 2
t= √ =p = −7.249
1 − r2 1 − (−0.924)2

Since t = −7.249 < −1.833, reject H0 .

Using R:
First we need to enter the data in R.
Price<-c(85, 103, 70, 82, 89, 98, 66, 95, 169, 70, 48)
Age<- c(5, 4, 6, 5, 5, 5, 6, 6, 2, 7, 7)
carSales<-[Link](Price,Age)
str(carSales)

## '[Link]': 11 obs. of 2 variables:


## $ Price: num 85 103 70 82 89 98 66 95 169 70 ...
## $ Age : num 5 4 6 5 5 5 6 6 2 7 ...
Now let us plot age against price, i.e. a scatterplot.
plot(Price ~ Age, pch=16, col=2)

or we can use ggplot2 for a much nicer plot.


library(ggplot2)
# Basic scatter plot
ggplot(carSales, aes(x=Age, y=Price)) + geom_point()

81
From this plot it seems that there is a negative linear relationship between age and price. There are several
tools that can help us to measure this relationship more precisely.
[Link](Age, Price,
alternative = "less",
method = "pearson", [Link] = 0.95)

##
## Bravais' product-moment correlation
##
## data: Age and Price
## t = -7.2374, df = 9, p-value = 2.441e-05
## alternative hypothesis: true correlation is less than 0
## 95 percent confidence interval:
## -1.0000000 -0.7749819
## sample estimates:
## cor
## -0.9237821
Suppose now we scale both variables (standardized)
[Link](scale(Age), scale(Price),
alternative = "less",
method = "pearson", [Link] = 0.95)

##
## Bravais product-moment correlation
##
## data: scale(Age) and scale(Price)
## t = -7.2374, df = 9, p-value = 2.441e-05
## alternative hypothesis: true correlation is less than 0
## 95 percent confidence interval:
## -1.0000000 -0.7749819
## sample estimates:
## cor
## -0.9237821
We notice that corr(age, price in pounds) = corr(age, price in dollars).

We can also obtain the rho and Kendall’s tau as follows.

82
[Link](Age, Price,
alternative = "less",
method = "spearman", [Link] = 0.95)

##
## Rank correlation rho
##
## data: Age and Price
## S = 403.26, p-value = 0.0007267
## alternative hypothesis: true rho is less than 0
## sample estimates:
## rho
## -0.8330014
[Link](Age, Price,
alternative = "less",
method = "kendall", [Link] = 0.95)

##
## Kendall's rank correlation tau
##
## data: Age and Price
## z = -2.9311, p-value = 0.001689
## alternative hypothesis: true tau is less than 0
## sample estimates:
## tau
## -0.7302967

As the p-values for all three tests (Bravais, rho, Kendall) less than α = 0.05, we reject the null hypothesis of
no correlation between the age and the price, at the 5% significance level.

Now what do you think about correlation and causation?

Figure 2: [Link]

14 Simple Linear Regression: Assumptions


Recall that the simple linear regression model for Y on x is

Y = β0 + β1 x + ϵ

where

83
Y : the dependent or response variable
x : the independent or predictor variable, assumed known
β0 , β1 : the regression parameters, the intercept and slope of the regression line
ϵ : the random regression error around the line.
and the regression equation for a set of n data points is ŷ = b0 + b1 x, where

(xi − x̄)(yi − ȳ)


P
Sxy
b1 = =
(xi − x̄)2
P
Sxx

and
b0 = ȳ − b1 x̄
where b0 is called the y-intercept and b1 is called the slope.
The residual standard error se can be defined as

(yi − ŷi )2
r rP
SSE
se = =
n−2 n−2
se indicates how much, on average, the observed values of the response variable differ from the predicated
values of the response variable.

14.1 Simple Linear Regression Assumptions (SLR)


We have a collection of n pairs of observations {(xi , yi )}, and the idea is to use them to estimate the unknown
parameters β0 and β1 .
ϵi = Yi − (β0 + β1 xi ) , i = 1, 2, . . . , n

We need to make the following key assumptions on the errors:


A. E(ϵi ) = 0 (errors have mean zero and do not depend on x)
B. V ar(ϵi ) = σ 2 (errors have a constant variance, homoscedastic, and do not depend on x)
C. ϵ1 , ϵ2 , . . . ϵn are independent.
D. ϵi are all i.i.d. N (0, σ 2 ), meaning that the errors are independent and identically distributed as Normal
with mean zero and constant variance σ 2 .
The above assumptions, and conditioning on β0 and β1 , imply:
a. Linearity: E(Yi |Xi ) = β0 + β1 xi
b. Homogenity or homoscedasticity: V ar(Yi |Xi ) = σ 2
c. Independence: Y1 , Y2 , . . . , Yn are all independent given Xi .
d. Normality: Yi |Xi ∼ N (β0 + β1 xi , σ 2 )

84
14.2 Example: used cars (cont.)

We can see that for each age, the mean price of all cars of that age lies on the regression line E(Y |x) = β0 +β1 x.
And, the prices of all cars of that age are assumed to be normally distributed with mean equal to β0 + β1 x
and variance σ 2 . For example, the prices of all 4-year-old cars must be normally distributed with mean
β0 + β1 (4) and variance σ 2 .
We used the least square method to find the best fit for this data set, and residuals can be obtained as
ei = yi − yˆi = yi − (195.47 − 20.26xi ).

85
14.3 Residual Analysis
The easiest way to check the simple linear regression assumptions is by constructing a scatterplot of the
residuals (ei = yi − yˆi ) against the fitted values (yˆi ) or against x. If the model is a good fit, then the residual
plot should show an even and random scatter of the residuals.

14.3.1 Linearity
The regression needs to be linear in the parameters.

Y = β0 + β1 x + ϵ
E(Yi |Xi ) = β0 + β1 xi ≡ E(ϵi |Xi ) = E(ϵi ) = 0

The residual plot below shows that a linear regression model is not appropriate for this data.

86
14.3.2 Constant error variance (homoscedasticity)
The plot shows the spread of the residuals is increasing as the fitted values (or x) increases, which indicates that
we have heteroskedasticity (non-constant variance). The standard errors are biased when heteroskedasticity
is present, but the regression coefficients will still be unbiased.

How to detect?
• Residuals plot (fitted vs residuals)
• Goldfeld–Quandt test
• Breusch-Pagan test
How to fix?
• White’s standard errors
• Weighted least squares model
• Taking the log

14.3.3 Independent errors terms (no autocorrelation)


The problem of autocorrelation is most likely to occur in time series data, however, it can also occur in
cross-sectional data, e.g. if the model is incorrectly specified. When autocorrelation is present, the regression
coefficients will still be unbiased, however, the standard errors and test statistics are no longer valid.
An example of no autocorrelation

87
An example of positive autocorrelation

An example of negative autocorrelation

How to detect?

88
• Residuals plot
• Durbin-Watson test
• Breusch-Godfrey test
How to fix?
• Investigate omitted variables (e.g. trend, business cycle)
• Use advanced models (e.g. AR model)

14.3.4 Normality of the errors


We need the errors to be normally distributed. Normality is only required for the sampling distributions,
hypothesis testing and confidence intervals.
How to detect?
• Histogram of residuals
• Q-Q plot of residuals
• Kolmogorov–Smirnov test
• Shapiro–Wilk test
How to fix?
• Change the functional form (e.g. taking the log)
• Larger sample if possible

14.4 Example: Infant mortality and GDP


Let us investigate the relationship between infant mortality and the wealth of a country. We will use data on
207 countries of the world gathered by the UN in 1998 (the ‘UN’ data set is available from the R package
‘car’). The data set contains two variables: the infant mortality rate in deaths per 1000 live births, and the
GDP per capita in US dollars. There are some missing data values for some countries, so we will remove the
missing data before we fit our model.
# [Link]("carData")
library(carData)
data(UN)
options(scipen=999)
# Remove missing data
newUN<-[Link](UN)
str(newUN)

## '[Link]': 193 obs. of 7 variables:


## $ region : Factor w/ 8 levels "Africa","Asia",..: 2 4 1 1 5 2 3 8 4 2 ...
## $ group : Factor w/ 3 levels "oecd","other",..: 2 2 3 3 2 2 2 1 1 2 ...
## $ fertility : num 5.97 1.52 2.14 5.13 2.17 ...
## $ ppgdp : num 499 3677 4473 4322 9162 ...
## $ lifeExpF : num 49.5 80.4 75 53.2 79.9 ...
## $ pctUrban : num 23 53 67 59 93 64 47 89 68 52 ...
## $ infantMortality: num 124.5 16.6 21.5 96.2 12.3 ...
## - attr(*, "[Link]")= 'omit' Named int [1:20] 4 6 21 35 38 54 67 75 77 78 ...
## ..- attr(*, "names")= chr [1:20] "American Samoa" "Anguilla" "Bermuda" "Cayman Islands" ...
fit<- lm(infantMortality ~ ppgdp, data=newUN)
summary(fit)

89
##
## Call:
## lm(formula = infantMortality ~ ppgdp, data = newUN)
##
## Residuals:
## Min 1Q Median 3Q Max
## -31.48 -18.65 -8.59 10.86 83.59
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 41.3780016 2.2157454 18.675 < 0.0000000000000002 ***
## ppgdp -0.0008656 0.0001041 -8.312 0.0000000000000173 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 25.13 on 191 degrees of freedom
## Multiple R-squared: 0.2656, Adjusted R-squared: 0.2618
## F-statistic: 69.08 on 1 and 191 DF, p-value: 0.0000000000000173
plot(newUN$infantMortality ~ newUN$ppgdp, xlab="GDP per Capita"
, ylab="Infant mortality (per 1000 births)"
, pch=16, col="cornflowerblue")
abline(fit,col="red")

We can see from the scatterplot that the relationship between the two variables is not linear. There is a
concentration of data points at small values of GDP (many poor countries) and a concentration of data
points at small values of infant mortality (many countries with very low mortality). This suggests a skewness
to both variables which would not conform to the normality assumption. Indeed, the regression line (the red
line) we construct is a poor fit and only has an R2 of 0.266.
From the residual plot below we can see clear evidence of structure to the residuals suggesting the linear
relationship is a poor description of the data, and substantial changes in spread suggesting the assumption of
homogeneous variance is not appropriate.

90
# diagnostic plots
plot(fit,which=1,pch=16,col="cornflowerblue")

So we can apply a transformation to one or both variables, e.g. taking the log or adding a quadratic form.
Notice that this will not affect (violet) the linearity assumption as the regression will still be linear in the
parameters. So if we take the logs of both variables gives us the scatterplot of the transformed data set, below,
which appears to show a more promising linear structure. The quality of the regression is now improved,
with an R2 value of 0.766, which is still a little weak due to the rather large spread in the data.
fit1<- lm(log(infantMortality) ~ log(ppgdp), data=newUN)
summary(fit1)

##
## Call:
## lm(formula = log(infantMortality) ~ log(ppgdp), data = newUN)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.16789 -0.36738 -0.02351 0.24544 2.43503
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.10377 0.21087 38.43 <0.0000000000000002 ***
## log(ppgdp) -0.61680 0.02465 -25.02 <0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5281 on 191 degrees of freedom
## Multiple R-squared: 0.7662, Adjusted R-squared: 0.765
## F-statistic: 625.9 on 1 and 191 DF, p-value: < 0.00000000000000022

91
plot(log(newUN$infantMortality) ~ log(newUN$ppgdp), xlab="log GDP per Capita"
, ylab="Log infant mortality (per 1000 births)", pch=16, col="cornflowerblue")
abline(fit1,col="red")

So we check the residuals again, as we can see from the residuals plot below that the log transformation has
corrected many of the problems with residual plot and the residuals now much closer to the expected random
scatter.
# diagnostic plots
plot(fit1,which=1,pch=16,col="cornflowerblue")

92
Now let us check the Normality of the errors by creating a histogram and normal QQ plot for the residuals,
before and after the log transformation. The normal quantile (QQ) plot shows the sample quantiles of the
residuals against the theoretical quantiles that we would expect if these values were drawn from a Normal
distribution. If the Normal assumption holds, then we would see an approximate straight-line relationship on
the Normal quantile plot.
par(mfrow=c(2,2))
# before the log transformation.
plot(fit, which = 2,pch=16, col="cornflowerblue")
hist(resid(fit),col="cornflowerblue",main="")
# after the log transformation.
plot(fit1, which = 2, pch=16, col="hotpink3")
hist(resid(fit1),col="hotpink3",main="")

The normal quantile plot and the histogram of residuals (before the log transformation) shows strong departure
from the expectation of an approximate straight line, with curvature in the tails which reflects the skewness
of the data. Finally, the normal quantile plot and the histogram of residuals suggest that residuals are much
closer to Normality after the transformation, with some minor deviations in the tails.

93
15 Simple Linear Regression: Inference
15.1 Simple Linear Regression Assumptions
• Linearity of the relationship between the dependent and independent variables
• Independence of the errors (no autocorrelation)
• Constant variance of the errors (homoscedasticity)
• Normality of the error distribution.

15.2 Simple Linear Regression


The simple linear regression model for Y on x is

Y = β0 + β1 x + ϵ

where
Y : the dependent or response variable
x : the independent or predictor variable, assumed known
β0 , β1 : the regression parameters, the intercept and slope of the regression line
ϵ : the random regression error around the line.

15.3 The simple linear regression equation


• The regression equation for a set of n data points is ŷ = b0 + b1 x, where

(xi − x̄)(yi − ȳ)


P
Sxy
b1 = =
(xi − x̄)2
P
Sxx

and
b0 = ȳ − b1 x̄
• y is the dependent variable (or response variable) and x is the independent variable (predictor variable
or explanatory variable).
• b0 is called the y-intercept and b1 is called the slope.

15.4 Residual standard error, se


The residual standard error, se , is defined by
r
SSE
se =
n−2
where SSE is the error sum of squares (also known as the residual sum of squares, RSS) which can be defined
as
2
X X Sxy
SSE = e2i = (yi − ŷi )2 = Syy −
Sxx
se indicates how much, on average, the observed values of the response variable differ from the predicated
values of the response variable. Under the simple linear regression assumptions, se is an unbiased estimate
for the error standard deviation σ.

94
15.5 Properties of Regression Coefficients
Under the simple linear regression assumptions, the least square estimates b0 and b1 are unbiased for the β0
and β1 , respectively, i.e.
E[b0 ] = β0 and E[b1 ] = β1 .
The variances of the least squares estimators in simple linear regression are:

1 x̄2
 
V ar[b0 ] = σb20 =σ 2
+
n Sxx

σ2
V ar[b1 ] = σb21 =
Sxx

Cov[b0 , b1 ] = σb0 ,b1 = −σ 2
Sxx

We use se to estimate the error standard deviation σ:

1 x̄2
 
s2b0 = s2e +
n Sxx

s2e
s2b1 =
Sxx


sb0 ,b1 = −s2e
Sxx

15.6 Sampling distribution of the least square estimators


For the Normal error simple linear regression model:
b0 − β0
b0 ∼ N (β0 , σb20 ) → ∼ N (0, 1)
σb0

and
b1 − β1
b1 ∼ N (β1 , σb21 ) → ∼ N (0, 1)
σb1

We use se to estimate the error standard deviation σ:


b0 − β 0
∼ tn−2
sb0

and
b1 − β 1
∼ tn−2
sb1

15.7 Degrees of Freedom


• In statistics, degrees of freedom are the number of independent pieces of information that go into the
estimate of a particular parameter.
• Typically, the degrees of freedom of an estimate of a parameter are equal to the number of independent
observations that go into the estimate, minus the number of parameters that are estimated as intermediate
steps in the estimation of the parameter itself.

95
• The sample variance has n − 1 degrees of freedom, since it is computed from n pieces of data, minus the
1 parameter estimated as intermediate step, the sample mean. Similarly, having estimated the sample
mean we only have n − 1 independent pieces of data left, as if we are given the sample mean and any
n − 1 of the observations then we can determine the value of remaining observation exactly.

(xi − x̄)2
P
s2 =
n−1
• In linear regression, the degrees of freedom of the residuals is df = n − k ∗ , where k ∗ is the numbers of
estimated parameters (including the intercept). So for the simple linear regression, we are estimating
β0 and β1 , thus df = n − 2.

15.8 Inference for the intercept β0


• Hypotheses:
H0 : β0 = 0 against H1 : β0 ̸= 0

• Test statistic:
b0
t=
sb0
has a t-distribution with df = n − 2, where sb0 is the standard error of b0 , and given by
s
1 x̄2
sb0 = se +
n Sxx
and

(yi − ŷi )2
r rP
SSE
se = =
n−2 n−2
We reject H0 at level α if |t| > tα/2 with df = n − 2.
• 100(1-α)% confidence interval for β0 ,

b0 ± tα/2 . sb0
where tα/2 is critical value obtained from the t-distribution table with df = n − 2.

15.9 Inference for the slope β1


• Hypotheses:
H0 : β1 = 0 against H1 : β1 ̸= 0

• Test statistic:
b1
t=
sb1
has a t-distribution with df = n − 2, where sb1 is the standard error of b1 ,and given by

se
sb1 = √
Sxx

and

(yi − ŷi )2
r rP
SSE
se = =
n−2 n−2
We reject H0 at level α if |t| > tα/2 with df = n − 2.

96
• 100(1-α)% confidence interval for β1 ,

b1 ± tα/2 sb1
where tα/2 is critical value obtained from the t-distribution table with df = n − 2.

15.10 How useful is the regression model?


Goodness of fit test
• We test the null hypothesis H0 : β1 = 0 against H1 : β1 ̸= 0, the F-statistic
M SR SSR
F = =
M SE SSE/(n − 2)

has F-distribution with degrees of freedom df1 = 1 and df2 = n − 2.


• We reject H0 , at level α, if F > Fα (df1 , df2 ).
• For a simple linear regression ONLY, F-test is equivalent to t-test for β1 .

15.11 Example: used cars (cont.)


Price<-c(85, 103, 70, 82, 89, 98, 66, 95, 169, 70, 48)
Age<- c(5, 4, 6, 5, 5, 5, 6, 6, 2, 7, 7)
carSales<-[Link](Price,Age)
str(carSales)

## '[Link]': 11 obs. of 2 variables:


## $ Price: num 85 103 70 82 89 98 66 95 169 70 ...
## $ Age : num 5 4 6 5 5 5 6 6 2 7 ...

97
# simple linear regression
reg<-lm(Price~Age)
summary(reg)

##
## Call:
## lm(formula = Price ~ Age)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.162 -8.531 -5.162 8.946 21.099
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 195.47 15.24 12.826 0.000000436 ***
## Age -20.26 2.80 -7.237 0.000048819 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.58 on 9 degrees of freedom
## Multiple R-squared: 0.8534, Adjusted R-squared: 0.8371
## F-statistic: 52.38 on 1 and 9 DF, p-value: 0.00004882

# To obtain the confidence intervals


confint(reg, level=0.95)

## 2.5 % 97.5 %
## (Intercept) 160.99243 229.94451
## Age -26.59419 -13.92833

15.12 R output

15.13 Simple Linear Regression: Confidence and Prediction intervals


Earlier we have introduced the simple linear regression as a basic statistical model for the relationship between
two random variables. We used the least square method for estimating the regression parameters.

98
Recall that the simple linear regression model for Y on x is

Y = β0 + β1 x + ϵ

where
Y : the dependent or response variable
x : the independent or predictor variable, assumed known
β0 , β1 : the regression parameters, the intercept and slope of the regression line
ϵ : the random regression error around the line.
and the regression equation for a set of n data points is ŷ = b0 + b1 x, where

(xi − x̄)(yi − ȳ)


P
Sxy
b1 = =
(xi − x̄)2
P
Sxx

and
b0 = ȳ − b1 x̄
where b0 is called the y-intercept and b1 is called the slope.

Under the simple linear regression assumptions, the residual standard error se is an unbiased estimate
for the error standard deviation σ, where

(yi − ŷi )2
r rP
SSE
se = =
n−2 n−2
se indicates how much, on average, the observed values of the response variable differ from the predicated
values of the response variable.

Below we will see how we can use these least square estimates for prediction. First, we will consider the
inference for the conditional mean of the response variable y given a particular value of the independent
variable x, let us call this particular value x∗ . Next we will see how to predicting the value of the response
variable Y for a given value of the independent variable x∗ . These confidence and predictive intervals, to be
valid, the usual four simple regression assumptions must hold.

15.14 Inference for the regression line E [Y |x∗ ]


Suppose we are interested in the value of the regression line at a new point x∗ . Let’s denote the unknown
true value of the regression line at x = x∗ as µ∗ . From the form of the regression line equation we have

µ∗ = µY |x∗ = E [Y |x∗ ] = β0 + β1 x∗

but β0 and β1 are unknown. We can use the least square regression equation to estimate the unknown true
value of the regression line, so we have

µ̂∗ = b0 + b1 x∗ = ŷ ∗

This is simply a point estimate for the regression line. However, in statistics, point estimate is often not
enough, and we need to express our uncertainty about this point estimate, and one way to do so is via
confidence interval.

99
A 100(1 − α)% confidence interval for the conditional mean µ∗ is
s
1 (x∗ − x̄)2

ŷ ± tα/2 · se +
n Sxx
Pn
where Sxx = i=1 (xi − x̄)2 , and tα/2 is the α/2 critical value from the t-distribution with df = n − 2.

15.15 Inference for the response variable Y for a given x = x∗


Suppose now we are interested in predicting the value of Y ∗ if we have a new observation at x∗ .
At x = x∗ , the value of Y ∗ is unknown and given by

Y ∗ = β0 + β1 x ∗ + ϵ

where but β0 , β1 and ϵ are unknown. We will use ŷ ∗ = b0 + b1 x∗ as a basis for our prediction.

A 100(1 − α)% prediction interval for Y ∗ at x = x∗ is


s
1 (x∗ − x̄)2
ŷ ∗ ± tα/2 · se 1+ +
n Sxx
The extra ’1’ under the square root sign, we have here to account for the extra variability of a single
observation about the mean.
Note: we construct a confidence interval for a parameter of the population, which is the conditional mean in
this case, while we construct a prediction interval for a single value.

15.16 Example: used cars (cont.)


Estimate the mean price of all 3-year-old cars, E[Y |x = 3]:

µ̂∗ = 195.47 − 20.26(3) = 134.69 = ŷ ∗

100
A 95% confidence interval for the mean price of all 3-year-old cars is
s
1 (x∗ − x̄)2
ŷ ∗ ± tα/2 × se +
n Sxx
s
1 (3 − 5.273)2
[195.47 − 20.26(3)] ± 2.262 × 12.58 +
11 (11 − 1) × 2.018
134.69 ± 16.76
that is
117.93 < µ∗ < 151.45

Predict the price of a 3-year-old car, Y |x = 3:

ŷ ∗ = 195.47 − 20.26(3) = 134.69

A 95% predictive interval for the price of a 3-year-old car is


s
1 (x∗ − x̄)2
ŷ ∗ ± tα/2 × se 1 + +
n Sxx
s
1 (3 − 5.273)2
[195.47 − 20.26(3)] ± 2.262 × 12.58 1 + +
11 (11 − 1) ∗ ×2.018
134.69 ± 33.025
that is
101.67 < Y ∗ < 167.72
Pn
where Sxx = i=1 (xi − x̄)2 = (n − 1)V ar(x).

15.17 Regression in R
# Build linear model
Price<-c(85, 103, 70, 82, 89, 98, 66, 95, 169, 70, 48)
Age<- c(5, 4, 6, 5, 5, 5, 6, 6, 2, 7, 7)
carSales<-[Link](Price=Price,Age=Age)

reg <- lm(Price~Age,data=carSales)


summary(reg)

101
##
## Call:
## lm(formula = Price ~ Age, data = carSales)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.162 -8.531 -5.162 8.946 21.099
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 195.47 15.24 12.826 0.000000436 ***
## Age -20.26 2.80 -7.237 0.000048819 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.58 on 9 degrees of freedom
## Multiple R-squared: 0.8534, Adjusted R-squared: 0.8371
## F-statistic: 52.38 on 1 and 9 DF, p-value: 0.00004882
mean(Age)

## [1] 5.272727
var(Age)

## [1] 2.018182
qt(0.975,9)

## [1] 2.262157
newage<- [Link](Age = 3)
predict(reg, newdata = newage, interval = "confidence")

## fit lwr upr


## 1 134.6847 117.9293 151.4401
predict(reg, newdata = newage, interval = "prediction")

## fit lwr upr


## 1 134.6847 101.6672 167.7022

We can plot the confidence and prediction intervals as follows:

102
16 Multiple Linear Regression
16.1 Multiple linear regression model
In simple linear regression, we have one dependent variable (y) and one independent variable (x). In multiple
linear regression, we have one dependent variable (y) and several independent variables (x1 , x2 , . . . , xk ).
• The multiple linear regression model, for the population, can be expressed as

Y = β0 + β1 x 1 + β2 x 2 + . . . + βk x k + ϵ

where ϵ is the error term.


• The corresponding least square estimate, from the sample, of this multiple linear regression model is
given by
ŷ = b0 + b1 x1 + b2 x2 + . . . + bk xk

• The coefficient b0 (or β0 ) represents the y-intercept, that is, the value of y when x1 = x2 = . . . = xk = 0.
The coefficient bi (or βi ) (i = 1, . . . , k) is the partial slope of xi , holding all other x’s fixed. So bi (or βi )
tells us the change in y for a unit increase in xi , holding all other x’s fixed.

16.2 Example: used cars (cont.)


The table below displays data on Age, Miles and Price for a sample of cars of a particular make and model.

Price (y) Age (x1 ) Miles (x2 )


85 5 57
103 4 40
70 6 77
82 5 60
89 5 49
98 5 47
66 6 58
95 6 39
169 2 8
70 7 69

103
Price (y) Age (x1 ) Miles (x2 )
48 7 89

Below is the sample covariance matrix calculated in R, along with the scatter diagram.

The scatterplot and the correlation matrix show a fairly negative relationship between the price of the car and
both independent variables (age and miles). It is desirable to have a relationship between each independent
variable and the dependent variable. However, the scatterplot also shows a positive relationship between the
age and the miles, which is undesirable as it will cause the issue of multicollinearity.

17 Multiple Linear Regression: Fit and Inference


17.1 Coefficient of determination, R2 and adjusted R2
• Recall that, R2 is a measure of the proportion of the total variation in the observed values of the response
variable that is explained by the multiple linear regression in the k predictor variables x1 , x2 , . . . , xk .
• R2 will increase when an additional predictor variable is added to the model. One should not simply
select a model with many predictor variables because it has the highest R2 value, it is often good to
have a model with high R2 value but only few x’s included.
• Adjusted R2 is a modification of R2 that takes into account the number of predictor variables.
n−1
Adjusted-R2 = 1 − (1 − R2 )
n−k−1

17.2 The residual standard error, se


• Recall that,
Residual = Observed value − Predicted value.

ei = yi − ŷi

104
• In a multiple linear regression with k predictors, the standard error of the estimate, se , is defined by
s
SSE X
se = where SSE = (yi − ŷi )2
n − (k + 1)

• The standard error of the estimate, se , indicates how much, on average, the observed values of the
response variable differ from the predicated values of the response variable. The se is the estimate of
the common standard deviation σ.

17.3 Inferences about a particular predictor variable


• To test whether a particular predictor variable, say xi , is useful for predicting y we test the null
hypothesis H0 : βi = 0 against H1 : βi ̸= 0.
• The test statistic
bi
t=
sbi
has a t-distribution with degrees of freedom df = n − (k + 1). So we reject H0 , at level α, if |t| > tα/2 .
• Rejection of the null hypothesis indicates that xi is useful as a predictor for y. However, failing to
reject the null hypothesis suggests that xi may not be useful as a predictor of y, so we may want to
consider removing this variable from the regression analysis.
• 100(1-α)% confidence interval for βi is
bi ± tα/2 .sbi
where sbi is the standard error of bi .

17.4 How useful is the multiple regression model?


Goodness of fit test
To test how useful is this model, we test the null hypothesis
H0 : β1 = β2 = . . . = βk = 0, against
H1 : at least one of the βi ’s is not zero. - The F -statistic

M SR SSR/k
F = =
M SE SSE/(n − k − 1)

with degrees of freedom df1 = k and df2 = n − (k + 1).


We reject H0 , at level α, if F > Fα (df1 , df2 ).

105
17.5 Used cars example continued
Multiple regression equation: ŷ = 183.04 − 9.50x1 − 0.82x2

The predicted price for a 4-year-old car that has driven 45 thousand miles is

ŷ = 183.04 − 9.50(4) − 0.82(45) = 108.14

(as units of $100 were used, this means $10814)


Extrapolation: we need to look at the region (all combined values) not only the range of the observed
values of each predictor variable separately.

17.6 Regression in R
Price<-c(85, 103, 70, 82, 89, 98, 66, 95, 169, 70, 48)
Age<- c(5, 4, 6, 5, 5, 5, 6, 6, 2, 7, 7)
Miles<-c(57,40,77,60,49,47,58,39,8,69,89)
carSales<-[Link](Price=Price,Age=Age,Miles=Miles)

# Scatterplot matrix
# Customize upper panel
[Link]<-function(x, y){
points(x,y, pch=19, col=4)
r <- round(cor(x, y), digits=3)
txt <- paste0("r = ", r)
usr <- par("usr"); [Link](par(usr))

106
par(usr = c(0, 1, 0, 1))
text(0.5, 0.9, txt)
}
pairs(carSales, [Link] = NULL,
[Link] = [Link])

## Warning in par(usr): argument 1 does not name a graphical parameter

## Warning in par(usr): argument 1 does not name a graphical parameter

## Warning in par(usr): argument 1 does not name a graphical parameter


You can ignore these warnings - they don’t prevent us from getting the figure we want, which is below.

reg <- lm(Price~Age+Miles,data=carSales)


summary(reg)

##
## Call:
## lm(formula = Price ~ Age + Miles, data = carSales)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.364 -5.243 1.028 5.926 11.545
##
## Coefficients:

107
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 183.0352 11.3476 16.130 0.000000219 ***
## Age -9.5043 3.8742 -2.453 0.0397 *
## Miles -0.8215 0.2552 -3.219 0.0123 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.805 on 8 degrees of freedom
## Multiple R-squared: 0.9361, Adjusted R-squared: 0.9201
## F-statistic: 58.61 on 2 and 8 DF, p-value: 0.00001666
confint(reg, level=0.95)

## 2.5 % 97.5 %
## (Intercept) 156.867552 209.2028630
## Age -18.438166 -0.5703751
## Miles -1.409991 -0.2329757

17.6.1 Summary

108
18 Multiple Linear Regression: Assumptions
• Linearity: For each set of values, x1 , x2 , . . . , xk , of the predictor variables, the conditional mean of the
response variable y is β0 + β1 x1 + β2 x2 + . . . + βk xk .
• Equal variance (homoscedasticity): The conditional variance of the response variable are the same
(equal to σ 2 ) for all sets of values, x1 , x2 , . . . , xk , of the predictor variables.
• Independent observations: The observations of the response variable are independent of one another.
• Normally: For each set values, x1 , x2 , . . . , xk , of the predictor variables, the conditional distribution of
the response variable is a normal distribution.
• No Multicollinearity: Multicollinearity exists when two or more of the predictor variables are highly
correlated.

18.0.1 Multicollinearity
• Multicollinearity refers to a situation when two or more predictor variables in our multiple regression
model are highly (linearly) correlated.
• The least square estimates will remain unbiased, but unstable.
• The standard errors (of the affected variables) are likely to be high.
• Overall model fit (e.g. R-square, F, prediction) is not affected.

18.0.2 Multicollinearity: Detect


• Scatterplot Matrix
• Variance Inflation Factors: the Variance Inflation Factors (VIF) for the ith predictor is
1
V IFi =
1 − Ri2

where Ri2 is the R-square value obtained by regressing the ith predictor on the other predictor variables.
• V IF = 1 indicates that there is no correlation between ith predictor variable and the other predictor
variables.
• As rule of thumb if V IF > 5 then multicollinearity could be a problem, and a serious problem if if
V IF > 10.

18.0.3 Multicollinearity: How to fix?


Ignore: if the model is going to be used for prediction only.
Remove: e.g. see if the variables are providing the same information.
Combine: combining highly correlated variables.
Advanced: e.g. Principal Components Analysis, Partial Least Squares.

18.1 Regression in R (regression assumptions)


plot(reg, which=1, pch=19, col=4)

109
plot(reg, which=2, pch=19, col=4)

110
# [Link]("car")
library(car)
vif(reg)

## Age Miles
## 3.907129 3.907129
The value of V IF = 3.91 indicates a moderate correlation between the age and the miles in the model, but
this is not a major concern.

19 Dummy Variables
We will consider the case when we have a qualitative (categorical) predictor (also known as a factor) with
two or more levels (or possible values).
Qualitative predictors with only two levels
To include a qualitative predictor in our model, we create a dummy variable that takes on two possible
numerical values, e.g. 0 and 1.
Back to our used cars example, suppose we want to add the transmission type to our linear regression model.
So let d be a dummy variable represents the car’s transmission type which takes value 1 for manual car and
value 0 for automatic car.
Again, y = P rice and x1 = age, and let us not include x2 = miles at the moment.

1 if ith car is manual,



di =
0 if ith car is automatic

111
then we can regress price on age and transmission type as

y = β0 + β1 x 1 + β2 d + ϵ

so for manual cars:


y = (β0 + β2 ) + β1 x1 + ϵ
and for automatic cars:
y = β0 + β1 x1 + ϵ
or we can write

(β0 + β2 ) + β1 x1i + ϵi if ith car is manual,



yi = β0 + β1 x1i + β2 di + ϵi =
β0 + β1 x1i + ϵi if ith car is automatic

Qualitative predictors with more than two levels


Suppose we now have a categorical variable with three levels, e.g. fuel type (petrol, diesel, and hybrid). So in
this case we need to create two dummy variables, d1 and d2 .

1 if ith car has a petrol engine,



d1i =
0 otherwise

1 if ith car has a diesel engine



d2i =
0 otherwise

then one can regress price on age and fuel type as

y = β0 + β1 x1 + β2 d1 + β3 d2 + ϵ

112
so for petrol cars:
y = (β0 + β2 ) + β1 x1 + ϵ

for diesel cars:


y = (β0 + β3 ) + β1 x1 + ϵ

and for hybrid cars


y = β0 + β1 x 1 + ϵ
this last model is often called the baseline model.

yi = β0 + β1 x1i + β2 d1i + β3 d2i + ϵi


 (β0 + β2 ) + β1 x1i + ϵi if ith car has a petrol engine,

= (β0 + β3 ) + β1 x1i + ϵi if ith car has a diesel engine


β0 + β1 x1i + ϵi if ith car has a hybrid engine

The interaction effect


In our used car example, we concluded that both age and miles seem to be associated with the price.

Y = β0 + β1 x1 + β2 x2 + ϵ

P rice = β0 + β1 age + β2 miles + ϵ


that is the linear regression model assumed that the average effect on price of a one-unit increase in age is
always β1 regardless of the number of miles.
One can extend this model to allow for interaction effects, called an interaction term, which is constructed
by computing the product of x1 = age and x2 = miles, e.g. older cars associated with additional miles of
driving.
P rice = β0 + β1 age + β2 miles + β3 (age × miles) + ϵ

113
P rice = β0 + (β1 + β3 × miles) × age + β2 miles + ϵ
P rice = β0 + β̃1 × age + β2 miles + ϵ

where β̃1 = β1 + β3 × miles. Since β̃1 changes with x2 = miles, the effect of x1 = age on Y = P rice is no
longer constant.
That is adjusting x2 = miles will change the impact of x1 = age on Y = P rice.

114

You might also like