Week 3
Week 3
Yuting Huang
1 / 64
Review on Probability and
Statistics II
2 / 64
Population parameter and sample statistics
3 / 64
Statistical inference
Or
4 / 64
Statistical inference
5 / 64
Estimation of population mean
6 / 64
Mean and variance of sample mean
7 / 64
Let H be the outcome of a coin toss.
Events Probability
H = 0, Tail 0.5
H = 1, Head 0.5
Population parameters:
8 / 64
Population parameter and sample statistics
9 / 64
▶ So far, we have assumed that the population parameters (µ and
σ 2 ) and distribution (e.g., Bernoulli) are known to us.
▶ This is not a realistic assumption.
10
Percent
0 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 ver
,00 ,99 ,99 ,99 ,99 ,99 ,99 ,99 ,99 ,99 ,99 ,99 ,99 ,99 ,99 ,49 ,99
$1 $1 $2 $3 $4 $5 $6 $7 $8 $9 $10 $11 $12 $13 $14 $17 $19 0 & O
low 0 − 0 − 0 − 0 − 0 − 0 − 0 − 0 − 0 − − − − − − − − 00
Be 1,00 2,00 3,00 4,00 5,00 6,00 7,00 8,00 9,00 ,000 ,000 ,000 ,000 ,000 ,000 ,500 $20,
$ $ $ $ $ $ $ $ $ 10 11 12 13 14 15 17
$ $ $ $ $ $ $
10 / 64
Large sample approximations
11 / 64
Law of large numbers
1
As n → ∞, Ȳ = (Y1 + Y2 + ... + Yn ) → µ
n
12 / 64
Law of large numbers
Events Probability
H = 0, Tail 0.5
H = 1, Head 0.5
13 / 64
Simulation
n = 10
1.0
0.8
Proportion of heads
0.6
0.4
0.2
0.0
2 4 6 8 10
Number of trials
14 / 64
What happens if we increase the sample size to 20, 50, 250, . . . ?
n = 10 n = 20
0.8
0.8
0.4
0.4
0.0
0.0
2 4 6 8 10 5 10 15 20
n = 50 n = 250
0.8
0.8
0.4
0.4
0.0
0.0
15 / 64
Law of large numbers
1
As n → ∞, Ȳ = (Y1 + Y2 + ... + Yn ) → µ
n
16 / 64
Central limit theorem
σ2
As n → ∞, Ȳ ∼ N (µ, )
n
17 / 64
Simulation
1. Flip the coin 10 times and calculate the mean outcome Ȳ1 .
2. Repeat the process. Then we have a set of sample means:
Ȳ1 , Ȳ2 , ...
3. Draw the histogram of these sample means:
50
0
0.0
0 0.2 0.4 0.6 0.8 1.0
1
18 / 64
What happens if we increase the sample size to 20, 50, 250, . . . ?
n = 10 n = 20
0.2 0.4 0.6 0.8 0.2 0.3 0.4 0.5 0.6 0.7 0.8
n = 50 n = 250
0.2 0.3 0.4 0.5 0.6 0.7 0.40 0.45 0.50 0.55 0.60
19 / 64
Central Limit Theorem
20 / 64
Central limit theorem when σ 2 is unknown
If the population variance σ 2 is unknown, the sample means will
follow a student’s t distribution with n − 1 degrees of freedom.
▶ The t distribution also has a bell shape, but it has thicker tails
than the normal distribution.
normal
t, df = 1
−4 −2 0 2 4
21 / 64
When sample size gets larger, t distribution approaches to the normal
distribution ⇒ the sample mean will follow a normal distribution.
s2
As n → ∞, Ȳ ∼ N (µ, )
n
where s2 is the sample variance.
Normal
df = 1
df = 5
df = 30
−4 −2 0 2 4
22 / 64
How large is a “large” number?
large n
small n
2
▶ The variance of the sample mean, σn , is inversely proportional to
sample size ⇒ precision increases with larger n.
▶ Rule of thumb: n ≥ 30 are consider sufficient for CLT to hold.
23 / 64
Summary on LLN and CLM
Sample mean
24 / 64
Example
Draw a random sample of size n = 64 from a population with mean µ = 50 and
standard deviation σ = 16.
x̄ − 50 54 − 50
P (x̄ > 54) = P ( √ < √ )
4 4
= P (Z > 2) = 0.0228
25 / 64
Hypothesis testing and
confidence interval
26 / 64
General idea
27 / 64
If the coin is fair, the distribution of the outcomes would be like this
28 / 64
General idea
8
0.55 0.63
6
4
2
0
29 / 64
0.55 0.63
Rejection
Region
31 / 64
Formulating the hypotheses
32 / 64
▶ If we want to test whether our coin is biased in general,
33 / 64
Testing the hypotheses
H0 : p = 0.5 H1 : p ̸= 0.5
E(p̂) = µ = 0.5
σ2
V ar(p̂) = = 0.0025 SD(p̂) = 0.05
n
34 / 64
Testing the hypotheses
35 / 64
Testing the hypotheses
36 / 64
Testing the hypotheses
37 / 64
One vs. two-sided hypothesis tests
H1 : p ̸= p0
H1 : p < p0 or H1 : p > p0
The testing procedures are almost the same, except for the critical
value at each significance level, based on which we reject or do not
reject the null hypothesis.
38 / 64
One vs. two-sided hypothesis tests
0.95
z−1.96 z = 1.96
0.025 0.025
−3 −2 −1 0 1 2 3
0.95 z = 1.65
0.050
−3 −2 −1 0 1 2 3
39 / 64
Confidence interval
40 / 64
Confidence interval
p̂ ± 1.96 × SE
| {z }
Margins of error
q q
p̂(1−p̂) 0.63×(1−0.63)
▶ We have SE = n = 100 = 0.0483
41 / 64
Confidence interval
42 / 64
Margins of error
The width of a CI is determined by the margins of error.
43 / 64
Confidence level and critical value
The margins of error changes as the confidence level changes.
point estimate ± z ∗ × SE
| {z }
Margins of error
44 / 64
Summary
45 / 64
Getting started with R and RStudio
46 / 64
Install the latest version of R and RStudio
47 / 64
The RStudio interface is organized by four panels, with the default
layout shown below.
48 / 64
The RStudio console
49 / 64
Use R as a calculator
1. Open a new R script file via File > New File > R Script.
2. R can be used as a calculator.
▶ Enter 11 + 1 in the script editor and click the Run button
at the top right of the panel. The output will appear in the
console.
▶ Alternatively, with the cursor on the line of code, use the
keyboard shortcut Ctrl/Cmd + Enter.
# Use R as a calculator
11 + 1
## [1] 12
The # symbol in the code marks off text as comments. They are not
run as code. This is a useful tool for annotate your code.
50 / 64
Assign name to a value
3. We can save a value by assigning it a name using = or <-.
# Calculation
x + y
## [1] 12.69315
# Create a vector
a = c(-2, -1, 1.2, 1.8); a # semicolon (;) separates commands
## [1] 0
## [1] 1.796292
52 / 64
Draw simple plots
7. Plots appear in the Plots tab on the lower-right panel.
# Plot a against b
plot(a, b) 6
4
b
2
0
−2 −1 0 1
53 / 64
8. Use the ? operator to find out more information about a
particular function.
▶ The help page will appear in the Help tab on the bottom
right.
54 / 64
Working directory
9. The Files tab on the lower-right panel shows all files in the
current working directory.
▶ Typically, this is the location of your current R script.
▶ To set or change your working directory, use Session > Set
Working Directory > Choose Directory. . . .
10. To save your script file, use File > Save As. . . in a specific
destination.
55 / 64
Errors, warnings, and friendly messages
11. One thing that intimidates new R and RStudio users is how it
reports errors, warnings, and friendly messages.
56 / 64
Read data file into R
57 / 64
First inspection of the data
summary(gpa_hours)
58 / 64
Scatter plot
plot(gpa_hours$height, gpa_hours$weight,
xlab = "Height of student", ylab = "Weight of student",
pch = 19, col = "brown1")
300
250
Weight of student
200
150
100
60 65 70 75
Height of student
150
100
50
0
Female Male
Gender of student
60 / 64
Histogram
hist(gpa_hours$GPA, xlab = "GPA")
Histogram of gpa_hours$GPA
60
50
Frequency
40
30
20
10
0
GPA
61 / 64
Heavily skewed data
summary(gpa_hours$pets)
▶ Notice that the maximum value of pets is much larger than the
other numbers in the summary statistics, which implies that the
variable is heavily right-skewed.
▶ This is also suggested by the fact that the mean is a lot larger
than the median.
62 / 64
Boxplot
0 5 10 15 20
Number of pets
63 / 64
# Define 1st, 3rd quartiles, and IQR
q1 = quantile(gpa_hours$pets, 0.25)
q3 = quantile(gpa_hours$pets, 0.75)
iqr = q3 - q1
64 / 64