0% found this document useful (0 votes)
6 views64 pages

Week 3

The document covers key concepts in probability and statistics relevant to data science, including population parameters, sample statistics, the law of large numbers, and the central limit theorem. It emphasizes the importance of statistical inference and hypothesis testing in estimating population characteristics from sample data. Additionally, it introduces practical applications using R and RStudio for data analysis.

Uploaded by

lowruofeievie
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views64 pages

Week 3

The document covers key concepts in probability and statistics relevant to data science, including population parameters, sample statistics, the law of large numbers, and the central limit theorem. It emphasizes the importance of statistical inference and hypothesis testing in estimating population characteristics from sample data. Additionally, it introduces practical applications using R and RStudio for data analysis.

Uploaded by

lowruofeievie
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

DSE1101

Introductory Data Science for Economics

Yuting Huang

Week 3, Semester 1 AY2022/23

1 / 64
Review on Probability and
Statistics II

▶ Population parameter and sample statistics


▶ Distribution of sample mean
▶ Large sample approximations
▶ Law of large numbers
▶ Central limit theorem
▶ Hypothesis testing and confidence interval
▶ Lab: Getting started with R and RStudio

2 / 64
Population parameter and sample statistics

We are often interested in population parameters. However, in


real-world applications, we rarely observe the true population
distribution. Population parameters are usually unknown.

▶ We use sample statistics as point estimates for the unknown


population parameter of interest.
▶ Error = true value of population parameter - point estimate
▶ Bias is the systematic tendency to over or under-estimate the
true population parameter.
▶ Sampling variability describes how much an estimate will tend
to vary from one sample to the next.

3 / 64
Statistical inference

Suppose you want to know the mean monthly earnings in Singapore.


What will you do?

▶ Perform an exhaustive survey of all residents in Singapore and


construct the population distribution of earnings.

Or

▶ Select a random sample from the residents in Singapore. Use


statistical methods to draw inference about the full population.

4 / 64
Statistical inference

Exhaustive survey is expensive. It is more practical to select a


random sample and estimate the average earnings of the population.

1. Choose a random sample of n individuals.


2. Ask each individual about their monthly earnings Y1 , Y2 , ..., Yn .
3. Compute the sample average
1
Ȳ = (Y1 + Y2 + ... + Yn )
n

▶ This is an example of an estimator of the population mean.


▶ Its value is called an estimate.

5 / 64
Estimation of population mean

Use the sample mean to estimate the population mean.


1
Ȳ = (Y1 + Y2 + ... + Yn )
n

▶ The sample mean, Ȳ , is a random variable, since the sample is


drawn at random.
▶ If we draw another sample, Ȳ will be different.

6 / 64
Mean and variance of sample mean

Suppose Y1 , Y2 , ..., Yn are drawn at random from a population, of


which the mean and variance are µ and σ 2 , respectively.

▶ The mean of sample mean is denoted as E(Ȳ ).


 n  n
1X 1X 1
E(Ȳ ) = E Yi = E(Y ) = nµ = µ
n i=1 n i=1 n

▶ The variance of sample mean is denoted as V ar(Ȳ )


n n
σ2
 
1X 1 X 1
V ar(Ȳ ) = V ar Yi = 2
V ar(Y ) = 2 nσ 2 =
n i=1 n i=1 n n

7 / 64
Let H be the outcome of a coin toss.

Events Probability
H = 0, Tail 0.5
H = 1, Head 0.5

Population parameters:

▶ Population mean: µ = 0.5


▶ Population variance: σ 2 = 0.25

Suppose we flip the coin 100 times, n = 100.


Sample statistics:

▶ Sample mean: E(H̄) = µ = 0.5


▶ Variance of sample mean: V ar(H̄) = σ 2 /100 = 0.0025
▶ Standard deviation of sample mean: SD(H̄) = 0.05

8 / 64
Population parameter and sample statistics

▶ Population parameter is a fixed feature of a particular


population.
▶ Sometimes known (e.g., the outcome of a fair coin).
▶ Usually unknown in real-life, as surveying the entire
population is usually not practical.
▶ Sample statistics is the quantity that vary from one sample to
another.
▶ Easy to compute, as it is just the statistics of a sample from
simple random sampling.

9 / 64
▶ So far, we have assumed that the population parameters (µ and
σ 2 ) and distribution (e.g., Bernoulli) are known to us.
▶ This is not a realistic assumption.

Resident households by monthly household income, 2021


including employer CPF contributions
15

10
Percent

0 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 ver
,00 ,99 ,99 ,99 ,99 ,99 ,99 ,99 ,99 ,99 ,99 ,99 ,99 ,99 ,99 ,49 ,99
$1 $1 $2 $3 $4 $5 $6 $7 $8 $9 $10 $11 $12 $13 $14 $17 $19 0 & O
low 0 − 0 − 0 − 0 − 0 − 0 − 0 − 0 − 0 − − − − − − − − 00
Be 1,00 2,00 3,00 4,00 5,00 6,00 7,00 8,00 9,00 ,000 ,000 ,000 ,000 ,000 ,000 ,500 $20,
$ $ $ $ $ $ $ $ $ 10 11 12 13 14 15 17
$ $ $ $ $ $ $

Data source: [Link]

10 / 64
Large sample approximations

When we don’t know the parameters and/or the exact


distribution of the population, we use approximations to the
sampling distribution that rely on large samples.

▶ This is called asymptotic distribution.


▶ We rely on the following law and theorem:
1. Law of large numbers: sample mean approaches
population mean as sample size increases.
2. Central limit theorem: using sample mean and sample
variance to approximate the distribution of the sample
mean.

11 / 64
Law of large numbers

Law of large numbers (LLN): as sample size n increases, the


sample mean Ȳ gets closer and closer to the population mean µ.

▶ Here is a formal definition:

1
As n → ∞, Ȳ = (Y1 + Y2 + ... + Yn ) → µ
n

12 / 64
Law of large numbers

Consider the coin tossing example, where H is the outcome of a coin


toss.

Events Probability
H = 0, Tail 0.5
H = 1, Head 0.5

▶ Population mean: µ = 0.5


▶ If we flip the coin ten times, what is the expected proportion of
heads?

13 / 64
Simulation

n = 10

1.0
0.8
Proportion of heads

0.6
0.4
0.2
0.0

2 4 6 8 10

Number of trials

▶ Ten outcomes = {Head, Tail, Head, Tail, Tail, . . . }

14 / 64
What happens if we increase the sample size to 20, 50, 250, . . . ?

n = 10 n = 20

0.8

0.8
0.4

0.4
0.0

0.0
2 4 6 8 10 5 10 15 20

n = 50 n = 250
0.8

0.8
0.4

0.4
0.0

0.0

0 10 20 30 40 50 0 50 100 150 200 250

15 / 64
Law of large numbers

Law of large numbers (LLN): as sample size n increases, the


sample mean Ȳ gets closer and closer to the population mean µ.

▶ In this coin tossing example, as the number of flips increases


(n = 10, 20, 50, 250, ...), the expected proportion of heads
approaches to the population mean, which is 0.5.

1
As n → ∞, Ȳ = (Y1 + Y2 + ... + Yn ) → µ
n

16 / 64
Central limit theorem

How about the distribution of these sample means?

▶ Suppose we draw a random sample of size n from a population


with mean µ and variance σ 2 .
▶ Central limit theorem (CLT): When n is large, the sampling
distribution of Ȳ is approximately normal, regardless of the
distribution of the underlying population.

σ2
As n → ∞, Ȳ ∼ N (µ, )
n

17 / 64
Simulation

1. Flip the coin 10 times and calculate the mean outcome Ȳ1 .
2. Repeat the process. Then we have a set of sample means:
Ȳ1 , Ȳ2 , ...
3. Draw the histogram of these sample means:

Distribution of sample means, n = 10


100 150 200 250
Frequency

50
0

0.0
0 0.2 0.4 0.6 0.8 1.0
1

18 / 64
What happens if we increase the sample size to 20, 50, 250, . . . ?

n = 10 n = 20

0.2 0.4 0.6 0.8 0.2 0.3 0.4 0.5 0.6 0.7 0.8

n = 50 n = 250

0.2 0.3 0.4 0.5 0.6 0.7 0.40 0.45 0.50 0.55 0.60

19 / 64
Central Limit Theorem

▶ Central Limit Theorem (CLT) states that as long as we have


a large enough sample, the sample mean is approximately
normally distributed with mean µ and variance σ 2 /n.
▶ This holds independently of how the underlying population is
distributed.

Distribution of sample mean


Population distribution
(n = 250)
0.5
0.4
0.3
0.2
0.1
0.0

0 1 0.40 0.45 0.50 0.55 0.60

20 / 64
Central limit theorem when σ 2 is unknown
If the population variance σ 2 is unknown, the sample means will
follow a student’s t distribution with n − 1 degrees of freedom.

▶ The t distribution also has a bell shape, but it has thicker tails
than the normal distribution.

normal
t, df = 1

−4 −2 0 2 4

21 / 64
When sample size gets larger, t distribution approaches to the normal
distribution ⇒ the sample mean will follow a normal distribution.

s2
As n → ∞, Ȳ ∼ N (µ, )
n
where s2 is the sample variance.

Normal
df = 1
df = 5
df = 30

−4 −2 0 2 4

22 / 64
How large is a “large” number?

large n
small n

2
▶ The variance of the sample mean, σn , is inversely proportional to
sample size ⇒ precision increases with larger n.
▶ Rule of thumb: n ≥ 30 are consider sufficient for CLT to hold.

23 / 64
Summary on LLN and CLM

Sample mean

▶ Take a random sample of size n from a population with mean µ


and variance σ 2 . Then calculate the sample average Ȳ .
▶ By the law of large numbers, if n is sufficiently large, then Ȳ
is close to µ.

Distribution of sample mean

▶ If the original distribution is normal, so is the distribution of the


sample means.
▶ By the central limit theorem, if n is sufficiently large (≥ 30),
the distribution of sample means will be approximately normal,
regardless of the original population distribution.

24 / 64
Example
Draw a random sample of size n = 64 from a population with mean µ = 50 and
standard deviation σ = 16.

What is the probability that the sample mean is above 54?

▶ Since the sample size is large (n ≥ 30), CLM applies.


▶ The sample mean follows a normal distribution with mean µ and
variance σ 2 /n, that is, N (50, 4).

x̄ − 50 54 − 50
P (x̄ > 54) = P ( √ < √ )
4 4
= P (Z > 2) = 0.0228

25 / 64
Hypothesis testing and
confidence interval

26 / 64
General idea

We suspect that we might have a biased coin. How do we test it


scientifically?

▶ Step 1. Flip the coin 100 times.


▶ Step 2. Write down the number of heads as the outcome of the
experiment.
▶ Then. . . what?

Hypothesis testing: given the outcome (e.g., 55 heads, or 63


heads), try to make some inference on whether the coin is biased.

27 / 64
If the coin is fair, the distribution of the outcomes would be like this

Number of heads (out of 100 flips) Proportion of heads

35 45 55 65 0.35 0.45 0.55 0.65

▶ The histogram on the left-hand side centers at 50.


▶ This is the expected number of heads if the coin is fair.

28 / 64
General idea

8
0.55 0.63
6
4
2
0

0.35 0.40 0.45 0.50 0.55 0.60 0.65

▶ Suppose one of our sample has 55 heads. Is it necessarily


evidence that we indeed have a bias coin?
▶ What if we have 63 heads. Does it cast any doubt on the fairness
of the coin?

29 / 64
0.55 0.63

Rejection
Region

0.35 0.4 0.45 0.5 0.55 0.6 0.65

Hypothesis testing tells us how extreme our sample outcome is.

▶ It first makes an hypothesis on the population (the coin is fair).


▶ And creates a barrier (rejection region), beyond which we say
that the sample is too extreme for us to maintain the hypothesis.
▶ That is, if our sample falls within the rejection region, we
reject the hypothesis.
30 / 64
Formulating the hypotheses

▶ The parameter of interest is the proportion heads from


tossing the coin.
▶ There may be two explanations why the sample proportion is
different than 0.50:
1. The true population parameter is 0.50, and the difference
between the true parameter and the sample statistics is
simply due to chance or sampling variability.
2. The true population parameter is different than 0.50.

31 / 64
Formulating the hypotheses

▶ We start with the assumption that the population parameter is


equal to a hypothesized value. This is called the null
hypothesis, denoted as H0 .
▶ We test the claim that the population parameter is different,
smaller, or larger than the hypothesized value. This is called the
alternative hypothesis, denoted as H1 .

32 / 64
▶ If we want to test whether our coin is biased in general,

H0 : p = 0.5 vs. H1 : p ̸= 0.5

▶ If we want to test whether our coin is head biased,

H0 : p = 0.5 vs. H1 : p > 0.5

▶ If we want to test whether our coin is tail biased,

H0 : p = 0.5 vs. H1 : p < 0.5

33 / 64
Testing the hypotheses

Suppose that out of 100 coin flips, we obtain an outcome of 63 heads.


We want to test whether the coin is biased in general.

▶ Formulate the hypothesis test

H0 : p = 0.5 H1 : p ̸= 0.5

▶ Under H0 , the coin is fair. The population mean and


variance are µ = p = 0.5, σ 2 = p(1 − p) = 0.25.
▶ By CLT, the sample mean is approximately normally
distributed with

E(p̂) = µ = 0.5
σ2
V ar(p̂) = = 0.0025 SD(p̂) = 0.05
n

34 / 64
Testing the hypotheses

In order to evaluate if the observed sample is considered unusual for


the hypothesized value, we determine how many standard errors the
sample statistics is away from H0 .

▶ Compute the test statistics

p̂ − E(p̂) 0.63 − 0.5


Z= = = 2.6
SD(p̂) 0.05

▶ The sample statistics is 2.6 standard errors away from the


hypothesized value.
▶ Is this considered unusual?
▶ Is this result statistically significant?

35 / 64
Testing the hypotheses

We can quantify how unusual it is using a p-value.

p-value = P ( Current sample |H0 is true)

As a general rule, we use a significance level α = 0.05.

▶ If p-value ≤ 0.05, we say that it would be very unlikely to observe


the sample if the null hypothesis were true. We reject H0 .
▶ If p-value > 0.05, we say that it is likely to observe the sample if
the null hypothesis were true. We fail to reject H0 .

36 / 64
Testing the hypotheses

▶ p-value is the probability of observing the current sample, if the


null hypothesis is true.

p-value = P ( Current sample |H0 is true ) = P (|Z| ≥ 2.6) = 0.0047

▶ Interpretation: If the coin is fair, there is less than a 0.0047


chance of observing a random sample at least as extreme as this
one.
▶ Conclusion: The difference between the null value of 0.50 and
the observed value of 0.63 is not due to chance or sampling
variability.
▶ This result is statistically significant at the 0.05 level.

37 / 64
One vs. two-sided hypothesis tests

▶ In a two-sided hypothesis test, we are interested in whether the


true parameter is different than the hypothesized value p0

H1 : p ̸= p0

▶ In a one-sided test, we are interested in whether the true


parameter is less than or greater than the hypothesized value.

H1 : p < p0 or H1 : p > p0

The testing procedures are almost the same, except for the critical
value at each significance level, based on which we reject or do not
reject the null hypothesis.

38 / 64
One vs. two-sided hypothesis tests

0.95
z−1.96 z = 1.96

0.025 0.025

−3 −2 −1 0 1 2 3

0.95 z = 1.65

0.050

−3 −2 −1 0 1 2 3

39 / 64
Confidence interval

Confidence interval (CI) is a plausible range of values for the


population parameter.

▶ When we estimate the population parameter using a point


estimate, we probably won’t hit the exact population parameter.
▶ But if we report a range of plausible values, we have a good shot
(e.g., 95%) at capturing the parameter.

40 / 64
Confidence interval

Suppose that out of 100 coin flips, we obtain an outcome of 63 heads.


We want to estimate the true proportion of heads from flipping the
coin.

▶ Sample statistics p̂ = 0.63, sample size n = 100


▶ The 95% confidence interval is defined as:

p̂ ± 1.96 × SE
| {z }
Margins of error

q q
p̂(1−p̂) 0.63×(1−0.63)
▶ We have SE = n = 100 = 0.0483

41 / 64
Confidence interval

The 95% CI for the population parameter is

0.63 ± 1.96 × 0.0483 = [0.5353, 0.7247]

What does 95% confidence mean?

▶ Suppose we take many samples and build a confidence interval


from each sample, then about 95% of these intervals would
contain the true population parameter.
▶ Since 0.5 ∈
/ 95%CI, we reject the null hypothesis at the 0.05
significance level.

42 / 64
Margins of error
The width of a CI is determined by the margins of error.

▶ If we want to be more certain that we capture the population


parameter (i.e., increase the confidence level), should we use a
wider interval or a smaller interval?

▶ If the interval is too wide, it may not be very informative.

43 / 64
Confidence level and critical value
The margins of error changes as the confidence level changes.

point estimate ± z ∗ × SE
| {z }
Margins of error

▶ z ∗ is called the critical value for a confidence level.


▶ Commonly used confidence levels are 90%, 95%, and 99%.

Confidence level Critical value


90% 1.645
95% 1.960
99% 2.576

Using the standard normal table, it is possible to find the critical


value z ∗ for any confidence level.

44 / 64
Summary

▶ Population parameter and sample statistics


▶ Large sample approximations
▶ Law of large number
▶ Central limit theorem
▶ Hypothesis testing and confidence intervals
▶ Readings for Week 4:
▶ ISLR: Chapter 3.1 Simple linear regression.
▶ Introduction to Econometrics: Chapters 4-5 Linear
regression with one regressor; Hypothesis test and
confidence intervals.

45 / 64
Getting started with R and RStudio

46 / 64
Install the latest version of R and RStudio

▶ R is an open-source statistical programming language.


▶ To begin, download and install R from
[Link]
▶ RStudio is an integrated development environment (IDE) for R
that simplifies the way of using R.
▶ Download and install RStudio Desktop from
[Link]

Both R and RStudio run identically under MacOS and Windows.

47 / 64
The RStudio interface is organized by four panels, with the default
layout shown below.

48 / 64
The RStudio console

▶ Script editor is on the top left. It is used to create and edit


files, such as R script files.
▶ Console is on the bottom left. When commands are run in the
script editor, the commands and the corresponding output will
appear in the console.
▶ There are additional panels on the right.

Now let us get started with simple tasks in R and RStudio.

49 / 64
Use R as a calculator

1. Open a new R script file via File > New File > R Script.
2. R can be used as a calculator.
▶ Enter 11 + 1 in the script editor and click the Run button
at the top right of the panel. The output will appear in the
console.
▶ Alternatively, with the cursor on the line of code, use the
keyboard shortcut Ctrl/Cmd + Enter.

# Use R as a calculator
11 + 1

## [1] 12

The # symbol in the code marks off text as comments. They are not
run as code. This is a useful tool for annotate your code.

50 / 64
Assign name to a value
3. We can save a value by assigning it a name using = or <-.

# Create two values x and y


x = 11 + 1
y = log(2)

# Calculation
x + y

## [1] 12.69315

# Assign name to a value


z = x + y

4. Now, take a look at the Environment tab on the top right.


▶ The values of x, y, and z are displayed.
▶ Any created data values or loaded data sets will appear in
this tab.
51 / 64
5. Variables can only contain a single value, but also vectors or
matrices of values. One simple way to create a vector is to use
the c() command.

# Create a vector
a = c(-2, -1, 1.2, 1.8); a # semicolon (;) separates commands

## [1] -2.0 -1.0 1.2 1.8

6. Use mean() and sd() to compute the mean and standard


deviation of the values in vector a.

# Calculate mean and standard deviation


mean(a); sd(a)

## [1] 0

## [1] 1.796292

52 / 64
Draw simple plots
7. Plots appear in the Plots tab on the lower-right panel.

# Create another vector named b


b = 2*a + pi

# Plot a against b
plot(a, b) 6
4
b

2
0

−2 −1 0 1

53 / 64
8. Use the ? operator to find out more information about a
particular function.
▶ The help page will appear in the Help tab on the bottom
right.

# Help page for the mean() function


?mean

54 / 64
Working directory

9. The Files tab on the lower-right panel shows all files in the
current working directory.
▶ Typically, this is the location of your current R script.
▶ To set or change your working directory, use Session > Set
Working Directory > Choose Directory. . . .
10. To save your script file, use File > Save As. . . in a specific
destination.

55 / 64
Errors, warnings, and friendly messages

11. One thing that intimidates new R and RStudio users is how it
reports errors, warnings, and friendly messages.

▶ “Error in. . . ”: The code will not run.


▶ “Warning messages”: Generally your code will still work, but
with some caveats. R still produces results.
▶ When the red text does not start with “Error” or “Warning”, it
is just a friendly message.

56 / 64
Read data file into R

Now let us read a CSV data file into R.

▶ Download week3_gpa_hours.csv from LumiNUS and save it in


your working directory.
▶ Read the CSV file in R using [Link]().

gpa_hours = [Link]("Data/week3_gpa_hours.csv", head = TRUE)

▶ Take a look at the Environment tab, where a data set


gpa_hours should now be visible.
▶ Click the blue arrow next to the data set name to view a
summary of the variables. Click on the data set name or use the
command view(gpa_hours) to view the data set.

57 / 64
First inspection of the data

▶ This is the data set we used in Week 1 – the survey responses


from 351 students.
▶ We will use it to compute summary statistics and recreate some
visualizations.

summary(gpa_hours)

## gender height weight GPA pets


## Female:199 Min. :57.00 Min. : 83.0 Min. :2.300 Min. : 0.000
## Male :152 1st Qu.:64.00 1st Qu.:125.0 1st Qu.:3.900 1st Qu.: 0.000
## Median :67.00 Median :145.0 Median :4.300 Median : 1.000
## Mean :67.23 Mean :152.1 Mean :4.199 Mean : 2.319
## 3rd Qu.:70.00 3rd Qu.:172.0 3rd Qu.:4.600 3rd Qu.: 3.000
## Max. :79.00 Max. :300.0 Max. :5.000 Max. :24.000

58 / 64
Scatter plot
plot(gpa_hours$height, gpa_hours$weight,
xlab = "Height of student", ylab = "Weight of student",
pch = 19, col = "brown1")

300
250
Weight of student

200
150
100

60 65 70 75

Height of student

▶ The $ operator is used to access variables within a data set.


▶ ?plot to understand the arguments of the plot function.
59 / 64
Bar plot
plot(gpa_hours$gender, xlab = "Gender of student", col = "steelblue")

150
100
50
0

Female Male

Gender of student

▶ Bar plot is used to summarize categorical data such as gender.


▶ A useful reference on colors in R:
[Link]

60 / 64
Histogram
hist(gpa_hours$GPA, xlab = "GPA")

Histogram of gpa_hours$GPA
60
50
Frequency

40
30
20
10
0

2.5 3.0 3.5 4.0 4.5 5.0

GPA

▶ Histogram summarizes quantitative data such as GPA.

61 / 64
Heavily skewed data

Summary statistics of the variable pets.

summary(gpa_hours$pets)

## Min. 1st Qu. Median Mean 3rd Qu. Max.


## 0.000 0.000 1.000 2.319 3.000 24.000

▶ Notice that the maximum value of pets is much larger than the
other numbers in the summary statistics, which implies that the
variable is heavily right-skewed.
▶ This is also suggested by the fact that the mean is a lot larger
than the median.

62 / 64
Boxplot

boxplot(gpa_hours$pets, xlab = "Number of pets", horizontal = TRUE)

0 5 10 15 20

Number of pets

We can use the following code to set up the definition of outliers.

63 / 64
# Define 1st, 3rd quartiles, and IQR
q1 = quantile(gpa_hours$pets, 0.25)
q3 = quantile(gpa_hours$pets, 0.75)
iqr = q3 - q1

# Define upper and lower bound for outliers


lower = q1 - 1.5*iqr; upper = q3 + 1.5*iqr

# Identify the rows corresponding to the outliers


ourlier_rows = which(gpa_hours$pets > upper)

# Remove outliers and save data in a new object


gpa_hours_new = gpa_hours[-ourlier_rows, ]

# Alternatively, use the [Link]() function


outliers = [Link](gpa_hours$pets)$out
gpa_hours_new = gpa_hours[-which(gpa_hours$pets %in% outliers), ]

▶ Check out the new data set gpa_hours_new in the


Environment tab. It contains 337 observations and 5 variables.

64 / 64

You might also like