IIMT 2641 Introduction to Business Analytics
Module 2: Intro to Statistics
Topic 2: Sampling
1
Objectives
1. Introduction to Statistics: Sample versus Population
2. Define and understand sampling distribution.
3. Describe and apply the Central Limit Theorem.
2
Introduction to Statistics
Population
Sample
3
Introduction to Statistics
Population
Sample
Size N Size n
Population Mean = 𝝁 Sample Mean = 𝑥ҧ
4
Population Standard Deviation = 𝝈 Sample Standard Deviation = 𝒔
Introduction to Statistics
How do I calculate the sample mean and sample standard deviation?
1. Sample Mean: For a given sample, you can calculate the mean as
follows:
n
Or, if the sample data is in a column
xi of excel you can calculate it using the
x= i =1
=average() function.
n
2. Sample Variance/Standard Deviation: For a given sample, you can
calculate the variance and standard deviation of the sample:
n
1 Or, if the sample data is in a column
s2 =
n − 1 i =1
( xi − x ) 2 of excel you can calculate it using
the =var.s() or =stdev.s() functions.
5
Introduction to Statistics
Example: The height of HKUers is known to be normally distributed with a
mean of 5.5 feet and standard deviation of 0.4 feet. Feng stands at the
entrance of the KKL building and with a special technology documents the
heights of 10 people that walk by. The heights are: {5, 6.5, 5.3, 5.9, 4.9, 6.2,
6.1, 5.3, 5.5, 5.7}.
1. What is 𝝁?
2. What is 𝝈?
3. ഥ?
What is the sample mean 𝒙
4. What is the sample standard deviation s?
6
Notation: Population Versus Sample
Population (Size N) Sample (Size n)
The mean 𝝁 is the true average (mean) The sample mean 𝒙ഥ is the average based
over all N. only on the sample n.
The sample variance 𝒔𝟐 and sample
The variance 𝝈𝟐 and standard deviation 𝝈
standard deviation 𝒔 are the variance and
are the true variance and standard
standard deviation estimates based only
deviation over all N.
on the sample n.
𝜇, 𝜎 2 , 𝜎 are called 𝑥,ҧ 𝑠 2 , 𝑠 are called statistics
parameters of a distribution of a sample
7
Today’s Objectives
1. Introduction to Statistics: Sample versus Population
2. Define and understand sampling distribution.
3. Describe and apply the Central Limit Theorem.
8
9
Definitions
▪ The sampling distribution of a statistic is the distribution of that statistic
across an arbitrarily large number of samples.
▪ The sampling distribution of the mean is the distribution of all possible
sample means theoretically possible from a sample of size n.
Population
Sample
9
Constructing our own Sampling Distribution
Let X be a discrete random variable giving the outcome when rolling a single
fair die. Below gives a table and graph of the pmf of X.
X P(X = x) P(X=x)
0.18
1 1/6 0.16
0.14
2 1/6 0.12
0.1
3 1/6
0.08
4 1/6 0.06
0.04
5 1/6 0.02
6 1/6 0
1 2 3 4 5 6
What is the expected value of X?
10
Constructing our own Sampling Distribution
X = Outcome of Rolling a Die
Step 1: Google “Roll Dice”, and click on the result ([Link])
Step 2: Roll a single die 15 times, record each of the options in a
single column in excel.
Step 3: Calculate the sample mean of the n=15 sample (𝑥ҧ ). Write the
value of 𝑥ҧ in the blank below.
Step 4: Share the value of 𝑥ҧ with the class
Population (Size N) Sample (Size n = 15)
Population Mean 𝜇 = 3.5 Sample mean 𝑥ҧ = __________
11
Constructing our own Sampling Distribution
What do we observe?
Population Distribution Sampling Distribution
of X of 𝑋ത
Discrete/ Continuous
Shape
Mean
12
Constructing our own Sampling Distribution
What would the sampling distribution
look like if the population distribution
was different?
What would the sampling distribution
look like if the samples were bigger?
13
Today’s Objectives
1. Introduction to Statistics: Sample versus Population
2. Define and understand sampling distribution.
3. Describe and apply the Central Limit Theorem.
14
Random Samples
– The random variables X1, X2, …, Xn are a random sample of n observations if
❑ The random variables are independent, and
❑ The random variables have the same distribution.
– We will say that “X1, X2, …, Xn are iid” where “iid” is short for “independent and
identically distributed” or random sample
– Random samples are a critical assumption for situations where we are interested
in making statements about the population and not merely describing the sample
of observations
❑ Scientific studies use random samples or more generally “probability samples”
❑ Many business applications do not need or have random samples
• Example: user generated content such as product reviews
15
Conceptual Model for Random Samples (not required)
1. Obtain a sampling frame that lists all N members of the population
2. Use a randomization device (e.g. uniform random numbers or =rand() in
EXCEL) to randomly select n of the N members of the population
▪ Examples
– The U.S. Bureau of Labor Statistics conducts a monthly survey of about 60,000
households that are randomly sampled from all households in the U.S. to
estimate unemployment rates
❑ [Link]
– Polling firms randomly sample from voter registration rolls to estimate
preferences for candidates
– Marketing firms use random digit dialing of phone numbers
❑ They hope this mimics sampling from a sampling frame
16
Normal Population Distribution
If the distribution of X is normal with mean 𝜇𝑋 and standard deviation 𝜎𝑋
then
2
𝜎𝑋
𝑋ത ~ 𝑁𝑜𝑟𝑚𝑎𝑙 (𝜇𝑋 , )
𝑛
Standard Error
– The random variables X1, X2, …, Xn are a random sample of n observations
❑ The random variables are independent and Xi ~ N(𝜇x , (𝜎𝑋)^2 ) for i = 1, 2, …, n
– ഥ is the sample mean of the random sample with n observations
𝑿
1
❑ 𝑋ത = 𝑛 𝑋1 + 𝑋2 + ⋯ + 𝑋𝑛
– ഥ is called the standard error.
The standard deviation of 𝑿
• The standard error captures how “off” the sample statistics are from the true population
value.
• *The standard error decreases as n increases.*
17
Normal Population Distribution
18
Normal Population Distribution: Increasing the Sample Size
X is a normally distributed random variable that captures “Waiting Time”
Sample size n = 5 X ~ Normal (35, 10^2)
𝝁𝒙= 35, 𝝈𝒙= 10
Sample size n = 30
X ~ Normal (35, 10^2)
19 𝝁𝒙= 35, 𝝈𝒙= 10
Mini Dooper Example
Mini Dooper, a car manufacturer, hires a marketing firm to conduct a
customer satisfaction survey of a randomly selected sample of n=25 US
customers. One of the survey questions is “What price did you pay for your
recently purchased mini?” Mini Dooper knows that the selling price for all of
the cars sold in the US is normally distributed with a mean of $27,500, with a
standard deviation of $7,500.
1. What is the distribution of the selling price?
2. What is the distribution of the sample mean selling price?
21
Mini Dooper Example
3. What is the standard deviation in selling price?
4. What is the standard error of the average selling price?
5. What is the probability a random customer’s purchase is greater than
$20,000?
6. What is the probability the sample mean is greater than $20,000?
22
The Central Limit Theorem
The Central Limit Theorem (or CLT) also states:
No matter the distribution of X, as long as the sample n is “large enough”
then:
2
𝜎𝑋
𝑋ത ~ 𝑁𝑜𝑟𝑚𝑎𝑙 (𝜇𝑥 , )
𝑛
What is “large enough”?
• Approximately need at least n = 30.
• The assumption that X1, X2, …, Xn is a random sample is critical
for the central limit theorem to hold along with large n
23
Unknown Distribution of X: Sample Size Matters!
X is a random variable that captures “Hourly Ad Revenue on Facebook”
X has unknown distribution but known mean and standard deviation
Sample size n = 5
X ~ unknown
distribution
𝝁𝒙= $7.35, 𝝈𝒙= $3.34
Sample size n = 30
X ~ unknown
distribution
24 𝝁𝒙= $7.35, 𝝈𝒙= $3.34
Risky Insurance Example
Based on market research, Risky Insurance knows that over all its homeowner’s
(HO) claims, the mean claim amount is $3,016 with a standard deviation of $227.
The distribution of claim amounts is unknown. They conduct a random survey of
homeowner claims (n = 100).
1. What is the sampling distribution of the sample mean amount of HO claims?
2. What is the probability the sample mean of HO claims will be less than
$3000?
25
Risky Insurance Example
3. What is the probability the mean of the random sample will be within
$10 of the population mean HO claim amount?
4. What is the probability a random sample of 400 customer claims will
have a sample mean within $2 of the population mean?
26
Today’s Objectives
1. Introduction to Statistics: Sample versus Population
2. Define and understand sampling distribution.
3. Describe and apply the Central Limit Theorem.
27