0% found this document useful (0 votes)
186 views14 pages

Sampling Distributions in Data Analytics

The document discusses sampling distributions and the central limit theorem. It begins by defining key terms like population, sample, statistic, and sampling distribution. It then provides examples to illustrate concepts like the mean and standard deviation of the sample mean. The main points are that as sample size increases, the sampling distribution of the mean takes on a bell shape even if the population is not normally distributed, and for samples of 30 or more the sample mean is approximately normally distributed. It concludes with an example problem to demonstrate applying the central limit theorem.

Uploaded by

Jewel Galvez
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
186 views14 pages

Sampling Distributions in Data Analytics

The document discusses sampling distributions and the central limit theorem. It begins by defining key terms like population, sample, statistic, and sampling distribution. It then provides examples to illustrate concepts like the mean and standard deviation of the sample mean. The main points are that as sample size increases, the sampling distribution of the mean takes on a bell shape even if the population is not normally distributed, and for samples of 30 or more the sample mean is approximately normally distributed. It concludes with an example problem to demonstrate applying the central limit theorem.

Uploaded by

Jewel Galvez
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Romblon State University

College of Engineering and Technology


Civil Engineering Department

Engineering Data Analysis

Chapter 6
Sampling Distributions

Prepared by:

Engr. Jeffy Jones F. Fetalvero


Lecturer
ENGINEERING DATA ANALYSIS SAMPLING DISTRIBUTIONS

Topic Overview

A statistic, such as the sample mean or the sample standard deviation, is a


number computed from a sample. Since a sample is random, every statistic is a
random variable: it varies from sample to sample in a way that cannot be
predicted with certainty. As a random variable it has a mean, a standard
deviation, and a probability distribution. The probability distribution of a statistic is
called its sampling distribution. Typically, sample statistics are not ends in
themselves, but are computed in order to estimate the corresponding population
parameters. This chapter introduces the concepts of the mean, the standard
deviation, and the sampling distribution of a sample statistic, with an emphasis on
the sample mean

Intended Learning Outcomes

At the end of this chapter, the students are expected to:

1. To become familiar with the concept of the probability distribution of the


sample mean.
2. To understand the meaning of the formulas for the mean and standard
deviation of the sample mean.
3. To learn what the sampling distribution of 𝑋̅ is when the sample size is large.
4. To learn what the sampling distribution of 𝑋̅ is when the population is normal.
5. To understand the meaning of the formulas for the mean and standard
deviation of the sample proportion.
6. To learn what the sampling distribution of 𝑝̂ is when the sample size is large.

ENGR. JEFFY JONES F. FETALVERO 2


ENGINEERING DATA ANALYSIS SAMPLING DISTRIBUTIONS

6.1: The Mean and Standard Deviation of the Sample Mean

Suppose we wish to estimate the mean μ of a population. In actual practice we


would typically take just one sample. Imagine however that we take sample after
sample, all of the same size n, and compute the sample mean 𝑥̅ each time. The
sample mean x is a random variable: it varies from sample to sample in a way
that cannot be predicted with certainty. We will write 𝑋̅ when the sample mean
is thought of as a random variable, and write x for the values that it takes. The
random variable 𝑋̅ has a mean, denoted 𝜇𝑋̅ , and a standard deviation, denoted
𝜎𝑋̅ . Here is an example with such a small population and small sample size that
we can actually write down every single sample.

Example 6.1.1
A rowing team consists of four rowers who weigh 152, 156, 160, and 164 pounds.
Find all possible random samples with replacement of size two and compute the
sample mean for each one. Use them to find the probability distribution, the
mean, and the standard deviation of the sample mean 𝑋̅.

Solution
The following table shows all possible samples with replacement of size two, along
with the mean of each:

The table shows that there are seven possible values of the sample mean 𝑋̅. The
value 𝑥̅ = 152 happens only one way (the rower weighing 152 pounds must be
selected both times), as does the value 𝑥̅ = 164, but the other values happen
more than one way, hence are more likely to be observed than 152 and 164 are.
Since the 16 samples are equally likely, we obtain the probability distribution of
the sample mean just by counting:

ENGR. JEFFY JONES F. FETALVERO 3


ENGINEERING DATA ANALYSIS SAMPLING DISTRIBUTIONS

For 𝜇𝑋̅ , we obtain

For 𝜎𝑋̅ , we first compute

which is 24,974, so that

The mean and standard deviation of the population {152,156,160,164} in the


example are μ = 158 and 𝜎 = √20. The mean of the sample mean 𝑋̅ that we have
just computed is exactly the mean of the population. The standard deviation of
the sample mean 𝑋̅ that we have just computed is the standard deviation of the
population divided by the square root of the sample size: √10 = √20/√2. These
relationships are not coincidences, but are illustrations of the following formulas.

Suppose random samples of size n are drawn from a population with mean μ and
standard deviation σ. The mean 𝜇𝑋̅ and standard deviation 𝜎𝑋̅ of the sample
mean 𝑋̅ satisfy

The first equation says that if we could take every possible sample from the
population and compute the corresponding sample mean, then those numbers
would center at the number we wish to estimate, the population mean μ. The
second equation says that averages computed from samples vary less than
individual measurements on the population do, and quantifies the relationship.

ENGR. JEFFY JONES F. FETALVERO 4


ENGINEERING DATA ANALYSIS SAMPLING DISTRIBUTIONS

Example 6.1.2
The mean and standard deviation of the tax value of all vehicles registered in a
certain state are μ = $13,525 and σ = $4,180. Suppose random samples of size 100
are drawn from the population of vehicles. What are the mean 𝜇𝑋̅ and standard
deviation 𝜎𝑋̅ of the sample mean 𝑋̅?

Solution
Since n = 100, the formulas yield

6.2: The Sampling Distribution of the Sample Mean

In Example 6.1.1, we constructed the probability distribution of the sample mean


for samples of size two drawn from the population of four rowers. The probability
distribution is:

Figure 6.2.1 shows a side-by-side comparison of a histogram for the original


population and a histogram for this distribution. Whereas the distribution of the
population is uniform, the sampling distribution of the mean has a shape
approaching the shape of the familiar bell curve. This phenomenon of the
sampling distribution of the mean taking on a bell shape even though the
population distribution is not bell-shaped happens in general. Here is a somewhat
more realistic example.

Figure 6.2.1. Distribution of a Population and a Sample Mean

ENGR. JEFFY JONES F. FETALVERO 5


ENGINEERING DATA ANALYSIS SAMPLING DISTRIBUTIONS

Suppose we take samples of size 1, 5, 10, or 20 from a population that consists


entirely of the numbers 0 and 1, half the population 0, half 1, so that the
population mean is 0.5. The sampling distributions are:

Histograms illustrating these distributions are shown in Figure 6.2.2.

Figure 6.2.2. Distributions of the Sample Mean

ENGR. JEFFY JONES F. FETALVERO 6


ENGINEERING DATA ANALYSIS SAMPLING DISTRIBUTIONS

As n increases the sampling distribution of 𝑋̅ evolves in an interesting way: the


probabilities on the lower and the upper ends shrink and the probabilities in the
middle become larger in relation to them. If we were to continue to increase n
then the shape of the sampling distribution would become smoother and more
bell-shaped.

What we are seeing in these examples does not depend on the particular
population distributions involved. In general, one may start with any distribution
and the sampling distribution of the sample mean will increasingly resemble the
bell-shaped normal curve as the sample size increases. This is the content of the
Central Limit Theorem.

The Central Limit Theorem


For samples of size 30 or more, the sample mean is approximately normally
𝜎
distributed, with mean 𝜇𝑋̅ = 𝜇 and standard deviation 𝜎𝑋̅ = , where n is the
√𝑛
sample size. The larger the sample size, the better the approximation. The Central
Limit Theorem is illustrated for several common population distributions in Figure
6.2.3.

Figure 6.2.3. Distribution of Populations and Sample Means

ENGR. JEFFY JONES F. FETALVERO 7


ENGINEERING DATA ANALYSIS SAMPLING DISTRIBUTIONS

The dashed vertical lines in the figures locate the population mean. Regardless of
the distribution of the population, as the sample size is increased the shape of the
sampling distribution of the sample mean becomes increasingly bell-shaped,
centered on the population mean. Typically, by the time the sample size is 30 the
distribution of the sample mean is practically the same as a normal distribution.

The importance of the Central Limit Theorem is that it allows us to make probability
statements about the sample mean, specifically in relation to its value in
comparison to the population mean, as we will see in the examples. But to use
the result properly we must first realize that there are two separate random
variables (and therefore two probability distributions) at play:

1. X, the measurement of a single element selected at random from the


population; the distribution of X is the distribution of the population, with mean the
population mean μ and standard deviation the population standard deviation σ;

2. 𝑋̅, the mean of the measurements in a sample of size n; the distribution of 𝑋̅ is


𝜎
its sampling distribution, with mean 𝜇𝑋̅ = 𝜇 and standard deviation 𝜎𝑋̅ = 𝑛 .

Example 6.2.1
Let 𝑋̅ be the mean of a random sample of size 50 drawn from a population with
mean 112 and standard deviation 40.

1. Find the mean and standard deviation of 𝑋̅.


2. Find the probability that 𝑋̅ assumes a value between 110 and 114.
3. Find the probability that 𝑋̅ assumes a value greater than 113.

Solution:
1. By the formulas in the previous section

2. Since the sample size is at least 30, the Central Limit Theorem applies: 𝑋̅ is
approximately normally distributed. We compute probabilities using normal
distribution in the usual way, just being careful to use 𝜎𝑋̅ and not σ when we
standardize:

ENGR. JEFFY JONES F. FETALVERO 8


ENGINEERING DATA ANALYSIS SAMPLING DISTRIBUTIONS

3. Similarly

Note that if in the above example we had been asked to compute the probability
that the value of a single randomly selected element of the population exceeds
113, that is, to compute the number P(X > 113), we would not have been able to
do so, since we do not know the distribution of X, but only that its mean is 112 and
its standard deviation is 40. By contrast we could compute P(𝑋̅ > 113) even without
complete knowledge of the distribution of X because the Central Limit Theorem
guarantees that 𝑋̅ is approximately normal.

Normally Distributed Populations


The Central Limit Theorem says that no matter what the distribution of the
population is, as long as the sample is “large,” meaning of size 30 or more, the
sample mean is approximately normally distributed. If the population is normal to
begin with then the sample mean also has a normal distribution, regardless of the
sample size.

For samples of any size drawn from a normally distributed population, the sample
𝜎
mean is normally distributed, with mean 𝜇𝑋̅ = 𝜇 and standard deviation 𝜎𝑋̅ = 𝑛,

where n is the sample size.

The effect of increasing the sample size is shown in Figure 6.2.4.

ENGR. JEFFY JONES F. FETALVERO 9


ENGINEERING DATA ANALYSIS SAMPLING DISTRIBUTIONS

Figure 6.2.4: Distribution of Sample Means for a Normal Population

Example 6.2.2
An automobile battery manufacturer claims that its midgrade battery has a mean
life of 50 months with a standard deviation of 6 months. Suppose the distribution
of battery lives of this particular brand is approximately normal.

1. On the assumption that the manufacturer’s claims are true, find the probability
that a randomly selected battery of this type will last less than 48 months.
2. On the same assumption, find the probability that the mean of a random
sample of 36 such batteries will be less than 48 months.

Solution:
1. Since the population is known to have a normal distribution

ENGR. JEFFY JONES F. FETALVERO 10


ENGINEERING DATA ANALYSIS SAMPLING DISTRIBUTIONS

𝜎 6
2. The sample mean has mean 𝜇𝑋̅ = 𝜇 = 50 and standard deviation 𝜎𝑋̅ = = =
√𝑛 √36
1. Thus

6.3: The Sample Proportion

Often sampling is done in order to estimate the proportion of a population that


has a specific characteristic, such as the proportion of all items coming off an
assembly line that are defective or the proportion of all people entering a retail
store who make a purchase before leaving. The population proportion is denoted
p and the sample proportion is denoted 𝑝̂ . Thus, if in reality 43% of people entering
a store make a purchase before leaving,

if in a sample of 200 people entering the store, 78 make a purchase,

The sample proportion is a random variable: it varies from sample to sample in a


way that cannot be predicted with certainty. Viewed as a random variable it will
be written 𝑃̂. It has a mean 𝜇𝑃̂ and a standard deviation 𝜎𝑃̂ . Here are formulas for
their values.

Suppose random samples of size n are drawn from a population in which the
proportion with a characteristic of interest is p. The mean 𝜇𝑃̂ and standard
deviation 𝜎𝑃̂ of the sample proportion 𝑃̂ satisfy

ENGR. JEFFY JONES F. FETALVERO 11


ENGINEERING DATA ANALYSIS SAMPLING DISTRIBUTIONS

where q = 1 − p.

The Central Limit Theorem has an analogue for the population proportion 𝑃̂ . To
see how, imagine that every element of the population that has the characteristic
of interest is labeled with a 1, and that every element that does not is labeled with
a 0. This gives a numerical population consisting entirely of zeros and ones. Clearly
the proportion of the population with the special characteristic is the proportion
of the numerical population that are ones; in symbols,

But of course, the sum of all the zeros and ones is simply the number of ones, so
the mean μ of the numerical population is

Thus, the population proportion p is the same as the mean μ of the corresponding
population of zeros and ones. In the same way the sample proportion 𝑝̂ is the
same as the sample mean 𝑥̅ . Thus, the Central Limit Theorem applies to 𝑝̂ .
However, the condition that the sample be large is a little more complicated than
just being of size at least 30.

The Sampling Distribution of the Sample Proportion


For large samples, the sample proportion is approximately normally distributed,
with mean and standard deviation

A sample is large if the interval lies wholly within the interval


[0,1].

In actual practice p is not known, hence neither is 𝜎𝑃̂ . In that case in order to
check that the sample is sufficiently large we substitute the known quantity 𝑝̂ for
p. This means checking that the interval

ENGR. JEFFY JONES F. FETALVERO 12


ENGINEERING DATA ANALYSIS SAMPLING DISTRIBUTIONS

lies wholly within the interval [0,1]. This is illustrated in the examples.

Figure 6.3.1 shows that when p = 0.1, a sample of size 15 is too small but a sample
of size 100 is acceptable.

Figure 6.3.1: Distribution of Sample Proportions

Figure 6.3.2 shows that when p = 0.5 a sample of size 15 is acceptable.

Figure 6.3.2: Distribution of Sample Proportions for p = 0.5 and n = 15

Example 6.3.1
Suppose that in a population of voters in a certain region 38% are in favor of
particular bond issue. Nine hundred randomly selected voters are asked if they
favor the bond issue.

1. Verify that the sample proportion 𝑝̂ computed from samples of size 900 meets
the condition that its sampling distribution be approximately normal.
2. Find the probability that the sample proportion computed from a sample of
size 900 will be within 5 percentage points of the true population proportion.

ENGR. JEFFY JONES F. FETALVERO 13


ENGINEERING DATA ANALYSIS SAMPLING DISTRIBUTIONS

Solution:
1. The information given is that p = 0.38, hence q = 1 – p = 0.62. First, we use the
formulas to compute the mean and standard deviation of 𝑝̂ :

Then 3𝜎𝑃̂ = 3(0.01618) = 0.04854 ≈ 0.05 so

which lies wholly within the interval [0,1], so it is safe to assume that 𝑝̂ is
approximately normally distributed.

2. To be within 5 percentage points of the true population proportion 0.38 means


to be between 0.38 − 0.05 = 0.33 and 0.38 + 0.05 = 0.43. Thus

ENGR. JEFFY JONES F. FETALVERO 14

Common questions

Powered by AI

To determine the probability of a sample mean falling within a specific range for a large sample size, apply the Central Limit Theorem to approximate the sample mean distribution as normal. Use the population mean (μ) and calculate the standard deviation of the sample mean (σ/√n). Then, standardize the range values and use z-scores to find the corresponding probabilities from the standard normal distribution table .

The practical implications of using the Central Limit Theorem in engineering statistical analysis are significant as it simplifies the analysis of complex systems by enabling the application of normal distribution tools to sample data. It allows engineers to derive meaningful insights from limited sample sizes by assuming approximate normality in the estimation of population parameters. This is particularly useful when working with unknown or difficult to access population distributions, providing a foundational tool for risk assessment, quality control, and decision making in engineering .

The sampling distribution of a sample proportion can be considered approximately normal if the sample size is large enough for the interval (p ± 3σₚ̂) to lie wholly within [0,1]. Unlike the condition for the sample mean, which generally requires the sample size to be at least 30, the acceptability of the sample size for proportions additionally depends on the actual proportion (p) and adjusts based on estimates if p is unknown. This makes the condition for proportions more complex, needing proportion-specific computation for validity .

The Central Limit Theorem facilitates statistical inference by ensuring that the sampling distribution of the sample mean approaches normality as the sample size grows. This allows for the use of normal distribution theory to create probabilities, confidence intervals, and hypothesis tests based on sample means, even when the population distribution is unknown. It is essential for validating sample-based methods in inferential statistics, as it justifies the application of normal-based techniques under broad conditions .

The distribution of individual population elements (X) is the actual distribution of the entire population, which can take any form and is characterized by its own mean (μ) and standard deviation (σ). In contrast, the sampling distribution of the sample mean becomes increasingly normal as sample size increases, due to the Central Limit Theorem. This sampling distribution is centered on the population mean (μ) and has a smaller standard deviation (σ/√n), resulting in less variability than the original population distribution .

Increasing the sample size enhances the certainty of statistical estimates derived from sample proportions by reducing the standard deviation of the sample proportion. This results in a narrower confidence interval and an increase in precision of the estimate, making the sample statistics more reliable for inferring the true population proportion. A larger sample size also ensures the condition for normal approximation of the sample proportion is met .

The standard deviation of a population (σ) measures the spread of individual data points around the population mean. Conversely, the standard deviation of the sample mean (σₓ̄) is calculated as σ/√n, where σ is the population standard deviation and n is the sample size. This formula reflects how sample means tend to vary less than individual data points, decreasing as the sample size increases, according to the Central Limit Theorem .

As the sample size increases, the sampling distribution of the sample mean becomes more similar to a normal distribution, regardless of the population's original distribution. The mean of the sampling distribution equals the population mean (μ), and its standard deviation is the population standard deviation (σ) divided by the square root of the sample size (√n). This convergence towards normality with increased sample size is a fundamental aspect of the Central Limit Theorem .

By comparing histograms of a population distribution and a sample mean distribution, one can observe that while the population may not have a bell-shaped distribution, the sample mean distribution tends to approximate a normal distribution as the sample size increases. This effect illustrates the Central Limit Theorem and highlights that the sample mean distribution becomes smoother and more bell-shaped with increasing sample size .

The Central Limit Theorem allows us to use the normal distribution to estimate probabilities about sample means because it ensures that, for sufficiently large sample sizes (generally n ≥ 30), the distribution of the sample mean will approximate a normal distribution, regardless of the original population distribution. This property enables probability estimations of the sample mean even when the distribution of individual measurements (X) is unknown .

You might also like