Probability Distribution
Continuous Probability Distributions
A continuous random variable can assume any value in an
interval on the real line or in a collection of intervals.
It is not possible to talk about the probability of the random
variable assuming a particular value.
Instead, we talk about the probability of the random
variable assuming a value within a given interval.
Continuous Probability Distributions
The probability of the random variable assuming a value
within some given interval from x1 to x2 is defined to be the
area under the graph of the probability density function
between x1 and x2.
f (x)
f (x) Exponential
Uniform
f (x)
x1 x2
Normal
x1 xx12 x2
x
x1 x2
Normal Probability Distributions
The normal probability distribution is the most important
distribution for describing a continuous random variable.
It is widely used in statistical inference.
Normal Probability Distributions
Normal Probability Density Function
1
( x )2 /2 2
f (x)
e
2
where:
= mean
= standard deviation
= 3.14159
e = 2.71828
Normal Probability Distributions
Its a probability function, so no matter what the values of
and , must integrate to 1!
1
2
E(X)= =
1 x 2
(
)
e 2 dx
1
2
Var(X)=2 = ( x 2
1 x 2
(
)
e 2 dx
Standard Deviation(X)=
1 x 2
(
)
2
dx) 2
Normal Probability Distributions
Characteristics
The distribution is symmetric; its skewness
measure is zero.
Normal Probability Distributions
Characteristics
The entire family of normal probability distributions
is defined by its mean and its standard deviation .
Standard Deviation
Mean
Normal Probability Distributions
Characteristics
The highest point on the normal curve is at the mean,
which is also the median and mode.
Normal Probability Distributions
Characteristics
The mean can be any numerical value: negative, zero,
or positive.
x
-10
20
Normal Probability Distributions
Characteristics
The standard deviation determines the width of the
curve: larger values result in wider, flatter curves.
= 15
= 25
x
Normal Probability Distributions
Characteristics
Probabilities for the normal random variable are given
by areas under the curve. The total area under the curve
is 1 (.5 to the left of the mean and 0.5 to the right).
.5
.5
x
Normal Probability Distributions
Since the area under the curve represents probability, the
probability of a normal random variable at one specific value is
zero . With a single value, one cant find the area since the
area must be bound by two values. Thus,
P(x = 10) = 0
P(x = 3) = 0
P(x = 7.5) = 0
However, one can find the following probabilities:
P( 1 < x < 3)
P(2.2 < x < 3.7)
P(x > 3)
Normal Probability Distributions
Characteristics
68.26% of values of a normal random variable
are within +/- 1 standard deviation of its mean.
95.44% of values of a normal random variable
are within +/- 2 standard deviations of its mean.
99.72% of values of a normal random variable
are within +/- 3 standard deviations of its mean.
Normal Probability Distributions
Characteristics
99.72%
95.44%
68.26%
3
1
2
+ 3
+ 1
+ 2
Normal Probability Distributions
There may be thousands of normal distribution curves,
each with a different mean and a different standard
deviation. Since the shapes are different, the areas under
the curves between any two points are also different.
To make life easier, all normal distributions can be
converted to a standard normal distribution. A standard
normal distribution has a mean of 0 and a standard
deviation of 1.
No matter what and are, the area between - and +
is about 68.26%; the area between -2 and +2 is about
95.44%; and the area between -3 and +3 is about
99.72%.
Almost all values fall within 3 standard
deviations.
How good is rule for real data?
Check some example data:
The mean of the weight of the women = 127.8
The standard deviation (SD) = 15.5
68% of 120 = .68x120 = ~ 82 runners
In fact, 79 runners fall within 1-SD (15.5 lbs) of the mean.
112.3
127.8
143.3
25
20
P
e
r
c
e
n
t
15
10
0
80
90
100
110
120
POUNDS
130
140
150
160
95% of 120 = .95 x 120 = ~ 114 runners
In fact, 115 runners fall within 2-SDs of the mean.
96.8
127.8
158.8
25
20
P
e
r
c
e
n
t
15
10
0
80
90
100
110
120
POUNDS
130
140
150
160
99.7% of 120 = .997 x 120 = 119.6 runners
In fact, all 120 runners fall within 3-SDs of the mean.
81.3
127.8
174.3
25
20
P
e
r
c
e
n
t
15
10
0
80
90
100
110
120
POUNDS
130
140
150
160
Example
Suppose SAT scores roughly follows a normal distribution in
the U.S. population of college-bound students (with range
restricted to 200-800), and the average math SAT is 500 with a
standard deviation of 50, then:
68% of students will have scores between 450 and 550
95% will be between 400 and 600
99.7% will be between 350 and 650
Standard Normal Probability Distributions
The formula for the standardized normal probability
density function is
1
p( Z )
e
(1) 2
1 Z 0 2
(
)
2 1
e
2
1
( Z )2
2
The Standard Normal Distribution (Z)
All normal distributions can be converted into the standard
normal curve by subtracting the mean and dividing by the
standard deviation:
Z
Somebody calculated all the integrals for the standard
normal and put them in a table! So we never have to
integrate!
Even better, computers now do all the integration.
Standard Normal Probability Distributions
The letter z is used to designate the standard
normal random variable.
1
z
0
Applications of Standard Normal Distribution
Problem:
Whats the probability of getting a math SAT score of 575 or less,
=500 and =50?
575 500
Z
1.5
50
i.e., A score of 575 is 1.5 standard deviations above the mean
575
P( X 575)
(50)
200
1.5
1 x 500 2
(
)
e 2 50 dx
1
2
1
Z2
e 2 dz
Yikes!
But to look up Z= 1.5 in standard normal chart (or enter
into SAS) no problem! = .9332
Applications of Standard Normal Distribution
Problem:
Test scores of a special examination administered to all potential
employees of a firm are normally distributed with a mean of 500
points and a standard deviation of 100 points. What is the probability
that a score selected at random will be higher than 700?
P(x > 700) = ?
If we convert this normal variable, x, to a standard normal variable, z,
z = (x - ) / = (700 500) / 100 = 2
-------------500----------700 x-scale
P(x > 700) = P(z > 2)
----------------0-----------2
z-scale
Problem
If birth weights in a population are normally distributed with a
mean of 109 oz and a standard deviation of 13 oz,
a. What is the chance of obtaining a birth weight of 141 oz
or heavier when sampling birth records at random?
b. What is the chance of obtaining a birth weight of 120 or
lighter?
Solution
a. What is the chance of obtaining a birth weight of 141 oz
or heavier when sampling birth records at random?
141 109
Z
2.46
13
From the chart or SAS Z of 2.46 corresponds to a right tail (greater
than) area of: P(Z2.46) = 1-(.9931)= .0069 or .69 %
Solution
b. What is the chance of obtaining a birth weight of 120
or lighter?
120 109
Z
.85
13
From the chart or SAS Z of .85 corresponds to a left tail area of:
P(Z.85) = .8023= 80.23%
Looking up probabilities in the
standard normal table
What is the area to
the left of Z=1.51 in
a standard normal
curve?
Z=1.51
Z=1.51
Area is 93.45%
Are my data normal?
Not all continuous random variables are normally distributed!!
It is important to evaluate how well the data are approximated
by a normal distribution
Are my data normally distributed?
1. Look at the histogram! Does it appear bell shaped?
2. Compute descriptive summary measuresare mean,
median, and mode similar?
3. Do 2/3 of observations lie within 1 std dev of the mean? Do
95% of observations lie within 2 std dev of the mean?
4. Look at a normal probability plotis it approximately
linear?
5. Run tests of normality (such as Kolmogorov-Smirnov). But,
be cautious, highly influenced by sample size!
Normal approximation to the binomial
When you have a binomial distribution where n is large and p
is middle-of-the road (not too small, not too big, closer to .5),
then the binomial starts to look like a normal distribution in
fact, this doesnt even take a particularly large n
Recall: What is the probability of being a smoker among a
group of cases with lung cancer is .6, whats the
probability that in a group of 8 cases you have less than 2
smokers?
Normal approximation to the binomial
When you have a binomial distribution where n is large and p
isnt too small (rule of thumb: mean>5), then the binomial starts
to look like a normal distribution
Recall: smoking example
.27
Starting to have a normal shape
even with fairly small n. You can
imagine that if n got larger, the
bars would get thinner and thinner
and this would look more and
more like a continuous function,
with a bell curve shape. Here
np=4.8.
Normal approximation to binomial
.27
What is the probability of fewer than 2 smokers?
Exact binomial probability (from before) = .00065 + .008 = 0.00865
Normal approximation probability:
=4.8
=1.39
2 (4.8) 2.8
Z
2
1.39
1.39
P(Z<2)= 0.0227
A little off, but in the right ballpark we could also use the value
to the left of 1.5 (as we really wanted to know less than but not
including 2; called the continuity correction)
1.5 (4.8) 3.3
Z
2.37
1.39
1.39
P(Z-2.37) =.0089
A fairly good approximation of
the exact probability, .00865.
Practice problem
1. You are performing a cohort study. If the probability of
developing disease in the exposed group is .25 for the study
duration, then if you sample (randomly) 500 exposed
people, Whats the probability that at most 120 people
develop the disease?
Solution:
By hand (yikes!):
P(X120) = P(X=0) + P(X=1) + P(X=2) + P(X=3) + P(X=4)+.+
P(X=120)= (.25) (.75) + (.25) (.75) + (.25) (.75) + (.25) (.75)
500
120
120
380
500
498
500
499
OR Use SAS:
data _null_;
Cohort=cdf('binomial', 120, .25, 500);
put Cohort;
run;
0.323504227
OR use, normal approximation:
=np=500(.25)=125 and 2=np(1-p)=93.75; =9.68
P(Z<-.52)= 0.3015
500
0
500
More Sample Problems
a. P(z 1.5) =
b. P(z 1.0) =
c. P(1 z 1.5) =
d. P(0 < z < 2.5) =
More Sample Problems
a. P(z 1.5) = 0.9332
b. P(z 1.0) = 0.8413
c. P(1 z 1.5) = 0.9332 0.8413 = 0.09
d. P(0 < z < 2.5) = 0.9938 0.5000 = 0.4938
More Sample Problems
a. P(z - 1.0) =
b. P(z - 1) =
c. P(z - 1.5) =
d. P(- 3 < z 0) =
More Sample Problems
a. P(z - 1.0) = 0.1587
b. P(z - 1) = 1 P(z - 1) = 1 0.1587 = 0.8413
c. P(z - 1.5) = 1 P(z - 1.5) = 1 0.0668 = 0.9332
d. P(- 3 < z 0) = 0.5 0.0014 = 0. 4986
More Sample Problems
Given: = 77
= 20
a. P(x < 50) = ?
Convert to z: z = (x - ) / = (50 77) / 20 = - 1.35
P(x < 50) = P(z < - 1.35) = 0.0885
b. P(x > 100) = ?
z = (100 77) / 20 = 1.15
P(x > 100) = P(z > 1.15) = 1 P(z 1.15) = 1 0.8749 =
0.1251 or 12.51 %
Continuation of Sample Problem
c. x = ? to be considered a heavy user
Upper 20% of the area is in the right tail of the normal curve.
80% of the area is to the left. Go to Table 1 and locate 0.8 (or
80%) as the table entry. The closest entry is 0.7995. That
point represents a z-value of 0.84. Use this value of z in the
following equation:
z = (x - ) /
0.84 = (x 77)/ 20
x = 93.8 hours
More Sample Problems
A statistics instructor grades on a curve. He does not want to
give more than 15 percent A in his class. If test scores of
students in statistics are normally distributed with a mean of
75 and a standard deviation of 10, what should be the cut-off
point for an A?
z = (x - ) /
1.04 = (x 75) / 10
x = 85.4 or 85
Sample Problems
The service life of a certain brand of automobile battery is
normally distributed with a mean of 1000 days and a standard
deviation of 100 days. The manufacturer of the battery wants
to offer a guarantee, but does not know the length of the
warranty. It does not want to replace more than 10 percent of
the batteries sold. What should be the length of the warranty?
z = (x - ) /
- 1.28 = (x 1000) / 100
x = 872 days
Reference:
Anderson Sweeney Williams