Book
Book
Preface v
1 Basic Concepts 1
1.1 Definitions and Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.2 Basic Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Equally Likely Outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.3 Conditional Probability and Bayes’ Theorem . . . . . . . . . . . . . . . . . 16
1.3.1 Bayes’ Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
1.4 Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
1.5 Using R for Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2 Sampling and Repeated Trials 37
2.1 Bernoulli Trials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.1.1 Using R to Compute Probabilities . . . . . . . . . . . . . . . . . . . 43
2.2 Poisson Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
2.3 Sampling With and Without Replacement . . . . . . . . . . . . . . . . . . . 54
2.3.1 The Hypergeometric Distribution . . . . . . . . . . . . . . . . . . . . 55
2.3.2 Hypergeometric Distribution as a Series of Dependent Trials . . . . 56
2.3.3 Binomial Approximation to the Hypergeometric Distribution . . . . 58
3 Discrete Random Variables 63
3.1 Random Variables as Functions . . . . . . . . . . . . . . . . . . . . . . . . . 63
3.1.1 Common Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . 67
3.2 Independent and Dependent Variables . . . . . . . . . . . . . . . . . . . . . 70
3.2.1 Independent Variables . . . . . . . . . . . . . . . . . . . . . . . . . . 71
3.2.2 Conditional, Joint, and Marginal Distributions . . . . . . . . . . . . 72
3.2.3 Memoryless Property of the Geometric Random Variable . . . . . . 76
3.2.4 Multinomial Distributions . . . . . . . . . . . . . . . . . . . . . . . . 77
3.3 Functions of Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . 81
3.3.1 Distribution of f (X ) and f (X1 , X2 , . . . , Xn ) . . . . . . . . . . . . . . 82
3.3.2 Functions and Independence . . . . . . . . . . . . . . . . . . . . . . 87
4 Summarizing Discrete Random Variables 93
4.1 Expected Value . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
4.1.1 Properties of the Expected Value . . . . . . . . . . . . . . . . . . . . 96
We believe that many foundational ideas of Probability and Statistics are best understood
when their natural connection is emphasised. We feel that the interested student should
learn the mathematical rigour of Probability, the motivating examples and techniques from
Statistics, and an instructive technology to perform computations relating to both in an
inclusive manner. These formed our main motivations for writing this book. We have
chosen to use the R software environment to demonstrate an available computational tool.
The book is intended to be an undergraduate text for a course on Probability Theory.
We had in mind courses such as the one year (two semester) Probability course at many
universities in India such as the Indian Statistical Institute or Chennai Mathematical
Institue, or a one semester (or two quarter) Probability course as is commonly offered
as an upper division, post-calculus elective at many North American universities. The
Statistics material and the package R are introduced so as to emphasise motivations and
applications of the probabilistic material. We assume that our readers are well-versed in
calculus, have a basic understanding of the theory of sets and functions, combinatorics,
and proof techniques, and have at least a passing awareness of the distinction between
countable and uncountable infinities. We do not assume any particular experience of Linear
Algebra or Real Analysis.
In Chapter 1 of this book we begin with an introduction to Outcomes, Sample Space,
Events and the axiomatic definition of Probability. Then we discuss the concepts of
conditional probability, independence and Bayes’ Theorem. We conclude this chapter with
a basic introduction to R. R is a Free Open Source software environment that runs on
all major software platforms, and instructions to download and install it are available at
[Link]
We begin Chapter 2 by applying the notion of independence to repeated trials (Bernoulli
Trials) and discuss the Binomial and Geometric distributions. We introduce the Poisson
distribution as a limiting approximation of the Binomial. We conclude this section with a
discussion on Sampling with and without replacement. The Hypergeometric Distribution
is thus introduced here and we prove its approximation to the Binomial. Throughout
this chapter and later in the book we provide the R code for calculating the probabilities
associated with common distributions.
In Chapters 3 and 4 we introduce discrete random variables (functions on a sample space
whose range is countable) and related concepts. In Chapter 3, we define the probability
mass function, distribution function, and independence for random variables. We introduce
the Multinomial distribution and show the memoryless property of the Geometric random
variable. The chapter concludes by providing a method to compute the distributions of
functions of one and several random variables, defining the concept of joint distribution
along the way. In Chapter 4, we define Expectation, Variance, Covariance, Conditional
Expectation and Conditional Variance for discrete random variables. Results involving
these quantities for standard distributions are presented (and proved) as well. We also state
and prove the Markov and Chebyshev inequalities along with the notion of standardising
random variables to mean zero and variance one.
Working with uncountable spaces and understanding the probability density function of
an absolutely continuous random variable are challenging without assuming a background
in Real Analysis but we make a modest attempt towards this in Chapter 5. We begin with
a description of uncountable sample spaces. After having described events in a temporary
manner in Chapter 1 we provide a precise definition here but comfort the reader that we
shall avoid the most general events and at most consider countable union/intersection
of intervals. This allows us to be fairly rigorous with random variables having piecewise
continuous probability density functions using results from basic calculus. After this we
imitate the program conducted in Chapter 3. Standard distributions such as Uniform,
Exponential, and Normal are discussed. While computing densities of sums and ratios of
independent random variables we introduce the Gamma distribution and use it to derive
the Beta distribution as an example of ratio of dependent Gamma random variables.
In Chapter 6 we define Variance, Covariance, Conditional Expectation and Conditional
Variance for continuous random variables and summarise their properties. Moment gener-
ating functions for random variables are defined. At this point, to respect the minimal
background assumption on our reader we state a few important results without proof such
as the fact that the moment generating functions characterise distribution of a random
variable. The chapter ends with a section on Bivariate Normal random variables. Here we
have done all computations in this section without using Linear Algebra but the notational
efficiency of using Linear Algebra is explained via exercises for the interested reader.
With the foundational ideas of Probability laid out we proceed in Chapter 7 with
Sampling and Descriptive Statistics. The empirical distribution, the sample mean, variance
and proportion are defined along with their properties. Simulation is used to develop
intuition regarding sampling variability, and plots such as Histograms, Hanging Rootograms,
and Q-Q Plots are introduced and illustrated using R.
Limit Theorems for Sampling Distributions discussed in Chapter 7 are the objective
of Chapter 8. We begin with a brief description of multivariate joint densities and Order
statistics. The t-distribution and χ2 (chi-square) distributions are introduced in this chapter.
The sample mean and variance from a normal population are discussed in relation to t and
χ2 . We prove the Weak Law of Large numbers and the Central Limit Theorem for random
variables possessing a moment generating function. We do state a more general version of
the Central Limit Theorem and also the Strong Law of Large numbers, providing a proof
of the latter in the Appendix. Along with R code we discuss the continuity correction
and applications of the Central Limit Theorem via examples. We then discuss the delta
method and its application to variance stabilising transformations. The chapter concludes
with a derivation of the Central Limit Theorem for the median.
We end the book with two chapters focused solely on results and techniques from
statistics. In Chapter 9 we discuss Estimation and Confidence Intervals. We briefly
describe Method of Moments Estimators and Maximum Likelihood Estimators. We then
introduce Confidence Intervals using the Pivotal Quantity approach. For cases when
natural pivotal quantities do not exist, we illustrate the use of the Central Limit Theorem
to obtain approximate confidence intervals. Finally we derive confidence intervals based on
the sample median and compare its performance with intervals based on the sample mean
via simulations.
In Chapter 10 we explore a non-traditional approach to Hypothesis Testing based on
p-values rather than pre-determined significance levels. We first formulate the multinomial
goodness of fit and independence problems in the familiar parametric set up. After that
we describe an intuitive approach to derive suitable test statistics for various hypothesis
testing examples. We then proceed to outline a likelihood ratio based approach to derive
test statistics systematically. We conclude this chapter with a discussion of the goodness
of fit and independence tests.
R code for most of the computations done are given in the book itself, and the reader
should be able to reproduce and extend them easily. Code for figures are not given in the
book, but are available at a website accompanying the book.
The Appendix includes some relevant mathematical details not covered in the main
matter of the book. The topics included are the Jacobian method for computing distribution
of transformations of random variables and the Strong Law of Large Numbers.
Most of the problems in probability and statistics involve determining how likely it is that
certain things will occur. Before we can talk about what is likely or unlikely, we need to
know what is possible. In other words, we need some framework in which to discuss what
sorts of things have the potential to occur. To that end, we begin by introducing the basic
concepts of “sample space”, “experiment”, “outcome”, and “event”. We also define what
we mean by a “probability” and provide some examples to demonstrate the consequences
of the definition.
1.1.1 Definitions
Definition 1.1.1. (Sample Space) A sample space S is a set. The elements of the
set S will be called “outcomes” and should be viewed as a listing of all possibilities
that might occur. We will call the process of actually selecting one of these outcomes
an “experiment”.
For its familiarity and simplicity we will frequently use the example of rolling a die. In
that case our sample space would be S = {1, 2, 3, 4, 5, 6}, a complete listing of all of the
outcomes on the die. Performing an experiment in this case would mean rolling the die and
recording the number that it shows. However, sample space outcomes need not be numeric.
If we are flipping a coin (another simple and familiar example) experiments would result in
one of two outcomes and the appropriate sample space would be S = {Heads, T ails}.
For a more interesting example, if we are discussing which country will win the next
World Cup, outcomes might include Brazil, Spain, Canada, and Thailand. Here the set
S might be all the world’s countries. An experiment in this case requires waiting for the
next World Cup and identifying the country which wins the championship game. Though
we have not yet explained how probability relates to a sample space, soccer fans amongst
our readers may regard this example as a foreshadowing that not all outcomes of a sample
space will necessarily have the same associated probabilities.
This definition will allow us to talk about how likely it is that a range of possible outcomes
might occur. Continuing our examples above we might want to talk about the probability
that a die rolls a number larger than two. This would involve the event {3, 4, 5, 6} as a
subset of {1, 2, 3, 4, 5, 6}. In the soccer example we might ask whether the World Cup will
be won by a South American country. This subset of our list of all the world’s nations
would contain Brazil as an element, but not Spain.
It is worth noting that the definition of “event” includes both S, the sample space itself,
and ∅, the empty set, as legitimate examples. As we introduce more complicated examples
we will see that it is not always necessary (or even possible) to regard every single subset
of a sample space as a legitimate event, but since the reasons for that may be distracting
at this point we will use the above as a temporary definition of “event” and refine the
definition when it becomes necessary.
To each event, we want to assign a chance (or “probability”) which will be a number
between 0 and 1. So if the probability of an event E is 0.72, we interpret that as saying,
“When an experiment is performed, it has a 72% chance of resulting in an outcome contained
in the event E”. Probabilities will satisfy two axioms stated and explained below. This
formal definition is due to Andrey Kolmogorov (1903-1987).
Definition 1.1.3. (Probability Space Axioms) Let S be a sample space and let
F be the collection of all events.
A “probability” is a function P : F → [0, 1] such that
(1) P (S ) = 1; and
∞ ∞
P (Ej ). (1.1.1)
[ X
P( Ej ) =
j =1 j =1
The first axiom is relatively straight forward. It simply reiterates that S did, indeed,
include all possibilities, and therefore there is a 100% chance that an experiment will result
in some outcome included in S. The second axiom is not as complicated as it looks. It
simply says that probabilities add when combining a countable number of disjoint events.
It is implicit that the series on right hand side of the equation (1.1.1) converges. Further
(1.1.1) also holds when combining finite number of disjoint events (see Theorem 1.1.4
below).
Returning to our World Cup example, suppose A is a list of all North American
countries and E is a list of all European countries. If it happens that P (A) = 0.05 and
P (E ) = 0.57 then P (A ∪ E ) = 0.62. In other words, if there is a 5% chance the next
World Cup will be won by a North American nation and a 57% chance that it will be
won by a European nation, then there is a 62% chance that it will be won by a nation
from either Europe or North America. The disjointness of these events is obvious as (if we
discount island territories) there is no country that is in both North America and Europe.
The requirement of axiom two that the collection of events be countable is important.
We shall see shortly that, as a consequence of axiom two, disjoint additivity also applies
to any finite collection of events. It does not apply to uncountably infinite collections
of events, though that fact will not be relevant until later in the text when we discuss
continuous probability spaces.
There are some immediate consequences of these probability axioms which we will state
and prove before returning to some simple examples.
(1) P (∅) = 0;
Proof. (1) - The empty set is disjoint from itself, so ∅,!∅, . . . is a countable disjoint
∞ ∞
collection of events. From the second axiom, P P (Ej ). When this is
S P
Ej =
j =1 j =1
∞
applied to the collection of empty sets we have P (∅) = P (∅). If P (∅) had any
P
j =1
non-zero value, the right hand side of this equation would be a divergent series while the
left hand side would be a number. Therefore, P (∅) = 0.
Proof of (2) - To use axiom two we need to make this a countable collection of events.
We may do so while preserving disjointness by including copies of the empty set. Define
Ej = ∅ for j > n. Then E!1 , E2 , . . . , En , ∅, ∅, . . . is a countable collection of disjoint
∞ ∞
sets and therefore P P (Ej ). However, the empty sets add nothing to the
S P
Ej =
j =1 j =1
∞ n
union and so Ej . Likewise since we have shown P (∅) = 0 these sets also add
S S
Ej =
j =1 j =1
∞ n
nothing to the sum, so P (Ej ).
P P
P (Ej ) =
j =1 j =1
Proof of (3) - If E ⊂ F , then E and F \ E are disjoint events with a union equal to F .
Using (2) above gives P (F ) = P (E ∪ (F \ E )) = P (E ) + P (F \ E ).
Since probabilities are assumed to be positive, it follows that P (F ) ≥ P (E ).
Proof of (4) - As with the proof of (3) above, E and F \ E are disjoint events with
E ∪ (F \ E ) = F . Therefore P (F ) = P (E ) + P (F \ E ) from which we get the result.
Example 1.1.5. A coin flip can come up either “heads” or “tails”, so S = {heads, tails}.
A coin is considered “fair” if each of these outcomes is equally likely. Which axioms or
properties above can be used to reach the (obvious) conclusion that both outcomes have a
50% chance of occurring?
Each outcome can also be regarded as an event. So E = {heads} and F = {tails}
are two disjoint events. If the coin is fair, each of these events is equally likely, so
P (E ) = P (F ) = p for some value of p. However, using the second axiom, 1 = P (S ) =
P (E ∪ F ) = P (E ) + P (F ) = 2p. Therefore, p = 0.5, or in other words each of the two
possibilities has a 50% chance of occurring on any flip of a fair coin. ■
In the examples above we have explicitly described the sample space S, but in many cases
this is neither necessary nor desirable. We may still use the probability space axioms and
their consequences when we know the probabilities of certain events even if the sample
space is not explicitly described.
Example 1.1.6. A certain sea-side town has a small fishing industry. The quantity of fish
caught by the town in a given year is variable, but we know there is a 35% chance that the
town’s fleet will catch over 400 tons of fish, but only a 10% chance that they will catch
over 500 tons of fish. How likely is it they will catch between 400 and 500 tons of fish?
The answer to this may be obvious without resorting to sets, but we use it as a first
example to illustrate the proper use of events. Note, though, that we will not explicitly
describe the sample space S.
There are two relevant events described in the problem above. We have F representing
“the town’s fleet will catch over 400 tons of fish” and E representing “the town’s fleet will
catch over 500 tons of fish”. We are given that P (E ) = 0.1 while P (F ) = 0.35.
Of course E ⊂ F since if over 500 tons of fish are caught, the actual tonnage will be
over 400 as well. The event that the town’s fleet will catch between 400 and 500 tons of
fish is F \ E since E hasn’t occurred, but F has. So using property (4) from above we
have P (F \ E ) = P (F ) − P (E ) = 0.35 − 0.1 = 0.25. In other words there is a 25% chance
that between 400 and 500 tons of fish will be caught. ■
Example 1.1.7. Suppose we know there is a 60% chance that it will rain tomorrow and a
70% chance the high temperature will be above 30◦ C. Suppose we also know that there is
a 40% chance that the high temperature will be above 30◦ C and it will rain. How likely is
it tomorrow will be a dry day that does not go above 30◦ C?
The answer to this question may not be so obvious, but our first step is still to view
the pieces of information in terms of events and probabilities. We have one event E which
represents “It will rain tomorrow” and another F which represents “The high will be
above 30◦ C tomorrow”. Our given probabilities tell us P (E ) = 0.6, P (F ) = 0.7, and
P (E ∩ F ) = 0.4. We are trying to determine P (E c ∩ F c ). We can do so using properties
(5) and (6) proven above, together with the set-algebraic fact that E c ∩ F c = (E ∪ F )c .
From (5) we know P (E ∪ F ) = P (E ) + P (F ) − P (E ∩ F ) = 0.7 + 0.6 − 0.4 = 0.9.
(This is the probability that it either will rain or be above 30 0 C).
Then from (6) and the set-algebraic fact, P (E c ∩ F c ) = P ((E ∪ F )c ) = 1 − P (E ∪ F ) =
1 − 0.9 = 0.1. So there is a 10% chance tomorrow will be a dry day that does not reach
30◦ C. ■
Temperature
Rain above 30℃
Figure 1.1: A Venn diagram that describes the probabilities from Example 1.1.7.
exercises
Ex. 1.1.1. Consider the sample space Ω = {a, b, c, d, e}. Given that {a, b, e}, and {b, c}
are both events, what other subsets of Ω must be events due to the requirement that the
collection of events is closed under taking unions, intersections, and compliments?
Ex. 1.1.2. There are two positions - Cashier and Waiter - open at the local restaurant.
There are two male applicants (David and Rajesh) two female applicants (Veronica and
Megha). The Cashier position is chosen by selecting one of the four applicants at random.
The Waiter position is then chosen by selecting at random one of the three remaining
applicants.
(b) List the elements of the event A that the position of Cashier is filled by a female
applicant.
(c) List the elements of the event B that exactly one of the two positions is filled by a
female applicant.
(d) List the elements of the event C that neither position was filled by a female applicant.
(e) Sketch a Venn diagram to show the relationship among the events A, B, C and S.
Ex. 1.1.3. A jar contains a large collection of red, green, and white marbles. Marbles are
drawn from the jar one at a time. The color of the marble is recorded and it is put back in
the jar before the next draw. Let Rn denote the event that the n-th draw is a red marble
and let Gn denote the event that the n-th draw is a green marble. For example, R1 ∩ G2
would denote the event that the first marble was red and the second was green. In terms of
these events (and appropriate set-theoretic symbols – union, intersection, and complement)
find expressions for the events in parts (a), (b), and (c) below.
(a) The first marble drawn is white. (We might call this W1 , but show that it can be
written in terms of the Rn and Gn sets described above).
(b) The first marble drawn is green and the second marble drawn is not white.
(d) Let E = R1 ∪ G2 and let F = R1c ∩ R2 . Are E and F disjoint? Why or why not?
Ex. 1.1.4. Suppose there are only thirteen teams with a non-zero chance of winning the
next World Cup. Suppose those teams are Spain (with a 14% chance), the Netherlands
(with a 11% chance), Germany (with a 11% chance), Italy (with a 10% chance), Brazil
(with a 10% chance), England (with a 9% chance), Argentina (with a 9% chance), Russia
(with a 7% chance), France (with an 6% chance), Turkey (with a 4% chance), Paraguay
(with a 4% chance), Croatia (with a 4% chance) and Portugal (with a 1% chance).
(a) What is the probability that the next World Cup will be won by a South American
country?
(b) What is the probability that the next World Cup will be won by a country that is
not from South America? (Think of two ways to do this problem - one directly and
one using part (5) of Theorem 1.1.4. Which do you prefer and why?)
Ex. 1.1.5. If A and B are disjoint events and P (A) = 0.3 and P (B ) = 0.6, find P (A ∪ B ),
P (Ac ) and P (Ac ∩ B ).
Ex. 1.1.6. Suppose E and F are events in a sample space S. Suppose that P (E ) = 0.7
and P (F ) = 0.5.
Ex. 1.1.7. A biologist is modeling the size of a frog population in a series of ponds. She is
concerned with both the number of egg masses laid by the frogs during breeding season
and the annual precipitation into the ponds. She knows that in a given year there is an
86% chance that there will be over 150 egg masses deposited by the frogs (event E) and
that there is a 64% chance that the annual precipitation will be over 17 inches (event F ).
(a) In terms of E and F , what is the event “there will be over 150 egg masses and an
annual precipitation of over 17 inches”?
(b) In terms of E and F , what is the event “there will be 150 or fewer egg masses and
the annual precipitation will be over 17 inches”?
(c) Suppose the probability of the event from (a) is 59%. What is the probability of the
event from (b)?
P (E ∪ F ) = P (E ) + P (F ) − P (E ∩ F ).
Versions of this rule for three or more sets are explored below.
P (A) + P (B ) + P (C ) − P (A ∩ B ) − P (A ∩ C ) − P (B ∩ C ) + P (A ∩ B ∩ C )
(b) Use part (a) to answer the following question. Suppose that in a certain United
States city 49.3% of the population is male, 11.6% of the population is sixty-five
years of age or older, and 13.8% of the population is Hispanic. Further, suppose 5.1%
is both male and at least sixty-five, 1.8% is both male and Hispanic, and 5.9% is
Hispanic and at least sixty-five. Finally, suppose that 0.7% of the population consists
of Hispanic men that are at least sixty-five years old. What percentage of people in
this city consists of non-Hispanic women younger than sixty-five years old?
(c) Find a four-set version of the equation. That is, write P (A ∪ B ∪ C ∪ D ) in terms of
probabilities of intersections of A, B, C, and D.
Ex. 1.1.9. A and B are two events. P(A)=0.4, P(B)=0.3, P(A∪B)=0.6. Find the following
probabilities:
(a) P (A ∩ B );
Ex. 1.1.10. In the next subsection we begin to look at probability spaces where each of the
outcomes are equally likely. This problem will help develop some early intuition for such
problems.
(a) Suppose we roll a die and so S = {1, 2, 3, 4, 5, 6}. Each outcome separately
{1}, {2}, {3}, {4}, {5}, {6} is an event. Suppose each of these events is equally likely.
What must the probability of each event be? What axioms or properties are you
using to come to your conclusion?
(b) With the same assumptions as in part (a), how would you determine the probability
of an event like E = {1, 3, 4, 6}? What axioms or properties are you using to come
to your conclusion?
(c) If S = {1, 2, 3, ..., n} and each single-outcome event is equally likely, what would be
the probability of each of these events?
(d) Suppose E ⊂ S is an event in the sample space from part (c). Explain how you could
determine P (E ).
(b) Show by example that the equality doesn’t always hold if B is not a subset of A.
∞
!
= lim P (An )
[
P An
n→∞
n=1
∞
!
= lim P (An )
\
P An
n→∞
n=1
When a sample space S consists of only a countable collection of outcomes, describing the
probability of each individual outcome is sufficient to describe the probability of all events.
This is because if A ⊂ S we may simply compute
P ({ω}).
[ X
P (A) = P ( {ω}) =
ω∈A ω∈A
Proof. As E ⊂ S, we know that |E| ≤ |S| and so 0 ≤ P (E ) ≤ 1. So, we must prove that
P satisfies the two probability axioms.
|S|
The first axiom is satisfied because P (S ) = = 1. To verify the second axiom,
|S|
suppose E1 , E2 , ... is a countable collection of disjoint events. As S is finite, only finitely
many of these Ej can be non-empty, so we may list the non-empty events as E1 , E2 , . . . , En .
For j > n we know Ej = ∅ and so P (Ej ) = 0 by definition. As the events are disjoint,
to find the number of elements in their union we simply add the elements of each event
separately. That is, |E1 ∪ E2 ∪ · · · ∪ En | = |E1 | + |E2 | + · · · + |En |, and so
∞ n
|Ej |
P
∪ Ej
∞ n n ∞
j =1 j =1 |Ej |
P (Ej ).
[ X X X
P Ej = = = = P ( Ej ) =
j =1
|S| |S| j =1
|S| j =1 j =1
Finally, let ω ∈ S be any single outcome and let E = {ω}. Then P (E ) = |S| ,
1
so every
outcome in S is equally likely. ■
Example 1.2.2. A deck of twenty cards labeled 1, 2, 3, . . . , 20 is shuffled and a card selected
at random. What is the probability that the number on the card is a multiple of six?
The description of the scenario suggests that each of the twenty cards is as likely to
be chosen as any other. In this case S = {1, 2, 3, ..., 20} while E = {6, 12, 18}. Therefore,
|E|
P (E ) = |S| = 3
20 = 0.15. There is a 15% chance that the card will be a multiple of six. ■
Example 1.2.3. Two dice are rolled. How likely is it that their sum will equal eight?
Since we are looking at a sum of dice, it might be tempting to regard the sample space
as S = {2, 3, 4, ..., 11, 12}, the collection of possible sums. While this is a possible approach
(and one we will return to later), it is not the case that all of these outcomes are equally
likely. Instead we can view an experiment as tossing a first die and a second die and
recording the pair of numbers that occur on each of the dice. Each of these pairs is as
likely as any other to occur. So
(1, 1), (1, 2), (1, 3), (1, 4), (1, 5), (1, 6)
(2, 1), (2, 2), (2, 3), (2, 4), (2, 5), (2, 6)
(3, 1), (3, 2), (3, 3), (3, 4), (3, 5), (3, 6)
S=
(4, 1), (4, 2), (4, 3), (4, 4), (4, 5), (4, 6)
(5, 1), (5, 2), (5, 3), (5, 4), (5, 5), (5, 6)
(6, 1), (6, 2), (6, 3), (6, 4), (6, 5), (6, 6)
and |S| = 6 × 6 = 36. The event that the sum of the dice is an eight is
E = {(2, 6), (3, 5), (4, 4), (5, 3), (6, 2)} .
|E|
Therefore, P (E ) = |S| = 36 .
5
■
Example 1.2.4. A seven letter code is selected at random with every code as likely to
be selected as any other code (so AQRVTAS and CRXAOLZ would be two possibilities).
How likely is it that such a code has at least one letter used more than once? (This would
happen with the first code above with a repeated A - but not with the second).
As with the examples above, the solution amounts to counting numbers of outcomes.
However, unlike the examples above the numbers involved here are quite large and we will
need to use some combinatorics to find the solution. The sample space S consists of all
seven-letter codes from AAAAAAA to ZZZZZZZ. Each of the seven spots in the code
could be any of twenty-six letters, so |S| = 267 = 8, 031, 810, 176. If E is the event for
which there is at least one letter used more than once, it is easier to count E c , the event
where no letter is repeated. Since in this case each new letter rules out a possibility for the
next letter there are 26 × 25 × 24 × 23 × 22 × 21 × 20 = 3, 315, 312, 000 such possibilities.
This lets us compute P (E c ) = 3,315,312,000
8,031,810,176 from which we find P (E ) = 1 − P (E c ) =
4,716,498,176
8,031,810,176 ≈ 0.587. That is, there is about a 58.7% chance that such a code will have a
repeated letter. ■
Example 1.2.5. A group of twelve people includes Grant and Dilip. A group of three
people is to be randomly selected from the twelve. How likely is it that this three-person
group will include Grant, but not Dilip?
Here, S is the collection of all three-person groups, each of which is as likely to be
selected as any other. The number of ways of selecting a three-person group from a pool of
twelve is |S| = (12
3 ) = 220. The event E consists of those three-person groups that include
Grant, but not Dilip. Such groups must include two people other than ! Grant and there
10
are ten people remaining from which to select the two, so |E| = = 45. Therefore,
2
45
P (E ) = 220 9
= 44 . ■
exercises
Ex. 1.2.1. A day is selected at random from a given week with each day as likely to be
selected as any other.
(b) Let E be the event that the selected day is a Saturday or a Sunday. What is the
probability of E.
Ex. 1.2.2. A box contains 500 envelopes, of which 50 contain Rs 100 in cash, 100 contain
Rs 50 in cash and 350 contain Rs 10. An envelope can be purchased at Rs 25 from the
owner, who will pick an envelope at random and give it to you. Write down the sample
space for the net money gained by you. If each envelope is as likely to be selected as any
other envelope, what is the probability that the first envelope purchased contains less than
Rs 100?
Ex. 1.2.3. Three dice are tossed.
(a) Describe (in words) the sample space S and give an example of an object in S.
(c) Let E be the event that the first two dice both come up “1”. What is the size of E?
What is the probability of E?
(d) Let G be the event that the three dice show three different numbers. What is the
size of G? What is the probability of G?
(e) Let F be the event that the third die is larger than the sum of the first two. What
is the size of F ? What is the probability of F ?
Ex. 1.2.4. Suppose that each of three women at a party throws her hat into the center of
the room. The hats are first mixed up and then each one randomly selects a hat. Describe
the probability space for the possible selection of hats. If all of these selections are equally
likely, what is the probability that none of the three women selects her own hat?
Ex. 1.2.5. A group of ten people includes Sona and Adam. A group of five people is to be
randomly selected from the ten. How likely is it that this group of five people will include
neither Sona nor Adam?
Ex. 1.2.6. There are eight students with two females and six males. They are split into
two groups A and B, of four each.
(c) What is the probability that there is one female in each group?
Ex. 1.2.7. Sheela has lost her key to her room. The security officer gives her 50 keys and
tells her that one of them will open her room. She decides to try each key successively
and notes down the number of the attempt at which the room opens. Describe the sample
space for this experiment. Do you think it is realistic that each of these outcomes is equally
likely? Why or why not?
Ex. 1.2.8. Suppose that n balls, of which k are red, are arranged at random in a line.
What is the probability that all the red balls are next to each other?
Ex. 1.2.9. Consider a deck of 50 cards. Each card has one of 5 colors (black, blue, green,
red, and yellow), and is printed with a number (1,2,3,4,5,6,7,8,9, or 10) so that each of the
50 color/number combinations is represented exactly once. A hand is produced by dealing
out five different cards from the deck. The order in which the cards were dealt does not
matter.
(b) How many hands consist of cards of identical color? What is the probability of being
dealt such a hand?
(c) What is the probability of being dealt a hand that contains exactly three cards with
one number, and two cards with a different number?
(d) What is the probability of being dealt a hand that contains two cards with one
number, two cards with a different number, and one card of a third number?
Ex. 1.2.10. Suppose you are in charge of quality control for a light bulb manufacturing
company. Suppose that in the process of producing 100 light bulbs, either all 100 bulbs
will work properly, or through some manufacturing error twenty of the 100 will not work.
Suppose your quality control procedure is to randomly select ten bulbs from a 100 bulb
batch and test them to see if they work properly. How likely is this procedure to detect if
a batch has bad bulbs in it?
Ex. 1.2.11. A fair die is rolled five times. What is the probability of getting at least two
5’s and at least two 6’s among the five rolls.
Ex. 1.2.12. (The “Birthday Problem”) For a group of N people, if their birthdays were
listed one-by-one, there are 365N different ways that such a list might read (if we ignore
February 29 as a possibility). Suppose each of those possible lists is as likely as any other.
(a) For a group of two people, let E be the event that they have the same birthday.
What is the size of E? What is the probability of E?
(b) For a group of three people, let F be the event that at least two of the three have
the same birthday. What is the size of F ? What is the probability of F ? (Hint: It
is easier to find the size of F c than it is to find the size of F ).
(c) For a group of four people, how likely is it that at least two of the four have the same
birthday?
(d) How large a group of people would you need to have before it becomes more likely
than not that at least two of them share a birthday?
(a) How likely is it that the 100 tosses will produce exactly fifty heads and fifty tails?
(b) How likely is it that the number of heads will be between 50 and 55 (inclusive)?
Ex. 1.2.14. Suppose I have a coin that I claim is “fair” (equally likely to come up heads or
tails) and that my friend claims is weighted towards heads. Suppose I flip the coin twenty
times and find that it comes up heads on sixteen of those twenty flips. While this seems to
favor my friend’s hypothesis, it is still possible that I am correct about the coin and that
1.0
0.8
Probability of common birthday
0.6
0.4
0.2
0.0
0 10 20 30 40 50
just by chance the coin happened to come up heads more often than tails on this series of
flips. Let S be the sample space of all possible sequences of flips. The size of S is then 220 ,
and if I am correct about the coin being “fair”, each of these outcomes are equally likely.
(a) Let E be the event that exactly sixteen of the flips come up heads. What is the size
of E? What is the probability of E?
(b) Let F be the event that at least sixteen of the flips come up heads. What is the size
of F ? What is the probability of F ?
Note that the probability of F is the chance of getting a result as extreme as the one I
observed if I happen to be correct about the coin being fair. The larger P (F ) is, the more
reasonable seems my assumption about the coin being fair. The smaller P (F ) is, the more
that assumption looks doubtful. This is the basic idea behind the statistical concept of
“hypothesis testing” which we will revisit in Chapter 9.
Ex. 1.2.15. Suppose that r indistinguishable balls are placed in n distinguishable boxes so
that each distinguishable arrangement is equally likely. Find the probability that no box
will be empty.
Ex. 1.2.16. Suppose that 10 potato sticks are broken into two - one long and one short
piece. The 20 pieces are now arranged into 10 random pairs chosen uniformly.
(a) Find the probability that each of pairs consists of two pieces that were originally
part of the same potato stick.
(b) Find the probability that each pair consists of a long piece and a short piece.
Ex. 1.2.17. Let S be a non-empty, countable (finite or infinite) set such that for each
ω ∈ S, 0 ≤ pω ≤ 1. Let F be the collection of all events. Suppose P : F → [0, 1] is given by
pω ,
X
P (E ) =
ω∈E
Let B be the event that there is a head in the first toss. As above,
Now suppose we are asked to find the probability of at least two or more heads among
the three tosses, but we are also given the additional information that the first toss was a
head. In other words, we are asked to find the probability of A, given the information that
event B has definitely occurred. As the additional information guarantees that B is now a
list of all possible outcomes, it makes intuitive sense to view the event B as a new sample
space and then identify the subset A ∩ B = {hhh, hht, hth} of B consisting of outcomes
for which there are at least two heads. We conclude that the probability of at least two or
more heads in three tosses given that the first toss was a head is
|A ∩ B| 3
= . ■
|B| 4
This is a legitimate way to view the problem and it leads to the correct solution. However,
this method has one very serious drawback—it requires us to change both our sample
space and our probability function in order to carry out the computation. It would be
preferable to have a method that allows us to work within the original framework of the
sample space S and to talk about the “conditional probability” of A given that the result
of the experiment will be an outcome in B. This is denoted as P (A | B ) and is read as
“the (conditional) probability of A given B.”
Suppose S is a finite set of equally likely outcomes from a given experiment. Then for
any two non-empty events A and B, the conditional probability of A given B is given by
|A∩B|
|A ∩ B| |S| P (A ∩ B )
= |B|
= .
|B| P (B )
|S|
This leads us to a formal definition of conditional probability for general sample spaces.
P (A ∩ B )
P (A | B ) = .
P (B )
then n
A ∩ B = (4, 6), (6, 4)}.
Hence
In many applications, the conditional probabilities are implicitly defined within the context
of the problem. In such cases, it is useful to have a method for computing non-conditional
probabilities from the given conditional ones. Two such methods are given by the next
results and the subsequent examples.
Example 1.3.4. An economic model predicts that if interest rates rise, then there is a 60%
chance that unemployment will increase, but that if interest rates do not rise, then there is
only a 30% chance that unemployment will increase. If the economist believes there is a
40% chance that interest rates will rise, what should she calculate is the probability that
unemployment will increase?
Let B be the event that interest rates rise and A be the event that unemployment
increases. We know the values
P (A) = P ((A ∩ B ) ∪ (A ∩ B c ))
= P ((A ∩ B )) + P (A ∩ B c ))
= P (A | B )P (B ) + P (A | B c )P (B c )
= 0.6 × 0.4 + 0.3 × 0.6 = 0.42.
n n
!
[ [
( A ∩ Bi ) = A ∩ Bi = A.
i=1 i=1
So,
n
!
[
P (A) = P ( A ∩ Bi )
i=1
n
X
= P ( A ∩ Bi )
i=1
n
P ( A | Bi ) P ( Bi ) .
X
=
i=1
■
A nearly identical proof holds when there are only countably many Bi (see Exercise 1.3.11).
Example 1.3.6. Suppose we have coloured balls distributed in three boxes in quantities as
given by the table below:
A box is selected at random. From that box a ball is selected at random. How likely is it
that a red ball is drawn?
Let B1 , B2 , and B3 be the events that Box 1, 2, or 3 is selected, respectively. Note
that these events are disjoint and cover all possibilities in the sample space. Let R be the
event that the selected ball is red. Then by Theorem 1.3.5,
P ( R ) = P ( R | B1 ) P ( B1 ) + P ( R | B2 ) P ( B2 ) + P ( R | B3 ) P ( B3 )
4 1 3 1 3 1 121
= · + · + · = . ■
12 3 8 3 10 3 360
Example 1.3.7. (Polya’s Urn Scheme) Suppose there is an urn that contains r red balls
and b black balls. A ball is drawn at random and its colour noted. It is replaced with c > 0
balls of the same colour. The procedure is then repeated. For j = 1, 2, . . . , let Rj and Bj
be the events that the j-th ball drawn is red and black respectively. Clearly P (R1 ) = r
b+r
and P (B1 ) = b+r .
b
When the first ball is replaced, c new balls will be added to the urn, so
that when the second ball is drawn there will be r + b + c balls available. From this it can
easily be checked that P (R2 | R1 ) = r +c
b+r +c and P (R2 | B1 ) = b+r +c .
r
Noting that R1 and
B1 are disjoint and together represent the entire sample space, P (R2 ) can be computed as
The urn schemes were originally developed by George Polya (1887–1985). Various modifi-
cations to Polya’s urn scheme are discussed in the exercises.
Above we have described how conditioning on an event B may be viewed as modifying
the original probability based on the additional information provided by knowing that
B has occurred. Frequently in applications, we gain information more than once in the
process of an experiment. The following theorem shows how to deal with such a situation.
n n j−1
Ak .
\ Y \
P Aj = P (A1 ) · P Aj
j =1 j =2 k =1
The proof of this theorem is left as Exercise 1.3.14, but we will provide a framework in
which to make sense of the equality. Usually the events A1 , . . . , An are viewed as a sequence
in time for which we know the probability of a given event provided that all of the others
before it have already occurred. Then we can calculate P (A1 ∩ A2 ∩ · · · ∩ An ) by taking the
product of the values P (A1 ), P (A2 | A1 ), P (A3 | A1 ∩ A2 ), . . . , P (An | A1 ∩ · · · ∩ An−1 ).
Example 1.3.9. A probability class has fifteen students - four seniors, eight juniors, and
three sophomores. Three different students are selected at random to present homework
problems. What is the probability the selection will be a junior, a sophomore, and a junior
again, in that order?
Let A1 be the event that the first selection is a junior. Let A2 be the event that the
second selection is a sophomore, and let A3 be the event that the third selection is a junior.
The problem asks for P (A1 ∩ A2 ∩ A3 ) which we can calculate using Theorem 1.3.8.
P ( A1 ∩ A2 ∩ A3 ) = P ( A1 ) P ( A2 | A1 ) P ( A3 | A1 ∩ A2 )
8 3 7 4
= · · = . ■
15 14 13 65
It is often the case that we know the conditional probability of A given B, but want to
know the conditional probability of B given A instead. It is possible to calculate one
quantity from the other using a formula known as Bayes’ theorem. We introduce this with
a motivating example.
Example 1.3.10. We return to Example 1.3.6. In that example we had three boxes
containing balls given by the table below.
A box is selected at random. From the box a ball is selected at random. When we looked
at conditional probabilities we saw how to determine the probability of an event such as
{the ball drawn is red}. Now suppose we know the ball is red and want to determine
the probability of the event {the ball was drawn from box 3}. That is, if R is the event
that a red ball is chosen and if B1 , B2 , and B3 are the events that boxes 1, 2, and 3 are
selected, we want to determine the conditional probability P (B3 | R). The difficulty is
that while the conditional probabilities P (R | B1 ), P (R | B2 ), and P (R | B3 ) are easy to
determine, calculating the conditional probability with the order of the events reversed is
not immediately obvious.
Using the definition of conditional probability we have that
P (B3 ∩ R)
P B3 | R = ,
P (R )
P ( B3 ∩ R ) 1/10 36
P ( B3 | R ) = = = ≈ 0.298.
P (R ) 121/360 121
So if we know that a red ball was drawn, there is slightly less than a 30% chance that it
came from Box 3. ■
In the above example the description of the experiment allowed us to determine P (B1 ),
P (B2 ), P (B3 ), P (R | B1 ), P (R | B2 ), and P (R | B3 ). We were then able to use the
definition of conditional probability to find P (B3 | R). Such a computation can be done in
general.
P (A | Bi )P (Bi )
P ( Bi | A ) = P
n . (1.3.1)
P ( A | Bj ) P ( Bj )
j =1
P ( Bi ∩ A ) P ( A | Bi ) P ( Bi )
P ( Bi | A ) = = !
P (A) n
A ∩ Bj
S
P
j =1
P (A | Bi )P (Bi ) P ( A | Bi ) P ( Bi )
= n = P
n .
P (A ∩ Bj ) P (A | Bj )P (Bj )
P
j =1 j =1 ■
Equation (1.3.1) is sometimes referred to as “Bayes’ formula” or “Bayes’ rule” as well. This
result is originally due to Thomas Bayes (1701–1761).
Example 1.3.12. Shyam is randomly selected from the citizens of Hyderabad by the Health
authorities. A laboratory test on his blood sample tells Shyam that he has tested positive
for Swine Flu. It is found that 95% of people with Swine Flu test positive but 2% of people
without the disease will also test positive. Suppose that 1% of the population has the
disease. What is the probability that Shyam indeed has the Swine Flu ?
Consider the events A = { Shyam has Swine Flu } and B = { Shyam tested postive
for Swine Flu }. We are given:
P (B | A)P (A)
P (A | B ) =
P (B | A)P (A) + P (B | Ac )P (Ac )
(0.95)(0.01)
= = 0.324
(0.95)(0.01) + (0.02)(0.99)
Despite testing positive, there is only a 32.4 percent chance that Shyam has the disease. ■
exercises
Ex. 1.3.1. There are two dice, one red and one blue, sitting on a table. The red die is a
standard die with six sides while the blue die is tetrahedral with four sides, so the outcomes
1, 2, 3, and 4 are all equally likely. A fair coin is flipped. If that coin comes up heads, the
red die will be rolled, but if the coin comes up tails the blue die will be rolled.
(a) Find the probability that the rolled die will show a 1.
(b) Find the probability that the rolled die will show a 6.
Ex. 1.3.2. A pair of dice are thrown. It is given that the outcome on one die is a 3. what
is the probability that the sum of the outcomes on both dice is greater than 7?
Ex. 1.3.3. Box A contains four white balls and three black balls and Box B contains three
white balls and five black balls.
(a) Suppose a box is selected at random and then a ball is chosen from the box. If the
ball drawn is black then what is the probability that it was from Box A?
(b) Suppose instead that one ball is drawn at random from Box A and placed (unseen)
in Box B. What is the probability that a ball now drawn from Box B is black?
Ex. 1.3.4. Tomorrow the weather will either be sunny, cloudy, or rainy. There is a 60%
chance tomorrow will be cloudy, a 30% chance tomorrow will be sunny, and a 10% chance
that tomorrow will be rainy. If it rains, I will not go on a walk. But if it is cloudy, there
is a 90% chance I will take a walk and if it’s sunny there is a 70% chance I will take a
walk. If I take a walk on a cloudy day, there is an 80% chance I will walk further than
five kilometers, but if I walk on a sunny day, there’s only a 50% chance I will walk further
than five kilometers. Using the percentages as given probabilities, answer the following
questions:
(a) How likely is it that tomorrow will be cloudy and I will walk over five kilometers?
(b) How likely is it I will take a walk over five kilometers tomorrow?
Ex. 1.3.5. A box contains B black balls and W balls, where W ≥ 3, B ≥ 3. A sample of
three balls is drawn at random with each drawn ball being discarded (not put back into
the box) after it is drawn. For j = 1, 2, 3 let Aj denote the event that the ball drawn on
the j th draw is white. Find P (A1 ), P (A2 ) and P (A3 ).
Ex. 1.3.6. There are two sets of cards, one red and one blue. The red set has four cards -
one that reads 1, two that read 2, and one that reads 3. An experiment involves flipping a
fair coin. If the coin comes up heads a card will be randomly selected from the red set
(and its number recorded) while if the coin comes up tails a card will be randomly selected
from the blue set (and its number recorded). You can construct the blue set of cards in
any way you see fit using any number of cards reading 1, 2, or 3. Explain how to build the
blue set of cards to make each of the experimental outcomes 1, 2, 3 equally likely.
Ex. 1.3.7. There are three tables, each with two drawers. Table 1 has a red ball in each
drawer. Table 2 has a blue ball in each drawer. Table 3 has a red ball in one drawer and a
blue ball in the other. A table is chosen at random, then a drawer is chosen at random
from that table. Find the conditional probability that Table 1 is chosen, given that a red
ball is drawn.
Ex. 1.3.8. In the G.R.E advanced mathematics exam, each multiple choice question has 4
choices for an answer. A prospective graduate student taking the test knows the correct
answer with probability 34 . If the student does not know the answer, she guesses randomly.
Given that a question was answered correctly, find the conditional probability that the
student knew the answer.
Ex. 1.3.9. You first roll a fair die, then toss as many fair coins as the number that showed
on the die. Given that 5 heads are obtained, what is the probability that the die showed 5?
Ex. 1.3.10. Manish is a student in a probability class. He gets a note saying, “I’ve organized
a probability study group tonight at 7pm in the coffee shop. Come if you want.” The note
is signed “Hannah”. However, Manish has class with two different Hannahs and he isn’t
sure which one sent the note. He figures that there is a 75% chance that Hannah A. would
have organized such a study group, but only a 25% chance that Hannah B. would have
done so. However, he also figures that if Hannah A. had organized the group, there is an
80% chance that she would have planned to meet on campus and only a 20% chance that
she would have planned to meet in the coffee shop. While if Hannah B. had organized the
group there is a 10% chance she would have planned for it on campus and a 90% chance
she would have chosen the coffee shop. Given all this information, determine whether it is
more likely that Manish should think the note came from Hannah A. or from Hannah B.
Ex. 1.3.11. State and prove a version of
(a) Theorem 1.3.5 when {Bi } is a countably infinite collection of disjoint events.
(b) Theorem 1.3.11 when {Bi } is a countably infinite collection of disjoint events.
Ex. 1.3.12. A bag contains 100 coins. Sixty of the coins are fair. The rest are biased to
land heads with probability p (where 0 ≤ p ≤ 1). A coin is drawn at random from the bag
and tossed.
(a) Given that the outcome was a head what is the conditional probability that it is a
biased coin?
(b) Evaluate your answer to (a) when p = 0. Can you explain why this answer should
be intuitively obvious?
(c) Evaluate your answer to (a) when p = 12 . Can you explain why this answer should
be fairly intuitive as well?
(d) View your answer to part (a) as a function f (p). Show that f (p) is an increasing
function when 0 ≤ p ≤ 1. Give an interpretation of this fact in the context of the
problem.
Ex. 1.3.13. An urn contains b black balls and r red balls. A ball is drawn at random. The
ball is replaced into the urn along with c balls of its colour and d balls of the opposite
colour. Then another random ball is drawn and the procedure is repeated.
(a) What is the probability that the second ball drawn is a red ball?
(b) Assume c = d. What is the probability that the second ball drawn is a black ball?
(c) Still assuming c = d, what is the probability that the nth ball drawn is a black ball?
(d) Assume c > 0 and d = 0, what is the probability that the nth ball drawn is a black
ball?
(6) Can you comment on the answers to (b) and/or (c) if the assumption that c = d was
removed?
(a) Prove Theorem 1.3.8 for the n = 2 case. (Hint: The proof should follow immediately
from the definition of conditional probability).
(b) Prove Theorem 1.3.8 for the n = 3 case. (Hint: Rewrite the conditional probabilities
in terms of ordinary probabilities).
(c) Prove Theorem 1.3.8 generally. (Hint: One method is to use induction, and parts (a)
and (b) have already provided a starting point).
1.4 independence
In the previous section we have seen instances where the probability of an event may
change given the occurrence of a related event. However it is instructive and useful to
study the case of two events where the occurrence of one has no effect on the probability
of the other. Such events are said to be “independent”.
Example 1.4.1. Suppose we toss a coin three times. Then the sample space
Define A = {hhh, hht, hth, htt} = {the first toss is a head}, and similarly define B =
{hhh, hht, thh, tht} = {the second toss is a head}. Note that P (A) = 1
2 = P (B ), while
P (A ∩ B ) |A ∩ B| 2 1
P (A | B ) = = = =
P (A) |B| 4 2
and
P (A ∩ B c ) |A ∩ B c | 2 1
P (A | B c ) = = = = .
c
P (B ) c
|B | 4 2
We have shown that P (A) = P (A | B ) = P (A | B c ). Therefore we conclude that the
occurrence (or non-occurrence) of B has no effect on the probability of A. ■
This is the sort of condition we would want in a definition of independence. However, since
defining P (A | B ) requires that P (B ) > 0, our formal definition of “independence” will
appear slightly different.
P (A ∩ B ) = P (A)P (B ).
Example 1.4.3. Suppose we roll a die twice and denote as an ordered pair the result of
the rolls. Suppose
E = { a six appears on the first roll } = {(6, 1), (6, 2), (6, 3), (6, 4), (6, 5), (6, 6)}
and
F = { a six appears on the second roll } = {(1, 6), (2, 6), (3, 6), (4, 6), (5, 6), (6, 6)} .
1 6 1 6 1
P (E ∩ F ) = , P (E ) = = , P (F ) = = .
36 36 6 36 6
So E, F are independent as P (E ∩ F ) = P (E )P (F ). ■
Using the definition of conditional probability it is not hard to show (see Exercise 1.4.9)
that if A and B are independent, and if 0 < P (B ) < 1 then
P (A | B ) = P (A) = P (A | B c ). (1.4.1)
If P (A) > 0 then the equations of (1.4.1) also hold with the roles of A and B reversed.
Thus, independence implies four conditional probability equalities.
If we want to extend our definition of independence to three events A1 , A2 , and A3 , we
would certainly want
to hold. We would also want any pair of the three events to be independent of each other.
It is tempting to hope that pairwise independence is enough to imply (1.4.2). However,
this is not true, as shown by the next example.
Example 1.4.4. Suppose we toss a fair coin two times. Consider the three events A1 =
{hh, tt}, A2 = {hh, ht}, and A3 = {hh, th}. Then it is easy to calculate that
1
P ( A1 ) = P ( A2 ) = P ( A3 ) = ,
2
1
P (A1 ∩ A2 ) = P (A1 ∩ A3 ) = P (A2 ∩ A3 ) = , and
4
1
P ( A1 ∩ A2 ∩ A3 ) = .
4
So even though A1 , A2 and A3 are pairwise independent, they do not satisfy (1.4.2). ■
It may also be tempting to hope that (1.4.2) is enough to imply pairwise independence,
but that is not true either (see Exercise 1.4.6). The root of the problem is that, unlike the
two event case, (1.4.2) does not imply that equality holds if any of the Ai are replaced by
their complements. One solution is to insist that the multiplicative equality hold for any
intersection of the events or their complements, which gives us the following definition.
exercises
Ex. 1.4.1. In the first semifinal of an international volleyball tournament Brazil has a 60%
chance to beat Pakistan. In the other semifinal Poland has a 70% chance to beat Mexico.
If the results of the two matches are independent, what is the probability that Pakistan
will meet Poland in the tournament final?
Ex. 1.4.2. A manufacturer produces nuts and markets them as having 50mm radius. The
machines that produce the nuts are not perfect. From repeated testing, it was established
that 15% of the nuts have radius below 49mm and 12% have radius above 51mm. If two
nuts are randomly (and independently) selected, find the probabilities of the following
events:
(a) The radii of both the nuts are between 49mm and 51mm;
Ex. 1.4.3. Four tennis players (Avinash, Ben, Carlos, and David) play a single-elimination
tournament with Avinash playing David and Ben playing Carlos in the first round and the
winner of each of those contests playing each other in the tournament final. Below is the
chart giving the percentage chance that one player will beat the other if they play. For
instance, Avinash has a 30% chance of beating Ben if they happen to play.
Suppose the outcomes of the games are independent. For each of the four players,
determine the probability that player wins the tournament. Verify that the calculated
probabilities sum to 1.
Ex. 1.4.4. Let A and B be events with P (A) = 0.8 and P (B ) = 0.7.
Ex. 1.4.5. Suppose we toss two fair dice. Let E1 denote the event that the sum of the dice
is six. E2 denote the event that sum of the dice equals seven. Let F denote the event that
the first die equals four. Is E1 independent of F ? Is E2 independent of F ?
Ex. 1.4.6. Suppose a bowl has twenty-seven balls. One ball is black, two are white, and
eight each are green, red, and blue. A single ball is drawn from the bowl and its color is
recorded. Define
(a) Calculate P (A ∩ B ∩ C ).
Ex. 1.4.7. There are 150 students in the Probability 101 class. Of them, ninety are female,
sixty use a pencil (instead of a pen), and thirty are wearing eye glasses. A student is chosen
at random from the class. Define the following events:
(b) Give an example to show that it may be possible for these events to be pairwise
independent.
Ex. 1.4.8. When can an event be independent of itself? Do parts (a) and (b) below to
answer this question.
(a) Prove that if an event A is independent of itself then either P (A) = 0 or P (A) = 1.
(b) Prove that if A is an event such that either P (A) = 0 or P (A) = 1 then A is
independent of itself.
Ex. 1.4.9. This exercise explores the relationship between independence and conditional
probability.
(a) Suppose A and B are independent events with 0 < P (B ) < 1. Prove that P (A |
B ) = P (A) and that P (A | B c ) = P (A).
(b) Suppose that A and B are independent events. Prove that A and B c are also
independent.
(c) Suppose that A and B are events with P (B ) > 0. Prove that if P (A | B ) = P (A),
then A and B are independent.
(d) Suppose that A and B are events with 0 < P (B ) < 1. Prove that if P (A | B ) = P (A),
then P (A | B c ) = P (A) as well.
(a) Suppose A, B, and C are mutually independent. In particular, this means that
P (A ∩ B ∩ C ) = P (A) · P (B ) · P (C ), and
P (A ∩ B ∩ C c ) = P (A) · P (B ) · P (C c ).
Use these two facts to conclude that A and B are pairwise independent.
As we have already seen, and will see throughout this book, the general approach to solve
problems in probability and statistics is to put them in an abstract mathematical framework.
Many of these problems eventually simplify to computing some specific numbers. Usually
these computations are simple and can be done using a calculator. For some computations
however, a more powerful tool is needed. In this book, we will use a software called R to
illustrate such computations. R is freely available open source software that runs on a
variety of computer platforms, including Windows, macOS, and GNU/Linux.
R is many different things to different people, but for our purposes, it is best to think
of it as a very powerful calculator. Once you install and start R,1 you will be presented
with a prompt that looks like the “greater than” sign (>). You can type expressions that
you want to evaluate here and press the Enter key to obtain the answer. For example,
9 / 44
[1] 0.2045455
[1] 0.42
[1] -0.8675006
It may seem odd to see a [1] at the beginning of each answer, but that is there for a
good reason. R is designed for statistical computations, which often require working with
a collection of numbers, which following standard mathematical terminology are referred
to as vectors. For example, we may want to do some computations on a vector consisting
1
Visit [Link] to download R and learn more about it.
of the first 5 positive integers. Specifically, suppose we want to compute the squares of
these integers, and then sum them up. Using R, we can do
c(1, 2, 3, 4, 5)ˆ2
[1] 1 4 9 16 25
sum(c(1, 2, 3, 4, 5)ˆ2)
[1] 55
Here the construct c(...) is used to create a vector containing the first five integers. Of
course, doing this manually is difficult for larger vectors, so another useful construct is m:n
which creates a vector containing all integers from m to n. Just as we do in mathematics, it
is also convenient to use symbols (called “variables”) to store intermediate values in long
computations. For example, to do the same operations as above for the first 40 integers,
we can do
x <- 1:40
x
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
[23] 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
xˆ2
sum(xˆ2)
[1] 22140
We can now guess the meaning of the number in square brackets at the beginning of
each line in the output: when R prints a vector that spans multiple lines, it prefixes each
line by the index of the first element printed in that line. The prefix appears for scalars
too because R treats scalars as vectors of length one.
In the example above, we see two kinds of operations. The expression xˆ2 is interpreted
as an element-wise squaring operation, which means that the result will have the same
length as the input. On the other hand, the expression sum(x) takes the elements of a
vector x and computes their sum. The first kind of operation is called a vectorized operation,
and most mathematical operations in R are of this kind.
To see how this can be useful, let us use R to compute factorials and binomial coefficients,
which will turn up frequently in this book. Recall that the binomial coefficient
!
n n!
=
k k!(n − k )!
represents the number of ways of choosing k items out of n, where for any positive integer
m, m! is the product of the first m positive integers. Just as sum(x) computes the sum
! of
10
the elements of x, prod(x) computes their product. So, we can compute 10! and as
4
prod(1:10)
[1] 3628800
[1] 210
Unfortunately, factorials can quickly become quite big, and may be beyond R’s ability to
compute
! precisely even for moderately large numbers. For example, trying to compute
200
, we get
4
prod(1:200)
[1] Inf
[1] NaN
The first computation yields Inf because at some point in the computation of the prod-
uct, the result becomes larger than the largest number R can store (this is often called
“overflowing”). The second computation essentially reduces to computing Inf/Inf, and the
resulting NaN indicates that the answer is ambiguous. The trivial mathematical fact that
m
log m! = log i
X
i=1
comes to our aid here because it lets us do our computations on much smaller numbers.
Using this, we can compute
[1] 17.98504
exp(logb)
[1] 64684950
R actually has the ability to compute binomial coefficients built into it.
choose(200, 4)
[1] 64684950
These named operations, such as sum(), prod() log(), exp(), and choose(), are known
as functions in R. They are analogous to mathematical functions in the sense that they
map some inputs to an output. Vectorized functions map vectors to vectors, whereas
summary functions like sum() and prod() map vectors to scalars. It is common practice
in R to make functions vectorized whenever possible. For example, the choose() function
is also vectorized:
choose(10, 0:10)
choose(10:20, 4)
[1] 210 330 495 715 1001 1365 1820 2380 3060 3876 4845
choose(2:15, 0:13)
[1] 1 3 6 10 15 21 28 36 45 55 66 78 91 105
A detailed exposition of R is beyond the scope of this book. In this book, we will only use
relatively basic R functions, which we will introduce as and when needed. There are many
excellent introductions available for the interested reader. In particular, R is very useful
for producing statistical plots, and most figures in this book are produced using R. We do
not describe how to create these figures in the book itself, but R code to reproduce them
is available on the website.
exercises
Ex. 1.5.2. Obtain a six-sided die, and throw it ten times, keeping a record of the face
that comes up each time. Store these values in a vector variable x. Find the output of the
built-in functions given in the previous exercise when applied to this vector.
Ex. 1.5.3. Use R to verify the calculations done in Example 1.2.4.
Ex. 1.5.4. We return to the Birthday Problem given in Exercise 1.2.12. Using R, calculate
the Probability that at least two from a group of N people share the same birthday, for
N = 10, 12, 17, 26, 34, 40, 41, 45, 75, 105.
37
P ({success, success})
= P (E ∩ F )
(using independence)
= P (E )P (F )
= P (success on the first roll) · P (success on the second roll)
1 1 1
= · = .
6 6 36
To complete the problem, the event of rolling exactly one 6 among the two dice requires
exactly one success and exactly one failure. From the list above, this can happen in either
of two orders, so the probability of observing exactly one 6 is 5
36 + 5
36 = 36 .
10
■
For any two real numbers a, b and any integer n ≥ 1, it is well known that
n
!
n k n−k
. (2.1.1)
X
(a + b)n = a b
k =0
k
This is the binomial expansion due to Blaise Pascal(1623-1662). It turns out when a and
b are positive numbers with a + b = 1, the terms in the right hand side above have a
probabilistic interpretation. We illustrate it in the example below.
50
40
Number of successes (cumulative)
30
20
10
Trial
Figure 2.1: The Binomial distribution as number of successes in fifty Bernoulli ( 13 ) trials. The
paths on the left count the cumulative successes in the fifty trials. The graph on
the right show the actual probability given by the Binomial(50, 31 ) distribution.
(c) How many attempts must be made before the first success is observed?
Ans (a) - Binomial(n,p): If n = 1, then the answer is clear, namely P ({one success}) = p
and P ({zero successes}) = 1 − p. For, n > 1 let ω = (ω1 , ω2 , . . . , ωn ) be an n-tuple of
outcomes. So we may view the sample space S as the set of all ω where each ωi is allowed
to be either “success” or “failure”. Let Ai represent either the event {the ith trial is a
success} or {the ith trial is a failure}. Then by independence
n
P ( Ai ) . (2.1.2)
Y
P ( A1 ∩ A2 ∩ . . . ∩ An ) =
i=1
Let Bk denote the event that there are k successes among the n trials. Then
P ({ω}).
X
P ( Bk ) =
ω∈Bk
But if ω ∈ Bk , then in notation (2.1.2), exactly k of the Ai represent success trials and
the other n − k represent the failure trials. The order in which the successes and failures
appear does not matter since the probabilities are being multiplied together. So for every
ω ∈ Bk ,
P ({ω}) = pk (1 − p)n−k .
Consequently, we have
P (Bk ) = |Bk |pk (1 − p)n−k .
But Bk is the event of all outcomes for which there are k successes and the number of ways
in which k successes can occur in n trials is known to be (nk). Therefore, for 0 ≤ k ≤ n,
!
n k
P ( Bk ) = p (1 − p)n−k . (2.1.3)
k
Note that if we are only interested in questions involving the number of successes, we could
ignore the set S described above and simply use {0, 1, 2, . . . , n} as our sample space with
P ({k}) = (nk)pk (1 − p)n−k . We call this a Binomial distribution with parameters n and p
(or a Binomial(n, p) for short). It is also worth noting that the binomial expansion (2.1.1)
shows
n
!
n k
p (1 − p)n−k = (p + (1 − p))n = 1,
X
k =0
k
which simply provides additional confirmation that we have accounted for all possible
outcomes in our list of Bernoulli trials. See Figure 2.1 for a simulated example of fifty
replications of Bernoulli( 13 ) trials.
n k +1 (1 − p)n−(k +1)
P ( Bk + 1 ) (k + 1)p
=
P (Bk ) n k
( k )p (1 − p)n−k
n! k!(n − k )! pk+1 (1 − p)n−(k+1)
= · ·
(k + 1)!(n − (k + 1))! n! pk (1 − p)n−k
p n−k
= · .
1−p k+1
If this ratio were to equal 1 we could conclude that {(k + 1) successes} was exactly
as likely as {k successes}. Similarly if the ratio were bigger than 1 we would know that
{(k + 1) successes} was the more likely of the two and if the ratio were less than 1 we
P ( Bk + 1 )
would see that {k successes} was the more likely case. Setting P ( Bk )
≥ 1 and solving for
k yields the following sequence of equivalent inequalities:
P (Bk+1 )
≥ 1
P ( Bk )
p n−k
· ≥ 1
1−p k+1
pn − pk ≥ k + 1 − pk − p
k ≤ p(n + 1) − 1.
In other words if k starts at 0 and begins to increase, the probability of achieving exactly
k successes will increase while k < p(n + 1) − 1 and then will decrease once k > p(n + 1) − 1.
As a consequence the most likely number of successes is the integer value of k for which
k − 1 ≤ p(n + 1) − 1 < k. This gives the critical value of k = ⌊p(n + 1)⌋, the greatest
integer less than or equal to p(n + 1).
An unusual special case occurs if p(n + 1) is already an integer. Then the sequence of
inequalities above is equality throughout, so if we let k = ⌊p(n + 1)⌋ = p(n + 1) we find a
ratio P (Bk )/P (Bk−1 ) exactly equal to 1. In this case {k − 1 successes} and {k successes}
share the distinction of being equally likely.
Ans (c) - Geometric(p): It is possible we could see the first success as early as the first
trial and, in fact, the probability of this occurring is just p, the probability that the first
trial is a success. The probability of the first success coming on the k th trial requires that
the first k − 1 trials be failures and the k th trial be a success. Let Ai be the event {the ith
trial is a success} and let Ck be the event {the first success occurs on the k th trial}. So,
for k > 0. If we view these as probabilities of the outcomes of a sample space {1, 2, 3, . . . },
we call this a geometric distribution with parameter p (or a Geometric(p) for short).
Ans (d) - Average: This is a natural question to ask but it requires a precise definition of
what we mean by “average” in the context of probability. We shall do this in Chapter 4
and return to answer (d) at that point in time.
■
Bernoulli trials may also be used to determine probabilities associated with who will
win a contest that requires a certain number of individual victories. Below is an example
applied to a “best two out of three” situation.
Example 2.1.3. Jed and Sania play a tennis match. The match is won by the first player to
win two sets. Sania is a bit better than Jed and she will win any given set with probability
3. How likely is it that Sania will win the match? (Assume the results of each set are
2
independent).
This can almost be viewed as three Bernoulli( 23 ) trials where we view a success as a set
won by Sania. One problem with that perspective is that an outcome such as (win,win,loss)
never occurs since two wins put an end to the match and the third set will never be played.
Nevertheless, the same tools used to solve the earlier problem can be used for this one as
well. Sania wins the match if she wins the first two sets (which happens with probability
9 ). She also wins the match with either a (win,loss,win) or a (loss,win,win) sequence of
4
R can be used to compute probabilities of both the Binomial and Geometric distribution
quite easily. We can compute them directly from the respective formulas. For example,
with n = 10 and p = 0.25, all Binomial probabilities are given by
k <- 0:5
choose(5, k) * 0.25ˆk * 0.75ˆ(5-k)
k <- 0:10
0.25 * 0.75ˆk
Actually, as both Binomial and Geometric are standard distributions, R has built-in
functions to compute these probabilities. These can be used as follows.
exercises
Ex. 2.1.1. Three dice are rolled. How likely is it that exactly one of the dice shows a 6?
Ex. 2.1.2. A fair die is rolled repeatedly.
(a) What is the probability that the first 6 appears on the fifth roll?
(b) What is the probability that no 6’s appear in the first four rolls?
(c) What is the probability that the second 6 appears on the fifth roll?
Ex. 2.1.3. Suppose that airplane engines operate independently in flight and fail with
probability p (0 ≤ p ≤ 1). A plane makes a safe flight if at least half of its engines
are running. Kingfisher Air lines has a four–engine plane and Paramount Airlines has
a two–engine plane for a flight from Bangalore to Delhi. Which airline has the higher
probability for a successful flight?
Ex. 2.1.4. Two intramural volleyball teams have eight players each. There is a 10% chance
that any given player will not show up to a game, independently of any another. The game
can be played if each team has at least six members show up. How likely is it the game
can be played?
Ex. 2.1.5. Mark is a 70% free throw shooter. Assume each attempted free throw is
independent of every other attempt. If he attempts ten free throws, answer the following
questions.
(a) How likely is it that Mark will make exactly seven of ten attempted free throws?
(b) What is the most likely number of free throws Mark will make?
(c) How do your answers to (a) and (b) change if Mark only attempts 9 free throws
instead of 10?
Ex. 2.1.6. Continuing the previous exercise, Kalyani isn’t as good a free throw shooter as
Mark, but she can still make a shot 40% of the time. Mark and Kalyani play a game where
the first one to sink a free throw is the winner. Since Kalyani isn’t as skilled a player, she
goes first to make it more fair.
(a) How likely is it that Kalyani will win the game on her first shot?
(b) How likely is it that Mark will win this game on his first shot? (Remember, for Mark
even to get a chance to shoot, Kalyani must miss her first shot).
(c) How likely is it that Kalyani will win the game on her second shot?
Ex. 2.1.7. Recall from the text above that the R code
produces a vector of six outputs corresponding to the probabilities that a Binomial(5, 0.25)
distribution takes on the six values 0-5. Specifically, the output indicates that the probability
of the value 0 is approximately 0.2373046875, the probability of the value 1 is approximately
0.3955078125 and so on. In Example 2.1.2 we derived a formula for the most likely outcome
of such a distribution. In the case of a Binomial(5, 0.25) that formula gives the result
⌊(5 + 1)(0.25)⌋ = 1. We could have verified this via the R output above as well, since the
second number on the list is the largest of the probabilities.
(a) Find the most likely outcome of a Binomial(7, 0.34) distribution using the formula
from example 2.1.2.
(c) Find the most likely outcome of a Binomial(8, 0.34) distribution using the formula
from Example 2.1.2.
Ex. 2.1.8. It is estimated that 0.8% of a large shipment of eggs to a certain supermarket are
cracked. The eggs are packaged in cartons, each with a dozen eggs, with the cracked eggs
being randomly distributed. A restaurant owner buys 10 cartons from the supermarket.
Call a carton “defective” if it contains at least one cracked egg.
(a) If she notes the number of defective cartons, what are the possible outcomes for this
experiment?
(b) If she notes the total number of cracked eggs, what are the possible outcomes for
this experiment?
(c) How likely is it that she will find exactly one cracked egg among all of her cartons?
(d) How likely is it that she will find exactly one defective carton?
(e) Explain why your answer to (d) is close to, but slightly larger than, than your answer
to (c).
(f) What is the most likely number of cracked eggs she will find among her cartons?
(g) What is the most likely number of defective cartons she will find?
(h) How do you reconcile your answers to parts (g) and (h)?
Ex. 2.1.9. Steve and Siva enter a bar with $30 each. A round of drinks cost $10. For each
round, they roll a die. If the roll is even, Steve pays for the round and if the roll is odd,
Siva pays for it. This continues until one of them runs out of money.
(b) What is the Probability that Siva runs out of money if Steve has cheated by bringing
a die that comes up even only 40% of the time?
Ex. 2.1.10. Let 0 < p < 1. Show that the mode of a Geometric(p) distribution is 1.
Ex. 2.1.11. Scott is playing a game where he rolls a standard die until it shows a 6. The
number of rolls needed therefore has a Geometric( 16 ) distribution. Use the appropriate R
commands to do the following:
Ex. 2.1.12. Suppose a fair coin is tossed n times. Compute the following:
Ex. 2.1.13. At a basketball tournament, each round is on a “best of seven games” basis.
That is, Team I and Team 2 play until one of the teams has won four games. Suppose
each game is won by Team I with probability p, independently of all previous games. Are
the events A = {Team I wins the round} and B = {the round lasts exactly four games}
independent?
Ex. 2.1.14. Two coins are sitting on a table. One is fair and the other is weighted so that
it always comes up heads.
(a) If one coin is selected at random (each equally likely) and flipped, what is the
probability the result is heads?
(b) One coin is selected at random (each equally likely) and flipped five times. Each
flip shows heads. Given this information about the coin flip results, what is the
conditional probability that the selected coin was the fair one?
Ex. 2.1.15. For 0 < p < 1 we defined the geometric distribution as a probability on the set
{1, 2, 3, . . . } for which P ({k}) = p(1 − p)k−1 . Show that these outcomes account for all
P∞
possibilities by demonstrating that k =1 P ({k}) = 1.
Ex. 2.1.16. The geometric distribution described the waiting time to observe a single
success. A “Negative Binomial” distribution with parameters n and p (NegBinomial(n, p))
is defined the number of Bernoulli(p) trials needed before observing n successes. The
following problem builds toward calculating some associated probabilities.
(a) If a fair die is rolled repeatedly and a number is recorded equal to the number of
rolls until the second 6 is observed, what is the sample space of possible outcomes
for this experiment?
(b) For k in the sample space you identified in part (a), what is P ({k})?
(c) If a fair die is rolled repeatedly and a number is recorded equal to the number of
rolls until the nth 6 is observed, what is the sample space of possible outcomes for
this experiment?
(d) For k in the sample space you identified in part (c), what is P ({k})?
(e) If a sequence of Bernoulli(p) trials (with 0 < p < 1) is performed and a number is
recorded equal to the number of trials until the nth success is observed, what is the
sample space of possible outcomes for this experiment?
(f) For k in the sample space you identified in part (e), what is P ({k})?
(g) Show that you have accounted for all possibilities in part (f) by showing
P ({k}) = 1.
X
k∈S
Calculating Binomial probabilities can be challenging when n is large. Let us consider the
following example:
Example 2.2.1. A small college has 1460 students. Assume that birthrates are constant
throughout the year and that each year has 365 days. What is the probability that five or
more students were born on Independence day?
The probability that any given student was born on Independence day is 365 .
1
So the
exact probability is
4
1460 1 k 364 1460−k
!
1− .
X
k =0
k 365 365
The example above can be thought of as a series of Bernoulli trials where a success
means finding a student whose birthday is Independence day. In this case p is small ( 365
1
)
and n is large (1460). To approximate we will consider a limiting procedure where p → 0
and n → ∞, but with limits carried out in such a way that np is held constant. The
computation below is called a Poisson approximation.
Proof -
!
k n−k
n λ λ
P (Ak ) = 1−
k n n
n ( n − 1 ) . . . ( n − k + 1 ) λk λ n−k
= 1 −
k! nk n
λ n(n − 1) . . . (n − k + 1)
k λ n λ −k
= 1 − 1 −
k! nk n n
1 k−1 λ −k
k n
λ λ
= 1(1 − ) . . . (1 − ) 1− 1−
k! n n n n
λk k−1
Y r λ n λ −k
= 1− 1− 1− .
k! r =1 n n n
r
lim (1 − )=1 for all r ≥ 1;
n→∞ n
−k
λ
lim 1−=1 for all λ ≥ 0, k ≥ 1; and
n→∞ n
λ n
lim 1 − = e−λ for all λ ≥ 0.
n→∞ n
As P (Ak ) is a finite product of such expressions, the result is now immediate using the
properties of limits. ■
Returning to Example 2.2.1 and using the above approximation, we would take λ =
pn = 1460
365 = 4. So if E is the event {five or more Independence day birthdays},
4
1460 1 k 364 1460−k
!
P (E ) = 1 −
X
k =0
k 365 365
42 43 44
" #
−4 −4
≈ 1− e + 4e + e−4 + e−4 + e−4 .
2 6 24
[1] 0.3711629
[1] 0.3711631
It also turns out that the right hand side of (2.2.1) defines a probability on the sample space
of non-negative integers. The distribution is named after Siméon Poisson (1781–1840).
Poisson (λ): Let λ ≥ 0 and S = {0, 1, 2, 3, . . .} with probability P given by
e−λ λk
P ({k}) =
k!
for k ∈ S. This distribution is called Poisson with parameter λ (or Poisson(λ) for short).
As with Binomial and Geometric, R has a built-in function to evaluate Poisson proba-
bilities as well. An alternative to the calculation above is the following.
[1] 0.3711631
It is important to note that for this approximation to work well, p must be small and n
must be large. For example, we may modify our question as follows:
Example 2.2.3. A class has 48 students. Assume that birthrates are constant throughout
the year and that each year has 365 days. What is the probability that five or more
students were born in September? ■
The correct answer to this question is
[1] 0.3710398
Example 2.2.4. A computer transmits three digital messages of 12 million bits of infor-
mation each. Each bit has a probability of one one-billionth that it will be incorrectly
received, independent of all other bits. What is the probability that at least two of the of
the three messages will be received error free?
Since n = 12, 000, 000 is large and since p = 1
1,000,000,000 is small it is appropriate to
use a Poisson approximation where λ = np = 0.012. A message is error free if there isn’t
a single misread bit, so the probability that a given message will be received without an
error is e−0.012 .
Now we can think of each message being like a Bernoulli trial with probability e−0.012 ,
so the number of messages correctly received is then like a Binomial (3, e−0.012 ). Therefore
the probability of receiving at least two error-free messages is
3 3
! !
(e−0.012 )3 (1 − e−0.012 )0 + (e−0.012 )2 (1 − e−0.012 )1 ≈ 0.9996.
3 2
There is about a 99.96% chance that at least two of the messages will be correctly
received. ■
exercises
Ex. 2.2.1. Do the problems below to familiarize yourself with the “sum” command in R.
(a) If a fair coin is tossed 100 times, what is the probability exactly 55 of the tosses show
heads?
(b) Example 2.2.3 showed how to use R to add the probabilities of a range of outcomes
for common distributions. Use this code as a guide to calculate the probability at
least 55 tosses show heads.
Ex. 2.2.2. Consider an experiment described by a Poisson( 12 ) distribution and answer the
following questions.
(b) What is the probability the experiment will produce a result larger than 1?
Ex. 2.2.3. Suppose we perform 500 independent trials with probability of success being
0.02.
(a) Use R to compute the probability that there are six or fewer successes. Obtain a
decimal approximation accurate to five decimal places.
0.20
0.15
Probability
0.10
0.05
0.00
0 5 10 15 20
0.20
0.15
Probability
0.10
0.05
0.00
0 5 10 15 20
Figure 2.2: The Poisson approximation to the Binomial distribution. In both plots above,
the points indicate Binomial probabilities for k = 0, 1, 2, . . . , 20; the top plot for
Binomial(1460, 365 1
), and the bottom for Binomial(48, 12 1
). The lengths of the
vertical lines, “hanging” from the points, represent the corresponding probabilities
for Poisson(4). For a good approximation, the bottom of the hanging lines should
end up at the x-axis. As we can see, this happens in the top plot but not for
the bottom plot, indicating that Poisson(4) is a good approximation for the first
Binomial distribution, but not as good for the second.
(b) Use the Poisson approximation to estimate the probability that there are six or fewer
successes and compare it to your answer to (a).
Now suppose we perform 5000 independent trials with probability of success being
0.002.
(c) Use R to compute the probability that there are six or fewer successes. Obtain a
decimal approximation accurate to five decimal places.
(d) Use the Poisson approximation to estimate the probability that there are six or fewer
successes and compare it to your answer to (c).
(a) What is the chance that there are at least 2 mistakes on the first page?
(b) What is the chance that at least eight of the first ten pages are free of mistakes?
Ex. 2.2.6. Let λ > 0. For the problems below, assume the probability space is a Poisson(λ)
distribution.
P ({k +1})
(a) Let k be a non-negative integer. Calculate the ratio P ({k})
.
Ex. 2.2.7. A number is to be produced as follows. A fair coin is tossed. If the coin comes
up heads the number will be the outcome of an experiment corresponding to a Poisson(1)
distribution. If the coin comes up tails the number will be the outcome of an experiment
corresponding to a Poisson(2) distribution. Given that the number produced was a 2,
determine the conditional probability that the coin came up heads.
Ex. 2.2.8. Suppose that the number of earthquakes that occur in a year in California has
a Poisson distribution with parameter λ. Suppose that the probability that any given
earthquake has magnitude at least 6 on the Richter scale is p.
(a) Given that there are exactly n earthquakes in a year, find an expression (in terms of
n and p) for the conditional probability that exactly one of them is magnitude at
least 6.
(b) Find an expression (in terms of λ and p) for the probability that there will be exactly
one earthquake of magnitude at least 6 in a year.
(c) Find an expression (in terms of n, λ, and p) for the probability that there will be
exactly k earthquakes of magnitude at least 6 in a year.
e−λ λk
P ({k}) = ,
k!
for k ≥ 1. Prove that this completely accounts for all possibilities by proving that
∞ −λ k
e λ
= 1.
X
k =0
k!
Imagine a small town with 5000 residents, exactly 1000 of whom are under the age of
eighteen. Suppose we randomly select four of these residents and ask how many of the
four are under the age of eighteen. There is some ambiguity in how to interpret this idea
of selecting four residents. One possibility is “sampling with replacement” where each
selection could be any of the 5000 residents and the selections are all genuinely independent.
With this interpretation, the sample is simply a series of four independent Bernoulli( 15 )
trials, in which case the answer may be found using techniques from the previous sections.
Note, however, that the assumption of independence allows for the possibility that the
same individual will be chosen two or more times in separate trials. This is a situation that
might seem peculiar when we think about choosing four people from a population of 5000,
since we may not have four different individuals at the end of the process. To eliminate
this possibility consider “sampling without replacement” where it is assumed that if an
individual is chosen for inclusion in the sample, that person is no longer available to be
picked in a later selection. Equivalently we can consider all possible groups of four which
might be selected and view each grouping as equally likely. This change means the problem
can no longer be solved by viewing the situation as a series of independent Bernoulli trials.
Nevertheless, other tools that have been previously developed will serve to answer this new
problem.
Example 2.3.1. For the town described above, what is the probability that, of four
residents randomly selected (without replacement), exactly two of them will be under the
age of eighteen?
Since we are selecting four residents from the town of 5000, there are (5000
4 ) ways this
may be done. If each of these is equally likely, the desired probability may be calculated
by determining how many of these selections result in exactly two people under the age of
eighteen. This requires selecting two of the 1000 who are in that younger age group and
also selecting two of the 4000 who are older. So there are (1000
2 )( 2 ) ways to make such
4000
choices and therefore the probability of selecting exactly two residents under age eighteen
is (1000
2 )( 2 ) / ( 4 ).
4000 5000
It is instructive to compare this to the solution if it is assumed the selection is done with
replacement. In that case, the answer is the simply the probability that a Binomial(4, 15 )
produces a result of two. From the previous sections, the answer is (42)( 15 )2 ( 45 )2 .
To compare these answers we give decimal approximations of both. To six digits of
accuracy
1000 4000
! !
2 2 4 1 2 4 2
!
≈ 0.153592 and = 0.1536,
5000 2 5 5
!
so while the two answers are not equal, they are very close. This is a reflection of an
important fact in statistical analysis—when samples are small relative to the size of the
populations they came from, the two methods of sampling give very similar results. ■
[1] 0.1535923
It is also possible (and useful) to view sampling without replacement as a series of dependent
Bernoulli trials for which each trial reduces the possible outcomes of subsequent trials. In
this case each trial is described in terms of conditional probabilities based on the results of
the preceding observations. We illustrate this by revisiting the previous example.
Example 2.3.1 Continued: We first solved this problem by considering every group of four
as equally likely to be selected. Now consider the sampling procedure as a series of four
separate Bernoulli trials where a success corresponds to the selection of a person under
eighteen and a failure as the selection of someone older. We still want to determine the
probability that a sample of size four will produce exactly two successes. One complication
with this perspective is that the successes and failures could come in many different orders,
so first consider the event where the series of selections follow the pattern “success-success-
failure-failure”. More precisely, for j = 1, 2, 3, 4 let
4000
P (Ac3 |A1 ∩ A2 ) =
4998
and
3999
P (Ac4 |A1 ∩ A2 ∩ Ac3 ) = .
4997
From those values, Theorem 1.3.8 may be used to calculate
Next we must account for the fact that this figure only considers the case where the two
younger people were chosen as the first two selections. There are (42) different orderings
that result in two younger and two older people, and it happens that each of these has the
same probability calculated above. For example,
The individual fractions are different, but their product is the same. This will always
happen for different orderings of a specific number of successes since the denominators
(5000 through 4997) reflect the steady reduction of one available choice with each additional
selection. Similarly the numerators (1000 and 999 together with 4000 and 3999) reflect the
number of people available from each of the two different categories and their reduction
as previous choices eliminate possible candidates. Therefore the total probability is the
product of the number of orderings and the probability of each ordering.
r r−1 r − (k − 1) N −r N −r−1 N − r − (m − 1 − k )
!
m
... ...
k N N −1 N − (k − 1) N −kN −k−1 N − (m − 1)
for any k ∈ S.
Proof. Following the previous example as a model, this can be proven by viewing the
hypergeometric distribution as a series of dependent trials. The first k fractions are the
probabilities the first k trials each result in successes conditioned on the successes of
the preceding trials. The remaining m − k fractions are the conditional probabilities the
remaining trials result in failures. The leading factor of (m
k ) accounts for the number of
different patterns of k successes and m − k failures, each of which is equally likely. It is
also possible to prove the equality directly using combinatorial identities and we leave this
as Exercise 2.3.4. ■
We saw with Example 2.3.1 that sampling with and without replacement may give very
similar results. The following theorem makes a precise statement to this effect.
Theorem 2.3.3. Let N , m, and r be positive integers for which m < r < N and
let k be a positive integer between 0 and m. Define
r r−k r−k
p= , p1 = , and p2 = .
N N −k N −m
Proof- The inequalities may be verified by comparing p, p1 , and p2 to the fractions from
Theorem 2.3.2. Specifically note that the k fractions
r r−1 r − (k − 1)
, ,...,
N N −1 N − (k − 1)
N −r N −r−1 N − r − (m − 1 − k )
, ,...,
N −k N −k−1 N − (m − 1)
N −r N −r−1 N − r − (m − 1 − k )
, ,...,
N −k N −k−1 N − (m − 1)
N −r−(m−k )
all exceed N −m which equals 1 − p2 . ■
exercises
Ex. 2.3.1. Suppose there are thirty balls in an urn, ten of which are black and the
remaining twenty of which are red. Suppose three balls are selected from the urn (without
replacement).
(b) What is the probability that the three draws result in exactly two red balls?
Ex. 2.3.2. This exercise explores how to use R to investigate the Binomial approximation
to the Hypergeometric distribution.
(a) A jar contains forty marbles – thirty white and ten black. Ten marbles are drawn
at random from the jar. Use R to calculate the probability that exactly five of the
marbles drawn are black. Do two separate computations, one under the assumption
that the draws are with replacement and the other under the assumption that the
draws are without replacement.
(b) Repeat part (a) except now assume the jar contains 400 marbles – 300 wihite and
100 black.
(c) Repeat part (a) excpet now assume the jar contains 4000 marbles – 3000 white and
1000 black.
(d) Explain what you are observing with your results of parts (a), (b), and (c).
Ex. 2.3.3. Consider a room of one hundred people – forty men and sixty women.
(a) If ten people are selected from the room, find the probability that exactly six are
women. Calculate this probability with and without replacement and compare the
decimal approximations of your two results.
(b) If ten people are selected from the room, find the probability that exactly seven are
women. Calculate this probability with and without replacement and compare the
decimal approximations of your two results.
(c) If 100 people are selected from the room, find the probability that exactly sixty are
women. Calculate this probability with and without replacement and compare the
two answers.
(d) If 100 people are selected from the room, find the probability that exactly sixty-one
are women. Calculate this probability with and without replacement and compare
the two answers.
r r−1 r − (k − 1)
· ... .
N N −1 N − (k − 1)
(N −r )!(N −m)!
(b) Prove that (N −k )!((N −r−(m−k ))! equals
N −r N −r−1 N − r − (m − 1 − k )
· ... .
N −k N −k−1 N − (m − 1)
Ex. 2.3.5. A box contains W white balls and B black balls. A sample of n balls is drawn
at random for some n ≤ min(W , B ). For j = 1, 2, · · · , n, let Aj denote the event that the
ball drawn on the j th draw is white. Let Bk denote the event that the sample of n balls
contains exactly k white balls.
Ex. 2.3.7. Biologists use a technique called “capture-recapture” to estimate the size of the
population of a species that cannot be directly counted. The following exercise illustrates
the role a hypergeometric distribution plays in such an estimate.
Suppose there is a species of unknown population size N . Suppose fifty members of
the species are selected and given an identifying mark. Sometime later a sample of size
twenty is taken from the population and it is found that four of the twenty were previously
marked. The basic idea behind mark-recapture is that since the sample showed 4
= 20%
20
marked members, that should also be a good estimate for the fraction of marked members
of the species as a whole. However, for the whole species that fraction is 50
N which provides
a population estimate of N ≈ 250.
Looking more deeply at the problem, if the second sample is assumed to be done at
random without replacement and with each member of the population equally likely to be
selected, the resulting number of marked members should follow a HyperGeo(N , 50, 20)
distribution.
Under these assumptions use the formula for the mode calculated in the previous
exercise to determine which values of N would cause a result of four marked members to
be the most likely of the possible outcomes.
Ex. 2.3.8. The geometric distribution was first developed to determine the number
of independent Bernoulli trials needed to observe the first success. When viewing the
hypergeometric distribution as a series of dependent trials, the same question may be
asked. Suppose we have a population of N people for which r have a certain characteristic
and the remaining N − r do not have that characteristic. Suppose an experiment consists
of sampling (without replacement) repeatedly and recording the number of the sample
that first corresponds to selecting someone with the specified characteristic. Answer the
questions below.
(As a consequence, when k is much smaller than r and N , the values of p1 and p are
approximately equal and the probabilities from (b) are closely approximated by a geometric
distribution).
Example 3.1.1. Suppose a coin is flipped three times. Consider the probabilities associated
with the following two questions:
(b) Which will be the first flip (if any) that shows heads?
At this point the answers to these questions should be easy to determine, but the purpose
of this example is to emphasize how functions could be used to answer both questions
within the context of a single sample space. Let S be a listing of all eight possible orderings
of heads and tails on the three flips, so that S = {hhh, hht, hth, htt, thh, tht, tth, ttt}. Now
define two functions on S. Let X be the function that describes the total number of heads
among the three flips and let Y be the function that describes the first flip that produces
heads. Then X and Y are given by the table
ω X (ω ) Y (ω )
hhh 3 1
hht 2 1
hth 2 1
htt 1 1
thh 2 2
tht 1 2
tth 1 3
ttt 0 none
where Y (ttt) is defined as “none” as there is no first time the coin produces heads.
63
Suppose we want to know the probability that exactly two of the three coins will be
heads. The relevant event is E = {hht, hth, thh}, but in the pre-image notation of function
theory this set may also be described as X −1 ({2}), the elements of S for which X produces
an output of 2. This allows us to describe the probability of the event as:
3
P (two heads) = P (X −1 ({2})) = P ({hht, hth, thh}) = .
8
Rather than use the standard pre-image notation, it is more common in probability to
write (X = 2) for the set X −1 ({2}) as this emphasizes that we are considering outcomes
for which the function X equals 2.
Similarly, if we wanted to know the probability that the first result of heads showed
up on the third flip, that is a question that involves the function Y . Using the notation
(Y = 3) in place of Y −1 ({3}) the probability may be calculated as
1
P (first heads on flip three) = P (Y = 3) = P ({tth}) = .
8
1 3 1
P (X = 0) = , P (X = 1) = , and P (X = 3) =
8 8 8
and
1 1 1
P (Y = 1) = , P (Y = 2) = , and P (Y = none) = ,
2 4 8
thus giving a complete description of how X and Y distribute the probabilities onto their
range. For both cases only a single sample space was needed. Two different questions were
approached by defining two different functions on that sample space. ■
The following theorem explains how the mechanism of the previous example may be more
generally applied.
Q(B ) = P (X −1 (B )).
Example 3.1.3. A board game has a wheel that is to be spun periodically. The wheel can
stop in one of ten equally likely spots. Four of these spots are red, three are blue, two are
green, and one is black. Let X denote the color of the spot. Determine the distribution of
X.
The function X is defined on a sample space S that consists of the ten spots the
wheel could stop, and it takes values on the set of colors T = {red, blue, green, black}. Its
distribution is a probability Q on the set of colors which can be determined by calculating
the probability of each color.
For instance Q({red}) = P (X = red) = P (X −1 ({red})) = 4
10 as four of the ten spots
on the wheel are red and all spots are equally likely. Similarly,
3
Q({blue}) = P (X = blue) =
10
2
Q({green}) = P (X = green) =
10
1
Q({black}) = P (X = black ) =
10
Example 3.1.4. For a certain lottery, a three-digit number is randomly selected (from 000
to 999). If a ticket matches the number exactly, it is worth $200. If the ticket matches
exactly two of the three digits, it is worth $20. Otherwise it is worth nothing. Let X be
the value of the ticket. Find the distribution of X.
The function X is defined on S = {000, 001, . . . , 998, 999} - the set of all one thousand
possible three digit numbers. The function takes values on the set {0, 20, 200}, so the
distribution Q is a probability on T = {0, 20, 200}.
First, Q({200}) = P (X = 200) = 1
1000 as only one of the one thousand three digit
numbers is going to be an exact match.
Next, Q({20}) = P (X = 20), so it must be determined how many of the one thousand
possibilities will have exactly two matches. There are (32) = 3 different ways to choose the
two digits that will match. Those digits are determined at that point and the remaining
digit must be one of the nine digits that do not match the third spot, so there are 3 · 9 = 27
three digit numbers that match exactly two digits. So Q({20}) = P (X = 20) = 1000 .
27
fX (t) = P (X = t)
referred to as a “probability mass function”. Then for any event A ⊂ T the quantity
P (X ∈ A) may be computed via
P (X = t).
X X
P (X ∈ A) = fX (t) =
t∈A t∈A
The function from Example 3.1.4 is a discrete random variable because it takes on one of
the real values 0, 20, or 200. We calculated its probability mass function when describing
its distribution and it is given by
972 27 1
fX (0) = , fX (20) = , fX (200) = .
1000 1000 1000
The function from Example 3.1.3 is not a discrete random variable by the above definition
because its range is a collection of colors, not real numbers.
When studying random variables it is often more important to know how they distribute
probability onto their range than how they actually act as functions on their domains. As
such it is useful to have a notation that recognizes the fact that two functions may be very
different in terms of where they map domain elements, but nevertheless have the same
range and produce the same distribution on this range.
There are many distributions which appear frequently enough they deserve their own
special names for easy identification. We shall use the symbol ∼ to mean “is distributed
as” or “is equal in distribution to”. For example, in the definition below X ∼ Bernoulli(p)
should be read as “X has a Bernoulli(p) distribution”. This says nothing explicit about
how X behaves as a function on its domain, but completely describes how X distributes
probability onto its range.
The following are common discrete distributions which we have seen arise previously in
the text.
P (X = k ) = p · (1 − p)k−1
k−1 r
!
P (X = k ) = p · (1 − p)k−r
r−1
for all k ≥ r, then X is a Negative Binomial random variable with parameters (r, p).
Such a random variable arises when determining how many Bernoulli trials must be
attempted before seeing r successes.
e−λ λk
P (X = k ) =
k!
for all k ≥ 0, then X is called a Poisson random variable with parameter λ. We first
used these distributions as approximations to a Binomial (n, p) when n was large
and p was small.
exercises
Ex. 3.1.1. Consider the experiment of flipping a coin four times and recording the sequence
of heads and tails. Let S be the sample space of all sixteen possible orderings of the results.
Let X be the function on S describing the number of tails among the flips. Let Y be the
function on S describing the first flip (if any) to come up tails.
Ex. 3.1.2. A pair of fair dice are thrown. Let X represent the larger of the two values on
the dice and let Y represent the smaller of the two values.
(a) Describe S, the domain of functions X and Y . How many elements are in S?
(b) What are the ranges of X and Y . Do X and Y have the same range? Why or why
not?
(c) Describe the distribution of X and describe the distribution of Y by finding the
probability mass function of each. Is it true that X and Y have the same distribution
?
Ex. 3.1.3. A pair of fair dice are thrown. Let X represent the number of the first die and
let Y represent the number of the second die.
(a) Describe S, the domain of functions X and Y . How many elements are in S?
(b) Describe T , the range of functions X and Y . How many elements are in T ?
(c) Describe the distribution of X and describe the distribution of Y by finding the
probability mass function of each. Is it true that X and Y have the same distribution
?
Ex. 3.1.4. Use the ∼ notation to classify the distributions of the random variables described
by the scenarios below. For instance, if a scenario said, “let X be the number of heads
in three flips of a coin” the approrpriate answer would be X ∼ Binomial(3, 12 ) as that
describes the number of successes in three Bernoulli trials.
(a) Let X be the number of 5’s seen in four die rolls. What is the distribution of X?
(b) Each ticket in a certain lottery has a 20% chance to be a prize-winning ticket. Let Y
be the number of tickets that need to be purchased before seeing the first prize-winner.
What is the distribution of Y ?
(c) A class of ten students is comprised of seven women and three men. Four students
are randomly selected from the class. Let Z denote the number of men among the
four randomly selected students. What is the distribution of Z?
(b) Theorem 3.1.2 does not require that X be real-valued. Why do you suppose that our
definition of “random variable” insisted that such functions should be real-valued?
Ex. 3.1.6. Let X : S → T be a discrete random variable. Suppose {Bi }i≥1 are sequence of
∞ ∞
events in T then show that X −1 ( X −1 (Bi ) and that if Bi and Bj are disjoint,
S S
Bi ) =
i=1 i=1
then so are X −1 (Bi ) and X −1 (Bj ).
Most interesting problems require the consideration of several different random variables
and an analysis of the relationships among them. We have already discussed what it means
for a collection of events to be independent, and it is useful to extend this notion to random
variables as well. As with events, we will first describe the notion of pairwise independence
of two objects, before defining mutual independence of an arbitrary collection of objects.
As events become more complicated and involve multiple random variables, a notational
shorthand will become useful. It is common in probability to write (X ∈ A, Y ∈ B ) for
the event (X ∈ A) ∩ (Y ∈ B ) and we will begin using this convention at this point.
Further, even though the definition of X : S → T and Y : S → U being independent
random variables requires that (X ∈ A) and (Y ∈ B ) be independent for all events A ⊂ T
and B ⊂ U , for discrete random variables it is enough to verify the events (X = t) and
(Y = u) are independent events for all t ∈ T and u ∈ U to conclude they are independent
(See Exercise 3.2.12).
Example 3.2.2. When we originally considered the example of rolling a pair of dice, we
viewed the results as thirty-six equally likely outcomes. However, it is also possible to view
the result of each die as a random variable in its own right, and then consider the possible
results of the pair of random variables. Let X, Y ∼ Uniform({1, 2, 3, 4, 5, 6}) and suppose
X and Y are independent. If x, y ∈ {1, 2, 3, 4, 5, 6} what is P (X = x, Y = y )?
By indpendence P (X = x, Y = y ) = P (X = x)P (Y = y ) = 1
6
1
·
= 36 . Therefore,
1
6
the result is identical to the original perspective – each of the thirty-six outcomes of the
pair of dice is equally likely. ■
For many problems it is useful to think about repeating a single experiment many times
with the results of each repetition being independent from every other. Though the results
are assumed to be independent, the experiment itself remains the same, so the random
variables produced all have the same distribution. The resulting sequence of random
variables X1 , X2 , X3 , . . . is referred to as “i.i.d.” (standing for “independent and identically
distributed”). When considering such sequences we will sometimes write X1 , X2 , X3 , . . .
are i.i.d. with distribution X, where X is a random variable that shares their common
distribution.
P (X1 > j, X2 > j, . . . , Xn > j ) = P (X1 > j )P (X2 > j ) . . . P (Xn > j )
= (1 − p)j · (1 − p)j · · · · · (1 − p)j
= (1 − p)nj . ■
Consider a problem involving two random variables. Let X be the number of centimeters
of rainfall in a certain forest in a given year, and let Y be the number of square meters
of the forest burned by fires that same year. It seems these variables should be related;
knowing one should affect the probabilities associated with the values of the other. Such
random variables are not independent of each other and we now introduce several ways to
compute probabilities under such circumstances. An important concept toward this end is
the notion of a “conditional distribution” which reflects the fact that the occurrence of an
event may affect the likely values of a random variable.
Q(B ) = P (X ∈ B | A) (3.2.1)
As with any discrete random variable, the distribution is completely determined by the
probabilities associated with each possible value the random variable may assume. This
means the conditional distribution may be considered known provided the values of
P (X = a|A) are known for every a ∈ Range(X ). Though this definition allows for A to be
any sort of event, in this section we will mainly consider examples where A describes the
outcome of some random variable. So a notation like P (X|Y = b) will be the conditional
distribution of the random variable X given that the random variable Y is known to have
the value b.
In many cases random variables are dependent in such a way that the distribution of
one variable is known in terms of the values taken on by another.
Example 3.2.6. Let X ∼ Uniform({1, 2}) and let Y be the number of heads in X tosses
of a fair coin. Clearly X and Y should not be independent. In particular, a result of Y = 0
could occur regardless of the value of X, but a result of Y = 2 guarantees that X = 2 as
two heads could not be observed with just one flip on the coin. Any information regarding
X or Y may influence the distribution of the other, but the description of the variables
makes it clearest how Y depends on X. If X = 1 then Y is the number of heads in one
flip of a fair coin. Letting A be the event (X = 1) and using the terminology of (3.2.1)
from Definition 3.2.5, we can say the conditional distribution of Y given that X = 1 is a
Bernoulli( 12 ). We will use the notation
1
(Y | X = 1) ∼ Bernoulli( )
2
to indicate this fact. In other words, this notation means the same thing as the pair of
equations
1
P (Y = 0 | X = 1) =
2
1
P (Y = 1 | X = 1) =
2
If X = 2 then Y is the number of heads in two flips of a fair coin and therefore (Y | X = 2) ∼
Binomial(2, 21 ) which means the following three equations hold:
1
P (Y = 0 | X = 2) =
4
1
P (Y = 1 | X = 2) =
2
1
P (Y = 2 | X = 2) = ■
4
The conditional probabilities of the previous example were easily determined in part
because the description of Y was already given in terms of X, but frequently random
variables may be dependent in some way that is not so explicitly described. A more
general method of expressing the dependence of two (or more) variables is to present the
probabilities associated with all combinations of possible values for every variable. This is
known as their joint distribution.
Definition 3.2.7. If X and Y are discrete random variables, the “joint distribution”
of X and Y is the probability Q on pairs of values in the ranges of X and Y defined
by
Q((a, b)) = P (X = a, Y = b).
Q((a1 , a2 , . . . , an )) = P (X1 = a1 , X2 = a2 , . . . , Xn = an ).
Q((a1 , a2 , . . . , an )).
X
Q(D ) =
(a1 ,a2 ,...,an )∈D
For a pair of random variables with few possible outcomes, it is common to describe the
joint distribution using a chart for which the columns correspond to possible X values, the
rows to possible Y values, and for which the entries of the chart are probabilities.
Example 3.2.8. Let X and Y be the dependent variables described in Example 3.2.6. The
X variable will be either 1 or 2. The Y variable could be as low as 0 (if no heads are flipped)
or as high as 2 (if two coins are flipped and both show heads). As Range(X ) = {1, 2}
and as Range(Y ) = {0, 1, 2}, the pair (X, Y ) could potentially be any of the six possible
pairings (though, in fact, one of the pairings has probability zero). To find the joint
distribution of X and Y we must calculate the probabilities of each possibility. In this case
the values may be obtained using the definition of conditional probability. For instance,
1 1 1
P (X = 1, Y = 0) = P (Y = 0|X = 1) · P (X = 1) = · =
2 2 4
and
1
P (X = 1, Y = 2) = P (Y = 2|X = 1) · P (X = 1) = 0 · = 0.
2
The entire joint distribution P (X = a, Y = b) is described by the following chart.
X=1 X=2
Y =0 1/4 1/8
Y =1 1/4 1/4
Y =2 0 1/8 ■
Knowing the joint distribution of random variables gives a complete picture of the proba-
bilities associated with those variables. From that information it is possible to compute
all conditional probabilities of one variable from another. For instance, in the example
analyzed above, the variable Y was originally described in terms of how it depended on X.
However, this also means that X should be dependent on Y . The joint distribution may
be used to determine how.
Example 3.2.9. Let X and Y be the variables of Example 3.2.6. How may the conditional
distributions of X given values of Y be determined?
There will be three different conditional distributions depending on whether Y = 0,
Y = 1, or Y = 2. Below we will solve the Y = 0 case. The other two cases will be left as
exercises. The conditional distribution of X given that Y = 0 is determined by the values
of P (X = 1|Y = 0) and P (X = 2|Y = 0) both of which may be computed using Bayes’
rule.
P (Y = 0|X = 1) · P (X = 1)
P (X = 1|Y = 0) =
P (Y = 0)
P (Y = 0|X = 1) · P (X = 1)
=
P (Y = 0|X = 1) · P (X = 1) + P (Y = 0|X = 2) · P (X = 2)
(1/2)(1/2) 2
= =
(1/2)(1/2) + (1/4)(1/2) 3
Just because X and Y are dependent on each other doesn’t mean they need to be
thought of as a pair. It still makes sense to talk about the distribution of X as a random
variable in its own right while ignoring its dependence on the variable Y . When there are
two or more variables under discussion, the distribution of X alone is sometimes called the
“marginal” distribution of X because it can be computed using the margins of the chart
describing the joint distribution of X and Y .
Example 3.2.10. Continue with X and Y as described in Example 3.2.6. Below is the
chart describing the joint distribution of X and Y that was created in Example 3.2.8, but
with the addition of one column on the right and one row at the bottom. The entries in
the extra column are the sums of the values in the corresponding row; likewise the entries
in the extra row are the sums of the values in the corresponding column.
The values in the right hand margin (column) exactly describe the distribution of Y . For
instance the event (Y = 0) can be partitioned into two disjoint events (X = 1, Y =
0) ∪ (X = 2, Y = 0) each of which is already described in the joint distribution chart.
Adding them together gives the result that P (Y = 0) = 8.
3
In a similar fashion, the
bottom margin (row) describes the distribution of X. This extended chart also makes it
numerically clearer why these two random variables are dependent. For instance,
1 3
P (X = 1, Y = 0) = while P (X = 1) · P (Y = 0) =
4 16
P (X = x, Y = y ) = P (X = x)P (Y = x)
Example 3.2.11. Suppose we toss a fair coin until the first head appears. Let X be the
number of tosses performed. We have seen in Example 2.1.2 that X ∼ Geometric( 12 ). Note
that if m is a positive integer,
∞ ∞
X X 1 1
P (X > m) = P (X = k ) = = m
k =m+1 k =m+1
2 k 2
Now let n be a positive integer and suppose we take the event (X > n) as given. In other
words, we assume we know that none of the first n flips resulted in heads. What is the
conditional distribution of X given this new information? A routine calculation shows
P (X > n + m) 1
2m+n 1
P (X > n + m | X > n) = = =
P (X > n) 1
2n
2m
As a consequence,
P (X > n + m | X > n) = P (X > m). (3.2.2)
Given that a result of heads has not occurred by the n-th flip, the probability that such a
result will require at least m more flips is identical to the (non-conditional) probability
the result would have required more than m flips from the start. In other words, if we
know that the first n flips have not yet produced a head, the number of additional flips
required to observe the first head still is a Geometric( 21 ) random variable. This is called
the “memoryless property” of the geometric distribution as it can be interpreted to mean
that when waiting times are geometrically distributed, no matter how long we wait for an
event to occur, the future waiting time always looks the same given that the event has not
occurred yet. The result remains true of geometric variables of any parameter p, a fact
which we leave as an exercise. ■
Consider a situation similar to that of Bernoulli trials, but instead of results of each attempt
limited to success or failure, suppose there are many different possible results for each
trial. As with the Bernoulli trial cases we assume that the trials are mutually independent,
but identically distributed. In the next example we will show how to calculate the joint
distribution for the random variables representing the number of times each outcome
occurs.
Example 3.2.12. Suppose we perform n i.i.d. trials each of which has k different possible
outcomes. For j = 1, 2, . . . , k, let pj represent the probability any given trial results in
the j-th outcome and let Xj represent the number of the n trials that result in the j-th
outcome. The joint distribution of all of the random variables X1 , X2 , . . . , Xk is called a
“multinomial distribution”.
Let B (x1 , x2 , . . . , xk ) = {X1 = x1 , X2 = x2 , . . . , Xk = xk }. Then,
P (B (x1 , x2 , . . . , xk )) =
X
P ({ω})
ω∈B (x1 ,x2 ,...,xk )
k
Y xj
P ({ω}) = pj
j =1
n!
|B (x1 , x2 , . . . , xk )| =
x1 ! x2 ! . . . xk !
exercises
Ex. 3.2.1. An urn has four balls labeled 1, 2, 3, and 4. A first ball is drawn and its number
is denoted by X. A second ball is then drawn from the three remaining balls in the urn
and its number is denoted by Y .
Ex. 3.2.2. Two dice are rolled. Let X denote the sum of the dice and let Y denote the
value of the first die.
Ex. 3.2.4. Let X and Y be random variables with joint distribution given by the chart
below.
X=0 X=1 X=2
Y =0 1/12 0 3/12
Y =1 2/12 1/12 0
Y =2 3/12 1/12 1/12
(d) Carry out a computation to show that X and Y are not independent.
Ex. 3.2.5. Let X be a random variable with range {0, 1} and distribution
1 2
P (X = 0) = and P (X = 1) =
3 3
1 1 3
P (Y = 0) = , P (Y = 1) = , and P (Y = 2) =
5 5 5
Suppose that X and Y are independent. Create a chart describing the joint distribution of
X and Y .
Ex. 3.2.6. Consider six independent trials each of which are equally likely to produce
a result of 1, 2, or 3. Let Xj denote the number of trials that result in j. Calculate
P (X1 = 1, X2 = 2, X3 = 3).
Ex. 3.2.7. Prove the combinatorial fact from Example 3.2.12 in the following way. Let
An (x1 , x2 , . . . , xk ) denote the number of ways of putting n balls into k boxes in such a way
that exactly xj balls wind up in box j for j = 1, 2, . . . , k.
(b) Use part (a) and induction to prove that An (x1 , x2 , . . . , xk ) = x1 ! x2 ! ... xk ! .
n!
Ex. 3.2.8. Let X be the result of a fair die roll and let Y be the number of heads in X
coin flips.
(a) Both X and (Y |X = n) can be written in terms of common distributions using the
∼ notation. What is the distribution of X? What is the distribution of (Y |X = n)
for n = 1, . . . 6?
Ex. 3.2.9. Suppose the number of earthquakes that occur in a year, anywhere in the
world, is a Poisson random variable with mean λ. Suppose the probability that any given
earthquake has magnitude at least 5 on the Richter scale is p independent of all other quakes.
Let N ∼ Poisson(λ) be the number of earthquakes in a year and let M be the number of
earthquakes in a year with magnitude at least 5, so that (M |N = n) ∼ Binomial(n, p).
for m > 0.
(c) Perform a change of variables (where k = n − m) in the infinite series from part (b)
to prove
∞
1 −λ X (λ(1 − p))k
P (M = m) = e (λp)m
m! k =0
k!
∞ k
(d) Use part (c) together with the infinite series equality ex = k! to conclude that
x
P
k =0
M ∼ Poisson(λp).
Ex. 3.2.10. Let X be a discrete random variable which has N = {1, 2, 3, . . . } as its
range. Suppose that for all positive integers m and n, X has the memoryless property –
P (X > n + m | X > n) = P (X > m). Prove that X must be a geometric random variable.
[Hint: Define p = P (X = 1) and use the memoryless property to calculate P (X = n)
inductively].
Ex. 3.2.11. A discrete random variable X is called “constant” if there is a single value c
for which P (X = c) = 1.
(b) Prove that if X is a discrete random variable which is independent of itself, then X
must be constant. [Hint: It may help to look at Exercise 1.4.8].
P (X = t, Y = u) = P (X = t)P (Y = u)
There are many circumstances where we want to consider functions applied to random
variables as inputs of functions. For a simple geometric example, suppose a rectangle
is selected in such a way that its width X and its length Y are both random variables
with known joint distribution. The area of the rectangle is A = XY , and as X and Y
are random, A should be random as well. How may the distribution of A be calculated
from the joint distribution of X and Y ? In general, if a new random variable Z depends
on random variables X1 , X2 , . . . , Xn which have a given joint distribution, how may the
distribution of Z be calculated from what is already known? In this section we discuss the
answers to such questions and also address related issues surrounding independence.
If X : S → T is a random variable and if f : T → R is a function, then the quantity
f (X ) makes sense as a composition of functions f ◦ X : S → R. In fact, as f (X ) is defined
on the sample space S, this new composition is itself a random variable.
The same reasoning holds for functions of more than one variable. If X1 , X2 , . . . , Xn
are random variables then f (X1 , X2 , . . . , Xn ) is a random variable provided f is defined for
the values the Xj variables produce. Below we illustrate how to calculate the distribution
of f (X1 , X2 , . . . , Xn ) in terms of the joint distribution of the Xj input variables. We
demonstrate the method with several examples followed by a general theorem.
Example 3.3.1. Let X ∼ Uniform({−2, −1, 0, 1, 2}) and let f (x) = x2 . Determine the
range and distribution of f (X ).
As f (X ) = X 2 , the values that f (X ) produces are the squares of the values that X
produces. Squaring the values in {−2, −1, 0, 1, 2} shows the range of f (X ) is {0, 1, 4}.
The probabilities that f (X ) takes on each of these three values determine the distribution
of f (X ) and these probabilities can be easily calculated from the known probabilities
associated with X.
1
P (f (X ) = 0) = P (X = 0) =
5
1 1 2
P (f (X ) = 1) = P ((X = 1) ∪ (X = −1)) = + =
5 5 5
1 1 2
P (f (X ) = 4) = P ((X = 2) ∪ (X = −2)) = + =
5 5 5 ■
A complication with this method is that there may be many different inputs that produce
the same output. Sometimes a problem requires careful consideration of all ways that a
given output may be produced. For instance,
Example 3.3.2. What is the probability the sum of three dice will equal six? Let X, Y ,
and Z be the results of the first, second, and third die respectively. These are i.i.d. random
variables each distributed as Uniform({1, 2, 3, 4, 5, 6}). A sum of six can be arrived at in
three distinct ways:
1 1 1 1
P (X = 2, Y = 2, Z = 2) = P (X = 2) · P (Y = 2) · P (Z = 2) = · · =
6 6 6 216
The other cases involve a similar computation, but are complicated by the consideration of
which number shows up on which die. For instance, both events (X = 1, Y = 2, Z = 3)
and (X = 3, Y = 2, Z = 1) are included as part of Case II as are four other permutations
of the numbers. Likewise Case III includes three permutations, one of which is (X =
4, Y = 1, Z = 1). Putting all three cases together,
So there is slightly less than a 5% chance three rolled dice will produce a sum of six. ■
This method may also be used to show relationships among the common (named) distribu-
tions that have been previously described, as in the next two examples.
This result should not be surpirsing given how Bernoulli and Binomial distributions
arose in the first place. Each of X and Y produces a value of 0 if the corresponding
Bernoulli trial was a failure and 1 if the trial was a success. Therefore Z = X + Y equals
the total number of successes in two independent Bernoulli trials, which is exactly what
led us to the Binomial distribution in the first place. However, it is instructive to consider
how this problem relates to the current topic of discussion.
As each of X and Y is either 0 or 1 the possible values of Z are in the set {0,1,2}. A
result of Z = 0 can only occur if both X and Y are zero. So,
P (Z = 0) = P (X = 0, Y = 0)
= P (X = 0) · P (Y = 0)
= (1 − p)(1 − p)
= (1 − p)2 .
Similarly, P (Z = 2) = P (X = 1) · P (Y = 1) = p2 .
There are two different ways that Z could equal 1, either X = 1 and Y = 0, or X = 0
and Y = 1. So,
P (Z = 1) = P ((X = 1, Y = 0) ∪ (X = 0, Y = 1))
= P (X = 1, Y = 0) + P (X = 0, Y = 1)
= p(1 − p) + (1 − p)p
= 2p(1 − p)
Two of the previous three examples involve adding random variables together. In fact,
addition is one of the most common examples of applying functions to random quantities.
In the previous situations, calculating the distribution of the sum was relatively simple
because the component variables only had finitely many outcomes. But now suppose X
and Y are random variables taking values in {0, 1, 2, . . . } and suppose Z = X + Y . How
could P (Z = n) be calculated?
As both X and Y are non-negative and as Z = X + Y , the value of Z must be at
least as large as either X or Y individually. If Z = n, then X could take on any value
j ∈ {0, 1, . . . , n}, but once that value is determed, the value of Y is compelled to be n − j to
give the appropriate sum. In other words, the event (Z = n) partitions into the following
union. n
(X = j, Y = n − j ).
[
(Z = n) =
j =0
P (X = x, Y = y ) = P (X = x) · P (Y = y )
λx λy
= e−λ1 1 · e−λ2 2 .
x! y!
(a) As computed above, the distribution of Z is given by the convolution. For any
n = 0, 1, 2, . . . we have
P (Z = n) = P (X + Y = n)
n
X
= P (X = j ) · P (Y = n − j )
j =0
n
λj1 −λ2 λ2n−j
e−λ1
X
= ·e
j =0
j! (n − j ) !
n
λj1 λ2n−j
= e−(λ1 +λ2 )
X
j =0
j!(n − j )!
n
1 X n!
= e−(λ1 +λ2 ) λj λn−j
n! j =0 j!(n − j )! 1 2
(λ1 + λ2 )n
= e−(λ1 +λ2 )
n!
where in the last line we have used the binomial expansion (2.1.1). Hence we can conclude
that Z ∼ Poisson (λ1 + λ2 ).
The above calculation is easily extended by an induction argument to obtain the fact
that if λi > 0, Xi , 1 ≤ i ≤ k are independent Poisson(λi ) distributed random variables
(respectively). Then Z = has Poisson ( distribution. Thus if we have k
Pk Pk
i=1 Xi i = 1 λi )
independent Poisson (λ) random variables then has Poisson(kλ) distribution.
Pk
i=1 Xi
(b) We readily observe that X and Z are dependent. We shall now try to understand
the conditional distribution of (X|Z = n). As the ranges of X and Y do not have any
negative numbers, given that Z = X + Y = n, X can only take values in {0, 1, 2, 3, . . . , n}.
P (X = k, X + Y = n) P (X = k, Y = n − k )
P (X = k | Z = n) = =
P (X + Y = n) P (X + Y = n)
P (X = k )P (Y = n − k )
=
P (X + Y = n)
λk1 λn−k
e−λ1 k! · e−λ2 (n−k
2
)! n! λk1 λn−k
2
= )n
=
e−(λ1 +λ2 ) (λ1 +n!λ2 k!(n − k )! (λ1 + λ2 )n
! k n−k
n λ1 λ2
= .
k λ1 + λ2 λ1 + λ2
The point of the examples above is that a probability associated with a functional value
f (X1 , X2 , . . . , Xn ) may be calculated directly from the probabilities associated with the
input variables X1 , X2 , . . . , Xn . The following theorem explains how this may be accom-
plished generally for any number of variables.
This is because the expression f (X1 (s), X2 (s), . . . , Xn (s)) ∈ B is what defines s to be an
outcome in the event (f (X1 , X2 , . . . , Xn ) ∈ B ). Likewise, the expression
defines s to be in the event ((X1 , X2 , . . . Xn ) ∈ f −1 (B )). As these events are equal, they
have the same probability. ■
If X and Y are independent random variables, does that guarantee that functions f (X )
and g (Y ) of these random variables are also indpendent? If we take the intuitive view of
independence as saying “knowing information about X does not affect the probabilities
associated with Y ” then it seems the answer should be “yes”. After all, X determines
the value of f (X ) and Y determines the value of g (Y ). So information about f (X )
should translate to information about X and infromation about g (Y ) should translate to
information about Y . Therefore if information about f (X ) affected probabilities associated
with g (Y ), then it seems there should be information about X that would affect the
probability assoicated with Y . We generalize this argument and make it more rigorous in
the following result.
Informally this theorem says that random quantities produced from independent inputs
will, themselves, be independent.
P (Y1 ∈ B1 , . . . , Yn ∈ Bn )
= P (f1 (X1,1 , . . . , Xm1 ,1 ) ∈ B1 , . . . , fn (X1,n , . . . , Xmn ,n ) ∈ Bn )
= P ((X1,1 , . . . , Xm1 ,1 ) ∈ f1−1 (B1 ), . . . , (X1,n , . . . , Xmn ,n ) ∈ fn−1 (Bn ))
n
P ((Xi,1 , . . . , Xmi ,i ) ∈ fi−1 (Bi ))
Y
=
i=1
n
P (fi (Xi,1 , . . . , Xmi ,i ) ∈ Bi )
Y
=
i=1
= P (Y1 ∈ B1 ) · · · P (Yn ∈ Bn )
exercises
Ex. 3.3.1. Let X ∼ Uniform({1, 2, 3}) and Y ∼ Uniform({1, 2, 3}) be independent and let
Z = X +Y.
Ex. 3.3.2. Consider the experiment of rolling three dice and calculating the sum of the
rolls. Answer the following questions.
Ex. 3.3.4. Let X ∼ Binomial(n, p) and Y ∼ Binomial(m, p). Assume X and Y are
independent and let Z = X + Y . Prove that Z ∼ Binomial(m + n, p).
Ex. 3.3.5. Let X ∼ Negative Binomial(r, p) and Y ∼ Negative Binomial(s, p). Assume X
and Y are independent and let Z = X + Y . Prove that Z ∼ Negative Binomial(r + s, p).
Ex. 3.3.6. Consider one flip of a single fair coin. Let X denote the number of heads on the
flip and let Y denote the number of tails on the flip.
(c) As (b) clearly says that Z cannot be a Binomial (2, 12 ), explain why this result does
not conflict with the conclusion of Example 3.3.3.
(c) Recall from the discussion of Geometric distributions that (X = 1) is the most likely
result for X and (Y = 1) is the most likely result for Y . This does not imply that
(Z = 2) is the most likely outcome for Z. Determine the values of p for which
P (Z = 3) is larger than P (Z = 2).
(b) Use the chart from (a) to explain why Y and Z are independent.
(c) Explain how you could use Theorem 3.3.6 to reach the conclusion that Y and Z are
independent without calculating their joint distribution.
(b) Use the chart from (a) to explain why Y and Z are not independent.
(c) Explain why the conclusion from (b) is not inconsistant with Theorem 3.3.6.
Ex. 3.3.13. Let X1 , X2 , . . . , Xn be an i.i.d. sequence of discrete random variables and let
Z be the maximum of these n variables. Let r be a real number and let R = P (X1 ≤ r ).
Prove that P (Z ≤ r ) = Rn .
Ex. 3.3.14. Let X1 , X2 , . . . , Xn be an i.i.d. sequence of discrete random variables and let
Z be the minimum of these n variables. Let r be a real number and let R = P (X1 ≤ r ).
Prove that P (Z ≤ r ) = 1 − (1 − R)n .
Ex. 3.3.15. Let X ∼ Geometric(p) and let Y ∼ Geometric(q ) be independent random
variables. Let Z be the smaller of X and Y . It is a fact that Z is also geometrically
distributed. This problem asks you to prove this fact using two different methods.
METHOD I:
(a) Explain why the event (Z = n) can be written as the disjoint union
(Z = n) = (X = n, Y = n) ∪ (X = n, Y > n) ∪ (X > n, Y = n)
(b) Recall from the proof of the memoryless property of geometric random variables that
2m . Use this fact and part (a) to prove that
1
P (X > m) =
(c) Use (b) to conclude that Z ∼ Geometric(r ) for some quantity r and calculate the
value of r in terms of the p and q.
METHOD II: Recall that geometric random variables first arose from noting the time it
takes for a sequence of Bernoulli trials to first produce a success. With that in mind, let
A1 , A2 , . . . be Bernoulli(p) random variables and let B1 , B2 , . . . be Bernoulli(q ) random
variables. Further assume the Aj and Bk variables collectively are mutually independent.
The variable X may be viewed as the number of the first Aj that produces a result of 1
and the variable Y may be viewed similarly for the Bk sequence.
(b) Explain why the sequence C1 , C2 , . . . are mutually independent random variables.
(c) Let Z be the random variable that equals the number of the first Cj that results in a
1 and explain why Z is the smaller of X and Y .
(d) Use (c) to conclude that Z ∼ Geometric(r ) for the value of r calculated in part (a).
Ex. 3.3.16. Each day during the hatching season along the Odisha and Northern Tamil
Nadu coast line a Poisson (λ) number of turtle eggs hatch giving birth to young turtles. As
these turtles swim into the sea the probability that they will survive each day is p. Assume
that number of hatchings on each day and the life of the turtles born are all independent.
Let X1 = 0 and for i ≥ 2, Xi be the total number of turtles alive at sea on the ith morning
of the hatching season before the hatchings on the i-th day. Find the distribution of Xn .
1+2+3+4+5+6 1 1 1 1 1 1
= 1( ) + 2( ) + 3( ) + 4( ) + 5( ) + 6( ).
6 6 6 6 6 6 6
From the perspective of the right hand side of the equation, the results of all outcomes
are added together after being weighted, each according to its probability. In the case of a
die, all six outcomes have probability 16 .
provided that the sum converges absolutely. In this case we say that X has “finite
expectation”. If the sum diverges to ±∞ we say the random variable has infinite
expectation. If the sum diverges, but not to infinity, we say the expected value is
undefined.
Example 4.1.2. In the previous chapter, Example 3.1.4 described a lottery for which a
ticket could be worth nothing, or it could be worth either $20 or $200. What is the average
value of such a ticket?
93
1 27 972
E [X ] = 200( ) + 20( ) + 0( ) = 0.74,
1000 1000 1000
Proof - By definition E [c] is a sum over all possible values of c, but in this case that is just
a single value, so E [c] = c · P (c = c) = c · 1 = c. ■
When the range of X is finite, E [X ] always exists since it is a finite sum. When the
range of X is infinite there is a possibility that the infinite series will not be absolutely
convergent and therefore that E [X ] will be infinite or undefined. In fact, when proving
theorems about how expected values behave, most of the complications arise from the fact
that one must know that an infinite sum converges absolutely in order to rearrange terms
within that sum with equality. The next examples explore ways in which expected values
may misbehave.
n=1 n=1
2n
which diverges to infinity, so this random variable has an infinite expected value. ■
Example 4.1.5. Suppose X is a random variable taking values in the range T =
{−2, 4, −8, 16, . . . } such that P (X = (−2)n ) = 1
2n for all integers n ≥ 1.
∞ ∞ ∞
1
(−2)n · P (X = 2n ) = (−1)n .
X X X
(−2)n =
n=1 n=1
2n n=1
This infinite sum diverges (not to ±∞), so the expected value of this random variable is
undefined. ■
The examples above were specifically constructed to produce series which clearly
diverged, but in general it can be complicated to check whether an infinite sum is absolutely
convergent or not. The next technical lemma provides a condition that is often simpler
to check. The convenience of this lemma is that, since |X| is always positive, the terms
of the series for E [|X|] may be freely rearranged without changing the value of (or the
convergence of) the sum.
t · P (X = t).
X
E [X ] =
t∈T
To more easilly relate these two sums, define T̂ = {t : |t| ∈ U }. Since every u ∈ U came
from some t ∈ T the new set T̂ contains every element of T . For every t ∈ T̂ for which
/ T , the element is outside of the range of X and so P (X = t) = 0 for such elements.
t∈
Because of this E [X ] may be written as
X
E [X ] = t · P (X = t)
t∈T̂
Note that for each u ∈ U , the event (|X| = u) is equal to (X = u) ∪ (X = −u) where
each of u and −u is an element of T̂ . Therefore,
u · P (U = u) = u · (P (X = u) + P (X = −u))
= u · P (X = u) + u · P (X = −u)
= |u| · P (X = u) + | − u| · P (X = −u)
Therefore the series describing E [X ] is absolutely convergent exactly when E [|X|] < ∞. ■
We will eventually wish to calculate the expected values of functions of multiple random
variables. Of particular interest to statistics is an understanding of expected values of sums
and averages of i.i.d. sequences. That understanding will be made easier by first learning
something about how expected values behave for simple combinations of variables.
Theorem 4.1.7. Suppose that X and Y are discrete random variables, both with
finite expected value and both defined on the same sample space S. If a and b are
real numbers then
(1) E [aX ] = aE [X ];
(2) E [X + Y ] = E [X ] + E [Y ]; and
(3) E [aX + bY ] = aE [X ] + bE [Y ].
(4) If X ≥ 0 then E [X ] ≥ 0.
Proof of (1) - If a = 0 then both sides of the equation are zero, so assume a ̸= 0. We know
that X is a function from S to some range U . So aX is also a random variable and its
range is T = {au : u ∈ U }.
By definition E [aX ] = t · P (aX = t), but because of how T is defined, adding values
P
t∈T
indexed by t ∈ T is equivalent to adding values indexed by u ∈ U where t = au. In other
words
X
E [aX ] = t · P (aX = t)
t∈T
X
= au · P (aX = au)
u∈U
X
= a· u · P (X = u)
u∈U
= aE [X ].
Proof of (2) - We are assuming that X and Y have the same domain, but they typically
have different ranges. Suppose X : S → U and Y : S → V . Then the random variable
X + Y is also defined on S and takes values in T = {u + v : u ∈ U , v ∈ V }. Therefore,
adding values indexed by t ∈ T is equivalent to adding values indexed by u and v as they
range over U and V respectively. So,
X
E [X + Y ] = t · P (X + Y = t)
t∈T
X
= (u + v ) · P (X = u, Y = v )
u∈U ,v∈V
X X
= (u + v ) · P (X = u, Y = v )
u∈U v∈V
X X X X
= u · P (X = u, Y = v ) + v · P (X = u, Y = v )
u∈U v∈V u∈U v∈V
X X X X
= u · P (X = u, Y = v ) + v · P (X = u, Y = v )
u∈U v∈V v∈V u∈U
where the rearrangement of summation is legitimate since the series converges absolutely.
Notice that as u ranges over all of U the sets (X = u, Y = v ) partition the set (Y = v )
into disjoint pieces based on the value of X. Likewise the event (X = u) is partitioned by
(X = u, Y = v ) as v ranges over all values of v ∈ V . Therefore, as a disjoint union,
and (X = u, Y = v ),
[ [
(Y = v ) = (X = u, Y = v ) (X = u) =
u∈U v∈V
and so
P (X = u, Y = v ) and P (X = u) = P (X = u, Y = v ).
X X
P (Y = v ) =
u∈U v∈V
Proof of (3) - This is an easy consequence of (1) and (2). From (2) the expected value
E [aX + bY ] may be rewritten as E [aX ] + E [bY ]. From there, applying (1) shows this is
also equal to aE [X ] + bE [Y ]. (Using induction this theorem may be extended to any finite
line ar combination of random variables, a fact which we leave as an exercise below).
Proof of (4) - We know that X is a function from S to T where t ∈ T implies that t ≥ 0.
As,
t · P (X = t),
X
E [X ] =
t∈T
make the game fair. Since X is the amount of money gained by the roll, the net change of
money for the roller is X − c after accounting for how much was paid to play. A fair game
requires
0 = E [X − c] = E [X ] − E [c] = 2 − c.
So the roller should pay his opponent $2 to make the game fair. ■
Theorem 4.1.10. Suppose that X and Y are discrete random variables, both with
finite expected value and both defined on the same sample space S. If X and Y are
independent, then E [XY ] = E [X ]E [Y ].
Before showing an example of how this theorem might be used, we provide a demon-
stration that the result will not typically hold without the assumption of independence.
However, the random variable XY can only take on two possible values. It may equal
3 (if either X = 1 and Y = 3 or vica versa) or it may equal 4 (if X = Y = 2). So,
P (XY = 3) = 2
3 and P (XY = 4) = 31 . Therefore,
2 1 10
E [XY ] = 3( ) + 4( ) = ̸= 4.
3 3 3
Example 4.1.12. Suppose an insurance company assumes that, for a given month, both
the number of customer claims X and the average cost per claim Y are independent
random variables. Suppose further the company is able to estimate that E [X ] = 100 and
E [Y ] = $1, 250. How should the company estimate the total cost of all claims that month?
The total cost should be the number of claims times the average cost per claim, or XY .
Using Theorem 4.1.10 the expected value of XY is simply the product of the separate
expected values.
Notice, though, that the assumption of independence played a critical role in this computa-
tion. Such an assumption might not be valid for many practical problems. Consider, for
example, if a weather event such as a tornado tends to cause both a larger-than-average
number of claims and also a larger-than-average value per claim. This could cause the
variables X and Y to be dependent and, in such a case, estimating the total cost would
not be as simple as taking the product of the separate expected values. ■
A quick glance at the definition of expected value shows that it only depends on the
distribution of the random variable. Therefore one can compute the expected values for
the various common distributions we defined in the previous chapter.
second is simpler, but requires using the relationship between the Binomial and Bernoulli
random variables. In algebraic terms, if Y ∼ Binomial(n, p) then
n
X
E [Y ] = k · P (Y = k )
k =0
n
!
n k
p (1 − p)n−k
X
= k·
k =1
k
n
n!
pk (1 − p)n−k
X
= k·
k =1
k!(n − k )!
n
(n − 1) !
pk−1 (1 − p)(n−1)−(k−1)
X
= np ·
k =1
( k − 1 ) ! (( n − 1 ) − ( k − 1 )) !
n
n − 1 k−1
!
p (1 − p)(n−1)−(k−1)
X
= np ·
k =1
k−1
n−1
n−1 k
!
p (1 − p)(n−1)−k
X
= np ·
k =0
k
where the last equality is a shift of variables. But now, by the binomial theorem, the sum
n−1
(n−1
k )p (1 − p)
k (n−1)−k is equal to 1 and therefore E [Y ] = np.
P
k =0
Alternatively, recall that the Binomial distribution first came about as the total number
of successes in n independent Bernoulli trials. Therefore a Binomial(n, p) distribution results
from adding together n independent Bernoulli(p) random variables. Let X1 , X2 , . . . , Xn
be i.i.d. Bernoulli(p) and let Y = X1 + X2 + · · · + Xn . Then Y ∼ Binomial(n, p) and
E [Y ] = E [X1 + X2 + · · · + Xn ]
= E [X1 ] + E [X2 ] + · · · + E [Xn ]
= p + p + · · · + p = np.
This also provides the answer to part (d) of Example 2.1.2. The expected number of
successes in a series of n independent Bernoulli(p) trials is np. ■
In the next example we will calculate the expected value of a geometric random variable.
The computation illustrates a common technique from calculus for simplifying power series
by differentiating the sum term-by-term in order to rewrite a complicated series in a simpler
way.
To evaluate the sum of the series we will need to work the partial sums of the same. For
any n ≥ 1, let
n n
kp(1 − p)k−1 = k (1 − (1 − p))(1 − p)k−1
X X
Tn =
k =1 k =1
n n
k (1 − p)k−1 − k (1 − p)k
X X
=
k =1 k =1
n
1 − (1 − p)n
(1 − p)k−1 − n(1 − p)n = − n(1 − p)n .
X
=
k =1
p
Using standard results from analysis we know that for 0 < p < 1,
Therefore Tn → 1
p as n → ∞. Hence
1
E [X ] = .
p
For instance, suppose we wanted to know on average how many rolls of a die it would
take before we observed a 5. Each roll is a Bernoulli trial with a probability 1
6 of success.
The time it takes to observe the first success is distributed as a Geometric( 16 ) and so
has expected value 1
1/6 = 6. On average it should take six rolls before observing this
outcome. ■
Taking the result as a given, we will illustrate how this expected value might be used
for an applied problem. Suppose an insurance company wants to model catastrophic floods
using a Poisson(λ) random variable. Since floods are rare in any given year, and since
the company is considering what might occur over a long span of years, this may be a
reasonable assumption.
As its name implies, a “50-year flood” is a flood so substantial that it should occur, on
average, only once every fifty years. However, this is just an average; it may be possible to
have two “50-year floods” in consecutive years, though such an event would be quite rare.
Suppose the insurance company wants to know how likely it is that there will be two or
more “50-year floods” in the next decade, how should this be calculated?
There is an average of one such flood every fifty years, so by proportional reasoning, in
the next ten years there should be an average of 0.2 floods. In other words, the number of
floods in the next ten years should a random variable X ∼ P oisson(0.2) and we wish to
calculate P (X ≥ 2).
P (X ≥ 2) = 1 − P (X = 0) − P (X = 1)
= 1 − e−0.2 − e−0.2 (0.2)
≈ 0.0002.
So assuming the Poisson random variable is an accurate model, there is only about a 0.02%
chance that two or more such disastrous floods would occur in the next decade. ■
with X ∼ HyperGeo(N , r, m). To calculate the expected value of X, we begin with two
facts. The first is an identity involving combinations. If n ≥ k > 0 then
!
n n!
=
k k!(n − k )!
n (n − 1) !
=
k (k − 1)!((n − 1) − (k − 1))!
n n−1
!
= .
k k−1
The second comes from the consideration of the probabilities associated with a HyperGeo(N −
1, r − 1, m − 1) distribution. Specifically, as k ranges over all possible values of such a
distribution, we have
r−1 (N −1)−(r−1)
X ( k )( (m−1)−k )
−1
=1
k (N
m−1)
since this is the sum over all outcomes of the random variable.
To calculate E [X ], let j range over the possible values of X. Recall that the minimum
value of j is max{0, m − (N − r )} and the maximum value of j is min{r, m}. Now let
k = j − 1. This means that the maximum value for k is min{r − 1, m − 1}. If the
minimum value for j was m − (N − r ) then the minimum value for k is m − (N − r ) − 1 =
((m − 1) − ((N − 1) − (r − 1))). If the minimum value for j was 0 then the minimum
value for k is −1.
The key to the computation is to note that as j ranges over all of the values of X, the
values of k cover all possible values of a HyperGeo(N − 1, m − 1, r − 1) distribution. In
fact, the only possible value k may assume that is not in the range of such a distribution is
if k = −1 as a minimum value. Now,
−r
(rj )(N
m−j )
,
X
E [X ] = j·
j (N
m)
and if j = 0 is in the range of X, then that term of the sum is zero and it may be
deleted without affecting the value. That is equivalent to deleting the k = −1 term, so
−r
X (rj )(N
m−j )
E [X ] = j·
j (N
m)
r r−1 (N −1)−(r−1)
X j (j−1)( (m−1)−(j−1) )
= j· N N −1
j m (m−1)
r−1 (N −1)−(r−1)
rm X (j−1)( (m−1)−(j−1) )
= ( )· −1
N j (N
m−1)
r−1 (N −1)−(r−1)
rm X ( k )( (m−1)−k )
= ( )· −1
N k (Nm−1)
rm rm
= ( ) · (1) = .
N N
This nearly completes the goal of calculating the expected values of hypergeometric
distributions. The only remaining issues are the cases when m = 0 and r = 0. Since the
hypergoemetric distribution was only defined when m and r were non-negative integers,
and since the proof above requires the consideration of such a distribution for the values
m − 1 and r − 1, the remaining cases must be handled separately. However, they are fairly
easy and yield the same result, a fact we leave it to the reader to verify. ■
Example 4.1.18. Returning to a setting first seen in Example 3.3.1 we will let X ∼
Uniform({−2, −1, 0, 1, 2}), and let f (x) = x2 . How may E [f (X )] be calculated?
We will demonstrate this in two ways – first by appealing directly to the definition, and
then using the distribution of X instead of the distribution of f (X ). To use the definition
of expected value, recall that f (X ) = X 2 takes values in {0, 1, 4} with the following
probabilities: P (f (X ) = 0) = 1
5 while P (f (X ) = 1) = P (f (X ) = 4) = 25 . Therefore,
1 2 2
E [f (X )] = 0( ) + 1( ) + 4( ) = 2.
5 5 5
However, the values of f (X ) are completely determined from the values of X. For
instance, the event (f (X ) = 4) had a probability of 2
5 because it was the disjoint union of
two other events (X = 2) ∪ (X = −2), each of which had probability 15 . So the term 4( 25 )
in the computation above could equally well have been thought of in two pieces
4 · P (f (X ) = 4) = 4 · P ((X = 2) ∪ (X = −2))
= 4 · (P (X = 2) + P (X = −2))
= 4 · P (X = 2) + 4 · P (X = −2)
= 22 · P (X = 2) + (−2)2 · P (X = −2),
where the final expression emphasizes that the outcome of 4 resulted either from 22 or
(−2)2 depending on the value of X. Following a similar plan for the other values of f (X )
allows E [f (X )] to be calcualted directly from the probabilities of X as
The technique of the example above works for any functions as demonstrated by the
next two theorems. We first state and prove a version for functions of a single random
variable and then deal with the multivariate case.
f (t) · P (X = t).
X
E [f (X )] =
t∈T
P (X = t).
X
P (f (X ) = u) =
t∈f −1 (u)
X
E [f (X )] = u · P (f (X ) = u)
u∈U
X X
= u· P (X = t)
u∈U t∈f −1 (u)
X X
= u · P (X = t)
u∈U t∈f −1 (u)
X X
= f (t) · P (X = t)
u∈U t∈f −1 (u)
f (t) · P (X = t),
X
=
t∈T
where the final step is simply the fact that T = f −1 (U ) and so summing over the values of
t ∈ T is equivalent to grouping them together in the sets f −1 (u) and summing over all
values in U that may be achieved by f (X ). ■
f (t1 , . . . , tn ) · P (X1 = t1 , . . . , Xn = tn ).
X
E [f (X )] =
t1 ∈T1 ,... tn ∈Tn
The proof is nearly the same as for the one-variable case. The only diference is that f −1 (u)
is now a set of vectors of values (t1 , . . . , tn ), so that the event (f (X ) = u) decomposes into
events of the form (X1 = t1 , . . . , Xn = tn ). However, this change does not interfere with
the logic of the proof. We leave the details to the reader.
exercises
Ex. 4.1.2. A lottery is held every day, and on any given day there is a 30% chance that
someone will win, with each day independent of every other. Let X denote the random
variable describing the number of times in the next five days that the lottery will be won.
(b) On average (expected value), how many times in the next five days will the lottery
be won?
(c) When the lottery occurs for each of the next five days, what is the most likely number
(mode) of days there will be a winner?
(d) How likely is it the lottery will be won in either one or two of the next five days?
Ex. 4.1.3. A game show contestant is asked a series of questions. She has a probability of
0.88 of knowing the answer to any given question, independently of every other. Let Y
denote the random variable describing the number of questions asked until the contestant
does not know the correct answer.
(b) On average (expected value), how many questions will be asked until the first question
for which the contestant does not know the answer?
(c) What is the most likely number of questions (mode) that will be asked until the
contestant does not know a correct answer?
(d) If the contestant is able to answer twelve questions in a row, she will win the grand
prize. How likely is it that she will know the answers to all twelve questions?
Ex. 4.1.4. Sonia sends out invitations to eleven of her friends to join her on a hike she’s
planning. She knows that each of her friends has a 59% chance of deciding to join her
independently of each other. Let Z denote the number of friends who join her on the hike.
(b) What is the average (expected value) number of her friends that will join her on the
hike?
(c) What is the most likely number (mode) of her friends that will join her on the hike?
(d) How do your answers to (b) and (c) change if each friend has only a 41% chance of
joining her?
Ex. 4.1.5. A player rolls three dice and earns $1 for each die that shows a 6. How much
should the player pay to make this a fair game?
Ex. 4.1.6. (“The [Link] Paradox”) Suppose a game is played whereby a player
begins flipping a fair coin and continues flipping it until it comes up heads. At that time
the player wins a 2n dollars where n is the total number of times he flipped the coin. Show
that there is no amount of money the player could pay to make this a fair game. (Hint:
See Example 4.1.4).
Ex. 4.1.7. Two different investment strategies have the following probabilities of return on
$10,000.
Strategy A has a 20% chance of returning $14,000, a 35% chance of returning $12,000,
a 20% chance of returning $10,000, a 15% chance of returning $8,000, and a 10% chance of
returning only $6,000.
Strategy B has a 25% chance of returning $12,000, a 35% chance of returning $11,000,
a 25% chance of returning $10,000, and a 15% chance of returning $9,000.
(c) Is one strategy clearly preferable to the other? Explain your reasoning.
Ex. 4.1.8. Calculate the expected value of a Uniform({1, 2, . . . , n}) random variable by
following the steps below.
n
n2 + n
(a) Prove the numerical fact that 2 . (Hint: There are many methods to do
P
j=
j =1
this. One uses induction).
Ex. 4.1.9. Use induction to extend the result of Theorem 4.1.7 by proving the following:
If X1 , X2 , . . . , Xn are random variables with finite expectation all defined on the same
sample space S and if a1 , a2 , . . . an are real numbers, then
Ex. 4.1.10. Suppose X and Y are random variables for which X has finitie expected value
and Y has infinite expected value. Prove that X + Y has infinite expected value.
Ex. 4.1.11. Suppose X and Y are random variables. Suppose E [X ] = ∞ and E [Y ] = −∞.
(c) Provide an example to show that E [X + Y ] may have finite expected value.
(b) Every non-zero term in your answer to (a) should have a λ in it. Factor this λ out
and explain why the remaining sum equals 1. (Hint: One way to do this is through
the use of infinite series. Another way is to use the idea from Example 4.1.17).
Ex. 4.1.13. A daily lottery is an event that many people play, but for which the likelihood
of any given person winning is very small, making a Poisson approximation appropriate.
Suppose a daily lottery has, on average, two winners every five weeks. Estimate the
probability that next week there will be more than one winner.
As a single number, the average of a random variable may or may not be a good approxi-
mation of the values that variable is likely to produce. For example, let X be defined such
that P (X = 10) = 1, let Y be defined so that P (Y = 9) = P (Y = 11) = 12 , and let Z be
defined such that P (Z = 0) = P (Z = 20) = 21 . It is easy to check that all three of these
random variables have an expected value of 10. However the number 10 exactly describes
X, is always off from Y by an absolute value of 1 and is always off from Z by an absolute
value of 10.
It is useful to be able to quantify how far away a random variable typically is from
its average. Put another way, if we think of the expected value as somehow measuring
the “center” of the random variable, we would like to find a way to measure the size of the
“spread” of the variable about its center. Quantities useful for this are the variance and
standard deviation.
Definition 4.2.1. Let X be a random variable with finite expected value. Then the
variance of the random variable is written as V ar [X ] and is defined as
V ar [X ] = E [(X − E [X ])2 ]
Notice that V ar [X ] is the average of the square distance of X from its expected value.
So if X has a high probability of being far away from E [X ] the variance will tend to be
large, while if X is very near E [X ] with high probability the variance will tend to be small.
In either case the variance is the expected value of a squared quantity, and as such is
always non-negative. Therefore SD [X ] is defined whenever V ar [X ] is defined.
If we were to associate units with the random variable X (say meters), then the units
of V ar [X ] would be meters2 and the units of SD [X ] would be meters. We will see that
the standard deviation is more meaningful as a measure of the “spread” of a random
variable while the variance tends to be a more useful quantity to consider when carrying
out complex computations.
Informally we will view the standard deviation as a typical distance from average. So
if X is a random variable and we calculate that E [X ] = 12 and SD [X ] = 3, we might
say, “The variable X will typically take on values that are in or near the range 9 − 15,
one standard deviation either side of the average”. A goal of this section is to make that
language more precise, but at this point it will help with intuition to understand this
informal view.
The variance and standard deviation are described in terms of the expected value.
Therefore V ar [X ] and SD [X ] can only be defined if E [X ] exists as a real number. However,
it is possible that V ar [X ] and SD [X ] could be infinite even if E [X ] is finite (see Exercises).
In practical terms, if X has a finite expected value and infinite standard deviation, it
means that the random variable has a clear average, but is so spread out that any finite
number underestimates the typical distance of the random variable from its average.
Example 4.2.2. As above, let X be a constant varaible with P (X = 10) = 1. Let Y be such
that P (Y = 9) = P (Y = 11) = 1
2 and let Z be such that P (Z = 0) = P (Z = 20) = 12 .
Since X always equals E [X ], the quantity (X − E [X ])2 is always zero and we can
conclude that V ar [X ] = 0 and SD [X ] = 0. This makes sense given the view of SD [X ] as
an estimate of how spread out the variable is. Since X is constant it is not at all spread
out and so SD [X ] = 0.
To calculate V ar [Y ] we note that (Y − E [Y ])2 is always equal to 1. Therefore V ar [Y ] =
1 and SD [Y ] = 1. Again this reaffirms the informal description of the standard deviation;
the typical distance between Y and its average is 1.
Likewise (Z − E [Z ])2 is always equal to 100. Therefore V ar [Z ] = 100 and SD [Z ] = 10.
The typical distance between Z and its average is 10. ■
Example 4.2.3. What are the variance and standard deviation of a die roll?
Before we carry out the calculation, let us use the informal idea of standard deviation
to estimate an answer and help build intuition. We know the average of a die roll is 3.5.
The closest a die could possibly be to this average is 0.5 (if it were to roll a 3 or a 4) and
the furthest it could possibly be is 2.5 (if it were to roll a 1 or a 6). Therefore the standard
deviation, a typical distance from average, should be somewhere between 0.5 and 2.5.
To calculate the quantity exactly, let X represent the roll of a die. By definition,
V ar [X ] = E [(X − 3.5)2 ], and the values that (X − 3.5)2 may assume are determined by
the six values X may take on.
V ar [X ] = E [(X − 3.5)2 ]
1 1 1 1 1 1
= (2.5)2 + (1.5)2 + (0.5)2 + (−0.5)2 + (−1.5)2 + (−2.5)2
6 6 6 6 6 6
35
= .
12
q
So, SD [X ] = 35
12 ≈ 1.71 which is near the midpoint of the range of our estimate above. ■
Theorem 4.2.4. Let a ∈ R and let X be a random variable with finite variance
(and thus, with finite expected value as well). Then,
(a) V ar [aX ] = a2 · V ar [X ];
(c) V ar [X + a] = V ar [X ]; and
(d) SD [X + a] = SD [X ].
Proof of (a) and (b) - V ar [aX ] = E [(aX − E [aX ])2 ]. Using known properties of expected
value this may be rewritten as
That concludes the proof of (a). The result from (b) follows by taking square roots of both
sides of this equation.
Proof of (c) and (d) - (See Exercises) ■
The variance may also be computed using a different (but equivalent) formula if E [X ]
and E [X 2 ] are known.
Theorem 4.2.5. Let X be a random variable for which E [X ] and E [X 2 ] are both
finite. Then
V ar [X ] = E [X 2 ] − (E [X ])2 .
Proof -
V ar [X ] = E [(X − E [X ])2 ]
= E [X 2 − 2XE [X ] + (E [X ])2 ]
= E [X 2 ] − 2E [XE [X ]] + E [(E [X ])2 ].
But E [X ] is a constant, so
■
In statistics we frequently want to consider the sum or average of many random variables.
As such it is useful to know how the variance of a sum relates to the variances of each
variable separately. Toward that goal we have
Theorem 4.2.6. If X and Y are independent random variables, both with finite
expectation and finite variance, then
(a) V ar [X + Y ] = V ar [X ] + V ar [Y ]; and
q
(b) SD [X + Y ] = (SD [X ])2 + (SD [Y ])2 .
V ar [X + Y ] = E [(X + Y )2 ] − (E [X + Y ])2
= E [X 2 + 2XY + Y 2 ] − (E [X ])2 + 2E [X ]E [Y ] + (E [Y ])2
= E [X 2 ] + 2E [XY ] + E [Y 2 ] − (E [X ])2 − 2E [X ]E [Y ] − (E [Y ])2 .
V ar [X + Y ] = E [X 2 ] − (E [X ])2 + E [Y 2 ] − (E [Y ])2
= V ar [X ] + V ar [Y ].
Part (b) follows immediately after rewriting the variances in terms of standard deviations
and taking square roots. As with expected values, this theorem may be generalized to a
sum of any finite number of independent random variables using induction. The proof of
that fact is left as Exercise 4.2.11. ■
Example 4.2.7. What is the standard deviation of the sum of two dice?
We previously found that if X represents one die, then V ar [X ] = 12 . If X
35
and Y are
two independent
q dice, then V ar [X + Y ] = V ar [X ] + V ar [Y ] =
35
12 + 12 = 6 .
35 35
Therefore
SD [X + Y ] = 35
6 ≈ 2.42. ■
As with expected value, the variances of the common discrete random variables can be
calculated from their corresponding distributions.
Example 4.2.8. (Variance of a Bernoulli(p))
Let X ∼ Bernoulli(p). We have already calculated that E [X ] = p. Since X only takes
on the values 0 or 1 it is always true that X 2 = X. Therefore E [X 2 ] = E [X ] = p.
So, V ar [X ] = E [X 2 ] − (E [X ])2 = p − p2 = p(1 − p). ■
Example 4.2.9. (Variance of a Binomial(n,p))
We will calculate the variance of a Binomial random variable using the fact that it may
be viewed as the sum of n independent Bernoulli random variables. A strictly algebraic
computation is also possible (see Exercises).
Let X1 , X2 , . . . , Xn be independent Bernoulli(p) random variables. Therefore, if
Y = X1 + X2 + · · · + Xn then Y ∼ Binomial (n, p) and
V ar [Y ] = V ar [X1 + X2 + · · · + Xn ]
= V ar [X1 ] + V ar [X2 ] + · · · + V ar [Xn ]
= p(1 − p) + p(1 − p) + · · · + p(1 − p)
= np(1 − p).
do not. The goal is to provide an estimate of the number of people in the sample that have
the characteristic. For this example, suppose we were to randomly select 100 people from
a large city in which 20% of the population works in a service industry. How many of the
100 people from our sample should we expect to be service industry workers?
If the sampling is done without replacement (so we cannot pick the same person twice),
then strictly speaking the desired number would be described by a Hypergeometric random
variable. However, we have also seen that there is little difference between the Binomial
and Hypergeometric distributions when the size of the sample is small relative to the size
of the population. So since the sample is only 100 people from a “large city”, we will
assume this situation is modeled by a binomial random variable. Specifically, since 20% of
the population consits of service workers, we will assume X ∼ Binomial (100, 0.2).
The simplest way to answer to the question of how many service industy workers to
expect within the sample is to compute the expected value of X. In this case E [X ] =
100(0.2) = 20, so we should expect around 20 of the 100 people in the sample to be service
workers. However, this is an incomplete answer to the question since it only provides an
average value; the actual number of service workers in the sample is probably not going to
be exactly 20, it’s only likely to be around 20 on average. A more complete answer to the
question would give an estimate as to how far away from 20 the actual value is likely to
be. But this is precisely what the standard deviation describes – an estimate of the likely
difference between the actual result of the random variable and its expected value.
√
In this case V ar [X ] = 100(0.2)(0.8) = 16 and so SD [X ] = 16 = 4. This means that
the actual number of service industry workers in the sample will typically be about 4 or so
away from the expected value of 20, so a more complete answer to the question would be
“The sample is likely to have around 16 − 24 service workers in it”. That is not to say that
the actual number of service workers is guaranteed to fall in the that range, but the range
provides s a sort of likely error associated with the estimate of 20. Results in the 16 − 24
range should be considered fairly common. Results far outside that range, while possible,
should be considered fairly unusual. ■
Recall in Example 4.1.17 we calculated E [X ] using a technique in which the sum
describing E [X ] was computed based on another sum which only involved the distribution
of X directly. This second sum equalled 1 since it simply added up the probabilities that
X assumed each of its possible values. In a similar fashion, it is sometimes possible to
calculate a sum describing E [X 2 ] in terms of a sum for E [X ] which is already known. From
that point, Theorem 4.2.5 may be used to calculate the variance and standard deviation of
X. This technique will be illustrated in the next example in which we calculate the spread
associated with a geometric random variable.
Example 4.2.10. (Variance of a Geometric(p))
∞
k 2 p(1 − p)k−1
X
E [X 2 ] =
k =1
To evaluate the sum of the series we will need to work the partial sums of the same. For
any n ≥ 1, let
n n
k 2 p(1 − p)k−1 = k 2 (1 − (1 − p))(1 − p)k−1
X X
Sn =
k =1 k =1
n n
k 2 (1 − p)k−1 − k 2 (1 − p)k
X X
=
k =1 k =1
n
= 1+ (2k − 1)(1 − p)k−1 − n2 (1 − p)n
X
k =2
n n
= 1− (1 − p)k−1 + 2 k (1 − p)k−1 − n2 (1 − p)n
X X
k =2 k =2
n n
= 2− (1 − p)k−1 + 2(−1 + k (1 − p)k−1 ) − n2 (1 − p)n
X X
k =1 k =1
n
1 − (1 − p)n 2
kp(1 − p)k−1 − n2 (1 − p)n
X
= − +
p p k =1
Using standard results from analysis and result from Example 4.1.15 we know that for
0 < p < 1,
n
1
lim kp(1 − p)k−1 = , lim (1 − p)n = 0, and lim n2 (1 − p)n = 0.
X
n→∞
k =1
p n→∞ n→∞
Therefore Sn → − p1 + 2
p2
as n → ∞. Hence
1 2
E [X 2 ] = − + 2 .
p p
V ar [X ] = E [X 2 ] − (E [X ])2
2 1 1
= 2 − − ( )2
p p p
1 1
= 2−
p p
A similar technique may be used for calculating the variance of a Poisson random
variable, a fact which is left as an exercise. We finish this subsection with a computation
of the variance of a hypergeometric distribution using an idea similar to how we calculated
its expected value in Example 4.1.17.
Example 4.2.11. Let m and r be positive integers and let N be an integer with N >
max{m, r} and let X ∼ HyperGeo(N , r, m). To calculate E [X 2 ], as j ranges over the
values of X,
−r
2
X
2
(rj )(N
m−j )
E [X ] = j ·
j (N
m)
r r−1 (N −1)−(r−1)
X
2 j (j−1)( (m−1)−(j−1) )
= j · N N −1
j m (m−1)
r−1 (N −1)−(r−1)
rm X (j−1)( (m−1)−(j−1) )
= ( ) j· −1
N j (N
m−1)
r−1 (N −1)−(r−1)
rm X ( k )( (m−1)−k )
= ( ) · (k + 1) −1
N k (N
m−1)
rm
E [X 2 ] = ( )E [Y + 1]
N
rm
= ( )(E [Y ] + 1)
N
rm (r − 1)(m − 1)
= ( )( + 1).
N (N − 1)
V ar [X ] = E [X 2 ] − (E [X ])2
rm (r − 1)(m − 1) rm 2
= ( )( + 1) − ( )
N (N − 1) N
N 2 rm − N rm2 − N r2 m + r2 m2
= .
N 2 (N − 1)
As with the computation of expected value, the cases of m = 0 and r = 0 must be handled
separately, but yield the same result. ■
Many random variables may be rescaled into a standard format by shifting them so that
they have an average of zero and then rescaling them so that they have a variance (and
standard deviation) of one. We introduce this idea now, though its chief importance will
not be realized until later.
E [X ] = 0 and V ar [X ] = 1.
Theorem 4.2.13. Let X be a discrete random variable with finite expected value
X−E [X ]
and finite, non-zero variance. Then Z = SD [X ]
is a standardized random variable.
X − E [X ]
E [Z ] = E [ ]
SD [X ]
E [X − E [X ]]
=
SD [X ]
E [X ] − E [X ]
= =0
SD [X ]
X − E [X ]
V ar [Z ] = V ar [ ]
SD [X ]
V ar [X − E [X ]]
=
(SD [X ])2
V ar [X ]
= = 1.
V ar [X ]
For easy reference we finish off this section by providing a chart of values associated
with common discrete distributions.
exercises
Calculate the expected value and standard deviation of this random variable. What is the
probability this random variable will produce a result more than one standard deviation
from its expected value?
Ex. 4.2.2. Answer the following questions about flips of a fair coin.
(a) Calculate the standard deviation of the number of heads that show up in 100 flips of
a fair coin.
(b) Show that if the number of coins is quadrupled (to 400) the standard deviation only
doubles.
Ex. 4.2.3. Suppose we begin rolling a die, and let X be the number of rolls needed before
we see the first 3.
(b) Calculate SD [X ].
(c) Viewing SD [X ] as a typical distance of X from its expected value, would it seem
unusual to roll the die more than nine times before seeing a 3?
(e) Calculate the probability X produces a result within one standard deviation of its
expected value.
Ex. 4.2.4. A key issue in statistical sampling is the determination of how much a sample
is likely to differ from the population it came from. This exercise explores some of these
ideas.
(a) Suppose a large city is exactly 50% women and 50% men and suppose we randomly
select 60 people from this city as part of a sample. Let X be the number of women in
the sample. What are the expected value and standard deviation of X? Given these
values, would it seem unusual if fewer than 45% of the individuals in the sample were
women?
(b) Repeat part (a), but now assume that the sample consists of 600 people.
Ex. 4.2.5. Calculate the variance and standard deviation of the value of the lottery ticket
from Example 3.1.4.
Ex. 4.2.6. Prove parts (c) and (d) of Theorem 4.2.4.
Ex. 4.2.7. Let X ∼ Binomial (n, p). Show that for 0 < p < 1, this random variable has
the largest standard deviation when p = 12 .
Ex. 4.2.8. Follow the steps below to calculate the variance of a random variable with a
Uniform({1, 2, . . . , n}) distribution.
n
n(n+1)(2n+1)
(a) Prove that k2 = . (Induction is one way to do this).
P
6
k =1
Ex. 4.2.9. This exercise provides an example of a random variable with finite expected
2n
value, but infinite variance. Let X be a random variable for which P (X = n(n+1)
) = 1
2n
for all integers n ≥ 1.
∞
2n
(a) Prove that X is a well-defined variable by showing = 1.
P
P (X = n(n+1)
)
n=1
Ex. 4.2.10. Recall that the hypergeometric distribution was first developed to answer
questions about sampling without replacement. With that in mind, answer the following
questions using the chart of expected values and variances.
(a) Use the formula in the chart to calculate the variance of a hypergeometric distribution
if m = 0. Explain this result in the context of what it means in terms of sampling.
(b) Use the formula in the chart to calculate the variance of a hypergeometric distribution
if r = 0. Explain this result in the context of what it means in terms of sampling.
(c) Though we only defined a hypergeometric distrbiution if N > max{r, m}, the
definition could be extended to N = max{r, m}. Use the chart to calculate the
variance of a hypergeometric distribution if N = m. Explain this result in the context
of what it means in terms of sampling without replacement.
Ex. 4.2.11. Prove the following facts about independent random variables.
(a) Use Theorem 4.2.6 and induction to prove that if X1 , X2 , . . . , Xn are independent,
then
V ar [X1 + · · · + Xn ] = V ar [X1 ] + · · · + V ar [Xn ].
X1 + X2 + · · · + Xn
Y = √ .
n
Ex. 4.2.12. Let X be a discrete random variable which takes on only non-negative values.
Show that if E [X ] = 0 then P (X = 0) = 1.
Ex. 4.2.13. Suppose X is a discrete random variable with finite variance (and thus finite
expected value as well) and suppose there are two different numbers a, b ∈ R for which
P (X = a) and P (X = b) are both positive. Prove that V ar [X ] > 0.
Ex. 4.2.14. Let X be a discrete random variable with finite variance (and thus finite
expected value as well).
(b) Suppose there are two different numbers a, b ∈ R for which P (X = a) and P (X = b)
are both positive. Prove that E [X 2 ] > (E [X ])2 .
Ex. 4.2.15. Let X ∼ Binomial(n, p) for n > 1 and 0 < p < 1. Using the steps below,
provide an algebraic proof of the fact that V ar [X ] = np(1 − p) without appealing to the
fact that such a variable is the sum of Bernoulli trials.
n−1
(b) Use (a) to show that E [X 2 ] = np · (k + 1)(n−1
k )p (1 − p)
k (n−1)−k .
P
k =0
(d) Use (c) together with Theorem 4.2.5 to prove that V ar [X ] = np(1 − p).
When there is no confusion about what random variable is being discussed, it is usual to
use the Greek letter µ in place of E [X ] and σ in place of SD [X ]. When more than one
variable is involved the same letters can be used with subscripts (µX and σX ) to indicate
which variable is being described.
In statistics one frequently measures results in terms of “standard units” – the number
of standard deviations a result is from its expected value. For instance if µ = 12 and σ = 5,
then a result of X = 20 would be 1.6 standard units because 20 = µ + 1.6σ. That is, 20 is
1.6 standard deviations above expected value. Similarly a result of X = 10 would be −0.4
standard units because 10 = µ − 0.4σ.
Since the standard deviation measures a typical distance from average, results that
are within one standard deviation from average (between −1 and +1 standard units) will
tend to be fairly common, while results that are more than two standard deviations from
average (less than −2 or greater than +2 in standard units) will usually be relatively rare.
The likelihoods of some such events will be calculated in the next two examples. Notice
that the event (|X − µ| ≤ kσ ) describes those outcomes of X that are within k standard
deviations from average.
Example 4.3.1. Let Y represent the sum of two dice. How likely is it that Y will be
within one standard deviation of its average? How likely is it that Y will be more than
two standard deviations from its average? q
We can use our previous calculations that µ = 7 and σ = 35
6 ≈ 2.42. The achievable
values that are within one standard deviation of average are 5, 6, 7, 8, and 9. So the
probability that the sum of two dice will be within one standard deviation of average is
There is about a 66.7% chance that a pair of dice will fall within one standard deviation of
their expected value.
q
Two standard deviations is 2 35
6 ≈ 4.83. Only the results 2 and 12 further than this
distance from the expected value, so the probability that X will be more than two standard
deviations from average is
There is only about a 5.6% chance that a pair of dice will be more than two standard
deviations from expected value. ■
Example 4.3.2. If X ∼ U nif orm{(1, 2, . . . , 100)}, what is the probability that X will be
within one standard deviation of expected value? What is the probability it will be more
than two standard deviations from expected value?
9999
12 ≈ 28.9. Of the possible values that X can achieve, only the numbers 22, 23, . . . , 79
fall within one standard deviation of average. So the desired probability is
There is a 58% chance that this random variable will be within one standard deviation of
expected value.
q
Similarly we can calculate that two standard deviations is 2 9999
12 ≈ 57.7. Since
µ = 50.5 and since the minimal and maximal values of X are 1 and 100 respectively, results
that are more than two or more standard deviations from average cannot happen at all for
this random variable. In other words P (|X − µ| > 2σ ) = 0. ■
The examples of the previous section show that the exact probabilities a random variable
will fall within a certain number of standard deviations of its expected value depend on the
distribution of the random variable. However, there are some general results that apply to
all random variables. To prove these results we will need to investigate some inequalities.
Proof - Let T be the range of X, so T is a countable subset of the positive real numbers.
By dividing T into those numbers smaller than c and those numbers that are at least as
large as c we have
X
µ = t · P (X = t)
t∈T
t · P (X = t).
X X
= t · P (X = t) +
t∈T ,t<c t∈T ,t≥c
The first sum must be non-negative, since we assumed that T consisted of only non-negative
numbers, so we only make the quantity smaller by deleting it. Likewise, for each term in
the second sum, t ≥ c so we only make the quantity smaller by replacing t by c. This gives
us
X X
µ = t · P (X = t) + t · P (X = t)
t∈T ,t<c t∈T ,t≥c
X
≥ c · P (X = t)
t∈T ,t≥c
P (X = t).
X
= c·
t∈T ,t≥c
The events (X = t) indexed over all values t ∈ T for which t ≥ c are a countable collection
of disjoint sets whose union is (X ≥ c). So,
X
µ ≥ c· P (X = t)
t∈T ,t≥c
= cP (X ≥ c).
Markov’s theorem can be useful in its own right for producing an upper bound on the
liklihood of certain events, but for now we will use it simply as a lemma to prove our next
result.
1
P (|X − µ| ≥ kσ ) ≤ .
k2
Proof - The event (|X − µ| ≥ kσ ) is the same as the event ((X − µ)2 ≥ k 2 σ 2 ). The
random variable (X − µ)2 is certainly non-negative and its expected value is the variance
of X which we have assumed to be finite. Therefore we may apply Markov’s inequality to
(X − µ)2 to get
Though the theorem is true for all k > 0, it doesn’t give any useful information unless
k > 1.
Example 4.3.5. Let X be a discrete random variable. Find an upper bound on the
likelihood that X will be more than two standard deviations from its expected value.
For the question to make sense we need to assume that X has finite variance to begin
with. In which case we may apply Chebychev’s inequality with k = 2 to find that
1
P (|X − µ| > 2σ ) ≤ P (|X − µ| ≥ 2σ ) ≤ .
4
There is at most a 25% chance that a random variable will be more than two standard
deviations from its expected value. ■
exercises
(b) Calculate P (|X − µ| ≤ σ ), the probability that X will be within one standard
deviation of average. Approximate your answer to the nearest tenth of a percent.
(c) Calculate P (|X − µ| > 2σ ), the probability that X will be more than two standard
deviations from average. Approximate your answer to the nearest tenth of a percent.
(b) Calculate P (|X − µ| ≤ σ ), the probability that X will be within one standard
deviation of average. Approximate your answer to the nearest tenth of a percent.
(c) Calculate P (|X − µ| > 2σ ), the probability that X will be more than two standard
deviations from average. Approximate your answer to the nearest tenth of a percent.
(b) Calculate P (|X − µ| ≤ σ ), the probability that X will be within one standard
deviation of average. Approximate your answer to the nearest tenth of a percent.
(c) Calculate P (|X − µ| > 2σ ), the probability that X will be more than two standard
deviations from average. Approximate your answer to the nearest tenth of a percent.
Ex. 4.3.4. Let X ∼ Binomial (n, 12 ). Determine the smallest value of n for which P (|X −
µ| > 4σ ) > 0. That is, what is the smallest n for which there is a positive probability that
X will be more than four standard deviations from average.
Ex. 4.3.5. For k ≥ 1 there are distributions for which Chebychev’s inequality is an equality.
Ex. 4.3.6. Let X be a discrete random variable with finite expected value µ and finite
variance σ 2 .
(e) Use parts (b) and (d) to derive a contradiction. Note that this proves that the
assumption that was made in part (d), namely that P (|X − µ| > σ ) = 1, cannot be
true for any discrete random variable where µ and σ are finite quantities. In other
words, no random variable can produce only values that are more than one standard
deviation from average.
Ex. 4.3.7. Let X be a discrete random variable with finite expected value and finite
variance.
(b) Prove that if P (|X − µ| > σ ) > 0 then P (|X − µ| < σ ) > 0. (If a random variable is
able to produce values more one standard deviation from average, it must also be
able to produce values that are less than one standard deviation from average).
In previous chapters we saw that information that a particular event had occurred could
substantially change the probability associated with another event. That realization led us
to the notion of conditional probability. It is also reasonable to ask how such information
might affect the expected value or variance of a random variable.
t · P (X = t|A),
X
E [X|A] =
t∈T
Example 4.4.2. A die is rolled. What are the expected value and variance of the result
given that the roll was even?
Let X be the die roll. Then X ∼ Uniform({1, 2, 3, 4, 5, 6}), but conditioned on the
event A that the roll was even, this changes so that
1
P (X = 2|A) = P (X = 4|A) = P (X = 6|A) = .
3
Therefore,
1 1 1
E [X|A] = 2( ) + 4( ) + 6( ) = 4.
3 3 3
Note that the (unconditioned) expected value of a die roll is E [X ] = 3.5, so the knowledge
of event A slightly increases the expected value of the die roll.
The conditional variance is
1 1 1 8
V ar [X|A] = (2 − 4)2 ( ) + (4 − 4)2 ( ) + (6 − 4)2 ( ) = .
3 3 3 3
In many cases the event A on which an expected value is conditioned will be described in
terms of another random variable. For instance E [X|Y = y ] is the conditional expectation
of X given that variable Y has taken on the value y.
Example 4.4.3. Cards are drawn from an ordinary deck of 52, one at a time, randomly
and with replacement. Let X and Y denote the number of draws until the first ace and
first king are drawn, respectively. We are interested in say, E [X|Y = 3]. When Y = 3 an
ace was seen of draw 3, but not on draws 1 or 2. Hence
4
48
if n = 1 or 2
P (king on draw n|Y = 3) = 0 if n = 3
4 if n > 3
52
so that n−1
44 4
if n = 1 or 2
48 48
P (X = n|Y = 5) = 0 if n = 3
44 2 48 n−4 4
if n > 3
48 52 52
For example, when n > 3, in order to have X = n a non-king must have been seen on
draws 1 and 2 (each with probability 48 ),
44
a non-king must have resulted on draw 3 (which
is automatic, since an ace was drawn), a non-king must have been seen on each of draws 4
through n − 1 (each with probability 52 ),
48
and finally a king was produced on draw n (with
probability 52 ).
4
Hence,
2 44 n−1 4 ∞ 44 2 48 n−4 4
E [X|Y = 3] =
X X
n + n
n=1
48 48 n=4
48 52 52
2 44 n−1 4 ∞ 44 2 48 m 4
(m + 4) .
X X
= n +
n=1
48 48 m=0
48 52 52
But
∞ ∞
d m+1
(m + 4)r m = 3rm +
X X
r
m=0 m=0
dr
3 d r
= +
1 − r dr 1 − r
3 1
= + ,
1 − r (1 − r )2
so
4 44 4 44 2 4 3 1
E [X|Y = 3] = +2 + +
48 48 48 48 52 1 − (48/52) (1 − (48/52))2
4 44 4 44 2 4 3 × 52 522
= +2 + + 2
48 48 48 48 52 4 4
1 11 1 11 2 52 11 2
= +2 +3 +
12 12 12 12 4 12
985
= ≈ 13.68.
72
Given that the first ace appeared on draw 3, it takes an average of between 13 and 14
draws until the first king appears. Compare this to the unconditional E [X ]. Since X ∼
Geometric( 52
4
) we know E [X ] = = 13. In other words, on average it takes 13 draws
52
4
to observe the first king. But given that the first ace appeared on draw three, we should
expect to need about 0.68 draws more (on average) to see the first king. ■
Recall how Theorem 1.3.2 described a way in which a non-conditional probability could
be calculated in terms of conditional probabilities. There is an analogous theorem for
expected value.
t · P (X = t) = E [X ].
X
=
t∈T
■
Example 4.4.5. A venture capitalist estimates that regardless of whether the economy
strengthens, weakens, or remains the same in the next fiscal quarter, a particular investment
could either gain or lose money. However, he figures that if the economy strengthens,
the investment should, on average, earn 3 million dollars. If the economy remains the
same, he figures the expected gain on the investment will be 1 million dollars, while if the
economy weakens, the investment will, on average, lose 1 million dollars. He also trusts
economic forcasts which predict a 50% chance of a weaker economy, a 40% chance of a
stagnant economy, and a 10% chance of a stronger economy. What should he calculate is
the expected return on the investment?
Let X be the return on investment and let A, B, and C represent the events that the
economy will be stronger, the same, and weaker in the next quarter, respectively. Then
the estimates on return give the following information in millions:
Therefore,
Theorem 4.4.6. Let X and Y be two discrete random variables on a sample space
S with Y : S → T . Let g : T → R be defined as g (y ) = E [X|Y = y ]. Then
E [g (Y )] = E [X ].
It is common to use E [X|Y ] to denote g (Y ) after which the theorem may be expressed
as E [E [X|Y ]] = E [X ]. This can be slightly confusing notation, but one must keep
in mind that the exterior expected value in the expression E [E [X|Y ]] refers to the
averge of E [X|Y ] viewed as a function of Y .
Proof - As y ranges over T , the events (Y = y ) are disjoint and cover all of S. Therefore,
by Theorem 4.4.4,
X
E [g (Y )] = g (y )P (Y = y )
y∈T
X
= E [X|Y = y ]P (Y = y )
y∈T
= E [X ].
Example 4.4.7. Let Y ∼ Uniform({1, 2, . . . , n}) and let X be the number of heads on Y
flips of a coin. What is the expected value of X?
Without Theorem 4.4.6 this problem would require computing many complicated
probabilities. However, it is made much simpler by noting that the distribution of X is
given conditionally by (X|Y = j ) ∼ Binomial(j, 12 ). Therefore we know E [X|Y = j ] = 2j .
Using the notation above, this may be written as E [X|Y ] = Y
2 after which
Y 1n+1 n+1
E [X ] = E [E [X|Y ]] = E [ ]= = .
2 2 2 4
Therefore,
∞ ∞
E [X 2 |Bi ]P (Bi ),
X X
(V ar [X|Bi ] + (E [X|Bi ])2 )P (Bi ) =
i=1 i=1
but the right hand side of this equation is E [X 2 ] from Theorem 4.4.4. The fact that
V ar [X ] = E [X 2 ] − (E [X ])2 completes the proof of the theorem. ■
As with expected value, this formula may be rewritten in a different form if the
conditioning events describe the outcomes of a random variable.
(3) V ar [E [X|Y ]] = E [(E [X|Y ])2 ] − (E [E [X|Y ]])2 = E [(E [X|Y ])2 ] − (E [X ])2 .
■
Example 4.4.10. The number of eggs N found in nests of a certain species of turtles has
a Poisson distribution with mean λ. Each egg has probability p of being viable and this
event is independent from egg to egg. Find the mean and variance of the number of viable
eggs per nest.
Let N be the total number of eggs in a nest and X the number of viable ones. Then if
N = n, X has a Binomial distribution with number of trials n and probability p of success
for each trial. Thus, if N = n, X has mean np and variance np(1 − p). That is,
or
E [X|N ] = pN ; V ar [X|N ] = p(1 − p)N .
Hence
E [X ] = E [E [X|N ]] = E [pN ] = pE [N ] = pλ
and
V ar [X ] = E [V ar [X|N ]] + V ar [E [X|N ]]
= E [p(1 − p)N ] + V ar [pN ] = p(1 − p)E [N ] + p2 V ar [N ].
exercises
Ex. 4.4.1. Let X ∼ Geometric(p) and let A be event (X ≤ 3). Calculate E [X|A] and
V ar [X|A].
Ex. 4.4.2. Calculate the variance of the quantity X from Example 4.4.7.
Ex. 4.4.3. Return to Example 4.4.5. Suppose that, in addition to the estimates on average
return, the investor had estimates on the standard deviations. If the economy strengthens
or weakens, the estimated standard deviation is 3 million dollars, but if the economy stays
the same, the estimated standard deviation is 2 million dollars. So, in millions of dollars,
Use this information, together with the conditional expectations from Example 4.4.5 to
calculate V ar [X ].
Ex. 4.4.4. A standard light bulb has an average lifetime of four years with a standard
deviation of one year. A Super D-Lux lightbulb has an average lifetime of eight years
with a standard devaition of three years. A box contains many bulbs – 90% of which are
standard bulbs and 10% of which are Super D-Lux bulbs. A bulb is selected at random
from the box. What are the average and standard deviation of the lifetime of the selected
bulb?
X = −1 X=0 X=1
Y = −1 1/15 2/15 2/15
Y =0 2/15 1/15 2/15
Y =1 2/15 2/15 1/15
(b) Show that E [X|X = x] = x (and so E [X|X ] = X). (From results in this section we
know E [X|Y ] is always a random variable with expected value equal to E [X ]. The
results above in some sense show two extremes. When X and Y are independent,
E [X|Y ] is a constant random variable E [X ]. When X and Y are equal, E [X|X ] is
just X itself).
Ex. 4.4.7. Let X ∼ Uniform {1, 2, . . . , n} be independent of Y ∼ Uniform {1, 2, . . . , n}.
Let Z = max(X, Y ) and W = min(X, Y ).
(a) Find the joint distribution of (Z, W ).
(b) Fine E [Z | W ].
When faced with two different random variables, we are frequently interested in how the two
different quantities relate to each other. Often the purpose of this is to predict something
about one variable knowing information about the other. For instance, if rainfall amounts
in July affect the quantity of corn harvested in August, then a farmer, or anyone else keenly
interested in the supply and demand of the agriculture industry, would like to be able to
use the July information to help make predictions about August costs.
4.5.1 Covariance
Just as we developed the concepts of expected value and standard deviation to summarize
a single random variable, we would like to develop a number that describes something
about how two different random variables X and Y relate to each other.
Since it is defined in terms of an expected value, there is the possibility that the
covariance may be infinite or not defined at all because the sum describing the
expectation is divergent.
Notice that if X is larger than its average at the same time that Y is larger than
its average (or if X is smaller than its average at the same time Y is smaller than its
average) then (X − E [X ])(Y − E [Y ]) will contribute a positive result to the expected
value describing the covariance. Conversely, if X is smaller than E [X ] while Y is larger
than E [Y ] or vica versa, a negative result will be contributed toward the covariance. This
means that when two variables tend to be both above average or both below average
simultaneously, the covariance will typically be positive (and the variables are said to be
positively correlated ), but when one variable tends to be above average when the other
is below average, the covariance will typically be negative (and the variables are said to
be negatively correlated ). When Cov [X, Y ] = 0 the variables X and Y are said to be
“uncorrelated”.
For example, suppose X and Y are the height and weight, respectively, of an individual
randomly selected from a large population. We might expect that Cov [X, Y ] > 0 since
people who are taller than average also tend to be heavier than average and people who are
shorter than average tend to be lighter. Conversely suppose X and Y represent elevation
and air density at a randomly selected point on Earth. We might expect Cov [X, Y ] < 0
since locations at a higher elevation tend to have thinner air.
Example 4.5.2. Consider a pair of random variables X and Y with joint distribution
X = −1 X=0 X=1
Y = −1 1/15 2/15 2/15
Y =0 2/15 1/15 2/15
Y =1 2/15 2/15 1/15
while when X = 1, then Y is more likely to be below average than above. This suggests
the two random variables should have a negative correlation. In fact, we can calculate
4 9 2 2
E [XY ] = (−1)( ) + 0( ) + 1( ) = − ,
15 15 15 15
Cov [X, X ] = V ar [X ].
Theorem 4.5.4. Let X and Y be discrete random variables with finite mean for
which E [XY ] is also finite. Then
As with the expected value, the covariance is a linear quantity. It is also related to the
concept of independence.
(d) If X and Y are independent with a finite covariance, then Cov [X, Y ] = 0.
Therefore, reversing the roles of X and Y does not change the correlation.
Proof of (2) - This follows from linearity properties of expected value. Using Theorem
4.5.4
Proof of (3) - This proof is essentially the same as that of (2) and is left as an exercise.
Poof of (4) - We have previously seen that if X and Y are independent, then E [XY ] =
E [X ]E [Y ]. Using Theorem 4.5.4 it follows that
Though independence of X and Y guarantees that they are uncorrelated, the converse
is not true. It is possible that Cov [X, Y ] = 0 and yet that X and Y are dependent, as the
next example shows.
Example 4.5.6. Let X, Y be two discrete random variables taking values {−1, 1}. Suppose
their joint distribution P (X = x, Y = y ) is given by the table
x=-1 x=1
Moreover,
4.5.2 Correlation
The possible size of Cov [X, Y ] has upper and lower bounds based on the standard deviations
of the two variables.
Theorem 4.5.7. Let X and Y be two discrete random variables both with finite
variance. Then
−σX σY ≤ Cov [X, Y ] ≤ σX σY ,
Cov [X,Y ]
and therefore −1 ≤ σX σY ≤ 1.
Proof - Standardize both variables and consider the expected value of their sum squared.
Since this is the expected value of a non-negative quantity,
X − µX Y − µY 2
0 ≤ E [( + ) ]
σX σY
( X − µX ) 2 (X − µX )(Y − µY ) (Y − µY )2
= E[ 2 + 2 + ]
σX σX σY σY2
E [(X − µX )2 ] 2E [(X − µX )(Y − µY )] E [(Y − µY )2 ]
= 2 + +
σX σX σY σY2
Cov [X, Y ]
= 1+2 + 1.
σX σY
A similar computation (see Exercises) for the expected value of the squared difference of
the standardized variables shows
Cov [X, Y ] ≤ σX σY .
Cov [X,Y ]
Definition 4.5.8. The quantity σX σY from Theorem 4.5.7 is known as
the“correlation” of X and Y and is often denoted as ρ[X, Y ]. Thinking in terms
of dimensional analysis, both the numerator and denominator include the units of
X and the units of Y . The correlation, therefore, has no units associated with it.
It is thus a dimensionless rescaling of the covariance and is frequently used as an
absolute measure of trends between the two variables.
exercises
Ex. 4.5.1. Consider the experiment of flipping two coins. Let X be the number of heads
among the coins and let Y be the number of tails among the coins.
Ex. 4.5.2. Let X ∼ Uniform({0, 1, 2}) and let Y be the number of heads in X flips of a
coin.
V ar [X + Y ] = V ar [X ] + V ar [Y ] + 2Cov [X, Y ].
(b) Use (a) to conclude that when X and Y are positively correlated, then V ar [X + Y ] >
V ar [X ] + V ar [Y ], while when X and Y are negatively correlated, V ar [X + Y ] <
V ar [X ] + V ar [Y ].
(c) Suppose Xi 1 ≤ i ≤ n are discrete random variables with finite variance and
covariances. Use induction and (a) to conclude that
n n
V ar [Xi ] + 2 Cov [Xi , Xj ].
X X X
V ar [ Xi ] =
i=1 i=1 1≤i<j≤n
f (x1 , x2 , . . . , xn ) = P (X1 = x1 , . . . Xn = xn )
is a symmetric function.
Example 4.6.3. Suppose we have an urn of m distinct objects labelled {1, 2, . . . , m}.
Objects are drawn at random from the urn without replacements till the urn is empty.
Let Xi be the label of the i-th object that is drawn. Then X1 , X2 , . . . , Xm is a particular
ordering of the objects in the urn. Since each ordering is equally likely and there are m!
possible orderings we have that the joint probability mass function
1
f (x1 , x2 , . . . , xm ) = P (X1 = x1 , X2 = x2 , . . . , Xm = xm ) = ,
m!
Proof - The random variables (X1 , X2 , . . . , Xn ) are exchangeable. Then we have for
any permutation σ and xi ∈ Range(Xi )
As this is true for all permutations σ all the random variables must have same range.
Otherwise if any two of them differ the we could get a contradiction by choosing an
appropriate permutation.
A = {xj ∈ T : 1 ≤ j ̸= 1, i ≤ n}
By using the exchangeable property with the permutation σ that is given by σ (i) =
1, σ (1) = i and σ (j ) = j for all j ̸= 1, i. We have that for any x2 , . . . , xi−1 , xi+1 , . . . , xn ∈ A
Therefore,
[
P (X1 = a) = P ( X1 = a, Xi = b)
b∈T
X
= P (X1 = a, Xi = b)
b∈T
P (X1 = a, X2 = x2 , . . . , Xi = b, . . . Xn = xn )
X X
=
b∈T xj ∈A
P (X1 = b, X2 = x2 , . . . , Xi = a, . . . Xn = xn )
X X
=
b∈T xj ∈A
So the distribution of Xi is the same as the distribution of X1 and hence all of them have
the same distribution. ■
Example 4.6.5. (Sampling without Replacement) An urn contains b black balls and
r red balls. A ball is drawn at random and its colour noted. This procedure is repeated
n times. Assume that n ≤ b + r. Let max 0, n − r ≤ k ≤ min(n, b). In this example we
examine the random variables Xi given by
We have already seen that (See Theorem 2.3.2 and Example 2.3.1)
!Q
k−1 Qm−k−1
n i=0 (b − i ) i=0 (r − i)
P (k black balls are drawn in n draws) = Qm−1 .
k i=0 (r + b − i )
Using the same proof we see that the joint probability mass function of (X1 , X2 , . . . , Xn )
is given by
where xi ∈ {0, 1}. It is clear from the right hand side of the above that the function f
depends only on the i = 1 xi . Hence any permutation of the xi ’s will not change the value
Pn
b
P (Xi = 1) = P (X1 = 1) = .
b+r
So we can conclude that they are all identically distributed as Bernoulli ( b+b r ) and the
probability of choosing a black ball in the i-th draw is b
b+r (See Exercise 4.6.4 for a similar
result). Further for any i, j
exercises
Ex. 4.6.1. Suppose X1 , X2 , . . . , Xn are exchangeable random variables. For any 2 ≤ m < n,
show that X1 , X2 , . . . , Xm are also a collection of exchangeable random variables.
Ex. 4.6.2. Suppose X1 , X2 , . . . , Xn are exchangeable random variables. Let T denote their
common range. Suppose b : T → R. Show that b(X1 ), b(X2 ), . . . , b(Xn ) is also a collection
of exchangeable random variables.
Ex. 4.6.3. Suppose n cards are drawn from a standard pack of 52 cards without replacement
(so we will assume n ≤ 52). For 1 ≤ i ≤ n, let Xi be random variables given by
(a) Suppose n = 52. Using Example 4.6.3 and the Exercise 4.6.2 show that (X1 , X2 , X3 , . . . Xn )
are exchangeable.
(b) Show that (X1 , X2 , X3 , . . . , Xn ) are exchangeable for any 2 ≤ n ≤ 52. Hint: If
n < 52 extend the sample to exhause the deck of cards. Use (a) and Exercise 4.6.1
(c) Find the probability that the second and fourth card drawn have the same colour.
Ex. 4.6.4. (Polya Urn Scheme) An urn contains b black balls and r red balls. A ball is
drawn at random and its colour noted. Then it is replaced along with c ≥ 0 balls of the
same colour. This procedure is repeated n times.
(c) Let 1 ≤ m ≤ n. Let Bm be the event that the m-th ball drawn is black. Show that
b
P (Bm ) = .
b+r
Suppose we want to randomly select a number on the interval (0, 1) in some uniform way.
In the discrete setting we would have said that “uniform” meant that every outcome in
our sample space S = (0, 1) was equally likely. Suppose we took that same approach here
and declared that there was some value p for which P ({x}) = p for every x ∈ (0, 1). Then
if we let E be the event E = { n1 : n = 2, 3, 4, . . . } ⊂ S, we find that
1 1 1
P (E ) = P ( , , ,... )
2 3 4
1 1 1
= P( ) + P( ) + P( ) + ...
2 3 4
= p+p+p+...
If p > 0 this sum diverges to infinity, which cannot be since it describes a probability.
Therefore it must be that p = 0. If every individual outcome in S = (0, 1) is equally likely,
then each outcome must have a probability of zero. After several chapters considering only
discrete probabilities many readers may suspect that this, in and of itself, is a contradiction.
How is it possible for P (S ) = 1 when every single element of S has probability zero? Could
not one then show
[
P (S ) = P ( {s})
s∈S
X
= P ({s})
s∈S
0
X
=
s∈S
= 0
147
using the probability axioms? The answer to that question is “no”. The probability space
axiom that allows us to write the probability of a disjoint union as the sum of separate
probabilities only applies to countable collections of events. But the events {s} that
combine to create (0, 1) are an uncountable collection. If S is uncountable, we could still
have P (S ) = 1 even if every individual element of s ∈ S has probability zero.
However, all of that does not yet explain how to define a uniform probability on
(0, 1). Knowing that each individual outcome has probability zero does not tell us how to
calculate P ([ 14 , 34 ]), for example, since we cannot simply add up the probabilities of each
of the constituent outcomes individually. Instead we need to reinterpret what we mean
by “uniform” in this situation. It would make sense to suggest that the event [ 14 , 34 ] should
have a probability of 12 since its length is exactly half of the length of (0, 1). Indeed it is
tempting (and essentially correct) to declare that P (A) should be the length of the set
A. The complication with making such a statement is that, although length is easy to
define if A is an interval or even a countable collection of disjoint intervals, it is not even
possible to consistently define a length for every single subset of (0, 1). Because of this
unfortunate fact, we will need to reconsider which subsets of S are actually events which
will be assigned a probability.
At a minimum we will want events to include any interval. The axioms and basic
properties of probability spaces also require that for any collection of events we must be
able to consider complements and countable unions of these events. Further, the entire
sample space S should also be considered a legitimate event. Consequently we make the
following definition.
(1) S ∈ F
(2) If A ∈ F then Ac ∈ F
∞
(3) If A1 , A2 , . . . is a countable collection of sets in F then An ∈ F
S
n=1
If S happens to be the set of real numbers there is a smallest σ-field that contains all
intervals, and this collection of subsets of R is known as the Borel sets. It happens that
the concept of the “length” of a set can be consistantly described for such sets. Because of
this we will modify our definition of probability space slightly at this point.
Definition 5.1.2. (Probability Space Axioms) Let S be a sample space and let
F be a σ-field of S. A “probability” is a function P : F → [0, 1] such that
(1) P (S ) = 1;
∞ ∞
P (Ej ).
[ X
P( Ej ) =
j =1 j =1
Our old definition is simply a special case where the σ-field was the collection of all
subset of S, so all results we have previously seen in the discrete setting are still legitimate
in this new framework. There are many technicalities that arise due to the fact that not
every set may be viewed as an event, but these issues would be distracting from the primary
goal of this text. Thus we give the definitions above only to provide the modern definition
of probability space.
The primary way we will define continuous probabilities on R is through a “density function”.
We begin by providing an example of what should be meant by a uniform distribution on
(0, 1).
For an event A define P (A) = Note that for an interval A = [a, b] ⊂ (0, 1) it
R
A f (x) dx.
happens that P (A) is just the length of the interval.
Z
P (A) = f (x) dx
A
Z b
= 1 dx
a
= b−a
For disjoint unions of intervals, the lengths simply add. For instance if A = [ 51 , 25 ] ∪ [ 53 , 45 ],
then
Z
P (A) = f (x) dx
[ 51 , 25 ]∪[ 53 , 45 ]
Z 2 Z 4
5 5
= 1 dx + 1 dx
1 3
5 5
1 1 2
= + =
5 5 5
which is the sum of the lengths of the two component intervals. In particular note that
P ((0, 1)) = 1 while P ({c}) = 0 for any c since a single point has no length. Similarly, if
A = [a, b] is an interval that is disjoint from (0, 1), then
Z
P (A) = f (x) dx
A
Z b
= 0 dx
a
= 0
We will soon see that P defines a probability on R. From the computation above this
probability gives equal likelihood to all equal-width intervals within (0, 1) and assigns zero
probability to any interval outside of (0, 1). Therefore it is consistant with the properties a
uniform probability on (0, 1) should have. ■
The function f from the example above is known as a density. What properties must be
required of such a function in order for it to define a probability? The fact that probabilities
cannot be negative suggests we will need to require f (x) to be non-negative for all real
R∞
numbers x. The fact that P (S ) = 1 means that −∞ f (x) dx has to be 1. It turns out that
these two requirements are essentially all that are needed. The only other assumption we
will make is that a density funciton be piecewise continuous. Though this final requirement
is more restrictive than necessary, the assumption will help avoid technicalities and will
include all densities of interest to us in the remainder of the text. We give a precise
definition.
(i) f (x) ≥ 0,
We proceed to state and prove a result that will help us construct probabilities on R
with the help of density functions. This will also ensure that in Example 5.1.3 we indeed
constructed a probability on R.
by assumption, so the entire sample space has probability 1. Now let A be a Borel subset
of R. Since f (x) is non-negative,
Z
P (A) = f (x) dx ≥ 0, and
ZA Z
P (A) = f (x) dx ≤ f (x) dx = 1,
A R
so P (A) ∈ [0, 1]. Finally, if E1 , E2 , . . . are a countable collection of disjoint events, then
∞
[ Z
P( En ) = S∞ f (x) dx
n=1 n=1
En
∞ Z
X
= f (x) dx
n=1 En
∞
P (En ).
X
=
n=1
while 4
3 4 37
Z
5
P ([ , ]) = 3x2 dx = .
5 5 3
5
125
In other words, intervals of the same length do not have equal probabilities; this probability
is not uniform on (0, 1).
The probability of individual points is still zero, so P ({ 15 }) = P ({ 25 }) = 0, but in
terms of the density function, f ( 25 ) is four times as large as f ( 15 ). What does this mean in
practical terms?
Let ϵ be a small positive quantity (certianly less than 15 ). Then
1 1 2 2
P ([ − ϵ, + ϵ]) = ϵ + 2ϵ3 ≈ ϵ while
5 5 25 25
2 2 8 8
P ([ − ϵ, + ϵ]) = ϵ + 2ϵ3 ≈ ϵ.
5 5 25 25
The fact that f ( 25 ) is four times as large as f ( 15 ) essentially means that a tiny interval
around 2
5 has approximately four times the probability of a similarly sized interval around
5.
1
■
exercises
2x if 0 < x < 1
(
f (x) =
0 otherwise
(c) Which will be larger, P ((0, 14 )) or P (( 14 , 12 ))? Explain how you can answer this
question without actually calculating either probability.
(d) A game is played in the following way. A random variable X is selected with a density
described by f above. You must select a number r and you win the game if the
random variable results in an outcome in the interval (r − 0.01, r + 0.01). Explain
how you should choose r to maximize your chance of winning the game. (A formal
proof requires only basic calculus, but it should take very little computation to
determine the correct answer).
if 0 < x
(
λe−λx
f (x) =
0 otherwise
if a < x < b
(
1
f (x) = b−a
0 otherwise
(b) Show that if I, J ⊂ [a, b] are two intervals that have the same length, then P (I ) =
P (J ).
if 1 < x
(
1
f (x) = x2
0 otherwise
if 0 < x
(
1 2 −x
f (x) = 6x e
0 otherwise
1 (x−µ)2
f (x) = √ e− 2σ2 x ∈ R
σ 2π
Follow the steps below to show that the function f is a density function.
R∞ −x2 /2 dx
(a) Let I = −∞ e and then explain why
Z ∞ Z ∞
2 +y 2 ) /2
I2 = e− ( x dx dy
−∞ −∞
(Hint: Write I 2 as a product of two integrals each over a different variable and explain
why the resulting expression may be written as the double integral above).
after switching from rectangular to polar coordinates. (Hint: Use the fact from
multivariate calculus that after the change of variables (dx dy ) becomes (r dr dθ )
and explain the new limits of integration based on the region being described in the
plane).
Just as the move from discrete to continuous spaces required a slight change in the definition
of probability space, so it also requires a slight change in the definition of random variable.
In the discrete setting we frequently needed to consider the preimage X −1 (A) of a set.
Now we need to make sure that such a preimage is a legitimate event.
Note that in the discrete setting this extra condition was met trivially as every subset of
S was an event. Therefore the discrete setting is simply a special case of this new definition.
As with the introduction of σ-fields, we include this definition for completeness. We will
only consider functions which meet this criterion. In this section we shall consider only
continuous random variables. These are defined next.
P (X = a) = 0 (5.2.2)
Ra
Proof- Let a ∈ R, then P (X = a) = a f (x)dx = 0. ■
Random variables may also be described using a “distribution function” (also commonly
known as a “cumulative distribution function”).
Though a distribution function is defined for any real-valued random variable, there is
a special relationship between fX (x) and FX (x) when the random variable has a density.
The result that F ′ (x) = f (x) then follows from the fundamental theorem of calculus
after taking derivatives of both sides of the equation (when such a derivative exists).
Note, in particular, that since densities are assumed to be piecewise continuous, their
corresponding distribution functions are piecewise differentiable. ■
This theorem will be useful for computation, but it also shows that the distribution of
a continuous random variable X is completely determined by its distribution function FX .
That is, if we know FX (x) and want to calculate P (X ∈ A) for some set A we could do so
by differentiating FX (x) to find fX (x) and then integrating this density over the set A. In
fact FX (x) always completely determines the distribution of X (regardless of whether or
not X is a continuous random variable), but a proof of that fact is beyond the scope of the
course and will not be needed for subsequent results.
In the literature random variables whose distributions satisfy (5.2.1) are called absolutely
continuous random variables and those that satisfy (5.2.2)are referred to as continous
random variables. Since we shall only consider continuous random variables that satisfy
(5.2.1) we refer to them as continous random variables.
There are many continuous distributions that commonly arise. Some of these are
continuous analogs of discrete random variables we have already studied. We will define
these in the context of continuous random variables having the corresponding distributions.
We begin with the already discussed uniform distribution but on an arbitrary interval.
then X is said to be uniformly distributed on (a, b). Note that this is consistant with
the example at the beginning of the section since the density of a Uniform(0, 1) is
one on the interval (0, 1) and zero elsewhere. Further, recall that in Exercise 5.1.6
we have shown that f is indeed a probability density function.
Since X only takes values on (a, b) if x < a then P (X ≤ x) = 0 while if x > b then
P (X ≤ x) = 1. So let a ≤ x ≤ b. Then,
1
Z x Z a Z x
x−a
P (X ≤ x) = fX (y ) dy = 0 dy + dy = .
−∞ −∞ a b−a b−a
0
if x < a
FX (x) = x−a
b−a if a ≤ x ≤ b
1
if x > b
N (t)
≈ e−λt ,
N (0)
for some λ > 0. One can introduce a probability model for the above experiment in the
following manner. Suppose X represented the time taken by a randomly chosen radioactive
atom to decay to its stable form. The distribution of the random variable X needs to
satisfy
P (X ≥ t) = e−λt , (5.2.5)
0 if x < 0
(
FX (x) =
1 − e−λx if 0 ≤ x
Exp(1) Exp(2)
1.5
0.8
0.6
1.0
0.4
0.5
0.2
0.0 0.0
0 1 2 3 4 5 0 1 2 3 4 5
Figure 5.1: The shape of typical Exponential density and cumulative distribution functions.
We have previously seen that geometric random variables have the memoryless property
(See (3.2.2)). It turns out that the exponential random variable also possess the memoryless
property in continuous time. Clearly if X ∼ Exp(λ) then P (X ≥ 0) = 1 and
Z ∞
P (X ≥ t) = P (X ∈ [t, ∞)) = λe−λx dx = −e−λx |∞
t =e
−λt
,
t
Thinking of the variables s and t in terms of time, this says that if an exponential random
variable has not yet occurred by time s, then its distribution from that time onward
continues to be distributed like an exponential random variable with the same parameter.
Situations that involve waiting times such as the lifetime of a light bulb or the time spent
in a queue at a service counter are often modelled with the exponential distribution. It is a
fact (see Exercise 5.2.12) that if a positive continuous random variable has the memoryless
property then it necessarily is an exponential random variable.
Example 5.2.8. Let X ∼ Exp(2). Calculate the probability that X produces a value
larger than 4.
The density of X is
2e−2x if x > 0
(
f (x) =
0 otherwise
Of all continuous distributions, The normal distribution (also sometimes called a “Gaussian
distribution”) is the most fundamental for applictions of statistics as it frequently arises as
a limiting distribution of sampling procedures.
1 (x−µ)2
f (x) = √ e− 2σ2 (5.2.7)
σ 2π
for all x ∈ R. We will prove that µ and σ are, respectively, the mean and standard
deviation of such a random variable (See Definiton 6.1.1, Definition 6.1.9, Example
6.1.11). Recall that in Exercise 5.1.10 we have seen that f is a probability density
function.
Normal(0, 1) Normal(1, 2)
0.8
0.3
0.6
0.2
0.4
0.1
0.2
0.0 0.0
−2 0 2 4 6 −2 0 2 4 6
Figure 5.2: The shape of typical Normal density and cumulative distribution functions.
n = 10 n = 50 n = 200
1.0
0.8
p = 0.5
0.6
0.4
Cumulative distribution function
0.2
0.0
1.0
0.8
p = 0.25
0.6 Binomial(n,p)
Normal approximation
0.4
0.2
0.0
1.0
0.8
p = 0.1
0.6
0.4
0.2
0.0
e−λ λk
lim P (Sn = k ) =
n→∞ k!
Such an approximation was useful when p was decreasing to zero while n grew to infinity
with np remaining constant. The De Moivre-Laplace Central Limit Theorem allows us to
consider another form of limit where p remains fixed, but n increases.
1
Z b
Sn − np x2
lim P (a < q ≤ b) = √ e− 2 dx (5.2.8)
n→∞
np(1 − p) 2π a
We shall omit the proof of the above Theorem for now. We prove it in a more general
setting in Chapter 8. For the students well versed with Real Analysis the proof is sketched
in Exercise 5.2.16. We refer the reader to [Ram97] for a detailed discussion of the Theorem
5.2.10.
In a standard introduction to integral calculus one learns many different techniques for
calculating integrals. But there are some functions whose indefinite integral has no closed-
form solution in terms of simple functions. The density of a normal random variable is one
such function. Because of this if X ∼ Normal(0, 1) the probability
1
Z x
2
P (X ≤ x) = √ e−x /2 dx
−∞ 2π
0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14 0.16 0.18
0.0 0.500 0.508 0.516 0.524 0.532 0.540 0.548 0.556 0.564 0.571
0.2 0.579 0.587 0.595 0.603 0.610 0.618 0.626 0.633 0.641 0.648
0.4 0.655 0.663 0.670 0.677 0.684 0.691 0.698 0.705 0.712 0.719
0.6 0.726 0.732 0.739 0.745 0.752 0.758 0.764 0.770 0.776 0.782
0.8 0.788 0.794 0.800 0.805 0.811 0.816 0.821 0.826 0.831 0.836
1.0 0.841 0.846 0.851 0.855 0.860 0.864 0.869 0.873 0.877 0.881
1.2 0.885 0.889 0.893 0.896 0.900 0.903 0.907 0.910 0.913 0.916
1.4 0.919 0.922 0.925 0.928 0.931 0.933 0.936 0.938 0.941 0.943
1.6 0.945 0.947 0.949 0.952 0.954 0.955 0.957 0.959 0.961 0.962
1.8 0.964 0.966 0.967 0.969 0.970 0.971 0.973 0.974 0.975 0.976
2.0 0.977 0.978 0.979 0.980 0.981 0.982 0.983 0.984 0.985 0.985
Table 5.1: Table of Normal(0, 1) probabilities. For X ∼ Normal(0, 1), the table gives values of
P (X ≤ z ) for various values of z between 0 and 2.18 upto three digits. The value
of z for each entry is obtained by adding the corresponding row and column labels.
Table 5.1 gives values only for positive values of z because for negative z, P (X ≤ z ) can
be easily computed using the symmetry of the Normal(0, 1) distribution as (see Figure 5.4)
1 1
Z z Z ∞
2 2
P (X ≤ z ) = √ e−x /2 dx = √ e−x /2 dx = 1 − P (X ≤ −z ). (5.2.9)
−∞ 2π −z 2π
A more complete version of this table is given in the Appendix. A similar computation can
be made for other normally distributed random variables by normalizing them. Suppose
0.4
0.3
0.2
0.1
0.0
−3 −2 −z −1 0 1 z 2 3
Figure 5.4: Computation of Normal(0, 1) probabilities as area under the normal density curve.
For Normal(0, 1) and in fact for any symmetric distribution in general, it is enough
to know the distribution function for positive values (see Exercise 5.2.8).
1
Z y
2 2
P (Y ≤ y ) = √ e−(z−µ) /2σ dz.
−∞ σ 2π
z−µ
Now perform a change of variables u = σ so that du = 1
σ dz. This integral then becomes
y−µ
1 y−µ
Z
σ 2
P (Y ≤ y ) = √ e−u /2 du = P (X ≤ ), (5.2.10)
−∞ 2π σ
where X ∼ Normal (0, 1). Now we may use Table 5.1 to compute the distribution function
of Y . We conclude this section with two examples.
Example 5.2.11. If X ∼ Normal(0, 1), how likely is it that X will be within one standard
deviation of its expected value?
In this case the expected value of the random variable is zero and the standard deviation
is one. Therefore the answer is given by
1
Z 1
2
P (−1 ≤ X ≤ 1) = √ e−x /2 dx
−1 2π
1 1
Z 1 Z −1
2 2
= √ e−x /2 dx − √ e−x /2 dx
−∞ 2π −∞ 2π
= P (X ≤ 1) − P (X ≤ −1)
R tells us that
pnorm(1)
[1] 0.8413447
pnorm(-1)
[1] 0.1586553
pnorm(1) - pnorm(-1)
[1] 0.6826895
Alternatively, using Table 5.1, we see that P (X ≤ 1) = 0.841 (upto three decimal places),
and by symmetry P (X ≤ −1) = P (X ≥ 1) = 1 − P (X ≤ 1) = 1 − 0.841 = 0.159.
Therefore, P (−1 ≤ X ≤ 1) ≈ 0.841 − 0.159 = 0.682. In other words, there is roughly a
68% chance that a standardized normal random variable will produce a value within one
standard deviation of expected value. ■
Example 5.2.12. A machine fills bags with cashews. The intended weight of cashews in
the bag is 200 grams. Assume the machine has a tolerance such that the actual weight
of the cashews is a normally distributed random variables with an expected value of 200
grams and a standard deviation of 4 grams. How likely is it that a bag filled by this
machine will have fewer than 195 grams of cashews in it?
We know Y ∼ Normal(200, 42 ) and we want the probability P (Y < 195). By above
computation, (5.2.10),
195 − 200 5
P (Y < 195) = P (X < ) = P (X < − )
4 4
where X ∼ Normal(0, 1). If we were to use Table 5.1, we would first obtain
5 5
P (X < − ) = 1 − P (X < ) = 1 − P (X < 1.25) = 1 − 0.896 = 0.104
4 4
Using the R command pnorm(-5/4), we obtain the value 0.1056498. That is, there is
slightly more than a 10% chance of a bag this light being produced by the machine. ■
We began this section by noting that continuous random variables must necessarily give
probability zero to any single outcome. It is an awkward consequence of this that two
different densities may give rise to exactly the same probabilities. For instance, the functions
1 if 0 < x < 1
(
f (x) =
0 otherwise
and
1 if 0 ≤ x ≤ 1
(
g (x) =
0 otherwise
are different because they assign different values to the points x = 0 and x = 1. However,
these individual points cannot affect the computation of probabilities so both f (x) and
g (x) give rise to the same probability distribution. The same thing would occur even if
f (x) and g (x) differed in a countably infinite number of points, since these will still have
probability zero when taken collectively.
Because of this we will describe f (x) and g (x) as the same density (and sometimes
even write f (x) = g (x)) when the two densities produce the same probabilities. We do this
even when f and g may technically be different functions. Though it is a more restirctive
assumption than is necessary, we have required densities to be piecewise continuous. As a
consquence of the explanation above, altering the values of the function at the endpoints
of intervals of continuity will not change the resulting probabilities and will result in the
same density.
exercises
Ex. 5.2.1. Suppose X was continuous random variable with distribution function F .
Express the following probabilities in terms of F :
Ex. 5.2.2. Let R > 0 and X ∼ Uniform [0, R]. Let Y = min(X, 10
R
). Find the distribution
function of Y .
Ex. 5.2.4. Let X be a continuous random variable with distribution function F : R → [0, 1].
Then G : R → [0, 1] given by
G(x) = 1 − F (x)
is called the reliability function of X or the right tail distribution function of X. Suppose
T ∼ Exponential(λ) for some λ > 0, then find the reliability function of T .
Ex. 5.2.5. Let X be a random variable whose probability density function f : R → [0, 1] is
given by
kxk−1 e−xk if x > 0
f (x) =
0 otherwise
The distribution of X is called the Weibull distribution. Figure 5.5 plots the Weibull
distribution for selected values of k.
Ex. 5.2.6. Let X be a random variable whose probability density function f : R → [0, 1] is
given by √
2 2 R2 − x2
πR
if − R < x < R
f (x) =
0 otherwise
0.8
1.0
0.6
0.4
0.5
0.2
0.0 0.0
0 1 2 3 0 1 2 3
Figure 5.5: The shape of typical Weibull density and cumulative distribution functions.
Semicircular(1) Semicircular(2)
0.4 0.6
0.4
0.2
0.2
0.0 0.0
−1 0 1 −1 0 1
Figure 5.6: The shape of the semicircular density and cumulative distribution functions.
0.8
0.6
0.4
0.2
0.0
Figure 5.7: Computation of probabilities as area under the density curve. For symmetric
distributions, it is enough to know the (cumulative) distribution function for
positive values.
Ex. 5.2.7. Let X be a random variable whose distribution function F : R → [0, 1] is given
by
0 if x ≤ 0
√
F (x) = π2 arcsin( x) if 0 < x < 1
1 if x ≥ 1
Find the probability density function of X. The distribution of X is called the standard
arcsine law.
Ex. 5.2.8. Let X be a continuous random variable with probability density function f
and distribution function F . Suppose f is a symmetric function, i.e. f (x) = f (−x) for all
x ∈ R. Then show that
(a) P (X ≤ 0) = P (X ≥ 0) = 12 ,
We have observed this fact for the normal distribution earlier (see Figure 5.7).
Ex. 5.2.9. Let X ∼ Exp(λ). The “90th percentile” is a value a such that X is larger than
a 90% of the time. Find the 90th percentile of X by determining the value of a for which
P (X < a) = 0.9.
Ex. 5.2.10. Let X be a continuous random variable such that its distribution function F is
strictly increasing on the set {x ∈ R : 0 < F (x) < 1}. The “median” of X is the value of x
for which P (X > x) = P (X < x) = 12 .
Ex. 5.2.11. Let X ∼ Normal(µ, σ 2 ). Show that P (|X − µ| < kσ ) does not depend on the
values of µ or σ. (Hint: Use a change of variables for the appropriate integral).
Ex. 5.2.12. Above we saw that exponential random variables satisfied the memoryless
property, (5.2.6). It can be shown that any positive, continuous random variable with
the memoryless property must be exponential. Follow the steps below to prove a slightly
weakened version of this result. For all parts, suppose X is a positive, continuous random
variable with the memoryless property for which the distribution function FX (t) has a
continuous derivative for t > 0. Suppose further that limt→0+ F ′ (t) exists and call this
quantity α. Let G(t) = 1 − FX (t) = P (X > t) and do the following.
(a) Use the memoryless property to show that G(s + t) = G(s) · G(t) for all postiive s
and t.
(b) Use part (a) to conclude that G′ (t) = −αG(t). (Hint: Take a derivative with respect
to s and then take an appropriate limit).
(c) It is a fact (which you may take as granted) that the differential equation from (b)
has solutions of the form G(t) = Ce−αt . Use the fact that X is positive to explain
why it must be that C = 1.
(d) Use part (c) to calculate FX (t) and then differentiate to find fX (t).
(e) Conclude that X must be exponentially distributed and determine the associated
parameter in terms of α.
Ex. 5.2.13. Let X be a random variable with density f (x) = 2x for 0 < x < 1 (and
f (x) = 0 otherwise). Calculate the distribution function of X.
Ex. 5.2.14. Let X ∼ Uniform({1, 2, 3, 4, 5, 6}). Despite the fact this is a discrete random
variable without a density, the distribution function FX (x) is still defined. Find a piecewise
defined expression for FX (x) (see Figure 5.8 for a plot).
Ex. 5.2.15. Suppose F : R → [0, 1] is given by (5.2.3). Then show that
1.0
0.8
0.6
0.4
0.2
0.0
1 2 3 4 5 6
2. limx→∞ F (x) = 1.
3. limx→−∞ F (x) = 0.
(a) Let
q q
An = k : 0 ≤ k ≤ n, np + a np(1 − p) ≤ k ≤ np + a np(1 − p) .
Show that
Sn − np
P ( Sn = k ) .
X
P (a ≤ q ≤ b) =
np(1 − p) k∈An
(b) Let
k − np
ξk,n = q .
np(1 − p)
ξ2
k,n Z b − x2
e− 2 e 2
lim
X
= √
2π
q
n→∞
k∈An 2πnp(1 − p) a
(nk)pk (1 − p)n−k
lim sup ξ2
=1
n→∞ k∈A q k,n
2πnp(1 − p)e −
n
2
Sn − np
P (a ≤ q ≤ b) =
np(1 − p)
ξ2 ξ2
− k,n − k,n (n)pk (1 − p)n−k
e 2 e 2
1
X X k
+ −
2
q q
ξ
k∈An 2πnp(1 − p) k∈An 2πnp(1 − p) √ 1
− k,n2 e
2πnp(1−p)
In Section 3.3 we have discussed functions of discrete random variables and how to find
their distributions. Suppose g : R → R and Y = g (X ), to find the distribution of Y
we converted events associated with Y with events of X by inverting the function g. In
the setting of continuous random variables distribution functions are used for calculating
probabilities associated with functions of a known random variable. We next present a
simple example for which g (x) = x2 followed by a result that covers situations when g (x)
is any linear function.
Example 5.3.1. Let X ∼ Uniform(0, 1) and let Y = X 2 . What is the density for Y ?
Since X takes values on (0, 1) and since Y = X 2 , it will also be the case that Y
takes values on (0, 1). However, though X is uniform on the interval, there should be no
expectation that Y will also be uniform. In fact, since squaring a positive number less
than one results in a smaller number than the original, it should seem intuitive that results
of Y will be more likely to be near to zero than they are to be near to one.
It is not easy to see how to calculate the density of Y directly from the density of X.
However, it is a much easier task to compute the distribution of Y from the distribution of
X. Therefore we will use the following plan in the calculation below – integrate fX (x) to
find FX (x); use FX (x) to determine FY (y ); then differentiate FY (y ) to calculate fY (y ).
For the first step, note
0 if 0 < x
Z x
FX (x) = fX (x) dx = x if 0 ≤ x ≤ 1
−∞
1 if x > 1
√ √
FY ( y ) = P ( − y ≤ X ≤ y )
√ √ √
= P (X < − y ) + P (− y ≤ X ≤ y )
√ √ √
= P ((X < − y ) ∪ (− y ≤ X ≤ y ))
√ √
= P (X ≤ y ) = FX ( y ).
Therefore,
0 if 0 ≤ y
√
FY (y ) = y if 0 < y < 1
1
if y ≥ 1
As noted in the beginning of this example, this distribution is far from uniform and gives
much more weight to intervals close to zero than it does intervals close to one. ■
1
Z y
u−b
P (Y ≤ y ) = fX ( )du. (5.3.2)
−∞ a a
If a < 0 then
Z ∞
y−b
P (Y ≤ y ) = P (aX + b ≤ y ) = P (X ≥ )= fX (z )dz
a y−b
a
1
Z y
u−b
P (Y ≤ y ) = fX ( )du. (5.3.3)
−∞ −a a
Using (5.3.2) and (5.3.3) we have that Y is a continuous random varable with density as in
(5.3.1). ■
Lemma 5.3.2 provides a method to standardize the normal random variable.
X−µ
(b) Let X ∼ N ormal (µ, σ 2 ) and let Z = σ . Then Z ∼ N ormal (0, 1).
1 y−b 1 (z−b)2
fY (y ) = fX =√ e− 2a2 ,
|a| a 2π | a |
µ 1
z2
fZ (z ) = σfX σ (z + ) = √ e− 2 ,
σ 2π
Example 5.3.4. Consider the two parallel lines in R2 , given by y = 0 and y = 1. Piku is
standing at the origin in the plane. She chooses an angle θ uniformly in (0, π ) and she
draws a line segment between the lines y = 0 and y = 1 at an angle θ from the origin in R2 .
Suppose the line segment meets the line y = 1 at the point (X, 1). Find the probability
density function of X.
(0,1) (X, 1)
θ
(0,0)
First observe that X = tan( π2 − θ ). We shall first find the distribution function of X.
Let x ∈ R. Observe that tan(x) is a strictly increasing function in the interval (− π2 , π2 )
and has an inverse denoted by arctan(x). So
π
P (X ≤ x) = P (tan( − θ ) ≤ x)
2
π
= P (( − θ ) ≤ arctan(x))
2
π
= P (θ ≥ − arctan(x))
2
π
= 1 − P (θ ≤ − arctan(x))
2
For any x ∈ R, π
2 − arctan(x) ∈ (0, π ). As θ has Uniform (0, π ) distribution, the above is
1 π
= 1 − ( − arctan(x))
π 2
1 1
= + arctan(x)
2 π
Hence the distribution function of X is differentiable and therefore the probability density
function of X is given by
1 1
fX (x) = ,
π 1 + x2
for all x ∈ R. Such a random variable is an example of a Cauchy distribution which we
define more generally next. ■
Cauchy(0, 1) Cauchy(1, 2)
0.2 0.6
0.4
0.1
0.2
0.0
0 5 0 5
Figure 5.10: The shape of Cauchy density and cumulative distribution functions for selected
parameter values.
1 α
f (x) = (5.3.4)
π α2 + ( x − θ ) 2
1 x−θ
F (x) = arctan( ) (5.3.5)
π α
Figure 5.10 gives plots of the Cauchy density and distribution functions.
Similar computations as above are useful for simulations. Most computer progam-
ming languages and spreadsheets have a “Random” function designed to approximate a
Uniform(0, 1) random variable. How could one use such a feature to simulate random
variables with other densities? We start with an example.
Example 5.3.6. If X ∼ Uniform(0, 1), our goal is to find a function g : (0, 1) → R for
which Y = g (X ) ∼ Exponential (λ). We will try to find such a g : (0, 1) → R which
is strictly increasing so that it has an inverse. This will be important when it comes to
relating the distributions of X and Y .
0 if y ≤ 0
(
FY (y ) =
1 − e−λy if y > 0
But
FY (y ) = P (Y ≤ y ) = P (g (X ) ≤ y ) = P (X ≤ g −1 (y ))
where the final equality comes from our decree that the function g should be strictly
increasing. Therefore,
FY (y ) = FX (g −1 (y )).
But the distribution function of a uniform random variable has previously been computed.
Hence,
0 if g −1 (y ) ≤ 0
FX (g −1 (y )) = g −1 (y ) if 0 < g −1 (y ) < 1
1 if g −1 (y ) ≥ 1
for y > 0. So inverting the above formula, we get g : (0, 1) → (0, ∞) is given by
1
g (x) = − log(1 − x),
λ
1
X ∼ Uniform(0, 1) =⇒ − log(1 − X ) ∼ Exponential(λ).
λ
In conclusion one could view g as the inverse of FY , on (0, ∞). It turns out that this is
a general result. We state a special case of this in the lemma below. ■
Lemma 5.3.7. Let U ∼ Uniform (0, 1) random variable. Let X be a continuous random
variable such that its distribution function, FX , is a strictly increasing continous function.
Then
(a) We shall verify that Y and X have the same distribution function. Let y ∈ R, then
FY (y ) = P (Y ≤ y ) = P (FX−1 (U ) ≤ y ) = P (U ≤ FX (y )) = FX (y )
P (Z ≤ z ) = P (F (X ) ≤ z ) = 0
P (Z ≤ z ) = P (F (X ) ≤ z ) = 1
as F : R → (0, 1). If 0 < z < 1 then F −1 (z ) is well defined as Range (F ) = (0, 1) and
P (Z ≤ z ) = P (F (X ) ≤ z ) = P (X ≤ F −1 (z )) = F (F −1 (z )) = z.
exercises
√
Ex. 5.3.1. Let X ∼ Uniform(0, 1) and let Y = X. Determine the density of Y .
Ex. 5.3.2. Let X ∼ Uniform(0, 1) and let Z = X.
1
Determine the density of Z.
Ex. 5.3.3. Let X ∼ Uniform(0, 1). Let r > 0 and define Y = rX. Show that Y is uniformly
distributed on (0, r ).
Ex. 5.3.4. Let X ∼ Uniform(0, 1). Let Y = 1 − X. Show that Y ∼ Uniform(0, 1) as well.
Ex. 5.3.5. Let X ∼ Uniform(0, 1). Let a and b be real numbers with a < b and let
Y = (b − a)X + a. Show that Y ∼ Uniform(a, b).
Ex. 5.3.6. Let X ∼ Uniform(0, 1). Find a function g (x) (which is strictly increasing) such
that the random variable Y = g (X ) has density fY (y ) = 3y 2 for 0 < y < 1 (and fY (y ) = 0
otherwise).
Ex. 5.3.7. Let X ∼ N ormal (µ, σ 2 ). Let g : (−∞, ∞) → R be given by g (x) = x2 . Find
the probability density function of Y = g (X ).
Pareto(1) Pareto(2)
1.5 0.8
0.6
1.0
0.4
0.5
0.2
0.0 0.0
1 2 3 4 1 2 3 4
Figure 5.11: The shape of the pareto density and cumulative distribution functions.
Ex. 5.3.8. Let α > 0 and X be a random variable with the p.d.f given by
αα+1
x
1≤x<∞
f (x) =
0 otherwise
The random variable X is said to have Pareto (α) distribution (see Figure 5.11).
In the above exercises we assume that the transformation function is defined as above
when the p.d.f of X is positive and zero otherwise.
Ex. 5.3.9. Let X be a continuous random variable with probability density function
fX : R → R. Let a > 0, b ∈ R Y = a1 (X − b)2 . Show that Y is also a continuous random
variable with probability density function fY : R → R given by
√
a √ √
fY (y ) = √ [fX ( ay + b) + fX (− ay + b)]
2 y
for y > 0.
Ex. 5.3.10. Let −∞ ≤ a < b ≤ ∞ and I = (a, b) and g : I → R. Let X be a continuous
random variable whose density fX is zero on the complement of I. Set Y = g (X ).
d −1
fY (y ) = fX (g −1 (y )) g (y ).
dy
d
−1
fY (y ) = fX (g (y )) − g −1 (y ) .
dy
Ex. 5.3.11. Let X be a random variable having an exponential density. Let g : [0, ∞) → R
1
be given by g (x) = x β , for some β ̸= 0. Find the probability density function of Y = g (X ).
Ex. 5.3.12. Let U ∼ Uniform (0, 1). Let X be a continuous random variable with a
distribution function F . Extend F : R → R to F : R ∪ {−∞} ∪ {∞} → R by setting
F (∞) = 1 and F (−∞) = 0. Define the generalised inverse of F , G : [0, 1] → R ∪ {−∞} ∪
{∞} by
G(y ) = inf{x ∈ R : F (x) ≥ y}.
Show that
F (x) ≥ y ⇐⇒ x ≥ G(y ).
When analyzing multiple random variables at once, one may consider a “joint density”
analogous to the joint distribution of the discrete variable case. In this section we will
restrict considerations to only two random variables, but we shall see in Chapter 8 that
the definitions and results all generalize to any finite collection of variables.
Proof- The proof of the theorem is essentially the same as in the one-variable version
of Theorem 5.1.5. We will not reproduce it here. As in the discrete case we will typically
associate such densities with random variables. ■
Definition 5.4.2. A pair of random variables (X, Y ) is said to have a joint density
f (x, y ) if for every Borel set A ⊂ R2
Z
P ((X, Y ) ∈ A) = f (x, y ) dx dy.
A
As in the one-variable case we describe this in terms of “Borel sets” to be precise, but
in practice we will only consider sets A which are simple regions in the plane. In fact
regions such as (−∞, a] × (−∞, b], for all real numbers a, b are enough to characterise the
joint distribution. As in the one variable case we can define a “joint distribution function”
of (X, Y ) as
Z a Z b
F(X,Y ) (a, b) = P ((X ≤ a) ∩ (Y ≤ b)) = f (z, w )dwdz (5.4.1)
−∞ −∞
for all a, b ∈ R. We will usually denote the joint distribution function by F omiting
the subscripts unless it is particularly needed. One can state and prove a similar type
of result as Theorem 5.2.5 for F (a, b) when (X, Y ) have a joint density. In particular,
we can conclude that since the joint densities are assumed to be piecewise continuous,
the corresponding distribution functions are piecewise differentiable. Further, the joint
distribution of two continuous random variables (X, Y ) are completely determined by their
joint distribution function F . That is, if we know the value of F (a, b) for all a, b ∈ R, we
could use multivariable calculus to differentiate F (a, b) to find f (a, b). Then P ((X, Y ) ∈ A)
for any event A is obtained by integrating the joint density f over the event A. We illustrate
this with a couple of examples.
Example 5.4.3. Consider the open rectangle in R2 given by R = (0, 1) × (3, 5) and
| R |= 2 denote its area. Let (X, Y ) have a joint density f : R2 → R given by
1
2 if (x, y ) ∈ R
f (x, y ) =
0 otherwise.
The above is clearly a density function. So for any recntangle A = (a, b) × (c, d) ⊂ R,
Z dZ b
(b − a)(d − c) |A|
P ((X, Y ) ∈ A) = f (x, y )dxdy = = .
c a 2 |R|
In general one can use the following definition to define a uniform random variable on
the plane.
When (X, Y ) ∼ Uniform (D ) then the probability that (X, Y ) lies in a region A ⊂ D
is proportional to the area of A.
We note that this really does describe a density. The function f (x, y ) is non-negative and
Z ∞ Z ∞ Z 1Z 1
f (x, y ) dx dy = x + y dx dy
−∞ −∞ 0 0
1
Z 1
= ( x2 + xy ) |xx= 1
=0 dy
0 2
1
Z 1
= + y dy
0 2
1 1
= y + y 2 |yy = 1
=0 = 1.
2 2
Calculating a probability such as P ((X < 12 ) ∩ (Y < 12 )) requires integrating over the
appropriate region.
1 1
Z 1/2 Z 1/2
P ((X < ) ∩ (Y < )) = f (x, y ) dx dy
2 2 −∞ −∞
Z 1/2 Z 1/2
= x + y dx dy
0 0
1 1
Z 1/2
= + y dy
0 8 2
1
= .
8
A probability only involving one variable may still be calculated from the joint density.
For instance P (X < 12 ) does not appear to involve Y , but this simply means that Y is
unrestircted and the corresponding integral should range over all possible values of Y .
Therefore,
1
Z ∞ Z 1/2
P (X < ) = f (x, y ) dx dy
2 −∞ −∞
3
Z 1 Z 1/2
= x + y dx dy = .
0 0 8
It is just as easy to compute that P (Y < 12 ) = 38 . Note that these computations also
demonstrate that X and Y are not independent since
1 1 9 1 1
P (X < ) · P (Y < ) = ̸= P ((X < ) ∩ (Y < )).
2 2 64 2 2
(0,1)
(0,0) (1,0)
Figure 5.12: The subset A of the unit square that represents the region x + y < 1.
As in the discrete case, when we begin with the joint density of many random variables,
but want to speak of the distribution of an individual variable we will frequently refer to it
as a “marginal distribution” .
Suppose (X, Y ) are random variables and have a joint probability density function
f : R2 → R. Then we obseve that
Z x Z ∞
P (X ≤ x) = P (X ≤ x, −∞ < Y < ∞) = f (u, y )dydu.
−∞ −∞
If g : R → R is given by Z ∞
g (u) = f (u, y )dy
−∞
then Z x
P (X ≤ x) g (u)du.
−∞
Using Theorem 5.2.5, by the continuity assumptions on f , we find that the random variable
X is also a continuous random variable with probability density function of X given by
Z ∞
fX (x) = g (x) = f (x, y )dy. (5.4.2)
−∞
As it was derived from a joint probability density function, the density of X is referred to
as the marginal density of X. Similarly one can show that Y is also a continuous random
variable and its marginal density is given by
Z ∞
fY (y ) = f (x, y )dx. (5.4.3)
−∞
Example 5.4.6. (Example 5.4.3 contd.) Going back to Example 5.4.3, we can compute
the marginal density of X and Y . The marginal density of X is given by
( R5
if 0 < x < 1 1 if 0 < x < 1
(
Z ∞ 1
3 2
fX (x) = f (x, y )dy = =
−∞ 0 otherwise. 0 otherwise.
5.4.2 Independence
Theorem 5.4.7. Let f be the joint density of random variables X and Y and let
fX and fY be the respective marginal densities. Then
f (x, y ) = fX (x)fY (y )
Proof - First suppose X and Y are independent and consider the quantity P ((X ≤
x) ∩ (Y ≤ y )). On one hand independnece gives
Since equations 5.4.4 and 5.4.5 are equal we may differentiate both with respect to each of
the variables x and y and they remain equal. However, differentiating the former gives
fX (x)fY (y ) because of the relationship between the distribution and the density, while
differentiating the latter yields f (x, y ) by a two-fold application of the fundamental theorem
of calculus.
To prove the opposite direction, suppose f (x, y ) = fX (x)fY (y ). Let A and B be Borel
sets in R. Then
Z Z
P ((X ∈ A) ∩ (Y ∈ B )) = f (x, y ) dx dy
ZB ZA
= fX (x)fY (y ) dx dy
B A
Z Z
= fX (x) dx fY (y ) dy
A B
= P (X ∈ A)P (Y ∈ B )
Since this is true for all sets such sets A and B, the variables X and Y are independent. ■
Example 5.4.8. (Example 5.4.3 contd.) We had observed that if (X, Y ) ∼ Uniform (R)
then X ∼ Uniform (0, 1) and Y ∼ Uniform (3, 5). Note further that
f (x, y ) = fX (x)fY (y )
|A|
P ((X, Y ) ∈ A) = ,
|C|
and the probability that (X, Y ) lies in A is proportional to the area of A. However the
marginal density calculation is a little different. The marginal density of X is given by
Z √
25−x2 1
if − 5 < x < 5
Z ∞
√ dy
fX (x) = f (x, y )dy = − 25−x2 |C|
−∞
0 otherwise.
√
25 − x2 if − 5 < x < 5
(
2
= 25π
0 otherwise.
The distribution of X is the Semi-circular law described in Exercise 5.2.6. As the joint
density f is symmetric in x and y (i.e f (x, y ) = f (y, x)) the marginal density of Y is the
same as that of X (why ?). It is easy to see
1 4
= f (0, 0) ̸= fX (0)fY (0) =
25π 25π 2
Consequently X, Y are not independent. This fact should make intuitive sense as well, for
if X happens to take a value near 5 or −5 the range of possible values of Y is much more
restricted than if X takes a value near 0. ■
We shall see the utility of independence when computing distributions of various
functions of independent random variables (see Section 5.5). Independence of random
variables also makes it easier to compute their joint density and hence probabilites. For
instance, consider the following example.
Example 5.4.10. Suppose X ∼ Exponential(λ1 ), Y ∼ Exponential(λ2 ) are independent
random variables. Find P (X − Y < 0).
Therefore
Z ∞Z y Z ∞ Z y
− ( λ1 x + λ2 y ) −λ2 y
P (X − Y < 0) = λ1 λ2 e dxdy = λ1 λ2 e [ e−λ1 x dx]dy
0 0 0 0
1
Z ∞
= λ1 λ2 e−λ2 y [1 − e−λ1 y ]dy
0 λ1
Z ∞
= λ2 e−λ2 y − e−(λ1 +λ2 )y dy
0
−1 −λ2 y ∞ 1
= λ2 (e |0 ) + (e−(λ1 +λ2 )y |∞
0 )
λ2 λ1 + λ2
1 1
= λ2 −
λ2 λ1 + λ2
λ1
= .
λ1 + λ2
In Section 3.2.2 we have seen the notion of conditional distributions for discrete random
variables and in Section 4.4 we have seen the notions of conditional expectation and
variance for discrete random variables. Suppose X measures the parts per million of a
particulate matter less than 10 microns in the air and Y is the incidence rate of asthma in
the population. It is clear that X and Y ought to be related; for the distribution of one
affects the other. Towards this, in this section we shall discuss conditional distributions for
two continuous random variables having a joint probability density function. We recall
from Definition 3.2.5 that if X is a random variable on a sample space S and A ⊂ S be an
event such that P (A) > 0, then the probability Q described by
Q(B ) = P (X ∈ B|A)
Suppose X and Y have a joint probability density function f . Given our discussion
for discrete random variables it is natural to characterise the conditional distribution of
X given some information on Y . In the discrete setting we typically considered an event
A = {Y = b} for some real number b in the range of Y . In the continuous setting such an
event A would have zero probability, so the usual way of conditioning on an event would
not be possible. However, there is a way to make such a conditioning meaningful and
precise provided fY (b) > 0, where fY is the marginal density of Y .
P (X ∈ [3, 4] | Y = b).
We shall argue heuristically and arrive at an expression for the above probability. Suppose
the marginal density of X is fX (·), and that of Y is fY (·). Assume first that fY is piecewise
continuous and fY (b) > 0. Then it is a standard fact from real analysis to see that
1
P (Y ∈ [b, b + )) > 0,
n
for all n ≥ 1. One can then view the conditional probability as before, that is
1 P (X ∈ [3, 4] ∩ X ∈ [b, b + n1 ))
P (X ∈ [3, 4] | X ∈ [b, b + )) =
n P (X ∈ [b, b + n1 ))
R 4 R b+ n1
3 b f (u, v )du dv
= R b+ n1
fX (u)du
b
R4 R b+ n1
3 n b f (u, v )du dv
= R b+ n1
n b fX (u)du
From facts in real analysis (under some mild assumptions on f ) the following can be
established,
Z b+ 1
n
lim n f (u, v )du = f (b, v ),
n→∞ b
for all real numbers v and
Z b+ 1
n
lim n fX (u)du = fX (b).
n→∞ b
1
lim P (Y ∈ [b, b + )) = P (Y = b).
n→∞ n
With the above motivation we are now ready to define conditional densities for two random
variables.
Definition 5.4.11. Let (X, Y ) be random variables having joint density f . Let the
marginal density of Y be fY (·). Suppose b is a real number such that fY (b) > 0 and
is continuous at b then conditional density of X given Y = b is given by
f (x, b)
fX|Y =b (x) = (5.4.6)
fY (b)
for all real numbers x. Similarly, let the marginal density of X be fX (·). Suppose a
is a real number such that fX (a) > 0 and is continuous at a then conditional density
of Y given X = a is given by
f (a, y )
fY |X =a (y ) =
fX (a)
This definition genuinely defines a probability density function, for fX|Y =b (x) ≥ 0 since
it is the ratio of a non-negative quantity and a positive quantity. Moreover,
Z ∞ Z ∞
f (x, b)
fX|Y =b (x)dx = dx
−∞ −∞ fY (b)
1 1
Z ∞
= f (x, b)dx = fY (b) = 1
fY (b) −∞ fY (b)
One can use the conditional density to compute the conditional probabilities, namely if
(X, Y ) are random variables having joint density f and b is a real number such that its
marginal density has the property fY (b) > 0 then
f (x, b)
Z Z
P (X ∈ A | Y = b) = fX|Y =b (x)dx = dx.
A A fY (b)
We conclude this section with two examples where we compute conditional densities.
In both the examples the dependencies between the random variables imply that the
conditional distributions are different from the marginal distributions.
Example 5.4.12. Let (X, Y ) have joint probability density function f given by
√
3 − 1 (x2 −xy +y2 )
f (x, y ) = e 2 − ∞ < x, y < ∞.
4π
3x2
By a standard completing the square computation, 1
2 (x
2 − xy + y 2 ) = 8 + 21 (y − x2 )2 .
Therefore, √
3 − 3x2
Z ∞
1 x 2
fX (x) = e 8 e− 2 (y− 2 ) dy
4π −∞
R∞ 1 x 2
Observing that −∞
√1 e− 2 (y− 2 ) dy
2π
= 1 (why ?), we have
√
3 − 3x2 √ 3 1 − 3x2
r
fX (x) = e 8 2π = √ e 8
4π 4 2π
for many x, y ∈ R. Hence X and Y are not independent. Note that fX (x) ̸= 0 for all real
numbers x and is continuous at all x ∈ R. Fix x ∈ R, the conditional density of Y given
X = x is given by
√
3 − 12 (x2 −xy +y 2 )
f (x, y ) 4π e 1 1 x 2
fY |X =x (y ) = = = √ e− 2 (y− 2 ) ∀ y ∈ R.
2π
q 2
fX (x) 3 √1 − 3x8
4 2π e
(0,4) (4,4)
X=1
Y=2
(0,0)
Figure 5.13: The region T = {(x, y ) | 0 < x < y < 4} from Example 5.4.13.
Example 5.4.13. Suppose T = {(x, y ) | 0 < x < y < 4} and let (X, Y ) ∼ Uniform (T ).
Therefore its joint density is given by (see Figure 5.13)
if (x, y ) ∈ T
(
1
f (x, y ) = 8
0 otherwise.
Let us fix 0 < b < 4. So fY (·) is non-zero at b and is continuous at b. The conditional
density of (X | Y = b) is given by
Therefore (X | Y = b) ∼ Uniform (0, b). Similarly if we fix 0 < a < 4, we observe fX (·) is
non-zero at a and is continuous at a. The conditional density of (Y | X = a) is given by
f (a, y ) 1/8
if a < y < 4 1 if a < y < 4
fY |X =a (y ) = = (4−a)/8 = 4−a
fX (a) 0 otherwise. 0 otherwise.
exercises
Ex. 5.4.1. Let (X, Y ) be random variables whose probability density function is given by
f : R2 → R. Find the probability density function of X and probability density function
of Y in each of the following cases:-
Ex. 5.4.2. Let c > 0. Suppose that X and Y are random variables with joint probability
density
c(xy + 1) if 0 ≤ x ≤ 1 and 0 ≤ y ≤ 1
(
f (x, y ) =
0 otherwise
(a) Find c.
(b) Compute the marginal densities fX (·) and fY (·) and the conditional density fX|Y =b (·)
Ex. 5.4.3. Let A = {(x, y ) ∈ R2 : x > 0, y > 0, x + y < 1} and let X and Y be random
variables defined by the joint density f (x, y ) = 24xy if (x, y ) ∈ A (and f (x, y ) = 0
otherwise).
(c) Explain why (b) doesn’t violate Theorem 5.4.7 despite the fact that 24xy is a product
of a function of x with a function of y.
L = {(x, y ) ∈ D : x = 0 or or x = −1 or x = 1 or y = 0 or y = 1 or y = −1}
be the lines that create a tiling of D. Suppose we drop a coin of radius R at a uniformly
chosen point in D what is the probability that it will intersect the set L ?
Ex. 5.4.5. Let X and Y be two independent uniform (0, 1) random variables. Let
U = max(X, Y ) and V = min(X, Y ).
cx2 (1 − x) for 0 ≤ x ≤ 1,
(
f (x) =
0 otherwise.
Find:
(c) Given a γ and η from parts (a) and (b), find the marginal densities of X and Y .
Ex. 5.4.10. Let D = {(x, y ) : x3 ≤ y ≤ x}. A point (X, Y ) is chosen uniformly from D.
Find the joint probability density function of X and Y .
Ex. 5.4.11. Let X and Y be two random variables with the joint p.d.f given by
ae
−by 0≤x≤y
f (x, y ) =
0
otherwise
Find a conditions on a and b that make this a joint probability density function.
Ex. 5.4.12. Suppandi and Meera plan to meet at Gopalan Arcade between 7pm and 8pm.
Each will arrive at a time (independent of each other) uniformly between 7pm and 8pm
and will wait for 15 minutes for the other person before leaving. Find the probability that
they will meet ?
In Section 5.3 we have seen how to compute the distribution of Y = g (X ) from the
distribution of X for various g : R2 → R. Suppose (X, Y ) are random variables having
a joint probability density function f : R2 → R. Let h : R2 → R. A natural follow up
objective is then to determine the distribution of
Z = h(X, Y ).
In Section 3.3 we discussed an approach to this question when the random variables where
discrete.
One could prove a result as attained in Exercise 5.3.10 for functions of two variables
but this will require knowledge of Linear Algebra and multivariable calculus. Here we limit
our objective and shall focus on two specific functions namely the sum and the product.
Let X and Y be two independent continous random variables with densities fX and fY .
In this section we shall see how to compute the distribution of Z = X + Y . We first prove
a proposition that describes the probability density function of Z.
Proposition 5.5.1. (Sum of two independent random variables) Let X and Y be two
independent random variables with marginal densities given by fX : R → R and fY : R → R.
Then Z = X + Y has a probability density function fZ : R → R given by
Z ∞
fZ (z ) = fX (x)fY (z − x)dx. (5.5.1)
−∞
F (z ) = P (Z ≤ z )
= P (X + Y ≤ z )
Z Z
= fX (x)fY (y )dydx
{(x,y ):x+y≤z}
Z ∞ Z z−x
= fX (x)fY (y )dydx
−∞ −∞
Z zZ ∞
= [ fX (x)fY (u − x)dx]du.
−∞ −∞
As fX (·) and fY (·) are densities, it can be shown that the integrand is a piecewise continuous
function. Hence F is of the form (5.2.4) and Theorem 5.2.5 implies that the probability
density function of Z is given by (5.5.1). ■
The integral expression on the right hand side of (5.5.1) is referred to as the convolution
of fX and fY and is denoted by fX ⋆ fY (z ). It is a property of convolutions that fX ⋆
fY (z ) = fY ⋆ fX (z ) for all z ∈ R. Thus if we view the sum of X and Y as Z = X + Y or
Z = Y + X the distribution will be the same (See Exercise 5.5.8).
1.0
0.8
0.6
0.4
0.2
0.0
Figure 5.14: The region T = {(x, y ) | 0 < x < y < 4} from Example 5.5.2.
Example 5.5.2. (Sum of Uniforms) Let X and Y be two independent Uniform (0, 1)
random variables. Let Z = X + Y . From the above proposition that Z has a density given
by (5.5.1). Note that
1 if 0 < x < 1, 0 < z − x < 1 and 0 < z < 2
fX (x)fY (z − x) =
0 otherwise
Therefore fX (x)fY (z − x) is non-zero if and only if max{0, z − 1} < x < min{1, z}, 0 <
z < 2. So for 0 < z < 2,
Z min{1,z} Z min{1,z}
fZ (z ) = fX (x)fY (z − x)dx = 1dx = min{1, z} − max{0, z − 1}.
max{0,z−1} max{0,z−1}
Therefore,
z
if 0 < z ≤ 1
min{1, z} − max{0, z − 1} if 0 < z < 2
(
fZ (z ) = = 2−z if 1 < z < 2
0 otherwise
0 otherwise
Our next example will deal with sum of two independent exponential random variables.
This will lead us to the Gamma distribution which is of significant interest in statistics.
Example 5.5.3. (Sum of Exponentials) Let λ > 0, X and Y be two independent Exponen-
tial (λ) random variables. Let Z = X + Y . Then we know and Z has a density given by
(5.5.1). Further,
λ2 e−λx e−λ(z−x) if x ≥ 0, z − x ≥ 0 λ2 e−λz if x ≥ 0, x ≤ z, z ≥ 0
fX (x)fY (z − x) = =
0 otherwise 0 otherwise
Before we define the Gamma distribution more generally we prove a lemma in real analysis,
the proof of which can be skipped upon first reading.
(n − 1) !
Z ∞
xn−1 e−λx = (5.5.2)
0 λn
Z a
a
In,λ = xn−1 e−λx .
0
is well defined finite positive number. As xα e−βx → 0 as x → ∞ for any α, β > 0 there is
a K > 0 such that
λx
0 ≤ xn−1 e−λx < e− 2 ,
is a well defined finite positive number. Now, as u, v are differentiable we have by the
integration by parts formula
Z a Z a
′
u(x)v (x)dx = u(a)v (a) − u(0)v (0) − u′ (x)v (x)dx.
0 0
a
−λIn,λ = an−1 e−λa − (n − 1)In−1,λ
a
.
λIn,λ = (n − 1)In−1,λ .
n−1
(n − i) (n − 1) !
I1,λ .
Y
In,λ = I1,λ =
i=1
λ λn−1
λn
f (x) = xn−1 e−λx , (5.5.3)
(n − 1) !
We saw in Example 5.5.3 that sum of two exponential distributions resulted in a gamma
distribution. If X ∼ Exponential (λ) then it can also be viewed as a Gamma(1, λ)
distribution. The result in Example 5.5.3 could be rephrased as follows: the sum of two
gamma random variables with shape parameter 1 and rate parameter λ is distributed as a
gamma random variable with shape parameter 2 and rate parameter λ. This holds more
generally as we show in the next example.
0.6
0.3
0.2 0.4
0.1 0.2
0.0 0.0
0 2 4 6 0 2 4 6
Figure 5.15: The Gamma density and cumulative distribution functions for various shape and
rate parameters.
For z ≥ 0, we have
Z ∞ Z z
fZ (z ) = fX1 (x)fX2 (z − x)dx = fX1 (x)fX2 (z − x)dx
−∞ 0
e−λz λn+m
Z z
= xn−1 (z − x)m−1 dx
(n − 1) ! (m − 1) ! 0
Define R 1 n−1
0 u (1 − u)m−1 du
c(n, m) = .
(n − 1) ! (m − 1) !
if z ≥ 0
(
c(n, m) · λn+m z n+m−1 e−λz
fZ (z ) =
0 otherwise
To evaluate c(n, m) we use the following fact. From Proposition 5.5.1 fZ (·) (given by
(5.5.1)) is a Probability density function. Therefore,
Z ∞
1= fZ (z )dz
−∞
Z ∞
= c(n, m)λn+m z n+m−1 e−λz dz
0
= c(n, m)[(n + m − 1)!],
where in the last line we have used (5.5.2) with n replaced by n + m. So c(n, m) = (n+m−1)! .
1
Hence Z has Gamma (n + m, λ) distribution. From the definition of c(n, m) we also have
(n + m − 1) !
Z 1
un−1 (1 − u)m−1 du = .
0 (n − 1) ! (m − 1) !
The above calculation is easily extended by an induction argument to obtain the fact
that if λ > 0, Xi , 1 ≤ i ≤ n are independent Gamma(ni , λ) distributed random variables
n n
(respectively). Then Z = Xi has Gamma ( ni , λ) distribution.
P P
i=1 i=1
As Exponential (λ) is the same as Gamma(1, λ) random variable, the above implies
that the sum of n independent Exponential (λ) random variables is a Gamma(n, λ) random
variable. ■
It is possible to define the Gamma distribution when the shape parameter is not necessarily
an integer.
Definition 5.5.7. X ∼ Gamma(α, λ): Let λ > 0 and α > 0. Then X is said to
be Gamma distributed with shape parameter α and rate parameter λ if it has the
density
λα α−1 −λx
f (x) = x e , (5.5.4)
Γ (α )
where x ≥ 0 and for α > 0
Z ∞
Γ (α ) = xα−1 e−x dx (5.5.5)
0
One can imitate the calculation done in Example 5.5.6 as well for such a Gamma distribution.
The distribution function of a gamma random variable involves an indefinite form of the
integral in (5.5.5). Such integrals are known as incomplete gamma functions, and have no
closed-form solution in terms of simple functions. In R, F (x) for the gamma distribution
λα
Z x
F (x) = P (X ≤ x) = z α−1 e−λz dz , x > 0
0 Γ (α )
can be evaluated numerically with a function call of the form pgamma(x, alpha, lambda).
For example,
pgamma(1, 2, 1)
[1] 0.2642411
[1] 0.5627258
Similarly, the density function f (x) in (5.5.4) involves the normalising constant Γ(α) (also
known as the gamma function) which usually cannot be computed explicitly when α is not
an integer. Using R, one can evaluate f (x) numerically using the dgamma() function as
dgamma(1, 2, 1)
[1] 0.3678794
[1] 0.2769272
Let X and Y be two independent continous random variables with densities fX and fY . In
this section we shall find out the probability density function of Z = Y .
X
As P (Y = 0) = 0,
Z is well defined random variable.
F (z ) = P (Z ≤ z )
X
= P ( ≤ z)
Z ZY
= fX (x)fY (y )dydx
{(x,y ):y̸=0, x
y
≤z}
Z Z Z Z
= fX (x)fY (y )dydx + fX (x)fY (y )dydx
{(x,y ):y<0, x
y
≤z} {(x,y ):y>0, x
y
≤z}
Z Z Z Z
= fX (x)fY (y )dydx + fX (x)fY (y )dydx
{(x,y ):y<0,x≥yz} {(x,y ):y>0,x≤yz}
Z 0 Z ∞ Z ∞ Z yz
= fX (x)fY (y )dxdy + fX (x)fY (y )dxdy
−∞ yz 0 −∞
= I + II
Let us make a u-substituion x = yu in both I and II. For I, y < 0, so we will obtain,
Z 0 Z −∞
I = yfX (yu)fY (y )dudy
−∞ z
Z 0 Z z
= (−y )fX (yu)fY (y )dudy
−∞ −∞
Z z Z 0
= (−y )fX (yu)fY (y )dydu,
−∞ −∞
where in the last line we have changed the order of integration1 . For II, y > 0 so we will
obtain (similarly as in I),
Z ∞Z z
II = yfX (yu)fY (y )dudy
0 −∞
Z z Z ∞
= yfX (yu)fY (y )dydu,
−∞ 0
1
The change of order of integration is justifiable under certain hypothesis for the integrand. We shall assume
these are satisfied, as it is not possible to state and verify them within the scope of this book
Therefore
F (z ) = I + II
Z z Z 0 Z z Z ∞
= (−y )fX (yu)fY (y )dydu + yfX (yu)fY (y )dydu
−∞ −∞ −∞ 0
Z z Z ∞
= | y | fX (yu)fY (y )dydu
−∞ −∞
As fX (·) and fY (·) are densities, it can be shown that the integrand is a piecewise continuous
function. Hence the F is of the form (5.2.4) and Theorem 5.2.5 implies that the probability
density function of Z is given by (5.5.6). ■
Using the above method for finding the distribution of quotient of two random variables,
we shall present three examples that will lead us to standard continuous distributions
that are useful in applications. We begin with an example that constructs the Cauchy
distribution.
Example 5.5.9. Let X and Y be two independent Normal random variables with mean 0
and variance σ 2 ̸= 0. Let Z = Y .
X
We know that the probability density function of Z is
given by (5.5.6). Further, for any y, z ∈ R
1 1 1 1 + z2
! !
z2 y2 y2
fX (zy )fY (y ) = √ e− 2σ2 √ e− 2σ2 = exp − y 2
2πσ 2πσ 2πσ 2 2σ 2
Fix z ∈ R.
1 1 + z2
Z ∞ ! !
fZ (z ) = |y| exp − y 2
dy
−∞ 2πσ 2 2σ 2
1 1 + z2 1 + z2
"Z ! ! Z ∞ ! ! #
0
= | y | exp − y 2
dy + | y | exp − y 2
dy
2πσ 2 −∞ 2σ 2 0 2σ 2
1 1 + z2 1 + z2
"Z ! ! Z ∞ ! ! #
0
= (−y ) exp − y 2 dy + y exp − y 2 dy
2πσ 2 −∞ 2σ 2 0 2σ 2
It is easy to see that two integrals are the same (perform a substitution of u = −y in the
first integral). So the above is
1 1 + z2
Z ∞ ! !
= y exp − y 2
dy.
πσ 2 0 2σ 2
1+z 2 1+z 2
Now perform a substitution 2σ 2
y 2 = t, so σ2
ydy = dt.
1 σ2 ∞
Z
fZ (z ) = exp(−t)dt.
πσ 2 1 + z 2 0
1 1
= (−e−t |∞
0 ) = .
π (1 + z 2 ) π (1 + z 2 )
Therefore Z has the Cauchy distribution, which we first saw in the context of Example
5.3.4. ■
The next example considers the ratio of two gamma random variables. This motivates a
standard distribution called the F -distribution, which we will encounter in Chapter 8.
Fix z > 0,
λn+m
Z ∞
fZ (z ) = y y n+m−2 z m−1 e−λ(1+z )y dy
0 (n − 1) ! (m − 1) !
z m−1 λn+m
Z ∞
= y n+m−1 e−λ(1+z )y dy
(n − 1) ! (m − 1) ! 0
z m−1 λm+n
Z ∞
= tm+n−1 e−λt dt
(1 + z )m+n (m − 1 ) ! (n − 1 ) ! 0
■
Our next example is a construction of the Beta-distribution.
Let W = X.
Y
Note that Z = 1+W .
1
In Example 5.5.10 we found the probability density
function of W . We shall use this to find the distribution funciton of Z. As P (W ≥ 0) = 1,
0 if z < 0
(
P (Z ≤ z ) =
1 if z > 1.
1 1−z
P (Z ≤ z ) = P ( ≤ z ) = P (W ≥ )
1+W z
1−z
= 1 − P (W ≤ )
z
1 (m + n − 1) ! 1 − z m−1 1−z
−(m+n)
fZ (z ) = · 1+
z 2 (m − 1) ! (n − 1) ! z z
(m + n − 1)! n−1
= z (1 − z )m−1
(m − 1) ! (n − 1) !
Definition 5.5.12. X ∼ Beta(α, β): Let α > 0 and β > 0. Then X is said to be
Beta distributed with parameters α and β if it has the density
2.0 0.8
1.5 0.6
1.0 0.4
0.5 0.2
0.0 0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
Figure 5.16: The Beta density and cumulative distribution functions for selected shape param-
eters.
Γ(α + β ) α−1
Z x
F (x) = P (X ≤ x) = u (1 − u)β−1 du , 0 < x < 1
0 Γ (α ) Γ (β )
can be evaluated numerically with a function call of the form pbeta(x, alpha, beta).
For example,
[1] 0.5
pbeta(0.5, 3, 6)
[1] 0.8554688
pbeta(0.2, 6, 1)
[1] 6.4e-05
pbeta(0.2, 1, 6)
[1] 0.737856
In the special case where either α or β equals 1, the distribution function of X can
be computed explicitly. Another special case is the standard arcsine law we previously
encountered in Exercise 5.2.7 in terms of its explicit distribution function; it is easy to
see that this is the same as the Beta( 21 , 12 ) distribution. The semicircular distribution
encountered in Exercise 5.2.6 is also related, in the sense that it can be viewed as a location
and scale transformed beta random variable.
exercises
Ex. 5.5.1. Suppose that X and Y are random variables with joint probability density
4 (xy + 1) if 0 ≤ x ≤ 1 and 0 ≤ y ≤ 1
5
f (x, y ) =
0 otherwise
Ex. 5.5.2. Let X and Y be two random variables with the joint p.d.f given by
2 −λy
λ e
0≤x≤y
f (x, y ) =
0
otherwise
Ex. 5.5.4. Let X1 , X2 , X3 be independent and identically distributed Uniform (0, 1) random
variables. Let A = X1 X3 and B = X22 . Find the P (A < B ).
Ex. 5.5.5. Let X and Y be two independent exponential random variables each with mean
1.
1
(a) Find the density of U1 = X 2 .
Ex. 5.5.6. Suppose X is a uniform random variable in the interval (0, 1) and Y is an
independent exponential(2) random variable. Find the distribution of Z = X + Y .
Ex. 5.5.7. Let α > 0, β > 0, λ > 0, X and Y be two independent Gamma(α, λ) and
Gamma(β, λ) random variables respectively. Then Z = X + Y is distributed as a Gamma
(α + β, λ).
Ex. 5.5.8. Let X and Y be two independent random variables with probability density
function fX (·) and fY (·). Show that X + Y and Y + X have the same distribution by
showing that the integral expression defining fX ⋆ fY (·) is equal to the integral expression
defining fY ⋆ fX (·)).
Ex. 5.5.9. Let α > 0 and Γ(α) as in (5.5.5).
(a) Using the same technique as in Lemma 5.5.4, show that 0 < Γ(α) < ∞.
R ∞ −0.5 −x √
(b) Show that Γ( 12 ) = 0 x e dx = π.
Ex. 5.5.10. Let α > 0, δ > 0, λ > 0. Let X and Y be two independent Gamma (α, λ) and
Gamma (δ, λ) random variables respectively.
(a) Let W = X.
Y
Find the probability density function of W .
(b) Let Z = X
X +Y . Find the probability density function of Z.
Ex. 5.5.11. Suppose X, Y are independent random variables each normally distributed
with mean 0 and variance 1.
√
(a) Find the probability density function of R = X2 + Y 2
(c) Find the probability density function of θ = arctan X
Y
The notion of expected value carries over from discrete to continuous random variables,
but instead of being described in terms of sums, it is defined in terms of integrals.
Z∞
E [X ] = xf (x) dx.
−∞
provided that the integral converges absolutely.a In this case we say that X has
“finite expectation”. If the integral diverges to ±∞ we say the random variable has
infinite expectation. If the integral diverges, but not to ±∞ we say the expected value
is undefined.
ZN
a
That is, lim |x| f (x) dx < ∞.
M →−∞
N →∞
M
The next three examples illustrate the three posibilities: the first is an example where
expectation exists as a real number; the next is an example of an infinite expected value;
and the final example shows that the expected value may not be defined at all.
Example 6.1.2. Let X ∼ Uniform(a, b). Then the expected value of X is given by
1 1
Z ∞ Z b
b+a
E [X ] = x · f (x) dx = x· dx = (b2 − a2 ) = .
−∞ a b−a 2(b − a) 2
This result is intuitive since it says that the average value of a Uniform(a, b) random
variable is the midpoint of its interval. ■
211
Example 6.1.3. Let 0 < α < 1 and X ∼ Pareto(α) which is defined to have the probability
density function
α
xα + 1
1≤x<∞
f (x) =
0
otherwise
Z ∞ Z M
α α
E [X ] = x· dx = α lim x−α dx = (−1 + lim M −α+1 ) = ∞
1 xα + 1 M →∞ 1 −α + 1 M →∞
as 0 < α < 1.
Thus this Pareto random variable has an infinite expected value. ■
Example 6.1.4. Let X ∼ Cauchy(0, 1). Then the probability density function of X is
given by
1 1
f (x) = for all x ∈ R.
π 1 + x2
Now,
1
Z ∞
E [X ] = x· dx
−∞ π ( 1 + x2 )
RN
Now by Exercise 6.1.10, we know that as M → −∞, N → ∞ the x
M 1+x2 dx does not
converge or diverge to ±∞. So E [X ] is not defined for this Cauchy random variable. ■
(b) Let Y be a continuous random variable such that (X, Y ) have a joint probability
density function f : R2 → R. Suppose h : R2 → R be piecewise continuous.
Then, Z ∞ Z ∞
E [h(X, Y )] = h(x, y )f (x, y ) dx dy.
−∞ −∞
Proof- The proof is beyond the scope of this book. For (a) when g is as in Exercise
5.3.10 then one can provide the proof using only the tools of basic calculus (we will leave
this case as an exercise to the reader) ■
Example 6.1.6. A piece of equipment breaks down after a functional lifetime that is a
random variable T ∼ Exp( 51 ). An insurance policy purchased on the equipment pays a
dollar amount equal to 1000 − 200t if the equipment breaks down at a time 0 ≤ t ≤ 5
and pays nothing if the equipment breaks down after time t = 5. What is the expected
payment of the insurance policy?
1 (1/5)t
Z ∞
E [g (T )] = e max{1000 − 200t, 0} dt
0 5
1 (1/5)t
Z 5
= e (1000 − 200t) dt
0 5
= 1000e−1 ≈ $367.88
Example 6.1.7. Let X, Y ∼ Uniform(0, 1). What is the expected value of the larger of
the two variables?
We offer two methods of solving this problem. The first is to define Z = max{X, Y } and
then determine the density of Z. To do so, we first find its distribution. FZ (z ) = P (Z ≤ z ),
but max{X, Y } is less than or equal to z exactly when both X and Y are less than or
equal to z. So for 0 ≤ z ≤ 1,
FZ (z ) = P ((X ≤ z ) ∩ (Y ≤ z ))
= P (X ≤ z ) · P (Y ≤ z )
= z2
Therefore fZ (z ) = FZ′ (z ) = 2z after which the expected value can be obtained through
integration
2 3 1 2
Z 1
E [Z ] = z · 2z dz = z |0 = .
0 3 3
An alternative method is to use Theorem 6.1.5 (b) to calculate the expectation directly
without finding a new density. Since X and Y are independent, their joint distribution is
the product of their marginal distributions. That is,
1 if 0 ≤ x ≤ 1 and 0 ≤ y ≤ 1
(
f (x, y ) = fX (x)fY (y ) =
0 otherwise
Therefore,
Z ∞ Z ∞
E [max{X, Y }] = max{x, y} · f (x, y ) dx dy
−∞ −∞
Z 1Z 1
= max{x, y} · 1 dx dy
0 0
Results from calculus may be used to show that the linearity properties from Theorem
4.1.7 such as apply to continuous random variables as well as to discrete ones. We restate
it here for completeness.
Theorem 6.1.8. Suppose that X and Y are continuous random variables with
piecewise continuous joint density function function f : R2 → R. Assume that both
have finite expected value. If a and b are real numbers then
(a) E [aX ] = aE [X ];
(b) E [aX + b] = aE [X ] + b
(c) E [X + Y ] = E [X ] + E [Y ]; and
(d) E [aX + bY ] = aE [X ] + bE [Y ].
(e) If X ≥ 0 then E [X ] ≥ 0.
We will use these now-familiar properties in the continuous setting. As in the discrete
setting we can define the variance and standard deviation of a continuous random variable.
Since the above terms are expected values, there is the possibility that they may be
infinite because the integral describing the expectation diverges to infinity. As the
integrand is strictly positive, it isn’t possible for the integral to diverge unless it
diverges to infinity.
Theorem 6.1.10. Let a ∈ R and let X be a continuous random variable with finite
variance (and thus, with finite expected value as well). Then,
(a) V ar [X ] = E [X 2 ] − (E [X ])2 .
(b) V ar [aX ] = a2 · V ar [X ];
(d) V ar [X + a] = V ar [X ]; and
(e) SD [X + a] = SD [X ].
(f) E [XY ] = E [X ]E [Y ];
(g) V ar [X + Y ] = V ar [X ] + V ar [Y ]; and
q
(h) SD [X + Y ] = (SD [X ])2 + (SD [Y ])2 .
Proof- The proof is essentially an imitation of the proofs presented in Theorem 4.1.10,
Theorem 4.2.5, Theorem 4.2.4, and Theorem 4.2.6. One needs to use the respective
densities, integrals in lieu of sums, and use Theorem 6.1.11 and Theorem 6.1.5 when needed.
We will leave this as an exercise to the reader. ■
Example 6.1.11. Let X ∼ Normal (0, 1). In this example we shall show that E [X ] = 0
and V ar [X ] = 1. Before that we collect some facts about the probability density function
of X, given by (5.2.7). Using (5.2.9) with z = 0, we can conclude that
1 1
Z ∞
x2
√ e− 2 dx = (6.1.1)
0 2π 2
x2
max{| x |, x2 }e− 2 ≤ c1 e−c1 |x|
1
Z ∞ ∞
x2
Z
| x | √ e− 2 dx ≤ c1 e−c1 |x| < ∞
−∞ 2π −∞
2 1
Z ∞ 2
Z ∞
− x2
x √ e dx ≤ c1 e−c1 |x| < ∞ (6.1.2)
−∞ 2π −∞
1
Z ∞
x2
E [X ] = x √ e− 2 dx < ∞
−∞ 2π
1 1
Z 0 Z ∞
x2 x2
E [X ] = x √ e− 2 dx + x √ e− 2 dx.
−∞ 2π 0 2π
1 1
Z 0 Z ∞
x2 y2
x √ e− 2 dx = − y √ e− 2 dy.
−∞ 2π 0 2π
So E [X ] = 0. Again by (6.1.2),
1 1
Z ∞ Z ∞
x2 x2
V ar [X ] = (x − E [X ]) √ e− 2 dx =
2
x2 √ e− 2 dx < ∞
−∞ 2π −∞ 2π
1 1 1 1
Z ∞ Z 0 Z ∞ Z ∞
x2 x2 x2 x2
x2 √ e− 2 = x2 √ e− 2 dx + x2 √ e− 2 dx = 2 x2 √ e− 2 dx.
−∞ 2π −∞ 2π 0 2π 0 2π
x2
Then we use integration by parts like Lemma 5.5.4. Set u(x) = x and v (x) = e− 2 , which
2
− x2
imply u′ (x) = 1 and v ′ (x) = −xe . Therefore for a > 0,
Z a Z a Z a
x2
x2 e − 2 dx = u(x)(−v ′ (x))dx = u(x)(−v (x)) |a0 − u′ (x)(−v (x))dx
0 0 0
2
Z a
x2
2 − a2
= a e + e− 2
0
a2
Using the fact that lima→∞ a2 e− 2 = 0 and (6.1.1) we have
1 ∞ 1 a 1
Z a
x2 x2 a2 x2
Z Z
V ar [X ] = 2 √ x2 e− 2 dx = √ lim x2 e− 2 dx = √ lim a2 e− 2 + e− 2 dx
2π 0 π a→∞ 0 π a→∞ 0
1 1
Z ∞
√
x2
= √ 0+ e− 2 dx = √ [0 + π ] = 1
π 0 π
Y −µ
Suppose Y ∼ Normal (µ, σ 2 ) then we know by Corollary 5.3.3 that W = σ ∼ Normal
(0, 1). By Example 6.1.11, E [W ] = 0 and V ar [W ] = 1. Also Y = σW + µ, so by
Theorem 6.1.8(b) E [Y ] = σE [W ] + µ = µ and by Theorem 6.1.10 (d) and (b) V ar [Y ] =
σ 2 V ar [W ] = σ 2 . ■
Example 6.1.12. Let X ∼ Uniform(a, b). To calculate the variance of X first note that
Theorem 6.1.5(a) gives
1 1 b2 + ab + a2
Z ∞ Z b
E [X ] = 2 2
x · f (x) dx = x2 · dx = (b3 − a3 ) = .
−∞ a b−a 3(b − a) 3
b2 + ab + a2 b+a 2 (b − a)2
V ar [X ] = E [X 2 ] − (E [X ])2 = −( ) = .
3 2 12
The Markov and Chebychev inequalities also apply to continuous random variables. As
with discrete variables, these help to estimate the probabilities that a random variable will
fall within a certain number of standard deviations from its expected value.
µ
P (X ≥ c) ≤ .
c
1
P (|X − µ| ≥ kσ ) ≤ .
k2
As f (·) ≥ 0, we have xf (x) ≥ cf (x) whenever x > c. So again using facts about integrals
Z ∞ Z ∞
µ ≥ cf (x)dx = c f (x)dx = cP (X > c).
c c
The last equality follows from definition. Hence we have the result.
(b) The event (|X − µ| ≥ kσ ) is the same as the event ((X − µ)2 ≥ k 2 σ 2 ). The random
variable (X − µ)2 is certainly non-negative, is continuous by Exercise 5.3.9, and its expected
value is the variance of X which we have assumed to be finite. Therefore we may apply
Markov’s inequality to (X − µ)2 to get
E [(X − µ)2 ] V ar [X ] σ2 1
P (|X − µ| ≥ kσ ) = P ((X − µ)2 ≥ k 2 σ 2 ) ≤ 2 2
= 2 2
= 2 2
= 2.
k σ k σ k σ k
Though the theorem is true for all k > 0, it doesn’t give any useful information unless
k > 1.
exercises
Ex. 6.1.4. Let 1 < α and X ∼ Pareto(α). Calculate E [X ] to show that it is finite.
Ex. 6.1.5. Let X be a random variable with density f (x) = 2x for 0 < x < 1 (and f (x) = 0
otherwise).
(a) Calculate E [X ]. You should get a result larger than 12 . Explain why this should be
expected even without computations.
(b) Calculate SD [X ].
Ex. 6.1.6. Let X ∼ Uniform(a, b) and let k > 0. Let µ and σ be the expected value and
standard deviation calculated in Example 6.1.12.
(a) Calculate P (|X − µ| ≤ kσ ). Your final answer should depend on k, but not on the
values of a or b.
(b) What is the value of k such that results of more than k standard deviations from
expected value are unachievable for X?
(b) Let µ and σ denote the mean and standard deviation of X respectively. Use your
computations from (a) to calculate P (|X − µ| ≤ kσ ). Your final answer should
depend on k, but not on the value of λ.
(c) Is there a value of k such that results of more than k standard deviations from
expected value are unachievable for X?
Ex. 6.1.8. Let X ∼ Gamma(n, λ) with n ∈ N and λ > 0. Using Example 5.5.3, Exercise
6.1.7(a) and Theorem 6.1.8(c) calculate E [X ]. Using Theorem 6.1.10 calculate V ar [X ].
Ex. 6.1.9. Let X ∼ Uniform(0, 10) and let g (x) = max{x, 4}. Calculate E [g (X )].
RN
Ex. 6.1.10. Show that as M → −∞, N → ∞ x
M 1+x2 dx does not have a limit.
Ex. 6.1.11. Using the hints provided below prove the respective parts of Theorem 6.1.8.
(a) For a = 0 the result is clear. Let a ̸= 0 and fX : R → R be the probability density
function of X. Use Lemma 5.3.2 to find the probability density function of aX.
Compute the expectation of aX to obtain the result. Alternatively use Theorem
6.1.5(a).
(c) Use the joint density of (X, Y ) to write E [X + Y ]. Then use (5.4.2) an (5.4.3) to
prove the result.
(e) If X ≥ 0 then its marginal density fX : R → R is positive only when the x ≥ 0. The
result immediately follows from definition of expectation.
Covariance of continuous random variables (X, Y ) is used to describe how the two random
variables relate to each other. The properties proved about covariances for discrete random
variables in Section 4.5 apply to continuous random variables as well via essentially the
same arguments. We define covariance and state the properties next.
Definition 6.2.1. Let X and Y be random variables with joint probability density
function f : R2 → R. Suppose X and Y have finite expectation. Then the covariance
of X and Y is defined as
Z ∞ Z ∞
Cov [X, Y ] = E [(X − E [X ])(Y − E [Y ])] = (x − E [X ])(y − E [Y ])f (x, y )dxdy,
−∞ −∞
(6.2.1)
Since it is defined in terms of an expected value, there is the possibility that the covariance
may be infinite or not defined at all. We now state the properties of Covariance.
Theorem 6.2.2. Let X, Y be continuous random variables such that they have
joint probability density function. Assume that 0 ̸= σx2 = Var(X ) < ∞, 0 ̸= σy2 =
Var(Y ) < ∞. Then
Definition 6.2.3. Let (X, Y ) be continuous random variables both with finite
Cov [X,Y ]
variance and covariance. From Theorem 6.2.2(d) the quantity ρ[X, Y ] = σX σY is
in the interval [−1, 1]. It is known as the “correlation” of X and Y . As discussed
earlier, both the numerator and denominator include the units of X and the units of Y .
The correlation, therefore, has no units associated with it. It is thus a dimensionless
rescaling of the covariance and is frequently used as an absolute measure of trends
between the two continuous random variables as well.
Example 6.2.4. Let X ∼ Uniform (0, 1) and be independent of Y ∼ Uniform (0, 1). Let
U = min(X, Y ) and V = max(X, Y ). We wish to find ρ[U , V ]. First, 0 < u < 1
P (V ≤ v ) = P (X ≤ v, Y ≤ v ) = P (X ≤ v )P (Y ≤ v ) = v 2 ,
P (U ≤ u, V ≤ v ) = P (V ≤ v ) − P (U > u, V ≤ v )
= v 2 − P (u < X ≤ v, u < Y ≤ v )
= v 2 − P (u < X ≤ v )P (u < Y ≤ v )
= v 2 − (v − u)2 ,
where we have used the formula for distribution function of V and the fact that X, Y are
independent uniform random variables. It is easily seen that P (U ≤ u, V ≤ v ) = 0 for all
other possibilities of (u, v ). As the joint distribution function is piecewise differentiable in
each variable, the joint probability density function of U and V , f : R2 → R, exists and is
obtained by differentiating it partially in u and v.
2 if 0 < u < v < 1
f (u, v ) =
0 otherwise
Now,
u3 1 1
Z 1
E [U ] = u2(1 − u)du = u2 − 2 | =
0 3 0 3
v3 1 2
Z 1
E [V ] = v2vdv = 2 | =
0 3 0 3
u3 u4 1 1
Z 1
E [U 2 ] = u2 2(1 − u)du = 2 − 2 | =
0 3 4 0 6
v4 1
Z 1
E [V 2 ] = v 2 2vdv = 2 |10 =
0 4 2
v4 1 1
" #
u2 1 v2
Z 1 Z v Z 1 Z 1
E [U V ] = uv2du dv = 2v |0 dv = 2v dv = | =
0 0 0 2 0 2 4 0 4
Therefore
2 1 5
V ar [U ] = E [U 2 ] − (E [U ])2 = − =
3 9 9
2 2 1 4 1
V ar [V ] = E [V ] − (E [V ]) = − =
2 9 18
1 12 5
Cov [U , V ] = E [U V ] − E [U ]E [V ] = − =
4 33 36
Cov [U , V ] 5
1
ρ[U , V ] = q = q 36 = √
2 2
q q
5 1
V ar [V ] V ar [U ] 9 18
As seen in Theorem 6.2.2 (e), independence of X and Y guarantees that they are uncorre-
lated (i.e ρ[X, Y ] = 0). The converse is not true (See Example 4.5.6 for discrete case). It
is possible that Cov [X, Y ] = 0 and yet that X and Y are dependent, as the next example
shows.
Example 6.2.5. Let X ∼ Uniform (−1, 1). Let Y = X 2 . Note from Example 6.1.2 and
Example 6.1.12 we have E [X ] = 0, E [Y ] = E [X 2 ] = 13 . Further using the probability
density function of X,
1 x4 1
Z 1
E [XY ] = E [X ] =3
x3 = | = 0.
−1 2 8 −1
So ρ[X, Y ] = 0. Clearly X and Y are not independent. We verify this precisely as well.
Consider the
1 1 1 1 1 1 1
P (X ≤ − , Y ≤ ) = P (X ≤ − , X 2 ≤ ) = P (− ≤ X ≤ − ) = ,
4 4 4 4 2 4 8
1 1 1 1 1 1 1 31 3
P (X ≤ − )P (Y ≤ ) = P (X ≤ − )P (X 2 ≤ ) = P (X ≤ − )P (− ≤ X ≤ ) = = .
4 4 4 4 4 2 2 82 16
Clearly
1 1 1 1
P (X ≤ − , Y ≤ ) ̸= P (X ≤ − )P (Y ≤ )
4 4 4 4
implying they are not independent. ■
V ar [Y |X = x] = E [(Y − E [Y |X = x])2 |X = x]
Z ∞ Z ∞ 2
f (x, y ) f (x, y )
= y− y dy dy.
−∞ −∞ fX (x) fX (x)
The results proved in Theorem 4.4.4, Theorem 4.4.6, Theorem 4.4.8, and Theorem 4.4.9
are all applicable when X and Y are continuous random variables having joint probability
density function f . The proofs of these results in the continuous setting follow very similarly
(though using facts about integrals from analysis).
Theorem 6.2.7. Let (X, Y ) be continuous random variables with joint probability
density function f : R → R. Assume that h, g : R → R be defined as
E [X|Y = y ] if fY (y ) > 0 V ar [X|Y = y ] if fY (y ) > 0
g (y ) = and h(y ) =
0 otherwise 0 otherwise
E [g (Y )] = E [X ], (6.2.3)
and
V ar [X ] = E [h(Y )] + V ar [g (Y )]. (6.2.4)
Proof- The proof of (6.2.2) is beyond the scope of this book. We shall omit it. To prove
(6.2.3) we use the definition of g and Theorem 6.1.8 (a) to write
Z ∞ Z ∞ Z ∞
E [g (Y )] = g (y )fY (y )dy = xfX|Y =y (x)dx fY (y )dy
−∞ −∞ −∞
Using the definition of conditional density and rearranging the order of integration we
obtain that the above is
Z ∞ Z ∞ Z ∞ Z ∞ Z ∞
f (x, y )
= x dx fY (y )dy = x f (x, y )dy dx = xfX (x)dx = E [X ].
−∞ −∞ fY (y ) −∞ −∞ −∞
E [h(Y )] = E [X 2 ] + E [g (Y )2 ]
V ar [g (Y )] = E [g (Y )2 ] − (E [g (Y )])2 = E [g (Y )2 ] − (E [X ])2
keep in mind that the exterior expected value in the expression E [E [X|Y ]] refers to the
averge of E [X|Y ] viewed as a function of Y .
Example 6.2.8. Let X ∼ Uniform (0, 1) and be independent of Y ∼ Uniform (0, 1). Let
U = min(X, Y ) and V = max(X, Y ). In Example 6.2.4 we found ρ[U , V ]. During that
computation we showed that the marginal densities of U and V were given by
2 ( 1 − u ) if 0 < u < 1 2v if 0 < v < 1
fU (u) = and fV (v ) =
0 otherwise 0 otherwise.
f (u, v )
fV |U =u (v ) = , for v ∈ R.
fU (u)
So,
1
1−u if u < v < 1
fV |U =u (v ) =
0 otherwise
1 − u2 1+u
Z 1
v
E [V | U = u] = dv = = .
u 1−u 2(1 − u) 2
V ar [V | U = u] = E [V 2 | U = u] − (E [V | U = u])2
v2 1+u 2
Z 1
= dv −
u 1−u 2
1−u 3 (1 + u)2 (1 − u)2
= dv − = .
3(1 − u) 4 12
We could have also concluded these from properties of Uniform distribution computed in
Example 6.1.2 and Example 6.1.12. We will use this approach in the next example. ■
Example 6.2.9. Let (X, Y ) have joint probability density function f given by
√
3 − 1 (x2 −xy +y2 )
f (x, y ) = e 2 − ∞ < x, y < ∞.
4π
These random variables were considered in Example 5.4.12. We showed there that X is
a Normal random variable with mean 0 and variance 4
3 and Y is also a Normal random
variable with mean 0 and variance 3.
4
We observed that they are not independent as well
and the conditional distribution of Y given X = x was Normal with mean x
2 and variance
1. Either by direct computation or by definition we observe that
x
E [Y | X = x] = V ar [Y | X = x] = 1.
2
V ar [Y ] = V ar [E [Y | X ]] + E [V ar [Y | X = x]]
X
= V ar [ ] + E [1]
2
1 14 4
= V ar [X ] + 1 = +1 = .
4 43 3
exercises
Ex. 6.2.1. Let (X, Y ) be uniformly distributed on the triangle 0 < x < y < 1.
Ex. 6.2.2. X is a random variable with mean 3 and variance 2. Y is a random variable
with mean −1 and variance 6. The covariance of X and Y is −2. Let U = X + Y and
V = X − Y . Find the correlation coefficient of U and V .
Ex. 6.2.3. Suppose X and Y are both uniformly distributed on [0, 1]. Suppose Cov [X, Y ] =
24 . Compute the variance of X + Y .
−1
Ex. 6.2.4. A dice game between two people is played by a pair of dice being thrown. One
of the dice is green and the other is white. If the green die is larger than the white die,
player number one earns a number of points equal to the value on the green die. If the
green die is less than or equal to the white die, then player number two earns a number
of points equal to the value of the green die. Let X be the random variable representing
the number of points earned by player one after one throw. Let Y be the random variable
representing the number of points earned by player two after one throw.
(b) Without explicitly computing it, would you expect Cov [X, Y ] to be positive or
negative? Explain.
Ex. 6.2.6. Let (X, Y ) have the joint probability density function f : R2 → R given by
Ex. 6.2.7. Suppose Y is uniformly distributed on (0, 1), and suppose for 0 < y < 1 the
conditional density of X | Y = y is given by
2x2
y
if 0 < x < y
fX|Y =y (x) =
0 otherwise.
(b) Compute the joint p.d.f. of (X, Y ) and the marginal density of X.
(c) Compute the expected value and variance of X given that Y = y, with 0 < y < 1.
Ex. 6.2.8. Let (X, Y ) have joint probability density function f : R2 → R. Show that
V ar [X | Y = y ] = E [X 2 | Y = y ] − (E [X | Y = y ])2 .
(a) E [X ] and E [Y ]
(b) V ar [X ] and V ar [Y ]
Ex. 6.2.10. From Example 5.4.12, consider(X, Y ) have joint probability density function f
given by √
3 − 1 (x2 −xy +y2 )
f (x, y ) = e 2 − ∞ < x, y < ∞.
4π
Find
(a) E [X ] and E [Y ]
(b) V ar [X ] and V ar [Y ]
Ex. 6.2.11. From Example 5.4.13, suppose T = {(x, y ) | 0 < x < y < 4} and let (X, Y ) ∼
Uniform (T ). Find
(a) E [X ] and E [Y ]
(b) V ar [X ] and V ar [Y ]
Ex. 6.2.12. From Example 5.4.9, consider the open disk in R2 given by C = {(x, y ) :
x2 + y 2 < 25} and | C |= 25π denote its area. Let (X, Y ) have a joint density f : R2 → R
given by
1
|C| if (x, y ) ∈ C
f (x, y ) =
0 otherwise.
Find
(a) E [X ] and E [Y ]
(b) V ar [X ] and V ar [Y ]
Ex. 6.2.13. Using the hints provided below prove the respective parts of Theorem 6.2.2
(a) Use the linearity properties of the expected value from Theorem 6.1.8.
(e) Use part (a) of this problem and part (f) of Theorem ??.
(f) Use the linearity properties of the expected value from Theorem 6.1.8.
(g) Use the linearity properties of the expected value from Theorem 6.1.8.
Ex. 6.2.14. Let X, Y be continuous random variable with piecewise continuous densities
f (x) and g (y ) and well-defined expected values. Suppose X ≤ Y then show that E [X ] ≤
E [Y ].
Compute E [Y |X = 12 ].
Ex. 6.2.16. Let (X, Y ) be random variables with joint probability density function
f : R2 → R. Assume that both random variables have finite variances and that their
covariance is also finite.
(b) Show that when X and Y are positively correlated (i.e. ρ[X, Y ] > 0) then V ar [X +
Y ] > V ar [X ] + V ar [Y ], while when X and Y are negatively correlated (i.e. ρ[X, Y ] <
0), then V ar [X + Y ] < V ar [X ] + V ar [Y ].
We have already seen for the distribution of a discrete random variable or a continuous
random variable is determined by its distribution function. In this section we shall discuss
the concept of moment generating functions. Under suitable assumptions, these functions
will determine the distribution of random variables. They are also serve as tools in
computations and come in handy for convergence concepts that we will discuss.
The moment generating function generates or determine the moments which in turn,
under suitable hypothesis determine the distribution of the corresponding random variable.
We begin with a definition of a moment.
is known as the “k-th moment of X”. As before the existence of a given moment is
determined by whether the above expectation exists or not.
We have previously seen many computations of the first moment E [X ] and also seen
that the second moment E [X 2 ] is related to the variance of the random variable. The
next theorem states that if a moment exists then it guarantees the existence of all lesser
moments.
= 1 + E [|X k |] < ∞
Therefore E [X j ] exists and is finite. See Exericse 6.3.7 when X is a discrete random
variable. ■
When a random variable has finite moments for all positive integers, then these moments
provide a great deal of information about the random variable itself. In fact, in some
cases, these moments serve to completely describe the distribution of the random variable.
One way to simultaneously describe all moments of such a variable in terms of a single
expression is through the use of a “moment generating function”.
M (t) = E [etX ],
The notation MX (t) will also be used when clarification is needed as to which variable
a particular moment generating function belongs. Note that M (0) = 1 will always be true,
but for other values of t, there is no guarantee that the function is even defined as the
expected value might be infinite. However, when M (t) has derivatives defined at zero,
these values incorporate information about the moments of X. For a discrete random
variable X : S → T with T = {xi : i ∈ N}, then for t ∈ D (as in Definition 6.3.3)
etxi P (X = xi ).
X
MX ( t ) =
i≥1
We compute moment generating function for a Poisson (λ) and a Gamma (n, λ), with
n ∈ N, λ > 0.
∞ ∞ ∞ k
λk e−λ (et λ) t t
= e−λ = e−λ ee λ = e−λ(1+e ) .
X X X
MX ( t ) = etk P (X = k ) = etk
k =0 k =0
k! k =0
k!
So the moment generating function of X exists for all t ∈ R. Suppose Y ∼ Gamma (n, λ)
then t < λ,
λn Γ ( n )
n
λn n−1 −λy λn λ
Z Z
MY ( t ) = ety y e dy = y n−1 e−(λ−t)y dy = = ,
R Γ (n) Γ (n) R Γ (n) (λ − t)n λ−t
where we have used (5.5.3). The moment generating function of Y will not be finite if
t ≥ λ. ■
We summarily compile some facts about moment generating functions. The proof of
some of the results are beyond the scope of this text.
Theorem 6.3.5. Suppose for a random variable X, there exists δ > 0 such that
MX (t) exists (−δ, δ ).
(k )
E [ X k ] = MX ( 0 ) ,
(k )
where MX denotes the k-th derivative of MX .
(c) Suppose Y is another independent random variable such that MY (t) exists for
t ∈ (−δ, δ ). Then
MX + Y ( t ) = MX ( t ) MY ( t ) .
for t ∈ (−δ, δ ).
Proof - (a) A precise proof is beyond the scope of this book. We provide a sketch. Express
etX as a power series in t.
t2 X 2 tn X n
etX = 1 + tX + +···+ +...
2 n!
The expected value of the left hand side is the moment generating function for X while
linearity may be used on the right hand side. So the power series of M (t) is given by
t2 tn
M (t) = 1 + t · E [X ] + · E [X 2 ] + · · · + · E [X n ] + . . .
2 n!
Taking k derivatives of both sides of the equation (which is valid in the interval of
convergence) yields
t2
M (k ) ( t ) = E [ X k ] + t · E [ X k +1 ] + · E [ X k +2 ] + . . .
2
Finally, when evaluating both sides at t = 0 all but one term on the right hand side
vanishes and the equation becomes simply M (k) (0) = E [X k ].
■
Theorem 6.3.5 applies equally well for both discrete and continuous variables. A discrete
example is presented next.
Example 6.3.6. Let X ∼ Geometric(p). We shall find MX (t) and use this function to
calculate the expected value and variance X. For any t ∈ R,
∞ ∞ ∞
(et )n · p(1 − p)n−1 = pet · (et · (1 − p))n−1
X X X
MX (t) = E [etX ] = etn P (X = n) =
n=1 n=1 n=1
pet
=
1 − et (1 − p)
Having completed that computation, the expected value and variance can be computed
simply by calculating derivatives.
′ pet
MX (t) =
[ 1 − ( 1 − p ) et ] 2
p
and so E [X ] = MX
′ (0) =
p2
= p1 . Similarly,
2p−p2
and so E [X 2 ] = MX
′′ (0) = = p22 − p1 . Therefore, V ar [X ] = E [X 2 ] − (E [X ])2 = 1−p
p3 p2
.
Both the expected value and variance are in agreement with the previous computations for
the goemetric random variable.
2 2
Let Y ∼ Normal(µ, σ 2 ). The density of Y is fY (y ) = √1 e−(y−µ) /2σ .
σ 2π
For any t ∈ R,
1 1
Z ∞ ∞ Z
2 2 2 2 2 2
tY
MY ( t ) = E [ e ] = e · √ e−(y−µ) /2σ dy =
ty
√ e−(y −(2µy +2σ ty )+µ )/2σ dy
−∞ σ 2π −∞ σ 2π
1
Z ∞
2 2 2 2 2
= eµt+(1/2)σ t √ e−(y−(µ+σ t)) /2σ dy
−∞ σ 2π
2 t2
= eµt+(1/2)σ (6.3.1)
where the integral in the final step is equal to one since it integrates the density of a
Normal(µ + σ 2 t, σ 2 ) random variable. One can easily verify that the MY′ (0) = µ and
MY′′ (0) = µ2 + σ 2 . ■
As with the expected value and variance, moment generating functions behave well
when applied to linear combinations of independent variables (courtesy Theorem 6.3.5 (b)
and (c)).
MX (t) = MY1 +···+Yn (t) = MY1 (t) · . . . · MYn (t) = (pet + (1 − p))n .
Moment generating functions are an extraordinarily useful tool in analyzing the distri-
butions of random variables. Two particularly useful tools involve the uniqueness and limit
properties of such generating functions. Unfortunately these theorems require analysis
beyond the scope of this text to prove. We will state the uniqueness fact (unproven)
below and the limit property in Chapter 8. First we generalize the definition of moment
generating functions to pairs of random variables.
Definition 6.3.8. Suppose X and Y are random variables. Then the function
is called the (joint) moment generating function for X and Y . The notation
MX,Y (s, t) will be used when confusion may arise as to which random variables are
being represented.
(a) (One variable) Suppose X and Y are random variables and MX (t) = MY (t)
in some open interval containing the origin. Then X and Y are equal in
distribution.
(b) (Two variable) Suppose (X, W ) and (Y , Z ) are pairs of random variables
and suppose MX,W (s, t) = MY ,Z (s, t) in some rectangle containing the origin.
Then (X, W ) and (Y , Z ) have the same joint distribution.
t
MY (t) = E [etY ] = E [et(X−µ)/σ ] = E [etX/σ e−tµ/σ ] = e−tµ/σ · MX ( )
σ
2 (t/σ )2 t2
= e−tµ/σ · eµ(t/σ )+(1/2)σ =e2.
But this expression is the moment generating function of a Normal(0, 1) random variable.
So by the uniqueness of moment generating functions, Theorem 6.3.9 (a), the distribution
of Y is Normal(0, 1). ■
Just as the joint density of a pair of random variables factors as a product of marginal
densities exactly when the variables are independent (Theorem 5.4.7), a similar result holds
for moment generating functions.
Theorem 6.3.11. Suppose (X, Y ) are a pair of continuous random variables with
moment generating function M (s, t). Then X and Y are indpendent if and only if
Proof - One direction of the proof follows from basic facts about independence. If X
and Y are independent, then by Exercise 6.3.4 , we have
To prove the opposite direction, we shall use Theorem 6.3.9(b). Let X̂ and Ŷ be independent,
but have the same distributions as X and Y respectively. Since MX,Y (s, t) = MX (s)MY (t)
we have the following series of equalities:
MX,Y (s, t) = MX (s)MY (t) = MX̂ (s)MŶ (t) = MX̂,Ŷ (s, t).
By Theorem 6.3.9(b), this means that (X, Y ) and (X̂, Ŷ ) have the same distribution. This
would imply that
Example 6.3.12. Let a, b be two real numbers. Let X ∼ Normal(µ1 , σ12 ) and Y ∼
Normal(µ2 , σ22 ) be independent. Observe that
2 σ 2 t2 2 σ 2 t2 2 σ 2 +b2 σ 2 )t2
MX (at)MY (bt) = eaµ1 t+(1/2)a 1 ebµ2 t+(1/2)b 2 = e(aµ1 +bµ2 )t+(1/2)(a 1 2
which is the moment generating function of a Normal random variable with mean aµ1 + bµ2
and variance a2 σ12 + b2 σ22 ). So aX + bY ∼ Normal(aµ1 + bµ2 , a2 σ12 + b2 σ22 ). ■
Proof- This follows from the preceeding example by induction and is left as an exercise.
■
exercises
Ex. 6.3.1. Let X ∼ Normal(0, 1). Use the moment generating function of X to calcluate
E [X 4 ].
(b) Use (a) to calculate E [Y 3 ] and E [Y 4 ], the third and fourth moments of an exponential
distriubtion.
Ex. 6.3.4. Let X and Y be two independent discrete random variables. Let h : R → R
and g : R → R. Show that
Show that the above holds if X and Y are independent continous random variables.
Ex. 6.3.5. Suppose X is a discrete random variable and D = {t ∈ R : E [tX ] exists}. The
function ψ : D → R given by
ψ (t) = E [tX ],
is called the probability generating function for X. Calculate the probability generating
function of X when X is
Ex. 6.3.6. Let X, Y : S → T be dicrete random variables with the number of elements in
T is finite. Prove part (a) of Theorem 6.3.9 in this case.
In Example 6.3.12, we saw that if X and Y are independent, normally distributed random
variables, any linear combination aX + bY is also normally distributed. In such a case
the joint density of (X, Y ) is determined easily (courtesy Theorem 5.4.7). We would like
to understand random variables that are not independent but have normally distributed
marginals. Motivated by the observations in Example 6.3.12 we provide the following
definition.
We need to be somewhat cautious in the above definition. Since the variables are
dependent it may turn out that aX + bY = 0 or some constant. (E.g: Y = −X,or
Y = −X + 2 with a = 1, b = 1 ). We shall follow the convention that a constant c random
variable in such cases is a normal random variable with mean c and variance 0.
Theorem 6.4.2. Suppose (X, Y ) and (Z, W ) are two bivariate normal random
variables. If
E [ X ] = E [ Z ] = µ1 , E [ Y ] = E [ W ] = µ2
V ar [X ] = V ar [Z ] = σ12 , V ar [Y ] = V ar [W ] = σ22
and
Cov [X, Y ] = Cov [Z, W ] = σ12 (6.4.1)
Proof- As (X, Y ) and (Z, W ) are bivariate normal random variables, given real numbers
s, t sX + tY and sZ + tW are normal random variables. Using (6.4.1) and the properties
of mean and covariance (see Theorem 6.2.2) we have
From the above, sX + tY and sZ + tW have the same mean and variance. So they have
the same distribution (as normal random variables are determined by their mean and
variances). By Theorem 6.3.9 (a) they have the same moment generating function. So, the
(joint) moment generating function of (X, Y ) at (s,t) is
MX,Y (s, t) = E [esX +tY ] = MsX +tY (1) = MsZ +tW (1) = E [esZ +tW ] = MZ,W (s, t)
Therefore (Z, W ) has the same joint m.g.f. as (X,Y) and Theorem 6.3.9 (b) implies that
they have the same joint distribution. ■
Though, in general, two variables which are uncorrelated may not be independent, it
is a remarkable fact that the two concepts are equivalent for bivariate normal random
variables.
Proof - That independence implies a zero covariance is true for any pair of random variables
(use Theorem 6.1.10 (e)), so we need to only consider the reverse implication.
Suppose Cov [X, Y ] = 0. Let µX and σX
2 denote the expected value and variance of X
and µY and σY2 the corresponding values for Y . Let s and t be real numbers. Then, by
the bivariate normality of (X, Y ), we know sX + tY is normally distributed. Moreover by
properties of expected value and variance we have
−0.3 0 0.7
3 3 3
2 2 2
1 1 1
3 3 3
y20−1 1
2 y20−1 1
2 y20−1 1
2
0 0 0
−2 −1 y1 −2 −1 y1 −2 −1 y1
−3−3 −2 −3−3 −2 −3−3 −2
−0.3 0 0.7
2
0.05 0.05
0.10
1 0.10 0.10 0.15
0.15 0.15 0.20
y2
−1
0.05
−2
−2 −1 0 1 2 −2 −1 0 1 2 −2 −1 0 1 2
y1
Figure 6.1: The density function of Bivariate Normal distributions. The set of panels on top
show a three-dimensional view of the density function for various values of the
correlation ρ. The bottom set of panels show contour plots, where each ellipse
corresponds to the (y1 , y2 ) pairs corresponding to a constant value of g (y1 , y2 ).
and
V ar [sX + tY ] = s2 V ar [X ] + 2stCov [X, Y ] + t2 V ar [Y ] = s2 σX
2
+ t2 σY2 .
2 σ 2 +t2 σ 2
MX,Y (s, t) = E [esX +tY ] = MsX +tY (1) = e(sµX +tµY )+(1/2)(s X Y )
2 σ2 2 σ2
= esµx +(1/2)s X · etµY +(1/2)t Y
= MX ( s ) · M Y ( t ) .
We conclude this section by finding the joint density of a Bivariate normal random
variable. See Figure 6.1 for a graphical display of this density.
From the discussion that follows (5.4.1), we can then conclude that the joint density of
(Y1 , Y2 ) is indeed given by g. To show (6.4.3) we find an alternate description of (Y1 , Y2 )
which is the same in distribution. Let Z1 , Z2 be two independent standard normal random
variables. Define
U = σ 1 Z 1 + µ1 (6.4.4)
q
V = σ2 (ρZ1 + 1 − ρ2 Z2 ) + µ2
Let α, β ∈ R. Then
q
αU + βV = (ασ1 + βσ2 ρ)Z1 + (βσ2 1 − ρ2 )Z2 + α1 µ1 + βµ2 .
ther using Corollary 5.3.3 (a) we have that αU + βV ∼ Normal (α1 µ1 + βµ2 , (ασ1 +
βσ2 ρ)2 + (βσ2 1 − ρ2 )2 ). As α, β were arbitrary real numbers by Definition 6.4.1, (U , V )
p
µ1 = E [U ], µ2 = E [V ], Var[U ] = σ12 .
Also in addition, using Exercise 6.2.16 and Theorem 6.2.2 (f), we have
q
V ar [V ] = σ22 ρ2 V ar [Z1 ] + σ22 (1 − ρ2 )V ar [Z2 ] + 2(σ2 (ρ + 1 − ρ2 )Cov [Z1 , Z2 ]
= σ22 ρ2 + σ22 (1 − ρ2 ) + 0 = σ22
and
q
Cov [U , V ] = Cov [σ1 Z1 + µ1 , σ2 (ρZ1 + 1 − ρ2 Z2 )]
q
= σ1 σ2 ρCov [Z1 , Z1 ] + σ1 σ2 1 − ρ2 Cov [Z1 , Z2 ]
= σ1 σ2 ρ + 0 = σ12 .
As bivariate normal random variables are by their means and covariances (by Theorem
6.4.2), (Y1 , Y2 ) and (U , V ) have the same joint distribution. By the above, we have
U − µ1 V − µ2 ρZ1
Z1 = , Z2 = p −p .
σ1 σ2 1 − ρ2 1 − ρ2
So
( )
a − µ1 b − µ2 ρZ1
{U ≤ a, V ≤ b} = Z1 ≤ , Z2 ≤ p −p
σ1 σ2 1 − ρ 2 1 − ρ2
y2 − µ2 ρz1
z2 = −p
σ2 1 − ρ 1 − ρ2
p
2
y2 −µ2 2
Z a−µ1
σ1
Z b exp(− 2(1−ρ
1 2
2 ) [ z1 + ( σ ) − 2ρ( y2σ−µ 2
)z1 ])
P (Y1 ≤ a, Y2 ≤ b) = 2 2
dy2 .dz1
2πσ2 1 − ρ2
p
−∞ −∞
(6.4.7)
Performing a u-subsitution
y1 − µ1
z1 =
σ1
on the outer integral above we obtain
P (Y1 ≤ a, Y2 ≤ b)
2
y1 −µ1
Z a Z b exp(− 2(1−ρ
1
2) [ σ1 + ( y2σ−µ
2
) − 2ρ( y2σ−µ
2 2
2
2
) y1 −µ1
σ1 ])
= dy2 dy1
2πσ1 σ2 1 − ρ2
p
−∞ −∞
exercises
Ex. 6.4.1. Let X1 , X2 be two independent Normal random variables with mean 0 and variance 1.
Show that (X1 , X2 ) is a bivariate normal random variable.
Ex. 6.4.2. Let (X1 , X2 ) be a bivariate normal random variable. Assume that the correlation
coefficient |ρ[X1 , X2 ]| ̸= 1. Show that X1 and X2 are Normal random variables by calculating their
marginal densities.
Ex. 6.4.3. Let X1 , X2 be two independent normal random variables with mean 0 and variance
1. Let (Y1 , Y2 ) be a bivariate normal random variable with zero means, variances equal to 1 and
correlation ρ = ρ[Y1 , Y2 ], with ρ2 ̸= 1. Let f be the joint probability density function of (X1 , X2 )
and g be the joint probability density function of (Y1 , Y2 ). For 0 < α < 1, let (Z1 , Z2 ) be a bivariate
random variable with joint density given by
(c) Show that Z1 and Z2 are Normal random variables by calculating their marginal densities.
Σ=
Cov [X1 , X2 ] Cov [X2 , X2 ]
" #
µ1
and µ1 = E [X1 ], µ2 = E [X2 ], µ2×1 = .
µ2
Σ is referred to as the covariance matrix of (X1 , X2 ) and µ is the mean matrix of (X1 , X2 ).
(b) Show that the joint density of (X1 , X2 ) can be rewritten in matrix notation as
" #!
1 1h i x1 − µ 1
g ( x1 , x2 ) = exp − x1 − µ 1 x2 − µ 2 Σ−1
2π det(Σ) 2
p
x2 − µ 2
Then (Y1 , Y2 ) is also a bivariate Normal random variable, with covariance matrix AΣAT and
mean matrix Aµ + η.
Hint: Compute means, variances and covariances of Y1 , Y2 and use Theorem 6.4.2
The distinction between Probability and Statistics is somewhat blurred, but largely has to do with
the perspective of what is known versus what is to be determined. One may think of Probability as
the study of models for (random) experiments when the model is fully known. When the model is
not fully known and one tries to infer about the unknown aspects of the model based on observed
outcomes of the experiment, this is where Statistics enters the picture. In this chapter we will be
interested in problems where we assume we know the outputs of random variables, and wish to use
that information to say what we can about their (unknown) distributions.
Suppose, for instance, we sample from a large population and record a numerical fact associated
with each selection. This may be recording the heights of people, recording the arsenic content
of water samples, recording the diameters of randomly selected trees, or anything else that may
be thought of as repeated, random measurements. Sampling an individual from a population in
this case may be viewed as a random experiment. If the sampling were done at random with
replacement with each selection independent of any other, we could view the resulting numerical
measurements as i.i.d. random variables X1 , X2 , . . . , Xn . A more common situation is sampling
without replacement, but we have previously seen (see Section 2.3) that when the sample size is
small relative to the size of the population, the two sampling methods are not dramatically different.
In this case we have the results of n samples from a distribution, but we do not actually know the
distribution itself. How might we use the samples to attempt to predict or “infer” such things as
expected value and variance?
A natural quantity we can create from the observed data, regardless of the underlying distribution
that generated it, is a discrete distribution that puts equal probability on each observed point. This
distribution is known as the empirical distribution. Inferences based on the empirical distribution
are traditionally referred to as “descriptive statistics”. In later chapters, we will see that making
additional assumptions lets us make “better” inferences, provided the additional assumptions are
valid.
We will assume that the random variables X1 , X2 , . . . , Xn are i.i.d. from some common
distribution, usually unknown. Some values of Xi can of course be repeated, so the empirical
distribution (and the empirical cummulative distribution function) is formally defined as follows.
247
|{i : Xi ≤ x}|
Fn (x) = ,
n
is known as the “empirical cumulative distribution function” or ECDF of X1 , X2 , . . . , Xn .
Given a realisation of X1 , X2 , . . . , Xn ECDF are easy to compute and provide information about
the underlying distribution. One can also show that as n → ∞ the ECDF will converge to the
underlying distribution function.
Example 7.1.2. Suppose we surveyed 10 random people and asked them how many litres of water
they consume in a day. Suppose the data collected was the following:
3 4 2 5 2 4 4 6 3 4
We can compute the empirical probability mass function and the empirical cummulative distribution
function. That is,
2
10 if t = 2,
0 if x < 2,
2
if t = 3, 2
if 2 ≤ x < 3,
10 10
4
if t = 4, 4
if 3 ≤ x < 4,
10 10
f10 (t) = 1
10 if t = 5, and F10 (x) = 8
10 if 4 ≤ x < 5,
1
if t=6 9
if 5 ≤ x < 6,
10
10
and and
0
otherwise 1
if 6 ≤ x
R has an built function called ecdf which will compute the empirical cumulative distribution
function given the data with options for plotting as indicated below.
x = c(3, 4, 2 , 5 , 2 , 4 , 4 , 6 , 3 , 4 )
F= ecdf(x)
plot(F)
Note that, the empirical distribution is a random object, as it is defined in terms of random
variables. However, for any fixed realisation of these random variables X1 , X2 , . . . , Xn , the corre-
sponding empirical distribution is a fixed probability distribution, so we can now study it using
the tools of probability. Doing so does not make any additional assumptions about the underlying
distribution.
It is important to realize that the empirical distribution is itself a random quantity, as each
sample realisation will produce a different discrete distribution. We intuitively expect it to carry
information about the underlying distribution, especially as the sample size n grows. For example,
the expectation computed from the empirical distribution should be closely related to the true
underlying expectation, probabilities of events computed from the empirical distribution should be
related to the true probabilities of those events, and so on. In the remainder of this chapter, we
will make this intuition more precise and describe some tools to investigate the properties of the
empirical distribution.
It is easy to see that X is the expected value of a random variable whose distribution is the empirical
distribution based on X1 , X2 , . . . , Xn (see Exercise 7.1.5). Suppose the Xj random variables have a
finite expected value µ. The sample mean X is not the same as this expected value. In particular
µ is a fixed constant while X is a random variable. From the statistical perspective, µ is usually
assumed to be an unknown quantity while X is something that may be computed from the results
of the sample X1 , X2 , . . . , Xn . The next theorem is a first step in answering how well does X work
as an estimate of µ.
The fact that E [X ] = µ implies that, on average, the quantity X is accurately describing the
unknown mean µ. In the language of statistics X is said to be an “unbiased estimator” of the
quantity µ. Note also that SD [X ] → 0 as n → ∞ meaning that the larger the sample size, the
more accurately X reflects its average of µ. In other words, if there is an unknown distribution
from which it is possible to sample, averaging a large sample should produce a value close to the
expected value of the distribution. In technical terms, this is considered as a notion of consistency
and we say that the sample mean is a “consistent estimator” of the population mean µ.
Given a sample of observations from a given distribution one may try to estimate the variance
of the distribution via the sample variance which we define below.
Note that this definition is not universal; it is common to define sample variance with n (instead
of n − 1) in the denominator, in which case the definition matches the variance of the empirical
distribution of X1 , X2 , . . . , Xn (Exercise 7.1.5). The definition given here produces a quantity that
is unbiased for the underlying population variance, a fact that follows from the next theorem.
But X1 + X2 + · · · + Xn = nX, so
2 2
E [(n − 1)S 2 ] = E [X12 + X22 + · · · + Xn2 ] − 2nE [X ] + nE [X ]
2
= E [X12 + X22 + · · · + Xn2 ] − nE [X ]
σ2
= n ( σ 2 + µ2 ) − n ( + µ2 ) = ( n − 1 ) σ 2
n
A more important property (than unbiasedness) is that S 2 and its variant with n in the denominator
are both “consistent” for σ 2 , just as X was for µ, in the sense that V ar [S 2 ] → 0 as n → ∞ under
some mild conditions (See Exercise 7.1.7). One may also try to estimate σ from S but due to
vagaries of averaging (in turn expectation) one will typically loose the unbiasedness property (See
Exercise 7.1.8).
Expectation and variance are commonly used summaries of a random variable, but they do not
characterize its distribution completely. In the next subsection we shall see how to use the idea of
sample proportion to understand the underlying distribution better.
In general, the distribution of a random variable X is fully known if we can compute P (X ∈ A) for
any event A. On the other hand if the distribution is not known and we have an event A of interest
then we can use the empirical distribution to estimate the probability P (X ∈ A).
Given a sample of i.i.d. observations X1 , X2 , . . . , Xn from a common distribution defined by a
random variable X, let Y be the random variable that has the same distribution as the empirical
distribution based on sample. More precisely,
|{i : Xi = t|}
Range(Y ) = {X1 , X2 , . . . , Xn } and P (Y = t) = , for t ∈ Range(Y ).
n
|{i : Xi ∈ A}|
P (Y ∈ A) = .
n
In other words, P (Y ∈ A) is simply the proportion of sample observations for which the event A
happened. Not surprisingly, P (Y ∈ A) is a good estimator of P (X ∈ A).
|{i : Xi ∈ A}|
p̂n = .
n
and Zi ’s are independent because Xi ’s are independent (See Theorem 3.3.6 and Exercise 7.1.2) and
identically distributed with
P (Zi = 1) = P (Xi ∈ A) = p.
Pn
Thus, i=1 Zi has the Binomial distribution with parameters n and p, with expectation np and
variance np(1 − p). It is immediate that
Pn n
i = 1 Zi 1 X
E [p̂n ] = E [ ]= E[ Zi ] = p
n n
i=1
and
Pn n
i = 1 Zi 1 X
V ar [p̂n ] = V ar [ ]= 2
V ar [ Zi ] = p(1 − p)/n. (7.1.1)
n n
i=1
This result is a special case of the more general “law of large numbers” we will encounter in Section
8.2. It is important because it gives formal credence to our intuition that the probability of an
event measures the limiting relative frequency of that event over repeated trials of an experiment.
Example 7.1.8. Suppose that U and V are independent Uniform(0, 1), and we interpret (U , V )as
coordinates of a point in R2 . Let A be the event that the point (U , V ) is inside the unit circle
replicate(10, {
u <- runif(10000)
v <- runif(10000)
z <- sqrt(uˆ2 + vˆ2)
sum(z < 1) / 10000
})
[1] 0.7820 0.7791 0.7882 0.7834 0.7888 0.7802 0.7861 0.7816 0.7813
[10] 0.7872
We can see that our estimates are quite good with n = 10000 that p̂n is very close to p. A little
thought tells us that the true probability P (Z < 1) = π/4 ≈ 0.7854. The simulation experiment we
have performed above is in fact one way of estimating π, although it is not a particularly efficient
one. We illustrate this below by repeating the experiment with 1000000 trials and multiplying the
observed sample proportion by 4. Note that in this experiment z < 1 ⇐⇒ z 2 < 1, so calculating
the square root is unnecessary.
u <- runif(1000000)
v <- runif(1000000)
zsq <- uˆ2 + vˆ2
4 * mean(zsq < 1)
[1] 3.13966
As the variance of the sample proportion p̂ is given by n1 p(1 − p) (see (7.1.1)), increasing the
number of replications by a factor of 100 (from 104 to 106 ) leads to an improvement in the accuracy
√
(in terms of standard deviation) of the estimate of π by a factor of 100 = 10. ■
Example 7.1.9. Suppose we are given A, B, C are independent Poisson random variables with
parameters α, β and γ respectively. What is the probablity that the equation Ax2 − Bx + C = 0
has a real solution? To answer this question one would have to calculate the probability that
B 2 − 4AC > 0. That would imply evaluating
∞ X
∞ X
∞
(α )a (β )b (γ )c
X
1(b − 4ac ≥ 0) exp(−α − β − γ )
2
,
a! b! c!
a=0 b=0 c=0
which would require some combinatorial effort. However we can use the strong law of large numbers
and try to estimate the number via simulations.
hist(D) hist(D)
Histogram of D Histogram of D
3000
4000
2500
3000
2000
Frequency
Frequency
1500
2000
1000
1000
500
0
D D
exercises
13 40 23 15 21 4 44 16 32 14
(a) Compute the probability mass function of the empirical distribution from the data and also
the corresponding ECDF.
Ex. 7.1.2. Verify that the proofs of Theorem 3.3.5 and Theorem 3.3.6 hold for continuous random
variables.
Ex. 7.1.3. Let X and Y be two continuous random variables having the same distribution. Let
f : R → R be a piecewise continuous function. Then show that f (X ) and f (Y ) have the same
distribution.
Ex. 7.1.4. Let X and Y be two discrete variables having the same distribution. Let f : R → R be
a piecewise continuous function. Then show that f (X ) and f (Y ) have the same distribution.
Ex. 7.1.5. Let P be the empirical distribution defined by sample observations X1 , X2 , . . . , Xn . In
other words, P is the discrete distribution with probability mass function given in Definition 7.1.1.
Let Y be a random variable with distribution P .
(a) Show that E [Y ] = X.
(b) Show that V ar [Y ] = n S .
n−1 2
Ex. 7.1.6. Suppose that U and V are independent Uniform(0, 1), and we interpret (U , V ) as
coordinates of a point in R2 .
√
1. Let Z = U 2 + V 2 . Find p := P (Z < 1).
2. Can you modify the above R-code given in Example 7.1.8 to provide an estimate for π ?
3. Can you modify the above R-code to observe that the variance of the estimator of p goes to
0?
Ex. 7.1.7. Let X1 , X2 , . . . , Xn be i.i.d. random variables with finite expectation µ, finite variance σ 2 ,
and finite γ = E [X1 − µ]4 . Compute V ar (S 2 ) in terms of µ, σ 2 , and γ and show that V ar (S 2 ) → 0
as n → ∞.
Ex. 7.1.8. Let X1 , X2 , . . . , Xn be i.i.d. random variables with finite expectation µ and finite
√
variance σ 2 . let S = S 2 , the non-negative root of the sample variance. The quantity S is called
the “sample standard deviation”. Although E [S 2 ] = σ 2 , it is not true that E [S ] = σ. In other
words, S is not an unbiased estimator for σ. Follow the steps below to see why.
(a) Let Z be a random variable with finite mean and finite variance. Prove that E [Z 2 ] ≥ E [Z ]2
and give an example to show that equality may not hold. (Hint: Consider how these quantities
relate to the variance of Z).
(b) Use (a) to explain why E [S ] ≤ σ and give an example to show that equality may not hold.
7.2 simulation
The preceding discussion gives several mathematical statements about random samples, but it is
difficult to develop any intuition about what these statements mean unless we look at actual data.
Data is of course abundant in our world; however, the problem with real data is that we do not
usually know for certain the random variable that generated it. To hone our intuition, it is therefore
useful to be able to generate random samples from a distribution we specify. The process of doing
so using a computer program is known as “simulation”.
Simulation is not an easy task, because computers are by nature not random. Simulation is
in fact not a random process at all; it is a completely deterministic process that tries to mimic
randomness. We will not go into how simulation is done, but simply use R to obtain simulated
random samples.
R supports simulation from many distributions, including all the ones we have encountered.
The general pattern of usage is that each distribution has a corresponding function that is called
with the sample size an argument, and further arguments specifying parameters. The function
returns the simulated observations as a vector. For example, 30 Binomial(100, 0.75) samples can be
generated by
[1] 63 72 77 75 82 73 69 78 68 76 87 67 75 73 68 64 71 74 65 79 72 79
[23] 76 72 70 74 72 69 74 72
We usually want to do more than just print simulated data, so we typically store the result in
a variable and make further calculations with it; for example, compute the sample mean, or the
sample proportion of cases where a particular event happens.
[1] 75.66667
[1] 0.5666667
R has a useful function called replicate that allows us to repeat such an experiment several
times.
replicate(15, {
x <- rbinom(30, size = 100, prob = 0.75)
mean(x)
})
replicate(15, {
x <- rbinom(30, size = 100, prob = 0.75)
sum(x >= 75) / length(x)
})
This gives us an idea of the variability of the sample mean and sample proportion computed from a
sample of size 30. We know of course that the sample mean has expectation 100 × 0.75 = 75, and
we can use R to compute the expected value of the proportion as follows.
[1] 0.5534708
So the correponding estimates are close to the expected values, but with some variability. We
expect the variability to go down if the sample size increases, say, from 30 to 3000.
replicate(15, {
x <- rbinom(3000, size = 100, prob = 0.75)
mean(x)
})
replicate(15, {
x <- rbinom(3000, size = 100, prob = 0.75)
sum(x >= 75) / length(x)
})
Indeed we see that the estimates are much closer to their expected values now.
We can of course repeat this process for other events of interest, and indeed for many other
distributions. We will see in the next section how we can simulate observations following the normal
distribution using the funtion rnorm, and the exponential distribution using the funtion rexp. It is
also interesting to think about how one can simulate observations from a given distribution when a
function to do so is not already available.
Recall from Lemma 5.3.7, that suppose U ∼ Uniform (0, 1) random variable and X is a
continuous random variable such that its distribution function, FX , is a strictly increasing continous
−1
function then Y = FX (U ) has the same distribution as X. This approach can be used to be
simulate distributions (both discrete and continuous) from Uniform samples. The following examples
explore this approach. We begin with an example on how to simulate Poisson samples from Uniform.
Example 7.2.1. When trying to formulate a method to simulate random variables from a new
distribution, it is customary to assume that we already have a method to generate random variables
from Uniform(0, 1). Let us see this can be used to generate random observations from a Poisson(λ)
distribution using its probability mass function.
Let X denote an observation from the Poisson(λ) distribution, and U ∼ Uniform(0, 1). Denote
pi = P (X = i). An algorithm to generate a random variable with the same distribution as X is
suggested by the following observation.
p0 = P (U ≤ p0 ),
P (U ≤ p0 + p1 ) = p0 + p1 ⇒ p1 = P (p0 < U < p0 + p1 ),
P (U ≤ p0 + p1 + p2 ) = p0 + p1 + p2 ⇒ p2 = P (p0 + p1 < U < p0 + p1 + p2 ),
k−1 k
and so on. Thus, if we set Y to be 0 if U ≤ p0 , and k if U satisfies pi , then Y has
P P
pi < U <
i=0 i=0
the same distribution as X. To use this idea to generate 50 observations from Poisson(5), we can
k
X
use the following code in R, noting that pi = P (X ≤ k ).
i=0
replicate(50,
{
U <- runif(1)
Y <- 0
while (U > ppois(Y, lambda = 5)) Y <- Y + 1
Y
})
[1] 5 3 3 5 5 3 4 4 4 6 6 8 10 2 10 9 4 5 5 9 2 9
[23] 2 6 4 5 7 3 4 6 8 3 4 2 4 6 4 2 3 0 4 5 5 5
[45] 4 9 7 6 3 4
Of course, there is nothing in this procedure that is specific to the Poisson distribution. By replacing
the call to ppois() suitably, the same process can be used to simulate random observations from
any discrete distribution supported on the non-negative integers. ■
The process described in the previous example cannot be used for continuous random variables.
In such cases, Lemma 5.3.7 often proves useful. We illustrate how to generate samples from
Exponential distribution using Uniform.
Example 7.2.2. Consider the case where we want X to have the Exp(1) distribution. Then,
FX (x) = 1 − e−x for x > 0. Solving for FX (x) = u, we have
1 − e−x = u
⇒ e−x = 1−u
⇒x = − log(1 − u),
−1
that is, FX (u) = −log (1 − u). Thus, we can simulate 50 observations from the Exp(1) distribution
using the following R code.
-log(1 - runif(50))
This takes advantage of the ability of runif() to generate multiple values at once, and the fact
−1
that the expression for FX (u) can be easily vectorized. We can multiply the resulting observations
by 1/λ to simulate observations from the Exp(λ) distribution. ■
The approach illustrated in the last two examples has a disadvantage when the distribution
function F or its generalised inverse cannot be computed explicitly. This is the case when one
wishes to simulate samples from standard Normal distribution. We will discuss a few instances in
the Exercises next.
exercises
(c) Let X1 = R cos(Θ) and X2 = R sin(Θ). Show that X1 , X2 are i.i.d standard Normal random
variables.
(d) Write a R code to simulate 100 samples from standard Normal distribution from Uniform(0, 1).
Ex. 7.2.2. Let Z1 , Z2 be i.i.d. standard normal random variables. µ1 , µ2 ∈ R, σ1 , σ2 > 0 and
−1 < ρ < 1. Suppose X1 = σ1 Z1 + µ1 and X2 = σ2 (ρZ1 + 1 − ρ2 Z2 ) + µ2 .
p
(b) Use the Exercise 7.2.1 (d) and write an R code to simulate 100 samples from a bivariate
Normal disstribution where the correlation ρ = 12 and marginals are standard Normal random
variables.
Ex. 7.2.3. Use the approach in Example 7.2.1 write an R code to simulate from Uniform(0,1), 100
samples of
(b) Geometric( 14 )
Ex. 7.2.4. Let X1 , X2 , . . . , Xn be an i.i.d. sample from the Poisson(λ) distribution, and suppose
we are interested in estimating λ.
(a) Show that both the sample mean and the sample variance of X1 , X2 , . . . , Xn are unbiased
estimators of λ.
(b) Which of these estimators is better? To answer this question, simulate random observations
from the Poisson(λ) distribution for various values of λ using the R function rpois. Explore
the behaviour of the two estimates by varying λ as well as the sample size.
Ex. 7.2.5. Exercise 2.3.7 described the technique called “capture-recapture” which biologists use
to estimate the size of the population of a species when it cannot be directly counted. Suppose
the unknown population size is N , and fifty members of the species are selected and given an
identifying mark. Sometime later a sample of size twenty is taken from the population, and it is
found to contain X of the twenty previously marked. Equating the proportion of marked members
in the second sample and the population, we have 20X
N , giving an estimate of N̂ = X .
= 50 1000
Recall that X has a hypergeometric distribution that involves N as a parameter. It is not easy
to compute E [N̂ ] and V ar [N̂ ]. However, Hypergeometric random variables can be simulated in
R using the function rhyper. For each N = 50, 100, 200, 300, 400, and 500, use this function to
simulate 1000 values of N̂ and use them to estimate E [N̂ ] and V ar [N̂ ]. Plot these estimates as a
function of N .
Ex. 7.2.6. Suppose p is the unknown probability of an event A, and we estimate p by the sample
proportion p̂ based on an i.i.d. sample of size n.
(b) Using the relations derived above, determine the sample size n, as a function of p, that is
required to acheive SD (p̂) = 0.01. How does this required value of n vary with p?
(c) Design and implement the following simulation study to verify this behaviour. For p = 0.01,
0.1, 0.25, 0.5, 0.75, 0.9, and 0.99,
(ii) Simulate 1000 values of p̂ with n chosen according to the formula derived above.
In each case, you can think of the 1000 values as i.i.d. samples from the distribution of p̂,
and use the sample standard deviation as an estimate of SD [p̂]. Plot the estimated values of
SD (p̂) against p for both choices of n. Your plot should look similar to Figure 7.1.
0.010
0.005
Figure 7.1: Estimated standard deviation in estimating a probability using sample proportion
as a function of the probability being estimated. See exercise 7.2.6.
7.3 plots
As we will see in later chapters, making more assumptions about the underlying distribution of X
allows us to give concrete answers to many important questions. This is indeed a standard and
effective approach to doing statistics, but in following that approach there is a danger of forgetting
that assumptions have been made, which we should guard against by doing our best to convince
ourselves beforehand that the assumptions we are making are reasonable.
Doing this is more of an art than a science, and usually takes the form of staring at plots
obtained from the sample observations, with the hope of answering the question: “does this plot
look like what I would have expected it to look like had my assumptions been valid?” Remember
that the sample X1 , X2 , . . . , Xn is a random sample, so any plot derived from it is also a “random
plot”. Unlike simple quantities such as sample mean and sample variance, it is not clear what to
“expect” such plots to look like, and the only way to really hone our instincts to spot anomalies is
through experience. In this section, we introduce some commonly used plots and use simulated
data to give examples of how such plots might look like when the usual assumptions we make are
valid or invalid.
0.06
Proportion
0.04
0.02
0.00
20 30 40 50
Value
Figure 7.2: Empirical frequency distribution of 10000 random samples from the Poisson(30)
distribution.
The typical assumption made about a random sample is that the underlying random variable
belongs to a family of distributions rather than a very specific one. For example, we may assume
that the random variable has a Poisson(λ) distribution for some λ > 0, without placing any further
restriction on λ, or a Binomial(n, p) distribution for some 0 < p < 1. Such families are known as
parametric families.
When the data X1 , X2 , . . . , Xn are from a discrete distribution, the simplest representation of
the data is its empirical distribution, which is essentially a table of the frequencies of each value
that appeared. For example, if we simulate 1000 samples from a Poisson distribution with mean 3,
its frequency table may look like
x
0 1 2 3 4 5 6 7 8
49 168 238 215 155 99 48 18 10
[Link](table(x))
x
0 1 2 3 4 5 6 7 8
0.049 0.168 0.238 0.215 0.155 0.099 0.048 0.018 0.010
The simplest graphical representation of such a table is through a plot similar to Figure 7.2, which
represents a larger Poisson sample with mean 30, resulting in many more distinct values. Although
in theory all non-negative integers have positive probability of occurring, the probabilites are too
small to be relevant beyond a certain range. This plot does not have a standard name, although
it may be considered a variant of the Cleveland Dot Plot. We will refer to it as the Empirical
Distribution Plot from now on.
We can make similar plots for samples from Binomial or any other distribution. Unfortunately,
looking at this plot does not necessarily tell us whether the underlying distribution is Poisson, in
part because the shape of the Poisson distribution varies with the λ parameter. A little later, We
will discuss a modification of the empirical distribution plot, known as a rootogram, that helps
make this kind of comparison a little easier.
In the case of continuous distributions, we similarly want to make assumptions about a random
sample being from a parametric family of distributions. For example, we may assume that the
random variable has a Normal(µ, σ 2 ) distribution without placing any further restriction on the
parameters µ or σ 2 (except of course that σ 2 > 0), or that it has an Exponential(λ) distribution
with any value of the parameter λ > 0. Such families, as noted earlier, are known as parametric
families. For both these examples, the shape of the distribution does not depend on the parameters,
and this makes various diagnostic plots more useful.
The empirical distribution plot above is not useful for data from a continuous distribution,
because by the very nature of continuous distributions, all the data points will be distinct with
probability 1, and the value of the empirical distribution function will be exactly 1/n at these
points.
The plot that is most commonly used instead to study distributions is the histogram. It
is similar to the empirical distribution plot, except that it does not retain all the information
contained in the empirical distribution. Instead, it divides the range of the data into arbitrary bins
and counts the frequencies of data points falling into each bin, effectively discretizing the data.
More precisely, the histogram estimates the probability density function of the underlying random
variable by estimating the density in each bin as a quantity such that the probability of each bin
is proportional to the number of observations in that bin. By choosing the bins judiciously, for
example by having more of them as sample size increases, the histogram strikes a balance that
ensures that the histogram “converges” to the true underlying density as n → ∞.
Figure 7.3 gives examples of histograms where data are simulated from the normal and
exponential distributions for varying sample sizes. Five replications are shown for each sample size.
We can see that for large sample sizes, the shape of the histograms are recognizably similar to the
shapes of the corresponding theoretical distributions seen in Figure 5.1 and Figure 5.2 in Chapter 5.
−2 −1 0 1 2 −2 −1 0 1 2
−2 −1 0 1 2 −2 −1 0 1 2 −2 −1 0 1 2
0 1 2 3 4 0 1 2 3 4
0 1 2 3 4 0 1 2 3 4 0 1 2 3 4
Figure 7.3: Histograms of random samples from the Normal(0, 1) (top) and Exponential(1)
(bottom) distributions. Columns represent increasing sample sizes, and rows are
independent repetitions of the experiment.
Moreover, the shape is consistent over the five replications. This is not true, however, for small
sample sizes. Remember that the histograms are based on the observed data, and are therefore
random objects themselves. As we saw with numerical properties like the mean, estimates have
higher variability when the sample size is small, and get less variable as sample size increases. The
same holds for graphical estimates, although making this statement precise is more difficult.
There are several ways to create histograms in R, which we will not go into here, but one
approach is explored in the exercises.
Graphical displays of data are almost always used for some kind of comparison. Sometimes these
are implicit comparisons, asking, say, “how many peaks does a density have”, or “is it symmetric?”
More often, they are used to compare samples from two subpopulations, say, the distribution of
height in males and females. Sometimes, as discussed above, they are used to compare an observed
sample to a hypothesized distribution.
In the case of the empirical distribution plot, a simple modification is to add the probability
mass function of the theoretical distribution. This, although a reasonable modification, is not
optimal. Research into human perception of graphical displays indicates that the human eye is more
adept at detecting departures from straight lines than from curves. Taking this insight into account,
John Tukey suggested “hanging” the vertical lines in an empirical distribution plot (which are after
all nothing but sample proportions) from their expected values under the hypothesized distribution.
He further suggested a transformation of what is plotted: instead of the sample proportions and
the correponding expected probabilities, he suggested plotting their square roots, thus leading to
the name hanging rootogram for the resulting plot. The reason for making this transformation is as
follows. Recall that for a proportion p̂ obtained from a sample of size n,
p(1 − p) p
V ar [p̂] = ≈
n n
provided p is close to 0. In Chapter 8, we will encounter the Central Limit Theorem and the Delta
Method, which can be used to show (see Example 8.5.4) that as the sample size n grows large,
√ √ √
V ar [ p̂] ≈ c/n for a constant c. This means that unlike p̂ − p, the variance of p̂ − p will be
approximately independent of p.
Figure 7.4 gives examples of hanging rootograms. These examples have been created using the
rootogram() function in the latticeExtra package. The following R code is an example of its
use.
library(package = "latticeExtra")
xbin30 <- rbinom(10000, 100, 0.3)
rootogram(˜ xbin30, dfun = function(x) dpois(x, lambda = 30), grid = TRUE)
This requires the latticeExtra package to be already installed on your system, which it most
likely will not be. To install it, type
[Link]("latticeExtra")
0.25
0.20
P(X = x)
0.15
0.10
0.05
0.00
20 30 40 50
Value
0.25
0.20
P(X = x)
0.15
0.10
0.05
0.00
20 30 40 50
Value
Figure 7.4: Hanging rootogram of 10000 random samples compared with the Poisson(30)
distribution. In the top plot, the samples are also from Poisson(30), whereas in
the bottom plot the samples are from the Binomial(100, 0.3) distribution, which
has the same mean but different variance. Note the similarities with Figure 2.2
1.0
0.8
Empirical CDF
0.6
0.4
0.2
0.0
0 1 2
Value
Sorted data values
Quantiles of U(0,1)
Figure 7.5: Conventional ECDF plot (top) and its “inverted” version (bottom), with x- and
y-axes switched, and points instead of lines.
Just as histograms were binned versions of the empirical distribution plot, we can plot binned
versions of hanging rootograms for data from a continuous distribution as well. It is more common
however, to look at quantile-quantile plots (QQ plots), which do not bin the data, but instead plot
what is essentially a transformation of the empirical CDF.
Recall from Definition 7.1.1 that the ECDF of observations X1 , X2 , . . . , Xn is given by
#{Xi ≤ t}
F̂n (t) = P (Y ≤ t) =
n
The top plot in Figure 7.5 is a conventional ECDF plot of 200 observations simulated from a
Normal(1, 0.52 ) distribution. The bottom plot has the sorted data values on the y-axis and 200
equally spaced numbers from 0 to 1. A little thought tells us that this plot is essentially the same
as the ECDF plot, with the x- and y-axes switched, and using points instead of lines. Naturally, we
expect that for reasonably large sample sizes, the ECDF plot obtained from a random sample will
be close to the true cumulative distribution function of the underlying distribution. If we know the
shape of the distribution we expect the data to be from, we can compare it with the shape seen in
the plot.
Although this is a fine idea in principle, it is difficult in practice to detect small differences
between the observed shape and the theorized or expected shape. Here, we are helped again by the
insight that the human eye finds it easier to detect deviations from a straight line than from curves.
By keeping the sorted data values unchanged, but transforming the equally spaced probability
values to the corresponding quantile values of the theorized distribution, we obtain a plot that we
expect to be linear. We define them formally below.
Definition 7.3.1. Let F be a distribution function. For 0 < p < 1, the p-th quantile of F
is defined as
qp = inf{x ∈ R : F (x) ≥ p}.
Note that qp = F −1 (p) when F −1 exists and F (qp −) ≤ p ≤ F (qp ). q 1 is referred to as the
2
median of F . For a sample X1 , X2 , . . . , Xn from distribution F, the sample p-th quantile is
defined as the p-th quantile of the empirical distribution function Fn .
Quantiles may be thought of informally as follows, generalizing the definition of median given in
Exercise 5.2.10: For a given CDF F , the quantile corresponding to a probability value p ∈ [0, 1] is a
value x such that F (x) = p. Such an x may not exist for all p and F , or it may not be unique, and
a formal definition of quantiles needs to be modified to take this into [Link], for most
standard continuous distributions used in Q-Q plots, one may work with this informal notion. Such
a plot with Normal (0, 1) quantiles is shown in Figure 7.6 for simulated normal and exponential
random samples. More examples are explored in the exercises.
exercises
Ex. 7.3.1. The R functions histogram() and qqmath() in the lattice package can be used to
generate histograms and Q-Q plots respectively (although there are other alternatives as well).
−2 0 2 −2 0 2
4
2
Sorted data values
0
−2
−4
8
6
4
Normal
2
0
−2
−4
−2 0 2 −2 0 2 −2 0 2
Quantiles of N(0,1)
Figure 7.6: Normal Q-Q plots of data generated from Normal and Exponential distributions,
with varying sample size. The Q-Q plots are more or less linear for Normal data,
but exhibit curvature indicative of a relatively heavy right tail for exponential data.
Not surprisingly, the difference becomes easier to see as the sample size increases.
This exercise guides you through the process of simulating data from a sampling distribution and
creating corresponding histograms and Normal Q-Q plots.
(a) Suppose Z1 , Z2 , . . . , Zn are independent Normal(0, 1). Then the distribution of the mean of
Z1 , Z2 , . . . , Zn is Normal(0, 1/n). To verify this, simulate such means for n = 50 using the
following R code.
(d) Study the behaviour of these plots over multiple repetitions, as well as by varying n and the
number of replications.
Ex. 7.3.2. If Z1 , Z2 , . . . , Zn are independent Normal(0, 1), what can you say about the distribution
of the median of Z1 , Z2 , . . . , Zn ? Use the median() function, using it to replace the call to mean()
in the previous exercise, to simulate observations from this distribution. Use histograms and Normal
Q-Q plots to study this distribution and compare it to the distribution of the mean. In particular, is
the distribution of the median also Normal? Does it have lower or higher variance than the mean?
Ex. 7.3.3. Repeat the previous exercise, replacing the median by the minimum and maximum of n
obsrvations Z1 , Z2 , . . . , Zn that are independent Normal(0, 1). What are the distingushing features
of these histograms and Normal Q-Q plots?
We have seen the use of these sample statistics in the previous chapter. In this chapter, we will
discuss the distributional properties and limiting behaviour of such statistics. In Chapters 9 and
10, we will discuss how these properties can be effectively used to estimate parameters related to
the underlying population and verify specific hypotheses about them. The corresponding fields of
study are called Estimation and Hypothesis Testing.
We will spend most of our time in finding the distribution of the sample mean and the
sample variance given the distribution of X1 . One immediately observes that these are somewhat
complicated functions of independent random variables. However in Section 3.3 and Section 5.5 we
have seen examples of functions for which we were able to explicitly compute the distribution. To
understand sampling statistics we must also understand the notion of joint distribution of more
than two continuous random variables (See Section 3.3 for discrete random variables).
In Chapter 3, while discussing discrete random variables, we had considered a finite collection
of random variables (X1 , X2 , . . . , Xn ). In Definition 3.2.7, we had described how to define their
joint distribution and we used this to understand the multinomial distribution in Example 3.2.12.
There are many instances in the continuous setting as well where it is relevant to study the joint
distribution of a finite collection of random variables. Suppose X is a point chosen randomly inside
the unit sphere in three dimensions. Then X has three coordinates, say X = (X1 , X2 , X3 ), where
each
q Xi is a random variable in (0, 1). These coordinates are dependent because we know that
X12 + X22 + X32 ≤ 1. To reason about the properties of X, it is useful and necessary to understand
the “joint distribution” of (X1 , X2 , X3 ). Similarly, to understand the distribution of the sample
mean and the sample variance, which are functions of X1 , X2 , . . . , Xn , we first need to understand
the joint distribution of (X1 , X2 , . . . , Xn ). We begin by defining the joint distribution function.
271
for x1 , x2 , . . . , xn ∈ R.
As in one-variable and two-variable situations, the joint distribution function determines the entire
joint distribution of (X1 , X2 , . . . , Xn ) for discrete random variables. More precisely, if all the
random variables were discrete with Xi : S → Ti , where Ti are countable subsets of R for 1 ≤ i ≤ n,
the from the joint distribution function one can determine
P (X1 = t1 , X2 = t2 , . . . , Xn = tn )
for all ti ∈ Ti , 1 ≤ i ≤ n (See Exercise 8.1.1). The joint distribution function determines the joint
distribution in the continuous setting as well, but we need to introduce some notation before we
can state this result rigorously.
For n ≥ 1, let f : Rn → R be a non-negative function that is piecewise-continuous in each
variable, and for which Z
f (x1 , x2 , . . . , xn ) dx1 dx2 . . . dxn = 1.
Rn
then one can show as in Theorem 5.1.5 that P is a probability on Rn . In this case, f is called the
density function for P .
Density functions arise naturally from certain types of random variables. A collection of random
variables (X1 , X2 , . . . , Xn ) is said to have a joint density f : Rn → R if for every event A ⊂ Rn ,
Z
P ((X1 , X2 , . . . , Xn ) ∈ A) = f (x1 , x2 , . . . , xn ) dx1 dx2 . . . dxn .
A
In this setting, the joint distribution of (X1 , X2 , . . . , Xn ) is determined by their joint density f .
Using multivariable calculus we can can state and prove a result similar to Theorem 5.2.5 for
random variables (X1 , X2 , . . . , Xn ) that have a joint density. In particular, we can conclude that
as the joint densities are assumed to be piecewise continuous in each variable, the corresponding
distribution functions are piecewise differentiable in each variable. Further, the joint distribution
of the continuous random variables (X1 , X2 , . . . , Xn ) are completely determined by their joint
distribution function F . That is, if we know F (x1 , x2 , . . . , xn ) for all x1 , x2 , . . . , xn ∈ R we could
use multivariable calculus to differentiate F to find f . Integrating this joint density over the event
A, we can then calculate P ((X1 , X2 , . . . , Xn ) ∈ A).
As in the n = 2 case, one can recover the marginal density of each Xi for i between 1 and n by
integrating over the other indices. So, the marginal density of Xi at a is given by
Z
f Xi ( a ) = f (x1 , . . . , xi−1 , a, xi+1 , . . . , xn ) dx1 . . . dxi−1 dxi+1 . . . dxn .
Rn−1
Further, for n ≥ 3, we can deduce the joint density for any sub-collection m ≤ n random variables
by integrating over the other variables. For instance, if we were interested in the joint density of
(X1 , X3 , X7 ), we would obtain
Z
fX1 ,X3 ,X7 (a1 , a3 , a7 ) = f (a1 , x2 , a3 , x4 , x5 , x6 , a7 , x8 . . . , xn ) dx2 dx4 dx5 dx6 . . . dxn .
Rn−3
Suppose X1 , X2 , . . . , Xn are random variables defined on a single sample space S with joint density
f : Rn → R. Let g : Rn → R be a function of n variables for which g (X1 , X2 , . . . , Xn ) is defined
on the range of the Xj variables. Let B be an event in the range of g. Then, following the proof of
Theorem 3.3.5, we can show that
P (g (X1 , X2 , . . . , Xn ) ∈ B ) = P (X1 , X2 , . . . , Xn ) ∈ g −1 (B ) .
Although the above provides an abstract method of finding the distribution of the random variable
Y = g (X1 , X2 , . . . , Xn ), it can be difficult to use for explicit calculations. For n = 1 we discussed
this question in detail in Section 5.3, and for n = 2 we explored how to find the distributions of
sums and ratios of independent random variables (see Section 5.5). This method could be extended
by induction on n in a few cases, but in general this is not possible. In Appendix B, Section A.1.1,
we discuss a more general Jacobian-based method of finding the joint density of functions of random
variable.
The notion of independence, introduced in the discrete setting, also extends to multi-dimensional
continuous random variables. As discussed in Definition 3.2.3, a finite collection of continuous
random variables X1 , X2 , . . . , Xn is mutually independent if the sets (Xj ∈ Aj ) are mutually
independent for all events Aj in the ranges of the corresponding Xj . As proved for the n = 2
case in Theorem 5.4.7, we can similarly deduce that if (X1 , X2 , . . . , Xn ) are mutually independent
continuous random variables with marginal densities fXi then their joint density is given by
n
Y
f (x1 , x2 , . . . , xn ) = fXi (xi ), (8.1.2)
i=1
for xi ∈ R and 1 ≤ i ≤ n. Further, for any finite sub-collection (Xi1 , Xi2 , . . . , Xim ) of the above
independent random variables, the joint density is given by
m
Y
f (a1 , a2 , . . . , am ) = fXi (aj ). (8.1.3)
j
j =1
Theorem 8.1.2. Fix n ≥ 1. For each j ∈ {1, 2, . . . , n}, let i ∈ {1, 2, . . . , mj } for some
positive integer mj . Suppose Xi,j is an array of mutually independent continuous random
variables. Define Yj = gj (X1,j , X2,j , . . . Xmj ,j ), where gj : Rmj → R are continuous
functions. Then the resulting variables Y1 , Y2 , . . . , Yn are mutually independent.
For n ≥ 1, let X1 , X2 , . . . , Xn be an i.i.d. random sample from a population with common distribu-
tion function F . Arrange them in increasing order of magnitude, with the ordered observations
denoted by
X(1) ≤ X(2) ≤ · · · ≤ X(n) .
These ordered values are called the order statistics of the sample X1 , X2 , . . . , Xn . For, 1 ≤ r ≤ n,
X(r ) is called the r-th order statistic. The median of X1 , X2 , . . . , Xn is defined as X( n+1 ) when n
2
is odd and X( n ) when n is even.
2
One can compute F(r ) , the distribution function of X(r ) , for 1 ≤ r ≤ n in terms of n and F .
We have,
n
F(1) ( x ) = P ( X(1) ≤ x ) = 1 − P X(1) > x = 1 − P ∩ (Xi > x)
i=1
n
Y n
Y
= 1− P (Xi > x) = 1 − (1 − P (Xi ≤ x))
i=1 i=1
= 1 − (1 − F (x))n ,
n
n Y
F(n) ( x ) = P ( X(n) ≤ x ) = P ∩ (Xi ≤ x) = P (Xi ≤ x) = (F (x))n ,
i=1
i=1
If the distribution function F had a probability density function f then each X(r ) has a probability
density function f(r ) . This can be obtained by differentiating F(r ) and is given by the following
expression.
n(1 − F (x))n−1 f (x) r=1
f(r ) (x) = nf (x)(F (x))n−1 r=n (8.1.4)
n!
f (x)(F (x))r−1 (1 − F (x))n−r 1 < r < n
(r−1)!(n−r )!
Example 8.1.3. Let n ≥ 1 and let X1 , X2 , . . . , Xn be a i.i.d. random sample from a population
whose common distribution F is an Exponential (λ) random variable. Then we know that
0 x<0
F (x) =
1 − e−λx x ≥ 0.
Therefore using (8.1.4) and substituting for F as above we have that the densities of the order
statistics are given by
r=1
n(e−λx ))n−1 λe−λx
f(r ) ( x ) = nλe−λx (1 − e−λx )n−1 r=n
λe−λx n!
(1 − e−λx )r−1 (e−λx )n−r 1 < r < n,
(r−1)!(n−r )!
r=1
nλe−nλx
f(r ) ( x ) = nλe−λx (1 − e−λx )n−1 r=n
λn!
(1 − e−λx )r−1 (e−λx )n−r +1 1 < r < n,
(r−1)!(n−r )!
for x > 0. We note from the above that X(1) , i.e minimum of exponentials, is Exponential (nλ)
random variable. However the other order statistics are not exponentially distributed. ■
In many applications, one is interested in the range of values a random variable X assumes.
A method to understand this to sample X1 , X2 , . . . , Xn i.i.d. X and examine R = X(n) − X(1) .
Suppose X has a probability density function f : R → R and distribution function F : R → [0, 1].
As before we can can calculate the joint density of X(1) , X(n) by first computing the joint distribution
function. This is done by using the i.i.d. nature of the sample and the definition of the order
statistics.
From the above, differentiating partially in x and y we see that the joint density of (X(1) , X(n) ) is
given by
n(f (x) − f (y ))[F (y ) − F (x)]n−1 x < y
f ( x, y ) = (8.1.5)
X(1) ,X(n) 0 otherwise.
P (R ≤ r ) = P ( X(n) ≤ X(1) + r )
Z∞ Zr
= fX
(1) ,X(n)
(x, z + x) dx dz,
0 −∞
where we have done a change of variable y = z + x in the second last line and a change in the order
of integration in the last line. Differentiating the above we conclude that R has a joint density
given by ∞
(x, r + x) dx if r > 0
R
f
−∞ X(1) ,X(n)
fR (r ) = (8.1.6)
0
otherwise.
Example 8.1.4. Let X1 , X2 , . . . , Xn be i.i.d. Uniform(0, 1). The probability density function and
distribution function of a Uniform(0, 1) random variable are given by
0 if x ≤ 0
1 if x ∈ (0, 1)
f (x) = and F (x) = x if 0 < x < 1
0 otherwise.
1 if x > 1.
Let fX be the probability density function of X(r ) for 1 ≤ r ≤ n. Then, using (8.1.4), we have
(r )
n(1 − x)n−1 if x ∈ (0, 1)
fX ( x ) =
(1) 0 otherwise,
nxn−1 if x ∈ (0, 1)
fX (x) =
(n) 0 otherwise, and
n!
(r−1)!(n−r )!
xr−1 (1 − x)n−r if x ∈ (0, 1)
for 1 < r < n, fX ( x ) =
(r ) 0 otherwise.
Using (8.1.6), the probability density function of the range R = X(n) − X(1) is given by
1−r
n(n − 1)(x + r − x)n−1 dx if 0 < r < 1
R
fR (r ) = 0
0
otherwise,
n(n − 1)rn−1 (1 − r ) if 0 < r < 1
=
0 otherwise.
It is easy to see by comparing density functions that X(r ) ∼ Beta(r, n − r + 1) for 1 ≤ r ≤ n, and
the range R ∼ Beta(n, 2). ■
In general, we may also be interested in the joint distribution of the order statistics. Suppose we
have an i.i.d. sample X1 , X2 , . . . , Xn having distribution X. If X has a probability density function
f : R → R then one can show that the order statistic (X(1) , X(2) , . . . , X(n) ) has a joint density
h : Rn → R given by
n!f (u )f (u ) . . . f (u )
1 2 n u1 < u2 < . . . < un ,
h(u1 , u2 , . . . , un ) =
0 otherwise.
The above fact should be intuitively clear: Any ordering u1 < u2 < . . . < un has “probability”
f (u1 )f (u2 ) . . . f (un ). Each Xi can assume any of the uk ’s. The total number of possible orderings
is n!. A formal proof involves using the Jacobian method and will be discussed in Appendix B.
8.1.2 χ2 , F and t
P n
n − 12 x2i
Y 1
f ( x1 , x2 , . . . , xn ) = fXi (xi ) = √ e i = 1 ,
i=1
( 2π )n
n
for xi ∈ R and 1 ≤ i ≤ n. We are interested in the distribution of Z = Xi2 .
P
i=1
We shall find this distribution in two steps. Clearly, the range of X12 is non-negative. The
distribution function for X12 at z ≥ 0 is given by
F1 ( z ) = P (X12 ≤ z )
√
= P (X1 ≤ z )
√
Zz
1 x2
= √ e− 2 dx
2π
0
Zz
1 u 1
= √ e− 2 u− 2 du
2 2π
0
Comparing it with the Gamma (α, λ) random variable defined in Definition 5.5.5 and using Exercise
5.5.9, we see that X12 is distributed as a Gamma ( 12 , 12 ) random variable. From the calculation done
n
in Example 5.5.6 for n = 2, it follows by using induction that Z = Xi2 has the Gamma n2 , 12
P
i=1
distribution. This distribution is referred to as χ2 with n degrees of freedom. We define it precisely
next. ■
Definition 8.1.6. (χ2n (i.e. chi-square with n degrees of freedom)) A random variable X
whose distribution is Gamma n2 , 12 is said to have the chi-square distribution with n degrees
for x > 0.
We show in Section 8.1.3 that the sample variance obtained from a Normal sample follows a
(scaled) χ2 random variable. The F distribution arises as the ratio of the sample variances of two
independent Normal samples, or in other words, as the ratio of two independent (scaled) χ2 random
variables, as we see in the next example.
Example 8.1.7. (F distribution) Let X1 , X2 , . . . , Xn1 be an i.i.d. random sample from the
Normal 0, σ12 population, and Y1 , Y2 , . . . , Yn2 be an independent i.i.d. random sample from
n1 2
Xi
a Normal 0, σ22 population. It follows from Example 8.1.5 that U = has the χ2n1
P
σ1
i=1
n 2 2
Yi
distribution, and V = has the χ2n2 distribution. Further, by Theorem 8.1.2 U and V are
P
σ2
i=1
n2
independent because the Xi and Yj random variables are independent. Let Z = U V
n1 / n2 = U
V · n1 .
It follows from Example 5.5.10 that the density of W = VU for w > 0 is given by
n1
w 2 −1 Γ( n1 +
2
n2
)
fW ( w ) = n1 n2
(1 + w )
n1 + n2
2 Γ( 2 )Γ( 2 )
Z is said to have the F distribution with degrees freedom parameters n1 and n2 , denoted Z ∼
Fn1 ,n2 . ■
Remark 8.1.8. In the previous example, the F distribution essentially arises as the ratio U /V
where U , V are independent χ2 random variables. As the χ2 distribution is a special case of the
Gamma distribution, it follows by Example 5.5.11 that U +U
V is distributed as a Beta random
variable. Further, U +
U = 1 + U , so the two distributions are simple transformations of each other.
V V
In that sense, the F distribution is not a new distribution either, and it is studied separately mainly
for its natural definition as the ratio of sample variances.
The distribution of the ratio of sample mean and sample variance plays an important role in
estimation and hypothesis testing. This forms the motivation for the next example where the t
distribution arises naturally.
Example 8.1.9. (t distribution) Let X1 be a Normal (0, 1) random variable, and let X2 be an
independent χ2n random variable. We wish to find the density of Z, where
X1
Z= √ .
X2 /n
X2
Observe that U = Z 2 is given by X2 /n
1
. Now, X12 has χ21 distribution (see Example 8.1.5), so
applying Example 8.1.7 with n1 = 1 and n2 = n, we find that U has F1,n distribution. The density
of U is given by
1 1
1 2 u 2 −1 Γ ( n+ 1
2 )
fU (u) =
(1 + n1 u) 2 Γ( 2 )Γ( 2 )
n+1 1 n
n
1
Γ ( n+1 ) u− 2
= √ 2 n .
nπΓ( 2 ) (1 + u ) n+2 1
n
√
As X1 is a symmetric random variable and X2 /n is positive valued, we conclude that Z is a
symmetric random variable (Exercise 8.1.11). So, for u > 0,
P (U ≤ u) = P (Z 2 ≤ u)
√ √
= P (− u ≤ Z ≤ u)
√ √
= P (Z ≤ u) − P (Z ≤ − u)
√ √
= P (Z ≤ u) − P (Z ≥ u)
√
= 2P (Z ≤ u) − 1
1 √
fU (u) = √ (fZ ( u)).
u
fZ ( z ) = |z| fU (z 2 )
1
−
Γ ( n+1 ) z2 2
= |z| √ 2 n
nπΓ( 2 ) 1 + u n+2 1
n
− n+2 1
Γ( 2 )
n+1 2
z
= √ 1+ .
nπΓ( n2 ) n
We have already seen in Theorem 7.1.4 that E [X ] = µ and in Theorem 7.1.6 that E [S 2 ] = σ 2 . It
is unreasonable to expect that we would be able to precisely describe the distribution of X or S 2
unless the distribution of the population is known. It turns out that even in that case, it is not easy
to derive these distributions in general. However, when the population is Normal, we can obtain
the joint distribution of X and S 2 completely. The main result of this section is the following.
Proof. (a) follows from Theorem 6.3.13. There are several proofs for (b) and (c), with the most
common ones requiring some knowledge of Linear Algebra (e.g., see [Rao73]). Here we will follow
Kruskal’s proof as illustrated in [Stig84]. The proof is by the method of induction on the sample
size n. To implement the inductive step, we shall replace X and S 2 with X n and Sn2 for the rest of
the proof. This notation also emphasizes that the distributions of X n and Sn2 depend on n, and that
as functions defined on the underlying sample space, they are in fact different random variables.
Step 1: (Proof for n = 2) Here
X1 + X2 2 X1 + X2 2 (X1 − X2 )2
X1 + X2
X2 = and S22 = X1 − + X2 − = . (8.1.7)
2 2 2 2
As X1 and X2 are independent Normal random variables with mean µ and variance σ 2 , by Theorem
(X −X )
6.3.13, 1 √ 2 is a Normal random variable with mean 0 and variance 1. Using Example 8.1.5,
σ 2
S2
we know that σ22 has χ21 distribution and this proves (b).
From (8.1.7), X 2 is a function of X1 + X2 and S22 is a function of X1 − X2 . Theorem 8.1.2
will imply that X 2 and S22 are independent if we show X1 + X2 and X1 − X2 are independent.
Let α, β ∈ R. Then using Theorem 6.3.13 again we have that α(X1 + X2 ) + β (X1 − X2 ) =
(α + β )X1 + (α − β )X2 is normally distributed. As this is true for any α, β ∈ R, (X1 + X2 , X1 − X2 )
has a bivariate Normal distribution by Definition 6.4.1. Using Theorem 6.2.2 (f) and (g), along
with the fact that X1 and X2 are independent Normal random variables with mean µ and variance
σ 2 , we have
Step 2: (inductive hypothesis) Let us inductively assume that (a),(b), and (c) are true when n = k
for some k ∈ N.
Step 3: (Proof for n = k + 1) We shall rewrite X k+1 and Sk2+1 using some elementary algebra.
k +1
1 X 1 1
k
X k − X k +1 = X k − Xi = 1 − Xk − Xk+1 = (X k − Xk+1 ). (8.1.8)
k+1 k+1 k+1 k+1
i=1
k +1 k +1
1X 1X
Sk2+1 = (Xi − X k+1 )2 = (Xi − X k + X k − X k+1 )2
k k
i=1 i=1
k +1
1 X
= (Xi − X k )2 + 2(Xi − X k )(X k − X k+1 ) + (X k − X k+1 )2
k
i=1
k−1 2 1 1
(Xk+1 − X k )2 + 2(Xk+1 − X k )(X k − X k+1 ) + (k + 1)(X k − X k+1 )2
= Sk +
k k k
k−1 2 1 1 − X k ) (Xk+1 − X k )2
(X
= Sk + (Xk+1 − X k )2 − 2(Xk+1 − X k ) k+1 +
k k k k+1 k+1
k−1 2 1
= Sk + (Xk+1 − X k )2 ,
k k+1
where we have used (8.1.8) in the second last inequality. Dividing thoughout by σ 2 and multiplying
by k we have
k 2 k−1 2 k
Sk+1 = Sk + 2 (Xk+1 − X k )2 . (8.1.9)
σ 2 σ 2 σ (k + 1)
Part (a) follows again from Theorem 6.3.13. To prove (b), it is enough to show that
s !
k (k − 1) 2
(Xk+1 − X k ) ∼ Normal (0, 1) and is independent of Sk .
(k + 1)σ 2 σ2
This is so because k
σ 2 (k +1)
(Xk+1 − X k )2 then has the χ21 distribution by Example 8.1.5, and
(k−1) 2 (k−1)
is independent of σ2
Sk by Theorem 8.1.2; by the induction hypothesis σ2 Sk2 has the χ2k−1
distribution, so using (8.1.9) along with Example 5.5.6 will imply that σk2 Sk2+1 has the χ2k distribution.
It is a routine calculation using Theorem 6.3.13 to verify the above distribution by noting that
s s
k
! r ! !
k (k + 1)σ 2 X 1 k
(Xk+1 − X k ) = Xk+1 − Xi .
(k + 1)σ 2 k k (k + 1)σ 2
i=1
By the induction hypothesis, X k and k−1 S 2 are independent. As X1 , . . . , Xk , Xk+1 are mutually
σ2 k
independent, Theorem 8.1.2 implies that Xk+1 is independent of X k and k−1 S 2 . Therefore,
σ2 k
k−1 2
Xk, S , Xk+1 are mutually independent random variables. (8.1.10)
σ2 k
(ii) X k+1 is a function of Xk+1 and X k . So (8.1.10) and Theorem 8.1.2 will then imply X k+1 is
(k−1) (k−1)
independent of σ2 Sk2 and also σ2 (kk+1) (Xk+1 − X k )2 is independent of σ2 Sk2 ;
(k−1) 2
(iii) Using (i) and (ii) we can conclude that X k+1 , σ2
Sk , and k
σ 2 (k +1)
(Xk+1 − X k )2 are
mutually independent; and
(k−1) 2
(iv) finally Sk2+1 is a function σ2
Sk , and k
σ 2 (k +1)
(Xk+1 − X k )2 by (8.1.9). Then (iii) and
Theorem 8.1.2 will imply that Sk+1
2 and X k+1 are independent.
Let α, β ∈ R. We have
k
X α β α
α(X k+1 ) + β (Xk+1 − X k ) = − Xi + − β Xk+1 .
k+1 k k+1
i=1
Theorem 6.3.13 will imply that α(X k+1 ) + β (Xk+1 − X k ) is is normally distributed random variable
for any α, β ∈ R. So by Definition 6.4.1 (X k+1 , Xk+1 − X k ) is a bivariate normal random variable.
Further, from Theorem 6.2.2 (f) and (g), we have
kX k + Xk+1
Cov [X k+1 , Xk+1 − X k ] = Cov [ , Xk+1 − X k ]
k+1
1 k
= V ar [Xk+1 ] − Cov [X k , Xk+1 ] − V ar [X k ]
k+1 k+1
1 k σ2
= σ2 + 0 + − = 0,
k+1 k+1 k
where we have used (8.1.10) in the last line. From Theorem 6.4.3 we conclude that X k+1 , Xk+1 − X k
are independent. ■
The following important Corollary connects the sampling distributions of X and S 2 to the t distri-
bution, and will be important in the context of confidence intervals, which we discuss in Chapter 9.
√
n(X − µ)
S
has the tn−1 distribution.
X −µ (n − 1) 2
√ ∼ Normal (0, 1) and S ∼ χ2n−1 .
σ/ n σ2
Noting that
√ X−µ
√
n(X − µ) σ/ n
=q ,
S 1 (n−1)S 2
n−1 σ2
exercises
Ex. 8.1.1. Let n ≥ 1. F be the joint distribution function of real valued discrete random variables
X1 , X2 , . . . , Xn as in (8.1.1).
Ex. 8.1.4. Let D be a set in R3 with a well defined volume. (X1 , X2 , X3 ) are said be uniform on a
set D if they have a joint density given by
1
Volume(D )
if x ∈ D
f ( x1 , x2 , x3 ) =
0 otherwise.
Ex. 8.1.5. Let X1 , X2 , . . . , Xn be i.i.d. random variables having a common distribution function
F : R → [0, 1] and probability density function f : R → R. Let X(1) < X(2) < . . . < X(n) be
the corresponding order statistics. Show that for 1 ≤ i < j ≤ n, (X(i) , X(j ) ) has a joint density
function given by
n!
fX (x, y ) = f (x)f (y )[F (x)]i−1 [F (y ) − F (x)]j−1−i [1 − F (y )]n−j ,
(i) ,X(j ) (i − 1) ! (j − 1 − i) ! (n − j ) !
(a) Find the conditional distribution of X(n) | X(1) = x for some 0 < x < 1.
Ex. 8.1.11. Suppose X is a symmetric continuous random variable. Let Y be a continuous random
Y is symmetric.
variable such that P (Y > 0) = 1. Show that X
1 1 ax + b cx + d
= + ,
1 + x 1 + (z − x)
2 2 1+x 2 1 + (z − x)2
for all x ∈ R.
Ex. 8.1.14. Suppose U , V are independent random variables with χ2m and χ2n respectively. Then
show that that Z = U +
U
V is distributed as Beta( 2 , 2 )
m n
and showed in Theorem 7.1.4 that E [X ] = µ. We also discussed that X could be considered as an
estimate for µ. The following result makes this precise, and is referred to as the Weak law of large
numbers. To emphasise the dependence of the sample mean and its behaviour on n, we will denote
X by X n .
So we have shown that the random variable X n has finite expectation and variance. By Chebychev’s
inequality (apply Theorem 6.1.13 (a) with k = σϵ ), we have
σ2
P (|X n − µ| > ϵ) ≤ .
nϵ2
2
Therefore as 0 ≤ P (|X n − µ| > ϵ) for all n ≥ 1 and nϵ
σ
2 → 0 as n → ∞, by standard results in real
u <- runif(10000)
mean(log(u / (1-u)))
[1] 0.006766486
Of course, however good an approximation, this estimate is still random, so we should replicate it
several times to get an idea of its general behaviour.
replicate(10, {
x <- runif(10000)
mean(log(x / (1 - x)))
})
These ten replications suggest that the approximation is usually correct only up to the first decimal
place, even though the value of n = 10000 might normally be considered large. Not surprisingly,
the approximation gets worse for n = 100.
replicate(10, {
x <- runif(100)
mean(log(x / (1 - x)))
})
To get a sense of how the approximation improves with n, it is common to plot the cumulative or
partial means as a function of n. For example, Figure 8.1 is created using
N <- 10000
i <- seq(1, N) # to be used as denominator
x <- runif(N)
m <- cumsum(log(x / (1 - x))) / i
This plot suggests that the estimate gets close to zero for fairly small n, and after that improvement
is not substantial. This plot, however, only tells us about the behaviour of one particular sequence
of random variables, whose partial means are guaranteed to converge to the true mean as n → 0
according to the Strong Law, which we have stated but not proved. The Weak Law, on the other
hand, states a result about the distribution of the sample mean. To assess whether it holds, we look
at independent replications of the experiment and plot the resulting paths taken by the partial
means together. We omit the code used to do this, but show the result of one such simulation
experiment in Figure 8.2. One can observe that there is a reduction in the variance of the partial
means as n increases, which was the essential requirement in the proof of the Weak Law. ■
Example 8.2.4. We modify the previous example as follows. Suppose U , V ∼ Uniform(0, 1) are
independent, and X = max(U , V ). What is E (log 1−X
X
)?
0.8
Partial Mean
0.6
0.4
0.2
0.0
−0.2
Index
Figure 8.1: Cumulative or partial means computed from 10000 random samples from the
population log 1−X
X
, where X follows Uniform(0, 1).
0.4
Partial Mean
0.2
0.0
−0.2
−0.4
Index
Figure 8.2: Results of the same experiment that is shown in Figure 8.1, replicated 50 times. For
each replication, cumulative or partial means computed from 10000 random samples
are shown. The underlying population is log 1−X X
, where X follows Uniform(0, 1).
The answer is not as obvious in this case. An approximate answer is easy to obtain by invoking
the Weak Law of large numbers.
replicate(10, {
u <- runif(10000)
v <- runif(10000)
x <- pmax(u, v)
mean(log(x / (1 - x)))
})
These results suggest that the expectation is 1, a fact that can be verified by explicit computation
(See Exercise 8.2.1). ■
Example 8.2.5. Suppose that U and V are independent Uniform(0, 1), and interpret them as
coordinates of a point in R2 . Suppose we want to calculate the expected norm of (U , V ). In other
√
words, if Z = U 2 + V 2 , we want to calculate E [Z ].
As before, we can estimate the expectation by simulating the experiment a large number of
times.
replicate(10, {
u <- runif(10000)
v <- runif(10000)
z <- sqrt(uˆ2 + vˆ2)
mean(z)
})
Theorem 8.2.1 states that for any ϵ > 0, the probability P (|X n − µ| > ϵ) goes to zero as n → ∞.
This mode of convergence of the sample mean X n to the true mean µ is called “convergence in
probability” . We define it precisely below.
The notation
p
Xn −→ X
Note that in the above definition the limit is allowed to be a non-trivial random variable X, although
in most examples we will consider, X will be a constant.
Example 8.2.7. Let X1 , X2 , . . . be i.i.d. random variables from the Uniform(0, 1) distribution.
We already know by the law of large numbers that X converges to E (X1 ) = 12 in probability. Often
we are interested in other functionals (i.e. f (X1 , X2 , . . . , Xn ) for some suitable f and n ≥ 1) of
the sample and their convergence properties. As an example, consider the n-th order statistic
X(n) = max{X1 , X2 , . . . , Xn }. Intuitively, as n increases, it is more and more likely that X(n) will
get closer to its maximum possible value 1. To see this formally, first note that for ϵ > 1,
P X(n) − 1 ≥ ϵ = P X(n) ≤ 1 − ϵ + P X(n) ≥ 1 + ϵ = 0.
An important application of the Weak Law of large numbers follows by noting that the sample
proportion discussed in Section 7.1.2 is the sample mean of Bernoulli random variables.
Example 8.2.8. Suppose we are interested in an event A and want to estimate p = P (X ∈ A).
We consider a sample X1 , X2 , . . . , Xn which is i.i.d. X. We define a sequence of random variables
{Yn }n≥1 by
1 if X ∈ A
n
Yn =
0 if Xn ̸∈ A.
Clearly Yn are independent (as the Xn are), and further they are identically distributed with
P (Yn = 1) = P (Xn ∈ A) = p. In particular, {Yn } is an i.i.d. Bernoulli(p) sequence of random
variables. We readily observe (as done in Chapter 7) that
n
1X 1
Yn = Yi = #{Xi ∈ A} = p̂.
n n
i=1
Hence the Weak Law of large numbers (applied to the sequence Yn ) implies that the sample
proportion will converge to the true proportion p in probability. This provides legitimacy, as
discussed earlier, to the relationship between probability and relative frequency. ■
exercises
1. X ∼ Uniform(0, 1).
Ex. 8.2.3. Let (U , V ) ∼ Uniform(D) where D = {(x, y ) : x2 + y 2 = 1}. Find the distribution of
norm of Z = (U , V ) and E [Z ].
Ex. 8.2.4. Let X, X1 , X2 , . . . be i.i.d. random variables that are uniformly distributed over
the interval (0, 1). Consider the first order statistic X(1) = min{X1 , · · · , Xn }. Show that X(1)
converges to 0 in probability.
Ex. 8.2.5. Let X1 , X2 , . . . , Xn , . . . be i.i.d. random variables with finite mean and variance. Define
n
2 X
Yn = iXi .
n(n + 1)
i=1
p
Show that Yn −→ E (X1 ) as n → ∞.
n
Ex. 8.2.6. Let {Xi : i ≥ 1} be a sequence of i.i.d. Normal (0, 1) random variables. Let Sn = Xi .
P
1
Design a suitable R-code as in Example 7.1.9 that will provide an estimate of the probability that
S1 , . . . , S100 all have the same sign.
p
Ex. 8.2.7. Suppose Xn and X are random variables such that Xn −→ X as n → ∞. Suppose
p
h : R → R is a continuous function. Then show that h(Xn ) −→ h(X ) as n → ∞.
When discussing a collection of random variables it makes sense to think of them as a sequence of
objects, and as with any sequence in calculus we may ask whether the sequence converges in any
way. We have already seen “convergence in probability” in the previous section. Here we will be
interested in what is known as “convergence in distribution”. This type of convergence plays an
important role in understanding the limiting distribution of the sample mean, as we will see later,
particularly in the Central Limit Theorem, Theorem 8.4.1.
If X is the constant random variable for which P (X = 0) = 1, then X has distribution function
(
0 if 0 < x
FX (x) =
1 if x ≥ 0
It is not true that FX (x) = F (x), but the two are equal at points where they are continuous.
Therefore the sequence X1 , X2 , . . . converges in distribution to the constant random variable 0. ■
Note that this form of convergence does not generally guarantee that probabilities associated with
X can be derived as limits of probabilities associated with Xn . For instance, in the example above
P (Xn = 0) = 0 for all n while P (X = 0) = 1. However, with a few additional assumptions a
stronger claim may be made.
Theorem 8.3.3. Let fX1 , fX2 , . . . be the respective densities of continuous random variables
X1 , X2 , . . . . Suppose they converge in distribution to a continuous random variable X with
density fX . Then for every interval A we have P (Xn ∈ A) → P (X ∈ A).
Proof. As X is a continuous random variable FX (x) is the integral of a density, and thus a
continuous function. Therefore convergence in distribution guarantees that FXn (x) converges to
FX (x) everywhere. Let A = (a, b) (and note that whether or not endpoints are included does not
matter as all random variables are taken to be continuous). Then
Zb
P (Xn ∈ A) = fXn (x) dx
a
= FXn (b) − FXn (a)
→ FX ( b ) − FX ( a )
Zb
= fX (x) dx = P (X ∈ A).
a ■
In other words, if Z is exponentially distributed with mean 1, then we have shown that P (nMn ≤
p d
x) → P (Z ≤ x) for all x. So we have Mn −→ 0 and n(Mn − 0) −→ Z. ■
Establishing convergence in distribution using the definition, as done in the previous example,
is not always possible. There are three key results that we will use in the book. These provide
sufficient conditions that are intuitive and often easier to check. The first result deals with the
case of convergence in distribution for continuous random variables, and states that pointwise
convergence of densities implies convergence in distribution.
Theorem 8.3.5. (Scheffé’s Lemma) Let fX1 , fX2 , . . . be the respective densities of continu-
ous random variables X1 , X2 , . . . , and let fX be the density of a continuous random variable
d
X. Suppose fXn (x) → fX (x) as n → ∞ for all x ∈ R. Then, Xn −→ X as n → ∞.
This is a deceptively simple result. After all, one could argue that if fXn (·) converges to fX (·)
pointwise as n → ∞, then so should
Za Za
fXn (u)du → fX (u)du
−∞ −∞
as n → ∞ for any a ∈ R. However, such interchanging of limits and integrals is not always
valid. The result that permits it in this particular situation, known as the “dominated convergence
theorem”, is beyond the scope of this book.
d
Example 8.3.6. Suppose Xn ∼ Normal n1 , 1 . Then it is intuitively clear that Xn −→ Z, where
Z ∼ Normal (0, 1). This follows from an elementary application of Scheffé’s Theorem, as
1 1 1 2 1 1 2
fXn (x) = √ e− 2 (x− n ) → √ e− 2 x = fZ (x) for all x ∈ R.
2π 2π
A direct proof is also simple in this case. As Yn = Xn − n1 has the Normal (0, 1) distribution, we
have
FXn (x) = P (Xn ≤ x) = P (Yn ≤ x − 1/n) = FZ (x − 1/n) → FZ (x)
1 t2
fXn (t) → √ exp(− ) (8.3.1)
2π 2
d
as n → ∞. Consequently by Scheffé’s Theorem Xn −→ Z as n → ∞ where Z ∼ Normal(0,1). ■
The second result, which works for both discrete and continuous random variables, formalizes the
intuition that if all moments of Xn exist and they converge to respective moments of X, then Xn
should converge in distribution to X. Unfortunately, a proof of this result is also beyond the scope
of this book.
t1 t3 t5 t10 t50
0.4
0.3
Density
0.2
0.1
0.0
−3 −2 −1 0 1 2 3
To illustrate the use of this result, consider an alternative proof of the limiting relationship between
Binomial and Poisson random variables (See Theorem 2.2.2).
MX ( t ) = E [etX ]
∞
X
= etj P (X = j )
j =0
∞
X λj e−λ
= etj
j!
j =0
∞ t
t X (λet )j e−λe
= eλe · e−λ ·
j!
j =0
t −1)
= eλ ( e
where the series equals 1 since it is simply the sum of the probabilities of a Poisson(λet ) random
variable.
Since MXn (t) → MX (t), by the M.G.F. convergence theorem (Theorem 8.3.8), Xn converges in
distribution to X. That is, Binomial(n, p) random variables converge in distribution to a Poisson(λ)
distribution when p = nλ and n → ∞. ■
The last result cannot be used to establish convergence in distribution directly. However, if
d
we already know that a sequence Xn −→ X, then this result can often be used to establish the
convergence in distribution of small “perturbations” of Xn , as long as the perturbations converge
in probability.
Proof. We prove only (a); (b) and (c) can be proved similarly. Let ϵ > 0 be given. Write
Fn = FXn +Yn . Choose t such that t, t − c + ϵ, t − c − ϵ are all continuity points of FX . This is
possible as there can be at most countably many points of discontinuity of FX . Now,
and
Our primary application of Slutsky’s Theorem will come in Section 8.5. However, to illustrate its
usefulness, we will show that the result in Example 8.3.6 follows immediately using it below.
Example 8.3.11. Recall the tn distribution from Example 8.1.9. The convergence of the tn
distribution to the Normal (0, 1) distribution, proved in Example 8.3.7, would follow by Lemma 8.3.10
√ p
(c) if we could show that the sequence Yn /n −→ 1, where Yn is the χ2n random variable in the
denominator in the definition of the tn distribution. This is shown in two steps. Either directly
applying Chebychev’s inequality (Theorem 6.1.13) on Ynn , or by an application of the Weak Law of
Large Numbers (Theorem 8.2.1) we can show that
Yn p
−→ 1 as n → ∞. (8.3.2)
n
n
Indeed, as Yn = Xi with Xi i.i.d. χ21 random variables, and E [X12 ] = 1 < ∞, it is immediate by
P
i=1
√ p
Theorem 8.2.1 that (8.3.2) holds. It then follows from Exercise 8.2.7 that Yn /n −→ 1. ■
exercises
for all continuity points of FX : R → [0, 1] with FXn , FX being the distribution functions of Xn
and X respectively.
Ex. 8.3.6. Let X1 , X2 , . . . be i.i.d. Uniform(0, 1) random variables. Generalize the definition
of Mn = min(X1 , X2 , . . . , Xn ) in Example 8.3.4 as follows: For fixed k ≥ 1, define Mn,k =
X(k +1) − X(k ) .
(a) Show that nMn,k = n(X(k+1) − X(k) ) also converges to Exponential(1) random variable in
distribution
(b) Show that for any fixed k ≥ 1, nX(k) converges to the Gamma(k, 1) distribution.
For n ≥ 1, let X1 , X2 , . . . , Xn be an i.i.d. random sample from a distribution X which has mean µ
and variance σ 2 , but is otherwise unknown. Consider the sample mean
n
1X
X= Xi .
n
i=1
(X − µ) √ (X − µ)
Yn = √ = n .
σ/ n σ
As done earlier, we shall denote X by X n in the statement and proof of the Theorem below to
emphasise its dependence on n.
√ (X n −µ)
Proof. Let Yn = n σ . We will verify that
t2
lim MYn (t) = e 2 .
n→∞
Now, using the definition of the moment generating function and some elementary algebra we have
√ (X n − µ)
MY n ( t ) = E [exp(tYn ))] = E exp t n
σ
n n
" !!# " !#
t√ 1 X X t
= E exp n Xi − µ = E exp √ (Xi − µ)
σ n σ n
i=1 i=1
" n #
Y t
= E exp √ (Xi − µ) . (8.4.2)
σ n
i=1
are also independent. From Exercise 7.1.3 and 7.1.4, they also have the same distribution. So from
the calculation in (8.4.2) and using Exercise 6.3.4 inductively we have
n n
" #
Y t Y t
MYn (t) = E exp √ (Xi − µ) = E exp √ (Xi − µ)
σ n σ n
i=1 i=1
(Using Theorem 6.3.9(a))
n
t (X − µ)
= E exp √ . (8.4.3)
n σ
n
where X is the common distribution of X1 , X2 , . . . . In other words, MYn (t) = MU √tn , where
U = X−µ σ . As E [U ] = 0, E [U ] = 1 we have that MU (0) = 0 and MU (0) = 1. From Exercise
2 ′ ′′
t2
MU ( t ) = 1 + + g (t), (8.4.4)
2
g (s)
where g satisfies lim = 0. Thus, we have
s→0 s2
n n n
t2 1 t2
t t t
MYn (t)) = MU √ = 1+ +g √ = 1+ + ng √ .
n 2n n n 2 n
t2 t2
Using the fact that for any fixed t, 2 + ng √t
n
→ 2 as n → ∞ and Exercise 8.4.4 it follows that,
t2
lim MYn (t) = e 2 .
n→∞
t2
Theorem 8.3.8 then implies the result as the limit e 2 is the moment generating function of the
standard Normal distribution. ■
Remark 8.4.2. The existence of moment generating function is not essential for the Central Limit
Theorem, and (8.4.1) holds as long as X1 , X2 , . . . are i.i.d. random variables with finite mean µ
and finite variance σ 2 . However, the proof of this more general statement is more complicated.
Remark 8.4.3. An equivalent formulation of the Central Limit Theorem is often useful. By
n
n −nµ
definition of X n and elementary algebra we see that Yn = S√ , where Xi . It follows
P
nσ
Sn =
i=1
that
Sn − nµ d
√ −→ Normal (0, 1) . (8.4.5)
nσ
Remark 8.4.4. The Central Limit Theorem is a remarkable result. But it perhaps bears emphasis
that the remarkable part of the result is not the specific statistic, the sample mean, but rather the
Normality of the limiting distribution, which arises in many other situations as well. Although
most such results are beyond the scope of this book, we show later in this chapter that the sample
median, when suitably standardized, also converges to the standard Normal distribution under
fairly general conditions.
The Central Limit Theorem for the sample median can be viewed as a refinement of the Weak Law
of large numbers. The weak law tells us that the sample mean X converges to the expectation µ as
n → ∞. However, for any finite n, X is still a non-constant random variable, whose distribution
we may be interested in. This distribution can be quite complicated in general. The Central Limit
Theorem is remarkable because it says that regardless of the underlying distribution, probabilities
concerning the sample mean X can be well approximated by standard Normal probabilities for
large n.
Before looking at uses of such approximations, let us consider the factors that might affect the
quality of the approximation. The Central Limit Theorem does not say anything about how well
the approximation will be for any given n, but we can guess that it will be better for larger n, and
also depend on the distribution giving rise to the data.
Example 8.4.5. As we have seen earlier, an important application of the Weak Law is to estimate
probabilities of events by sample proportion. Here the underlying distribution is Bernoulli(p),
with the probability p estimated by the sample proportion p̂n = Sn /n. Suppose X1 , . . . , Xn are
independent Bernoulli(p) random variables. Then Sn ∼ Binomial(n, p) and
p̂ − p Sn − np d
p n := p −→ Z,
p(1 − p) np(1 − p)
where Z is standard Normal. Let us see how the quality of this approximation changes with the
choice of n and p.
Instead of simulating Xi -s individually, we can simulate Sn directly using the rbinom() function.
For a specific choice of p and n, we could simulate standardized Sn values as follows.
p <- 0.5
n <- 25
s <- rbinom(1000, size = n, prob = 0.5)
z <- (s - n * p) / sqrt(n * p * (1-p))
mean(z)
[1] 0.0052
sd(z)
[1] 0.9830459
The mean and standard deviation of the sample proportion, computed over these 1000 replication,
matches what we expect. To see how similar their overall distribution is to the standard Normal
distribution, the top panel in Figure 8.4 shows empirical frequency distribution plots obtained from
1000 replications for p = 0.5 and n = 10, 25, 50, and 100. Similar plots for p = 0.25 and p = 0.05
shown in the middle and bottom panels. The Normal approximation obtained using the Central
Limit Theorem are added for comparison. As the sample spaces differ substantially depending on n,
the quantity plotted on the y-axis is not the relative frequency but rather a scaled version, similar
to the scaling done in histograms, that makes the scaled quantities comparable with each other and
the Normal density. From these plots, we can conclude that the distribution of Binomial proportion
is well approximated by Normal when p is close to 12 , although for smaller sample sizes the number
of ties can become an issue as well. Values of p away from 12 can generate skewed (asymmetric)
distributions for which the Normal is not a good approximation. A general convention often used is
to consider the approximation valid if both np and n(1 − p) are at least 5.
As we saw in Chapter 7, Q-Q plots are often more useful for assessing departure from Normality.
Figure 8.5 shows Normal Q-Q plots that are analogous to the empirical frequency distribution plots
in Figure 8.4. Each plot represents 1000 replications, for p = 0.5, 0.25, 0.05 and n = 10, 25, 50, 100.
These largely confirm what we already saw from the empirical frequency distribution plots, and
suggest, in particular, that the Normal approximation may be unreliable when p is close to 0 or 1. ■
Example 8.4.6. The Central Limit Theorem applies not just to sample proportions but to general
discrete and continuous distributions if they have finite expectation and variance. For continuous
distributions, ties happen with probability 0, so empirical frequency distribution plots are not useful.
We can use histograms as an alternative, but Q-Q plots are more useful when the primary goal is
to compare with a Normal distribution.
In Figure 8.6, we show Q-Q plots similar to those in Figure 8.5, but instead of sample proportion,
we consider means of random samples from three continuous distributions, namely Uniform(0, 1),
Exp(1), and Cauchy, with sample sizes n = 5, 20, 50, 100. These plots suggest that even with a
shape very different from Normal, the distribution of the Uniform sample mean is well approximated
by a Normal distribution even for small n. For the heavily asymmetric Exponential distribution,
−2 0 2 4 −2 0 2 4
10 25 50 100
Scaled Frequency
0.4
0.3
0.2
0.1
0.0
−2 0 2 4 −2 0 2 4
−2 0 2 4 −2 0 2 4
10 25 50 100
Scaled Frequency
0.4
0.3
0.2
0.1
0.0
−2 0 2 4 −2 0 2 4
−2 0 2 4 −2 0 2 4
10 25 50 100
Scaled Frequency
0.4
0.3
0.2
0.1
0.0
−2 0 2 4 −2 0 2 4
Figure 8.4: Empirical frequency distribution plots of standardized sample proportions when
true probability is (top) p = 0.5, (middle) p = 0.25, and (bottom) p = 0.05. In
each case, the standard Normal density has been added for comparison. To account
for the different sample spaces, the frequencies plotted on the y-axis have been
scaled to make them comparable with each other and the Normal density.
−2 0 2 −2 0 2
10 25 50 100
Sample Proportion
(standardized)
−2
−2 0 2 −2 0 2
−2 0 2 −2 0 2
10 25 50 100
Sample Proportion
4
(standardized)
−2
−2 0 2 −2 0 2
−2 0 2 −2 0 2
10 25 50 100
Sample Proportion
(standardized)
−2
−2 0 2 −2 0 2
Figure 8.5: Q-Q plot of standardized sample proportions when true probability is (top) p = 0.5,
(middle) p = 0.25, and (bottom) p = 0.05.
−2 0 2 −2 0 2
5 20 50 100
(standardized)
Sample Mean
−2
−2 0 2 −2 0 2
−2 0 2 −2 0 2
5 20 50 100
(standardized)
Sample Mean
2
1
0
−1
−2
−2 0 2 −2 0 2
−2 0 2 −2 0 2
5 20 50 100
(standardized)
Sample Mean
50
−50
−2 0 2 −2 0 2
Figure 8.6: Q-Q plot of standardized sample mean of random sample from (top) Uniform(0, 1),
(middle) Exponential(1), and (bottom) Cauchy.
this convergence requires a larger sample size. For the Cauchy distribution, which does not have
finite mean, the Central Limit Theorem does not hold at all. ■
Before moving on, let us summarize the main conclusions from the last two examples. Although the
sample mean converges to the population mean, the convergence is not necessarily immediate. Thus,
although we can expect that for large n the sample proportion or sample mean will be “close” to
the population proportion or mean, we cannot expect it to be exactly the same. The Central Limit
Theorem assures us that under fairly mild assumptions, the difference will behave like a Normal
random variable. As we see in the next chapter, this knowledge allows us to make useful statements
about the population proportion or mean, when it is unknown, based on what we observe.
A typical application of the Central Limit Theorem is to find approximate value of the probability of
events related to Sn or X. For instance, suppose we were interested in calculating for any a, b ∈ R,
P (a < Sn ≤ b) for large n. We would proceed in the following way. We know from (8.4.5) that
Sn − nµ
P √ ≤x → P (Z ≤ x) (8.4.6)
nσ
as n → ∞ for all x ∈ R.
a − nµ Sn − nµ b − nµ
P (a < Sn ≤ b) = P √ < √ ≤ √
nσ nσ nσ
Sn − nµ b − nµ Sn − nµ a − nµ
= P( √ ≤ √ ) −P( √ ≤ √ )
nσ nσ nσ nσ
from (8.4.6) for large enough n
b − nµ a − nµ
≈ P (Z ≤ √ ) − P (Z ≤ √ )
nσ nσ
a − nµ b − nµ
= P( √ <Z≤ √ ),
nσ nσ
where in the second last line we have used the notation ≈ to indicate that the right hand side is an
approximation. Therefore we would conclude that for large n,
a − nµ b − nµ
P (a < Sn ≤ b) ≈ P √ <Z≤ √ . (8.4.7)
nσ nσ
We would then use the R function pnorm() or Normal Tables (See Table B.1) to compute the right
hand side. A similar computation would also yield
√ √
n(a − µ) n(b − µ)
. (8.4.8)
P a<X≤b ≈P <Z≤
σ σ
Example 8.4.7. Let Y be a random variable distributed as Gamma(100, 4). Suppose we were
interested in finding P (20 < Y ≤ 30). Suppose X1 , X2 , . . . , X100 are independent Exponential (4)
100
random variables then Y and S100 = Xi have the same distribution. Therefore, applying the
P
i=1
Central Limit Theorem with µ = E [X1 ] = 14 , σ = SD[X1 ] = 41 , we have
Looking up Table B.1, we see that this value comes out to be approximately 2 × 0.9772 − 1 = 0.9544.
A more precise answer is given by R as
2 * pnorm(2) - 1
[1] 0.9544997
Using R, we can also compare this with the exact probability that we are approximating.
[1] 0.9550279
n
Suppose X1 , X2 , X3 , . . . are all integer valued random variables. Then Sn = Xi is also an integer
P
i=1
valued random variable. Now, for any integer k, P (Sn ≤ k ) = P (Sn ≤ k + h) for all 0 < h < 1.
However it is easy to see that two distinct values of h will lead to two different answers if we use
the Normal approximation provided by the Central Limit Theorem. It is customary to use h = 12
when computing such probabilities using the Normal approximation, as
P ( Sn ≤ a ) = P (Sn ≤ a + 0.5)
a + 0.5 − nµ
≈ P Z≤ √ (8.4.9)
nσ
Example 8.4.8. Two types of coin are produced at a factory: a fair coin and a biased one that
comes up heads 55% of the time. Priya is the quality control scientist at the factory. She wants to
design an experiment that will test whether a coin is fair or biased. In order to ascertain which
type of coin she has, she prescribes the following experiment as a test: Toss the given coin 1000
times, if the coin comes up heads 525 or more times conclude that it is a biased coin. Otherwise
conclude that it is fair. Factory manager Ayesha is interested in the following question: What is
the probability that Priya’s test shall reach a false conclusion for a fair coin ?
Let S1000 be the number of heads in 1000 tosses of a coin. As discussed in earlier chapters, we
1000
know that S1000 = Xi where each Xi are i.i.d. Bernoulli random variables with parameter p. If
P
i=1
the coin is fair, then p = 0.5 and E [X1 ] = 0.5, V ar [X1 ] = 0.25, and therefore E [S1000 ] = 500 and
√
SD [S1000 ] = 250 = 15.8114. We want to approximate
24
1−P Z ≤ = 1 − P (Z ≤ 1.52)
15.8114
1 - pnorm(24 / sqrt(250))
[1] 0.06452065
With the continuity correction, the approximation would instead use z = 24.5/15.8114 = 1.55 ,
giving 1 − 0.9394 = 0.0606 using Table B.1 or
1 - pnorm(24.5 / sqrt(250))
[1] 0.06062886
in R. We can also compute the exact probability that we are trying to approximate, namely
P (S1000 ≥ 525), in R as
[1] 0.06060713
As we can see, the continuity correction gives us a slightly better approximation. These calculations
tell us that the probability of Priya’s test reaching a false conclusion if the coin is fair is approximately
0.061. We shall examine the topic of Hypothesis testing, which is what Priya was trying to do, in
more detail in Chapter 10. ■
Example 8.4.9. We return to the Birthday problem. Suppose a small town has 1460 students.
What is the probability that five or more students were born on independence day ? Assume that
birthrates are constant throughout the year and that each year has 365 days.
The probability that any given student was born on independence day is 365 .
1
So the exact
probability that five or more students were born on independence day is
4
1460 1 k 364 1460−k
X
1− .
k 365 365
k =0
In Example 2.2.1 we have used the Poisson approximation with λ = 4 to estimate the above as
4
1460 1 k 364 1460−k
X
1−
k 365 365
k =0
42 1 1
≈ 1 − e−4 + 4e−4 + e−4 + 43 e−4 + 44 e−4
2 6 24
= 0.3711631
We can do another approximation using Central Limit Theorem, which is typically called the
Normal approximation. For 1 ≤ i ≤ 1460, define
1 if i-th person’s birthday is on independence day
Xi =
0 otherwise
Given the assumptions above on birthrates we know Xi are i.i.d. random variables distributed as
1460
Bernoulli( 365
1
). Note that S1460 = Xi is the number of people born on independence day
P
i=1
and we are interested in calculating
P (S1460 ≥ 5).
0.5
= 1 − P (Z ≤ )
1.9973
= 0.401.
Recall from the calculations done in Example 2.2.1 that the exact answer for this problem is
0.3711629. So in this example, the Poisson approximation seems to work better then the Normal
approximation. This is due to the fact that more asymmetry in the underlying Bernoulli distribution
worsens the normal approximation, just as it improves the Poisson approximation as we saw in
Figure 2.2. ■
exercises
Ex. 8.4.1. Suppose Sn is binomially distributed with parameters n = 200 and p = 0.3 Use the
Central Limit Theorem to find an approximation for P (99 ≤ Sn ≤ 101).
Ex. 8.4.2. Toss a fair coin 400 times. Use the Central Limit Theorem to
(d) find an approximation for the probability that the number of heads is between 140 and least
160.
Ex. 8.4.3. Suppose that the weight of open packets of daal in a home is uniformly distributed from
200 to 600 gms. In random survey of 64 homes, find the (approximate) probability that the total
weight of open boxes is less than 25 kgs.
Ex. 8.4.4. Let {an }n≥1 be a sequence of real numbers such that an → a as n → ∞. Show that
an n
lim 1+ = ea .
n→∞ n
Ex. 8.4.5. Suppose U is a random variable (discrete or continuous) and MU (t) = E (etU ) exists for
all t. Show that
t2
MU (t) = 1 + tMU′ (0) + MU′′ (0) + g (t),
2
g (t)
where lim 2 = 0.
t→0 t
Ex. 8.4.6. Let X1 , X2 , . . . be a sequence of i.i.d. random variables with X1 ∼ Exp(1). Find
√ n √ !
n n X n n
lim P − √ ≤ [1 − exp(−Xi )] ≤ + √ .
n→∞ 2 2 3 2 2 3
i=1
n
nk −n
Ex. 8.4.7. Let an = , n ≥ 1. Using the Central Limit Theorem, evaluate lim an .
P
k! e n→∞
k =0
Ex. 8.4.8. How many times should you toss a coin:
(a) to be at least 90% sure that your estimate of the P(head) is within 0.1 of its true value ?
(b) to be at least 90% sure that your estimate of the P(head) is within 0.01 of its true value ?
Ex. 8.4.9. To forecast the outcome of the election in which two parties are contesting, an internet
poll via Facebook is conducted. How many people should be surveyed to be at least 95% sure that
the estimated proportion is within 0.05 of the true value?
Ex. 8.4.10. A medical study is conducted to estimate the proportion of people suffering from April
allergies in Bangalore. How many people should be surveyed to be at least 99% sure that the
estimate is within 0.02 of the true value?
In many situations one is interested in knowing whether convergence properties are preserved
d
under transformations. Given random variables X1 , X2 , . . . and Z such that Xn −→ Z, and a
function g : R → R, we may be interested in knowing the limiting distribution of g (Xn ). In earlier
chapters, we have learnt techniques to calculate the distribution of g (X ) from the distribution of
X (see Section 3.3, Section 5.3), which may be helpful in studying this problem. In this section, we
discuss the Delta method, which answers this question in a specific situation, where g is a smooth
transformation that can be effectively approximated by a linear function in the region of interest.
Slutsky’s theorem (Lemma 8.3.10) is an important tool in proving the following result.
√ (Xn − µ) d
n −→ Z as n → ∞.
σ
Then
√ (g (Xn ) − g (µ)) d
n −→ Z as n → ∞.
σg ′ (µ)
Zx Z1
′
g (x) − g (µ) = g (t) dt = (x − µ) g ′ (µ + s(x − µ)) ds.
µ 0
Z1
√ (g (Xn ) − g (µ)) √ (Xn − µ) 1
n ′
= n · ′ g ′ (µ + s(Xn − µ)) ds. (8.5.1)
σg (µ) σ g (µ)
0
√ (X −µ) d
By our hypothesis on Xn , we know that n nσ −→ Z as n → ∞. Slutsky’s theorem (Lemma
8.3.10) will imply the result if we can show that
Z1
1 p
g ′ (µ + s(Xn − µ)) ds −→ 1 as n → ∞. (8.5.2)
g ′ (µ)
0
Z1
1
P g ′ (µ + s(Xn − µ)) ds − 1 > ϵ
g ′ (µ)
0
1
Z
= P g ′ (µ + s(Xn − µ)) ds − g ′ (µ) > g ′ (µ) ϵ
0
!
′ ′ ′
≤ P sup g (µ + s(Xn − µ)) − g (µ) > g (µ) ϵ (8.5.3)
s∈[0,1]
Z1
1
P ′ g ′ (µ + s(Xn − µ)) ds − 1 > ϵ ≤ P (|Xn − µ| > δ ) . (8.5.5)
g (µ)
0
Z1
1
P ′ g ′ (µ + s(Xn − µ)) ds − 1 > ϵ < 2ϵ.
g (µ)
0
Remark 8.5.2. The particular transformation g affects the rate of convergence to normality of the
sequence g (Xn ). Indeed Theorem 8.5.1 shows that
√ d √ d
if n(Xn − µ) −→ Normal 0, σ 2 , then n(g (Xn ) − g (µ)) −→ Normal 0, σ 2 (g ′ (µ))2
The value of g ′ (µ) determines how large or small the variance of the limiting normal distribution is.
√ d
Example 8.5.3. Suppose n(Xn − µ) −→ Normal 0, σ 2 as n → ∞ for some µ ̸= 0 and σ > 0.
(b) If g : R → R is given by g (x) = x1 for x ̸= 0 and g (0) = [Link] by Theorem 8.5.1 we have
that
√ 1 1 σ2
d
n − −→ Normal 0, 4 as n → ∞.
Xn µ µ
Note that the transformed random variables g (Xn ) need not have finite expectation for any n ≥ 1,
e.g., take Xn ∼ Normal µ, nσ with g (Xn ) = X1n as in (b).
■
A natural application of the Weak Law of Large Numbers and the Central Limit Theorem is to
estimate the unknown mean µ of a population by the sample mean X n , provided we have a random
sample from the population. Here the assumption is that E [Xi ] = µ and V ar [Xi ] = σ 2 for each
1 ≤ i ≤ n, so by the Central Limit Theorem,
√ d
n(X n − µ) −→ Normal 0, σ 2 .
(a) Suppose X ∼ Bernoulli(p). Then σ 2 = p(1 − p), and we need a transformation such that
√
g ′ (p) p(1 − p) = c for some c ∈ R. Indeed if g (x) = arcsin( x) then
p
√ 1
d
n(g (X n ) − g (p)) −→ Normal 0, .
4
√
(b) Suppose X ∼ Poisson(p). Then σ 2 = p, and we need a transformation such that g ′ (p) p = c
√
for some c ∈ R. Indeed if g (x) = x then
√ 1
d
n(g (X n ) − g (p)) −→ Normal 0, .
4
(c) Suppose X ∼ Normal 0, σ 2 and we are interested in estimating σ 2 . We will then use
n
1 P
n Xi2 as the estimate for σ 2 and the Central Limit Theorem implies that
i=1
n
!
√ 1X 2 d
Xi − σ 2 −→ Normal 0, 2σ 4 .
n
n
i=1
√
Thus we need a transformation such that g ′ (σ 2 ) 2σ = c for some c ∈ R. Indeed if
g (x) = log x then
n
! !
√ 1X 2 d
n g Xi − g (σ ) −→ N(0, 2).
2
n
i=1
exercises
Ex. 8.5.2. Supppose {Xn }n≥1 , X are a sequence of random variables and {an }n≥1 , a are a sequence
d d
of real numbers such that Xn −→ X and an → a then show that an Xn −→ aX. Hint: Use Lemma
8.3.10.
Ex. 8.5.3. Supppose {Xn }n≥1 , X and {Yn }n≥1 , Y are a sequence of random variables such that
d d d
Xn −→ X and Yn −→ Y . Show that λXn + (1 − λ)Yn −→ λX + (1 − λ)Y . Hint: Use Lemma
8.3.10.
d d
Ex. 8.5.4. Suppose Xn −→ X. Show that Xn2 −→ X 2 .
Ex. 8.5.5. Let α, µ > 0. Let {Xi }i≥1 be i.i.d. random variables following Pareto (α, µ) distribution.
That is, the probability density function of Xi , for any i ≥ 1, is given by
αµα
(
xα + 1
x ≥ µ,
f(α,µ) (x) =
0 otherwise.
n
Xi
(a) Let Y n = 1
log . Show that
P
n µ
i=1
√ 1
d
−→ Normal 0, α2 .
n −α
Yn
n
(b) Let Z n = 1
log(Xi ) and Mn = max{X1 , X2 , . . . , Xn }.
P
n
i=1
√ p
(i) Show that n(log(Mn ) − log(µ)) −→ 0.
(ii) Using (a) and Lemma 8.3.10 show that
√ 1 1
d
n Z n − log(Mn ) − −→ Normal 0, 2 .
log(µ) α
Ex. 8.5.6. For α, µ > 0, let {Xi }i≥1 be i.i.d. random variables with the probability density function
of Xi , for any i ≥ 1, given by
(
µe−µ(x−α) x ≥ α,
f(α,µ) (x) =
0 otherwise.
n
Let X n = 1
Xi and Mn = max{X1 , X2 , . . . , Xn }.
P
n
i=1
n
Ex. 8.5.7. Let Xi , i ≥ 1 be i.i.d. Bernoulli(p) random variables. Let X n = 1
Xk . Show that
P
n
k =1
√
Xn p d p
n − −→ Normal 0, .
1 − Xn 1 − p (1 − p)3
p
The statistic Xn
is typically used to estimate the odds ratio 1−p .
1−X n
n
Ex. 8.5.8. Let Xi , i ≥ 1 be i.i.d. Bernoulli(p) random variables. Let X n = 1
Xk . Show that
P
n
k =1
√ p(1 − p)
d
n X n (1 − X n ) − p(1 − p) −→ Normal 0, ,
(1 − 2p)2
for p ̸= 12 . The statistic X n (1 − X n ) is typically used to estimate the variance p(1 − p).
n
Ex. 8.5.9. Let Xi , i ≥ 1 be i.i.d. Exp(λ) random variables. Let X n = 1
Xk . Show that
P
n
k =1
√ 1 1 1
d
n − −→ Normal 0, 2 .
Xn λ λ
Ex. 8.5.10. Consider the same set up as in Example 8.5.3 with µ = 0. Then show that
√ p
nXn2 −→ 0
√ 2 d
as n → ∞, and that the correct scaling is n as opposed to n, that is, n X
σ2
n
−→ χ21 as n → ∞.
The sample median is a natural alternative to the sample mean as a measure of centrality. For
continuous symmetric distributions, the median is the same as the mean when the mean exists. The
median of a distribution always exists, even if the mean does not. It is invariant under monotone
transformations, making it a more appealing measure of centrality for skewed distributions. The
asymptotic distribution of the sample median is therefore of natural interest.
The Central Limit Theorem establishes the Normal distribution as the limiting distribution
of the standardized sample mean for all distributions that have finite second moment. This is a
universality result which says that all sums and averages from a random sample are asymptotically
Normal. However, from a sample one can derive many other summary statistics, each with different
sampling distributions, and it is natural to ask about their asymptotic behaviour. In this section,
we show that the limiting distribution of the sample median is also Normal. We do this in two
stages. First we prove it for Uniform(0, 1) random variables, and then use the Delta method to
prove it for more general distributions.
Lemma 8.6.1. Suppose that U1 , U2 , . . . are i.i.d. Uniform(0, 1) random variables, and let U
en be
the sample median obtained from U1 , . . . , Un . Then,
en − 1
√
d
2 n U −→ Z, (8.6.1)
2
Proof. To begin with and to keep the definition of the median unambiguous, we consider odd
samples sizes such that n = 2k − 1 for some positive integer k and we shall let k → ∞. In this case
the median Uen = U .
(k )
As seen in Example 8.1.4, U(k) has the Beta(k, k ) distribution with density
k 2k uk−1 (1 − u)k−1
0 < u < 1,
fk ( u ) = 2 k
0
otherwise.
−3 −2 −1 0 1 2 3
5 11 19
0.4
0.3
Density
0.2
0.1
0.0
−3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3
Figure 8.7: Density of standardized sample median for a Uniform(0, 1) population, for sample
sizes 5, 11, and 19. The grey curve represents the standard Normal density. As the
sample size increases, the density function of the standardized median converges
to the standard Normal density.
Figure 8.7, which plots gk (z ) for k = 3, 6, 10 which correspond to n = 5, 11, 19 shows that the
Normal approximation is good even for small values of n.
By Stirling’s approximation for factorials we have that for all k ≥ 1
√ √ 1
k k 2πk exp (−k ) < k! < k k 2πk exp −k + . (8.6.3)
12k
1 1 2
gk (z ) → √ e− 2 z as k → ∞. (8.6.5)
2π
d
Zk −→ Normal (0, 1) as k → ∞. (8.6.6)
√ e √
n
As n(U n − 2 ) = √n+2 Z n+1 , an application of Lemma 8.3.10 (see Exercise 8.5.2) yields the result
1
2
when n is odd.
For n = 2k even, the median may be defined as any value between U(k) and U(k+1) , i.e., any
U +U
convex combination of U(k) and U(k+1) . For example, we could take (k) 2 (k+1) . Imitating the
above argument in n odd case one can prove the result in this case as well (see Exercise 8.6.1). ■
Having established the result for the special case of Uniform(0, 1), we next wish to prove that a
similar result holds for a much wider class of distributions. For this, we will use the Delta method,
generalizing the result to all continuous distributions which have strictly positive density at the
population median.
Theorem 8.6.2. Let X1 , X2 , . . . be i.i.d. random variables with probability density function
en be the sample median obtained from X1 , X2 , . . . , Xn . Assume that f (µ) > 0,
f . Let X
where µ denotes the median of the random variable X. Then,
√
d
2 nf (µ) Xen − µ −→ Z, (8.6.7)
Proof. Define Ui = F (Xi ) for i ≥ 1. By Lemma 5.3.7, Exercise 5.3.12 and Theorem 8.1.2, U1 , U2 , . . .
are i.i.d. Uniform(0, 1). By Lemma 8.6.1
en − 1 −→
√
d
2 n U Z, (8.6.8)
2
where Z is standard Normal and Uen is the median of U1 , . . . , Un . Let F be the distribution function
of X. Now define G : [0, 1] → R ∪ {−∞} ∪ {∞} as
Recall from Exercise 5.3.12 that G is the generalised inverse of F . First note, since Xi are sampled
from f , Xi = G(Ui ) for i ≥ 1 with probability 1. By definition G is increasing, and so X
en = G(Uen ).
Further, since f (µ) > 0 then F is strictly monotone and F ′ (exists, is continuous) is strictly positive
in a neighbourhood of µ. This will imply G( 12 ) = µ, G is differentiable at µ with G′ ( 12 ) = f (1µ) > 0
and G′ (·) is continuous in a neighbourhood of µ.
As G satisfies the hypothesis of Theorem 8.5.1, using (8.6.8) we have
en ) − G( 1 ) d
√ G(U 2
n 1 ′ 1 −→ Z,
2 G ( 2 )
2 (X
r
√ en − µ) d
n −→ Z, (8.6.9)
π σ
2
r
√ (X
en − µ) d √ (X n − µ) d
n −→ Z and n −→ Z (8.6.10)
σ π σ
The asymptotic efficiency of X en over X n is defined as the inverse of the ratio of the limiting
variances, i.e., π2 ≈ 0.64. This number can be interpreted in the following manner: If one uses Xen
to estimate µ, then one could instead have used use X m with sample size m ≈ 0.64n to get an
estimator with the same variance. ■
The previous example suggests that one should use the sample mean rather than the sample
median when the population is Normal. In practice, however, the underlying distribution can be
rarely known with certainty, and we will see in Chapter 9 that under even fairly mild departures
from Normality, the sample median may become much more useful than the sample mean in the
sense of asymptotic efficiency as defined above. An extreme case is the following example, where
the sample mean does not have finite variance, and hence the asymptotic efficiency of the sample
mean over the sample median is zero.
Example 8.6.4. Suppose that X1 , X2 , . . . are i.i.d. Cauchy(θ, α2 ) random variables. It is easy to
see (by symmetry) that θ is the median. As the Cauchy distribution has no finite moments, one
way to estimate θ is using the sample median. We can apply Theorem 8.6.2 to get
√ 1
2 n en − θ ) → Z
(X as n → ∞,
πα
where Z is standard Normal. ■
exercises
Ex. 8.6.1. Suppose that U1 , U2 , . . . are i.i.d. Uniform(0, 1) random variables, and let U
en be the
sample median obtained from U1 , . . . , Un , with n = 2k for some k ≥ 1.
(a) Using Example 8.1.4 find the distribution of U(k) and U(k+1) and Compute E [U(k) ], Var[U(k) ],
E [U(k+1) ] and Var[U(k+1) ].
(b) As in the proof of Lemma 8.6.1 find ak and bk such that both Zk = ak (U(k) − 12 ), Z
ek =
bk (U(k+1) − 21 ) converge in distribution to Normal (0, 1) as k → ∞.
√ d
(c) Using Lemma 8.3.10 For 0 < λ < 1 show that n(λUk + (1 − λ)Uk+1 − 12 ) −→ Normal (0, 1)
as k → ∞.
Ex. 8.6.2. Suppose that X1 , X2 , . . . , Xn are i.i.d. Exp(λ) random variables. Find the distribution
of sample median X en and also identify the standardization required to obtain a Normal distribution
as the limiting distribution of the standardized sample median.
Ex. 8.6.3. Under the conditions of Theorem 8.6.2, show that the sample median converges to the
population median in probability. Thus, the Weak Law of large numbers holds for the sample
median.
321
as f (x | p), where p = (p1 , p2 , . . . , pd ) represents the vector of all the parameters. We will assume
that the set P of all possible values p can take is known, where P ⊂ Rd for some d ≥ 1. The set P
may be all of Rd or some proper subset depending on the nature of the parameters.
We now fix some notations and terminology for estimators.
In practice the function g is chosen keeping in mind the parameter θ (p) of interest. We have seen
the following in Chapter 7.
n
1X
g (x) = xi .
n
i=1
Then g (X1 , X2 , . . . , Xn ) is the (now familiar) sample mean and it is an estimator for µ. Further,
E [g (X1 , X2 , . . . , Xn )] = µ regardless of the true value of µ. We called such an estimator an
unbiased estimator. Finally we also know by the strong law of large numbers, Theorem 8.2.1, that
p
g (X1 , X2 , . . . , Xn ) −→ µ as n → ∞. ■
Recall from Chapter 6 that E [X ] is the first moment of X. As noted in Chapter 7, we can thus
view the sample mean, which is the first moment of the empirical distribution based on a sample,
as estimating the first moment of the underlying distribution. A generalization of this method is
known as the method of moments.
n
1X k
mk (x) = xi .
n
i=1
Notice that mk (X1 , X2 , . . . , Xn ) is the k-th moment of the empirical distribution based on the
sample X1 , X2 , . . . , Xn , which we will refer to simply as the k-th sample moment.
Let µk = E [X k ], the k-th moment of the distribution X. As the distribution of X depends on
(p1 , p2 , . . . , pd ) one can view µk ≡ µk (p1 , p2 , . . . , pd ) as a function of p. The method of moments
estimator for (p1 , p2 , . . . , pd ) is obtained by equating the first d sample moments to the corresponding
moments of the distribution. Specifically, it requires solving the d equations in d unknowns given by
µk (p1 , p2 , . . . , pd ) = mk (X1 , X2 , . . . , Xn ) , k = 1, 2, . . . , d.
for p1 , p2 , . . . , pd . There is no guarantee in general that these equations have a unique solution or
that it can be computed, but in practice it is often possible to do so. The solution will be denoted
by p̂1 , p̂2 , . . . , p̂d which will be writen in terms of the realised values for mk , k = 1, 2, . . . , d. We will
now explore this method for two examples.
m21
N̂ = ≈ 19
m1 − (m2 − m21 )
m1 − (m2 − m21 )
p̂ = ≈ 0.371.
m1
Thus, according to the method of moments, the distribution from which the sample came from is
estimated to be the Binomial(19, 0.371) distribution. In practice, we usually wish to restrict the
estimates of the parameters based on the context of the problem. Since the N value is surely some
integer, the estimate of N̂ was rounded to the nearest meaningful value in this case. ■
Example 9.1.2. Suppose our distribution of interest X has a Normal (µ, σ 2 ) distribution. Therefore
our probability density function is given by
1 (x−µ) 2
−
f (x | µ, σ 2 ) = √ e 2σ2 , x ∈ R.
2πσ
E [X ] = µ and E [X 2 ] = Var[X ] + E [X ]2 = µ2 + σ 2 .
m1 = µ and m2 = µ2 + σ 2 .
from which
µ̂ = m1 = X and
n
!
1X 2 2 n−1 2
σ̂ 2
= m2 − m21 = Xi −X = S .
n n
i=1
Here X and S 2 are, respectively, the sample mean and sample variance defined in Chapter 7. ■
The method of moment estimators may not always be very reliable, in the sense that it might
give implausible estimates. For instance, in Example 9.1.1 above, you can check that the estimate
for p would be negative if the sample mean X happened to be smaller than n−1 n S . Such defects
2
can be somewhat rectified using moment matching and other techniques (see [CasBer90]).
exercises
Ex. 9.1.1. Suppose X1 , . . . , X5 is an i.i.d. sample with Uniform(a, b) distribution for some unknown
a and b. Suppose the empirical realisation of these variables is 3.5, 2.1, 5.7, 4.8, 3.9. Use the method
of moments to estimate a and b.
Ex. 9.1.2. Suppose X1 , X2 , . . . , Xn is an i.i.d. sample with Uniform(a, b) distribution for some
unknown a and b. Let m1 and m2 be the empirical realisation of the first and second moments of
the X1 , X2 , . . . , Xn data. Find an expression for the estimates of a and b given by the method of
moments in terms of the quantities m1 and m2 .
Ex. 9.1.3. Suppose X1 , X2 , . . . , Xn is an i.i.d. sample with Uniform(a, b) distribution for some
unknown a and b. Prove that the method of moments produces estimates â and b̂ such that â = b̂
if and only if every data point in the empirical realisation has exactly the same value.
Ex. 9.1.4. Suppose X1 , . . . , X4 is an i.i.d. sample with Binomial(N , p) distribution for some
unknown N and p. Suppose the empirical realisation of these variables is 1, 2, 5, 12. Show that the
method of moments for estimating N and p gives negative (and therefore meaningless) results.
Ex. 9.1.5. Suppose X1 , X2 , . . . , Xn is an i.i.d. sample with Binomial(N , p) distribution for some
unknown N and p. Prove that the method of moments will produce a negative estimate for p if an
only if it also produces a negative estimate for N .
Ex. 9.1.6. Suppose X1 , . . . , X6 is an i.i.d. sample with Gamma(α, λ) distribution for some unknown
α and λ. Suppose the empirical realisation of these variables is 5.3, 2.4, 2.8, 7.6, 6.9, 4.2. Use the
method of moments to estimate α and λ.
Ex. 9.1.7. Suppose X1 , X2 , . . . , Xn is an i.i.d. sample with Gamma(α, λ) distribution for some
unknown α and λ. Let m1 and m2 be the empirical realisations of the first and second moments of
X1 , X2 , . . . , Xn .
(a) Find an expression for the estimates of α and λ given by the method of moments in terms of
the quantities m1 and m2 .
(b) Show that the estimates of α and λ from part (a) can never be negative.
Ex. 9.1.8. The following code simulates 100 samples from a population with distribution Binomial
(20, 0.4) and computes the method of moments estimate for the sample size parameter n and the
success probability p (n = 20 and p = 0.4 in the simulation).
n <- 100
N <- 20
p <- 0.4
x <- rbinom(n, N, p)
m1 <- mean(x)
m2 <- mean(xˆ2)
Nhat <- m1ˆ2 / (m1 -(m2-m1ˆ2))
phat <- (m1 - (m2-m1ˆ2)) / m1
(b) Change the code suitably to simulate 1000 samples from Binomial (20, 0.4) and see if the
answer to (a) changes.
(c) Change the code suitably to simulate samples from Binomial (10, 0.1) and Binomial (10, 0.9),
and repeat (a) and (b).
Ex. 9.1.9. Using the method from Exercise 9.1.2, and by suitably modifying the R-code in Exercise
9.1.8, write R-code that computes the method of moments estimate for a and b in Uniform(a, b)
when a = 3 and b = 5 by generating 100 samples from Uniform(3, 5)
Ex. 9.1.10. Using the method from Example 9.1.2, and by suitably modifying the R-code in Exercise
9.1.8, write R code that computes the method of moments estimate for µ and σ 2 in Normal(µ, σ 2 )
when µ = 4 and σ 2 = 10 by generating 100 samples from Normal(4, 10)
Ex. 9.1.11. Using the method from Exercise 9.1.6, and by suitably modifying the R code in Exercise
9.1.8, write R code that computes the method of moments estimate for a and b in Gamma (a, b)
when a = 10 and b = 0.5 by generating 100 samples from Gamma (10, 0.5)
For n ≥ 1, let X1 , X2 , . . . , Xn be an i.i.d. sample from the distribution X. Assume that X has
either probability mass function or probability density function denoted by f (x | p) depending on
parameter(s) p = (p1 , p2 , . . . , pd ) ∈ P ⊂ Rd .
Definition 9.2.1. The “likelihood function” for the sample X1 , X2 , . . . , Xn is the function
L : P × Rn → R given by
n
Y
L(p; X1 , X2 , . . . , Xn ) = f (Xi | p).
i=1
One observes readily that the likelihood function is the joint density or joint mass function of
(X1 , X2 , . . . , Xn ) when p is fixed. Assuming that the MLE p̂ as defined above is unique, it can be
thought of as the most “likely” value of the parameter p for the given realisation of X1 , X2 , . . . , Xn ,
as for any other parameter value pe, the corresponding joint density or joint mass function has a
lower value at (X1 , X2 , . . . , Xn ). If p̂ is not unique, the same is true for any pe which is not an MLE.
To find the MLE, treating the given the realisation X1 , X2 , . . . , Xn as fixed, one needs to maximise
L as a function of p. Noting that maximising L is equivalent to maximising loge L (as logarithm is
an increasing function), the problem is then to find the minimum of g : R → R given by
n
X
g (p) = (Xi − p)2 .
i=1
n
Method 1: Since g (p) = (Xi − X )2 + (X − p)2 (see Exercise 9.2.2) and the first term does not
P
i=1
depend on p, the minimum of g will occur at p̂ = X.
Method 2: An alternative approach is to use differential calculus. As g is a quadratic function of p,
it is differentiable at all p, and
n
X
g ′ (p) = −2 (Xi − p) and g ′′ (p) = 2n.
i=1
n
As g ′′ (·) > 0, the minimum will occur when g ′ (p) = 0. This occurs when p is equal to 1
Xi . So
P
n
i=1
the MLE of p is given by p̂ = X. ■
Example 9.2.3. Let p ∈ (0, 1) and X1 , X2 , . . . , Xn be an i.i.d. sample from a population distributed
as Bernoulli(p). The probability mass function f can be written as
p if x = 1 (
px (1 − p)1−x if x ∈ {0, 1}
f (x | p) = 1−p if x = 0 =
0 otherwise.
0 otherwise.
To find the MLE, treating the given the realisation X1 , X2 , . . . , Xn as fixed, one needs to maximise
L as a function of p. We can use calculus to do this, but differentiating L is cumbersome, so as
before we look at loge L, which is called the log likelihood function.
n
As Xi is fixed for the purpose of this maximisation problem, we can approach the problem
P
i=1
separately for the three cases above.
n
In the first case with 0 < Xi < n, we can re write
P
i=1
n
X n
X
ℓ(p ; X1 , X2 , . . . , Xn ) = loge (p)( Xi ) + (n − Xi ) loge (1 − p).
i=1 i=1
and
n n
1 X 1 X
ℓ′′ (p ; X1 , X2 , . . . , Xn ) = − ( Xi ) + ( n − Xi ).
p2 (1 − p)2
i=1 i=1
n
As 0 < Xi < n, ℓ′′ (p ; X1 , X2 , . . . , Xn ) < 0 for all p. So the global maximum will occur at the
P
i=1
n
point where ℓ′ (p ; X1 , X2 , . . . , Xn ) = 0. This happens when p = 1
Xi .
P
n
i=1
n
In the second case with Xi = 0, ℓ is a decreasing function of p and the maximum occurs at
P
i=1
n
p = 0 which can be trivially re-written as p = 1
Xi in this case.
P
n
i=1
n
In the third case with Xi = n, ℓ is an increasing function of p and maximum occurs at p = 1
P
i=1
n
which can be trivially re-written as p = 1
Xi in this case.
P
n
i=1
n
Combining the three cases, we can conclude that the MLE of p, p̂ = 1 P
n Xi = X. ■
i=1
At times we may wish to maximize the likelihood as a function of a parameter that takes values in
a discrete set. Consider a collection of empirical measurements of waiting times. Suppose we know
that each waiting time is the sum of some fixed number of i.i.d. Exp(λ) distributions, but we are
not certain how many such distributions are in each sum. We might let m represent that unknown
number, and attempt to find the m which maximizes the likelihood. As we have previously seen,
such sums have a Gamma (m, λ) distribution. In the example below we will assume λ is known.
Example 9.2.4. Let λ > 0 and let m be an unknown positive integer. Let X1 , X2 , . . . , Xn be an
i.i.d. sample from a population distributed as Gamma(m, λ). Then the likelihood function is given
by
n
Y λm
L(m) = L(m ; X1 , . . . , Xn ) = X m−1 e−λXi .
(m − 1) ! i
i=1
This ratio is a decreasing function of m, so L(m) is maximimized at the smallest value of m for
which this ratio is less than 1. Therefore
1/nthe maximum likelihood estimate for m is the smallest
1/n
n
n
integer which is larger than λ . The quantity is known as the “geometric
Q Q
Xi Xi
i=1 i=1
mean” of the X1 , X2 , . . . , Xn values. ■
As a final example, let us revisit Example 9.1.1, where we considered a Binomial distribution with
both parameters unknown.
To obtain MLEs for N and p, we need to maximise this expression as a function of N and p, for a
fixed set of empirical observations X1 , X2 , . . . , Xn .
This is unfortunately not easy to do explicitly. We can simplify the problem by observing that
n
we have already calculated an estimate of p if N is known. In that case, Xi is the sum of N n
P
i=1
independent Bernoulli(p) random variables, for which the MLE of p is (see Example 9.2.3 and
Exercise 9.2.9)
Pn
Xi
i=1
p̂ = .
Nn
By plugging in this estimator in the expression for L(N , p ; X1 , X2 , . . . , Xn ), we obtain the so called
“profiled likelihood function”
e (N ) ≡ L(N , p̂ ; X1 , X2 , . . . , Xn ),
L
which can now be viewed as a function of N only. Such profiled likelihood functions, where some
parameters in the likelihood function are replaced by estimators that depend on the remaining
parameters, are useful because they reduce the number of parameters over which the maximization
problem needs to be solved. It is easy to see that maximizing the profiled likelihood is equivalent
to maximizing the original likelihood function.
Unfortunately, further theoretical analysis of this function is difficult. Numerically, however, this
problem is not difficult to solve. Consider again the empirical realisations given in Example 9.1.1,
with n = 10 and observations 8, 7, 6, 11, 8, 5, 3, 7, 6, 9. Clearly, N must be at least 11, the largest
of the observations. Let us use R to compute the logarithm of the profiled likelihood for values of
N from 11 to 50.
We can now plot these log-likelihood values, as we do in Figure 9.1, or look at them directly.
head(d, 15)
−21.6
−21.8
−22.0
logL
−22.2
−22.4
−22.6
−22.8
10 20 30 40 50
N phat logL
11 0.6363636 -22.77295
12 0.5833333 -22.14081
13 0.5384615 -21.86794
14 0.5000000 -21.73123
15 0.4666667 -21.65932
16 0.4375000 -21.62190
17 0.4117647 -21.60412
18 0.3888889 -21.59799
19 0.3684211 -21.59894
20 0.3500000 -21.60426
21 0.3333333 -21.61225
22 0.3181818 -21.62184
23 0.3043478 -21.63233
24 0.2916667 -21.64328
25 0.2800000 -21.65438
By inspecting the first few rows of the table, we see that the likelihood is maximized at N̂ = 18
and p̂ = 0.389. These estimates are not very different from the ones we obtained using the method
of moments. ■
Similar numerical methods are required to compute the maximum likelihood estimate in many
other examples (See Exercise 9.2.10).
Remark 9.2.6. In general, one may not be able to compute the sampling distribution (i.e. probability
distribution of the random-sample-based statistic) of the maximum likelihood estimate. However
there is a well-understood theory of limiting distributions of maximum likelihood estimators, whether
they are available in closed form solutions or not. They are “better” than other estimators in terms
of variance, and follow a normal distribution, asymptotically. A detailed discussion and proof of
these results are beyond the scope of this book.
Sampling distributions of these estimates do play an important role in obtaining confidence
interval (see Section 9.3) and test of hypotheses (see Chapter 10). For this one must understand
the limiting behaviour of the sampling distributions.
exercises
Ex. 9.2.1. In the examples above we have used the fact that an exponential with negative exponent
may be maximized by minimizing the exponent. Prove that this is generally true. Suppose f (x) is
a function which achieves a minimum when x = a. Let g (x) = e−f (x) . Prove that g (x) achieves a
maximum when x = a.
Ex. 9.2.2. Show that for any real numbers p, x1 , x2 , . . . , xn ,
n
X n
X
( xi − p ) 2 = ( xi − x ) 2 + ( x − p ) 2 .
i=1 i=1
Ex. 9.2.3. Let λ > 0 and X1 , X2 , . . . , Xn be an i.i.d. sample from a population with Exponential(λ)
distribution.
Ex. 9.2.4. Let X1 , X2 , . . . , Xn be an i.i.d. sample from a population with Poisson(λ) distribution,
where λ is known to be strictly positive.
(b) Prove that if at least one of the Xj values in non-zero, then the maximum likelihood estimate
for λ is X.
(c) Prove that if all of the Xj values are zero, then L(λ ; X1 , X2 , . . . , Xn ) has no maximum value
for λ > 0.
Ex. 9.2.5. Let 0 < p < 1 and X1 , X2 , . . . , Xn be an i.i.d. sample from a population with
Geometric(p) distribution.
(b) Let ℓ(p ; X1 , X2 , . . . , Xn ) = loge (L(p ; X1 , X2 , . . . , Xn )). Find the value of p for which
ℓ(p ; X1 , X2 , . . . , Xn ) is maximized.
Ex. 9.2.6. Let σ > 0 and X1 , X2 , . . . , Xn be an i.i.d. sample from a population with Normal(0, σ 2 )
distribution.
(c) Prove that this maximum likelihood estimate is also an unbiased estimator for σ 2 in this
case.
Ex. 9.2.7. Let µ ∈ R, σ > 0 and X1 , X2 , . . . , Xn be an i.i.d. sample from a population with
Normal(µ, σ 2 ) distribution.
(a) Find the likelihood function L(µ, σ ; X1 , X2 , . . . , Xn ).
(b) Find the maximum likelihood estimators of µ and σ 2 .
Ex. 9.2.8. Let a < b and X1 , X2 , . . . , Xn be an i.i.d. sample from a population with Uniform(a, b)
distribution. Prove that the maximum likelihood estimates for a and b are min{X1 , X2 , . . . , Xn }
and max{X1 , X2 , . . . , Xn } respectively.
Ex. 9.2.9. Let m be a known integer and p ∈ (0, 1) be an unknown parameter. Let X1 , X2 , . . . , Xn
be an i.i.d. sample from a population with Binomial(m, p) distribution.
(a) Find the likelihood function L(p, X1 , X2 , . . . , Xn ).
(b) Let ℓ(p ; X1 , X2 , . . . , Xn ) = loge (L(p ; X1 , X2 , . . . , Xn )). Find the value of p that maximizes
ℓ(p ; X1 , X2 , . . . , Xn ).
(c) Prove that the maximum likelihood estimate for p is X/m.
Ex. 9.2.10. Let θ > 0 be unknown and X1 , X2 , . . . , Xn be an i.i.d. sample from a population with
Cauchy(θ, 1) distribution.
Ex. 9.2.11. Suppose we have a sample of size n from Multinomial distribution with parameters
k, (p1 , p2 , . . . , pk ). Let Xj , 1 ≤ j ≤ k represent the number of samples that correspond to j-th
X
outcome. Show that for each 1 ≤ j ≤ k the MLE for pj , p̂j = nj .
In the previous sections, we have considered data X1 , X2 , . . . , Xn whose distributions are governed
by parameters and described two general methods (namely the methods of moments and of maximum
likelihood) to estimate the parameters of the model from this data. In this section, we will try to
understand how we can quantify the accuracy of estimates.
We will start with the simple model considered in Example 9.2.2 where data are distributed as
Normal with unknown mean but known variance.
where the probability value on the right hand side can be computed using R.
pnorm(3) - pnorm(-3)
[1] 0.9973002
By manipulating the inequalities, we can write the equation above in its more standard form,
namely,
3 3
P p̂ − √ ≤ p ≤ p̂ + √ = 0.9973.
n n
or
3 3
P p ∈ p̂ − √ , p̂ + √ = 0.9973.
n n
It is common practice to express this interval more concisely as
3
p̂ ± √ .
n
When viewed as a random interval, the probability statement above implies that the interval
contains p with probability 0.9973. This property is usually conveyed by the statement that the
interval is a 99.73% “confidence interval.” The factor of 3 used above, which led to the confidence
level of 99.73%, is arbitrary. It is more common to specify a desired confidence level and then
calculate the factor accordingly. For example, to get a confidence level of 95%, we want a factor z
such that
P (Z > z ) = P (Z < −z ) = 0.05/2 = 0.025,
-qnorm(0.025)
[1] 1.959964
-qnorm(0.10)
[1] 1.281552
3
10.279 ± √ = [9.5047, 11.054]
15
Of course, this specific interval may or may not contain the true mean. Our “confidence” is in
the procedure which was used to produce the interval, in the sense that it will yield an interval
containing the true mean p with probability 99.73%, as long as the data are from the postulated
Normal model. ■
We could simulate data using R to verify that these confidence intervals do in fact contain the true
parameter as often as expected. It would be also interesting to evaluate the statistical properties of
the intervals when the data are from a different distribution. We do this in Section 9.3.2
The key observation that allows us to write the probability statement (9.3.1) in Example 9.3.1
√
is that n(p̂ − p) ∼ Normal(0, 1). In other words, we have found a function of the data and the
parameter of interest, namely,
√
T (X1 , X2 , . . . , Xn , p) = n(p̂ − p),
so that regardless of the value of p, T (X1 , X2 , . . . , Xn , p) has a completely known distribution. Such
functions are sometimes called pivotal quantities. The derived confidence interval is completely
specified, in principle, once we choose an interval in the support of the known distribution that
has the desired probability. For instance, in the previous example, to get a confidence interval
with coverage probability β = 0.95, we require an interval [a, b] such that P (Z ∈ [a, b]) = β for a
standard Normal random variable Z. There are many such intervals, but intuitively, the interval
[−1.959964, 1.959964] ≈ [−1.96, 1.96] is a good choice because it is the shortest such interval, as
the density of Z is symmetric and decreases away from 0.
In general, we could choose any interval that has probability β. Popular alternative choices in the
standard normal case are (−∞, Φ−1 (β )] and [Φ−1 (1 − β ), ∞), which give “one-sided” confidence
intervals. Once we choose a suitable interval [a, b], the confidence “interval” for p is given by the set
This set is random as it depends on the random sample X1 , X2 , . . . , Xn , but will have a specific
realisation for any particular empirical sample. We will try to follow this approach in a few other
situations.
Example 9.3.2. Let λ > 0 and X1 , X2 , . . . , Xn be an i.i.d. sample from an Exponential population
with rate λ. We are interested in obtaining a confidence interval for λ. Take
T (X1 , X2 , . . . , Xn , λ) = nλX.
As the Gamma distribution is not symmetric, the choice of a two-sided interval is not obvious. A
simple choice is given by taking the exclusion probabilities on both tails to be equal, as follows.
This will typically not be the shortest interval. The shortest interval cannot be obtained in closed
form, but can be computed numerically by varying the left and right tail exclusion probabilites
together so that they add up to 1 − β, and choosing the one giving the shortest interval. ■
Example 9.3.3. Let θ > 0 and X1 , X2 , . . . , Xn be from the the Uniform(0, θ) distribution. We are
interested in a confidence interval for θ. Take
X(n)
T (X1 , X2 , . . . , Xn , θ ) = .
θ
From Example 8.1.4, we know that T (X1 , X2 , . . . , Xn , θ ) has a Beta(n, 1) distribution, which has
an increasing density supported on (0, 1). Thus the shortest interval of probability β will have right
endpoint 1. The left endpoint will depend on n. For example, with n = 15 and β = 0.95, the left
endpoint is given by
qbeta(0.95, 15, 1)
[1] 0.9965863
X(n) X(n)
θ : 0.9965863 ≤ ≤ 1 = θ : X(n) ≤ θ ≤ . ■
θ 0.9965863
In the next example we discuss two situations where the pivotal quantity approach will not work.
Example 9.3.4. Suppose we want to find a procedure to obtain confidence intervals for the mean
parameter when the underlying distribution is Bernoulli or Poisson.
(a) Let p > 0 and X1 , X2 , . . . , Xn be an i.i.d. sample from the Bernoulli(p) distribution. We are
interested in a confidence interval for p. Unfortunately, there is no obvious pivotal quantity
T (X1 , X2 , . . . , Xn , p) in this example, so this approach does not work.
(b) Let λ > 0 and X1 , X2 , . . . , Xn be an i.i.d. sample from the Poisson(λ) distribution. We
are interested in a confidence interval for λ. We see immediately that T (X1 , . . . , Xn , λ) =
X1 + · · · + Xn has a Poisson(nλ) distribution and consequently depends on λ. We could thus
take n = 1 without loss of generality. However, there is again no obvious pivotal quantity
T (X1 , λ) in this example, so this approach does not work.
In both these examples, it is common practice to obtain approximate confidence intervals using the
Central Limit Theorem. We discuss this approach in the next section. ■
We conclude this section with a final example, returning to the Normal distribution in Example
9.3.1, where we assume that both mean and variance are unknown.
√ (X − µ)
T (X1 , X2 , . . . , Xn , µ, σ ) = n .
σ
With a factor of 1.96 for a 95% coverage probability, the corresponding confidence interval for µ
would then be
σ
X ± 1.96 √ ,
n
which agrees with our intuition. Unfortunately, σ 2 is not known, so this approach is not valid.
However, as we do have a natural estimator S 2 for σ 2 (recall Theorem 7.1.6), we can try replacing
σ 2 by this estimate and arrive at
√ (X − µ)
T (X1 , X2 , . . . , Xn , µ) = n .
S
Recall from Corollary 8.1.11 that the exact distribution of T (X1 , X2 , . . . , Xn , µ) is tn−1 and thus
is a pivotal quantity (see also Exercise 9.3.1). We can now proceed as we did in Example 9.3.1.
For the standard normal, the shortest interval with probability 0.95 was the symmetric interval
[−1.96, 1.96]. For the tn−1 distribution, similar quantiles can be computed using the qt() function.
For example, with n = 15, the right endpoint is given by
qt(0.975, df = 14)
[1] 2.144787
Considering again the n = 15 data points 11.22, 9.56, 10.06, 10.21, 10.95, 10.03, 10.75, 10.71, 11.42,
9.61, 10.91, 8.14, 8.95, 10.57, 11.1, we have µ̂ = X = 10.279 and σ̂ = S = 0.9112, so the confidence
interval for µ is given by
σ̂ 0.9112
µ̂ ± 2.145 √ = 10.279 ± 2.145 √ = [9.775, 10.784].
n 15
The ability to derive these kinds of confidence intervals, where we can control the confidence level
exactly as long as the model assumptions hold, is one of the main reasons for studying the t
distribution. ■
The theoretical calculations in the previous section guarantee that the confidence intervals we
derived will satisfy the target coverage, as long as the data come from the distribution assumed. It
is still a good idea to verify this using simulation. Doing so will also allow us test how the coverage
probabilities change when the data do not come from the postulated distribution.
We can use R to construct confidence intervals using the formulas derived above. For example,
we can generate a random sample from the Normal distribution and then construct 80% and 95%
confidence intervals for the mean using the following code.
ci95
We can now repeat this process using replicate(), but to do so, it would be convenient to put
the code above into an R function. Functions in R encapsulate repetitive calculations as code that
can be run with different values of certain variables that are provided to the function as arguments.
In this case, we would like to repeat the above process many times, but with different data, and
possibly different confidence levels. So, we can create a function that takes two arguments, x and
level, and repeats the calculations above.
When the function is called with these two arguments, the value of the last expression in the
function is returned as its value. So, we can now repeat our earlier calculations using this more
general function as follows.
[,1] [,2]
[1,] 8.843671 10.18292
[2,] 8.803891 10.46128
[3,] 9.327093 10.80719
[4,] 9.176958 10.63078
[5,] 9.625395 11.23960
[6,] 8.745298 10.96407
[7,] 10.237023 12.58567
[8,] 10.148164 11.66396
[9,] 9.636105 11.04266
[10,] 9.099662 10.42231
Of course, to estimate the coverage probability we need to repeat this process a much larger number
of times and check how many times the interval contains the true mean. We can do this for 10000
replications as follows.
[1] 0.795
We leave it to the reader to verify that the coverage probabilities seen in simulation match the
target level regardless of the mean, standard deviation or sample sizes.
A related question is how the length of the confidence intervals, which are random quanities,
behave. The empirical distribution of the lengths obtained in the 10000 replications above can be
summarized as follows.
summary(cirep[,2] - cirep[,1])
12
11
Confidence interval
10
Sample size
Figure 9.2: Simulated 80% confidence intervals for the mean of a Normal population with mean
10 and variance 32 , computed assuming mean and variance are both unknown.
The sample sizes used are from 10, 11, . . . , 200. The interval widths decrease on
average as sample size increases. We expect roughly 1 in 5 intervals to exclude
the true mean (shown in a different color) regardless of sample size as data are
generated from a Normal distribution.
Example 9.3.6. Consider data X1 , X2 , . . . , Xn from the Bernoulli(p) distribution, for some unknown
p ∈ (0, 1). We saw earlier that the pivotal quantity approach did not work for this model. However,
as the parameter of interest p is the mean, we could blindly apply the normal confidence interval
formula and expect to get something reasonable.
p = 0.05
0.4
0.3
0.2
0.1
0.0
p = 0.20
0.6
Confidence interval
0.4
0.2
0.0
p = 0.50
1.0
0.8
0.6
0.4
0.2
Sample size
Figure 9.3: Simulated 80% confidence intervals for Bernoulli with probability p = 0.05, 0.2,
and 0.5, computed using the normal confidence interval formula for the mean
when mean and variance are both unknown. The sample sizes used are from
10, 11, . . . , 200. Notice that some intervals go below zero, especially for p = 0.05,
where in addition, some intervals consist of the single point {0}. This happens
when all outcomes are 0, which may happen for small n and small p.
We have already generated one Bernoulli sample above, in the experiment where we computed
confidence intervals from normal data 10000 times. If we denote by Xi whether the i-th interval
contained the true mean, then the vector of observed Xi values is given by
mean(x)
[1] 0.795
which is close to the nominal coverage probability 0.8. However, in view of the current discussion,
we would be more reassured if we see that the value 0.8 is included in a reasonable confidence
interval. We obtain the following 95% interval by applying the normal confidence interval formula.
This is of course just one example, but we can evaluate the performance of the method using the
same techniques as above, replacing the data generating process to simulate Bernoulli data instead
of normal. Figure 9.3 plots confidence intervals for data generated using Bernoulli(p) for p = 0.05,
0.2, and 0.5, with the setup otherwise similar to Figure 9.2. This plot illustrates some problems with
small p, which are also present for large p close to 1, but otherwise suggests reasonable performance.
To estimate coverage probability and average length for specific combinations of p and n, we
can use the replication approach.
It is reassuring to see that for large n the coverage probability is close to the target of 95%, with
the average interval length decreasing with n. However, for small n and small p, the observed
coverage probability is substantially smaller than the target. In the next section, we will compare
these results with other approximate confidence intervals obtained using asymptotic results. ■
Example 9.3.7. Consider data X1 , X2 , . . . , Xn from the Exponential(λ) distribution, for some
unknown λ > 0. Here we have an exact confidence interval for λ based on the pivotal quantity
nλX which follows a Gamma(n, 1) distribution. However, as λ is the mean, we can try using the
normal confidence interval as well. Below we contrast the coverage probability and average interval
length for the two methods for λ = 1.
Here again we observe that for n = 10, the normal interval has lower than desired coverage probability,
possibly stemming from the lower interval length on average. As in the Bernoulli case with small p,
the normal interval goes below zero in a small proportion of cases. As n increases, however, the
performance of the normal interval becomes comparable with the exponential interval. ■
Example 9.3.8. Consider data X1 , X2 , . . . , Xn from the Cauchy(θ, α2 ) distribution, for some
unknown location θ ∈ R that we are interested in estimating and an unknown scale α > 0. The
Cauchy distribution does not have finite mean, so it is unclear whether it makes sense to use a
confidence interval designed for the population mean. However, we can still blindly apply it and
see what happens. In the simulation below, we take θ = 0 and α2 = 1.
We can now simulate Normal confidence intervals as before, and summarize their properties for
various values of n.
n coverage meanLength
10 0.9535 4.817333
100 0.9779 2.771039
1000 0.9593 1.195399
Here again the coverage probabilities are close to the target and the confidence intervals decrease
in size as sample size increases. ■
We have seen in the previous section that the Normal confidence intervals obtained from data
X1 , X2 , . . . , Xn perform quite reasonably even when the underlying distribution is not Normal, with
the exception of the Cauchy distribution. This essentially follows from the Central Limit Theorem
(Theorem 8.4.1), which can be interpreted as saying that for large n we have the approximation
√ (X − µ)
P n ≤ 1.96 ≈ 0.95
σ
In the simulation examples in the previous section, we have replaced σ by the sample standard
deviation S, and the normal quantile 1.96 by the corresponding tn−1 quantile. As n → ∞, the tn−1
becomes unnecessary, and it can be shown that
√ (X − µ) d
n −→ Normal(0, 1)
S
under reasonable moment conditions on the underlying distribution (see Example 8.3.7). In this
section, we will see whether applying the Central Limit Theorem to specific models can provide
improved confidence intervals.
Example 9.3.10. Consider data X1 , X2 , . . . , Xn from the Bernoulli(p) distribution, for some
unknown p ∈ (0, 1). We saw in Example 9.3.6 that the normal confidence interval performs
reasonably well for large n but not for small n, especially when p is small. Let us see if the Central
Limit Theorem for this specific model leads us to a better interval.
The Normal confidence interval we have been using so far assumes that both mean and variance
are unknown. However, the Bernoulli model has only one parameter p, on which both mean and
variance depend. Specifically, the Central Limit Theorem (See Example 8.4.5) gives the asymptotic
distribution of the sample proportion p̂ as
√ (p̂ − p) d
T (X1 , X2 , . . . , Xn , p) = np −→ Normal(0, 1).
p(1 − p)
This T (X1 , X2 , . . . , Xn , p) can be viewed as an approximately pivotal quantity, and following our
earlier approach, an approximate 95% confidence interval for p is given by
( )
√ (p̂ − p)
p : −Q ≤ np ≤Q , (9.3.2)
p(1 − p)
where Q = 1.96 is the 0.975 quantile of standard Normal. The bounds will be achieved by p that
satisfy the quadratic equation
Q2
n(p̂ − p)2 = p(1 − p).
n
It it easy to verify that this equation has two real solutions, given by
!
1 Cn2 + 4Cn p̂(1 − p̂)
p
Cn
+ p̂ ±
( Cn + 1 ) 2 2
where Cn = Q2 /n. These are called Wilson Confidence intervals [Wil27]. Ignoring the possibility
that these solutions may lie outside [0, 1], we can compute this confidence interval in R as follows
We can now repeat our earlier simulation experiment from Example 9.3.6, this time with these
intervals instead of the normal interval.
Comparing with the results for the Normal interval, we see that the simpler linear interval has
roughly similar performance. However, the more complicated quadratic interval has substantially
better performance for small n and small p. It is important to remember that all these intervals
are based on asymptotic results, and it is difficult to analyse performance in small samples except
through simulation. Nonetheless, it is not surprising that the methods which replace the variance
term by an estimate do worse than the method that does not. ■
Just as the Central Limit Theorem for the sample mean allows us to formulate approximate
confidence intervals for the population mean, we can try to use the limiting distribution of
the sample median to obtain a confidence interval for the population median. For symmetric
distributions, the population median coincides with the population mean, so the two intervals are
for the same population parameter.
Let X1 , X2 , . . . be i.i.d. random variables with probability density function f with median µ.
Theorem 8.6.2 tells us that if f (µ) > 0, then
√
d
2 nf (µ) Xen − µ −→ Z,
where X en is the sample median obtained from X1 , X2 , . . . , Xn . We cannot use this result directly
to get a confidence interval for µ, because f (µ) is unknown. To estimate it, we shall proceed with
the following approximations. First from the fact that F −1 (F (µ)) = 1 and an application of chain
rule
f (µ) = F ′ (µ)
1
=
(F −1 )′ (F (µ))
2
≈√
n F −1 ( 21 + √1 ) − F −1 ( 1
n 2 − √1 )
n
Now if X(r ) is the r-th order statistic derived from X1 , X2 , . . . , Xn (see Section 8.1.1) then it is
intuitively clear that
1 1 1 1
F −1 ( + √ ) ≈ X([n( 1 + √1 )]) and F −1 ( − √ ) ≈ X([n( 1 − √1 )]) .
2 n 2 n 2 n 2 n
2
fˆn (µ) = √ ,
n(X([n( 1 + √1 )]) − X([n( 1 − √1 )]) )
2 n 2 n
for f (µ). This approximation above can be made rigorous and can be used to provide confidence
intervals by using the lemma below. We refer the reader to [Ser09] for a complete treatment of
convergences of sample quantiles.
p
fˆn (µ) −→ f (µ)
as n → ∞. An application of the Central Limit Theorem for the median (Theorem 8.6.2) along
with Slutsky’s Theorem (Lemma 8.3.10) yields the result. ■
en ± √1.96
X
2 nfˆn (µ)
for confidence level 0.95, with the numerator on the right hand side adjusted for other confidence
levels. We can calculate this confidence interval using R as follows.
keep the true mean and variance fixed at 0 and 1, but assume that they are both unknown. The
sample size is varied to see how well the median approximation works for small samples.
The coverage of the median interval is reasonable, although a little lower than the target for smaller
n. Observe that on average the length of the median interval is larger than the mean interval, in
spite of the lower coverage probability. This is related to the asymptotic efficiency of the median
discussed in the previous chapter. ■
Example 9.3.13. Median intervals are potentially more useful when the underlying distribution is
not Normal. An extreme case illustrating this is the Cauchy distribution. Continuing Example 9.3.8,
consider data X1 , X2 , . . . , Xn from the Cauchy(θ, α2 ) distribution, where we are interested in
estimating the unknown location θ ∈ R, and the unknown scale α > 0 is a nuisance parameter. As
we saw in Example 9.3.8, the Normal mean confidence inteval completely fails in this case.
Repeating the experiment with the median interval, we get the following.
With the median confidence interval, we get reasonable coverage with the average interval lengths
decreasing with sample size, as we would expect. ■
Example 9.3.14. Continuing Example 9.3.9, consider the following model that mimics low-
probability contamination of the data by outliers. Suppose X1 , X2 , . . . , Xn come from the Normal(0, 1)
distribution with probability 0.99, but with probability 0.01, they arise from the Normal(0, 100)
distribution.
Although we have not compared properties of the sample mean and the sample median, one
intuitively obvious property of the median is that changing the values of a few extreme data points
will not affect it, whereas it might affect the sample mean. We can thus expect the median to be
more stable under this contamination model, in the sense that it will have lower variance. Although
we will not do it here, we can formalize this property in terms of the relative asymptotic efficiency
of the sample median over the sample median. In the last chapter, we noted that the sample mean
was more efficient than the sample mean when data are obtained from the normal distribution, and
this explains the wider median confidence intervals in Example 9.3.12. However, the situation is
reversed in this model, where the median is more efficient, and will lead to narrower confidence
intervals on average. We can see this in the results of the following simulation.
As we can see, the coverage probabilities in both intervals are reasonable close to the target level of
95%, but the median intervals are much narrower on average. ■
exercises
Ex. 9.3.1. Let X1 , X2 , . . . , Xn be an i.i.d. sample from a Normal population with unknown mean
µ ∈ R and unknown variance σ 2 > 0. Show that the function
√ (X − µ)
T (X1 , X2 , . . . , Xn , µ) = n .
S
is a pivotal quantity. (Hint: define Zi = Xiσ−µ and show that T (X1 , X2 , . . . , Xn , µ) = T (Z1 , Z2 , . . . , Zn , 0).
The distribution of T (Z1 , Z2 , . . . , Zn , 0) clearly does not depend on µ or σ 2 .)
Ex. 9.3.2. X1 , X2 , . . . , Xn i.i.d. U nif orm(0, θ ), Confidence interval for θ (one-sided? shortest
length of the form [aX(n) , bX(n) ])
n
X (Xi − X )2
T (X1 , X2 , . . . , Xn , σ 2 ) = .
σ2
i=1
It is easy to see from Theorem 8.1.10 that T (X1 , X2 , . . . , Xn , σ 2 ) has a χ2n−1 distribution. Use this
to obtain a confidence interval for σ 2 .
Ex. 9.3.4. Let σ > 0 and X1 , X2 , . . . , Xn be an i.i.d. sample from the Normal(0, σ 2 ) distribution.
(b) Using qchisq() in R, find a 95% confidence interval for σ 2 with equal tail exclusion proba-
bilites.
(b) Construct confidence intervals using the variance stablisling formula from (a), R-code below
and cisummaryBernoulli discussed in Example 9.3.10.
• Are temperatures on average higher now than they were a hundred years ago?
• Are people with higher blood glucose levels at age 30 more likely to develop diabetes by age
60?
• Does smoking decrease life expectancy? Does eating organic food increase life expectancy?
One way in which we can formulate and answer such questions is by viewing as the “test of a
hypothesis”, which is a central problem in statistics. A hypothesis test makes a conjecture or
hypothesis about the population (e.g., an attribute has the same distribution in two subpopulations,
or two attributes are independent) and then carries out a computation to test the credibility
of the conjecture. Probability theory is important in hypothesis testing because hypotheses are
based on probability models, and the computations required to arrive at a decision are done using
probabilistic techniques.
In this chapter we will discuss simplified prototype hypothesis testing problems. These may not
be as nuanced as many real life hypothesis testing problems but will convey their essence. The
techniques described are in fact practically useful in a wide range of situations.
10.1 introduction
355
could be to say “yes” if a suitable confidence interval for p includes the value 0.5. This is a perfectly
reasonable approach for this problem, but for now we will take an alternative, more direct approach
that is easier to generalize to other testing problems. In this case, for example, we may argue as
follows: “If the coin had an equal chance of showing heads or tails, then the probability of observing
67 heads or more in 100 flips is around 0.0004 (this can be verified using R). This number is small
enough to suggest that the hypothesis of heads and tails being equally likely is inaccurate”. ■
We will try to make this intuition concrete in a way that can be generalized to other situations. In
Chapter 9 we introduced the idea of maximum likelihood as a systematic approach to estimation.
In this chapter, we will continue using it to develop a unified approach to hypothesis testing. We
will see that this approach leads to useful testing procedures in many situations. However, we will
also come across situations where it does not. In practice we often resort to ad hoc procedures
instead. Such procedures, while important and useful, are beyond the scope of this book.
As before, we assume that the sample X1 , X2 , . . . , Xn are i.i.d. copies of a random variable
X with a probability mass function or probability density function f (x), where f (x) = f (x |
p1 , p2 , . . . , pd ) = f (x | p) depends on one or more unknown parameters p = (p1 , p2 , . . . , pd ) ∈ P ⊂
Rd for some d ≥ 1. A conjecture or “null hypothesis” about X restricts the possible values that
p can take, and is represented by the statement that p ∈ P0 , where P0 is a proper subset of P.
Just as we computed the maximum likelihood estimator p̂ as the value of p that maximizes the
likelihood function, we can also compute an estimator p̂0 that maximizes the likelihood function
within this smaller subset P0 . If the null hypothesis holds, we expect the likelihoods at p̂ and p̂0 to
be close to each other, whereas we expect the likelihood at p̂0 to be substantially smaller if the null
hypothesis does not hold.
Our goal is to develop this idea into an approach to hypothesis testing. In the formal testing
problem described above, this approach will lead to a “test statistic”, which is a function of the data
X1 , X2 , . . . , Xn . Naturally, the probability distribution of this statistic will depend on the unknown
parameter p ∈ P. To “test” the conjecture that p ∈ P0 for a specified P0 ⊂ P, one essentially asks
whether the observed value of the test statistic could conceivably have come from a value of p ∈ P0 ,
quantifying the degree of this possibility through a probability that is referred to as the “p-value”.
The only requirement, albeit a very important one, is that this p-value not depend on the unknown
parameter p, which is equivalent to saying that the distribution of the test statistic should be fully
known when p ∈ P0 , so that probability calculations involving the test statistic can be performed
explicitly leading to a numerical answer.
The precise notion of the p-value is important, but somewhat difficult to convey. We will get
to a formal definition only in Section 10.5.3, but loosely speaking, it can be thought of as the
probability of the test statistic being “at least as extreme” as the observed statistic if the null
hypothesis was true. This makes intuitive sense: an observed value of the test statistic that is
“likely” to occur if the null hypothesis were true supports the null hypothesis, so conversely one
that is “unlikely” contradicts it. However, only considering the probability of the observed statistic
under the null is not enough, and we need to consider the probabilities of “more extreme” outcomes
as well. To get a sense of why this is important, let us revisit Example 10.1.1 above.
100
Example 10.1.1. (Continued) Suppose we use the natural test statistic S = Xi for this example.
P
i=1
Under the null hypothesis that the coin is fair (P0 = { 12 }), S ∼ Binomial(100, 12 ), so P (S = 67)
and P (S ≥ 67) are given by
[1] 0.0002324713
[1] 0.0004368599
Both of these probabilities are small, and provide evidence against the null. But consider the
situation where the observed sum was 50 rather 67. Clearly, this outcome is the strongest possible
evidence in favour of the null hypothesis. Here, the corresponding probabilities P (S = 50) and
P (S ≥ 50) are
[1] 0.07958924
[1] 0.5397946
Here P (S = 50) is arguably quite small, and if we make a decision based on P (S = 50) alone, we
ignore the fact that while small, it is still the highest probability for any individual outcome. The
contrast is more extreme if we consider a similar problem with 1000 instead of 100 tosses; in this
case, P (S = 500) and P (S ≥ 500) are
[1] 0.02522502
[1] 0.5126125
In general, the probability of individual outcomes decreases as the size of the sample space
“increases”, and in fact for continuous distributions, the probability of any singleton outcome is 0. It
is thus more natural to define the p-value as P (S ≥ s) where s is the observed quantity. ■
We will see later that the above notion of the p-value defined in terms of the probability of the test
statistic being “at least as extreme as the observed test statistic” is generally useful regardless of
the underlying model.
A natural generalization of the previous example is to situations where there are more than one
possible outcomes, and we want to “test”, based on observed data, whether the probabilities
associated with each outcome are as expected. A prototype of this problem is given by the following
example.
Example 10.2.1. Suppose we roll a six-sided die n times, and record the outcome of the i-th roll
as Ri . Assuming that the rolls are independent, the distribution of R1 , R2 , . . . , Rn is determined
by the vector of probabilities of each outcome, which we denote by p = (p1 , p2 , . . . , p6 ). Formally,
the parameter space for the problem is
6
( )
X
P= p = (p1 , p2 , . . . , p6 ) : 0 ≤ pi ≤ 1 for all i, pi = 1 .
i=1
We wish to test whether the die is fair, or in other words, pi = 1/6 for all i. This happens if p
belongs to the singleton subset P0 = {(1/6, 1/6, . . . , 1/6)} of P. ■
Recall the multinomial distribution discussed in Example 3.2.12. If we define the vector X =
(X1 , X2 , . . . , X6 ) as the number of times each outcome occurs, that is,
n
X
Xj = 1{Ri = j}, j = 1, 2, . . . , 6
i=1
then we can identify the distribution of X as the multinomial distribution with size n and probability
vector p. Another useful way to represent X is using indicator vectors as follows. For j = 1, 2, . . . , 6,
let ej be the 6-dimensional unit vector with 1 in the j-th position and 0 otherwise. Define the
i.i.d. random variables Yi = ej if Ri = j. Then it is easy to verify that the distribution of Yi is
Multinomial with parameter 1 and probabilities p = (p1 , p2 . . . p6 ) and X = ni=1 Yi is Multinomial
P
There is no obvious test statistic, although intutively it seems that large values of (Xi − n6 )2
P
would suggest that the equal probability hypothesis does not hold.
In its more general form, when the number of outcomes is some fixed number m (not necessarily
6) and P0 is a singleton set consisting of a fixed element of P (not necessarily with all components
equal), this problem is known as the “goodness of fit” problem. Despite the simplicity of the
problem, we will see that it is difficult to find a simple solution.
Another simple but important problem that arises in the context of categorical data is to decide
whether two categorical attributes are independent. Once suitably framed, this problem can also
be formulated in terms of the multinomial distribution, but with a more complicated P0 . As with
the goodness of fit problem, there is no simple solution, yet the simplicity and wide applicability
of the problem makes it important to study. We use the remainder of this section to formulate
the problem precisely. We then move on to a discussion of the testing problem in general and
some specfic problems that are simpler to analyze. We will come back to the goodness of fit and
independence of categorical attributes problems towards the end of the chapter.
Example 10.3.1. Suppose a medical research team has come up with a potential vaccine for a
dangerous disease, and we have been asked to design an experiment to determine whether it is
effective. There are of course established protocols to design such studies (commonly known as
‘clinical trials’), but the following strategy is reasonable as a first attempt. Choose n individuals
from a vulnerable population, give the vaccine to n1 of them (giving the remaining n2 = n − n1
individuals a placebo as “control”). Then, wait for a reasonable period of time (which may depend
on the features of the disease) to see how many individuals are affected by the disease in each
group. The result may be summarized in a 2 × 2 table as follows, where X11 denotes the number of
vaccinated individuals who were affected, X12 denotes the number of vaccinated individuals who
were not affected and so on.
If the vaccine is effective, we expect a smaller proportion of the vaccinated group to be affected. If
the chance of getting affected does not depend on whether the vaccine was given, then the vaccine
is ineffective. The principle of scientific skepticism suggests that we should start by supposing
independence (i.e., the vaccine has no effect), unless convinced otherwise by evidence. ■
It is easy to see that the setup in the previous example can apply to a wide range of problems
of a similar nature. The essential aspects are that either of two “treatments” are applied to a group
of experimental units, and one of two possible “outcomes” is then recorded for each unit. In the
example above, treatments are vaccine and placebo, and outcomes are affected and not affected,
but in general they can represent any binary categorical variable. The number of participants
given treatment k who have outcome ℓ is denoted by Xkℓ for k, ℓ = 1, 2. In general, neither the
number of possible treatments nor the number of possible outcomes need to be two, and both can
be categorical attributes with an arbitrary number of categories.
In the example above, we have not explicitly introduced any randomness. Indeed, many aspects
of the experiment could be random, such as the way the participants are selected from the population
of interest, the number of participants, the number of participants given each treatment, and so on.
We will now consider ways in which parametric probability models can describe such experiments.
To use our usual parametric setup, we need a sample of i.i.d. observations. The natural
independent units in Example 10.3.1 are the participants in the study; in general, we may assume
that individuals or units are selected randomly from a population of interest. Each such individual
has two categorical attributes (coded by the numbers 1 and 2 for convenience), which we will
generally refer to as treatment and outcome, even if they are not actually treatments or outcomes
in the literal sense. There are four possible treatment-outcome combinations for each unit, namely
(1, 1), (1, 2), (2, 1), and (2, 2). If we identify these combinations respectively with the four matrices
" # " # " # " #
1 0 0 0 0 1 0 0
, , , and .
0 0 1 0 0 0 0 1
then it is easy to see that the summary table is in fact simply the sum of these matrices over all
units in the experiment.
It is now quite simple to formulate the problem parametrically. The information available for each
individual unit (namely, the values of the “treatment” and “outcome” attributes) can be one of four
possible 2 × 2 matrices, which can thus be thought of as the sample space of outcomes in a random
experiment. If we assume that these outcomes, say Mi for the i-th individual, are distributed
independently and identically for each individual unit, then we are back in our usual setup where
we have n i.i.d. random observations (i.e., a random sample) M1 , M2 , . . . , Mn from some unknown
distribution. The unknown distribution is discrete with four possible outcomes, so the most general
parametric model is to assign four unknown probabilities to each outcome, with the only constraint
that they must add up to "1. In #other words, the" parameters # in the model
" are
# the probabilities
1 0 0 0 0 1
p11 of seeing the outcome , p21 of seeing , p12 of seeing , and p22 of seeing
0 0 1 0 0 0
" #
0 0
, with the constraints that 0 ≤ p11 , p21 , p12 , p22 ≤ 1 and p11 + p21 + p12 + p22 = 1. This
0 1
model is precisely the categorical distribution we saw in the previous section, with the number of
possible outcomes being four rather than six. As the n random outcomes are independent, the
individuals outcomes
" Mi can
# be combined and summarized by their sum, giving the 2 × 2 table
n X11 X12
. The distribution of the table X can thus be identified as a multinomial
P
X= Mi =
i=1 X21 X22
distribution, with size n and probability vector (p11 , p21 , p12 , p22 ). The parameter space is a subset
of R4 , specifically
X
P= p = (p11 , p21 , p12 , p22 ) : 0 ≤ pij ≤ 1, pij = 1 .
i,j
This multinomial model is appropriate when individuals or units are sampled from the population
independently, with the total sample size n fixed in advance. This assumption is usually reasonable
in observational studies where both attributes are intrinsic properties of the units being sampled,
for example, in a study of college students that record each individual’s gender and whether they
need corrective lenses. For controlled trials such as the one described in Example 10.3.1, it is not
immediately clear whether this model is appropriate because one of the attributes, namely the
treatment, is assigned by the experimenter and not an intrinsic property of the individual subjects.
Consider this alternative probability model for such experiments: Suppose individuals are
chosen independently at random from the population of interest. Each chosen individual is assigned
treatment 1 with probability π1 and treatment 2 with probability π2 = 1 − π1 . Let q11 be the
conditional probability of observing outcome 1 given treatment 1 and q12 = 1 − q11 . Similarly,
suppose the conditional probability of outcome 1 given treatment 2 is q21 and the conditional
probability of outcome 2 given treatment 2 is q22 . Then, the unconditional probability that
individual i gets treatment 1 and has outcome 1 can be calculated as
" #!
1 0
P Mi = = P (treatment = 1, outcome = 1)
0 0
= P (treatment = 1) P (outcome = 1 | treatment = 1)
= π1 q11
Similarly, we have
" #!
0 0
P Mi = = π2 q21 ,
1 0
" #!
0 1
P Mi = = π1 q12 , and
0 0
" #!
0 0
P Mi = = π2 q22 .
0 1
Our current interest is in testing whether the treatment and outcome attributes are independent.
We will need to understand how this “hypothesis” translates to restrictions on P that define P0 ,
the parameter space under the null hypothesis. This is given by the following lemma.
Lemma 10.3.2. Let {pkℓ , πk and qkℓ : k, ℓ = 1, 2} be as in Section 10.3.1. Let pk◦ = pk1 + pk2
and p◦ℓ = p1ℓ + p2ℓ for k, ℓ = 1, 2. For an individual, let T denote the treatment and Y denote the
outcome. Then, the following are equivalent.
Thus the conjecture or hypothesis of independence that we are interested in testing can be stated
as a constraint on the parametric model, specifically, that the true parameter p is in
So far, in this chapter, we have discussed testing problems with one common theme: the primary
attribute we were interested in was a categorical outcome, and could be modeled using a categorical
distribution (or the Bernoulli distribution in the simplest case). Although we have not yet been able
to derive a satisfactory general approach to testing in this problem, the setup is useful in motivating
the next set of examples, where instead of a categorical outcome we consider a continuous outcome,
which we model using a Normal distribution.
In this section, we consider the continuous analog of Example 10.1.1, with one set of univariate
i.i.d. observations X1 , X2 , . . . , Xn from the Normal distribution.
As we will soon see in Definition 10.4.1, any statistic, or function of the available data, can be
a test statistic in principle. To be of practical use, the distribution of a test statistic should be
completely known under the null hypothesis. In the Bernoulli coin toss example above, P is the
interval [0, 1], P0 is the singleton set {0.5} and the test statistic is the total number of heads in say
n
n tosses is S = Xi . Although the distribution of S (Binomial) generally involves the unknown
P
i=1
p, it is completely known for p = 0.5.
Ideally one wants to find the test statistic that is also “best” in some sense, but often the
optimality of a test is difficult to establish. In Section 10.5, we will describe one principled approach
that often, but not always, leads to a test statistic with some optimality properties. Sometimes,
however, a suitable test statistic is easy to derive intuitively, as in the coin toss example above. In
this section, we consider some specific examples where such tests are available.
Intuitively, a good test statistic should satisfy two main criteria. Firstly, the sampling distribu-
tion of the test statistic should be known when p ∈ P0 . Secondly, it should have the “power” to
distinguish whether the conjecture is true or false (i.e., its distribution should vary substantially
when p ∈ P0 and when it is not). We have seen one important example, namely the Bernoulli
model, where an intuitively appealing test statistic was easy to find, but we have also seen closely
related multinomial models where such test statistics were not available. We will now investigate a
few examples involving the Normal distribution, which has two parameters, to see whether we can
come up with similarly simple and intuitive test statistics. Our goal is to get a sense of what test
procedures typically look like, before discussing a more systematic but somewhat abstract approach
in the next section. Before proceeding, let us formally define the notion of a test statistic in this
context.
10.4.2 The Normal Distribution: Test for Sample Mean when σ is Known
Suppose X ∼ N ormal (µ, σ 2 ) where µ is an unknown mean, but σ is a known standard deviation.
Thus, here the parameter of interest p ≡ µ and parameter space P = R.
Two-sided test
Suppose we want to test the null hypothesis that µ = a, where a is some known value that has
special meaning in the context of the problem. Thus, here P0 = {a}.
Example 10.4.2. Let X1 , X2 , . . . , Xn be an i.i.d. sample from the distribution X ∼ Normal(µ, 1),
where µ ∈ P = R. We are interested in testing the null hypothesis that µ = 0, or equivalently, that
µ ∈ P0 = {0}. Clearly, an intuitively appealing test statistic is the observed sample mean X. The
further X is from 0, the less confident we would be that the null hypothesis is true. The question is
how far away should it be before we decide that the evidence against the null hypothesis is strong
enough.
Note that although we have assumed σ = 1 and a = 0, this does not lead to any loss in generality.
Suppose the true distribution of X was Normal(µ, σ 2 ), and we were interested in testing the null
hypothesis that µ = a. Consider the transformed observations Zi = (Xi − a)/σ. Then their common
distribution is Z ∼ Normal((µ − a)/σ, 1), and the null hypothesis µ = a ⇐⇒ (µ − a)/σ = 0. ■
n
The sample mean X can thus be viewed as the test statistic T (X1 , X2 , . . . , Xn ) = n1 Xi . In
P
i=1
general, a value of X close to a tends to support the conjecture that µ = a, and a value away from
a tends to make us suspect it. But the strength of the evidence against the conjecture depends
not only on the difference X − a, but also on σ. It is easy to see from results we have already
encountered that if the null hypothesis is true, that is, µ = a, then X ∼ Normal(a, σ 2 /n), and so
√
X −a
Ta (X1 , X2 , . . . , Xn ) = n ∼ Normal(0, 1).
σ
The if in the last sentence is important, and bears emphasizing. Another way to express the same
statement is to say that if Y1 , Y2 , . . . , Yn are i.i.d. observations from the Normal(a, σ 2 ) distribution,
then
√
Y −a
Ta (Y1 , Y2 , . . . , Yn ) = n ∼ Normal(0, 1).
σ
This distinguishes between the observed data X1 , X2 , . . . , Xn whose assumed distribution depends
on the unknown µ, and the hypothetical random variables Y1 , Y2 , . . . , Yn whose distribution is
assumed to satisfy the null hypothesis.
The known distribution of the test statistic, calculated from the distribution of the hypothetical
data for which the null hypothesis is true, is known as the “null distribution”. It then remains
to compare the observed value of Ta to this known distribution to obtain a p-value. The formal
definition of p-value will be given later, but as noted earlier, we can think of it as the probability
that under the null distribution we will see a value of Ta (Y1 , Y2 , . . . , Yn ) at least as extreme as
the value of Ta (X1 , X2 , . . . , Xn ) observed from the data at hand. Here, “more extreme” is to be
interpreted as values that are even less likely, and thus would have provided even more evidence
against the null hypothesis. In this case, large positive and large negative values of Ta are both
evidence against the null hypothesis, and the symmetry of the distribution of Ta (Y1 , Y2 , . . . , Yn )
suggests that the p-value may be computed as
given the observed value of the test statistic Ta (X1 , X2 , . . . , Xn ). We would reject the null hypothesis
if the p-value obtained in (10.4.1) is small enough (say smaller than 0.05).
One-sided test
The test above conjectures that µ = a, and considers deviations on either side to be departures
that invalidate the null hypothesis. A common variation of the above test is interested in deviations
from the null hypothesis in only one direction. Suppose, as before, that the distribution of X
is Normal(µ, σ 2 ) with known σ 2 and µ ∈ P = R. Further, X1 , X2 , . . . , Xn is an i.i.d. sample
from X, and we are interested in testing the null hypothesis that µ ≤ a, or equivalently, that
µ ∈ P0 = (−∞, a]. Intuitively, the data support the null hypothesis if X ≤ a, and oppose it more
and more strongly the larger X is than a. Here, we would expect the null hypothesis to be rejected
if X − a is large. It would be natural to compute
where Y1 , Y2 , . . . , Yn ∼ i.i.d. Normal(µ, σ 2 ) and reject if this probability is uniformly small for all
µ ≤ a. We thus define the p-value to be
One can in fact show that this maximum is achieved for µ = a . Let us look at a specific numerical
example to make these ideas concrete.
Example 10.4.3. Students in a probability class were surveyed to obtain the sex and heights of
one of their parents, along with those of that parent’s oldest sibling, provided they had one. When
the siblings compared were both male, the difference in heights (oldest sibling - parent) rounded to
the nearest centimeter were as follows: -5 -5 -4 -2 -2 1 2 2 3 4 4 5 7.
Assume that these values are i.i.d. observations from a Normal distribution with unknown
mean µ and known variance σ 2 = 32 . We want to test the null hypothesis that µ = 0.
√
The observed value of X is 0.769, and thus T0 (X1 , X2 , . . . , Xn ) = 13(0.769/3) = 0.925. The
probability that the random variable T0 (Y1 , Y2 , . . . , Yn ), which follows Normal(0, 1), is larger than
0.925 can be calculated in R as follows.
[1] 0.177483
This is the p-value for the one-sided null hypothesis µ ≤ 0. For the two-sided null hypothesis the
p-value would then be 2 × 0.1775 = 0.354. As both p-values are larger than 0.05 we do not reject
either of the null hypotheses. ■
We will discuss the interpretation of the p-value in more detail in Section 10.5, but the value 0.05 is
generally considered a reasonable cutoff between “weak” and “strong” evidence against the null
hypothesis.
10.4.3 The Normal Distribution: Test for Sample Mean when σ is Unknown
Despite all the elaborate notation, at its core the previous example is not very different from the
Binomial example we started with. In both cases, there is only one unknown parameter of interest,
and the null hypothesis fixes it to a particular known value, making the distribution of the data
essentially known if the null hypothesis holds. Obtaining a test statistic whose distribution would
be known under the null hypothesis is then almost trivial, as literally any function of the data
would satisfy this requirement.
This is rarely the case in realistic hypothesis testing scenarios. Suppose again that X ∼
Normal µ, σ 2 , but now both the mean µ and the standard deviation σ are unknown. Thus, here
The essential approach remains the same as before. We wish to find a test statistic whose
distribution does not depend on the unknown parameters when the null hypothesis is true. The
test statistic in the earlier case was
√
X −a
Ta (X1 , X2 , . . . , Xn ) = n ∼ Normal (0, 1) .
σ
However, this is not a valid test statistic in this case because it cannot even be calculated, as it
involves σ, which is unknown. Intuitively, we may expect to fix this problem by replacing σ with
an estimate, say the sample standard deviation S (see Definition 7.1.5 and Exercise 7.1.8). This
leads to a proper statistic
√
X −a
Ta (X1 , X2 , . . . , Xn ) = n .
S
For this modified statistic Ta (X1 , X2 , . . . , Xn ) to be useful as a test statistic, its distribution under
the null hypothesis should be completely known. In general, the distribution of Ta would depend on
µ and σ 2 . If the null hypothesis µ = a is true, then µ is known, but σ 2 is still unknown. However,
it is easy to see that the distribution of Ta (Y1 , Y2 , . . . , Yn ) does not depend on σ 2 in this case. To
see this, let Y1 , Y2 , . . . , Yn be an i.i.d. sample from Normal(a, σ 2 ) for some arbitrary σ 2 > 0. Note,
√ Y − a
Ta (Y1 , Y2 , . . . , Yn ) = n q
1 Pn 2
n−1 i=1 (Yi − Y )
√ Y −a
σ
= n
r
2
1 Pn Yi −Y
n−1 i=1 σ
a <- 0
x <- c(-5, -5, -4, -2, -2, 1, 2, 2, 3, 4, 4, 5, 7)
n <- length(x)
mu_x <- mean(x)
sigma_x <- sqrt(sum((x - mu_x)ˆ2) / (n-1))
T_x <- sqrt(n) * ((mu_x - a) / sigma_x)
T_x
[1] 0.6964513
The value of Ta (X1 , X2 , . . . , Xn ) is thus 0.6965. To obtain the corresponding p-value for the
two-sided null hypothesis µ = 0, we must compute
Suppose we did not know the distribution of Ta (Y1 , Y2 , . . . , Yn ), we could approximately compute
this probability by simulating values of Y1 , Y2 , . . . , Yn . The critical observation here is that it does
not matter that we do not know σ 2 , because the distribution of Ta (Y1 , Y2 , . . . , Yn ) does not depend
on σ 2 . Thus, for simulation, any value of σ 2 is as good as any other. Here we choose to simulate
with σ 2 = 1. Once Y1 , Y2 , . . . , Yn are available, we simply repeat the same calculations as above to
calculate Ta (Y1 , Y2 , . . . , Yn ).
n <- length(x)
T_sim <-
replicate(1000000,
{
y <- rnorm(n, mean = 0, sd = 1)
mu_y <- mean(y)
sigma_y <- sqrt(sum((y - mu_y)ˆ2) / (n-1))
T_y <- sqrt(n) * (mu_y / sigma_y)
T_y
})
uprob <- mean(T_sim >= abs(T_x))
uprob
[1] 0.249762
As before, the p-value is twice the upper tail probability, which in turn is estimated by uprob, the
proportion of cases out of a million simulations where the simulated T0 (Y1 , Y2 , . . . , Yn ) exceeds
0.6965. The estimated p-value is thus 2 × 0.2498 = 0.4995. Of course, this estimated p-value will
vary every time the simulation is run, but it should be reasonably close to the correct answer. In
fact, as this estimate is based on estimating the proportion p from a series of Bernoulli trials, we
can use what we learned in Chapter 9 to obtain a confidence interval for the true p-value. Using
the bernoulliQuadraticCI() and bernoulliLinearCI() functions defined in Section 9.3.3, we
have the following 95% confidence intervals for the p-value, which are essentially identical, which is
not surprising given the large number of replications.
In this somewhat special situation, however, this simulation approach is not necessary because the
distribution of Ta (Y1 , Y2 , . . . , Yn ) is well-studied theoretically, with its density function available
in closed form and numerical algorithms available to evaluate its cumulative distribution func-
tion. Recall that we learned about the t distribution in Chapter 8, and note in particular that
Corollary 8.1.11 immediately implies that in this case,
√ (Y − a)
Ta (Y1 , Y2 , . . . , Yn ) = n ∼ tn−1 .
S
Tail probabilities of the t distribution can be computed in R using the pt() function, which is
analogous to the pnorm() function for the Normal distribution. Thus, the exact p-value in this case
can be computed as follows.
[1] 0.4994148
As the p-values we have computed for this example are all larger than 0.05, we would not have
rejected the null hypothesis that µ = 0, irrespective of the approach taken.
As long as we are considering tests that are intuitively appealing, the test statistics outlined above
are not the only possibilities. For example, the parameter µ in the Normal µ, σ 2 distribution is
the median as well as the mean, so it may be reasonable to define a test statistic based on the
sample median instead of the sample mean. Specifically, to test the null hypothesis that µ = a,
consider the statistic
√ median(X ) − a
Ta (X1 , X2 , . . . , Xn ) = n
e .
Se
where
n
1X
Se = |Xi − median(X )| .
n
i=1
Essentially, Tea (X1 , X2 , . . . , Xn ) replaces the sample mean by the sample median in the formula for
Ta (X1 , X2 , . . . , Xn ), with an analogous change in the estimate of scale. We can argue as before that
the distribution of Tea (Y1 , Y2 , . . . , Yn ), when Y1 , Y2 , . . . , Yn are independent Normal a, σ 2 , does
Continuing Example 10.4.3, we can obtain the value of Tea (X1 , X2 , . . . , Xn ), with a = 0, as
follows.
a <- 0
x <- c(-5, -5, -4, -2, -2, 1, 2, 2, 3, 4, 4, 5, 7)
n <- length(x)
mu_x <- median(x)
sigma_x <- sum(abs(x - mu_x)) / n
T_x <- sqrt(n) * ((mu_x - a) / sigma_x)
T_x
[1] 2.232008
The observed value of Tea (X1 , X2 , . . . , Xn ) here is 2.232. To compute the corresponding p-value, we
again simulate values of Te0 (Y1 , Y2 , . . . , Yn ) where Y1 , Y2 , . . . , Yn are i.i.d. Normal (0, 1).
T_sim_median <-
replicate(1000000,
{
y <- rnorm(n, mean = 0, sd = 1)
mu_y <- median(y)
sigma_y <- sum(abs(y - mu_y)) / n
T_y <- sqrt(n) * (mu_y / sigma_y)
T_y
})
uprob <- sum(T_sim_median >= abs(T_x)) / 1000000
uprob
[1] 0.094342
The desired approximation to the p-value in this case is again twice the upper tail probability, i.e.
2 × 0.0943 = 0.1887. As this value is larger than 0.05, we will not reject the null hypothesis that
µ = 0.
One may reasonably wonder which of the above tests is “better”. This is an important question
in general, but it is beyond the scope of this book. In this specific case, the first test can be shown
to have more “power” to identify situations where the null hypothesis does not hold: This means
that if the true value of µ differs from a, then the test based on the mean has higher probability of
rejecting the null hypothesis (by producing a p-value less than 0.05) than the test based on the
median. This holds for all values of µ ̸= a, although the actual probabilities of rejection would
certainly depend on the value of µ. However, this assurance requires Normality of the underlying
measurement. As in Chapter 9, simulation studies can usually provide helpful guidance regarding
the performance of specific tests under various scenarios.
exercises
The examples in the previous section illustrate the “intuitive” approach for developing a test
statistic given a hypothesis of interest. While this approach is useful in many situations, one is often
interested in a principle that may be applied in a general setup, much as the maximum likelihood
principle served for estimation in Chapter 9. We will describe a similar principle for testing in this
section. Although we will not delve into theoretical properties of this approach, we mention two
important results about it: First, the resulting test can be easily shown to be optimal in the special
case where both P0 and P \ P0 are singleton sets (this is a fundamental result in statistics that
is known as the Neyman-Pearson Lemma (see [CasBer90]) as well as more generally for certain
families of distributions. Second, even though the distribution of the resulting test statistic under
the null hypothesis may not always be computable, it can be determined asymptotically under fairly
general conditions (see Wilks’ Theorem, Theorem 10.7.2).
Recall from Chapter 9 that the likelihood function given the sample (X1 , X2 , . . . , Xn ) is defined as
n
Y
L(p; X1 , X2 , . . . , Xn ) = f (Xi | p).
i=1
where we use the notation “arg max” to denote the value of the argument p for which the maximum
is obtained. We can similarly define the MLE restricted to the null hypothesis being tested as
L(p̂0 ; X1 , . . . , Xn )
λ(X1 , . . . , Xn ) = , (10.5.1)
L(p̂; X1 , . . . , Xn )
L(p̂; X1 , . . . , Xn )
Λ(X1 , . . . , Xn ) = −2 log λ(X1 , . . . , Xn ) = 2 log . (10.5.2)
L(p̂0 ; X1 , . . . , Xn )
1
which should perhaps be called the log-likelihood ratio statistic, and often is.
The intuition behind these definitions is as follows. By definition of p̂ and p̂0 , we must have
L(p̂; X1 , . . . , Xn ) ≥ L(p̂0 ; X1 , . . . , Xn ) ≥ 0, hence 0 ≤ λ(X1 , . . . , Xn ) ≤ 1 and thus Λ(X1 , . . . , Xn ) ≥
0. Equality is achieved if the unrestricted MLE p̂ ∈ P0 . If p̂ ̸∈ P0 then Λ(X1 , . . . , Xn ) > 0 gives a
measure of how far p̂ is from P0 (in terms of L). The further p̂ is away from P0 , the less likely it is
that the null hypothesis p ∈ P0 is true. The general principle then is to believe the null hypothesis
if the likelihood ratio λ is close to one, or equivalently, the likelihood ratio statistic Λ is small (close
to zero), and suspect it when Λ is large. The reason for taking log and the including the factor of 2
is that doing so makes the distribution of Λ more convenient, as will become clear in due course.
It still remains for us to determine how large Λ(X1 , . . . , Xn ) should be before we conclude that the
balance of evidence suggests that the null hypothesis is false. We next present a simple example
where the underlying distribution depends only on one parameter, but is nonetheless helpful in
understanding this question.
2 −1
cX 10
10 1
P ( T ≤ c1 ) + P ( T ≥ c2 ) = 1 −
k 2
k =c1 +1
For example, if c1 = 2 and c2 = 8, the probability of Type I error is 0.109. If instead c1 = 1 and
c2 = 9, the probability of Type I error is 0.021.
On the other hand, if the null hypothesis is not true, we may still mistakenly accept the null
hypothesis. This is called a “Type II error” or a “false negative”. The probability of making a Type
II error depends on the true value of the parameter p and is given by
2 −1
cX
10 k
P (T ≤ c1 ) + P (T ≥ c2 ) = 1 − p (1 − p)10−k .
k
k =c1 +1
For example, if p = 0.75, then with c1 = 2 and c2 = 8, probability of Type II error is 0.474, and
with c1 = 1 and c2 = 9, probability of Type II error is 0.756. Similarly, for p = 0.9, with c1 = 2
and c2 = 8, probability of Type II error is 0.07, and with c1 = 1 and c2 = 9, probability of Type II
error is 0.264.
These calculations illustrate a general principle, that trying to decrease the probability of
Type I error (e.g., by controlling c1 and c2 in this example) generally results in an increase in the
probability of Type II error. The standard approach to resolve this trade-off is a two-step approach:
(a) Obtain a reasonable “test statistic”, such as the likelihood ratio statistic Λ(X1 , . . . , Xn ) defined
above. (b) Choose a threshold for the test statistic,2 beyond which the null hyothesis is declared
to be rejected, in a manner that ensures that the probability of Type I error does not exceed a
pre-determined limit.
An “optimal” choice of test statistic in (a) would ensure that any other test with equal or
lower probability of Type I error will always have higher probability of Type II error. Using
Λ(X1 , . . . , Xn ) as the test statistic ensures such optimality in many situations, provided that it is
possible to compute and control the probability of Type I error for the resulting test. Such a test is
known as the likelihood ratio test.
With this background, we will now proceed to outline a general test procedure, before getting
back to the specific examples cited above.
Let X be a random variable with probability mass function or probability density function f (x),
where f (x) ≡ f (x | p) depends on one or more unknown parameters p ∈ P. Let X1 , X2 , . . . , Xn be
an i.i.d. random sample with common distribution X. In our approach, the test statistic is the
likelihood ratio statistic Λ(X1 , . . . , Xn ), and we reject the null hypothesis if Λ is large. However,
rather than trying to find a rejection cutoff that controls the probability of Type I error, we will
take a more modern approach and introduce the closely related concept of p-value.
Note that Λ(X1 , . . . , Xn ) is also a random variable having a corresponding sampling distribution
of its own. When performing a test, we will work with a specific realisation of this sample. Henceforth,
we will denote this realised value of Λ(X1 , . . . , Xn ) by d, to emphasize that it is a constant for the
purposes of the test.
2
More generally, a set of possible values of the test statistic for which the null hypothesis is rejected.
Now, the sample X1 , X2 , . . . , Xn was generated with a particular value of p that may or may
not have belonged to P0 . The p-value is concerned with the probabilistic behaviour of the random
variable Λ(X1 , . . . , Xn ) when the underlying parameter does belong to P0 . To distinguish a sample
from such a “thought experiment” from the actual realized sample, imagine a second set of i.i.d.
observations Y1 , Y2 , . . . , Yn from the distribution of X for some p ∈ P.
The notation Pp emphasizes that the probability calculations are done with parameter value
p, and d = Λ(X1 , X2 , . . . , Xn ) is the likelihood ratio statistic calculated from the observed
sample.
Whether we can actually compute the p-value still remains to be seen, and will depend on the
problem. Assuming that it can be, consider the following test procedure.
Definition 10.5.3. (Level α test) Fix 0 < α < 1. Let X1 , X2 , . . . , Xn be an i.i.d. sample
from a population with distribution X. To test the null hypothesis p ∈ P0 at level α, compute
the p-value as above, and reject the null hypothesis if the p-value is less than or equal to α.
Otherwise, accept the null hypothesis.
The following result establishes that the probability of Type I error for this test procedure does not
exceed α. This property is conventionally taken to be defining characteristic of a level α test.
Theorem 10.5.4. For a level α test obtained from the likelihood ratio statistic Λ(X1 , . . . , Xn ) with
p-value computed according to (10.5.3), the probability of Type I error does not exceed α.
PY (Λ(Y1 , . . . , Yn ) ≥ Λ(X1 , . . . , Xn )) ≤ α
as this inequality must hold for all p ∈ P0 and hence in particular for p0 . Therefore, we must have
This last result follows from Lemma10.5.5 below, with Z = Λ(X1 , . . . , Xn ) and W = Λ(Y1 , . . . , Yn ).
■
Lemma 10.5.5. Let Z and W be i.i.d. random variables with distribution function F . Then
PZ (PW (W ≥ Z ) < α) ≤ α.
Proof. The proof is simple if we assume F is continuous and strictly increasing on its support.
PZ ( PW ( W ≥ Z ) < α ) = PZ ( 1 − F ( Z ) < α )
= PZ ( F ( Z ) > 1 − α )
= PZ (Z > F −1 (1 − α))
= 1 − F (F −1 (1 − α))
= 1 − (1 − α ) = α
The result also holds for more general F , but that case requires more careful manipulation, with
the first two equalities in the proof above becoming inequalities. ■
Let us now return to Example 10.5.1, where X1 , X2 , . . . , Xn is an i.i.d. sample from Bernoulli(p),
where p ∈ P = (0, 1). Then T = ni=1 Xi has a Binomial(n, p) distribution. We are interested in
P
It is easy to see that p̂ = T /n and p̂0 = p0 . Thus the likelihood ratio statistic is
L(p̂; X )
Λ (X ) = 2 log
L(p̂0 ; X )
X/n 1 − X/n
= 2X log + 2(n − X ) log , (10.6.1)
p0 1 − p0
where we interpret 0 log(0) as 0. To compute the p-value for this test for a specific realization of X,
we need to first compute d = Λ(X ), and then compute
Pp ( Λ ( Y ) ≥ d ) ,
0
where Y has the Binomial(n, p0 ) distribution. Although we cannot express this probability in closed
form, we can explicitly write it as the following sum involving Binomial(n, p0 ) probabilities:
X n k
P ( Λ (Y ) ≥ d) = p (1 − p0 )n−k , (10.6.2)
k 0
0≤k≤n:Λ(k )≥d
Example 10.6.1. Consider a specific instance of the above experiment where n = 100, and X = 67.
Suppose we want to test the null hypothesis p0 = 0.5. Then the observed value of d is
[1] 0.0006412485
20
15
Λ(k)
10
Realized value of d
5
20 30 40 50 60 70 80
As can be seen by inspecting Figure 10.1, the null hypothesis will be rejected at level α = 0.05
either if the realized X is 61 or higher, or if X is 39 or lower. In other words, this test is a two-sided
test. ■
Next we continue Example 10.4.2, where X1 , X2 , . . . , Xn is an i.i.d. sample from the Normal (µ, σ 2 )
distribution, where the variance σ 2 is known but the mean µ ∈ P = R is not. We are interested in
testing the null hypothesis µ = µ0 , or equivalently, that µ ∈ P0 = {µ0 }, for a specific value µ0 . It
is easily seen that
1 X
2 log L(µ; X1 , . . . , Xn ) = − log 2π − log σ 2 − (Xi − µ)2
σ2
i
The unrestricted MLE of µ is µ̂ = X, and the restricted MLE is µ̂0 = µ0 . It follows that
L(µ̂; X1 , . . . , Xn )
Λ(X1 , . . . , Xn ) = 2 log
L(µ̂0 ; X1 , . . . , Xn )
= 2 log L(µ̂; X1 , . . . , Xn ) − 2 log L(µ̂0 ; X1 , . . . , Xn )
1 X
X
2 2
= ( X i − µ 0 ) − ( X i − X )
σ2
i i
1 h 2 2
i
= nµ 0 − 2nµ 0 X + nX
σ2
2
n 2 X − µ0
= ( X − µ0 ) = √
σ2 σ/ n
where Y1 , Y2 , . . . , Yn have a Normal µ0 , σ 2 distribution. Now, we know that in that case Z = Yσ/−µ
√0
n
has a Normal (0, 1) distribution, and hence
2
Y − µ0
Λ(Y1 , Y2 , . . . , Yn ) = √
σ/ n
has a χ21 distribution. The p-value can thus be easily calculated using R. The following code
computes the p-value for the data previously seen in Example 10.4.3.
[1] 0.3552259
As in the Binomial example, we can also derive a rejection cutoff for the observed d corresponding
to a fixed level α. For example, with α = 0.05, this cutoff would be the 0.95 quantile of the χ21
distribution, which is
qchisq(0.95, df = 1)
[1] 3.841459
Thus our approach gives the same rule as the intuitive test derived earlier which which rejects the
√ X−µ0
null hypothesis when the absolute value of Tµ0 (X1 , X2 , . . . , Xn ) = n σ is large. As with
the test for Binomial proportion described earlier, this is a two-sided test. Recall that we have seen
the cutoff of 1.96 before in Section 9.3 where we obtained a confidence interval for µ. This is not a
coincidence, as the 0.95 quantile of χ21 should indeed be the same as the 0.975 quantile (and the
negative of the 0.025 quantile) of Normal (0, 1).
More interestingly, it follows that given i.i.d. data X1 , X2 , . . . , Xn ∼ Normal µ, σ 2 , a particular
value of µ0 will belong to the level (1 − α) confidence interval for µ obtained in Example 9.3.1
if and only if the null hypothesis that µ = µ0 is accepted at level α. This observation applies
more generally, and every hypothesis test can be used to generate a confidence region, and vice
versa. This idea is often useful, especially in situations where a hypothesis test is available, but a
confidence region may not be easy to obtain directly.
An important variation of the above test is when we are interested in testing not a single point
value of µ, but rather a range. For example, we may want to “reject” the null hypothesis only
when the true mean is larger than the conjectured value, but not when it is lower. In this case, the
null hypothsis is actually µ ≤ µ0 , or µ ∈ P0 = (−∞, µ0 ].
Although the unrestricted MLE µ̂ = X remains unchanged, the restricted MLE µ̂0 is now given
by (See Exercise 10.6.1)
X if X ≤ µ0
µ̂0 = = min(X, µ0 )
µ0 otherwise.
L(µ̂; X1 , . . . , Xn )
Λ(X1 , X2 , . . . , Xn ) = 2 log
L(µ̂0 ; X1 , . . . , Xn )
1 X
X
2 2
= (Xi − µ̂0 ) − (Xi − X )
σ2
i i
n 0 if X ≤ µ0
2
= 2
( X − µ̂ 0 ) =
σ n(X − µ0 )2 /σ 2 otherwise.
where Z is a standard normal random variable with cumulative distribution function Φ. Now, as µ
µ0 −µ
√
increases to µ0 , σ/ √ decreases to 0, and Pµ ( Λ (Y1 , . . . , Yn ) ≥ d) increases to 1 − Φ ( d). In other
n
words, the maximum for calculating the p-value is achieved for µ = µ0 , giving p-value
√
X − µ0
1 − Φ ( d) = P Z ≥ √ .
σ/ n
As before, we can compute this p-value using R. To obtain a rejection cutoff, for example with
α = 0.05, we have
qnorm(0.95)
[1] 1.644854
or equivalently,
√ X − µ0 σ
n > 1.645 ⇐⇒ X > µ0 + √ 1.645
σ n
Unlike the case where the null hypothesis allowed a single point value of µ, this is a one-sided test
in terms of X.
As in the previous examples, suppose we have an i.i.d. sample X1 , X2 , . . . , Xn from the Normal(µ, σ 2 )
distribution, but now in addition to the mean µ, we assume more realistically that the variance
σ 2 is also unknown. Formally, the parameters of the problem are (µ, σ 2 ) ∈ P = R × (0, ∞).
As in Section 10.6.2, we are interested in testing for a specific value of µ. To simplify notation,
we will consider the null hypothesis µ = 0 instead of the more general µ = µ0 . As noted
earlier, this is not really a restriction; to test µ = µ0 , we can simply work with the transformed
data X1 − µ0 , X2 − µ0 , . . . , Xn − µ0 . Under the null hypothesis, the restricted parameter set is
P0 = {0} × (0, ∞).
Unlike the previous examples, we have two parameters in this problem, and we are interested
in testing a null hypothesis that only puts restrictions on one of them. It is easy to see that the
n
unrestricted MLEs of µ and σ 2 are given by µ̂ = X and σ̂ 2 = n1 (Xi − X )2 (see Exercise 9.2.7).
P
i=1
The restricted MLE of µ is of course µ̂0 = 0. It is easy to verify that the restricted MLE of σ 2 is
n n
σ̂02 = n1 (Xi − µ̂0 )2 = n1 Xi2 . It follows that
P P
i=1 i=1
L(µ̂, σ̂ 2 ; X1 , . . . , Xn )
Λ(X1 , . . . , Xn ) = 2 log
L(µ̂0 , σ̂02 ; X1 , . . . , Xn )
= 2 log L(µ̂, σ̂ 2 ; X1 , . . . , Xn ) − 2 log L(µ̂0 , σ̂02 ; X1 , . . . , Xn )
n n
1 X 1 X
= n log σ̂02 + 2 (Xi − µ̂0 )2 − n log σ̂ 2 − 2 (Xi − µ̂)2
σ̂0 i=1 σ̂
i=1
P n P n
X 2 ( X − X ) 2
1X 2
n
i=1 i n
i=1 i
− n log 1
X
= n log 2
Xi + n n ( X i − X ) − n
n
n P 2
n P 2
i=1 Xi i=1 (Xi − X )
i=1 i=1
P 2
Xi
= n log P
(Xi − X )2
distribution (i.e., µ = 0). In general, this distribution could depend on the other parameters,
namely, σ 2 in this example. As we saw in Section 10.4.3, this does not happen here; that is, the
distribution of Λ(Y1 , . . . , Yn ) does not depend on the value of σ 2 when µ = 0. To see this, notice
that
P 2
Yi
Λ(Y1 , . . . , Yn ) = n log P
(Yi − Y )2
(Yi /σ )2
P
= n log P
(Yi /σ − Y /σ )2
P 2
Zi
= n log P = Λ ( Z1 , . . . , Zn )
( Zi − Z ) 2
for all σ 2 , where Pσ2 denotes probability calculations when Y1 , Y2 , . . . , Yn are i.i.d. from Normal(0, σ 2 ).
In particular, for an observed value d = Λ(X1 , . . . , Xn ), the p-value is given by
P 2
Zi
P ( Λ ( Z1 , . . . , Zn ) ≥ d ) = P n log P ≥ d
( Zi − Z ) 2
P 2
Zi d/n
= P P ≥e
( Zi − Z ) 2
( Zi − Z ) 2
P
= P P 2 ≤ e−d/n
Zi
where Z1 , Z2 , . . . , Zn are i.i.d. from Normal(0, 1). Now, simple algebraic manipulation shows that
n n n
X X X 2
Zi2 = (Zi − Z + Z )2 = (Zi − Z )2 + nZ
i=1 i=1 i=1
2 √
Recall that Z ∼ Normal(0, n1 ) and so nZ = ( nZ )2 ∼ χ21 . An application of Theorem 8.1.10
2
further tells us that (i) (Zi − Z )2 ∼ χ2n−1 , and (ii) Z is independent of (Zi − Z )2 . Recall
P P
Example 8.1.5 to note that the χ2n distribution is the same as the Gamma( n2 , 12 ) distribution. In
other words,
n
n−1 1 1 1
X 2
(Zi − Z )2 ∼ Gamma , and nZ ∼ Gamma , ,
2 2 2 2
i=1
n
Xi2
P
i=1
d = Λ(X1 , . . . , Xn ) = n log n ,
(Xi − X )2
P
i=1
For the data in Example 10.4.3, the p-value can be computed as follows.
[1] 0.5151229
[1] 0.4994148
To obtain the rejection region at level α, we need to start with the lower α quantile, say b, which
for α = 0.05 can be computed as follows.
[1] 0.7165366
(Xi − X )2
P
2 ≤ b,
(Xi − X )2 + nX
P
or equivalently, if
2 2
(Xi − X )2 + nX 1
P
X
= 1 + nP ≥ ,
(Xi − X )2 2
P
(Xi − X ) b
2
n−1 1
X
⇐⇒ 1 ≥ −1
(Xi − X )2
P
n−1
n b
s
n−1 1
X
⇐⇒ ≥ −1
S n b
s
√ X 1
⇐⇒ n ≥ (n − 1) −1 ,
S b
where S 2 is the sample variance (see Definition 7.1.5 in Chapter 8). Note the similarity of the
rejection region with the one in Section 10.6.2 which addressed the analogous problem when σ 2 is
√
known; as in that case, this is also a two-sided test. It follows from Corollary 8.1.11 that n XS has
the tn−1 distribution when µ = 0, and so p-values of cutoffs can be computed using the t distribution
as well. It can be numerically verified that a two-sided t-test rejection cutoff is equivalent to the
Beta test cutoff as follows.
[1] 2.178813
[1] 2.178813
Before moving on, we make some observations and remarks about this important example. It
should be easy to see that for the general null hypothesis µ = µ0 , the Beta statistic derived above
generalizes to
(Xi − X )2 (Xi − X )2
P P
e−Λ(X1 ,...,Xn )/n = P 2
=
(Xi − µ0 ) (Xi − X )2 + n(X − µ0 )2
P
√ X − µ0
n .
S
The latter form of the statistic is more conventional, not least due to its intuitively appealing
√
form and similarity with the Normal statistic n X−µ σ
0
in the case when σ 2 is known, as it can be
motivated as a modification where σ is replaced by an estimator when it is unknown. As with the
2
case where σ 2 is known, the test can be adjusted for a one-sided null hypothesis of the form µ ≤ µ0
or µ ≥ µ0 , resulting in a one-sided rejection region in terms of the t statistic. Even though this test
is derived under the assumption of Normality, it performs well for moderate departures from this
assumption, and is among the most common statistical tests used in practice.
Another equivalent form of the test is also important. In its squared form, the statistic
n ( X − µ0 ) 2
S2
has a F1,n−1 distribution when µ = µ0 (see Example 8.1.7). This form and its generalizations are
also commonly used. A typical example is the problem of testing for equality of means of two or
more populations, which we discuss next.
Connect this to a test of independence of treatment vs outcome, where outcome is now continuous
rather than discrete. Or maybe keep as-is and use another section or subsection to make the
connection and tie up other loose ends.
Hypothesis tests may be used to compare two samples to each other to see if the populations
they were derived from are similar. This is of particular use in many applications. For instance:
Are the political preferences of people of one region different from another? Are test scores at one
school better than those at another school? These questions could be approached by taking random
samples from each population and comparing them with each other.
Suppose X1 , X2 , . . . , Xm is an i.i.d. sample from a distribution X ∼ Normal(µ1 , σ12 ) and
Y1 , Y2 , . . . , Yn is an i.i.d. sample from a distribution Y ∼ Normal(µ2 , σ22 ) independent of the Xj
variables. Assume that µ1 and µ2 are unknown, as well as σ12 and σ22 . How might we test the null
hypothesis that µ1 = µ2 against the alternative hypothesis µ1 ̸= µ2 ? It turns out that this problem
is not easy to solve in its general form, and we will only consider the problem with the additional
assumption that σ12 = σ22 . We will use σ 2 to denote this common variance.
Based on the examples we have seen so far, we might guess that a good test would be based
on X − Y , as this should be close to 0 if the null hypothesis that µ1 = µ2 were true. It is simple
to check that X − Y has a Normal distribution with mean µ1 − µ2 and variance m 1
+ n1 σ 2 . If
X −Y
q
1 1
S m + n
would have a t distribution. In fact, it is not difficult to find such an estimator S 2 . Consider the
standard unbiased estimators of σ 2 from the two independent populations:
m
1 X
S12 = (Xi − X )2
m−1
i=1
n
1 X
S22 = (Yj − Y )2
n−1
j =1
It follows from Theorem 8.1.10 that (m − 1)S12 /σ 2 ∼ χ2m−1 independently of X and (n − 1)S22 /σ 2 ∼
χ2n−1 independently of Y . This suggests the natural estimator
m n
1 X X
S2 = (Xi − X )2 + (Yj − Y )2
m+n−2
i=1 j =1
X −Y
q
1 1
S m + n
has a tm+n−2 distribution. This is the basis for the standard two-sided test for this problem, and
this solution generalizes to the one-sided null hypotheses µ ≤ µ0 and µ ≥ µ0 in the usual way. As
in the test for the mean of a single Normal population, the squared version of the statistic
mn 2
m+n (X − Y )
S2
has the F1,m+n−2 distribution.
It turns out, not surprisingly, that this is equivalent to the test derived using the likeli-
hood ratio statistic. The likelihood function for this model is (suppressing the dependence on
X1 , . . . , Xm , Y1 , . . . , Yn for brevity)
m n
1 1 1 1
Y Y
L ( µ1 , µ2 , σ )
2
= √ exp − 2 (Xi − µ1 ) 2
√ exp − 2 (Yj − µ2 ) 2
nπσ 2σ nπσ 2σ
i=1 j =1
m + n m n
1 1
X X
= √ exp − 2 (Xi − µ1 )2 + (Yj − µ2 )2
nπσ 2σ
i=1 j =1
It is easy to verify, again following the approach of Exercise 9.2.7, that the unrestricted MLEs are
µ̂1 = X
µ̂2 = Y
m n
1 X X
σ̂ 2 = (Xi − µ̂1 )2 + (Yj − µ̂2 )2 ,
m+n
i=1 j =1
and therefore
m n
2+ 2
P P
m + n
( Xi − µ̂ 1 ) ( Yj − µ̂ )
2
1
m+n
i=1 j =1
L(µ̂1 , µ̂2 , σ̂ 2 ) = √ exp −
nπ σ̂ 2 m n
P
(Xi − µ̂1 )2 + (Yj − µ̂2 )2
P
i=1 j =1
m+n
2 −
= nπeσ̂ 2
.
1
µ̂0 = µ̂10 = µ̂20 = (mX + nY )
m+n
m n
1 X X
σ̂02 = (Xi − µ̂0 )2 + (Yj − µ̂0 )2
m+n
i=1 j =1
and therefore
− m + n
L(µ̂10 , µ̂20 , σ̂02 ) = nπeσ̂02 2
.
The test rejects the null hypothesis when this ratio is large, which is in line with our intuition
because if the null hypothesis is not true, then we would expect the deviations in the numerator
(from a common mean) to be larger than the deviations in the denominator (from group-specific
means). The denominator clearly has a χ2m+n−2 distribution (by Theorem 8.1.10 followed by
Example 5.5.6), but the distribution of the numerator is not immediately obvious. To simplify the
numerator, note that
m
X m
X m
X
(Xi − µ̂0 )2 = (Xi − X + X − µ̂0 )2 = (Xi − X )2 + m(X − µ̂0 )2 + 0.
i=1 i=1 i=1
(m + n)X − mX − nY 1 n
X − µ̂0 = = (mX + nX − mX − nY ) = (X − Y ),
m+n m+n m+n
and consequently,
m m
X X mn2
(Xi − µ̂0 )2 = (Xi − X )2 + (X − Y )2 .
(m + n)2
i=1 i=1
Analogously,
n n
X X m2 n
(Yj − µ̂0 )2 = (Yj − Y )2 + (X − Y )2
(m + n)2
j =1 j =1
As before, it follows from Theorem 8.1.10 and the discussion above that X − Y is independent of
the denominator
Xm Xn
(Xi − X )2 + (Yj − Y )2
i=1 j =1
and that mmn+n (X − Y ) has a χ1 distribution. Arguing similarly as we did in Section 10.6.4, it
2 2
follows that
m n
(Xi − X )2 + (Yj − Y )2
P P
σ̂ 2 σ̂02
i=1 j =1
= 1/ 2 = P m n
σ̂02 σ̂
(Xi − X )2 + (Yj − Y )2 + mmn 2
P
+n (X − Y )
i=1 j =1
has the Beta( m+2n−2 , 12 ) distribution (test rejects null when small), and
mn mn
m+n (X − Y )2 m+n (X − Y )
2
1
=
S2
P
(Xi − X )2 + (Yj − Y )2
P
m+n−2
has the F1,m+n−2 distribution (test rejects null when large), and both are equivalent to the “intuitive”
two-sided t test derived above.
The examples above may give the impression that the likelihood ratio approach always leads to a
useful test statistic. The next example, which is a simple and natural extension of the previous
problem, shows that this is not so.
Recall the setup of the previous problem, where we suppose that X1 , X2 , . . . , Xm is an i.i.d.
sample from X ∼ Normal(µ1 , σ12 ) and Y1 , Y2 , . . . , Yn is an independent i.i.d. sample from Y ∼
Normal(µ2 , σ22 ) independent of the Xj variables. We are still interested in testing the null hypothesis
that µ1 = µ2 against the alternative hypothesis µ1 ̸= µ2 , but this time we do not wish to assume
that σ12 = σ22 .
The unrestricted MLEs of the parameters are straightforward, as the X and Y observations
do not share any common parameters. The detailed calculations for the restricted case are more
involved. The details are left as an exercise, but it can be shown that the MLEs satisfy the following
equations:
2 X + nσ̂ 2 Y
mσ̂20 10
µ̂0 = µ̂10 = µ̂20 = 2 + nσ̂ 2
mσ̂20 10
m
2 1 X
σ̂10 = (Xi − µ̂0 )2
m
i=1
n
2 1X
σ̂20 = (Yj − µ̂0 )2
n
j =1
An exact solution can be obtained, although in practice an iterative approach works well. It can
be further shown that the likelihood ratio statistic can be expressed as follows in terms of these
estimates as follows.
This is where the usual procedure breaks down. Unlike in the previous examples, the distribution
of this quantity is not completely determined when the null hypothesis holds, because it depends
on the unknown ratio σ12 /σ22 . Thus, even using simulation to estimate the p-value is not feasible.
As alluded to earlier, one benefit of the likelihood ratio test is that even when the distribution
of the statistic is difficult to study, a powerful result gives its asymptotic distribution under fairly
general conditions. Applied to this problem, this result says that as m, n → ∞, the distribution of
Λ(X1 , . . . , Xm , Y1 , . . . , Yn ) converges to a χ21 distribution under the null hypothesis. This is indeed
true; however, for small sample sizes m and n, calculating p-values using this null distribution leads
to substantially larger probability of Type I error than nominally specified. A modification works
quite well in practice, but the details are beyond our scope. For us, this example serves to illustrate
the limitations of the likelihood ratio test approach.
exercises
Ex. 10.6.1. Consider an i.i.d. sample X1 , X2 , . . . , Xn ∼ Normal µ, σ 2 . Show that the restricted
We now return to the problems discussed earlier in this chapter, in Sections 10.2 and 10.3, which we
left largely unresolved. We start with the “goodness of fit” problem, and specifically Example 10.2.1
which we can generalize as follows.
Consider a random experiment (such as rolling a die) which has k possible outcomes. As
in Example 3.2.12, let pj represent the probability that any individual trial results in the j-th
outcome, and let Xj represent the number of the n trials that result in the j-th outcome. The
joint distribution of the random variables (X1 , X2 , . . . , Xk ) is then a multinomial distribution with
parameters n and p = (p1 , p2 , . . . , pk ). Formally, the parameter space for the problem is
k
X
P= p = (p1 , p2 , . . . , pk ) : 0 ≤ pj ≤ 1 for all j, pj = 1 .
j =1
We wish to test the null hypothesis that p = p0 for some p0 = (p01 , p02 , . . . , p0k ). In Example 10.2.1,
where the experiment was rolling a die, we had k = 6 and p0 = (1/6, 1/6, . . . , 1/6).
As P0 = {p0 } is a singleton set, the MLE p̂0 under the null hypothesis is simply p0 . As we saw
in Exercise 9.2.11, the unrestricted MLE p̂ of p is given by the coordinatewise sample proportions,
that is, p̂j = Xj /n.
From the multinomial distribution obtained in Example 3.2.12, it follows easily that the
likelihood function, given observed data X1 , X2 , . . . , Xk , has the form
k
n! Y X
L(p; X1 , X2 , . . . , Xk ) = pj j , for p ∈ P
X1 !X2 ! . . . Xk !
j =1
k
X k
X
log L(p) = log n! − log Xj ! + Xj log pj , for p ∈ P.
j =1 j =1
k k
L(p̂) X p̂j X Xj
Λ(X1 , . . . , Xk ) = 2 log =2 Xj log =2 Xj log .
L(p̂0 ) p̂0j np0j
j =1 j =1
It is conventional in this context to define Ej = np0j , representing the expected value of Xj . This
leads to the test statistic
k
X Xj
Λ(X1 , . . . , Xk ) = 2 Xj log . (10.7.1)
Ej
j =1
Example 10.7.1. In a survey, a class of n = 71 students were asked what their birth month was,
and the following summary results were obtained:
Using this data, we wish to test the null hypothesis that all days in the year are equally likely as
birthdays. We will ignore the possibility of leap years for simplicity and assume that there are
365 days in a year. We also assume that the class represents an i.i.d. sample from some larger
population. We can set up the problem and calculate the test statistic in R as follows.
X <- c(7, 4, 5, 8, 2, 6, 6, 7, 3, 8, 9, 6)
n <- sum(X)
p0 <- c(31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31) / 365
E <- n * p0
Lambda_x <- 2 * sum(X * log(X / E))
Lambda_x
[1] 9.008024
To compute the p-value for this test, we need the distribution of Λ(Y1 , . . . , Yk ) when (Y1 , Y2 , . . . , Yk )
follow the multinomial distribution with parameters n and p0 . Unfortunately, this distribution is
not easy to compute explicitly. However, as the distribution of (Y1 , Y2 , . . . , Yk ) is completely known
(because P0 has only one element, p0 ), we can easily simulate values of Λ(Y1 , . . . , Yk ) and estimate
the p-value. We can do this, using 1000000 replications, as follows.
Lambda_sim <-
replicate(1000000,
{
Y <- rmultinom(1, size = n, prob = p0)
2 * sum(Y * log(Y / E), [Link] = TRUE) # Use [Link] to allow for Y = 0
})
uprob <- sum(Lambda_sim >= Lambda_x) / 1000000
uprob
[1] 0.648674
A remarkable result, a version of which is stated without proof below in Section 10.7.1, tells us
that even if we had been unable to estimate the exact p-value, the asymptotic distribution of
Λ(Y1 , . . . , Yk ) (as the sample size n → ∞) is known, and is in fact χ2k−1 . This asymptotic result
allows us to obtain an approximate p-value for this example as follows.
[1] 0.6211516
While not exactly the same, this is reasonably close to the exact p-value estimated using simulation.
■
Much of the appeal of the likelihood-based approach we have explored in this chapter comes from a
very powerful result that states that the null distribution of the likelihood ratio test statistic can
be obtained asymptotically, even if it cannot be computed exactly for any finite sample size. In
simplified form, the result may be stated as follows.
The proof of this result is beyond the scope of this book. In fact, even a proper statement would be
unnecessarily complicated. For our purposes, it is sufficient to know that the result is applicable
in most situations, as long as two important conditions are met: The number of independent
parameters in P and P0 are fixed finite numbers, and no p ∈ P0 is on the “boundary” of P. We
will not get into the precise details of what it means for p to be in the boundary of P, but in the
goodness of fit example, an obvious way for this to happen would be if one of the components p0j
equals 0. If that were the case, the result would not be applicable.
To appreciate the power of this result, note that it applies even when the null parameter space
P0 is not a singleton set. Recall that to solve a testing problem we need two things: to find a
suitable test statistic, and then to find its null distribution. The likelihood ratio approach often gives
us a test statistic in situations where no natural candidate is available. In Example 10.7.1 above,
this approach gave a test statistic that we would probably not consider to be natural. However,
once we determine the test statistic, finding its null distribution is simple; even though this null
distribution is “unkown” in the sense that we cannot identify it as a standard distribution, it is
still possible to simulate from it because P0 is a singleton set, so any probabilities related to that
distribution can be computed as precisely as we wish.
The situation is fundamentally different when P0 contains multiple (usually infinitely many)
values. The likelihood ratio test statistic need not have a single “null distribution”, but rather a
different one for every p0 ∈ P0 , making the simulation approach impractical. Theorem 10.7.2 comes
to our rescue in such sitations, giving us a single null distribution that is at least asymptotically
valid. We will see an example of such a test in Section 10.8.
We conclude this section with a discussion of a much more well known test for the goodness of fit
problem, given by the following test statistic.
k
X (Xj − Ej )2
T (X1 , . . . , Xk ) =
Ej
j =1
This is an intutively appealing test statistic. As with the likelihood ratio test described above,
the exact distribution of T (X1 , . . . , Xk ) cannot be computed explicitly, but can be studied using
simulation for any specific problem. This distribution also converges to the same χ2k−1 distribution
as n → ∞, which as we will see soon, is not a coincidence. This asymptotic result can be proved
without appealing to the more general Theorem 10.7.2, although the proof is still beyond the scope
of this book. For Example 10.7.1, the resulting test statistic and the corresponding p-value can be
obtained as follows.
[1] 8.110577
[1] 0.7033655
One may wonder, given the same asymptotic null distribution of Λ(X1 , . . . , Xk ) and T (X1 , . . . , Xk ),
whether they are related to each other. The answer is that they are indeed related. To see this,
define εj = Xj − Ej , so that we can write (10.7.1) as
k
X Xj
Λ(X1 , . . . , Xk ) = 2 Xj log
Ej
j =1
k
X εj
= 2 (Ej + εj ) log 1 +
Ej
j =1
εj
Now, for x close to 0, we can write log(1 + x) ≈ x − 12 x2 , so we can write, provided Ej ≈ 0,
k
" !#
2
X εj 1 εj
Λ(X1 , . . . , Xk ) ≈ 2 (Ej + εj ) − (10.7.2)
Ej 2 Ej2
j =1
k
1 X (Xj − Ej )2
Xj − Ej
Λ(X1 , . . . , Xk ) ≈ 0 + 2 1−
2 Ej Ej
j =1
εj Xj −Ej
This approximation is of course only valid when Ej = Ej ≈ 0. In that case, we can further
Xj −Ej
approximate Λ(X1 , . . . , Xk ) by assuming that 1 − Ej ≈ 1, to get
k
X (Xj − Ej )2
Λ(X1 , . . . , Xk ) ≈ = T (X1 , . . . , Xk ).
Ej
j =1
ε p
It is not particularly difficult to show that Ejj −→ 0 as n → 0, and thus establish using Slutsky’s
theorem that Λ(X1 , . . . , Xk ) and T (X1 , . . . , Xk ) have the same asymptotic distribution.
A natural next question is to ask which of these tests is better in terms of their power to identify
situations where the null hypothesis does not hold. The results we have cited so far do not give a
clear answer, but simulation studies can be used to get an indication.
Next we consider the problem of testing whether two categorical attributes are independent. There
are several formulations of this problem that give potentially different answers. We start with the
multinomial formulation discussed earlier in Section 10.3.
Recall that in Section 10.3, we described a multinomial model for this problem parameterized by
the parameter vector p = (p11 , p21 , p12 , p22 ). This model is natural when the units studied in the
problem can be viewed as an i.i.d. sample from some population, as would be appropriate in an
observational study. An alternative formulation that is more appropriate for randomized controlled
trials is parameterized by p = (π1 , π2 , q11 , q12 , q21 , q22 ), where π1 and π2 are the probabilities of
a specific unit being assigned to treatment 1 and treatment 2 respectively, and q1ℓ and q2ℓ are
corresponding conditional probabilities of outcome ℓ. As the two formulations are equivalent (as
long as units are allocated treatment independently), we will only consider the first formulation.
We have already obtained (see Section 10.3.2) maximum likelihood estimates of p under the
unconstrained model, as well as under the null hypothesis of independence which restricts the
parameter values to
P0 = {p : pkℓ = pk◦ p◦ℓ for k, ℓ = 1, 2} .
To obtain the likelihood ratio statistic, we can follow the same calculations as in the goodness of fit
problem, to write the test statistic as
2 2
L(p̂) XX p̂
Λ(X11 , X12 , X21 , X22 ) = 2 log =2 Xkℓ log kℓ
L(p̂0 ) p̂0,kℓ
k =1 ℓ=1
2 X
2
X Xkℓ /n
Λ(X11 , X12 , X21 , X22 ) = 2 Xkℓ log .
(Xk1 + Xk2 )(X1ℓ + X2ℓ )/n2
k =1 ℓ=1
As in the case of the goodness of fit test, it is conventional to write this statistic as
2 X
2
X Xkℓ
Λ(X11 , X12 , X21 , X22 ) = 2 Xkℓ log , (10.8.1)
Ekℓ
k =1 ℓ=1
where Ekℓ = n1 (Xk1 + Xk2 )(X1ℓ + X2ℓ ) = np̂k◦ p̂◦ℓ is interpreted as the “expected” value of Xkℓ if
the null hypothesis is true. The more popular Pearson’s χ2 test of independence, with test statistic
2 X
2
X (Xkℓ − Ekℓ )2
T (X11 , X12 , X21 , X22 ) = 2 ,
Ekℓ
k =1 ℓ=1
can be similarly viewed as an approximation of Λ(X11 , X12 , X21 , X22 ). Both these tests can be
easily generalized to situations where there more than two treatments or more than two outcomes.
Unlike the goodness of fit problem, the null parameter space P0 is no longer a singleton set.
Also, unlike in the examples involving the Normal distribution, the distribution of the test statistic
does not become independent of the choice of p0 ∈ P0 . This effectively rules out the simulation
approach to obtain the null distribution, as there are infinitely many choices of p0 which need to be
considered if we wish to compute the p-value using (10.5.3).
Fortunately, Theorem 10.7.2 is still applicable in this case. As the number of independent
parameters is 3 in P and 2 in P0 , both the likelihood ratio statistic and Pearson’s test statistic
have the χ21 distribution asymptotically. In general, the degrees of freedom of the asymptotic null
distribution will depend on the number of treatments and outcomes.
A common strategy when designing clinical trials such as the one in Example 10.3.1 is to fix the
number of individuals in each treatment group in advance. In other words, the numbers n1 and n2
to be given treatment 1 and treatment 2 respectively are fixed in advance.
This still leaves open the question of how to choose the n1 individuals
to be given treatment
n
1. A natural choice is to choose them randomly, uniformly from the possibilities. Such an
n1
allocation scheme forms the basis of the approach described in Section 10.8.3.
Here, we consider the model where the treatment attribute is not random at all, but rather
fixed in advance. One may think of this as comparing two different populations, based on samples
of size n1 and n2 . In this setup, the null hypothesis of independence of treatment and outcome can
be reinterpreted to mean that the distribution of the outcome attribute does not depend on the
population from which the individual comes. It is similar in that sense to the two-sample test for
equality of population means discussed in Section 10.6.5. Alternatively, this model can be thought
to have been derived from the multinomial model by conditioning on the treatment attribute of all
individuals. In terms of Lemma 10.3.2, we have
Yi = 1 | Ti = 1 ∼ Bernoulli(q11 ),
Yi = 1 | Ti = 2 ∼ Bernoulli(q21 ).
is exactly the same as (10.8.1), the statistic obtained in the multinomial model. Although both P
and P0 have one fewer independent parameter in this case, P0 is still not a singleton set, so the
null distribution will depend on the unknown common value of q11 = q21 . Again, the asymptotic
distribution of the test statistic is χ21 .
The tests of independence derived above are approximate tests, valid asymptotically. An interesting
and somewhat natural extension of this setup does allow us to obtain an exact test, although its
formulation is such that it does not fit nicely into our usual parametric setup. Nonetheless, we end
this section with a discussion of this formulation because it provides a useful perspective on the
testing problem in general. The resulting test is known as Fisher’s exact test of independence.
Technically, the setup of Fisher’s test can be obtained from the multinomial model by condi-
tioning on a certain event. In this sense, it is an extension of the previous two models. Specifically,
the multinomial model in Section 10.3.1 defines a probability distribution on the set of all 2 × 2
tables of the form
Outcome 1 Outcome 2 Total
Treatment 1 X11 X12 N1◦
Treatment 2 X21 X22 N2◦
Total N◦1 N◦2 n
where Xkℓ , k, ℓ = 1, 2 are non-negative integer counts, Nk◦ and N◦ℓ are row and column totals, and
N◦ℓ = n, the total number of participants. The row and column totals Nk◦
P P P
Xkℓ = Nk◦ =
k,ℓ k ℓ
and N◦ℓ are random, although the total sum n is fixed as the total number of units (the sample
size) is fixed in advance. The model considered in Section 10.8.2, as we have noted above, can be
derived from the multinomial model by conditioning on the treatment attribute of each participant,
or equivalently by conditioning on N1◦ and N2◦ . The interpretation of this conditioning from the
perspective of the random experiment is that the number of individuals in each treatment group
(for instance, the number of individuals given the placebo and the vaccine in Example 10.3.1) is
fixed in advance. To derive Fisher’s exact test, we further condition on the column totals N◦1 and
N◦2 .
If both row and column totals are fixed in advance, then it is immediate that any one element
of the table determines the others. This solves one of our problems, namely, finding a test statistic:
without loss of generality, we can take X11 to be our test statistic, as it completely defines the entire
table. In the multinomial setup, the conditional distribution of X11 given the row and column
totals turns out to be the familiar hypergeometric distribution (Example 2.3.1).
It is not, however, immediately obvious how the conditioning on column totals can be interpreted
from the perspective of the underlying random experiment. Fixing the total number of individuals
to be assigned treatment 1 and treatment 2 in advance is reasonable. However, it is completely
unreasonable to expect that the number of individuals who have outcome 1 and outcome 2 would
also be fixed in advance. To link the conditional model above to a reasonable experimental setup,
we have to view the experiment from a different, nonparametric perspective.
Recall that there are four possible treatment-outcome combinations for each individual or unit,
namely (1, 1), (1, 2), (2, 1), and (2, 2), identified respectively with the four matrices
" # " # " # " #
1 0 0 0 0 1 0 0
, , , and .
0 0 1 0 0 0 0 1
As in Lemma 10.3.2, let (Ti , Yi ) denote the treatment and outcome pair for the i-th individual.
The treatment attribute Ti for each unit is usually random, but the nature of the randomness
depends on the type of experiment. For observational studies, individuals are sampled from a
population and both “treatment” and “outcome” attributes are observed; neither are under the
control of the experimenter. However, for controlled trials such as the vaccine trial described in
Example 10.3.1, the treatment is assigned randomly as part of the experiment, typically by ensuring
that all participants are equally likely to get a particular treatment. The outcome Yi is also random,
presumably in a way that depends on the individual involved. In the vaccine example, the outcome
may depend on age, gender, and other attributes of a participant that are not available to us. The
outcome is also possibly affected by the treatment assigned; in fact, in the vaccine trial, we hope
that being vaccinated makes an individual less likely to become affected with the disease.
However, we are interested in testing the hypothesis of independence of the two attributes,
which posits that the outcome is not affected by the treatment, even if it does depend on the
individual. How can we formulate this idea as a probability model? The multinomial model with
fixed row totals described in Section 10.8.2 assumes a parametric model where the distribution
of Yi | Ti = t depends only on the parameter q11 or q21 , depending on whether t = 1 or t = 2.
Fisher’s exact test uses a different probability formulation which in principle allows the distribution
of Yi | Ti = t to vary from individual to individual, without explicitly providing a parametric model.
We describe this formulation next.
Normally, probability statements are interpreted in terms of the outcome in repeated perfor-
mances of an experiment. Here, repeating the experiment may mean selecting a new sample of
units (participants) via some random sampling mechanism, randomly assigning them treatments,
and observing the outcomes. However, let us suppose that we simplify the process of repeating the
experiment by skipping the first step: instead of selecting a new set of units on which to perform
the experiment, suppose we use the same set of n participants. However, we do still randomize
the
allocation
of treatments, by randomly selecting a new subset of size n1 uniformly from the
n
possibilities, to receive treatment 1. Here n1 , the number of participants who get treatment
n1
1, is assumed to be fixed in advance, and thus n1 can be viewed as the first row total N1◦ in
the multinomial formulation. Of course, we cannot actually perform this experiment and observe
the outcomes on the same units again (after all, someone who has been vaccinated cannot be
un-vaccinated), but we can conjecture about what could have happened if the experiment had been
performed with this new treatment assignment.
In general, we cannot say what the outcome would have been, as they could have changed if
different treatments had been received. However, suppose we restrict ourselves to the case when
the outcome is independent of the treatment, which is what the position of the skeptic would be.
Interpreting the notion of independence literally rather than probabilistically, we can then say
that for each unit, the outcome depends only on the individual and not the treatment, so would
have remained the same as the outcome observed with the original assignment. In other words,
for the i-th unit if we had Yi = 1 in the original experiment, we would again have Yi = 1 in the
new hypothetical experiment, regardless of whether the value of Ti had changed. For each such
hypothetical experiment then, we can recreate the summary 2 × 2 table without requiring any new
information other than the new treatment assignments. An important point, which is easy to see, is
that the row and column totals of these tables remain unchanged from the original by construction.
The row totals are the number of units assigned treatments 1 and 2, which are always n1 and
n − n1 . The column totals are the total number of units with outcome 1 and 2, which also remain
unchanged, provided that outcomes are not affected by treatment assignment.
The argument above gives us a very concrete (if still somewhat abstract) procedure to test the
null hypothesis (the skeptic’s conjecture) that treatment does not affect outcome: If this conjecture
were true, then we can randomly choose a hypothetical treatment assignment to get a random
summary table, forming a probability distribution whose sample space consists of all 2 × 2 summary
tables with the given marginal row and column totals, or equivalently, just its first entry X11 . Even
if we could not say anything more about this distribution, we could always simulate as many such
tables as we wanted to get an empirical distribution of X11 . As it happens, we can actually say
more, because the distribution of X11 again turns out to be the same hypergeometric distribution.
Finally, to decide whether or not to reject the null hypothesis, we use the same approach as
earlier. Depending on the problem at hand, departure from the null may be indicated either by
high values of X11 , or low values of X11 , or both. Accordingly, the test is one-sided or two-sided.
For one-sided tests, the p-value is given by the corresponding tail probability of the null distribution
starting from the observed value of X11 . For two-sided tests, the computation is less obvious as
the null distribution is not symmetric. Following the principle described in Example 10.6.1, the
p-value in this case is obtained by adding up the individual probabilities of all outcomes in the null
distribution that are at most as likely as the observed value of X11 .
Suppose the random variable X : S → R is a continuous random variable, with probability density
function fX : R → R. Let g : R → R and Y = g (X ). In general it may be hard to find the
distribution of Y . For some specific class of g the random variable Y will also be a continous random
variable and we can calculate its probability density function. We recall the method discussed in
Section 5.3. One immediately observes that the distribution function of Y is given by
Thus the above formula provides a theoretical expression for the distribution function of Y provided
for all y the function g is such that g −1 (−∞, y ] is an event. Now, let us assume that g is strictly
increasing and differentiable function with g ′ being continuous and g ′ (x) > 0 for all x ∈ R. This
implies that g −1 : R → R exists and is differentiable. The distribution function of Y is given by
Z g −1 (y )
P (Y ≤ y ) = P (g (X ) ≤ y ) = P (X ≤ g −1 (y )) = fX (x)dx.
−∞
From the above, using the fundamental theorem of calculus, we see that Y has a probability density
function fY : R → given by
dg −1
fY (y ) = (y )fX (g −1 (y )), (A.1.1)
dy
for all y ∈ R. Now, let us assume that g is strictly decreasing and differentiable function with
g ′ being continuous and g ′ (x) < 0 for all x ∈ R. This implies that g −1 : R → R exists and is
differentiable. The distribution function of Y is given by
Z ∞
P (Y ≤ y ) = P (g (X ) ≤ y ) = P (X ≥ g −1 (y )) = fX (x)dx.
g −1 (y )
From the above, using the fundamental theorem of calculus, we see that Y has a probability density
function fY : R → given by
dg −1
fY ( y ) = − (y )fX (g −1 (y )), (A.1.2)
dy
for all y ∈ R. We now present the above deductions as a theorem below.
399
dg −1
|
dy (y ) | fX (g −1 (y )), y ∈ Range(g )
fY (y ) = (A.1.3)
0
otherwise
or equivalently
f (y ) = 1
f (x), with y = g (x), x ∈ I
Y |g ′ (x)| X
(A.1.4)
0
otherwise.
a
We can assume without loss of generality that X (s) ∈ I for all s ∈ S so that Y (s) = g (X (s)) is
well defined for all s ∈ S.
Example A.1.2. Let X ∼ Uniform (−1, 1). Let g : (−1, 1) → R given by g (x) = x2 and Y = g (X ).
Observe that g is differentiable on (−1, 1) with g ′ : (−1, 1) → R given g ′ (x) = 2x. Again g ′ (0) = 0
so Theorem A.1.3 is not applicable. As we have seen before we can calculate the probability density
function of Y . We first calculate the distribution function of Y .
0 y ≤ 0
P (Y ≤ y ) =
1 y ≥ 1
We note that the distribution function of Y is piecewise differentiable and hence Y has a probability
density function given by
1 y − 21 y ∈ (0, 1),
fY ( y ) = 2
0 otherwise
In the above example the transformation g was not one-one and hence was not invertible. We
note that the function g has a well defined inverse in the interval (−1, 0) and (0, 1). Intuitively one
should be able to apply Theorexm A.1.3 on each of these intervals. The next theorem formalises
this.
An alternate way of viewing the change of distribution formula in the previous subsection is as the
familiar u-substitution from calculus. If g (x) is a differentiable function with differentiable inverse
and Y = g (X ) then using the subsitution y = g (x) (so x = g −1 (y )) we have
Z
fY (y ) dy = P (Y ∈ A)
A
= P (g (X ) ∈ A)
= P (X ∈ g −1 (A))
Z
= fX (x) dx
g −1 (A)
dg −1
Z
= fX (g −1 (y )) (y ) dy
A dy
dg −1
But if −1 dy for every event A, then the integrands must be
R R
A fY (y ) dy = A fX (g (y )) dy
−1
the same and fY (y ) = fX (g −1 (y )) dgdy (y ).
When multiple random variables are involved a similar formula may be derived using the
multivariate change of variables formula invovling the Jacobian. Recall the following result from
multivariate calculus:
h ( y1 , y2 , . . . , yn ) = ( x1 , x2 , . . . , xn )
where in the final line it is understood that the xj variables have been written in terms of
y1 , y2 , . . . , yn via the h−1 function.
Now let n ≥ 1 and suppose X1 , X2 , . . . , Xn are random variables with a joint density fX (x1 , x2 , . . . , xn ).
Let h : S → T be as in the theorem above and define an Rn -valued random vector
(Y1 , Y2 , . . . , Yn ) = h(X1 , X2 , . . . , Xn ).
P ((Y1 , Y2 , . . . , Yn ) ∈ A) = P (h(X1 , X2 , . . . , Xn ) ∈ A)
= P ((X1 , X2 , . . . , Xn ) ∈ h−1 (A))
Z Z Z
= ··· fX (x1 , x2 , . . . , xn ) dx1 dx2 , . . . dxn
h−1 (A)
Z Z Z
= ··· fX (x1 , x2 , . . . , xn ) | J (y1 , y2 , . . . , yn ) dy1 dy2 , . . . dyn
A
where, as in the prior theorem, the xj variables are understood to have been written in terms of
y1 , y2 , . . . , yn .
Let fY (y1 , y2 , . . . , yn ) represent the joint density for the (Y1 , Y2 , . . . , Yn ). Since that density is
defined by the equation
Z Z Z
P ((Y1 , Y2 , . . . , Yn ) ∈ A) = ··· fY (y1 , y2 , . . . , yn ) dy1 dy2 , . . . dyn
A
" #
∂ ( x1 , x2 ) 1/2 1/2 1
J= = det =− .
∂ (y1 , y2 ) 1/2 −1/2 2
where the region in which the density is non-zero is the square with corners (0, 0), (1, 1), (2, 0),
and (1, −1) in the (y1 , y2 )-plane. In particular we could use this joint density to provide another
derivation of the density for the sum of two independent Uniform(0, 1) random variables. That is
simply the marginal distribution of Y1 alone which we can now calculatate.
When 0 < y1 < 1 we have
Z ∞
fY1 (y1 ) = fY (y1 , y2 ) dy2
−∞
1
Z y1
= dy2
−y1 2
= y1
Z ∞
fY1 (y1 ) = fY (y1 , y2 ) dy2
−∞
1
Z 2−y1
= dy2
y1 −2 2
= 2 − y1
not originally be the case, but this problem can often be alleviated by inserting extra variables.
For instance, consider Exercise 5.5.4 from Chapter 5. In that problem (X1 , X2 , X3 ) are given as
independent and uniformly distributed on (0, 1). The problem asks for the value of P (X1 X3 < X22 ).
While that problem can be solved using the techniques from that chapter, it can also be solved
using the Jacobian method.
Example A.1.6. Let Y1 = X1 X3 and let Y2 = X22 . Note that these are the quantities of interest
in the probability we are asked to compute. To use the Jacobian technique we will also introduce
Y3 = X1 simply to maintain an equal number of variables. On the region X1 , X2 , X3 ∈ (0, 1) where
the density is non-zero, this transformation is invertible. Solving for the X-variables gives: X1 = Y3 ,
√
X2 = Y2 , and X3 = YY31 . Therefore,
0 0 1
∂ ( x1 , x2 , x3 ) 0 = − √1 .
= det 0 √1
J= 2 y2
∂ (y1 , y2 , y3 ) 2 y2 y3
1
y3 0 − yy12
3
Since the joint density for the X-variables is fX (x1 , x2 , x3 ) = 1 whenever 0 < x1 , x2 , x3 < 1
(and 0 otherwise) we have
√
y1
√1
2 y2 y3 if 0 < y1 < 1 and 0 < y2 < 1 and 0 < y3 <1
fY (y1 , y2 , y3 ) = fX (x1 , x2 , x3 )|J (y1 , y2 , y3 )| =
0 otherwise
At that point we may calculate the desired probability using the new joint density.
In this section we shall state and prove the strong law of large numbers.
As remarked in Chapter 8, the above results states that the convergence of sample mean to µ
actually happens with Probability one. This mode of convergence of the sample mean to the true
mean is called “convergence with probability 1.” We define it precisely below.
P (A) = 1. (A.2.2)
is typically used to convey that the sequence X1 , X2 , . . . converges with probability one to X.
As alluded earlier that this is a stronger mode of convergence. We prove it in the next
proposition.
Proof- Let ϵ > 0 and δ > 0 be given. We need to show ∃N such that
P (A) = 1. (A.2.4)
Aηn = {ω ∈ S : |Xn (ω ) − X (ω )| ≤ ϵ }.
then
A = ∩η>0 ∪∞ ∞
k =1 ∩n=k An .
η
This can be verified using the fact that ω ∈ A if and only if for all η > 0, there is a k ≡ k (ω ) such
that
|Xn (ω ) − X (ω )| ≤ ϵ, ∀ n ≥ k.
For m ≥ 1, define Bm
ϵ = ∩∞ Aϵ . Note
n=m n
+1 , (A.2.5)
ϵ ϵ
Bm ⊂ Bm
lim P (Bm
ϵ
) ↑ P (∪∞
m = 1 Bm ) .
ϵ
(A.2.6)
m→∞
As A ⊂ ∪∞
m=1 Bm , using (A.2.4) we have 1 = P (A) ≤ P (∪m=1 Bm ) ≤ 1. So
ϵ ∞ ϵ
P (∪∞
m=1 Bm ) = 1.
ϵ
(A.2.7)
ϵ0
P ( Bm ) > 1 − δ, ∀m ≥ N .
ϵ ⊂ Aϵ ,
As Bm m
P (Aϵm ) > 1 − δ, ∀m ≥ N .
p p
Xn −→ X and Xn −→ Y
1 1
δ δ
0 ≤ P | Xn − X |> < and 0 ≤ P | Xn − Y |> < . (A.2.8)
2k 2 2k 2
Using the triangle inequality we observe that | X − Y |≤| X − Xn | + | Xn − Y | for all n ≥ 1. So,
1 1
Ak ⊂ {| Xn − X |> } ∪ {| Xn − X |> } (A.2.9)
2k 2k
for all n ≥ 1. Combining (A.2.8) and (A.2.9) we have (using any n ≥ N )
1 1
δ δ
0 ≤ P ( Ak ) ≤ P | Xn − X |> +P | Xn − Y |> ≤ + = δ.
2k 2k 2 2
P (X ̸= Y ) = lim P (Ak ) = 0.
k→∞
Hence P (X = Y ) = 1. ■
Proof of Theorem A.2.1(Special Case)- We provide a complete proof of Theorem A.2.1 in the
special case when the random variables are i.i.d Bernoulli (p) random variables. We will proceed in
two steps.
0 ≤ S ≤ S ≤ 1.
Xk + Xk+1 . . . + Xk+n−1
Nk = inf{n ∈ N : ≥ S − ϵ}.
n
The random variable Nk , in some sense, measures how close we are to S and our main effort will
be to control the size Nk . It is easy to see that Nk is finite a.e. and are all identically distributed
(because of independence of Xi ). Hence we can choose an m such that P (Nk > m) < ϵ for all k.
Define random variables Yk and NkY by the following mechanism:
(
Xk if Nk ≤ m
Yk = (A.2.10)
1 if Nk > m
Yk + Yk+1 . . . + Yk+n−1
NkY = inf{n ∈ N : ≥ S − ϵ}. (A.2.11)
n
Clearly NkY ≤ Nk and if k is such that Nk ≥ m then NkY = 1 (since setting Yk = 1 ensures
that we are above S − ϵ immediately). So we have
NkY ≤ m. a.e.
So for large enough n ∈ N we can break up nk=1 Yk into pieces of lengths atmost M such that
P
the average over each piece is atleast S − ϵ. Then finally stop at the n-th term. Then it is clear
that,
Xn
Yk ≥ (n − m)(S − ϵ). (A.2.12)
k =1
By our choice of m
for any k. Take expectations in (A.2.12) and use the above inequality to obtain
E (S ) ≤ E (X ). (A.2.13)
Let X
fk = 1 − Xk . Applying the above argument to X
e (verify this) we have
e ).
E (Se) ≤ E (X
Now, S ≤ S a.e. So only way (A.2.14) and (A.2.13) can hold only if S = Sa.e. Therefore limn→∞ X n
exists almost everywhere and let us call it X. This completes step 1.
Step 2: We shall now use the Weak Law of Large numbers (Theorem 8.2.1), along with
Proposition A.2.3, and Lemma A.2.4 to complete the proof. The weak law implies that
p
X n −→ µ as n → ∞.
p
X n −→ X as n → ∞.
w.p.1
X n −→ µ as n → ∞.
■
Proof of Theorem A.2.1(General Case) The essence of the proof is contained in the special
case proven above. We provide a sketch of the proof.
Case 1:(0 ≤ X ≤ 1) An imitation of Step 1 of the proof for Bernoulli p random variables will
show that there is a limit. Step 2 of the above proof follows readily.
Case 2: Bounded Case When the random variable X is bounded, i.e. | X |≤ M for some
Xi −M
M > 0. One can consider Y = X−M 2M and Yi = 2M . As 0 ≤ Y ≤ 1 then one can use Case 1 for
Yi to establish that there is a limit. Step 2 of the above proof follows readily.
Case 3: (General Case by Truncation) One fixes α, β > 0 and defines
(β )
S (α) = min{S, α}, X (β ) = max{X, −β} and Xk = max{Xk , −β} ∀k ∈ N.
The above quantities are all bounded. One imitates Step 1 of the above proof and this will result in
w.p.1
inequalities depending on α, β. One then allows α, β approach infinity to establish that X n −→ X
for a random variable X. Step 2 of the above proof follows readily. We refer the reader to [AS09]
for the complete proof. ■
411
0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09
0.0 0.5000 0.5040 0.5080 0.5120 0.5160 0.5199 0.5239 0.5279 0.5319 0.5359
0.1 0.5398 0.5438 0.5478 0.5517 0.5557 0.5596 0.5636 0.5675 0.5714 0.5753
0.2 0.5793 0.5832 0.5871 0.5910 0.5948 0.5987 0.6026 0.6064 0.6103 0.6141
0.3 0.6179 0.6217 0.6255 0.6293 0.6331 0.6368 0.6406 0.6443 0.6480 0.6517
0.4 0.6554 0.6591 0.6628 0.6664 0.6700 0.6736 0.6772 0.6808 0.6844 0.6879
0.5 0.6915 0.6950 0.6985 0.7019 0.7054 0.7088 0.7123 0.7157 0.7190 0.7224
0.6 0.7257 0.7291 0.7324 0.7357 0.7389 0.7422 0.7454 0.7486 0.7517 0.7549
0.7 0.7580 0.7611 0.7642 0.7673 0.7704 0.7734 0.7764 0.7794 0.7823 0.7852
0.8 0.7881 0.7910 0.7939 0.7967 0.7995 0.8023 0.8051 0.8078 0.8106 0.8133
0.9 0.8159 0.8186 0.8212 0.8238 0.8264 0.8289 0.8315 0.8340 0.8365 0.8389
1.0 0.8413 0.8438 0.8461 0.8485 0.8508 0.8531 0.8554 0.8577 0.8599 0.8621
1.1 0.8643 0.8665 0.8686 0.8708 0.8729 0.8749 0.8770 0.8790 0.8810 0.8830
1.2 0.8849 0.8869 0.8888 0.8907 0.8925 0.8944 0.8962 0.8980 0.8997 0.9015
1.3 0.9032 0.9049 0.9066 0.9082 0.9099 0.9115 0.9131 0.9147 0.9162 0.9177
1.4 0.9192 0.9207 0.9222 0.9236 0.9251 0.9265 0.9279 0.9292 0.9306 0.9319
1.5 0.9332 0.9345 0.9357 0.9370 0.9382 0.9394 0.9406 0.9418 0.9429 0.9441
1.6 0.9452 0.9463 0.9474 0.9484 0.9495 0.9505 0.9515 0.9525 0.9535 0.9545
1.7 0.9554 0.9564 0.9573 0.9582 0.9591 0.9599 0.9608 0.9616 0.9625 0.9633
1.8 0.9641 0.9649 0.9656 0.9664 0.9671 0.9678 0.9686 0.9693 0.9699 0.9706
1.9 0.9713 0.9719 0.9726 0.9732 0.9738 0.9744 0.9750 0.9756 0.9761 0.9767
2.0 0.9772 0.9778 0.9783 0.9788 0.9793 0.9798 0.9803 0.9808 0.9812 0.9817
2.1 0.9821 0.9826 0.9830 0.9834 0.9838 0.9842 0.9846 0.9850 0.9854 0.9857
2.2 0.9861 0.9864 0.9868 0.9871 0.9875 0.9878 0.9881 0.9884 0.9887 0.9890
2.3 0.9893 0.9896 0.9898 0.9901 0.9904 0.9906 0.9909 0.9911 0.9913 0.9916
2.4 0.9918 0.9920 0.9922 0.9925 0.9927 0.9929 0.9931 0.9932 0.9934 0.9936
2.5 0.9938 0.9940 0.9941 0.9943 0.9945 0.9946 0.9948 0.9949 0.9951 0.9952
2.6 0.9953 0.9955 0.9956 0.9957 0.9959 0.9960 0.9961 0.9962 0.9963 0.9964
2.7 0.9965 0.9966 0.9967 0.9968 0.9969 0.9970 0.9971 0.9972 0.9973 0.9974
2.8 0.9974 0.9975 0.9976 0.9977 0.9977 0.9978 0.9979 0.9979 0.9980 0.9981
2.9 0.9981 0.9982 0.9982 0.9983 0.9984 0.9984 0.9985 0.9985 0.9986 0.9986
3.0 0.9987 0.9987 0.9987 0.9988 0.9988 0.9989 0.9989 0.9989 0.9990 0.9990
3.1 0.9990 0.9991 0.9991 0.9991 0.9992 0.9992 0.9992 0.9992 0.9993 0.9993
3.2 0.9993 0.9993 0.9994 0.9994 0.9994 0.9994 0.9994 0.9995 0.9995 0.9995
3.3 0.9995 0.9995 0.9995 0.9996 0.9996 0.9996 0.9996 0.9996 0.9996 0.9997
3.4 0.9997 0.9997 0.9997 0.9997 0.9997 0.9997 0.9997 0.9997 0.9997 0.9998
Rz x2
−
Table B.1: Normal tables evaluating : 2π
1
−∞ e
2 dx
[AS09] Siva, Athreya, Sunder, V.S. Measure and Probability CRC Press (Outside India) 2009 ISBN
14 3980 126 6.
[CasBer90] Casella, George; Berger, Roger [Link] inference. The Wadsworth & Brooks/Cole
Statistics/Probability Series. Wadsworth & Brooks/Cole Advanced Books & Software, Pacific
Grove, CA, 1990. xviii+650 pp. ISBN: 0-534-11958-1
[Fel68] W. Feller, An introduction to probability theory and its applications Vol. I. Third edition
John Wiley & Sons, Inc., New York-London-Sydney 1968.
[Fel71] W. Feller, An introduction to probability theory and its applications Vol. II. Second edition
John Wiley & Sons, Inc., New York-London-Sydney 1971.
[FPP98] [Link], [Link], [Link], Statistics Third edition [Link] and Company,
Inc., New York-London, 1998.
[FG97] B. Fristedt and L. Gray, A Modern Approach to Probability Thoery Birkhauser, Boston 1997.
[Ghah00] S. Ghahramani Fundamentals of probability. Second edition Prentice Hall, New Jersey,
2000.
[HPS72] P.G. Hoel, S.C. Port, and C.J. Stone, Introduction to Probability Theory, Houghton
Mifflin Company, 1972.
[Keane] M. Keane, The Essence of the Law of Large Numbers, , Pages 125–129, Algorithms, Fractals,
and Dynamics, Springer USA, 1995.
413
[Rao73] [Link], Linear Statistical Inference and its Application , Second Edition, John Wiley,
New York, 1973.
[Ross84] S. Ross, A first course in probability, Second edition Macmillan Publishing Company,
New York, 1984.
[Ser09] Serfling RJ., Approximation theorems of mathematical statistics, John Wiley & Sons; 2009
Sep 25.
[Stig84] Stephen M. Stigler, Kruskal’s Proof of the Joint Distribution of X n and s2n , The American
Statistician, Vol 38, No. 2, (May 1984), pp 134-135.
[Wil27] Wilson E.B., Probable inference, the law of succession, and statistical inference, Journal of
the American Statistical Association. 1927 Jun 1;22(158):209-12.
415