Introduction to Probability Concepts
Introduction to Probability Concepts
Nathanaël Berestycki
Lecture notes1 - Michaelmas 2007
These notes are a list of some of the definitions and results presented
in the course. Since they are free from any motivating exposition or
examples, and since no proofs are given for any of the theorems, these
notes should be used only as a reference to supplement the recom-
mended texts. The class material consists, ultimately, of what was
done in class and in the exmaple sheets. In what follows, phrases en-
closed in brackets [ . . . ] are digressive or slightly outside of the syllabus
of the course, but are included for completeness. A table of notation
is in the appendix.
1. Basic tools
1.1. Probability spaces.
1
Version of November 28th, 2007
1
1.2. Random variables and distribution functions.
Definition 1.4. Let (Ω, F, P) be a probability space. A random vari-
able is a function X : Ω → R such that the set {ω ∈ Ω : X(ω) ≤ t} is
an element of F for all t ∈ R.
Let A be a subset of R, and let X be a random variable. We use the
notation {X ∈ A} to denote the set {ω ∈ Ω : X(ω) ∈ A}. A random
variable X is said to take values in a subset S ⊆ R if X ∈ S almost
surely.
The distribution function of X is the function FX : R → [0, 1] defined
by
FX (t) = P(X ≤ t)
for all t ∈ R.
[A distribution function is called defective if either limt↑∞ FX (t) < 1
or limt↓−∞ FX (t) > 0. Unless otherwise indicated, all random vari-
ables considered here are assumed to have non-defective distribution
functions.]
The law of X is the probability measure ν such that
ν([a, b]) = P(a ≤ X ≤ b)
for all a ≤ b. The statement “the random variable X has law ν” is
written X ∼ ν. [The measure ν is defined on theBorel sigma-field
B, which is defined as the smallest sigma-field on R containing the
intervals [a, b] for all a ≤ b.]
Definition 1.5. Let A be an event in Ω. The indicator function of the
event A is the random variable 1A : Ω → {0, 1} defined by
1 if ω ∈ A
1A (ω) =
0 if ω ∈ Ac
for all ω ∈ Ω.
Definition 1.6. A random variable X is called discrete if X takes
values in a countable set.
If X is discrete, the function pX : R → [0, 1] defined by pX (t) =
P(X = t) is called the mass function of X.
Definition 1.7. Let X is a discrete random variable taking values in
N with mass function pX .
The random variable X is called
• Bernoulli with parameter p if
pX (0) = 1 − p and pX (1) = p.
where 0 ≤ p ≤ 1.
2
• binomial with parameters n and p, written X ∼ bin(n, p), if
n k
pX (k) = p (1 − p)n−k for all k = 0, 1, . . . , n
k
where n ∈ N and 0 ≤ p ≤ 1.
• Poisson with parameter λ if
λk −λ
pX (k) = e for all k = 0, 1, 2, . . .
k!
where λ ≥ 0.
• geometric with parameter p if
pX (k) = p(1 − p)k−1 for all k = 1, 2, 3, . . .
where 0 ≤ p ≤ 1.
[In the above formulae, the convention 00 = 1 is used. If X is geometric
with parameter p = 0, then X = ∞ almost surely.]
In practice, a random variable is specified by its law. We almost
never need to construct explicitly the probability space (Ω, F, P) asso-
ciated with it.
Definition 1.8. Let FX be the distribution of a random variable X.
The random variable X is continuous if and only if FX is a continuous
function.
The random variable X is absolutely continuous if and only if there
exists a positive function fX : R → [0, ∞) such that
Z t
FX (t) = fX (s)ds
−∞
for all t ∈ R, in which case the function fX is called the density function
of X.
Definition 1.9. Let X is a continuous random variable with density
function fX .
The random variable X is called
• uniform on the interval (a, b), written X ∼ unif(a, b), if
1
fX (t) = for all a < t < b
b−a
for some a < b.
• normal with mean µ and variance σ 2 , written X ∼ N (µ, σ 2 ), if
1 (x − µ)2
fX (t) = √ exp − for all t ∈ R
2πσ 2σ 2
for some µ ∈ R and σ 2 > 0.
3
• exponential with rate λ, written X ∼ exp(λ), if
fX (t) = λe−λt for all t ≥ 0
for some λ > 0.
• Cauchy if
1
fX (t) = for all t ∈ R.
π(1 + t2 )
1.3. Expectations and variances.
Definition 1.10. Let X be a random variable on (Ω, F, P). The ex-
pected value of X is denoted by E(X) and is given by the integral
Z
E(X) = X(ω)P(dω).
Ω
The above integral is defined in the following cases:
• if X ≥ 0 almost surely.
• if either E(X + ) or E(X − ) is finite, in which case E(X) =
E(X + ) − E(X − ).
A random variable X is integrable if E|X| < ∞ and is square-integrable
if E(X 2 ) < ∞. The terms expected value, expectation, and mean are
interchangeable.
The variance of an integrable random variable X, written Var(X),
is
Var(X) = E(X 2 ) − E(X)2 .
The covariance of square-integrable random variable X and Y , writ-
ten Cov(X, Y ), is
Cov(X, Y ) = E(XY ) − E(X)E(Y ).
If neither X or Y is almost surely constant, then their correlation,
written ρ(X, Y ), is
Cov(X, Y )
ρ(X, Y ) = .
Var(X)1/2 Var(Y )1/2
Random variables X and Y are called uncorrelated if Cov(X, Y ) = 0.
Theorem 1.11. Let the function g : R → R be such that g(X) is
integrable.
If X is a discrete random variable with probability mass function pX
taking values in a countable set S then
X
E(g(X)) = g(t) pX (t).
t∈S
4
If X is an absolutely continuous integrable random variable with den-
sity function fX then
Z ∞
E(g(X)) = g(t) fX (t) dt.
−∞
if P(Y = t) = 0.
For fixed random variable X and Y , let h : R → R be the function
defined by h(t) = E(X|Y = t). The conditional expectation of X given
Y , written E(X|Y ), is
E(X|Y ) = h(Y ).
6
Theorem 1.19. Let the function g : R → R be such that g(X) is
integrable.
If X and Y are discrete taking values in S and pY (t) > 0, then
X
E(g(X)|Y = t) = g(s) pX|Y (s, t).
s∈S
for every finite subset I ⊂ N then the events are said to be independent.
Random variables X1 , X2 , . . . are called independent if the events
{X1 ≤ t1 }, {X2 ≤ t2 }, . . . are independent. [The phrase “independent
and identically distributed” is often abbreviated i.i.d.]
Theorem 1.22. If X and Y are discrete and independent random
variables then
pX,Y (s, t) = pX (s)pY (t).
If X and Y are jointly absolutely continuous and independent random
variables then
fX,Y (s, t) = fX (s)fY (t).
If X and Y are independent and integrable, then
E(XY ) = E(X)E(Y ).
1.5. Probability inequalities.
Theorem 1.23 (Markov’s inequality). Let X be a nonnegative random
variable with E(X) < ∞. Then
E(X)
P(X ≥ A) ≤
A
for all A > 0.
7
Corollary 1.24 (Chebychev’s inequality). Let X be a random variable
with E(X) = µ and Var(X) = σ 2 < ∞. Then
σ2
P(|X − µ| ≥ A) ≤
A2
for all A > 0.
1.6. Generating and characteristic functions.
Definition 1.25. Let X be a random variable on (Ω, F, P) taking
values in {0, 1, 2, . . .}. The moment generating function of X is the
function GX : [0, 1] → [0, 1] defined by
GX (t) = E(tX )
for all t ∈ (0, 1] and GX (0) = P(X = 0).
Theorem 1.26. Let GX be the moment generating function of X.
Then
1 (n)
P(X = n) = GX (0)
n!
(n)
for all n ∈ N, where GX (0) denotes the n-th derivative of GX evalu-
tated at 0.
Definition 1.27. Let X be a random variable on (Ω, F, P). The
Laplace transform of X is the function MX : R → R ∪ {∞} defined by
MX (t) = E(etX )
for all t ∈ R.
Theorem 1.28. Let MX be the Laplace transform of a random variable
X, and suppose there exists an > 0 such that MX (t) < ∞ for all
− < t < . Then
(n)
E(X n ) = MX (0)
(n)
for all n ∈ N, where MX (0) denotes the n-th derivative of MX eva-
lutated at 0. (The number µn = E(X n ) is called the n-th moment of
X.)
Definition 1.29. The characteristic function of a real-valued random
variable X is the function φX : R → C defined by
φX (t) = E(eitX )
√
for all t ∈ R, where i = −1.
Theorem 1.30 (Uniqueness of generating and characteristic func-
tions). Let X and Y be real-valued random variables with distribution
functions FX and FY .
8
• Let φX and φY be the characteristic functions of X and Y . Then
φX (t) = φY (t) for all t ∈ R
if and only if
FX (t) = FY (t) for all t ∈ R.
• Let X and Y be valued in N with moment generating functions
GX and GY . Then
GX (t) = GY (t) for all t ∈ [0, 1]
if and only if
FX (t) = FY (t) for all t ∈ R.
• Let MX and MY be the Laplace transforms of X and Y . If there
exists an > 0 such that
MX (t) = MY (t) < ∞ for all t ∈ (−, )
then
FX (t) = FY (t) for all t ∈ R.
Theorem 1.31. Inversion formula. Let X be a random variable with
characteristic function ϕ. Then for any a < b
Z T −ita
1 e − e−itb 1
lim ϕ(t)dt = P(a < X < b)+ (P(X = a)+P(X = b))
T →∞ 2π −T it 2
2. Fundamental probability results
Definition 2.1. Let x1 , x2 , . . . be a sequence of real numbers. The
limit superior is defined by
lim sup xn = inf sup xn
n↑∞ N ∈N n≥N
Equivalently, if x ∈ R then
{n ∈ N : xn > x + } is finite
x = lim sup xn ⇔ for all > 0
n↑∞ {n ∈ N : xn > x − } is infinite
and
{n ∈ N : xn < x + } is infinite
x = lim inf xn ⇔ for all > 0
n↑∞ {n ∈ N : xn < x − } is finite
If lim supn↑∞ xn = lim inf n↑∞ xn = x then the sequence x1 , x2 , . . . is
convergent and with limit x = limn↑∞ xn .
9
Definition 2.2 (Modes of convergence). Let X1 , X2 , . . . and X be ran-
dom variables.
• Xn → X almost surely if P(Xn → X) = 1
• Xn → X in Lp , for p ≥ 1, if E|X|p < ∞ and E|Xn − X|p → 0
• Xn → X in probability if P(|Xn − X| > ) → 0 for all > 0
• Xn → X in distribution if FXn (t) → FX (t) for all points t ∈ R
of continuity of FX
Theorem 2.3. The following implications hold:
Xn → X almost surely
or ⇒ Xn → X in probability ⇒ Xn → X in distribution
Xn → X in Lp , p ≥ 1
Furthermore, if r ≥ p ≥ 1 then Xn → X in Lr ⇒ Xn → X in Lp .
Definition 2.4. Let A1 , A2 , . . . be events. The term eventually is de-
fined by [ \
{An eventually} = An
N ∈N n≥N
and infinitely often by
\ [
{An infinitely often} = An .
N ∈N n≥N
3. Markov chains
Definition 3.1. A stochastic process is a collection (Xi )i∈I of random
variables. If the index set I is N, then the stochastic process is called
discrete time and is denoted by (Xn )n≥0 . If the index set I is [0, ∞),
then the stochastic process is called continuous time and is denoted by
(Xt )t≥0 .
3.1. Discrete time Markov chains.
Definition 3.2. Let (Xn )n≥0 be a discrete time stochastic process tak-
ing values in a countable set S. The process (Xn )n≥0 is called a Markov
chain if
P(Xn = in |X0 = i0 , . . . , Xn−1 = in−1 ) = P(Xn = in |Xn−1 = in−1 )
for each n ≥ 1, and s0 , . . . , sn ∈ S. The set S is called the state space of
the Markov chain, and an elements of S is called a state. The Markov
chain is called time homogeneous if
P(Xn = j|Xn−1 = i) = P(X1 = j|X0 = i)
for all n ≥ 1 and all states i, j ∈ S.
Throughout these notes, all Markov chains are assumed to be time
homogeneous.
Definition 3.3. Let (Xn )n≥0 be a Markov chain on a state space S.
The numbers pij = P(X1 = j|X0 = i) for i, j ∈ S are called the one-
step transition probabilities, and pij (n) = P(Xn = j|X0 = i) for n ≥ 0
the n-step transition probabilities.
The |S| × |S| matrix P = (pij )i,j∈S is called the transition matrix.
We use the notation
Pi (A) = P(A|X0 = i)
for any event A ∈ F, and
Ei (Z) = E(Z|X0 = i)
for any random variable Z for which the conditional expectation can
be defined.
Theorem 3.4 (Chapman-Kolmogorov equations). Let pij (n) for i, j ∈
S and n ≥ 0 denote the n-step transition probabilities of a Markov
chain with state space S. Then
X
pik (m + n) = pij (m)pjk (n)
j∈S
12
for all i, k ∈ S and m, n ≥ 0. In particular, the n-step transition
probabilities are given by
pij (n) = (P n )ij
where P is the transition matrix of the Markov chain.
Definition 3.5. A |S| × |S|P matrix P = (pij )i,j∈S is called stochastic
if pij ≥ 0 for all i, j ∈ S and j∈S pij = 1 for all i ∈ S. In particular, a
matrix is stochastic if and only if it is the transition matrix of a Markov
chain.
Theorem 3.6. Let P be an d × d matrix where d < ∞. Let λ1 , . . . , λd
be the eigenvalues of P . If the eigenvalues are distinct, then there exists
(k)
complex numbers aij for i, j, k ∈ {1, . . . , d} such that
d
X
n (k)
(P )ij = aij λnk
k=1
for all n ∈ N. More generally, if the eigenvalues are not necessarily dis-
(k)
tinct, then there exists polynomials aij : N → C for i, j, k ∈ {1, . . . , d}
of degree less than the multiplicity of the eigenvalue λk such that
d
X
n (k)
(P )ij = aij (n)λnk
k=1
for all n ∈ N.
Theorem 3.7. Let (Xn )n≥0 be a Markov chain on a state space S, and
let A ⊆ S. Define the random variable H A valued in N ∪ {∞} by
H A = inf{n ≥ 0 : Xn ∈ S}.
[The standard convention inf ∅ = ∞ is used throughout.] For each state
i ∈ S let hA A A
i = Pi (H < ∞). Then (hi )i∈S is the minimal non-negative
solution to
A 1 if i ∈ A
hi = P A
p h
i∈S ij j otherwise
Definition 3.8. Let pij (n) for i, j ∈ S and n ≥ 0 denote the n-step
transition probabilities of a Markov chain with state space S.
States i leads to state j, written i → j, if there exists an n ≥ 0 such
that pij (n) > 0. States i and j communicate, written i ↔ j, if i → j
and j → i. The communicating class containing a state i is the largest
subset C ⊆ S with the property that if state j is in C then states i and
j communicate.
13
A communicating class C is called closed it has the property if i ∈ C
and i → j then j ∈ C. Otherwise, C is called open if there exists a
state i in C and a state j not in C such that i leads to j.
A Markov chain is irreducible if all states in S communicate.
Definition 3.9. A state i ∈ S is called recurrent if
Pi (Xn = i infinitely often) = 1,
and called transient if
Pi (Xn = i infinitely often) < 1.
Let C ⊆ S be a communicating class of the Markov chain. The class
is called a transient class if every state i ∈ C is transient, and the class
is called a recurrent class if every state i ∈ C is recurrent.
[The terms recurrent and persistent are interchangeable.]
Definition 3.10. Let (Xn )n≥0 be a Markov chain with state space S.
A stopping time T is a random variable taking values in N∪{∞} having
the property that for each n ∈ N the event {T = n} is determined by
X0 , . . . , Xn } in the sense that there exists a function f : S n → {0, 1}
such that
1{T =n} = f (X0 , . . . , Xn ).
Theorem 3.11 (The strong Markov property). Let (Xn )n≥0 be a Markov
chain with state space S and let T be a stopping time. Then conditional
on the events {T < ∞} and XT = i, the random process (XT +n )n≥0 is
a Markov chain starting at i, independent of X0 , . . . , XT .
Theorem 3.12. Let (Xn )n≥0 be a Markov chain with state space S.
For each state i ∈ S define the random variable Ti taking values in
N ∪ {∞} by
Ti = inf{n ≥ 1 : Xn = i}.
The following are equivalent:
• State
P∞ i is recurrent.
• n=1 pii (n) = ∞.
• Pi (Ti < ∞) = 1
Moreover, the following are equivalent:
• State i is transient.
• Pi (Xn = i infinitely often) = 0.
• Pi (T
P i < ∞) < 1.
∞ 1
• n=1 pii (n) = 1−Pi (Ti <∞) < ∞.
14
Definition 3.13. Let (Xn )n≥0 be a Markov chain and let Ti = inf{n ≥
1 : Xn = i}. The state i is called positive recurrent if Ei (Ti ) < ∞ and is
called null recurrent otherwise. [The terms positive and non-null are
interchangeable in the context of recurrent Markov chains.]
Theorem 3.14 (Recurrence, transience, and positive recurrence are
class properties). Let i and j be communicating states of a Markov
chain. Then i is recurrent if and only if j is recurrent. Equivalently, i
is transient if and only if j is transient. Also, i is positive recurrent if
and only if j is positive recurrent.
Definition 3.15. Let P be the transition matrix of a Markov chain.
An invariant Pdistribution is a row vector π = (πi )i∈S such that πi ≥ 0
for all i ∈ S, i∈S πi = 1 and
πP = π.
[The terms invariant distribution and stationary distribution are inter-
changeable.]
Theorem 3.16. Let (Xn )n≥0 be an irreducible Markov chain with tran-
sition matrix P . Then (Xn )n≥0 is positive recurrent if an only if there
exists an invariant distribution for P . If the Markov chain is positive
recurrent then the invariant distribution is unique and is given by the
formula
1
πi =
Ei (Ti )
where Ti = inf{n ≥ 1 : Xn = i}.
Theorem 3.17. Let (Xn )n≥0 be an irreducible recurrent Markov chain
on S with transition matrix P . Let k ∈ S be a fixed state and let
Tk = inf{n ≥ 1 : Xn = k}. For each i ∈ S let
Tk
X
(k)
γi = Ek 1{Xn =i}
n=1
for all i.
• inf i gii > −∞
The matrix G = (gij )i,j∈S is called the generator of the Markov process.
All Markov chains considered here have uniform transition matrices.
Theorem 3.26. Let (Xt )t≥0 be a Markov chain with generator G, and
for each state i, let
Ui = inf{t > 0, Xt 6= i}.
Then conditional on X0 = i, the random variable Ui is exponential with
rate −gii ; that is
Pi (Ui > t) = egii t .
Furthermore,
gij
Pi (XU = j) = .
−gii
Theorem 3.27. Let G be the generator of a Markov chain with tran-
sition matrices (P (t))t≥0 . Then for all t ≥ 0
P
• j∈S gij = 0, or in matrix notation, G1 = 0 where 1 =
(1, 1, . . .)
• P (t) = etG = ∞ tn Gn
P
n=0 n!
• (Kolmogorov’s forward equation) P 0 (t) = P (t)G
• (Kolmogorov’s backward equation) P 0 (t) = GP (t)
17
where P 0 (t) denotes the matrix (p0ij (t))i,j∈S and p0ij denotes the deriva-
tive of pij .
Definition 3.28. An invariant distribution for a Markov chain with
transition matrices (P (t))t≥0 is a row vector π = (πi )i∈S such that
πP (t) = π for all t ≥ 0.
Theorem 3.29. Let G be the generator of a Markov chain. Then a
row vector π is an invariant distribution if and only if
πG = 0.
Definition 3.30. Let (pij (t))i,j∈S be the transition probabilities of a
Markov chain. The Markov chain is irreducible if pij (t) > 0 for all
t > 0.
Theorem 3.31. Let (pij (t))i,j∈S be the transition probabilities of an
irreducible Markov chain. If there exists an invariant distribution π,
the invariant distribution is unique and
pij (t) → πj as t ↑ ∞
for all i, j ∈ S . Otherwise, if no invariant distribution exists, then
pij (t) → 0.
4. Martingales
Theorem 4.1. Let X be an integrable random variable defined on the
probability space (Ω, F, P), and let G ⊆ F be a sigma-field of F. Then
there exists an integrable G-measureable random variable Y such that
E(1G Y ) = E(1G X)
for all G ∈ G. Furthermore, if there exists another G-measureable
random variable Y 0 such that E(1G Y 0 ) = E(1G X) for all G ∈ G, then
Y = Y 0 almost surely.
Definition 4.2. Let X be an integrable random variable and let G ⊆ F
be a sigma-field. The conditional expectation of X given G, written
E(X|G), is a G-measurable random variable with the property that
E [1G E(X|G)] = E(1G X)
for all G ∈ G.
Theorem 4.3. Let X and Y be integrable random variables defined on
(Ω, F, P), let H ⊆ G ⊆ F be sigma-algebras, and a, b ∈ R be constants.
• E(X|{∅, Ω}) = E(X)
• E(X|F) = X.
• E(aX + bY |G) = aE(X|G) + bE(Y |G)
18
• If X ≥ 0 almost surely, then E(X|G) ≥ 0 almost surely
• E[E(X|G)|H] = E[E(X|H)|G] = E(X|H)
• If Y is independent of G (the events {Y ≤ t} and G are inden-
dent for each t ∈ R and G ∈ G) then E(Y |G) = E(Y )
• If Y is G-measurable (the events {Y ≤ t} are in G for each
t ∈ R) then E(XY |G) = Y E(X|G)
Theorem 4.4. Let X be an integrable random variables defined on
(Ω, F, P).
• Let
S B1 , B2 , . . . be a sequence of disjoint non-null events with
n Bn = Ω. Let G be the smallest sigma-field containing {B1 , B2 , . . . , ...}.
Then
E(X1Bn )
E(X|G) = if ω ∈ Bn .
P(Bn )
• Let Y be a random variable on (Ω, F, P) and let G be the small-
est sigma-field containing the events {Y ≤ t} for all t ∈ R.
Then
E(X|G) = E(X|Y )
where the right side is defined by Definition 1.18.
Definition 4.5. A filtration (Fn )n≥0 on Ω is a collection of sigma-fields
on Ω such that
F0 ⊆ F1 ⊆ F2 ⊆ . . . .
Definition 4.6. A martingale relative to a filtration (Fn )n≥0 is a sto-
chastic process with the following properties:
• E|Mn | < ∞ for all n ≥ 0
• E(Mn+1 |Fn ) = Mn for all n ≥ 0.
Theorem 4.7 (The L2 martingale convergence theorem). Let (Mn )n≥0
be a martingale relative to a filtration (Fn )n≥0 . If supn≥0 E(Sn2 ) < ∞
then there exists a random variable M such that E(M 2 ) < ∞ and
Mn → M in L2 . Furthermore Mn = E(M |Fn ).
Definition 4.8. A stopping time for a filtration (Fn )n≥0 is a random
variable T : Ω → N ∪ {∞} with the property that {T = n} ∈ Fn for
each n ≥ 0.
Theorem 4.9 (The optional sampling theorem). Let (Mn )n≥0 be a
martingale relative to a filtration (Fn )n≥0 , let T a bounded stopping
time (there exists a real constant C > 0 such that T (ω) < C for almost
all ω ∈ Ω.) Then
E(MT ) = E(M0 ).
19
Theorem 4.10 (The optional stopping theorem). Let (Mn )n≥0 be a
martingale relative to a filtration (Fn )n≥0 , let T be almost surely finite
stopping time. (that is P(T < ∞) = 1.) If there is C > 0 such that
|Mn | ≤ C if n ≤ τ
then
E(MT ) = E(M0 ).
Theorem 4.11 (The optional stopping theorem). Let (Mn )n≥0 be a
martingale relative to a filtration (Fn )n≥0 , let T be almost surely finite
stopping time. (that is P(T < ∞) = 1.) If supn≥0 E(Mn2 ) < ∞ then
E(MT ) = E(M0 ).
As an application:
20
R the set of real numbers
R the set of extended real numbers {−∞} ∪ R ∪ {∞}
N the set of natural numbers {0, 1, 2, . . .}
C the set of complex numbers
Z the set of integers {. . . , −2, −1, 0, 1, 2, . . .}
P(Ω) the set of all subsets of Ω, called the power set of Ω
Ac the complement of the set A
A⊂B A is a proper subset of B
A⊆B A is a subset of B, and the possibility that A = B is allowed