Topic 1: Review of MAS 103
PRINCIPLES OF PROBABILITY
Notation
Events are denoted using capital letters so that p(A) denotes the probability that event A
occurs.
Range of Probability
0 6 p(A) 6 1
Axioms of Probability
The three basic probability axioms can be summarized as follows:
1. p(S) = 1.
0
It follows that for event A from sample space S that p(A ) = 1 − p(A).
2. p(A) > 0 for all A ⊂ S.
Rules 1 and 2 together are telling us that probabilities lie between 0 (impossible) and 1
(certain).
3. p(A ∪ B) = p(A) + p(B) if A ∩ B = ∅.
If two events cannot occur simultaneously, ie they are mutually exclusive, the probability of
the event defined by their union is equal to the sum of the probabilities of the two events.
This property is known as additivity.
Venn Diagrams
Venn diagrams are used to represent events graphically. We use set notation to identify different
areas on a Venn diagram.
ε or S, the universal set represents the sample space.
A, B, closed curves represent the events.
0
The event that A does not occur denoted as A
0
p(A ) = 1 − p(A)
1
The event that A or B or both occur denoted as A ∪ B
p(A ∪ B) = p(A) + p(B) − p(A ∩ B) - the Addition Rule.
The event that A and B occurs denoted as A ∩ B
p(A ∩ B) = p(A) × p(B|A) - the Multiplication Rule.
Conditional probability - the event that B given A occurs denoted as B|A
p(A ∩ B)
p(B|A) =
p(A)
Conditional probability - the event that A given B occurs denoted as A|B
p(A ∩ B)
p(A|B) =
p(B)
Conditional probabilities shrink the sample space to the prior event.
Mutually Exclusive events
If events A and B are mutually exclusive, then p(A∪B) = p(A)+p(B). This implies p(A∩B) =
0.
Independent events
If events A and B are independent, then p(A ∩ B) = p(A) × p(B). This implies p(B|A) = p(B)
and p(A|B) = p(A).
The Law of Total Probability
Let A1 , ..., Ak be mutually exclusive and exhaustive events. Then for any other event B,
X
k
p(B) = p(B|A1 )p(A1 ) + · · · + p(B|Ak )p(Ak ) = p(B|Ai )p(Ai )
i=1
The Bayes Theorem
The Bayes theorem is used to work out the probability that a given prior event occurred given
that a subsequent event has occurred.
THE CONCEPT OF A RANDOM VARIABLE
2
A random variable is a way of mapping outcomes of random processes to numbers (quantify
outcomes) e.g. Let X be the outcome when you toss a coin,
1, if heads
X=
0, if tails
or let Y be the sum of the uppermost faces when two die are rolled.
When you quantify outcomes you can do more mathematics on the outcomes and equally state
more mathematical notations on the outcome.
e.g. the probability that the sum of the uppermost faces is less than or equal to 12 is denoted
as p(Y 6 12).
Capital letters, X, are used to denote the random variables, while small letters x, denote a
particular value (or realized values) of the random variable.
Illustration
Consider a study where the objective is to estimate the average height of some seedling. Height
is a random variable while 2.2 cm is the realized value of the random variable.
Random variables may be
i) Discrete – take particular values (values on a discrete scale)
ii) Continuous – take a given range of values (values in a given range)
Probability functions of random variables
A probability function of a random variable describes how total probability is distributed over
the various values that the random variable takes.
The probability function of a discrete random variable is termed it’s probability mass function
(pmf) while that of a continuous random variable, it’s probability density function (pdf).
The discrete case: probability mass function
x 0 1 2 3
1 3 3 1
p(x) = p(X = x) 8 8 8 8
3
OR
1
, x = 0, 3
8
p(x) = p(X = x) = 3 , x = 1, 2
8
0, otherwise
The pmf satisfies two conditions namely:
i) p(x) > 0.
P
ii) ∀x = 1.
The continuous case: probability density function
3x2 , 0 6 x 6 1
f(x) =
0, otherwise
The pdf satisfies the following conditions:
i) f(x) > 0.
R∞
ii) −∞ f(x)dx = 1.
Rb
iii) p(a 6 X 6 b) = p(a < X < b) = a f(x)dx.
For a continuous random variable, p(X = x) = 0.
Cumulative distribution function
The cumulative distribution function (cdf) of a random variable X also termed the distribution
function (df) is denoted F(x) = p(X 6 x).
p(X 6 x), X discrete
F(x) = R
x f(x)dx, X continuous
−∞
Properties of cumulative distribution functions
1) p(a < X 6 b) = p(X 6 b) − p(X 6 a) = F(b) − F(a).
2) p(a 6 X 6 b) = p(a < X 6 b) + p(X = a) = F(b) − F(a) + p(a).
3) p(a < X < b) = p(a < X 6 b) − p(X = b) = F(b)–F(a) − p(b).
For continuous X,
d 0
f(x) = F(x) = F (x)
dx
4
Mode, median and quartiles of a continuous random variable
Given X is a continuous random variable with cdf F(x) then
1) Median, m, of X is given by F(m) = 0.5.
2) Lower quartile, Q1 , is given by F(Q1 ) = 0.25.
3) Upper quartile, Q3 , is given by F(Q3 ) = 0.75.
4) The mode is the value of the random variable where it is most dense i.e. where the pdf
reaches its highest point (at a maximum turning point)
df(x) d2 f(x)
= 0; <0
dx dx2
MOMENTS
Let g(x) denote any function of X, then the expected value of g(x) denoted E[g(x)] is
P
for X discrete
∀x g(x)p(x),
E[g(x)] = R
∞ g(x)f(x)dx, for X continuous
−∞
Special Case
If g(x) = x, then the expected value of X denoted E[X] is
P
for X discrete
∀x xp(x),
E[X] = R
∞ xf(x)dx, for X continuous
−∞
This special type of expectation is called the mean and is denoted by µ = E(X).
Properties of Expectations
1) E[kX] = kE[X],
2) E[X + k] = E[X] + k where k is a real number.
If g(x) = (x − µ)2 , where µ = E(X) is the mean of the random variable X then the expected
value of (x − µ)2 denoted E[(x − µ)2 ] is
P
2
for X discrete
2 ∀x (x − µ) p(x),
E[(x − µ) ] = R
∞ (x − µ)2 f(x)dx, for X continuous
−∞
This special type of expectation is called the variance and is denoted by σ2 = Var(X).
Further Var(X) = E[(x − µ)2 ] = E[X2 ] − E[X]2 .
5
Properties of Variances
1) Var[kX] = k2 Var[X],
2) Var[X + k] = Var[X] where k is a real number.
Moments about the origin
Consider the random variable X and let g(x) = xr , r > 0, then the expected value of g(x)
denoted E[g(x)] = E[xr ] is
P
xr p(x), for X discrete
∀x
E[xr ] =
R∞ xr f(x)dx, for X continuous
−∞
E[xr ] is termed the rth moment of the random variable X about the origin.
The mean µ = E[X] is the 1st moment of the random variable X about the origin.
The variance of the random variable X denoted σ2 = Var(X) = E[(X − µ)2 ] = E[X2 ] − E[X]2
hence the [2nd moment about the origin] – [1st moment about the origin] squared.
Moments about the mean
Consider the random variable X with mean µ = E[X] and let g(x) = (x − µ)r , r > 0, then the
expected value of g(x) denoted E[g(x)] = E[(x − µ)r ] is
P
r
for X discrete
r ∀x (x − µ) p(x),
E[(x − µ) ] = R
∞ (x − µ)r f(x)dx, for X continuous
−∞
E[(x − µ)r ] is termed the rth moment of the random variable X about the mean, µ.
When r = 2, E[(x − µ)2 ] = Var(X). Hence the variance is the 2nd moment about the mean.
When r = 1, E[(x − µ)] = E[X] − µ = µ − µ = 0. This implies the 1st moment of a random
variable X about its mean, µ is 0.
Moment Generating Functions
The moment generating function of a random variable X, is used to generate its moments about
the origin. It is denoted by:
P
etx p(x), for X discrete
tx ∀x
MX (t) = E[e ] =
R∞ etx f(x)dx, for X continuous
−∞
where t is a constant.
6
Mean and Variance using Moment Generating Functions
0
E[X] = MX (0) i.e. differentiate the moment generating function once with respect to t and let
t = 0.
00 0
Var(X) = MX (0) − [MX (0)]2 .
Theorems of Moment Generating Functions
Given a random variable X,
1) McX (t) = MX (ct), where c is a constant.
2) MX1 +X2 +···+Xn (t) = MX1 (t)MX2 (t)...MXn (t) where the Xi0 s are independent random vari-
ables.
at t
X−a
3) MU (t) = e− h MX h
where U = h
and a and h are constants.
4) Moment generating functions are unique to given distributions.
STANDARD DISCRETE DISTRIBUTIONS
Uniform Distribution
Conditions for a discrete uniform random variable X are:
i) X is defined over a set of n distinct values.
ii) Each value is equally likely
hence
1 , for each x
p(x) = p(X = x) = n
0, otherwise
n+1 n2 − 1
It has properties E[X] = and Var[X] = .
2 12
Bernoulli Distribution
Conditions for a Bernoulli distribution include
1) A single trail termed a Bernoulli trail.
7
2) The trail has two possible outcomes termed a success and a failure
1, if success
X=
0, if failure
3) p = p(success) and q = p(failure); where q = 1 − p.
The random variable X denotes the success and it’s probability mass function is given by
px (1 − p)1−x , x = 0, 1
p(x) = p(X = x) =
0, otherwise
denoted X ∼ B(p).
It has properties E[X] = p, Var[X] = pq and MX (t) = (q + pet ).
Binomial Distribution
Conditions required for a Binomial distribution
1) A fixed number, n, of independent trials.
2) Each trail has two possible outcomes technically termed a ’success’ and a ’failure’ (Bernoulli
trails.
3) The probability of success, p, in each trail is constant.
The random variable X which denotes the number of successes in n trails has a binomial
distribution. It’s probability mass function is given by
n px (1 − p)n−x , x = 0, 1, 2, ..., n
x
p(x) = p(X = x) =
0, otherwise
denoted X ∼ B(n, p).
It has properties E[X] = np, Var[X] = npq and MX (t) = (q + pet )n , where p = p(success)
and q = p(failure); and q = 1 − p.
Poisson Distribution
The Poisson random variable X represents the number of events that occur in an interval. The
interval may be a fixed length in time or space. The events must occur:
8
i) singly in space or time;
ii) independently of each other;
iii) at a constant rate in the sense that the mean number of occurrences in an interval is
proportional to the length of the interval.
Such events are said to occur randomly.
The probability mass function of X is given by
x −λ
λ e , x = 0, 1, 2, ...
p(x) = p(X = x) = x!
0, otherwise
denoted X ∼ Po (λ).
t −1)
It has properties E[X] = λ, Var[X] = λ and MX (t) = eλ(e .
A limiting form of the Binomial Distribution
The Poisson distribution can be used as a limiting form of the binomial distribution i.e. when
n, the number of trials is too large and p, the probability of success is too small – a rare event.
If X ∼ B(n, p) with n large and p small, then we can approximate it by X ∼ Po (λ) where
λ = np.
Geometric Distribution
The geometric distribution models discrete waiting time. The random variable X denotes the
number of trails required before the first success. It’s probability function is given by
qx p, x = 0, 1, 2, ...
p(x) = p(X = x) =
0, otherwise
where p = p(success) and q = p(failure); and q = 1 − p.
q q p
It has properties E[X] = , Var[X] = 2 and MX (t) = .
p p (1 − qet )
Hypergeometric Distribution
The hypergeometric distribution is a discrete distribution that models the number of events in
a fixed sample size when you know the total number of items in the population that the sample
is from. Each item in the sample has two possible outcomes (either an event or a nonevent).
The hypergeometric distribution is used under the following conditions:
9
1) Total number of items (population), M, is fixed; with k of a certain type.
2) Sample size (number of trials),n, is a portion of the population; n items are drawn without
replacement.
3) Probability of success changes after each trial.
4) The random variable X denotes the number of successes.
We note that the chosen group contains x successes and (n − x) failures. In how many ways
can you pick x successes from k of a certain type?
The random variable X is said to have a hypergeometric distribution and it’s probability mass
function is given by k M−k
x n−x
M
p(x) = p(X = x) = n
0, otherwise
kn kn (M − k)(M − n)
It has properties E[X] = , Var[X] = .
M M M(M − 1)
Negative Binomial Distribution
In a sequence of independent Bernoulli(p) trials, let the random variable X denote the trial at
which the rth success occurs, where r is a fixed integer. Then
x−1 pr (1 − p)x−r , x = r, r + 1, ...
r−1
p(X = x|r, p) =
0, otherwise
and we say that X has a negative binomial(r, p) distribution.
The negative binomial distribution is sometimes defined in terms of the random variable Y =
number of failures before rth success. This formulation is statistically equivalent to the one
given above in terms of X = trial at which the rth success occurs, since Y = X − r. The
alternative form of the negative binomial distribution is
r+y−1 pr (1 − p)y , y = 0, 1, ...
y
p(Y = y) =
0, otherwise
r(1 − p) r(1 − p)
It has properties E[Y] = and Var[Y] = .
p p2
STANDARD CONTINUOUS DISTRIBUTIONS
10
Rectangular Distribution
Models a random variable where probability is constant (or the same) over a given interval. Its
probability density function is given by
k, a 6 x 6 b
f(x) =
0, elsewhere
1
This implies k = .
b−a
x−a a+b (b − a)2
It has properties F(x) = , E[X] = and Var[X] = .
b−a 2 12
Exponential Distribution
1. Used to model the behaviour of probabilities that reflect a large number of small values
and a small number of large values.
2. It is often concerned with the amount of time until some specific event occurs. It models
the length of time between Poisson happenings.
Its probability density function is given by:
λe−λx , x > 0
f(x) =
0, elsewhere
denoted X ∼ Exp(λ). λ is the decay parameter i.e. it controls the rate of decay or decline and
1
λ= .
µ
1 1 λ
It has properties F(x) = 1 − e−λx , E[X] = , Var[X] = 2 and MX (t) = .
λ λ λ−t
Normal Distribution
Used to model continuous random variables which have a symmetric distribution. Its proba-
bility function is given by:
√ 1 e− 12 ( x−µ
2
σ ) ,
2πσ2
−∞ < x < ∞, −∞ < µ < ∞, σ2 > 0
f(x) =
0, elsewhere
11
denoted X ∼ N(µ, σ2 ).
It has properties
Zx
1 1 x−µ 2
F(x) = Φ(x) = p(X 6 x) = √ e− 2 ( σ ) dx
−∞ 2πσ2
E(X) = µ
Var(X) = σ2
1 2 σ2 +2µt)
MX (t) = e 2 (t
Any normal variable X ∼ N(µ, σ2 ) can be transformed to a standard normal variable Z ∼ N(0, 1)
by the formula
X−µ
Z=
σ
We say, we are standardizing the normal random variable.
The probability function of the standard normal variable is given by:
√1 e− z22 , −∞ < z < ∞
f(z) = 2π
0, elsewhere
denoted Z ∼ N(0, 1). It has properties
Zz
1 z2
F(z) = Φ(z) = p(Z 6 z) = √ e− 2 dz
−∞ 2π
E(Z) = 0
Var(Z) = 1
t2
MZ (t) = e 2
The areas under the standard normal curve are given in the standard normal tables.
Normal Approximation to Binomial and Poisson
The normal distribution provides a simple and accurate approximation to the binomial and
Poisson distributions. The normal distribution is a continuous distribution hence p(X = x) = 0
while the Binomial and Poisson distributions are discrete distributions. We then use a continuity
correction.
Using a continuity correction
We first write the probability using 6 or >. We approximate
12
1) p(X 6 n) by p(Y < n + 0.5).
2) p(X > n) by p(Y > n − 0.5).
Approximating a Binomial distribution
For a binomial distribution, there are two possible approximations, depending upon whether p
lies close to 0.5 (in which case a normal approximation is used) or p is small (in which case a
Poisson distribution is used).
If X ∼ B(n, p) and n is large and p is close to 0.5 then X can be approximated by Y ∼
N(np, np(1 − p)).
If you are approximating a binomial distribution by a normal distribution you should go di-
rectly to the normal distribution not via a Poisson distribution as this involves one not two
approximations and should therefore be more accurate.
If you are in doubt over which approximation is appropriate a useful rule of thumb is to calcu-
late the mean np and if this is less than or equal to 10 use the Poisson approximation. If the
mean is more than 10 then a normal approximation is usually suitable.
Approximating a Poisson distribution by a normal distribution
If X ∼ Po (λ) and λ is large then X can be approximated by Y ∼ N(λ, λ).
Reference
1. S C Gupta & V K Kapoor (2000) Fundamentals of Mathematical Statistics (A Modern
Approach); 10th Edition, Sultan Chand & Sons
2. Jay L Devore & Kenneth N Berk (2012) Modern Mathematical Statistics with Applications
2nd Edition, Springer Texts in Statistics
3. William Mendenhall, III, Robert J Beaver & Barbara M Beaver (2013) Introduction to
Probability and Statistics 14th Edition, Brooks/Cole Cengage Learning
13