0% found this document useful (0 votes)
17 views144 pages

Probability and Statistics for CS Students

This document outlines the course structure for 'Probability and Statistics for Computer Science' offered by the University of Delhi, detailing its components, evaluation scheme, and content coverage. The course includes theoretical and practical aspects, with a focus on foundational concepts in probability and statistics relevant to Artificial Intelligence and Machine Learning. It emphasizes the importance of these subjects in understanding uncertainty and making data-driven decisions.

Uploaded by

Mayank Goyal
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views144 pages

Probability and Statistics for CS Students

This document outlines the course structure for 'Probability and Statistics for Computer Science' offered by the University of Delhi, detailing its components, evaluation scheme, and content coverage. The course includes theoretical and practical aspects, with a focus on foundational concepts in probability and statistics relevant to Artificial Intelligence and Machine Learning. It emphasizes the importance of these subjects in understanding uncertainty and making data-driven decisions.

Uploaded by

Mayank Goyal
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Probability and Statistics for Computer Science

Rekha R
Assistant Professor
Department of Computer Engineering
Monsoon 2025

Abstract
This lecture note is based on the [Link] syllabus of the University of
Delhi for the course offered for Minors/Specializations by the Computer
Science and Engineering Department under the Faculty of Technology in
the third semester. The note is largely based on the text book Proba-
bility and Statistics for Engineering and the Science by Jay Devore, 9th
edition. This four credit course has lecture of 3 credits and practical of 1
credit components. The evaluation scheme for the course includes these
oD

components. The End Term Theory Exam carries a weight of 90 marks


,U

and is conducted over a duration of 3 hours. In addition, Internal Assess-


oT
,F

ment (IA) contributes 30 marks. The total marks allocated to the theory
CE

component (i.e., End Term Exam + IA) amount to 120 marks. The prac-
R.

tical component has a Continuous Assessment (CA) of 10 marks, while


a
kh

the End Term Practical Exam carries 20 marks and Viva voce carries 10
Re

marks. The End Term Practical Exam and Viva voce shall be conducted
25
20

by an external examiner. Therefore, the grand total for the course is 160
©

marks.
For the IA of 30 marks, 6 marks shall be for attendance, 12 marks for
mid semester test and 12 marks for assignments/quiz/presentations (to
be announced). Six marks for attendance shall be distributed as follows:

Attendance Range Marks Awarded


More than 67% but less than 70% 1.2 marks
More than 70% but less than 75% 2.4 marks
More than 75% but less than 80% 3.6 marks
More than 80% but less than 85% 4.8 marks
More than 85% 6.0 marks

1
Contents

1 Introduction 4

2 Probability Theory 5
2.1 Sample Spaces and Events . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Axioms, interpretations, and Properties of Probability . . . . . . 6
2.3 Interpreting Probability . . . . . . . . . . . . . . . . . . . . . . . 8
2.4 Probability Properties . . . . . . . . . . . . . . . . . . . . . . . . 9
2.5 Conditional Probability . . . . . . . . . . . . . . . . . . . . . . . 10

3 Random Variables 14
3.1 Discrete Random Variable . . . . . . . . . . . . . . . . . . . . . . 14
3.2 Probability Distribution . . . . . . . . . . . . . . . . . . . . . . . 14
3.3 Expected Value . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.4 Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

4 Probability Distribution on a Single RV 26


4.1 Binomial Distribution . . . . . . . . . . . . . . . . . . . . . . . . 26
4.1.1 Expected Value and Variance . . . . . . . . . . . . . . . . 29
4.2 Poisson Distribution . . . . . . . . . . . . . . . . . . . . . . . . . 30
oD

4.2.1 Expected Value and Variance . . . . . . . . . . . . . . . . 31


,U
oT
,F

5 Continuous Random Variable 33


CE

5.1 Probability Distribution . . . . . . . . . . . . . . . . . . . . . . . 37


R.
a

5.2 Expected Value . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39


kh
Re

5.3 Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
25

5.4 Normal Distribution . . . . . . . . . . . . . . . . . . . . . . . . . 44


20
©

5.5 Moments of a Random Variable . . . . . . . . . . . . . . . . . . . 48


5.5.1 Moment Generating Function (MGF) . . . . . . . . . . . 48

6 Joint Probability Distributions 53


6.1 Two Discrete Random Variables . . . . . . . . . . . . . . . . . . 53
6.2 Two Continuous Random Variables . . . . . . . . . . . . . . . . . 55
6.3 Conditional Distribution . . . . . . . . . . . . . . . . . . . . . . . 58
6.4 Expected Values . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
6.5 Covariance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
6.6 Bivariate Normal Distribution . . . . . . . . . . . . . . . . . . . . 71

7 Bridging Probability and Statistics 77


7.1 Random Samples . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
7.2 Law of Large Numbers (LLN) . . . . . . . . . . . . . . . . . . . . 83
7.3 Central Limit Theorem . . . . . . . . . . . . . . . . . . . . . . . 84

8 Statistics 90
8.1 Point Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
8.2 Unbiased Estimators . . . . . . . . . . . . . . . . . . . . . . . . . 92
8.2.1 Sample Mean . . . . . . . . . . . . . . . . . . . . . . . . . 93
8.2.2 Sample Variance . . . . . . . . . . . . . . . . . . . . . . . 98
8.2.3 Sample Standard Deviation . . . . . . . . . . . . . . . . . 101

2
8.2.4 Estimators with Minimum Variance . . . . . . . . . . . . 102

9 Interval Estimation 107


9.1 Confidence Interval for the Mean . . . . . . . . . . . . . . . . . . 107
9.2 General Framework for Confidence Intervals . . . . . . . . . . . . 112
9.3 Large-Sample Confidence Intervals for Population Mean . . . . . 113
9.4 Intervals Based on a Normal Population Distribution . . . . . . . 117
9.4.1 Properties of t-Distribution . . . . . . . . . . . . . . . . . 118
9.5 Confidence Intervals for the Variance and Standard Deviation of
a Normal Population . . . . . . . . . . . . . . . . . . . . . . . . . 120

10 Hypothesis Testing 122


10.1 Hypothesis Testing . . . . . . . . . . . . . . . . . . . . . . . . . . 122
10.2 Form of Hypotheses in Testing . . . . . . . . . . . . . . . . . . . 123
10.3 Test Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
10.4 General Form of a Test Statistic . . . . . . . . . . . . . . . . . . 125
10.5 z Tests for Hypothesis about a Population Mean . . . . . . . . . 126
10.5.1 A Normal Population Distribution with known σ . . . . . 126
10.5.2 Large Sample Size . . . . . . . . . . . . . . . . . . . . . . 128
10.6 The One-Sample t Test . . . . . . . . . . . . . . . . . . . . . . . 130
10.7 Errors in Hypothesis Testing . . . . . . . . . . . . . . . . . . . . 131
oD

10.8 Chi-Squared Test . . . . . . . . . . . . . . . . . . . . . . . . . . . 133


,U

10.8.1 Goodness-of-Fit Tests When Category Probabilities Are


oT
,F

Completely Specified . . . . . . . . . . . . . . . . . . . . . 134


CE

10.8.2 Goodness-of-fit Test When the Pi s Are Functions of other


R.

Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . 137
a
kh

10.8.3 Two-Way Contingency Tables . . . . . . . . . . . . . . . . 138


Re
25
20
©

3
1 Introduction
A Conceptual Journey from Theory to Practice As Artificial Intelli-
gence and Machine Learning (AIML) continue to reshape industries and rede-
fine what is possible with data, students preparing for this field must develop
a strong foundation in Probability and Statistics. These are not just academic
prerequisites – they are the language of uncertainty, the engine behind models,
and the bridge between data and decisions.
You might have observed that the course begin with ‘Probability’ before
‘Statistics’. The reason being probability is foundational while statistics builds
upon it. Probability begins with a known or assumed model and calculates
the likelihood of outcomes. Example: Given a fair die, what is the chance of
rolling a 6? Statistics on the other hand starts with observed data and tries
to infer or test properties of the model. Example: Given 100 coin flips with 58
heads, is the coin fair? Probability lays the theoretical foundation. Statistics
builds on it to interpret data.

Unit 1 This unit introduces you to some basic probability concepts and ter-
minology. Many AIML models simulate or reason under uncertainty. Bayes’
Theorem finds application in classifiers, probability distributions in generative
models, and expectation and variance in model evaluation.
oD
,U
oT

Unit 2 Probability theory works numerically, not symbolically. With random


,F

variables, we can express probabilities. It finds application in joint distributions


CE

in Bayesian networks, hypothesis testing, and sampling and CLT for model
R.
a

evaluation.
kh
Re
25
20

Unit 3 Learning statistical methods are essential for inference, evaluation,


©

and model interpretation. Regression analysis forms the basis of supervised


learning.

Unit 4 The course concludes with some real-world applications of statistical


tools in computing. Monte Carlo is used in reinforcement learning and sim-
ulations. Infrastructure performance can be better explained with Queueing
Theory. Markov chains are the foundation of NLP and recommendation sys-
tems.
Every time you train a model, test a hypothesis, or evaluate an algorithm in
AIML, you rely on the principles of Probability and Statistics. The former helps
you understand how systems might behave; the latter helps you learn from what
has actually occurred. Together, they provide the language of uncertainty, tools
for reasoning under risk, and methods to model, simulate, test, and improve
intelligent systems.

4
2 Probability Theory
Probability theory is the branch of mathematics that deals with uncertainty.
It provides a formal framework for reasoning about randomness, likelihood,
and chance events. We might have encountered the following expressions in
real life: “The odds favor Team India winning the match.”, “There is a 50–50
chance that the exam will be postponed.”, “It’s likely that the new software
update will fix the bug.”, etc. These are all statements involving uncertainty
– an acknowledgment that the outcome is not yet known, but we have some
intuition or evidence about how likely it is. In this unit, we introduce a precise
mathematical framework that allows us to convert statements like “There is
a 50% chance of rain tomorrow” into exact probabilistic models that can be
analyzed, simulated, or embedded into intelligent systems.

Origins The onset of probability as a useful science is primarily attributed


to Blaise Pascal and Pierre de Fermat in the 17th century. The problem was
proposed by a gambler, Chevalier de Mere, who got into a dispute in 1654 con-
cerning the division of a stake between two players whose game was interrupted
before its close. de Mere consulted Blaise Pascal who shared his thoughts with
Pierre de Fermat. The question posed was pertaining to the number of turns
required to ensure obtaining a six in the roll of two dice. The correspondence
oD

between them laid the fundamental groundwork of probability theory.


,U
oT
,F

2.1 Sample Spaces and Events


CE
R.

An operation which can produce some well-defined outcomes is called an ex-


a
kh

periment. For example, tossing a coin once or several times, selecting a card or
Re
25

cards from a deck, picking a 4-character password using lowercase letters, etc.
20

Each outcome of an experiment is called an event. An experiment in which


©

all possible outcomes are known but the exact outcome cannot be predicted in
advance is called as random experiment. For example, when we throw a coin
we know the possible outcomes are ’head’ and ’tail’. But, if we throw a coin at
random, we cannot predict in advance whether its upper face will show a head
or a tail. Peforming a random experiment is also referred to as trial.
The sample space of an experiment, denoted by S, is the set of all possible
outcomes of that experiment. The sample space in the simple coin toss experi-
ment is S = {H, T } where H stands for the appearance of ‘head’ and T for ‘tail’.
The sample space when the coin is tossed twice is S = {HH, HT, T H, T T }.

Example 2.1. A student has 3 favorite songs – A, B, and C – in a playlist. The


music player randomly shuffles and queues them without repeats. What are all
the possible ways the songs could be arranged?

S = {ABC, ACB, BAC, BCA, CAB, CBA}

An event is a subset of outcomes contained in the sample space. When an


experiment is performed, a particular event A is said to occur if the resulting
experimental outcome is contained in A.

5
Example 2.2. In throwing three coins simultaneously with sample space

S = {HHH, HHT, HT H, HT T, T HH, T HT, T T H, T T T }

ˆ {HHH, T T T } is the event of getting two same upper face

ˆ {HHH, HHT, HT H, T HH} is the event of getting at least two heads

ˆ {HHH, HHT, HT H, T HH} is the event of getting at most one tail

Disjoint events (also called mutually exclusive events) are events that
cannot happen at the same time. Let A and B be two events in a sample space
S. They are disjoint if
A∩B =∅
That is, no outcome is common between the two events.

Example 2.3. Consider an experiment of selecting students from the Faculty


of Technology. Let event A be the students picked is from CSE department and
B be the students picked from EEE department. The two events are disjoint.

oD
,U

Example 2.4. Consider another example where the sample space is the power
oT

set of integers from 1 to 10. Let E be the event ‘all even number’ and O be the
,F
CE

event ‘all odd number’. These events are not dijoint since 6 is common. 
R.
a
kh

2.2 Axioms, interpretations, and Properties of Probability


Re
25
20

Probability Approaches Following are some approaches to defining proba-


©

bility:
1. Classical Probability
2. Frequentist Probability
3. Axiomatic Probability
The Classical approach assumes that all the outcomes are equally likely. If
our event of interest E can happen in n ways out of a total of N ways, the
probability of the event, denoted P (E) is defined as
n
P (E) =
N
The Frequentist approach does not make the assumption that all the out-
comes are equally likely. In that case, we repeat the experiment many times,
say Ne (a large value). Then observe how many times that the particular event
E occurred, say n. The probability is defined as
n
P (E) = lim
Ne →∞ Ne
The Axiomatic approach to probability takes the approach of considering
probability as a function associated with any event. This assumes that the

6
probability is a real-valued function whose domain is the set of events and the
range between is a real number between 0 and 1 (both inclusive). Further that
the assignment of real values to each event should satisfy the following Kol-
mogorov’s axioms (formulated by Andrey Kolmogorov in 1933) of probability.
1. Non-negativity. For any event A, P (A) ≥ 0. Probabilities can never be
negative.
2. Truth. P (S) = 1. Something in the sample space must happen.
3. Countable Additivity. If A1 , A2 , A3 , . . . is an infinite collection of disjoint
events, then
∞ ∞
!
[ X
P Ai = P (Ai )
i=1 i=1

If at least one of a number of events occur and no two of the events can
occur simultaneously, then the probability of at least one occurring is the
sum of the probabilities of the individual events.
Let us now understand why the condition 0 ≤ P (A) ≤ 1 is not added as a
separate axiom. This is because it folllows from the above three axioms.
Theorem. For any event A ⊆ S, where S is a countable sample space and P
oD

is a probability measure satisfying the Kolmogorov axioms, we have


,U
oT
,F

0 ≤ P (A) ≤ 1
CE
R.

Proof. By the definition of a probability space, the sample space S satisfies


a
kh

P (S) = 1 (by the truth axiom). For any event A ⊆ S, its complement Ac = S \A
Re
25

is also an event.
20

Since A ∩ Ac = ∅ and A ∪ Ac = S, we apply the countable additivity axiom


©

for disjoint events

P (S) = P (A ∪ Ac ) = P (A) + P (Ac )

Substituting P (S) = 1, we get

P (A) = 1 − P (Ac )

From the non-negativity axiom, we have P (A) ≥ 0 and P (Ac ) ≥ 0, which im-
plies
P (A) ≤ 1
Thus combining both the bounds we get 0 ≤ P (A) ≤ 1
Proposition. P (∅) = 0 where ∅ is the null event (the event containing no
outcomes whatsoever). Also, the countable additivity property is valid for a
finite collection of disjoint events.
Proof. Part 1. First, consider the infinite collection A1 = ∅, A2 = ∅, A3 = ∅, . . ..
Since Ai ∩Aj = ∅∩∅ = ∅, for any i, j ≥ 1 the events in this collection are disjoint,
and their union is

[
Ai = ∅
i=1

7
By the countable additivity axiom, we have
∞ ∞ ∞
!
[ X X
P (∅) = P Ai = P (Ai ) = P (∅)
i=1 i=1 i=1

Thus

X
P (∅) = P (∅)
i=1

This is only possible if P (∅) = 0, since otherwise the right-hand side diverges.

Part 2. Let A1 , A2 , . . . , Ak be disjoint events. Append to this the infinite


collection Ak+1 = ∅, Ak+2 = ∅, . . .. Then
∞ ∞
k
! ! k
! k
[ [ [ [ [
Ai = Ai ∪ = Ai ∪ ∅ = Ai
i=1 i=1 i=k+1 i=1 i=1

By the countable additivity axiom


∞ ∞
k
! ! k
[ [ X X
P Ai = P Ai = P (Ai ) = P (Ai )
oD

i=1 i=1 i=1 i=1


,U
oT

since P (Ai ) = 0, from Part 1, for all i > k.


,F

This confirms that the axiom also applies to finite disjoint collections.
CE
R.
a
kh

2.3 Interpreting Probability


Re
25

The axioms do not completely determine an assignment of probabilities to


20

events. Instead serve only to rule out assignments inconsistent with our in-
©

tuitive notions of probability. Let us now understand some interpretations of


probability.

1. Classical Probability
2. Logical/Evidential Probability
3. Subjective Probability
4. Frequency Interpretations

Classical Probability championed by de Moivre and Laplace it assigns prob-


abilities in the absence of any evidence, or in the presence of symmetrically bal-
anced evidence. The guiding idea is that in such circumstances, probability is
shared equally among all the possible outcomes, so that the classical probability
of an event is simply the fraction of the total number of possibilities in which
the event occurs.
Number of favourable outcomes
P (A) =
Total number of equally likely outcomes

For example, if a fair die is rolled, the probability of getting a 4 is P (A =


4) = 61 .

8
Logical Probability views probability as a logical relation between proposi-
tions, where probabilities are determined by the strength of the evidence sup-
porting the event. For example, given that “All swans observed so far are
white,” what is the probability that the next swan is white?

Subjective Probability is a person’s degree of belief or confidence in an


event. Thus, we really have many interpretations of probability here – as many
as there are suitable agents deciding on it. Different people can assign different
probabilities to the same event. For example, “I believe there is a 70% chance
it will rain tomorrow.” This reflects personal belief based on experience or
intuition.

Frequency Interpretations defines probability as the long-run relative fre-


quency of an event in repeated trials.
Number of times A occurs in n trials
P (A) = lim
n→∞ n
This probability depends on the number of events observed during the trials.
But the catch is that the relative frequency of an outcome will approach its
theoretical probability as the number of trials becomes large.
oD
,U

2.4 Probability Properties


oT
,F

Proposition. For any event A, P (A) + P (Ac ) = 1.


CE
R.

Proof. In the countable additivity axiom, let k = 2, A1 = A, and A2 = Ac .


a
kh
Re

By the definition of the complement, A ∪ Ac = S (the sample space), and


25

A ∩ Ac = ∅, so A and Ac are disjoint.


20

Then by countable additivity, P (A ∪ Ac ) = P (A) + P (Ac ).


©

But P (A ∪ Ac ) = P (S) = 1, by the truth axiom. Hence,

1 = P (A) + P (Ac ).

This proposition is useful because there are many situations in which P (Ac )
is more easily obtained by direct methods than is P (A).
From the countable additivity axiom, it is clear that when the events A and
B are disjoint then P (A ∪ B) = P (A) + P (B). For events that are not disjoint,
adding P (A) and P (B) results in double counting outcomes that lie in both A
and B. The next proposition shows how to correct this.
Proposition. For any two events A and B, P (A∪B) = P (A)+P (B)−P (A∩B).
Proof. We begin by observing that the union A ∪ B can be partitioned into
three disjoint events A \ B, B \ A, and A ∩ B So,

P (A ∪ B) = P (A \ B) + P (B \ A) + P (A ∩ B)

Now note that:

P (A) = P (A \ B) + P (A ∩ B) and P (B) = P (B \ A) + P (A ∩ B)

9
Adding these gives:
P (A) + P (B) = P (A \ B) + P (B \ A) + 2P (A ∩ B)
P (A \ B) + P (B \ A) = P (A) + P (B) − 2P (A ∩ B)
Substituting the LHS in the above equation,
P (A ∪ B) = P (A) + P (B) − 2P (A ∩ B) + P (A ∩ B)
P (A ∪ B) = P (A) + P (B) − P (A ∩ B)

Proposition. Inclusion–Exclusion Principle. For any three events A, B, and


C,
P (A ∪ B ∪ C) = P (A) + P (B) + P (C)
− P (A ∩ B) − P (A ∩ C) − P (B ∩ C)
+ P (A ∩ B ∩ C)
Intution: Add P (A), P (B), and P (C), then subtract the pairwise overlaps
(because they were double-counted), and then add back the triple overlap (be-
cause it was subtracted too many times).
This can be extended to k events by summing individual event probabilities,
oD

subtracting double intersection probabilities, adding triple intersection proba-


,U

bilities, subtracting quadruple intersection probabilities, and so on.


oT
,F
CE

2.5 Conditional Probability


R.
a
kh

The probabilities assigned to various events are based on the initial understand-
Re

ing of the experimental setup and the observations made. However, as the
25
20

experiment unfolds or additional evidence becomes available, our knowledge


©

may change. For example, suppose you are meeting someone at an airport.
The flight is likely to arrive on time; the probability of that is 0.8. Suddenly it
is announced that the weather at the origin is bad. You now realise that the
flight may be delayed. Now it has the probability of only 0.05 to arrive on time.
New information affected the probability of meeting this flight on time. The
new probability is called conditional probability, where the new information,
that the flight departed late, is a condition.
For a given event A, we denote its original probability, based solely on the
initial information, as P (A), often referred to as the unconditional or prior
probability. When new information, such as the occurrence of another event
B, becomes available, we update this to the conditional probability, denoted by
P (AB), representing the revised belief in the occurrence of A given that B is
known to have occurred.
The relationship between the unconditional probability P (A) and the con-
ditional probability P (A | B) depends on how the events A and B are related.
P (A | B) can be less than, equal to, or greater than P (A) This depends on
whether A and B are positively associated, negatively associated, or indepen-
dent.

ˆ Positive association. If knowing that B occurred increases the likelihood


of A, then P (A | B) > P (A). For example, let A be a person having flu
and B be a person having fever. Then P (A | B) > P (A).

10
ˆ Negative association. If knowing B makes A less likely, then P (A | B) <
P (A). For example, let A be a student who passed an exam and B be a
student who skipped all classes. Then P (A | B) < P (A).
ˆ Independence. If A and B are independent, then P (A | B) = P (A). For
example, let A be a coin landing on heads and B be a die showing a 6.
Then P (A | B) = P (A).

Conditional probability essentially tries to answer the question: “Given that


event B has occurred, what is the probability that event A also occurs?” To
understand this intuitively, think of probability in terms of area (such as in a
Venn diagram). The sample space S has total probability 1. Now that the event
B has occurred, the sample space shrinks to B. Within this new sample space,
how much of B’s area is also shared with A?
So, the conditional probability P (A | B) is the proportion of the area of A
that lies in B, i.e.,
Area of A∩B P (A∩B)
P (A | B) = Area of B = P (B) , provided P (B) > 0

P (A ∩ B) = P (A | B).P (B), provided P (B) > 0

The second equation is used when P (A ∩ B) is desired, whereas both P (B) and
oD
,U

P (A | B) can be obtained from the problem description.


oT
,F
CE

Law of Total Probability Let A1 , A2 , . . . , Ak be a partition ofSthe sample


R.

k
space S such that Ai ∩Aj = ∅ for all i 6= j (mutually exclusive) and i=1 Ai = S
a
kh

(exhaustive). Then for any event B, the probability of B is given by


Re
25
20

k
©

X
P (B) = P (B | Ai ) · P (Ai )
i=1

The intution is that suppose we want to compute the probability of an event


B, but we do not have direct access to it. Instead, we know that the sample
space is divided into mutually exclusive and exhaustive cases. And for each case,
we know the probability of B within that case, i.e., P (B | Ai ). So we compute
the chance that both Ai and B occur, i.e., P (B ∩ Ai ) = P (B | Ai ).P (Ai ).
Adding them gives the total chance of B occuring since A1 , A2 , . . . Ak cover the
world.
Let us now look at the Bayes’ theorem which is just a rearrangement of the
conditional probability definition but with profound use. When P (A | B) is
hard to find directly but P (B | A), P (A), and P (B) are known or computable,
then the following Bayes’ Theorem is a handy tool.

P (B | A) · P (A)
P (A | B) =
P (B)
Some time in the 1740s, the Reverend Thomas Bayes, a clergyman and a
mathematician, made this ingenious discovery. It was rediscovered indepen-
dently by Pierre Simon Laplace who gave it its modern mathematical form and
scientific application. The following example inspired by Jerome Cornfield’s
observations, illustrates the wide variety of applications of Bayes’ Theorem.

11
Cornfield used the theorem to solve a puzzle about the chances of a person get-
ting lung cancer. His paper helped epidemiologists to see how patients’ histories
could help measure the link between a disease and its possible cause.

Example 2.5. Suppose a medical test is 95% accurate for detecting a disease.
However, only 1% of the population has the disease. If a person tests positive,
what is the probability that they actually have the disease?
Given
ˆ P (Disease) = 0.01 (1% of the people have the disease)

ˆ P (P ositive|Disease) = 0.95 (test correctly identifies disease 95% of the


time)
ˆ When we say a medical test is 95% accurate, it means the result is correct
95 out of 100 times or 5 out of 100 times the result is false, meaning
1. disease detection is missed leading to false negative in which case

P (N egative|Disease) = 1 − 0.95 = 0.05

2. the test raises false alarm leading to false positive in which case
oD

P (P ositive|N oDisease) = 1 − 0.95 = 0.05


,U
oT
,F

In this example demonstrating the use Bayes’ Theorem we need the prob-
CE

ability of testing positive whether or not the disease is present. That is


R.
a

why we use P (P ositive|N oDisease) and not the false negative rate.
kh
Re

P (P ositive|N oDisease) = 0.05 (false rate of 5%)


25
20
©

We have to compute P (Disease|P ositive). For this we need P (P ositive) first.

P (P ositive) = P (P ositive|Disease)P (Disease)


+ P (P ositive|N oDisease)P (N oDisease)
= 0.95 × 0.01 + 0.05 × 0.99
= 0.059

P (P ositive|Disease)P (Disease)
P (Disease|P ositive) =
P (P ositive)
0.95 × 0.01
=
0.059
≈ 0.161

So, even though the test is 95% accurate, if a person tests positive, the proba-
bility they actually have the disease is only ∼16.1%. This happens because the
disease is rare, making false positives more significant. 

Conditional probability and Bayes’ Theorem are only meaningful when the
events are dependent, i.e., the occurrence or non-occurrence of one event has
a bearing on the chance that the other will occur. For example, drawing a
red card without replacement affects the chance of drawing a second red card.

12
However, drawing a red card with replacement has no impact on the chance of
drawing a second red card. The two are independent events.
Formally, two events A and B are independent if P (A | B) = P (A),
learning that B occurred doesn’t change our belief in A. Likewise, P (B | A) =
P (B). Also, P (A ∩ B) = P (A | B).P (B) = P (A).P (B).

oD
,U
oT
,F
CE
R.
a
kh
Re
25
20
©

13
3 Random Variables
While probability theory models uncertainty before observing data, statistics
offers tools to make sense of data once it has been observed. Random Variables is
a conceptual bridge between mathematical analysis and real-world experiments.
Some of the experiments we have seen produce abstract outcomes like “head” or
“face 3” or “red card”. But to apply statistical methods, we need to work with
numbers and not such abstract outcomes. We need numeric data to calculate
proportion x/n, mean x̄, and standard deviation. A random variable is a
function that assigns a number to each outcome in a sample space, i.e., the
domain is the sample space and range is the set of real numbers.

Example 3.1. Consider the coin tossing experiment where the sample space
is S = {Heads, Tails}. We define a random variable X such that X(Heads) =
1, X(Tails) = 0. This allows us to use mathematical tools to analyze the
outcomes. For example: the probability of getting Heads as P (X = 1), expected
x · P (X = x), and variance as Var(X) = E[(X − E[X])2 ].
P
value as E[X] =
Thus, random variables provide a way to transition from qualitative outcomes
to quantitative analysis, forming the foundation for statistical inference. 

In the following, we denote random variables uppercase letters near the end
of the English alphabet. We use lowercase letters to represent a particular value
oD
,U

of the corresponding random variable. For example, X(ω) = x means that x is


oT

the value associated with the outcome ω by the random variable X.


,F
CE

Any random variable whose only possible values are 0 and 1 is called a
R.

Bernoulli random variable.


a
kh
Re
25

3.1 Discrete Random Variable


20
©

A discrete random variable is a random variable (or rv in short) whose


possible values either constitute a finite set or else can be listed in an infinite
sequence in which there is a first element, a second element, and so on (“count-
ably” infinite). To study basic properties of discrete rv, only the tools of discrete
mathematics – summation and differences – are required.

Example 3.2. Consider a mobile gaming app that gives users a reward based on
a virtual dice roll (a fair 6-sided die) each time they log in. The reward system
is as follows: Face 1 → 0 coins, face 2 → 5 coins, face 3 → 10 coins, and so on.
Define the rv X as the number of coins received from a single dice roll. So, the
domain of X is the finite set {1, 2, 3, 4, 5, 6} and the range is {0, 5, 10, 15, 20, 25}.
Since the dice is fair, P (X = x) = 16 for each x. 

3.2 Probability Distribution


Probabilities assigned to various outcomes in S determine probabilities associ-
ated with the values of any particular rv X. The probability distribution of
X says how the total probability of 1 is distributed among (allocated to) the
various possible X values.

Example 3.3. Suppose a business has just purchased three laser printers, and
let X be the number among these that require service during the warranty pe-

14
riod. Possible X values are then 0, 1, 2, and 3. The probability distribution
will tell us how the probability of 1 is divided among these four possible values
– how much probability is associated with the X value 0, how much is appor-
tioned to the X value 1, and so on. We will use the following notation for the
probabilities in the distribution
p(0) = the probability of the X value 0 = P (X = 0)
p(1) = the probability of the X value 1 = P (X = 1)
and so on. The probability distribution is
x 0 1 2 3
P (X = x) 0.25 0.40 0.18 0.17


The probability mass function (pmf ) of a discrete rv is defined for every


number x by p(x) = P (X = x) = P (all ω ∈ S : X(ω) = x). Probability
distribution and probability mass function (PMF) are closely related, but they
are not exactly the same. Both describes the probabilities assigned to all possible
outcomes of a random variable. But the difference lies in the way they are
viewed. Probability distribution is typically a table or list showing each value
the variable can take and the probability of that value. A pmf is a function,
oD

usually denoted by something like P (X = x), that maps each value of a discrete
,U
oT

random variable to a probability. It satisfies the properties


,F
CE

a. 0 ≤ P (X = x) ≤ 1 for all x
R.
a

X
kh

b. P (X = x) = 1
Re

x
25
20

Example 3.4. In a group of five potential blood donors – a, b, c, d, e – only a


©

and b have type O-positive blood. Samples from all five individuals are typed
in random order until an O-positive individual is identified. Let the random
variable Y denote the number of typings necessary to identify the first O-positive
individual.
The sample space S is the set of all possible ordered sequences of typings
stopped at the first O-positive person. We do not continue testing after the first
O-positive is found. So each outcome is a sequence that ends with either a or b,
and all individuals before that are from {c, d, e} in some order. So the sample
space contains outcomes like

{a, b, ca, cb, da, db, ea, eb, cda, cdb, · · · , edcb}

We count the number of such sequences by cases

ˆ 1st position. If a or b appears first, the sequence ends immediately:


Count = 2
ˆ 2nd position. One person from {c, d, e} is chosen, then they are ordered
3
(permuted) for the first position and followed by choosing a or b: 1 ·1!·2 =
3·2=6
ˆ 3rd position.
 Two distinct people from {c, d, e} in any order, followed by
3
a or b: 2 · 2! · 2 = 3 · 2 · 2 = 12

15
3

ˆ 4th position. All three c, d, e in any order, followed by a or b: 3 · 3! · 2 =
1 · 6 · 2 = 12

Total number of outcomes in the sample space is |S| = 2 + 6 + 12 + 12 = 32.

Then the pmf of Y is given as follows:


2
p(1) = P (Y = 1) = P (first is a or b) = = 0.4
5

p(2) = P (Y = 2) = P (first is c, d, or e and then a or b)


3 2 3
= · = = 0.3
5 4 10

p(3) = P (Y = 3) = P (first two are c, d, or e; then a or b)


3 2 2 2
= · · = = 0.2
5 4 3 10

p(4) = P (Y = 4) = P (first three are c, d, or e; then a or b)


3 2 1 1
= · · ·1= = 0.1
5 4 3 10
oD
,U

p(y) = 0 for all other y


oT
,F
CE

The probability distribution is


R.
a
kh

y 1 2 3 4 other
Re
25

p(y) 0.40 0.30 0.20 0.10 0


20
©

A useful pictorial representation of a pmf, is a probability histogram. Above


each y with p(y) > 0, construct a rectangle centered at y. The height of each
rectangle is proportional to p(y), and the base width is the same for all rect-
angles. When possible values are equally spaced, the base width is frequently
chosen as the distance between successive y values (though it could be smaller).
Figure 1 shows the probability histogram for the example 3.4.

Figure 1: Probability histogram

16
The pmf or probability distribution gives the probabilities for a select value
of X. Imagine a scenario where X is the number of number of beds occupied in
a hospital’s emergency room at a certain time of day and we are interested in
knowing the probability that at most two beds are occupied. This is where the
concept of cumulative distribution function helps.

Example 3.5. Suppose the pmf of X for the above hospital scenario is given
by
x 0 1 2 3 4
p(x) 0.20 0.25 0.30 0.15 0.10
Then the probability that at most two beds are occupied is:

P (X ≤ 2) = p(0) + p(1) + p(2) = 0.20 + 0.25 + 0.30 = 0.75

Suppose we are interested in P (X ≤ 2.7). But note that X is a discrete


random variable; it only takes values from the set {0, 1, 2, 3, 4}. There is no
probability mass assigned to values like 2.7 because X never equals 2.7. So
when we ask P (X ≤ 2.7) it is interpreted as the probability that X is any of
oD

these value {0, 1, 2}. The highest value X can take that is ≤ 2.7 is 2. Thus
,U
oT

P (X ≤ 2) = p(0) + p(1) + p(2) = 0.20 + 0.25 + 0.30 = 0.75


,F
CE
R.

Likewise, P (X ≤ 2.999) = 0.75. Since 0 is the smallest possible value of X, we


a
kh

get P (X ≤ −10) = 0.
Re
25

For any x < 0, P (X ≤ x) = 0


20
©

And because 4 is the largest possible value of X, P (X ≤ 4) = 1.

For any x > 4, P (X ≤ x) = 1.

It follows that for any real x P (X ≥ x) = 1 − P (X < x).

The cumulative distribution function F (x) of a discrete random variable


X with probability mass function (pmf) p(x) is defined for every number x by
X
F (x) = P (X ≤ x) = p(y)
y:
y≤x

That is, for any number x, F (x) gives the probability that the observed value
of X will be at most x.
Proposition. Let X be a discrete random variable with cumulative distribution
function F (x) = P (X ≤ x). For any two real numbers a ≤ b, we have

P (a ≤ X ≤ b) = F (b) − F (a− )

where F (a− ) denotes the left-hand limit of the cdf at a, i.e., the sum of proba-
bilities of all values strictly less than a.

17
In particular, if X takes only integer values and a, b ∈ Z, then
b
X
P (a ≤ X ≤ b) = P (X = x) = F (b) − F (a − 1)
x=a

Taking a = b yields
P (X = a) = F (a) − F (a − 1)
Proof. By definition of the cdf
X X X
F (b) = P (X ≤ b) = P (X = x) = P (X = x) + P (X = x)
x≤b x<a a≤x≤b

and X
F (a− ) = P (X = x)
x<a

Substituting F (a− ) in F (b)


X
F (b) = F (a− ) + P (X = x)
a≤x≤b
X
F (b) − F (a− ) = P (X = x) = P (a ≤ X ≤ b)
a≤x≤b
oD
,U

If X only takes integer values and a, b ∈ Z, then


oT
,F

X
F (a − 1) = P (X = x) (since the only values less than a are integers)
CE
R.

x<a
a
kh

So
Re

P (a ≤ X ≤ b) = F (b) − F (a − 1)
25
20
©

And if a = b, then:
P (X = a) = P (a ≤ X ≤ a) = F (a) − F (a − 1)

Example 3.6. * Let X denote the number of days of sick leave taken by a
randomly selected employee of a large company during a particular year. If the
maximum number of allowable sick days per year is 14, then the possible values
of X are {0, 1, 2, . . . , 14}
Suppose the cumulative distribution function F (x) = P (X ≤ x) is known
for several values F (0) = 0.58, F (1) = 0.72, F (2) = 0.76, F (3) = 0.81, F (4) =
0.88, F (5) = 0.94.
Then we can compute the probability that an employee took between 2 and
5 sick days (inclusive) as
P (2 ≤ X ≤ 5) = P (X = 2 or 3 or 4 or 5) = F (5) − F (1) = 0.94 − 0.72 = 0.22
Similarly, the probability that an employee took exactly 3 sick days is
P (X = 3) = F (3) − F (2) = 0.81 − 0.76 = 0.05

* taken from Probability and Statistics for Engineering and the Science by Jay Devore, 9th
edition.

18
3.3 Expected Value
In many real-world situations, we are interested not just in the possible outcomes
of a random experiment but in summarizing those outcomes with a single rep-
resentative number, typically an average. For instance, when analyzing student
course enrollments at a university, knowing how many courses each student is
registered for is useful, but what is even more informative is the average num-
ber of courses per student. This average helps administrators plan resources,
allocate faculty, and understand student workload.

Example 3.7. * Consider a university having 15,000 students and let X be the
number of courses for which a randomly selected student is registered. The pmf
of X is as follows
x 1 2 3 4 5 6 7
p(x) 0.01 0.03 0.13 0.25 0.39 0.17 0.02
The basic experiment being performed is picking a student at random from the
university. So, the sample space S is the set of all possible outcomes of this
experiment which is mutually exclusive (no overlap) and collectively exhaustive
(cover all possibilities). Here the sample space is
S = {Stud1 , Stud2 , · · · , Stud1 5000}.
oD

Let us now identify the random variable. In this example, the random vari-
,U
oT

able X is the function from S above to the number of courses taken by a


,F

randomly selected student. The range of X is {1, 2, 3, 4, 5, 6, 7}. For example,


CE
R.

X(Stud219 = 3); X(Stud1125 = 7), etc. Thus X collapses all students who take,
a
kh

say, 5 courses into the same numerical value. It doesn’t identify who the student
Re

is.
25
20

The number of students registered for each course can be computed as


©

0.01 × 15000 = 150 0.39 × 15000 = 5850


0.03 × 15000 = 450 0.17 × 15000 = 2550
0.13 × 15000 = 1950 0.02 × 15000 = 300
0.25 × 15000 = 3750


The average number of courses per student is given by computing the total
number of courses taken by all students and dividing by the total number of
students. Since each of 150 students is taking one course, these 150 contribute
150 courses to the total. Similarly, 450 students contribute 2×450 courses, and
so on. The population average value of X is then
1(150) + 2(450) + 3(1950) + 4(3750) + 5(5850) + 6(2550) + 7(300)
15000
150 450 1950 3750 5850 2250 300
=1· +2· +3· +4· +5· +6· +7·
15000 15000 15000 15000 15000 15000 15000
= 1 · p(1) + 2 · p(2) + 3 · p(3) + 4 · p(4) + 5 · p(5) + 6 · p(6) + 7 · p(7)
= 4.57
* taken from Probability and Statistics for Engineering and the Science by Jay Devore, 9th
edition.

19
This is nothing but the weighted average of all possible values a random variable
can take, where the weights are the probabilities. This naturally leads to the
concept of the expected value of a random variable – a fundamental quantity
that captures the long-run average outcome in probabilistic settings. Impor-
tantly, probabilities themselves can often be interpreted as relative proportions
in a large population, meaning we can compute expected values using just the
probability distribution without needing the full population data. This makes
expected value a powerful and practical tool even when only probabilistic mod-
els are available. Average/mean is a term used in statistics to mean empirical
average from data sample expected value is a similar notion for theoretical av-
erage over long-run on all outcomes, based on model. At their core, both refer
to the central tendency.
The expected value in the above example is a decimal (4.57) but the random
variable itself takes only integer values in {1, 2, 3, 4, 5, 6, 7}. The expected value
is a theoretical center of gravity of the distribution and not necessarily a value
the random variable ever actually assumes.
Let X be a discrete random variable with set of possible values D and
probability mass function p(x). The expected value or mean value of X,
denoted by E[X] or µX (or simply µ), is defined as
X
E[X] = µX = x · p(x)
oD

x∈D
,U
oT
,F

Expected Value of a Function Suppose we are not directly interested in


CE

the value of X, but rather in a transformation of it like g(X) = cost(X) where


R.
a

each outcome leads to a cost. Then E[g(X)] gives the expected cost.
kh
Re
25

Example 3.8. Let a game be defined as follows:


20
©

ˆ You earn Rs. 2 if you roll a 1 or 2,

ˆ Rs 5 if you roll a 3 or 4,

ˆ Rs 10 if you roll a 5 or 6.

Let X denote the outcome of a fair six-sided die and define the function g(x)
by

2
 if x = 1, 2
g(x) = 5 if x = 3, 4

10 if x = 5, 6

1
Each outcome x ∈ {1, 2, 3, 4, 5, 6} has probability 6. Then the expected
value of the function g(X) is

20
6
X 1
E[g(X)] = g(x) ·
x=1
6
    
1 1 1 1 1 1
=2 + +5 + + 10 +
6 6 6 6 6 6
2(2) + 5(2) + 10(2)
=
6
34
= ≈ 5.67
6
So, on average, you earn Rs 5.67 per roll. This illustrates how expected
value generalizes to functions of random variables. 

Example 3.9. A courier company charges customers based on the weight of


a package. Let X be the weight (in kg) of a randomly selected package. The
company charges a shipping cost according to the function g(x) = 50 + 10x2
where 50 is a fixed base charge and 10x2 is a weight-based surcharge. The
probability mass function (pmf) of the random variable X is given as
x 1 2 3 4 5
P (X = x) 0.30 0.25 0.20 0.15 0.10
oD
,U

The goal is to compute the expected shipping cost


oT
,F
CE

5
X
R.

E[g(X)] = g(x) · P (X = x)
a
kh

x=1
Re
25

We compute the values of g(x) and their contributions to the expectation


20
©

x g(x) = 50 + 10x2 P (X = x) g(x) · P (X = x)


1 60 0.30 18.00
2 90 0.25 22.50
3 140 0.20 28.00
4 210 0.15 31.50
5 300 0.10 30.00
Total Expected Cost E[g(X)] 130.00
The expected shipping cost per package is computed to be Rs. 130. This
represents the company’s average cost for handling packages based on their
observed weight distribution. To operate at a profit in the long run, the company
must charge more than Rs. 130 per package on average. For instance, if it charges
Rs. 150, the company secures an average profit of Rs. 20 per package. 

If the random variable X has a set of possible values D and probability mass
function p(x), then the expected value of any function h(X), denoted by
E[[h(X)]] or µh(X) , is computed as
X
E[h(X)] = h(x) · p(x)
x∈D

This is incredibly useful because we don’t need to find the pmf of Y . We can
just “lift” the function h over the known distribution of X. This is called

21
the Law of the Unconscious Statistician (LOTUS). Paul Halmos, a prominent
mathematician, is credited to have coined the term “Fundamental Theorem of
the Unconscious Statistician” in early the 1940s. Sheldon Ross popularized
this term as by using it in the first edition of his text book Introduction to
Probability Models, noting in a footnote This law got its name from ‘unconscious’
statisticians who have used it as if it were the definition of E[g(X)]. Many
statisticians didn’t appreciate this humor and the author had to remove it in
later editions.
Proposition. Let X be a discrete random variable with probability mass func-
tion p(x), and let a and b be real constants. Then
E[aX + b] = a · E[X] + b
Proof. Let Y = aX + b.
X
E[Y ] = E[aX + b] = (ax + b) · p(x)
x∈D
X X
= ax · p(x) + b · p(x)
x∈D x∈D
X X
=a x · p(x) + b p(x)
x∈D x∈D
oD

= a · E[X] + b · 1
,U
oT

= a · E[X] + b
,F
CE
R.
a
kh

This means sums and constants can be pulled outside the expectation.
Re

Think of X as a measurement, for example, length in centimeters. And your


25
20

American friend is familiar with measurements in inches. If you convert each


©

value of X to inches by multiplying by 0.39 then the expected value also gets
multiplied by 0.39. So, multiplying changes the unit or scale of the values and
proportionally changes the average. Likewise, suppose you increase every value
of X by 5 like adding a fixed processing fee or tax. Then the expected value also
increases by exactly 5. That means shifting the values just shifts the average
by the same amount. Changing the random variable by scaling (multiplying) or
shifting (adding) changes the expected value in a predictable way – the expected
value scales and shifts with it.
Let us consider a transformation of a random variable X into a new variable
Y as Y = aX +b. This is a transformation of the values of X, and depending on
whether b = 0 or not, we classify it differently. When b = 0, the transformation,
known as linear transformation becomes Y = aX. There is no shift involved;
just stretching or shrinking the random variable based on the constant a.
When b 6= 0, the transformation, known as an affine transformation becomes
Y = aX + b. It first scales and then shifts the values of X. So not only do we
scale the average, but we also add the constant shift. For example, converting
Celsius to Fahrenheit by F = 59 C + 32 is an affine transformation.

3.4 Variance
The expected value of a random variable X, denoted E[X] or µ, tells us where the
center of the probability distribution lies. It is like the balance point or fulcrum

22
of the distribution when all probabilities are imagined as weights placed on a
seesaw.
Imagine each value x as a point on a ruler, and the probability p(x) as a
small weight placed at that point. If you support the ruler at the expected value
µ it will balance and not tilt to one side or the other. This shows that µ is the
center of gravity of the distribution.
Even if two distributions have the same expected value (same center), they
can look very different. One might have most values tightly clustered around the
center, while another might have values spread far apart. For example, a factory
producing screws with a target length of 10mm will desire that most of the
screws are very close to that length. A high difference indicates inconsistency,
leading to defects or failures. A stock with 10% average return but high variance
might lose you money one year and double it the next. Understanding variance
helps balance risk vs return.
This difference between the desired behaviour (expected value) and the ac-
tual behaviour is captured by the variance of X. It quantifies uncertainty
and inconsistency. Whether it is risk, quality, performance, or fairness variance
helps us understand how much things fluctuate around their expected behavior.
We want to measure the “expected deviation” of a random variable from it’s
expected value. But then the simple E[x − E[X]] gives 0.
In his early work on astronomical observations and measurement errors,
oD

Carl Friedrich Gauss introduced the use of the expected value of the squared
,U
oT

difference between a random variable and its mean, which we now define as the
,F

variance
CE

V[X] = E[(X − µ)2 ], where µ = E[X]


R.
a
kh

Gauss developed this concept, in the early 19th century, while seeking to find
Re

the most probable value of an unknown quantity based on noisy measurements.


25
20

Measurements in astronomy often contained random errors, and Gauss was


©

concerned with how to best estimate unknown quantities from such noisy data.
He proposed minimizing the expected squared error, which naturally led to the
mean µ as the best estimate and variance as a measure of uncertainty or spread
around the mean. This principle underlies much of modern statistics and data
science, particularly in regression, estimation theory, and machine learning.
The square of the deviation, as done in the equation above, is done to ensure
all deviations contribute positively to the overall measure of spread. Moreover,
it gives more weight to larger deviations for example, squaring a difference of 2
gives 4, while squaring a difference of 10 gives 100). The expected deviation is
then the average of the squared deviations, weighted by their probabilities.
If a random variable X is measured in some units, then its mean µ has
the same measurement unit as X. However, the variance V is measured in
squared units, and therefore it cannot be directly compared with X or µ. No
matter how unusual it sounds, it is mathematically correct to measure variance
of profit in squared rupees, variance of class enrollment in squared students,
and variance of available disk space in squared gigabytes. When we take the
square root of variance, the resulting standard deviation σ is again measured
in the same units as X. This is the main reason for introducing another measure
of variability – the standard deviation σ, which provides a more interpretable
sense of spread in the context of the original data. Variance, also denoted σ 2 ,
is essential for theoretical reasons (e.g., it is easier to manipulate algebraically).

23
But for interpreting data or comparing with real-world measurements, we always
use the standard deviation.
The computation of variance can be reduced to this simple formula
2
V (X) = σ 2 = E[(X − µ)2 ] = E[X 2 ] − (E[X])

Variance of a Linear Function Let us know analyse how transformations


affect variance. Let X be a random variable, and let Y = aX + b be a linear
transformation of X. Then,
V[Y ] = V[aX + b] = a2 · V[X]
Adding a constant b shifts all values but does not change the spread, so it does
not affect the variance. Multiplying by a constant a stretches or compresses
the distribution. Since variance involves squaring the deviations, the variance
scales by a2 . For example, if all values of X are doubled, then the deviations
from the mean are also doubled. Since variance is the expected value of squared
deviations
Vnew = (2)2 · Vold = 4 · Vold (X)
Linear transformations are common in practice like converting units (e.g., inches
to centimeters), normalizing or standardizing data, adjusting values based on
cost, inflation, or other scaling factors.
oD
,U
oT

Standard Deviation of a Linear Function Since standard deviation is the


,F
CE

square root of the variance



R.

p p p
σY = V[Y ] = a2 · V[X] = a2 · V[X] = |a| · σ
a
kh


Re

Mathematically, a2 = ±a. But standard deviation measures spread, not di-


25
20

rection. The negative square root flips the distribution but does not affect how
©

spread out the values are. Therefore, only the magnitude affects the standard
deviation, not its sign.

Example 3.10. „ We would like to invest Rs. 10,000 into shares of companies
XX and YY. Shares of XX cost Rs. 20 per share. The market analysis shows
that their expected return is Rs. 1 per share with a standard deviation of Rs. 0.5.
Shares of YY cost Rs. 50 per share, with an expected return of Rs. 2.50 and a
standard deviation of Rs. 1 per share, and returns from the two companies are
independent. In order to maximize the expected return and minimize the risk
(standard deviation or variance), is it better to invest (A) all Rs. 10,000 into
XX, (B) all Rs. 10,000 into YY, or (C) Rs. 5,000 in each company?
Let X be the actual (random) return from each share of XX, and Y be the
actual return from each share of YY. Compute the expectation and variance of
the return for each of the proposed portfolios (A, B, and C).
(a) At Rs. 20 a piece, we can use Rs. 10,000 to buy 500 shares of XX collecting
a profit of A = 500X. Using (3.5) and (3.7),
E(A) = 500 E(X) = 500(1) = 500;
V(A) = 5002 V(X) = 5002 (0.5)2 = 62,500.
„ taken from Probability and Statistics for Computer Scientists by Michael Baron, 2ed
edition.

24
(b) Investing all Rs. 10,000 into YY, we buy 10,000/50 = 200 shares of it and
collect a profit of B = 200Y ,

E(B) = 200 E(Y ) = 200(2.50) = 500;

V(B) = 2002 V(Y ) = 2002 (1)2 = 40,000.

(c) Investing Rs. 5,000 into each company makes a portfolio consisting of
250 shares of XX and 100 shares of YY; the profit in this case will be
C = 250X + 100Y . Following (3.7) for independent X and Y ,

E(C) = 250 E(X) + 100 E(Y ) = 250 + 250 = 500;

V(C) = 2502 V(X) + 1002 V(Y ) = 2502 (0.5)2 + 1002 (1)2 = 25,625.

Table 1: Portfolio Risk Comparison

Portfolio Exp Return Variance SD Risk


A 500 62,500 ≈ 250 High
B 500 40,000 ≈ 200 Medium
C 500 25,625 ≈ 160 Low
oD
,U

Discussion. A portfolio will not generally yield the same return unless the
oT
,F

assets involved have identical proportional returns. But risk will almost always
CE

change depending on how you combine assets. This example illustrates well
R.
a

where the portfolios yield the same return but with different risk level. The
kh
Re

expected return is the same for each of the proposed three portfolios because
25

each share of each company is expected to return Rs. 1/20 or Rs. 2.50/50, which
20
©

is 5%. But portfolio C, where investment is split between two companies, has
the lowest variance; therefore, it is the least risky.
High variance, as in Portfolio A, means high risk and you could see huge
drops before a recovery. The return might be delayed or even not materialize
within your time frame.
Expected return is an average over many possible outcomes (or many in-
vestors, or many years). It does not guarantee what happens in one realization.
Expected return becomes meaningful when the experiment is repeated many
times (e.g., investing every year) or when many investors are considered. Then
the average of outcomes will converge toward the expected value due to the Law
of Large Numbers.
A Systematic Investment Plan (SIP) involves investing a fixed amount at
regular intervals (e.g., monthly), typically in mutual funds or equity markets.
This process ties closely with ideas from expected value, variance, and the law of
large numbers. Over time, the average return stabilizes and starts approaching
the expected return. SIP is a practical implementation of diversifying not just
across assets, but across time.


25
4 Probability Distribution on a Single RV
A probability distribution (pd) of a single random variable provides a com-
plete description of the likelihood of its possible values. If the random variable
is discrete, the distribution is represented by a probability mass function (pmf)
that assigns a probability to each possible value. The probability distribution
encapsulates how the random variable behaves, and it forms the foundation
for further statistical analysis, such as computing expectations, variances, and
making inferences.
Absolutely different phenomena can be adequately described by the same
mathematical model, or a family of distributions. For example, the number
of virus attacks, received e-mails, error messages, network blackouts, telephone
calls, traffic accidents, earthquakes, and so on can all be modeled by the same
Poisson family of distributions.

4.1 Binomial Distribution


A Bernoulli trial is a single random experiment that results in one of two
possible outcomes

ˆ Success (S) with probability p, and


oD

ˆ Failure (F) with probability 1 − p.


,U
oT

Each trial is independent and the probability of success remains constant.


,F
CE

For example, tossing a fair coin (head = success, tail = failure) is a Bernoulli
R.

trial with p = 0.5.


a
kh

A Binomial experiment is a random process that satisfies the following


Re

four conditions
25
20
©

1. Fixed Number of Trials: The experiment consists of a sequence of n trials,


where n is predetermined.

2. Dichotomous Outcomes: Each trial has exactly two possible outcomes,


labeled success (S) and failure (F).
3. Independence: The outcomes of the trials are independent of one another.
4. Constant Probability of Success: The probability of success on each trial
is the same and denoted by p; thus, the probability of failure is 1 − p.
If these conditions are met, then the number of successes X in n trials follows
a Binomial distribution, denoted by X ∼ Bin(n, p), with probability mass
function (pmf)
 
n k
P (X = k) = p (1 − p)n−k , for k = 0, 1, . . . , n.
k

Binomial Distribution Derivation Let us consider a sequence of n inde-


pendent Bernoulli trials, where each trial results in a success (S) with probability
p, and a failure (F) with probability 1−p. Let X denote the total number of suc-
cesses in the n trials. We are interested in computing the probability P (X = k),
i.e., the probability of observing exactly k successes in n trials.

26
Each specific sequence of k successes and (n − k) failures has the same
probability, since trials are independent. The probability of such a sequence is

P (one such sequence) = pk (1 − p)n−k

There are nk different ways to arrange k successes in n trials. Thus, the




total probability of observing k successes in n trials is


 
n k
P (X = k) = p (1 − p)n−k , for k = 0, 1, 2, . . . , n
k

This is the probability mass function (pmf) of the Binomial distribution,


denoted by
X ∼ Bin(n, p)
This distribution is named after a Swiss mathematician Jacob Bernoulli
(1654-1705) who also proposed the model of Bernoulli experiment. The distri-
bution is widely used in situations involving repeated independent experiments,
such as transmitted or lost signals, quality control, clinical trials, and reliability
testing.

Example 4.1. Suppose a spam filter is known to incorrectly mark a legitimate


oD

email as spam 5% of the time. That is, the probability of a false positive (a
,U

success in this context as we are interested in identfying such cases) is p = 0.05.


oT
,F

A user checks 4 important emails received during the day. Let us find the
CE

probability that none of them were wrongly marked as spam, assuming inde-
R.

pendence.
a
kh

This is a binomial scenario with


Re
25
20

ˆ Number of trials: n = 4
©

ˆ Probability of success (false positive): p = 0.05

ˆ Probability of failure (correct classification): 1 − p = 0.95

To find the probability that none of the emails were marked as spam, we
calculate
 
4
P (0 false positives) = · 0.050 · 0.954−0
0
= 1 · 1 · 0.954
≈ 0.815
This implies that there is an approximately 81.5% chance that all 4 impor-
tant emails were delivered correctly (not misclassified).
This kind of probability estimation helps software engineers and product
managers assess the reliability of machine learning models used in email filtering
systems. If the probability of at least one misclassification becomes too high,
adjustments may be needed in the model. 

In most binomial experiments, we are not concerned with which specific trials
resulted in a success (S), but only with the total number of successes across
all the trials. For example, on a coin flipped 10 times we are not interested

27
whether the 1st , 3rd , or 7th toss was heads; just how many total heads we got.
This count is modeled by the binomial random variable X. It is a discrete
random variable that can take values from 0 to n. The domain of the random
variable X is the sample space S of the binomial experiment

S = {All sequences of length n consisting of S and F }.

There are 2n such sequences, and each represents an outcome of the experiment.
The range of X is the set of possible values that X can take, that is, the number
of successes in n trials

Range of X = {0, 1, 2, . . . , n}.

For example, if n = 3, then the domain of X is

S = {SSS, SSF, SF S, F SS, SF F, F SF, F F S, F F F }

and range of X = {0, 1, 2, 3}. We denote a binomial random variable X with


parameters n and p as X ∼ Bin(n, p). The pmf of a binomial random variable
X is denoted by b(x; n, p).
Let us see how to compute the probability for a value of an rv. Let X be
the number of heads obtained when a fair coin is tossed three times. Then,
oD

X = 3 corresponds to the outcome HHH. Since the coin is fair, each toss has a
,U

probability P (H) = 21 . Because the coin tosses are independent, the probability
oT
,F

of getting HHH is
CE
R.

 3
1 1
a
kh

P (X = 3) = P (HHH) = P (H) · P (H) · P (H) = = .


Re

2 8
25
20

Example 4.2. Consider a binomial experiment with 3 coin tosses, where ‘H’
©

is considered a success with probability p, and ‘T’ is a failure with probability


1 − p. The random variable X counts the number of successes (H) in the 3
tosses.

Outcome x Probability Outcome x Probability


HHH 3 p3 THH 2 p2 (1 − p)
2
HHT 2 p (1 − p) THT 1 p(1 − p)2
HTH 2 p2 (1 − p) TTH 1 p(1 − p)2
HTT 1 p(1 − p)2 TTT 0 (1 − p)3

Table 2: Outcomes, successes (x), and probabilities for 3 Bernoulli trials

For example, P (X = 2) = P (HHT ) + P (HT H) + P (T HH) = p2 (1 − p) +


p (1 − p) + p2 (1 − p) = 3 · p2 (1 − p).
2


In general, the probability of observing exactly x successes in n trials is given


by the binomial probability mass function (pmf)
 
n x
P (X = x) = p (1 − p)n−x
x

28
where nx is the number of ways to choose x successes from n trials, px is the


probability of getting x successes, and (1 − p)n−x is the probability of getting


n − x failures. For example,
 
3 2
P (X = 2) = p (1 − p)1 = 3p2 (1 − p)
2
as computed in the above example.
When n = 1, the binomial distribution becomes P (X = 1) = p, P (X =
0) = 1 − p. This is exactly the pmf of a Bernoulli distribution with parameter p.
Therefore, the Bernoulli distribution is a special case of the binomial distribution
where n = 1. Let X ∼ Bernoulli(p). The expected value of X is
E[X] = 0 · (1 − p) + 1 · p = p
Thus, the expected number of successes in a single trial is p.

4.1.1 Expected Value and Variance


Each Bernoulli trial is associated with a Bernoulli variable that equals 1 if
the trial results in a success and 0 in case of a failure. Then, a sum of these
variables is the overall number of successes. Thus, any Binomial variable X can
be represented as a sum of independent Bernoulli variables Xi ∼ Bernoulli(p)
oD

and the Xi ’s are independent


,U
oT

X = X1 + X2 + · · · + Xn
,F
CE

Using the linearity of expectation


R.
a

E[X] = E[X1 + X2 + · · · + Xn ]
kh
Re

= E[X1 ] + E[X2 ] + · · · + E[Xn ]


25
20

= p + p + ··· + p
©

= np
Recollect from Sec.3.4, the variance of a Bernoulli random variable is defined
as
2
Var(Xi ) = E[Xi2 ] − (E[Xi ])
Since Xi ∈ {0, 1}, we have Xi2 = Xi . Therefore,
E[Xi2 ] = E[Xi ] = p
Substituting into the variance formula for
Var(Xi ) = p − p2 = p(1 − p)
Since the Xi ’s are independent, we have the variance of Binomial variable X as
V[X] = V[X1 + X2 + · · · + Xn ]
= V[X1 ] + V[X2 ] + · · · + V[Xn ]
= n · V[X1 ]
= n · p(1 − p)
= np(1 − p)
Thus, for a binomial random variable X ∼ Bin(n, p), we have
E[X] = np and V[X] = np(1 − p)

29
4.2 Poisson Distribution
Let us now analyse scenarios like counting the number of phone calls arriving
at a call center in an hour, typos in a 1000 page book, network packet losses in
a second, customers entering a shop per minute, etc.
At first glance, we might model these using the Binomial distribution, where
each “trial” corresponds to a possible opportunity for the event (e.g., each mil-
lisecond for a call to arrive). The probability of the event in each trial is small
and the number of trials is fixed.
We may view this as a binomial experiment being performed by an unknown
agent, over a very large number of very small sub-intervals, where each sub-
interval has a very small chance of success, and we observe only the count of
successes (rare events).
For example, to count typos (an experiment) in a 1,000-page book (large
interval) split each page into 1,000 characters (small sub-intervals). Assume the
chance of a typo per character is very small. But over the whole book, you
might expect 5 typos in total (count successes). Even though the number of
trials is huge, the expected number of actual events (typos) is still small.
Binomial distribution models finite and known trials. But here this shows
the behaviour of continuous space (or in some cases time) with unknown dis-
crete events. Hence the need for Poisson distribution. The Poisson distribution
emerges naturally when we are interested in modeling rare events that occur
oD
,U

randomly over time or space. The Poisson distribution arises as a limit of the
oT

Binomial distribution
,F
CE
R.

lim Bin(n, p) = Poisson(µ)


a

n→∞, p→0, np=µ


kh
Re
25

In simpler terms, the Poisson distribution answers the question “Given that
20

events happen independently and randomly at a constant average rate, what is


©

the probability of observing exactly k events in a fixed interval?” By letting


n → ∞ and p → 0 we can model many tiny opportunities for an event to occur
but the events occur rarely, at a steady average rate (or expected value as we
know it) µ = np. Given these conditions, we obtain the pmf of a poisson random
variable X as

µx e−µ
 
n x
P (X = x) = lim p (1 − p)n−x = , x = 0, 1, 2, . . .
n→∞ x x!
P (X = x) is also denoted as p(x; µ).

Example 4.3. Suppose the number of customer arrivals at a store in a 1-hour


interval follows a Poisson distribution with average rate µ = 4 customers per
hour. What is the probability that exactly 3 customers arrive in a given hour?
The set of all possible outcomes of the experiment

S = {0, 1, 2, 3, 4, . . . } = N0

Let X be the number of customer arrivals in the interval. Then X is a random


variable defined as
X(x) = x, for each x ∈ S

30
That is, the random variable maps each outcome to itself (an identity function).
The range is the set of values that X can take with non-zero probability. For a
Poisson distribution, all non-negative integers have positive probability, so

Range(X) = {x ∈ N0 : P (X = x) > 0} = N0

Given, Poisson distribution parameter or the mean µ = 4 and event of interest


X = 3. Plug in µ = 4 and x = 3 in

e−µ µx
P (X = x) =
x!

e−4 · 43 e−4 · 64
P (X = 3) = =
3! 6
Now, approximate numerically
64
e−4 ≈ 0.0183, ≈ 10.6667
6

P (X = 3) ≈ 0.0183 × 10.6667 ≈ 0.195


oD

So, there is approximately a 19.5% chance that exactly 3 customers arrive during
,U
oT

the 1-hour period. 


,F
CE
R.

4.2.1 Expected Value and Variance


a
kh

Since the pmf of a binomial random variable X denoted b(x; n, p) → p(x; µ) as


Re
25

n → ∞, p → 0, and np → µ, the mean and variance of a binomial variable


20

should approach those of a Poisson variable. Under the Poisson limit


©

n → ∞, p → 0, such that np = µ,

The probability mass function (pmf) is given by

µx e−µ
P (X = x) = , x = 0, 1, 2, . . .
x!
We compute the expected value
∞ ∞
X X µx e−µ
E[X] = x · P (X = x) = x·
x=0 x=0
x!

Note that the term corresponding to x = 0 is 0, so we can write


∞ ∞ ∞
X µx e−µ X xµx e−µ X µx e−µ
E[X] = x· = =
x=1
x! x=1
x! x=1
(x − 1)!

Now, let k = x − 1. Then as x goes from 1 to ∞, k goes from 0 to ∞. Thus,



X µk
E[X] = µe−µ = µe−µ · eµ = µ
k!
k=0

31
We substitute p = µ/n into the variance expression
µ  µ  µ
V[X] = n · · 1 − =µ 1− .
n n n
Now, take the limit as n → ∞
 µ
lim µ 1 − = µ(1 − 0) = µ.
n→∞ n
Proposition.
E(X) = Var(X) = µ.
where µ is the expected number of events occurring in a fixed interval (of time,
space, etc.), or the Poisson parameter.

Example 4.4. A public health researcher is studying rare dental defects in chil-
dren aged 6–10. Based on historical data, the average number of enamel defects
per child is known to be µ = 0.2. Let X be the random variable representing
the number of enamel defects in a randomly selected child. Since the events are
rare, independent, and counted over a fixed unit (per child) this justifies the use
of the Poisson distribution with parameter µ = 0.2.
In this case, the researcher examines one child at a time. For each individual
child, they count how many enamel defects occurred. So, “per child” acts as
a fixed unit of observation (just like “per hour”, “per square meter”, or “per
oD

kilometer”).
,U

The expected value of a Poisson random variable is equal to its parameter


oT
,F

E[X] = µ = 0.2 Thus, each child has, on average, 0.2 defects (or 1 defect every
CE

5 children).
R.

The variance of a Poisson random variable, V[X] is also equal to µ = 0.2.


a
kh
Re

e−0.2 · 0.20
25

= e−0.2 ≈ 0.8187
20

P (X = 0) =
0!
©

There is an 81.87% chance that a child has no defects.

e−0.2 · 0.21
P (X = 1) = = 0.2e−0.2 ≈ 0.1637
1!
About 16.37% of children have exactly one defect.

P (X ≥ 2) = 1 − P (X = 0) − P (X = 1) = 1 − 0.8187 − 0.1637 = 0.0176


Only 1.76% of children have two or more defects.
This example illustrates a classic application of the Poisson distribution in
public health. Knowing the probabilities helps in
ˆ Allocating dental care resources,

ˆ Planning targeted screenings,

ˆ Communicating risk to stakeholders.


Even without knowing which specific child has a defect, the Poisson model
enables us to describe and quantify the likelihood of occurrences over a popula-
tion. It helps in resource planning for preventive care. Dentists can decide how
many children per 100 need follow-up.


32
5 Continuous Random Variable
Any discrete distribution is concentrated on a finite or countable number of iso-
lated values. Conversely, continuous variables can take any value of an interval,
(a, b), (a, +∞), (−∞, +∞), etc. Various times like service time, installation
time, download time, failure time, and also physical measurements like weight,
height, distance, velocity, temperature, and connection speed are examples of
continuous random variables.
Suppose some value x1 has P (x1 ) ≥ 12 . Since the total probability cannot
exceed 1 (by Axiom of Probability), at most one other value could also have
P (x) ≥ 12 . More generally, at most 2 values can have P (x) ≥ 21 , at most 4
values can have P (x) ≥ 14 , at most n values can have P (x) ≥ n1 .
An interval like [0, 1] is an uncountable set so we cannot list its elements
one-by-one. If a pmf gave positive probability to each point in an uncountable
set say ε, the total probability would diverge
Total = ε × (uncountable infinity) = ∞
P
This violates the axiom of probability x P (x) = 1. Thus, a pmf cannot assign
positive probabilities to uncountably many values.
A random variable X is continuous if possible values comprise either a
single interval on the number line (for some A < B, any number x between A
oD

and B is a possible value) or a union of disjoint intervals, and P (X = x) = 0


,U
oT

for any number x that is a possible value of X.


,F

Since P (X = x) = 0 the pmf does not carry any information about a random
CE

variable. So instead of a pmf, it has a probability density function (pdf ).


R.
a

Probabilities are assigned over intervals


kh
Re

Z b
25

P (a ≤ X ≤ b) =
20

f (x) dx
©

This allows continuous distributions to spread total probability 1 over an un-


countable set.
Suppose X is a continuous random variable representing the depth (in me-
ters) of a lake at a randomly chosen point on its surface. Let the maximum
possible depth of the lake be M , so that X takes values in the interval [0, M ].
To analyze this using discrete methods, we discretize the variable X by measur-
ing depth to the nearest meter. In this case, the new discrete random variable
Xdisc takes integer values from 0 to M . That is, Xdisc ∈ {0, 1, 2, . . . , M }.
We define the probability mass function (PMF) of Xdisc as
P (Xdisc = k) = proportion of the lake where the depth rounds to k meters.
This distribution can be visualized using a probability histogram in Fig.2a

(a) rounded to metre (b) rounded to cm (c) smooth curve

Figure 2: Probability histogram

33
The
PM histogram above shows a possible distribution of depth, where
k=0 P (Xdisc = k) = 1.
If depth is measured much more accurately and rounded to nearest centime-
tres, we get the histogram in Fig.2b. If we continue in this way to measure depth
more and more finely, the resulting sequence of histograms approaches a smooth
curve, such as is pictured in Fig.2c. The total area under the smooth curve is
1. The probability that the depth at a randomly chosen point is between a and
b is just the area under the smooth curve between a and b illustrated in Fig.3.

Figure 3: P (a ≤ x ≤ b)

A function f (x) is a valid probability density function (pdf) if it satisfies


the following two conditions

1. f (x) ≥ 0 for all x


Z ∞
oD

2. f (x) dx = 1
,U
oT

−∞
,F
CE

Example 5.1. Let X be the angle (in degrees) from a reference line to an
R.

imperfection on a circular object. The pdf of X is


a
kh
Re

(
1
, 0 ≤ x < 360
25

f (x) = 360
20

0, otherwise
©

ˆ f (x) ≥ 0 for all x

ˆ Total area under the curve


Z ∞ Z 360
1 1
f (x) dx = dx = · 360 = 1
−∞ 0 360 360

So f (x) is a valid probability density function. Probability that 90◦ ≤ X ≤



180 is Z 180
1 90 1
P (90 ≤ X ≤ 180) = dx = = = 0.25
90 360 360 4
Probability that X is within 90◦ of the reference line. This includes
90 90
P (0 ≤ X ≤ 90) = = 0.25, P (270 ≤ X < 360) = = 0.25
360 360

P (within 90◦ of reference) = 0.25 + 0.25 = 0.50

34
·10−2
1

0.75

f (x)
0.5

0.25

0
0 90 180 270 360
x (degrees)

Figure: Uniform PDF with shaded region for P (90◦ ≤ X ≤ 180◦ ) = 0.25

In the example, whenever 0 ≤ a ≤ b ≤ 360, P (a ≤ X ≤ b) depends only


on the width b − a of the interval, X is said to have a uniform distribution. A
continuous random variable X is said to have a uniform distribution on the
interval [A, B] if its probability density function (pdf) is given by

 1 , A≤x≤B
f (x; A, B) = B − A
0, otherwise

In this case, we write


oD
,U

X ∼ Uniform(A, B)
oT
,F

The motivation for the uniform distribution on an interval [A, B] is based


CE

on the idea of complete uncertainty (or equal likelihood) within a known range.
R.
a

Suppose, you are told a delivery will arrive sometime between 2 PM and 4 PM,
kh
Re

with no further information. You assume the arrival time is equally likely at
25

2:05, 3:10, or 3:59. So you model the time X with X ∼ Uniform(2, 4).
20
©

Because single points have zero probability, it follows that for any interval
[a, b] with a < b

P (a ≤ X ≤ b) = P (a < X < b) = P (a < X ≤ b) = P (a ≤ X < b)

Suppose X ∼ Uniform(0, 10). Then the probability density function (pdf) is


(
1
, 0 ≤ x ≤ 10
f (x) = 10
0, otherwise
Z 7
1 1
P (3 ≤ X ≤ 7) = dx = (7 − 3) = 0.4
3 10 10
P (3 < X ≤ 7) = P (3 ≤ X ≤ 7) − P (X = 3) = 0.4 − 0 = 0.4
P (3 < X < 7) = 0.4, P (3 ≤ X < 7) = 0.4
So all expressions give the same result. This implies that when working with
continuous variables, we never need to worry about whether the interval is open
or closed at the endpoints. It doesn’t affect the answer. If the variable was
discrete, the equality need not hold.

35
Example 5.2. “Time headway” in traffic flow is the elapsed time between the
moment one car finishes passing a fixed point and the instant the next car begins
to pass that point. Let X denote the time headway for two randomly chosen
consecutive cars on a freeway during a period of heavy flow with the pdf of X
(
0.15e−2.15(x−0.5) , x ≥ 0.5
f (x) =
0, otherwise

The function reflects real-world traffic conditions where cars can’t be arbitrarily
close due to safety. The first condition models the physical constraint that
vehicles need at least 0.5 seconds of separation. But the exponential form of
the pdf shows that small headways are more likely (high density near x = 0.5).
The probability of large headways drops off rapidly as x → ∞. This matches
our intuition about traffic during heavy flow where cars are close together. The
graph of f (x) is given in Fig.4.

oD
,U
oT

Figure 4: Density curve for the time headway


,F
CE
R.

There is no density associated with headway times less than 0.5, and the
a
kh

headway density decreases rapidly


R ∞(exponentially fast) as x increases from 0.5.
Re
25

Clearly, f (x) ≥ 0. To show that −∞ f (x) dx = 1, we use the calculus result


20
©

Z ∞
1 −ka
e−kx dx = e
a k
Z ∞ Z ∞
f (x) dx = 0.15e−2.15(x−0.5) dx
−∞ 0.5
Z ∞
= 0.15e0.075 e−2.15x dx
0.5
 
0.075 1 −2.15x ∞
= 0.15e · e
2.15 x=0.5
 
0.075 1 −2.15·0.5
= 0.15e · 0− e
2.15
1 −1.075
= 0.15e0.075 · e =1
2.15
This confirms that f (x) is a valid pdf, since the total area under the curve is 1.

36
The probability that the headway time is at most 5 seconds is
Z 5 Z 5
P (X ≤ 5) = f (x) dx = 0.15e−2.15(x−0.5) dx
0.5 0.5
Z 5
= 0.15e0.075 e−2.15x dx
0.5
 
0.075 1 −2.15x 5
= 0.15e · e
2.15 x=0.5
1
= 0.15e0.075 · e−1.075 − e−10.75

2.15
= e0.075 · (2e−2.75 + e−2.075 ) ≈ 0.491

Thus, the probability that the headway time is less than 5 seconds is ap-
proximately 0.491. This means that about 49.1% of the time, the headway is
less than 5 seconds. 

5.1 Probability Distribution


Let us now understand the continuous probability distribution which offers a way
of describing the probability behavior of a continuous random variable which
can take infinitely many values within a given interval.
oD
,U

The cumulative distribution function (cdf) F (x) for a discrete random vari-
oT

able X gives, for any specified number x, the probability P (X ≤ x). It is


,F
CE

obtained by summing the probability mass function p(y) over all possible values
R.

y satisfying y ≤ x
a
kh

X
F (x) = P (X ≤ x) = p(y)
Re
25

y≤x
20
©

The cdf of a continuous random variable also gives the same probabilities
P (X ≤ x), but it is obtained by integrating the probability density function
f (y) from −∞ to x as
Z x
F (x) = P (X ≤ x) = f (y) dy
−∞

For each x, F (x) is the area under the density curve to the left of x.
The importance of the cumulative distribution function (cdf) here, just as
for discrete random variables, is that probabilities of various intervals can be
computed from a formula for or a table of F (x).

Proposition. Let X be a continuous random variable with probability density


function f (x) and cumulative distribution function F (x). Then, for any number
a,
P (X > a) = 1 − F (a)
and for any two numbers a and b with a < b,

P (a ≤ X ≤ b) = F (b) − F (a)

37
Example 5.3. The following example illustrates how uncertainty in forces can
be modeled with a probability distribution. Here the load on a bridge is modeled
as a continuous random variable.
Suppose the probability density function (pdf) of the magnitude X of a
dynamic load on a bridge (in newtons) is given by
(
1
+ 3x
8 , 0≤x≤2
f (x) = 8
0, otherwise
To find the cumulative distribution function F (x), we integrate the pdf from 0
to x Z x Z x 
1 3y 1 3
F (x) = f (y) dy = + dy = x + x2
0 0 8 8 8 16
Thus, the full expression for F (x) is

0,
 x<0
F (x) = 81 x + 3 2
16 x , 0≤x≤2

1, x>2

The probability that the load is between 1 and 1.5 is


oD

P (1 ≤ X ≤ 1.5) = F (1.5) − F (1)


,U

   
oT

1 3 2 1 3 2
= · 1.5 + · (1.5) − ·1+ · (1)
,F

8 16 8 16
CE
R.

   
3 27 1 3
= + − +
a
kh

16 64 8 16
Re

39 5 39 20 19
25

= − = − = ≈ 0.297
20

64 16 64 64 64
©

The probability that the load exceeds 1 is

P (X > 1) = 1 − F (1)
   
1 3 1 3
=1− ·1+ · 12 = 1 − +
8 16 8 16
5 11
=1− = ≈ 0.688
16 16


Relationship Between PDF and CDF


Proposition. If X is a continuous random variable with probability density
function f (x) and cumulative distribution function F (x), then at every point x
at which the derivative F 0 (x) exists,

F 0 (x) = f (x)

This means that for a continuous random variable, the CDF is the integral
of the PDF, and the PDF is the derivative of the CDF, if it exists.

38
PDF and CDF Derivative for a Uniform Distribution When X has a
uniform distribution on the interval [A, B] denoted X ∼ U (A, B), F (x) = 0 for
x < A and F (x) = 1 for x > B since the slope is 0. Thus

F 0 (x) = 0 = f (x) for x < A or x > B

For A < x < B, we have


x−A 1
F (x) = ⇒ F 0 (x) = = f (x)
B−A B−A
Thus (
1
f (x) = B−A , if A < x < B
0, otherwise

PDF: f (x) CDF: F (x)



0, x < A,
(
1

B−A , if A < x < B, x−A
, A ≤ x ≤ B,
0, otherwise  B−A
1, x>B

oD

Visual Representation:
,U
oT
,F

F (x)
CE
R.

1
a
kh
Re
25

x
20

A B
©

CDF has sharp corners at x = A and x = B


⇒ derivative undefined there

5.2 Expected Value


For a discrete random variable X, E[X] was obtained by summing x · p(x) over
possible X values. Here we replace summation by integration and the pmf by
the pdf to get a continuous weighted average.
The expected or mean value of a continuous random variable X with pdf
f (x) is Z ∞
µX = E[X] = x · f (x) dx
−∞

Figure 5: : Expectation of a continuous variable as a center of gravity

39
Example 5.4. * The pdf X is given as
(
3(1 − x2 ), 0 ≤ x ≤ 1
f (x) =
0, otherwise

To find the expected value E[X], we compute


Z 1 Z 1 Z 1
2
E[X] = x · f (x) dx = x · 3(1 − x ) dx = 3 (x − x3 ) dx
0 0 0
1
x2 x4
  
1 1 1 3
=3 − =3 − =3· =
2 4 0 2 4 4 4


Example 5.5. Let the probability density function f (x) be defined as follows

2x,
 0 ≤ x < 0.5
f (x) = 2(1 − x), 0.5 ≤ x ≤ 1

0, otherwise

We want to compute the expected value of X, denoted E[X]. This requires


oD

integrating over the support of f (x)


,U
oT
,F

Z ∞ Z 0.5 Z 1
CE

E[X] = xf (x) dx = x · 2x dx + x · 2(1 − x) dx


R.

−∞ 0 0.5
a
kh

Compute the first integral


Re
25
20
©

0.5 0.5 0.5


x3 (0.5)3
Z Z 
2 2 1 1
2x dx = 2 x dx = 2 =2· =2· =
0 0 3 0 3 24 12

Compute the second integral


1 1 1
x2 x3
Z Z 
2x(1 − x) dx = 2 (x − x2 ) dx = 2 −
0.5 0.5 2 3 0.5

(0.5)2 (0.5)3
        
1 1 1 1 1 1 1 1 1
=2 − − − =2 − − =2 − = 2· =
2 3 2 3 6 8 24 6 12 12 6

Therefore, the expected value is


1 1 1
E[X] = + =
12 6 4

* taken from Probability and Statistics for Engineering and the Science by Jay Devore, 9th
edition.

40
Expected value of a function As in the discrete case, the expected value
of a function h(x) of random variable X with pdf f (x) can be computed as
Z ∞
E[h(X)] = µh(X) = h(x) · f (x) dx
−∞

E[h(X)] is a weighted average of h(X) values.

Example 5.6. Two territorial animals, such as deers or hyenas, are randomly
dividing a stretch of riverbank to establish their respective feeding zones. Sup-
pose a boundary point is chosen at random (uniformly) along the riverbank to
determine the division between the two territories. Let X denote the propor-
tion of the riverbank controlled by Animal A. We assume that X is uniformly
distributed on the interval [0, 1]. Thus, the probability density function (pdf)
of X is (
1, 0 ≤ x ≤ 1
f (x) =
0, otherwise
Let h(X) = max(X, 1−X) denote the proportion of the riverbank controlled
by the dominant animal i.e., the one that receives the larger share.

1. What does the function h(X) represent in this scenario?


oD
,U

2. Compute the expected value E[h(X)], the expected share of the dominant
oT
,F

animal.
CE
R.

3. What is the expected share of the less dominant animal?


a
kh
Re

4. What does this model suggest about competition based purely on random
25
20

allocation?
©

Two territorial animals randomly divide a stretch of riverbank of unit length


where the riverbank is modeled as a line. A boundary point is chosen uniformly
at random along the riverbank (or the line) to split the territory. Let X be the
proportion of the riverbank controlled by Animal A.
The sample space S represents all possible outcomes of the random experi-
ment S = [0, 1]. Each x ∈ S corresponds to a point on the riverbank indicating
the location of the boundary between the two animals. Since the point is se-
lected uniformly at random along the riverbank, all values in this interval are
equally likely.
We define a random variable X that maps each outcome x ∈ S to the
proportion of the riverbank controlled by Animal A. Since x is the position of
the boundary point from the left end of the riverbank X(x) = x. Thus, X is
the identity function on [0, 1]. It directly gives the share of territory controlled
by Animal A.
Let h(X) = max(X, 1 − X) represent the length of riverbank controlled by
the dominant animal i.e., the one with the larger share. This transformation
abstracts away from the identity of the animals and focuses instead on the out-
come of their competition. It measures how much of the territory is controlled
by the winning animal, regardless of whether that is Animal A or B. In doing
so, it allows us to quantify the level of asymmetry introduced by a random yet
fair division.

41
This function transforms the share of Animal A to the share of the animal
who holds more territory. Because X + (1 − X) = 1, the maximum of the two
values must lie in the interval Range(h(X)) = 12 , 1 . Specifically, if X = 0.5,
both animals control the same amount: h(X) = 0.5. As X → 0 or X → 1, one
animal approaches control of the entire riverbank h(X) → 1.
In simpler terms, the value of h(X) is the maximum of X and 1 − X. Since
X is the proportion of riverbank held by Animal A, 1 − X is the share held by
Animal B. Thus, h(X) gives the larger of the two shares – the amount controlled
by the animal with more territory. This reflects which animal is dominant in
terms of space.
To compute the expected value of h(X), we break the integration into two
intervals
1
ˆ When 0 ≤ x < 2, we have h(x) = 1 − x
1
ˆ When 2 ≤ x ≤ 1, we have h(x) = x

So Z 1/2 Z 1
E[h(X)] = (1 − x) · 1 dx + x · 1 dx
0 1/2

Compute each integral


oD
,U

1/2 1/2
x2
Z 
1 1 3
oT

(1 − x) dx = x − = − =
,F

2 0 2 8 8
CE

0
R.

Z 1  2 1
x 1 1 3
a
kh

x dx = = − =
Re

1/2 2 1/2 2 8 8
25
20

Therefore,
©

3 3 6 3
E[h(X)] =+ = =
8 8 8 4
Since the total riverbank is of unit length, the less dominant animal gets
3 1
1 − E[h(X)] = 1 − =
4 4
This result shows that, although the division point is chosen at random
(giving each animal an equal chance of dominance), the expected advantage for
the dominant animal is significant. On average, the dominant animal controls
75% of the territory, while the less dominant one controls only 25%.
This model with transformation function h(X) reveals a key insight that
although the dividing point is chosen uniformly at random, which treats both
animals equally, the resulting division tends to be unequal. The expected value
E[h(X)] = 43 indicates that, on average, the dominant animal controls 75%
of the resource. This demonstrates that fairness in the mechanism (uniform
randomness) does not guarantee equality in the outcome. Such insights are
important in fields like ecology, economics, and game theory, where resources or
advantages are allocated through random or stochastic processes. 

42
5.3 Variance
The variance and standard deviation give quantitative measures of how much
spread there is in the distribution or population of x values. Again σ is roughly
the size of a typical deviation from µ.
The variance of a continuous random variable X with probability density
function f (x) and mean value µ is given by
Z ∞
2
σX = V[X] = (x − µ)2 f (x) dx = E[(X − µ)2 ]
−∞

The standard deviation (SD) of X is


p
σX = V[X]

Example 5.7. A warehouse experiences random weekly demand for a product


that can vary anywhere between 50 and 150 units, with equal likelihood. Let
X denote the weekly demand, modeled as a continuous random variable with
a Uniform distribution on [50, 150]. That is, the probability density function of
X is (
1 1
= 100 , 50 ≤ x ≤ 150
f (x) = 150−50
0, otherwise
oD
,U

The mean µ = E[X] of a uniform distribution on [a, b] is


oT
,F

a+b 50 + 150
CE

µ= = = 100
R.

2 2
a
kh
Re

The variance of a uniform distribution on [a, b] is


25
20

(b − a)2 (150 − 50)2 1002 10000 2500


©

V[X] = = = = = ≈ 833.33
12 12 12 12 3
The standard deviation is the square root of the variance
r
p 2500
σX = V[X] = ≈ 28.87
3
This tells us that the average weekly demand is 100 units with most weekly
demand values lying within approximately ±29 units of the mean. The wide
standard deviation reflects high uncertainty in the weekly demand. This model
is useful for inventory planning under uncertainty.


As in the discrete case, computation of variance is facilitated by using the


same short-cut formula.

V[X] = E[X 2 ] − [E[X]]2

Let h(X) = aX + b, where a and b are constants, and let µ = E[X] and
σ 2 = Var(X) be the mean and variance of a continuous random variable X
with pdf f (x). Then the expected value and variance of h(X) satisfy the same
properties as in the discrete case

43
E[h(X)] = E[aX + b] = aµ + b

Var(h(X)) = Var(aX + b) = a2 σ 2
These results show that shifting a random variable by b affects the mean but
not the spread, while scaling it by a affects both the mean and the variance.

5.4 Normal Distribution


In many real-world settings, outcomes vary unpredictably from one observation
to another. For example, individuals differ in their heights, manufacturing pro-
cesses introduce slight deviations in product dimensions, and repeated measure-
ments of the same quantity may yield slightly different results. These variations
are often due to the cumulative effect of many small, independent, and largely
unobservable factors like genetic, environmental, instrumental, or behavioral.
These small factors can push the outcome slightly up or down, like tiny nudges.
Importantly, most of the time, these small effects cancel out leading to values
near the average (mean). Occasionally, more of them align in one direction
giving a more extreme result but that is less likely. This balance due to many
small effects mostly canceling, rarely reinforcing produces the bell shape with
oD

high probability near the mean (centre), symmetry because of equally likely
,U

up/down pushes and rapid fall-off because the chance of large combined effects
oT

is small. To understand, summarize, and predict such phenomena, we seek a


,F
CE

mathematical model that captures the overall pattern of variability.


R.

Empirical evidence from diverse domains shows that when individual out-
a
kh

comes are the result of a large number of small additive effects, the resulting
Re

distribution tends to be symmetric, unimodal, and bell-shaped. This leads to


25
20

the concept of the normal distribution, also known as the Gaussian distribution.
©

A continuous random variable X is said to have a normal distribution with


parameters µ and σ (or µ and σ 2 ), where −∞ < µ < ∞ and σ > 0, if the
probability density function (pdf) of X is given by
1 (x−µ)2
f (x; µ, σ 2 ) = √ e− 2σ 2 , −∞ < x < +∞
2πσ 2
where µ is the mean (the center of the distribution) and σ 2 is the variance
(a measure of spread). The curve is symmetric about µ with the majority of
probability mass concentrated within a few standard deviations from the mean.
Here, e denotes the base of the natural logarithm system and is approxi-
mately equal to 2.71828, and π represents the familiar mathematical constant
with approximate value 3.14159. The statement that X is normally distributed
with parameters µ and σ 2 is often abbreviated as X ∼ N (µ, σ 2 ).
Let us dig deep into the meaning of the above pdf. The negative exponent 
2
in the function ensures a decay away from the mean. The term exp − (x−µ)2σ 2
decreases as |x − µ| increases. This makes the Gaussian curve highest at the
mean (µ) and ensures smooth decay as we move away from the mean.
If on the otherhand, the exponent were positive
(x − µ)2
 
exp
2σ 2

44
would grow to infinity as |x − µ| increases. The integral of the PDF would then
diverge, meaning total probability could not equal 1. Hence, it would not be a
valid probability distribution. Thus, the negative exponent is essential for the
bell-shaped behavior of the Normal distribution.
The standard deviation σ in the denominator controls the spread (or width)
of the bell curve. If σ is small, the denominator is small which means the term
(x − µ)2
is large. The exponent decays faster, so the curve becomes narrow.
2σ 2
(x − µ)2
If however σ is large, the denominator is big and the term becomes
2σ 2
smaller. The exponent decays slowly, so the curve becomes wider.
The term (x − µ)2 has units of (value)2 . Dividing by σ 2 (or the variance)
makes the exponent
(x − µ)2

2σ 2
dimensionless, which is necessary since the exponential function ez only makes
sense when z is unitless.
Figure 6 plots the pdf for normal distribution with different pairs of mean
and variance values. Each density curve is symmetric about µ and is bell-
shaped. Changing the σ value stretches or compresses the curve horizontally
(Fig.6a) while changing the µ value shifts the density curve to one side or the
oD

other (Fig.6b) without changing the basic shape.


,U

Between µ − σ and µ + σ, the curve is concave downward (hill-shaped ∩),


oT
,F

representing the region with the highest density of values. Outside this interval,
CE

the curve is concave upward (tail region ∪), indicating that the slope of the curve
R.

starts increasing (but the values remain very small). These are the inflection
a
kh
Re

points where the curve changes concavity from concave down (near the peak)
25

to concave up (in the tails) or vice versa. The interval [µ − σ, µ + σ] contains


20
©

approximately 68% of the probability mass. The inflection points help identify
the boundary between the central concentration of data and the tails. The plot
in Fig.6c illustrates a normal distribution with mean µ = 0 and SD σ = 2 along
with its inflection points x = µ − σ = −2 and x = µ + σ = 2.

45
(a) same mean, different variances (b) same variance, different means

(c) inflection points


oD

Figure 6: Normal Distribution Curves


,U
oT
,F
CE

Standard Normal Distribution Suppose X is a normally distributed con-


R.

tinuous random variable with mean µ and standard deviation σ, i.e.,


a
kh

X ∼ N (µ, σ 2 )
Re
25
20

To compute the probability that X lies in the interval [a, b], we evaluate the
©

integral
Z b
1 (x−µ)2
P (a ≤ X ≤ b) = √ e− 2σ2 dx (1)
a 2πσ 2
which represents the area under the normal curve between a and b. The function
(x−µ)2
1
f (x) = √2πσ 2
e− 2σ2 does not have an antiderivative that can be expressed
in terms of elementary functions. That is, none of the standard integration
techniques (such as substitution, integration by parts, etc.) can be used to
obtain a closed-form expression for this integral.
Suppose X ∼ N (µ, σ 2 ), meaning that most of the values of the random vari-
able X lie around the mean µ, and they are spread out with standard deviation
σ. Now consider the transformation
X −µ
Z=
σ
This transformation performs two operations: it shifts the distribution left or
right so that the mean becomes 0 and it scales the distribution so that the stan-
dard deviation becomes 1. In other words, we are rescaling the original data so
that it is measured in units of standard deviation. This standardization process
converts the original normal distribution into a standard normal distribution,
i.e.,
Z ∼ N (0, 1)

46
i.e., the mean for this new tranformed distribution is 0 and the new standard
deviation is 1. Then,
 
a−µ b−µ
P (a ≤ X ≤ b) = P ≤Z≤
σ σ

This reformulates the original problem in terms of the standard normal dis-
tribution with Z ∼ N (0, 1). The probability density function (pdf) of the
standard normal random variable Z, which has mean 0 and variance 1, is given
by
1 2
f (z; 0, 1) = √ e−z /2 , −∞ < z < ∞

The graph of f (z; 0, 1) is called the standard normal curve (or z-curve).
Its inflection points occur at z = −1 and z = 1, where the curve transitions
from concave down to concave up.
The cumulative distribution function (cdf) of Z, denoted Φ(z), gives
Z z Z z
1 2
Φ(z) = P (Z ≤ z) = f (y; 0, 1) dy = √ e−y /2 dy
−∞ −∞ 2π

The cumulative distribution function (cdf) of the standard normal, which gives
the area under the standard normal curve from −∞ to z, is also not easy to
oD
,U

integrate. Instead we can use smart techniques that approximate the area under
oT

the curve like the application of trapezoidal rule to approximate the area under
,F
CE

the curve as a series of trapezoids or Simpson’s rule to approximate the curve


R.

over intervals using parabolas. These values are compiled into the standard
a
kh

normal tables (also called Z-tables) so they can be quickly used without redoing
Re

the integration every time. One can, in modern times, use Python, R, or C++
25
20

libraries to compute Φ(z) values very accurately which often are implemented
©

using a combinations of the mentioned techniques.


But in real-world data, the mean is rarely 0 and the standard deviation is
rarely 1. That means no naturally measured data (heights, weights, IQ scores,
etc.) exactly follows a standard normal distribution. So the standard normal
isn’t the real-world model. It is the universal yardstick we use to compute
probabilities for any normal distribution.

Example 5.8. Suppose exam scores in a university course are normally dis-
tributed with mean µ = 70 and standard deviation σ = 10. We want to compute
the probability that a student scores less than 85 P (X < 85).
We first standardize the variable
X −µ 85 − 70
Z= = = 1.5
σ 10
Then
P (X < 85) = P (Z < 1.5) ≈ 0.9332
the value for P (Z < 1.5) looked up in the Z-table. So, approximately 93.32% of
students scored less than 85. Even though the scores follow a normal distribution
with µ = 70, σ = 10, we used the standard normal distribution to compute
the probability.


47
5.5 Moments of a Random Variable
Moments are numerical summaries that describe the shape of a probability
distribution.

ˆ 1st moment:
E[X] → mean (center).

ˆ 2nd moment:

E[X 2 ], with variance Var(X) = E[X 2 ] − (E[X])2 ,

describes the spread.


ˆ 3rd moment:
E[X 3 ],
related to skewness (asymmetry of the distribution).
ˆ 4th moment:
E[X 4 ],
related to kurtosis (tailedness/peakedness of the distribution).
ˆ 5th and higher moments: provide finer details of the distribution’s shape.
oD

Odd moments (3rd , 5th , . . . ) capture asymmetry, while even moments


,U

(2nd , 4th , . . . ) capture spread/peakedness.


oT
,F
CE
R.

5.5.1 Moment Generating Function (MGF)


a
kh

The moment generating function (MGF) of a random variable is a mathematical


Re
25

tool that encodes all the moments (mean, variance, skewness, etc.) of a prob-
20

ability distribution in a single function. The moment generating function


©

(MGF) of a random variable X is defined as (for values of t where the expecta-


tion exists)
Z ∞
etx f (x) dx, if X is continuous,




−∞
MX (t) = E[etX ] = X
etx P (X = x), if X is discrete.




x

As we will see later, because of the Taylor expansion, derivatives of MX (t)


at t = 0 isolate moments. This property makes etX uniquely useful. Other
functions (like polynomials or sine/cosine) won’t neatly generate all moments
in this way.
The function MX (t) is only defined for those values of t ∈ R such that
the expectation is finite. That is, MX (t) < ∞. An expectation is essentially a
weighted average. If the integral (or sum) diverges to infinity, then we do not
have a valid “average value.”
If on the other hand the MGF is finite (or converges) in an open interval
around 0, then the Taylor expansion about t = 0 is possible. We can thus
expand etX using a Taylor series
t2 X 2 t3 X 3
etX = 1 + tX + + + ···
2! 3!

48
Taking expectation
E[X 2 ] 2 E[X 3 ] 3
MX (t) = E[etX ] = 1 + E[X]t + t + t + ···
2! 3!
Thus, the coefficient of tn gives the n-th moment E[X n ]. If we take the first
derivative of the moment generating function MX (t) with respect to t and then
evaluate it at t = 0, we obtain
0
MX (0) = E[X] (mean)

00
MX (0) = E[X 2 ] (you can compute variance),
(3)
MX (0) = E[X 3 ] (helps explain skewness),

..
.
All higher-order moments (3rd, 4th, 5th, · · · ) together describe the entire shape
of a distribution. Practically, most work in probability/statistics stops at 4th
order (variance, skewness, kurtosis). Higher ones are sometimes used in ad-
vanced areas like financial risk modeling and signal processing. In short, the
MGF encodes all moments.
oD
,U

In general,
oT

(n)
MX (0) = E[X n ].
,F
CE
R.

MGF of a Sum of Independent Random Variables Let X and Y be


a
kh

independent random variables with moment generating functions


Re
25
20

MX (t) = E[etX ], MY (t) = E[etY ].


©

Consider their sum


Z = X + Y.
The moment generating function of Z is

MZ (t) = E[etZ ] = E[et(X+Y ) ].

MZ (t) = E[etX · etY ].


Since X and Y are independent, the expectation of a product is the product of
expectations:
E[etX · etY ] = E[etX ] · E[etY ].
MZ (t) = MX (t) · MY (t).
In general, if X1 , X2 , . . . , Xn are independent, then
n
Y
MX1 +X2 +···+Xn (t) = MXi (t).
i=1

This property is fundamental because it turns the distribution of a sum of


independent random variables into a simple product of MGFs. For example,
the sum of independent normal random variables is again normal, the sum of
independent Poisson random variables is again Poisson, etc.

49
Uniqueness of the MGF Another important property of the moment gener-
ating function (MGF) is that it uniquely determines the distribution of a random
variable (provided the MGF exists in a neighborhood of t = 0). Formally,
d
MX (t) = MY (t) for all t in some interval around 0 =⇒ X = Y,
d
where X = Y means that X and Y have the same distribution. The MGF is
a unique fingerprint of a distribution (when it exists). If two random variables
share the same MGF, they must have the same distribution.
The idea is that MGF encodes all the moments of the random variable
(n)
MX (0) = E[X n ].

Since all moments are contained in it, and these moments uniquely describe the
distribution (for most well-behaved distributions), the MGF fully “captures”
the law of X. This is why MGFs are powerful: they do not just summarize –
they characterize. If you compute an MGF and recognize it as matching the
known MGF of a distribution, you can immediately identify the distribution of
the random variable. Examples:

MX (t) = exp λ(et − 1)



=⇒ X ∼ Poisson(λ).
oD
,U

λ
oT

MX (t) = , t<λ =⇒ X ∼ Exponential(λ).


λ−t
,F
CE
R.

Example: Bernoulli Distribution Let X ∼ Bernoulli(p). Then


a
kh
Re

P (X = 1) = p, P (X = 0) = 1 − p.
25
20
©

The moment generating function (MGF) is defined as


X
MX (t) = E[etX ] = etx P (X = x).
x

for a discrete random variable. Since X takes only the values 0 and 1, in the
Bernoulli case
MX (t) = et·0 P (X = 0) + et·1 P (X = 1).
MX (t) = e0 (1 − p) + et · p.
MX (t) = (1 − p) + pet .
The n-th moment is obtained by differentiating MX (t) n times and evaluat-
ing at t = 0.

ˆ First derivative:
0 0
MX (t) = pet , MX (0) = p.
Hence,
E[X] = p.

50
ˆ Second derivative:
00 00
MX (t) = pet , MX (0) = p.
Thus,
E[X 2 ] = p.

Thus variance

Var(X) = E[X 2 ] − (E[X])2 = p − p2 = p(1 − p).

The MGF of a Bernoulli random variable is

MX (t) = (1 − p) + pet ,

and it allows us to directly compute the mean E[X] = p and variance Var(X) =
p(1 − p).

Example: Normal Distribution Let X ∼ N (µ, σ 2 ) with probability den-


sity function
(x − µ)2
 
1
f (x) = √ exp − , x ∈ R.
2πσ 2 2σ 2
The moment generating function is
oD
,U

Z ∞
oT

tX
MX (t) = E[e ] = etx f (x) dx.
,F

−∞
CE
R.


(x − µ)2
Z  
1
a

tx
kh

MX (t) = e ·√ exp − dx.


Re

−∞ 2πσ 2 2σ 2
25
20

Combine the exponent terms


©

(x − µ)2
tx − .
2σ 2
Expand the quadratic term
1
tx − (x2 − 2µx + µ2 ).
2σ 2
This becomes
x2 µ µ2

2
+ 2 x − 2 + tx.
2σ σ 2σ
Group the terms involving x

x2 µ  µ2
− 2
+ 2
+ t x − 2.
2σ σ 2σ
The quadratic in x can be written as
1  2
x − 2(µ + σ 2 t)x + µ2 .

− 2

Completing the square
1  
− 2 (x − (µ + σ 2 t))2 − (µ + σ 2 t)2 + µ2 .

51
Thus the exponent simplifies to

(x − (µ + σ 2 t))2 1
− + µt + σ 2 t2 .
2σ 2 2
So the MGF becomes

(x − (µ + σ 2 t))2
Z  
1 2 2
 1
MX (t) = exp µt + 2σ t · √ exp − dx.
−∞ 2πσ 2 2σ 2

The integral is just 1 (since it is the integral of a normal density).


Hence the MGF is

MX (t) = exp µt + 21 σ 2 t2 , t ∈ R.


Extracting the moments

ˆ First derivative:

0
(t) = (µ + σ 2 t) exp µt + 12 σ 2 t2 .

MX

Thus,
0
E[X] = MX (0) = µ.
oD
,U

ˆ Second derivative:
oT
,F
CE

00
(t) = σ 2 + (µ + σ 2 t)2 exp µt + 21 σ 2 t2 .
 
MX
R.
a
kh

Hence,
Re

00
E[X 2 ] = MX (0) = σ 2 + µ2 .
25
20
©

Variance

Var(X) = E[X 2 ] − (E[X])2 = (σ 2 + µ2 ) − µ2 = σ 2 .

Thus the MGF of a normal random variable is

MX (t) = exp µt + 21 σ 2 t2 .


It compactly encodes the mean µ and variance σ 2 .

52
6 Joint Probability Distributions
We have seen so far distributions that modeled the number of typos in a text-
book, the number of failed bulbs, heights of a population, temperature of the
engine when idle, etc. We also have scenarios where we have to analyse several
random variables simultaneously. For example the size of a RAM and the speed
of a CPU, rainfall and crop yield, technical and artistic performance, etc. Joint
distribution tells us how two variables behave together, not just individually. If
X and Y are random variables, then the pair (X, Y ) is a random vector. Its dis-
tribution is called the joint distribution of X and Y . Individual distributions
of X and Y are then called the marginal distributions.
Pierre-Simon Laplace, in the late 16th century used to systematically rea-
son about multiple variables and probability laws. Andrey Kolmogorov, in the
early 18th century defined random variables as measurable functions and also is
credited to have given a rigorous definition for joint distributions.

6.1 Two Discrete Random Variables


Two or more random variables may be considered in joint distribution. They
may be a mix of discrete or continuous variables. The “pure” cases, in which
both variables are discrete or both are continuous, are the ones most frequently
encountered in practice. For example, student grade (letter) and attendance
oD
,U

(days) or height and weight. In this section, we focus on two discrete random
oT

variables. All the concepts extend to a vector (X1 , X2 , . . . , Xn ) of n components


,F
CE

and its joint distribution.


R.

Just like a single variable X has a distribution that tells us the probability
a
kh

of X = x, the vector (X, Y ) has a joint distribution that tells what is the
Re

probability that X = x and Y = y at the same time. That is,


25
20
©

P (x, y) = P {(X, Y ) = (x, y)}.

Two vectors are equal, i.e., (X, Y ) = (x, y) if and only if X = x and Y =
y. In probability, the logical “and” corresponds to the intersection of events.
Therefore, the joint probability mass function (PMF) is given by

P (x, y) = P {(X, Y ) = (x, y)} = P {X = x ∩ Y = y}.

Each distinct pair (x, y) represents a unique outcome of the random vector
(X, Y ). For example, if (X, Y ) = (2, 3), then it cannot also be (2, 4) or (1, 3).
So, different outcomes, (x1 , y1 ) and (x2 , y2 ) such that (x1 , y1 ) 6= (x2 , y2 ) cannot
occur simultaneously. That is they are mutually exclusive
Moreover, the collection of all such pairs (x, y) is exhaustive, meaning all
possible outcomes of the vector are covered.
Since each possible outcome (x, y) is mutually exclusive and they cover all
the possibilities, their probabilities must add up to 1, just like in the case of a
single variable. XX
P (x, y) = 1
x y

Let X and Y be two discrete random variables defined on the sample space
S of an experiment. The joint probability mass function p(x, y) is defined

53
for each pair of numbers (x, y) by

p(x, y) = P (X = x and Y = y)
P P
Further p(x, y) ≥ 0 for all x, y and x y p(x, y) = 1.
Now let A be any particular set consisting of pairs of (x, y) values, for ex-
ample A = {(x, y) : x + y = 5} or A = {(x, y) : max(x, y) ≤ 3}. Then the
probability that the random pair (X, Y ) lies in the set A is given by summing
the joint probability mass function over all pairs in A
X
P [(X, Y ) ∈ A] = p(x, y).
(x,y)∈A

Example 6.1. At a university, students applying for financial aid must choose
among different scholarship and loan options. Suppose the available options are
“Scholarship amounts” : Rs. 2000, Rs. 4000, Rs. 6000 and “Loan amounts”:
Rs. 0, Rs. 2000, Rs. 4000. A student is randomly selected from the program.
Define X to be the amount of scholarship received and Y to be the amount of
loan taken. The joint probability mass function p(x, y) = P (X = x, Y = y) is
given below

Y\X Rs. 2000 Rs. 4000 Rs. 6000


oD

Rs. 0 0.10 0.05 0.15


,U

Rs. 2000 0.10 0.15 0.15


oT
,F

Rs. 4000 0.05 0.15 0.10


CE
R.

There are nine possible (X, Y ) pairs, such as (2000, 0), (2000, 2000), . . . , (6000, 4000).
a
kh

Each probability p(x, y) ≥ 0, and it is easily verified that their sum is 1.


Re
25

The probability that a randomly selected student receives a Rs. 4000 schol-
20

arship and takes a Rs. 2000 loan is


©

P (X = 4000, Y = 2000) = p(4000, 2000) = 0.15.

We compute

P (X = Y ) = P (2000, 2000) + P (4000, 4000) = 0.10 + 0.15 = 0.25.

To compute the probability that the loan amount is at least Rs. 2000
X
P (Y ≥ 2000) = [p(x, 2000) + p(x, 4000)]
x

From the joint PMF:

P (Y ≥ 2000) = p(2000, 2000) + p(4000, 2000) + p(6000, 2000)


+ p(2000, 4000) + p(4000, 4000) + p(6000, 4000)
= 0.10 + 0.15 + 0.15 + 0.05 + 0.15 + 0.10 = 0.70

Once we have the joint probability mass function p(x, y) of two discrete
random variables X and Y , we can find the distribution of just one variable
called the marginal distribution by summing over the other variable. Let X

54
and Y denote the number of statistics and mathematics courses, respectively,
currently being taken by a randomly selected statistics major. Suppose we want
the distribution of Y when X = 2. The only possible values of Y are 0, 1, and 2.
Then

pX (2) = P (X = 2) = P [(X, Y ) = (2, 0) or (2, 1) or (2, 2)] = p(2, 0)+p(2, 1)+p(2, 2)

That is, the joint pmf is summed over all pairs of the form (2, y).
For any possible value x of X, the marginal probability mass function
of X, denoted by pX (x), is given by
X
pX (x) = p(x, y)
y
p(x,y)>0

This means we hold x fixed and sum the joint pmf over all values of y such that
p(x, y) > 0. Similarly, the marginal probability mass function of Y is
X
pY (y) = p(x, y)
x
p(x,y)>0

The row totals in the above example give the marginal pmf of X and the column
oD

totals give the marginal pmf of Y .


,U
oT
,F

Example 6.2. Possible X values are x = 2000, 4000, and 6000. Computing
CE

row totals from the joint probability table yields


R.
a
kh
Re
25

pX (2000) = p(2000, 0)+p(2000, 2000)+p(2000, 4000) = 0.10+0.10+0.05 = 0.25


20
©

pX (4000) = 0.05 + 0.15 + 0.15 = 0.35


pX (6000) = 1 − (0.25 + 0.35) = 0.40
The marginal pmf of X is then


0.25 x = 2000

0.40 x = 4000
pX (x) =


0.35 x = 6000
0 otherwise

From this pmf,

P (X ≥ 4000) = pX (4000) + pX (6000) = 0.40 + 0.35 = 0.75.

6.2 Two Continuous Random Variables


Let X and Y be continuous random variables. A joint probability density
function f (x, y) for these two variables is a function satisfying

f (x, y) ≥ 0 for all x, y,

55
and
Z∞Z
f (x, y) dx dy = 1.
−∞

Then, for any two-dimensional set A, the probability that the random pair
(X, Y ) lies in A is given by
ZZ
P [(X, Y ) ∈ A] = f (x, y) dx dy.
A

In particular, if A is a rectangle A = {(x, y) : a ≤ x ≤ b, c ≤ y ≤ d}, then


Z bZ d
P [(X, Y ) ∈ A] = P (a ≤ X ≤ b, c ≤ Y ≤ d) = f (x, y) dy dx.
a c

Imagine laying a thin, stretchable rubber sheet over a flat table representing
the xy-plane. Now, at each point (x, y), you push the sheet upward so that its
height matches f (x, y). The resulting shape of the sheet is the graph of f (x, y).
The total probability over a region A on the table is then the air space (volume)
trapped between the surface and the region A on the table.
oD
,U

Probability over region A = Volume under f (x, y) and above A


oT
,F
CE

This volume is P [(X, Y ) ∈ A]. This concept is illustrated in Fig.7.


R.
a
kh
Re
25
20
©

Figure 7: 3D surface plot of a joint probability density function

The marginal pdf of each variable can be obtained in a manner analogous


to what we did in the case of two discrete variables. The marginal pdf of X at
the value x results from holding x fixed in the pair (x, y) and integrating the
joint pdf over y. Integrating the joint pdf with respect to x gives the marginal
pdf of Y .
Let X and Y be continuous random variables with joint probability density
function f (x, y). The marginal probability density functions of X and Y ,
denoted by fX (x) and fY (y) respectively, are obtained by integrating the joint
density over the other variable
Z ∞
fX (x) = f (x, y) dy for all x ∈ R,
−∞
Z ∞
fY (y) = f (x, y) dx for all y ∈ R.
−∞

56
These marginal densities describe the distribution of X and Y individually,
regardless of any dependence between them.

Example 6.3. Let the joint probability density function of continuous random
variables X and Y be
(
4xy, 0 ≤ x ≤ 1, 0 ≤ y ≤ 1
f (x, y) =
0, otherwise

We verify that this is a valid joint PDF by computing the total volume
Z 1Z 1 Z 1 Z 1  Z 1
1 1 1
4xy dy dx = 4 x y dy dx = 4 x · dx = 4 · · = 1
0 0 0 0 0 2 2 2

To find fX (x), we integrate f (x, y) with respect to y


Z 1 Z 1
1
fX (x) = 4xy dy = 4x y dy = 4x · = 2x
0 0 2
Thus, the marginal density of X is
(
2x, 0≤x≤1
fX (x) =
oD

0, otherwise
,U
oT
,F

Similarly, for fY (y)


CE
R.

Z 1 Z 1
1
a
kh

fY (y) = 4xy dx = 4y x dx = 4y · = 2y
Re

0 0 2
25
20

So, the marginal density of Y is


©

(
2y, 0 ≤ y ≤ 1
fY (y) =
0, otherwise

The above describes a way to compute marginal distribution from joint dis-
tribution both for discrete and continuous random variables. But, in general,
the joint distribution cannot be computed from marginal distributions because
they carry no information about interrelations between random variables.

Dependence of Random Variables Two random variables X and Y are


said to be independent if for every pair of values x and y

p(x, y) = pX (x) · pY (y) in the discrete case


f (x, y) = fX (x) · fY (y) in the continuous case
Otherwise, X and Y are said to be dependent.
For example, consider an experiment where two fair coins are tossed. Let
( (
1, if the first coin is heads 1, if the second coin is heads
X= and Y =
0, otherwise 0, otherwise

57
Then the joint probability mass function satisfies:

P (X = x, Y = y) = P (X = x) · P (Y = y)

for all x, y ∈ {0, 1}, since the outcomes of the two coin tosses are independent.
Now consider the case where X is the loan amount and Y is the interest rate.
In practice, the interest rate often depends on the loan amount. For instance,
larger loans may be associated with lower rates for preferred customers or higher
rates due to increased credit risk. Therefore, the joint distribution of X and Y
does not factor into the product of their marginals

f (x, y) 6= fX (x) · fY (y)

This inequality implies that X and Y are dependent random variables.


If X and Y are independent random variables, then for any intervals [a, b]
and [c, d]

P (a ≤ X ≤ b, c ≤ Y ≤ d) = P (a ≤ X ≤ b) · P (c ≤ Y ≤ d)

This identity holds for both discrete and continuous random variables under the
assumption of independence.
The idea of a joint distribution for two variables generalizes naturally to
oD

three, four, or even hundreds of variables allowing us to model complex real-


,U
oT

world systems with multiple interacting quantities. For example, to model the
,F

Indian monsoon meaningfully, one typically needs hundreds to thousands of


CE

parameters like temperature (surface, mid-level, upper atmosphere), pressure


R.
a

(mean sea level pressure, geopotential height), wind speed and direction (at
kh
Re

various altitudes), humidity, Sea Surface Temperature (SST), ocean currents,


25

El Nino–Southern Oscillation (ENSO) indicators, etc.


20
©

If X1 , X2 , . . . , Xn are all discrete random variables, the joint probability


mass function (pmf ) is the function

p(x1 , x2 , . . . , xn ) = P (X1 = x1 , X2 = x2 , . . . , Xn = xn )

If the variables are continuous, the joint probability density function


(pdf ) of X1 , . . . , Xn is a function f (x1 , x2 , . . . , xn ) such that for any n intervals
[a1 , b1 ], . . . , [an , bn ],
Z b1 Z bn
P (a1 ≤ X1 ≤ b1 , . . . , an ≤ Xn ≤ bn ) = ··· f (x1 , . . . , xn ) dxn · · · dx1
a1 an

6.3 Conditional Distribution


In many real-life situations, we don’t start with complete uncertainty. We
already know something and we want to use that knowledge to make better
predictions. For example, suppose we are evaluating the reliability of a new
smartphone model. We know that one component failed after 200 hours of use.
What is the likelihood that another dependent component will fail within the
next 100 hours? This demonstrates how our knowledge of one variable reshapes
our belief about another. This is exactly the kind of reasoning powered by
conditional distributions.

58
Let X and Y be two continuous random variables with joint probability
density function f (x, y) and marginal density function of X, denoted by fX (x).
Then, for any value of x such that fX (x) > 0, the conditional probability
density function of Y given that X = x is defined as
f (x, y)
fY |X (y | x) = , −∞ < y < ∞
fX (x)
If X and Y are discrete random variables, replacing density functions with
probability mass functions in this definition yields the conditional probability
mass function of Y given X = x
P (X = x, Y = y)
P (Y = y | X = x) = , for all x with PX (x) > 0
PX (x)
We can notice a close similarity between the definitions of conditional prob-
ability for events and conditional distributions for random variables.
For events A and B in a sample space S,
P (A ∩ B)
P (B | A) = , provided P (A) > 0.
P (A)
Here, A and B are subsets of the sample space. The idea is to restrict the
probability measure to the event A, and then renormalize.
oD

Let X, Y be random variables with joint density (continuous case) or joint


,U
oT

pmf (discrete case). For the continuous case,


,F
CE

f (x, y)
fY |X (y | x) =
R.

, provided fX (x) > 0,


fX (x)
a
kh
Re

where fX (x) is the marginal density of X.


25
20

For the discrete case, replacing densities with pmfs gives


©

P (X = x, Y = y)
P (Y = y | X = x) = , for PX (x) > 0.
PX (x)
Example 6.4. Roll a fair six-sided die. Let
A = {even outcome}, B = {outcome ≥ 4}.
Then
P (A ∩ B) P ({4, 6}) 2/6
P (B | A) = = = = 23 .
P (A) P ({2, 4, 6}) 3/6
Roll two fair dice. Define
X = outcome of die 1, Y = outcome of die 2.
The joint pmf is
1
P (X = x, Y = y) = 36 , x, y ∈ {1, . . . , 6}.
The conditional pmf of Y given X = 4 is
P (X = 4, Y = y) 1/36
P (Y = y | X = 4) = = = 61 , y = 1, . . . , 6.
P (X = 4) 1/6
Thus, conditioning on X = 4, the distribution of Y remains uniform over
{1, 2, . . . , 6}. 

59
Example 6.5. Let X denote the proportion of time a main runway at an airport
is busy, and let Y denote the proportion of time a secondary runway is busy.
Suppose the joint pdf of X and Y is given by

f (x, y) = k(x + y 2 ), for 0 < x < 1, 0 < y < 1

First, we aim to compute the constant k. For this, we have to assume that
f (x, y) is a valid pdf, i.e.,
Z 1 Z 1
k(x + y 2 ) dy dx = 1
0 0

Z 1 Z 1  Z 1    
2 1 1 1 5 6
k(x + y ) dy dx = k x+ dx = k + =k· ⇒k=
0 0 0 3 2 3 6 5

Let us now aim to find the conditional pdf of Y | X = 0.6. First, we find the
marginal pdf of X
Z 1 Z 1  
6 2 6 1 6 2
fX (x) = f (x, y) dy = (x + y ) dy = x+ = x+
0 0 5 5 3 5 5
oD

So the conditional pdf is


,U
oT
,F

6 2
f (0.6, y) 5 (0.6 + y ) 0.6 + y 2
CE

fY |X (y | 0.6) = = 6 1 = , 0<y<1
fX (0.6) 5 (0.6 + 3 )
0.933
R.
a
kh
Re

Let us now find the conditional probability P (Y ≤ 0.5 | X = 0.6)


25
20

0.5 0.5
©

0.6 + y 2 y3
Z 
1
P (Y ≤ 0.5 | X = 0.6) = dy = 0.6y +
0 0.933 0.933 3 0
 
1 0.125 1
= 0.3 + = · 0.3417 ≈ 0.366
0.933 3 0.933
Let us now find the conditional expectation E(Y | X = 0.6)
1 1
0.6 + y 2
Z Z
1
E(Y | X = 0.6) = y· dy = (0.6y + y 3 ) dy
0 0.933 0.933 0

1
0.6y 2 y4

1 1 0.55
= + = (0.3 + 0.25) = ≈ 0.589
0.933 2 4 0 0.933 0.933
The conditional distribution fY |X (y | 0.6) reflects how knowledge of the
main runway’s usage changes our prediction for the secondary runway. Given
that the main runway is busy 60% of the time (i.e., X = 0.6), the probability
that the secondary runway is busy at most 50% of the time (i.e., Y ≤ 0.5) is
approximately 36.6%. Also, the expected (i.e., average) proportion of time that
the secondary runway is busy is 58.9%. 

60
If two variables are independent, the marginal pmf or pdf in the denominator
will cancel the corresponding factor in the numerator. The conditional distribu-
tion is then identical to the corresponding marginal distribution. Let X and Y
be two independent random variables. Then, by the definition of independence

P (X = x, Y = y) = PX (x) · PY (y)

f (x, y) = fX (x) · fY (y)


Now consider the conditional probability of Y given X = x

P (X = x, Y = y) PX (x) · PY (y)
P (Y = y | X = x) = = = PY (y)
PX (x) PX (x)

The conditional probability density function of Y given X = x is

f (x, y) fX (x) · fY (y)


fY |X (y | x) = = = fY (y)
fX (x) fX (x)
If X and Y are independent, then knowing the value of X tells us nothing new
about Y or vice versa.

6.4 Expected Values


oD
,U

For a single random variable X, the expectation E[X] represents the average or
oT
,F

“center of mass” of its distribution on the real line. When we consider a random
CE

vector (X, Y ), the expectation is defined component-wise as


R.
a
kh


E[(X, Y )] = E[X], E[Y ] .
Re
25
20

Geometrically, E[(X, Y )] is the centroid (balance point) of the joint distribution


©

in the xy-plane. In general, for an n-dimensional random vector (X1 , X2 , . . . , Xn ),


the expectation is

E[(X1 , X2 , . . . , Xn )] = E[X1 ], E[X2 ], . . . , E[Xn ] .

Expected Value of a function Any function h(X) of a single random vari-


able X is itself a random variable. However, to compute E[h(X)], it is not
necessary to obtain the probability distribution of h(X). Instead, E[h(X)] is
computed as a weighted average of h(x) values, where the weights are given by
the pmf p(x) (if X is discrete) or the pdf f (x) (if X is continuous). A similar
result holds for a function h(X, Y ) of two jointly distributed random variables.
Proposition. Let X and Y be jointly distributed random variables with
joint pmf p(x, y) (if discrete) or joint pdf f (x, y) (if continuous). Then the
expected value of a function h(X, Y ), denoted by E[h(X, Y )], is given by

X X


 h(x, y) p(x, y), if X, Y are discrete,
x y
E[h(X, Y )] = Z ∞ Z ∞
h(x, y) f (x, y) dx dy, if X, Y are continuous.



−∞ −∞

61
A special case is the expectation of the product of two random variables.
For two jointly distributed random variables X and Y , the expectation of their
product is defined as
X X


 xy p(x, y), if X, Y are discrete,
x y
E[XY ] = Z ∞ Z ∞
xy f (x, y) dx dy, if X, Y are continuous.



−∞ −∞

E[XY ] is the average product of X and Y over many repetitions of the


experiment. It captures how the two random variables jointly contribute to
outcomes. If X and Y are independent, then
E[XY ] = E[X] · E[Y ].
Example 6.6. * Five friends have purchased tickets to a concert. If the tickets
correspond to seats 1–5 in a row and are randomly distributed among the five,
what is the expected number of seats separating any particular two of the five?

Let X and Y denote the seat numbers of the first and second individuals,
respectively. Possible (X, Y ) pairs are
{(1, 2), (1, 3), . . . , (5, 4)}
oD

with size 20. The joint pmf of (X, Y ) is


,U
oT

(1
20 , x = 1, . . . , 5; y = 1, . . . , 5; x 6= y,
,F

p(x, y) =
CE

0, otherwise.
R.
a
kh

The number of seats separating the two individuals is defined as


Re
25

h(X, Y ) = |X − Y | − 1.
20
©

Values of h(x, y)

x\y 1 2 3 4 5
1 − 0 1 2 3
2 0 − 0 1 2
3 1 0 − 0 1
4 2 1 0 − 0
5 3 2 1 0 −
Expected value
XX 1 XX
E[h(X, Y )] = h(x, y) p(x, y) = h(x, y).
x y
20 x
y6=x
Computing the sum XX
h(x, y) = 20.
x y6=x

Thus,
20
E[h(X, Y )] =
= 1.
20
The expected number of seats separating any two friends is 1. 
* taken from Probability and Statistics for Engineering and the Science by Jay Devore, 9th
edition.

62
6.5 Covariance
Variance measures how much a single variable deviates from its mean. Likewise,
covariance measures how two variables deviate from their respective means to-
gether. When X is far from its mean (positive or negative deviation), does Y
also tend to be far from its mean in the same direction? Specifically, it cap-
tures whether large (or small) values of one variable tend to occur with large (or
small) values of the other. A positive covariance means that when X is above
its mean, Y also tends to be above its mean, and when X is below its mean,
Y also tends to be below its mean (refer Fig.8a). A negative covariance means
that when one variable is above its mean, the other tends to be below its mean,
indicating an inverse relationship (refer Fig.8b). If the relationship between X
and Y is inconsistent, i.e.,
ˆ for some x-values larger than mean µX , the corresponding y-values are
also larger than µY (positive contribution), while
ˆ for other x-values larger than µX , the corresponding y-values are smaller
than µY (negative contribution)
the overall covariance is close to zero because these positive and negative con-
tributions cancel out on average (refer Fig.8c). This does not imply that X and
Y are completely unrelated, only that there is no systematic linear tendency for
oD
,U

them to move together (or against each other) across the distribution. Thus,
oT

covariance provides a numerical measure of how two variables vary together.


,F
CE
R.
a
kh
Re
25
20
©

(a) positive covariance (b) negative covariance (c) covariance near zero

Figure 8: Covariance graphs

We can express the joint deviation from the respective mean as a function

g(X − µX , Y − µY )

to capture both direction and magnitude of dependence between two random


variables. A positive sign when both deviations have the same sign (or move
together) and a negative sign when deviations have opposite signs (or move op-
posite). The magnitude quantifies how far the deviations are from their means.
The product

g(X − µX , Y − µY ) = (X − µX )(Y − µY )

is the simplest such function, but there are other possible choices:

63
Odd-power products, sign-magnitude decomposition, normalized product
(correlation), and nonlinear kernels are some other functions to achieve the
same. But the simple product is preferred because it is linear, easy to com-
pute, preserves sign and scale naturally, and connects directly to variance and
correlation.
Covariance measures the average effect of the deviations across all outcomes.
Formally, the covariance between two random variables X and Y is defined as
 
Cov(X, Y ) = E (X − µX )(Y − µY ) ,

where µX = E[X] and µY = E[Y ].


If X and Y are discrete with joint pmf p(x, y), then
XX
Cov(X, Y ) = (x − µX )(y − µY ) p(x, y).
x y

If X and Y are continuous with joint pdf f (x, y), then


Z ∞Z ∞
Cov(X, Y ) = (x − µX )(y − µY ) f (x, y) dx dy.
−∞ −∞

Example 6.7. * The joint and marginal pmf’s for X (automobile policy de-
oD

ductible amount) and Y (homeowner policy deductible amount) is


,U
oT
,F

y = 500 y = 1000 y = 5000 pX (x)


CE

x = 100 0.30 0.05 0.00 0.35


R.

x = 500 0.15 0.20 0.05 0.40


a
kh
Re

x = 1000 0.10 0.10 0.05 0.25


25

pY (y) 0.55 0.35 0.10 1.00


20
©

From these, we compute the means


X X
µX = x pX (x) = 485, µY = y pY (y) = 1125.
x y

Therefore, the covariance is


XX
Cov(X, Y ) = (x − 485)(y − 1125)p(x, y).
x y

Expanding term by term,

Cov(X, Y ) = (100 − 485)(500 − 1125)(0.30)


+ (1000 − 485)(5000 − 1125)(0.05) + · · ·
= 136,875.

Proposition.
Cov(X, Y ) = E(XY ) − µX µY .
* taken from Probability and Statistics for Engineering and the Science by Jay Devore, 9th
edition.

64
Proof. Start with the definition

Cov(X, Y ) = E[(X − µX )(Y − µY )] .

Expanding,

(X − µX )(Y − µY ) = XY − µX Y − µY X + µX µY .

Taking expectations term by term,

E[(X − µX )(Y − µY )] = E(XY ) − µX E(Y ) − µY E(X) + µX µY .

Since E(X) = µX and E(Y ) = µY ,

Cov(X, Y ) = E(XY ) − µX µY .

Example 6.8. Suppose we collect data on students’ study hours X and exam
scores Y for three students

(X, Y ) = (2, 65), (4, 75), (6, 85).


oD

Computing means
,U
oT

2+4+6 65 + 75 + 85
,F

µX = = 4, µY = = 75.
CE

3 3
R.
a

Computing E(XY ).
kh
Re
25

1  1 940
20

E(XY ) = 2 · 65 + 4 · 75 + 6 · 85 = (130 + 300 + 510) = ≈ 313.33.


3 3 3
©

Applying the shortcut formula.

Cov(X, Y ) = E(XY ) − µX µY = 313.33 − (4)(75) = 313.33 − 300 = 13.33.

The positive covariance indicates that has study hours increase, exam scores
also tend to increase. 

Let us explore the kind of relationships measured by covariance. Recall the


definition of covariance
 
Cov(X, Y ) = E (X − µX )(Y − µY ) ,

Suppose Y is exactly linear in X

Y = aX + b,

where a 6= 0 and b are constants.

µY = E[Y ] = E[aX + b] = aE[X] + b = aµX + b.

Y − µY = (aX + b) − (aµX + b) = a(X − µX ).


Y − µY = a(X − µX ).

65
Y − µY ∝ (X − µX ).
This implies the constant term b just shifts the distribution up or down, so
it cancels out when subtracting the mean. The slope a scales the deviations of
X to produce the deviations of Y . Hence, in a perfectly linear relationship, the
deviations of Y are proportional to the deviations of X. This is why covariance
captures linear dependence exactly.
 
Cov(X, Y ) = E (X − µX )(Y − µY )
 
= E (X − µX ) × a(X − µX )
= aE (X − µX )2
 

= a Var(X).
The magnitude of covariance, when X and Y are linearly dependent, is decided
by the deviation of X from its mean µX and the direction is decided by a as
follows
ˆ if a > 0, covariance is positive,

ˆ if a < 0, covariance is negative, and

ˆ if a = 0, covariance is zero.
oD
,U

Theorem. Let Y = aX + b be a linear relation between random variables X


oT

and Y . If Cov(X, Y ) = 0, then one of the following must hold:


,F
CE

1. a = 0, so that Y = b is a constant independent of X.


R.
a
kh

2. Var(X) = 0, so that X is constant, and hence Y = aX +b is also constant.


Re
25
20

In all other cases (i.e., when a 6= 0 and Var(X) 6= 0), we have


©

Cov(X, Y ) = a Var(X) 6= 0.

Proof. This follows from Cov(X, Y ) = a Var(X)


Hence zero covariance implies linear independence. Now consider the case
when the variables are not linearly dependent, i.e.,

Y = g(X)

where
Y = X 2 , X 3 , sin(X), eX , or some nonlinear function
Then,
g(X) − E[g(X)] 6∝ (X − µX )
It can take positive and negative values that cancel each other out when multi-
plied by (X − µX ). That is,

(X − µX ) g(X) − E[g(X)]

can sum to zero even if Y is clearly a function of X.

66
Example Consider X ∼ U (−1, 1) and let Y = X 2 .
1  1
1 x2
Z
1
µX = E[X] = x· 2 dx = = 0.
−1 2 2 −1

1  1
1 x3
Z
2 2 1 1
µY = E[X ] = x · dx = 2 = .
−1 2 3 −1 3
 
Cov(X, Y ) = E (X − µX )(Y − µY ) .
Since µX = 0 and Y = X 2

Cov(X, Y ) = E (X − µX )(Y − µY ) = E X · (X 2 − 31 ) = E[X 3 ] − 13 E[X].


   

Because X is symmetric around 0, E[X] = 0 and E[X 3 ] = 0. Hence,

Cov(X, Y ) = 0.

Thus, although Y = X 2 is fully determined by X (a strong nonlinear depen-


dence), the covariance is zero because positive and negative contributions cancel
due to symmetry. This shows covariance only captures linear dependence.
To sum up, covariance measures linear co-movement. Nonlinear relation-
oD

ships can produce zero or misleading covariance, because positive and negative
,U

deviations cancel out. Therefore, covariance and correlation is not a complete


oT
,F

measure of dependence. It is a measure of only linear dependence.


CE

The numerical value of covariance is highly dependent on the units of mea-


R.

surement.
a
kh
Re

Example 6.9. Suppose X represents the height of individuals (in centimeters)


25
20

and Y represents their weight (in kilograms). Assume we calculate


©

Cov(X, Y ) = 120 cm-kg.

Now, if we convert height from centimeters to meters, the new variable is


X 0 = 0.01X. The covariance rescales accordingly

Cov(X 0 , Y ) = 0.01 × Cov(X, Y ) = 1.2 m-kg.

Alternatively, if we convert weight from kilograms to grams, then Y 0 =


1000Y , and

Cov(X, Y 0 ) = 1000 × Cov(X, Y ) = 120,000 cm-gm.

Thus, the same relationship between height and weight produces very dif-
ferent covariance values depending only on the units of measurement. 

This example illustrates why raw covariance is not a reliable measure for
comparing the strength of association between variables. To eliminate this de-
fect, we define the correlation coefficient, which rescales the covariance by
the product of the standard deviations, resulting in a unit-free measure. If two
variables move perfectly together, their deviations from their means are aligned
exactly. In that case, the covariance reaches its largest possible value which is
σX σY . If they move perfectly opposite, the covariance reaches −σX σY . Thus

67
the product of the standard deviations σX σY is not literally the covariance, but
it represents the maximum possible magnitude of co-movement that two vari-
ables can have. So dividing by σX σY scales the actual covariance relative to the
perfect alignment giving a correlation between -1 and 1. Thus, the correlation
coefficient is a unit-free measure between -1 and 1.
The correlation coefficient of X and Y , denoted by Corr(X, Y ), ρX,Y , or
just ρ, is defined as
Cov(X, Y )
ρX,Y =
σX σY
Proposition. 1. If a and c are either both positive or both negative,

Corr(aX + b, cY + d) = Corr(X, Y )

2. For any two random variables X and Y ,

−1 ≤ ρ ≤ 1

The two variables are said to be uncorrelated when ρ = 0.


Proof. Part 1. We use the following properties of covariance and standard
deviation:
oD
,U

Cov(aX + b, cY + d) = ac Cov(X, Y ),
oT
,F

σaX+b = |a| σX , σcY +d = |c| σY .


CE
R.
a

Now, the correlation is


kh
Re
25

Cov(U, V )
20

Corr(U, V ) = .
©

σU σV
Hence,

Cov(aX + b, cY + d)
Corr(aX + b, cY + d) =
σaX+b σcY +d
ac Cov(X, Y )
=
|a| σX |c| σY
ac Cov(X, Y )
= ·
|a||c| σX σY

= sign(ac) Corr(X, Y ).

If a and c are either both positive or both negative, then sign(ac) = +1. There-
fore,
Corr(aX + b, cY + d) = Corr(X, Y ).
This statement says that linear transformations don’t change ρ. If X or Y is
transformed linearly, the correlation remains the same.
This matters because changing units (e.g., temperature in ◦ C to ◦ F) changes
the covariance but leaves the correlation unchanged because ρ measures only
relative co-movement.

68
And if a and c have opposite signs, the correlation flips sign

Corr(aX + b, cY + d) = − Corr(X, Y ).

Proof. Part 2. The correlation coefficient is defined as

Cov(X, Y )
ρ = Corr(X, Y ) = ,
σX σY
where σX , σY > 0 are the standard deviations of X and Y .
Using Cauchy–Schwarz inequality, for any square-integrable random vari-
ables U and V , we have
p
| E[U V ] | ≤ E[U 2 ] E[V 2 ].

Let U = X − µX and V = Y − µY . Then


p
|E[(X − µX )(Y − µY )]| ≤ E[(X − µX )2 ] E[(Y − µY )2 ].

E[(X − µX )(Y − µY )] = Cov(X, Y ),


oD
,U

E[(X − µX )2 ] = σX
2
, E[(Y − µY )2 ] = σY2 .
oT
,F

Hence,
CE

| Cov(X, Y )| ≤ σX σY .
R.
a
kh

−σX σY ≤ Cov(X, Y ) ≤ σX σY .
Re
25
20

Dividing both sides by σX σY > 0,


©

Cov(X, Y )
−1 ≤ ≤ 1.
σX σY
That is,
−1 ≤ ρ ≤ 1.

ρ = +1 indicates a perfect positive linear relationship, ρ = −1 indicates a


perfect negative linear relationship, and ρ = 0 indicates no linear relationship
(variables are uncorrelated ). In practice, |ρ| ≥ 0.8 indicates a strong relation-
ship, 0.5 < |r| < 0.8 indicates a moderate relationship, and |ρ| ≤ 0.5 indicates
a weak relationship.
Proposition. 1. If X and Y are independent, then ρ = 0. However, ρ = 0
does not imply independence.

2. ρ = 1 or ρ = −1 if and only if

Y = aX + b

for some numbers a and b with a 6= 0.

69
The first statement says that if X and Y are independent, their correlation
is ρ = 0. However, the opposite is not necessarily true, i.e., ρ = 0 does not
imply that X and Y are independent. There could still be a relationship, but
it is nonlinear. Correlation measures only linear relationships. The second
statement describes a perfect linear relationship

ρ=1 or ρ = −1

happens only when Y is exactly a straight-line function of X, i.e.,

Y = aX + b, a 6= 0.

This shows that correlation captures how close the relationship is to a straight
line. On the other hand
ˆ if |ρ| < 1, there may still be a strong connection between X and Y which
is not linear.
ˆ Even if |ρ| is close to 1, the relationship might be nonlinear, but it can be
well approximated by a straight line.
In short, correlation measures linear association and not all types of relation-
ships. Zero correlation does not mean no relationship; it only indicates no linear
relationship. Maximum correlation (|ρ| = 1) occurs only when the relationship
oD
,U

is perfectly linear.
oT
,F
CE

Example 6.10. We are given two discrete random variables X and Y with the
R.

following joint pmf


a
kh
Re
25

(
0.25 for (x, y) ∈ {(−4, 1), (−2, −2), (2, 2), (4, −1)}
20

pX,Y (x, y) =
©

0 otherwise

For each value of X, there is exactly one corresponding Y , and vice versa. This
implies X and Y are perfectly dependent. Knowing X tells us exactly what Y
is and vice versa. Fig.9 shows a plot of these points.

Figure 9: A plot

The marginal means are µX = µY = 0 because the points are symmetric around
0 (positive and negative values cancel). The expected value of the product is

E(XY ) = (−4)(1)(0.25) + (−2)(−2)(0.25) + (2)(2)(0.25) + (4)(−1)(0.25) = 0

Cov(X, Y ) = E(XY ) − µX µY = 0 − 0 · 0 = 0

70
The correlation coefficient
Cov(X, Y )
ρX,Y = =0
σX σY
Even though X and Y are perfectly dependent, the pattern of dependence
is nonlinear as reflected in Fig.9. So, correlation doesn’t detect it. 

6.6 Bivariate Normal Distribution


In single-variable statistics, the normal distribution is the most widely used
because of its mathematical tractability and its natural appearance in real-world
data. Similarly, when dealing with two random variables jointly, the bivariate
normal distribution plays the same role. The probability density function
(pdf) of the bivariate normal given | ρ |< 1 is
1
f (x, y) = p
2π σX σY 1 − ρ2

(x − µX )2 2ρ(x − µX )(y − µY ) (y − µX )2
  
1
× exp − 2 − +
2(1 − ρ2 ) σX σX σY σY2
oD

The density looks complicated because it incorporates the five parameters


,U

– means of X and Y , denoted µX , µY , their standard deviations, σX , σY , and


oT

2
their correlation ρ. When you integrate out one variable X ∼ N (µX , σX ) and
,F
CE

2
Y ∼ N (µY , σY ). That means each variable, considered alone, still follows a
R.

normal distribution. The middle term in the exponent is the interaction term.
a
kh

It introduces dependence between X and Y .


Re

Geometrically, the joint density surface looks like a “bell-shaped hill” in


25
20

3D as shown in Fig.10. The bivariate normal distribution reaches its peak


©

at (µX , µY ). The contour plots (a set of concentric closed curves of constant


density) are ellipses, whose orientation depends on ρ.

Figure 10: A graph of the bivariate normal pdf

Let us understand the shape fully. Contours are lines of constant density,
meaning
fX,Y (x, y) = constant.
Once the five parameters are fixed, the only varying term is the exponential
argument of the bivariate normal pdf

71
(x − µX )2 2ρ(x − µX )(y − µY ) (y − µY )2
Q(x, y) = 2 − + .
σX σX σY σY2
Setting Q(x, y) = k for some constant k > 0 gives the contour. The equation

(x − µX )2 2ρ(x − µX )(y − µY ) (y − µY )2
2 − + =k
σX σX σY σY2

is quadratic in x and y. In coordinate geometry, a quadratic equation of the


form
Ax2 + Bxy + Cy 2 = D
is an ellipse if B 2 − 4AC < 0. Here,
1 2ρ 1
A= 2 , B=− , C= .
σX σX σY σY2

We can check
4ρ2 4 4(ρ2 − 1)
B 2 − 4AC = 2 σ2 − 2 σ2 = 2 σ2 <0
σX Y σX Y σX Y

because −1 ≤ ρ ≤ 1 and ρ2 < 1. Hence the contours are ellipses.


oD
,U

If σX = σY and ρ = 0 (i.e., variables are independent and have the same


oT
,F

variance), the equation becomes


CE
R.

(x − µX )2 + (y − µY )2 = kσ 2 ,
a
kh
Re

which is a circle. Intuitively, ρ = 0 means that pull is roughly the same from both
25
20

the x and y axes making the ellipse axis aligned. Additionally, if the variances
©

are the same then the individual distances from the mean is same which gives the
contour a circular shape. Correlated or differently scaled variables stretch the
contours along certain directions resulting in an elliptical contour. Correlation
controls the rotation (which way the ellipse points) while variances control the
how elongated the ellipse is. If σX  σY , the ellipse will be stretched more
along the x-axis direction, then tilted according to ρ. By “tilt” we mean the
major axis of the ellipse is rotated by some angle with respect to the x-axis.
Let us analyse each cases geometrically:
1. Unrelated variables with equal variances (ρ = 0 and σX = σY ): The joint
density surface is a symmetric bell aligned with the axes (no tilt) as shown
in Fig.11a and the contours are perfect circles as shown in Fig.11b.

72
(a) no tilt symmetric bell (b) contours formed are perfect
circles

Figure 11: Bell for ρ = 0 and σX = σY

2. Unrelated variables with unequal variances (ρ = 0 and σX 6= σY ): The


contour is an axis-aligned ellipse. Since ρ = 0, the axes of the ellipse are
aligned with the x- and y-axes, i.e., the major axis (the longer direction of
the ellipse) is parallel to either the axes while the minor axis (the shorter
direction) is parallel to the other axis. If σX > σY the ellipse is stretched
along the x-axis. If σY > σX the ellipse is stretched along the y-axis.
oD
,U
oT
,F
CE
R.
a
kh
Re
25
20
©

(a) elliptical contours elongated (b) elliptical contours elongated


along x axis (σX > σY ) along y axis (σX < σY )

Figure 12: Bell for ρ = 0 and σX 6= σY

3. Positively related variables with equal variances (ρ > 0 and σX = σY ):


Positive correlation means that as X increases, Y also tends to increase.
This makes the contours elliptical shaped rotated toward the line y ≈
x. But then since the variances are equal, the tilt is exactly at y = x
line. In short, the contours are ellipses rotated at 45◦ . Variances being
the same, Fig.13 shows the effect of the magnitude of correlation. A
higher correlation magnitude results in elliptical contours having a shorter
length along the y = −x line and vice versa. This is because correlation
means when X increases, Y tends to increase too. This reinforces variation
along y = x line. So the ellipse gets longer in that direction. But in the
perpendicular direction, i.e. along the y = −x line, if we try to move away,
the correlation “fights it”.

73
(a) ρ = 0.4 (b) ρ = 0.9

Figure 13: Bell for ρ > 0 and σX = σY

4. Positively related variables with unequal variances (ρ > 0 and σX 6= σY ):


The shape is similar to the above case but the tilt major axis lies on the
line y ≈ x. If σX > σY then the angle of rotation is less than 45◦ and if
σX < σY then the angle of rotation is more than 45◦ .
oD
,U
oT
,F
CE
R.
a
kh
Re
25
20
©

(a) tilted bell ρ > 0 and σX 6= σY (b) contours formed are ellipses

Figure 14: Bell for ρ > 0 and σX 6= σY

5. Negatively related variables with unequal variances (ρ < 0 and σX 6= σY ):


The idea is similar and is shown in Fig.15.

(a) tilted bell ρ < 0 and σX 6= σY (b) contours formed are ellipses

Figure 15: Bell for ρ < 0 and σX 6= σY

74
6. Perfect postive correlation with equal variances (ρ = 1 and σX = σY ):
Since ρ = 1, the major axis is along the line y = x and since the variances
are equal the distribution is collapsed along this axis. It is no longer an
ellipse but just a line on the 2D XY plane. The equation for the joint pdf
doesn’t hold instead
  
fX,Y (x, y) = fX (x) δ y − µY + σσX Y
(x − µX )

where the second term is the Dirac delta function. If we project onto the
X-axis we can see the spread of X along X, a 1D Gaussian. If we project
onto the Y -axis we can see the spread of Y along Y , a 1D Gaussian.
Let us now understand the relation between the two variables when ρ = 0.
If ρ = 0, X and Y are fully independent (their joint pdf factors into the product
of two normals).

(x − µX )2 (y − µY )2
  
1 1
f (x, y) = √ exp − 2 −0+
2πσX σY 1 − 02 2(1 − 02 ) σX σY2

(x − µX )2 (y − µY )2
  
1
f (x, y) = exp − 21 2 +
2πσX σY σX σY2
oD

1 (x − µX )2 1 (y − µY )2
     
1 1
,U

f (x, y) = √ exp − · √ exp −


oT

2 σX2 2 σY2
2π σX 2π σY
,F
CE
R.

f (x, y) = fX (x) · fY (y)


a
kh
Re

Here,
25
20

2
 
1 1 (x − µX )
©

2
fX (x) = √ exp − 2 2 is the pdf of X ∼ N (µX , σX ),
2π σX σX

(y − µY )2
 
1
fY (y) = √ exp − 12 is the pdf of Y ∼ N (µY , σY2 ).
2π σY σY2

In the case of the bivariate normal distribution when ρ = 0, the joint pdf
factorizes as f (x, y) = fX (x) fY (y), which is exactly the condition for true
independence. It even rules out nonlinear dependence.

Conditional Distribution in a Bivariate Normal We have a bivariate


normal distribution of (X, Y ). Jointly, it forms a “bell” over the 2D XY plane.
Suppose we fix X = x and ask how is Y distributed, given this information? or
given we know X, what uncertainty remains in Y ? That is, we are interested
in the conditional distribution Y | X = x.
Picture the joint bell surface in 3D. If you take a vertical “slice” at X = x,
you cut the bell along that line. The resulting cross-section curve in the Y -
direction is a 1D Gaussian (normal curve). Thus, conditionals of a bivariate
normal are themselves normal.
The formula for the conditional mean is
σY
µY |x = µY + ρ (x − µX ).
σX

75
This is linear in x. If x > µX (X is above its average), then the expected
value of Y also shifts upward (if ρ > 0) or downward (if ρ < 0). The line of
conditional means is the regression line of Y on X. If ρ = 0, knowing X doesn’t
change the expectation of Y .
The conditional variance is

σY2 |x = (1 − ρ2 )σY2 .

The formula suggests that the conditional variance does not depend on x.
The higher |ρ|, the smaller is the variance. If ρ = 0, the conditional variance is
simply σ22 . If |ρ| → 1, the variance collapses toward 0 (almost perfect prediction
of Y from X). When ρ = 1 and if we know X there is no uncertainty left in Y ,
i.e., Y is completely determined by X. The curve at X = x is no longer a bell
curve; it collapses to a single point along Y .
In short, if correlation is strong (|ρ| ≈ 1), once we know X, Y is almost
determined, i.e., very little conditional variance. If on the other hand correlation
is weak (ρ ≈ 0), knowing X doesn’t help much, i.e., conditional variance is
basically the same as the marginal variance of Y .

Multivariate Normal Distribution The bivariate normal distribution can


be generalized to the multivariate normal distribution. Its density function
oD

is quite complicated, and the only way to write it compactly is to employ matrix
,U

notation. If a collection of variables has this distribution, then the marginal


oT
,F

distribution of any single variable is normal, the conditional distribution of any


CE

single variable given values of the other variables is normal, the joint marginal
R.

distribution of any pair of variables is bivariate normal, and the joint marginal
a
kh
Re

distribution of any subset of three or more of the variables is again multivariate


25

normal.
20
©

76
7 Bridging Probability and Statistics
Data collection is a crucially important step in statistics. In real-world statistics,
we often collect data from samples drawn from a population for example, by
measuring the fuel efficiency of randomly chosen cars. Let us say, first sample
of 3 cars gives x1 = 30.7, x2 = 29.4, x3 = 31.1 and a second (different) sample
gives x1 = 28.8, x2 = 30.0, x3 = 32.5. The numbers change from sample to
sample because each sample is chosen randomly. This means before we collect
any data the values x1 , x2 , . . . , xn are unknown and the choice of values for the
variables are random in nature. So we treat them as random variables, denoted
X1 , X2 , . . . , Xn .
Since the sample values vary, any function we compute from the sample like
the sample mean X̄, the sample standard deviation s, or other statistics, will
also vary from sample to sample. That is why we think of the sample mean
X̄ as a random variable before we observe any data. This is the foundation
for concepts such as sampling distributions, standard error, and ultimately, the
Central Limit Theorem.
More formally, a population consists of all units of interest. Any numeri-
cal characteristic of a population is a parameter. A parameter describes the
population. A sample consists of observed units collected from the population.
It is used to make statements about the population. Any function of a sample
oD

is called statistic. A statistic varies from sample to sample because different


,U

samples give different numbers.


oT
,F

Prior to obtaining data, there is uncertainty as to what value of any par-


CE

ticular statistic will result. Therefore, a statistic is a random variable and will
R.

be denoted by an uppercase letter; a lowercase letter is used to represent the


a
kh
Re

calculated or observed value of the statistic. Thus the sample mean, regarded as
25

a statistic (before a sample has been selected or an experiment carried out), is


20

denoted by X; the calculated value of this statistic is x. Similarly, S represents


©

the sample standard deviation thought of as a statistic, and its computed value
is s. Think of a population as a big jar of chocolates, each chocolate has some
weight. A parameter can be the true average weight of all chocolates in the jar
(unknown unless we weigh all chocolate). A statistic can be the average weight
of a handful of chocolate we randomly pick (can change if we pick a different
handful).
Every statistic we compute from a sample like the sample mean X̄ or the
sample variance S 2 is based on random data. Since the data are random, the
statistic itself is also a random variable, and hence it has a probability distri-
bution. Suppose that n = 2 components are randomly selected and the number
of breakdowns while under warranty is determined for each one. Let X1 be the
number of breakdowns for the first component and X2 be for the second. Then
the sample mean is
X1 + X2
X̄ =
2
Possible values for X̄ might include 0, if X1 = X2 = 0; 0.5, if X1 = 0, X2 = 1
or vice versa 1, 1.5, . . ., depending on the values of X1 and X2 . Thus, the
probability distribution of X̄ includes

P (X̄ = 0), P (X̄ = 0.5), P (X̄ = 1), etc.

77
From this distribution, other probabilities such as P (1 ≤ X̄ ≤ 3) or P (X̄ ≥ 2.5)
can be computed.
Suppose each of the two observations, X1 and X2 , is randomly drawn from
the set {40, 45, 50}. Then, the sample space for ordered pairs (X1 , X2 ) consists
of 3 × 3 = 9 combinations

(40, 40), (40, 45), (40, 50),


(45, 40), (45, 45), (45, 50),
(50, 40), (50, 45), (50, 50)

For a sample of size 2, the formula for the sample variance is


n 2
1 X 1 X
S2 = (Xi − X̄)2 = (Xi − X̄)2 = (X1 − X̄)2 + (X2 − X̄)2
n − 1 i=1 2 − 1 i=1

(Please see below a detailed explanation on this formula.) Let us compute a few
cases starting with (40, 40).
40 + 40
X̄ = = 40, S 2 = (40 − 40)2 + (40 − 40)2 = 0 + 0 = 0
2
For (40, 50)
oD
,U
oT

40 + 50
S 2 = (40 − 45)2 + (50 − 45)2 = 25 + 25 = 50
,F

X̄ = = 45,
CE

2
R.

For (40, 45)


a
kh
Re

40 + 45
25

X̄ = = 42.5, S 2 = (40 − 42.5)2 + (45 − 42.5)2 = 6.25 + 6.25 = 12.5


20

2
©

If we compute for all 9 combinations, we get only 3 possible sample variance


values S 2 = 0, S 2 = 12.5, S 2 = 50. Thus, the probability distribution of S 2
includes
P (S 2 = 0), P (S 2 = 12.5), P (S 2 = 50)
The probability distribution of a statistic is sometimes referred to as its
sampling distribution to emphasize that it describes how the statistic varies
in value across all samples that might be selected.
Note. Given a sample, the observed values fall, on an average, closer to
the sample mean than to the population mean. Though sample variance can
overestimate or underestimate for a particular sample, but the expected value
across all samples is less than the population variance. Consider the population
{1, 2, 3, 4, 5} with mean
1+2+3+4+5
µ= = 3.
5
The population variance is
5
1X (1 − 3)2 + (2 − 3)2 + (3 − 3)2 + (4 − 3)2 + (5 − 3)2
σ2 = (Xi − µ)2 = =2
5 i=1 5

78
Take the sample (1, 2) with sample mean
1+2
X̄ = = 1.5.
2
The sample variance (using n in the denominator) is
2
1X (1 − 1.5)2 + (2 − 1.5)2 0.25 + 0.25
s2 = (Xi − X̄)2 = = = 0.25.
2 i=1 2 2

Here, sample variance s2 = 0.25 and population variance σ 2 = 2. Thus, the


sample variance turned out to an underestimate of the population variance. In
general, the expected value of sample variance across all samples is less than the
population variance. Intuitively, the deviations Xi − X̄ from the sample mean
are smaller on average than deviations from the true population mean µ. That
is why dividing the variances by n gives a number that is, on average, too small
than the population variance.

X1 +X2 (X1 −X̄)2 +(X2 −X̄)2


Sample (X1 , X2 ) X̄ = 2 s2 = 2
(1,2) 1.5 0.25
(1,3) 2 1
oD
,U

(1,4) 2.5 2.25


oT

(1,5) 3 4
,F
CE

(2,3) 2.5 0.25


R.

(2,4) 3 1
a
kh

(2,5) 3.5 2.25


Re

(3,4) 3.5 0.25


25
20

(3,5) 4 1
©

(4,5) 4.5 0.25

Table 3: Sample means and sample variances for all 2-element samples

Let us prove this formally. We want to compute the expected value of the
sample variance when dividing by n, i.e.,
n
1X
Sn2 = (Xi − X̄)2 ,
n i=1

and explain why it underestimates the population variance σ 2 .


Recall the formula for the sample mean
n
1X
X̄ = Xi n>0
n i=1

We can write
Xi − X̄ = (Xi − µ) − (X̄ − µ),
so
(Xi − X̄)2 = (Xi − µ)2 − 2(Xi − µ)(X̄ − µ) + (X̄ − µ)2 .

79
n
X n
X n
X
(Xi − X̄)2 = (Xi − µ)2 − 2(X̄ − µ) (Xi − µ) + n(X̄ − µ)2 .
i=1 i=1 i=1

But note
n
X n
X
(Xi − µ) = Xi − nµ = nX̄ − nµ = n(X̄ − µ),
i=1 i=1
so
n
X n
X n
X
(Xi −X̄)2 = (Xi −µ)2 −2(X̄−µ)·n(X̄−µ)+n(X̄−µ)2 = (Xi −µ)2 −n(X̄−µ)2 .
i=1 i=1 i=1

Divide by n
n n
1X 1X
(Xi − X̄)2 = Sn2 = (Xi − µ)2 − (X̄ − µ)2 .
n i=1 n i=1

Taking expectations
n
1X
E[Sn2 ] = E[(Xi − µ)2 ] − E[(X̄ − µ)2 ].
n i=1
oD

σ2 *
Since E[(Xi − µ)2 ] = σ 2 and E[(X̄ − µ)2 ] = Var(X̄) = n , we get
,U
oT
,F

1 σ2 σ2 n−1 2
CE

E[Sn2 ] = · nσ 2 − = σ2 − = σ .
n n n n
R.
a
kh

Thus
Re

n−1 2
25

E[Sn2 ] =
σ < σ 2 , since n > 0
20

n
©

so dividing by n underestimates the true variance. If instead we divide by n − 1,


then
E[Sn2 ] = σ 2 ,
making it an unbiased estimator as shown below.
n
1 X
Let S 2 = (Xi − X̄)2
n − 1 i=1

Using the identity computed above


n
X n
X
(Xi − X̄)2 = (Xi − µ)2 − n(X̄ − µ)2
i=1 i=1

we can write " n #


2 1 X
2 2
S = (Xi − µ) − n(X̄ − µ)
n − 1 i=1
*
n n n
!  2 !
1X 1 X 1 X 1 σ2
Var(X̄) = Var Xi = Var Xi . = 2 Var(Xi ) = 2 · nσ 2 = .
n i=1 n i=1
n i=1 n n

80
Take expectation
" n #
2 1 X
2 2
E[S ] = E[(Xi − µ) ] − n E[(X̄ − µ) ]
n − 1 i=1

Each term is
σ2
E[(Xi − µ)2 ] = σ 2 , E[(X̄ − µ)2 ] = Var(X̄) = .
n
So,
σ2
 
2 1 2 1
E[S ] = nσ − n · = (nσ 2 − σ 2 ).
n−1 n n−1
n−1 2
E[S 2 ] = σ = σ2 .
n−1
Thus, the sample variance S 2 is an unbiased estimator of the population vari-
ance. The sample variance has two interpretations

1. Descriptive. It measures the spread of the observed data around the sam-
ple mean
n
1X
S2 = (Xi − X̄)2
oD

n i=1
,U
oT
,F

2. Inferential. It is viewed as a statistic used to estimate the population


CE

variance σ 2 . By construction,
R.
a
kh
Re

n
1 X
E[S 2 ] = (Xi − X̄)2 ≈ σ 2 ,
25
20

n − 1 i=1
©

so S 2 is an unbiased estimator of σ 2 .

Using n in the denominator underestimates σ 2 , while using n − 1 corrects


for this bias.

The correction in the sample variance formula has a larger proportional


effect when the sample size n is small than when it is large. This is desirable
because for small n, the sample mean X̄ is likely to be a poor estimator of
the population mean µ, so the uncorrected variance (dividing by n) tends to
underestimate the true population variance. Specifically, for small n, using
n − 1 instead of n substantially increases the computed variance, correcting for
the bias. As n grows, X̄ approaches µ so the difference between dividing by n
and n − 1 becomes negligible. When the sample is the entire population, X̄ = µ,
so no correction is needed and we divide by n.

7.1 Random Samples


A simple random sample of size n is a foundational concept in statistics. We
say that random variables X1 , X2 , . . . , Xn form such a sample if

81
1. each Xi is independent of the others, i.e., the outcome of one observation
does not affect the others. For example while measuring the height of
10 randomly chosen people knowing the height of the 1st person doesn’t
tell anything about the 2nd . If the variables aren’t independent then the
estimates made about the population be biased or misleading.
2. Each Xi comes from the same probability distribution, i.e., they are all
“copies” of the same underlying process. Formally,

X1 , X2 , . . . , Xn ∼ F,

where F is a single distribution with parameters (e.g. mean µ, variance


σ 2 , etc.).
This does not strictly mean that each Xi must come from the same phys-
ical population, but rather that they are governed by the same statistical
distribution. Two common perspectives are:
ˆ Statistical (population) viewpoint: All Xi are independent draws
from a single population with distribution F . Example: sampling
heights of students from Delhi University; each Xi comes from the
same student population.
ˆ Probabilistic (distributional) viewpoint: More abstractly, the
oD
,U

only requirement is that all Xi follow the same distribution F , regard-


oT
,F

less of whether there is a real-world “population.” Example: gener-


CE

ating X1 , . . . , Xn by simulating independent draws from a N (0, 1)


R.

distribution.
a
kh
Re

Thus, “same distribution” in i.i.d. sampling means that the Xi ’s are


25
20

governed by the same probability law, whether or not we refer to an actual


©

population.
Continuing with the example above, each Xi might follow a normal dis-
tribution with mean µ = 170 cm and standard deviation σ = 10 cm.
When the sample is taken properly (random, independent, identically dis-
tributed), the sample mean X̄n tends to be close to the population mean
µ, with only small deviations, especially as the sample size grows
X1 + X2 + · · · + Xn P
X̄n = −
→ µ.
n
If however the samples don’t come from the same distribution then com-
bining them (e.g., via sample mean) may not reflect any one population.
For example, let us say we roll a fair 6-sided die. The population mean
µ = 3.5. Let the first roll give 2 resulting in a sample mean of 2. Af-
ter 2 rolls: (2, 5) resulting in the sample mean 3.5. After 3 rolls: (2,
5, 4) resulting in the sample mean 3.67. After 1000 rolls: sample mean
is approximately 3.51. After 10000 rolls: sample mean is approximately
3.5002. The more rolls (larger n), the closer the sample mean gets to the
true mean 3.5.
In short, the individual random variables Xi s are independent and identically
distributed (iid).

82
If sampling is either with replacement or from an infinite population, these
conditions are satisfied. But in real life, especially in surveys or experiments we
often sample without replacement (i.e., we don’t select the same person or object
twice). Technically, this breaks the independence assumption. For example, if
we already picked one person, the chance of picking them again is zero. That
means the probability distribution for the second draw depends on the outcome
of the first draw.
If the population size is very large compared to the sample size, removing one
element hardly changes the probabilities for the next draw. To see this, suppose
N = 1,000,000. On the first draw, the probability of selecting any particular
1
person is 1,000,000 . On the second draw (without replacement), the probability
1 1 1
of selecting a different specific person is 999,999 . Since 999,999 ≈ 1,000,000 , the
change in probability is microscopically small. Although the observations are
technically dependent, the effect is so minor that we can safely treat them as
independent in practice. Specifically, if the sample size is 5% or less of the
whole population, then we can treat it like a true random sample, even if it is
technically not. We can then proceed to use all the powerful statistical tools to
analyse the population.

7.2 Law of Large Numbers (LLN)


oD

Early in the 16th century, Italian mathematician Gerolamo Cardano observed


,U

what would later become known as The Law of Large Numbers. He observed
oT
,F

that in statistics the accuracy of observations tended to improve as the number


CE

of trials increased. It was a Swiss mathematician Jacob Bernoulli who published


R.
a

the first proof in 1713 for what Cardano had observed centuries earlier. Bernoulli
kh
Re

recognized the intuitive nature of the problem as well as its importance and spent
25

twenty years formulating a complicated proof for the case of a binary random
20
©

variable. Bernoulli states that when estimating the unknown proportion any
degree of accuracy can be achieved through an appropriate number of trials.
The official name of the theorem “The Law of Large Numbers” was not coined
until the year 1837 by French mathematician Simeon Denis Poisson. Roughly
stated, “as the size of a random sample increases, the sample average tends to
get closer and closer to the population mean.” In simple terms, if we repeat an
experiment many times, the average result will eventually stabilize around the
expected (true) value.
Imagine flipping a fair coin with the expected proportion of heads being 0.5.
If we flip it 10 times we might get 7 heads or 4 heads there is more variability.
But if we flip it 1,000 or 10,000 times, the proportion of heads will be very close
to 0.5.
The weak law of large numbers states
lim X̄n = µ (in probability)
n→∞

That is, for any  > 0,



lim P X̄n − µ >  = 0
n→∞

This means that the probability that the sample mean X̄n differs from the
population mean µ by more than a small positive number  goes to zero as the
sample size n increases.

83
7.3 Central Limit Theorem
If the individual data points Xi follow a normal distribution, then the sample
mean X̄ also follows a normal distribution, regardless of sample size n. This is
a special property of the normal distribution; it is “stable” under averaging.
Let X1 , X2 , . . . , Xn be i.i.d. random variables with
Xi ∼ N (µ, σ 2 ).
(To rephrase: let Xi be a random variable representing a single draw from
a population with mean µ. If we repeatedly draw Xi and observe outcomes
(1) (2) (N )
xi , xi , . . . , xi , then the long-run average of these draws converges to the
population mean
N
1 X (j)
E[Xi ] = lim xi = µ
N →∞ N
j=1
(j)
where xi is the outcome of the i-th random variable in the j-th repetition of
(j)
the experiment and even though any single xi may differ from µ, the long-run
average of all draws approaches µ.
This is consistent with the Law of Large Numbers (LLN), which states that
the sample mean of repeated independent draws converges to the population
mean as the number of draws grows.)
oD

Resuming the discussion on CLT, the moment generating function (MGF)


,U
oT

of a single Xi is (Sec. 5.5.1)


,F
CE

MX (t) = exp µt + 21 σ 2 t2 .

R.
a

For the sum Sn = X1 + X2 + · · · + Xn , independence gives (refer the property


kh
Re

discussed in Sec. 5.5.1)


25
20

n n
MSn (t) = MX (t) = exp µt + 21 σ 2 t2 = exp nµt + 12 nσ 2 t2 .

©

But this is exactly the MGF of a normal distribution


MY (t) = exp µY t + 12 σY2 t2 ,


with parameters µY = nµ. Hence


 2

Sn ∼ N (nµ, nσ 2 ) and X̄ = Snn ∼ N µ, σn

since E[X̄] = E Snn = n1 E[Sn ] = n1 (nµ) = µ and Var(X̄) = Var Snn =


 
1 1 2 σ2
n2 Var(Sn ) = n2 (nσ ) = n .
This shows a remarkable property: if the population is normal, then the
sample mean is exactly normal, no matter the sample size n.
But even if the population distribution is not normal when we compute the
average of many such values, the result still tends to look bell-shaped (i.e.,
normal-like). This happens even if the original data is skewed, bimodal, or
otherwise far from normal. Let us analyse this case below.
Let X1 , X2 , . . . , Xn be independent and identically distributed (i.i.d.) ran-
dom variables with population mean µ = E[Xi ] and population variance σ 2 =
Var(Xi ). The sample mean is
n
1X
X̄n = Xi .
n i=1

84
We are interested in the distribution of X̄n as n grows large. Standardize the
independent variables by defining
Xi − µ
Yi = , i = 1, 2, . . . , n.
σ
Then each Yi has mean 0 and variance 1. Likewise standardize the sample mean
n
X̄n − µ 1 X
Zn = √ =√ Yi .
σ/ n n i=1

Now consider the MGF of Zn


" n
!#
 tZn  t X
MZn (t) = E e = E exp √ Yi
n i=1

Pull the factor inside the sum


n
h X t i
= E exp( √ Yi )
i=1
n

Since exp(a + b) = exp(a) exp(b)


n  t
oD

hY i
=E exp √ Yi
,U

n
oT

i=1
,F
CE

Using independence of the Yi , expectation of product = product of expectations


R.

n
a

t
kh

Y h i
= E exp( √ Yi )
Re

n
25

i=1
20
©

Since Yi are from identical distribution


 t n  t n
= E[exp( √ Y1 )] = MY1 ( √ ) .
n n
Thus
t n
MZn (t) = MY1 ( √ )
n
We let
t
u= √
n
and express the Taylor expansion of the MGF of a standardized random variable
Y1 around u = 0:

1 1 1
MY1 (u) = 1 + E[Y1 ] u + E[Y12 ] u2 + E[Y13 ]u3 + · · · = 1 + u2 + o(u2 ),
| {z } 2 | {z } 6 2
=0 =1

where Y1 is standardized: E[Y1 ] = 0 and E[Y12 ] = 1 (since 1 = V ar(Y1 ) =


E[Y12 ] − E[Y1 ]2 = E[Y12 ] − 0 = E[Y12 ]). The little-o notation o(u2 ) represents
terms that are of smaller order than u2 as u → 0. Then,
 t  1  t 2  t 2  t2 1
M Y1 √ =1+ √ +o √ =1+ +o .
n 2 n n 2n n

85
Taking the limit as n → ∞
 n
t2
 t  
n 1 2
lim MZn (t) = lim (MY1 √ ) = lim 1 + +o = et /2 .
n→∞ n→∞ n n→∞ 2n n

This is the MGF of the standard normal distribution N (0, 1) (substitute µ = 0


and σ 2 = 1 in the MGF of a normal distribution MX (t) = exp µt + 21 σ 2 t2 ).
In short, as n grows, the higher-order terms described by o( n1 ) (like skewness)
shrink faster than 1/n and hence become negligible as n grows. Therefore, the
distribution of the sample mean depends only on the mean and variance in the
limit. Any skewness or kurtosis in the original distribution does not affect the
limiting normal distribution, which is why the sample mean tends to be normal
even if the population is not.
2
Thus X̄ ∼ N (µ, σn ) even if the population distribution with mean µ and
variance σ2 is not normal.
Even if the population is messy and non-normal, if the sample size is large
enough, we can still use the normal distribution to approximate the behavior
of the sample mean. It will still tend to the population mean (guaranteed by
the Law of Large Numbers) even if the population distribution is non-normal.
This is an incredibly powerful concept and is the foundation for most methods
in statistical inference (confidence intervals, hypothesis tests, etc.). Hence this
oD

is formalised as a theorem called the Central Limit Theorem.


,U
oT
,F
CE
R.
a
kh
Re
25
20
©

Figure 16: Illustration of Central Limit Theorem

Central Limit Theorem (CLT) Let X1 , X2 , . . . , Xn be a random sample


from a distribution with mean µ and variance σ 2 . Then, if n is sufficiently large,
the sample mean X̄ has approximately a normal distribution with

2 σ2
µX̄ = µ and σX̄ = ,
n
Pn
and the total T = i=1Xi also has approximately a normal distribution with
n
! n
X X
2
µT = nµ and σT = Var Xi = Var(Xi ) = nσ 2 .
i=1 i=1

The larger the value of n, the better the approximation.

86
Central Limit Theorem is one of the most powerful results in probability and
statistics with profound implications. It allows us to use the normal distribution
which is mathematically simple and well-understood to make inferences about
population parameters using sample data. The normal distribution has desir-
able properties such as symmetry and is completely characterized by just two
parameters (mean and variance). This makes it the go-to choice in simulations
and modeling. So, irrespective of the population one can reuse the software
desgined for one population in another possibly with minor changes.

Example 7.1. A disk has free space of 330 megabytes. Is it likely to be sufficient
for 300 independent images, if each image has expected size of 1 megabyte with
a standard deviation of 0.5 megabytes?

The sample space S consists of all possible combinations of 300 image sizes

S = {(x1 , x2 , . . . , x300 ) ∈ R300 : xi ∈ possible image sizes}

Each outcome in the sample space is a 300-dimensional vector representing the


sizes of all images in one realization. Define a random variable Xi as the size of
the i-th image

Xi : S → R, where Xi (x1 , x2 , . . . , x300 ) = xi


oD
,U

Each random variable has E[Xi ] = 1 and V[Xi ] = 0.25.


oT
,F

Assume that the samples X1 , X2 , . . . , X300 are independent and identically dis-
CE

tributed (i.i.d.). Define the total size of all 300 images as


R.
a
kh

T = X1 + X2 + · · · + X300
Re
25
20

This is also a random variable on the same sample space. We are interested in
©

computing P (T ≤ 330), probability that all 300 images fit within 330 MB.
Since the individual image sizes are independent and identically distributed
(i.i.d.), and n = 300 is large, we can apply the Central Limit Theorem

T ≈ N (nµ, nσ 2 )

That is
T ≈ N (300, 300 × 0.52 ) = N (300, 75)
We standardize to find the z-score
T − nµ 330 − 300 30 30
Z= √ = √ = ≈ ≈ 3.46
σ n 0.5 300 0.5 × 17.32 8.66
Now, using the standard normal table

P (T ≤ 330) = P (Z ≤ 3.46) ≈ Φ(3.46) ≈ 0.9997

There is a 99.97% chance that 300 images will fit in 330 MB. Hence the disk
space is very likely sufficient. 

Example 7.2. Let X be the number of different people sent text messages
during a particular day by a randomly selected student at a large university.
Suppose the mean value of X is 7 and the standard deviation is 6 (values very

87
close to those reported in the article Cell Phone Use and Grade Point Average
Among Undergraduate University Students (College Student J., 2011: 544–551).
Among 100 randomly selected such students, how likely is it that the sample
mean number of different people texted exceeds 5?

The experiment is randomly selecting one student and counting how many
different people she sent messages to in a day. So, the sample space is S =
{0, 1, 2, . . .}. This is a set of non-negative integers with no fixed upper bound,
but in reality, there is a practical maximum (e.g. maybe no more than 100–200
people).
The random variable X is defined as “X is the number of different people
messaged in a day by a randomly chosen student” where X : S → Z≥0 . It assigns
to each student (each outcome in the population) a number x ∈ {0, 1, 2, . . . }.
Now, if we repeat the experiment n = 100 times (selecting 100 students
independently as mentioned in the question), we get 100 i.i.d. observations
X1 , X2 , . . . , X100 . Each Xi is a realization of the same random variable X. From
these, we compute the sample mean
100
1 X
X̄ = Xi
100 i=1
oD

This X̄ is also a random variable (since it depends on random observations)


,U

whose sample space is now the set of averages of 100 non-negative integers.
oT
,F

Given, mean µ = 7, standard deviation σ = 6, and sample size n = 100, we


CE

want to compute P (X̄ > 5). Since n = 100 is large and the averages are iid
R.
a

observations, by the Central Limit Theorem,


kh
Re

62
25

 
20

X̄ ∼ N 7, = N (7, 0.36)
100
©

We standardize as
5−7 −2
Z= √ = = −3.33
6/ 100 0.6
to compute the probability

P (X̄ > 5) = P (Z > −3.33) = Φ(3.33) ≈ 0.9996

There is a 99.96% probability that the sample mean number of people texted
by 100 students exceeds 5. Note: the cited article stated that text messaging
frequency was negatively correlated with GPA. 

Challenge A challenge when applying the CLT in practice is knowing how


large the sample size n must be for the approximation to be good. The answer
depends on the shape of the original distribution. If the original distribution is
already close to normal, then even a small sample size (say n = 10 or 20) might
suffice. But if the original distribution is heavily skewed or irregular, we might
need a much larger sample (say n = 50, 100, or more) for the sample mean to
become approximately normal. As a rule of thumb, we can generally apply CLT
safely for n > 30.

88
For extremely skewed or heavy-tailed distributions, even n = 40 or 50 might
not be enough. However, such distributions are rare. For well-behaved distri-
butions (e.g., uniform, mildly skewed), the CLT can work even with small n
as small as n = 12. For example, if we are measuring heights (roughly normal
in humans), even n = 10 gives a decent approximation. If we are measuring
income or insurance claims due to catastrophic events (which are skewed), we
may need a much larger n (50 or more) for CLT to hold well.

A bridge The Central Limit Theorem (CLT) serves as a fundamental bridge


between probability theory and statistical inference. It begins in the realm of
probability, where we consider a set of independent and identically distributed
(iid) random variables with a known mean and variance. In the real world,
we rarely have access to entire populations, so we rely on random samples
to estimate unknown parameters. The CLT states that, as the sample size
increases, the distribution of the sample mean approaches a normal distribution
regardless of the shape of the original population distribution.
While this result is specific to the sample mean, its importance is far broader:
many other statistics (such as sample variance, proportions, regression coeffi-
cients, and maximum likelihood estimators) can be expressed as functions of
sample means. Through generalizations like the multivariate CLT and the Delta
method, these statistics also exhibit asymptotic normality. Thus, the CLT pro-
oD

vides the theoretical foundation for inference methods such as confidence in-
,U
oT

tervals and hypothesis testing, making it one of the most powerful results in
,F

statistics.
CE
R.
a
kh
Re
25
20
©

89
8 Statistics
Statistics helps us make sense of data, and it is broadly divided into two branches
descriptive and inferential statistics. Descriptive statistics are used to sum-
marize, organize, and present data in a meaningful way. They include measures
like mean, median, mode, standard deviation, and visual tools such as histograms
and pie charts, all of which help describe the main features of a dataset. For
example, stating that “the average score of a class of 50 students is 72” is a
descriptive statement. It tells us about the data we have, without making any
generalizations beyond that group.
In contrast, inferential statistics allow us to make predictions or draw
conclusions about a larger population based on a smaller sample. They in-
volve estimation techniques, hypothesis testing (like z-tests, t-tests, or chi-square
tests), regression analysis, and more. For instance, estimating that “60% of a
city’s population supports a candidate based on a survey of 200 voters” is an
inferential conclusion.
Unlike descriptive statistics, inferential statistics rely heavily on probability
theory to quantify uncertainty and assess the reliability of conclusions. Proba-
bility theory allows us to say not just what the estimate is, but also how reliable
it is. For example, instead of only reporting the sample mean, we report a
confidence interval which shows how much trust we can put in that estimate.
oD

In hypothesis testing, reliability is assessed by controlling error probabilities.


,U

In summary, descriptive statistics describe what is observed, while inferential


oT
,F

statistics predict what could be true for a broader population. Both are essential
CE

tools in data analysis, serving different but complementary purposes.


R.

As we move deeper into inferential statistics, one of the fundamental tasks is


a
kh
Re

estimation which is the process of using sample data to make educated guesses
25

about unknown population parameters. A point estimate is a single value


20

derived from a sample that serves as the best guess for an unknown parameter
©

of the population, such as the mean, proportion, or variance. For example,


the sample mean (x̄) is a point estimate of the population mean (µ). While
point estimates are simple and intuitive, they do not convey the uncertainty
associated with sampling. That is why point estimation is often paired with
interval estimation which provides a range of plausible values along with a
confidence level. Before exploring those intervals, it is important to understand
the properties of good estimators such as unbiasedness, consistency, and
efficiency which ensure that our point estimates are as accurate and reliable
as possible when drawn from representative samples.

8.1 Point Estimation


A point estimate of a parameter θ is a single number that can be regarded
as a sensible value for θ. It is obtained by selecting a suitable statistic and
computing its value from the given sample data. The selected statistic is called
the point estimator of θ.

Example 8.1. A semiconductor company manufactures a large batch of micro-


processor chips. The company is interested in estimating the average operational
lifetime of these chips under normal usage conditions. Let the true average life-
time (in hours) of all chips be denoted by the population parameter µ, which is

90
unknown.
Since testing every chip is impractical, the company proceeds as follows. It
randomly selects a sample of n = 50 chips and measures the operational lifetime
(in hours) of each selected chip until it fails. It records the sample values as
X1 , X2 , . . . , X50 . It then computes the sample mean
50
1 X
X̄ = Xi
50 i=1

Suppose the result is X̄ = 72,500 hours. The statistic X̄ is called the point
estimator of the unknown population mean µ. The value 72,500 is the point
estimate of µ based on this particular sample. It serves as a reasonable, data-
driven guess for the true average lifetime of all chips produced. 

This is an example for which there is only one reasonable point estimator
for the parameter of interest. The next example shows that multiple sensible
estimators can exist for a given parameter.

Example 8.2. * Consider the following 20 observations on dielectric breakdown


voltage (in kilovolts) for pieces of epoxy resin, as introduced in Exercise 4.89 in
the text book
oD
,U
oT

24.46 25.61 26.25 26.42 26.66 27.15 27.31 27.54 27.74 27.94
,F
CE

27.98 28.04 28.28 28.49 28.50 28.87 29.11 29.13 29.50 30.88
R.
a

The normal probability plot of this data appears to be approximately linear,


kh
Re

suggesting that the distribution of breakdown voltage is roughly normal. Be-


25

cause normal distributions are symmetric, the population mean µ is also equal
20
©

to the median of the distribution.


We now treat these 20 observations as a random sample X1 , X2 , . . . , X20
drawn from a normal population with mean µ. The goal is to estimate µ using
various estimators.
The sample mean X̄ is computed as
20
1X 555.86
X̄ = Xi = = 27.793
n i=1 20

1. Here estimator is X̄ (sample mean) and the estimate is 27.793.


Since the sample size is even (n = 20), the median is the average of the 10th
and 11th ordered values
X(10) + X(11) 27.94 + 27.98
X̃ = = = 27.960
2 2
2. Here the estimator is sample median and the estimate is 27.960.
The midrange is the average of the minimum and maximum observations
min(Xi ) + max(Xi ) 24.46 + 30.88
Midrange = = = 27.670
2 2
* taken from Probability and Statistics for Engineering and the Science by Jay Devore, 9th
edition.

91
3. Here the estimator is min(Xi )+max(X
2
i)
and the estimate is 27.670.
In a 10% trimmed mean with n = 20, we discard the lowest 10% (2 values) and
the highest 10% (2 values), and compute the mean of the remaining 16 values.
The discarded values are 24.46, 25.61, 29.50, 30.88. So, the remaining sum is
18
X
Xi = 555.86 − 24.46 − 25.61 − 29.50 − 30.88 = 445.41
i=3

445.41
X̄trimmed = = 27.838
16
4. Here the estimator is 10% trimmed mean and the estimate is 27.838.
By presenting four different estimates, the example implicitly invites ques-
tions like “Which estimator is most reliable?”, “How does outlier sensitivity
affect each?”, “How efficient are these estimators in different situations?”, etc.
This helps us understand why the choice of estimator matters and that robust-
ness and efficiency are important properties.
A simple observation is that the sample mean is sensitive to outliers (affected
by min/max values). The median is more robust to extreme values. And the
trimmed mean is a compromise as it reduces the effect of outliers while still
using most of the data. 
oD

Ideally, we would like an estimator that always gives us the exact value of
,U

the population parameter µ, no matter what random sample we collect. But


oT
,F

such a perfect estimator almost never exists in practice, unless we have the full
CE

population. An estimator is calculated using sample data (e.g., sample mean,


R.
a

sample variance), and since the sample is chosen randomly, the estimator itself
kh

becomes a random variable„ . It doesn’t have a fixed value until the sample
Re
25

is observed. Different random samples will give different estimates, sometimes


20
©

overestimating and sometimes underestimating the true parameter µ. Our aim


is to find an estimator θ̂ for which

θ̂ = θ + estimation error

where this error tends to be small.


We will restrict attention just to estimators that have some specified desir-
able property. A popular property of this sort in the statistical community is
unbiasedness.

8.2 Unbiased Estimators


Imagine we are using two instruments to repeatedly measure the same object
(e.g., the length of a rod) where instrument A is well-calibrated and accurate
while instrument B is miscalibrated such that it systematically overestimates
the actual length. Even the same instrument will give slightly different readings
each time due to random noise or minor fluctuations. This is just like sampling
variability in statistics.
Instrument A may produce different values each time, but on average, its
readings are centered around the true value. That is the definition of an unbiased
„ The form of the estimator is always the same, but its value changes from sample to sample,
because it is computed from random data. Because of this, statisticians treat the estimator
as a random variable.

92
estimator in statistics E[θ̂] = θ. While instrument B is consistently wrong in one
direction because it always reads too high. Even if its readings vary their average
is not the true value. This is called bias in statistics Bias(θ̂) = E[θ̂] − θ 6= 0.
This analogy highlights the intuition for statistical bias which is not just random
error, but a systematic shift away from the truth.
The expected value of the sampling distribution of θ̂ is always centered at
the true value of the parameter. For example, if µ = 100, then the mean of the
sampling distribution of θ̂ is 100; if µ = 27.5, then it is 27.5, and so on. Figures
below illustrate the concept of bias in estimators

(a) Bias in θ1 (b) Bias in θ1 and θ2

Figure 17: Biased and unbiased estimators

In Fig.17a, the mean of the sampling distribution of estimator θ̂1 does not
oD
,U

coincide with the true parameter value θ. The horizontal distance between θ
oT

and the mean of the distribution of θ̂1 is the bias of θ̂1 . This illustrates that θ̂1
,F
CE

is a biased estimator. In Fig.17b, the distribution of θ̂1 is shifted to the right of


R.

θ, whereas the distribution of θ̂2 is shifted to the left. Both are therefore biased,
a
kh

but in opposite directions.


Re

It may seem paradoxical to compute bias from Bias(θ̂) = E[θ̂] − θ because we


25
20

don’t know θ to start with! So at first, it seems like we cannot ever tell whether
©

an estimator is unbiased unless we already know the true value which defeats
the point of estimation. Here is where theoretical analysis comes to our rescue.
Unbiasedness is a theoretical property of an estimator’s sampling distribution.
We can analyze it mathematically by computing the expected value E[θ̂] in terms
of the population distribution. If this expected value equals θ for all possible
values of θ (not just one specific value), then the estimator is unbiased. So we
don’t need to know the actual value of θ. Instead, we use theoretical tools (like
laws of expectation, properties of estimators) to prove that E[θ̂] = θ for all θ.

8.2.1 Sample Mean


We have seen in Sec. 7.3 that the distribution of the sample mean X̄ is normal
2
with mean µ and variance σn , irrespective of the distribution of the population
σ2
 
X̄ ∼ N µ,
n
Let us see how sample mean is an unbiased estimator. Let X1 , X2 , . . . , Xn be
independent and identically distributed (iid) random variables with mean µ and
finite variance. The sample mean is defined as
n
1X
X̄ = Xi
n i=1

93
To show that X̄ is an unbiased estimator of µ, we compute its expected value
and apply the linearity of expectation (refer proposition in 3.3)
" n # n
1X 1X
E[X̄] = E Xi = E[Xi ]
n i=1 n i=1

By construction, Xi is a random draw from the population distribution. So each


Xi has the same distribution as the population (same mean, same variance, same
shape) E[Xi ] = µ.
1
E[X̄] = · nµ = µ
n
This calculation does not require us to know the actual numerical value of µ;
we are only using the fact that each Xi has expected value µ. No matter what
the true value of µ is, the distribution of the estimator X̄ will be centered at
the true value. If we take infinitely many random samples, and compute their
sample means, then the average of those sample means would equal the true
population mean in a normal distribution.

Proposition. Let X1 , X2 , . . . , Xn be a random sample from a population with


mean µ. Then the sample mean
n
oD

1X
,U

X̄ = Xi
oT

n i=1
,F
CE

is an unbiased estimator of µ. If in addition the distribution is continuous and


R.
a

symmetric, then X̃ and any trimmed mean are also unbiased estimators of µ.
kh
Re
25

This result holds for X̄ regardless of whether the distribution is discrete or


20

continuous, as long as the random variables are independent and identically


©

distributed (iid) with finite mean µ.


However, for other estimators, such as the sample median, a trimmed mean,
or statistics that involve nonlinear transformations of the data like standard
deviation, verifying unbiasedness is not as straightforward. In such cases, we
cannot directly apply basic expectation properties. Instead, verifying whether
an estimator θ̂ is unbiased requires more involved mathematical derivation.
In situations, especially when the algebra is intractable, simulation-based
methods are used to approximate the bias empirically by generating many sam-
ples from a known distribution and computing the average value of the estimator
over those samples.
The following example explains the concept of bias in an estimator using
a scenario where we want to estimate the upper bound u of U (0, u) uniform
distribution based on a sample. But before that let us prove the following

Proposition. Let X1 , . . . , Xn be an i.i.d. sample from the Uniform(0, u) dis-


tribution, i.e. each Xi has pdf fX (x) = u1 for 0 ≤ x ≤ u. Let

M = max(X1 , . . . , Xn ).

Then
n
E[M ] = u.
n+1

94
x
Proof. The CDF of a single observation is FX (x) = P (Xi ≤ x) = for 0 ≤ x ≤
u
u. Define the sample maximum as

M = max(X1 , X2 , . . . , Xn ).

By definition,

{M ≤ x} ⇐⇒ {X1 ≤ x, X2 ≤ x, . . . , Xn ≤ x}.

Since the Xi are independent,


n
Y
P (M ≤ x) = P (X1 ≤ x, X2 ≤ x, . . . , Xn ≤ x) = P (Xi ≤ x).
i=1

Because each Xi has distribution function FX (x), the CDF of the sample max-
imum M is
n  x  n
FM (x) = P (M ≤ x) = FX (x) = , 0≤x≤u
u
Differentiating gives the pdf of M (refer Sec. 5.1)

d x n−1
oD

fM (x) = FM (x) = n n , 0 ≤ x ≤ u.
,U

dx u
oT
,F

Now compute the expectation


CE
R.

Z u Z u  n−1  Z u
x n
a

x n dx.
kh

E[M ] = x fM (x) dx = x n n dx = n
Re

0 0 u u 0
25
20

Evaluate the integral


©

u
u n+1
Z
x n dx = .
0 n+1
Hence
n u n+1 n
E[M ] = · = u.
un n + 1 n+1
n
Thus M has expectation u.
n+1
Example 8.3. Suppose that X, the reaction time to a certain stimulus, has a
uniform distribution on the interval from 0 to an unknown upper limit u. Thus,
the probability density function (pdf) of X is rectangular in shape with height
1 1
u−0 = u for 0 ≤ x ≤ u. It is desired to estimate u on the basis of a random
sample X1 , X2 , . . . , Xn of reaction times.

First Estimator Since u is the largest possible time in the entire population
of reaction times, consider as a first estimator the largest sample reaction time

û1 = max(X1 , . . . , Xn ).

Suppose n = 5 and the observed data are

x1 = 4.2, x2 = 1.7, x3 = 2.4, x4 = 3.9, x5 = 1.3.

95
Then the point estimate of u is
û1 = max(4.2, 1.7, 2.4, 3.9, 1.3) = 4.2.
Let us analyse the problem with this estimator. û1 will never overestimate
u, because sample values cannot exceed the population maximum. It often
underestimates u, since unless the sample maximum exactly equals u, it lies
below it. Therefore, the distribution of û1 is not centered at u; thus it is a
biased estimator. From the above proposition
n
E[û1 ] = ·u
n+1
Hence, the bias is
n u
Bias(û1 ) = E[û1 ] − u = ·u−u=−
n+1 n+1
So, û1 underestimates u on average. But as n → ∞, the bias → 0. That is, the
estimator becomes asymptotically unbiased.

Second Estimator We can correct the bias in the estimator û1 by multiplying
n+1
the sample maximum by . This gives the unbiased estimator
n
oD

n+1
,U

û2 = · max(X1 , . . . , Xn )
oT

n
,F
CE

For the sample above max(Xi ) = 4.2. Then


R.
a

6
kh

û2 = · 4.2 = 5.04


Re

5
25
20

This adjusted estimator û2 is unbiased, meaning its expected value equals the
©

true parameter u, i.e., E[û2 ] = u. This is because


 
n+1 n+1 n
E[û2 ] = · E[û1 ] = · u =u
n n n+1
Let us analyse the possible estimates made by this estimator û2 .
n+1 n
û2 < u ⇐⇒ · max(Xi ) < u ⇐⇒ max(Xi ) < ·u
n n+1
n
So, whenever the sample maximum is less than n+1 · u, the adjusted estimator
û2 will also be less than u, i.e., it undershoots. For sample size 5,
n 5
= ≈ 0.833
n+1 6
So, if
max(Xi ) < 0.833 · u then û2 < u
This can and often does happen, especially in small samples, because it is un-
likely for any observation to land very close to the true upper limit u. When
n+1 n
û2 > u ⇐⇒ · max(Xi ) > u ⇐⇒ max(Xi ) > ·u
n n+1
max(X1 , X2 , . . . , Xn ) or, in more compact notation, max1≤i≤n Xi . or, in short max(Xi )

96
n
Whenever the maximum sample value exceeds n+1 · u the adjusted estimator
û2 overshoots the true value u even though the sample max never exceeds u.
n+1 n
û2 = u ⇐⇒ · max(Xi ) = u ⇐⇒ max(Xi ) = ·u
n n+1
n
Whenever the maximum sample value is the same as n+1 · u the adjusted esti-
mator û2 = u.

Figure 18: Estimator Comparison Plot


oD

The plot in Fig.18 visually compares two estimators used to approximate


,U

the unknown upper bound u = 10 of a uniform distribution. The orange curve


oT
,F

represents the estimator û1 = max(Xi ), which is clearly biased. It always


CE

underestimates the true value of u, as the sample maximum cannot exceed


R.

the population maximum. This is evident from the orange distribution being
a
kh
Re

entirely to the left of the red dashed line at u = 10, with its peak concentrated
25

well below it.


20
©

On the other hand, the blue curve corresponds to the corrected estimator
û2 = n+1
n · max(Xi ), which is unbiased. This distribution is centered around the
true value u = 10, as shown by the fact that the blue curve straddles the red
dashed line. However, to achieve unbiasedness, û2 sometimes overshoots and
sometimes undershoots, resulting in a wider and more right-skewed distribution.
Estimators don’t have to be normally distributed (as this example illustrates
where the estimators follow a mostly exponential distribution), especially when
based on nonlinear statistics (like max). The skewed shape just tells us that
small values û2 are more common and larger values (overshoots) are rarer but
possible to cancel out the smaller values. But overall, the mean is exactly at u
making it unbiased. The contrast between the two curves illustrates how bias
can be corrected at the cost of increased variability and occasional overshooting.

Summary Unbiasedness implies that some samples will yield estimates that
exceed u and other samples will yield estimates smaller than u otherwise u
could not possibly be the center (balance point) of û1 ’s distribution. However,
the first estimator û1 will never overestimate u (since the largest sample value
cannot exceed the largest population value), and will underestimate u unless
the largest sample value equals u. This intuitive argument shows that û1 is a
biased estimator.
The second estimator û2 is designed such that it will sometimes overshoot,
sometimes undershoot, but on average, it equals u. This is a great example

97
of how bias can be analyzed and corrected using knowledge of the sampling
distribution. 

Proposition. When choosing among several different estimators of θ, select


one that is unbiased.

8.2.2 Sample Variance


Consider now the problem of estimating population variance σ 2 . But before
that let us prove the following identity
Proposition (ANOVA Identity). Let X1 , X2 , P
. . . , Xn be a sample from a pop-
n
ulation with mean µ and sample mean X̄ = n1 i=1 Xi . Then
n
X n
X
(Xi − X̄)2 = (Xi − µ)2 − n(X̄ − µ)2 .
i=1 i=1

Proof. We start by rewriting each deviation from the sample mean

Xi − X̄ = (Xi − µ) − (X̄ − µ).

Squaring both sides gives


oD

(Xi − X̄)2 = (Xi − µ)2 − 2(Xi − µ)(X̄ − µ) + (X̄ − µ)2 .


,U
oT
,F

Summing over i = 1, . . . , n, we obtain


CE
R.

n n n n
a
kh

X X X X
(Xi − X̄)2 = (Xi − µ)2 − 2(X̄ − µ) (Xi − µ) + (X̄ − µ)2 .
Re
25

i=1 i=1 i=1 i=1


20
©

Note that
n
X
(Xi − µ) = nX̄ − nµ = n(X̄ − µ).
i=1

Thus the middle term simplifies to


n
X
−2(X̄ − µ) (Xi − µ) = −2(X̄ − µ) · n(X̄ − µ) = −2n(X̄ − µ)2 ,
i=1

and the last term simplifies to


n
X
(X̄ − µ)2 = n(X̄ − µ)2 .
i=1

Combining, we obtain
n
X n
X
(Xi − X̄)2 = (Xi − µ)2 − 2n(X̄ − µ)2 + n(X̄ − µ)2
i=1 i=1
n
X
= (Xi − µ)2 − n(X̄ − µ)2 ,
i=1

which establishes the identity.

98
The identity, also known as variance decomposition identity, splits the total
variation of the data about the sample mean into two parts. The total deviation
of the data around the sample mean can be expressed in terms of the total
deviation of the data around the population mean µ with a correction that
adjusts for the fact that we are centering around X̄ rather than µ.
Proposition. Let X1 , X2 , . . . , Xn be a random sample from a distribution with
mean µ and variance σ 2 . Then the estimator
Pn
(Xi − X̄)2
σ̂ 2 = S 2 = i=1
n−1
is an unbiased estimator of σ 2 .
Proof. Let X1 , X2 , . . . , Xn be a random sample from a population with mean µ
and variance σ 2 . Define the sample mean X̄ and sample variance S 2 as follows
n n
1X 1 X
X̄ = Xi , S2 = (Xi − X̄)2
n i=1 n − 1 i=1

We aim to prove that the sample variance S 2 is an unbiased estimator of popu-


lation variance σ 2 . By definition, an estimator θ̂ for a parameter θ is unbiased
oD

if E[θ̂] = θ. That is, on average, over many samples, the estimator gives you the
,U
oT

true parameter value. Hence, in this five step proof we aim to prove
,F
CE

E[S 2 ] = σ 2
R.
a
kh

1 : Use the ANOVA identity. Before proceeding with the proof let us understand
Re
25

the need for this identity. In this proof of unbiasedness, we wish to evaluate the
20

expectation (refer Sec.3.4 for the definition of variance)


©

" n #
X
2
E (Xi − X̄)
i=1

But this is hard to compute directly since it involves X̄. However, using the
ANOVA identity allows us to break the expression into terms involving the
population mean µ because it is easier to work with. So, we use the ANOVA
identity
Xn Xn
(Xi − X̄)2 = (Xi − µ)2 − n(X̄ − µ)2
i=1 i=1

2 : Take expectation of both sides of the S 2 equation

" n #
1 X
E[S 2 ] = E (Xi − X̄)2
n−1 i=1
" n # !
1 X
2
 2

= E (Xi − µ) − E n(X̄ − µ)
n−1 i=1

99
3 : Evaluate expectations. Since each Xi has variance σ 2 , we have §

" n #
X
E (Xi − µ) = nσ 2
2

i=1

σ2
Also, X̄ has variance V ar(X̄) = n , so

σ2
E[n(X̄ − µ)2 ] = n · E[(X̄ − µ)2 ] = n · = σ2
n
4 : Plug back into the formula

1
E[S 2 ] = nσ 2 − σ 2

n−1
(n − 1)σ 2
= = σ2
n−1
5 : Hence E[S 2 ] = σ 2 .

Biased Estimator with Divisor n Consider the variance estimator that


uses divisor n instead of n − 1:
oD
,U

n
n−1 2
oT

1X
σ̂n2 = (Xi − X̄)2 = S ,
,F

n i=1 n
CE
R.
a

where S 2 is the unbiased sample variance with divisor n−1. Taking expectation,
kh
Re
25

n−1 n−1 2
20

E[σ̂n2 ] = E[S 2 ] = σ .
©

n n
Thus, σ̂n2 is not unbiased. Its bias is

n−1 2 σ2
Bias(σ̂n2 ) = E[σ̂n2 ] − σ 2 = σ − σ2 = − .
n n
Because the bias is negative, the estimator with divisor n systematically
underestimates the true variance σ 2 . This explains why the divisor n − 1 is
generally preferred. However, when n is large, the bias −σ 2 /n is small, so the
difference between the two estimators becomes negligible.

Variance: Different Perspectives


ˆ Variance in Basic Probability Theory A random variable X with
mean µ = E[X] has variance

Var(X) = E (X − µ)2 .
 

X is random, so variance measures expected squared deviation from the


mean.
§ We expand EPn − µ)2 = n
 P  2 using linearity of expectation. Since

i=1 (Xi i=1 E (Xi − µ)
each Xi comes from a population with variance σ , we know E[(Xi − µ)2 ] = Var(Xi ) = σ 2 .
2

Summing over n terms gives nσ 2 .

100
ˆ Population Variance If the population follows a probability distribution

σ 2 = Var(X) = E[(X − µ)2 ].

If all population values x1 , . . . , xN are known,


N N
1 X 1 X
σ2 = (xi − µ)2 , µ= xi .
N i=1 N i=1

Here no randomness remains, i.e., variance is a fixed property of the pop-


ulation.

ˆ Sample Variance (Descriptive Statistic) For observed data x1 , . . . , xn ,


n n
1X 1X
s2 = (xi − x̄)2 , x̄ = xi .
n i=1 n i=1

Here we are not taking expectations, because we already have the [Link]
literally compute deviations from the sample mean and average them. The
variance is thus a fixed number describing variability in the dataset.
ˆ Sample Variance (Inferential Statistic) For an i.i.d. random sample
oD
,U

X1 , . . . , Xn , the estimator for variance is


oT
,F

n n
1 X 1X
CE

S2 = (Xi − X̄)2 , X̄ = Xi .
R.

n − 1 i=1 n i=1
a
kh
Re

Here Xi are random variables, so S 2 is itself random. The denominator


25
20

(n − 1) ensures that the expected value of the estimator E[S 2 ] = σ 2 and


©

hence S 2 is an unbiased estimator.

In short, variance is defined via random variables in probability and pop-


ulation variance. In descriptive statistics, data are fixed numbers and hence
variance is just a descriptive measure. In inferential statistics, variance is a
random statistic used to estimate the population variance.

8.2.3 Sample Standard Deviation


We know that the sample variance,
n
2 1 X
S = (Xi − X̄)2
n − 1 i=1

is an unbiased estimator of the population variance σ 2 . That is, E[S 2 ] = σ 2 .


A natural idea is to estimate the population standard deviation σ using the
square root of the sample variance

S = S2.

However, the property of unbiasedness does not carry over through nonlinear
transformations such as the square root. Specifically, E[S] 6= σ. This is because

101
the expectation of a function of a random variable is not generally equal to the
function of the expectation

E[g(X)] 6= g(E[X]) in general.

In our case, this is √ p √


E[ S 2 ] 6= E[S 2 ] = σ 2 = σ,
which means S is a biased estimator of σ, even though S 2 is unbiased for σ 2 .
Fortunately, the bias of S is small unless the sample size n is quite small. As
the sample size increases, the bias decreases, and S becomes an approximately
unbiased estimator of σ.
Despite this bias, there are good reasons to prefer using S in practice, es-
pecially when the underlying population distribution is normal. The sample
standard deviation S plays a critical role in many statistical procedures, such as
confidence intervals and hypothesis tests, which we will explore in the following
sections.
In the earlier example, which estimated the mean µ of a normal distribution
using different statistics, we encountered situations where multiple unbiased
estimators are available. When such scenarios arise, where there are several
unbiased estimators for a particular parameter, the principle of unbiasedness
(introduced in Sec.8.2.1) alone is not enough to tell us which estimator is better,
oD

because all the candidates are unbiased. They all have the correct expected
,U
oT

value. So, we need an additional criterion to decide which estimator to prefer.


,F

One such important criterion is minimum variance.


CE
R.
a
kh

8.2.4 Estimators with Minimum Variance


Re
25

Suppose θ̂1 and θ̂2 are two estimators of a population parameter θ, and both
20
©

are unbiased. That is,


E[θ̂1 ] = E[θ̂2 ] = θ.
This means that the probability distributions of both θ̂1 and θ̂2 are centered
at the true value of θ. However, even though their expected values are equal,
the spread (i.e., variance) of their distributions might be different. In particular,
one estimator might vary more from sample to sample than the other. That is,

V[θ̂1 ] 6= V[θ̂2 ].

When multiple unbiased estimators are available we may still prefer one
over the other. This is because, in practice, we usually work with only a single
sample from the population. In many real-world situations, repeated sampling
is not possible due to cost, time, ethical constraints, or the destructive nature
of measurement. For instance, in clinical drug trials, only one study might
be conducted because replicating trials may be probhibitive due to ethical and
cost constraints; in quality control, destructive testing allows sampling only
once from a batch; and in environmental monitoring or historical data analysis,
only a single set of data may be available. In such cases, the performance of an
estimator on that one sample is crucial.
Among unbiased estimators, the one with the smaller variance tends to pro-
duce estimates that are more consistently close to the true parameter value.

102
Therefore, we prefer the estimator with lower variance, leading to the impor-
tant concept of the minimum variance unbiased estimator (MVUE). Choosing
estimators based on both unbiasedness and spread ensures more reliable and
accurate statistical inference in practice.

Principle of Minimum Variance Unbiased Estimation Among all esti-


mators of θ that are unbiased, choose the one that has minimum variance. The
resulting θ̂ is called the minimum variance unbiased estimator (MVUE)
of θ.

Example 8.4. We argued in Example 8.3 that when X1 , . . . , Xn is a random


sample from a uniform distribution on the interval [0, u], the estimator
n+1
û1 = · max(X1 , . . . , Xn )
n
is unbiased for u (this was previously denoted as û2 ). However, this is not the
only unbiased estimator of u. The expected value of a uniformly distributed
random variable on [0, u] is E[Xi ] = u2 . Since the sample mean is defined as:
n
1X
X̄ = Xi
n i=1
oD
,U
oT
,F

we can use the linearity of expectation to find the expected value of X̄


CE
R.

" n # n
1X 1X 1 u u
a

E[X̄] = E Xi = E[Xi ] = · n · =
kh

n i=1 n i=1 n 2 2
Re
25
20

u
©

E[X̄] = ⇒ E[2X̄] = u
2
Therefore, the estimator û2 = 2X̄ is also unbiased. Now, we compare the
variances of û1 and û2 to find out which is better. Let X ∼ Uniform(0, u).
Then the probability density function (pdf) of X is given by
(
1
, 0≤x≤u
f (x) = u
0, otherwise

To compute mean µ = E[X]


u u u  2 u
1 u2
Z Z Z
1 1 1 x u
E[X] = x · f (x) dx x · dx = x dx = · = · =
0 0 u u 0 u 2 0 u 2 2

To compute variance V[X] = E[X 2 ] − (E[X])2 first compute E[X 2 ]


Z u  3 u
1 u 2 1 u3 u2
Z
1 1 x
E[X 2 ] = x2 · dx = x dx = · = · =
0 u u 0 u 3 0 u 3 3

Now substitute into the variance formula


u2  u 2 u2 u2 u2
V[X] = E[X 2 ] − (E[X])2 = − = − =
3 2 3 4 12

103
For a uniform distribution on [0, u],

u2 u2 u2 u2
V[Xi ] = ⇒ V[X̄] = ⇒ V[2X̄] = 4 · = = V[û2 ]
12 12n 12n 3n
For the maximum M of n i.i.d. U (0, u) random variables, we have
n n
E[M ] = u, E[M 2 ] = u2 .
n+1 n+2
The variance of M is
2
nu2

n n
V[M ] = E[M 2 ] − (E[M ])2 = u2 − u = .
n+2 n+1 (n + 1)2 (n + 2)

Using the scaling property of variance, we get


2
u2

n+1
V[û1 ] = V[M ] = .
n n(n + 2)

Comparing the two variances. Let V[û1 ] < V[û2 ].

u2 u2 1 1
< ⇔ < ⇔ 3n < n(n + 2) ⇔ 3 < (n + 2)
oD

n(n + 2) 3n n(n + 2) 3n
,U
oT
,F

This inequality holds whenever n > 1. Thus, for any sample size greater than 1,
CE

û1 has smaller variance than û2 , making it the preferred estimator among the
R.

two. More advanced statistical theory can show that û1 is in fact the minimum
a
kh

variance unbiased estimator (MVUE) for u; that is, it has the smallest variance
Re
25

among all unbiased estimators of u.


20

This example highlights one of the key strengths of mathematical statistics.


©

It has the ability to evaluate and compare estimators using theoretical tools
from probability and algebra, even before any data is observed. Since estimators
are functions of random variables, they possess distributions of their own. This
allows us to compute important characteristics such as bias, variance, and mean
squared error (MSE) directly from their probability models.
For instance, in the case of estimating the upper bound u of a uniform
distribution on [0, u], both the sample maximum and twice the sample mean
can serve as unbiased estimators. However, by computing and comparing their
variances analytically, we can determine which estimator is more efficient. This
approach avoids the need for empirical simulation or sample data and provides
deeper insight into the long-run performance of different estimation strategies.
Such theoretical comparisons are foundational in the development of optimal
statistical procedures.
This is the essence of mathematical statistics distinguishing it from applied
or computational statistics where data-driven methods dominate. 

MVUE of sample mean Let X1 , X2 , . . . , Xn be a random sample from a


normal distribution with unknown parameters µ (mean) and σ (standard devi-
ation), i.e.,
Xi ∼ N (µ, σ 2 ) for i = 1, 2, . . . , n.

104
The sample mean is defined as
n
1X
µ̂ = X̄ = Xi .
n i=1

Using linearity of expectation,


" n # n
1X 1X 1
E[X̄] = E Xi = E[Xi ] = · nµ = µ.
n i=1 n i=1 n

Thus, X̄ is an unbiased estimator of µ. Since the Xi are independent and


identically distributed,
n
! n
1X 1 X σ2
V ar(X̄) = V ar Xi = 2 V ar(Xi ) = .
n i=1 n i=1 n

Variance of X̄ shrinks with n. The key theoretical result to use here is the
Lehmann–Scheffé theorem:

If a statistic is unbiased for a parameter and is a function of a com-


plete, sufficient statistic, then it is the unique minimum variance
oD
,U

unbiased estimator (MVUE).


oT
,F

X̄ is a function of a sufficient statistic for µ. For the normal distribution,


CE

the sample mean is also complete. Thus by the theorem X̄ is the MVUE of µ.
R.
a

While other unbiased estimators of µ exist (e.g., selecting a single Xi ), their


kh
Re

variance is larger. X̄ aggregates information from all n observations, reducing


25

variance to the minimum possible. Therefore, X̄ is the best unbiased estimator


20
©

of µ in terms of variance.

Example 8.5. Suppose we wish to estimate the thermal conductivity µ of a


certain material. Using standard measurement techniques, we obtain a random
sample X1 , X2 , . . . , Xn of n thermal conductivity measurements. Assume the
population distribution is a member of one of the following three families
1 (x−µ)2
f (x) = √ e− 2σ2 , −∞ < x < ∞ (6.1)
2π σ
1
f (x) =  2  , −∞ < x < ∞ (6.2)
πb 1 + x−µb

 1 , µ − c ≤ x ≤ µ + c,
f (x) = 2c (6.3)
0, otherwise,
where (6.1) is the normal distribution, (6.2) is the Cauchy distribution, and (6.3)
is the uniform distribution. All three are symmetric about µ. The Cauchy curve
is bell-shaped but has much heavier tails than the normal. This means large
outliers occur much more often than in the normal case. In fact, the mean does
not exist, although µ is still the median and location parameter. The uniform
distribution has no tails.

105
We consider four estimators of µ: X̄, X̃ (median), X̄e (average of extremes),
and X̄tr(10) (10% trimmed mean). If the random sample comes from a normal
distribution, then X̄ is the best estimator (MVUE), since it has minimum vari-
ance among all unbiased estimators. If the random sample comes from a Cauchy
distribution, X̄ and Xe are very poor choices (sensitive to outliers and heavy
tails). The sample median X̃ performs quite well, though the MVUE is not
known. If the sample has a uniform distribution then the best estimator is X̄e .
Outliers cannot occur since the distribution is bounded. Trimmed mean is not
optimal in any case, but performs reasonably well across all three distributions.
It is robust, balancing efficiency and resistance to outliers.
The best estimator for µ depends crucially on the underlying distribution.
Different families reward different choices of estimators, and no single estimator
is universally optimal. Robust choices like the trimmed mean provide a good
compromise. 

oD
,U
oT
,F
CE
R.
a
kh
Re
25
20
©

106
9 Interval Estimation
A point estimate is a single number calculated from a sample, such as the
sample mean used to estimate a population mean. While point estimators are
often theoretically sound i.e., being unbiased, consistent, and efficient, they
still rely on the particular sample drawn. In practice, we usually have access
to only one sample, and the estimate we get from it may be higher or lower
than the true population value due to random sampling variability. Although a
good estimator performs well on average over many samples, any one estimate
could still be inaccurate. This means that a point estimate by itself provides no
information about how precise or reliable it is. Therefore, we accompany point
estimates with measures of uncertainty, such as standard errors or confidence
intervals, which help us express how much the estimate might vary and how
close it is likely to be to the true value.
To address this uncertainty, we accompany point estimates with an interval
of plausible values for the parameter. This is called an interval estimate or
confidence interval (CI). Constructing a confidence interval begins by choos-
ing a confidence level, which reflects how confident we are that the interval
captures the true parameter. For instance, suppose we use a sample statistic
X to estimate the true average breaking strength (in grams) of a particular
brand of paper towels, and we get a value of x = 9322.7. A 95% confidence level
oD

for the average breaking strength from 9162.5 to 9482.9 means we can be 95%
,U

confident that the true mean lies somewhere within this range.
oT
,F

More formally, a 95% confidence level means that if we repeatedly took


CE

samples and computed confidence intervals, then about 95% of those intervals
R.

would contain the true value µ, and only 5% would not. Commonly used con-
a
kh
Re

fidence levels are 95%, 99%, and 90%. The higher the confidence level, the
25

more certainty we have that the interval includes the true parameter. To be
20

more confident that the interval contains the true parameter, we must be more
©

cautious — which means a wider range. Our aim is for a narrow interval with
high confidence.

9.1 Confidence Interval for the Mean


We are interested in estimating the true mean µ of a population using a random
sample. For this we assume that the population is normally distributed, the
population standard deviation σ is known, and the sample consists of observa-
tions X1 , X2 , . . . , Xn .
The sample mean X̄ has the following distribution (refer Sec.8.2.1)
σ2
 
X̄ ∼ N µ,
n
That is, X̄ is normally distributed with mean E[X̄] = µ and standard deviation
SD(X̄) = √σn . As discussed in Sec. 5.4, it is easy to work with standard normal
distribution. We standardize X̄ as follows
X̄ − µ
Z= √ ∼ N (0, 1)
σ/ n
From standard normal tables, we know that
Φ(1.96) ≈ 0.975, Φ(−1.96) = 1 − Φ(1.96) ≈ 0.025.

107
Therefore,
P (−1.96 < Z < 1.96) = Φ(1.96) − Φ(−1.96) = 0.975 − 0.025 = 0.95,
Substituting for Z, we get
 
X̄ − µ
P −1.96 < √ < 1.96 = 0.95 (1)
σ/ n
We now rearrange (1) algebraically to express it in terms of µ. Multiply through
by √σn
 
σ σ
P −1.96 · √ < X̄ − µ < 1.96 · √
n n
Subtract X̄ (or equivalently, add µ and subtract µ)
 
σ σ
P X̄ − 1.96 · √ < µ < X̄ + 1.96 · √ = 0.95 (2)
n n
Equation (2) defines a 95% confidence interval for the population mean µ
 
σ σ
X̄ − 1.96 · √ , X̄ + 1.96 · √
n n
oD

This interval itself is random because it depends on the random variable X̄.
,U

Before the data is collected, the sample mean is unknown and hence the interval
oT
,F

is not fixed. Once data is collected, we compute x̄, and the interval becomes
CE
R.

 
σ σ
a

x̄ − 1.96 · √ , x̄ + 1.96 · √
kh

n n
Re
25
20

In repeated sampling, 95% of all intervals constructed in this way will con-
©

tain the true population mean µ. Before the data is collected, there is a 95%
probability that the interval we construct will capture µ. Once the data is ob-
served and the interval is calculated, it either contains µ or it does not. But
the method used is such that, in the long run, 95% of such intervals will be
successful.
The interval is centered at X̄ with the width of the interval being 2 × 1.96 ×
σ σ
√ . Standard error of the mean SE is (X̄) = √ and the margin of error ME
n n
σ
is zα/2 · SE = 1.96 · √n . Crucially, the formula is valid only when the population
is normal, or the sample size is large by Central Limit Theorem (Sec. 7.3).

Example 9.1. We are constructing a 95% confidence interval for the true av-
erage preferred keyboard height µ, based on a sample of typists. We are given
sample mean x̄ = 80.0 cm, population standard deviation σ = 2.0 cm, sample
size n = 31, and confidence level 95%.
Since the population standard deviation is known, we use the z-based con-
fidence interval formula
σ
x̄ ± z ∗ · √
n
For 95% confidence, z ∗ = 1.96 (derived above). Substituting the values
2.0
80.0 ± 1.96 · √ = 80.0 ± 1.96 · 0.359 ≈ 80.0 ± 0.7
31

108
Therefore, the confidence interval is (79.3, 80.7).
Interpretation: We are 95% confident that the true average preferred keyboard
height µ lies between 79.3 cm and 80.7 cm. The interval is relatively narrow,
which indicates that the estimate of µ is quite precise.
It might be tempting, but incorrect, to write
P (µ ∈ (79.3, 80.7)) = 0.95
This is incorrect because the value µ is a fixed but unknown constant. Once the
sample mean x̄ = 80.0 is observed, the interval is fixed and no longer random.
Probability statements apply to random variables or events, not fixed numbers.
The correct meaning of 95% confidence is based on the long-run frequency
interpretation of probability. If we were to repeatedly take random samples
from the same population, and compute a 95% confidence interval each time,
then approximately 95% of those intervals would contain the true mean µ.
That is, the procedure used to construct the interval has a 95% success rate
in the long run. We do not know whether this particular interval contains µ,
but we are 95% confident in the method used to obtain it.


Other Confidence Levels Different confidence levels reflect varying degrees


of certainty about whether a constructed interval contains the true population
oD
,U

parameter. The most commonly used level is 95%, meaning that if we were to
oT

repeat the sampling process many times, about 95% of the resulting confidence
,F
CE

intervals would contain the true value. Other levels include 90% and 99%, each
R.

with its own trade-offs.


a
kh

A 90% confidence level yields a narrower interval but offers less certainty
Re

(10% chance of error), while a 99% confidence level provides greater assurance
25
20

that the true parameter is captured, but results in a wider interval due to
©

the higher margin for coverage. The choice of confidence level depends on the
context. 90% may be sufficient for exploratory studies, 95% is standard in
many scientific fields, and 99% is preferred when higher precision or stricter
error control is required, such as in clinical research or quality assurance. In
general, as the confidence level increases, the interval becomes wider to reflect
increased caution in capturing the true value.
A z-critical value (often written zα or zα/2 ) is a cutoff point on the standard
normal distribution N (0, 1) that separates the central probability from the tail
probability. zα is the number such that the area to the right of it under the
standard normal curve is α.
P (Z > zα ) = α, Z ∼ N (0, 1).
zα/2 is the value such that the area to the right is α/2.
α
P (Z > zα/2 ) = .
2
Equivalently, the central probability is
P (−zα/2 < Z < zα/2 ) = 1 − α.
Consider the standard normal distribution curve shown in Figure 19. The
central area under the curve is 1 − α, while the two tails each have probability

109
α/2. The boundary points of this central region are the critical values of Z,
denoted by −zα/2 and zα/2 . The interval between −zα/2 and zα/2 contains the
middle 100(1 − α)% of the distribution. This is the probability basis for con-
structing two-sided confidence intervals. In confidence intervals, the z-critical
value tells us how many standard errors we need to move away from the mean
to capture a specified confidence level.

Figure 19: Standard normal curve showing the central probability P (−zα/2 <
Z < zα/2 ) = 1 − α.

For a 95% confidence interval (α = 0.05), z0.025 = 1.96 which means that
95% of the standard normal distribution lies between −1.96 and 1.96. For a
90% confidence interval (α = 0.10) z0.05 = 1.645 which means that 90% of the
oD

distribution lies between −1.645 and 1.645. The z-critical value is the cutoff
,U
oT

on the standard normal curve corresponding to a chosen significance level α or


,F

confidence level (1 − α).


CE
R.

Example 9.2. * The production process for engine control housing units has
a
kh
Re

recently been modified. Historically, the hole diameters for bushings on these
25

housings followed a normal distribution with a standard deviation of σ = 0.100


20

mm. It is believed that the process modification has not affected the shape of
©

the distribution or the standard deviation, but the population mean diameter
µ may have changed. To assess this, a random sample of n = 40 housing units
was selected. The sample mean hole diameter was found to be x̄ = 5.426 mm.

Our goal is to construct a 90% confidence interval for the true average hole
diameter µ. Given: x̄ = 5.426, σ = 0.100, n = 40, confidence level = 90%
⇒ α = 0.10, and zα/2 = z0.05 = 1.645.
First, we compute the standard error (SE)
σ 0.100
SE = √ = √ ≈ 0.0158
n 40
Next, we compute the margin of error (ME)

ME = zα/2 · SE = 1.645 · 0.0158 ≈ 0.026

Now, we construct the confidence interval

x̄ ± ME = 5.426 ± 0.026 = (5.400, 5.452)


* taken from Probability and Statistics for Engineering and the Science by Jay Devore, 9th
edition.

110
We are 90% confident that the true average hole diameter µ lies between 5.400
mm and 5.452 mm. The interval is relatively narrow due to the small standard
deviation (σ = 0.100) and reasonably large sample size (n = 40), indicating that
the mean has been estimated with good precision. This interval can be used to
assess whether the process modification has significantly changed the average
diameter. 

Sample size The confidence interval (CI) for a population mean (when the
population standard deviation σ is known) is
σ
x̄ ± zα/2 · √
n
This formula shows exactly how the CI depends √ on the sample size n. The
term √σn is the standard error. As n increases, n increases, so the standard
error decreases. As a result, the margin of error zα/2 · √σn becomes smaller. This
makes the confidence interval tighter (narrower) around the sample mean x̄ and
thus the estimate of the population mean µ more precise.
If we could take an infinitely large sample, then
σ
√ →0
n
oD
,U

and the interval would collapse to just the point estimate x̄. In practice,
√ in-
oT
,F

creasing n improves CI precision, but with diminishing returns, since n grows


CE

slowly. This means to halve the width of the confidence interval, we need to
R.

quadruple the sample size. A larger sample size leads to a narrower and more
a
kh

precise interval, but collecting more data may be costly or time-consuming.


Re
25
20

Example 9.3. * Extensive monitoring of a computer time-sharing system has


©

suggested that the response time to a particular editing command is normally


distributed with a standard deviation of 25 milliseconds. A new operating sys-
tem has been installed and we wish to estimate the true average response time
µ under the new environment. Assuming the response times are still normally
distributed with σ = 25, we want to determine the sample size n required to
ensure that the resulting 95% confidence interval has a total width of at most
10 milliseconds.

We use the general formula for confidence interval width


σ
Width = 2 · zα/2 · √
n
For a 95% confidence level, zα/2 = 1.96. Substituting into the equation
25
10 = 2 · (1.96) · √
n

Solving for n
√ 2 · (1.96) · 25 98
n= = = 9.80
10 10
* taken from Probability and Statistics for Engineering and the Science by Jay Devore, 9th
edition.

111
Squaring both sides gives

n = (9.80)2 = 96.04

Since the sample size must be an integer, we round up to the next whole number.
Thus, a sample size of n = 97 is required to ensure that the 95% confidence
interval for the mean response time has a width no greater than 10 milliseconds.


9.2 General Framework for Confidence Intervals¶


We just saw a formula for constructing confidence intervals for the population
mean when the standard deviation σ is known
σ
X̄ ± zα/2 · √
n

Here we introduce a framework for interval construction when σ is unknown or


we want a CI for a different parameter like the variance, difference of means, or
regression coefficient. In such cases, we cannot directly use X̄ ± z · SE. Instead,
we look for a statistic that has a known distribution when expressed in terms of
the parameter, use its probability distribution to invert a probability statement,
and thereby derive the confidence interval. This is exactly what the following
oD
,U

general method using h(X1 , . . . , Xn ; θ) does.


oT

Even when using a simple formula, it actually comes from this general
,F
CE

method. For example, the CI (refer Sec.9.1)


R.
a

σ
kh

X̄ ± zα/2 · √
Re

n
25
20

is derived by starting with the probability statement


©

 
X̄ − µ
P −zα/2 < √ < zα/2 = 1 − α
σ/ n

This uses the general h(·)-based form and then solves for µ. So this framework
provides the foundation for why confidence interval formulas work.
To construct a confidence interval (CI) for a parameter θ (like µ, the pop-
ulation mean), we use a sample X1 , X2 , . . . , Xn . The idea is to find a random
variable h(X1 , . . . , Xn ; θ) that satisfies two important conditions
1. it must depend on both the sample values and the parameter θ that we
are estimating.
2. The distribution of this variable must be completely known, i.e., it should
not depend on θ or any other unknown parameters.
This kind of variable is useful because we can make probability statements
about it without knowing the parameter we are trying to estimate.
Suppose the population is normal with known standard deviation σ, and
we want to estimate the mean µ. We use the sample mean X̄, and define the
variable
¶ Sections in red may be skipped.

112
X̄ − µ
h(X1 , . . . , Xn ; µ) = √
σ/ n
This is the standard Z-statistic with standard normal distribution N (0, 1).
It depends on µ regardless of what the true value of µ is. This makes it ideal
for constructing a confidence interval.
Likewise, we can construct a confidence interval for a general parameter θ.
Because we know the distribution of h, we can find values a and b such that

P (a < h(X1 , . . . , Xn ; θ) < b) = 1 − α

We then algebraically rearrange this inequality to isolate θ. That gives us

P (`(X1 , . . . , Xn ) < θ < u(X1 , . . . , Xn )) = 1 − α


This is exactly what a confidence interval is: a random interval that contains
θ with probability 1 − α. Rearranging the inequality for the standard normal
case leads to
σ σ
X̄ − zα/2 · √ < µ < X̄ + zα/2 · √
n n
Here, the lower limit ` and upper limit u are functions of the sample values
oD
,U

and standard deviation σ, and the interval captures the true value of µ with
oT

probability 1 − α.
,F
CE

As common in the real-world, standard deviation σ may not be known.


R.

Hence we use the sample standard deviation s. We then replace the Z-score zα/2
a
kh

with the t-score tα/2, n−1 , which adjusts for the additional variability introduced
Re

by estimating σ from the sample. So the confidence interval becomes


25
20
©

s
X̄ ± tα/2, n−1 · √
n

The t-interval is typically wider, especially for small sample sizes, reflecting the
increased uncertainty in estimating the population standard deviation.
The general method using h(X1 , . . . , Xn ; θ) gives a unified framework to
construct confidence intervals for any parameter and any distribution, as long
as the distribution of the statistic is known or can be derived. Depending on
whether we know σ, we either use a Z-distribution, or t-distribution. Thus
the framework is the blueprint behind every CI formula and becomes especially
powerful in advanced or custom settings.

9.3 Large-Sample Confidence Intervals for Population Mean


In computing, data science, and engineering applications, it is common to work
with large datasets. Large-sample confidence intervals are a basic and power-
ful tool for making estimates about populations from samples. Large-sample
methods (based on the Central Limit Theorem) allow for approximate inference
without knowing the exact distribution.
In the previous section we saw that the standard deviation σ may not be
known in real scenarios. So we are forced to replace the population standard

113
deviation σ in Z by the sample standard deviation S to obtain the standardized
variable
X̄ − µ
T = √
S/ n
Now, both the numerator X̄ and the denominator S are random, since both are
based on the sample. For small sample sizes, the estimate S can fluctuate quite
a bit from the true σ. If S is underestimated, the denominator is too small,
and T can take unusually large values. If S is overestimated, T is compressed
near zero. This extra randomness inflates the probability of extreme T -values,
giving the distribution of random variable T heavier tails than the normal.

Density

0.3

0.2

0.1

t
oD
,U

−4 −2 2 4
oT
,F
CE

Standard Normal N (0, 1)


R.
a

t with 2 df
kh
Re
25
20

Figure 20: Standard normal distribution vs t-distribution


©

Figure 20 compares the standard normal distribution (blue solid line) with
the t-distribution with 2 degrees of freedom (red dashed line). Both distributions
are centered at 0 and symmetric, but the t-distribution has a lower peak and
noticeably heavier tails.
But as the sample size increases (apply the Law of Large Numbers), the
sample standard deviation S becomes a better and better estimate of σ. So,
the extra variability introduced by using S instead of σ becomes negligible.
Therefore, for large n, the distribution of the approaches the standard normal
distribution and we can use Z-based confidence intervals.
Proposition. If n is sufficiently large, the standardized variable
X̄ − µ
Z= √
S/ n
has approximately a standard normal distribution. This implies that the CI is
s
x̄ ± zα/2 · √
n
is a large-sample confidence interval for µ with confidence level approximately
100(1 − α)%. This formula is valid regardless of the shape of the population
distribution.

114
In words, the CI is

point estimate of µ ± (z-critical value)×(estimated standard error of the mean).

Generally speaking, n > 40 will be sufficient to justify the use of this interval.
This is somewhat more conservative than the rule of thumb for the CLT because
of the additional variability introduced by using S in place of σ.

Example 9.4. An internet service provider (ISP) wants to estimate the true
average download speed its customers are experiencing. Due to practical con-
straints, it can’t test every customer’s speed, so it selects a random sample of
n = 64 customers. After measuring their speeds, it finds sample mean x̄ = 48.5
Mbps and sample standard deviation s = 5.2 Mbps. We want to construct a
95% confidence interval for the true average download speed µ that customers
experience.

Since the sample size n = 64 is large, we can use the normal approximation,
even though the population SD σ is unknown. So we use the following confidence
interval formula
s
x̄ ± zα/2 · √
n
for 95% confidence, zα/2 = 1.96.
oD
,U
oT

5.2 5.2
,F

Standard error = √ = = 0.65


CE

64 8
R.
a

Margin of error = 1.96 · 0.65 = 1.274


kh
Re

Confidence interval = 48.5 ± 1.274 = (47.23, 49.77)


25
20
©

With 95% confidence, we estimated that the true average download speed
lies between 47.23 Mbps and 49.77 Mbps with a full width of the interval is only
about 2.55 Mbps indicating a tighter interval. But this is not always the case.
The width depends on the variability in the data (reflected by s) and the nature
of the population distribution. 

Understanding large-sample confidence intervals (CIs) prepares ground for


more advanced concepts in machine learning, particularly confidence bands and
intervals. In classical statistics, a CI provides a range of plausible values for
a parameter (e.g., population mean), quantifying uncertainty using probability
theory. In machine learning, we also need to measure uncertainty not just about
parameters, but about predictions and fitted functions. Confidence bands ex-
tend the idea of CIs where instead of a single parameter, they give a range around
an estimated function (e.g., regression line) that likely contains the true curve.
Large-sample methods (e.g., CLT approximations, bootstrap) form the statisti-
cal backbone for ML uncertainty quantification in regression, classification, and
prediction tasks. Thus, mastering CIs in statistics builds the foundation for in-
terpreting uncertainty in ML models, ensuring reliable and safe decision-making
in practice.

115
A General Framework for Large Sample CI The large-sample intervals
like
s
x̄ ± zα/2 · √
n
are special cases of a general large-sample confidence interval (CI) for a param-
eter θ. Suppose that θ̂ is an estimator satisfying the following conditions (1)
θ̂ has an approximately normal distribution, (2) θ̂ is (at least approximately)
unbiased, and (3) the standard deviation (standard error) of θ̂, denoted σθ̂ , is
known or can be estimated.
Standardizing θ̂ gives the random variable

θ̂ − θ
Z=
σθ̂

which is approximately the standard normal N (0, 1). Therefore, the following
probability statement holds approximately
!
θ̂ − θ
P −zα/2 < < zα/2 ≈ 1 − α
σθ̂

Multiplying through by σθ̂ and rearranging


oD
,U

 
oT

P θ̂ − zα/2 · σθ̂ < θ < θ̂ + zα/2 · σθ̂ ≈ 1 − α


,F
CE

So the general form of a large-sample confidence interval is


R.
a
kh
Re

θ̂ ± zα/2 · σθ̂
25
20

There are three cases to deal with


©

ˆ If σθ̂ is completely known, this formula gives an exact large-sample CI.

ˆ If σθ̂ involves other unknown parameters (but not θ), then we estimate it
by sθ̂ , the plug-in estimate. The resulting CI is

θ̂ ± zα/2 · sθ̂

ˆ If σθ̂ involves θ itself (e.g., when θ = p, a population proportion), we


approximate σθ̂ by replacing θ with θ̂. The resulting approximate CI
becomes
θ̂ ± zα/2 · sθ̂

In short, the large-sample confidence interval is

point estimate of θ±(critical value)×(estimated standard error of the estimator).

One-Sided Confidence Intervals (Confidence Bounds) In many real-


world situations, we are not always interested in estimating both lower and
upper limits for a population parameter. Sometimes, it is sufficient or even nec-
essary to determine a bound in only one direction. For example, a manufacturer
may want to ensure that the average strength of a material does not fall below

116
a certain level, or a safety engineer might want to confirm that a measurement
does not exceed a maximum allowable threshold. In such cases, one-sided con-
fidence intervals, also known as confidence bounds, are more appropriate than
the usual two-sided intervals. These intervals provide either an upper or lower
limit for the parameter with a specified level of confidence, and are particularly
useful in threshold-based decisions or when prior knowledge justifies focusing
on one direction of error.

9.4 Intervals Based on a Normal Population Distribution


The large-sample confidence intervals we discussed do not require the popula-
tion to follow a specific distribution (like normal). Instead, they rely on the
Central Limit Theorem (CLT), which ensures that for large enough sample size
n, the sampling distribution of the sample mean X̄ is approximately normal,
regardless of the shape of the population distribution, provided the population
has a finite mean and variance. In this section, we focus on intervals for a
normal distribution.
For large sample sizes, the standardized variable Z is approximately stan-
dard normal due to the Central Limit Theorem. However, when the sample
size is small, using the sample standard deviation S instead of the population
standard deviation σ introduces extra variability. As a result, the distribution
oD

of the statistic becomes wider than the normal distribution. To account for
,U

this, statisticians use a new family of distributions called the t-distributions,


oT
,F

which are specifically designed for small-sample inference. Our focus here is on
CE

t-distributions.
R.
a
kh

Theorem. When X̄ is the mean of a random sample of size n from a normal


Re

distribution with mean µ, the random variable


25
20
©

X̄ − µ
T = √
S/ n

has a probability distribution called a t-distribution with n − 1 degrees of free-


dom (df ).
The t-distribution and the standard normal distribution are both symmetric
and bell-shaped, centered at zero. However, they differ in their spread and tail
behavior. The standard normal distribution Z has a fixed shape with relatively
thin tails. It has no parameter since µ = 0 and σ 2 = 1. In contrast, the
t-distribution has its center at 0 (like the standard normal), but the spread
depends on the degrees of freedom. It has heavier tails, meaning it gives more
probability to extreme values. This reflects the extra uncertainty introduced
when the population variance σ 2 is unknown and is replaced by the sample
variance S 2 . The t-distribution is parameterized by degrees of freedom (df),
typically n−1 for a sample of size n. When the degrees of freedom are small, the
distribution is more spread out compared to the standard normal. As the degrees
of freedom increase, the t-distribution gradually approaches the standard normal
distribution. In practice, the standard normal is used when the population
variance is known, while the t-distribution is applied when it is unknown and
must be estimated, especially in the case of small samples.

117
9.4.1 Properties of t-Distribution
When estimating the population mean µ and the population standard deviation
σ is unknown, we use the sample standard deviation S. This leads to using the
t-distribution rather than the normal distribution for the standardized statistic
X̄ − µ
T = √
S/ n

This variable T does not follow the standard normal distribution when the
sample size n is small. Instead, it follows a t-distribution, which depends on a
single parameter called the degrees of freedom (df ). For a sample of size n,
the degrees of freedom is usually ν = n − 1.

Degree of Freedom It refers to the number of values in a calculation that


are free to vary, given a constraint. When we estimate something from data, we
often impose constraints. These constraints reduce the number of independent
pieces of information available for further calculations. To calculate the sample
mean x̄
n
1X
x̄ = xi
n i=1
oD

all n values are free to vary, so df for mean = n.


,U

Sample variance is given by


oT
,F
CE

n
1 X
s2 = (xi − x̄)2
R.

n − 1 i=1
a
kh
Re
25

Here, sample mean x̄ is computed from the data and acts as a constraint. After
20

computing x̄, only n − 1 deviations from the mean can vary freely. So df for
©

variance or sd is n − 1.
In general, when we plug the sample mean into another formula, it becomes
a constraint and we lose 1 degree of freedom.

Key Properties of the t-Distribution Let tn denote the t-distribution


with n degrees of freedom (df). The following are key properties of the tn
distributions
1. Each tn curve is bell-shaped and centered at 0.
2. Each tn curve is more spread out than the standard normal (z) curve.

3. As n increases, the spread of the corresponding tn curve decreases.


4. As n → ∞, the sequence of tn curves approaches the standard normal
curve.
lim tn = N (0, 1)
n→∞

Hence, the standard normal curve is often referred to as the t-curve with
infinite degrees of freedom.

118
Figure 21: Comparing Z-distribution with t-distributions

Figure 21 compares the standard normal distribution and t-distributions


with different degrees of freedom (df). All the curves are bell-shaped and cen-
tered at zero, but their spread varies. The z curve is the narrowest, indicating
less variability. The t5 curve, with a low degree of freedom, is the widest and
flattest, reflecting the greater uncertainty due to small sample size. As the de-
grees of freedom increase (e.g., t25 curve), the t-distribution becomes narrower
and more similar to the standard normal. This shows that the t-distribution
approaches the z-distribution as the sample size grows large.

t-critical value A t-distribution is a bell-shaped curve used when estimating


oD
,U

population parameters from a small sample and when the population standard
oT

deviation is unknown. Each t-distribution depends on degrees of freedom ν


,F

(usually n − 1, where n is sample size).


CE
R.

tα,ν is a number on the x-axis of the t-distribution such that the area to
a
kh

the right of this number is α where α is a small probability. It is called the


Re

t-critical value because it “cuts off” the tail area (e.g., the most extreme 5%) of
25
20

the t-curve. We use tα,ν in confidence intervals and hypothesis tests when the
©

sample is small, the population standard deviation is unknown, and we have to


assume the population is approximately normal.

Figure 22: t-critical value tα,ν for α = 0.05, ν = 9

Figure 22 clarifies the meaning of a t-critical value. It is the point beyond


which the area under the t-curve is α and it depends on the degrees of freedom
ν. The blue curve represents the t-distribution with ν = 9. The red shaded area
corresponds to the right tail where the area (or probability) is α = 0.05. The
dashed red line shows the critical value t0.05,9 ≈ 1.83.

119
One Sample t Confidence Interval We define a standardized variable

X̄ − µ
T = √
S/ n

which follows a t-distribution with n − 1 degrees of freedom (ν), where n is


the sample size. The standardized variable T has a t distribution and the area
under the corresponding t density curve between −tα/2,n−1 and +tα/2,n−1 is
1 − α (area α/2 lies under each tail). So

P (−tα/2,n−1 < T < tα/2,n−1 ) = 1 − α

Note: In the Z-distribution, we replace tα/2,n−1 by zα/2 .


Proposition. Let x̄ and s be the sample mean and sample standard deviation
computed from the results of a random sample from a normal population with
mean µ. Then a 100(1 − α)% confidence interval for µ is
 
s s
x̄ − tα/2,n−1 · √ , x̄ + tα/2,n−1 · √
n n

or, more compactly,


s
oD

x̄ ± tα/2,n−1 · √ .
,U

n
oT
,F

An upper confidence bound for µ is


CE

s
R.

x̄ + tα,n−1 · √ ,
a
kh

n
Re
25

and replacing the + with − in this latter expression gives a lower confidence
20
©

bound for µ, both with confidence level 100(1 − α)%.


This holds for a one-sample to estimate the population mean µ.

9.5 Confidence Intervals for the Variance and Standard Deviation


of a Normal Population
Although statistical inferences about a population variance σ 2 or standard de-
viation σ are generally of less interest than those about a mean or proportion,
there are situations where such procedures become important. These can be
described using Chi-squared distribution.
The Chi-Squared Distribution (χ2 ) is a probability distribution that arises
naturally in statistical inference. Formally, if X1 , X2 , . . . , Xk ∼ N (0, 1) are
independent standard normal random variables, then

χ2 = X12 + X22 + · · · + Xk2

follows a chi-squared distribution with k degrees of freedom. The distribution


has a single parameter – the degrees of freedom (df). It is denoted by χ2k , where
k is the number of squared independent standard normals.
The distribution is right-skewed for small k and becomes more symmetric
as k → ∞. It has mean E[χ2k ] = k and variance Var(χ2k ) = 2k. For large k,
χ2k ≈ N (k, 2k).

120
Figure 23: Graphs of chi-squared density functions

The graphs of several χ2 probability density functions (pdf’s) are illustrated


in Fig.23. Each pdf is positive only for x > 0, and each has a positive skew
(stretched out upper tail). As k increases, the distribution shifts to the right
and becomes more symmetric. This is because both the mean and variance
increase, which explains the rightward shift and the reduced skewness of the
distribution.
Theorem. Let X1 , X2 , . . . , Xn be a random sample from a normal distribution
with parameters µ and σ 2 . Then the random variable
Pn 2
(n − 1)S 2 i=1 (Xi − X̄)
=
σ2 σ2
oD
,U

has a chi-squared (χ2 ) distribution with n − 1 degrees of freedom.


oT
,F

When we compute the sample variance S 2 , we use the sample mean X̄


CE

instead of the true mean µ. This “costs” one degree of freedom and it is down
R.
a

to n − 1.
kh
Re
25
20
©

121
10 Hypothesis Testing
When we collect sample data, the first step is often to compute a point estimate,
such as the sample mean X̄, to provide a single best guess for a population
parameter like the true mean µ. While this is simple and easy to report, it does
not convey any information about the uncertainty in the estimate. To quantify
this uncertainty, we construct a confidence interval (CI), which gives a range
of plausible values for the parameter. For example, a 95% CI for the average
blood pressure reduction from a new drug might be [2, 8] mmHg. This interval
indicates that, based on the sample data, we are 95% confident that the true
mean reduction lies somewhere within this range. All values inside the interval
are considered plausible, while those outside are unlikely.
However, in practice, decision-makers often face a specific question about
a particular value. For instance, a regulator may ask: “Does this drug reduce
blood pressure by at least 4 mmHg on average?” While the CI provides context
– showing that 4 mmHg is within the plausible range – it does not formally
quantify the strength of evidence for this specific claim. We can see that while
the average effect might exceed 4 mmHg, it could also be as low as 2 mmHg
which means the effect of the drug could be small. This is where hypothesis
testing (HT) comes in. Hypothesis testing is like a structured way of asking,
“Does the sample data support a specific claim about the population parame-
oD

ter?” While the CI provides a range of plausible parameter values, hypothesis


,U

testing allows us to examine the strength of evidence (via the test statistic and
oT
,F

p-value) for each specific value within or outside the CI.


CE
R.
a

10.1 Hypothesis Testing


kh
Re

In statistics, a hypothesis is a formal claim about a population parameter. The


25
20

null hypothesis (H0 ) represents the baseline assumption or “status quo,” such
©

as “the teacher calls on boys and girls equally often” or “the population mean is
µ0 .” The alternative hypothesis (Ha ) represents the competing claim we want
to test, such as “the teacher favors boys” or “the mean is greater than µ0 .” The
basic idea of hypothesis testing is to use data to assess whether the observed
evidence is so unlikely under H0 that we should reject it in favor of Ha . This
is done by choosing a test statistic, comparing its observed value to what is
expected under H0 , and computing the probability (the p-value) of obtaining
results at least as extreme as the observed data if H0 were true.
You may think of hypothesis testing like a courtroom trial. The null hypoth-
esis (H0 ) is like assuming the defendant is innocent – it is the default position
we start with. The alternative hypothesis (Ha ) is like saying the defendant is
guilty. The job of the data (like the evidence in court) is to challenge the null
hypothesis. We don’t reject innocence unless the evidence is strong enough.
The test statistic is a summary of the data, like the key piece of evidence. The
p-value tells us how surprising this evidence would be if the null were actually
true. If that surprise is too large (the p-value is small), we reject H0 and ac-
cept that the alternative Ha has stronger support. If the evidence isn’t strong
enough, we fail to reject H0 – which doesn’t prove innocence, but means there
isn’t enough reason to overturn the status quo.

122
10.2 Form of Hypotheses in Testing
Suppose a manufacturing process has a defect rate of p = 0.10. A new process
is proposed and we want to check if it improves quality, i.e., “does the new
process reduce defects below 10%?”. So we start with the assumption (status
quo) that the defect rate is at least 0.10 which is framed as the null hypothesis
H0 : p ≥ 0.10. But testing this inequality is mathematically harder; the test
statistic’s distribution under the null is no longer standard.
Suppose we inspect n = 50 items from the new process and record the
number of defectives, call it X. Since each item is defective with probability p,
X ∼ Binomial(n = 50, p). Now look at the null hypothesis: If we test H0 : p ≥
0.10, then p could be 0.10, 0.15, 0.20, . . . But the distribution of X depends on
the actual p

If p = 0.10, X ∼ Bin(50, 0.10), E[X] = 5.


If p = 0.15, X ∼ Bin(50, 0.15), E[X] = 7.5.
If p = 0.20, X ∼ Bin(50, 0.20), E[X] = 10.

So under H0 : p ≥ 0.10, the distribution of X is not uniquely determined – it


shifts depending on the actual p.
Hence we test the null hypothesis equality H0 : p = 0.10 against the al-
ternative Ha : p < 0.10 to effectively test the same claim as H0 : p ≥ 0.10.
oD
,U

Rejecting H0 : p = 0.10 implies the observed data is inconsistent with p ≥ 0.10


oT

as well. If we fail to reject H0 , it means that the sample data are not strong
,F
CE

enough to show that the defect rate is below 0.10. We cannot conclude that the
R.

new process improves quality; the data are consistent with the current process
a
kh

(status quo). Importantly, this does not prove that H0 is true – we just don’t
Re

have enough evidence to support Ha . In other words, failing to reject H0 leaves


25
20

us inconclusive regarding the claim, but it respects the original assumption that
©

the defect rate is at least 0.10.


Hypothesis testing is framed in terms of H0 and is not symmetric. Rejecting
H0 means the sample data are strong enough to conclude that H0 is unlikely.
This can be taken as evidence in favor of Ha but we never formally accept Ha .
In general, the null hypothesis is usually stated as an equality

H0 : θ = θ0

where θ is the parameter of interest like µ and θ0 is the claimed value. Three
common forms of alternative hypotheses Ha are
1. Ha : θ > θ0 (Right-tailed test)
2. Ha : θ < θ0 (Left-tailed test)
3. Ha : θ 6= θ0 (Two-tailed test)
The value θ0 used in both H0 and Ha is called the null value. It is the threshold
separating the null hypothesis from the alternative.

10.3 Test Procedure


Suppose a teacher claims to randomly call on students, and the class has an
equal number of boys and girls. You observe that she picks 20 students in a

123
row. We first define a test statistic which is a function of the data chosen such
that its sampling distribution under H0 is known (or well approximated) and it
makes discrimination between H0 and Ha possible. The test statistic captures
how far the observed data are from what we would expect under H0 .
Here we choose the test statistic X which is the number of boys in n = 20
picks. Under the null hypothesis of fairness H0 : p = 0.5, we know exactly the
distribution of X as
X ∼ Bin(n = 20, p = 0.5).
The sampling distribution of the test statistic X shows us how it would behave
over many samples if H0 were true. In this example
 
20
P (X = k | H0 ) = (0.5)20 , k = 0, 1, . . . , 20.
k

Larger values of X are evidence against the fairness assumption encoded in H0


in favor of the alternative Ha : p > 0.5. So X satisfies both the conditions and
it qualifies to be a test statistic.
We then compute the p-value which is the probability, under H0 , of observ-
ing a value of the test statistic at least as extreme as the observed one, where
“extreme” depends on the chosen alternative Ha . Here extreme means
oD

ˆ right-tailed : probability of values ≥ observed,


,U
oT

ˆ left-tailed : probability of values ≤ observed, and


,F
CE

ˆ two-tailed : probability of values at least as far from the center in either


R.
a

direction.
kh
Re
25

Coming back to the example,


20
©

ˆ If observed X = 20 boys and we are interested in knowing if the teacher


favors boys (Ha : p > 0.5), the p-value is

p-value = P (X ≥ 20 | H0 ) = P (X = 20 | H0 ) = (0.5)20 ≈ 9.54 × 10−7 .


† (P (X ≥ 20) = P (X = 20) since the teacher picked 20 students and the
count of boys cannot be more.)
This tells us that if the teacher were fair, then the probability of seeing
all 20 boys is very small. But we have observed 20 boys and hence the
fairness assumption is rejected.
ˆ If observed X = 12 boys. For the same one-sided alternative Ha : p > 0.5
the p-value is
20  
X 20
p-value = P (X ≥ 12 | H0 ) = (0.5)20 ≈ 0.25.
k
k=12

† In probability theory, conditional probability is written as P (A | B), where both A and


B are events (often involving random variables) in the same probability space. In hypothesis
testing, however, we often write P (X = k | H0 ). Here “| H0 ” does not represent conditioning
on another random variable. Instead, it is shorthand for “probability computed under the
assumption that H0 is true.” Formally, this is better written as PH0 (X = k), meaning the
probability of X = k when probabilities are evaluated under the distribution specified by H0 .

124
This tells us that if the teacher were fair, then the probability of seeing
all 12 boys is on the higher side. We have observed 12 boys and hence the
fairness assumption cannot be rejected.
We then choose a significance level α which is a threshold that determines how
small the p-value must be to reject H0 . Common values include 0.10, 0.05, 0.01,
and 0.001. We then execute the test procedure which is the following rule
that uses sample data to decide whether to reject the null hypothesis

If p-value ≤ α, reject H0 ;
> α, do not reject H0 .

Continuing with the example


ˆ For the observed value of X = 20, the p-value computed as 9.54 × 10−7 is
extremely small, so we reject H0 at any usual level α.
ˆ For the observed value of X = 12, the p-value computed as 0.25 is not
small for any level α, so we do not reject H0 .

10.4 General Form of a Test Statistic


From the teacher example, we simply judged based on the difference between
oD

a single sample outcome and the claimed value. However, there are scenarios
,U
oT

where multiple observations need to be made and reason on them. We cannot


,F

check each individually and are forced to summarize the sample with X̄.
CE

A naive approach would be to compare the observed estimate θ̂ (e.g., sample


R.
a

mean X̄) with a hypothesized (or claimed) population value θ0 . Simply looking
kh
Re

at the raw observed difference, θ̂ − θ0 , is insufficient, because the observed


25
20

difference could be large or small just by chance. To make a fair assessment, we


©

need a way to standardize this difference by the typical variability expected in


this estimator. Variability describes the spread of the estimator across repeated
samples. The typical variability is measured by the standard error (SE) of the
estimator. Recall from Sec. 8.2.1, that the typical variability for sample mean

is nothing but the standard deviation of the sample mean X̄ √ which is σ/ n.
When σ is unknown we use the estimated standard error S/ n.
If we know the true population standard deviation σ, then the standard error
is completely reliable. Because it is based on a fixed population parameter.
There is no randomness or estimation involved.
√ But in practice, we don’t know
σ. Hence when we estimate SE using s/ n the noise or randomness in selecting
the samples contribute to the noise of SE.
Intuitively, SE is smaller when the sample size is larger or the data are
less noisy, giving more precise and reliable estimate of the population value.
The same observed difference becomes stronger evidence. Conversely, a large
SE indicates noisy data (which implies a less reliable estimate) and the same
difference is less convincing. Incorporating SE into the test statistic allows us
to weigh the observed difference by the reliability of the estimate. This is done
by dividing the observed difference θ̂ − θ0 by the SE as it accounts for the scale
of expected variability. The general form of the test statistic is

θ̂ − θ0 observed difference
, i.e., test statistic =
SE(θ̂) typical variability

125
which standardizes the difference by dividing it by the standard error.
The test statistic is a dimensionless quantity that reflects the signal-to-noise
ratio. A large value for the statistic indicates that the observed mean value is
far from the claim θ0 relative to expected fluctuations (SE), providing strong
evidence against the claim, whereas a small statistic value indicates that the
observed difference could easily arise by chance, or simply due to noisy data.
Thus, the statistic incorporates both the magnitude of the observed difference
and the reliability of the estimate, enabling a fair assessment of how surprising
the sample mean is under the claim. This standardized framework allows us to
make confident decisions about the claim, regardless of the units or scale of the
data.
Depending on the situation, this statistic follows different reference distri-
butions. For example, a standard normal distribution (Z) when the standard
deviation is known (or n is large), or a t-distribution (T ) when it must be
estimated from the sample. These will be discussed in the later sections.
Estimation is used in testing since to test the claim we may not have the
entire data. So, we often estimate first and then use that estimate to test
whether a specific assumption holds. Depending on the use case, estimation and
testing can be done by the same set of people or different sets. In most practical
settings like in academic research, scientific studies, business analytics, etc. the
same person or team does both. In regulatory or auditing contexts, different
oD

roles may be involved. A company estimates its own metrics (e.g., “Our average
,U
oT

delivery time is 2.5 days.”) An external auditor or regulator may test that claim
,F

using independent sampling. So here, the estimator and the tester are different
CE

entities.
R.
a
kh
Re

10.5 z Tests for Hypothesis about a Population Mean


25
20

To compute a p-value in hypothesis testing, we rely on the distribution of the


©

test statistic under the null hypothesis H0 . A z-test refers to hypothesis testing
procedures where the test statistic follows the standard normal distribution.

10.5.1 A Normal Population Distribution with known σ


Here we consider the case in which the population distribution is normal with
known standard deviation σ. Then the sample mean X̄ has a normal
√ distribution
with expected value µX̄ = µ and standard deviation σX̄ = σ/ n. When H0
is true, µX̄ = µ0 . Consider now the statistic Z obtained by standardizing X̄
under the assumption that H0 is true

X̄ − µ0
Z= √
σ/ n

When certain conditions are met, this statistic follows a standard normal dis-
tribution Z ∼ N (0, 1). This allows us to compute p-values as areas under the
standard normal (Z) curve.
The formula for the test statistic Z measures how many standard errors the
sample mean X̄ is from the claimed (or null) mean µ0 . σ is the population
standard deviation assumed to be known in a√z-test. n is the sample size or
the number of observations in the sample. σ/ n is the standard error of the
mean. It measures the variability of X̄ across samples. A large absolute value

126
of Z indicates that the sample mean X̄ is far from the hypothesized mean µ0 ,
which may provide evidence against H0 .
Let it be clear that the sample data itself is not assumed to follow a normal
distribution. What matters is the distribution of the sample mean X̄, which
becomes approximately normal under the Central Limit Theorem if the sample
size is large.
The z-test is valid if the population is normal and σ is known X̄ ∼ N (µ, σ 2 /n)
exactly. If the population is not normal but the sample size n is large X̄ is ap-
proximately normal by the CLT.
We now ask “If H0 : population mean µ = µ0 the claimed mean, were true
what is the probability to get a sample mean as extreme (or more) than what
the sample produced.” The answer is the p-value, which connects the sample
to the population. We are testing whether the evidence is strong enough to
support Ha by trying to rule out H0 .
The right tailed hypothesis Ha : µ > µ0 says that the true population mean
is larger than the claimed mean. We now look for evidence in our data that
the sample mean X̄ is unusually large compared to µ0 . After standardization,
this corresponds to zobs , the numerical value of the test statistic computed from
your data. We now ask “what is the probability of getting a Z-value at least as
large as the one we observed?” That probability is exactly
oD

p = P (Z ≥ zobs | H0 is true)
,U
oT

= 1 − P (z ≤ zobs )
,F

= 1 − Φ(zobs )
CE
R.
a

where Φ(zobs ) is the cumulative distribution function (CDF) of the standard


kh
Re

normal.
25
20

A small p-value implies that the observed data is very unlikely under H0
©

which implies that the evidence favors Ha . On the contrary, a large p-value
indicates that the observed data is plausible under H0 which implies that there
is insufficient evidence for Ha .
The p-value computation for the other cases of alternative hypotheses are

ˆ Ha : µ < µ0 (left-tailed test) ⇒ p-value is area to the left of observed Z


= P (Z ≤ zobs ) = Φ(zobs ).
ˆ Ha : µ 6= µ0 (two-tailed test) ⇒ p = P (Z ≥ |Zobs |) + P (Z ≤ −|Zobs |).
By the symmetry of the standard normal distribution P (Z ≤ −|Zobs |) =
P (Z ≥ |Zobs |). The two-tailed p-value simplifies to p = 2 P (Z ≥ |Zobs |) =
2 1 − Φ(|Zobs |) .
A summary of this is in Fig.24*.
* taken from Probability and Statistics for Engineering and the Science by Jay Devore, 9th
edition.

127
Figure 24: Determination of the p-value for a z test

10.5.2 Large Sample Size


If the sample size n is large, then by the Central Limit Theorem (CLT), the
oD

distribution of the sample mean X̄ becomes approximately normal regardless of


,U

the population’s shape. So, even if the population is not normally distributed,
oT
,F

we can still proceed with a z-test. In real-world problems, we often don’t know
CE

the population standard deviation σ. If n is large, we can safely estimate σ


R.

using the sample standard deviation s. The test statistic become


a
kh
Re

X̄ − µ0
25

Z≈ √
20

s/ n
©

which has approximately a standard normal distribution when H0 is true. In


practice, if the sample size is greater than 40, this test statistic can be used.
We can still use the interpretations for p-value described in the previous
section. The probability of Type I error (rejecting a true H0 ) remains ap-
proximately equal to the chosen α. So the test is still valid with approximate
significance level α.

Example 10.1. * Urban storm water can be contaminated by materials such


as metals from discarded batteries. A random sample of 51 Panasonic AAA
batteries gave a sample mean zinc mass of 2.06 gm and a sample standard
deviation of 0.141 g. Does this data provide compelling evidence for concluding
that the population mean zinc mass exceeds 2.0 gm? Let us employ a significance
level of 0.01 to reach a conclusion.

Given a sample of size n = 51, sample mean x̄ = 2.06, sample standard


deviation s = 0.141 and significance level α = 0.01. The null hypothesis

H0 : µ = 2.0 vs Ha : µ > 2.0


* taken from Probability and Statistics for Engineering and the Science by Jay Devore, 9th
edition.

128
Since n = 51 is large, the sampling distribution of X̄ is approximately normal
(under Central Limit Theorem). The test statistic follows a standard normal
distribution under H0
X̄ − µ0
Z= √
s/ n
The test statistic is computed as
2.06 − 2.0 0.06
Z= √ = ≈ 3.04
0.141/ 51 0.0197

The p-value is computed as

p-value = P (Z ≥ 3.04) = 1 − Φ(3.04) ≈ 1 − 0.9988 = 0.0012

Comparing the p-value to the significance level

p-value = 0.0012 ≤ 0.01 = α ⇒ Reject H0

At the 1% significance level, there is strong evidence that the population mean
zinc mass exceeds 2.0 g. The process supports the claim that the average zinc
mass is more than 2.0 g. 
oD

Example 10.2. * A Dynamic Cone Penetrometer (DCP) test measures pene-


,U
oT

tration resistance (in mm/blow). For a pavement type to be acceptable, the true
,F

mean DCP value must be less than 30. A sample of n = 52 observations yielded
CE

a sample mean of x̄ = 28.76 and a sample standard deviation of s = 12.2647.


R.
a

The population distribution is not normal, but n > 40, allowing use of the
kh
Re

z-test. Can we conclude at a significance level of α = 0.05 that the true average
25
20

DCP is less than 30?


©

Hypothesis
H0 : µ = 30 vs. Ha : µ < 30 (left tailed test)
Test statistic
x̄ − µ0 28.76 − 30 −1.24
Z= √ = √ ≈ ≈ −0.73
s/ n 12.2647/ 52 1.701

Since we are conducting a left-tailed z-test, the P-value is

P (Z ≤ −0.73) = Φ(−0.73) ≈ 0.2327

Since 0.2327 > 0.05, we do not reject H0 . There is no sufficient evidence to


conclude that the true mean DCP value is less than 30. The pavement should
not be used. Note that we may possibly have committed a Type II error.

* taken from Probability and Statistics for Engineering and the Science by Jay Devore, 9th
edition.

129
10.6 The One-Sample t Test
When the sample size n is small we cannot use z-tests, because the Central Limit
Theorem (CLT) doesn’t apply, i.e., the sampling distribution of the sample
mean may not be approximately normal. The population standard deviation σ
is usually unknown.
We do a t-test (refer Sec9.4) but this requires the assumption that the popu-
lation distribution is approximately normal. Let the small sample X1 , X2 , . . . , Xn
be drawn from a normal population. The test statistic

X̄ − µ0
T = √
S/ n

where X̄ is the sample mean, S is the sample standard deviation, and µ0 is


hypothesized mean under H0 . Under the null hypothesis H0 : µ = µ0 , the test
statistic T follows a t-distribution with n − 1 degrees of freedom

T ∼ tn−1

The p-value is determined from the area under the tn−1 curve corresponding to
the observed value of T as shown in Fig.25. It depends on whether the test is
one-tailed or two-tailed.
oD
,U
oT
,F
CE
R.
a
kh
Re
25
20
©

Figure 25: p-values for t-tests

When calculating p-values for t-tests, we need to find tail areas under a
t-distribution curve. For z-tests, this is easy. The z-table gives areas (i.e.,
cumulative probabilities) for many z-values (like 1.23, 2.71, etc.). But for
t-distributions, things are different. Each t-distribution depends on degrees
of freedom (df). So we would need a separate, full table of cumulative or
tail areas for each possible df which is impractical. Generally, the t-values
for only few critical values corresponding to common significance levels α =
0.10, 0.05, 0.025, 0.01, 0.005, 0.001, 0.0005, etc. are pre-computed and stored in
a t-table.

130
Example 10.3. * Carbon nanofibers have potential application as heat man-
agement materials, for composite reinforcement, and as components for nano-
electronics and photonics. The accompanying data on failure stress (MPa) of
fiber specimens is

300, 580, 312, 589, 327, 626, 368, 637, 400, 690, 425, 715, 470, 757, 556,
891, 573, 900, 575

with a summary of statistics n = 19, x̄ = 562.68, s = 180.874, √sn = 41.495. We


are to determine whether the data provides compelling evidence for concluding
that the true average failure stress exceeds 500 MPa.

Hypotheses. Let µ denote the true average failure stress (MPa). The hypotheses
are
H0 : µ = 500 (null hypothesis)
Ha : µ > 500 (alternative hypothesis; right-tailed test)
Test Statistic. Since the population standard deviation is unknown and the
sample size is small (n = 19), we use the t-test
x̄ − µ0 562.68 − 500 62.68
t= √ = √ = ≈ 1.51
oD

s/ n 180.874/ 19 41.495
,U
oT

Degrees of Freedom. df = n − 1 = 19 − 1 = 18
,F
CE

P-Value. This is a one-tailed test (right-tailed). Using a t-distribution table


R.
a
kh

P -value = P (T18 ≥ 1.51) ≈ 0.075


Re
25
20

At the α = 0.05 significance level since P -value = 0.075 > 0.05, we fail to reject
©

H0 . There is no sufficient statistical evidence to conclude that the true average


failure stress exceeds 500 MPa. Note: If the significance level had been higher
(e.g., α = 0.10), the null hypothesis would have been rejected, suggesting weak
evidence in favor of Ha . 

10.7 Errors in Hypothesis Testing


Choosing a significance level α involves balancing the risk of making two types
of errors in decision-making. Let us try to understand the different implications
using the judicial analogy where the null hypothesis H0 : the person is innocent.
Decision Truth: H0 is True Truth: H0 is False
(innocent) (guilty)
Type I Error
Reject H0 Correct Decision
(Convict innocent)
(Declare guilty) (Convict guilty)
Probability = α
Type II Error
Do Not Reject H0 Correct Decision
(Let guilty go free)
(Declare not guilty) (Let innocent go free)
Probability = β
* taken from Probability and Statistics for Engineering and the Science by Jay Devore, 9th
edition.

131
Type I Error is when rejecting the null hypothesis H0 when it is actually
true. Formally, P (Type I error) = P (Reject H0 | H0 true). Suppose we are
testing H0 : µ = µ0 . If H0 is true, the test statistic has a known distribution

X̄ − µ0
Z= √ ∼ N (0, 1), (if σ known),
σ/ n
or
X̄ − µ0
T = √ ∼ tn−1 , (if σ unknown).
S/ n
We reject H0 when the test statistic is “too extreme.” For example, in a right-
tailed Z-test, Ha : µ > µ0 , at level α = 0.05, we reject if zobs > z0.95 , where z0.95
is the 95th percentile of the standard normal distribution. By construction,

P (zobs > z0.95 | H0 true) = 0.05.

The probability of a Type I error is

P (Reject H0 | H0 true) = P (Z > z0.95 | H0 true) = 0.05.

In general, the rejection region is chosen so that

P (Test statistic in rejection region | H0 true) = α.


oD
,U
oT

Thus, P (Type I error) = α, fixed by design.


,F
CE

Recall that H0 specifies a single parameter value (e.g., θ̂ = θ0 ). That makes


R.

the distribution under H0 fixed, so the probability of rejecting it (Type I error)


a
kh

is uniquely defined. Contrast this with Ha which has many possible parameter
Re

values, so the Type II error depends on which alternative is true. A Type II error
25
20

occurs when the null hypothesis H0 is false (that is, some alternative hypothesis
©

Ha is true), but we fail to reject H0 . Formally,

P (Type II error) = P (Fail to reject H0 | Ha true).

This probability is denoted by β.


Under Ha , the parameter can take many possible values (for example, µ > µ0
could mean µ = µ0 + 1, or µ = µ0 + 5, etc.). Each such value leads to a different
distribution of the test statistic. Therefore, the Type II error probability is not
a single number, but a function β(θ) that depends on the true parameter value
under Ha .
Suppose we test H0 : µ = 100 with Ha : µ > 100 at significance level
α = 0.05. The rejection region is Z > 1.645. If the true mean is µ = 105,
the distribution of the test statistic shifts to the right, and the probability of
rejecting H0 is high (small β). If the true mean is µ = 101, the shift is small,
so the probability of failing to reject H0 is larger (large β).
It is not realistic to demand error-free hypothesis testing procedures. In-
stead, a good test procedure is one in which both types of errors are unlikely. A
small probability of Type I error means α = P (reject H0 | H0 true) and a small
probability of Type II error implies β = P (fail to reject H0 | H0 false). Thus, a
well-designed hypothesis test seeks to minimize both α and β, though a trade-off
often exists. In practice, this involves choosing an appropriate significance level
and ensuring adequate sample size.

132
10.8 Chi-Squared Test
Traditional hypothesis tests focus on numerical/continuous outcomes, e.g., test-
ing whether a sample mean differs from a hypothesized population mean. They
rely on assumptions like normality and require quantitative measurements. While
some datasets are categorical, e.g., blood type, education level, survey re-
sponses. Categorical data cannot be averaged. For example, blood types O, A,
B, AB have no meaningful arithmetic mean. So standard testing methods are
not applicable here. Consider a sample of 10 people with blood types

Per 1: O, Per 2: A, Per 3: B, Per 4: AB, Per 5: O, Per 6: B, Per 7: A

Per 8: O, Per 9: AB, Per 10: B


Blood types are labels, not numbers. There is no natural numerical ordering or
distance between O, A, B, and AB. It doesn’t make sense to ask “What is the
average of O, A, B, AB?”. There is no meaningful answer. Similarly, computing
a standard deviation or variance makes no sense, because variance requires a
numeric scale. On the contrary, for height measurements 160, 170, 175, 180 cm,
we can compute the mean, standard deviation, and apply t-tests.
Instead of working with numerical measurements, chi-squared tests work
with observed counts in categories. They allow us to test whether the distri-
oD

bution of counts matches the expected distribution under a given hypothesis.


,U

Chi-squared tests allow us to do exactly this. The idea is to compare observed


oT
,F

frequencies with expected frequencies. Large deviations between observed Oi


CE

and expected Ei counts indicate that the null hypothesis does not fit the data,
R.

even though we can’t apply a numerical mean-based test.


a
kh

In the chi-squared goodness-of-fit test, each observation falls into one


Re
25

of a finite number of categories (e.g., blood type: O, A, B, AB), and the null
20

hypothesis specifies fixed probabilities for each category, such as H0 : pO =


©

0.45, pA = 0.35, pB = 0.15, pAB = 0.05. The test statistic


X (Oi − Ei )2
χ2 =
i
Ei

based on the difference between observed and expected counts, follows a chi-
squared distribution.
In more complex situations, the goodness-of-fit test for composite hy-
potheses applies when the category probabilities depend on one or more un-
known parameters (like a probability u, or a mean µ) (e.g., p1 = µ2 , p2 =
2µ(1 − µ), p3 = (1 − µ)2 ). These parameters are estimated from the data before
conducting the test. This is used to test if data fit a specific family of distribu-
tions, like Poisson (estimate λ) or normal (estimate µ, σ). Chi-squared methods
are also used with contingency tables, which involve two categorical variables.
In the test of homogeneity, the goal is to compare distributions across different
populations for example “do 3 hospitals have the same distribution of patient
types?” The test of independence, on the other hand, examines whether two
categorical variables are independent within a single population. For example
“is religion independent of political affiliation?” All these tests rely on com-
paring observed and expected frequencies and assessing whether any significant
deviation is due to chance or indicates a real effect.

133
10.8.1 Goodness-of-Fit Tests When Category Probabilities Are Completely Spec-
ified
A multinomial experiment generalizes a binomial experiment by allowing each
trial to result in one of k possible outcomes or categories, where k > 2. For
instance, if a store accepts three types of credit cards, observing which card
type is used by each of the next n customers forms a multinomial experiment.
The probability that a trial results in category i is denoted by pi , and the null
hypothesis H0 specifies the values of all pi ’s, such as H0 : p1 = 0.5, p2 =
0.3, p3 = 0.2. The alternative hypothesis Ha asserts that at least one of the pi
differs from the null values (in which case at least two must be different, since
they sum to 1). The symbol pi0 represents the value of pi claimed by the null
hypothesis. In the example just given, p10 = 0.5, p20 = 0.3, and p30 = 0.2.
Let the random variable Ni represent the number of observations falling into
category iP and its observed value isP ni . Since each trial results in exactly one
outcome, Ni = n, and similarly ni = n. As an example, an experiment
with n = 100 and k = 3 might yield N1 = 46, N2 = 35, and N3 = 19.
Under H0 , the expected number of observations in category i is E(Ni ) =
npi0 . For example, with n = 100 and H0 : p1 = 0.5, p2 = 0.3, p3 = 0.2, the
expected counts are E(N1 ) = 100 × 0.5 = 50, E(N2 ) = 100 × 0.3 = 30, and
E(N3 ) = 100 × 0.2 = 20. The observed (ni ) and expected (E(Ni )) frequencies
are often displayed in a table as in Table 4. The chi-squared goodness-of-fit test
oD
,U

is used to assess whether the observed frequencies deviate significantly from the
oT

expected frequencies under the null hypothesis. The “goodness-of-fit” test is so


,F
CE

named because it tests how good the observed data fits a theoretical probability
R.

model.
a
kh
Re

Table 4: Observed and Expected Cell Counts


25
20
©

Category i=1 i=2 ··· i=k Row total


Observed n1 n2 ··· nk n
Expected np10 np20 ··· npk0 n

Table 5: Observed and Expected Cell Counts for the Example

Category i=1 i=2 i=3 Row total


Observed 46 35 19 100
Expected 50 30 20 100

Chi-square test Under the probability distribution assumed by H0 , we can


calculate what counts we should see (expected counts). But the only data we
have are the observed counts. Now it is like asking “if the theory says 45%
of people have blood type O and we observe 80 out of 200 (40%) with type
O, is that difference just random sampling variation or is it too large to be
due to our incorrect assumptions about the distributions?” This is exactly the
hypothesis testing problem but for categorical data instead of numerical data.
The chi-squared test measures that discrepancy quantitatively.
A naive approach is to consider the squared deviations
(n1 − np10 )2 , (n2 − np20 )2 , ..., (nk − npk0 )2

134
These can be summed into an overall measure
X
(ni − npi0 )2
However, this unadjusted sum can be misleading. For instance, if np10 = 100
and np20 = 10, and the observed values are n1 = 95 and n2 = 5, both terms
contribute equally to the sum
(95 − 100)2 = 25, (5 − 10)2 = 25
Yet, relatively speaking, n1 is only 5% less than expected, while n2 is 50%
less. To account for this imbalance, each squared deviation is divided by its
corresponding expected count. This leads to the test statistic for the chi-square
goodness-of-fit test
X (ni − npi0 )2
χ2 =
npi0
The chi-squared distribution has a single parameter n, called the degrees of
freedom (df), where n takes positive integer values: 1, 2, 3, . . . . If a random
variable Y follows a chi-squared distribution with ν degrees of freedom, i.e.,
Y ∼ χ2 (ν), then the expected value and variance of Y are given by
E(Y ) = ν and Var(Y ) = 2ν
The shape of the χ2 density curve is positively skewed. However, as the degrees
oD
,U

of freedom ν increase, the distribution becomes more symmetric and spreads


oT
,F

out further to the right. This behavior illustrated in Fig.26 denotes a typical
CE

chi-squared density plot.


R.
a
kh
Re
25
20
©

Figure 26: Chi-squared distribution with varying degrees of freedom

The fact that the degrees of freedom (df) (refer Sec.9.4.1 for the definition)
equal k − 1, where k is the number of categories,Parises from the constraint that
the total number of observations is fixed, i.e., Ni = n. Although there are k
observed cell counts, once any k − 1 of them are known, the remaining count
is uniquely determined. Therefore, there are only k − 1 values that are freely
determined, giving k − 1 degrees of freedom.
Interpretation of the Chi-Squared Test Statistic. The chi-squared test statis-
tic is defined as
X (ni − npi0 )2
χ2obs = ,
npi0

135
where ni are the observed cell frequencies and npi0 are the expected cell fre-
quencies under the null hypothesis H0 . A small value of χ2obs indicates that the
observed frequencies are close to the expected ones, which is consistent with
H0 . A large value of χ2obs suggests a substantial discrepancy between the ob-
served and expected frequencies, providing evidence against H0 . We now want
to see how likely it is to get that value or anything more extreme under H0 .
We cannot not just look at the observed statistic in isolation and take a formal
decision on H0 because the observed discrepancy could reasonably occur just
by random chance. We want to check if this observed data is due to the sys-
temic nature of the population. For this we compute the p-value which gives a
probabilistic measure of evidence against H0 , rather than a subjective “seems
large” judgment.
We are only interested in whether χ2 is significantly large. Hence the chi-
squared test is a right-tailed test. The p-value is the area under the chi-squared
distribution curve to the right of the observed χ2obs value

p-value = P (χ2df ≥ χ2obs )

We can use a chi-squared table to lookup for probability value. Suppose we have
the following situation where the observed chi-squared statistic χ2obs = 12.83 and
degrees of freedom df = 5. A typical chi-squared table provides critical values
for selected upper-tail probabilities
oD
,U
oT

df | p 0.10 0.05 0.025 0.01 0.001


,F

5 9.24 11.07 12.83 15.09 20.52


CE
R.

Here, χ2obs = 12.83 falls exactly at the column for upper-tail probability 0.025.
a
kh

If the significance level is α = 0.05, then p < α, so we reject H0 . If χ2obs falls


Re
25

between two critical values, one can interpolate to obtain a more precise p-value.
20
©

For exact p-values, we may use software tools.

Example 10.4. * Consider a genetic experiment involving a dihybrid cross be-


tween pure strains with genotypes AABB and aabb, producing offspring with
genotype AaBb. When these first-generation organisms are crossed among
themselves, Mendel’s laws imply that four phenotypes should arise with proba-
bilities
9 3 3 1
p1 = , p2 = , p3 = , p4 =
16 16 16 16
This corresponds to the null hypothesis
9 3 3 1
H0 : p1 = , p2 = , p3 = , p4 =
16 16 16 16
The observed data from a dihybrid cross in tomatoes (total n = 1611) are

Phenotype Observed Count ni Expected Count npi0


1 Tall, cut leaf 926 906.2
2 Tall, potato leaf 288 302.1
3 Dwarf, cut leaf 293 302.1
4 Dwarf, potato leaf 104 100.7
* taken from Probability and Statistics for Engineering and the Science by Jay Devore, 9th
edition.

136
We compute the chi-squared statistic
4
X (ni − npi0 )2
χ2 =
i=1
npi0
(926 − 906.2)2 (288 − 302.1)2 (293 − 302.1)2 (104 − 100.7)2
= + + +
906.2 302.1 302.1 100.7
= 0.433 + 0.658 + 0.274 + 0.108
= 1.473

Degrees of freedom df = k − 1 = 4 − 1 = 3. From the chi-square table, the


critical value χ20.10,3 for 3 degrees of freedom and significance level α = 0.10 is
6.251. That is, the area to the right of 6.251 under the chi-squared curve is
0.10 (P (χ2 > 6.251) = 0.10). Since the observed test statistic is χ2obs = 1.473
and the chi-squared distribution is right-skewed, values smaller than the critical
value correspond to larger areas (p-values). So

P (χ2 > 1.473) > P (χ2 > 6.251) > 0.10

Since the p-value is large, we fail to reject H0 . The observed frequencies are
quite consistent with Mendel’s laws of inheritance. 
oD
,U

10.8.2 Goodness-of-fit Test When the Pi s Are Functions of other Parameters


oT
,F

Suppose the category probabilities pi are not specified directly but are assumed
CE
R.

to depend on a smaller set of parameters θ1 , θ2 , . . . , θm (where m < k). A


a
kh

specific hypothesis about these parameters then determines explicit values pi0 .
Re

These pi0 are subsequently used in the chi-squared test statistic


25
20
©

k
X (ni − npi0 )2
χ2 = .
i=1
npi0

Example 10.5. * In a genetics experiment by G. U. Yule (1923), 269 four-


seed pea pods were classified based on the number of yellow-round (YR) seeds.
The dominant traits were yellow (Y) and round (R), and the double dominant
9
outcome (YR) had a probability of u = 16 . Let X be the number of YR seeds
in a pod. Then under the null hypothesis H0 , the distribution of X is binomial
9
X ∼ Bin(n = 4, u = )
16
We wish to test
H0 : p1 = p10 , p2 = p20 , . . . , p5 = p50
where
pi0 = P (i − 1 YR seeds among 4)
 
4 9
= u i−1 (1 − u) 4−(i−1) , for i = [1, 5]; u=
i−1 16
* taken from Probability and Statistics for Engineering and the Science by Jay Devore, 9th
edition.

137
   i−1  4−(i−1)
4 9 7
pi0 = , i = [1, 5]
i−1 16 16

Interpretation In this genetics experiment, each seed is a Bernoulli trial with


success for an yellow-round (YR) seed and failure if not (refer Sec.4.1). But
we are not looking at just one seed. We are looking at a pod of 4 seeds. The
variable X is “Number of YR seeds in one pod”. The possible values of X
are {0, 1, 2, 3, 4}. These 5 possible counts (0 to 4) are the categories for the
chi-squared test. But they come from a binomial process. The chi-square test
checks if the observed frequency of each count matches the binomial prediction
thereby assessing how well the binomial distribution fits the observed data.
Coming back to the question, the observed and expected counts are com-
puted as

i YR seeds in pod Observed (ni ) Expected (npi0 )


1 0 16 9.86
2 1 45 50.68
3 2 100 97.75
4 3 82 83.78
5 4 26 26.93
oD

Chi-Square Test Statistic:


,U
oT

5
,F

X (ni − npi0 )2
CE

χ2 = = 3.823 + 0.637 + 0.052 + 0.038 + 0.032 = 4.582


npi0
R.

i=1
a
kh
Re

With df = k − 1 = 5 − 1 = 4, we check the critical value from a chi-squared


25

table. Since the critical values are related thus


20
©

4.582 = χ2obs < χ20.10,4 = 7.77

the p-value for χ2obs exceeds the significance level 0.10, so we fail to reject H0 .
There is no compelling evidence against the hypothesis that the number of
9
YR seeds in a pod follows a binomial distribution with u = 16 . The observed
frequencies are consistent with Mendelian inheritance. 

10.8.3 Two-Way Contingency Tables


In the previous secion, the observed frequencies were displayed in a single row
within a rectangular table. We now examine two scenarios where data is pre-
sented in a table with I rows and J columns creating I × J cells. This is called
a two-way contingency table.
1. Comparing multiple populations. We have I different groups (e.g.,
three store chains), each categorized by J criteria (e.g., five payment
methods like cash, check, AmEx, Visa, MasterCard). We sample from
each group and fill the counts into the corresponding row. Here we have
several distinct groups (populations), and within each group, we classify
individuals according to a single categorical variable.
For example, we wish to test whether the distribution of payment methods
differs across three retail chains (A, B, C). Each customer transaction is

138
classified by payment type – Cash, UPI, Credit, or Debit. A random
sample of transactions is summarized below

Store Cash UPI Credit Debit


A 45 30 15 10
B 25 40 20 15
C 20 25 35 20

Each row corresponds to a population (store), and each column represents


a categorical outcome (payment method). Each entry in the table corre-
sponds to a count, nij , (frequency) of individuals (or items) that fall into
a specific combination of the two categorical variables.
We can ask “Are the distributions of payment method the same across
the three stores?” This leads to a Chi-Square Test of Homogeneity where
we compare whether the same categorical distribution holds for different
populations.

H0 : Distribution of payment methods is identical across stores.

Ha : At least one store differs in its payment method distribution.

2. Single population with two factors. We have one population (e.g.,


oD

shoppers), and we have to classify each individual by two categorical vari-


,U
oT

ables (e.g., department they visited and payment method). Each person’s
,F

count nij goes into the cell at row i, column j.


CE
R.

For example, a supermarket manager wants to know whether the depart-


a
kh

ment visited and the payment method used by customers are related. A
Re

random sample of 360 shoppers is classified as follows


25
20
©

Department Cash UPI Credit Debit


Grocery 40 60 30 20
Electronics 15 25 35 25
Clothing 25 35 30 20

Here we have a single population (shoppers in a supermarket), but each


individual is classified by two categorical factors - Department (3 levels)
and Payment Method (4 levels). The resulting 3 × 4 table is a two-way
contingency table.
We can apply the Chi-Square Test of Independence to check whether the
department visited and the payment method are statistically independent.

H0 : Department and payment method are independent.

Ha : Department and payment method are associated.

Chi-Square Test for Homogeneity We consider data arranged in an I × J


contingency table, where nij is the count in cell (i, j), with i = 1, . . . , I and
j = 1, . . . , J. The row totals are
J
X I
X
ni· = nij , and the column totals are n·j = nij .
j=1 i=1

139
Sample Data: I × J Contingency Table with Row and Column Totals

Population C1 C2 ··· CJ Row Total ni·


1 n11 n12 ··· n1J n1·
2 n21 n22 ··· n2J n2·
.. .. .. .. ..
. . . . .
I nI1 nI2 ··· nIJ nI·
Column Total n·j n·1 n·2 ··· n·J n

Under the null hypothesis of homogeneity, each of the I populations has the
same distribution across the J categories; that is

Population C1 C2 ··· CJ Row Total


1 p11 p12 ··· p1J 1
2 p21 p22 ··· p2J 1
.. .. .. .. ..
. . . . .
I pI1 pI2 ··· pIJ 1
Under H0 p1 p2 ··· pJ 1
oD
,U
oT
,F

Since pj is the common category proportion, the expected count in cell (i, j)
CE

is E(nij ) = ni· pj
R.
a
kh
Re

Expected Counts under Homogeneity: E(nij ) = ni· pj


25
20
©

Population C1 C2 ··· CJ Row Total ni·


1 n1· p1 n1· p2 ··· n1· pJ n1·
2 n2· p1 n2· p2 ··· n2· pJ n2·
.. .. .. .. ..
. . . . .
I nI· p1 nI· p2 ··· nI· pJ nI·
Column Total n·j np1 np2 ··· npJ n

If the true population probabilities for each category are known, then the
expected count in a cell can be computed directly. But in practice, the true
category proportions pj are almost never known. Under the null hypothesis of
homogeneity, we claim that each population has the same proportions p1 , . . . , pJ ,
but we do not know their actual values. Instead we have the observed row totals
ni· and column totals n·j . We therefore estimate pj from the data itself
n·j
p̂j = ,
n
where n·j is the total count in category j across all populations, and n is the
grand total. Then the expected count is
n·j ni· × n·j
Ê(nij ) = ni· p̂j = ni· =
n n

140
This gives a data-driven estimate of what the count would be under H0 .
We compute the chi-square test statistic as
I X
J
X (nij − Êij )2
χ2 = ,
i=1 j=1 Êij

which under H0 follows approximately a chi-square distribution with degrees of


freedom df = (I − 1)(J − 1).
Decision rule. We compare the calculated χ2obs to the critical value χ2α, df by
computing the p-value
P (χ2α, df ≥ χ2obs ).
If the p-value is less than α, reject H0 ; otherwise, fail to reject.

Example 10.6. * A company samples nonconforming cans from three produc-


tion lines and classifies each defect into one of five categories: Blemish, Crack,
Location, Missing, and Other. The data are arranged in a 3 × 5 contingency
table
Line Blemish Crack Location Missing Other Total
1 31 68 17 21 13 150
2 19 47 30 19 10 125
oD
,U

3 33 26 16 14 11 100
oT

Total 83 141 63 54 34 375


,F
CE

We wish to test
R.
a
kh
Re

H0 : p1j = p2j = p3j , ∀j ∈ {Blemish, Crack, Location, Missing, Other}


25
20

versus
©

Ha : Not all pij are equal.


H0 claims the production lines are homogeneous with respect to the five non-
conformance categories. Hypothesis Ha is to prove the production lines are not
homogeneous with respect to the categories. Under H0 , the estimated expected
counts for each cell are
ni· × n·j
Êij =
n
For example, for Line 1 and Blemish
150 · 83
Ê11 = ≈ 33.20
375
The other contributions are calculated in a similar manner and shown in Table6.
* taken from Probability and Statistics for Engineering and the Science by Jay Devore, 9th
edition.

141
Table 6: Contingency Table with Expected Counts and Chi-Square Contribu-
tions

Reason for Nonconformance


Line Blemish Crack Location Missing Other Total
1 31 68 17 21 13 150
(33.20) (56.40) (25.20) (21.60) (13.60)
0.146 2.386 2.668 0.017 0.026
2 19 47 30 19 10 125
(27.67) (47.00) (21.00) (18.00) (11.33)
2.715 0.000 3.857 0.056 0.157
3 33 26 16 14 11 100
(22.13) (37.60) (16.80) (14.40) (9.07)
5.335 3.579 0.038 0.011 0.412
Total 83 141 63 54 34 375

The chi-square statistic is


3 X
5
X (nij − Êij )2
χ2 = = 21.403
i=1 j=1 Êij
oD
,U

Degrees of freedom are df = (3 − 1) × (5 − 1) = 8 From chi-square tables


oT
,F

20.09 < χ20.010,8 < 21.403 < χ20.005,8


CE
R.
a

hence the p-value is 0.005 < p < 0.010, and an algorithmic computation gives
kh
Re

p = 0.006. At α = 0.01, we reject H0 . The distribution of defect types is not


25

homogeneous across the three lines. 


20
©

Chi-Square Test for Independence Suppose we have a single population


of interest where each individual is classified according to
1. Factor 1 with I categories (e.g., supermarket department)
2. Factor 2 with J categories (e.g., card payment)
We take a random sample of size n and let nij be the count of individuals who
fall into category i of Factor 1 and category j of Factor 2. These counts form
an I × J contingency table, with
J
X I
X I X
X J
ni· = nij , n·j = nij , and total n = nij .
j=1 i=1 i=1 j=1

We define pij = P individual in category i and j and marginal probabilities
X X
pi· = pij , p·j = pij .
j i

Here pi· is the probability that a randomly selected individual falls in category
i of factor 1 and p·j is the probability that a randomly selected individual falls
in category j of factor 2. Under the null hypothesis of independence which

142
states that an individual’s category with respect to factor 1 is independent of
the category with respect to factor 2

H0 : pij = pi· × p·j , for all i, j.

The expected count in cell (i, j) under H0 is:

E(nij ) = npij = n pi· p·j .

We estimate the marginal probabilities using sample proportions


ni·
p̂i· = , sample proportion for category i of factor 1
n
n·j
p̂·j = , sample proportion for category j of factor 2
n
which yields
ni· n·j ni· n·j
Ê(nij ) = npˆij = n p̂i· p̂·j = n · · = .
n n n
The chi-square test statistic is
I X
J
X (nij − Ê(nij ))2
χ2 = .
oD

Ê(nij )
,U

i=1 j=1
oT
,F

This statistic follows approximately a chi-square distribution with df = (I −


CE

1)(J − 1) degrees of freedom. To test H0 , compare χ2obs with the chi-square


R.
a

critical value or compute the P -value


kh
Re
25

P = P χ2df ≥ χ2obs .

20
©

A large χ2obs (or small p-value) suggests dependence between factors. A small
value supports independence.

Example 10.7. * A sample of size n = 13 262 classifies each newborn according


to
ˆ Row factor, I = 4 levels of paternal education: completed university,
partial university, secondary, partial secondary (denoted E1 to E4 ).
ˆ Column factor, J = 4 quartiles of neonatal weight gain: Q1 (lowest
25%), Q2 , Q3 , Q4 .
The observed counts and expected counts (under independence) are summarized
in Table 7. The goal is to check whether the educational level is independent of
neonatal weight gain (NNWG) in the sampled population?

Test statistic The χ2 statistic is computed as


4 X 4
X (Oij − Eij )2
χ2 = = 0.261 + 0.313 + · · · + 0.253 = 19.016.
i=1 j=1
Eij

* taken from Probability and Statistics for Engineering and the Science by Jay Devore, 9th
edition.

143
Table 7: Observed counts, expected counts (in parentheses), and χ2 contribu-
tions

Education Q1 Q2 Q3 Q4 Total
E1 422 433 429 414 1698
(411.63) (444.79) (422.64) (418.93)
0.261 0.313 0.096 0.058
E2 1493 1655 1556 1605 6309
(1529.44) (1652.65) (1570.35) (1556.56)
0.868 0.003 0.131 1.508
E3 1239 1276 1243 1179 4937
(1196.84) (1293.25) (1228.85) (1218.06)
1.485 0.230 0.163 1.252
E4 61 110 73 74 318
(77.09) (83.30) (79.15) (78.46)
3.358 8.558 0.478 0.253
Total 3215 3474 3301 3272 13 262
oD

Under the null hypothesis of independence, degrees of freedom df = (I − 1)(J −


,U
oT

1) = (4 − 1)(4 − 1) = 9. The expected value of a chi-square distribution with


,F

df degrees of freedom is equal to its degrees of freedom (refer Sec. 9.5). In


CE

this example, for χ29 , the expected value is 9 the test value is 19.016. A naive
R.
a

comparison of the two clearly shows that the test statistic value greatly exceeds
kh
Re

what would be expected if the two factors were independent. But this com-
25

parison is not enough to suggest implausibility of this null hypothesis. We will


20
©

decide this formally by comparing the test statistic value with the critical values
corresponding to the chosen significance level. For df = 9, the right-tail critical
values are
Significance level (α) Critical value χ2α,9
0.05 16.919
0.01 21.666

For the observed test statistic χ2obs = 19.016,

16.919 < 19.016 < 21.666

Hence,
ˆ we reject H0 at α = 0.05. Evidence suggests dependence between educa-
tion level and weight gain quartile.
ˆ We fail to reject at α = 0.01. No strong evidence of dependence at the 1%
level.


144

You might also like