Probability and Statistics for CS Students
Probability and Statistics for CS Students
Rekha R
Assistant Professor
Department of Computer Engineering
Monsoon 2025
Abstract
This lecture note is based on the [Link] syllabus of the University of
Delhi for the course offered for Minors/Specializations by the Computer
Science and Engineering Department under the Faculty of Technology in
the third semester. The note is largely based on the text book Proba-
bility and Statistics for Engineering and the Science by Jay Devore, 9th
edition. This four credit course has lecture of 3 credits and practical of 1
credit components. The evaluation scheme for the course includes these
oD
ment (IA) contributes 30 marks. The total marks allocated to the theory
CE
component (i.e., End Term Exam + IA) amount to 120 marks. The prac-
R.
the End Term Practical Exam carries 20 marks and Viva voce carries 10
Re
marks. The End Term Practical Exam and Viva voce shall be conducted
25
20
by an external examiner. Therefore, the grand total for the course is 160
©
marks.
For the IA of 30 marks, 6 marks shall be for attendance, 12 marks for
mid semester test and 12 marks for assignments/quiz/presentations (to
be announced). Six marks for attendance shall be distributed as follows:
1
Contents
1 Introduction 4
2 Probability Theory 5
2.1 Sample Spaces and Events . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Axioms, interpretations, and Properties of Probability . . . . . . 6
2.3 Interpreting Probability . . . . . . . . . . . . . . . . . . . . . . . 8
2.4 Probability Properties . . . . . . . . . . . . . . . . . . . . . . . . 9
2.5 Conditional Probability . . . . . . . . . . . . . . . . . . . . . . . 10
3 Random Variables 14
3.1 Discrete Random Variable . . . . . . . . . . . . . . . . . . . . . . 14
3.2 Probability Distribution . . . . . . . . . . . . . . . . . . . . . . . 14
3.3 Expected Value . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.4 Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
5.3 Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
25
8 Statistics 90
8.1 Point Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
8.2 Unbiased Estimators . . . . . . . . . . . . . . . . . . . . . . . . . 92
8.2.1 Sample Mean . . . . . . . . . . . . . . . . . . . . . . . . . 93
8.2.2 Sample Variance . . . . . . . . . . . . . . . . . . . . . . . 98
8.2.3 Sample Standard Deviation . . . . . . . . . . . . . . . . . 101
2
8.2.4 Estimators with Minimum Variance . . . . . . . . . . . . 102
Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . 137
a
kh
3
1 Introduction
A Conceptual Journey from Theory to Practice As Artificial Intelli-
gence and Machine Learning (AIML) continue to reshape industries and rede-
fine what is possible with data, students preparing for this field must develop
a strong foundation in Probability and Statistics. These are not just academic
prerequisites – they are the language of uncertainty, the engine behind models,
and the bridge between data and decisions.
You might have observed that the course begin with ‘Probability’ before
‘Statistics’. The reason being probability is foundational while statistics builds
upon it. Probability begins with a known or assumed model and calculates
the likelihood of outcomes. Example: Given a fair die, what is the chance of
rolling a 6? Statistics on the other hand starts with observed data and tries
to infer or test properties of the model. Example: Given 100 coin flips with 58
heads, is the coin fair? Probability lays the theoretical foundation. Statistics
builds on it to interpret data.
Unit 1 This unit introduces you to some basic probability concepts and ter-
minology. Many AIML models simulate or reason under uncertainty. Bayes’
Theorem finds application in classifiers, probability distributions in generative
models, and expectation and variance in model evaluation.
oD
,U
oT
in Bayesian networks, hypothesis testing, and sampling and CLT for model
R.
a
evaluation.
kh
Re
25
20
4
2 Probability Theory
Probability theory is the branch of mathematics that deals with uncertainty.
It provides a formal framework for reasoning about randomness, likelihood,
and chance events. We might have encountered the following expressions in
real life: “The odds favor Team India winning the match.”, “There is a 50–50
chance that the exam will be postponed.”, “It’s likely that the new software
update will fix the bug.”, etc. These are all statements involving uncertainty
– an acknowledgment that the outcome is not yet known, but we have some
intuition or evidence about how likely it is. In this unit, we introduce a precise
mathematical framework that allows us to convert statements like “There is
a 50% chance of rain tomorrow” into exact probabilistic models that can be
analyzed, simulated, or embedded into intelligent systems.
periment. For example, tossing a coin once or several times, selecting a card or
Re
25
cards from a deck, picking a 4-character password using lowercase letters, etc.
20
all possible outcomes are known but the exact outcome cannot be predicted in
advance is called as random experiment. For example, when we throw a coin
we know the possible outcomes are ’head’ and ’tail’. But, if we throw a coin at
random, we cannot predict in advance whether its upper face will show a head
or a tail. Peforming a random experiment is also referred to as trial.
The sample space of an experiment, denoted by S, is the set of all possible
outcomes of that experiment. The sample space in the simple coin toss experi-
ment is S = {H, T } where H stands for the appearance of ‘head’ and T for ‘tail’.
The sample space when the coin is tossed twice is S = {HH, HT, T H, T T }.
5
Example 2.2. In throwing three coins simultaneously with sample space
Disjoint events (also called mutually exclusive events) are events that
cannot happen at the same time. Let A and B be two events in a sample space
S. They are disjoint if
A∩B =∅
That is, no outcome is common between the two events.
Example 2.4. Consider another example where the sample space is the power
oT
set of integers from 1 to 10. Let E be the event ‘all even number’ and O be the
,F
CE
event ‘all odd number’. These events are not dijoint since 6 is common.
R.
a
kh
bility:
1. Classical Probability
2. Frequentist Probability
3. Axiomatic Probability
The Classical approach assumes that all the outcomes are equally likely. If
our event of interest E can happen in n ways out of a total of N ways, the
probability of the event, denoted P (E) is defined as
n
P (E) =
N
The Frequentist approach does not make the assumption that all the out-
comes are equally likely. In that case, we repeat the experiment many times,
say Ne (a large value). Then observe how many times that the particular event
E occurred, say n. The probability is defined as
n
P (E) = lim
Ne →∞ Ne
The Axiomatic approach to probability takes the approach of considering
probability as a function associated with any event. This assumes that the
6
probability is a real-valued function whose domain is the set of events and the
range between is a real number between 0 and 1 (both inclusive). Further that
the assignment of real values to each event should satisfy the following Kol-
mogorov’s axioms (formulated by Andrey Kolmogorov in 1933) of probability.
1. Non-negativity. For any event A, P (A) ≥ 0. Probabilities can never be
negative.
2. Truth. P (S) = 1. Something in the sample space must happen.
3. Countable Additivity. If A1 , A2 , A3 , . . . is an infinite collection of disjoint
events, then
∞ ∞
!
[ X
P Ai = P (Ai )
i=1 i=1
If at least one of a number of events occur and no two of the events can
occur simultaneously, then the probability of at least one occurring is the
sum of the probabilities of the individual events.
Let us now understand why the condition 0 ≤ P (A) ≤ 1 is not added as a
separate axiom. This is because it folllows from the above three axioms.
Theorem. For any event A ⊆ S, where S is a countable sample space and P
oD
0 ≤ P (A) ≤ 1
CE
R.
P (S) = 1 (by the truth axiom). For any event A ⊆ S, its complement Ac = S \A
Re
25
is also an event.
20
P (A) = 1 − P (Ac )
From the non-negativity axiom, we have P (A) ≥ 0 and P (Ac ) ≥ 0, which im-
plies
P (A) ≤ 1
Thus combining both the bounds we get 0 ≤ P (A) ≤ 1
Proposition. P (∅) = 0 where ∅ is the null event (the event containing no
outcomes whatsoever). Also, the countable additivity property is valid for a
finite collection of disjoint events.
Proof. Part 1. First, consider the infinite collection A1 = ∅, A2 = ∅, A3 = ∅, . . ..
Since Ai ∩Aj = ∅∩∅ = ∅, for any i, j ≥ 1 the events in this collection are disjoint,
and their union is
∞
[
Ai = ∅
i=1
7
By the countable additivity axiom, we have
∞ ∞ ∞
!
[ X X
P (∅) = P Ai = P (Ai ) = P (∅)
i=1 i=1 i=1
Thus
∞
X
P (∅) = P (∅)
i=1
This is only possible if P (∅) = 0, since otherwise the right-hand side diverges.
This confirms that the axiom also applies to finite disjoint collections.
CE
R.
a
kh
events. Instead serve only to rule out assignments inconsistent with our in-
©
1. Classical Probability
2. Logical/Evidential Probability
3. Subjective Probability
4. Frequency Interpretations
8
Logical Probability views probability as a logical relation between proposi-
tions, where probabilities are determined by the strength of the evidence sup-
porting the event. For example, given that “All swans observed so far are
white,” what is the probability that the next swan is white?
1 = P (A) + P (Ac ).
This proposition is useful because there are many situations in which P (Ac )
is more easily obtained by direct methods than is P (A).
From the countable additivity axiom, it is clear that when the events A and
B are disjoint then P (A ∪ B) = P (A) + P (B). For events that are not disjoint,
adding P (A) and P (B) results in double counting outcomes that lie in both A
and B. The next proposition shows how to correct this.
Proposition. For any two events A and B, P (A∪B) = P (A)+P (B)−P (A∩B).
Proof. We begin by observing that the union A ∪ B can be partitioned into
three disjoint events A \ B, B \ A, and A ∩ B So,
P (A ∪ B) = P (A \ B) + P (B \ A) + P (A ∩ B)
9
Adding these gives:
P (A) + P (B) = P (A \ B) + P (B \ A) + 2P (A ∩ B)
P (A \ B) + P (B \ A) = P (A) + P (B) − 2P (A ∩ B)
Substituting the LHS in the above equation,
P (A ∪ B) = P (A) + P (B) − 2P (A ∩ B) + P (A ∩ B)
P (A ∪ B) = P (A) + P (B) − P (A ∩ B)
The probabilities assigned to various events are based on the initial understand-
Re
ing of the experimental setup and the observations made. However, as the
25
20
may change. For example, suppose you are meeting someone at an airport.
The flight is likely to arrive on time; the probability of that is 0.8. Suddenly it
is announced that the weather at the origin is bad. You now realise that the
flight may be delayed. Now it has the probability of only 0.05 to arrive on time.
New information affected the probability of meeting this flight on time. The
new probability is called conditional probability, where the new information,
that the flight departed late, is a condition.
For a given event A, we denote its original probability, based solely on the
initial information, as P (A), often referred to as the unconditional or prior
probability. When new information, such as the occurrence of another event
B, becomes available, we update this to the conditional probability, denoted by
P (AB), representing the revised belief in the occurrence of A given that B is
known to have occurred.
The relationship between the unconditional probability P (A) and the con-
ditional probability P (A | B) depends on how the events A and B are related.
P (A | B) can be less than, equal to, or greater than P (A) This depends on
whether A and B are positively associated, negatively associated, or indepen-
dent.
10
Negative association. If knowing B makes A less likely, then P (A | B) <
P (A). For example, let A be a student who passed an exam and B be a
student who skipped all classes. Then P (A | B) < P (A).
Independence. If A and B are independent, then P (A | B) = P (A). For
example, let A be a coin landing on heads and B be a die showing a 6.
Then P (A | B) = P (A).
The second equation is used when P (A ∩ B) is desired, whereas both P (B) and
oD
,U
k
space S such that Ai ∩Aj = ∅ for all i 6= j (mutually exclusive) and i=1 Ai = S
a
kh
k
©
X
P (B) = P (B | Ai ) · P (Ai )
i=1
P (B | A) · P (A)
P (A | B) =
P (B)
Some time in the 1740s, the Reverend Thomas Bayes, a clergyman and a
mathematician, made this ingenious discovery. It was rediscovered indepen-
dently by Pierre Simon Laplace who gave it its modern mathematical form and
scientific application. The following example inspired by Jerome Cornfield’s
observations, illustrates the wide variety of applications of Bayes’ Theorem.
11
Cornfield used the theorem to solve a puzzle about the chances of a person get-
ting lung cancer. His paper helped epidemiologists to see how patients’ histories
could help measure the link between a disease and its possible cause.
Example 2.5. Suppose a medical test is 95% accurate for detecting a disease.
However, only 1% of the population has the disease. If a person tests positive,
what is the probability that they actually have the disease?
Given
P (Disease) = 0.01 (1% of the people have the disease)
2. the test raises false alarm leading to false positive in which case
oD
In this example demonstrating the use Bayes’ Theorem we need the prob-
CE
why we use P (P ositive|N oDisease) and not the false negative rate.
kh
Re
P (P ositive|Disease)P (Disease)
P (Disease|P ositive) =
P (P ositive)
0.95 × 0.01
=
0.059
≈ 0.161
So, even though the test is 95% accurate, if a person tests positive, the proba-
bility they actually have the disease is only ∼16.1%. This happens because the
disease is rare, making false positives more significant.
Conditional probability and Bayes’ Theorem are only meaningful when the
events are dependent, i.e., the occurrence or non-occurrence of one event has
a bearing on the chance that the other will occur. For example, drawing a
red card without replacement affects the chance of drawing a second red card.
12
However, drawing a red card with replacement has no impact on the chance of
drawing a second red card. The two are independent events.
Formally, two events A and B are independent if P (A | B) = P (A),
learning that B occurred doesn’t change our belief in A. Likewise, P (B | A) =
P (B). Also, P (A ∩ B) = P (A | B).P (B) = P (A).P (B).
oD
,U
oT
,F
CE
R.
a
kh
Re
25
20
©
13
3 Random Variables
While probability theory models uncertainty before observing data, statistics
offers tools to make sense of data once it has been observed. Random Variables is
a conceptual bridge between mathematical analysis and real-world experiments.
Some of the experiments we have seen produce abstract outcomes like “head” or
“face 3” or “red card”. But to apply statistical methods, we need to work with
numbers and not such abstract outcomes. We need numeric data to calculate
proportion x/n, mean x̄, and standard deviation. A random variable is a
function that assigns a number to each outcome in a sample space, i.e., the
domain is the sample space and range is the set of real numbers.
Example 3.1. Consider the coin tossing experiment where the sample space
is S = {Heads, Tails}. We define a random variable X such that X(Heads) =
1, X(Tails) = 0. This allows us to use mathematical tools to analyze the
outcomes. For example: the probability of getting Heads as P (X = 1), expected
x · P (X = x), and variance as Var(X) = E[(X − E[X])2 ].
P
value as E[X] =
Thus, random variables provide a way to transition from qualitative outcomes
to quantitative analysis, forming the foundation for statistical inference.
In the following, we denote random variables uppercase letters near the end
of the English alphabet. We use lowercase letters to represent a particular value
oD
,U
Any random variable whose only possible values are 0 and 1 is called a
R.
Example 3.2. Consider a mobile gaming app that gives users a reward based on
a virtual dice roll (a fair 6-sided die) each time they log in. The reward system
is as follows: Face 1 → 0 coins, face 2 → 5 coins, face 3 → 10 coins, and so on.
Define the rv X as the number of coins received from a single dice roll. So, the
domain of X is the finite set {1, 2, 3, 4, 5, 6} and the range is {0, 5, 10, 15, 20, 25}.
Since the dice is fair, P (X = x) = 16 for each x.
Example 3.3. Suppose a business has just purchased three laser printers, and
let X be the number among these that require service during the warranty pe-
14
riod. Possible X values are then 0, 1, 2, and 3. The probability distribution
will tell us how the probability of 1 is divided among these four possible values
– how much probability is associated with the X value 0, how much is appor-
tioned to the X value 1, and so on. We will use the following notation for the
probabilities in the distribution
p(0) = the probability of the X value 0 = P (X = 0)
p(1) = the probability of the X value 1 = P (X = 1)
and so on. The probability distribution is
x 0 1 2 3
P (X = x) 0.25 0.40 0.18 0.17
usually denoted by something like P (X = x), that maps each value of a discrete
,U
oT
a. 0 ≤ P (X = x) ≤ 1 for all x
R.
a
X
kh
b. P (X = x) = 1
Re
x
25
20
and b have type O-positive blood. Samples from all five individuals are typed
in random order until an O-positive individual is identified. Let the random
variable Y denote the number of typings necessary to identify the first O-positive
individual.
The sample space S is the set of all possible ordered sequences of typings
stopped at the first O-positive person. We do not continue testing after the first
O-positive is found. So each outcome is a sequence that ends with either a or b,
and all individuals before that are from {c, d, e} in some order. So the sample
space contains outcomes like
{a, b, ca, cb, da, db, ea, eb, cda, cdb, · · · , edcb}
15
3
4th position. All three c, d, e in any order, followed by a or b: 3 · 3! · 2 =
1 · 6 · 2 = 12
y 1 2 3 4 other
Re
25
16
The pmf or probability distribution gives the probabilities for a select value
of X. Imagine a scenario where X is the number of number of beds occupied in
a hospital’s emergency room at a certain time of day and we are interested in
knowing the probability that at most two beds are occupied. This is where the
concept of cumulative distribution function helps.
Example 3.5. Suppose the pmf of X for the above hospital scenario is given
by
x 0 1 2 3 4
p(x) 0.20 0.25 0.30 0.15 0.10
Then the probability that at most two beds are occupied is:
these value {0, 1, 2}. The highest value X can take that is ≤ 2.7 is 2. Thus
,U
oT
get P (X ≤ −10) = 0.
Re
25
That is, for any number x, F (x) gives the probability that the observed value
of X will be at most x.
Proposition. Let X be a discrete random variable with cumulative distribution
function F (x) = P (X ≤ x). For any two real numbers a ≤ b, we have
P (a ≤ X ≤ b) = F (b) − F (a− )
where F (a− ) denotes the left-hand limit of the cdf at a, i.e., the sum of proba-
bilities of all values strictly less than a.
17
In particular, if X takes only integer values and a, b ∈ Z, then
b
X
P (a ≤ X ≤ b) = P (X = x) = F (b) − F (a − 1)
x=a
Taking a = b yields
P (X = a) = F (a) − F (a − 1)
Proof. By definition of the cdf
X X X
F (b) = P (X ≤ b) = P (X = x) = P (X = x) + P (X = x)
x≤b x<a a≤x≤b
and X
F (a− ) = P (X = x)
x<a
X
F (a − 1) = P (X = x) (since the only values less than a are integers)
CE
R.
x<a
a
kh
So
Re
P (a ≤ X ≤ b) = F (b) − F (a − 1)
25
20
©
And if a = b, then:
P (X = a) = P (a ≤ X ≤ a) = F (a) − F (a − 1)
Example 3.6. * Let X denote the number of days of sick leave taken by a
randomly selected employee of a large company during a particular year. If the
maximum number of allowable sick days per year is 14, then the possible values
of X are {0, 1, 2, . . . , 14}
Suppose the cumulative distribution function F (x) = P (X ≤ x) is known
for several values F (0) = 0.58, F (1) = 0.72, F (2) = 0.76, F (3) = 0.81, F (4) =
0.88, F (5) = 0.94.
Then we can compute the probability that an employee took between 2 and
5 sick days (inclusive) as
P (2 ≤ X ≤ 5) = P (X = 2 or 3 or 4 or 5) = F (5) − F (1) = 0.94 − 0.72 = 0.22
Similarly, the probability that an employee took exactly 3 sick days is
P (X = 3) = F (3) − F (2) = 0.81 − 0.76 = 0.05
* taken from Probability and Statistics for Engineering and the Science by Jay Devore, 9th
edition.
18
3.3 Expected Value
In many real-world situations, we are interested not just in the possible outcomes
of a random experiment but in summarizing those outcomes with a single rep-
resentative number, typically an average. For instance, when analyzing student
course enrollments at a university, knowing how many courses each student is
registered for is useful, but what is even more informative is the average num-
ber of courses per student. This average helps administrators plan resources,
allocate faculty, and understand student workload.
Example 3.7. * Consider a university having 15,000 students and let X be the
number of courses for which a randomly selected student is registered. The pmf
of X is as follows
x 1 2 3 4 5 6 7
p(x) 0.01 0.03 0.13 0.25 0.39 0.17 0.02
The basic experiment being performed is picking a student at random from the
university. So, the sample space S is the set of all possible outcomes of this
experiment which is mutually exclusive (no overlap) and collectively exhaustive
(cover all possibilities). Here the sample space is
S = {Stud1 , Stud2 , · · · , Stud1 5000}.
oD
Let us now identify the random variable. In this example, the random vari-
,U
oT
X(Stud219 = 3); X(Stud1125 = 7), etc. Thus X collapses all students who take,
a
kh
say, 5 courses into the same numerical value. It doesn’t identify who the student
Re
is.
25
20
The average number of courses per student is given by computing the total
number of courses taken by all students and dividing by the total number of
students. Since each of 150 students is taking one course, these 150 contribute
150 courses to the total. Similarly, 450 students contribute 2×450 courses, and
so on. The population average value of X is then
1(150) + 2(450) + 3(1950) + 4(3750) + 5(5850) + 6(2550) + 7(300)
15000
150 450 1950 3750 5850 2250 300
=1· +2· +3· +4· +5· +6· +7·
15000 15000 15000 15000 15000 15000 15000
= 1 · p(1) + 2 · p(2) + 3 · p(3) + 4 · p(4) + 5 · p(5) + 6 · p(6) + 7 · p(7)
= 4.57
* taken from Probability and Statistics for Engineering and the Science by Jay Devore, 9th
edition.
19
This is nothing but the weighted average of all possible values a random variable
can take, where the weights are the probabilities. This naturally leads to the
concept of the expected value of a random variable – a fundamental quantity
that captures the long-run average outcome in probabilistic settings. Impor-
tantly, probabilities themselves can often be interpreted as relative proportions
in a large population, meaning we can compute expected values using just the
probability distribution without needing the full population data. This makes
expected value a powerful and practical tool even when only probabilistic mod-
els are available. Average/mean is a term used in statistics to mean empirical
average from data sample expected value is a similar notion for theoretical av-
erage over long-run on all outcomes, based on model. At their core, both refer
to the central tendency.
The expected value in the above example is a decimal (4.57) but the random
variable itself takes only integer values in {1, 2, 3, 4, 5, 6, 7}. The expected value
is a theoretical center of gravity of the distribution and not necessarily a value
the random variable ever actually assumes.
Let X be a discrete random variable with set of possible values D and
probability mass function p(x). The expected value or mean value of X,
denoted by E[X] or µX (or simply µ), is defined as
X
E[X] = µX = x · p(x)
oD
x∈D
,U
oT
,F
each outcome leads to a cost. Then E[g(X)] gives the expected cost.
kh
Re
25
Rs 5 if you roll a 3 or 4,
Rs 10 if you roll a 5 or 6.
Let X denote the outcome of a fair six-sided die and define the function g(x)
by
2
if x = 1, 2
g(x) = 5 if x = 3, 4
10 if x = 5, 6
1
Each outcome x ∈ {1, 2, 3, 4, 5, 6} has probability 6. Then the expected
value of the function g(X) is
20
6
X 1
E[g(X)] = g(x) ·
x=1
6
1 1 1 1 1 1
=2 + +5 + + 10 +
6 6 6 6 6 6
2(2) + 5(2) + 10(2)
=
6
34
= ≈ 5.67
6
So, on average, you earn Rs 5.67 per roll. This illustrates how expected
value generalizes to functions of random variables.
5
X
R.
E[g(X)] = g(x) · P (X = x)
a
kh
x=1
Re
25
If the random variable X has a set of possible values D and probability mass
function p(x), then the expected value of any function h(X), denoted by
E[[h(X)]] or µh(X) , is computed as
X
E[h(X)] = h(x) · p(x)
x∈D
This is incredibly useful because we don’t need to find the pmf of Y . We can
just “lift” the function h over the known distribution of X. This is called
21
the Law of the Unconscious Statistician (LOTUS). Paul Halmos, a prominent
mathematician, is credited to have coined the term “Fundamental Theorem of
the Unconscious Statistician” in early the 1940s. Sheldon Ross popularized
this term as by using it in the first edition of his text book Introduction to
Probability Models, noting in a footnote This law got its name from ‘unconscious’
statisticians who have used it as if it were the definition of E[g(X)]. Many
statisticians didn’t appreciate this humor and the author had to remove it in
later editions.
Proposition. Let X be a discrete random variable with probability mass func-
tion p(x), and let a and b be real constants. Then
E[aX + b] = a · E[X] + b
Proof. Let Y = aX + b.
X
E[Y ] = E[aX + b] = (ax + b) · p(x)
x∈D
X X
= ax · p(x) + b · p(x)
x∈D x∈D
X X
=a x · p(x) + b p(x)
x∈D x∈D
oD
= a · E[X] + b · 1
,U
oT
= a · E[X] + b
,F
CE
R.
a
kh
This means sums and constants can be pulled outside the expectation.
Re
value of X to inches by multiplying by 0.39 then the expected value also gets
multiplied by 0.39. So, multiplying changes the unit or scale of the values and
proportionally changes the average. Likewise, suppose you increase every value
of X by 5 like adding a fixed processing fee or tax. Then the expected value also
increases by exactly 5. That means shifting the values just shifts the average
by the same amount. Changing the random variable by scaling (multiplying) or
shifting (adding) changes the expected value in a predictable way – the expected
value scales and shifts with it.
Let us consider a transformation of a random variable X into a new variable
Y as Y = aX +b. This is a transformation of the values of X, and depending on
whether b = 0 or not, we classify it differently. When b = 0, the transformation,
known as linear transformation becomes Y = aX. There is no shift involved;
just stretching or shrinking the random variable based on the constant a.
When b 6= 0, the transformation, known as an affine transformation becomes
Y = aX + b. It first scales and then shifts the values of X. So not only do we
scale the average, but we also add the constant shift. For example, converting
Celsius to Fahrenheit by F = 59 C + 32 is an affine transformation.
3.4 Variance
The expected value of a random variable X, denoted E[X] or µ, tells us where the
center of the probability distribution lies. It is like the balance point or fulcrum
22
of the distribution when all probabilities are imagined as weights placed on a
seesaw.
Imagine each value x as a point on a ruler, and the probability p(x) as a
small weight placed at that point. If you support the ruler at the expected value
µ it will balance and not tilt to one side or the other. This shows that µ is the
center of gravity of the distribution.
Even if two distributions have the same expected value (same center), they
can look very different. One might have most values tightly clustered around the
center, while another might have values spread far apart. For example, a factory
producing screws with a target length of 10mm will desire that most of the
screws are very close to that length. A high difference indicates inconsistency,
leading to defects or failures. A stock with 10% average return but high variance
might lose you money one year and double it the next. Understanding variance
helps balance risk vs return.
This difference between the desired behaviour (expected value) and the ac-
tual behaviour is captured by the variance of X. It quantifies uncertainty
and inconsistency. Whether it is risk, quality, performance, or fairness variance
helps us understand how much things fluctuate around their expected behavior.
We want to measure the “expected deviation” of a random variable from it’s
expected value. But then the simple E[x − E[X]] gives 0.
In his early work on astronomical observations and measurement errors,
oD
Carl Friedrich Gauss introduced the use of the expected value of the squared
,U
oT
difference between a random variable and its mean, which we now define as the
,F
variance
CE
Gauss developed this concept, in the early 19th century, while seeking to find
Re
concerned with how to best estimate unknown quantities from such noisy data.
He proposed minimizing the expected squared error, which naturally led to the
mean µ as the best estimate and variance as a measure of uncertainty or spread
around the mean. This principle underlies much of modern statistics and data
science, particularly in regression, estimation theory, and machine learning.
The square of the deviation, as done in the equation above, is done to ensure
all deviations contribute positively to the overall measure of spread. Moreover,
it gives more weight to larger deviations for example, squaring a difference of 2
gives 4, while squaring a difference of 10 gives 100). The expected deviation is
then the average of the squared deviations, weighted by their probabilities.
If a random variable X is measured in some units, then its mean µ has
the same measurement unit as X. However, the variance V is measured in
squared units, and therefore it cannot be directly compared with X or µ. No
matter how unusual it sounds, it is mathematically correct to measure variance
of profit in squared rupees, variance of class enrollment in squared students,
and variance of available disk space in squared gigabytes. When we take the
square root of variance, the resulting standard deviation σ is again measured
in the same units as X. This is the main reason for introducing another measure
of variability – the standard deviation σ, which provides a more interpretable
sense of spread in the context of the original data. Variance, also denoted σ 2 ,
is essential for theoretical reasons (e.g., it is easier to manipulate algebraically).
23
But for interpreting data or comparing with real-world measurements, we always
use the standard deviation.
The computation of variance can be reduced to this simple formula
2
V (X) = σ 2 = E[(X − µ)2 ] = E[X 2 ] − (E[X])
p p p
σY = V[Y ] = a2 · V[X] = a2 · V[X] = |a| · σ
a
kh
√
Re
rection. The negative square root flips the distribution but does not affect how
©
spread out the values are. Therefore, only the magnitude affects the standard
deviation, not its sign.
Example 3.10. We would like to invest Rs. 10,000 into shares of companies
XX and YY. Shares of XX cost Rs. 20 per share. The market analysis shows
that their expected return is Rs. 1 per share with a standard deviation of Rs. 0.5.
Shares of YY cost Rs. 50 per share, with an expected return of Rs. 2.50 and a
standard deviation of Rs. 1 per share, and returns from the two companies are
independent. In order to maximize the expected return and minimize the risk
(standard deviation or variance), is it better to invest (A) all Rs. 10,000 into
XX, (B) all Rs. 10,000 into YY, or (C) Rs. 5,000 in each company?
Let X be the actual (random) return from each share of XX, and Y be the
actual return from each share of YY. Compute the expectation and variance of
the return for each of the proposed portfolios (A, B, and C).
(a) At Rs. 20 a piece, we can use Rs. 10,000 to buy 500 shares of XX collecting
a profit of A = 500X. Using (3.5) and (3.7),
E(A) = 500 E(X) = 500(1) = 500;
V(A) = 5002 V(X) = 5002 (0.5)2 = 62,500.
taken from Probability and Statistics for Computer Scientists by Michael Baron, 2ed
edition.
24
(b) Investing all Rs. 10,000 into YY, we buy 10,000/50 = 200 shares of it and
collect a profit of B = 200Y ,
(c) Investing Rs. 5,000 into each company makes a portfolio consisting of
250 shares of XX and 100 shares of YY; the profit in this case will be
C = 250X + 100Y . Following (3.7) for independent X and Y ,
V(C) = 2502 V(X) + 1002 V(Y ) = 2502 (0.5)2 + 1002 (1)2 = 25,625.
Discussion. A portfolio will not generally yield the same return unless the
oT
,F
assets involved have identical proportional returns. But risk will almost always
CE
change depending on how you combine assets. This example illustrates well
R.
a
where the portfolios yield the same return but with different risk level. The
kh
Re
expected return is the same for each of the proposed three portfolios because
25
each share of each company is expected to return Rs. 1/20 or Rs. 2.50/50, which
20
©
is 5%. But portfolio C, where investment is split between two companies, has
the lowest variance; therefore, it is the least risky.
High variance, as in Portfolio A, means high risk and you could see huge
drops before a recovery. The return might be delayed or even not materialize
within your time frame.
Expected return is an average over many possible outcomes (or many in-
vestors, or many years). It does not guarantee what happens in one realization.
Expected return becomes meaningful when the experiment is repeated many
times (e.g., investing every year) or when many investors are considered. Then
the average of outcomes will converge toward the expected value due to the Law
of Large Numbers.
A Systematic Investment Plan (SIP) involves investing a fixed amount at
regular intervals (e.g., monthly), typically in mutual funds or equity markets.
This process ties closely with ideas from expected value, variance, and the law of
large numbers. Over time, the average return stabilizes and starts approaching
the expected return. SIP is a practical implementation of diversifying not just
across assets, but across time.
25
4 Probability Distribution on a Single RV
A probability distribution (pd) of a single random variable provides a com-
plete description of the likelihood of its possible values. If the random variable
is discrete, the distribution is represented by a probability mass function (pmf)
that assigns a probability to each possible value. The probability distribution
encapsulates how the random variable behaves, and it forms the foundation
for further statistical analysis, such as computing expectations, variances, and
making inferences.
Absolutely different phenomena can be adequately described by the same
mathematical model, or a family of distributions. For example, the number
of virus attacks, received e-mails, error messages, network blackouts, telephone
calls, traffic accidents, earthquakes, and so on can all be modeled by the same
Poisson family of distributions.
For example, tossing a fair coin (head = success, tail = failure) is a Bernoulli
R.
four conditions
25
20
©
26
Each specific sequence of k successes and (n − k) failures has the same
probability, since trials are independent. The probability of such a sequence is
email as spam 5% of the time. That is, the probability of a false positive (a
,U
A user checks 4 important emails received during the day. Let us find the
CE
probability that none of them were wrongly marked as spam, assuming inde-
R.
pendence.
a
kh
Number of trials: n = 4
©
To find the probability that none of the emails were marked as spam, we
calculate
4
P (0 false positives) = · 0.050 · 0.954−0
0
= 1 · 1 · 0.954
≈ 0.815
This implies that there is an approximately 81.5% chance that all 4 impor-
tant emails were delivered correctly (not misclassified).
This kind of probability estimation helps software engineers and product
managers assess the reliability of machine learning models used in email filtering
systems. If the probability of at least one misclassification becomes too high,
adjustments may be needed in the model.
In most binomial experiments, we are not concerned with which specific trials
resulted in a success (S), but only with the total number of successes across
all the trials. For example, on a coin flipped 10 times we are not interested
27
whether the 1st , 3rd , or 7th toss was heads; just how many total heads we got.
This count is modeled by the binomial random variable X. It is a discrete
random variable that can take values from 0 to n. The domain of the random
variable X is the sample space S of the binomial experiment
There are 2n such sequences, and each represents an outcome of the experiment.
The range of X is the set of possible values that X can take, that is, the number
of successes in n trials
X = 3 corresponds to the outcome HHH. Since the coin is fair, each toss has a
,U
probability P (H) = 21 . Because the coin tosses are independent, the probability
oT
,F
of getting HHH is
CE
R.
3
1 1
a
kh
2 8
25
20
Example 4.2. Consider a binomial experiment with 3 coin tosses, where ‘H’
©
28
where nx is the number of ways to choose x successes from n trials, px is the
X = X1 + X2 + · · · + Xn
,F
CE
E[X] = E[X1 + X2 + · · · + Xn ]
kh
Re
= p + p + ··· + p
©
= np
Recollect from Sec.3.4, the variance of a Bernoulli random variable is defined
as
2
Var(Xi ) = E[Xi2 ] − (E[Xi ])
Since Xi ∈ {0, 1}, we have Xi2 = Xi . Therefore,
E[Xi2 ] = E[Xi ] = p
Substituting into the variance formula for
Var(Xi ) = p − p2 = p(1 − p)
Since the Xi ’s are independent, we have the variance of Binomial variable X as
V[X] = V[X1 + X2 + · · · + Xn ]
= V[X1 ] + V[X2 ] + · · · + V[Xn ]
= n · V[X1 ]
= n · p(1 − p)
= np(1 − p)
Thus, for a binomial random variable X ∼ Bin(n, p), we have
E[X] = np and V[X] = np(1 − p)
29
4.2 Poisson Distribution
Let us now analyse scenarios like counting the number of phone calls arriving
at a call center in an hour, typos in a 1000 page book, network packet losses in
a second, customers entering a shop per minute, etc.
At first glance, we might model these using the Binomial distribution, where
each “trial” corresponds to a possible opportunity for the event (e.g., each mil-
lisecond for a call to arrive). The probability of the event in each trial is small
and the number of trials is fixed.
We may view this as a binomial experiment being performed by an unknown
agent, over a very large number of very small sub-intervals, where each sub-
interval has a very small chance of success, and we observe only the count of
successes (rare events).
For example, to count typos (an experiment) in a 1,000-page book (large
interval) split each page into 1,000 characters (small sub-intervals). Assume the
chance of a typo per character is very small. But over the whole book, you
might expect 5 typos in total (count successes). Even though the number of
trials is huge, the expected number of actual events (typos) is still small.
Binomial distribution models finite and known trials. But here this shows
the behaviour of continuous space (or in some cases time) with unknown dis-
crete events. Hence the need for Poisson distribution. The Poisson distribution
emerges naturally when we are interested in modeling rare events that occur
oD
,U
randomly over time or space. The Poisson distribution arises as a limit of the
oT
Binomial distribution
,F
CE
R.
In simpler terms, the Poisson distribution answers the question “Given that
20
µx e−µ
n x
P (X = x) = lim p (1 − p)n−x = , x = 0, 1, 2, . . .
n→∞ x x!
P (X = x) is also denoted as p(x; µ).
S = {0, 1, 2, 3, 4, . . . } = N0
30
That is, the random variable maps each outcome to itself (an identity function).
The range is the set of values that X can take with non-zero probability. For a
Poisson distribution, all non-negative integers have positive probability, so
Range(X) = {x ∈ N0 : P (X = x) > 0} = N0
e−µ µx
P (X = x) =
x!
e−4 · 43 e−4 · 64
P (X = 3) = =
3! 6
Now, approximate numerically
64
e−4 ≈ 0.0183, ≈ 10.6667
6
So, there is approximately a 19.5% chance that exactly 3 customers arrive during
,U
oT
n → ∞, p → 0, such that np = µ,
µx e−µ
P (X = x) = , x = 0, 1, 2, . . .
x!
We compute the expected value
∞ ∞
X X µx e−µ
E[X] = x · P (X = x) = x·
x=0 x=0
x!
31
We substitute p = µ/n into the variance expression
µ µ µ
V[X] = n · · 1 − =µ 1− .
n n n
Now, take the limit as n → ∞
µ
lim µ 1 − = µ(1 − 0) = µ.
n→∞ n
Proposition.
E(X) = Var(X) = µ.
where µ is the expected number of events occurring in a fixed interval (of time,
space, etc.), or the Poisson parameter.
Example 4.4. A public health researcher is studying rare dental defects in chil-
dren aged 6–10. Based on historical data, the average number of enamel defects
per child is known to be µ = 0.2. Let X be the random variable representing
the number of enamel defects in a randomly selected child. Since the events are
rare, independent, and counted over a fixed unit (per child) this justifies the use
of the Poisson distribution with parameter µ = 0.2.
In this case, the researcher examines one child at a time. For each individual
child, they count how many enamel defects occurred. So, “per child” acts as
a fixed unit of observation (just like “per hour”, “per square meter”, or “per
oD
kilometer”).
,U
E[X] = µ = 0.2 Thus, each child has, on average, 0.2 defects (or 1 defect every
CE
5 children).
R.
e−0.2 · 0.20
25
= e−0.2 ≈ 0.8187
20
P (X = 0) =
0!
©
e−0.2 · 0.21
P (X = 1) = = 0.2e−0.2 ≈ 0.1637
1!
About 16.37% of children have exactly one defect.
32
5 Continuous Random Variable
Any discrete distribution is concentrated on a finite or countable number of iso-
lated values. Conversely, continuous variables can take any value of an interval,
(a, b), (a, +∞), (−∞, +∞), etc. Various times like service time, installation
time, download time, failure time, and also physical measurements like weight,
height, distance, velocity, temperature, and connection speed are examples of
continuous random variables.
Suppose some value x1 has P (x1 ) ≥ 12 . Since the total probability cannot
exceed 1 (by Axiom of Probability), at most one other value could also have
P (x) ≥ 12 . More generally, at most 2 values can have P (x) ≥ 21 , at most 4
values can have P (x) ≥ 14 , at most n values can have P (x) ≥ n1 .
An interval like [0, 1] is an uncountable set so we cannot list its elements
one-by-one. If a pmf gave positive probability to each point in an uncountable
set say ε, the total probability would diverge
Total = ε × (uncountable infinity) = ∞
P
This violates the axiom of probability x P (x) = 1. Thus, a pmf cannot assign
positive probabilities to uncountably many values.
A random variable X is continuous if possible values comprise either a
single interval on the number line (for some A < B, any number x between A
oD
Since P (X = x) = 0 the pmf does not carry any information about a random
CE
Z b
25
P (a ≤ X ≤ b) =
20
f (x) dx
©
33
The
PM histogram above shows a possible distribution of depth, where
k=0 P (Xdisc = k) = 1.
If depth is measured much more accurately and rounded to nearest centime-
tres, we get the histogram in Fig.2b. If we continue in this way to measure depth
more and more finely, the resulting sequence of histograms approaches a smooth
curve, such as is pictured in Fig.2c. The total area under the smooth curve is
1. The probability that the depth at a randomly chosen point is between a and
b is just the area under the smooth curve between a and b illustrated in Fig.3.
Figure 3: P (a ≤ x ≤ b)
2. f (x) dx = 1
,U
oT
−∞
,F
CE
Example 5.1. Let X be the angle (in degrees) from a reference line to an
R.
(
1
, 0 ≤ x < 360
25
f (x) = 360
20
0, otherwise
©
34
·10−2
1
0.75
f (x)
0.5
0.25
0
0 90 180 270 360
x (degrees)
Figure: Uniform PDF with shaded region for P (90◦ ≤ X ≤ 180◦ ) = 0.25
X ∼ Uniform(A, B)
oT
,F
on the idea of complete uncertainty (or equal likelihood) within a known range.
R.
a
Suppose, you are told a delivery will arrive sometime between 2 PM and 4 PM,
kh
Re
with no further information. You assume the arrival time is equally likely at
25
2:05, 3:10, or 3:59. So you model the time X with X ∼ Uniform(2, 4).
20
©
Because single points have zero probability, it follows that for any interval
[a, b] with a < b
35
Example 5.2. “Time headway” in traffic flow is the elapsed time between the
moment one car finishes passing a fixed point and the instant the next car begins
to pass that point. Let X denote the time headway for two randomly chosen
consecutive cars on a freeway during a period of heavy flow with the pdf of X
(
0.15e−2.15(x−0.5) , x ≥ 0.5
f (x) =
0, otherwise
The function reflects real-world traffic conditions where cars can’t be arbitrarily
close due to safety. The first condition models the physical constraint that
vehicles need at least 0.5 seconds of separation. But the exponential form of
the pdf shows that small headways are more likely (high density near x = 0.5).
The probability of large headways drops off rapidly as x → ∞. This matches
our intuition about traffic during heavy flow where cars are close together. The
graph of f (x) is given in Fig.4.
oD
,U
oT
There is no density associated with headway times less than 0.5, and the
a
kh
Z ∞
1 −ka
e−kx dx = e
a k
Z ∞ Z ∞
f (x) dx = 0.15e−2.15(x−0.5) dx
−∞ 0.5
Z ∞
= 0.15e0.075 e−2.15x dx
0.5
0.075 1 −2.15x ∞
= 0.15e · e
2.15 x=0.5
0.075 1 −2.15·0.5
= 0.15e · 0− e
2.15
1 −1.075
= 0.15e0.075 · e =1
2.15
This confirms that f (x) is a valid pdf, since the total area under the curve is 1.
36
The probability that the headway time is at most 5 seconds is
Z 5 Z 5
P (X ≤ 5) = f (x) dx = 0.15e−2.15(x−0.5) dx
0.5 0.5
Z 5
= 0.15e0.075 e−2.15x dx
0.5
0.075 1 −2.15x 5
= 0.15e · e
2.15 x=0.5
1
= 0.15e0.075 · e−1.075 − e−10.75
2.15
= e0.075 · (2e−2.75 + e−2.075 ) ≈ 0.491
Thus, the probability that the headway time is less than 5 seconds is ap-
proximately 0.491. This means that about 49.1% of the time, the headway is
less than 5 seconds.
The cumulative distribution function (cdf) F (x) for a discrete random vari-
oT
obtained by summing the probability mass function p(y) over all possible values
R.
y satisfying y ≤ x
a
kh
X
F (x) = P (X ≤ x) = p(y)
Re
25
y≤x
20
©
The cdf of a continuous random variable also gives the same probabilities
P (X ≤ x), but it is obtained by integrating the probability density function
f (y) from −∞ to x as
Z x
F (x) = P (X ≤ x) = f (y) dy
−∞
For each x, F (x) is the area under the density curve to the left of x.
The importance of the cumulative distribution function (cdf) here, just as
for discrete random variables, is that probabilities of various intervals can be
computed from a formula for or a table of F (x).
P (a ≤ X ≤ b) = F (b) − F (a)
37
Example 5.3. The following example illustrates how uncertainty in forces can
be modeled with a probability distribution. Here the load on a bridge is modeled
as a continuous random variable.
Suppose the probability density function (pdf) of the magnitude X of a
dynamic load on a bridge (in newtons) is given by
(
1
+ 3x
8 , 0≤x≤2
f (x) = 8
0, otherwise
To find the cumulative distribution function F (x), we integrate the pdf from 0
to x Z x Z x
1 3y 1 3
F (x) = f (y) dy = + dy = x + x2
0 0 8 8 8 16
Thus, the full expression for F (x) is
0,
x<0
F (x) = 81 x + 3 2
16 x , 0≤x≤2
1, x>2
oT
1 3 2 1 3 2
= · 1.5 + · (1.5) − ·1+ · (1)
,F
8 16 8 16
CE
R.
3 27 1 3
= + − +
a
kh
16 64 8 16
Re
39 5 39 20 19
25
= − = − = ≈ 0.297
20
64 16 64 64 64
©
P (X > 1) = 1 − F (1)
1 3 1 3
=1− ·1+ · 12 = 1 − +
8 16 8 16
5 11
=1− = ≈ 0.688
16 16
F 0 (x) = f (x)
This means that for a continuous random variable, the CDF is the integral
of the PDF, and the PDF is the derivative of the CDF, if it exists.
38
PDF and CDF Derivative for a Uniform Distribution When X has a
uniform distribution on the interval [A, B] denoted X ∼ U (A, B), F (x) = 0 for
x < A and F (x) = 1 for x > B since the slope is 0. Thus
Visual Representation:
,U
oT
,F
F (x)
CE
R.
1
a
kh
Re
25
x
20
A B
©
39
Example 5.4. * The pdf X is given as
(
3(1 − x2 ), 0 ≤ x ≤ 1
f (x) =
0, otherwise
Example 5.5. Let the probability density function f (x) be defined as follows
2x,
0 ≤ x < 0.5
f (x) = 2(1 − x), 0.5 ≤ x ≤ 1
0, otherwise
Z ∞ Z 0.5 Z 1
CE
−∞ 0 0.5
a
kh
(0.5)2 (0.5)3
1 1 1 1 1 1 1 1 1
=2 − − − =2 − − =2 − = 2· =
2 3 2 3 6 8 24 6 12 12 6
40
Expected value of a function As in the discrete case, the expected value
of a function h(x) of random variable X with pdf f (x) can be computed as
Z ∞
E[h(X)] = µh(X) = h(x) · f (x) dx
−∞
Example 5.6. Two territorial animals, such as deers or hyenas, are randomly
dividing a stretch of riverbank to establish their respective feeding zones. Sup-
pose a boundary point is chosen at random (uniformly) along the riverbank to
determine the division between the two territories. Let X denote the propor-
tion of the riverbank controlled by Animal A. We assume that X is uniformly
distributed on the interval [0, 1]. Thus, the probability density function (pdf)
of X is (
1, 0 ≤ x ≤ 1
f (x) =
0, otherwise
Let h(X) = max(X, 1−X) denote the proportion of the riverbank controlled
by the dominant animal i.e., the one that receives the larger share.
2. Compute the expected value E[h(X)], the expected share of the dominant
oT
,F
animal.
CE
R.
4. What does this model suggest about competition based purely on random
25
20
allocation?
©
41
This function transforms the share of Animal A to the share of the animal
who holds more territory. Because X + (1 − X) = 1, the maximum of the two
values must lie in the interval Range(h(X)) = 12 , 1 . Specifically, if X = 0.5,
both animals control the same amount: h(X) = 0.5. As X → 0 or X → 1, one
animal approaches control of the entire riverbank h(X) → 1.
In simpler terms, the value of h(X) is the maximum of X and 1 − X. Since
X is the proportion of riverbank held by Animal A, 1 − X is the share held by
Animal B. Thus, h(X) gives the larger of the two shares – the amount controlled
by the animal with more territory. This reflects which animal is dominant in
terms of space.
To compute the expected value of h(X), we break the integration into two
intervals
1
When 0 ≤ x < 2, we have h(x) = 1 − x
1
When 2 ≤ x ≤ 1, we have h(x) = x
So Z 1/2 Z 1
E[h(X)] = (1 − x) · 1 dx + x · 1 dx
0 1/2
1/2 1/2
x2
Z
1 1 3
oT
(1 − x) dx = x − = − =
,F
2 0 2 8 8
CE
0
R.
Z 1 2 1
x 1 1 3
a
kh
x dx = = − =
Re
1/2 2 1/2 2 8 8
25
20
Therefore,
©
3 3 6 3
E[h(X)] =+ = =
8 8 8 4
Since the total riverbank is of unit length, the less dominant animal gets
3 1
1 − E[h(X)] = 1 − =
4 4
This result shows that, although the division point is chosen at random
(giving each animal an equal chance of dominance), the expected advantage for
the dominant animal is significant. On average, the dominant animal controls
75% of the territory, while the less dominant one controls only 25%.
This model with transformation function h(X) reveals a key insight that
although the dividing point is chosen uniformly at random, which treats both
animals equally, the resulting division tends to be unequal. The expected value
E[h(X)] = 43 indicates that, on average, the dominant animal controls 75%
of the resource. This demonstrates that fairness in the mechanism (uniform
randomness) does not guarantee equality in the outcome. Such insights are
important in fields like ecology, economics, and game theory, where resources or
advantages are allocated through random or stochastic processes.
42
5.3 Variance
The variance and standard deviation give quantitative measures of how much
spread there is in the distribution or population of x values. Again σ is roughly
the size of a typical deviation from µ.
The variance of a continuous random variable X with probability density
function f (x) and mean value µ is given by
Z ∞
2
σX = V[X] = (x − µ)2 f (x) dx = E[(X − µ)2 ]
−∞
a+b 50 + 150
CE
µ= = = 100
R.
2 2
a
kh
Re
V[X] = = = = = ≈ 833.33
12 12 12 12 3
The standard deviation is the square root of the variance
r
p 2500
σX = V[X] = ≈ 28.87
3
This tells us that the average weekly demand is 100 units with most weekly
demand values lying within approximately ±29 units of the mean. The wide
standard deviation reflects high uncertainty in the weekly demand. This model
is useful for inventory planning under uncertainty.
Let h(X) = aX + b, where a and b are constants, and let µ = E[X] and
σ 2 = Var(X) be the mean and variance of a continuous random variable X
with pdf f (x). Then the expected value and variance of h(X) satisfy the same
properties as in the discrete case
43
E[h(X)] = E[aX + b] = aµ + b
Var(h(X)) = Var(aX + b) = a2 σ 2
These results show that shifting a random variable by b affects the mean but
not the spread, while scaling it by a affects both the mean and the variance.
high probability near the mean (centre), symmetry because of equally likely
,U
up/down pushes and rapid fall-off because the chance of large combined effects
oT
Empirical evidence from diverse domains shows that when individual out-
a
kh
comes are the result of a large number of small additive effects, the resulting
Re
the concept of the normal distribution, also known as the Gaussian distribution.
©
44
would grow to infinity as |x − µ| increases. The integral of the PDF would then
diverge, meaning total probability could not equal 1. Hence, it would not be a
valid probability distribution. Thus, the negative exponent is essential for the
bell-shaped behavior of the Normal distribution.
The standard deviation σ in the denominator controls the spread (or width)
of the bell curve. If σ is small, the denominator is small which means the term
(x − µ)2
is large. The exponent decays faster, so the curve becomes narrow.
2σ 2
(x − µ)2
If however σ is large, the denominator is big and the term becomes
2σ 2
smaller. The exponent decays slowly, so the curve becomes wider.
The term (x − µ)2 has units of (value)2 . Dividing by σ 2 (or the variance)
makes the exponent
(x − µ)2
−
2σ 2
dimensionless, which is necessary since the exponential function ez only makes
sense when z is unitless.
Figure 6 plots the pdf for normal distribution with different pairs of mean
and variance values. Each density curve is symmetric about µ and is bell-
shaped. Changing the σ value stretches or compresses the curve horizontally
(Fig.6a) while changing the µ value shifts the density curve to one side or the
oD
representing the region with the highest density of values. Outside this interval,
CE
the curve is concave upward (tail region ∪), indicating that the slope of the curve
R.
starts increasing (but the values remain very small). These are the inflection
a
kh
Re
points where the curve changes concavity from concave down (near the peak)
25
approximately 68% of the probability mass. The inflection points help identify
the boundary between the central concentration of data and the tails. The plot
in Fig.6c illustrates a normal distribution with mean µ = 0 and SD σ = 2 along
with its inflection points x = µ − σ = −2 and x = µ + σ = 2.
45
(a) same mean, different variances (b) same variance, different means
X ∼ N (µ, σ 2 )
Re
25
20
To compute the probability that X lies in the interval [a, b], we evaluate the
©
integral
Z b
1 (x−µ)2
P (a ≤ X ≤ b) = √ e− 2σ2 dx (1)
a 2πσ 2
which represents the area under the normal curve between a and b. The function
(x−µ)2
1
f (x) = √2πσ 2
e− 2σ2 does not have an antiderivative that can be expressed
in terms of elementary functions. That is, none of the standard integration
techniques (such as substitution, integration by parts, etc.) can be used to
obtain a closed-form expression for this integral.
Suppose X ∼ N (µ, σ 2 ), meaning that most of the values of the random vari-
able X lie around the mean µ, and they are spread out with standard deviation
σ. Now consider the transformation
X −µ
Z=
σ
This transformation performs two operations: it shifts the distribution left or
right so that the mean becomes 0 and it scales the distribution so that the stan-
dard deviation becomes 1. In other words, we are rescaling the original data so
that it is measured in units of standard deviation. This standardization process
converts the original normal distribution into a standard normal distribution,
i.e.,
Z ∼ N (0, 1)
46
i.e., the mean for this new tranformed distribution is 0 and the new standard
deviation is 1. Then,
a−µ b−µ
P (a ≤ X ≤ b) = P ≤Z≤
σ σ
This reformulates the original problem in terms of the standard normal dis-
tribution with Z ∼ N (0, 1). The probability density function (pdf) of the
standard normal random variable Z, which has mean 0 and variance 1, is given
by
1 2
f (z; 0, 1) = √ e−z /2 , −∞ < z < ∞
2π
The graph of f (z; 0, 1) is called the standard normal curve (or z-curve).
Its inflection points occur at z = −1 and z = 1, where the curve transitions
from concave down to concave up.
The cumulative distribution function (cdf) of Z, denoted Φ(z), gives
Z z Z z
1 2
Φ(z) = P (Z ≤ z) = f (y; 0, 1) dy = √ e−y /2 dy
−∞ −∞ 2π
The cumulative distribution function (cdf) of the standard normal, which gives
the area under the standard normal curve from −∞ to z, is also not easy to
oD
,U
integrate. Instead we can use smart techniques that approximate the area under
oT
the curve like the application of trapezoidal rule to approximate the area under
,F
CE
over intervals using parabolas. These values are compiled into the standard
a
kh
normal tables (also called Z-tables) so they can be quickly used without redoing
Re
the integration every time. One can, in modern times, use Python, R, or C++
25
20
libraries to compute Φ(z) values very accurately which often are implemented
©
Example 5.8. Suppose exam scores in a university course are normally dis-
tributed with mean µ = 70 and standard deviation σ = 10. We want to compute
the probability that a student scores less than 85 P (X < 85).
We first standardize the variable
X −µ 85 − 70
Z= = = 1.5
σ 10
Then
P (X < 85) = P (Z < 1.5) ≈ 0.9332
the value for P (Z < 1.5) looked up in the Z-table. So, approximately 93.32% of
students scored less than 85. Even though the scores follow a normal distribution
with µ = 70, σ = 10, we used the standard normal distribution to compute
the probability.
47
5.5 Moments of a Random Variable
Moments are numerical summaries that describe the shape of a probability
distribution.
1st moment:
E[X] → mean (center).
2nd moment:
tool that encodes all the moments (mean, variance, skewness, etc.) of a prob-
20
48
Taking expectation
E[X 2 ] 2 E[X 3 ] 3
MX (t) = E[etX ] = 1 + E[X]t + t + t + ···
2! 3!
Thus, the coefficient of tn gives the n-th moment E[X n ]. If we take the first
derivative of the moment generating function MX (t) with respect to t and then
evaluate it at t = 0, we obtain
0
MX (0) = E[X] (mean)
00
MX (0) = E[X 2 ] (you can compute variance),
(3)
MX (0) = E[X 3 ] (helps explain skewness),
..
.
All higher-order moments (3rd, 4th, 5th, · · · ) together describe the entire shape
of a distribution. Practically, most work in probability/statistics stops at 4th
order (variance, skewness, kurtosis). Higher ones are sometimes used in ad-
vanced areas like financial risk modeling and signal processing. In short, the
MGF encodes all moments.
oD
,U
In general,
oT
(n)
MX (0) = E[X n ].
,F
CE
R.
49
Uniqueness of the MGF Another important property of the moment gener-
ating function (MGF) is that it uniquely determines the distribution of a random
variable (provided the MGF exists in a neighborhood of t = 0). Formally,
d
MX (t) = MY (t) for all t in some interval around 0 =⇒ X = Y,
d
where X = Y means that X and Y have the same distribution. The MGF is
a unique fingerprint of a distribution (when it exists). If two random variables
share the same MGF, they must have the same distribution.
The idea is that MGF encodes all the moments of the random variable
(n)
MX (0) = E[X n ].
Since all moments are contained in it, and these moments uniquely describe the
distribution (for most well-behaved distributions), the MGF fully “captures”
the law of X. This is why MGFs are powerful: they do not just summarize –
they characterize. If you compute an MGF and recognize it as matching the
known MGF of a distribution, you can immediately identify the distribution of
the random variable. Examples:
λ
oT
P (X = 1) = p, P (X = 0) = 1 − p.
25
20
©
for a discrete random variable. Since X takes only the values 0 and 1, in the
Bernoulli case
MX (t) = et·0 P (X = 0) + et·1 P (X = 1).
MX (t) = e0 (1 − p) + et · p.
MX (t) = (1 − p) + pet .
The n-th moment is obtained by differentiating MX (t) n times and evaluat-
ing at t = 0.
First derivative:
0 0
MX (t) = pet , MX (0) = p.
Hence,
E[X] = p.
50
Second derivative:
00 00
MX (t) = pet , MX (0) = p.
Thus,
E[X 2 ] = p.
Thus variance
MX (t) = (1 − p) + pet ,
and it allows us to directly compute the mean E[X] = p and variance Var(X) =
p(1 − p).
Z ∞
oT
tX
MX (t) = E[e ] = etx f (x) dx.
,F
−∞
CE
R.
∞
(x − µ)2
Z
1
a
tx
kh
−∞ 2πσ 2 2σ 2
25
20
(x − µ)2
tx − .
2σ 2
Expand the quadratic term
1
tx − (x2 − 2µx + µ2 ).
2σ 2
This becomes
x2 µ µ2
−
2
+ 2 x − 2 + tx.
2σ σ 2σ
Group the terms involving x
x2 µ µ2
− 2
+ 2
+ t x − 2.
2σ σ 2σ
The quadratic in x can be written as
1 2
x − 2(µ + σ 2 t)x + µ2 .
− 2
2σ
Completing the square
1
− 2 (x − (µ + σ 2 t))2 − (µ + σ 2 t)2 + µ2 .
2σ
51
Thus the exponent simplifies to
(x − (µ + σ 2 t))2 1
− + µt + σ 2 t2 .
2σ 2 2
So the MGF becomes
∞
(x − (µ + σ 2 t))2
Z
1 2 2
1
MX (t) = exp µt + 2σ t · √ exp − dx.
−∞ 2πσ 2 2σ 2
MX (t) = exp µt + 21 σ 2 t2 , t ∈ R.
First derivative:
0
(t) = (µ + σ 2 t) exp µt + 12 σ 2 t2 .
MX
Thus,
0
E[X] = MX (0) = µ.
oD
,U
Second derivative:
oT
,F
CE
00
(t) = σ 2 + (µ + σ 2 t)2 exp µt + 21 σ 2 t2 .
MX
R.
a
kh
Hence,
Re
00
E[X 2 ] = MX (0) = σ 2 + µ2 .
25
20
©
Variance
MX (t) = exp µt + 21 σ 2 t2 .
52
6 Joint Probability Distributions
We have seen so far distributions that modeled the number of typos in a text-
book, the number of failed bulbs, heights of a population, temperature of the
engine when idle, etc. We also have scenarios where we have to analyse several
random variables simultaneously. For example the size of a RAM and the speed
of a CPU, rainfall and crop yield, technical and artistic performance, etc. Joint
distribution tells us how two variables behave together, not just individually. If
X and Y are random variables, then the pair (X, Y ) is a random vector. Its dis-
tribution is called the joint distribution of X and Y . Individual distributions
of X and Y are then called the marginal distributions.
Pierre-Simon Laplace, in the late 16th century used to systematically rea-
son about multiple variables and probability laws. Andrey Kolmogorov, in the
early 18th century defined random variables as measurable functions and also is
credited to have given a rigorous definition for joint distributions.
(days) or height and weight. In this section, we focus on two discrete random
oT
Just like a single variable X has a distribution that tells us the probability
a
kh
of X = x, the vector (X, Y ) has a joint distribution that tells what is the
Re
Two vectors are equal, i.e., (X, Y ) = (x, y) if and only if X = x and Y =
y. In probability, the logical “and” corresponds to the intersection of events.
Therefore, the joint probability mass function (PMF) is given by
Each distinct pair (x, y) represents a unique outcome of the random vector
(X, Y ). For example, if (X, Y ) = (2, 3), then it cannot also be (2, 4) or (1, 3).
So, different outcomes, (x1 , y1 ) and (x2 , y2 ) such that (x1 , y1 ) 6= (x2 , y2 ) cannot
occur simultaneously. That is they are mutually exclusive
Moreover, the collection of all such pairs (x, y) is exhaustive, meaning all
possible outcomes of the vector are covered.
Since each possible outcome (x, y) is mutually exclusive and they cover all
the possibilities, their probabilities must add up to 1, just like in the case of a
single variable. XX
P (x, y) = 1
x y
Let X and Y be two discrete random variables defined on the sample space
S of an experiment. The joint probability mass function p(x, y) is defined
53
for each pair of numbers (x, y) by
p(x, y) = P (X = x and Y = y)
P P
Further p(x, y) ≥ 0 for all x, y and x y p(x, y) = 1.
Now let A be any particular set consisting of pairs of (x, y) values, for ex-
ample A = {(x, y) : x + y = 5} or A = {(x, y) : max(x, y) ≤ 3}. Then the
probability that the random pair (X, Y ) lies in the set A is given by summing
the joint probability mass function over all pairs in A
X
P [(X, Y ) ∈ A] = p(x, y).
(x,y)∈A
Example 6.1. At a university, students applying for financial aid must choose
among different scholarship and loan options. Suppose the available options are
“Scholarship amounts” : Rs. 2000, Rs. 4000, Rs. 6000 and “Loan amounts”:
Rs. 0, Rs. 2000, Rs. 4000. A student is randomly selected from the program.
Define X to be the amount of scholarship received and Y to be the amount of
loan taken. The joint probability mass function p(x, y) = P (X = x, Y = y) is
given below
There are nine possible (X, Y ) pairs, such as (2000, 0), (2000, 2000), . . . , (6000, 4000).
a
kh
The probability that a randomly selected student receives a Rs. 4000 schol-
20
We compute
To compute the probability that the loan amount is at least Rs. 2000
X
P (Y ≥ 2000) = [p(x, 2000) + p(x, 4000)]
x
Once we have the joint probability mass function p(x, y) of two discrete
random variables X and Y , we can find the distribution of just one variable
called the marginal distribution by summing over the other variable. Let X
54
and Y denote the number of statistics and mathematics courses, respectively,
currently being taken by a randomly selected statistics major. Suppose we want
the distribution of Y when X = 2. The only possible values of Y are 0, 1, and 2.
Then
That is, the joint pmf is summed over all pairs of the form (2, y).
For any possible value x of X, the marginal probability mass function
of X, denoted by pX (x), is given by
X
pX (x) = p(x, y)
y
p(x,y)>0
This means we hold x fixed and sum the joint pmf over all values of y such that
p(x, y) > 0. Similarly, the marginal probability mass function of Y is
X
pY (y) = p(x, y)
x
p(x,y)>0
The row totals in the above example give the marginal pmf of X and the column
oD
Example 6.2. Possible X values are x = 2000, 4000, and 6000. Computing
CE
55
and
Z∞Z
f (x, y) dx dy = 1.
−∞
Then, for any two-dimensional set A, the probability that the random pair
(X, Y ) lies in A is given by
ZZ
P [(X, Y ) ∈ A] = f (x, y) dx dy.
A
Imagine laying a thin, stretchable rubber sheet over a flat table representing
the xy-plane. Now, at each point (x, y), you push the sheet upward so that its
height matches f (x, y). The resulting shape of the sheet is the graph of f (x, y).
The total probability over a region A on the table is then the air space (volume)
trapped between the surface and the region A on the table.
oD
,U
56
These marginal densities describe the distribution of X and Y individually,
regardless of any dependence between them.
Example 6.3. Let the joint probability density function of continuous random
variables X and Y be
(
4xy, 0 ≤ x ≤ 1, 0 ≤ y ≤ 1
f (x, y) =
0, otherwise
We verify that this is a valid joint PDF by computing the total volume
Z 1Z 1 Z 1 Z 1 Z 1
1 1 1
4xy dy dx = 4 x y dy dx = 4 x · dx = 4 · · = 1
0 0 0 0 0 2 2 2
0, otherwise
,U
oT
,F
Z 1 Z 1
1
a
kh
fY (y) = 4xy dx = 4y x dx = 4y · = 2y
Re
0 0 2
25
20
(
2y, 0 ≤ y ≤ 1
fY (y) =
0, otherwise
The above describes a way to compute marginal distribution from joint dis-
tribution both for discrete and continuous random variables. But, in general,
the joint distribution cannot be computed from marginal distributions because
they carry no information about interrelations between random variables.
57
Then the joint probability mass function satisfies:
P (X = x, Y = y) = P (X = x) · P (Y = y)
for all x, y ∈ {0, 1}, since the outcomes of the two coin tosses are independent.
Now consider the case where X is the loan amount and Y is the interest rate.
In practice, the interest rate often depends on the loan amount. For instance,
larger loans may be associated with lower rates for preferred customers or higher
rates due to increased credit risk. Therefore, the joint distribution of X and Y
does not factor into the product of their marginals
P (a ≤ X ≤ b, c ≤ Y ≤ d) = P (a ≤ X ≤ b) · P (c ≤ Y ≤ d)
This identity holds for both discrete and continuous random variables under the
assumption of independence.
The idea of a joint distribution for two variables generalizes naturally to
oD
world systems with multiple interacting quantities. For example, to model the
,F
(mean sea level pressure, geopotential height), wind speed and direction (at
kh
Re
p(x1 , x2 , . . . , xn ) = P (X1 = x1 , X2 = x2 , . . . , Xn = xn )
58
Let X and Y be two continuous random variables with joint probability
density function f (x, y) and marginal density function of X, denoted by fX (x).
Then, for any value of x such that fX (x) > 0, the conditional probability
density function of Y given that X = x is defined as
f (x, y)
fY |X (y | x) = , −∞ < y < ∞
fX (x)
If X and Y are discrete random variables, replacing density functions with
probability mass functions in this definition yields the conditional probability
mass function of Y given X = x
P (X = x, Y = y)
P (Y = y | X = x) = , for all x with PX (x) > 0
PX (x)
We can notice a close similarity between the definitions of conditional prob-
ability for events and conditional distributions for random variables.
For events A and B in a sample space S,
P (A ∩ B)
P (B | A) = , provided P (A) > 0.
P (A)
Here, A and B are subsets of the sample space. The idea is to restrict the
probability measure to the event A, and then renormalize.
oD
f (x, y)
fY |X (y | x) =
R.
P (X = x, Y = y)
P (Y = y | X = x) = , for PX (x) > 0.
PX (x)
Example 6.4. Roll a fair six-sided die. Let
A = {even outcome}, B = {outcome ≥ 4}.
Then
P (A ∩ B) P ({4, 6}) 2/6
P (B | A) = = = = 23 .
P (A) P ({2, 4, 6}) 3/6
Roll two fair dice. Define
X = outcome of die 1, Y = outcome of die 2.
The joint pmf is
1
P (X = x, Y = y) = 36 , x, y ∈ {1, . . . , 6}.
The conditional pmf of Y given X = 4 is
P (X = 4, Y = y) 1/36
P (Y = y | X = 4) = = = 61 , y = 1, . . . , 6.
P (X = 4) 1/6
Thus, conditioning on X = 4, the distribution of Y remains uniform over
{1, 2, . . . , 6}.
59
Example 6.5. Let X denote the proportion of time a main runway at an airport
is busy, and let Y denote the proportion of time a secondary runway is busy.
Suppose the joint pdf of X and Y is given by
First, we aim to compute the constant k. For this, we have to assume that
f (x, y) is a valid pdf, i.e.,
Z 1 Z 1
k(x + y 2 ) dy dx = 1
0 0
Z 1 Z 1 Z 1
2 1 1 1 5 6
k(x + y ) dy dx = k x+ dx = k + =k· ⇒k=
0 0 0 3 2 3 6 5
Let us now aim to find the conditional pdf of Y | X = 0.6. First, we find the
marginal pdf of X
Z 1 Z 1
6 2 6 1 6 2
fX (x) = f (x, y) dy = (x + y ) dy = x+ = x+
0 0 5 5 3 5 5
oD
6 2
f (0.6, y) 5 (0.6 + y ) 0.6 + y 2
CE
fY |X (y | 0.6) = = 6 1 = , 0<y<1
fX (0.6) 5 (0.6 + 3 )
0.933
R.
a
kh
Re
0.5 0.5
©
0.6 + y 2 y3
Z
1
P (Y ≤ 0.5 | X = 0.6) = dy = 0.6y +
0 0.933 0.933 3 0
1 0.125 1
= 0.3 + = · 0.3417 ≈ 0.366
0.933 3 0.933
Let us now find the conditional expectation E(Y | X = 0.6)
1 1
0.6 + y 2
Z Z
1
E(Y | X = 0.6) = y· dy = (0.6y + y 3 ) dy
0 0.933 0.933 0
1
0.6y 2 y4
1 1 0.55
= + = (0.3 + 0.25) = ≈ 0.589
0.933 2 4 0 0.933 0.933
The conditional distribution fY |X (y | 0.6) reflects how knowledge of the
main runway’s usage changes our prediction for the secondary runway. Given
that the main runway is busy 60% of the time (i.e., X = 0.6), the probability
that the secondary runway is busy at most 50% of the time (i.e., Y ≤ 0.5) is
approximately 36.6%. Also, the expected (i.e., average) proportion of time that
the secondary runway is busy is 58.9%.
60
If two variables are independent, the marginal pmf or pdf in the denominator
will cancel the corresponding factor in the numerator. The conditional distribu-
tion is then identical to the corresponding marginal distribution. Let X and Y
be two independent random variables. Then, by the definition of independence
P (X = x, Y = y) = PX (x) · PY (y)
P (X = x, Y = y) PX (x) · PY (y)
P (Y = y | X = x) = = = PY (y)
PX (x) PX (x)
For a single random variable X, the expectation E[X] represents the average or
oT
,F
“center of mass” of its distribution on the real line. When we consider a random
CE
E[(X, Y )] = E[X], E[Y ] .
Re
25
20
X X
h(x, y) p(x, y), if X, Y are discrete,
x y
E[h(X, Y )] = Z ∞ Z ∞
h(x, y) f (x, y) dx dy, if X, Y are continuous.
−∞ −∞
61
A special case is the expectation of the product of two random variables.
For two jointly distributed random variables X and Y , the expectation of their
product is defined as
X X
xy p(x, y), if X, Y are discrete,
x y
E[XY ] = Z ∞ Z ∞
xy f (x, y) dx dy, if X, Y are continuous.
−∞ −∞
Let X and Y denote the seat numbers of the first and second individuals,
respectively. Possible (X, Y ) pairs are
{(1, 2), (1, 3), . . . , (5, 4)}
oD
(1
20 , x = 1, . . . , 5; y = 1, . . . , 5; x 6= y,
,F
p(x, y) =
CE
0, otherwise.
R.
a
kh
h(X, Y ) = |X − Y | − 1.
20
©
Values of h(x, y)
x\y 1 2 3 4 5
1 − 0 1 2 3
2 0 − 0 1 2
3 1 0 − 0 1
4 2 1 0 − 0
5 3 2 1 0 −
Expected value
XX 1 XX
E[h(X, Y )] = h(x, y) p(x, y) = h(x, y).
x y
20 x
y6=x
Computing the sum XX
h(x, y) = 20.
x y6=x
Thus,
20
E[h(X, Y )] =
= 1.
20
The expected number of seats separating any two friends is 1.
* taken from Probability and Statistics for Engineering and the Science by Jay Devore, 9th
edition.
62
6.5 Covariance
Variance measures how much a single variable deviates from its mean. Likewise,
covariance measures how two variables deviate from their respective means to-
gether. When X is far from its mean (positive or negative deviation), does Y
also tend to be far from its mean in the same direction? Specifically, it cap-
tures whether large (or small) values of one variable tend to occur with large (or
small) values of the other. A positive covariance means that when X is above
its mean, Y also tends to be above its mean, and when X is below its mean,
Y also tends to be below its mean (refer Fig.8a). A negative covariance means
that when one variable is above its mean, the other tends to be below its mean,
indicating an inverse relationship (refer Fig.8b). If the relationship between X
and Y is inconsistent, i.e.,
for some x-values larger than mean µX , the corresponding y-values are
also larger than µY (positive contribution), while
for other x-values larger than µX , the corresponding y-values are smaller
than µY (negative contribution)
the overall covariance is close to zero because these positive and negative con-
tributions cancel out on average (refer Fig.8c). This does not imply that X and
Y are completely unrelated, only that there is no systematic linear tendency for
oD
,U
them to move together (or against each other) across the distribution. Thus,
oT
(a) positive covariance (b) negative covariance (c) covariance near zero
We can express the joint deviation from the respective mean as a function
g(X − µX , Y − µY )
g(X − µX , Y − µY ) = (X − µX )(Y − µY )
is the simplest such function, but there are other possible choices:
63
Odd-power products, sign-magnitude decomposition, normalized product
(correlation), and nonlinear kernels are some other functions to achieve the
same. But the simple product is preferred because it is linear, easy to com-
pute, preserves sign and scale naturally, and connects directly to variance and
correlation.
Covariance measures the average effect of the deviations across all outcomes.
Formally, the covariance between two random variables X and Y is defined as
Cov(X, Y ) = E (X − µX )(Y − µY ) ,
Example 6.7. * The joint and marginal pmf’s for X (automobile policy de-
oD
Proposition.
Cov(X, Y ) = E(XY ) − µX µY .
* taken from Probability and Statistics for Engineering and the Science by Jay Devore, 9th
edition.
64
Proof. Start with the definition
Expanding,
(X − µX )(Y − µY ) = XY − µX Y − µY X + µX µY .
Cov(X, Y ) = E(XY ) − µX µY .
Example 6.8. Suppose we collect data on students’ study hours X and exam
scores Y for three students
Computing means
,U
oT
2+4+6 65 + 75 + 85
,F
µX = = 4, µY = = 75.
CE
3 3
R.
a
Computing E(XY ).
kh
Re
25
1 1 940
20
The positive covariance indicates that has study hours increase, exam scores
also tend to increase.
Y = aX + b,
65
Y − µY ∝ (X − µX ).
This implies the constant term b just shifts the distribution up or down, so
it cancels out when subtracting the mean. The slope a scales the deviations of
X to produce the deviations of Y . Hence, in a perfectly linear relationship, the
deviations of Y are proportional to the deviations of X. This is why covariance
captures linear dependence exactly.
Cov(X, Y ) = E (X − µX )(Y − µY )
= E (X − µX ) × a(X − µX )
= aE (X − µX )2
= a Var(X).
The magnitude of covariance, when X and Y are linearly dependent, is decided
by the deviation of X from its mean µX and the direction is decided by a as
follows
if a > 0, covariance is positive,
if a = 0, covariance is zero.
oD
,U
Cov(X, Y ) = a Var(X) 6= 0.
Y = g(X)
where
Y = X 2 , X 3 , sin(X), eX , or some nonlinear function
Then,
g(X) − E[g(X)] 6∝ (X − µX )
It can take positive and negative values that cancel each other out when multi-
plied by (X − µX ). That is,
(X − µX ) g(X) − E[g(X)]
66
Example Consider X ∼ U (−1, 1) and let Y = X 2 .
1 1
1 x2
Z
1
µX = E[X] = x· 2 dx = = 0.
−1 2 2 −1
1 1
1 x3
Z
2 2 1 1
µY = E[X ] = x · dx = 2 = .
−1 2 3 −1 3
Cov(X, Y ) = E (X − µX )(Y − µY ) .
Since µX = 0 and Y = X 2
Cov(X, Y ) = 0.
ships can produce zero or misleading covariance, because positive and negative
,U
surement.
a
kh
Re
Thus, the same relationship between height and weight produces very dif-
ferent covariance values depending only on the units of measurement.
This example illustrates why raw covariance is not a reliable measure for
comparing the strength of association between variables. To eliminate this de-
fect, we define the correlation coefficient, which rescales the covariance by
the product of the standard deviations, resulting in a unit-free measure. If two
variables move perfectly together, their deviations from their means are aligned
exactly. In that case, the covariance reaches its largest possible value which is
σX σY . If they move perfectly opposite, the covariance reaches −σX σY . Thus
67
the product of the standard deviations σX σY is not literally the covariance, but
it represents the maximum possible magnitude of co-movement that two vari-
ables can have. So dividing by σX σY scales the actual covariance relative to the
perfect alignment giving a correlation between -1 and 1. Thus, the correlation
coefficient is a unit-free measure between -1 and 1.
The correlation coefficient of X and Y , denoted by Corr(X, Y ), ρX,Y , or
just ρ, is defined as
Cov(X, Y )
ρX,Y =
σX σY
Proposition. 1. If a and c are either both positive or both negative,
Corr(aX + b, cY + d) = Corr(X, Y )
−1 ≤ ρ ≤ 1
Cov(aX + b, cY + d) = ac Cov(X, Y ),
oT
,F
Cov(U, V )
20
Corr(U, V ) = .
©
σU σV
Hence,
Cov(aX + b, cY + d)
Corr(aX + b, cY + d) =
σaX+b σcY +d
ac Cov(X, Y )
=
|a| σX |c| σY
ac Cov(X, Y )
= ·
|a||c| σX σY
= sign(ac) Corr(X, Y ).
If a and c are either both positive or both negative, then sign(ac) = +1. There-
fore,
Corr(aX + b, cY + d) = Corr(X, Y ).
This statement says that linear transformations don’t change ρ. If X or Y is
transformed linearly, the correlation remains the same.
This matters because changing units (e.g., temperature in ◦ C to ◦ F) changes
the covariance but leaves the correlation unchanged because ρ measures only
relative co-movement.
68
And if a and c have opposite signs, the correlation flips sign
Corr(aX + b, cY + d) = − Corr(X, Y ).
Cov(X, Y )
ρ = Corr(X, Y ) = ,
σX σY
where σX , σY > 0 are the standard deviations of X and Y .
Using Cauchy–Schwarz inequality, for any square-integrable random vari-
ables U and V , we have
p
| E[U V ] | ≤ E[U 2 ] E[V 2 ].
E[(X − µX )2 ] = σX
2
, E[(Y − µY )2 ] = σY2 .
oT
,F
Hence,
CE
| Cov(X, Y )| ≤ σX σY .
R.
a
kh
−σX σY ≤ Cov(X, Y ) ≤ σX σY .
Re
25
20
Cov(X, Y )
−1 ≤ ≤ 1.
σX σY
That is,
−1 ≤ ρ ≤ 1.
2. ρ = 1 or ρ = −1 if and only if
Y = aX + b
69
The first statement says that if X and Y are independent, their correlation
is ρ = 0. However, the opposite is not necessarily true, i.e., ρ = 0 does not
imply that X and Y are independent. There could still be a relationship, but
it is nonlinear. Correlation measures only linear relationships. The second
statement describes a perfect linear relationship
ρ=1 or ρ = −1
Y = aX + b, a 6= 0.
This shows that correlation captures how close the relationship is to a straight
line. On the other hand
if |ρ| < 1, there may still be a strong connection between X and Y which
is not linear.
Even if |ρ| is close to 1, the relationship might be nonlinear, but it can be
well approximated by a straight line.
In short, correlation measures linear association and not all types of relation-
ships. Zero correlation does not mean no relationship; it only indicates no linear
relationship. Maximum correlation (|ρ| = 1) occurs only when the relationship
oD
,U
is perfectly linear.
oT
,F
CE
Example 6.10. We are given two discrete random variables X and Y with the
R.
(
0.25 for (x, y) ∈ {(−4, 1), (−2, −2), (2, 2), (4, −1)}
20
pX,Y (x, y) =
©
0 otherwise
For each value of X, there is exactly one corresponding Y , and vice versa. This
implies X and Y are perfectly dependent. Knowing X tells us exactly what Y
is and vice versa. Fig.9 shows a plot of these points.
Figure 9: A plot
The marginal means are µX = µY = 0 because the points are symmetric around
0 (positive and negative values cancel). The expected value of the product is
Cov(X, Y ) = E(XY ) − µX µY = 0 − 0 · 0 = 0
70
The correlation coefficient
Cov(X, Y )
ρX,Y = =0
σX σY
Even though X and Y are perfectly dependent, the pattern of dependence
is nonlinear as reflected in Fig.9. So, correlation doesn’t detect it.
(x − µX )2 2ρ(x − µX )(y − µY ) (y − µX )2
1
× exp − 2 − +
2(1 − ρ2 ) σX σX σY σY2
oD
2
their correlation ρ. When you integrate out one variable X ∼ N (µX , σX ) and
,F
CE
2
Y ∼ N (µY , σY ). That means each variable, considered alone, still follows a
R.
normal distribution. The middle term in the exponent is the interaction term.
a
kh
Let us understand the shape fully. Contours are lines of constant density,
meaning
fX,Y (x, y) = constant.
Once the five parameters are fixed, the only varying term is the exponential
argument of the bivariate normal pdf
71
(x − µX )2 2ρ(x − µX )(y − µY ) (y − µY )2
Q(x, y) = 2 − + .
σX σX σY σY2
Setting Q(x, y) = k for some constant k > 0 gives the contour. The equation
(x − µX )2 2ρ(x − µX )(y − µY ) (y − µY )2
2 − + =k
σX σX σY σY2
We can check
4ρ2 4 4(ρ2 − 1)
B 2 − 4AC = 2 σ2 − 2 σ2 = 2 σ2 <0
σX Y σX Y σX Y
(x − µX )2 + (y − µY )2 = kσ 2 ,
a
kh
Re
which is a circle. Intuitively, ρ = 0 means that pull is roughly the same from both
25
20
the x and y axes making the ellipse axis aligned. Additionally, if the variances
©
are the same then the individual distances from the mean is same which gives the
contour a circular shape. Correlated or differently scaled variables stretch the
contours along certain directions resulting in an elliptical contour. Correlation
controls the rotation (which way the ellipse points) while variances control the
how elongated the ellipse is. If σX σY , the ellipse will be stretched more
along the x-axis direction, then tilted according to ρ. By “tilt” we mean the
major axis of the ellipse is rotated by some angle with respect to the x-axis.
Let us analyse each cases geometrically:
1. Unrelated variables with equal variances (ρ = 0 and σX = σY ): The joint
density surface is a symmetric bell aligned with the axes (no tilt) as shown
in Fig.11a and the contours are perfect circles as shown in Fig.11b.
72
(a) no tilt symmetric bell (b) contours formed are perfect
circles
73
(a) ρ = 0.4 (b) ρ = 0.9
(a) tilted bell ρ > 0 and σX 6= σY (b) contours formed are ellipses
(a) tilted bell ρ < 0 and σX 6= σY (b) contours formed are ellipses
74
6. Perfect postive correlation with equal variances (ρ = 1 and σX = σY ):
Since ρ = 1, the major axis is along the line y = x and since the variances
are equal the distribution is collapsed along this axis. It is no longer an
ellipse but just a line on the 2D XY plane. The equation for the joint pdf
doesn’t hold instead
fX,Y (x, y) = fX (x) δ y − µY + σσX Y
(x − µX )
where the second term is the Dirac delta function. If we project onto the
X-axis we can see the spread of X along X, a 1D Gaussian. If we project
onto the Y -axis we can see the spread of Y along Y , a 1D Gaussian.
Let us now understand the relation between the two variables when ρ = 0.
If ρ = 0, X and Y are fully independent (their joint pdf factors into the product
of two normals).
(x − µX )2 (y − µY )2
1 1
f (x, y) = √ exp − 2 −0+
2πσX σY 1 − 02 2(1 − 02 ) σX σY2
(x − µX )2 (y − µY )2
1
f (x, y) = exp − 21 2 +
2πσX σY σX σY2
oD
1 (x − µX )2 1 (y − µY )2
1 1
,U
2 σX2 2 σY2
2π σX 2π σY
,F
CE
R.
Here,
25
20
2
1 1 (x − µX )
©
2
fX (x) = √ exp − 2 2 is the pdf of X ∼ N (µX , σX ),
2π σX σX
(y − µY )2
1
fY (y) = √ exp − 12 is the pdf of Y ∼ N (µY , σY2 ).
2π σY σY2
In the case of the bivariate normal distribution when ρ = 0, the joint pdf
factorizes as f (x, y) = fX (x) fY (y), which is exactly the condition for true
independence. It even rules out nonlinear dependence.
75
This is linear in x. If x > µX (X is above its average), then the expected
value of Y also shifts upward (if ρ > 0) or downward (if ρ < 0). The line of
conditional means is the regression line of Y on X. If ρ = 0, knowing X doesn’t
change the expectation of Y .
The conditional variance is
σY2 |x = (1 − ρ2 )σY2 .
The formula suggests that the conditional variance does not depend on x.
The higher |ρ|, the smaller is the variance. If ρ = 0, the conditional variance is
simply σ22 . If |ρ| → 1, the variance collapses toward 0 (almost perfect prediction
of Y from X). When ρ = 1 and if we know X there is no uncertainty left in Y ,
i.e., Y is completely determined by X. The curve at X = x is no longer a bell
curve; it collapses to a single point along Y .
In short, if correlation is strong (|ρ| ≈ 1), once we know X, Y is almost
determined, i.e., very little conditional variance. If on the other hand correlation
is weak (ρ ≈ 0), knowing X doesn’t help much, i.e., conditional variance is
basically the same as the marginal variance of Y .
is quite complicated, and the only way to write it compactly is to employ matrix
,U
single variable given values of the other variables is normal, the joint marginal
R.
distribution of any pair of variables is bivariate normal, and the joint marginal
a
kh
Re
normal.
20
©
76
7 Bridging Probability and Statistics
Data collection is a crucially important step in statistics. In real-world statistics,
we often collect data from samples drawn from a population for example, by
measuring the fuel efficiency of randomly chosen cars. Let us say, first sample
of 3 cars gives x1 = 30.7, x2 = 29.4, x3 = 31.1 and a second (different) sample
gives x1 = 28.8, x2 = 30.0, x3 = 32.5. The numbers change from sample to
sample because each sample is chosen randomly. This means before we collect
any data the values x1 , x2 , . . . , xn are unknown and the choice of values for the
variables are random in nature. So we treat them as random variables, denoted
X1 , X2 , . . . , Xn .
Since the sample values vary, any function we compute from the sample like
the sample mean X̄, the sample standard deviation s, or other statistics, will
also vary from sample to sample. That is why we think of the sample mean
X̄ as a random variable before we observe any data. This is the foundation
for concepts such as sampling distributions, standard error, and ultimately, the
Central Limit Theorem.
More formally, a population consists of all units of interest. Any numeri-
cal characteristic of a population is a parameter. A parameter describes the
population. A sample consists of observed units collected from the population.
It is used to make statements about the population. Any function of a sample
oD
ticular statistic will result. Therefore, a statistic is a random variable and will
R.
calculated or observed value of the statistic. Thus the sample mean, regarded as
25
the sample standard deviation thought of as a statistic, and its computed value
is s. Think of a population as a big jar of chocolates, each chocolate has some
weight. A parameter can be the true average weight of all chocolates in the jar
(unknown unless we weigh all chocolate). A statistic can be the average weight
of a handful of chocolate we randomly pick (can change if we pick a different
handful).
Every statistic we compute from a sample like the sample mean X̄ or the
sample variance S 2 is based on random data. Since the data are random, the
statistic itself is also a random variable, and hence it has a probability distri-
bution. Suppose that n = 2 components are randomly selected and the number
of breakdowns while under warranty is determined for each one. Let X1 be the
number of breakdowns for the first component and X2 be for the second. Then
the sample mean is
X1 + X2
X̄ =
2
Possible values for X̄ might include 0, if X1 = X2 = 0; 0.5, if X1 = 0, X2 = 1
or vice versa 1, 1.5, . . ., depending on the values of X1 and X2 . Thus, the
probability distribution of X̄ includes
77
From this distribution, other probabilities such as P (1 ≤ X̄ ≤ 3) or P (X̄ ≥ 2.5)
can be computed.
Suppose each of the two observations, X1 and X2 , is randomly drawn from
the set {40, 45, 50}. Then, the sample space for ordered pairs (X1 , X2 ) consists
of 3 × 3 = 9 combinations
(Please see below a detailed explanation on this formula.) Let us compute a few
cases starting with (40, 40).
40 + 40
X̄ = = 40, S 2 = (40 − 40)2 + (40 − 40)2 = 0 + 0 = 0
2
For (40, 50)
oD
,U
oT
40 + 50
S 2 = (40 − 45)2 + (50 − 45)2 = 25 + 25 = 50
,F
X̄ = = 45,
CE
2
R.
40 + 45
25
2
©
78
Take the sample (1, 2) with sample mean
1+2
X̄ = = 1.5.
2
The sample variance (using n in the denominator) is
2
1X (1 − 1.5)2 + (2 − 1.5)2 0.25 + 0.25
s2 = (Xi − X̄)2 = = = 0.25.
2 i=1 2 2
(1,5) 3 4
,F
CE
(2,4) 3 1
a
kh
(3,5) 4 1
©
Table 3: Sample means and sample variances for all 2-element samples
Let us prove this formally. We want to compute the expected value of the
sample variance when dividing by n, i.e.,
n
1X
Sn2 = (Xi − X̄)2 ,
n i=1
We can write
Xi − X̄ = (Xi − µ) − (X̄ − µ),
so
(Xi − X̄)2 = (Xi − µ)2 − 2(Xi − µ)(X̄ − µ) + (X̄ − µ)2 .
79
n
X n
X n
X
(Xi − X̄)2 = (Xi − µ)2 − 2(X̄ − µ) (Xi − µ) + n(X̄ − µ)2 .
i=1 i=1 i=1
But note
n
X n
X
(Xi − µ) = Xi − nµ = nX̄ − nµ = n(X̄ − µ),
i=1 i=1
so
n
X n
X n
X
(Xi −X̄)2 = (Xi −µ)2 −2(X̄−µ)·n(X̄−µ)+n(X̄−µ)2 = (Xi −µ)2 −n(X̄−µ)2 .
i=1 i=1 i=1
Divide by n
n n
1X 1X
(Xi − X̄)2 = Sn2 = (Xi − µ)2 − (X̄ − µ)2 .
n i=1 n i=1
Taking expectations
n
1X
E[Sn2 ] = E[(Xi − µ)2 ] − E[(X̄ − µ)2 ].
n i=1
oD
σ2 *
Since E[(Xi − µ)2 ] = σ 2 and E[(X̄ − µ)2 ] = Var(X̄) = n , we get
,U
oT
,F
1 σ2 σ2 n−1 2
CE
E[Sn2 ] = · nσ 2 − = σ2 − = σ .
n n n n
R.
a
kh
Thus
Re
n−1 2
25
E[Sn2 ] =
σ < σ 2 , since n > 0
20
n
©
80
Take expectation
" n #
2 1 X
2 2
E[S ] = E[(Xi − µ) ] − n E[(X̄ − µ) ]
n − 1 i=1
Each term is
σ2
E[(Xi − µ)2 ] = σ 2 , E[(X̄ − µ)2 ] = Var(X̄) = .
n
So,
σ2
2 1 2 1
E[S ] = nσ − n · = (nσ 2 − σ 2 ).
n−1 n n−1
n−1 2
E[S 2 ] = σ = σ2 .
n−1
Thus, the sample variance S 2 is an unbiased estimator of the population vari-
ance. The sample variance has two interpretations
1. Descriptive. It measures the spread of the observed data around the sam-
ple mean
n
1X
S2 = (Xi − X̄)2
oD
n i=1
,U
oT
,F
variance σ 2 . By construction,
R.
a
kh
Re
n
1 X
E[S 2 ] = (Xi − X̄)2 ≈ σ 2 ,
25
20
n − 1 i=1
©
so S 2 is an unbiased estimator of σ 2 .
81
1. each Xi is independent of the others, i.e., the outcome of one observation
does not affect the others. For example while measuring the height of
10 randomly chosen people knowing the height of the 1st person doesn’t
tell anything about the 2nd . If the variables aren’t independent then the
estimates made about the population be biased or misleading.
2. Each Xi comes from the same probability distribution, i.e., they are all
“copies” of the same underlying process. Formally,
X1 , X2 , . . . , Xn ∼ F,
distribution.
a
kh
Re
population.
Continuing with the example above, each Xi might follow a normal dis-
tribution with mean µ = 170 cm and standard deviation σ = 10 cm.
When the sample is taken properly (random, independent, identically dis-
tributed), the sample mean X̄n tends to be close to the population mean
µ, with only small deviations, especially as the sample size grows
X1 + X2 + · · · + Xn P
X̄n = −
→ µ.
n
If however the samples don’t come from the same distribution then com-
bining them (e.g., via sample mean) may not reflect any one population.
For example, let us say we roll a fair 6-sided die. The population mean
µ = 3.5. Let the first roll give 2 resulting in a sample mean of 2. Af-
ter 2 rolls: (2, 5) resulting in the sample mean 3.5. After 3 rolls: (2,
5, 4) resulting in the sample mean 3.67. After 1000 rolls: sample mean
is approximately 3.51. After 10000 rolls: sample mean is approximately
3.5002. The more rolls (larger n), the closer the sample mean gets to the
true mean 3.5.
In short, the individual random variables Xi s are independent and identically
distributed (iid).
82
If sampling is either with replacement or from an infinite population, these
conditions are satisfied. But in real life, especially in surveys or experiments we
often sample without replacement (i.e., we don’t select the same person or object
twice). Technically, this breaks the independence assumption. For example, if
we already picked one person, the chance of picking them again is zero. That
means the probability distribution for the second draw depends on the outcome
of the first draw.
If the population size is very large compared to the sample size, removing one
element hardly changes the probabilities for the next draw. To see this, suppose
N = 1,000,000. On the first draw, the probability of selecting any particular
1
person is 1,000,000 . On the second draw (without replacement), the probability
1 1 1
of selecting a different specific person is 999,999 . Since 999,999 ≈ 1,000,000 , the
change in probability is microscopically small. Although the observations are
technically dependent, the effect is so minor that we can safely treat them as
independent in practice. Specifically, if the sample size is 5% or less of the
whole population, then we can treat it like a true random sample, even if it is
technically not. We can then proceed to use all the powerful statistical tools to
analyse the population.
what would later become known as The Law of Large Numbers. He observed
oT
,F
the first proof in 1713 for what Cardano had observed centuries earlier. Bernoulli
kh
Re
recognized the intuitive nature of the problem as well as its importance and spent
25
twenty years formulating a complicated proof for the case of a binary random
20
©
variable. Bernoulli states that when estimating the unknown proportion any
degree of accuracy can be achieved through an appropriate number of trials.
The official name of the theorem “The Law of Large Numbers” was not coined
until the year 1837 by French mathematician Simeon Denis Poisson. Roughly
stated, “as the size of a random sample increases, the sample average tends to
get closer and closer to the population mean.” In simple terms, if we repeat an
experiment many times, the average result will eventually stabilize around the
expected (true) value.
Imagine flipping a fair coin with the expected proportion of heads being 0.5.
If we flip it 10 times we might get 7 heads or 4 heads there is more variability.
But if we flip it 1,000 or 10,000 times, the proportion of heads will be very close
to 0.5.
The weak law of large numbers states
lim X̄n = µ (in probability)
n→∞
This means that the probability that the sample mean X̄n differs from the
population mean µ by more than a small positive number goes to zero as the
sample size n increases.
83
7.3 Central Limit Theorem
If the individual data points Xi follow a normal distribution, then the sample
mean X̄ also follows a normal distribution, regardless of sample size n. This is
a special property of the normal distribution; it is “stable” under averaging.
Let X1 , X2 , . . . , Xn be i.i.d. random variables with
Xi ∼ N (µ, σ 2 ).
(To rephrase: let Xi be a random variable representing a single draw from
a population with mean µ. If we repeatedly draw Xi and observe outcomes
(1) (2) (N )
xi , xi , . . . , xi , then the long-run average of these draws converges to the
population mean
N
1 X (j)
E[Xi ] = lim xi = µ
N →∞ N
j=1
(j)
where xi is the outcome of the i-th random variable in the j-th repetition of
(j)
the experiment and even though any single xi may differ from µ, the long-run
average of all draws approaches µ.
This is consistent with the Law of Large Numbers (LLN), which states that
the sample mean of repeated independent draws converges to the population
mean as the number of draws grows.)
oD
MX (t) = exp µt + 21 σ 2 t2 .
R.
a
n n
MSn (t) = MX (t) = exp µt + 21 σ 2 t2 = exp nµt + 12 nσ 2 t2 .
©
84
We are interested in the distribution of X̄n as n grows large. Standardize the
independent variables by defining
Xi − µ
Yi = , i = 1, 2, . . . , n.
σ
Then each Yi has mean 0 and variance 1. Likewise standardize the sample mean
n
X̄n − µ 1 X
Zn = √ =√ Yi .
σ/ n n i=1
hY i
=E exp √ Yi
,U
n
oT
i=1
,F
CE
n
a
t
kh
Y h i
= E exp( √ Yi )
Re
n
25
i=1
20
©
1 1 1
MY1 (u) = 1 + E[Y1 ] u + E[Y12 ] u2 + E[Y13 ]u3 + · · · = 1 + u2 + o(u2 ),
| {z } 2 | {z } 6 2
=0 =1
85
Taking the limit as n → ∞
n
t2
t
n 1 2
lim MZn (t) = lim (MY1 √ ) = lim 1 + +o = et /2 .
n→∞ n→∞ n n→∞ 2n n
2 σ2
µX̄ = µ and σX̄ = ,
n
Pn
and the total T = i=1Xi also has approximately a normal distribution with
n
! n
X X
2
µT = nµ and σT = Var Xi = Var(Xi ) = nσ 2 .
i=1 i=1
86
Central Limit Theorem is one of the most powerful results in probability and
statistics with profound implications. It allows us to use the normal distribution
which is mathematically simple and well-understood to make inferences about
population parameters using sample data. The normal distribution has desir-
able properties such as symmetry and is completely characterized by just two
parameters (mean and variance). This makes it the go-to choice in simulations
and modeling. So, irrespective of the population one can reuse the software
desgined for one population in another possibly with minor changes.
Example 7.1. A disk has free space of 330 megabytes. Is it likely to be sufficient
for 300 independent images, if each image has expected size of 1 megabyte with
a standard deviation of 0.5 megabytes?
The sample space S consists of all possible combinations of 300 image sizes
Assume that the samples X1 , X2 , . . . , X300 are independent and identically dis-
CE
T = X1 + X2 + · · · + X300
Re
25
20
This is also a random variable on the same sample space. We are interested in
©
computing P (T ≤ 330), probability that all 300 images fit within 330 MB.
Since the individual image sizes are independent and identically distributed
(i.i.d.), and n = 300 is large, we can apply the Central Limit Theorem
T ≈ N (nµ, nσ 2 )
That is
T ≈ N (300, 300 × 0.52 ) = N (300, 75)
We standardize to find the z-score
T − nµ 330 − 300 30 30
Z= √ = √ = ≈ ≈ 3.46
σ n 0.5 300 0.5 × 17.32 8.66
Now, using the standard normal table
There is a 99.97% chance that 300 images will fit in 330 MB. Hence the disk
space is very likely sufficient.
Example 7.2. Let X be the number of different people sent text messages
during a particular day by a randomly selected student at a large university.
Suppose the mean value of X is 7 and the standard deviation is 6 (values very
87
close to those reported in the article Cell Phone Use and Grade Point Average
Among Undergraduate University Students (College Student J., 2011: 544–551).
Among 100 randomly selected such students, how likely is it that the sample
mean number of different people texted exceeds 5?
The experiment is randomly selecting one student and counting how many
different people she sent messages to in a day. So, the sample space is S =
{0, 1, 2, . . .}. This is a set of non-negative integers with no fixed upper bound,
but in reality, there is a practical maximum (e.g. maybe no more than 100–200
people).
The random variable X is defined as “X is the number of different people
messaged in a day by a randomly chosen student” where X : S → Z≥0 . It assigns
to each student (each outcome in the population) a number x ∈ {0, 1, 2, . . . }.
Now, if we repeat the experiment n = 100 times (selecting 100 students
independently as mentioned in the question), we get 100 i.i.d. observations
X1 , X2 , . . . , X100 . Each Xi is a realization of the same random variable X. From
these, we compute the sample mean
100
1 X
X̄ = Xi
100 i=1
oD
whose sample space is now the set of averages of 100 non-negative integers.
oT
,F
want to compute P (X̄ > 5). Since n = 100 is large and the averages are iid
R.
a
62
25
20
X̄ ∼ N 7, = N (7, 0.36)
100
©
We standardize as
5−7 −2
Z= √ = = −3.33
6/ 100 0.6
to compute the probability
There is a 99.96% probability that the sample mean number of people texted
by 100 students exceeds 5. Note: the cited article stated that text messaging
frequency was negatively correlated with GPA.
88
For extremely skewed or heavy-tailed distributions, even n = 40 or 50 might
not be enough. However, such distributions are rare. For well-behaved distri-
butions (e.g., uniform, mildly skewed), the CLT can work even with small n
as small as n = 12. For example, if we are measuring heights (roughly normal
in humans), even n = 10 gives a decent approximation. If we are measuring
income or insurance claims due to catastrophic events (which are skewed), we
may need a much larger n (50 or more) for CLT to hold well.
vides the theoretical foundation for inference methods such as confidence in-
,U
oT
tervals and hypothesis testing, making it one of the most powerful results in
,F
statistics.
CE
R.
a
kh
Re
25
20
©
89
8 Statistics
Statistics helps us make sense of data, and it is broadly divided into two branches
descriptive and inferential statistics. Descriptive statistics are used to sum-
marize, organize, and present data in a meaningful way. They include measures
like mean, median, mode, standard deviation, and visual tools such as histograms
and pie charts, all of which help describe the main features of a dataset. For
example, stating that “the average score of a class of 50 students is 72” is a
descriptive statement. It tells us about the data we have, without making any
generalizations beyond that group.
In contrast, inferential statistics allow us to make predictions or draw
conclusions about a larger population based on a smaller sample. They in-
volve estimation techniques, hypothesis testing (like z-tests, t-tests, or chi-square
tests), regression analysis, and more. For instance, estimating that “60% of a
city’s population supports a candidate based on a survey of 200 voters” is an
inferential conclusion.
Unlike descriptive statistics, inferential statistics rely heavily on probability
theory to quantify uncertainty and assess the reliability of conclusions. Proba-
bility theory allows us to say not just what the estimate is, but also how reliable
it is. For example, instead of only reporting the sample mean, we report a
confidence interval which shows how much trust we can put in that estimate.
oD
statistics predict what could be true for a broader population. Both are essential
CE
estimation which is the process of using sample data to make educated guesses
25
derived from a sample that serves as the best guess for an unknown parameter
©
90
unknown.
Since testing every chip is impractical, the company proceeds as follows. It
randomly selects a sample of n = 50 chips and measures the operational lifetime
(in hours) of each selected chip until it fails. It records the sample values as
X1 , X2 , . . . , X50 . It then computes the sample mean
50
1 X
X̄ = Xi
50 i=1
Suppose the result is X̄ = 72,500 hours. The statistic X̄ is called the point
estimator of the unknown population mean µ. The value 72,500 is the point
estimate of µ based on this particular sample. It serves as a reasonable, data-
driven guess for the true average lifetime of all chips produced.
This is an example for which there is only one reasonable point estimator
for the parameter of interest. The next example shows that multiple sensible
estimators can exist for a given parameter.
24.46 25.61 26.25 26.42 26.66 27.15 27.31 27.54 27.74 27.94
,F
CE
27.98 28.04 28.28 28.49 28.50 28.87 29.11 29.13 29.50 30.88
R.
a
cause normal distributions are symmetric, the population mean µ is also equal
20
©
91
3. Here the estimator is min(Xi )+max(X
2
i)
and the estimate is 27.670.
In a 10% trimmed mean with n = 20, we discard the lowest 10% (2 values) and
the highest 10% (2 values), and compute the mean of the remaining 16 values.
The discarded values are 24.46, 25.61, 29.50, 30.88. So, the remaining sum is
18
X
Xi = 555.86 − 24.46 − 25.61 − 29.50 − 30.88 = 445.41
i=3
445.41
X̄trimmed = = 27.838
16
4. Here the estimator is 10% trimmed mean and the estimate is 27.838.
By presenting four different estimates, the example implicitly invites ques-
tions like “Which estimator is most reliable?”, “How does outlier sensitivity
affect each?”, “How efficient are these estimators in different situations?”, etc.
This helps us understand why the choice of estimator matters and that robust-
ness and efficiency are important properties.
A simple observation is that the sample mean is sensitive to outliers (affected
by min/max values). The median is more robust to extreme values. And the
trimmed mean is a compromise as it reduces the effect of outliers while still
using most of the data.
oD
Ideally, we would like an estimator that always gives us the exact value of
,U
such a perfect estimator almost never exists in practice, unless we have the full
CE
sample variance), and since the sample is chosen randomly, the estimator itself
kh
becomes a random variable . It doesn’t have a fixed value until the sample
Re
25
θ̂ = θ + estimation error
92
estimator in statistics E[θ̂] = θ. While instrument B is consistently wrong in one
direction because it always reads too high. Even if its readings vary their average
is not the true value. This is called bias in statistics Bias(θ̂) = E[θ̂] − θ 6= 0.
This analogy highlights the intuition for statistical bias which is not just random
error, but a systematic shift away from the truth.
The expected value of the sampling distribution of θ̂ is always centered at
the true value of the parameter. For example, if µ = 100, then the mean of the
sampling distribution of θ̂ is 100; if µ = 27.5, then it is 27.5, and so on. Figures
below illustrate the concept of bias in estimators
In Fig.17a, the mean of the sampling distribution of estimator θ̂1 does not
oD
,U
coincide with the true parameter value θ. The horizontal distance between θ
oT
and the mean of the distribution of θ̂1 is the bias of θ̂1 . This illustrates that θ̂1
,F
CE
θ, whereas the distribution of θ̂2 is shifted to the left. Both are therefore biased,
a
kh
don’t know θ to start with! So at first, it seems like we cannot ever tell whether
©
an estimator is unbiased unless we already know the true value which defeats
the point of estimation. Here is where theoretical analysis comes to our rescue.
Unbiasedness is a theoretical property of an estimator’s sampling distribution.
We can analyze it mathematically by computing the expected value E[θ̂] in terms
of the population distribution. If this expected value equals θ for all possible
values of θ (not just one specific value), then the estimator is unbiased. So we
don’t need to know the actual value of θ. Instead, we use theoretical tools (like
laws of expectation, properties of estimators) to prove that E[θ̂] = θ for all θ.
93
To show that X̄ is an unbiased estimator of µ, we compute its expected value
and apply the linearity of expectation (refer proposition in 3.3)
" n # n
1X 1X
E[X̄] = E Xi = E[Xi ]
n i=1 n i=1
1X
,U
X̄ = Xi
oT
n i=1
,F
CE
symmetric, then X̃ and any trimmed mean are also unbiased estimators of µ.
kh
Re
25
M = max(X1 , . . . , Xn ).
Then
n
E[M ] = u.
n+1
94
x
Proof. The CDF of a single observation is FX (x) = P (Xi ≤ x) = for 0 ≤ x ≤
u
u. Define the sample maximum as
M = max(X1 , X2 , . . . , Xn ).
By definition,
{M ≤ x} ⇐⇒ {X1 ≤ x, X2 ≤ x, . . . , Xn ≤ x}.
Because each Xi has distribution function FX (x), the CDF of the sample max-
imum M is
n x n
FM (x) = P (M ≤ x) = FX (x) = , 0≤x≤u
u
Differentiating gives the pdf of M (refer Sec. 5.1)
d x n−1
oD
fM (x) = FM (x) = n n , 0 ≤ x ≤ u.
,U
dx u
oT
,F
Z u Z u n−1 Z u
x n
a
x n dx.
kh
E[M ] = x fM (x) dx = x n n dx = n
Re
0 0 u u 0
25
20
u
u n+1
Z
x n dx = .
0 n+1
Hence
n u n+1 n
E[M ] = · = u.
un n + 1 n+1
n
Thus M has expectation u.
n+1
Example 8.3. Suppose that X, the reaction time to a certain stimulus, has a
uniform distribution on the interval from 0 to an unknown upper limit u. Thus,
the probability density function (pdf) of X is rectangular in shape with height
1 1
u−0 = u for 0 ≤ x ≤ u. It is desired to estimate u on the basis of a random
sample X1 , X2 , . . . , Xn of reaction times.
First Estimator Since u is the largest possible time in the entire population
of reaction times, consider as a first estimator the largest sample reaction time
û1 = max(X1 , . . . , Xn ).
95
Then the point estimate of u is
û1 = max(4.2, 1.7, 2.4, 3.9, 1.3) = 4.2.
Let us analyse the problem with this estimator. û1 will never overestimate
u, because sample values cannot exceed the population maximum. It often
underestimates u, since unless the sample maximum exactly equals u, it lies
below it. Therefore, the distribution of û1 is not centered at u; thus it is a
biased estimator. From the above proposition
n
E[û1 ] = ·u
n+1
Hence, the bias is
n u
Bias(û1 ) = E[û1 ] − u = ·u−u=−
n+1 n+1
So, û1 underestimates u on average. But as n → ∞, the bias → 0. That is, the
estimator becomes asymptotically unbiased.
Second Estimator We can correct the bias in the estimator û1 by multiplying
n+1
the sample maximum by . This gives the unbiased estimator
n
oD
n+1
,U
û2 = · max(X1 , . . . , Xn )
oT
n
,F
CE
6
kh
5
25
20
This adjusted estimator û2 is unbiased, meaning its expected value equals the
©
96
n
Whenever the maximum sample value exceeds n+1 · u the adjusted estimator
û2 overshoots the true value u even though the sample max never exceeds u.
n+1 n
û2 = u ⇐⇒ · max(Xi ) = u ⇐⇒ max(Xi ) = ·u
n n+1
n
Whenever the maximum sample value is the same as n+1 · u the adjusted esti-
mator û2 = u.
the population maximum. This is evident from the orange distribution being
a
kh
Re
entirely to the left of the red dashed line at u = 10, with its peak concentrated
25
On the other hand, the blue curve corresponds to the corrected estimator
û2 = n+1
n · max(Xi ), which is unbiased. This distribution is centered around the
true value u = 10, as shown by the fact that the blue curve straddles the red
dashed line. However, to achieve unbiasedness, û2 sometimes overshoots and
sometimes undershoots, resulting in a wider and more right-skewed distribution.
Estimators don’t have to be normally distributed (as this example illustrates
where the estimators follow a mostly exponential distribution), especially when
based on nonlinear statistics (like max). The skewed shape just tells us that
small values û2 are more common and larger values (overshoots) are rarer but
possible to cancel out the smaller values. But overall, the mean is exactly at u
making it unbiased. The contrast between the two curves illustrates how bias
can be corrected at the cost of increased variability and occasional overshooting.
Summary Unbiasedness implies that some samples will yield estimates that
exceed u and other samples will yield estimates smaller than u otherwise u
could not possibly be the center (balance point) of û1 ’s distribution. However,
the first estimator û1 will never overestimate u (since the largest sample value
cannot exceed the largest population value), and will underestimate u unless
the largest sample value equals u. This intuitive argument shows that û1 is a
biased estimator.
The second estimator û2 is designed such that it will sometimes overshoot,
sometimes undershoot, but on average, it equals u. This is a great example
97
of how bias can be analyzed and corrected using knowledge of the sampling
distribution.
n n n n
a
kh
X X X X
(Xi − X̄)2 = (Xi − µ)2 − 2(X̄ − µ) (Xi − µ) + (X̄ − µ)2 .
Re
25
Note that
n
X
(Xi − µ) = nX̄ − nµ = n(X̄ − µ).
i=1
Combining, we obtain
n
X n
X
(Xi − X̄)2 = (Xi − µ)2 − 2n(X̄ − µ)2 + n(X̄ − µ)2
i=1 i=1
n
X
= (Xi − µ)2 − n(X̄ − µ)2 ,
i=1
98
The identity, also known as variance decomposition identity, splits the total
variation of the data about the sample mean into two parts. The total deviation
of the data around the sample mean can be expressed in terms of the total
deviation of the data around the population mean µ with a correction that
adjusts for the fact that we are centering around X̄ rather than µ.
Proposition. Let X1 , X2 , . . . , Xn be a random sample from a distribution with
mean µ and variance σ 2 . Then the estimator
Pn
(Xi − X̄)2
σ̂ 2 = S 2 = i=1
n−1
is an unbiased estimator of σ 2 .
Proof. Let X1 , X2 , . . . , Xn be a random sample from a population with mean µ
and variance σ 2 . Define the sample mean X̄ and sample variance S 2 as follows
n n
1X 1 X
X̄ = Xi , S2 = (Xi − X̄)2
n i=1 n − 1 i=1
if E[θ̂] = θ. That is, on average, over many samples, the estimator gives you the
,U
oT
true parameter value. Hence, in this five step proof we aim to prove
,F
CE
E[S 2 ] = σ 2
R.
a
kh
1 : Use the ANOVA identity. Before proceeding with the proof let us understand
Re
25
the need for this identity. In this proof of unbiasedness, we wish to evaluate the
20
" n #
X
2
E (Xi − X̄)
i=1
But this is hard to compute directly since it involves X̄. However, using the
ANOVA identity allows us to break the expression into terms involving the
population mean µ because it is easier to work with. So, we use the ANOVA
identity
Xn Xn
(Xi − X̄)2 = (Xi − µ)2 − n(X̄ − µ)2
i=1 i=1
" n #
1 X
E[S 2 ] = E (Xi − X̄)2
n−1 i=1
" n # !
1 X
2
2
= E (Xi − µ) − E n(X̄ − µ)
n−1 i=1
99
3 : Evaluate expectations. Since each Xi has variance σ 2 , we have §
" n #
X
E (Xi − µ) = nσ 2
2
i=1
σ2
Also, X̄ has variance V ar(X̄) = n , so
σ2
E[n(X̄ − µ)2 ] = n · E[(X̄ − µ)2 ] = n · = σ2
n
4 : Plug back into the formula
1
E[S 2 ] = nσ 2 − σ 2
n−1
(n − 1)σ 2
= = σ2
n−1
5 : Hence E[S 2 ] = σ 2 .
n
n−1 2
oT
1X
σ̂n2 = (Xi − X̄)2 = S ,
,F
n i=1 n
CE
R.
a
where S 2 is the unbiased sample variance with divisor n−1. Taking expectation,
kh
Re
25
n−1 n−1 2
20
E[σ̂n2 ] = E[S 2 ] = σ .
©
n n
Thus, σ̂n2 is not unbiased. Its bias is
n−1 2 σ2
Bias(σ̂n2 ) = E[σ̂n2 ] − σ 2 = σ − σ2 = − .
n n
Because the bias is negative, the estimator with divisor n systematically
underestimates the true variance σ 2 . This explains why the divisor n − 1 is
generally preferred. However, when n is large, the bias −σ 2 /n is small, so the
difference between the two estimators becomes negligible.
Var(X) = E (X − µ)2 .
100
Population Variance If the population follows a probability distribution
Here we are not taking expectations, because we already have the [Link]
literally compute deviations from the sample mean and average them. The
variance is thus a fixed number describing variability in the dataset.
Sample Variance (Inferential Statistic) For an i.i.d. random sample
oD
,U
n n
1 X 1X
CE
S2 = (Xi − X̄)2 , X̄ = Xi .
R.
n − 1 i=1 n i=1
a
kh
Re
However, the property of unbiasedness does not carry over through nonlinear
transformations such as the square root. Specifically, E[S] 6= σ. This is because
101
the expectation of a function of a random variable is not generally equal to the
function of the expectation
because all the candidates are unbiased. They all have the correct expected
,U
oT
Suppose θ̂1 and θ̂2 are two estimators of a population parameter θ, and both
20
©
V[θ̂1 ] 6= V[θ̂2 ].
When multiple unbiased estimators are available we may still prefer one
over the other. This is because, in practice, we usually work with only a single
sample from the population. In many real-world situations, repeated sampling
is not possible due to cost, time, ethical constraints, or the destructive nature
of measurement. For instance, in clinical drug trials, only one study might
be conducted because replicating trials may be probhibitive due to ethical and
cost constraints; in quality control, destructive testing allows sampling only
once from a batch; and in environmental monitoring or historical data analysis,
only a single set of data may be available. In such cases, the performance of an
estimator on that one sample is crucial.
Among unbiased estimators, the one with the smaller variance tends to pro-
duce estimates that are more consistently close to the true parameter value.
102
Therefore, we prefer the estimator with lower variance, leading to the impor-
tant concept of the minimum variance unbiased estimator (MVUE). Choosing
estimators based on both unbiasedness and spread ensures more reliable and
accurate statistical inference in practice.
" n # n
1X 1X 1 u u
a
E[X̄] = E Xi = E[Xi ] = · n · =
kh
n i=1 n i=1 n 2 2
Re
25
20
u
©
E[X̄] = ⇒ E[2X̄] = u
2
Therefore, the estimator û2 = 2X̄ is also unbiased. Now, we compare the
variances of û1 and û2 to find out which is better. Let X ∼ Uniform(0, u).
Then the probability density function (pdf) of X is given by
(
1
, 0≤x≤u
f (x) = u
0, otherwise
103
For a uniform distribution on [0, u],
u2 u2 u2 u2
V[Xi ] = ⇒ V[X̄] = ⇒ V[2X̄] = 4 · = = V[û2 ]
12 12n 12n 3n
For the maximum M of n i.i.d. U (0, u) random variables, we have
n n
E[M ] = u, E[M 2 ] = u2 .
n+1 n+2
The variance of M is
2
nu2
n n
V[M ] = E[M 2 ] − (E[M ])2 = u2 − u = .
n+2 n+1 (n + 1)2 (n + 2)
u2 u2 1 1
< ⇔ < ⇔ 3n < n(n + 2) ⇔ 3 < (n + 2)
oD
n(n + 2) 3n n(n + 2) 3n
,U
oT
,F
This inequality holds whenever n > 1. Thus, for any sample size greater than 1,
CE
û1 has smaller variance than û2 , making it the preferred estimator among the
R.
two. More advanced statistical theory can show that û1 is in fact the minimum
a
kh
variance unbiased estimator (MVUE) for u; that is, it has the smallest variance
Re
25
It has the ability to evaluate and compare estimators using theoretical tools
from probability and algebra, even before any data is observed. Since estimators
are functions of random variables, they possess distributions of their own. This
allows us to compute important characteristics such as bias, variance, and mean
squared error (MSE) directly from their probability models.
For instance, in the case of estimating the upper bound u of a uniform
distribution on [0, u], both the sample maximum and twice the sample mean
can serve as unbiased estimators. However, by computing and comparing their
variances analytically, we can determine which estimator is more efficient. This
approach avoids the need for empirical simulation or sample data and provides
deeper insight into the long-run performance of different estimation strategies.
Such theoretical comparisons are foundational in the development of optimal
statistical procedures.
This is the essence of mathematical statistics distinguishing it from applied
or computational statistics where data-driven methods dominate.
104
The sample mean is defined as
n
1X
µ̂ = X̄ = Xi .
n i=1
Variance of X̄ shrinks with n. The key theoretical result to use here is the
Lehmann–Scheffé theorem:
the sample mean is also complete. Thus by the theorem X̄ is the MVUE of µ.
R.
a
of µ in terms of variance.
105
We consider four estimators of µ: X̄, X̃ (median), X̄e (average of extremes),
and X̄tr(10) (10% trimmed mean). If the random sample comes from a normal
distribution, then X̄ is the best estimator (MVUE), since it has minimum vari-
ance among all unbiased estimators. If the random sample comes from a Cauchy
distribution, X̄ and Xe are very poor choices (sensitive to outliers and heavy
tails). The sample median X̃ performs quite well, though the MVUE is not
known. If the sample has a uniform distribution then the best estimator is X̄e .
Outliers cannot occur since the distribution is bounded. Trimmed mean is not
optimal in any case, but performs reasonably well across all three distributions.
It is robust, balancing efficiency and resistance to outliers.
The best estimator for µ depends crucially on the underlying distribution.
Different families reward different choices of estimators, and no single estimator
is universally optimal. Robust choices like the trimmed mean provide a good
compromise.
oD
,U
oT
,F
CE
R.
a
kh
Re
25
20
©
106
9 Interval Estimation
A point estimate is a single number calculated from a sample, such as the
sample mean used to estimate a population mean. While point estimators are
often theoretically sound i.e., being unbiased, consistent, and efficient, they
still rely on the particular sample drawn. In practice, we usually have access
to only one sample, and the estimate we get from it may be higher or lower
than the true population value due to random sampling variability. Although a
good estimator performs well on average over many samples, any one estimate
could still be inaccurate. This means that a point estimate by itself provides no
information about how precise or reliable it is. Therefore, we accompany point
estimates with measures of uncertainty, such as standard errors or confidence
intervals, which help us express how much the estimate might vary and how
close it is likely to be to the true value.
To address this uncertainty, we accompany point estimates with an interval
of plausible values for the parameter. This is called an interval estimate or
confidence interval (CI). Constructing a confidence interval begins by choos-
ing a confidence level, which reflects how confident we are that the interval
captures the true parameter. For instance, suppose we use a sample statistic
X to estimate the true average breaking strength (in grams) of a particular
brand of paper towels, and we get a value of x = 9322.7. A 95% confidence level
oD
for the average breaking strength from 9162.5 to 9482.9 means we can be 95%
,U
confident that the true mean lies somewhere within this range.
oT
,F
samples and computed confidence intervals, then about 95% of those intervals
R.
would contain the true value µ, and only 5% would not. Commonly used con-
a
kh
Re
fidence levels are 95%, 99%, and 90%. The higher the confidence level, the
25
more certainty we have that the interval includes the true parameter. To be
20
more confident that the interval contains the true parameter, we must be more
©
cautious — which means a wider range. Our aim is for a narrow interval with
high confidence.
107
Therefore,
P (−1.96 < Z < 1.96) = Φ(1.96) − Φ(−1.96) = 0.975 − 0.025 = 0.95,
Substituting for Z, we get
X̄ − µ
P −1.96 < √ < 1.96 = 0.95 (1)
σ/ n
We now rearrange (1) algebraically to express it in terms of µ. Multiply through
by √σn
σ σ
P −1.96 · √ < X̄ − µ < 1.96 · √
n n
Subtract X̄ (or equivalently, add µ and subtract µ)
σ σ
P X̄ − 1.96 · √ < µ < X̄ + 1.96 · √ = 0.95 (2)
n n
Equation (2) defines a 95% confidence interval for the population mean µ
σ σ
X̄ − 1.96 · √ , X̄ + 1.96 · √
n n
oD
This interval itself is random because it depends on the random variable X̄.
,U
Before the data is collected, the sample mean is unknown and hence the interval
oT
,F
is not fixed. Once data is collected, we compute x̄, and the interval becomes
CE
R.
σ σ
a
x̄ − 1.96 · √ , x̄ + 1.96 · √
kh
n n
Re
25
20
In repeated sampling, 95% of all intervals constructed in this way will con-
©
tain the true population mean µ. Before the data is collected, there is a 95%
probability that the interval we construct will capture µ. Once the data is ob-
served and the interval is calculated, it either contains µ or it does not. But
the method used is such that, in the long run, 95% of such intervals will be
successful.
The interval is centered at X̄ with the width of the interval being 2 × 1.96 ×
σ σ
√ . Standard error of the mean SE is (X̄) = √ and the margin of error ME
n n
σ
is zα/2 · SE = 1.96 · √n . Crucially, the formula is valid only when the population
is normal, or the sample size is large by Central Limit Theorem (Sec. 7.3).
Example 9.1. We are constructing a 95% confidence interval for the true av-
erage preferred keyboard height µ, based on a sample of typists. We are given
sample mean x̄ = 80.0 cm, population standard deviation σ = 2.0 cm, sample
size n = 31, and confidence level 95%.
Since the population standard deviation is known, we use the z-based con-
fidence interval formula
σ
x̄ ± z ∗ · √
n
For 95% confidence, z ∗ = 1.96 (derived above). Substituting the values
2.0
80.0 ± 1.96 · √ = 80.0 ± 1.96 · 0.359 ≈ 80.0 ± 0.7
31
108
Therefore, the confidence interval is (79.3, 80.7).
Interpretation: We are 95% confident that the true average preferred keyboard
height µ lies between 79.3 cm and 80.7 cm. The interval is relatively narrow,
which indicates that the estimate of µ is quite precise.
It might be tempting, but incorrect, to write
P (µ ∈ (79.3, 80.7)) = 0.95
This is incorrect because the value µ is a fixed but unknown constant. Once the
sample mean x̄ = 80.0 is observed, the interval is fixed and no longer random.
Probability statements apply to random variables or events, not fixed numbers.
The correct meaning of 95% confidence is based on the long-run frequency
interpretation of probability. If we were to repeatedly take random samples
from the same population, and compute a 95% confidence interval each time,
then approximately 95% of those intervals would contain the true mean µ.
That is, the procedure used to construct the interval has a 95% success rate
in the long run. We do not know whether this particular interval contains µ,
but we are 95% confident in the method used to obtain it.
parameter. The most commonly used level is 95%, meaning that if we were to
oT
repeat the sampling process many times, about 95% of the resulting confidence
,F
CE
intervals would contain the true value. Other levels include 90% and 99%, each
R.
A 90% confidence level yields a narrower interval but offers less certainty
Re
(10% chance of error), while a 99% confidence level provides greater assurance
25
20
that the true parameter is captured, but results in a wider interval due to
©
the higher margin for coverage. The choice of confidence level depends on the
context. 90% may be sufficient for exploratory studies, 95% is standard in
many scientific fields, and 99% is preferred when higher precision or stricter
error control is required, such as in clinical research or quality assurance. In
general, as the confidence level increases, the interval becomes wider to reflect
increased caution in capturing the true value.
A z-critical value (often written zα or zα/2 ) is a cutoff point on the standard
normal distribution N (0, 1) that separates the central probability from the tail
probability. zα is the number such that the area to the right of it under the
standard normal curve is α.
P (Z > zα ) = α, Z ∼ N (0, 1).
zα/2 is the value such that the area to the right is α/2.
α
P (Z > zα/2 ) = .
2
Equivalently, the central probability is
P (−zα/2 < Z < zα/2 ) = 1 − α.
Consider the standard normal distribution curve shown in Figure 19. The
central area under the curve is 1 − α, while the two tails each have probability
109
α/2. The boundary points of this central region are the critical values of Z,
denoted by −zα/2 and zα/2 . The interval between −zα/2 and zα/2 contains the
middle 100(1 − α)% of the distribution. This is the probability basis for con-
structing two-sided confidence intervals. In confidence intervals, the z-critical
value tells us how many standard errors we need to move away from the mean
to capture a specified confidence level.
Figure 19: Standard normal curve showing the central probability P (−zα/2 <
Z < zα/2 ) = 1 − α.
For a 95% confidence interval (α = 0.05), z0.025 = 1.96 which means that
95% of the standard normal distribution lies between −1.96 and 1.96. For a
90% confidence interval (α = 0.10) z0.05 = 1.645 which means that 90% of the
oD
distribution lies between −1.645 and 1.645. The z-critical value is the cutoff
,U
oT
Example 9.2. * The production process for engine control housing units has
a
kh
Re
recently been modified. Historically, the hole diameters for bushings on these
25
mm. It is believed that the process modification has not affected the shape of
©
the distribution or the standard deviation, but the population mean diameter
µ may have changed. To assess this, a random sample of n = 40 housing units
was selected. The sample mean hole diameter was found to be x̄ = 5.426 mm.
Our goal is to construct a 90% confidence interval for the true average hole
diameter µ. Given: x̄ = 5.426, σ = 0.100, n = 40, confidence level = 90%
⇒ α = 0.10, and zα/2 = z0.05 = 1.645.
First, we compute the standard error (SE)
σ 0.100
SE = √ = √ ≈ 0.0158
n 40
Next, we compute the margin of error (ME)
110
We are 90% confident that the true average hole diameter µ lies between 5.400
mm and 5.452 mm. The interval is relatively narrow due to the small standard
deviation (σ = 0.100) and reasonably large sample size (n = 40), indicating that
the mean has been estimated with good precision. This interval can be used to
assess whether the process modification has significantly changed the average
diameter.
Sample size The confidence interval (CI) for a population mean (when the
population standard deviation σ is known) is
σ
x̄ ± zα/2 · √
n
This formula shows exactly how the CI depends √ on the sample size n. The
term √σn is the standard error. As n increases, n increases, so the standard
error decreases. As a result, the margin of error zα/2 · √σn becomes smaller. This
makes the confidence interval tighter (narrower) around the sample mean x̄ and
thus the estimate of the population mean µ more precise.
If we could take an infinitely large sample, then
σ
√ →0
n
oD
,U
and the interval would collapse to just the point estimate x̄. In practice,
√ in-
oT
,F
slowly. This means to halve the width of the confidence interval, we need to
R.
quadruple the sample size. A larger sample size leads to a narrower and more
a
kh
111
Squaring both sides gives
n = (9.80)2 = 96.04
Since the sample size must be an integer, we round up to the next whole number.
Thus, a sample size of n = 97 is required to ensure that the 95% confidence
interval for the mean response time has a width no greater than 10 milliseconds.
Even when using a simple formula, it actually comes from this general
,F
CE
σ
kh
X̄ ± zα/2 · √
Re
n
25
20
X̄ − µ
P −zα/2 < √ < zα/2 = 1 − α
σ/ n
This uses the general h(·)-based form and then solves for µ. So this framework
provides the foundation for why confidence interval formulas work.
To construct a confidence interval (CI) for a parameter θ (like µ, the pop-
ulation mean), we use a sample X1 , X2 , . . . , Xn . The idea is to find a random
variable h(X1 , . . . , Xn ; θ) that satisfies two important conditions
1. it must depend on both the sample values and the parameter θ that we
are estimating.
2. The distribution of this variable must be completely known, i.e., it should
not depend on θ or any other unknown parameters.
This kind of variable is useful because we can make probability statements
about it without knowing the parameter we are trying to estimate.
Suppose the population is normal with known standard deviation σ, and
we want to estimate the mean µ. We use the sample mean X̄, and define the
variable
¶ Sections in red may be skipped.
112
X̄ − µ
h(X1 , . . . , Xn ; µ) = √
σ/ n
This is the standard Z-statistic with standard normal distribution N (0, 1).
It depends on µ regardless of what the true value of µ is. This makes it ideal
for constructing a confidence interval.
Likewise, we can construct a confidence interval for a general parameter θ.
Because we know the distribution of h, we can find values a and b such that
and standard deviation σ, and the interval captures the true value of µ with
oT
probability 1 − α.
,F
CE
Hence we use the sample standard deviation s. We then replace the Z-score zα/2
a
kh
with the t-score tα/2, n−1 , which adjusts for the additional variability introduced
Re
s
X̄ ± tα/2, n−1 · √
n
The t-interval is typically wider, especially for small sample sizes, reflecting the
increased uncertainty in estimating the population standard deviation.
The general method using h(X1 , . . . , Xn ; θ) gives a unified framework to
construct confidence intervals for any parameter and any distribution, as long
as the distribution of the statistic is known or can be derived. Depending on
whether we know σ, we either use a Z-distribution, or t-distribution. Thus
the framework is the blueprint behind every CI formula and becomes especially
powerful in advanced or custom settings.
113
deviation σ in Z by the sample standard deviation S to obtain the standardized
variable
X̄ − µ
T = √
S/ n
Now, both the numerator X̄ and the denominator S are random, since both are
based on the sample. For small sample sizes, the estimate S can fluctuate quite
a bit from the true σ. If S is underestimated, the denominator is too small,
and T can take unusually large values. If S is overestimated, T is compressed
near zero. This extra randomness inflates the probability of extreme T -values,
giving the distribution of random variable T heavier tails than the normal.
Density
0.3
0.2
0.1
t
oD
,U
−4 −2 2 4
oT
,F
CE
t with 2 df
kh
Re
25
20
Figure 20 compares the standard normal distribution (blue solid line) with
the t-distribution with 2 degrees of freedom (red dashed line). Both distributions
are centered at 0 and symmetric, but the t-distribution has a lower peak and
noticeably heavier tails.
But as the sample size increases (apply the Law of Large Numbers), the
sample standard deviation S becomes a better and better estimate of σ. So,
the extra variability introduced by using S instead of σ becomes negligible.
Therefore, for large n, the distribution of the approaches the standard normal
distribution and we can use Z-based confidence intervals.
Proposition. If n is sufficiently large, the standardized variable
X̄ − µ
Z= √
S/ n
has approximately a standard normal distribution. This implies that the CI is
s
x̄ ± zα/2 · √
n
is a large-sample confidence interval for µ with confidence level approximately
100(1 − α)%. This formula is valid regardless of the shape of the population
distribution.
114
In words, the CI is
Generally speaking, n > 40 will be sufficient to justify the use of this interval.
This is somewhat more conservative than the rule of thumb for the CLT because
of the additional variability introduced by using S in place of σ.
Example 9.4. An internet service provider (ISP) wants to estimate the true
average download speed its customers are experiencing. Due to practical con-
straints, it can’t test every customer’s speed, so it selects a random sample of
n = 64 customers. After measuring their speeds, it finds sample mean x̄ = 48.5
Mbps and sample standard deviation s = 5.2 Mbps. We want to construct a
95% confidence interval for the true average download speed µ that customers
experience.
Since the sample size n = 64 is large, we can use the normal approximation,
even though the population SD σ is unknown. So we use the following confidence
interval formula
s
x̄ ± zα/2 · √
n
for 95% confidence, zα/2 = 1.96.
oD
,U
oT
5.2 5.2
,F
64 8
R.
a
With 95% confidence, we estimated that the true average download speed
lies between 47.23 Mbps and 49.77 Mbps with a full width of the interval is only
about 2.55 Mbps indicating a tighter interval. But this is not always the case.
The width depends on the variability in the data (reflected by s) and the nature
of the population distribution.
115
A General Framework for Large Sample CI The large-sample intervals
like
s
x̄ ± zα/2 · √
n
are special cases of a general large-sample confidence interval (CI) for a param-
eter θ. Suppose that θ̂ is an estimator satisfying the following conditions (1)
θ̂ has an approximately normal distribution, (2) θ̂ is (at least approximately)
unbiased, and (3) the standard deviation (standard error) of θ̂, denoted σθ̂ , is
known or can be estimated.
Standardizing θ̂ gives the random variable
θ̂ − θ
Z=
σθ̂
which is approximately the standard normal N (0, 1). Therefore, the following
probability statement holds approximately
!
θ̂ − θ
P −zα/2 < < zα/2 ≈ 1 − α
σθ̂
oT
θ̂ ± zα/2 · σθ̂
25
20
If σθ̂ involves other unknown parameters (but not θ), then we estimate it
by sθ̂ , the plug-in estimate. The resulting CI is
θ̂ ± zα/2 · sθ̂
116
a certain level, or a safety engineer might want to confirm that a measurement
does not exceed a maximum allowable threshold. In such cases, one-sided con-
fidence intervals, also known as confidence bounds, are more appropriate than
the usual two-sided intervals. These intervals provide either an upper or lower
limit for the parameter with a specified level of confidence, and are particularly
useful in threshold-based decisions or when prior knowledge justifies focusing
on one direction of error.
of the statistic becomes wider than the normal distribution. To account for
,U
which are specifically designed for small-sample inference. Our focus here is on
CE
t-distributions.
R.
a
kh
X̄ − µ
T = √
S/ n
117
9.4.1 Properties of t-Distribution
When estimating the population mean µ and the population standard deviation
σ is unknown, we use the sample standard deviation S. This leads to using the
t-distribution rather than the normal distribution for the standardized statistic
X̄ − µ
T = √
S/ n
This variable T does not follow the standard normal distribution when the
sample size n is small. Instead, it follows a t-distribution, which depends on a
single parameter called the degrees of freedom (df ). For a sample of size n,
the degrees of freedom is usually ν = n − 1.
n
1 X
s2 = (xi − x̄)2
R.
n − 1 i=1
a
kh
Re
25
Here, sample mean x̄ is computed from the data and acts as a constraint. After
20
computing x̄, only n − 1 deviations from the mean can vary freely. So df for
©
variance or sd is n − 1.
In general, when we plug the sample mean into another formula, it becomes
a constraint and we lose 1 degree of freedom.
Hence, the standard normal curve is often referred to as the t-curve with
infinite degrees of freedom.
118
Figure 21: Comparing Z-distribution with t-distributions
population parameters from a small sample and when the population standard
oT
tα,ν is a number on the x-axis of the t-distribution such that the area to
a
kh
t-critical value because it “cuts off” the tail area (e.g., the most extreme 5%) of
25
20
the t-curve. We use tα,ν in confidence intervals and hypothesis tests when the
©
119
One Sample t Confidence Interval We define a standardized variable
X̄ − µ
T = √
S/ n
x̄ ± tα/2,n−1 · √ .
,U
n
oT
,F
s
R.
x̄ + tα,n−1 · √ ,
a
kh
n
Re
25
and replacing the + with − in this latter expression gives a lower confidence
20
©
120
Figure 23: Graphs of chi-squared density functions
instead of the true mean µ. This “costs” one degree of freedom and it is down
R.
a
to n − 1.
kh
Re
25
20
©
121
10 Hypothesis Testing
When we collect sample data, the first step is often to compute a point estimate,
such as the sample mean X̄, to provide a single best guess for a population
parameter like the true mean µ. While this is simple and easy to report, it does
not convey any information about the uncertainty in the estimate. To quantify
this uncertainty, we construct a confidence interval (CI), which gives a range
of plausible values for the parameter. For example, a 95% CI for the average
blood pressure reduction from a new drug might be [2, 8] mmHg. This interval
indicates that, based on the sample data, we are 95% confident that the true
mean reduction lies somewhere within this range. All values inside the interval
are considered plausible, while those outside are unlikely.
However, in practice, decision-makers often face a specific question about
a particular value. For instance, a regulator may ask: “Does this drug reduce
blood pressure by at least 4 mmHg on average?” While the CI provides context
– showing that 4 mmHg is within the plausible range – it does not formally
quantify the strength of evidence for this specific claim. We can see that while
the average effect might exceed 4 mmHg, it could also be as low as 2 mmHg
which means the effect of the drug could be small. This is where hypothesis
testing (HT) comes in. Hypothesis testing is like a structured way of asking,
“Does the sample data support a specific claim about the population parame-
oD
testing allows us to examine the strength of evidence (via the test statistic and
oT
,F
null hypothesis (H0 ) represents the baseline assumption or “status quo,” such
©
as “the teacher calls on boys and girls equally often” or “the population mean is
µ0 .” The alternative hypothesis (Ha ) represents the competing claim we want
to test, such as “the teacher favors boys” or “the mean is greater than µ0 .” The
basic idea of hypothesis testing is to use data to assess whether the observed
evidence is so unlikely under H0 that we should reject it in favor of Ha . This
is done by choosing a test statistic, comparing its observed value to what is
expected under H0 , and computing the probability (the p-value) of obtaining
results at least as extreme as the observed data if H0 were true.
You may think of hypothesis testing like a courtroom trial. The null hypoth-
esis (H0 ) is like assuming the defendant is innocent – it is the default position
we start with. The alternative hypothesis (Ha ) is like saying the defendant is
guilty. The job of the data (like the evidence in court) is to challenge the null
hypothesis. We don’t reject innocence unless the evidence is strong enough.
The test statistic is a summary of the data, like the key piece of evidence. The
p-value tells us how surprising this evidence would be if the null were actually
true. If that surprise is too large (the p-value is small), we reject H0 and ac-
cept that the alternative Ha has stronger support. If the evidence isn’t strong
enough, we fail to reject H0 – which doesn’t prove innocence, but means there
isn’t enough reason to overturn the status quo.
122
10.2 Form of Hypotheses in Testing
Suppose a manufacturing process has a defect rate of p = 0.10. A new process
is proposed and we want to check if it improves quality, i.e., “does the new
process reduce defects below 10%?”. So we start with the assumption (status
quo) that the defect rate is at least 0.10 which is framed as the null hypothesis
H0 : p ≥ 0.10. But testing this inequality is mathematically harder; the test
statistic’s distribution under the null is no longer standard.
Suppose we inspect n = 50 items from the new process and record the
number of defectives, call it X. Since each item is defective with probability p,
X ∼ Binomial(n = 50, p). Now look at the null hypothesis: If we test H0 : p ≥
0.10, then p could be 0.10, 0.15, 0.20, . . . But the distribution of X depends on
the actual p
as well. If we fail to reject H0 , it means that the sample data are not strong
,F
CE
enough to show that the defect rate is below 0.10. We cannot conclude that the
R.
new process improves quality; the data are consistent with the current process
a
kh
(status quo). Importantly, this does not prove that H0 is true – we just don’t
Re
us inconclusive regarding the claim, but it respects the original assumption that
©
H0 : θ = θ0
where θ is the parameter of interest like µ and θ0 is the claimed value. Three
common forms of alternative hypotheses Ha are
1. Ha : θ > θ0 (Right-tailed test)
2. Ha : θ < θ0 (Left-tailed test)
3. Ha : θ 6= θ0 (Two-tailed test)
The value θ0 used in both H0 and Ha is called the null value. It is the threshold
separating the null hypothesis from the alternative.
123
row. We first define a test statistic which is a function of the data chosen such
that its sampling distribution under H0 is known (or well approximated) and it
makes discrimination between H0 and Ha possible. The test statistic captures
how far the observed data are from what we would expect under H0 .
Here we choose the test statistic X which is the number of boys in n = 20
picks. Under the null hypothesis of fairness H0 : p = 0.5, we know exactly the
distribution of X as
X ∼ Bin(n = 20, p = 0.5).
The sampling distribution of the test statistic X shows us how it would behave
over many samples if H0 were true. In this example
20
P (X = k | H0 ) = (0.5)20 , k = 0, 1, . . . , 20.
k
direction.
kh
Re
25
124
This tells us that if the teacher were fair, then the probability of seeing
all 12 boys is on the higher side. We have observed 12 boys and hence the
fairness assumption cannot be rejected.
We then choose a significance level α which is a threshold that determines how
small the p-value must be to reject H0 . Common values include 0.10, 0.05, 0.01,
and 0.001. We then execute the test procedure which is the following rule
that uses sample data to decide whether to reject the null hypothesis
If p-value ≤ α, reject H0 ;
> α, do not reject H0 .
a single sample outcome and the claimed value. However, there are scenarios
,U
oT
check each individually and are forced to summarize the sample with X̄.
CE
mean X̄) with a hypothesized (or claimed) population value θ0 . Simply looking
kh
Re
θ̂ − θ0 observed difference
, i.e., test statistic =
SE(θ̂) typical variability
125
which standardizes the difference by dividing it by the standard error.
The test statistic is a dimensionless quantity that reflects the signal-to-noise
ratio. A large value for the statistic indicates that the observed mean value is
far from the claim θ0 relative to expected fluctuations (SE), providing strong
evidence against the claim, whereas a small statistic value indicates that the
observed difference could easily arise by chance, or simply due to noisy data.
Thus, the statistic incorporates both the magnitude of the observed difference
and the reliability of the estimate, enabling a fair assessment of how surprising
the sample mean is under the claim. This standardized framework allows us to
make confident decisions about the claim, regardless of the units or scale of the
data.
Depending on the situation, this statistic follows different reference distri-
butions. For example, a standard normal distribution (Z) when the standard
deviation is known (or n is large), or a t-distribution (T ) when it must be
estimated from the sample. These will be discussed in the later sections.
Estimation is used in testing since to test the claim we may not have the
entire data. So, we often estimate first and then use that estimate to test
whether a specific assumption holds. Depending on the use case, estimation and
testing can be done by the same set of people or different sets. In most practical
settings like in academic research, scientific studies, business analytics, etc. the
same person or team does both. In regulatory or auditing contexts, different
oD
roles may be involved. A company estimates its own metrics (e.g., “Our average
,U
oT
delivery time is 2.5 days.”) An external auditor or regulator may test that claim
,F
using independent sampling. So here, the estimator and the tester are different
CE
entities.
R.
a
kh
Re
test statistic under the null hypothesis H0 . A z-test refers to hypothesis testing
procedures where the test statistic follows the standard normal distribution.
X̄ − µ0
Z= √
σ/ n
When certain conditions are met, this statistic follows a standard normal dis-
tribution Z ∼ N (0, 1). This allows us to compute p-values as areas under the
standard normal (Z) curve.
The formula for the test statistic Z measures how many standard errors the
sample mean X̄ is from the claimed (or null) mean µ0 . σ is the population
standard deviation assumed to be known in a√z-test. n is the sample size or
the number of observations in the sample. σ/ n is the standard error of the
mean. It measures the variability of X̄ across samples. A large absolute value
126
of Z indicates that the sample mean X̄ is far from the hypothesized mean µ0 ,
which may provide evidence against H0 .
Let it be clear that the sample data itself is not assumed to follow a normal
distribution. What matters is the distribution of the sample mean X̄, which
becomes approximately normal under the Central Limit Theorem if the sample
size is large.
The z-test is valid if the population is normal and σ is known X̄ ∼ N (µ, σ 2 /n)
exactly. If the population is not normal but the sample size n is large X̄ is ap-
proximately normal by the CLT.
We now ask “If H0 : population mean µ = µ0 the claimed mean, were true
what is the probability to get a sample mean as extreme (or more) than what
the sample produced.” The answer is the p-value, which connects the sample
to the population. We are testing whether the evidence is strong enough to
support Ha by trying to rule out H0 .
The right tailed hypothesis Ha : µ > µ0 says that the true population mean
is larger than the claimed mean. We now look for evidence in our data that
the sample mean X̄ is unusually large compared to µ0 . After standardization,
this corresponds to zobs , the numerical value of the test statistic computed from
your data. We now ask “what is the probability of getting a Z-value at least as
large as the one we observed?” That probability is exactly
oD
p = P (Z ≥ zobs | H0 is true)
,U
oT
= 1 − P (z ≤ zobs )
,F
= 1 − Φ(zobs )
CE
R.
a
normal.
25
20
A small p-value implies that the observed data is very unlikely under H0
©
which implies that the evidence favors Ha . On the contrary, a large p-value
indicates that the observed data is plausible under H0 which implies that there
is insufficient evidence for Ha .
The p-value computation for the other cases of alternative hypotheses are
127
Figure 24: Determination of the p-value for a z test
the population’s shape. So, even if the population is not normally distributed,
oT
,F
we can still proceed with a z-test. In real-world problems, we often don’t know
CE
X̄ − µ0
25
Z≈ √
20
s/ n
©
128
Since n = 51 is large, the sampling distribution of X̄ is approximately normal
(under Central Limit Theorem). The test statistic follows a standard normal
distribution under H0
X̄ − µ0
Z= √
s/ n
The test statistic is computed as
2.06 − 2.0 0.06
Z= √ = ≈ 3.04
0.141/ 51 0.0197
At the 1% significance level, there is strong evidence that the population mean
zinc mass exceeds 2.0 g. The process supports the claim that the average zinc
mass is more than 2.0 g.
oD
tration resistance (in mm/blow). For a pavement type to be acceptable, the true
,F
mean DCP value must be less than 30. A sample of n = 52 observations yielded
CE
The population distribution is not normal, but n > 40, allowing use of the
kh
Re
z-test. Can we conclude at a significance level of α = 0.05 that the true average
25
20
Hypothesis
H0 : µ = 30 vs. Ha : µ < 30 (left tailed test)
Test statistic
x̄ − µ0 28.76 − 30 −1.24
Z= √ = √ ≈ ≈ −0.73
s/ n 12.2647/ 52 1.701
129
10.6 The One-Sample t Test
When the sample size n is small we cannot use z-tests, because the Central Limit
Theorem (CLT) doesn’t apply, i.e., the sampling distribution of the sample
mean may not be approximately normal. The population standard deviation σ
is usually unknown.
We do a t-test (refer Sec9.4) but this requires the assumption that the popu-
lation distribution is approximately normal. Let the small sample X1 , X2 , . . . , Xn
be drawn from a normal population. The test statistic
X̄ − µ0
T = √
S/ n
T ∼ tn−1
The p-value is determined from the area under the tn−1 curve corresponding to
the observed value of T as shown in Fig.25. It depends on whether the test is
one-tailed or two-tailed.
oD
,U
oT
,F
CE
R.
a
kh
Re
25
20
©
When calculating p-values for t-tests, we need to find tail areas under a
t-distribution curve. For z-tests, this is easy. The z-table gives areas (i.e.,
cumulative probabilities) for many z-values (like 1.23, 2.71, etc.). But for
t-distributions, things are different. Each t-distribution depends on degrees
of freedom (df). So we would need a separate, full table of cumulative or
tail areas for each possible df which is impractical. Generally, the t-values
for only few critical values corresponding to common significance levels α =
0.10, 0.05, 0.025, 0.01, 0.005, 0.001, 0.0005, etc. are pre-computed and stored in
a t-table.
130
Example 10.3. * Carbon nanofibers have potential application as heat man-
agement materials, for composite reinforcement, and as components for nano-
electronics and photonics. The accompanying data on failure stress (MPa) of
fiber specimens is
300, 580, 312, 589, 327, 626, 368, 637, 400, 690, 425, 715, 470, 757, 556,
891, 573, 900, 575
Hypotheses. Let µ denote the true average failure stress (MPa). The hypotheses
are
H0 : µ = 500 (null hypothesis)
Ha : µ > 500 (alternative hypothesis; right-tailed test)
Test Statistic. Since the population standard deviation is unknown and the
sample size is small (n = 19), we use the t-test
x̄ − µ0 562.68 − 500 62.68
t= √ = √ = ≈ 1.51
oD
s/ n 180.874/ 19 41.495
,U
oT
Degrees of Freedom. df = n − 1 = 19 − 1 = 18
,F
CE
At the α = 0.05 significance level since P -value = 0.075 > 0.05, we fail to reject
©
131
Type I Error is when rejecting the null hypothesis H0 when it is actually
true. Formally, P (Type I error) = P (Reject H0 | H0 true). Suppose we are
testing H0 : µ = µ0 . If H0 is true, the test statistic has a known distribution
X̄ − µ0
Z= √ ∼ N (0, 1), (if σ known),
σ/ n
or
X̄ − µ0
T = √ ∼ tn−1 , (if σ unknown).
S/ n
We reject H0 when the test statistic is “too extreme.” For example, in a right-
tailed Z-test, Ha : µ > µ0 , at level α = 0.05, we reject if zobs > z0.95 , where z0.95
is the 95th percentile of the standard normal distribution. By construction,
is uniquely defined. Contrast this with Ha which has many possible parameter
Re
values, so the Type II error depends on which alternative is true. A Type II error
25
20
occurs when the null hypothesis H0 is false (that is, some alternative hypothesis
©
132
10.8 Chi-Squared Test
Traditional hypothesis tests focus on numerical/continuous outcomes, e.g., test-
ing whether a sample mean differs from a hypothesized population mean. They
rely on assumptions like normality and require quantitative measurements. While
some datasets are categorical, e.g., blood type, education level, survey re-
sponses. Categorical data cannot be averaged. For example, blood types O, A,
B, AB have no meaningful arithmetic mean. So standard testing methods are
not applicable here. Consider a sample of 10 people with blood types
and expected Ei counts indicate that the null hypothesis does not fit the data,
R.
of a finite number of categories (e.g., blood type: O, A, B, AB), and the null
20
based on the difference between observed and expected counts, follows a chi-
squared distribution.
In more complex situations, the goodness-of-fit test for composite hy-
potheses applies when the category probabilities depend on one or more un-
known parameters (like a probability u, or a mean µ) (e.g., p1 = µ2 , p2 =
2µ(1 − µ), p3 = (1 − µ)2 ). These parameters are estimated from the data before
conducting the test. This is used to test if data fit a specific family of distribu-
tions, like Poisson (estimate λ) or normal (estimate µ, σ). Chi-squared methods
are also used with contingency tables, which involve two categorical variables.
In the test of homogeneity, the goal is to compare distributions across different
populations for example “do 3 hospitals have the same distribution of patient
types?” The test of independence, on the other hand, examines whether two
categorical variables are independent within a single population. For example
“is religion independent of political affiliation?” All these tests rely on com-
paring observed and expected frequencies and assessing whether any significant
deviation is due to chance or indicates a real effect.
133
10.8.1 Goodness-of-Fit Tests When Category Probabilities Are Completely Spec-
ified
A multinomial experiment generalizes a binomial experiment by allowing each
trial to result in one of k possible outcomes or categories, where k > 2. For
instance, if a store accepts three types of credit cards, observing which card
type is used by each of the next n customers forms a multinomial experiment.
The probability that a trial results in category i is denoted by pi , and the null
hypothesis H0 specifies the values of all pi ’s, such as H0 : p1 = 0.5, p2 =
0.3, p3 = 0.2. The alternative hypothesis Ha asserts that at least one of the pi
differs from the null values (in which case at least two must be different, since
they sum to 1). The symbol pi0 represents the value of pi claimed by the null
hypothesis. In the example just given, p10 = 0.5, p20 = 0.3, and p30 = 0.2.
Let the random variable Ni represent the number of observations falling into
category iP and its observed value isP ni . Since each trial results in exactly one
outcome, Ni = n, and similarly ni = n. As an example, an experiment
with n = 100 and k = 3 might yield N1 = 46, N2 = 35, and N3 = 19.
Under H0 , the expected number of observations in category i is E(Ni ) =
npi0 . For example, with n = 100 and H0 : p1 = 0.5, p2 = 0.3, p3 = 0.2, the
expected counts are E(N1 ) = 100 × 0.5 = 50, E(N2 ) = 100 × 0.3 = 30, and
E(N3 ) = 100 × 0.2 = 20. The observed (ni ) and expected (E(Ni )) frequencies
are often displayed in a table as in Table 4. The chi-squared goodness-of-fit test
oD
,U
is used to assess whether the observed frequencies deviate significantly from the
oT
named because it tests how good the observed data fits a theoretical probability
R.
model.
a
kh
Re
134
These can be summed into an overall measure
X
(ni − npi0 )2
However, this unadjusted sum can be misleading. For instance, if np10 = 100
and np20 = 10, and the observed values are n1 = 95 and n2 = 5, both terms
contribute equally to the sum
(95 − 100)2 = 25, (5 − 10)2 = 25
Yet, relatively speaking, n1 is only 5% less than expected, while n2 is 50%
less. To account for this imbalance, each squared deviation is divided by its
corresponding expected count. This leads to the test statistic for the chi-square
goodness-of-fit test
X (ni − npi0 )2
χ2 =
npi0
The chi-squared distribution has a single parameter n, called the degrees of
freedom (df), where n takes positive integer values: 1, 2, 3, . . . . If a random
variable Y follows a chi-squared distribution with ν degrees of freedom, i.e.,
Y ∼ χ2 (ν), then the expected value and variance of Y are given by
E(Y ) = ν and Var(Y ) = 2ν
The shape of the χ2 density curve is positively skewed. However, as the degrees
oD
,U
out further to the right. This behavior illustrated in Fig.26 denotes a typical
CE
The fact that the degrees of freedom (df) (refer Sec.9.4.1 for the definition)
equal k − 1, where k is the number of categories,Parises from the constraint that
the total number of observations is fixed, i.e., Ni = n. Although there are k
observed cell counts, once any k − 1 of them are known, the remaining count
is uniquely determined. Therefore, there are only k − 1 values that are freely
determined, giving k − 1 degrees of freedom.
Interpretation of the Chi-Squared Test Statistic. The chi-squared test statis-
tic is defined as
X (ni − npi0 )2
χ2obs = ,
npi0
135
where ni are the observed cell frequencies and npi0 are the expected cell fre-
quencies under the null hypothesis H0 . A small value of χ2obs indicates that the
observed frequencies are close to the expected ones, which is consistent with
H0 . A large value of χ2obs suggests a substantial discrepancy between the ob-
served and expected frequencies, providing evidence against H0 . We now want
to see how likely it is to get that value or anything more extreme under H0 .
We cannot not just look at the observed statistic in isolation and take a formal
decision on H0 because the observed discrepancy could reasonably occur just
by random chance. We want to check if this observed data is due to the sys-
temic nature of the population. For this we compute the p-value which gives a
probabilistic measure of evidence against H0 , rather than a subjective “seems
large” judgment.
We are only interested in whether χ2 is significantly large. Hence the chi-
squared test is a right-tailed test. The p-value is the area under the chi-squared
distribution curve to the right of the observed χ2obs value
We can use a chi-squared table to lookup for probability value. Suppose we have
the following situation where the observed chi-squared statistic χ2obs = 12.83 and
degrees of freedom df = 5. A typical chi-squared table provides critical values
for selected upper-tail probabilities
oD
,U
oT
Here, χ2obs = 12.83 falls exactly at the column for upper-tail probability 0.025.
a
kh
between two critical values, one can interpolate to obtain a more precise p-value.
20
©
136
We compute the chi-squared statistic
4
X (ni − npi0 )2
χ2 =
i=1
npi0
(926 − 906.2)2 (288 − 302.1)2 (293 − 302.1)2 (104 − 100.7)2
= + + +
906.2 302.1 302.1 100.7
= 0.433 + 0.658 + 0.274 + 0.108
= 1.473
Since the p-value is large, we fail to reject H0 . The observed frequencies are
quite consistent with Mendel’s laws of inheritance.
oD
,U
Suppose the category probabilities pi are not specified directly but are assumed
CE
R.
specific hypothesis about these parameters then determines explicit values pi0 .
Re
k
X (ni − npi0 )2
χ2 = .
i=1
npi0
137
i−1 4−(i−1)
4 9 7
pi0 = , i = [1, 5]
i−1 16 16
5
,F
X (ni − npi0 )2
CE
i=1
a
kh
Re
the p-value for χ2obs exceeds the significance level 0.10, so we fail to reject H0 .
There is no compelling evidence against the hypothesis that the number of
9
YR seeds in a pod follows a binomial distribution with u = 16 . The observed
frequencies are consistent with Mendelian inheritance.
138
classified by payment type – Cash, UPI, Credit, or Debit. A random
sample of transactions is summarized below
ables (e.g., department they visited and payment method). Each person’s
,F
ment visited and the payment method used by customers are related. A
Re
139
Sample Data: I × J Contingency Table with Row and Column Totals
Under the null hypothesis of homogeneity, each of the I populations has the
same distribution across the J categories; that is
Since pj is the common category proportion, the expected count in cell (i, j)
CE
is E(nij ) = ni· pj
R.
a
kh
Re
If the true population probabilities for each category are known, then the
expected count in a cell can be computed directly. But in practice, the true
category proportions pj are almost never known. Under the null hypothesis of
homogeneity, we claim that each population has the same proportions p1 , . . . , pJ ,
but we do not know their actual values. Instead we have the observed row totals
ni· and column totals n·j . We therefore estimate pj from the data itself
n·j
p̂j = ,
n
where n·j is the total count in category j across all populations, and n is the
grand total. Then the expected count is
n·j ni· × n·j
Ê(nij ) = ni· p̂j = ni· =
n n
140
This gives a data-driven estimate of what the count would be under H0 .
We compute the chi-square test statistic as
I X
J
X (nij − Êij )2
χ2 = ,
i=1 j=1 Êij
3 33 26 16 14 11 100
oT
We wish to test
R.
a
kh
Re
versus
©
141
Table 6: Contingency Table with Expected Counts and Chi-Square Contribu-
tions
hence the p-value is 0.005 < p < 0.010, and an algorithmic computation gives
kh
Re
Here pi· is the probability that a randomly selected individual falls in category
i of factor 1 and p·j is the probability that a randomly selected individual falls
in category j of factor 2. Under the null hypothesis of independence which
142
states that an individual’s category with respect to factor 1 is independent of
the category with respect to factor 2
Ê(nij )
,U
i=1 j=1
oT
,F
P = P χ2df ≥ χ2obs .
20
©
A large χ2obs (or small p-value) suggests dependence between factors. A small
value supports independence.
* taken from Probability and Statistics for Engineering and the Science by Jay Devore, 9th
edition.
143
Table 7: Observed counts, expected counts (in parentheses), and χ2 contribu-
tions
Education Q1 Q2 Q3 Q4 Total
E1 422 433 429 414 1698
(411.63) (444.79) (422.64) (418.93)
0.261 0.313 0.096 0.058
E2 1493 1655 1556 1605 6309
(1529.44) (1652.65) (1570.35) (1556.56)
0.868 0.003 0.131 1.508
E3 1239 1276 1243 1179 4937
(1196.84) (1293.25) (1228.85) (1218.06)
1.485 0.230 0.163 1.252
E4 61 110 73 74 318
(77.09) (83.30) (79.15) (78.46)
3.358 8.558 0.478 0.253
Total 3215 3474 3301 3272 13 262
oD
this example, for χ29 , the expected value is 9 the test value is 19.016. A naive
R.
a
comparison of the two clearly shows that the test statistic value greatly exceeds
kh
Re
what would be expected if the two factors were independent. But this com-
25
decide this formally by comparing the test statistic value with the critical values
corresponding to the chosen significance level. For df = 9, the right-tail critical
values are
Significance level (α) Critical value χ2α,9
0.05 16.919
0.01 21.666
Hence,
we reject H0 at α = 0.05. Evidence suggests dependence between educa-
tion level and weight gain quartile.
We fail to reject at α = 0.01. No strong evidence of dependence at the 1%
level.
144