Essential Probability for ML:
Events and Probability
ECEN 250: Machine Learning, 1
Resources
• Most of the materials, and figures for this lecture are borrowed from:
Introduction to Probability for Computing, M. Harchol-Balter.
[Link]
ECEN 250: Machine Learning, 2
Sample Space
• Probability is defined in terms of some experiment
• Experiment of tossing a coin
• Experiment of rolling a die
• Experiment of rolling a die twice
• Experiment of conducting a survey (who will be the next president?)
• Experiment of measuring the temperature at 2PM in College Station in August
• Sample space is the set of all possible outcomes of an experiment
• Experiment of tossing a coin: Ω = {H, T}
• Experiment of rolling a die: Ω = {1, 2, …, 6}
• Experiment of rolling a die twice: Ω = {(1,1), (1,2), …, (6,6)}
• Experiment of measuring the temperature: Ω = [60, 120]
ECEN 250: Machine Learning, 3
Events
• An event, 𝐸, is any subset of the sample space, Ω
• Experiment of tossing a coin, event of getting H
• E = {H}
• Experiment of rolling a die, event of getting an odd number
• E = {1, 3, 5}
• Experiment of rolling a die twice, event of getting identical numbers in each roll
• E = {(1,1), (2,2), (3,3), …, (6,6)}
• Experiment of measuring the temperature, event of temperature within 80 and 90
• E = [80, 90]
ECEN 250: Machine Learning, 4
Sample Space and Events
Probability is defined in terms of some experiment.
W = Sample space of the experiment = Set of all possible outcomes
Defn: An event, 𝐸, is any subset of the sample space, W.
Example: Roll die twice
Q: What does event 𝐸! represent?
Q: What is 𝐸! ∪ 𝐸" ?
Q: What is 𝐸! ?
Q: Are 𝐸! and 𝐸" independent? (we’ll see)
"Introduction to Probability for Computing", Harchol-Balter '24 5
Probability Defined on Events
• Probability of an event E, denoted as P(E), is the probability that the outcome of
the experiment lies in set E
• Experiment of tossing a coin, event of getting H
• E = {H}, P(E) =
• Experiment of rolling a die, event of getting an odd number
• E = {1, 3, 5}, P(E) =
• Experiment of rolling a die twice, event of getting identical numbers in each roll
• E = {(1,1), (2,2), (3,3), …, (6,6)}, P(E) =
• Experiment of measuring the temperature, event of temperature within 80 and 90
• E = [80, 90], P(E) =
ECEN 250: Machine Learning, 6
Three Axioms of Probability
• Non-negativity:
𝐏(E) ≥ 0 for any event E
• Additivity:
If A! and A" are disjoint events, then, 𝐏(A! ∪ A" ) = 𝐏(A! ) + 𝐏(A" )
• Normalization:
𝐏(W) = 1
ECEN 250: Machine Learning, 7
Consequences of the Probability Axioms
• Result 1:
𝐏(A ∪ B) = 𝐏(A) + 𝐏(B) − 𝐏(A ∩ B)
A B
ECEN 250: Machine Learning, 8
Consequences of the Probability Axioms
• Corollary of Result 1:
𝐏(A ∪ B) ≤ 𝐏(A) + 𝐏(B)
A B
ECEN 250: Machine Learning, 9
Consequences of the Probability Axioms
• Result 2:
P A = 1 − P(A)
ECEN 250: Machine Learning, 10
Conditional Probability on Events
The conditional probability of event A given event B is
𝐏(A ∩ B)
𝐏(A|B) =
𝐏(B)
assuming 𝐏(B) > 0.
Two equivalent views:
A B 2 (of the 10 outcomes in set 𝐵,
𝐏(A | B ) = only 2 of these are in set 𝐴)
10
2
𝐏(A ∩ B) 42 2
𝐏(A | B ) = = =
𝐏(B) 10 10
42
ECEN 250: Machine Learning, 11
Conditional Probability on Events
The conditional probability of event A given event B is
𝐏(A ∩ B)
𝐏(A|B) =
𝐏(B)
assuming 𝐏(B) > 0.
Sandwich choices: Q: What is 𝑷(Cheese | 2nd half of week) ?
Argue this from 2 views.
Mon – Jelly
Tues – Cheese 1st half 2 (of the 4 days in 2nd half,
𝐏(Cheese | 2nd half ) =
Wed – Turkey of week 4 2 are cheese sandwiches)
Thur – Cheese 2
Fri – Turkey 2nd half 𝐏(Cheese ∩ 2nd half) 7 2
𝐏(Cheese | 2nd half ) = = =
Sat – Cheese of week 𝐏(2nd half) 4 4
Sun – None 7
ECEN 250: Machine Learning, 12
Chain Rule of Probability
Chain rule of probability:
𝐏(A ∩ B) = 𝐏(A|B) ⋅ 𝐏(B)
𝐏('∩))
• This directly follows from the definition of conditional probability, 𝐏(A|B) = 𝐏())
ECEN 250: Machine Learning, 13
Independent Events
Events 𝐴 and 𝐵 are independent, if: 𝑷(𝐴 ∩ 𝐵) = 𝑷(𝐴) ⋅ 𝑷(𝐵)
• This follows from the chain role, 𝐏(A ∩ B) = 𝐏(A|B) ⋅ 𝐏(B), and the intuitive idea
of independence
ECEN 250: Machine Learning, 14
Independent Events: Examples
• Coin Tosses: Tossing two coins. The outcome of the first coin (heads or tails) does
not affect the outcome of the second coin. Both events are independent.
• Weather and Stock Prices: The rainy/sunny weather on a given day and the
performance of a particular stock on the same day are generally independent
events.
• Drawing Cards with Replacement: Drawing a card from a deck, recording the result,
and then replacing it before drawing again. Each draw is independent since the
outcome of one draw doesn't affect the next.
• Exam grade and phone battery charge: The grade a student receives on a test is
independent of the his/her phone battery charge.
ECEN 250: Machine Learning, 15
Practice with Independent Events
You are routing a packet from the source to the destination.
But each of the 16 edges in the network only works with probability 𝑝.
Q: What is the probability that you can get the packet from the source to the destination?
"Introduction to Probability for Computing", Harchol-Balter '24 16
Practice with Independent Events
Each edge works with probability 𝑝. There are 8 paths.
Let 𝐸+ denote the event that the 𝑖 ,- path is usable (not broken).
Q: What is 𝑷 𝐸+ ? Q: What is 𝑷 𝐸! ?
Q: What is 𝑷 Can get from source to desenaeon ?
𝑷 Can get from source to desenaeon = 𝑷 At least one path works
= 𝑷 𝐸! ∪ 𝐸" ∪ ⋯ ∪ 𝐸.
= 1 − 𝑷 All paths are broken
= 1 − 𝑷 𝐸! ⋅ 𝑷 𝐸" ⋯ 𝑷 𝐸. = 1 − 1 − 𝑝" .
"Introduction to Probability for Computing", Harchol-Balter '24 17
Practice with Independent Events
Each edge works with probability 𝑝. There are 8 paths.
Let 𝐸+ denote the event that the 𝑖 ,- path is usable (not broken).
Q: What is 𝑷 𝐸+ ? Q: What is 𝑷 𝐸! ?
Q: What is 𝑷 Can get from source to desenaeon ?
𝑷 Can get from source to desenaeon = 𝑷 At least one path works
= 𝑷 𝐸! ∪ 𝐸" ∪ ⋯ ∪ 𝐸.
= 1 − 𝑷 All paths are broken
= 1 − 𝑷 𝐸! ⋅ 𝑷 𝐸" ⋯ 𝑷 𝐸. = 1 − 1 − 𝑝" .
"Introduction to Probability for Computing", Harchol-Balter '24 18
Independent Events
Events 𝐴 and 𝐵 are independent, if: 𝑷(𝐴 ∩ 𝐵) = 𝑷(𝐴) ⋅ 𝑷(𝐵)
ECEN 250: Machine Learning, 19
Problem
Events 𝐴 and 𝐵 are independent, if: 𝑷(𝐴 ∩ 𝐵) = 𝑷(𝐴) ⋅ 𝑷(𝐵)
ECEN 250: Machine Learning, 20
Problem
• The offspring of a horse is called a foal. A horse couple has at most one foal at a time. Each foal is equally
likely to be a “colt” or a “filly.” We are told that a horse couple has two foals, and at least one of these is a
colt. Given this information, what’s the probability that both foals are colts?
• Question: What is P {both are colts | at least one is a colt}?
ECEN 250: Machine Learning, 21
Exercise
• The offspring of a horse is called a foal. A horse couple has at most one foal at a time. Each foal is equally
likely to be a “colt” or a “filly.” We are told that a horse couple has two foals, and at least one of these is a
colt. Given this information, what’s the probability that both foals are colts?
• Question: What is P {both are colts | at least one is a colt}?
ECEN 250: Machine Learning, 22
Law of Total Probability
For any sets A and B, we have,
T) B A "
B
A = A ∩ B ∪ (A ∩ B
𝑃 𝐴 = 𝑃 𝐴 ∩ 𝐵 ∪ 𝐴 ∩ 𝐵V
V
= 𝑃 𝐴 ∩ 𝐵 + 𝑃(𝐴 ∩ 𝐵) T) are disjoint)
(Since A ∩ B and (A ∩ B
=P A B P B +P A B T P B
T . (By chain rule)
ECEN 250: Machine Learning, 23
Law of Total Probability
For any sets A and B, we have,
T) B A "
B
A = A ∩ B ∪ (A ∩ B
T P B
𝑃 𝐴 =P A B P B +P A B T
ECEN 250: Machine Learning, 24
Bayes’ Law
𝑷{𝐸 ∩ 𝐹} 𝑷{𝐸 | F } ⋅ 𝑷{𝐹}
𝑷{𝐹 | 𝐸} = =
𝑷{𝐸} 𝑷{𝐸}
Suppose that there is a rare child cancer that occurs in one out of one million kids.
There’s a test for this cancer which is 99.9% effective
25
Bayes’ Law
𝑷{𝐸 ∩ 𝐹} 𝑷{𝐸 | F } ⋅ 𝑷{𝐹}
𝑷{𝐹 | 𝐸} = =
𝑷{𝐸} 𝑷{𝐸}
Suppose that there is a rare child cancer that occurs in one out of one million kids.
There’s a test for this cancer which is 99.9% effective
26
Bayes’ Law
𝑷{𝐸 ∩ 𝐹} 𝑷{𝐸 | F } ⋅ 𝑷{𝐹}
𝑷{𝐹 | 𝐸} = =
𝑷{𝐸} 𝑷{𝐸}
Suppose that there is a rare child cancer that occurs in one out of one million kids.
There’s a test for this cancer which is 99.9% effective
27
Random Variables
A random variable (r.v.) is a real-valued function of the
outcome of an experiment involving randomness.
Example: Experiment: Roll two dice
Q: Here are some r.v.s. What values
can these take on?
X = sum of the rolls
Y = difference of the rolls
Z = max of the rolls
W = value of the first roll
We can now ask, “What is P(X = 11)? ”
ECEN 250: Machine Learning, 28
Random Variables
A random variable (r.v.) is a real-valued function of the
outcome of an experiment involving randomness.
Example: Throw 2 darts uniformly at random at unit interval
Here are some random variables:
D = difference in location of the 2 darts
L = location of leftmost dart
0 1
Q: Can you define some more r.v.s?
ECEN 250: Machine Learning, 29
Random Variables
A discrete random variable can take on at most a countably infinite
number of possible values, whereas a continuous random variable can
take on an uncountable set of possible values.
Q: Which of these random variables is discrete and which is continuous?
q The sum of the rolls of two dice
q The number of arrivals at a website by time t
q The time until the next arrival at a website
q The CPU time requirement of an HTTP request
ECEN 250: Machine Learning, 30
From Random Variables to Events
When we set a random variable (r.v.) equal to a value, we get an event, and all the results
we learned about events and their probabilities now apply.
Random Variable (R.V.) Event Probability of Event
𝑋 = sum of 2 rolls of a die 𝑋=7 1
6
𝑁 = number arrivals to a 𝑁 > 10 P(N > 10) =
website within the next hour P(N > 10|weekday) ⋅ /0
+P(N > 10|weekend) ⋅ "0
ECEN 250: Machine Learning, 31
Probability Mass Function (PMF)
For a discrete r.v. 𝑋, the probability mass function of 𝑋 is:
p1 a = P(X = a)
Q: What is this?
b 𝑝3 (𝑥)
2
ECEN 250: Machine Learning, 32
Problem
For a discrete r.v. 𝑋, the probability mass function of 𝑋 is:
p1 a = P(X = a)
ECEN 250: Machine Learning, 33
Bernoulli Random Variable
Experiment: Flip a single coin, with probability 𝑝 of Heads.
Random Variable: 𝑋 = 1{toss == H}
Defn: 𝑋 ∼ 𝐵𝑒𝑟𝑛𝑜𝑢𝑙𝑙𝑖 𝑝 :
1 w. p. 𝑝
𝑋=l
0 w. p. 1 − 𝑝
ECEN 250: Machine Learning, 35
Two Random Variables
The joint probability mass function between discrete r.v.’s 𝑋 and 𝑌 is:
p1,5 x, y = P(X = x & Y = y)
or equivalently, P(X = x , Y = y), where, by definition:
b b p1,5 x, y = 1.
6 7
ECEN 250: Machine Learning, 36
Marginal Probability Mass Function
How is 𝑝3 𝑥 related to 𝑝3,8 𝑥, 𝑦 ?
Table shows 𝑝3,8 𝑥, 𝑦
𝑋=0 𝑋=1 𝑋=2
𝑌=0 0.4 0.05 0.05
𝑌=1 0.05 0.05 0.1 𝑝$ 1 = 0.2
𝑌=2 0.1 0.2 0
𝑝$ 𝑦 = ( 𝑝!,$ 𝑥, 𝑦
𝑝# 0 = 0.55 %
𝑝! 𝑥 = ( 𝑝!,$ 𝑥, 𝑦
" Called “marginal
probabilities”
because written in
the margins.
ECEN 250: Machine Learning, 37
Marginal Probability Mass Function
Independence
Discrete random variables 𝑋 and 𝑌 are independent if :
𝑝3,8 𝑥, 𝑦 = 𝑝3 𝑥 ⋅ 𝑝8 𝑦
Q: If 𝑋 and 𝑌 are independent, what does this say about P X = x Y = y)?
P(𝑋 = 𝑥, Y = y)
P X = x Y = y) =
P(Y = y)
P 𝑋 = 𝑥 P(Y = y)
=
P(Y = y)
= P(𝑋 = 𝑥)
ECEN 250: Machine Learning, 39
Continuous Random Variable
A continuous random variable (r.v.) has a continuous range of values that it can
take on. This might be an interval or set of intervals.
Thus a continuous r.v. can take on an uncountable set of possible values.
Examples:
q Time of an event
q Response time of a job
q Speed of a device
q Location of a satellite
q Distance between people’s eyeballs
ECEN 250: Machine Learning, 40
Probability Density Function (PDF)
• The probability that a continuous r.v. is equal to any particular value is defined to be 0.
• Probability for a continuous r.v. is defined via a density function.
The probability density function (p.d.f.) of a continuous r.v. 𝑋 is a non-negative
function 𝑓3 ⋅ , where
B @
P(a ≤ X ≤ b) = s f1 x dx and s f1 x dx = 1
A ?@
ECEN 250: Machine Learning, 41
Probability Density Function (PDF)
The probability density function (p.d.f.) of a continuous r.v. 𝑋 is a non-negative
function 𝑓3 ⋅ , where
B @
P(a ≤ X ≤ b) = s f1 x dx and s f1 x dx = 1
A ?@
ECEN 250: Machine Learning, 42
Normal (a.k.a. Gaussian) distribution
ECEN 250: Machine Learning, 43
Normal (a.k.a. Gaussian) distribution
Defn: 𝑋 ∼ 𝑵𝒐𝒓𝒎𝒂𝒍 𝜇, 𝜎 " if
1 , .+/ !
+
𝑓# 𝑥 = 𝑒 - 0 , −∞ < 𝑥 < ∞
2𝜋 𝜎
where 𝜎 > 0. The parameter 𝜇 is called the mean, and parameter 𝜎 = 𝑽𝒂𝒓 𝑋 is
called the standard deviation.
Defn: 𝑋 follows a standard Normal distribution if 𝑋 ∼ 𝑁𝑜𝑟𝑚𝑎𝑙 0,1 , i.e.,
1 ,
+ .!
𝑓# 𝑥 = 𝑒 - , −∞ < 𝑥 < ∞
2𝜋
ECEN 250: Machine Learning, 44
Expectation
Defn: The expectation of a discrete r.v. 𝑋, written 𝐄 X , is the sum of the
possible values of 𝑋, each weighted by its probability:
𝐄 X = b x ⋅ 𝐏{X = x}
6
• Consider rolling a fair six-sided die. Let the random variable 𝑋 represent the
outcome of the die roll. What is 𝐄 X ?
# # %#
E X = ∑$!"# 𝑥 ⋅ 𝑃(𝑋 = 𝑥) = ∑$!"# 𝑥 = 1 + 2 + ⋯+ 6 = = 3.5
$ $ $
ECEN 250: Machine Learning, 45
Expectation
• Let the random variable 𝑋 represent the number of customers visiting in a coffee
shop. The possible values of X and their respective probabilities are as follows…:
0 customers with probability 0.05
10 customers with probability 0.10
20 customers with probability 0.30
30 customers with probability 0.40
40 customers with probability 0.10
50 customers with probability 0.05
We want to calculate the expected number of customers
E X = ∑& x ⋅ P(X = x) = 0 ⋅ 0.05 + 10 ⋅ 0.10 + 20 ⋅ 0.30 + 30 ⋅ 0.40 + 40 ⋅ 0.10 + 50 ⋅ 0.05
= 0 + 1 + 6 + 12 + 4 + 2.5 = 25.5
ECEN 250: Machine Learning, 46
Expectation: Intuition
• Consider rolling a fair six-sided die. Let the random variable 𝑋 represent the
outcome of the die roll. What is E X ?
• The expected value, E[X]=3.5, represents the long-run average outcome if you
were to roll the die many, many times, and average the outcomes. Although you
can never roll a "3.5" on a single roll of the die, it indicates that over time, the
average result of your rolls will converge to this value.
• Let the random variable 𝑋 represent the number of customers visiting in a coffee
shop. The possible values of X and their respective probabilities are as follows…
• The expected value, E[X]=25.5, represents the average number of customers
you can expect to visit the coffee shop. Although you can’t have exactly 25.5
customers in a day, this value gives you a sense of what the average turnout is
over a large number of days. ECEN 250: Machine Learning, 47
Average vs Mean
• Consider rolling a fair six-sided die repeatedly and we record the outcomes as
{x1, x2, x3,…, xn}. What is the average of the outcomes?
,
7
Average = 𝑋(𝑛) = ∑CDE, xi
C
• What is the expected value of the outcomes?
Q
E X = b 𝑥 ⋅ 𝑃(𝑋 = 𝑥)
2P!
Law of large numbers
𝑋V 𝑛 EX
ECEN 250: Machine Learning, 48
Law of Large Numbers
Law of large numbers
𝑋V 𝑛 EX
ECEN 250: Machine Learning, 49
Variance
• The variance of a random variable is a measure of how much the values of the
random variable deviate from the expected value (mean).
Var X = E[ X − E X "]
ECEN 250: Machine Learning, 50
Variance
• The variance of a random variable is a measure of how much the values of the
random variable deviate from the expected value (mean).
Var X = E[ X − E X "]
• The standard deviation of a random variable is the square root of its variance
σ1 = std X = Var X
ECEN 250: Machine Learning, 51
Covariance
• Covariance is a measure of how much two random variables change together.
• If the covariance is positive, it means that when one variable increases, the other
tends to increase as well.
• Conversely, if the covariance is negative, it indicates that when one variable
increases, the other tends to decrease.
ECEN 250: Machine Learning, 52
Covariance
• Covariance is a measure of how much two random variables change together.
Cov X, Y = E X − E X ⋅ (Y − E Y )
ECEN 250: Machine Learning, 53
Covariance
• Covariance is a measure of how much two random variables change together.
Cov X, Y = E X − E X ⋅ (Y − E Y )
• Examples of positive correlation:
• Height and weight
• Education level and income
• Advertising expenditure and sales revenue
• Temperature and ice cream sales
• Examples of negative correlation:
• Unemployment rate and consumer spending
• Distance from city center and property prices
• Interest rates and borrowing
• Time spent studying and number of errors on a test
ECEN 250: Machine Learning, 54
Covariance
• Covariance is a measure of how much two random variables change together.
Cov X, Y = E X − E X ⋅ (Y − E Y )
• Examples of positive correlation:
• Height and weight
• Education level and income
• Advertising expenditure and sales revenue
• Temperature and ice cream sales
• Examples of negative correlation:
• Unemployment rate and consumer spending
• Distance from city center and property prices
• Interest rates and borrowing
• Time spent studying and number of errors on a test
ECEN 250: Machine Learning, 55
Covariance
• Covariance is a measure of how much two random variables change together.
Cov X, Y = E X − E X ⋅ (Y − E Y )
• Examples of positive correlation:
• Height and weight
• Education level and income
• Advertising expenditure and sales revenue
• Temperature and ice cream sales
• Examples of negative correlation:
• Unemployment rate and consumer spending
• Distance from city center and property prices
• Interest rates and borrowing
• Time spent studying and number of errors on a test
ECEN 250: Machine Learning, 56
Covariance
• Covariance is a measure of how much two random variables change together.
Cov X, Y = E X − E X ⋅ (Y − E Y )
• Examples of positive correlation:
• Height and weight
• Education level and income
• Advertising expenditure and sales revenue
• Temperature and ice cream sales
• Examples of negative correlation:
• Unemployment rate and consumer spending
• Distance from city center and property prices
• Interest rates and borrowing
• Time spent studying and number of errors on a test
ECEN 250: Machine Learning, 57
ECEN 250: Machine Learning, 58