Probability Theory
Probability theory is an advanced branch of mathematics that deals with measuring the
likelihood of events occurring. It provides tools to analyze situations involving uncertainty
and helps in determining how likely certain outcomes are. This theory uses the concepts of
random variables, sample space, probability distributions, and more to determine the outcome
of any situation.
For Example: Flipping a Coin
Flipping a coin is a random event with two possible outcomes: heads or tails. Each time you
flip a fair coin, there are exactly two possible outcomes, each with an equal chance of
occurring. Therefore, the probability of landing on heads is 1/2, and similarly, the probability
of landing on tails is also 1/2.
Different Approaches In Probability Theory
Probability theory studies random events and tells us about their occurrence. The three main
approaches for studying probability theory are:
1. Theoretical Probability
2. Experimental Probability
3. Subjective Probability
Theoretical Probability
Theoretical Probability deals with assumptions to avoid unfeasible or costly repetition of
experiments. The theoretical Probability for an event A can be calculated as follows:
P(A) = (Number of outcomes favorable to Event A) / (Number of all possible outcomes)
The image shown below shows the theoretical probability formula.
Note: Here we assume the outcomes of an event as equally likely.
Now, as we learn the formula, let’s put this formula in our coin-tossing case. In tossing a
coin, there are two outcomes: Head or Tail. Hence, The Probability of occurrence of Head on
tossing a coin is P(H) = 1/2
Similarly, The Probability of the occurrence of a Tail on tossing a coin is P(T) = 1/2
Experimental Probability
Experimental probability is found by performing a series of experiments and observing their
outcomes. These random experiments are also known as trials. The experimental probability
for Event A can be calculated as follows:
P(E) = (Number of times event A happened) / (Total number of trials)
The following image shows the Experimental Probability Formula,
Now, as we learn the formula, let’s put this formula in our coin-tossing case. If we tossed a
coin 10 times and recorded heads for 4 times and a tail 6 times then the Probability of
occurrence of Head on tossing a coin: P(H) = 4/10
Similarly, the Probability of Occurrence of Tails on tossing a coin: P(T) = 6/10
Subjective Probability
Subjective probability refers to the likelihood of an event occurring, as estimated by an
individual based on their personal beliefs, experiences, intuition, or knowledge, rather than on
objective statistical data or formal mathematical models.
Example: A cricket enthusiast might assign a 70% probability to a team’s victory based on
their understanding of the team’s recent form, the opponent’s strengths and weaknesses, and
other relevant factors.
Basics of Probability Theory
Random Experiment
In probability theory, any event that can be repeated multiple times and its outcome is not
hampered by its repetition is called a Random Experiment. Tossing a coin, rolling dice, etc.
are random experiments.
Sample Space
The set of all possible outcomes for any random experiment is called sample space. For
example, throwing dice results in six outcomes, which are 1, 2, 3, 4, 5, and 6. Thus, its
sample space is (1, 2, 3, 4, 5, 6)
Event
The outcome of any experiment is called an event. Various types of events used in probability
theory are,
Independent Events: The events whose outcomes are not affected by the outcomes of other
future and/or past events are called independent events. For example, the output of tossing a
coin in repetition is not affected by its previous outcome.
Dependent Events: The events whose outcomes are affected by the outcome of other events
are called dependent events. For example, picking oranges from a bag that contains 100
oranges without replacement.
Mutually Exclusive Events: The events that can not occur simultaneously are called
mutually exclusive events. For example, obtaining a head or a tail in tossing a coin, because
both (head and tail) can not be obtained together.
Equally likely Events: The events that have an equal chance or probability of happening are
known as equally likely events. For example, observing any face in rolling dice has an equal
probability of 1/6.
Random Variable
A variable that can assume the value of all possible outcomes of an experiment is called a
random variable in Probability Theory. Random variables in probability theory are of two
types which are discussed below,
Discrete Random Variable
Variables that can take countable values such as 0, 1, 2,… are called discrete random
variables.
Continuous Random Variable
Variables that can take an infinite number of values in a given range are called continuous
random variables.
Probability Theory Formulas
Various formulas are used in probability theory and some of them are discussed below,
Theoretical Probability Formula: (Number of Favourable Outcomes) / (Number of Total
Outcomes)
Empirical Probability Formula: (Number of times event A happened) / (Total number of
trials)
Addition Rule of Probability: P(A ∪ B) = P(A) + P(B) – P(A∩B)
Complementary Rule of Probability: P(A’) = 1 – P(A)
Independent Events: P(A∩B) = P(A) ⋅ P(B)
Conditional Probability: P(A | B) = P(A∩B) / P(B)
Bayes’ Theorem: P(A | B) = P(B | A) ⋅ P(A) / P(B)
What is an Uncertain Domain in AI?
An uncertain domain in artificial intelligence (AI) refers to a field or environment where the
information available is incomplete, ambiguous, noisy, or inherently unpredictable. Unlike
deterministic domains where outcomes can be predicted with certainty given the inputs, uncertain
domains require AI systems to handle and reason about uncertainty in a structured manner.
Characteristics of Uncertain Domains
1. Incomplete Information: The system does not have access to all the data required to make a
fully informed decision.
2. Ambiguity: Information might be unclear or open to multiple interpretations.
3. Noise: Data might be corrupted or imprecise due to measurement errors or external factors.
4. Stochastic Processes: The environment might involve random processes or events.
Importance of Handling Uncertainty
In many real-world applications, AI systems must operate effectively despite uncertainty. Accurately
representing and reasoning about uncertain information is crucial for making reliable predictions and
decisions. Handling uncertainty enables AI systems to:
Make informed decisions based on probabilistic reasoning.
Adapt to new information and changing environments.
Provide robust and reliable performance in complex scenarios.
Representing Knowledge in an Uncertain Domain
In real-world applications, AI systems frequently encounter incomplete, ambiguous, or noisy
information. Traditional deterministic approaches fall short in such scenarios, necessitating the use of
probabilistic and fuzzy methods to handle uncertainty effectively. These methods enable AI systems
to make informed decisions, predict outcomes, and adapt to changing environments.
Axiomatic Approach to Probability
Perform a random experiment whose sample space is S and P is the probability of occurrence
of any random event. This model assumes that P should be a real-valued function with a
range between 0 and 1. The domain of this function is defined to be a power set of sample
space. If all these conditions are satisfied then, the function should satisfy the following
axioms:
Axiom 1: For any given event X, the probability of that event must be greater than or equal
to 0. Thus,
0 ≤ P(X)
Axiom 2: We know that the sample space S of the experiment is the set of all the outcomes.
This means that the probability of any one outcome happening is 100 percent i.e P(S) = 1.
Intuitively this means that whenever this experiment is performed, the probability of getting
some outcome is 100 percent.
P(S) = 1
Axiom 3: For the experiments where we have two outcomes A and B. If A and B are
mutually exclusive,
P(A ∪ B) = P(A) + P(B) and P(A ∩ B) = 0
Here, ∪ stands for union, ∩ stands for intersection of two sets. This can be understood as if
saying “If A and B are mutually exclusive outcomes, that probability that either one of these
events will happen is probability of A happening plus the probability of B happening”.
These axioms are also called Kolmogorov’s three axioms. The third axiom can also be
extended to a number of outcomes given all are mutually exclusive.
Let’s say the experiment has A1, A2, A3, and … An. All these events are mutually exclusive. In
this case, the three axioms become:
Axiom 1: 0 ≤ P(Ai) ≤ 1 for all i = 1,2,3,… n.
Axiom 2: P(A1) + P(A2) + P(A3) +…. = 1
Axiom 3: P(A1 ∪ A2∪ A3 ….) = P(A1) + P(A2) + P(A3) ….
Let’s look at some sample problems based on these concepts.
Sample Problems on Axiomatic Approach to Probability
Question 1: Find out the sample space “S” for a random experiment involving the
tossing of three coins.
Solution.
We know that tossing a coin gives us either Heads or Tails. Tossing three coins will give us
either triplets of either heads or tails. So, the possible outcomes can be,
HHH, HHT, HTH, HTT, ….
All these outcomes will constitute the sample space.
S = {HHH, HHT, HTH, HTT, THH, THT, TTH, TTT}
Bayes’ Theorem
Bayes’ Theorem is a mathematical formula that helps determine the conditional probability
of an event based on prior knowledge and new evidence.
It adjusts probabilities when new information comes in and helps make better decisions in
uncertain situations.
Bayes Theorem and Conditional Probability
Bayes theorem (also known as the Bayes Rule or Bayes Law) is used to determine the
conditional probability of event A when event B has already occurred.
The general statement of Bayes’ theorem is “The conditional probability of an event A, given
the occurrence of another event B, is equal to the product of the event of B, given A and the
probability of A divided by the probability of event B.” i.e.
For example, if we want to find the probability that a white marble drawn at random came
from the first bag, given that a white marble has already been drawn, and there are three bags
each containing some white and black marbles, then we can use Bayes’ Theorem.
Bayes Theorem Formula
For any two events A and B, then the formula for the Bayes theorem is given by:
Where,
P(A) and P(B) are the probabilities of events A and B also P(B) is never equal to zero,
P(A|B) is the probability of event A when event B happens,
P(B|A) is the probability of event B when A happens.
Bayes Theorem Statement
Bayes Theorem for n set of events is defined as,
Let E1, E2,…, En be a set of events associated with the sample space S, in which all the events
E1, E2,…, En have a non-zero probability of occurrence. All the events E 1, E2,…, E form a
partition of S. Let A be an event from space S for which we have to find probability, then
according to Bayes theorem,
P(Ei|A) = P(Ei)P(A|Ei) / ∑ P(Ek)P(A|Ek)
for k = 1, 2, 3, …., n
Bayes Theorem Derivation
The proof of Bayes Theorem is given as, according to the conditional probability formula,
P(Ei|A) = P(Ei∩A) / P(A)…..(i)
Then, by using the multiplication rule of probability, we get
P(Ei∩A) = P(Ei)P(A|Ei)……(ii)
Now, by the total probability theorem,
P(A) = ∑ P(Ek)P(A|Ek)…..(iii)
Substituting the value of P(Ei∩A) and P(A) from eq (ii) and eq(iii) in eq(i) we get,
P(Ei | A) = P(Ei)P(A | Ei) / ∑ P(Ek)P(A | Ek)
Bayes’ theorem is also known as the formula for the probability of “causes”. As we know, the
Ei‘s are a partition of the sample space S, and at any given time only one of the events E i
occurs. Thus we conclude that the Bayes theorem formula gives the probability of a particular
Ei, given the event A has occurred.
Terms Related to Bayes Theorem
After learning about Bayes theorem in detail, let us understand some important terms related
to the concepts we covered in formula and derivation.
Hypotheses
Hypotheses refer to possible events or outcomes in the sample space, they are denoted as E1,
E2, …, En.
Each hypothesis represents a distinct scenario that could explain an observed event.
Priori Probability
Priori Probability P(Ei) is the initial probability of an event occurring before any new data is
taken into account.
It reflects existing knowledge or assumptions about the event.
Example: The probability of a person having a disease before taking a test.
Posterior Probability
Posterior probability (P(Ei∣A) is the updated probability of an event after considering new
information.
It is derived using Bayes Theorem.
Example: The probability of having a disease given a positive test result.
Conditional Probability
The probability of an event A based on the occurrence of another event B is termed
conditional Probability.
It is denoted as P(A|B) and represents the probability of A when event B has already
happened.
Joint Probability
When the probability of two more events occurring together and at the same time is measured
it is marked as Joint Probability.
For two events A and B, it is denoted by joint probability is denoted as, P(A∩B).
Random Variables
Real-valued variables whose possible values are determined by random experiments are
called random variables.
The probability of finding such variables is the experimental probability.
Bayes' Rule
Bayes' Rule is the most important rule in data science. It is the mathematical rule that
describes how to update a belief, given some evidence. In other words – it describes the act of
learning.
The equation itself is not too complex:
The equation: Posterior = Prior x (Likelihood over Marginal probability)
There are four parts:
Posterior probability (updated probability after the evidence is considered)
Prior probability (the probability before the evidence is considered)
Likelihood (probability of the evidence, given the belief is true)
Marginal probability (probability of the evidence, under any circumstance)
Bayesian Belief Network in artificial intelligence
Bayesian belief network is key computer technology for dealing with probabilistic events and
to solve a problem which has uncertainty. We can define a Bayesian network as:
"A Bayesian network is a probabilistic graphical model which represents a set of variables
and their conditional dependencies using a directed acyclic graph."
It is also called a Bayes network, belief network, decision network, or Bayesian model.
Bayesian networks are probabilistic, because these networks are built from a probability
distribution, and also use probability theory for prediction and anomaly detection.
Real world applications are probabilistic in nature, and to represent the relationship between
multiple events, we need a Bayesian network. It can also be used in various tasks including
prediction, anomaly detection, diagnostics, automated insight, reasoning, time series
prediction, and decision making under uncertainty.
Bayesian Network can be used for building models from data and experts opinions, and it
consists of two parts:
Directed Acyclic Graph
Table of conditional probabilities.
The generalized form of Bayesian network that represents and solve decision problems under
uncertain knowledge is known as an Influence diagram.
A Bayesian network graph is made up of nodes and Arcs (directed links), where:
Each node corresponds to the random variables, and a variable can be continuous or
discrete.
Arc or directed arrows represent the causal relationship or conditional probabilities between
random variables. These directed links or arrows connect the pair of nodes in the graph.
These links represent that one node directly influence the other node, and if there is no
directed link that means that nodes are independent with each other
o In the above diagram, A, B, C, and D are random variables represented by the
nodes of the network graph.
o If we are considering node B, which is connected with node A by a directed
arrow, then node A is called the parent of Node B.
o Node C is independent of node A.
o
Note: The Bayesian network graph does not contain any cyclic graph. Hence, it is known as a
directed acyclic graph or DAG.
The Bayesian network has mainly two components:
Causal Component
Actual numbers
Each node in the Bayesian network has condition probability distribution P(Xi |Parent(Xi) ), which
determines the effect of the parent on that node.
Bayesian network is based on Joint probability distribution and conditional probability. So
let's first understand the joint probability distribution:
Joint probability distribution:
If we have variables x1, x2, x3,....., xn, then the probabilities of a different combination of x1,
x2, x3.. xn, are known as Joint probability distribution.
P[x1, x2, x3,....., xn], it can be written as the following way in terms of the joint probability
distribution.
= P[x1| x2, x3,....., xn]P[x2, x3,....., xn]
= P[x1| x2, x3,....., xn]P[x2|x3,....., xn]....P[xn-1|xn]P[xn].
In general for each variable Xi, we can write the equation as:
P(Xi|Xi-1,........., X1) = P(Xi |Parents(Xi ))
Explanation of Bayesian network:
Let's understand the Bayesian network through an example by creating a directed acyclic
graph:
xample: Harry installed a new burglar alarm at his home to detect burglary. The alarm
reliably responds at detecting a burglary but also responds for minor earthquakes. Harry has
two neighbors David and Sophia, who have taken a responsibility to inform Harry at work
when they hear the alarm. David always calls Harry when he hears the alarm, but sometimes
he got confused with the phone ringing and calls at that time too. On the other hand, Sophia
likes to listen to high music, so sometimes she misses to hear the alarm. Here we would like
to compute the probability of Burglary Alarm.
Problem:
Calculate the probability that alarm has sounded, but there is neither a burglary, nor
an earthquake occurred, and David and Sophia both called the Harry.
Solution:
The Bayesian network for the above problem is given below. The network structure is
showing that burglary and earthquake is the parent node of the alarm and directly
affecting the probability of alarm's going off, but David and Sophia's calls depend on
alarm probability.
The network is representing that our assumptions do not directly perceive the burglary
and also do not notice the minor earthquake, and they also not confer before calling.
The conditional distributions for each node are given as conditional probabilities table
or CPT.
Each row in the CPT must be sum to 1 because all the entries in the table represent an
exhaustive set of cases for the variable.
In CPT, a boolean variable with k boolean parents contains 2K probabilities. Hence, if
there are two parents, then CPT will contain 4 probability values
List of all events occurring in this network:
Burglary (B)
Earthquake(E)
Alarm(A)
David Calls(D)
Sophia calls(S)
We can write the events of problem statement in the form of probability: P[D, S, A,
B, E], can rewrite the above probability statement using joint probability distribution:
P[D, S, A, B, E]= P[D | S, A, B, E]. P[S, A, B, E]
=P[D | S, A, B, E]. P[S | A, B, E]. P[A, B, E]
= P [D| A]. P [ S| A, B, E]. P[ A, B, E]
= P[D | A]. P[ S | A]. P[A| B, E]. P[B, E]
= P[D | A ]. P[S | A]. P[A| B, E]. P[B |E]. P[E]
Let's take the observed probability for the Burglary and earthquake component:
P(B= True) = 0.002, which is the probability of burglary.
P(B= False)= 0.998, which is the probability of no burglary.
P(E= True)= 0.001, which is the probability of a minor earthquake
P(E= False)= 0.999, Which is the probability that an earthquake not occurred.
We can provide the conditional probabilities as per the below tables:
Conditional probability table for Alarm A:
The Conditional probability of Alarm A depends on Burglar and earthquake:
B E P(A= True) P(A= False)
True True 0.94 0.06
True False 0.95 0.04
False True 0.31 0.69
False False 0.001 0.999
Conditional probability table for David Calls:
The Conditional probability of David that he will call depends on the probability of
Alarm.
A P(D= True) P(D= False)
True 0.91 0.09
False 0.05 0.95
Conditional probability table for Sophia Calls:
The Conditional probability of Sophia that she calls is depending on its Parent Node
"Alarm."
A P(S= True) P(S= False)
True 0.75 0.25
False 0.02 0.98
From the formula of joint distribution, we can write the problem statement in the form
of probability distribution:
P(S, D, A, ¬B, ¬E) = P (S|A) *P (D|A)*P (A|¬B ^ ¬E) *P (¬B) *P (¬E).
= 0.75* 0.91* 0.001* 0.998*0.999
= 0.00068045.
Hence, a Bayesian network can answer any query about the domain by using
Joint distribution.
The semantics of Bayesian Network:
There are two ways to understand the semantics of the Bayesian network, which is
given below:
1. To understand the network as the representation of the Joint probability
distribution.
It is helpful to understand how to construct the network.
2. To understand the network as an encoding of a collection of conditional
independence statements.
It is helpful in designing inference procedure.
What Are Temporal Models?
Temporal models are a class of models in AI and machine learning that explicitly take into
account the temporal (time-based) relationships in data. Unlike static models, which consider
data points as independent and identically distributed (i.i.d.), temporal models recognize that
data points may be dependent on previous values. This dependency is crucial for accurately
modeling and predicting sequences.
Types of Temporal Models
1. Time Series Models
Time series models are fundamental for handling temporal data, where observations are
recorded at specific time intervals. These models are extensively used in economics, finance,
and weather forecasting. Key types of time series models include:
Autoregressive (AR) Models: These models predict future values based on past values. The
AR model assumes that the current value is a linear combination of previous values.
Moving Average (MA) Models: In MA models, future values are predicted based on past
prediction errors. This model smoothens out fluctuations in the data by averaging them over
a certain period.
ARIMA Models: The Autoregressive Integrated Moving Average (ARIMA) model combines
AR and MA models and includes a differencing step to make the data stationary, which is
essential for accurate modeling.
2. State-Space Models
State-space models provide a more flexible and comprehensive framework for temporal data.
These models consist of two equations: the state equation, which describes the evolution of
the hidden state over time, and the observation equation, which links the hidden state to the
observed data. Kalman filters and Hidden Markov Models (HMMs) are prominent examples.
Kalman Filters: Used for linear and Gaussian state-space models, Kalman filters are ideal for
real-time applications like tracking and navigation.
Hidden Markov Models (HMMs): HMMs are widely used for sequence data where the
system is assumed to be a Markov process with unobserved states. Applications include
speech recognition and biological sequence analysis.
3. Recurrent Neural Networks (RNNs)
Recurrent Neural Networks (RNNs) are a class of artificial neural networks designed to
handle sequential data. RNNs have connections that form directed cycles, creating an internal
state that captures information from previous time steps. This makes them suitable for tasks
like language modeling and machine translation.
Long Short-Term Memory (LSTM) Networks: LSTMs are a special type of RNN designed to
overcome the vanishing gradient problem. They can capture long-term dependencies,
making them effective for tasks with long-range temporal dependencies, such as music
generation and video analysis.
Gated Recurrent Units (GRUs): GRUs are a simplified version of LSTMs, retaining essential
features to capture long-term dependencies while being computationally more efficient.
Applications of Temporal Models
1. Natural Language Processing (NLP)
2. Financial Forecasting
3. Healthcare
4. Autonomous Systems