Reasoning under Uncertainty
Introduction to Reasoning Under
Uncertainty
• Though there are various types of uncertainty in various
aspects of a reasoning system, the "reasoning with
uncertainty" (or "reasoning under uncertainty") in AI has
been focused on the uncertainty of truth value, that is, to
allow and process truth values other than "true" and
"false".
• Generally speaking, to develop a system that reasons with
uncertainty means to provide the following:
– a semantic explanation about the origin and nature of the
uncertainty
– a way to represent uncertainty in a formal language
– a set of inference rules that derive uncertain (though well-
justified) conclusions
– an efficient memory-control mechanism for uncertainty
management
Nonmonotonic Logics
• A reasoning system is monotonic if the truthfulness of
a conclusion does not change when new information is
added to the system — the set of theorems can only
monotonically grow when new axioms are added.
• In contrast, in a system doing non-monotonic
reasoning, the set of conclusions may either grow or
shrink when new information is obtained.
• Nonmonotonic logics are used to formalize plausible
reasoning, such as the following inference step:
Nonmonotonic Logics
• Such reasoning is characteristic of commonsense reasoning, where
default rules are applied when case-specific information is not
available.
• The conclusion of a nonmonotonic argument may turn out to be
wrong. For example, if Tweety is a penguin, it is incorrect to conclude
that Tweety flies. Nonmonotonic reasoning often requires jumping to
a conclusion and subsequently retracting that conclusion as further
information becomes available.
• All systems of nonmonotonic reasoning are concerned with the issue
of consistency. Inconsistency is resolved by removing the relevant
conclusion(s) derived previously by default rules. Simply speaking, the
truth value of propositions in a nonmonotonic logic can be classified
into the following types:
– facts that are definitely true, such as "Tweety is a bird"
– default rules that are normally true, such as "Birds fly"
– tentative conclusions that are presumably true, such as "Tweety flies"
• When an inconsistency is recognized, only the truth value of the last
type is changed.
Why Is Uncertainty Needed?
• The world is not just T/F, so our reasoning should be able to
model this and reason over the shades of grey we find in the
world.
• Input data may be questionable
– to what extent is a patient demonstrating some symptom?
– do we rely on their word?
• Knowledge may be questionable
– is this really a fact?
• Input may be ambiguous or unclear
– this is especially true if we are dealing with real-world
inputs from sensors, or dealing with situations where
ambiguity readily exists (natural languages for instance)
• Output may be expected in terms of a plausibility/probability
such as “what is the likelihood that it will rain today?”
Why Is Uncertainty Needed?
There are many situations where uncertainty arises:
• When you travel, you reason about the possibility of delays on the way, like
bad weather, a traffic jam due to a demonstration, an accident, an
earthquake …
• When an insurance company offers a policy it has calculated the risk that
you will claim.
• When your brain estimates what an object is from an unclear picture, it
filters random noise and fills in missing details.
• When you play a game you cannot be certain what the other player will do.
• A medical expert system that diagnoses disease has to deal with the results
of tests that are sometimes incorrect, symptoms that the patient may not
report, pre-existing conditions …
Systems which can reason about the effects of uncertainty should
do better than those that don’t. But how should uncertainty be
represented?
Why Is Uncertainty Needed?
Two examples:
• I have toothache. What is the cause? There are many possible
causes of an observed event. If I go to the dentist and he
examines me, and the probe catches, it indicates there may be
a cavity rather than another cause. The likelihood of a
hypothesised cause will change as additional evidence arrives.
• Bob lives in San Francisco. He has a burglar alarm in his house,
which can be triggered by burglars and earthquakes. He has
two neighbours, John and Mary, who will call him if the alarm
goes off while he is at work, but each is unreliable in their own
way. All these sources of uncertainty can be quantified. Mary
calls. How likely is it that there has been a burglary? Using
probabilistic reasoning, we can calculate the likelihood of a
hypothesised cause.
Uncertainty Handling
• There are many different approaches to handling
uncertainty
– Formal approaches based on mathematics (probabilities)
– Formal approaches based on logic
– Informal approaches
• Many questions arise
– How do we combine uncertainty values?
– How do we obtain uncertainty values?
– How do we interpret uncertainty values?
– How do we add uncertainty values to our knowledge and
inference mechanisms?
Uncertainty Handling
• Classical first-order logic has no room for uncertainty
Ɐp Symptom(p, Toothache) → Disease(p, Cavity)
• Not correct –toothache can be caused in many other cases
• In first-order logic, we have to include all possible causes
Ɐp Symptom(p, Toothache) → Disease(p, Cavity) ˅
Disease(p, GumDisease) ˅ Disease(p, ImpactedWisdom) ˅ …
• Similarly, a Cavity does not always cause a Toothache, so
the following is also not true
Ɐp Disease(p, Cavity) → Symptom(p, Toothache)
Reasons for using probability
• Specification becomes too large
– It is too much work to list the complete set of
antecedents or consequents needed to ensure an
exception-less rule.
• Theoretical ignorance
– The complete set of antecedents is not known
• Practical ignorance
– The truth of the antecedents is not known, but we
still wish to reason.
Predicting versus Diagnosing
• Probabilistic reasoning can be used for predicting
outcomes (from cause to effect)
– Given that I have a cavity, what is the chance that I will
have toothache?
• Probabilistic reasoning can also be used for
diagnosis (from effect to cause)
– Given that I am having toothache, what is the chance
that it is being caused by a cavity?
• We need a methodology for reasoning that can
work both ways.
Probabilistic Reasoning
• Basic idea: Use probability theory to represent and process
uncertainty. In probabilistic reasoning, the truth value of a
proposition is extended from {0, 1} to [0, 1], with binary logic as its
special case.
• Justification: Though no conclusion is absolutely true, the one with
the highest probability is preferred. Under certain assumptions,
probability theory yields optimal solutions.
• To extend the basic Boolean connectives to probability functions:
– negation: P(¬A) = 1 − P(A)
– conjunction: P(A∧B) = P(A) * P(B) if A and B are independent of each
other
– disjunction: P(A∨B) = P(A) + P(B) if A and B never happen at the same
time
• Furthermore, the conditional probability of B given A is P(B|A) =
P(B∧A) / P(A), from which Bayes' Theorem is derived, and it is
often used to update a system's belief according to new
information: P(H|E) = P(E|H) * P(H) / P(E).
Probabilistic Reasoning
• Probabilistic reasoning is a way of knowledge representation
where we apply the concept of probability to indicate the
uncertainty in knowledge. In probabilistic reasoning, we
combine probability theory with logic to handle the
uncertainty.
• Probability can be defined as a chance that an uncertain event
will occur. It is the numerical measure of the likelihood that an
event will occur. The value of probability always remains
between 0 and 1 that represent ideal uncertainties.
1. 0 ≤ P(A) ≤ 1, where P(A) is the probability of an event A.
2. P(A) = 1, indicates total certainty of an event A.
3. P(A) =0, indicates total certainty of event A not happening.
Probability
We can find the probability of an uncertain event by using the
formula:
•P(¬A) = probability of an event A not happening.
•P(¬A) + P(A) = 1.
Random variables: Random variables are used to represent the
events and objects in the real world.
Event: Each possible outcome of a variable is called an event.
Sample space: The collection of all possible events is called the
sample space.
Prior probability: The probability of an event computed before
observing new information or evidence.
Posterior Probability: The probability that is calculated after all
evidence or information has taken into account. It is a
combination of prior probability and new information.
Conditional probability
• Conditional probability is a probability of an event
occurring when another event has already happened.
• Let's suppose, we want to calculate the event A when
event B has already occurred, "the probability of A under
the conditions of B", it can be written as:
P(A|B)=P(A∩B)/P(B)
where P(A ∩ B)= Joint probability of A and B
P(B)= Marginal probability of B.
• If the probability of A is given and we need to find the
probability of B, then it will be given as:
P(B|A)=P(A∩B)/P(A)
Conditional probability
Example: In a class, 70% of the students like English and 40% of
the students like both English and Mathematics. What
percentage of students who like English, also like Mathematics?
Solution:
• Let, A is an event that a student likes Mathematics
• B is an event that a student likes English
P(A|B) = P(A∩B)/P(B)
= 0.4/0.7 = 57%
• Hence, 57% are the students who like English also like
Mathematics.
Bayes' theorem
Bayes' theorem, also known as Bayesian reasoning, determines the
probability of an event with uncertain knowledge. In probability
theory, it relates the conditional probability and marginal probabilities
of two random events.
Bayes' theorem may be derived from the definition of conditional
probability:
• P(A|B)=P(A∩B)/P(B), if P(B)≠0,where P(A∩B) is the probability of both A and
B being true. Similarly,
• P(B|A)=P(A∩B)/P(A), if P(A)≠0.
Solving for P(A∩B) and substituting into the above expression for P(A|B) yields
Bayes' theorem:
P(A|B)=P(B|A).P(A)/P(B), if P(B)≠0.
It is a way to calculate the value of P(A|B) with the knowledge of
P(B|A).
Bayes' theorem allows updating the probability prediction of an event
by observing new information of the real world.
Bayes' theorem
P(A|B) = P(B|A).P(A)/P(B)
Bayes' theorem shows the simple relationship between joint and conditional
probabilities. Here,
• P(A|B) is known as the posterior probability, which we need to calculate,
and it will be read as the probability of hypothesis A when there is the
occurrence of evidence B.
• P(B|A) is called the likelihood, in which we consider that the hypothesis is
true, then we calculate the probability of the evidence.
• P(A) is called the prior probability, probability of the hypothesis before
considering the evidence.
• P(B) is called marginal probability, pure probability of an evidence.
P(cause|effect) = P(effect|cause). P(cause) / P(effect)
This is the Diagnosis Rule for finding the cause from observation of the effects.
Bayes' theorem
Question: What is the probability that a patient with a stiff neck has the
disease meningitis?
Given Data:
– A doctor is aware that disease meningitis causes a patient to have a stiff
neck, and it occurs 80% of the time.
– The Known probability that a patient has meningitis disease is 1/30,000.
– The Known probability that a patient has a stiff neck is 2%.
• Let a be the proposition that patient has stiff neck and b be the
proposition that patient has meningitis.
– P(a|b) = 0.8
– P(b) = 1/30000
– P(a)= .02
P(b|a) = P(a|b). P(b) / P(a)
= 0.8 * (1/30000) / 0.02 = 0.001333…
• Hence, we can assume that 1 out of 750 patients with a stiff neck has
meningitis disease.
Bayes' theorem
• Product rule gives an alternative formulation:
P(A∩B) = P(A|B) P(B) = P(B|A) P(A)
• Chain rule is derived by successive application
of the Product rule: The chain rule says that:
P(A, B, C, D, E) = P(A) * P(B | A) * P(C | A, B)
* P(D | A, B, C) * P(E | A, B, C, D)
Inference by Enumeration
In artificial intelligence, we often want to model the
relationships between various nondeterministic events.
• If the weather predicts a 40% chance of rain, should I
carry my umbrella?
• How many scoops of ice cream should I get if the more
scoops I get, the more likely I am to drop it all?
• If there was an accident 15 minutes ago on the freeway
on my route to Oracle Arena to watch the Warriors’
game, should I leave now or in 30 minutes?
All of these questions can be answered with probabilistic
inference.
Inference by Enumeration
For example, we might build a weather model in which the
state consists of the season, temperature, and weather. Our
model might say that
P(winter, 35°, cloudy) = 0.023.
This number represents the probability of the specific
outcome that it is winter, 35°, and cloudy.
More precisely, our model is a joint distribution, i.e., a table
of probabilities that captures the likelihood of each possible
outcome, also known as an assignment of variables.
Inference by Enumeration
Inference by Enumeration
Inference by Enumeration
Inference by Enumeration
In Inference By Enumeration, we follow the following
algorithm:
• Collect all the rows consistent with the observed
evidence variables.
• Sum out (marginalize) all the hidden variables.
• Normalize the table so that it is a probability distribution
(i.e., values sum to 1).
For example, if we wanted to compute
P(W|S=winter) using the above joint distribution, we’d
select the four rows where S is winter, then sum out over T
and normalize. This yields the following probability table:
Inference by Enumeration
Inference by Enumeration
Example : If the world consists of only two Boolean variables Cavity and
Toothache, then there are 4 distinct atomic events:
• Cavity = false ɅToothache = false
• Cavity = false Ʌ Toothache = true
• Cavity = true Ʌ Toothache = false
• Cavity = true Ʌ Toothache = true
Prior or unconditional probabilities of propositions
Example: P(Cavity = true) = 0.1 and P(Weather = sunny) = 0.72
correspond to belief prior to arrival of any (new) evidence.
Probability distribution gives values for all possible assignments.
Example : Weather can take one of 4 possible values i.e. <sunny, windy,
cloudy, rainy>
P(Weather) = <0.72, 0.1, 0.08, 0.1> (normalized, i.e., sums to 1)
Inference by Enumeration
Joint probability distribution for a set of random variables gives
the probability of every atomic event on those random variables.
Example: P(Weather, Cavity) = a 4 × 2 matrix of values
Weather: sunny windy cloudy rainy
Cavity 0.144 0.02 0.016 0.02
¬ Cavity 0.576 0.08 0.064 0.08
Every question about a domain can be answered by the joint
distribution.
– What is the probability of having a cavity?
– What is the probability of the weather being sunny?
– What is the probability of having a cavity and the weather being
cloudy or windy?
Inference by Enumeration
P(toothache) = 0.108 + 0.012 + 0.016 + 0.064 = 0.2
P(¬cavity | toothache) = P(¬cavity Ʌ toothache)/P(toothache)
= 0.016+0.064 /0.108 + 0.012 + 0.016 + 0.064 = 0.4
General idea: compute distribution on query variable by fixing
evidence variables and summing over hidden variables. i.e. catch.
Conditional Independence
• A and B are independent iff P(A|B) = P(A) or P(B|A) = P(B)
or P(A, B) = P(A) . P(B)
Example: P(Toothache, Catch, Cavity, Weather) = P(Toothache,
Catch, Cavity) . P(Weather)
• If I have a cavity, the probability that the probe catches in it
doesn't depend on whether I have a toothache:
(1) P(catch | toothache, cavity) = P(catch | cavity)
• The same independence holds if I haven't got a cavity:
(2) P(catch | toothache, ¬cavity) = P(catch | ¬cavity)
Catch is conditionally independent of Toothache given Cavity:
P(Catch | Toothache, Cavity) = P(Catch | Cavity)
Conditional Independence
• In most cases, the use of conditional independence
reduces the size of the representation of the joint
distribution from exponential in n to linear in n.
• Conditional independence is our most basic and
robust form of knowledge about uncertain
environments.
Equivalent statements:
P(Toothache | Catch, Cavity) = P(Toothache | Cavity)
P(Toothache, Catch | Cavity) = P(Toothache | Cavity).
P(Catch | Cavity)
The Traffic Problem
The joint probability distribution for the traffic and construction
variables
T ¬T
C 0.3 0.2
¬C 0.1 0.4
Given bad traffic, what is the probability of road construction?
p(C|T)=p(C=t, T=t)/(p(C=t, T=t)+p(C=f, T=t))=.3/(.3+.1)=.75
33
The Traffic Problem
• Events:
• Road construction C
• Accident A
• Orange barrels B
• Bad traffic T
• Flashing lights L
• Joint probability
• P(C,A,B,T,L)=p(C)*p(A|C)*p(B|C,A)*p(T|C,A,B)*p(L|C,A,B,T)
(using the Chain Rule)
• Number of parameters: 2^5=32 (each parameter is binary valued)
• Reduction
• Assumption: Parameters are only dependent on parents
• Calculation of joint probability
– P(C,A,B,T,L)=p(C)*p(A)*p(B|C)*p(T|C,A)*p(L|A)
– Number of parameters: 2+2+4+8+4=20
34
Conditional Independence
Example: You want to run the sprinkler system if it is not going to
rain and you base your decision on whether it will rain or not on
whether it is cloudy.
• The grass is wet, we want to know the probability that you ran
the sprinkler versus if it rained.
• Evidential probabilities P(sprinkler | wet) and P(rain | wet) are
not independent of whether it was cloudy or not.
• We cannot use Bayes theorem directly because the evidential
probabilities are based on the prior probability of cloudy.
• However, a propagation algorithm can be applied where the prior
probability for cloudiness will impact the evidential probabilities
of sprinkler and rain.
– from there, we can finally compute the likelihood of rain
versus sprinkler.
Conditional Independence
• We can avoid the assumption of
independence by including causality
in our knowledge
– For this, we enhance our previous
approach by using a network
where directed edges denote
some form of dependence or
causality
• The idea behind computing
probabilities in a Bayesian network is
the chain rule.
• In our previous example, we want to
know the probability that it rained
versus the probability that we ran the
sprinkler given that the grass is wet.
• P(C, S, W) = P(C) * P(S|C) * P(W|C,S)
• P(C, R, W) = P(C) * P(R|C) * P(W|C,R)
Introduction to Bayesian networks
• Real-world problems rarely provide complete or certain
information.
• Sensors fail, witnesses are unreliable, and future events are
inherently probabilistic.
• Reasoning under uncertainty involves updating beliefs and
making decisions despite this incompleteness and noise.
• Bayesian networks (BNs, also called belief networks or
probabilistic graphical models) combine probability with
graph theory to make this reasoning computationally
tractable. BNs let us:
– Represent dependencies compactly.
– Encode conditional independencies visually.
– Perform efficient inference (answering probabilistic queries).
They are widely used in medical diagnosis, spam filtering,
robotics, speech recognition, and fault diagnosis.
Probability Foundation
Structure and Semantics of Bayesian
Networks
Classic Example: Burglary-Alarm
Network
Example: Burglary-Alarm Network
Real World Example
Real World Example
D-Separation
• Understanding conditional independence is key to
interpreting and working with Bayesian networks.
• In these directed graphical models, the structure of the
graph itself encodes assumptions about which variables
are independent of each other, either unconditionally or
when conditioned on other variables.
• This is formalized through the concept of d-separation
(directional-separation). D-separation provides a
graphical criterion for determining whether a set of
variables is independent of another set, given some
observed variables.
• If two nodes in the graph are d-separated by a set of
variables, they are conditionally independent given that
set.
Conditional Independence and d-
Separation
Definition (D-Separation). A set of variables E d-separates
variables X and Y if E blocks every undirected path between X
and Y in the network.
To determine whether X and Y are independent given the
observed variables E, we can verify whether E d-separates X
and Y. If d-separation holds, then X and Y are independent.
Contd…
• First, to verify the definition, we need to consider every
undirected path between X and Y. The word ”un-directed”
means that we do not care about the direction of the arrows on
the path. As long as a series of nodes and edges connect X and
Y, we will consider it a path.
• Second, there may be multiple paths between X and Y. We
need to consider every path and verify that E blocks every path
between X and Y.
• Third, on each path, there could be multiple nodes between X
and Y. The path is blocked if at least one node blocks the path.
As soon as we find one node blocking the path, we can move
on to a different path. In the worst case, we need to check
every node and determine that none of them blocks the path.
Contd…
• Given this definition, our task boils down to the
following: pick a path between X and Y, pick a node on
the path, and determine whether the node blocks the
path or not.
• This leads to our next question: What does it mean to
”block a path”? Let me explain this in three scenarios.
Interestingly, these three scenarios correspond to the
three key structures that I discussed previously.
• In each of the three scenarios, we will look at one path
between X and Y and consider one node N on the path.
The three scenarios differ by the direction of the two
arrows on both sides of N.
Blocked Path: Chain (Causal Path)
The two arrows around N point in the same direction, forming a
chain around N. I drew the arrows pointing to the right, but it’s
fine if they point to the left. In this scenario, if N is observed,
then N blocks the path between X and Y.
Blocked Path: Fork (Common Cause)
The two arrows around N point away from N, to the two children
A and B. If the arrows depict causal relationships, you can think
of A and B as unreliable sensors of N. If N is observed, then N
blocks the path between X and Y.
Blocked Path: Collider (Common Effect)
The two arrows around N point toward N. A and B are both
parents of N. If the arrows depict causal relationships, then A and
B jointly cause N to happen. The descendants of N are also
important in this scenario. The rule says that: If we do not
observe N and do not observe any of N’s descendants, then the
path is blocked.
Example: A Scenario
If I travel on the
subway, I may get the
flu. If I take a trip to an
exotic destination, I
may catch malaria. Flu
and malaria will have
different symptoms.
Flu and malaria will
both cause fever. Flu
will also cause aches.
Malaria will also cause
jaundice. Having a
fever will also cause
me to have high
temperature.
Example: Questions
1) Are TravelSubway and High-Temp independent?
2) Are TravelSubway and HighTemp conditionally independent
given Flu?
3) Are Aches and HighTemp independent?
4) Are Aches and HighTemp conditionally independent given
Flu?
5) Are Flu and ExoticTrip independent?
6) Are Flu and ExoticTrip conditionally independent given
HighTemp?
• Find the probability that both John and Mary
call when the alarm rings but no burglary or
earthquake has taken place.
i.e. P(JohnCalls, MaryCalls, Alarm, ¬Burglary,
¬Earthquake)
= P(JohnCalls|Alarm) . P(MaryCalls|Alarm) .
P(Alarm| ¬Burglary Ʌ ¬Earthquake) .
P(¬Burglary) . P(¬Earthquake)
= 0.90 * 0.70 * 0.001 * 0.999 * 0.998
= 0.000628… (very small)
Solve the following:
• The probability of an alarm sounding. (ans = 0.0025) (Hint: Check antecedents of
all 4 possibilities.)
• The probability that John calls. (ans = 0.052125)
(Hint : P(¬Alarm) = 1 – P(Alarm))
• The probability that the alarm sounds because the burglary has happened. (ans =
0.00095) (Hint: Sum over the hidden variable.)
• The probability that the alarm sounds because of an earthquake.
(ans = 0.00058)
• The probability that the alarm sounds without an earthquake. (ans = 0.001945)
• The probability that no alarm sounds and there is no earthquake. (ans = 0.996)
• The probability that John calls and a burglary has actually happened.
(ans = 0.00086)
• The probability that John calls because a burglary has actually happened.
(ans = 0.86)
• The last 2 probabilities for Mary. (ans = 0.00067 and 0.67)
• The probability of a burglary when John has called. (ans = 0.016)
• The probability of a burglary when the alarm has sounded. (ans = 0.38)