Probabilistic Reasoning in AI: Unit 5 Notes
Probabilistic Reasoning in AI: Unit 5 Notes
Probabilistic reasoning
Causes of uncertainty:
Following are some leading causes of uncertainty to occur in the real world.
The probability in probabilistic reasoning because it provides a way to handle the uncertainty
that is the result of someone's laziness and ignorance.
In the real world, there are lots of scenarios, where the certainty of something is not confirmed,
such as "It will rain today," "behavior of someone for some situations," "A match between two
teams or two players." These are probable sentences for which we can assume that it will happen
but not sure about it, so here use probabilistic reasoning.
Probability: Probability can be defined as a chance that an uncertain event will occur. It is the
numerical measure of the likelihood that an event will occur. The value of probability always
remains between 0 and 1 that represent ideal uncertainties.
Let's suppose, to calculate the event A when event B has already occurred, "the probability of A
under the conditions of B", it can be written as:
Where P(A⋀B)= Joint probability of a
and B P(B)= Marginal probability of B.
If the probability of A is given and we need to find the probability of B, then it will be given as:
It can be explained by using the below Venn diagram, where B is occurred event, so sample
space will be reduced to set B, and now we can only calculate event A when event B is already
occurred by dividing the probability of P(A⋀B) by P( B ).
Example:
In a class, there are 70% of the students who like English and 40% of the students who likes English and
mathematics, and then what is the percent of students those who like English also like mathematics?
Solution:
Let, A is an event that a student likes
Mathematics B is an event that a student likes
English.
Hence, 57% are the students who like English also like Mathematics.
Probabilistically
In many problem domains it isn't possible to create complete, consistent models of the
world. Therefore agents (and people) must act in uncertain worlds (which the real
world is).
an agent to make rational decisions even when there is not enough information to prove
that an action will work.
Some of the reasons for reasoning under uncertainty:
o True uncertainty. E.g., flipping a coin.
o Theoretical ignorance. There is no complete theory which is known about the
problem domain. E.g., medical diagnosis.
o Laziness. The space of relevant factors is very large, and would require too much
work to list the complete set of antecedents and consequents. Furthermore, it
would be too hard to use the enormous rules that resulted.
o Practical ignorance. Uncertain about a particular individual in the domain
because all of the information necessary for that individual has not been
collected.
Probability theory will serve as the formal language for representing and
reasoning with uncertain knowledge.
Bayes' theorem:
Bayes' theorem is also known as Bayes' rule, Bayes' law, or Bayesian reasoning, which determines the
probability of an event with uncertain knowledge.
In probability theory, it relates the conditional probability and marginal probabilities of two
random events.
Bayes' theorem was named after the British mathematician Thomas Bayes. The Bayesian inference is
an application of Bayes' theorem, which is fundamental to Bayesian statistics
to calculate the value of P(B|A) with the knowledge of P(A|B).
Bayes' theorem allows updating the probability prediction of an event by observing new
information of the real world.
Example: If cancer corresponds to one's age then by using Bayes' theorem, determine the
probability of cancer more accurately with the help of age.
Bayes' theorem can be derived using product rule and conditional probability of event A with
known event B:
The above equation (a) is called as Bayes' rule or Bayes' theorem. This equation is basic of
most modern AI systems for probabilistic inference.
In the equation (a), in general, we can write P (B) = P(A)*P(B|Ai), hence the Bayes' rule can be
written as:
Bayes' rule allows us to compute the single term P(B|A) in terms of P(A|B), P(B), and P(A).
This is very useful in cases where we have a good probability of these three terms and want to
determine the fourth one.
Suppose we want to perceive the effect of some unknown cause, and want to compute that cause,
then the Bayes' rule becomes:
Example-1:
Question: what is the probability that a patient has diseases meningitis with a stiff neck?
Given Data:
A doctor is aware that disease meningitis causes a patient to have a stiff neck, and it occurs 80%
of the time. He is also aware of some more facts, which are given as follows:
Let a be the proposition that patient has stiff neck and b be the proposition that patient has meningitis. ,
so we can calculate the following as:
P(a|b) = 0.8
P(b)= 1/30000
P(a)=0.02
Hence, we can assume that 1 patient out of 750 patients has meningitis disease with a stiff
neck.
Bayesian Network can be used for building models from data and experts opinions, and it
consists of two parts:
The generalized form of Bayesian network that represents and solve decision problems under
uncertain knowledge is known as an Influence diagram.
A Bayesian network graph is made up of nodes and Arcs (directed links), where:
o Each node corresponds to the random variables, and a variable can be continuous or discrete.
o Arc or directed arrows represent the causal relationship or conditional probabilities
between random variables. These directed links or arrows connect the pair of nodes in
the graph.
These links represent that one node directly influence the other node, and if there is no
directed link that means that nodes are independent with each other
o In the above diagram, A, B, C, and D are random variables represented by the
nodes of the network graph.
o If we are considering node B, which is connected with node A by a directed
arrow, then node A is called the parent of Node B.
o Node C is independent of node A.
o Causal Component
o Actual numbers
Each node in the Bayesian network has condition probability distribution P(Xi |Parent(Xi) ),
which determines the effect of the parent on that node.
Rather than reasoning about the truth or falsity of a proposition, reason about the belief
that a proposition or event is true or false
For each primitive proposition or event, attach a degree of belief to the sentence
Use probability theory as a formal means of manipulating degrees of belief
Given a proposition, A, assign a probability, P(A), such that 0 <= P(A) <= 1, where if A
is true, P(A)=1, and if A is false, P(A)=0. Proposition A must be either true or false, but
P(A) summarizes our degree of belief in A being true/false.
o Examples
o P(Weather=Sunny) = 0.7 means that we believe that the weather will be Sunny with
70% certainty. In this case Weather is a random variable that can take on values in a
domain such as {Sunny, Rainy, Snowy, Cloudy}.
o P(Cavity=True) = 0.05 means that we believe there is a 5% chance that a person has a
cavity. Cavity is a Boolean random variable since it can take on possible values True
and False.
o Example: P(A=a ^ B=b) = P(A=a, B=b) = 0.2, where A=My_Mood, a=happy,
B=Weather, and b=rainy, means that there is a 20% chance that when it's raining
my mood is happy.
assume that in a given problem domain, the programmer and expert identify all of the
relevant propositional variables that are needed to reason about the domain.
Each of these will be represented as a random variable, i.e., a variable that can take on
values from a set of mutually exclusive and exhaustive values called the sample space or
partition of the random variable. Usually this will mean a sample space {True, False}.
For example, the proposition Cavity has possible values True and False indicating
whether a given patient has a cavity or not. A random variable that has True and False as
its possible values is called a Boolean random variable.
More generally, propositions can include the equality predicate with random variables
and the possible values they can have.
For example, we might have a random variable Color with possible values red, green,
blue, and other.
Then P(Color=red) indicates the likelihood that the color of a given object is red.
Similarly, for Boolean random variables we can ask P(A=True), which is abbreviated to P(A),
and P(A=False), which is abbreviated to P(~A).
From these axioms we can show the following properties also hold:
P(~A) = 1 - P(A)
P(A) = P(A ^ B) + P(A ^ ~B)
Sum{P(A=a)} = 1, where the sum is over all possible values a in the sample space of A
For example, consider a domain described by three Boolean random variables, Bird, Flier, and
Young. Then we can enumerate a table showing all possible interpretations and associated
probabilities:
that there are 8 rows in the above table representing the fact that there are 2 3 ways to assign
values to the three Boolean variables. More generally, with n Boolean variables the table will be
of size 2n. And if n variables each had k possible values, then the table would be size kn.
Also notice that the sum of the probabilities in the right column must equal 1 this means that for
n Boolean random variables, the table has 2n-1 values that must be determined to completely fill
in the table.
If all of the probabilities are known for a full joint probability distribution table, then we
can compute any probabilistic statement about the domain.
For example,
Conditional Probabilities
Conditional probabilities are key for reasoning because they formalize the process of
accumulating evidence and updating probabilities based on new evidence.
For example, if we know there is a 4% chance of a person having a cavity, we can
represent this as the prior (aka unconditional) probability P(Cavity)=0.04.
A person has a symptom of a toothache, to know what is the posterior
probability of a Cavity given this new evidence. That is, compute P(Cavity | Toothache).
If P(A|B) = 1, this is equivalent to the sentence in Propositional Logic B => A. Similarly, if
P(A|B)
=0.9, then this is like saying B => A with 90% certainty.
In other words, we've made implication fuzzy because it's not absolutely certain.
Given several measurements and other "evidence", E1, ..., Ek, we will formulate queries
as P(Q | E1, E2, ..., Ek) meaning "what is the degree of belief that Q is true given that
we know E1, ..., Ek and nothing else."
Example C o m p u t i n g C o n d i t i o n a l P r o b a b i l i t y f r o m t h e J o i n t
P r o b a b i l i t y D i s t r i b u t i o n Say we want to compute P(~Bird | Flier) and we know
the full joint probability distribution function given above.
P(~B|F) = P(~B,F) / P(F)
= (P(~B,F,Y) + P(~B,F,~Y)) / P(F)
= (.01 + .01)/P(F)
Next, we could either compute the marginal probability P(F) from the full joint
probability distribution, or, as is more commonly done, we could do it by
using a process called normalization, which first requires computing
a random variable, C, which represents the possible ways to classify an input pattern of
features that have been measured.
The domain of C is the set of possible classifications,
e.g., it might be the possible diagnoses in a medical domain.
Say the possible values for C are {a,b,c}, and the features we have
En=en),
P(C=c | E1=e1, ..., En=en) assuming E1, ..., En are conditionally independent
given C. Since for each value of C the denominators are the same above, they
P(C=a | E1=e1, ..., En=en) = P(C=a) * P(E1=e1 | C=a) * P(E2=e2 | C=a) * ... *
P(En=en | C=a) Choose the value for C that gives the maximum probability.
Finally, since only relative values are needed and probabilities are often very small, it is
common to compute the sum of logarithms of the probabilities:
log P(C=a | E1=e1, ..., En=en) = log P(C=a) + log P(E1=e1 | C=a) + ... + log P(En=en | C=a).
Example
The doctor wants to determine the likelihood that the patient has a PickledLiver.
Based on no other information, she knows that the prior probability P(PickledLiver) =
10-17. So, this represents the doctor's initial belief in this diagnosis. However, after
examination, she determines that the patient has jaundice. She knows that P(Jaundice) =
2-10 and P(Jaundice | PickledLiver) = 2-3, so she computes the new updated probability in
the patient having PickledLiver as:
So, based on this new evidence, the doctor increases her belief in this diagnosis from 2-17 to 2-10.
Next, determines that the patient's eyes are bloodshot, to add this new piece of evidence
and update the probability of PickledLiver given Jaundice and Bloodshot.
Say, P(Bloodshot) = 2-6 and P(Bloodshot | PickledLiver) = 2-1. Then, computes the new
conditional probability:
So, after taking both symptoms into account, the doctor's belief that the patient has a
PickledLiver is 2-5.
Bayesian Networks, also known as Bayes Nets, Belief Nets, Causal Nets, and Probability
Nets, are a space-efficient data structure for encoding all of the information in the full
joint probability distribution for the set of random variables defining a domain. That is,
from the Bayesian Net one can compute any value in the full joint probability distribution
of the set of random variables.
Represents all of the direct causal relationships between variables
Intuitively, to construct a Bayesian net for a given set of variables, draw arcs
from cause variables to immediate effects.
Space efficient because it exploits the fact that in many real-world problem
domains the dependencies between variables are generally local, so there are a lot
of conditionally independent variables
Captures both qualitative and quantitative relationships between variables
Can be used to reason
o Forward (top-down) from causes to effects -- predictive reasoning (aka
causal reasoning)
o Backward (bottom-up) from effects to causes -- diagnostic reasoning
Formally, a Bayesian Net is a directed, acyclic graph (DAG), where there is a node for
each random variable, and a directed arc from A to B whenever A is a direct causal
influence on B. Thus the arcs represent direct causal relationships and the nodes
represent states of affairs. The occurrence of A provides support for B, and vice versa.
The backward influence is call "diagnostic" or "evidential" support for A due to the
occurrence of B.
Each node A in a net is conditionally independent of any subset of nodes that
are not descendants of A given the parents of A.
Conditional independence defines local net structure. For example, if B and C are
conditionally independent given A, then by definition P(C|A,B) = P(C|A) and,
symmetrically, P(B|A,C) = P(B|A). Intuitively, think of A as the direct cause of both B
and C. In a Bayesian Net this will be represented by the local structure:
For example, in the dentist example in the textbook, having a Cavity causes both a
Toothache and the dental probe to Catch, but these two events are conditionally
independent given Cavity. That is, if we know nothing about whether or not someone has
a Cavity, then Toothache and Catch are dependent. But as soon as we definitely know the
person has a cavity or not, then knowing that the person has a Toothache as well has no
effect on whether Catch is true. This conditional independence relationship will be
reflected in the Bayesian Net topology as:
In general, we will construct the net so that given its parents, a node is
conditionally independent of the rest of the net variables. That is,
Example :
Consider the problem domain in which when I go home I want to know if someone in my
family is home before I go in. Let's say I know the following information:
(1) Why my wife leaves the house, she often (but not always) turns on the outside light.
(She also sometimes turns the light on when she's expecting a guest.)
(4) If the dog is outside, I will probably hear it barking (though it might not bark, or I
might hear a different dog barking and think it's my dog).
Given this information, define the following five Boolean random variables:
From this information, the following direct causal influences seem appropriate:
Based on the above, the following is a Bayesian Net that represents these direct causal
relationships (though it is important to note that these causal connections are not absolute, i.e.,
they are not implications):
Next, the following quantitative information is added to the net; this information is
usually given by an expert or determined empirically from training data.
o For each root node (i.e., node without any parents), the prior probability of the
random variable associated with the node is determined and stored there
o For each non-root node, the conditional probabilities of the node's variable
given all possible combinations of its immediate parent nodes are determined.
This results in a conditional probability table (CPT) at each non-root node.
Example, a total of 10 probabilities are computed and stored in the net, whereas the full
joint probability distribution would require a table containing 2 5 = 32 probabilities. The
reduction is due to the conditional independence of many variables.
Two variables that are not directly connected by an arc can still affect each other. For
example, B and H are not (unconditionally) independent, but H does not directly depend
on B.
Given a Bayesian Net, read off the conditional independence relations that are
represented. Specifically, each node, V, is conditionally independent of all nodes that
are not descendants of V, given V's parents. For example, in the above example H is
conditionally independent of B, O, and L given D. So, P(H | B,D,O,L) = P(H | D).
Intuitively, "to construct a Bayesian Net for a given set of variables, we draw arcs from cause
variables to immediate effects. In almost all cases, doing so results in a Bayesian network [whose
conditional independence implications are accurate]."
1. Identify a set of random variables that describe the given problem domain
2. Choose an ordering for them: X1, ..., Xn
3. for i=1 to n do
a. Add a new node for Xi to the net
b. Set Parents(Xi) to be the minimal set of already added nodes such that we have
conditional independence of Xi and all other members of {X1, ..., Xi-1} given
Parents(Xi)
c. Add a directed arc from each node in Parents(Xi) to Xi
d. If Xi has at least one parent, then define a conditional probability table at Xi:
P(Xi=x | possible assignments to Parents(Xi)). Otherwise, define a prior
probability at Xi: P(Xi)
There is not, in general, a unique Bayesian Net for a given set of random variables. But
all represent the same information in that from any net constructed every entry in the joint
probability distribution can be computed.
The "best" net is constructed if in Step 2 the variables are topologically sorted first. That
is, each variable comes before all of its children. So, the first nodes should be the roots,
then the nodes they directly influence, and so on.
The algorithm will not construct a net that is illegal in the sense of violating the rules of
probability.
To illustrate how a Bayesian Net can be used to compute an arbitrary value in the joint probability
distribution, consider the Bayesian Net shown above for the "home domain."
P(B,~O,D,~L,H) = P(H,~L,D,~O,B)
= P(H | ~L,D,~O,B) * P(~L,D,~O,B) by Product Rule
= P(H|D) * P(~L,D,~O,B) by Conditional
Independence of H and L,O, and B
given D
= P(H|D) P(~L | D,~O,B) P(D,~O,B) by Product Rule
= P(H|D) P(~L|~O) P(D,~O,B) by Conditional Independence of L and D,
and L and B, given O
= P(H|D) P(~L|~O) P(D | ~O,B) P(~O,B) by Product Rule
= P(H|D) P(~L|~O) P(D|~O,B) P(~O | B) P(B) by Product Rule
= P(H|D) P(~L|~O) P(D|~O,B) P(~O) P(B)by Independence of O and B
= (.3)(1 - .6)(.1)(1 - .6)(.3)
= 0.00144
where all of the numeric values are available directly in the Bayesian Net (since P(~A|B) = 1 -
P(A|B)).
The simplest kind of random sampling process for Bayesian networks generates events from
a network that has no evidence associated with it. The idea is to sample each variable in turn,
in topological order. The probability distribution from which the value is sampled is
conditioned on the values already assigned to the variable’s parents.
Likelihood weighting
Representing ignorance:
Dempster–Shafer theory
The masses still must add to 1 over all possible events. Bel(A) is defined to be the sum ofmasses
for all events that are subsets of (i.e., that entail) A, including A itself. With thisdefinition,
Bel(A) and Bel(¬A) sum to at most 1, and the gap—the interval between Bel(A)and 1 −
Bel(¬A)—is often interpreted as bounding the probability of A.
For eg:-
let us consider a room where four person are presented A, B, C, D(lets say) And suddenly lights
out and when the lights come back B has been died due to stabbing in his back with the help of a
knife. No one came into the room and no one has leaved the room and B has not committed
suicide. Then we have to find out who is the murdrer?
Mass function m(K): It is an interpretation of m({K or B}) i.e; it means there is evidence for {K
or B} which cannot be divided among more specific beliefs for K and B.
Belief in K: The belief in element K of Power Set is the sum of masses of element which are
subsets of K. This can be explained through an example
Lets say K = {a, b, c}
Bel(K) = m(a) + m(b) + m(c) + m(a, b) + m(a, c) + m(b, c) +
m(a, b, c) Plaausiblity in K: It is the sum of masses of set that
intersects with K. i.e; Pl(K) = m(a) + m(b) + m(c) + m(a, b) +
m(b, c) + m(a, c) + m(a, b, c)
Characteristics of Dempster Shafer Theory:
It will ignorance part such that probability of all events aggregate to [Link] is reduced in
this theory by adding more and more [Link] rule is used to combine various
types of possiblities.
Advantages:
Ucertainty interval reduces.
DST has much lower level of ignorance.
Diagnose Hierarchies can be represented using this.
Person dealing with such problems is free to think about
evidences. Disadvantages:
In this computation effort is high, as we have to deal with 2n of sets.
Learning
An agent is learning if it improves its performance on future tasks after making
observations about the world.
Forms Of Learning
Any component of an agent can be improved by learning from [Link] depends upon 4 factors: