0% found this document useful (0 votes)
35 views20 pages

Probabilistic Reasoning in AI: Unit 5 Notes

The document discusses the concepts of uncertain knowledge and learning in AI, focusing on core and probabilistic AI approaches. It explains probabilistic reasoning, causes of uncertainty, Bayes' theorem, and Bayesian networks, emphasizing how these concepts help in decision-making under uncertainty. Additionally, it outlines the axioms of probability theory and the importance of conditional probabilities in reasoning and updating beliefs based on new evidence.

Uploaded by

thippaninithin50
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views20 pages

Probabilistic Reasoning in AI: Unit 5 Notes

The document discusses the concepts of uncertain knowledge and learning in AI, focusing on core and probabilistic AI approaches. It explains probabilistic reasoning, causes of uncertainty, Bayes' theorem, and Bayesian networks, emphasizing how these concepts help in decision-making under uncertainty. Additionally, it outlines the axioms of probability theory and the importance of conditional probabilities in reasoning and updating beliefs based on new evidence.

Uploaded by

thippaninithin50
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

UNIT-V

Uncertain knowledge and Learning


Core vs. Probabilistic AI •
 Knowledge Reasoning : work with facts/assertions; develop rules of logical inference
 Planning: work with applicability/effects of actions; develop searches for actions which
achieve goals/avert disasters.
 Expert Systems: develop by hand a set of rules for examining inputs, updating internal
states and generating outputs
 Learning approach: use probabilistic models to tune performance based on many
data examples.
 Probabilistic AI: emphasis on noisy measurements, approximation in hard cases,
learning, algorithmic issues.
o logical assertions ⇒ probability distributions
o logical inference ⇒ conditional probability distributions
o logical operators ⇒ probabilistic generative models

Probabilistic reasoning
Causes of uncertainty:
Following are some leading causes of uncertainty to occur in the real world.

 Information occurred from unreliable sources.


 Experimental Errors
 Equipment fault
 Temperature variation
 Climate change.

Probabilistic reasoning is a way of knowledge representation where we apply the concept of


probability to indicate the uncertainty in knowledge. In probabilistic reasoning, we combine
probability theory with logic to handle the uncertainty.

The probability in probabilistic reasoning because it provides a way to handle the uncertainty
that is the result of someone's laziness and ignorance.

In the real world, there are lots of scenarios, where the certainty of something is not confirmed,
such as "It will rain today," "behavior of someone for some situations," "A match between two
teams or two players." These are probable sentences for which we can assume that it will happen
but not sure about it, so here use probabilistic reasoning.

Probability: Probability can be defined as a chance that an uncertain event will occur. It is the
numerical measure of the likelihood that an event will occur. The value of probability always
remains between 0 and 1 that represent ideal uncertainties.

0 ≤ P(A) ≤ 1, where P(A) is the probability of an event


AP(A) = 0, indicates total uncertainty in an event A.
P(A) =1, indicates total certainty in an event A.
To find the probability of an uncertain event by using the below formula.
P(¬A) = probability of a not happening event.
P(¬A) + P(A) = 1.

Event: Each possible outcome of a variable is called an event.


Sample space: The collection of all possible events is called sample space.
Random variables: Random variables are used to represent the events and objects in the real
world. Prior probability: The prior probability of an event is probability computed before
observing new information.
Posterior Probability: The probability that is calculated after all evidence or information has
taken into account. It is a combination of prior probability and new information.
Conditional probability:
Conditional probability is a probability of occurring an event when another event has already happened.

Let's suppose, to calculate the event A when event B has already occurred, "the probability of A
under the conditions of B", it can be written as:
Where P(A⋀B)= Joint probability of a
and B P(B)= Marginal probability of B.
If the probability of A is given and we need to find the probability of B, then it will be given as:

It can be explained by using the below Venn diagram, where B is occurred event, so sample
space will be reduced to set B, and now we can only calculate event A when event B is already
occurred by dividing the probability of P(A⋀B) by P( B ).

Example:
In a class, there are 70% of the students who like English and 40% of the students who likes English and
mathematics, and then what is the percent of students those who like English also like mathematics?
Solution:
Let, A is an event that a student likes
Mathematics B is an event that a student likes
English.
Hence, 57% are the students who like English also like Mathematics.

Probabilistically

 In many problem domains it isn't possible to create complete, consistent models of the
world. Therefore agents (and people) must act in uncertain worlds (which the real
world is).
 an agent to make rational decisions even when there is not enough information to prove
that an action will work.
 Some of the reasons for reasoning under uncertainty:
o True uncertainty. E.g., flipping a coin.
o Theoretical ignorance. There is no complete theory which is known about the
problem domain. E.g., medical diagnosis.
o Laziness. The space of relevant factors is very large, and would require too much
work to list the complete set of antecedents and consequents. Furthermore, it
would be too hard to use the enormous rules that resulted.
o Practical ignorance. Uncertain about a particular individual in the domain
because all of the information necessary for that individual has not been
collected.
 Probability theory will serve as the formal language for representing and
reasoning with uncertain knowledge.

Bayes' theorem:
Bayes' theorem is also known as Bayes' rule, Bayes' law, or Bayesian reasoning, which determines the
probability of an event with uncertain knowledge.

In probability theory, it relates the conditional probability and marginal probabilities of two
random events.
Bayes' theorem was named after the British mathematician Thomas Bayes. The Bayesian inference is
an application of Bayes' theorem, which is fundamental to Bayesian statistics
to calculate the value of P(B|A) with the knowledge of P(A|B).
Bayes' theorem allows updating the probability prediction of an event by observing new
information of the real world.

Example: If cancer corresponds to one's age then by using Bayes' theorem, determine the
probability of cancer more accurately with the help of age.

Bayes' theorem can be derived using product rule and conditional probability of event A with
known event B:

As from product rule we can write:

1. P(A ⋀ B)= P(A|B) P(B) or

Similarly, the probability of event B with known event A:

1. P(A ⋀ B)= P(B|A) P(A)

Equating right hand side of both the equations, we will get:

The above equation (a) is called as Bayes' rule or Bayes' theorem. This equation is basic of
most modern AI systems for probabilistic inference.

the simple relationship between joint and conditional probabilities.


Here, P(A|B) is known as posterior,
to calculate, and it will be read as Probability of hypothesis A when we have occurred an evidence B.
P(B|A) is called the likelihood, in which we consider that hypothesis is true, then we calculate the
probability of evidence.
P(A) is called the prior probability, probability of hypothesis before considering the
evidence P(B) is called marginal probability, pure probability of an evidence.

In the equation (a), in general, we can write P (B) = P(A)*P(B|Ai), hence the Bayes' rule can be
written as:

Applying Bayes' rule:

Bayes' rule allows us to compute the single term P(B|A) in terms of P(A|B), P(B), and P(A).
This is very useful in cases where we have a good probability of these three terms and want to
determine the fourth one.

Suppose we want to perceive the effect of some unknown cause, and want to compute that cause,
then the Bayes' rule becomes:

Example-1:

Question: what is the probability that a patient has diseases meningitis with a stiff neck?

Given Data:

A doctor is aware that disease meningitis causes a patient to have a stiff neck, and it occurs 80%
of the time. He is also aware of some more facts, which are given as follows:

o The Known probability that a patient has meningitis disease is 1/30,000.


o The Known probability that a patient has a stiff neck is 2%.

Let a be the proposition that patient has stiff neck and b be the proposition that patient has meningitis. ,
so we can calculate the following as:

P(a|b) = 0.8
P(b)= 1/30000

P(a)=0.02

Hence, we can assume that 1 patient out of 750 patients has meningitis disease with a stiff
neck.

Bayesian Network can be used for building models from data and experts opinions, and it
consists of two parts:

o Directed Acyclic Graph


o Table of conditional probabilities.

The generalized form of Bayesian network that represents and solve decision problems under
uncertain knowledge is known as an Influence diagram.

A Bayesian network graph is made up of nodes and Arcs (directed links), where:

o Each node corresponds to the random variables, and a variable can be continuous or discrete.
o Arc or directed arrows represent the causal relationship or conditional probabilities
between random variables. These directed links or arrows connect the pair of nodes in
the graph.
These links represent that one node directly influence the other node, and if there is no
directed link that means that nodes are independent with each other
o In the above diagram, A, B, C, and D are random variables represented by the
nodes of the network graph.
o If we are considering node B, which is connected with node A by a directed
arrow, then node A is called the parent of Node B.
o Node C is independent of node A.

The Bayesian network has mainly two components:

o Causal Component
o Actual numbers

Each node in the Bayesian network has condition probability distribution P(Xi |Parent(Xi) ),
which determines the effect of the parent on that node.

Representing Belief about Propositions

 Rather than reasoning about the truth or falsity of a proposition, reason about the belief
that a proposition or event is true or false
 For each primitive proposition or event, attach a degree of belief to the sentence
 Use probability theory as a formal means of manipulating degrees of belief
 Given a proposition, A, assign a probability, P(A), such that 0 <= P(A) <= 1, where if A
is true, P(A)=1, and if A is false, P(A)=0. Proposition A must be either true or false, but
P(A) summarizes our degree of belief in A being true/false.
o Examples
o P(Weather=Sunny) = 0.7 means that we believe that the weather will be Sunny with
70% certainty. In this case Weather is a random variable that can take on values in a
domain such as {Sunny, Rainy, Snowy, Cloudy}.
o P(Cavity=True) = 0.05 means that we believe there is a 5% chance that a person has a
cavity. Cavity is a Boolean random variable since it can take on possible values True
and False.
o Example: P(A=a ^ B=b) = P(A=a, B=b) = 0.2, where A=My_Mood, a=happy,
B=Weather, and b=rainy, means that there is a 20% chance that when it's raining
my mood is happy.

 assume that in a given problem domain, the programmer and expert identify all of the
relevant propositional variables that are needed to reason about the domain.
 Each of these will be represented as a random variable, i.e., a variable that can take on
values from a set of mutually exclusive and exhaustive values called the sample space or
partition of the random variable. Usually this will mean a sample space {True, False}.
 For example, the proposition Cavity has possible values True and False indicating
whether a given patient has a cavity or not. A random variable that has True and False as
its possible values is called a Boolean random variable.
More generally, propositions can include the equality predicate with random variables
and the possible values they can have.
For example, we might have a random variable Color with possible values red, green,
blue, and other.
Then P(Color=red) indicates the likelihood that the color of a given object is red.
Similarly, for Boolean random variables we can ask P(A=True), which is abbreviated to P(A),
and P(A=False), which is abbreviated to P(~A).

Axioms of Probability Theory


Probability Theory provides us with the formal mechanisms and rules for manipulating
propositions represented probabilistically. The following are the three axioms of probability
theory:

 0 <= P(A=a) <= 1 for all a in sample space of A


 P(True)=1, P(False)=0
 P(A v B) = P(A) + P(B) - P(A ^ B)

From these axioms we can show the following properties also hold:

 P(~A) = 1 - P(A)
 P(A) = P(A ^ B) + P(A ^ ~B)
 Sum{P(A=a)} = 1, where the sum is over all possible values a in the sample space of A

Joint Probability Distribution


Given an application domain in which we have determined a sufficient set of random variables to
encode all of the relevant information about that domain, we can completely specify all of the
possible probabilistic information by constructing the full joint probability distribution,
P(V1=v1, V2=v2, ..., Vn=vn), which assigns probabilities to all possible combinations of values
to all random variables.

For example, consider a domain described by three Boolean random variables, Bird, Flier, and
Young. Then we can enumerate a table showing all possible interpretations and associated
probabilities:

Bird Flier Young Probability


T T T 0.0
T T F 0.2
T F T 0.04
T F F 0.01
F T T 0.01
F T F 0.01
F F T 0.23
F F F 0.5

that there are 8 rows in the above table representing the fact that there are 2 3 ways to assign
values to the three Boolean variables. More generally, with n Boolean variables the table will be
of size 2n. And if n variables each had k possible values, then the table would be size kn.

Also notice that the sum of the probabilities in the right column must equal 1 this means that for
n Boolean random variables, the table has 2n-1 values that must be determined to completely fill
in the table.
If all of the probabilities are known for a full joint probability distribution table, then we
can compute any probabilistic statement about the domain.
For example,

 P(Bird=T) = P(B) = 0.0 + 0.2 + 0.04 + 0.01 = 0.25


 P(Bird=T, Flier=F) = P(B, ~F) = P(B, ~F, Y) + F(B, ~F, ~Y) = 0.04 + 0.01 = 0.05

Conditional Probabilities

 Conditional probabilities are key for reasoning because they formalize the process of
accumulating evidence and updating probabilities based on new evidence.
 For example, if we know there is a 4% chance of a person having a cavity, we can
represent this as the prior (aka unconditional) probability P(Cavity)=0.04.
 A person has a symptom of a toothache, to know what is the posterior
probability of a Cavity given this new evidence. That is, compute P(Cavity | Toothache).
 If P(A|B) = 1, this is equivalent to the sentence in Propositional Logic B => A. Similarly, if
P(A|B)
=0.9, then this is like saying B => A with 90% certainty.
 In other words, we've made implication fuzzy because it's not absolutely certain.
 Given several measurements and other "evidence", E1, ..., Ek, we will formulate queries
as P(Q | E1, E2, ..., Ek) meaning "what is the degree of belief that Q is true given that
we know E1, ..., Ek and nothing else."

Conditional probability is defined as: P(A|B) = P(A ^ B)/P(B) = P(A,B)/P(B)


One way of looking at this definition is as a normalized (using P(B)) joint probability (P(A,B)).

 Example C o m p u t i n g C o n d i t i o n a l P r o b a b i l i t y f r o m t h e J o i n t
P r o b a b i l i t y D i s t r i b u t i o n Say we want to compute P(~Bird | Flier) and we know
the full joint probability distribution function given above.

 P(~B|F) = P(~B,F) / P(F)
= (P(~B,F,Y) + P(~B,F,~Y)) / P(F)
= (.01 + .01)/P(F)

Next, we could either compute the marginal probability P(F) from the full joint
probability distribution, or, as is more commonly done, we could do it by
using a process called normalization, which first requires computing

P(B|F) = P(B,F) / P(F)


= (P(B,F,Y) + P(B,F,~Y)) / P(F)
= (0.0 + 0.2)/P(F)

Now we also know that P(~B|F) + P(B|F) = 1,


so substituting from above and solving for P(F)
P(F) = 0.22.
Hence, P(~B|F) = 0.02/0.22 = 0.091.
This is an effective procedure for computing conditional probabilities, it is intractable in
general because it means that we must compute and store the full joint probability
distribution table, which is exponential in size.

Some important rules related to conditional probability are:

o Rewriting the definition of conditional probability,


o the Product Rule: P(A,B) = P(A|B)P(B)
o Chain Rule: P(A,B,C,D) = P(A|B,C,D)P(B|C,D)P(C|D)P(D), which generalizes the product
rule for a joint probability of an arbitrary number of variables.
o Note that ordering the variables results in a different expression, but all have the same
resulting value.
o Conditionalized version of the Chain Rule: P(A,B|C) = P(A|B,C)P(B|C)
o Bayes's Rule: P(A|B) = (P(A)P(B|A))/P(B), which can be written as follows to more clearly
emphasize the "updating" aspect of the rule: P(A|B) = P(A) * [P(B|A)/P(B)] Note: The terms
P(A) and P(B) are called the prior (or marginal) probabilities. The term P(A|B) is called the
posterior probability because it is derived from or depends on the value of B.
o Conditionalized version of Bayes's Rule: P(A|B,C) = P(B|A,C)P(A|C)/P(B|C)
o Conditioning (aka Addition) Rule: P(A) = Sum{P(A|B=b)P(B=b)} where the sum is over
all possible values b in the sample space of B.
o P(~B|A) = 1 - P(B|A)

Assuming conditional independence of B and C given A, we can simplify Bayes's Rule


for two pieces of evidence B and C:

 P(A | B,C) = (P(A)P(B,C | A))/P(B,C)


= (P(A)P(B|A)P(C|A))/(P(B)P(C|B))
= P(A) * [P(B|A)/P(B)] * [P(C|A)/P(C|B)]
= (P(A) * P(B|A) * P(C|A))/P(B,C)

Naive Bayes Classifier:

a random variable, C, which represents the possible ways to classify an input pattern of
features that have been measured.
The domain of C is the set of possible classifications,
e.g., it might be the possible diagnoses in a medical domain.

Say the possible values for C are {a,b,c}, and the features we have

measured are E1=e1, E2=e2, ..., En=en.

Then we can compute

P(C=a | E1=e1, ...,

En=en),

P(C=b | E1=e1, ..., En=en) and

P(C=c | E1=e1, ..., En=en) assuming E1, ..., En are conditionally independent

given C. Since for each value of C the denominators are the same above, they

can be ignored. So, for example

P(C=a | E1=e1, ..., En=en) = P(C=a) * P(E1=e1 | C=a) * P(E2=e2 | C=a) * ... *

P(En=en | C=a) Choose the value for C that gives the maximum probability.

Finally, since only relative values are needed and probabilities are often very small, it is
common to compute the sum of logarithms of the probabilities:

log P(C=a | E1=e1, ..., En=en) = log P(C=a) + log P(E1=e1 | C=a) + ... + log P(En=en | C=a).

If B and C are (unconditionally) independent, then P(C|B) = P(C), so

P(A | B,C) = P(A) * [P(B|A)/P(B)] * [P(C|A)/P(C)]

 Example

Consider the medical domain consisting of three Boolean variables: PickledLiver,


Jaundice, Bloodshot, where the first indicates if a given patient has the "disease"
PickledLiver, and the second and third describe symptoms of the patient. We'll assume
that Jaundice and Bloodshot are independent.

The doctor wants to determine the likelihood that the patient has a PickledLiver.

Based on no other information, she knows that the prior probability P(PickledLiver) =
10-17. So, this represents the doctor's initial belief in this diagnosis. However, after
examination, she determines that the patient has jaundice. She knows that P(Jaundice) =
2-10 and P(Jaundice | PickledLiver) = 2-3, so she computes the new updated probability in
the patient having PickledLiver as:

P(PickledLiver | Jaundice) = P(P)P(J|P)/P(J)


= (2-17 * 2-3)/2-10
= 2-10

So, based on this new evidence, the doctor increases her belief in this diagnosis from 2-17 to 2-10.

Next, determines that the patient's eyes are bloodshot, to add this new piece of evidence
and update the probability of PickledLiver given Jaundice and Bloodshot.

Say, P(Bloodshot) = 2-6 and P(Bloodshot | PickledLiver) = 2-1. Then, computes the new
conditional probability:

P(PickledLiver | Jaundice, Bloodshot) = (P(P)P(J|P)P(B|P))/(P(J)P(B))


= 2-10 * [2-1 / 2-6]
= 2-5

So, after taking both symptoms into account, the doctor's belief that the patient has a
PickledLiver is 2-5.

Bayesian Networks (aka Belief Networks)

 Bayesian Networks, also known as Bayes Nets, Belief Nets, Causal Nets, and Probability
Nets, are a space-efficient data structure for encoding all of the information in the full
joint probability distribution for the set of random variables defining a domain. That is,
from the Bayesian Net one can compute any value in the full joint probability distribution
of the set of random variables.
 Represents all of the direct causal relationships between variables
 Intuitively, to construct a Bayesian net for a given set of variables, draw arcs
from cause variables to immediate effects.
 Space efficient because it exploits the fact that in many real-world problem
domains the dependencies between variables are generally local, so there are a lot
of conditionally independent variables
 Captures both qualitative and quantitative relationships between variables
 Can be used to reason
o Forward (top-down) from causes to effects -- predictive reasoning (aka
causal reasoning)
o Backward (bottom-up) from effects to causes -- diagnostic reasoning
 Formally, a Bayesian Net is a directed, acyclic graph (DAG), where there is a node for
each random variable, and a directed arc from A to B whenever A is a direct causal
influence on B. Thus the arcs represent direct causal relationships and the nodes
represent states of affairs. The occurrence of A provides support for B, and vice versa.
The backward influence is call "diagnostic" or "evidential" support for A due to the
occurrence of B.
 Each node A in a net is conditionally independent of any subset of nodes that
are not descendants of A given the parents of A.

Net Topology Reflects Conditional Independence Assumptions

 Conditional independence defines local net structure. For example, if B and C are
conditionally independent given A, then by definition P(C|A,B) = P(C|A) and,
symmetrically, P(B|A,C) = P(B|A). Intuitively, think of A as the direct cause of both B
and C. In a Bayesian Net this will be represented by the local structure:

For example, in the dentist example in the textbook, having a Cavity causes both a
Toothache and the dental probe to Catch, but these two events are conditionally
independent given Cavity. That is, if we know nothing about whether or not someone has
a Cavity, then Toothache and Catch are dependent. But as soon as we definitely know the
person has a cavity or not, then knowing that the person has a Toothache as well has no
effect on whether Catch is true. This conditional independence relationship will be
reflected in the Bayesian Net topology as:

 In general, we will construct the net so that given its parents, a node is
conditionally independent of the rest of the net variables. That is,

P(X1=x1, ..., Xn=xn) = P(xi | Parents(Xi)) * ... * P(xn | Parents(Xn))


Hence, we don't need the full joint probability distribution, only conditionals relative to
the parent variables.

 Example :
Consider the problem domain in which when I go home I want to know if someone in my
family is home before I go in. Let's say I know the following information:

(1) Why my wife leaves the house, she often (but not always) turns on the outside light.
(She also sometimes turns the light on when she's expecting a guest.)

(2) When nobody is home, the dog is often left outside.

(3) If the dog has bowel-troubles, it is also often left outside.

(4) If the dog is outside, I will probably hear it barking (though it might not bark, or I
might hear a different dog barking and think it's my dog).
Given this information, define the following five Boolean random variables:

O: Everyone is Out of the


house L: The Light is on
D: The Dog is outside
B: The dog has Bowel
troubles H: I can Hear the
dog barking

From this information, the following direct causal influences seem appropriate:

1. H is only directly influenced by D. Hence H is conditionally independent of L, O and B given D.


2. D is only directly influenced by O and B. Hence D is conditionally independent of L given O and B.
3. L is only directly influenced by O. Hence L is conditionally independent of D, H and B given O.
4. O and B are independent.

Based on the above, the following is a Bayesian Net that represents these direct causal
relationships (though it is important to note that these causal connections are not absolute, i.e.,
they are not implications):
Next, the following quantitative information is added to the net; this information is
usually given by an expert or determined empirically from training data.

o For each root node (i.e., node without any parents), the prior probability of the
random variable associated with the node is determined and stored there
o For each non-root node, the conditional probabilities of the node's variable
given all possible combinations of its immediate parent nodes are determined.
This results in a conditional probability table (CPT) at each non-root node.

Example, the following Bayesian Net:

Example, a total of 10 probabilities are computed and stored in the net, whereas the full
joint probability distribution would require a table containing 2 5 = 32 probabilities. The
reduction is due to the conditional independence of many variables.

Two variables that are not directly connected by an arc can still affect each other. For
example, B and H are not (unconditionally) independent, but H does not directly depend
on B.

Given a Bayesian Net, read off the conditional independence relations that are
represented. Specifically, each node, V, is conditionally independent of all nodes that
are not descendants of V, given V's parents. For example, in the above example H is
conditionally independent of B, O, and L given D. So, P(H | B,D,O,L) = P(H | D).

Building a Bayesian Net

Intuitively, "to construct a Bayesian Net for a given set of variables, we draw arcs from cause
variables to immediate effects. In almost all cases, doing so results in a Bayesian network [whose
conditional independence implications are accurate]."

More formally, the following algorithm constructs a Bayesian Net:

1. Identify a set of random variables that describe the given problem domain
2. Choose an ordering for them: X1, ..., Xn
3. for i=1 to n do
a. Add a new node for Xi to the net
b. Set Parents(Xi) to be the minimal set of already added nodes such that we have
conditional independence of Xi and all other members of {X1, ..., Xi-1} given
Parents(Xi)
c. Add a directed arc from each node in Parents(Xi) to Xi
d. If Xi has at least one parent, then define a conditional probability table at Xi:
P(Xi=x | possible assignments to Parents(Xi)). Otherwise, define a prior
probability at Xi: P(Xi)

 There is not, in general, a unique Bayesian Net for a given set of random variables. But
all represent the same information in that from any net constructed every entry in the joint
probability distribution can be computed.
 The "best" net is constructed if in Step 2 the variables are topologically sorted first. That
is, each variable comes before all of its children. So, the first nodes should be the roots,
then the nodes they directly influence, and so on.
 The algorithm will not construct a net that is illegal in the sense of violating the rules of
probability.

Computing Joint Probabilities from a Bayesian Net

To illustrate how a Bayesian Net can be used to compute an arbitrary value in the joint probability
distribution, consider the Bayesian Net shown above for the "home domain."

Goal: Compute P(B,~O,D,~L,H)

P(B,~O,D,~L,H) = P(H,~L,D,~O,B)
= P(H | ~L,D,~O,B) * P(~L,D,~O,B) by Product Rule
= P(H|D) * P(~L,D,~O,B) by Conditional
Independence of H and L,O, and B
given D
= P(H|D) P(~L | D,~O,B) P(D,~O,B) by Product Rule
= P(H|D) P(~L|~O) P(D,~O,B) by Conditional Independence of L and D,
and L and B, given O
= P(H|D) P(~L|~O) P(D | ~O,B) P(~O,B) by Product Rule
= P(H|D) P(~L|~O) P(D|~O,B) P(~O | B) P(B) by Product Rule
= P(H|D) P(~L|~O) P(D|~O,B) P(~O) P(B)by Independence of O and B
= (.3)(1 - .6)(.1)(1 - .6)(.3)
= 0.00144

where all of the numeric values are available directly in the Bayesian Net (since P(~A|B) = 1 -
P(A|B)).

APPROXIMATE INFERENCE IN BAYESIAN NETWORKS

 Direct sampling methods

The simplest kind of random sampling process for Bayesian networks generates events from
a network that has no evidence associated with it. The idea is to sample each variable in turn,
in topological order. The probability distribution from which the value is sampled is
conditioned on the values already assigned to the variable’s parents.

 Likelihood weighting

Likelihood weighting avoids the inefficiency of rejection sampling by generating only


eventsthat are consistent with the evidence e. It is a particular instance of the general
statisticaltechnique of importance sampling, tailored for inference in Bayesian networks.

 Inference by Markov chain simulation


Markov chainMonte Carlo (MCMC) MARKOV CHAIN algorithms work quite differently
from rejection sampling and likelihood weighting. Instead of generating each sample from
scratch, MCMC algorithms generate each sample by making a random change to the
preceding sample. It is therefore helpful to think of an MCMC algorithm as being in a
particular current state specifying a value for every variable and generating a next state by
making random changes to the current state.

FIRST-ORDER PROBABILITY MODELS


 Possible worlds
For Bayesian networks, the possible worlds are assignmentsof values to variables; for the
Boolean case in particular, the possible worlds areidentical to those of propositional logic.
For a first-order probability model, then, it seemswe need the possible worlds to be those of
first-order logic—that is, a set of objects withrelations among them and an interpretation
that maps constant symbols to objects, predicatesymbols to relations, and function symbols
to functions on those objects.

 Relational probability models


Like first-order logic, RPMs have constant, function, and predicate [Link] can refine
the model by introducing a context-specific independence.

A context-specific independence allows a variable to be independent of some of its parents given


certain values of others.

 Open-universe probability models


 A vision system doesn’t know what exists, if anything, around the next corner, and may not
know if the object it sees now is the same one it saw a few minutes ago.
 A text-understanding system does not know in advance the entities that will be featured in a
text, and must reason about whether phrases such as “Mary,” “Dr. Smith,” “she,”“his
cardiologist,” “his mother,” and so on refer to the same object.
 An intelligence analyst hunting for spies never knows how many spies there really are and
can only guess whether various pseudonyms, phone numbers, and sightings belong to the
same individual.

Representing ignorance:

Dempster–Shafer theory

The Dempster–Shafer theory DEMPSTER–SHAFER is designed to deal with the distinction


between uncertainty and ignorance. Rather than computing the probability of a proposition, it
computes theprobability that the evidence supports the proposition. This measure of belief is
called abelief function, written Bel(X).
The mathematical formulation of Dempster–Shafer theory is similar tothose of probability
theory; the main difference is that, instead of assigning probabilities to possible worlds, the
theory assigns masses to sets of possible world, that is, to events.

The masses still must add to 1 over all possible events. Bel(A) is defined to be the sum ofmasses
for all events that are subsets of (i.e., that entail) A, including A itself. With thisdefinition,
Bel(A) and Bel(¬A) sum to at most 1, and the gap—the interval between Bel(A)and 1 −
Bel(¬A)—is often interpreted as bounding the probability of A.

As with default reasoning, there is a problem in connecting beliefs to actions. Wheneverthere is a


gap in the beliefs, then a decision problem can be defined such that a Dempster–Shafer system is
unable to make a decision.
Bel(A) should be interpretednot as a degree of belief in A but as the probability assigned to all
the possible worlds (nowinterpreted as logical theories) in which A is provable.

For eg:-
let us consider a room where four person are presented A, B, C, D(lets say) And suddenly lights
out and when the lights come back B has been died due to stabbing in his back with the help of a
knife. No one came into the room and no one has leaved the room and B has not committed
suicide. Then we have to find out who is the murdrer?

 Either {A} or{C} or {D} has killed him.


 Either {A, C} or {C, D} or {A, C} have killed him.
 Or the three of them kill him i.e; {A, C, D}
 None of the kill him {o}(let us say).
These will be the possible evidences by which we can find the murderer by measure of
plausiblity. Using the above example we can say :
Set of possible conclusion (P): ,p1, p2….pn-where P is set of possible conclusion and cannot be
exhaustive means at least one (p)i must be true.(p)i must be mutually [Link] Set will
contain 2n elements where n is number of elements in the possible set.
For eg:-
If P = { a, b, c}, then Power set is given as
{o, {a}, {b}, {c}, {a, b}, {b, c}, {a, c}, {a, b, c}}= 23 elements.

Mass function m(K): It is an interpretation of m({K or B}) i.e; it means there is evidence for {K
or B} which cannot be divided among more specific beliefs for K and B.

Belief in K: The belief in element K of Power Set is the sum of masses of element which are
subsets of K. This can be explained through an example
Lets say K = {a, b, c}
Bel(K) = m(a) + m(b) + m(c) + m(a, b) + m(a, c) + m(b, c) +
m(a, b, c) Plaausiblity in K: It is the sum of masses of set that
intersects with K. i.e; Pl(K) = m(a) + m(b) + m(c) + m(a, b) +
m(b, c) + m(a, c) + m(a, b, c)
Characteristics of Dempster Shafer Theory:
It will ignorance part such that probability of all events aggregate to [Link] is reduced in
this theory by adding more and more [Link] rule is used to combine various
types of possiblities.
Advantages:
 Ucertainty interval reduces.
 DST has much lower level of ignorance.
 Diagnose Hierarchies can be represented using this.
 Person dealing with such problems is free to think about
evidences. Disadvantages:
 In this computation effort is high, as we have to deal with 2n of sets.

Learning
An agent is learning if it improves its performance on future tasks after making
observations about the world.

Forms Of Learning
Any component of an agent can be improved by learning from [Link] depends upon 4 factors:

 Which component is to be improved


o direct mapping from conditions on the current state to actions
o infer relevant properties of the world
o results of possible actions
o Action-value information
o Goals that describe classes of states
 What prior knowledge the agent already has.

 What representation is used for the data and the component.


o representations: propositional and first-order logical sentences
o Bayesian networks for the inferential components
o factored representation—a vector of attribute values—and outputs that can be
either a continuous numerical value or a discrete value
 What feedback is available to learn from : types of feedback that determine the three main
types of learning
o In unsupervised learning the agent learns patterns in the input even though no
explicit feedback is supplied
o reinforcement learning the agent learns from a series of reinforcements—
rewards or punishments.
o supervised learning the agent observes some example input–output pairs and learns a
function that maps from input to output
o semi-supervised learning we are given a few labeled examples and must make what we
can of a large collection of unlabelled examples

You might also like