Probabilistic Reasoning in AI
Probabilistic Reasoning in AI
1. Laziness: It is too much work to list the complete set of antecedents ● Assigning a probability of 0 to a given sentence corresponds to an
or consequents needed to ensure an exception less rule and too hard unequivocal belief that the sentence is false, while assigning a
to use such rules. probability of 1 corresponds to an unequivocal belief that the
2. Theoretical ignorance: Medical science has no complete theory for sentence is true.
the domain. ● Probabilities between 0 and 1 correspond to intermediate degrees of
3. Practical ignorance: Even if we know all the rules, we might be belief in the truth of the sentence.
uncertain about a particular patient because not all the necessary ● The sentence itself is in fact either true or false.
tests have been or can be run.
● It is important to note that a degree of belief is different from a degree
● The connection between toothaches and cavities is just not a logical
of truth.
consequence in either direction.
● A probability of 0.8 does not mean "80% true" but rather an 80%
● This is typical of the medical domain, as well as most other judgmental
degree of belief-that is, a fairly strong expectation.
domains: law, business, design, automobile repair, gardening, dating,
● Thus, probability theory makes the same ontological commitment as
and so on.
logic namely, that facts either do or do not hold in the world.
● The agent's knowledge can at best provide only a degree of belief in
● Degree of truth, as opposed to degree of belief, is the subject of fuzzy
the relevant sentences. Our main tool for dealing with degrees of
logic.
belief will be probability theory, which assigns to each sentence a
numerical degree of belief between 0 and 1.
1. In logic, a sentence such as "The patient has a cavity" is true or false
depending on the interpretation and the world; it is true just when the
1. Probability provides a way of summarizing the uncertainty that comes
fact it refers to is the case.
from our laziness and ignorance.
2. In probability theory, a sentence such as "The probability that the
2. But we believe that there is, say, an 80% chance-that is, a probability
patient has a cavity is 0.8" is about the agent's beliefs, not directly
of 0.8-that the patient has a cavity if he or she has a toothache.
about the world.
3. This belief could be derived from statistical data-80% of the toothache 3. These beliefs depend on the percepts that the agent has received to
patients seen so far have had cavities-or from some general rules, or
date. These percepts constitute the evidence on which probability
from a combination of evidence sources.
assertions are based.
4. The missing 20% summarizes all the other possible causes of 4. Just as entailment status can change when more sentences are added
toothache that we are too lazy or ignorant to confirm or deny. to the knowledge base, probabilities can change when more evidence
is acquired.
Department Of CSE Anjalai Ammal- Mahalingam Engineering College -614 403 Page No 2
CS3491-Artificial Intelligence and Machine Learning
5. All probability statements must therefore indicate the evidence with 8. Utility theory says that every state has a degree of usefulness, or
respect to which the probability is being assessed. utility, to an agent and that the agent will prefer states with higher
6. As the agent receives new percepts, its probability assessments are utility.
updated to reflect the new evidence. 9. The utility of a state is relative to the agent whose preferences the
7. Before the evidence is obtained, we talk about prior or utility function is supposed to represent.
unconditional probability; after the evidence is obtained, we talk 10. Example: The utility of a state in which White has won a game of
about posterior or conditional probability. chess is obviously high for the agent playing White, but low for the
8. In most cases, an agent will have some evidence from its percepts and agent playing Black.
will be interested in computing the posterior probabilities of the 11. Or again, some players (including the authors) might be happy with a
outcomes it cares about. draw against the world champion, whereas other players (including
the former world champion) might not.
1.2 Uncertainty and rational decisions 12. A utility function can even account for altruistic behavior, simply by
1. The presence of uncertainty radically changes the way an agent makes including the welfare of others as one of the factors contributing to the
decisions. agent's own utility.
2. A logical agent typically has a goal and executes any plan that is
guaranteed to achieve it. ● Preferences, as expressed by utilities, are combined with probabilities
3. An action can be selected or rejected on the basis of whether it in the general theory of rationa1 decisions called decision theory:
achieves the goal, regardless of what other actions might achieve. Decision theory = probability theory + utility theory.
4. To make such choices, an agent must first have preferences between ● The fundamental idea of decision theory is that an agent is rational if and
the different possible outcomes of the various plans. only if it chooses the action that yields the highest expected utility,
5. A particular outcome is a completely specified state, including such averaged over all the possible outcomes of the action.
factors as whether the agent arrives on time and the length of the wait
● This is called the principle of Maximum Expected Utility (MEU).
at the airport.
6. We will be using utility theory to represent and reason with
1.3 Design for a decision-theoretic agent
preferences.
● The structure of an agent that uses decision theory to select actions
7. The term utility is used here in the sense of "the quality of being
useful," not in the sense of the electric company or water works. shows in figure.
● The primary difference is that the decision-theoretic agent's
knowledge of the current state is uncertain; the agent's belief state is
Department Of CSE Anjalai Ammal- Mahalingam Engineering College -614 403 Page No 3
CS3491-Artificial Intelligence and Machine Learning
a representation of the probabilities of all possible actual states of the 2. Necessarily true (i.e., valid) propositions have probability I, and
world. necessarily false (i.e., unsatisfiable) propositions have probability 0.
● As time passes, the agent accumulates more evidence and its belief
state changes.
● Given the belief state, the agent can make probabilistic predictions of
action outcomes and hence select the action with highest expected 3. The probability of a disjunction is given by
utility.
● This rule is easily remembered by noting that the cases where a holds,
together with the cases where b holds, certainly cover all the cases
where a V b holds; but summing the two sets of cases counts their
intersection twice, so we need to subtract P (a b).
● These three axioms are often called Kolmogorov's axioms.
Department Of CSE Anjalai Ammal- Mahalingam Engineering College -614 403 Page No 4
CS3491-Artificial Intelligence and Machine Learning
● That is, any probability distribution on a single variable must sum to It is can an agent not hold the following set of beliefs, which clearly
also true that any joint probability distribution on any set of variables violates axiom 3?
must sum to 1: this can be seen simply by creating a single megavariable
whose domain is the cross product of the domains of the original
equation (1)
variables.
● Here, we give one argument for the axioms of probability, first stated in
● Recall that any proposition a is equivalent to the disjunction of all the
193 1 by Bruno de Finetti.
atomic events in which a holds; call this set of events e(a).
● The key to de Finetti's argument is the connection between degree of
● Recall also that atomic events are mutually exclusive, so the probability of
belief and actions.
any conjunction of atomic events is zero, by axiom 2.
● The idea is that if an agent has some degree of belief in a proposition a,
● Hence, from axiom 3, we can derive the following simple relationship: The
then the agent should be able to state odds at which it is indifferent to a
probability of a proposition is equal to the sum of the probabilities of the
bet for or against a.
atomic events in which it holds; that is,
● Think of it as a game between two agents: Agent 1 states "my degree of
belief in event a is 0.4." Agent 2 is then free to choose whether to bet for
or against a, at stakes that are consistent with the stated degree of
belief.
● This equation provides a simple method for computing the probability ● That is, Agent 2 could choose to bet that a will occur, betting $4 against
of any proposition, given a full joint distribution that specifies the Agent 1's $6. Or Agent 2 could bet $6 against $4 that A will not occur.
probabilities of all atomic events.
● If an agent's degrees of belief do not accurately reflect the world, then
1.5 Why the axioms of probability are reasonable
you would expect that it would tend to lose money over the long run to
● The axioms of probability can be seen as restricting the set of
an opposing agent whose beliefs more accurately reflect the state of the
probabilistic beliefs that an agent can hold. world.
● In the logical case, the semantic definition of conjunction means that ● But de Finetti proved something much stronger: ZfAgent I expresses a set
at least one of the three beliefs just mentioned must be false in the of degrees of belief that violate the axioms of probability theory then there
world, so it is unreasonable for an agent to believe all three. is a combination of bets by Agent 2 that guarantees that Agent I will lose
● With probabilities, on the other hand, statements refer not to the money every time.
world directly, but to the agent's own state of knowledge. Why, then,
Department Of CSE Anjalai Ammal- Mahalingam Engineering College -614 403 Page No 5
CS3491-Artificial Intelligence and Machine Learning
● So if you accept the idea that an agent should be willing to "put its
money where its probabilities are," then you should accept that it is
irrational to have beliefs that violate the axioms of probability.
● Suppose that Agent 1 has the set of degrees of belief from Equation (1).
Figure shows that if Agent 2 chooses to bet $4 on a, $3 on b, and $2 on
Figure: A full joint distribution for the Toothache, Cavity, Catch world.
(a b), then Agent 1 always loses money, regardless of the outcomes
● Notice that the probabilities in the joint distribution sum to 1, as
for a and b.
required by the axioms of probability.
● For Example, there are six atomic events in which cavity V toothache
holds:
Department Of CSE Anjalai Ammal- Mahalingam Engineering College -614 403 Page No 6
CS3491-Artificial Intelligence and Machine Learning
● That is, a distribution over Y can be obtained by summing out all the ● In fact, it can be viewed as a normalization constant for the
other variables from any joint distribution containing Y. distribution P( Cavity / toothache), ensuring that it adds up to 1.
● Throughout the chapters dealing with probability, we will use a to
● A variant of this rule involves conditional probabilities instead of joint denote such constants. With this notation, we can write the two
probabilities, using the product rule: preceding equations in one:
Department Of CSE Anjalai Ammal- Mahalingam Engineering College -614 403 Page No 7
CS3491-Artificial Intelligence and Machine Learning
● Notice that together the variables X, E, and Y constitute the complete ● In a realistic problem, there might be hundreds or thousands of
set of variables for the domain, so P(X, e, y) is simply a subset of random variables to consider, not just three.
probabilities from the full joint distribution. ● It quickly becomes completely impractical to define the vast numbers
● It loops over the values of X and the values of Y to enumerate all of probabilities required-the experience needed in order to estimate
possible atomic events with e fixed, add up their probabilities from each of the table entries separately simply cannot exist.
the joint table, and normalize the results. ● For these reasons, the full joint distribution in tabular form is not a
An algorithm for probabilistic inference by enumeration of the entries in practical tool for building reasoning systems.
a full joint: distribution. ● Instead, it should be viewed as the theoretical foundation on which
more effective approaches may be built.
● That is, we expect only 1 in 5000 patients with a stiff neck to have
meningitis.
Applying Bayes' rule: The simple case ● Notice that, even though a stiff neck is quite strongly indicated by
● On the surface, Bayes' rule does not seem very useful. It requires three meningitis (with probability 0.5), the probability of meningitis in the
terms-a conditional probability and two unconditional probabilities- patient remains small.
just to compute one conditional probability.
● Bayes' rule is useful in practice because there are many cases where ● A process by which one can avoid assessing the probability of the
we do have good probability estimates for these three numbers and evidence (here, P(s)) by instead computing a posterior probability for
need to compute the fourth. each value of the query variable (here, m and m) and then
● In a task such as medical diagnosis, we often have conditional normalizing the results.
probabilities on causal relationships and want to derive a diagnosis. ● The same process can be applied when using Bayes' rule. We have
● A doctor knows that the disease meningitis causes the patient to have
a stiff neck, say, 50% of the time.
● The doctor also knows some unconditional facts: the prior probability
● Thus, in order to use this approach we need to estimate P(s\
that a patient has meningitis is 1/50,000, and the prior probability
m)instead of P(s).
that any patient has a stiff neck is 1/20.
● The general form of Bayes' rule with normalization is
● Letting s be the proposition that the patient has a stiff neck and m be
the proposition that the patient has meningitis, we have
Where a is the normalization constant needed to make the entries in
P(Y | X) sum to 1.
Department Of CSE Anjalai Ammal- Mahalingam Engineering College -614 403 Page No 9
CS3491-Artificial Intelligence and Machine Learning
● Now the information requirements are the same as for inference using
Using Bayes' rule: Combining evidence each piece of evidence separately: the prior probability P(Cavity) for
In particular, we have argued that probabilistic information is often available the query variable and the conditional probability of each effect, given
in the form P (effect | cause). its cause.
● The general definition of conditional independence of two variables X
and Y , given a third variable Z is
We know, however, that such an approach will not scale up to larger numbers
of variables. We can try using Bayes' rule to reformulate the problem: ● In the dentist domain, for example, it seems reasonable to assert
conditional independence of the variables Toothache and Catch, given
Cavity:
● For this reformulation to work, we need to know the conditional
probabilities of the conjunction toothache A catch for each value of
Cavity. ● Notice that this assertion is somewhat stronger than Equation (13.13),
● These variables are independent, however, given the presence or the which asserts independence only for specific values of Toothache and
absence of a cavity. Catch. As with absolute independence in Equation (13.8), the
● equivalent forms.
Each is directly caused by the cavity, but neither has a direct effect on
the other: toothache depends on the state of the nerves in the tooth,
whereas the probe's accuracy depends on the dentist's skill, to which can also be used.
the toothache is irrelevant. ● Absolute independence assertions allow a decomposition of the full
● Mathematically, this property is written as joint distribution into much smaller pieces. It turns out that the same
is true for conditional independence assertions.
● For example, given the assertion in Equation (13.14), we can derive a
● This equation expresses the conditional independence of toothache
decomposition as follows:
and catch given Cavity. We can plug it into Equation to obtain the
probability of a cavity:
Department Of CSE Anjalai Ammal- Mahalingam Engineering College -614 403 Page No 10
CS3491-Artificial Intelligence and Machine Learning
● In this way, the original large table is decomposed into three smaller ● Such a probability distribution is called a naive Bayes model-"naive"
tables. because it is often used (as a simplifying assumption) in cases where
● The original table has seven independent numbers (23 - 1, because the the "effect" variables are not conditionally independent given the
numbers must sum to 1). The smaller tables contain five independent cause variable.
numbers (2 x (21 - 1) for each conditional probability distribution and ● (The naive Bayes model is sometimes called a Bayesian classifier, a
21 - 1 for the prior on Cavity). somewhat careless usage that has prompted true Bayesians to call it
● This might not seem to be a major triumph, but the point is that, for n IDIOT BAYES the idiot Bayes model.) In practice, naive Bayes systems
symptoms that are all conditionally independent given Cavity, the size can work surprisingly well, even when the independence assumption
of the representation grows as O(n) instead of O(2"). is not true.
● Thus, conditional independence assertions can allow probabilistic
systems to scale up; moreover; they are much more commonly available
than absolute independence assertions. 2.1 Naïve Bayes Models
● Conceptually, Cavity separates Toothache and Catch because it is a o Naïve Bayes algorithm is a supervised learning algorithm, which is
direct cause of both of them. based on Bayes theorem and used for solving classification problems.
● The decomposition of large probabilistic domains into weakly o It is mainly used in text classification that includes a high-dimensional
connected subsets via conditional independence is one of the most training dataset.
important developments in the recent history of AI. o Naïve Bayes Classifier is one of the simple and most effective
● The dentistry example illustrates a commonly occurring pattern in Classification algorithms which helps in building the fast machine
which a single cause directly influences a number of effects, all of learning models that can make quick predictions.
which are conditionally independent, given the cause. o It is a probabilistic classifier, which means it predicts on the basis
of the probability of an object.
● The full joint distribution can be written as o Some popular examples of Naïve Bayes Algorithm are spam
filtration, Sentimental analysis, and classifying articles.
Department Of CSE Anjalai Ammal- Mahalingam Engineering College -614 403 Page No 11
CS3491-Artificial Intelligence and Machine Learning
6 Sunny Yes
7 Overcast Yes
8 Rainy No
Where,
P(A|B) is Posterior probability: Probability of hypothesis A on the 9 Sunny No
observed event B.
P(B|A) is Likelihood probability: Probability of the evidence given 10 Sunny Yes
that the probability of a hypothesis is true.
P(A) is Prior Probability: Probability of hypothesis before observing 11 Rainy No
the evidence.
P(B) is Marginal Probability: Probability of Evidence. 12 Overcast Yes
Example 1: Suppose we have a dataset of weather conditions and
corresponding target variable "Play". So using this dataset we need to 13 Overcast Yes
decide that whether we should play or not on a particular day according Applying Bayes'theorem:
to the weather conditions. P(Yes|Sunny)= P(Sunny|Yes)*P(Yes)/P(Sunny)
P(Sunny|Yes)= 3/10= 0.3
Sample No Outlook Play
P(Sunny)= 0.35
0 Rainy Yes P(Yes)=0.71
So P(Yes|Sunny) = 0.3*0.71/0.35= 0.60
1 Sunny Yes P(No|Sunny)= P(Sunny|No)*P(No)/P(Sunny)
P(No)= 0.29 o
P(Sunny)= 0.35
So P(No|Sunny)= 0.5*0.29/0.35 = 0.41
So as we can see from the above calculation that P(Yes|Sunny)>P(No|Sunny)
Hence on a Sunny day, Player can play the game.
o Example 2:
Department Of CSE Anjalai Ammal- Mahalingam Engineering College -614 403 Page No 13
CS3491-Artificial Intelligence and Machine Learning
o
3. Each node Xi has a conditional probability distribution P(Xi|
Parents(Xi))that quantifies the effect of the parents on the node.
set of nodes and links—specifies the conditional independence
relationships that hold in the domain, in a way that will be made precise
shortly. The intuitive meaning of an arrow is typically that X has a direct
influence on Y, which suggests that causes should be parents of effects.
Recall the simple world consisting of the variables Toothache, Cavity,
Catch, and Weather. We argued that Weather is independent of the other
variables; furthermore, we argued that Toothache and Catch are conditionally
independent, given Cavity. These relationships are represented by the
Bayesian network structure shown in Figure 14.1. Formally, the conditional
independence of Toothache and Catch, given Cavity, is indicated by the
absence of a link between Toothache and Catch. Intuitively, the network
o
So For the given instance Color=Red, Type=SUV, Origin= Domestic represents the fact that Cavity is a direct cause of Toothache and Catch,
,the Naïve Bayes Model predicts the output stolen=NO whereas no direct causal relationship exists between Toothache and Catch.
[Link] NETWORK
who have promised to call you at work when they hear the alarm. John nearly just a possible combination of values for the parent nodes—a miniature
always calls when he hears the alarm, but sometimes confuses the telephone possible world, if you like. Each row must sum to 1, because the entries
ringing with the alarm and calls then, too. Mary, on the other hand, likes represent an exhaustive set of cases for the variable. In general, a table for a
rather loud music and often misses the alarm altogether. Given the evidence Boolean variable with k Boolean parents contains 2k independently
of who has or has not called, we would like to estimate the probability of a specifiable probabilities.
burglary. Problem:
The network structure shows that burglary and earthquakes directly Calculate the probability that alarm has sounded, but there is neither a
affect the probability of the alarm’s going off, but whether John and Mary call burglary, nor an earthquake occurred, and Marry and John both called the
depends only on the alarm. The network thus represents you.
our assumptions that they do not perceive burglaries directly, they do not List of all events occurring in this network:
notice minor earthquakes, and they do not confer before calling. 1. Burglary (B)
2. Earthquake(E)
3. Alarm(A)
4. John Calls(J)
5. Marry calls(M)
Let's take the observed probability for the Burglary and earthquake
component:
P(B= True) = 0.002, which is the probability of burglary.
P(B= False)= 0.998, which is the probability of no burglary.
P(E= True)= 0.001, which is the probability of a minor earthquake
P(E= False)= 0.999, Which is the probability that an earthquake not occurred.
We can provide the conditional probabilities as per the below tables:
Conditional probability table for Alarm A:
The Conditional probability of Alarm A depends on Burglar and
earthquake:
The conditional distributions in Figure 14.2 are shown as a conditional
B E P(A= True) P(A= False)
probability table, or CPT. True True 0.94 0.06
Each row CONDITIONING CASE in a CPT contains the conditional True False 0.95 0.04
probability of each node value for a conditioning case. A conditioning case is False True 0.31 0.69
Department Of CSE Anjalai Ammal- Mahalingam Engineering College -614 403 Page No 15
CS3491-Artificial Intelligence and Machine Learning
False False 0.001 0.999 joint probability distribution. The second is to view it as an encoding of a
collection of conditional independence statements.
Conditional probability table for John Calls: 3.1.1Representing the full joint distribution
The Conditional probability of John that he will call depends on the Bayesian network is a directed acyclic graph with some numeric
probability of Alarm. parameters attached to each node. One way to define what the network
A P(J= P(J= False) means—its semantics—is to define the way in which it represents a specific
True) joint distribution over all the variables.
True 0.91 0.09 A generic entry in the joint distribution is the probability of a
False 0.05 0.95 conjunction of particular assignments to each variable, such as P(X1 = x1 ...
Xn = xn). We use the notation P(x1,...,xn) as an abbreviation for this. The
value of this entry is given by the formula.
3.1 SEMANTICS OF BAYESIAN NETWORK: 3.1.2 A method for constructing Bayesian networks
There are two ways in which one can understand the semantics of The next step is to explain how to construct a Bayesian network in such a
Bayesian networks. The first is to see the network as a representation of the way that the resulting joint distribution is a good representation of a given
domain.
Department Of CSE Anjalai Ammal- Mahalingam Engineering College -614 403 Page No 16
CS3491-Artificial Intelligence and Machine Learning
First, we rewrite the entries in the joint distribution in terms of conditional Even in a locally structured domain, we will get a compact Bayesian network
probability, using the product rule only if we choose the node ordering well. What happens if we happen to
choose the wrong order?
Then we repeat the process, reducing each conjunctive probability to a Consider the burglary example again. Suppose we decide to add the nodes in
conditional probability and a smaller conjunction. We end up with one big the order MaryCalls, JohnCalls, Alarm, Burglary, Earthquake.
product: We then get the somewhat more complicated network shown in Figure 14.3(a).
Department Of CSE Anjalai Ammal- Mahalingam Engineering College -614 403 Page No 18
CS3491-Artificial Intelligence and Machine Learning
variable given discrete or continuous parents; and the conditional 5. A typical query asks for the posterior probability distribution
distribution for a discrete variable given continuous parents. P(X | e).
6. In the burglary network, we might observe the event in which
JohnCalls=true and MaryCalls=true.
7. We could then ask for, say, the probability that a burglary has
occurred:
P(Burglary | JohnCalls=true,MaryCalls=true) = {0.284, 0.716}
Department Of CSE Anjalai Ammal- Mahalingam Engineering College -614 403 Page No 20
CS3491-Artificial Intelligence and Machine Learning
[Link] variable elimination algorithm: ● The process of evaluation is a process of summing out
1. Variable elimination works by evaluating expressions in variables (right to left) from point wise products of
right-to-left order (that is, bottom up). factors to produce new factors, eventually yielding a
2. Intermediate results are stored, and summations over factor that is the solution, i.e., the posterior distribution
each variable are done only for those portions of the over the query variable.
expression that depend on the variable. The steps are as follows:
3. Let us illustrate this process for the burglary network. ● First, we sum out A from the product of f3, f4, and f5.
We evaluate the expression ● This gives us a new 2 × 2 factor f6(B,E) whose indices
range over just B and E.
(The “first” element is given by P (a | b, e) =0.95 and the “last” ● This leaves the expression
by P ( a | b, e) =0.999.)
1. In terms of factors, the query expression is written as which can be evaluated by taking the point wise product and
normalizing the result.
is equal to 1 by definition!
● Hence, there was no need to include it in the first place;
the variable M is irrelevant to this query.
● Another way of saying this is that the result of the
queryP(JohnCalls| Burglary =true) is unchanged if we 5. APPROXIMATE INFERENCE IN BN
remove MaryCalls from the network altogether.
● In general, we can remove any leaf node that is not a Why use approximate inference?
query variable or an evidence variable. 1. Exact inference becomes intractable for large multiply-connected
● After its removal, there may be some more leaf nodes, networks
and these too may be irrelevant. 2. Variable elimination can have exponential time and space complexity
● Continuing this process, we eventually find that every 3. Exact inference is strictly HARDER than NP-complete problems (NP-
variable that is not an ancestor of a query variable or hard)
evidence variable is irrelevant to the query.
Randomized sampling
● A variable elimination algorithm can therefore remove
all these variables before evaluating the query. 1. Monte Carlo algorithms provide approximate answers whose
accuracy depends on the number of samples generated.
2. In recent years, Monte Carlo algorithms have become widely used in
computer science to estimate quantities that are difficult to calculate
exactly.
3. Hence Sampling applied to the computation of posterior probabilities.
Department Of CSE Anjalai Ammal- Mahalingam Engineering College -614 403 Page No 23
CS3491-Artificial Intelligence and Machine Learning
Two Types of algorithms: 7. The probability distribution from which the value is sampled is
conditioned on the values already assigned to the variable's parents.
1. Direct Sampling
2. Markov Chain Sampling.
Direct Sampling- Rejection Sampling
1. Used to compute conditional probabilities P(X|e)
2. Generate samples as before Reject samples that do not match
evidence
3. Estimate by counting the how often event X is in the resulting
samples.
Direct Sampling - are Likelihood Weighting
Department Of CSE Anjalai Ammal- Mahalingam Engineering College -614 403 Page No 24
CS3491-Artificial Intelligence and Machine Learning
In this case, PRIOR-SAMPLE returns the event [true, false, true, true]
9. It is easy to see that PRIOR-SAMPLE generates samples from the prior
joint distribution specified by the network.
First, let Sps(xl, . . . , x,) be the probability that a specific event is generated
by the PRIOR-SAMPLE algorithm.
Steps:
1. First, it generates samples from the prior distribution specified
by the network.
2. Then, it rejects all samples that do not match the evidence
3. Finally, estimates by counting the how often event X is in the
resulting samples.
Department Of CSE Anjalai Ammal- Mahalingam Engineering College -614 403 Page No 25
CS3491-Artificial Intelligence and Machine Learning
• LIKELIHOOD fixes the values for the evidence variables E and samples
only the remaining variables X and Y.
• This guarantees that each event generated is consistent with the
evidence. Not all events are equal, however.
• Before tallying the counts in the distribution for the query variable,
each event is weighted by the likelihood that the event accords to the
evidence, as measured by the product of the conditional probabilities for each
evidence variable, given its parents.
5.1.2 Likelihood weighting • Intuitively, events in which the actual evidence appears unlikely
should be given less weight.
Likelihood weighting avoids the inefficiency of rejection sampling by
Let us apply the algorithm to the network shown in Figure 14.11(a), with the
generating only events that are consistent with the evidence e.
query
P(Rain/Sprinkler = true, WetGrass = true).
The process goes as follows: First, the weight w is set to 1.0. Then an event is
generated:
1. Cloudy is an evidence variable with value true. Therefore, we set
w w × P(Cloudy = true)=0.5 .
2. Sprinkler is not an evidence variable, so sample from P(Sprinkler |
Cloudy = true) =[0.1, 0.9]; suppose this returns false.
3. Similarly, sample from P(Rain | Cloudy = true) = 0.8, 0.2; suppose this
returns true.
4. WetGrass is an evidence variable with value true. Therefore, we set w
w × P(WetGrass = true | Sprinkler = false, Rain = true)=0.45 .
Here WEIGHTED-SAMPLE returns the event [true, false,true,true] with weight
0.45, and this is tallied under Rain = true.
Department Of CSE Anjalai Ammal- Mahalingam Engineering College -614 403 Page No 26
CS3491-Artificial Intelligence and Machine Learning
1. MCMC therefore wanders randomly around the state space-the space [Link] Bayesian Networks:
of possible complete assignments-flipping one variable at a time, but keeping
the evidence variables fixed. A directed network which illustrates casual dependencies among all the
2. Consider the query P(Rain1 Sprinkler = true, Wet Grass = true) components of the network.
applied to the network in Figure 14.1 l(a). A causal network for a domain of chance variables U is a directed acyclic
3. The evidence variables Sprinkler and WetGrass are fixed to their graph where Nodes correspond to chance variables in U and Non root
observed values and the hidden variables Cloudy and Rain are initialized node is a direct causal effect of its parents.
randomly-let us say to true and false respectively.
4. Thus, the initial state is [true, true, false, true]. An example of a causal network is shown in Figure :
5. Now the following steps are executed repeatedly:
Each state visited during this process is a sample that contributes to the
estimate for the query variable Rain. If the process visits 20 states where Rain
is true and 60 states where Rain is false, then the answer to the query is
NORMALIZE((^^, 60)) = (0.25,0.75). The complete algorithm is shown in
Figure 14.15.
• The diagram indicates that whether or not a car starts is caused by the
condition of its battery and fuel supply, that whether or not a car
moves is caused by whether or not it starts, and that (in this model)
the condition of the battery and the fuel supply have no causes.
In this example, we assume that all variables are binary.
A Causal Network:
• A Casual relationship exists when one variable in a data set has direct
influence on another variable. Thus one event triggers occurrence of
another event.
• Causal relationship also called as cause and effect.
Example:
Department Of CSE Anjalai Ammal- Mahalingam Engineering College -614 403 Page No 27
CS3491-Artificial Intelligence and Machine Learning
Department Of CSE Anjalai Ammal- Mahalingam Engineering College -614 403 Page No 28