0% found this document useful (0 votes)
10 views28 pages

Probabilistic Reasoning in AI

The document discusses probabilistic reasoning in artificial intelligence, focusing on how agents act under uncertainty using Bayesian inference and decision theory. It highlights the challenges of logical reasoning in uncertain environments, particularly in medical diagnosis, and introduces concepts such as utility theory and the axioms of probability. The text emphasizes the distinction between degrees of belief and truth, and the importance of updating probabilities based on new evidence.

Uploaded by

kavindharan40828
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views28 pages

Probabilistic Reasoning in AI

The document discusses probabilistic reasoning in artificial intelligence, focusing on how agents act under uncertainty using Bayesian inference and decision theory. It highlights the challenges of logical reasoning in uncertain environments, particularly in medical diagnosis, and introduces concepts such as utility theory and the axioms of probability. The text emphasizes the distinction between degrees of belief and truth, and the importance of updating probabilities based on new evidence.

Uploaded by

kavindharan40828
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

CS3491-Artificial Intelligence and Machine Learning

7. The right thing to do-the rational decision—therefore depends on


UNIT II PROBABILISTIC REASONING 9 both the relative importance of various goals and the likelihood that,
and degree to which, they will be achieved.
Acting under uncertainty – Bayesian inference – naïve bayes models.
Probabilistic reasoning –Bayesian networks – exact inference in BN – 1.1 Handling uncertain knowledge
approximate inference in BN – causal networks. In this section, we look more closely at the nature of uncertain knowledge.
Course Outcomes: We will use a simple diagnosis example to illustrate the concepts involved.
At the end of the Course Students will be able to Diagnosis-whether for medicine, automobile repair, or whatever-is a task that
CO2: Apply reasoning under uncertainty almost always involves uncertainty.
Let us try to write rules for dental diagnosis using first-order logic, so that we
[Link] UNDER UNCERTAINTY can see how the logical approach breaks down.
Consider the following rule:
1. When an agent knows enough facts about its environment, the logical
approach enables it to derive plans that are guaranteed to work.
2. This is a good thing. Unfortunately, agents almost never have access to The problem is that this rule is wrong. Not all patients with toothaches have
the whole truth about their environment. Agents must, therefore, act cavities; some of them have gum disease, an abscess, or one of several other
under uncertainty. problems:
3. If a logical agent cannot conclude that any particular course of action
achieves its goal, then it will be unable to act.
4. Conditional planning can overcome uncertainty to some extent, but Unfortunately, in order to make the rule true, we have to add an almost
only if the agent's sensing actions can obtain the required information unlimited list of possible causes. We could try turning the rule into a causal
and only if there are not too many different contingencies. rule:
5. Another possible solution would be to endow the agent with a simple
but incorrect theory of the world that does enable it to derive a plan;
presumably, such plans will work most of the time, but problems arise The only way to fix the rule is to make it logically exhaustive: to augment the
when events contradict the agent's theory. left-hand side with all the qualifications required for a cavity to cause a
6. Moreover, handling the tradeoff between the accuracy and usefulness toothache.
of the agent's theory seems itself to require reasoning about ● Trying to use first-order logic to cope with a domain like medical
uncertainty. diagnosis thus fails for three main reasons:
Department Of CSE Anjalai Ammal- Mahalingam Engineering College -614 403 Page No 1
CS3491-Artificial Intelligence and Machine Learning

1. Laziness: It is too much work to list the complete set of antecedents ● Assigning a probability of 0 to a given sentence corresponds to an
or consequents needed to ensure an exception less rule and too hard unequivocal belief that the sentence is false, while assigning a
to use such rules. probability of 1 corresponds to an unequivocal belief that the
2. Theoretical ignorance: Medical science has no complete theory for sentence is true.
the domain. ● Probabilities between 0 and 1 correspond to intermediate degrees of
3. Practical ignorance: Even if we know all the rules, we might be belief in the truth of the sentence.
uncertain about a particular patient because not all the necessary ● The sentence itself is in fact either true or false.
tests have been or can be run.
● It is important to note that a degree of belief is different from a degree
● The connection between toothaches and cavities is just not a logical
of truth.
consequence in either direction.
● A probability of 0.8 does not mean "80% true" but rather an 80%
● This is typical of the medical domain, as well as most other judgmental
degree of belief-that is, a fairly strong expectation.
domains: law, business, design, automobile repair, gardening, dating,
● Thus, probability theory makes the same ontological commitment as
and so on.
logic namely, that facts either do or do not hold in the world.
● The agent's knowledge can at best provide only a degree of belief in
● Degree of truth, as opposed to degree of belief, is the subject of fuzzy
the relevant sentences. Our main tool for dealing with degrees of
logic.
belief will be probability theory, which assigns to each sentence a
numerical degree of belief between 0 and 1.
1. In logic, a sentence such as "The patient has a cavity" is true or false
depending on the interpretation and the world; it is true just when the
1. Probability provides a way of summarizing the uncertainty that comes
fact it refers to is the case.
from our laziness and ignorance.
2. In probability theory, a sentence such as "The probability that the
2. But we believe that there is, say, an 80% chance-that is, a probability
patient has a cavity is 0.8" is about the agent's beliefs, not directly
of 0.8-that the patient has a cavity if he or she has a toothache.
about the world.
3. This belief could be derived from statistical data-80% of the toothache 3. These beliefs depend on the percepts that the agent has received to
patients seen so far have had cavities-or from some general rules, or
date. These percepts constitute the evidence on which probability
from a combination of evidence sources.
assertions are based.
4. The missing 20% summarizes all the other possible causes of 4. Just as entailment status can change when more sentences are added
toothache that we are too lazy or ignorant to confirm or deny. to the knowledge base, probabilities can change when more evidence
is acquired.
Department Of CSE Anjalai Ammal- Mahalingam Engineering College -614 403 Page No 2
CS3491-Artificial Intelligence and Machine Learning

5. All probability statements must therefore indicate the evidence with 8. Utility theory says that every state has a degree of usefulness, or
respect to which the probability is being assessed. utility, to an agent and that the agent will prefer states with higher
6. As the agent receives new percepts, its probability assessments are utility.
updated to reflect the new evidence. 9. The utility of a state is relative to the agent whose preferences the
7. Before the evidence is obtained, we talk about prior or utility function is supposed to represent.
unconditional probability; after the evidence is obtained, we talk 10. Example: The utility of a state in which White has won a game of
about posterior or conditional probability. chess is obviously high for the agent playing White, but low for the
8. In most cases, an agent will have some evidence from its percepts and agent playing Black.
will be interested in computing the posterior probabilities of the 11. Or again, some players (including the authors) might be happy with a
outcomes it cares about. draw against the world champion, whereas other players (including
the former world champion) might not.
1.2 Uncertainty and rational decisions 12. A utility function can even account for altruistic behavior, simply by
1. The presence of uncertainty radically changes the way an agent makes including the welfare of others as one of the factors contributing to the
decisions. agent's own utility.
2. A logical agent typically has a goal and executes any plan that is
guaranteed to achieve it. ● Preferences, as expressed by utilities, are combined with probabilities
3. An action can be selected or rejected on the basis of whether it in the general theory of rationa1 decisions called decision theory:
achieves the goal, regardless of what other actions might achieve. Decision theory = probability theory + utility theory.
4. To make such choices, an agent must first have preferences between ● The fundamental idea of decision theory is that an agent is rational if and
the different possible outcomes of the various plans. only if it chooses the action that yields the highest expected utility,
5. A particular outcome is a completely specified state, including such averaged over all the possible outcomes of the action.
factors as whether the agent arrives on time and the length of the wait
● This is called the principle of Maximum Expected Utility (MEU).
at the airport.
6. We will be using utility theory to represent and reason with
1.3 Design for a decision-theoretic agent
preferences.
● The structure of an agent that uses decision theory to select actions
7. The term utility is used here in the sense of "the quality of being
useful," not in the sense of the electric company or water works. shows in figure.
● The primary difference is that the decision-theoretic agent's
knowledge of the current state is uncertain; the agent's belief state is
Department Of CSE Anjalai Ammal- Mahalingam Engineering College -614 403 Page No 3
CS3491-Artificial Intelligence and Machine Learning

a representation of the probabilities of all possible actual states of the 2. Necessarily true (i.e., valid) propositions have probability I, and
world. necessarily false (i.e., unsatisfiable) propositions have probability 0.
● As time passes, the agent accumulates more evidence and its belief
state changes.
● Given the belief state, the agent can make probabilistic predictions of
action outcomes and hence select the action with highest expected 3. The probability of a disjunction is given by
utility.

● This rule is easily remembered by noting that the cases where a holds,
together with the cases where b holds, certainly cover all the cases
where a V b holds; but summing the two sets of cases counts their
intersection twice, so we need to subtract P (a b).
● These three axioms are often called Kolmogorov's axioms.

Using the axioms of probability


We can derive a variety of useful facts from the basic axioms. For example,
the familiar rule for negation follows by substituting la for b in axiom 3,
Figure: A decision-theoretic agent that selects rational actions. giving us:

1.4 The Axioms of Probability


We have defined syntax for propositions and for prior and conditional
probability statements about those propositions.
Now we must provide some sort of semantics for probability statements. Let the discrete variable D have the domain (d1.........., dn). Then it is easy to
We begin with the basic axioms that serve to define the probability scale and show (Exercise) that
its endpoints:
1. All probabilities are between 0 and 1. For any proposition a,

Department Of CSE Anjalai Ammal- Mahalingam Engineering College -614 403 Page No 4
CS3491-Artificial Intelligence and Machine Learning

● That is, any probability distribution on a single variable must sum to It is can an agent not hold the following set of beliefs, which clearly
also true that any joint probability distribution on any set of variables violates axiom 3?
must sum to 1: this can be seen simply by creating a single megavariable
whose domain is the cross product of the domains of the original
equation (1)
variables.
● Here, we give one argument for the axioms of probability, first stated in
● Recall that any proposition a is equivalent to the disjunction of all the
193 1 by Bruno de Finetti.
atomic events in which a holds; call this set of events e(a).
● The key to de Finetti's argument is the connection between degree of
● Recall also that atomic events are mutually exclusive, so the probability of
belief and actions.
any conjunction of atomic events is zero, by axiom 2.
● The idea is that if an agent has some degree of belief in a proposition a,
● Hence, from axiom 3, we can derive the following simple relationship: The
then the agent should be able to state odds at which it is indifferent to a
probability of a proposition is equal to the sum of the probabilities of the
bet for or against a.
atomic events in which it holds; that is,
● Think of it as a game between two agents: Agent 1 states "my degree of
belief in event a is 0.4." Agent 2 is then free to choose whether to bet for
or against a, at stakes that are consistent with the stated degree of
belief.
● This equation provides a simple method for computing the probability ● That is, Agent 2 could choose to bet that a will occur, betting $4 against
of any proposition, given a full joint distribution that specifies the Agent 1's $6. Or Agent 2 could bet $6 against $4 that A will not occur.
probabilities of all atomic events.
● If an agent's degrees of belief do not accurately reflect the world, then
1.5 Why the axioms of probability are reasonable
you would expect that it would tend to lose money over the long run to
● The axioms of probability can be seen as restricting the set of
an opposing agent whose beliefs more accurately reflect the state of the
probabilistic beliefs that an agent can hold. world.
● In the logical case, the semantic definition of conjunction means that ● But de Finetti proved something much stronger: ZfAgent I expresses a set
at least one of the three beliefs just mentioned must be false in the of degrees of belief that violate the axioms of probability theory then there
world, so it is unreasonable for an agent to believe all three. is a combination of bets by Agent 2 that guarantees that Agent I will lose
● With probabilities, on the other hand, statements refer not to the money every time.
world directly, but to the agent's own state of knowledge. Why, then,

Department Of CSE Anjalai Ammal- Mahalingam Engineering College -614 403 Page No 5
CS3491-Artificial Intelligence and Machine Learning

● So if you accept the idea that an agent should be willing to "put its
money where its probabilities are," then you should accept that it is
irrational to have beliefs that violate the axioms of probability.
● Suppose that Agent 1 has the set of degrees of belief from Equation (1).
Figure shows that if Agent 2 chooses to bet $4 on a, $3 on b, and $2 on
Figure: A full joint distribution for the Toothache, Cavity, Catch world.
(a b), then Agent 1 always loses money, regardless of the outcomes
● Notice that the probabilities in the joint distribution sum to 1, as
for a and b.
required by the axioms of probability.
● For Example, there are six atomic events in which cavity V toothache
holds:

● One particularly common task is to extract the distribution over some


subset of variables or a single variable.
● For example, adding the entries in the first row gives the unconditional
● Because Agent 1 has inconsistent beliefs, Agent 2 is able to devise a set of or marginal probability of cavity:
bets that guarantees a loss for Agent 1, no matter what the outcome of a
and b.
1.6 Inference using full joint distribution:
● A simple method for probabilistic inference-that is, the computation ● This process is called marginalization, or summing out-because the
from observed evidence of posterior probabilities for query propositions. variables other than Cavity are summed out.
● We will use the full joint distribution as the "knowledge base" from ● We can write the following general marginalization rule for any sets of
which answers to all questions may be derived. variables Y and Z:
● We begin with a very simple example: a domain consisting of just the
three Boolean variables Toothache, Cavity, and Catch (the dentist's nasty
steel probe catches in my tooth).
● The full joint distribution is a 2 x 2 x 2 table as shown in Figure

Department Of CSE Anjalai Ammal- Mahalingam Engineering College -614 403 Page No 6
CS3491-Artificial Intelligence and Machine Learning

● That is, a distribution over Y can be obtained by summing out all the ● In fact, it can be viewed as a normalization constant for the
other variables from any joint distribution containing Y. distribution P( Cavity / toothache), ensuring that it adds up to 1.
● Throughout the chapters dealing with probability, we will use a to
● A variant of this rule involves conditional probabilities instead of joint denote such constants. With this notation, we can write the two
probabilities, using the product rule: preceding equations in one:

● This rule is called conditioning. Marginalization and conditioning will


turn out to be useful rules for all kinds of derivations involving
probability expressions. ● Normalization will turn out to be a useful shortcut in many probability
Conditional probabilities can be found by first using Equation P(a|b) = calculations.
P(a b)/ P(b) to obtain an expression in terms of unconditional probabilities ● From the example, we can extract a general inference procedure.
and then evaluating the expression from the full joint distribution. ● We will stick to the case in which the query involves a single variable.
● We will need some notation: let X be the query variable (Cavity in the
example), let E be the set of evidence variables (just Toothache in the
example), let e be the observed values for them, and let Y be the
remaining unobserved variables (just Catch in the example).
Just to check, we can also compute the probability that there is no cavity,
given a toothache: ● The query is P(X|e) and can be evaluated as

● Where the summation is over all possible ys (i.e., all possible


combinations of values of the unobserved variables Y).
● Notice that in these two calculations the term l/P(toothache) remains
constant, no matter which value of Cavity we calculate.

Department Of CSE Anjalai Ammal- Mahalingam Engineering College -614 403 Page No 7
CS3491-Artificial Intelligence and Machine Learning

● Notice that together the variables X, E, and Y constitute the complete ● In a realistic problem, there might be hundreds or thousands of
set of variables for the domain, so P(X, e, y) is simply a subset of random variables to consider, not just three.
probabilities from the full joint distribution. ● It quickly becomes completely impractical to define the vast numbers
● It loops over the values of X and the values of Y to enumerate all of probabilities required-the experience needed in order to estimate
possible atomic events with e fixed, add up their probabilities from each of the table entries separately simply cannot exist.
the joint table, and normalize the results. ● For these reasons, the full joint distribution in tabular form is not a
An algorithm for probabilistic inference by enumeration of the entries in practical tool for building reasoning systems.
a full joint: distribution. ● Instead, it should be viewed as the theoretical foundation on which
more effective approaches may be built.

2. BAYER'S R ULE AND ITS USE


● We defined the product rule and pointed out that can be written in
two forms because of the commutativity of conjunction:

● Equating the two right-hand sides and dividing by P(a), we get

● This equation is known as Bayes' rule (also Bayes' law or Bayes'


theorem).
● This simple equation underlies all modern A1 systems for
● Given the full joint distribution to work with, ENUMERATE-JOINT-ASK probabilistic inference.
is a complete algorithm for answering probabilistic queries for ● The more general case of multivalued variables can be written in the P
discrete variables. notation as
● For a domain described by n Boolean variables, it requires an input
table of size 0(2n) and takes 0(2n) time to process the table.
Department Of CSE Anjalai Ammal- Mahalingam Engineering College -614 403 Page No 8
CS3491-Artificial Intelligence and Machine Learning

Where again this is to be taken as representing a set of equations, each


dealing with specific values of the variables.

● That is, we expect only 1 in 5000 patients with a stiff neck to have
meningitis.
Applying Bayes' rule: The simple case ● Notice that, even though a stiff neck is quite strongly indicated by
● On the surface, Bayes' rule does not seem very useful. It requires three meningitis (with probability 0.5), the probability of meningitis in the
terms-a conditional probability and two unconditional probabilities- patient remains small.
just to compute one conditional probability.
● Bayes' rule is useful in practice because there are many cases where ● A process by which one can avoid assessing the probability of the
we do have good probability estimates for these three numbers and evidence (here, P(s)) by instead computing a posterior probability for
need to compute the fourth. each value of the query variable (here, m and m) and then
● In a task such as medical diagnosis, we often have conditional normalizing the results.
probabilities on causal relationships and want to derive a diagnosis. ● The same process can be applied when using Bayes' rule. We have
● A doctor knows that the disease meningitis causes the patient to have
a stiff neck, say, 50% of the time.
● The doctor also knows some unconditional facts: the prior probability
● Thus, in order to use this approach we need to estimate P(s\
that a patient has meningitis is 1/50,000, and the prior probability
m)instead of P(s).
that any patient has a stiff neck is 1/20.
● The general form of Bayes' rule with normalization is
● Letting s be the proposition that the patient has a stiff neck and m be
the proposition that the patient has meningitis, we have
Where a is the normalization constant needed to make the entries in
P(Y | X) sum to 1.

Department Of CSE Anjalai Ammal- Mahalingam Engineering College -614 403 Page No 9
CS3491-Artificial Intelligence and Machine Learning

● Now the information requirements are the same as for inference using
Using Bayes' rule: Combining evidence each piece of evidence separately: the prior probability P(Cavity) for
In particular, we have argued that probabilistic information is often available the query variable and the conditional probability of each effect, given
in the form P (effect | cause). its cause.
● The general definition of conditional independence of two variables X
and Y , given a third variable Z is
We know, however, that such an approach will not scale up to larger numbers
of variables. We can try using Bayes' rule to reformulate the problem: ● In the dentist domain, for example, it seems reasonable to assert
conditional independence of the variables Toothache and Catch, given
Cavity:
● For this reformulation to work, we need to know the conditional
probabilities of the conjunction toothache A catch for each value of
Cavity. ● Notice that this assertion is somewhat stronger than Equation (13.13),
● These variables are independent, however, given the presence or the which asserts independence only for specific values of Toothache and
absence of a cavity. Catch. As with absolute independence in Equation (13.8), the
● equivalent forms.
Each is directly caused by the cavity, but neither has a direct effect on
the other: toothache depends on the state of the nerves in the tooth,
whereas the probe's accuracy depends on the dentist's skill, to which can also be used.
the toothache is irrelevant. ● Absolute independence assertions allow a decomposition of the full
● Mathematically, this property is written as joint distribution into much smaller pieces. It turns out that the same
is true for conditional independence assertions.
● For example, given the assertion in Equation (13.14), we can derive a
● This equation expresses the conditional independence of toothache
decomposition as follows:
and catch given Cavity. We can plug it into Equation to obtain the
probability of a cavity:

Department Of CSE Anjalai Ammal- Mahalingam Engineering College -614 403 Page No 10
CS3491-Artificial Intelligence and Machine Learning

● In this way, the original large table is decomposed into three smaller ● Such a probability distribution is called a naive Bayes model-"naive"
tables. because it is often used (as a simplifying assumption) in cases where
● The original table has seven independent numbers (23 - 1, because the the "effect" variables are not conditionally independent given the
numbers must sum to 1). The smaller tables contain five independent cause variable.
numbers (2 x (21 - 1) for each conditional probability distribution and ● (The naive Bayes model is sometimes called a Bayesian classifier, a
21 - 1 for the prior on Cavity). somewhat careless usage that has prompted true Bayesians to call it
● This might not seem to be a major triumph, but the point is that, for n IDIOT BAYES the idiot Bayes model.) In practice, naive Bayes systems
symptoms that are all conditionally independent given Cavity, the size can work surprisingly well, even when the independence assumption
of the representation grows as O(n) instead of O(2"). is not true.
● Thus, conditional independence assertions can allow probabilistic
systems to scale up; moreover; they are much more commonly available
than absolute independence assertions. 2.1 Naïve Bayes Models
● Conceptually, Cavity separates Toothache and Catch because it is a o Naïve Bayes algorithm is a supervised learning algorithm, which is
direct cause of both of them. based on Bayes theorem and used for solving classification problems.
● The decomposition of large probabilistic domains into weakly o It is mainly used in text classification that includes a high-dimensional
connected subsets via conditional independence is one of the most training dataset.
important developments in the recent history of AI. o Naïve Bayes Classifier is one of the simple and most effective
● The dentistry example illustrates a commonly occurring pattern in Classification algorithms which helps in building the fast machine
which a single cause directly influences a number of effects, all of learning models that can make quick predictions.
which are conditionally independent, given the cause. o It is a probabilistic classifier, which means it predicts on the basis
of the probability of an object.
● The full joint distribution can be written as o Some popular examples of Naïve Bayes Algorithm are spam
filtration, Sentimental analysis, and classifying articles.

Naïve: It is called Naïve because it assumes that the occurrence of a


certain feature is independent of the occurrence of other features. Such as
if the fruit is identified on the bases of color, shape, and taste, then red,

Department Of CSE Anjalai Ammal- Mahalingam Engineering College -614 403 Page No 11
CS3491-Artificial Intelligence and Machine Learning

spherical, and sweet fruit is recognized as an apple. Hence each feature


3 Overcast Yes
individually contributes to identify that it is an apple without depending
on each other. 4 Sunny No
Bayes: It is called Bayes because it depends on the principle of Bayes'
Theorem. 5 Rainy Yes

6 Sunny Yes

7 Overcast Yes

8 Rainy No
Where,
P(A|B) is Posterior probability: Probability of hypothesis A on the 9 Sunny No
observed event B.
P(B|A) is Likelihood probability: Probability of the evidence given 10 Sunny Yes
that the probability of a hypothesis is true.
P(A) is Prior Probability: Probability of hypothesis before observing 11 Rainy No
the evidence.
P(B) is Marginal Probability: Probability of Evidence. 12 Overcast Yes
Example 1: Suppose we have a dataset of weather conditions and
corresponding target variable "Play". So using this dataset we need to 13 Overcast Yes
decide that whether we should play or not on a particular day according Applying Bayes'theorem:
to the weather conditions. P(Yes|Sunny)= P(Sunny|Yes)*P(Yes)/P(Sunny)
P(Sunny|Yes)= 3/10= 0.3
Sample No Outlook Play
P(Sunny)= 0.35
0 Rainy Yes P(Yes)=0.71
So P(Yes|Sunny) = 0.3*0.71/0.35= 0.60
1 Sunny Yes P(No|Sunny)= P(Sunny|No)*P(No)/P(Sunny)

2 Overcast Yes P(Sunny|NO)= 2/4=0.5


Department Of CSE Anjalai Ammal- Mahalingam Engineering College -614 403 Page No 12
CS3491-Artificial Intelligence and Machine Learning

P(No)= 0.29 o
P(Sunny)= 0.35
So P(No|Sunny)= 0.5*0.29/0.35 = 0.41
So as we can see from the above calculation that P(Yes|Sunny)>P(No|Sunny)
Hence on a Sunny day, Player can play the game.

Advantages of Naïve Bayes Classifier:


o Naïve Bayes is one of the fast and easy ML algorithms to predict a
class of datasets.
o It can be used for Binary as well as Multi-class Classifications.
o It performs well in Multi-class predictions as compared to the other
Algorithms.
o It is the most popular choice for text classification problems.

Disadvantages of Naïve Bayes Classifier: o Solution:


o Naive Bayes assumes that all features are independent or unrelated,
so it cannot learn the relationship between features.

o Example 2:

Department Of CSE Anjalai Ammal- Mahalingam Engineering College -614 403 Page No 13
CS3491-Artificial Intelligence and Machine Learning

o
3. Each node Xi has a conditional probability distribution P(Xi|
Parents(Xi))that quantifies the effect of the parents on the node.
set of nodes and links—specifies the conditional independence
relationships that hold in the domain, in a way that will be made precise
shortly. The intuitive meaning of an arrow is typically that X has a direct
influence on Y, which suggests that causes should be parents of effects.
Recall the simple world consisting of the variables Toothache, Cavity,
Catch, and Weather. We argued that Weather is independent of the other
variables; furthermore, we argued that Toothache and Catch are conditionally
independent, given Cavity. These relationships are represented by the
Bayesian network structure shown in Figure 14.1. Formally, the conditional
independence of Toothache and Catch, given Cavity, is indicated by the
absence of a link between Toothache and Catch. Intuitively, the network
o
So For the given instance Color=Red, Type=SUV, Origin= Domestic represents the fact that Cavity is a direct cause of Toothache and Catch,
,the Naïve Bayes Model predicts the output stolen=NO whereas no direct causal relationship exists between Toothache and Catch.

[Link] NETWORK

Bayesian network used to represent the dependencies among


variables. Bayesian networks can represent essentially any full joint
probability distribution. Other names of Bayesian network is belief network,
probabilistic network, causal network, and knowledge map.
A Bayesian network is a directed graph in which each node is annotated with
quantitative probability information. The full specification is as follows:
1. Each node corresponds to a random variable, which may be discrete or Example 2:
continuous. You have a new burglar alarm installed at home. It is fairly reliable at
2. A set of directed links or arrows connects pairs of nodes. If there is an detecting a burglary, but also responds on occasion to minor earthquakes.
arrow from node X to node Y , X is said to be a parent of Y. The graph has no (This example is due to Judea Pearl, a resident of Los Angeles—hence the
directed cycles (and hence is a directed acyclic graph, or DAG. acute interest in earthquakes.) You also have two neighbours, John and Mary,
Department Of CSE Anjalai Ammal- Mahalingam Engineering College -614 403 Page No 14
CS3491-Artificial Intelligence and Machine Learning

who have promised to call you at work when they hear the alarm. John nearly just a possible combination of values for the parent nodes—a miniature
always calls when he hears the alarm, but sometimes confuses the telephone possible world, if you like. Each row must sum to 1, because the entries
ringing with the alarm and calls then, too. Mary, on the other hand, likes represent an exhaustive set of cases for the variable. In general, a table for a
rather loud music and often misses the alarm altogether. Given the evidence Boolean variable with k Boolean parents contains 2k independently
of who has or has not called, we would like to estimate the probability of a specifiable probabilities.
burglary. Problem:
The network structure shows that burglary and earthquakes directly Calculate the probability that alarm has sounded, but there is neither a
affect the probability of the alarm’s going off, but whether John and Mary call burglary, nor an earthquake occurred, and Marry and John both called the
depends only on the alarm. The network thus represents you.
our assumptions that they do not perceive burglaries directly, they do not List of all events occurring in this network:
notice minor earthquakes, and they do not confer before calling. 1. Burglary (B)
2. Earthquake(E)
3. Alarm(A)
4. John Calls(J)
5. Marry calls(M)
Let's take the observed probability for the Burglary and earthquake
component:
P(B= True) = 0.002, which is the probability of burglary.
P(B= False)= 0.998, which is the probability of no burglary.
P(E= True)= 0.001, which is the probability of a minor earthquake
P(E= False)= 0.999, Which is the probability that an earthquake not occurred.
We can provide the conditional probabilities as per the below tables:
Conditional probability table for Alarm A:
The Conditional probability of Alarm A depends on Burglar and
earthquake:
The conditional distributions in Figure 14.2 are shown as a conditional
B E P(A= True) P(A= False)
probability table, or CPT. True True 0.94 0.06
Each row CONDITIONING CASE in a CPT contains the conditional True False 0.95 0.04
probability of each node value for a conditioning case. A conditioning case is False True 0.31 0.69
Department Of CSE Anjalai Ammal- Mahalingam Engineering College -614 403 Page No 15
CS3491-Artificial Intelligence and Machine Learning

False False 0.001 0.999 joint probability distribution. The second is to view it as an encoding of a
collection of conditional independence statements.
Conditional probability table for John Calls: 3.1.1Representing the full joint distribution
The Conditional probability of John that he will call depends on the Bayesian network is a directed acyclic graph with some numeric
probability of Alarm. parameters attached to each node. One way to define what the network
A P(J= P(J= False) means—its semantics—is to define the way in which it represents a specific
True) joint distribution over all the variables.
True 0.91 0.09 A generic entry in the joint distribution is the probability of a
False 0.05 0.95 conjunction of particular assignments to each variable, such as P(X1 = x1 ...
Xn = xn). We use the notation P(x1,...,xn) as an abbreviation for this. The
value of this entry is given by the formula.

Conditional probability table for Marry Calls:


The Conditional probability of Marry that she calls is depending on its
Parent Node "Alarm." where parents(Xi) denotes the values of Parents(Xi) that appear in x1,...,xn.
Thus, each entry in the joint distribution is represented by the product of the
A P(M= True) P(M= False)
True 0.70 0.25 appropriate elements of the conditional probability tables (CPTs) in the
False 0.02 0.98 Bayesian network.
From the formula of joint distribution, we can write the problem statement in To illustrate this, we can calculate the probability that the alarm has
the form of probability distribution: sounded, but neither a burglary nor an earthquake has occurred, and both
P(M, J, A, ¬B, ¬E) = P (M|A) *P (J|A)*P (A|¬B ^ ¬E) *P (¬B) *P (¬E). John and Mary call. Probabilistic Reasoning from the joint distribution (using
= 0.70* 0.91* 0.001* 0.998*0.999 single-letter names for the variables):
= 0.00063509. P(j, m, a,¬b,¬e) = P(j | a)P(m | a)P(a | ¬b ¬e)P(¬b)P(¬e) = 0.90 × 0.70 × 0.001
Hence, a Bayesian network can answer any query about the domain by using × 0.999 × 0.998 = 0.000628 .
Joint distribution

3.1 SEMANTICS OF BAYESIAN NETWORK: 3.1.2 A method for constructing Bayesian networks
There are two ways in which one can understand the semantics of The next step is to explain how to construct a Bayesian network in such a
Bayesian networks. The first is to see the network as a representation of the way that the resulting joint distribution is a good representation of a given
domain.
Department Of CSE Anjalai Ammal- Mahalingam Engineering College -614 403 Page No 16
CS3491-Artificial Intelligence and Machine Learning

First, we rewrite the entries in the joint distribution in terms of conditional Even in a locally structured domain, we will get a compact Bayesian network
probability, using the product rule only if we choose the node ordering well. What happens if we happen to
choose the wrong order?
Then we repeat the process, reducing each conjunctive probability to a Consider the burglary example again. Suppose we decide to add the nodes in
conditional probability and a smaller conjunction. We end up with one big the order MaryCalls, JohnCalls, Alarm, Burglary, Earthquake.
product: We then get the somewhat more complicated network shown in Figure 14.3(a).

This identity is called the chain rule.


we see that the specification of the joint distribution is equivalent to the
general assertion that, for every variable Xi in the network,

Bayesian network is a correct representation of the domain only if each


node is conditionally independent of its other predecessors in the node
ordering, given its parents. We can satisfy this condition with this
methodology:
1. Nodes: First determine the set of variables that are required to model the The process goes as follows:
domain. Now order them, {X1,...,Xn}. Any order will work, but the resulting • Adding MaryCalls: No parents.
network will be more compact if the variables are ordered such that causes • Adding JohnCalls: If Mary calls, that probably means the alarm has gone
precede effects. off, which of course would make it more likely that John calls.
2. Links: For i = 1 to n do: Therefore, JohnCalls needs MaryCalls as a parent.
• Choose, from X1,...,Xi−1, a minimal set of parents for Xi, such that Equation • Adding Alarm: Clearly, if both call, it is more likely that the alarm has
(14.3) is satisfied. gone off than if just one or neither calls, so we need both
• For each parent insert a link from the parent to Xi. MaryCalls and JohnCalls as parents.
• CPTs: Write down the conditional probability table, P(Xi|Parents(Xi)). • Adding Burglary: If we know the alarm state, then the call from John or
3.1.3 Compactness and node ordering Mary might give us information about our phone ringing or Mary’s
music, but not about burglary:
Department Of CSE Anjalai Ammal- Mahalingam Engineering College -614 403 Page No 17
CS3491-Artificial Intelligence and Machine Learning

P(Burglary | Alarm, JohnCalls , MaryCalls) = P(Burglary | Alarm) .


Hence we need just Alarm as parent.
• Adding Earthquake: If the alarm is on, it is more likely that there has
been an earthquake. (The alarm is an earthquake detector of
sorts.) But if we know that there has been a burglary, then that
explains the alarm, and the probability of an earthquake would be
only slightly above normal. Hence, we need both Alarm and
Burglary as parents.
Figure 14.3(b) shows a very bad node ordering: MaryCalls, JohnCalls,
Earthquake, Burglary, Alarm. This network requires 31 distinct
probabilities to be specified—exactly the same number as the full joint
distribution.
3.1.4 Conditional independence relations in Bayesian networks
The topological semantics2 specifies that each variable is 3.1.5 Bayesian nets with continuous variables
conditionally independent of its non-descendants, given its Many real-world problems involve continuous quantities, such as
parents. For example, in Figure 14.2, JohnCalls is independent of height, mass, temperature, and money; in fact, much of statistics deals with
Burglary, Earthquake, and MaryCalls given the value of Alarm. random variables whose domains are continuous. By definition, continuous
Another important independence property is implied by the variables have an infinite number of possible values, so it is impossible to
topological semantics: a node is conditionally independent of all specify conditional probabilities explicitly for each value. One possible way
other nodes in the network, given its parents, children, and to handle continuous variables is to avoid them by using
children’s parents—that is, given its Markov blanket. discretization—that is, dividing up the possible values into a fixed set of
For example, Burglary is independent of JohnCalls and MaryCalls, intervals. For example, temperatures could be divided into (<0oC),
given Alarm and Earthquake. This property is illustrated in Figure (0oC−100oC), and (>100oC).
14.4(b). A network with both discrete and continuous variables is called a hybrid
Bayesian network. To specify a hybrid network, we have to specify two
new kinds of distributions: the conditional distribution for a continuous

Department Of CSE Anjalai Ammal- Mahalingam Engineering College -614 403 Page No 18
CS3491-Artificial Intelligence and Machine Learning

variable given discrete or continuous parents; and the conditional 5. A typical query asks for the posterior probability distribution
distribution for a discrete variable given continuous parents. P(X | e).
6. In the burglary network, we might observe the event in which
JohnCalls=true and MaryCalls=true.
7. We could then ask for, say, the probability that a burglary has
occurred:
P(Burglary | JohnCalls=true,MaryCalls=true) = {0.284, 0.716}

Exact inference Algorithms for considering posterior probabilities:


1. Inference for enumeration
2. Variable elimination algorithm
4.1. Inference for enumeration:
Any conditional probability can be computed by summing terms from the
4. EXACT INFERENCE IN BAYESIAN NETWORK: full joint distribution.
1. The basic task for any probabilistic inference system is to More specifically, a query P(X | e) can be answered using
compute the posterior probability distribution for a set of
query variables, given some observed event—that is, some
assignment of values to a set of evidence variables. ● Now, a Bayesian network gives a complete
2. To simplify the presentation, only one query variable is representation of the full joint distribution.
considered at a time.
● Therefore, a query can be answered using a Bayesian
3. The algorithms can easily be extended to queries with multiple
network by computing sums of products of conditional
variables. probabilities from the network.
● X denotes the query variable; ● Consider the query P(Burglary |
● E denotes the set of evidence variables E1, . . . ,Em, and JohnCalls=true,MaryCalls=true).
e is a particular observed event; ● The hidden variables for this query are Earthquake and
●Y denotes the nonevidence, nonquery variables Y1, . . . Alarm.
, Yl(called the hidden variables).
4. Thus,the complete set of variables is X={X} E Y.
Department Of CSE Anjalai Ammal- Mahalingam Engineering College -614 403 Page No 19
CS3491-Artificial Intelligence and Machine Learning

An improvement can be obtained from the following simple


observations:
1. the P(b)term is a constant and can be moved outside the
summations over a and e, and the P(e) term can be moved
outside the summation over a. Hence, we have
2.

This expression can be evaluated by looping through the


variables in order, multiplying CPT entries as we go.
3. For each summation, we also need to loop over the variable’s
possible values.

Department Of CSE Anjalai Ammal- Mahalingam Engineering College -614 403 Page No 20
CS3491-Artificial Intelligence and Machine Learning

[Link] variable elimination algorithm: ● The process of evaluation is a process of summing out
1. Variable elimination works by evaluating expressions in variables (right to left) from point wise products of
right-to-left order (that is, bottom up). factors to produce new factors, eventually yielding a
2. Intermediate results are stored, and summations over factor that is the solution, i.e., the posterior distribution
each variable are done only for those portions of the over the query variable.
expression that depend on the variable. The steps are as follows:
3. Let us illustrate this process for the burglary network. ● First, we sum out A from the product of f3, f4, and f5.
We evaluate the expression ● This gives us a new 2 × 2 factor f6(B,E) whose indices
range over just B and E.

1. Each part of the expression is annoted with the name of


the corresponding factor each factor is a matrix indexed
by the values of its argument variables. ● This leaves the expression
2. For example, the factors f4 (A) and f5(A) corresponding
to P(j | a) and P(m| a) depend just on A because J and M
are fixed by the query. They are therefore two-element ● Next, we sum out E from the product of f2 and f6:
vectors:

(The “first” element is given by P (a | b, e) =0.95 and the “last” ● This leaves the expression
by P ( a | b, e) =0.999.)
1. In terms of factors, the query expression is written as which can be evaluated by taking the point wise product and
normalizing the result.

Two basic computational operations are required


where
the “×” operator is not ordinary matrix multiplication but instead 1. Point wise product of a pair of factors
the point wise product operation. 2. Summing out a variable from a product of factors
Department Of CSE Anjalai Ammal- Mahalingam Engineering College -614 403 Page No 21
CS3491-Artificial Intelligence and Machine Learning

its values in turn. For example, to sum out A fromf3(A,B,C), we


Operations on factors: write
The pointwise product of two factors f1 and f2 yields a new
factor f whose variables are the union of the variables in f1 and
f2 and whose elements are given by the product of the
corresponding elements in the two factors. Suppose the two
factors have variables Y1, . . ., Ykin common.
Then we have For example, if we were to sum out E first in the burglary
network, the relevant part of the expression would be
Variable ordering and variable relevance
If all the variables are binary, then f1 and f2 have 2j+k and 2k+l
entries, respectively, and the pointwise product has 2j+k+l entries. ● Every choice of ordering yields a valid algorithm, but different
For example, given two factors f1(A,B) andf2(B,C), the orderings cause different intermediate factors to be generated
pointwise product f1 ×f2 =f3(A,B,C) has 21+1+1 =8 entries during the calculation.
● For example, in the calculation shown previously, we eliminated A
before E; if we do it the other way, the calculation becomes

during which a new factor f6(A,B) will be generated.


● In general, the time and space requirements of variable
elimination are dominated by the size of the largest
factor constructed during the operation of the algorithm.
● This in turn is determined by the order of elimination of
variables and by the structure of the network.
● It turns out to be intractable to determine the optimal
ordering, but several good heuristics are available.
● One fairly effective method is a greedy one: eliminate
Summing out a variable from a product of factors is done by whichever variable minimizes the size of the next factor
adding up the submatrices formed by fixing the variable to each of to be constructed.
Department Of CSE Anjalai Ammal- Mahalingam Engineering College -614 403 Page No 22
CS3491-Artificial Intelligence and Machine Learning

Let us consider one more query: P(JohnCalls| Burglary =true).


As usual, the first step is to write out the nested summation:

Evaluating this expression from right to left, we notice something


interesting:

is equal to 1 by definition!
● Hence, there was no need to include it in the first place;
the variable M is irrelevant to this query.
● Another way of saying this is that the result of the
queryP(JohnCalls| Burglary =true) is unchanged if we 5. APPROXIMATE INFERENCE IN BN
remove MaryCalls from the network altogether.
● In general, we can remove any leaf node that is not a Why use approximate inference?
query variable or an evidence variable. 1. Exact inference becomes intractable for large multiply-connected
● After its removal, there may be some more leaf nodes, networks
and these too may be irrelevant. 2. Variable elimination can have exponential time and space complexity
● Continuing this process, we eventually find that every 3. Exact inference is strictly HARDER than NP-complete problems (NP-
variable that is not an ancestor of a query variable or hard)
evidence variable is irrelevant to the query.
Randomized sampling
● A variable elimination algorithm can therefore remove
all these variables before evaluating the query. 1. Monte Carlo algorithms provide approximate answers whose
accuracy depends on the number of samples generated.
2. In recent years, Monte Carlo algorithms have become widely used in
computer science to estimate quantities that are difficult to calculate
exactly.
3. Hence Sampling applied to the computation of posterior probabilities.

Department Of CSE Anjalai Ammal- Mahalingam Engineering College -614 403 Page No 23
CS3491-Artificial Intelligence and Machine Learning

Two Types of algorithms: 7. The probability distribution from which the value is sampled is
conditioned on the values already assigned to the variable's parents.
1. Direct Sampling
2. Markov Chain Sampling.
Direct Sampling- Rejection Sampling
1. Used to compute conditional probabilities P(X|e)
2. Generate samples as before Reject samples that do not match
evidence
3. Estimate by counting the how often event X is in the resulting
samples.
Direct Sampling - are Likelihood Weighting

1. Avoid inefficiency of rejection sampling


2. Fix values for evidence variables and only sample the remaining
variables
3. Weight samples with regard to how likely they are .

5.1. Direct sampling methods


1. The primitive element in any sampling algorithm is the generation of
samples from a known probability distribution.
2. For example, an unbiased coin can be thought of as a random variable
Coin with values (heads, tails) and a prior distribution
P(Coin) = (0.5,0.5).
8. We can illustrate its operation on the network in Figure 14.1 l(a), assuming
3. Sampling from this distribution is exactly like flipping the coin: with probability
0.5 it will return heads, and with probability 0.5 it will return tails. an ordering [Cloudy, Sprinkler, Rain, WetGrass] :
4. Given a source of random numbers in the range 1. Sample from P(Cloudy) = [0.5, 0.5], value is true.
5. The simplest kind of random sampling process for Bayesian networks 2. Sample from P(Sprinkler | Cloudy = true) = [0.1, 0.9], value is false.
generates events from a network that has no evidence associated with 3. Sample from P(Rain | Cloudy = true) = [0.8, 0.2], value is true.
it. 4. Sample from P(WetGrass | Sprinkler = false, Rain = true) = [0.9, 0.1], value
6. The idea is to sample each variable in turn, in topological order. is true.

Department Of CSE Anjalai Ammal- Mahalingam Engineering College -614 403 Page No 24
CS3491-Artificial Intelligence and Machine Learning

In this case, PRIOR-SAMPLE returns the event [true, false, true, true]
9. It is easy to see that PRIOR-SAMPLE generates samples from the prior
joint distribution specified by the network.
First, let Sps(xl, . . . , x,) be the probability that a specific event is generated
by the PRIOR-SAMPLE algorithm.

Steps:
1. First, it generates samples from the prior distribution specified
by the network.
2. Then, it rejects all samples that do not match the evidence
3. Finally, estimates by counting the how often event X is in the
resulting samples.

5.1.1 Rejection sampling in Bayesian networks

Rejection sampling is a general method for producing samples from a hard-to-


sample distribution given an easy-to-sample distribution.

In its simplest form, it can be used to compute conditional probabilities-


that is, to determine P(X|e)

Department Of CSE Anjalai Ammal- Mahalingam Engineering College -614 403 Page No 25
CS3491-Artificial Intelligence and Machine Learning

• LIKELIHOOD fixes the values for the evidence variables E and samples
only the remaining variables X and Y.
• This guarantees that each event generated is consistent with the
evidence. Not all events are equal, however.
• Before tallying the counts in the distribution for the query variable,
each event is weighted by the likelihood that the event accords to the
evidence, as measured by the product of the conditional probabilities for each
evidence variable, given its parents.
5.1.2 Likelihood weighting • Intuitively, events in which the actual evidence appears unlikely
should be given less weight.
Likelihood weighting avoids the inefficiency of rejection sampling by
Let us apply the algorithm to the network shown in Figure 14.11(a), with the
generating only events that are consistent with the evidence e.
query
P(Rain/Sprinkler = true, WetGrass = true).
The process goes as follows: First, the weight w is set to 1.0. Then an event is
generated:
1. Cloudy is an evidence variable with value true. Therefore, we set
w w × P(Cloudy = true)=0.5 .
2. Sprinkler is not an evidence variable, so sample from P(Sprinkler |
Cloudy = true) =[0.1, 0.9]; suppose this returns false.
3. Similarly, sample from P(Rain | Cloudy = true) = 0.8, 0.2; suppose this
returns true.
4. WetGrass is an evidence variable with value true. Therefore, we set w
w × P(WetGrass = true | Sprinkler = false, Rain = true)=0.45 .
Here WEIGHTED-SAMPLE returns the event [true, false,true,true] with weight
0.45, and this is tallied under Rain = true.

5.2 Inference by Markov chain simulation(Markov Chain Sampling )


The MCMC algorithm

Department Of CSE Anjalai Ammal- Mahalingam Engineering College -614 403 Page No 26
CS3491-Artificial Intelligence and Machine Learning

1. MCMC therefore wanders randomly around the state space-the space [Link] Bayesian Networks:
of possible complete assignments-flipping one variable at a time, but keeping
the evidence variables fixed. A directed network which illustrates casual dependencies among all the
2. Consider the query P(Rain1 Sprinkler = true, Wet Grass = true) components of the network.
applied to the network in Figure 14.1 l(a). A causal network for a domain of chance variables U is a directed acyclic
3. The evidence variables Sprinkler and WetGrass are fixed to their graph where Nodes correspond to chance variables in U and Non root
observed values and the hidden variables Cloudy and Rain are initialized node is a direct causal effect of its parents.
randomly-let us say to true and false respectively.
4. Thus, the initial state is [true, true, false, true]. An example of a causal network is shown in Figure :
5. Now the following steps are executed repeatedly:
Each state visited during this process is a sample that contributes to the
estimate for the query variable Rain. If the process visits 20 states where Rain
is true and 60 states where Rain is false, then the answer to the query is
NORMALIZE((^^, 60)) = (0.25,0.75). The complete algorithm is shown in
Figure 14.15.

• The diagram indicates that whether or not a car starts is caused by the
condition of its battery and fuel supply, that whether or not a car
moves is caused by whether or not it starts, and that (in this model)
the condition of the battery and the fuel supply have no causes.
In this example, we assume that all variables are binary.

A Causal Network:
• A Casual relationship exists when one variable in a data set has direct
influence on another variable. Thus one event triggers occurrence of
another event.
• Causal relationship also called as cause and effect.
Example:
Department Of CSE Anjalai Ammal- Mahalingam Engineering College -614 403 Page No 27
CS3491-Artificial Intelligence and Machine Learning

• Consider a hypothetical college admission example in which


applicants are admitted based on qualifications Q, choice of
department D, and gender G; and in which female applicants apply
more often to certain departments (for simplicity’s sake, we consider
gender as binary).

In above Figure 1, the path G D A is causal, whilst the path


G D A Q is non causal.
Gender has a direct influence on admission through the causal path
G A and an indirect influence through the causal path G D A.
• The direct influence captures the fact that individuals with the same
qualifications who are applying to the same department might be
treated differently based on their gender.
• The indirect influence captures differing admission rates between
female and male applicants due to their differing department choices.

Department Of CSE Anjalai Ammal- Mahalingam Engineering College -614 403 Page No 28

You might also like