0% found this document useful (0 votes)
7 views83 pages

Probabilistic Reasoning and Decision Theory

The document discusses probabilistic reasoning, focusing on how agents operate under uncertainty using Bayesian inference and decision theory. It explains the importance of probability in representing beliefs and making rational decisions, highlighting concepts such as prior and conditional probabilities, utility theory, and decision-making algorithms. Key elements include the definitions of random variables, atomic events, and the axioms of probability that govern their relationships.

Uploaded by

pattabirani
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views83 pages

Probabilistic Reasoning and Decision Theory

The document discusses probabilistic reasoning, focusing on how agents operate under uncertainty using Bayesian inference and decision theory. It explains the importance of probability in representing beliefs and making rational decisions, highlighting concepts such as prior and conditional probabilities, utility theory, and decision-making algorithms. Key elements include the definitions of random variables, atomic events, and the axioms of probability that govern their relationships.

Uploaded by

pattabirani
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd

UNIT II

Chapter: 7: Probabilistic Reasoning


Syllabus
Acting under uncertainty - Bayesian inference - naïve bayes models.
Probabilistic reasoning - Bayesian networks - exact inference in BN -
approximate inference in BN - causal networks

Introduction of Probabilistic Reasoning


AU Dec.-13, May-14,15,18
A agent working in real world environment almost never has access to
whole truth about its environment. Therefore, agent needs to work under
uncertainity.
Earlier agents we have seen make the epistemological commitment that
either the facts (expressed as prepositions) are true, false or else they are
unknown. When an agent knows enough facts about its environment, the
logical approach enables it to derive plans, which are guaranteed to work.
But when agent works with uncertain knowledge then it might be
impossible to construct a complete and correct description of how its
actions will work. If a logical agent can not conclude that any perticular
course of action achieves its goal, then it will be unable to act.
The right thing logical agent can do is, take a rational decision. The
rational decision depends on following things:
• The relative importance of various goals.
• The likelihood and the degree to which, goals will be achieved.

Acting Under Uncertainty


An agent would possess some early basic knowledge of the world
(Assume that knowledge is represented in first order logic sentence).
Using first order logic to handle real word problem domains fails for
three main reasons as discussed below:
1) Laziness:
It is too much work to list the complete set of antecedents or consequents
needed to ensure an exceptionless rule and too hard to use such rules.
2) Theoretical ignorance:
A perticular problem may not have complete theory for the domain.
3) Practical ignorance:
Even if all the rules are known, perticular aspects of problem are not
checked yet or some details are not considered at all (missing out the
details).
The agent's knowledge can provide it with a degree of belief with
relevent sentences. To this degree of belief probability theory is applied.
Probability assigns a numerical degree of belief between 0 and 1 to each
sentence.
Probability provides a way of summarizing the uncertainity that comes
from our laziness and ignorance.
Assigning probability of 0 to a given sentence corresponds to an
unequivocal belief saying that sentence is false. Assigning probability of
1 corresponds to an unequivocal belief saying that the sentence is true.
Probabilities between 0 and 1 correspond to intermediate degree of belief
in the truth of the sentence.
The beliefs completely depends on percepts of agent at perticular time.
These percepts constitute the evidence on which probability assertions are
based. Assignment of probability to a proposition is analogous to saying
that whether the given logical sentence (or its negation) is entailed by the
knowledge base rather than whether it is true or not. When more
sentences are added to knowledge base the entailment keeps on changing.
Similarly the probability would also keep on changing with additional
knowledge.
All probability statements must therefore, indicate the evidence with
respect to which the probability is being assessed. As the agent receives
new percepts, its probability assessments are updated to reflect the new
evidence. Before the evidence is obtained, we talk about prior or
unconditional probability; after the evidence is obtained, we talk about
posterior or conditional probability. In most cases, an agent will have
some evidence from its percepts and will be interested in computing the
posterior probabilities of the outcomes it cares about.
Uncertainty and rational decisions:
The presence of uncertainty drastically changes the way an agent makes
decision. At perticular time an agent can have various available decisions,
from which it has to make a choice. To make such choices an agent must
have a preferences between the different possible outcomes, of the
various plans.
A perticular outcome is completely specified state, along with the
expected factors related with the outcome.
For example: Consider a car driving agent who wants to reach at airport
by a specific time say at 7.30 pm.
Here factors like, whether agent arrived at airport on time, what is the
length of waiting duration at the airport are attached with the outcome.

Utility Theory
Utility theory is used to represent and reason with preferences. The term
utility in current context is used as "quality of being useful".
Utility theory says that every state has a degree of usefulness called as
utility. The agent will prefer the states with higher utility.
The utility of the state is relative to the agent for which utility function is
calculated on the basis of agent's preferences.
For example: The pay off functions for games are utility functions. The
utility of a state in which black has won a game of chess is obviously
high for the agent playing black and low for the agent playing white.
There is no measure that can count test or preferences. Someone loves
deep chocolate icecream and someone loves chocochip icecream. A
utility function can account for altruistic behavior, simply by including
the welfare of other as one of the factors contributing to the agent's own
utility.
Decision theory
Preferences as expressed by utilities are combined with probabilities for
making rational decisions. This theory, of rational decision making is
called as decision theory.
Decision theory can be summarized as,
Decision theory = Probability theory + Utility theory.
• The principle of Maximum Expected Utility (MEU):
Decision theory says that the agent is rational if and only if it chooses the
action that yields highest expected utility,averaged over all the possible
outcomes of the action.
• Design for a decision theoretic agent:
Following algorithm sketches the structure of an agent that uses decision
theory to select actions.
The algorithm
Function: DT-AGENT (percept) returns an action.
Static: belief-state, probabilistic beliefs about the current state of the
world. action, the agent's action.
- Update belief-state based on action and percept
- Calculate outcome probabilities for actions, given actions
descriptions and current belief-state
- Select action with highest expected utility given probabilities of
outcomes and utility information
- Return action.
A decision therotic agent that selects rational actions.
The decision theoretic agent is identical, at an abstract level, to the logical
agent. The primary difference is that the decision theoretic agent's
knowledge of the current state is uncertain; the agent's belief state is a
representation of the probabilities of all possible actual states of the
world.
As time passes, the agent accumulates more evidence and its belief state
changes. Given the belief state, the agent can make probabilistic
predictions of action outcomes and hence select the action with highest
expected utility.

The Basic Probability Notation


The probability theory uses propositional logic language with additional
expressivness.
The probability theory uses represent prior probability statements, which
apply before any evidence is obtained. The probability theory uses
conditional probability statements which include the evidence explicitly.
1. Propositions
1) The propositions (assertions) are attached with the degree of belief.
2) Complex proposition can be formed using standard logical
connectives.
For example: [(Cavity = True) (Toothache = False)] and [(Cavity
Toothache)] both are same assertions.
• The random variable:
1) The basic element of language is random variable.
2) It reffers to a "part" of the world whose "status" is initially unknown.
For example: In toothache problem 'cavity' is a random variable which
can refer to my left wisdom tooth or right wisdom tooth.
3) Random variables are like symbols in propositional logic.
4) Random variables are represented using capital letters. Whereas
unknown random variable can be represented with lowercase letter.
For example: P (a) = 1 - P (-a)
5) Each random variable has a domain of values that it can take on. That
is domain is, set of allowable values for random variable.
For example: The domain of cavity can be < true, false >
6) A random variable's proposition will assert that what value is drawn
for the random variable from its domain.
For example: Cavity = True is proposition.
Saying that "there is" cavity in my lower left wisdom tooth".
7) Random variables are divided into three kinds, depending on their
domain. The types are as follows.
i) Boolean random variables: These are random variables that can take
up only boolean values.
For example: Cavity, it takes value either true or false.
ii) Discrete random variables: They take values from countable domain.
They also include boolean domain. The values in the domain must be
mutually exclusive and exhaustive (finite).
For example: Weather, it has domain < sunny, rainy, cloudy, cold >
iii) Continuous random variables: They take values from real numbers.
The domain can be either entire real line or subset of it like intervals [2,
3].
For example: X = 4.14 asserts that X has exact value 4.14
Propositions having continuous random variable can have inequalities
like X ≤ 4.14.
2. Atomic Events
1) An atomic event is a complete specification of the state of the world
about which agent is uncertain.
2) They are represented as variables. These variables are assigned values
from the real world.
For example: If the world is consists of cavity and Toothache then there
are four distinct atomic events,
a) Cavity = False Ʌ Toothache = True
b) Cavity = False Ʌ Toothache = False
c) Cavity = True Ʌ Toothache = False
d) Cavity = True Ʌ Toothache = True
e) Properties of atomic events
i) They are mutually exclusive - That is at most one can actually be the
case. For example: (Cavity Ʌ Toothache) and (Cavity Ʌ ⌐ Toothache)
can not both be the case.
ii) The set of all possible atomic events is exhaustive out of which at least
one must be the case. That is, the disjunction of all atomic events is
logically equivalent to true.
iii) Any particular atomic event entails the truth or falsehood of every
proposition, whether simple or complex.
For example -
The atomic event
(Cavity Ʌ⌐ Toothache) entails the truth of cavity and the Falsehood of
(cavity → toothache).
iv) Any proposition is logically equivalent to the disjunction of all atomic
events that entail the truth of the proposition.
For example: The proposition cavity is equivalent to disjunction of the
atomic events (cavityɅtoothache) and (cavity Ʌ ⌐ toothache).
3. Prior Probability (Unconditional Probability)
1) The prior (unconditional) probability is associated with a proposition
'a'.
2) It is the degree of belief accorded to a proposition in the absence of
information.
3) It is written as P(a).
For example: The probability that, Ram has cavity = 0.1, then prior
probability is written as, P (Cavity = true) = 0.1 or P (Cavity) = 0.1
4) It should be noted that as soon as new information is received, one
should reason with the conditional probability of 'a' depending upon new
information. 5) When it is required to express probabilities of all the
possible values of a random variable, then a vector of values is used. It is
represented using P(a). This represents values for the probabilities of each
individual state of the 'a'.
For example: P (Weather) = < 0.7, 0.2, 0.08, 0.02 > is representing four
equations
P (Weather = Sunny) = 0.7
P (Weather = Rain) = 0.2
P (Weather = Cloudy) = 0.08
P (Weather = Cold) = 0.02
6) The expression P(a) is said to be defining prior probability distribution
for the random variable 'a'.
7) To denote probabilities of all random variables combinations, the
expression P(a1, a2) can be used. This is called as joint probability
distribution for random variables a 1, a2. Any number of random variables
can be mentioned in the expression.
8) A simple example of joint probability distribution is,
→ P<Weather, Cavity> can be represented as, 4×2 table of probabilities.
(Weather's probability) (Cavity probability)
9) A joint probability distribution that covers the complete set of random
variables is called as full joint probability distribution.
10) A simple example of full joint probability distribution is,
If problem world consists of 3 random variables, wheather, cavity,
toothache then full joint probability distribution would be,
P<Weather, Cavity, Toothache>
It will be represented as, 4×2×2, table of probabilities.
11) Prior probability for continuous random variable:
i) For continuous random variable it is not feasible to represent vector of
all possible values because the values are infinite. For continuous random
variable the probability is defined as a function with parameter x, which
indicates that random variable takes some value x.
For example: Let random variable x denotes the tomorrow's temperature
in Chennai. It would be represented as,
P(X = x) = U [25 – 37] (x).
This sentence express the belief that X is distributed uniformly between
25 and 37 degrees celcius.
ii) The probability distribution for continuous random variable has
probability density function.
4. Conditional Probability
1) When agent obtains evidence concerning previously unknown random
variables in the domain, then prior probability are not used. Based on new
information conditional or posterior probabilities are calculated.
2) The notation is P(a | b) where a and b are any proposition.
The 'P' is read as "the probability of a given that all we know is b". That
is when b is known it indicates probability of a.
For example: P (Cavity | Toothache) = 0.8
it means that, if patient has toothache (and no other information is
known) then the chances of having cavity are = 0.8
3) Prior probability are infact special case of conditional probability. It
can be represented as P(a) which means that probability 'a' is conditioned
on no evidence.
4) Conditional probability can be defined interms of unconditional
probabilities. The equation would look like,
P(a/b) =P(aɅb) / P(b), it holds whenever P(b) > 0 …(7.1.1)
The above equation can also be written as,
P(a Ʌ b) = P(a | b) P(b)
This is called as product rule. In other words it says, for 'a' and 'b' to be
true we need 'b' to be true and we need a to be true given b. It can be also
written as,
P(a Ʌ b)= P(b | a) P(a).
5) Conditional probability are used for probabilistic inferencing.
6) P notation can be used for conditional distribution. P(x | y) gives the
values of P(X = xi | Y = yj) for each possible i, j. following are the
individual equations,
P(X = x1 ^ Y = y1) = P(X = x1 |Y = y1) P(Y = y1)
P(X = x1 ^ Y = y2) = P(X = x1 |Y = y2) P(Y = y2)
These can be combined into a single equation as,
P(X, Y) = P(X | Y) P(Y)
7) Conditional probabilities should not be treated as logical implications.
That is, "When 'b' holds then probability of 'a' is something", is a
conditional probability o and not to be mistake as logical implication. It is
wrong on two points, one is, P(a) always denotes prior probability. For
this it do not require any evidence. P(a | b) = 0.7, is immediately relevant
when b is available evidence. This will keep on altering. When
information is updated logical Implications do not change over time.
5. The Probability Axioms
Axioms gives semantic of probability statements. The basic axioms
(Kolmogorov's axioms) serve to define the probability scale and its end
points.
1) All probabilities are between 0 and 1. For any proposition a, 0 ≤ P(a)
≤1.
2) Necessarily true (i.e, valid) propositions have probability 1, and
necessarily false (i.e., unsatisfiable) propositions have probability 0.
P(true) = 1 P(false) = 0
3) The probability of a disjunction is given by
P (a v b) = P(a) + P(b) - P(a Ʌ b)
This axiom connects the probabilities of logically related propositions.
This rule states that, the cases where 'a' holds, together with the cases
where 'b' holds, certainly cover all the causes where 'a v b' holds; but
summing the two sets of cases counts their intersection twice, so we need
to subtract P(a Ʌ b).
Note The axioms deal only with prior probabilities rather than conditional
probabilities; this is because prior probability can be defined in terms of
conditional probability.
Using the axioms of probability:
From basic probability axioms following facts can be deduced.
P(a v¬ a) = P(a) + P(¬a) - P(a Ʌ¬a) (by axiom 3 with b = ¬ a)
P(true) = P(a) + P(¬ a) - P(false) (by logical equivalence)
1 = P(a) + P(¬ a) (by axiom 2)
P(¬ a) = 1 - P(a) (by algebra).
• Let the discrete variable D have the domain <d1,....., dn>.
Then, Σni=1 P(D=di) = 1.
That is, any probability distribution on a single variable must sum to 1.
• It is also true that any joint probability distribution on any set of
variables must sum to 1. This can be seen simply by creating a single
megavariable whose domain is the cross product of the domains of the
original variables.
• Atomic events are mutually exclusive, so the probability of any
conjunction of atomic events is zero, by axiom 2.
• From axiom 3, we can derive the following simple relationship: The
probability of a proposition is equal to the sum of the probabilities of
atomic events in which it holds; that is P(a) = Σei ε e(a) P(ei) …..(7.1.2)
6. Inference using Full Joint Distribution
Probabilistic inference means, computation from observed evidence of
posterior probabilities, for query propositions. The knowledge base used
for answering the query is represented as full joint distribution. Consider
simple example, consists of three boolean variables, toothache, cavity,
catch. The full joint distribution is 2×2×2, as shown below.

Note that the probability in the joint distribution sum to 9.


• One particular common task in inferencing is to extract the distribution
over some subset of variables or a single variable. This distribution over
some variables or single variables is called as marginal probability.
For example: P(Cavity) = 0.108 +0.012+ 0.072 + 0.008. = 0.2
This process is called as marginalization or summing. Because the
variables other than 'cavity' (that is whose probability is being counted)
are summed out.
• The general marginalization rule is as follows,
For any sets of variables Y and Z,
P(Y) = Σ z P(Y, z). ….(7.1.3)
It indicates that distribution Y can be obtained by summing out all the
other paulay to variables from any joint distribution X containing Y.
• Variant of above example of general marginalization rule involved the
conditional to to probabilities using product rule.
P(Y) = Σ z P(Y | z) P(z) ….(7.1.4)
This rule is conditioning rule.
For example: Computing probability of a cavity, given evidence of a
toothacheis as follows:
P(Cavity | Toothache) =P(Cavity Ʌ Toothache)
/P(Toothache) = 0.108+0.012 / 0.108+0.012+0.016+0.064 = 0.6
• Normalization constant: It is variable that remains constant for the
distribution, which ensures that it adds in to 1. a is used to denote such
constant.
For example: We can compute the probability of a cavity, given evidence
of a toothache, as follows:
P(Cavity | Toothache) = P(Cavity Ʌ
Toothache)/ P(Toothache) = 0.108+0.012/
0.108+0.012+0.016+0.064 = 0.6
Just to check we can also compute the probability that there is no cavity
given a toothache:
P( Cavity / Toothache)= P (¬Cavity Ʌ Tootache) /P (Toothache) = 0.016+
0.064 / 0.108+0.012+ 0.016+ 0.064 = 0.4
Notice that in these two calculations the term 1/P (toothache) remains
constant, no matter which value of cavity we calculate. With this notation
we can write above two equations in one.
P(Cavity | Toothache) = αP(Cavity, Toothache)
= α [P(Cavity, Toothache, Catch) + P(Cavity, Toothache, ¬ Catch)]
= α [< 0.108, 0.016> + <0.012, 0.064>]
= α <0.12, 0.08> = <0.6, 0.4>
From above one can extract a general inference procedure.
Consider the case in which query involves a single variable. The notation
used is let X be the query variable (cavity in the example), let E be the set
of evidence variables (just toothache in the example) let e be the observed
values for them, and let Y be the remaining unobserved variable (just
catch in the example). The query is P(X/e) and can be evaluated as
P(X | e) = α P(X, e) = α Σ y P(X, e, y) ……(7.1.5)
where the summation is over all possible ys (i.e. all possible
combinations of values of the unobserved variables 'y'). Notice that
together the variables, X, E and Y constitute the complete set of variables
for the domain, so P (X, e, y) is simply a subset of probabilities from the
full joint distribution.
7. Independance
It is a relationship between two different sets of full joint distributions. It
is also called as marginal or absolute independance of the variables.
Independence indicates that whether the two full joint distributions
affects probability of each other.
The independance between variables X and Y can be written as follows,
P(X | Y) = P(X) or P(Y | X) = P(Y) or P(X, Y) = P(X) P(Y)
• For example: The weather is independant of once dental problem.
Which can be shown as below equation.
P(Toothache, Catch, Cavity, Weather) = P(Toothache, Catch, Cavity)
P(Weather).
• Following diagram shows factoring a large joint distributing into
smaller distributions, using absolute independence. Weather and dental
problems are independent.
8. Bayes' Rule
Bayes' rule is derived from the product rule.
The product rule can be written as,
P(a Ʌ b)=P(a | b) P(b) .…(7.1.6)
P(a Ʌ b)=P(b | a) P(a) .... (7.1.7)
[because conjunction is commutative]
Equating right sides of equation (7.1.6) and equation (7.1.7) and dividing
by P(a),
P(b | a) = P(a | b)P(b) / P(a)
This equation is called as Bayes' rule or Bayes' theorem or Bayes' law.
This rule is very useful in probabilistic inferences.
Generalized Bayes' rule is,
P(Y | X) = P(X | Y) P(Y) / P(X)
(where P has same meanings)
We can have more general version, conditionalized on some background
evidence e.
P(Y | X, e) = P(X |Y, e) P(Y | e) / P(X | e)
General form of Bays' rule with normalization's
P(y | x) = α P(x | y) P(y).
Applying Bays' Rule:
1) It requires total three terms (1 conditional probability and 2
unconditional Probabilities). For computing one conditional probability.
For example: Probability of patient having low sugar has high blood
pressure is
50 %.
Let, M be proposition, 'patient has low sugar'.
S be a proposition, 'patient has high blood pressure'.
Suppose we assume that, doctor knows following unconditional fact,
i) Prior probabilition of (m) = 1/50,000.
ii) Prior probability of (s) = 1/20.
Then we have,
P(s | m) = 0.5
P(m) = 1 | 50000
P(s) = 1 | 20
P(m/s) = P(s | m) P(m) / P(s)
= 0.5×1/50000 /1/20
= 0.0002
That is, we can expect that 1 in 5000 with high B.P. will has low sugar.
2) Combining evidence in Bayes' rule.
Bayes rule is helpful for answering queries conditioned on evidences.
For example: Toothache and catch both evidences are available then
cavity is sure to exist. Which can be represented as,
P(Cavity | Toothache ^ Catch) = α <0.108, 0.016> ≈ <0.871, 0.129>
By using Bayes' rule to reformulate the problem:
P(Cavity | Toothache ^ Catch) = α P(Toothache A Catch | Cavity)
P(Cavity) ……(7.1.8)
For this reformulation to work, we need to know the conditional
probabilities of the conjunction Toothache Catch for each value of
Cavity. That might be feasible for just two evidence variables, but again it
will not scale up.
If there are n possible evidence variable (X-rays, diet, oral hygiene, etc.),
then there are 2n possible combinations of observed values for which we
would need to know conditional probabilities.
The notion of independence can be used here. These variables are
independent, however, given the presence or the absence of a cavity.
Each is directly caused by the cavity, but neither has a direct effect on the
other. Toothache depends on the state of the nerves in the tooth, where as
the probe's accuracy depends on the dentist's skill, to which the toothache
is irrelevant.
Mathematically, this property is written as,
P(Toothache ^ Catch | Cavity) = P(Toothache | Cavity) P(Catch | Cavity) .
…(7.1.9)
This equation expresses the conditional independence of toothache and
catch, given cavity.
Substitute equation (7.1.3) into (7.1.4) to obtain the probability of a
cavity:
P (Cavity | Toothache ^Catch) = α P (Toothache | Cavity) P (Catch |
Cavity) P (Cavity)
Now, the information requirement are the same as for inference using
each piece of evidence separately the prior probability P(Cavity) for the
query variable and the conditional probability of each effect, given its
cause.
Conditional independence assertions can allow probabilistic systems to
scale up; more over, they are much more commonly available than
absolute independence assertions. When their are 'n' variables given that
they are all conditionally independent, the size of the representation
grows as O(n) instead of O(2n).
For example -
Consider dentistry example, in which a single cause, directly influences a
number of effects, all of which are conditionally independent, given the
cause.
The full joint distribution can be written as,
P(Cause, Effect1,..., Effectn) = P(Cause) II P(Effecti | Cause).
Such a probability distribution is called as naive Bayes' model - "naive"
because it is often used (as a simplifying assumption) in cases where the
"effect" variables are not conditionally independent given the cause
variable. The naive Bayes model is sometimes called as Bayesian
classifier.
Bayesian Learning and Inferencing
AU May-14, Dec.-14
1) It calculates the probability of each hypotheses, given the data and
makes the predictions on that basis.
2) Predictions are made by using all the hypotheses, weighted by their
probabilities, bay rather than a single "best" hypothesis. This way
learning is reduced to probabilistic inference.
3) Let D represent all the data with observed value d, then the probability
of each hypothesis obtained by Bayes' Rule -
P(hi | d) = α P(d | hi) P(hj) …….(7.2.1)
If we want to make predicition about an unknown quantity X, then we
have, P(X | P(X | d) = Σi P(X | d, hi) P(hi | d)
= Σi P(X | hi) P (hi | d) ... (7.2.2)
Where it is assumed that each hypothesis determines a probability
distribution over X.
4) Equation (7.2.2) shows that predictions are weighted averages over the
predictions of the individual hypotheses.
5) The important quantities in Bayesian learning are the hypothesis prior -
P(hi) and or the likelihood of the data under each hypothesis - P(d | hi).
6) The basic characteristic of Baysian learning is that "True hypothesis
dominates the Bayesian prediction".
7) For any fixed prior that does not rule out the true hypothesis, the
posterior probability of any false hypothesis will eventually vanish,
simply because the probability of generating "uncharacteristic" data
indefinitely is vanishingly small.
8) The Bayesian prediction is optimal whether the data set be small or
large.
9) For real learning problems, the hypothesis space is usually very large
or infinite.
10) Approximation in Bayesian learning: -
i) A prediction can be made on the basis of single most probable
hypothesis, hi, that maximizes P (hi | d). This is called as maximum a
posteriori or MAP hypothesis.
ii) Predictions made according to an MAP hypothesis h map are
approximately Bayesian to the extent that P(X | d) ≈ P(X |hmap).
iii) Finding MAP hypothesis is often much easier than Bayesian learning,
because it requires solving an optimization problem instead of a large
summation.
iv) Overfitting Trade-offs:
a) Overfitting can occur when the hypothesis space is too expressive, so
that it contains many hypotheses that fit the data set well.
b) Rather than placing an arbitrary limit on the hypotheses to be
considered, Bayesian and MAP learning methods use the prior to penalize
complexity.
c) Typically, more complex hypothesis have a lower prior probability-in
part because there are usually many more complex hypotheses than
simple hypotheses.
d) On the other hand, more complex hypotheses have a greater capacity
to fit the data.
v) Hence, the hypothesis prior embodies a trade-off between the
complexity of a (SS) hypothesis and its degree of fit to the data.
vi) If H contains only deterministic hypothesis, then in that case, P(d | h i)
is 1 if hi is consistent and 0 otherwise. Looking at equation (7.2.1) we see
that hMAP will then be the simplest logical theory that is consistent with
the data. Therefore, maximum a posteriori learning provides a natural
embodiment of Ockham's razor.
vii) a) Another trade-off between complexity and degree of fit is obtained
by taking the logarithm of equation (7.2.1).
b) Choosing hMAP to maximize P (d | hi) P(hi) is equivalent to minimizing
-log2 P (d | hi) – log2 P (hi)
c) Using the connection between information encoding and probability
we see that the -log2 P(hi) term equals the number of bits required to
specify the hypothesis hi.
d) log2 P(d | hi) is the additional number of bits required to specify the
data given the hypothesis.
e) To see this, consider that no bits are required if the hypothesis predicts
the data exactly as with h5 and the string of lime candies and log2 1 = 0.
f) MAP learning is choosing the hypothesis that provides maximum
compression of the data.
g) The same task is addressed more directly by the minimum description
length or MDL, learning method, which attempts to minimize the size of
hypothesis and data encodings rather than work with probabilities.
viii) a) Another approximation is provided by assuming a uniform prior
over the space of hypotheses. In that case, MAP learning reduces to
choosing an h; that maximizes P(d | H i). This is called a maximum-
likelihood (ML) hypothesis, hML.
b) Maximum-likelihood learning is very common in statistics, a
discipline in which many researchers destruct the subjective nature of
hypothesis priors. It is reasonable approach when there is no reason to
prefer one hypothesis over another a priori. For example, when all
hypotheses are equally complex.
c) It provides a good approximation to Bayesian and MAP learning when
the data set is large, because the data swamps the prior distribution over
stems hypotheses, but it has problems (all we shall see) with small data
sets.

Learning with Complete Data


1. Maximum-Likelihood Parameter Learning: (Discrete
Models)
Statistical learning methods have important task which is parameter
learning with complete data. Parameter learning task involves finding the
numerical parameters for a probability model whose structure is fixed.
Data is said to be complete when each data points contains values for
every variable in the mode.
Consider candy-bag example: -
New manufacturer: Then the lime/Cherry proportions is completely
unknown.
Parameter: 0 € [0, 1] (proportion of cherry)
Hypothesis: hӨ
Assumption: All proportions equally likely a priori.
BN's variables: Flavour € {Cherry, Lime}
N unwrapped candies: C cherries and I = N – C limes.
Likelihood of this particular data set:
P(d | hӨ)=IINj=1 P(di | |hӨ) = Өc (1 - Ө)i
• Finding the maximum-likelihood hypothesis hML is then equivalent to
maximising the log-likelihood:
L(d | hӨ) = log P(d | hӨ)
= ΣNj=1 log P(dj | hӨ) = clogӨ + l log (1 - Ө)
To that end : 1) differentiate L with respect to 0 and
2) set the resulting expression to 0.
dL(d|hӨ)/dӨ = c/Ө = l /1- Ө = 0 → Ө = c / c+l = c/N
• Previous result is obvious, but the process is important: -
1) Write down the expression for the likelihood of the data as a function
of the parameter(s).
2) Write down the derivative of the log-likelihood with respect to each
parameter.
3) Find the parameter values such that the derivatives are zero.
The last step can be tricky, using iterative solution algorithms or
numerical optimisation.
Problem with ML learning:
If some events have never been observed, hML assigns them 0 probability.

New pb(b): Wrappers red or blue. Colors selected probabilistically (see


new BN)
P (Flavor = Cherry, Wrapper = green | h0,01,02)
= P (Flavor = cherry | hӨ, Ө1, Ө2) P (Wrapper = green | Flavor = Cherry,
hӨ, Ө1, Ө2)
= Ө. (1- Ө1)
Experiment:
N = c + l = (rc + gc) + (rl + gl)
Likelihood:
P(d | h Ө, Ө1, Ө2) = Өc (1 - Ө)l. Ө1rc (1 – Ө1)gc . Ө2r1 (1 – Ө2)g1
log-likelihood:
L = [clog Ө + llog (1- Ө)] + [rc log Ө 1+ gc log (1- Ө 1)]+ [rl log Ө2 +
gl log (1- Ө 2)]
Derivatives:
dL/dӨ = c/Ө – l/1- Ө = 0 => Ө = c/c+l
dL/dӨ1 = rc /Ө1 – gc /1- Ө1 = 0 => Ө = rc /rc + gc
dL/dӨ2 = rl /Ө2 – gl /1- Ө2 = 0 => Ө = rl /rl + gl
With complete data, ML parameter learning for a BN decomposes into
separate learning problems (one per parameter).
2. Naive Bayes Models
1) This is the most common Bayesian network model used in machine
learning.
2) In this model, the "class" variable C (which is to be predicted) is the
root and the "attribute" variables Xi are the leaves.
3) The model is "naive" because it assumes that the attributes are
conditionally independent of each other, given the class.
4) Assuming Boolean variables the parameters are,
Ꮎ = P(C = true), Ꮎi1 =P(Xi = true | C = true),
Ꮎi2 = P(Xi= true | C = false)
The maximum likelihood parameter values are found in exactly the same
way as shown in Fig. 7.2.1 (b).
5) Once the model has been trained in this way, it can be used to classify
new examples for which the class variable C is unobserved. With
observed attribute values, x1, x2,.... xn, the probability of each class is
given by,
P(C x1, x2,… xn) = α P(C) IIi P(xi | C)
6) A deterministic prediction can be obtained by choosing the most likely
class.
7) The method learns fairly well but not as well as decision-tree learning;
this is presumably because the true hypothesis which is a decision tree is
not representable exactly using a naive Bayes model.
8) Naive Bayes learning do well in a wide range of applications; the
boosted version is one of the most effective general-purpose learning
algorithm.
9) Naive Bayes learning scales well to very large problems with n
Boolean attributes, there are just 2n+1 parameters, and no search is
required to find hML, the maximum likelihood naive Bayes hypothesis.
10) Naive Bayes learning has no difficulty with noisy data and can give
probabilistic predictions when appropriate.
3. Maximum-Likelihood Parameter Learning (Continuous
Models)
1) Continuous probability model such as the linear-Gaussian model is
used for maximum - likelihood parameter learning.
2) Because continuous variables are ubiquitous in real-world applications,
it is important to know how to learn continuous models from data.
3) The principles for maximum likelihood learning are identical to those
of the discrete case.
4) a) Let us begin with a very simple case: Learning the parametes of a
Gaussian density function on a single variable.
b) That is, the data are generated as follows: -

The parameters of this model are the mean u and the standard deviation σ.
(Notice that the normalizing "constant" depends on σ, so we cannot
ignore it.
c) Let the observed values be x1,..., xN. Then the log likelihood is,

d) Setting the derivatives to zero as usual, we obtain,


5) The maximum-likelihood value of the mean is the sample average and
the maximum-likelihood value of the standard deviation is the square root
of the sample variance. Again, these are comforting results that confirm
"commonsense" practice.
6) a) Now consider a linear Gaussian model with one continuous parent X
and acontinuous child Y.
b) Y has a Gaussian distribution whose mean depends linearly on the
value of X and whose standard deviation is fixed.
c) To learn the conditional distribution P(Y|X), we can maximize the
conditional likelihood.

d) Here, the parameters are Ө1, Ө2 and σ. The data are a collection of (x j,
yj)
pairs, as shown in Fig. 7.2.4.
e) Using the usual methods, we can find the maximum-likelihood values
of theparameters.
f) Here we want to make a different point. If we consider just the
parameters Ө1 and Ө2 that define the linear relationship between x and y,
it becomes clear that maximizing the log-likelihood with respect to these
parameters is the same as minimizing the numerator in the exponent of
above equation.
E =ΣNj=1 ( yj - (Ѳ1 xj +Ѳ2)2
g) The quantity (yj -(Ѳ1 xj +Ѳ2)) is the error for (xj, yj) that is, the
difference between the actual value y i and the predicted value (Ѳ1 xj +
Ѳ2). So E is the well-known sum of squared errors.
(h) This is the quantity that is minimized by the standard linear regression
procedure.
i) Minimizing the sum of squared errors gives the maximum - likelihood
straight line model, provided that the data are generated with Gaussian
noise of fixed variance.
4. Bayesian Parameter Learning
ML learning is simple, but not appropriate for small data sets.
Example -
If only cherries have been observed, hML →0=1.0
• Bayesian Approach:
- It uses hypothesis prior over possible values of the parameters.
- Update of this distribution is used as the data arrives.
• Candy example with Bayesian view :-
- Ө unknown value of a variable Ѳ.
- Hypothesis prior: P (Ѳ). (Continuous over [0, 1] and integrating to 1).

• Candidates:
beta distributions, defined by 2 hyperparameters a and b
such that:
beta [a, b] (Ѳ) = αѲa-1 (1-Ѳ)b-1
• Nice property of the beta family:
If Ѳ has prior beta [a, b] and a data point is observed, then the posterior
for is also a beta distribution.
Beta family:
A beta family is called as the conjugate prior for the family of
distributions for a
Boolean variable.
P(Ѳ | D1 Cherry)=α P(D1=Cherry | Ѳ) P(Ѳ)
= α՛ Ѳ. Beta[a,b] (Ѳ) = α՛ Ѳ. Ѳa-1 (1- Ѳ)b-1
= α՛ Ѳa (1- Ѳ)b-1 = beta [a + 1, b] (Ѳ)
Note: a and b are virtual counts (starting with beta [1, 1]).
With wrappers: 3 parameters. It need to specify P(Ѳ, Ѳ1, Ѳ2).
Assuming parameter independence: P (Ѳ, Ѳ1, Ѳ2) = P(Ѳ) P (Ѳ1) P (Ѳ2)
Fig. 7.2.4 A Bayesian network that corresponds to a Bayesian learning
process. Posterior distributions for the parameter variables 0, 1, 2 can be
inferred from their prior distributions and the evidence in the flavour,
wrapper; variables
• View of Bayesian learning as inference in a BN.
Variables: Unknown parameters + variables describing each instance.
P (Flavori= Cherry | Ѳ = Ө) = Ө
P (Wrapperi = red | Flavori = Cherry, Ѳ1 = Ө1) = Ө1
5. Learning Bayes Net Structures
Often the Bayes net structures are easy to get from expert knowledge. But
sometimes the causality relationships are debatable.
For example:
Smoking => Cancer?
too Much TV => Bad at school?
• To search for a good model:
- Start with a linkless model and add parents to each node, then learn
parameters and measure accuracy of the resulting model.
- Start with an initial guess of the structure, and use hill-climbing or
simulated annealing to make modifications (reversing, adding, or deleting
arcs).
Note
• Two ways for deciding when a good structure has been found: -
1) Test whether the conditional independence assertions implicit in the
structure, are satisfied in the data.
P(Fri Sat, Bar | WillWait)=P(Fri Sat WillWait) P(Bar | WillWait)
It requires an appropriate statistical test (with appropriate threshold).
2) Measure the degree to which the proposed model explains the data.
The ML hypothesis would give a fully connected network.
• Here we need to penalize complexity: -
a) MAP (MDL) approach: Substracts a penalty from the likelihood of
each structure.
b) Bayesian approach: places a joint prior over structures and parameters.
It too many structures are there then use sampling rather than exact
methods.

Probabilistic Reasoning and Bayesian Networks


The concept:
Probability based reasoning is the same as inferring directly from
knowledge that can be given a probability rating based on the amount of
uncertainty present. The uncertainties can arise from an inability to
predict outcomes due to unreliable, vague incomplete or inconsistent
knowledge. The probability of an uncertain event is a measure of the
degree of likelihood of occurence of that event.
When one takes it for granted that certain parameters exist or known with
certainty, solutions can be found out. Belief revision takes place
whenever one encounters a piece of information that decreases the belief
of the fact.
The source for all uncertainties is the real world. The following sources
constitute the major causes of uncertainties -
i) Most information is not obtained personally but got from sources on
which one has a lot of belief. Uncertainty prevails when the source does
not send information in time or the information provided is not
understood.
ii) In laboratory, experiments one takes it for granted that the equipment
is properly calibrated and error free. If the equipment is faulty, the picture
one gets about the situation is not correct which leads to uncertainty.
iii) Experimental errors like parallax errors are also sources of
uncertainty.
iv) Imprecision in natural language is also a source of uncertainty.
v) A random event occurring is a major source of uncertainty.
vi) In judicial courts one may have seen or heard delivering judgements
acquitting the accused for -
a) Lack of evidence
b) Lack of certainty in evidence thereby giving the benefit of doubt to the
accused.
To handle these uncertain data, probability is the oldest technique
available. Probabilistic reasoning is sometimes used when outcomes are
unpredictable. For example, when a physician examines a patient, the
patient's history, symptoms and test results provide some, but not
conclusive evidence of possible ailments. This knowledge, together with
the physician's experience with previous patients, improves the likelihood
of predicting the unknown event, but there is still much uncertainty in
most diagnoses.
One way to express confidence about such event is through probability. It
is a mere number that expresses the chance of something happening or
not happening.
Representing knowledge in uncertain domain:
Independence and conditional independence relationships among
variables can greatly reduce the number of probabilities that need to be
specified in order to define the full joint distribution. There is technique
called as Bayesian Network which can be used to represent the
dependencies among variables and to give a concise specification of any
full joint probability distribution.

Bayesian Network
1. Definition
It is a data structure which is a graph, in which each node is annotated
with quantitative probability information.
The nodes and edges in the graph are specified as follows:-
1) A set of random variables makes up the nodes of the network.
Variables may be discrete or continuous.
2) A set of directed links or arrows connects pairs of nodes. If there is an
arrow from node X to node Y, then X is said to be a parent of Y.
3) Each node Xi, has a conditional probability distribution P(X i |
Parents(Xi)) that quantifies the effect of the parents on the node.
4) The graph has no directed cycles (and hence is a directed, acyclic
graph, or DAG).
The set of nodes and links is called as topology of the network. The
topology of the network specifies the conditional independence
relationships that hold in the domain.
The intutive meaning of an arrow between two nodes X and Y, in a
properly constructed network, is that "X has direct influence on Y". The
domain expert is a person who can identify such influence associations.
Once Bayesian network topology is specified then conditional probability
distribution for each variable is specified. For specifying continuous
probability distribution for a variable, its parent information is required.
The combination of topology and the conditional distributions is
sufficient to specify the full joint distribution for all the variables.
Consider our example of simple world consisting of variables toothache,
cavity, catch and weather. Weather is surly independent of other
variables. Toothache and Catch are conditionally independent if given the
information of cavity.
This relationships are represented by Bayesian network as shown below:

Fig. 7.3.1 A simple Bayesian network in which Weather is independent of


other three variables and Toothache and Catch are conditionally
independent, given the cavity
Note
1) The conditional independence of toothache and catch given the cavity
is indicated by the absence of a link between toothache and catch.
2) The network represents the fact that cavity is a direct cause of
toothache and catch.
3) No direct causal relationship exists between toothache and catch.
Consider another example of burglar alarm installed at A's home. This
alarm notifies in case of burglary and can occasionally respond to minor
earthquakes. There are two neighbours to A, M and J. M and J promised
to call A in office when they hear alarm. J always calls A when he hears
alarm, but many times he is confused between telephone ring and burglar
alarm. M many a times misses alarm because he keeps on listening to
loud music. Given the evidence of who has or has not called, we are to
estimate the probability of burglary.
The Bayesian network for above problem along with the associated
probabilities is shown below:

Note In the figure, in the CPTS, the letters B, E, A, J and M stand for
Burglary, Earthquake, Alarm, Johncalls, Marycalls respectively.
1) In the figure, each distribution is shown as conditional probability
table. Each row in the table contains the conditional probability of each
node value for a 91 conditioning case. A conditioning case is just a
possible combination of values for the parent nodes. Each row must sum
upto 1 because the entries represent an exhaustive set of cases for the
variable.
2) For boolean variable if it is true and its probability is know to be 'P'
then when it is false its probability is 1-P and it is ommited from the table
(because it can be deduced from 'P').
3) A table for a boolean variable with k boolean parents contains 2k
independently specifiable probabilities. A node with no parents has only
one row, representing the prior probabilities of each possible value of the
variable.
2. The Semantics of Bayesian Networks
There are two ways through which semantic (meaning) of Bayesian
network can be understood.
One way is to view network as a representation of the joint probability
distribution. This view helps to constructs networks. Second way is to
view a network as an encoding of a collection of conditional
independence statements. This view helps in designing inference
procedure semantically both views are equivalent.
Understanding semantic of Bayesian network method:
• Representing the full joint distribution:
1) Every entry in the full joint probability distribution (hereafter
abbreviated as "joint") can be calculated from the information in the
network.
2) A generic entry in the joint distribution is the probability of a
conjunction of particular assignments to each variable, such as P(X 1 =
x1 ^.....^xn = xn)
follg3) The notation P(x1,...,xn) is used as an abbrevation for this.
4) The value of this entry is given by the formula.
P(x1,...,xn) = Пnj=1 P(xi | parents (Xi)) ……(7.3.1)
where parents (xi) denotes the specific values of the variables in parents
(xi).
5) Thus, each entry in the joint distribution is represented by the product
of the appropriate elements of the Conditional Probability Tables (CPTs)
in the Bayesian network. The CPTs therefore, provide a decomposed
representation of the joint distribution.
We can calculate the probability that the alarm has sounded, but neither a
burglary nor an earthquake has occurred and both J and M call. We use
single letter names for the variables:
P(j Ʌ m Ʌ a Ʌ⌐ b Ʌ⌐ e) = P(j | a) P(m | a) P(a |⌐ b Ʌ⌐ e) P(⌐ b) P(⌐e)
= 0.90 × 0.70 × 0.001 × 0.999 x 0.998 = 0.00062
6) Remember that the full joint distribution can be used to answer any
query about the domain.
If a Bayesian network is a representation of the joint distribution, then it
too can be used to answer any query, by summing all the relevant joint
entries. A method for constructing Bayesian network:
1) We rewrite the joint distribution in terms of a conditional probability,
using the product rule.
P(x1,...,xn) = P(xn | Xn-1,...,X1) P(xn-1,...,x1).
2) Then the process is represented, reducing each conjunctive probability
to a conditional probability and a smaller conjunction. We end up with
one bigproduct.
P(x1,...,xn) = P(xn | xn-1,..., x1)P(xn-1 xn-2 ,...,x1).....P(x2 | x1) P(x1)
= IIni=1 P(xi | xj-1,….x1) …..(7.3.2)
3) Above identity holds true for any set of random variables and is called
the Chain rule. Comparing it with equation (7.3.1). We see that the
specification of the joint distribution is equivalent to the general assertion
that, for every variable Xi in the network,
P(Xi Xi-1,....X1) = P(Xi | Parents(Xj)),
provided that parents (Xi) (Xi-1,....X1}. This last condition is satisfied by
labeling the nodes in any order that is consistent with the partial order
implicit in the graph structure.
4) Bayesian network is correct representation of the domain only if each
node is conditionally independent of its predecessors in the node
ordering, given its parents.
5) In order to construct a Bayesian network with the correct structure for
the domain, we need to choose parents for each node such that this
property holds. Intuitively, the parents of node X; should contain all those
nodes in Xi,....Xi-1 that directly influence Xi.
For example: Suppose we have completed the network in Fig. 7.2.2
except for the choice of parents for M calls, M calls is certainly
influenced by whether there is a burglary or an earthquake, but not
directly influenced. Intitutively, our knowledge of the domain tells us that
these events influence M's calling behaviour only through their effect on
the alarm. Also given the state of the alarm, whether J calls has no
influence on M's calling. Formally speaking, we believe that the
following conditional independence statement holds:
P(M Calls |J Calls, Alarm, Earthquake, Burglary) = P(M calls | Alarm).
• Compactness and node ordering:
Bayesian network are compact and they possess a property of being
locally structured (also called as sparse systems). In a locally structured
system, each sub component interacts directly with only a bounded
number of other components, regardless of the total number of
components.
⇒ Local structure is usually associated with linear rather than
exponential growth in complexity.
⇒ We assume 'n' boolean variables for simplicity, then the amount of
information on needed to specify each conditional probability table will
be at most 2k numbers. Where each random variable is influenced by 'k'
other variables.
⇒ The complete network can be specified by n2k numbers.
With some values of 'n' the joint distribution contains 2n numbers.
⇒ To make this concrete, suppose we have n = 30 nodes, each with five
parents (k=5). Then the Bayesian network requires 960 numbers, but the
full joint distribution requires over a billion.
Ordering of nodes in Bayesian network:
1) The correct order in which to add nodes is to add the "root causes"
first, then the variables they influence, and so on, until we reach the
"leaves", which have no direct causal influence on the other variables.
2) If wrong order is choosen we get more complicated network.
For example: Consider network shown in following diagram.
The network construction process goes as follows:
i) Adding M calls: No parent.
ii) Adding J calls: If M calls, that probably means the alarm has gone off,
which ofcourse would make it more likely that J calls. Therefore J calls
needs M calls as a parent.
iii) Adding alarm: Clearly, if both call, it is more likely that the alarm has
gone off than if just one or neither call, so we need both M calls and J
calls as parent.
iv) Adding burglary: If we know the alarm state, then the call from J or M
might give us information about phone ringing or M's music, but not
about burglary : P(Burglary Alarm, J calls, M calls) = P(Burglary |
Alarm).
Hence we need just alarm as parent.
v) Adding earthquake: If the alarm is on, it is more likely that there has
been an 2 earthquake. (The alarm is an earthquake detector of sorts). But
if we know that there has been a burglary, then that explains the alarm,
and the probability of an earthquake would be only slightly above normal.
Hence, we need both Alarm and Burglary as parents. The network is
shown below.

• Consider example of a very bad node ordering as shown in the Fig.


7.2.4: M calls, J calls, Earthquake, Burglary, Alarm are nodes. This
network requires 31 distinct probabilities to be specified exactly the same
as the full joint distribution. It is important to realize, however, that any
of the three networks can represent exactly the same joint distribution.
The last two versions simply fail to represent all conditional
independence relationships and hence end up specifying a lot of
unnecessary numbers instead.
3. Conditional Independence Relations in Bayesian
Networks
One can start from a "topological" semantics that specifies the conditional
independence relationships encoded by the graph structure, and from
these we can derive the "numerical" semantics. The topological semantics
is given by either of the following specifications, which are equivalent.
1) A node is conditionally independent of its non-descendants, given its
parents. For example, in Fig. 7.3.2 J calls is independent of Burglary and
Earthquake, given the value of Alarm.
2) A node is conditionally independent of all other nodes in the network,
given it’s parents, children, and children’s parents- that is, given its
Markov blanket.

For example: Burglary is independent of J calls and M calls given Alarm


and Earthquake.
This specifications are illustrated in following figures. From these
conditional independence assertions and the CPTs, the full joint
distribution can be reconstructed; thus the "numerical" semantics and the
"topological" semantics are equivalent.
A node X is conditionally independent of its non-descendants (example -
- The Z1js) given its parents(the U; is shown in the gray area).
As node X is conditionally on independent of all other nodes in the
network given its Markove blanket (the gray area).
Efficient representation of conditional representation:
If the maximum number of parents k is smallish, filling in the CPT for a
node, would require up to O(2k) numbers.
To avoid this big relationship number canconical distribution is used. In
this a complete table is specified by naming the patterns supplied with
parameters, which can describe relationship which are necessary. A
deterministic node represents canonical distribution. A deterministic node
is a node whose value is exactly specified by the value of its parents with
no uncertainity.
The uncertain relationships are often represented by noisy-OR
relationships. [Noisy-OR is generalization of logical OR]. The noisy-OR
model allows uncertainity about the parent to cause the child to be true
i.e. the causal relationship between parent and child would be inhibited.
For example: A patient could have cold (parent) but not exhibit a fever
(child).
For noisy-OR model it is required that all the possible effects of parent
must be listed. It should be clear that inhibition of one parent is
independent of other parent.

The Hybrid Bayesian Network


A network with both discrete and continuous variables is called as hybrid
Bayesian network. For representing continuous variable its discretization
is done in terms of intervals (because it can have infinite values).
To specify a hybrid network, we have to specify two new kinds of
distributions. The conditional distribution for a continuous variable given
discrete or continuous parents; and the conditional distribution for a
discrete variable given continuous parents.
Consider the simple example in following diagram, in which a customer
buys some fruit depending on its cost, which depends in turn on the size
of the harvest and whether the government's subsidy scheme is operating.
The variable Cost is continuous

and has continuous and discrete parents; the variable Buys is discrete and
has a continuous parent.
For the cost variable, we need to specify P(Cost Harvest, Subsidy). The
discrete parent is handled by explicit enumeration that is, specifying both
P(Cost | Harvest, Subsidy) and P(Cost Harvest, Subsidy). To handle
Harvest, we specify how the distribution over the cost 'c' depends on the
continuous value 'h', of Harvest. In other words, we specify the parameter
of the cost distribution as a function of 'h'.

Inferencing in Bayesian Networks


1. Exact Inference in Bayesian Networks
For inferencing in probabilistic system, it is required to calculate
posterior probability distribution for a set of query variables, where some
observed events are given. [That is we have some values attached to
evidence variables].
• Notation Revisited:
The notation used in inferencing is same as the one used in probability
theory.
X: Query variable.
E: The set of evidence variables E 1,....., Em and 'e' is the perticular
observed event. Y: The set of non-evidence variables Y 1, Y2,.....
Yk [Non-evidence variables are also called as hidden variables].
X: It the complete set of all the types of variables, where X = {X} U E U
Y.
• Generally the query requires the posterior probability distribution P(X |
e) [assuming that query variable is not among the evidence variables, if it
is, then posterior distribution for X simply gives probability 1 to the
observed value]. [Note that query can contain more than one variable. For
study purpose we are assuming single variable].
• Example: In the burglary case, if the observed event is Jcalls = true and
Mcalls true.
The query is 'Has burglary occured?'
The probability distribution for this situation would be,
P(Burglary | J calls = true, M calls = true) = < 0.284, 0.716 >
2. Inference by Enumeration
A Bayesian network gives a complete representation of the full joint
distribution. These full joint distributions can be written as product of
conditional probabilities from the Bayesian network.
A query can be answered using Bayesian network by computing sums of
products of conditional probabilities from the network.
• The algorithm
The algorithm ENUMERATE-JOINT-ASK gives inference by
enumerating on full joint distribution.
Characteristics of algorithm:
1) It takes input a full joint distribution P and looks up values in it. [The
same algorithm can be modified to take input as Bayesian network and
looking up in joint entries by multiplying the corresponding conditional
probability table entries from Bayesian network.
2) The ENUMERATION-JOINT-ASK uses ENUMERATION-ASK
(EA) algorithm which evaluate expression using depth-first recursions.
Therefore, the space complexity of EA is only linear in the number of
variables. The algorithm sums over the full joint distribution without ever
constructing it explicitely. The time complexity for network with 'n'
boolean variables is always O(2n) which is better than the O(n 2n)
required in simple inferencing approach (using posterior probability).
3) The drawback of the algorithm is, it keeps on evaluating repeated sub
expression which results in wastage of computation time.
The enumeration algorithm for answering queries on Bayesian
network.
• The algorithm
Function ENUMERATION-ASK (X, e, bn) returns a distribution over X.
Inputs: X, the query variable
e, observed values for variables E.
bn, a Bayes net with variables {X} UEU Y/* y = Hidden variable */
Q(X) ← A distribution over X, initially empty for each value x i of X do
extend e with value xi for X.
Q(xi) ← ENUMERATE-ALL (VARS[bn] e) return NORMALIZE
(Q(x)).
Function ENUMERATE-ALL (vars, e) returns a real number.
if EMPTY? (vars) then return 1.0
Y ← FIRST (vars)
If Y has value y in e.
Then return P(y | parents (Y)) X ENUMERATE-ALL (REST(vars), e)
else return, Σy Ply parents (Y)) X ENUMERATE-ALL (REST(vars), ey)
where eyis e extended with Y = y.
Example:
Consider query,
P(Burglary | J calls = true, M calls =true)
Hidden variables in the queries are → Earthquake and Alarm.
using the query equation.
P(Burglary | j, m) = α P( Burglary, J, M) = α Σe Σa P(Burglary, e, a, j, m)
The semantics of Bayesian networks (equation 7.2.1) then gives us an
expression in terms of CPT entries. For simplicity, we will do this just for
Burglary = true.
P(b | j, m) = α Σe ΣaP(b)P(e)P(a | b, e)P(j | a) P(m | a)
• To compute this expression, we have to add four terms, each computed
by multiplying five numbers.
• Worst case, where we have to sum out almost all the variables, the
complexity of the algorithm for a network with n boolean variables is
O(n2n).
• An improvement can be obtained from the following simple
observations. The P(b) term is a constant and can be moved outside the
summations over a and e, and the P(e) term can be moved outside the
summation over a. Hence, we have
P(b | j, m) = α P(b) Σe P(e) ΣaP(a | b, e)P(j | a) P(m, a)
This expression can be evaluated by looping through the variables in
order, multiplying CPT entries as we go. For each summation, we also
need to loop over the variable's possible values. The structure of this
computation is shown in following diagram. Using the numbers from Fig.
7.3.2, we obtain P(b | j, m) = α × 0.00059224. The corresponding
computation for ⌐b yields α × 0.0014919;
Hence,
P(B | j, m)= α < 0.00059224, 0.0014919 >
≈ <0.284, 0.716 >
That is, the chance of burglary, given calls from both neighbours is about
28 %. Note In the Fig. 7.3.7, the evalution proceeds top to down,
multiplying values along each path and summing at the "t" nodes.
Observe that there is repetition of paths for j and m.
3. The Variable Elimination Algorithm
The enumeration algorithm can be improved substantially by eliminating
calculations of repeated sub expression in tree. Calculation can be done
once and save the results for later use. This is a form of dynamic
programming.
Working of variable elimination algorithm
1) It works by evaluating expressions such as [P(b | j, m) = α P(b) Σ e P(e)
Σa P(a | b, e) P(j | a) P(m | a)] in right-to-left order.
2) Intermediate results are stored and summations over each variable are
done only for those portions of the expression that depends on the
variable.
For example: Consider the Burglary network.
We evaluate the expression:

3) Factors: Each part of the expression is annotated with the name of the
associated variable, these parts are called factors.
Steps in algorithm:
i) The factor for M, P (m | a), does not require summing over M.
Probability is stored, given each value of a, in a two-element factor,
Note fM means that M was used to produce f.
ii) Store the factor for J as the two-element vector fj (A).
iii) The factor for A is P(a | B, e) which will be a 2×2×2 matrix f A (A, B,
E).
iv) Summing out A from the product of these tree factors. This will give
2×2 matrix whose indices range over just B and E. We put bar over A in
the name of the matrix to indicate that A has been summed out.
FȂJM(B, E) =Σa fA (a, B, E) × fj (a) × fM (a)
= fA (a, B, E) x fJ(a) x fM (a) + fA ( ⌐a, B, E) x fj (¬a)×fm (¬a)
The multiplication process used is called a printwise product.
v) Processing E in the same way (i.e.) sum out E from the product of
fE (E) and fȂJM (B, E):
fȂJM (B) = fE (e) × fȂJM (B, e) + fE (¬e) × fȂJM (B, ¬e).
vi) Compute the answer simply by multiplying the factor for B. (i.e.) (f B|
B) = P(B), 701 by the accumulated matrix (B) :
P(B | j, m) = α fB (B) x f EȂJM (B).
From the above sequence of steps it can be noticed that two
computational operations are required.
a) Pointwise product of a pair of factors.
b) Summing out a variable from a product of factors.
a) Pointwise product of a pair of factors: The pointwise product of two
factors f1 and f2 yields a new factor f, those variables are the union of the
variables in f1 and f2. Suppose the two factors have variables Y1,..., Yk.
Then we have f(x1,..., Xj, Y1,.... Yk, Z1,.....Zl) = f1 (X1,....Xj, Y1.... Yk)
f2(Y1,..... Yk, Z1,....Zl). If all the variables are binary, then f 1 and f2 have
2j+k and 2k+l entries and the pointwise product has 2j+k+1 entries.
For example: Given two factors f1 (A, B) and f2 (B, C) with probability
distributions shown below, the pointwise product f1 × f2 is given as f1 (A,
B, C).
b) Summing out a variable from a product of factors: It is a straight
forward computation. Any factor that does not depend on the variable to
be summed out can be moved outside the summation process.
For example:
Σe fE (e) × fA (A, B, e) × fj (A) × fM (A) = fj (A) × fM (A) × Σ e fE (e) ×
fA (A, B, e). Now, the pointwise product inside the summation is
computed and the variable is summed out of the resulting matrix.
Fj (A) × fM (A) × Σe fE (e) × fA (A, B, e) = fj (A) × fM (A) × fEA(A, B).
Matrices are not multiplied until we need to sum but a variable from the
accumulated product. At that point, multiply those matrices that include
the variable to be summed out.
The procedure for pointwise product and summing is given below, the
variable elimination algorithm is shown below:
Function ELIMINATION-ASK (X, e, bn) returns a distribution over X
Inputs: X, the query variable
e, evidence specified as an event
bn, a Bayesian network specifying joint distribution P(X1,..., Xn).
Factors ← [ ] : vars ← REVERSE (VARS [bn])
for each var in vars do
Factors ←[MAKE-FACTOR (var, e) factors]
if var is a hidden variable then
Factors←SUM-OUT (var, factors)
returns NORMALIZE (POINTWISE-PRODUCT (factors)).
P (J calls, Burglary = True)
The first step is to write out the nested summation.
P(J | b) = α P(b) Σe P(e) Σa P(a | b, e) P(J | a) Σm P(M | a).
Evaluating this expression from right to left,
Σm P(M | a) is equal to 1.
Note The variable M is irrelevant to this query. Result of the query P (J
calls/Burglary = True) is unchanged if we remove M calls from the
network. We can remove any leaf node which is not a query variable or
an evidence variable. After its removal, there may be more leaf nodes and
they may be irrelevant. Eventually we find that every variable that is not
an ancestor of a query variable or evidence variable is, irrelevant to the
query. A variable elimination algorithm can remove all these variables
before evaluating the query.
4. The Complexity Involved in Exact Inferencing
The variable elimination algorithm is more efficient than enumeration
algorithm because it avoids repeated computations as well as drops
irrelevant variables.
The variable elimination algorithm constructs the factor, deriving its
operation. The space and time complexity of variable elimination is
directly dependant on size of the largest factor constructed during the
operation. Basically the factor construction is determined by the order of
elimination of variables and by the structure of the network; which affects
both space and time complexity.
For developing more efficient process we can construct singly connected
networks which are also called as polytrees. In singly connected network,
there is at most one undirected path between any two nodes in the
networks. The singly connected networks have property that, the time and
space complexity of exact inference in polytrees is linear in the size of the
network. Here the size is defined as the number of CPT entries. If the
number of parents of each node is bounded by a constant, then the
complexity I will also be linear in the number of nodes.
For example: The Burglary network shown in the Fig. 7.3.2 is a polytrees.
[Note that every problem may not be represented as polytrees].
In multiply networks [In this, their can be multiple undirected paths
between any two nodes and more than one directed path between some
pair of nodes], variable elimination takes exponential time and space
complexity in the worst case, even when the number of parents per node
is bounded. It should be noted that variable elimination includes inference
in propositional logic as a special case and inference in Bayesian network
is NP-hard. In fact it is strictly harder than NP-complete problem.
Clustering algorithm:
1) Clustering algorithm (known as joint tree algorithms) in which
inferencing time can be reduced to O(n). In clustering individual nodes of
the network are joint to form cluster nodes to such a way that the
resulting network is a polytree.
2) The variable elimination algorithm is efficient algorithm for answering
individual queries. Posterior probabilities are computed for all the
variables in the network. It can be less efficient, in polytree network
because it needs to issue O(n) queries costing O(n) each, for a total of
O(n2) time, clustering algorithm, improves over it.

For example: The multiply connected network shown in Fig. 7.3.8 (a) can
be converted into a polytree by combining the Sprinkler and Rain node
into a clusternode called Sprinkler + Rain, as shown in Fig. 7.2.8 (b). The
two Boolean nodes are replaced by a meganode that takes on four
possible values: TT, TF, FT, FF. The meganode has only one parent, the
Boolean variable. Cloudy, so there are two conditioning cases.
Peculiarities of algorithm:
1) Once the network is in polytree form, a special-purpose inference
algorithm is applied. Essentially, the algorithm is a form of constraint
propagation where the constraints ensure that neighbouring clusters agree
on the posterior probability of any variables that they have in common.
2) With careful book keeping, this algorithm is able to compute posterior
probabilities for all the non-evidence nodes in the network in time O(n),
where n is now the size of the modified network.
3) However, the NP-hardness of the problem continues: if a network
requires exponential time and space with variable elimination, then the
CPTs in the clustered network will require exponential time and space to
construct.

Approximate Inference in Bayesian Network


Because of the fact that exact inference in multiply connected network is
intractable problem, we go for approximate inferencing.
For approximate inferencing randomize sampling algorithm (Monte Carlo
Algorithm) is used. Its accuracy depends on number of samples
generated. Monte Carlo Algorithm is widely used in problem areas where
it is difficult to calculate exact quantities.
There are two families of Monte Carlo algorithm:
1) Direct sampling algorithm.
2) Markov chain sampling algorithm.
1. Direct Sampling Algorithm
1) The basic element in any sampling algorithm is the generation of
samples from a known probability distribution.
For example: An unbiased coin can be thought of as a random variable
coin with values (heads, tails) and a prior distribution P(coin) <0.5, 0.5>.
Sampling from this distribution is exactly like flipping the coin: with
probability 0.5 it will return 9879 heads, and with probability 0.5 it will
return tails. Given a source of random numbers in the range [0, 1], it is a
simple matter to sample any distribution on a single variable.
2) The simplest kind of random sampling process for Bayesian networks
generates events from a network that has no evidence associated with it.
The idea is to sample each variable in turn, in topological order.
3) The probability distribution from which the value is sampled is
conditioned on the values already assigned to the variable's parents.
4) The sampling algorithm:
Function PRIOR-SAMPLE (bn) returns an event sampled from the prior
specified by bn.
inputs: bn, a Bayesian network specifying joint distribution P(X1,..., Xn).
X ← an event with n elements.
for i = 1 to n do
xi ← a random sample from P(Xi parent (Xi))
return x.
5) Applying operations of algorithm on the network in Fig. 7.3.8 (a)
assuming an ordering [Cloudy, Sprinkler, Rain, WetGrass]:
i) Sample from P(Cloudy) = <0.5, 0.5>; suppose this returns true.
ii) Sample from P(Sprinkler | Cloudy = true) = <0.1, 0.9>; suppose this
returns false.
iii) Sample from P(Rain | Cloudy = true) = <0.8, 0.2>; suppose this
returns true. iv) Sample from P(WetGrass | Sprinkler = false, Rain = true)
<0.9, 0.1>; suppose this returns true.
In this case PRIOR-SAMPLE returns the event [true, false, true, true].
6) PRIOR-SAMPLE generates samples from the prior joint distribution
specified by the nework. First, let S ps (x1,...,xn) be the probability that a
specific event is generated by the PRIOR-SAMPLE algorithm. Just
looking at the sampling process, we have,
Sps (x1,...,xn)= Πni=1 P(xi | parents(Xi)).
Because each sampling step depends only on the parent values. This
expression is also the probability of the event according to the Bayesian
net's representation of the joint distribution. We have,
Sps (x1,...,xn) = P(x1,...,xn)
7) In any sampling algorithm, the answers are computed by counting the
actual samples generated. Suppose there are N total samples, and let N
(x1,...,xn) be the frequency of the specific event x 1,...,xn. We expect this
frequency to converge in the limit, to its expected value according to the
sampling probability:-
N → lim ∞ Nps (x1, …., xn )/N = Sps (x1, …, xn) = P(x1,….xn) ... (7.3.4)
For example: Consider the event produced earlier : [true, false, true, true].
The b. a sampling probability for this event is,
Sps (true, false, true, true) = 0.5 × 0.9 × 0.8 × 0.9
= 0.324
Hence, in the limit of large N, we expect 32.4 % of the samples to be of
this event.
8) Whenever we use an approximate equality ("≈"), we mean it in exactly
this sense that the estimated probability becomes exact in the large
sample limit. Such an estimate is called consistent.
For example: One can produce a consistent estimate of the probability of
any partially specified event x1,...,xn, where m≤n, as follows: -
P(x1,...,xm) ≈ Nps (x1,...,xm)/N ….(7.3.5)
That is, the probability of the event can be estimated as the fraction of all
complete events generated by the sampling process that match the
partially specified event.
For example: If we generate 1000 samples from the sprinkler network,
and 511 of them have, Rain = true, then the estimated probability of rain,
written as P(Rain = true), is 0.511.
2. Rejection Sampling in Bayesian Networks
1) Rejection sampling is a method for producing samples from a hard-to-
sample distribution given an easy-to-simple distribution.
2) It can be used to compute conditional probabilities; that is, to
determine P(X | e).
3) The Rejection Sampling algorithm is,
function REJECTION-SAMPLING (x, e, bn, N) returns an estimate of
P(X | e)
inputs: X, the query variable
e, evidence specified as an event.
bn, a Bayesian network.
N, the total number of samples to be generated.
local variables: N, a vector of counts over X, initially zero.
for j = 1 to N do
X ←PRIOR-SAMPLE (bn)
if X is consistent with e is N[x] ← N[x] +1 where x is the value of X in x.
return NORMALIZE (N[x]).
The rejection sampling algorithm for answering queries given evidence in
a Bayesian network.
• Working of algorithm:
i) It generates samples from the prior distribution specified by the
network.
ii) It rejects all those samples that do not match the evidence.
iii) Finally, the estimate P(X = x | e) is obtained by counting how often X
= x occurs in the remaining samples.
4) Let P(X | e) be the estimated distribution that the algorithm returns.
From the definition of the algorithm, we have,
P(X | e) = α Nps (X, e) = Nps (X, e)/Nps (e)
from equation (7.3.5) this becomes,
P(X | e) ≈ P(x, e)/P(e) = P(x, e).
That is, rejection sampling produces a consistent estimate of the rule
probability. 5) Applying operations of algorithm on the network in the
Fig. 7.3.8 (a), let us assume that we wish to estimate P (Rain/Sprinkler =
true), using 100 samples. Of the 100 that we generate, suppose 8 have
Rain = true and 19 have Rain = false. Hence,
P(Rain | Sprinkler = true) ≈NORMALIZE
(<8.19>) = <0.296, 0.704 >
The true answer is <0.3, 0.7>
6) As more samples are collected, the estimate will converges to the true
answer. The standard deviation of the error in each probability will be
proportional to 1/√n, where 'n' is the number of samples used in the
estimate.
7) The biggest problem with rejection sampling is that it rejects so many
samples ! The fraction of samples consistent with the evidence 'e' drops
exponentially as the number of evidence variables grows, so the
procedure is simply unusable for complex problems.
3. Likelihood Weighing in Bayesian Network
1) Likelihood weighing avoids the inefficiency of rejection sampling by
generating only events that are consistent with the evidence 'e'.
2) The Likelihood Weighing algorithm is,
Function LIKELIHOOD-WEIGHING (X, e, bn, N) returns an estimate of
P(X | e) inputs: X, the query variable
e, evidence specified as an event
bn, a Bayesian network.
N, the total number of samples to be generated.
Local variables: W, a vector of weighted counts over X, initially zero.
for j = 1 to N do
X, w← WEIGHTED-SAMPLE (bn)
W[X]← W[x] + w where x is the value of X in x.
return NORMALIZE (W[X]).
Function WEIGHTED-SAMPLE (bn, e) returns an event and a weight
X ← an event with n elements; w← 1.
for i = 1 to n do
if X, has a value xi in e
then w ← wX P(Xi = xi parents(Xi))
else xi ← a random sample from P(Xi | parents (Xi))
return X, w.
Note on likelihood weighing algorithm -
i) It fixes the values for the evidence variables E and samples only the
remaining variables X and Y. This guarantees that each event generated is
consistent with the evidence.
ii) Not all events are equal, however, before tallying the counts in the
distribution for the query variable, each event is weighted by the
likelihood that the event 19 accords to the evidence, as measured by the
product of the conditional of probabilities for each evidence variable,
given its parents.
iii) Intuitively, events in which the actual evidence appears unlikely
should be given less weight.
3) Applying operations of algorithm on the network Fig. 7.3.8 (a), with
the query P(Rain/Sprinkler = true, WetGrass = true). The process goes as
follows:
i) The weight w is set to 1.0
ii) Now, an event is generated.
• Sample from P(Cloudy) = <0.5, 0.5>; suppose this returns true.
• Sprinkler is an evidence variable with value true. Therefore, we set
w ← w P (Sprinkler = true | Cloudy = true) = 0.1.
• Sample from P(Rain | Cloudy = true) = <0.8, 0.2>; suppose this returns
true.
• WetGrass is an evidence variable with value true. Therefore, we set
W ← w P(Wet Grass = true | Sprinkler = true, Rain = true) = 0.099.
iii) Here WEIGHTED-SAMPLE returns the event [true, true, true, true]
with weight 0.099, and this is tallied under Rain = true.
iv) The weight is low because the event describes a cloudy day, which
makes the sprinkler unlikely to be on.
4) Likelihood Weighting working:
i)Examine the sampling distribution Sws for WEIGHTED-SAMPLE.
ii) The evidence variables E are fixed with values 'e'.
iii) Call the other variables Z, that is, Z = {X}U Y.
iv) The algorithm samples each variable in Z, in given its parent values:
Sws (Z, e) =IIli=1 P(Zi | parents (Zi))... (7.3.6)
Notice that Parents (Zi) can include both hidden variables and evidence
variables. Unlike the prior distribution P(Z), the distribution S ws pays
some attention to the son evidence the sampled values for each Z i will be
influenced by evidence among Zi's ancestors.
v) On the other hand, Sws pays less attention to the evidence than does the
true Som posterior distribution P(Z | e), because the sampled values for
each Zi ignore evidence among Zi's non ancestors.
vi) The likelihood weight w makes up for the difference between the
actual and desired sampling distribution. The weight for a given sample
'x', composed from Z and 'e', is the product of the likelihood for each
evidence variable given its parents (some or all of which may be among
the Zis):
w(Z, e) = IImj=l P(ei / parents(Ei) .... (7.3.7)
vii) Multiplying equations (7.3.4) and (7.3.5), we see that the weighted
probability of a sample has the particularly convenient form,
Sws (Z, e) W(Z, e) = Πli=1 P(yi | parents(Yi))
Πmi=1 P(ei | parents(Ei)) = P(y, e)
Because the two products cover all the variables in the network, allowing
us to use equation (7.3.1) for the joint probability.
viii) It can be easy to show that likelihood weighting estimates are
consistent. For any particular value x of X, the estimated posterior
probability can be calculated as follows:
P(x | e) =α Σy Nws (x, y, e) w (x, y, e)
from LIKELIHOOD-WEIGHTING.
= α՛ Σy Sws (x, y, e) w (x, y, e)
from large N.
= α' Σy Ρ(x, y, e)
= α' P(x, e) =P(x | e)by equation (7.3.8)
Hence, likelihood weighting returns consistent estimates.
5) Performance of algorithm:
i) Likelihood weighting uses all the samples generated therefore it can be
much more efficient than rejection sampling.
ii) It will, suffer a degradation in performance as the number of evidence
variables increases.
iii) Because most samples will have very low weights and hence the
weighted estimate will be dominated by the tiny fraction of samples that
accord more than an infinitesimal likelihood to the evidence.
iv) The problem is exacerbated if the evidence variables occur late in the
variable ordering, because then the samples will be the simulations that
bear little resemblance to the reality suggested by the evidence.
4. Markov Chain Monte Carlo (MCMC) Algorithm for
Inference in Bayesian Networks
Working of MCMC algorithm
1) MCMC generates each event by making a random change to the
preceding event. It is therefore helpful to think of the network as being in
a particular current state specifying a value for every variable.
2) The next state is generated by randomly sampling a value for one of
the non- evidence variables X i, conditioned on the current values of the
variables in the Markov blanket of Xi.
3) MCMC therefore wonders randomly around the state space - the space
of possible complete assignments flipping one variable at a time, but
keeping the evidence variables fixed.
4) For example: Consider the query P(rain | Sprinkler = true, WetGrass =
true) applied to the network of Fig. 7.3.8 (a). The evidence variables
'Sprinkler' and 'WetGrass' are fixed to their observed values and the
hidden variables 'Cloudy'and 'Rain' are initialized randomly. Let us say to
true and false respectively. Thus, the initial state is [true, true, false, true].
Now the following steps are executed repeatedly:
a) Cloudy is sampled, given the current values of its Markov blanket
variables : in this case, we sample from P(Cloudy | Sprinkler = true, Rain
= false), (Shortly we will show how to calculate this distribution).
Suppose the result is y = false. Then the new current state is [false, true,
false, true].
b) Rain is sampled, given the current values of its Markov blanket
variables : In this case we sample from P(Rain Cloudy false, Sprinkler =
true, WetGrass = true). Suppose this yields Rain = true. The new current
state is [false, true, true, true].
5) Each state visited during this process is a sample that contributes to the
estimate, for the query variable Rain. If the process visits 20 states where
Rain is true and 60 states where Rain is false, then the answer to the
query is NORMALIZE (<20, 60>) = <0.25, 0.75>).
6) The complete algorithm is shown as follows:-
Function MCMC-ASK (X, e, bn, N) returns an estimate of P(X | e)
local variables: N[X], a vector of counts over X, initially zero.
Z, the nonevidence variables in bn.
X, the current state of the network, initially copied from e.
initialize X with random values for the variables in Z.
for j = 1 to N do
N[x] ←N[x] +1 where x is the value of X in x.
for each Zi in Z do...
sample the value of Zi in X from P(Zi |mb(Zi)) given the value of MB (Zi)
in X return NORMALIZE (N[X]).
Why MCMC works
1) It should be noticed that the sampling process settles into a "dynamic
equilibrium" in which the long-run fraction of time spent in each state is
exactly proporitional to its posterior probability.
2) This remarkable property follows from the specific transition
probability with which the process moves from one state to another, as
defined by the conditional distribution given the Markov blanket of the
variable being sampled.
3) Let q(X →X') be the probability that the process makes a transition
from state X to X'. This transition probability defines what is called a
Markov chain on the state space.
4) Now suppose that we run the Markov chain for 't' steps, and let л(X) be
the probability that the system is in state X at time 't'.
- Similarly, let лt+1(X') be the probability of being in state X' at time 't+1.'
- Given лt (X)
- We can calculate лt+1(X') by summing for all states the system could be
in at time 't'.
- The probability of being in that state is the times the probability of
making the transition to X' →πt+1(X') = Σx πt (X) q(X→X')
5) We will say that the chain has reached its stationary distribution if л t =
πt+1. Call this stationary distribution л; its defining equation is therefore,
π (X') = ΣX π (X) q (X→X') for all X' …..(7.3.9)
Under certain standard assumptions about the transition probability
distribution there is exactly one distribution л satisfying this equation for
any given q.
6) Above equation (7.3.8) can be read as saying that the expected
"outflow" from each state (i.e., its current "population") is equal to the
expected "inflow" from all the state.
One way to satisfy this relationship is, the expected flow between any
pair of states is the same in both direction. This is the property of detailed
balance:
л(X) q (X→X')=π (X') q (X'→X) for all X, X'. …...(7.3.10)
7) It can be shown that detailed balance implies stationarily simply by
summing over X in equation (7.3.9). We have,
Σx π (X) q (X→X') = Σx π (X') q (X' →X) = π (X') Σx q (X' → X) = π
(X').
8) To show that the transition probability q (X→X') defined by the
sampling step in MCMC-ASK satisfies the detailed balance equation with
a stationary distribution f equal to P(X | e). (The true posterior
distribution on the hidden variables).
This is done in two steps:-
a) First, we will define a Markov chain in which each variable is sampled
conditionally on the current values of all the other variables, and we will
show that this satisfies detailed balance.
b) We will simply observe that, for Bayesian networks, doing that is
equivalent to sampling conditionally on the variable's Markov blanket.
• Let Xi be the variable to be sampled, and let be all the hidden
variables other than Xi. Their values in the current state are x i and . If
we sample a new xi for Xi conditionally on all the other variables,
including the evidence, we have, q(X→X') = q(x i, ) → (xi, xi) =
P(xi | , e).
• This transition probability is called Gibbs sampler and is a particularly
convenient form of MCMC. Now we show that the Gibbs sampler is, in
detailed balance with the true posterior.
π (X) q(X→ X') = P(X | e) P(x'I | , e) = P(xi, | e) P(x'i | , e)
= P(xi | , e) P( | e) P(x'i | , e) (using the chain rule of first term).
= P(xi | , e) P(x'i, | e) (using the chain rule backwards)
= π (X') q (X' →X).
• Note that, a variable is independent of all other variables given its
Markov blanket; hence
P(x'i + ,e) = P(x' | mb(Xi))
where mb (Xi) denotes the values of the variables in X i's Markov blanket,
MB (xi).
The probability of a variable, given its Markov blanket, is proportional to
the lo probability of the variable given its parents times the probability of
each child given its respective parents:-
P(x' | mb (Xi)) = α P(x' | Parents(Xi)) × II Yj € children(xi) P(yi |
Parents(Yi)) ….(7.3.11)
• Hence, to flip each variable , the number of multiplications required
is equal to the number of Xi 's children.
9) We have discussed here simple variant of MCMC, namely the Gibbs
sampler. In its most general form, MCMC is a powerful method for
computing with probability models and many variants have been
developed, including the simulated annealing algorithm presented.

Other Techniques for Uncertain Reasoning


AU: Dec.-14, 17, May-17
There are two basic reasons why probability can fail:-
1) One common view is that probability theory is essentially numerical,
whereas human judgemental reasoning is more "qualitative".
2) Probabilistic Reasoning did not scale up because of the exponential
number of probabilities required in the full joint distribution.

Default Reasoning
• We can do qualitative reasoning using technique like default reasoning.
• Default reasoning treats conclusions not as "believed to a certain
degree", but as "believed until a better reason is found to believe
something else".
Rule-based
1) This approach hope to build on the success of logical rule-based
systems, but add a sort of "Fudge factor" to each rule to accommodate
uncertainty. These methods were developed in the mid-1970s and formed
the basis for a large number of expert systems in medicine and other
areas.
2) Rule-based systems emerged from early work on practical and intuitive
systems for logical inference.
3) Logical systems in general, and logical rule-based systems in particular
have three desirable properties:

A ⇒ B, we conclude B, given evidence A, without worrying about any


i) Locality: In logical systems, whenever we have a rule of the form

other rules. In probabilistic systems, we need to consider all the evidence


in the Markov blanket.
ii) Detachment: Once a logical proof is found for a proposition B, the
proposition can be used regardless of how it was derived. That is, it can
be detached from its justification. In dealing with probabilities, on the
other hand, the source of the evidence for a belief is important for
subsequent reasoning.
iii) Truth functionality: In logic, the truth of complex sentences can be
computed from the truth of the components. Probability combination does
not work this way, except under strong global independence assumptions.
4) There have been several attempts to devise uncertain reasoning
schemes that retain these advantages. The idea is to attach degrees of
belief to propositions and rules and to devise purely local schemes for
combining and propagating those degrees of belief. The schemes are also
truth functional.
For example: The degree of belief in A v B is a function of the belief in A
and the belief in B.
5) Problem associated with rule based methods:-
i) The properties of locality, detachment, and truth-functionality are
simply not appropriate for uncertain reasoning.
ii) Let us look at truth functionality first. Let H 1, be the event that a fair
coin flip comes up heads. Let T1 be the event that the coin comes up tails
on that same flip, and let H2 be the event that the coin comes up heads on
a second flip. Clearly, all three events have the same probability, 0.5, and
so a truth functional system must assign the same belief to the disjunction
of any two of them. But we can see that the probability of the disjunction
depends on the events themselves and not just on their probabilities.

It gets worse when we chain evidence together. Truth-functional systems


have rules of the form A → B that allow us to compute the belief in B as
a function of the belief in the rule and the belief in A. Both forward and
backward-chaining systems can be devised. The belief in the rule is
assumed to be constant and is usually specified by the knowledge
engineer.
For example: as A →0.9B
Consider the WetGrass situation. If we wanted to be able to do both
causal and diagnostic reasoning, we would need the two rules,
Rain → WetGrass and WetGrass → Rain
These two rules form a feedback loop: evidence for Rain increases the
belief in WetGrass, which in turn increases the belief in Rain even more.
Clearly, uncertain reasoning systems need to keep track of the paths along
which evidence is propagated.
Inter-causal reasoning (for explaining away) is also tricky. Consider what
happens when we have the two rules,
Sprinkler → WetGrass and WetGrass → Rain.
Suppose we see that the Sprinkler is on. Chaining forward through our
rules, this increases the belief that the grass will be wet, which in turn
increases the belief that it is raining. But this is ridiculous; the fact that
the sprinkler is on explains away the wet grass and should reduce the
belief in rain. A truth functional system acts as if it also believes
Sprinkler → Rain.
6) If task is restricted and rules are engineered carefully then truth-
functional systems work well.
Ignorance
1) One area that we have not addressed so far is the question of
ignorance, as opposed to uncertainty. Consider the flipping of a coin. If
we know that the coin is fair, then a probability of 0.5 for heads is
reasonable. If we know that the coinis biased, but we do not know which
way, then 0.5 is the only reasonable gill b probability. The two cases are
different, yet probability seems not to distinguish buy them. The
Dempster-Shafer theory uses interval-valued degrees of belief to ow
represent an agent's knowledge of the probability of a proposition.
2) Representing ignorance using Dempster-Shafer theory:-
i) The Dempster- Shafer theory is designed to deal with the distinction
between uncertainty and ignorance.
ii) Rather than computing the probability of a proposition, it computes the
probability that the evidence supports the propositions.
iii) This measure of belief is called a belief function, written Bel(X).
iv) We return to coin flipping for an example of belief functions. Suppose
a shady character comes up to you and offers to bet you 10 that this coin
will come up heads on the next flip. Given that the coin might or might
not be fair, what belief should you ascribe to the event that it comes up
heads? Dempster-Shafer theory says that because you have no evidence
either way, you have to say that the belief Bel(Heads) = 0 and also that
Bel(¬Heads) 0. This make Dempster-Shafer reasoning systems skeptical
in a way that has some intuitive appeal.
v) Now suppose you have an expert at your disposal who testifies with
90% certainty that the coin is fair (i.e he is 90% sure that P(Heads) 0.5).
Then Dempster-Shafer theory gives Bel(Heads) = 0.9×0.5 0.45 and
likewise Bel (¬Heads) = 0.45. There is still a 10 percentage point "gap"
that is not accounted for by the evidence.
vi) 'Dempster's rule' (Dempster, 1968) shows how to combine evidence to
give new values for Bel, and Shafer's work extends this into a complete
computational model.
vii) Problems associated with Dempster-Shafer theory:-
1. There is a problem in connecting beliefs to actions.
2. With probabilities, decision theory says that if P(Heads) = P(¬Heads) =
0.5 then (assuming that winning 10 and losing 10 are considered equal
magnitude opposites). The reasoner will be indifferent between the action
of accepting and declining the bet.
3. The Dempster-Shafer reasoner has Bel(¬ Heads) = 0 and thus no
reason to accept the bet, but then it also has Bel(Heads) = 0 and thus no
reason to decline it. Thus, it seems that the Dempster-Shafer reasoner
comes to the same conclusion about how to act in this case.
4. Unfortunately, Dempster-Shafer theory allows no definite decision in
many other cases where probabilistic inference does yield a specific
choice.
viii) One inter-pretation of Dempster-Shafer theory is that it defines a
probability interval, the interval for Heads is [0,1] before our expert
testimony and [0.45, 0.55] after. The width of the interval might help in
deciding when we need to acquire more evidence. It can tell you that the
expert's testimony will help you if you do not know whether the coin is
fair, but will not help you if already learned that the coin is fair. However,
there are no clear guidelines for how to do this, because there is no clear
meaning for what the width of an interval means.
ix) In the Bayesian approach, this kind of reasoning can be done easily by
examining how much one's belief would change if one were to acquire
more evidence.
For example: Knowing whether the coin is fair would have a significant
impact on the belief that it will come up heads, and detecting an
asymmetric weight would have an impact on the belief that the coin is
fair. A complete Bayesian model would include probability estimates for
factors such as these, allowing us to express our "ignorance" in terms of
how our beliefs would change in the face of future information gathering.

7.4.4 Vagueness
• Probability makes the same ontological commitment as logic:- That
events are true or false in the world, even if the agent is uncertain as to
which is the case. Researchers in fuzzy logic have proposed an ontology
that allows vagueness:- or That an event can be "Sort of" true. Vagueness
and uncertainty are in fact orthogonal issues.
• Representing vagueness: (Fuzzy sets and fuzzy logic):
Fuzzy sets:
In standard set theory, an object is either a member of a set or it is not.
There is no other choice. For example, 2, 4, 10, 16 are member of the set
of even numbers but 15, Red, sky are not members.
Similarly, blue, white, red and black are members of the set of colors but
match, house and vehicles are not members.
Traditional logics are based on the notions that P(a) is true as long as a is
a member of the set belonging to class P and false otherwise. There is no
partial containment. This amounts to the use of a characteristic function f
from a set A, where fA (X) = 1 if x is in A; otherwise it is 0.
Thus f is defined on the universe U and for all x € U, f: U→ {0, 1}.
This notion may be generalized by allowing this set to have a
characteristic function assuming values other than 0 and 1.
For example, the notion of a fuzzy set is defined with the characteristic
function u which maps from v to a number in the real interval [0, 1]; that
is u : v→ [0, 1].
The concept of fuzzy sets:
Fuzzy set theory is used for specifying how well an object satisfies a
vague description.
For example: Consider the proposition "Anil is tall". Is this true, if Anil is
5'10"? most people would hesitate to answer "true" or "false", preferring
to say, "sort of".
Note that this is not a question of uncertainty about the external world.
We are sure of Anil's height. The issue is that the linguistic term "tall"
does not refer to a sharp demarcation of objects into two classes and there
are degrees of tallness.
Due to this reason, fuzzy set theory is not a method for uncertain
reasoning at all. Rather, fuzzy set theory treats Tall as a fuzzy predicate
and says that the truth value of Tall(Anil) is a number between 0 and 1,
rather than being just true or false.
The name "fuzzy set" derives from the interpretation of the predicate as
implicitly defining a set of its members - a set that does not have sharp
boundaries.
Definition - Fuzzy set Ā:
Definition: Let U be a set, denumerable or not and let x be an element of
U. A fuzzy subset Ā of U is a set of ordered pairs {(x, u A (x))3}, for all x
in U1 where uA (x) is a membership characteristic function with values in
[0, 1] and which indicates the degree or level of membership of x in Ā.
Here, uA (x) = 0 indicates that x is not in Ā.
UA (x) = 1 signifies that x is completely contained in Ā.
Values of 0 < UA (x) < 1 signify that x is a partial member of A.
Fuzzy characteristic function relates to vagueness and is a measure of the
feasibility or ease of attainment of an event. Fuzzy sets have been related
to possibility distributions which have some similarities to probability
distributions, but their meanings are entirely different.
Now, consider an example, where fuzzy set is defined as,
Ā = { tall }
and assign values
uA (0) = uA (10) = ....= uA (40) = 0
uA (50)= 0.2
uA (60) = 0.4
uA (70)= 0.6
uA (80) = 0.9
uA (90) = uA (100) = 1.0.
and TALL (Joe)= 0.5
One should also assign values to other fuzzy sets associated with
linguistic variables such as very short, short, medium etc.
So, by this one can conclude that now there is a means of expressing the
notion of TALL(x) for an individual x.
Operations on fuzzy sets and properties of fuzzy sets:
Operations on fuzzy sets are somewhat similar to the operations of
standard set theory, as given below -
1. Equality – A B if and only if UA (x) = UB (x) for all x ϵ U
2. Containment - A B if and only if UA (x) ≤ UB (x) for all x ϵ U
3. Intersection - UA∩B(x) = min{UA (x), UB (x)}
4. Union - UAUB (X) = Maxx {UA (x), UB (x)}
5. Complement set - UA (x) = 1-UA (x)
• Here, single quotation mark denotes the complement fuzzy set, A'.
• Intersection of two fuzzy sets Ā and is the largest fuzzy subset that is
a subset of both.
• Similarly, the union of two fuzzy sets Ā and is the smallest fuzzy
subset having both Ā and as subsets.
• By applying above operations on fuzzy sets, the properties that can be
derived, which are applicable of fuzzy sets are as follows -

Since in general for uA (x) = a, with 0 < a < 1, we have


i) uAUA' (x) = max [a, 1 - a] ≠ 1
and ii) uA∩A'(x)=min [a, 1 a] ≠ 0
On the other hand the following relations do hold
i) Ā∩ϕ = ϕ
ii) Ā U ϕ = Ā
iii) Ā ∩ U = Ā
and iv) Ā U U = U
• The universe from which a fuzzy set is constructed may also be
uncountable. For example, one can define values of u for the fuzzy set
Ā = {young} as

The values of UA (x) are depicted graphically in Fig. 7.4.1

• Except above defined operations there are some operations that are
unique to fuzzy sets only. Some of these are as follows -
i) Dilation - The dilation of Ā is defined as
DIL (Ā) = [uA (x)]1/2 for all x in U.
ii) Concentration - The concentration of Ā is defined as
CON (Ā) = [u (x)]2 for all x in U.
iii) Normalization - The normalization of Ā is defined as
NORM (Ā) = UA (x)/Maxx {uA (x)}
for all x in U. These operations are shown in Fig. 7.4.2.
Dilation tends to increase the degree of membership of all partial
members x by spreading out the characteristic function curve.
The concentration is the opposite of dilation. As it tends to decrease the
degree of membership of all partial members and concentrates the
characteristic function curve.
While normalization provides a means of normalizing all characteristic
functions to the same base much the same as vectors can be normalized to
unit vectors.
Fuzzification - Fuzzification permits one to fuzzify any normal set.
• Fuzzy logic:
Traditional logics used for representing knowledge have many
limitations. So a new method, which extends the expressive power of the
traditional logics and permits different forms of non monotonic reasoning
is known as fuzzy logic.
Traditional logic admits interpretations which are either true or false only.
The use of two valued logics is considered as too limiting. They fail to
effectively represent value or fuzzy concepts, i.e., the motivation for
fuzzy sets is provided by the need to represent such propositions as –
- Radha is very tall.
- Rima is slightly ill.
- Ram and Shyam are close friends.
- Exceptions to the rules are nearly impossible.
- Most chinese are not very tall.
While traditional set theory defines set membership as a Boolean
predicate, fuzzy set theory allows us to represent set membership as a
possibility distribution. Consider an example of fuzzy logic.
Suppose, one want to make distinction between a tall person and a short
heighted person. No doubt one would be willing to agree that the
predicate "TALL" is true for pole, the seven foot basketball player and
false for smidge the midget. But what value you would assign for Ram,
who is 5 foot 10 inches? What about Raju who is 6 foot 2 or Joe who is 5
foot 5?
If one agrees 7 foot is tall, then 6 foot 2 or Raj who is 5 foot 5? If one
agree 7 foot is tall, then is 6 foot 11 inches also tall? What about 6 foot 10
inches?
If continued this process of incrementally decreasing the height through a
sequence of applications of modus ponens, one would eventually,
conclude that a three foot person is tall. Intuitively, one expects the
inferences should have failed at some point, but at what point? In FOPL
there is no direct way to represent this type of concept. So, fuzzy logic is
useful for solving such problem, as fuzzy set theory allows us to represent
set membership as a possibility distribution such as the ones shown in
Fig. 7.4.3 for the set of tall people and the set of very tall people shown in
Fig. 7.4.4.

In the latter, one is either tall or not and there must be a specific height
that defined the boundary. The same is true for very tall.
In the former ones tallness increased with one's height until the value of 1
is reached. Fuzzy logic is a method for reasoning with logical expressions
describing membership in fuzzy sets. For example: The complex sentence
Tall (Anil) Heavy (Anil) has Λ truth value that is a function of the truth
values of its components.
The standard rules for evaluating the fuzzy truth, T, of a complex
sentence are
T (A ^ B) = min (T(A), T(B))
T (A v B) = max (T(A), T(B))
T(¬A) = 1 - T(A)
Fuzzy logic is therefore a truth-functional system - a fact that causes
serious difficulties.
For example: Suppose that T (Tall(Anil)) = 0.6 and T (Heavy(Anil)) =
0.4. Then we have T (Tall(Anil)) ^ T (Heavy (Anil)) = 0.4. Which seems
reasonable but we also get the result T (Tall(Anil)) ^ ¬ (Tall(Anil)) 0.4
which does not. The problem arises from the inability of a truth-
functional approach to take into account the correlations or anti-
correlations among the component propositions.
Benefits of fuzzy logic over classical probability theory:
According to Zadeh classical probability theory lack in following ways -
1) Classical probability theory is insufficient to deal with uncertainty.
2) Classical probability theory has no facilities for representing the
meaning of events containing -
i) Fuzzy predicates such as small, large, young, safe, much longer than,
soon.
ii) Fuzzy quantifiers such as most, means, few, several, often, usually.
iii) Fuzzy probabilities expressed as quite possible, almost impossible etc.
iv) Fuzzy truth values such as very quite, extremely, somewhat, slightly.
Lacking these facilities, one finds it is difficult to deal with statements
like –
i) More the cholesterol, more is the chance for heart attack.
ii) Almost all people prefer a strong union government.
iii) It is very likely that Radha is young.
iv) Ram is much taller than most of the friends.
Now, to overcome above problems one can use fuzzy logic. Consider an
example, where fuzzy is used -
Suppose, you have been asked by your friends to arrange a small party.
Now here the question is "what does 'small' mean?". Answer to this
question is different for persons of different societies. As if you are
affluent, small has one meaning and if you belong to middle class family,
the word small has a different meaning. Hence, one can say that sets for
whom the boundary is ill-defined are called fuzzy sets.
Now, the question arises that how can one represent a fuzzy set if the
boundary is not clear. This can be done by grade membership diagram.
All the member of a fuzzy set have a membership value between 0
{complete nonmembership) and 1 (complete membership).
For this consider a fuzzy set "small numbers" like "small party". The
number that has the minimal value (i.e. zero) is the smallest number and
hence the membership value is 1. A number increases, the value of
"small" decreases and at a point of the time the value reaches zero.
This is shown in Fig. 7.4.5.

From this curve, one can find the membership value or in other words the
element is a member of the set to some degree.
Reasoning using fuzzy logic:
In fuzzy logic, the degree of membership of x is Ā, where Ā defines some
propositional or predicate class. When U A (x) = 1, the proposition A is
completely true and when UA (x) 0 it is completely false. Values between
0 and 1 assumes corresponding values of truth or falsehood.
Truth value of a statement can be found by using truth tables. But in case
of fuzzy logic it is not possible as there may be infinite number of truth
values and one could tabulate only a limited number of truth values, as
those corresponding to the terms false, not very false, not true, very true
and so on.
So, here one has an inference rule equivalent to a fuzzy modus ponens.
Modus ponens for fuzzy sets are different from standard modus ponens in
that statements which are characterized by fuzzy sets are permitted and
the conclusion need not be identical to the implicand in the implication.
For example, let Ā, Ā1, and 1 be statements characterized by fuzzy
sets. Then one form of the generalized modus ponens reads
Premise: x is Ā1
Implication:If x is Ā then y is
Conclusion: y is 1
An example of this form of modus ponens is given as -
Premise: This man is very brilliant.
Implication: If a man is brilliant then the man is intelligent.
Conclusion: This man is very intelligent.
According to Zadeh's a relation can be defined as -
Relation - For two sets A and B, the cartesian product A × B is the set of
all, ordered pairs (a, b) for a A and b B.
A binary relation of two sets A and B is a subset of A × B.
Binary fuzzy relation - Binary fuzzy relation R is a subset of fuzzy
cartesian product Ā × , a mapping of Ā → characterized by the two
parameter membership function UR (a, b).
For example, let Ā → = R the set of real numbers and let R: = much
larger than. A membership function for this relation might then be
defined as –

Now, let X and Y be two universes and let Ā and be fuzzy sets in X
and X × Y respectively.
Define fuzzy relations ṜA (x), ṜB (x, y)and ṜC(y) in X, X × Y and Y
respectively.
Then the compositional rule of inference is the solution of the relation
equation. Ṝc(y) = ṜA (x) O ṜB (x, y) = maxx min{uA (X), uB (x, y)}
where the symbol O signifies the composition of Ā and as an example.
Let,
x = y = {1, 2, 3, 4}
Ā = {little} = {(1/1), (2/0.6), (3/0.2), (4.0/0)}
Ṝ = Approximately equal, a fuzzy relation defined by

Then applying the max-min composite rule


ṜC (y) = maxx min{uA (x), UR (x, y)}
= maxx {min[(1, 1), (0.6, 0.5), (0.2, 0), (0, 0)]},
= min[(1, 0.5), (0.6, 1), (0.2, 0.5), (0, 0)]
=min[(1, 0), (0.6, 0.5), (0.2, 1), (0, 0.5)]
= min[(1, 0), (0.6, 0), (0.2, 0.5), (0, 1)],
= maxx {[1, 0.5, 0, 0], [0.5, 0.6, 0.2, 0], [0, 0.5, 0.2, 0], [0, 0, 0.2, 0]}
= {[1], [0.6], [0.5], [0.2] }
Therefore the relation is
ṜC (y) = ((1/1), (2/0.6), (3/0.5), (4/0.2)}
Stated in terms of a fuzzy modus ponens, one might interpret this as the
inference.
- Premise: x is little
- Implication: x and y are approximately equal
- Conclusion: y is more or less little
The above notion can be generalized to any number of universes by
taking the cartesian product and defining relations on various subsets.
Applications of fuzzy logic:
One of the recent applications where fuzzy logic has been extensively
used is development of fuzzy logic controlled commercial air-conditioner
by Mitsubishi Heavy Industries of Japan.
Conventional air-conditioners have an annoying and discomforting
characteristic of turning ON and OFF when the temperature exceeds or
falls below a fixed temperature. Fuzzy based air-conditioners determines
the thermal characteristic of the room and temperature change required
and adjusts the air flow to minimize heating and cooling times and
maintain a stable room temperature.
An infra-red sensor in the equipment determines if anyone is there in the
room. If not, the system gradually reduces the air flow and temperature,
thereby minimizing power consumption.
The air-conditioner equipment is only the tip of the iceberg where fuzzy
logic is used. In future, many such equipment are expected.
Fuzzy logic can be represented in terms of probability theory. One idea is
to view assertions such as "Anil is Tall" as discrete observations, made
concerning a continuous hidden variable, Anil's actual height. The
probability model specifies P(observer says Anil is tall/Height), using a
probit distribution. A posterior distribution over Anil's height can then be
calculated in the usual way, for example if the model is part of a hybrid
Bayesian Network.
• Fuzzy control:
1. Fuzzy control is methodology for constructing control systems in
which the mapping between real-valued input and output parameters is
represented by fuzzy rules.
2. Fuzzy control has been very successful in commercial products such as
automatic transmissions, video cameras, and electric shavers.
3. These applications enjoy success may be because they have small rule
bases, no chaining of inferences and tunable parameters that can be
adjusted to improve the system's performance. The fact that they are
implemented with fuzzy operators might be incidental to their success;
the key is simply to provide a concise and intuitive way to specify a
smoothly interpolated, real-world function.
Fuzzy predicates can also be given a probabilistic interpretation in terms
of random sets - that is, random variables whose possible values are sets
of objects.
For example: Tall is random set whose possible values are sets of people.
The probability P(Tall = S 1), where S1 is some particular set of people, is
the probability that exactly the set would be identified as "tall" by an
observer. Then the probability that "Anil is tall" is the sum of the
probabilities of all the sets of which Anil is a member.
Both the hybrid Bayesian network approach and the random sets
approach appear to capture aspects of fuzziness without introducing
degrees of truth. But they can not handle proper representation of
linguistic observations and continuous quantities that have been neglected
by most outside the fuzzy community.

Forward and Backward Reasoning


• In forward reasoning, reasoning proceeds forward, beginning with
factor, chaining through rules and finally establishing the goal.
• When the left side of a sequence of rules is instantiated first and the
rules are executed from left to right the process is called forward
chaining/reasoning. This is also known as data-driven search, since, input
data are used to guide the direction of the inference process. For example,
one can chain forward to show that when a student is encouraged, is
healthy, and has goals, the student will succeed.
ENCOURAGED (student) → MOTIVATED (students)
MOTIVATED (student) and HEALTHY (student) → WORKHARD
(student)
WORKHARD (student) and HASGOALS (student) → EXCELL
(student)
EXCELL (student) → SUCCEED (student)
• On the other hand, when the right side of the rule is instantiated first, the
left-hand conditions become subgoals. These subgoals may in turn cause
sub-subgoals to be established, and so on until facts are found to match
the lowest subgoal conditions. When this form of inference takes place, it
is said that backward chaining is performed. This form of inference is
also known as goal-driven inference since an initial goal establishes the
backward direction of the inferring.
For example, in MYCIN the initial goal in a consultation is "Does the
patient have a certain disease?" This causes subgoals to be established
such as "are certain bacteria present in the patient?" Determining if
certain bacteria are present may require such things as tests on cultures
taken from the patient. This process of setting up subgoals to confirm a
goal continues until all the subgoals are eventually satisfied or fail. If
satisfied, the backward chain is established thereby confirming the main
goal.
• Some systems use both forward and backward chaining/reasoning,
depending on the type of problem and the information available.
Likewise rules may be tested or exhaustively or selectively, depending on
the control structure.
Solved Example
Example 7.5.1 Consider an incandescent bulb manufacturing unit. Here
machines M1, M2 and M3 make 30 %, 30 % and 40% of the total bulbs of
their, output, let's assume that 2 %, 3 % and 4 % are defective. A bulb is
drawn at random and is found defective. What is the probability that the
bulb is made by machine M1 or M2 or M3.
Solution:
• Let E1, E2 and E3 be the events that a bulb selected at random is made
by machine M1, M2 and M3.
• Let Q denote that it is defective.
Prob (E1) = 0.3
Prob (E2) = 0.3 and Prob (E3) = 0.4 (given data),
These represent the prior probabilities.
• Probability of drawing a defective bulb made by M 1 = Prob (Q/E1) =
0.02
• Probability of drawing a defective bulb made by M 2 = Prob (Q/E2) =
0.03
• Probability of drawing a defective bulb made by M 3 = Prob (Q/E3) =
0.04
•These values are the posterior probabilities
Therefore,
Prob (E1/Q) = Prob (E1/) * Prob (Q/E1)/ Σ3i=1 Prob (Ei) * Prob (Q/Ei)
= 0.3* 0.02/ (0.03* 0.2) + (0.03* 0.3) + (0.04 * 0.4)
= 0.1935
Similarly,
Prob (E2/Q) = 0.3* 0.03/ (0.03* 0.2) + (0.03* 0.3) + (0.04* 0.4)
= 0.2903
Prob (E3/Q) = (1-(Prob(E1/Q) + Prob(E2/Q)))
= (1-((0.1935) (0.2903)))
= 0.5162
Causal Networks
A directed network which illustrates the causal dependencies of all the
components in the network.
A causal relationship exists when one variable in a data set has a direct
influence on another variable. Thus, one event triggers the occurrence of
another event. A causal relationship is also referred to as cause and effect.
The ability to identify truly causal relationships is fundamental to
developing impactful interventions in medicine, policy, business, and
other domains.
Often, in the absence of randomised control trials, there is a need for
causal inference purely from observational data. However, in this case the
commonly known fact that - 'correlation does not imply causation'
distinguish between events that cause specific outcomes and those that
merely correlate. One possible explanation for correlation between
variables where neither causes the other is the presence of confounding
variables that influence both the target and a driver of that target.
Unobserved confounding variables are severe threats when doing causal
inference on observational data.
A causal generalization, e.g., that smoking causes lung cancer, is not
about an particular smoker but states a special relationship exists between
the property of smoking and the property of getting lung cancer. As a
causal statement, this says more than that there is a correlation between
the two properties.
Some causal conditions are necessary conditions: The presence of oxygen
is a necessary condition for combustion; in the absence of oxygen there is
no combustion. "Cause" is often used in this sense when the elimination
of the cause is sought to eliminate the effect (what's causing the pain?)
Some causal conditions are sufficient conditions: The presence of a
sufficient condition the effect must occur (being in temperature range R
in the presence of oxygen is sufficient for combustion of many
substances. "Cause" is often used in this sense when one seeks to produce
the effect (What causes this metal to be so strong?)
Looking for special circumstances: what was the cause of the fire?
Oxygen? or an arsonist's match?
Causes are sometimes said to be INUS conditions in that they are
Insufficient but Necessary parts of an unnecessary but sufficient set of
conditions for the effect. Striking a match may be said to be a cause of its
lighting. Suppose there is some set of conditions that is sufficient for a
match's lighting. This might include the presence of oxygen, the
appropriate chemicals in the matchhead and the striking. The striking can
be said to be a necessary part of this set (though insufficient by itself)
because without the striking among those other conditions the match
would not have lit. But the set itself, though sufficient, is not necessary
because other sets of conditions could have produced the lighting of the
match.
Statisticians are careful to distinguish between two different
interpretations of relationship - correlation and causal. Every successful
prediction model Y ~ X is ademonstration that there is a correlation
between the response Y and the explanatory variable X.1313
"Successful" means that the prediction performance of the model is better
than the performance of a no-input model. But the performance of the
model does not itself tell us that X causes Y in the real world. There are
other possible configurations that will produce a correlation between X
and Y. For instance, both X and Y may themselves have a common cause
C without X being otherwise related to Y. In such a circumstance, a real-
world intervention to change X will have no effect on Y. To put this in
the form of a story, consider that the start of the school year and leaves
changing color are correlated. But an intervention to start the school year
in mid-winter will not result in leaves changing color. There's a common
cause for the school year and colorful folliage that produces the
relationship: the end of summer.
Structural Causal Models (SCMs)
Structural causal models represent causal dependencies using graphical
models that provide an intuitive visualisation by representing variables as
nodes and relationships between variables as edges in a graph.
SCMS serve as a comprehensive framework unifying graphical models,
structural equations, and counterfactual and interventional logic.
Graphical models serve as a language for structuring and visualising
knowledge about the world and can incorporate both data-driven and
human inputs.
Counterfactuals enable the articulation of something there is a desire to
know, and structural equations serve to tie the two together.
SCMs had a transformative impact on multiple data-intensive disciplines
(e.g. epidemiology, economics, etc.), enabling the codification of the
existing knowledge in diagrammatic and algebraic forms and
consequently leveraging data to estimate the answers to interventional
and counterfacutal questions.
Bayesian Networks are one of the most widely used SCMs and are at the
core of this library.

Bayesian Networks (BNs)


1. Directed Acyclic Graph (DAG)
A graph is a collection of nodes and edges, where the nodes are some
objects, and edges between them represent some connection between
these objects. A directed graph, is a graph in which each edge is
orientated from one node to another node. In a directed graph, an edge
goes from a parent node to a child node. A path in a directed graph is a
sequence of edges such that the ending node of each edge is the starting
node of the next edge in the sequence. A cycle is a path in which the
starting node of its firstedge equals the ending node of its last edge. A
directed acyclic graph is a directed graph that has no cycles.
Bayesian Networks
Bayesian networks are probabilistic graphical models that represent the
dependency structure of a set of variables and their joint distribution
efficiently in a factorised way.
Bayesian network consists of a DAG, a causal graph where nodes
represents random variables and edges represent the the relationship
between them, and a conditional probability distribution (CPDs)
associated with each of the random variables.
If a random variable has parents in the BN then the CPD represents\(P(\
text{variablel parents}) \) i.e. the probability of that variable given its
parents. In the case, when the random variable has no parents it simply
represents \(P(\text{variable}) \) i.e. the probability of that variable.
Even though if one is interested in the joint distribution of the variables in
the graph, Bayes' rule requires to only specify the conditional
distributions of each variable given its parents.
The links between variables in Bayesian Networks encode dependency
but not necessarily causality. Here, the interest is in the case where
Bayesian Networks are causal. Hence, the edge between nodes should be
seen as a cause -> effect relationship.
Since BNs themselves are not inherently causal models, the structure
learning algorithms on their own merely learn that there are dependencies
between variables. A useful approach to the problem is to first group the
features into themes and constrain the search space to inspect how themes
of variables relate. If there is further domain knowledge available, it can
be used as additional constraints before learning a graph algorithmically.
Steps for working with a Bayesian Network
BN models are built in a multi-step process before they can be used for
analysis.
1. Structure learning: The structure of a network describing the
relationships between variables can be learned from data, or built from
expert knowledge.
2. Structure review: Each relationship should be validated, so that it can
be asserted to be causal. This may involve flipping / removing / adding
learned edges, or confirming expert knowledge from trusted literature or
empirical beliefs.
3. Likelihood estimation: The conditional probability distribution of
each variable given its parents can be learned from data. ibo sanit (ol
4. Prediction and inference: The given structure and likelihoods can be
used to make predictions, or perform observational and counterfactual
inference.
Review Questions
1. Derive Baye's theorem of probability. Explain with suitable example its
use in expert system. (Refer section 7.1)
2. What do you mean by probabilistic reasoning? Give an exmaple.
OR
Explain the probabilistic reasoning. (Refer section 7.1)
3. State and prove Baye's theorem. (Refer section 7.1)
OR
Write a short note on Baye's theorem.
4. Explain the term theorem providing and inferencing with
examples. (Refer section 7.3)
5. Explain fuzzy logic with examples.
OR
Write short note on fuzzy logic. (Refer section 7.4)
6. Draw fuzzy curve for tall, short, very tall. (Refer section 7.4)
7. Write short note on : Fuzzy logic. (Refer section 7.4)
8. Explain how Fuzzy logic is beneficial over classical probability
theory? (Refer section 7.4)
9. Explain the forward and backward reasoning.
OR
Explain forward and backward reasoning with examples.
OR
Explain reasoning with example.
OR
Differentiate between forward and backward reasoning. (Refer section
7.3)
Two Marks Questions with Answers
Q.1 Define Dempster-Shafer theory. AU: May-11
Ans.: The Dempster-Shafer theory is designed to deal to deal with the
distinction between uncertainty and ignorance.
Rather than computing the probability of a proposition, it computes the
probability the evidence that supports the proposition.
Q.2 Define Baye's theorem. AU: May-11,13,17, Dec.-12
Ans.: In probability theory and applications, Baye's theorem
(alternatively called as Baye's law or Bayes rule) links a conditional
probability to its inverse.
P(b | a) = P(a | b) P(b)/P(a)
This equation is called as Baye's Rule or Baye's Theorem.
Q.3 What is reasoning by default? AU : May-11
Ans.: We can do qualitative reasoning using technique like default
reasoning. Default reasoning treats conclusions not as "believed to a
certain degree", but as "believed until a better reason is found to believe
something else".
Q.4 What are the logics used in reasoning with uncertain
information? AU: Dec.-11
Ans.: There are two approaches that can be taken for reasoning with
uncertain information in which logic is used.
Non-monotonic logic is used in default reasoning process. Default
reasoning also uses other type or logic called as default logic.
The second approach towards reasoning is vagueness which uses fuzzy
logic. Fuzzy logic is a method for reasoning with logical expressions
describing membership in fuzzy sets.
Q.5 Define prior probability. AU: May-12
Ans. The prior (unconditional) probability is associated with a
proposition 'a'. The prior probability is the degree of belief accorded to a
proposition in the absence of any other information.
It is written as P(a). For example, the probability that, Ram has cavity =
0.1, then the prior priobility is written as,
P(Cavity = true) = 1 or P(cavity) = 0.1
Q.6 State the types of approximation methods. AU: May-12
Ans.: For approximate inferencing randomize sampling algorithm (Monte
Carlo Algorithm) is used. There are two approximation methods that are
used in randomize sampling algorithm which are 1) Direct sampling
algorithm and 2) Markov chain sampling algorithm.
In direct sampling algorithm samples are generated from known
probability distribution. In Markov chain sampling each event is
generated by making a random change to the preceding event.
Q.7 What do you mean by hybrid Bayesian network? AU: Dec.-12
Ans.: A network with both discrete and continuous variables is called as
hybrid Bayesian network. In hybrid Bayesian network, for representing
the continuous variable its discretization is done in terms of intervals
because it can have infinite values.
For specifying the hybrid network two kinds of distribution are specified.
The conditional distribution for a continuous variable given discrete or
continuous parents and the conditional distribution for a discrete variable
given continuous parent.
Q.8 Define computational learning theory. AU: Dec.-12
Ans.: The computational learning theory is a mathematical field related
to the analysis of machine learning algorithms.
The computational learning theory is used in the evaluation of sample
complexity and computational complexity. Sample complexity targets the
issue that, how many training examples are needed to learn a successful
hypothesis? The computational complexity evaluates that how much
computational effort is needed to learn a successful hypothesis?
In addition to performance bounds, computational learning theory also
deals with the time complexity and feasibility of learning.
Q.9 Give the full specification of Bayesian network. AU: May-13
Ans.:Bayesian network: Definition: It is a data structure which is a graph,
in which each node is annotated with quantitative probability information.
The nodes and edges in the graph are specified as follows:-
1) A set of random variables makes up the nodes of the network.
Variables may be discrete or continuous.
2) A set of directed links or arrows connects pairs of nodes. If there is an
arrow from node X to node Y, then X is said to be a parent of Y.
3) Each node Xi, has a conditional probability distribution P(X i |
Parents(Xi)) that quantifies the effect of the parents on the node.
4) The graph has no directed cycles (and hence is a directed, acyclic
graph, or DAG).
The set of nodes and links is called as topology of the network.
Q.10 Define uncertainty. AU: May-14
Ans.: A agent working in real world environment almost never has access
to whole truth about its environment. Therefore, agent needs to work
under uncertainity.
Earlier agents we have seen make the epistemological commitment that
either the facts (expressed as prepositions) are true, false or else they are
unknown. When an agent knows enough facts about its environment, the
logical approach enables it to derive plans, which are guaranteed to work.
But when agent works with uncertain knowledge then it might be
impossible to construct a complete and correct description of how its
actions will work. If a logical agent can not conclude that any perticular
course of action achieves its goal, then it will be unable to act.
Q.11 State Baye's rule. (Refer Q.2) AU: Dec.-14
Ans.: In probability theory and applications, Baye's theorem
(alternatively called as Baye's law or Bayes rule) links a conditional
probability to its inverse.
P(b | a) = P(a | b) P(b)/P(a)
This equation is called as Baye's Rule or Baye's Theorem.
Q.12 Define uncertainty. How it is solved? (Refer Q.10) AU: May-15
Ans.: A agent working in real world environment almost never has access
to whole truth about its environment. Therefore, agent needs to work
under uncertainity.
Earlier agents we have seen make the epistemological commitment that
either the facts (expressed as prepositions) are true, false or else they are
unknown. When an agent knows enough facts about its environment, the
logical approach enables it to derive plans, which are guaranteed to work.
But when agent works with uncertain knowledge then it might be
impossible to construct a complete and correct description of how its
actions will work. If a logical agent can not conclude that any perticular
course of action achieves its goal, then it will be unable to act.

University Questions with Answers


Dec. 2013
Q.1 How to handle uncertain knowledge with example? (Refer section
7.1) [8]
Q.2 How to represent knowledge in an uncertain domain? (Refer section
7.1) [8]
Q.3 Explain the need of fuzzy set and fuzzy logic with example. (Refer
section 7.4) [8]
May 2014
Q.4 Define uncertain knowledge, prior probability and conditional
probability. State the Baye's theorem. How is it useful for decision
making under uncertainty? Explain belief networks briefly. (Refer section
7.1)[6]
Q.5 What is a Bayesian network? How is the Bayesian network used in
representing the uncertainty about knowledge? Explain the method of
performing exact inference in won Bayesian networks. (Refer section 7.1)
[10]
Q.6 Consider the following facts: (Refer section 7.2)
i) I saw my cat in the living room 3 hours ago.
ii) 2 hours ago my door blew open.
iii) Three quarters of the time my door blows open, my cat runs outside
the door.
iv) One hour ago I thought I heard a cat noise in my living. Assume I was
half certain.
v) In one hour period the probability that the cat will leave the room is
0.2. There is also a 0.2 probability that he may enter the room. What ts
the certainty that the cat is in my living room? Use Bayesian networks to
answer this.
[8]
Dec. 2014
Q.7 Explain variable elimination algorithm for answering queries on
Bayesian networks. (Refer section 7.2) [8]
Q.8 Discuss forward - backward algorithm in detail. (Refer section 7.5)
[8]
May 2015
Q.9 Explain about the exact inference in Bayesian networks. (Refer
section 7.2) [16]
Q.10 List down the applications of Bayesian network. (Refer section 7.1)
[6]
Dec. 2016
Q.11 Explain the method of performing exact inference in Bayesian
Networks. (Refer section 7.2)[16]
May 2017
Q.12 Explain about Dempster shafer theory. (Refer section 7.4)[16]
Dec. 2017
Q.13 Discuss about Bayesian theory and Bayesian network. (Refer
section 7.1)
Q.14 Describe in details about Dampster-Shafer theory. (Refer section
7.4)
May 2018
Q.15 Discuss the need and structure of Bayesian networks. (Refer section
7.2) [13]

You might also like