+
Bayesian Learning
Dr. Megha Ummat
+
Elements of Probability
◼A random experiment is one whose outcome is not
predictable with certainty in advance.
◼ The set of all possible outcomes is known as the
sample space S.
◼ A sample space is discrete if it consists of a finite
(or countably infinite) set of outcomes; otherwise
it is continuous.
+
Elements of Probability
◼ Anysubset E of S is an event.
◼ Events are sets, and we can talk about their
complement, intersection, union, and so forth.
+
Probability
◼ Probability can be interpreted as a frequency.
◼ When an experiment is continually repeated under
the exact same conditions, for any event E, the
proportion of time that the outcome is in E
approaches some constant value.
◼ This constant limiting frequency is the probability
of the event, and we denote it as P(E).
+
Bayesian Learning
◼ BayesianLearning is a probabilistic
approach of inference.
◼ Given a set of training data T, we are
interested in the best hypothesis h
from the space H.
◼ The best hypothesis can be
considered equivalent to the most
probable hypothesis.
+
Bayesian Learning
◼ Fortraining data T, if we have any
previous knowledge about the
probabilities of various hypothesis in
H, we can estimate the probability of
best hypothesis with the aid of Bayes
theorem.
+
Bayes Theorem
◼ P(h) prior probability of h, denotes any previous
knowledge about the chance that h is correct
◼ P(T) prior probability of T, probability that data
such as T will be observed
◼ P(T|h) probability of observing T, given that h
holds
◼ P(h|T) reflects confidence that h holds after T has
been observed. Also known as posterior
probability.
𝑷 𝑻𝒉 𝑷(𝒉)
P(h|T)=
𝑷(𝑻)
+
Bayes Theorem Example 1
▪ Given:
◼ A doctor knows that meningitis causes stiff neck
50% of the time
◼ Prior probability of any patient having meningitis
is 1/50,000
◼ Prior probability of any patient having stiff neck
is 1/20
◼ If a patient has stiff neck, what’s the probability
he/she has meningitis?
P( S | M ) P( M ) 0.5 1 / 50000
P( M | S ) = = = 0.0002
P( S ) 1 / 20
+
Bayes Theorem Example 2
◼ In
Denmark, 51% of the adults are males.
One adult is randomly selected for a
survey involving credit card usage.
◼ Theselected survey subject was found
smoking a cigar. It is known that 9.5% of
males smoke cigars, whereas 1.7% of
females smoke cigars.
◼ Findthe probability that the selected
subject is a male.
+
Bayes Theorem Example 2
◼ Solution
M → Male M’ → Not a Male
C → Cigar Smoker C’ → Not a Cigar Smoker
Based on the given information
P(M) = 0.51 P(M’) = 0.49
P(C|M)=0.095 P(C|M’) = 0.017
𝑷 𝑪𝑴 𝑷(𝑴)
P(M|C)=
𝑷(𝑪)
+
Bayes Theorem Example 2
P(C) = P(M) * P(C|M) + P(M’) * P(C|M’)
Given
P(M) = 0.51 P(M’) = 0.49
P(C|M)=0.095 P(C|M’) = 0.017
P(C|M) P(M)
P(M|C) =
P(C|M) P(M) +P(C|M′) P(M’)
0.095 ∗ 0.51
= = 0.853
0.095 ∗ 0.51 + 0.017 ∗ 0.49
+
Using Bayes Theorem for
Classification
◼X → Set of Attributes, Y → Class Variable
◼ Given a record with attributes (X1, X2,…, Xd)
◼ Goal is to predict class Y
◼ The relationship can be captured probabilistically by
using P(Y|X)
◼ Specifically, we want to find the value of Y that
maximizes P(Y| X1, X2,…, Xd )
◼ This conditional probability is also known as posterior
probability for Y, as opposed to prior probability P(Y)
◼ Can we estimate P(Y| X1, X2,…, Xd ) directly from
data?
+
Using Bayes Theorem for
Classification
◼ During the training phase, we need to learn the
posterior probabilities P(Y|X) for every
combination of X and Y based on information
gathered from the training data.
◼ Byknowing these probabilities, a test record X’ can
be classified by finding the class Y’ that maximizes
the posterior probability P(Y’|X’).
◼ Given a test record with binary dependent
variables : “A” and “B”, we need to compute the
posterior probabilities P(A|X) and P(B|X) based
on the information available in the training data.
+
Using Bayes Theorem for
Classification
◼ If P(A|X) > P(B|X), then the record
is classified as “A” otherwise “B”.
◼ Estimating the posterior
probabilities accurately for every
possible combination of class label
and attribute value is a difficult
problem because it requires a very
large training set, even for
moderate number of attributes.
+
Using Bayes Theorem for
Classification
◼ TheBayes theorem is useful because it allows us to
express posterior probability in terms of prior
probability P(Y), the class conditional probability
P(X|Y), and the evidence P(X).
P( X | Y ) P(Y )
P(Y | X ) =
P( X )
+
Bayesian Classifiers
In many applications, the relationship between the
class variable and the attribute set is non-
deterministic.
This situation may arise due to presence of noisy
data or the presence of certain confounding
factors that affect classification but are not
included in the analysis.
Eg: Predicting heart disease on the basis of
person’s diet and workout frequency. Confounding
factors such as heredity, excessive smoking or
alcohol abuse may be other factors that may cause
heart disease leading to non-deterministic
relationship.
+
Conditional Independence
X is said to be conditionally independent of Y given
Z if P(X|Y,Z) = P(X|Z)
Example: Arm length and reading skills
– Young child has shorter arm length and limited
reading skills, compared to adults
– If age is fixed, no apparent relationship between
arm length and reading skills
– Arm length and reading skills are conditionally
independent given age
+
Using Bayes Theorem for
Classification
◼ Compute posterior probability P(Y | X1, X2, …, Xd) using
the Bayes theorem.
P(X1, X2, ….,Xd|Y) P(Y)
P(Y|X1, X2, ….,Xd) =
P(X1, X2, ….,X𝑑)
◼ Since P(X1, X2,…, Xd ) is always constant (it can be ignored)
◼ Maximize the posterior probability: Choose Y that
maximizes P(Y | X1, X2, …, Xd)
◼ Equivalent to choosing value of Y that maximizes P(X1, X2,
…, Xd|Y) P(Y)
◼ How to estimate P(X1, X2, …, Xd | Y )?
+
Naïve Bayes Classifier
◼A naïve Bayes classifier estimates the class-
conditional probability by assuming that the
attributes are conditionally independent, given the
class label Y.
◼ Assume independence among attributes Xi when
class is given:
◼ P(X1, X2, …, Xd |Yj) = P(X1| Yj) P(X2| Yj)… P(Xd| Yj)
◼ Now we can estimate P(Xi| Yj) for all Xi and Yj
combinations from the training data
◼ New point is classified to Yj if P(Yj) * P(Xi| Yj) is
maximal.
+
Naïve Bayes Classifier
◼ Witha conditional independence assumption, instead of
computing the class-conditional probability for every
combination of X, we only have to estimate the
conditional probability of each Xi, given Y.
◼ Thisapproach is more practical because it does not
require a very large training set to obtain a good
estimate of the probability.
◼ Toclassify a test record, the Naïve Bayes classifier
computes the posterior probability for each class Y:
𝑃 𝑌 ∏ 𝑃(𝑋𝑖|𝑌)
P(Y|X) =
𝑃(𝑿)
+
Example
◼ Given a test record X = (Ease of Use: Easy; Quality:
Bad)
◼ Using Bayes Theorem
P(X|Yes) P(Yes) P(X|No) P(No)
P(Yes|X) = P(No|X) =
P(X) P(X)
Product Ease of Use Quality Satisfied?
P1 Easy Good Yes
P2 Moderate Good No
P3 Difficult Bad No
P4 Difficult Good No
P5 Moderate Bad Yes
▪ How to estimate P(Yes|X) and P(No|X)?
+
Naïve Bayes from Example Data
◼ Given a test record X = (Ease of Use: Easy; Quality: Bad)
P(X|Yes) = P(Ease of Use=Easy|Yes) * P(Quality=Bad|Yes)
P(X|No) = P(Ease of Use=Easy|No) * P(Quality=Bad|No)
Product Ease of Use Quality Satisfied?
P1 Easy Good Yes
P2 Moderate Good No
P3 Difficult Bad No
P4 Difficult Good No
P5 Moderate Bad Yes
+
Estimate Probabilities from Data
◼ Class: P(Y) = Nc/N
◼ Eg., P(No) = 3/5, P(Yes) = 2/5
◼ For attributes: P(Xi | Yk) = |Xik|/ Nc
◼ where |Xik| is number of instances having attribute
value Xi and belonging to class Yk
◼ Examples:
◼ P(Quality=Good|No) = 2/3
◼ P(Ease of use=Moderate|Yes)=1/2
Product Ease of Use Quality Satisfied?
P1 Easy Good Yes
P2 Moderate Good No
P3 Difficult Bad No
P4 Difficult Good No
P5 Moderate Bad Yes
+
Naïve Bayes from Example Data
◼ Given a test record X = (Ease of Use: Easy; Quality:
Bad)
P(No) = 3/5
P(Yes) = 2/5
P(Ease of use=Easy|Yes) =1/2
P(Ease of Use=Easy|No) = 0
P(Ease of Use=Moderate|Yes) = 1/2
P(Ease of Use=Moderate|No) =1/3
P(Easy of Use = Difficult |Yes) = 0
P(Ease of Use = Difficult |No) = 2/3
P (Quality = Good|Yes) = 1/2
P(Quality = Good |No) = 2/3
P (Quality = Bad|Yes) = 1/2
P(Quality = Bad|No) = 1/3
+
Naïve Bayes from Example Data
◼ Given a test record X = (Ease of Use: Easy; Quality:
Bad)
P(X|Yes) = P(Ease of Use=Easy|Yes) * P(Quality=Bad|Yes)
= 1/2 * 1/2 = 1/4
P(X|No) = P(Ease of Use=Easy|No) * P(Quality=Bad|No)
= 0 * 1/3 = 0
P(X|Yes)* P(Yes) = 1/4 * 2/5 = 2/20
P(X|No) * P(No) = 0 * 3/5 = 0
As, P(X|Yes)* P(Yes) > P(X|No) * P(No)
Therefore, P(Yes|X) > P(No|X)
Class Prediction → Yes
+
Characteristics of Naïve Bayes
Classifier
◼ Apart from giving providing posterior probability
estimates, they attempt to capture the underlying
mechanism behind the generation of data instances
belonging to every class. Thus, they can be used for
predictive as well as descriptive insights.
◼ Ifthe attributes are conditionally independent of each
other, Naive Bayes can easily compute class conditional
probabilities even in high dimensional settings. This
makes it a simple and effective classification technique
that can be used in diverse problems such as text
classification.
+
Characteristics of Naïve Bayes
Classifier
◼ They are robust to isolated noise points because such
points are averaged out when estimating conditional
probabilities from data.
◼ Naïve Bayes can handle missing values in the training
data by ignoring the example during training and
classification. They can effectively handle missing values
in test instance, by using only the non-missing attribute
values while computing the posterior probabilities.
◼ They are robust to irrelevant attributes.
+
Characteristics of Naïve Bayes
Classifier
◼ Correlated attributes can degrade the
performance of Naïve Bayes as the conditional
independence assumption no longer holds for
such attributes.
◼ Use other techniques like Bayesian Belief
Networks
+
Practice
Record A B C Class
1 0 0 0 +
2 0 0 1 -
3 0 1 1 -
4 0 1 1 -
5 0 0 1 +
6 1 0 1 +
7 1 0 1 -
8 1 0 1 -
9 1 1 1 +
10 1 0 1 +
a) Estimate the conditional probabilities for P(A|+), P(B|+),
P(C|+), P(A|-), P(B|-), P(C|-).
b) Use the estimate of conditional probabilities given in the
previous question to predict the class label for a test sample
(A=0, B=1, C=0) using the naïve bayes approach.
+Estimate Probabilities from Data
l For continuous attributes:
Discretization: Partition the range into bins:
◆ Replace continuous value with bin value
◼ Attribute changed from continuous to ordinal
k
Probability density estimation:
◆ Assume attribute follows a normal distribution
◆ Use data to estimate parameters of distribution
(e.g., mean and standard deviation)
◆ Once probability distribution is known, use it to
estimate the conditional probability P(Xi|Y)
◆ The distribution is estimated by mean and
variance
+Estimate Probabilities from Data
l l
ic a ic a
ous
or or nu
te g
te g
nti
la ss
ca ca co c
Tid Refund Marital
Status
Taxable
Income Evade
l Normal distribution:
( X i − ij ) 2
−
1 Yes Single 125K No 1 2 ij2
P( X i | Y j ) = e
2 No Married 100K No
2 2
ij
3 No Single 70K No
4 Yes Married 120K No – One for each (Xi,Yi) pair
5 No Divorced 95K Yes
6 No Married 60K No
l For (Income, Class=No):
7 Yes Divorced 220K No
8 No Single 85K Yes – If Class=No
9 No Married 75K No ◆ sample mean = 110
10
10 No Single 90K Yes ◆ sample variance = 2975
1 −
( 120−110 ) 2
P( Income = 120 | No) = e 2 ( 2975 )
= 0.0072
2 (54.54)
+
Example of Naïve Bayes Classifier
Given a Test Record:
X = (Refund = No, Divorced, Income = 120K)
Naïve Bayes Classifier:
P(Refund = Yes | No) = 3/7
P(Refund = No | No) = 4/7
P(Refund = Yes | Yes) = 0
P(Refund = No | Yes) = 1
P(Marital Status = Single | No) = 2/7
P(Marital Status = Divorced | No) = 1/7
P(Marital Status = Married | No) = 4/7
P(Marital Status = Single | Yes) = 2/3
P(Marital Status = Divorced | Yes) = 1/3
P(Marital Status = Married | Yes) = 0
+
Example of Naïve Bayes Classifier
Given a Test Record:
X = (Refund = No, Divorced, Income = 120K)
Naïve Bayes Classifier:
For Taxable Income:
If class = No: sample mean = 110
sample variance = 2975
If class = Yes: sample mean = 90
sample variance = 25
+
Example of Naïve Bayes Classifier
Given a Test Record:
X = (Refund = No, Divorced, Income = 120K)
P(X | No) = P(Refund=No | No)
P(Divorced | No)
P(Income=120K | No)
= 4/7 1/7 0.0072 = 0.0006
P(X | Yes) = P(Refund=No | Yes)
P(Divorced | Yes)
P(Income=120K | Yes)
= 1 1/3 1.2 10-9 = 4 10-10
Since P(X|No)P(No) > P(X|Yes)P(Yes)
Therefore P(No|X) > P(Yes|X)
=> Class = No
+
Handling Zero Conditional
Probabilities
◼ IfP( Marital Status = Divorced| No) is zero instead
of 1/7, then a data instance with attribute set x=
(Home Owner = Yes, marital Status = Divorced,
Income =$120) will have the following class
conditional probabilities:
◼ P(X|No) = 3/7 X 0 X 0.0072 = 0
◼ P (X|Yes) = 0 X 1/3 X 1.2 X 10-9 = 0
As both class conditional probabilities are zero
Naïve Bayes will not be able to classify the
instance.
+
Handling Zero Conditional
Probabilities
◼ To address this issue, we need to adjust the conditional
probability estimates using the following alternate
estimates :
Laplace Estimate:
𝑛𝑐 + 1
𝑃(𝑋𝑖 = 𝑐|𝑦) =
𝑛+𝑣
m- estimate:
𝑛𝑐 + 𝑚𝑝
𝑃(𝑋𝑖 = 𝑐|𝑦) =
𝑛+𝑚
n : Number of training instances belonging to class y
nc : Number of training instances with Xi = c and Y = y
v is the total number of values Xi can take
p is the initial estimate of P (Xi = c|y) known as priori
m is a hyper-parameter that indicates our confidence in using p when the
fraction of training instances is too brittle
+
Exercise
◼ Previous probabilities in Bayes Theorem that are
changed with help of new available information
are termed as
A) independent probabilities
B) posterior probabilities
C) interior probabilities
D) dependent probabilities
+
Exercise
◼ Previous probabilities in Bayes Theorem that are
changed with help of new available information
are termed as
A) independent probabilities
B) posterior probabilities
C) interior probabilities
D) dependent probabilities
+
Exercise
◼ Suppose the fraction of undergraduate students
who smoke is 15% and the fraction of graduate
students who smoke is 23%. If one-fifth of the
college students are graduate students and the rest
are undergraduates, what is the probability that a
student who smokes is a graduate student?
+
Solution
P(S|UG) = 0.15, P(S|G) = 0.23, P(G) = 0.2, P(UG) = 0.8.
We want to compute P(G|S).
According to Bayesian Theorem,
0.23 𝑋 0.2
P(G|S) = = 0.277
0.15 𝑋 0.8+0.23 𝑋 0.2