0% found this document useful (0 votes)
6 views18 pages

Data Mining Notes

Bayesian classification uses Bayes' theorem to predict class membership probabilities based on observed data. The naïve Bayesian classifier simplifies the process by assuming attribute independence, allowing for efficient computation of posterior probabilities. The document also covers k-means clustering, the EM algorithm, agglomerative clustering, and k-NN classification as additional methods for data classification and clustering.

Uploaded by

cse viji2021
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views18 pages

Data Mining Notes

Bayesian classification uses Bayes' theorem to predict class membership probabilities based on observed data. The naïve Bayesian classifier simplifies the process by assuming attribute independence, allowing for efficient computation of posterior probabilities. The document also covers k-means clustering, the EM algorithm, agglomerative clustering, and k-NN classification as additional methods for data classification and clustering.

Uploaded by

cse viji2021
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd

BAYESIAN CLASSIFICATION

o Bayesian classifiers are statistical classifiers. They can predict class


membership probabilities, such as the probability that a given tuple belongs to a
particular class.

o Bayesian classification is based on Bayes’ theorem. Bayes’ theorem is named


after Thomas Bayes.

Bayes’ Theorem

o Bayes theorem is:

P (H|X) = P (X|H) P (H)


-------------------------------
P(X)

o Let X be a data tuple.

o X is considered “evidence.” It is described by measurements made on a set of n


attributes.

o Let H be some hypothesis, such as that the data tuple X belongs to a specified
class C.

o P (H|X) is the probability that the hypothesis H holds given the “evidence” or
observed data tuple X.

o P (H|X) is the posterior probability, or a posteriori probability, of H


conditioned on X.

o P (H) is the priori probability of H.

o P (X|H) is the posteriori probability of X conditioned on H.

o P(X) is the prior probability of X.

o Example :
Suppose the data tuples are customers described by the attributes age and
income, X is a 35-year old customer with an income of $40,000; H is the
hypothesis that our customer will buy a computer.
P (H|X) represents the probability that customer X will buy a computer given
that we know the customer’s age and income.

P (H) is the probability that any given customer will buy a computer,
regardless of age, income or any other information.

P (X|H) is the probability that a customer X, is 35 years old and earns


$40,000, given that we know that a customer will buy a computer.

P(X) is the probability that a person from our set of customers is 35 years old
and earns $40,000.

Example:

Consider a football game between two rival teams Team 0 and Team 1. Suppose
Team 0 wins 65% of the time and Team 1 wins the remaining matches. Among the
games won by Team 0, only 30% of them come from paying on Team 1’s football
field. On the other hand, 75% of the victories for team 1 are obtained while playing
at home. If Team 1 is to host the next match between the two teams, which team will
most likely emerge as the winner?

1. Probability Team 0 wins is P(Y=0)=0.65


2. Probability Team 1 wins is P(Y=1)=1-0.65 = 0.35
3. Probability Team 1 hosted the match, it wins is P(X=1|Y=0)=0.75
4. Probability Team 1 hosted the match won by team 0 is P(X=1|Y=0)=0.3

Compute P(Y=1|X=1) is the conditional probability that Team 1 wins the next match
it will be hosting, and compares it against P(Y=0|X=1).

Bayes Theorem,

P(Y=1|X=1) = P(X=1|Y=1) * P(Y = 1)


------------------------------------
P(X=1)

= 0.75 X 0.35 / 0.75 x0.35 + 0.3 X0.65 = 0.5738

P(Y=0|X =1) = 1-0.5738 = 0.4262


P(Y=1|X =1) > P(Y=0|X =1), Team 1 has a better chance than Team 0 of winning the
next match.

Naïve Bayesian Classification

The naïve Bayesian classifier, or simple Bayesian classifier, works as follows:


Step 1:
Let D be a training set of tuples and their associated class labels.
 Each tuple is represented by an n-dimensional attribute vector, X =
(x1, x2, : : : , xn), depicting n measurements made on the tuple from n
attributes, respectively, A1, A2, : : : , An.
Step 2:

Suppose that there are m classes, C1, C2, : : : , Cm. Given a tuple, X, the classifier
will predict that X belongs to the class having the highest posterior
probability, conditioned on X.

 The naïve Bayesian classifier predicts that tuple X belongs to the


class Ci if and only if,

 The class Ci for which P(CijX) is maximized is called the maximum


posteriori hypothesis.

Step 3:
As P(X) is constant for all classes, only P(XjCi)P(Ci) need be maximized. If the class
prior probabilities are not known, then it is commonly assumed that the classes are
equally likely, that is, P(C1) = P(C2) = … = P(Cm), and we would therefore maximize
P(X|Ci). Otherwise, we maximize P(X|Ci)P(Ci).

Step 4:
A simplified assumption: attributes are conditionally independent:
Where V are the data samples, vi is the value of attribute i on the sample and C j is
the j-th class.

 Greatly reduces the computation cost, only count the class


distribution.
 xk refers to the value of attribute Ak for tuple X.
 For each attribute, we look at whether the attribute is categorical or
continuous-valued.
o For instance, to compute P(X|Ci), we consider the following:
(a) If Ak is categorical, then P(xk|Ci) is the number of tuples
of class Ci in D having the value xk for Ak, divided by |
Ci,D|, the number of tuples of class Ci in D.

(b) If Ak is continuous-valued, it is assumed to have a


Gaussian distribution with a mean μ and standard
deviation s.
Step 5:
In order to predict the class label of X, P(X|Ci)P(Ci) is evaluated for each class Ci. The
classifier predicts that the class label of tuple X is the class Ci if and only if,

Advantages:

 Have the minimum error rate in comparison to all other classifiers.


 Easy to implement
 Good results obtained in most of the cases.
 They provide theoretical justification for other classifiers that do not
explicitly use Bayes theorem.
Disadvantages:

 Lack of available probability data.


 Inaccuracies in the assumption.
Example 1: Predicting a class label using naïve Bayesian classification.
The training data are described by the attributes age, income, student, and credit
rating.

The class label attribute, buys computer, has two distinct values (namely, {yes, no}).

Let, C1 correspond to the class buys computer = yes and

C2 correspond to buys computer = no.

The tuple is to classify is

X = (age = youth, income = medium, student = yes, credit rating = fair)

Solution:
Example 2:

Predicting a class label using naïve Bayesian classification. The training data set is
given below. The data tuples are described by the attributes Owns Home?, Married,
Gender and Employed. The class label attribute Risk Class has three distinct values.
Let C1 corresponds to the class A, and C2 corresponds to the class B and C3
corresponds to the class C.

The tuple is to classify is,


X = (Owns Home = Yes, Married = No, Gender = Female, Employed = Yes0

Owns Home Married Gender Employed Risk Class

Yes Yes Male Yes B


No No Female Yes A
Yes Yes Female Yes C
Yes No Male No B
No Yes Female Yes C

No No Female Yes A
No No Male No B
Yes No Female Yes A
No Yes Female Yes C
Yes Yes Female Yes C
Solution:

There are 10 samples and three classes.

Risk class A = 3

Risk class B = 3

Risk class C = 4

The prior probabilities are obtained by dividing these frequencies by the total
number in the training data,

P(A) = 3/10 = 0.3

P(B) = 3/10 = 0.3

P(C) = 4/10 = 0.4

To compute P(X/Ci) =P {yes, no, female, yes}/Ci) for each of the classes, the
conditional probabilities for each:

P(Owns Home = Yes/A) = 1/3 =0.33

P(Married = No/A) = 3/3 =1

P(Gender = Female/A) = 3/3 = 1

P(Employed = Yes/A) = 3/3 = 1

P(Owns Home = Yes/B) = 2/3 =0.67

P(Married = No/B) = 2/3 =0.67

P(Gender = Female/B) = 0/3 = 0

P(Employed = Yes/B) = 1/3 = 0.33

P(Owns Home = Yes/C) = 2/4 =0.5

P(Married = No/C) = 0/4 =0


P(Gender = Female/C) = 4/4 = 1

P(Employed = Yes/C) = 4/4 = 1

Using the above probabilities, we obtain

P(X/A)= P(Owns Home = Yes/A) X P(Married = No/A) x P(Gender = Female/A) X


P(Employed = Yes/A)

= 0.33 x 1 x 1 x 1 = 0.33

Similarly, P(X/B), P(X/C)

To find the class, G, that maximizes,

P(X/Ci)P(Ci), we compute,

P(X/A) P(A) = 0.33 X 0.3 = 0

P(X/B) P(B) =0 X 0.3 = 0

P(X/C) P(C) = 0 X 0.4 = 0.0

Therefore x is assigned to class A

K-MEANS ALGORITHM

The k-means algorithm, where each cluster is represented by the mean value of the
objects in the cluster.

Steps of the algorithm:

 Determine K centroids (randomly?)


 Iterate until stable (= no object move group)
o Determine the distance of each object to the centroids
o Group the object based on minimum distance
o When all objects have been assigned, recalculate the positions of the K
centroids

Centroid-Based Technique: The k-Means Method


Example:

Object attribut attribute 2


e1 (Y):
Medicine A 1 1
Medicine B 2 1
Medicine C 4 3
Medicine 5 4

Iteration 0:
Iteration 1:
EM (EXPECTATION-MAXIMIZATION) ALGORITHM

 EM algorithm is an iterative estimation algorithm that can derive the maximum


likelihood (ML) estimates in the presence of missing/hidden data (“incomplete
data”)
Example:

AGGLOMERATIVE CLUSTERING

There are two natural algorithms for clustering:

 In divisive clustering, the entire data set is regarded as a cluster, and then
clusters are recursively split to yield a good clustering.
 In agglomerative clustering, each data item is regarded as a cluster, and
clusters are recursively merged to yield a good clustering.

 Algorithm :
Make each point a separate cluster
Until the clustering is satisfactory
Merge the two clusters with the
smallest inter-cluster distance
end
AGNES (AGglomerative NESting)

Consider a data set of five objects, {a, b, c, d, e}.

AGNES places each object into a cluster of its own. The clusters are then
merged step-by-step according to some criterion.

Clusters C1 and C2 may be merged if an object in C1 and an object in C2 form


the minimum Euclidean distance between any two objects from different
clusters.

This is a single-linkage approach in that each cluster is represented by all of


the objects in the cluster, and the similarity between two clusters is measured
by the similarity of the closest pair of data points belonging to different
clusters.

DIANA (DIvisive ANAlysis),

The cluster merging process repeats until all of the objects are eventually
merged to form one cluster.
All of the objects are used to form one initial cluster.
The cluster is split according to some principle, such as the maximum
Euclidean distance between the closest neighboring objects in the cluster.
The cluster splitting process repeats until, eventually, each new cluster
contains only a single object.
CLASSIFICATION BY DECISION TREES

Example:
K-NN CLASSIFICATION

- Lazy Learning Algorithm


- Whenever we have a new point to classify, classify, we find its K nearest neighbors
from the training data.
- The distance is calculated using one of the following measures :
 Euclidean Distance
 Minkowski Distance
 Mahalanobis Distance

Algorithm
For each training example <x,f(x)>, add the example to the list of training_examples.
Given a query instance xq , Given a query instance x to be classified,
- Let x1,x2….xk denote the k instances from training_ examples that are
nearest to xq
- Return the class that represents the maximum of the k instances.

You might also like