巨量資料探勘與應用
Big Data Mining and Applications
Classification Algorithms II
李建樂
Chien-Yueh Lee, Ph.D.
Assistant Professor
Master Program in Artificial Intelligence
Innovation Frontier Institute of Research for Science and Technology
Department of Electrical Engineering
National Taipei University of Technology
Apr. 21, 2025
Introduction to Support Vector Machines (SVM)
• What is SVM?
○ A powerful discriminative classifier (判別分類器) designed to
find optimal boundaries for separating classes in data.
• Key Characteristics:
○ Maximizes margin for best class separation
○ Strong built-in regularization → prevents overfitting
○ Based on Structural Risk Minimization
○ Only relies on critical points called support vectors
○ Highly generalizable and effective for real-world tasks
○ Works in both linear and nonlinear domains (via kernel trick)
○ Common in applications like text classification, image
recognition
• Why Use SVM?
○ Effective even in high-dimensional spaces
○ Robust to overfitting
Support Vector Machines
• Find a linear hyperplane (decision boundary) that
will separate the data
Support Vector Machines
• One possible solution.
B1
Support Vector Machines
• Another possible solution
B2
Support Vector Machines
• Other possible solutions
B2
Support Vector Machines
• Which one is better? B1 or B2?
• How do you define better?
B1
B2
Support Vector Machines
• Find hyperplane maximizes the margin
=> B1 is better than B2 B
1
B2
b21
b22
margin
b11
b12
Linear separator of SVM
• A separating hyperplane: wT x + b = 0
(wT xi) + b ≥ 1 if yi = 1
(wT xi) + b ≤ -1 if yi = -1 B
1
• Decision function
f(x) = sgn(wT x + b)
wTx + b = 0
wTx + b = -1 wTx + b = 1
b11
b12
Linear separator of SVM
• x: Feature Vector
○ Represents the input data point; each x is a combination of
features for a sample.
○ It is the input to the model, used to determine on which side
of the hyperplane the point lies.
• w: Weight Vector
○ A vector with the same dimension as the data point x.
○ It determines the orientation of the hyperplane, also known
as the normal vector.
○ It is typically used to describe the direction of the
classification boundary.
• b: Bias or Offset
○ Determines the distance between the hyperplane and the
origin.
○ It controls the translation of the hyperplane, allowing it to not
pass through the origin.
Maximal margin
• Distance between wT x
+ b = 1 and −1: B1
!"("!) &
margin = =
𝒘 𝒘
• Maximizing the margin
⟺ minimizing 𝒘
wTx + b = -1 wTx + b = 1
!
• Minw,b 𝒘 2
&
s.t. y' 𝐰 ( 𝐱 ' + 𝑏 ≥ 1,
𝑖 = 1, … , 𝑛
g in
ar
m
b11
b12
A tour from building a Lagrangian
function to the end
A tour from building a Lagrangian
function to the end
Nonlinear SVM
• What if decision boundary is not linear?
Nonlinear SVM
• Transform data into higher dimensional space
• Decision boundary:
SVM with polynomial kernel
visualization
[Link]
Kernel trick
•
• K(u, v) is a kernel function (expressed in terms of
the coordinates in the original space)
• Kernels:
Decision boundaries of different
kernel functions
[Link]
Conditional probability
!(#∩%)
• Definition: 𝑃 𝑌 𝑋 =
!(%)
• X, Y:Events
• 𝑃 𝑌|𝑋 : The posterior probability of Y represents the
probability of event Y occurring given that event X has
occurred.
• 𝑃 𝑋 : The prior probability of X is the probability of event
X occurring independently.
• 𝑃 𝑌 ∩ 𝑋 : The joint probability represents the probability
of events Y and X occurring simultaneously.
Bayes' theorem
• The equation of Bayes' theorem:
𝑃(𝑋|𝑌)𝑃(𝑌)
𝑃 𝑌𝑋 =
𝑃(𝑋)
• Proof:
!(#∩%)
○𝑃 𝑌 𝑋 = !(%)
… Eq. 1
!(#∩%)
○𝑃 𝑋 𝑌 = !(#)
⇒ 𝑃 𝑌 ∩ 𝑋 = 𝑃 𝑋 𝑌 𝑃(𝑌), applying
to Eq. 1
!(%|#)!(#)
○𝑃 𝑌 𝑋 = !(%)
Naïve Bayes Classifier
• Bayes' theorem is used to calculate the
probabilistic relationship between individual
features and classification outcomes, under the
assumption that each feature is independent
and does not influence the others.
• This assumption is known as conditional
independence in each class.
• It is suitable for features that are categorical
(discrete variables). If the features are
continuous variables, the Gaussian Naïve Bayes
classifier should be used.
Using Bayes Theorem for Classification
• Consider each attribute and class label as random variables
• Given a record with attributes (X1, X2,…, Xd), the goal is to
predict class Y
○ Specifically, we want to find the value of Y that maximizes
P(Y| X1, X2,…, Xd )
• Can we estimate P(Y| X1, X2,…, Xd )
directly from data?
Using Bayes Theorem for Classification
• Approach:
○ Compute posterior probability P(Y | X1, X2, …, Xd) using the Bayes
theorem
P( X 1 X 2 X d | Y ) P(Y )
P(Y | X 1 X 2 X n ) =
P( X 1 X 2 X d )
○ Maximum a-posteriori: Choose Y that maximizes
P(Y | X1, X2, …, Xd)
○ Equivalent to choosing value of Y that maximizes
P(X1, X2, …, Xd|Y) P(Y)
• How to estimate P(X1, X2, …, Xd | Y )?
Example Data
• Given a Test Record: X = (Refund = False, Divorced, Income = 120K)
• We need to estimate
○ P(Evade = Yes|X) and P(Evade = No|X)
• Using Bayes Theorem:
• How to estimate P(X|Yes) and P(X|No)?
Conditional Independence
• X and Y are conditionally independent given Z if
P(X|YZ) = P(X|Z)
• Example: Arm length and reading skills
○ Young child has shorter arm length and limited
reading skills, compared to adults
○ If age is fixed, no apparent relationship between
arm length and reading skills
○ Arm length and reading skills are conditionally
independent given age
Naïve Bayes Classifier
• Assume independence among attributes Xi when
class Yj is given:
○ P(X1, X2, …, Xd |Yj) = P(X1| Yj) P(X2| Yj)… P(Xd| Yj)
○ Now we can estimate P(Xi| Yj) for all Xi and Yj
combinations from the training data
○ New point is classified to Yj if P(Yj) P P(Xi| Yj) is
maximal.
j
Naïve Bayes on Example Data
• Given a Test Record: X = (Refund = False, Divorced, Income = 120K)
• P(X | Yes) =
P(Refund = False | Yes) ×
P(Divorced | Yes) ×
P(Income = 120K | Yes)
• P(X | No) =
P(Refund = False | No) ×
P(Divorced | No) ×
P(Income = 120K | No)
Estimate Probabilities from Data
• P(y) = fraction of instances of class y
○ e.g., P(No) = 7/10,
P(Yes) = 3/10
• For categorical attributes:
○ P(Xi = c|y) = nc / n,
where |Xi = c| is number of
instances having attribute
value Xi = c and belonging to
class y
• Examples:
○ P(Status = Married|No) = 4/7
P(Refund = True|Yes) = 0
Estimate Probabilities from Data
• For continuous attributes:
○ Discretization: Partition the range into bins:
—Replace continuous value with bin value
=>Attribute changed from continuous to ordinal
○ Probability density estimation:
—Assume attribute follows a normal distribution
—Use data to estimate parameters of distribution,
such as, mean (𝜇) and variance (𝜎2)
—Once probability distribution is known, use it to
estimate the conditional probability P(Xi|Y)
Estimate Probabilities from Data
• Normal distribution:
( X i - µij ) 2
-
1 2s ij2
○ P( X i | Y j ) = e
2ps 2
ij
○ One for each (Xi,Yi) pair
• For (Income, Evade=No):
○ sample mean = 110
○ sample variance = 2975
1 -
( 120 -110 ) 2
P( Income = 120 | No) = e 2 ( 2975 )
= 0.0072
2p (54.54)
Example of Naïve Bayes Classifier
• Given a Test Record: X = (Refund = False, Divorced, Income = 120K)
• P(X | No) = P(Refund=False | No) × P(Refund = True | No) = 3/7
P(Refund = False | No) = 4/7
P(Divorced | No) × P(Refund = True | Yes) = 0
P(Income=120K | No) P(Refund = False | Yes) = 1
P(Marital Status = Single | No) = 2/7
= 4/7 × 1/7 × 0.0072 = 0.0006 P(Marital Status = Divorced | No) = 1/7
P(Marital Status = Married | No) = 4/7
• P(X | Yes) = P(Refund= False | Yes) × P(Marital Status = Single | Yes) = 2/3
P(Marital Status = Divorced | Yes) = 1/3
P(Divorced | Yes) × P(Marital Status = Married | Yes) = 0
P(Income=120K | Yes)
For Taxable Income:
= 1 × 1/3 × 1.2 × 10-9 = 4 × 10-10 If class = No: sample mean = 110
sample variance = 2975
• Since P(X | No)P(No) > P(X | Yes)P(Yes) If class = Yes: sample mean = 90
sample variance = 25
• Therefore P(No | X) > P(Yes | X)
=> Class (Evade) = No P(Yes) = 3/10
P(No) = 7/10
Example of Naïve Bayes Classifier
• Given a Test Record: X = (Refund = False, Divorced, Income = 120K)
• If we only know that marital status is Divorced, then:
○ P(Yes|Divorced) = P(Divorced|Yes) × P(Yes) / P(Divorced)
= 1/3 × 3/10 / P(Divorced)
○ P(No|Divorced) = P(Divorced|No) × P(No) / P(Divorced)
= 1/7 × 7/10 / P(Divorced)
• If we also know that Refund = False, then
○ P(Yes|Refund = False, Divorced)
= P(Refund = False, Divorced|Yes) × P(Yes) / P(Refund = False, Divorced)
= 1 × 1/3 × 3/10 / P(Refund = False, Divorced)
○ P(No|Refund = False, Divorced)
= 4/7 × 1/7 × 7/10 / P(Refund = False, Divorced)
• If we also know that Taxable Income = 120, then
○ P(Yes|Refund = False, Divorced, Income = 120)
= P(Refund = False, Divorced, Income = 120|Yes) × P(Yes) / P(Refund = False,
Divorced, Income = 120)
= 1 × 1/3 × 1.2 × 10-9 × 3/10 / P(Refund = False, Divorced, Income = 120 )
○ P(No|Refund = False, Divorced Income = 120)
= 4/7 × 1/7 × 0.0072 × 7/10 / P(Refund = False, Divorced, Income = 120)
Issues with Naïve Bayes Classifier
• Consider the table with Tid = 7 deleted
• Given X = (Refund = True, Divorced, 120K)
• P(X | No) = 2/6 × 0 × 0.0083 = 0
• P(X | Yes) = 0 × 1/3 × 1.2 × 10-9 = 0
• Naïve Bayes will not be able to classify X as Yes or No!
P(Refund = True | No) = 2/6
P(Refund = False | No) = 4/6
P(Refund = True | Yes) = 0
P(Refund = False | Yes) = 1
P(Marital Status = Single | No) = 2/6
P(Marital Status = Divorced | No) = 0
P(Marital Status = Married | No) = 4/6
P(Marital Status = Single | Yes) = 2/3
P(Marital Status = Divorced | Yes) = 1/3
P(Marital Status = Married | Yes) = 0
For Taxable Income:
If class = No: sample mean = 110
sample variance = 2975
If class = Yes: sample mean = 90
sample variance = 25
Issues with Naïve Bayes Classifier
• If one of the conditional probabilities is zero, then the entire
expression becomes zero.
• Need to use other estimates of conditional probabilities
than simple fractions.
• Probability estimation: n: number of training
(! instances belonging to class y
○ original: 𝑃 𝑋' = 𝑐 𝑦) =
( nc: number of instances with
(! )*
○ Laplace Estimate: 𝑃 𝑋' = 𝑐 𝑦) = Xi = c and Y = y
()+
(! ),- v: total number of attribute
○ m − estimate: 𝑃 𝑋' = 𝑐 𝑦) = values that Xi can take
(),
p: initial estimate of
(P(Xi = c|y) known apriori
m: hyper-parameter for our
confidence in p
Example of Naïve Bayes Classifier
Name Give Birth Can Fly Live in Water Have Legs Class
human yes no no yes mammals A: attributes
python no no no no non-mammals
salmon no no yes no non-mammals
M: mammals
whale yes no yes no mammals N: non-mammals
frog no no sometimes yes non-mammals
komodo no no no yes non-mammals
6 6 2 2
bat yes yes no yes mammals P ( A | M ) = ´ ´ ´ = 0.06
pigeon
cat
no
yes
yes
no
no
no
yes
yes
non-mammals
mammals
7 7 7 7
leopard shark yes no yes no non-mammals 1 10 3 4
turtle no no sometimes yes non-mammals P ( A | N ) = ´ ´ ´ = 0.0042
penguin no no sometimes yes non-mammals 13 13 13 13
porcupine yes no no yes mammals
eel no no yes no non-mammals 7
salamander no no sometimes yes non-mammals P ( A | M ) P ( M ) = 0.06 ´ = 0.021
gila monster no no no yes non-mammals 20
platypus no no no yes mammals
13
owl
dolphin
no
yes
yes
no
no
yes
yes
no
non-mammals
mammals
P ( A | N ) P ( N ) = 0.004 ´ = 0.0027
eagle no yes no yes non-mammals 20
Give Birth Can Fly Live in Water Have Legs Class
P(A|M)P(M) > P(A|N)P(N)
yes no yes no ? => Mammals
Naïve Bayes (Summary)
• Robust to isolated noise points
• Handle missing values by ignoring the instance
during probability estimate calculations
• Robust to irrelevant attributes
• Redundant and correlated attributes will violate
class conditional assumption
○ Use other techniques such as Bayesian Belief
Networks (BBN)
How does Naïve Bayes perform on
the following dataset?
• Violation of the Conditional Independence Assumption
○ Naïve Bayes assumes that features are conditionally
independent given the class label.
• In the image, data points of the same color are not
concentrated in a single cluster, but instead form two
separate subgroups positioned diagonally