0% found this document useful (0 votes)
13 views19 pages

Gaussian Classifier and Bayesian Decision Theory

The document discusses the Gaussian classifier and its foundation in Bayesian decision theory, emphasizing the Bayes decision rule and the maximum a-posteriori (MAP) rule for minimizing error in classification tasks. It explains the implications of class probabilities and how they affect decision boundaries, particularly in cases where classes have different covariance structures. The Gaussian classifier is presented as a multivariate extension, leading to linear discriminant functions when class covariances are equal.

Uploaded by

gasev82612
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views19 pages

Gaussian Classifier and Bayesian Decision Theory

The document discusses the Gaussian classifier and its foundation in Bayesian decision theory, emphasizing the Bayes decision rule and the maximum a-posteriori (MAP) rule for minimizing error in classification tasks. It explains the implications of class probabilities and how they affect decision boundaries, particularly in cases where classes have different covariance structures. The Gaussian classifier is presented as a multivariate extension, leading to linear discriminant functions when class covariances are equal.

Uploaded by

gasev82612
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

The Gaussian classifier

Nuno Vasconcelos
ECE Department, UCSD
Bayesian decision theory
recall that we have
• Y – state of the world
• X – observations
• g(x) – decision function
• L[g(x),y] – loss of predicting y with g(x)
Bayes decision rule is the rule that minimizes the risk
Risk = EX ,Y L( X , Y )

given x, it consists of picking the prediction of


minimum conditional risk
M
g ( x) = arg min  PY | X (i | x) L[ g ( x), i ]
*

g ( x) i =1

2
MAP rule
for the “0-1” loss
1, g ( x)  y
L[ g ( x), y ] = 
0, g ( x) = y
the optimal decision rule is the maximum a-posteriori
probability rule
g * ( x) = arg max PY | X (i | x)
i

the associated risk is the probability of error of this


rule (Bayes error)
there is no other decision function with lower error

3
MAP rule
by application of simple mathematical laws (Bayes
rule, monotonicity of the log)
we have shown that the following three decision
rules are optimal and equivalent
• 1) i * (x ) = arg max PY |X (i | x )
i

• 2) i * (x ) = arg maxPX Y| (x | i )PY (i )


i

• 3) i * (x ) = arg maxlog PX Y| (x | i ) + log PY (i )


i

• 1) is usually hard to use, 3) is frequently easier than 2)

4
Example
the Bayes decision rule is usually highly intuitive
we have used an example from communications
• a bit is transmitted by a source, corrupted by noise, and
received by a decoder
Y X
channel

• Q: what should the optimal decoder do to recover Y?

5
Example
this was modeled as a classification problem with
Gaussian classes

PX |Y ( x | 0) = G ( x, m 0 ,  ) 1 −
( x−m )2

G ( x, m ,  ) = e 2 2
PX |Y ( x | 1) = G ( x, m1 ,  ) 2 2

PY (0) = PY (1) = 1
2
• or, graphically,

m0 m1
6
BDR
for which the optimal decision boundary is a
threshold
• pick “0” if m + m0
x 1
2

m0 m1
pick 0 pick 1

7
BDR
what is the point of going through all the math?
• now we know that the intuitive threshold is actually optimal,
and in which sense it is optimal (minimum probability or
error)
• the Bayesian solution keeps us honest.
• it forces us to make all our assumptions explicit
• assumptions we have made
• uniform class probabilities PY (0) = PY (1) = 1
2

• Gaussianity PX |Y ( x | i) = G( x, mi ,  i )

• the variance is the same under the two states  i =  , i

• noise is additive X =Y +
• even for a trivial problem, we have made lots of
assumptions
8
BDR
what if the class probabilities are not the same?
• e.g. coding scheme 7 = 11111110
• in this case PY(1) >> PY(0)
• how does this change the optimal decision rule?

i* ( x) = arg maxlog PX |Y ( x | i) + log PY (i)


i
  1 −
( x − mi ) 2 
 
= arg max log  e 2  + log PY (i )
2

i   2 2  
 1 ( x − mi ) 2 
= arg max − log( 2 ) −
2
+ log PY (i )
i  2 2 2

 ( x − mi ) 2 
= arg min  − log P (i ) 
 2
2 Y
i 
9
BDR
 ( x − mi ) 2 
• or i* = arg min  − log PY (i )
 2
2
i 
= arg min ( x 2 − 2 xmi + mi − 2 2 log PY (i ))
2

= arg min (−2 xmi + mi − 2 2 log PY (i ))


2

• the optimal decision is, therefore


• pick 0 if
− 2 xm0 + m0 − 2 2 log PY (0)  −2 xm1 + m1 − 2 2 log PY (1)
2 2

PY (0)
2 x( m1 − m0 )  m1 − m0 + 2 log
2 2 2

PY (1)
• or, pick 0 if

m1 + m 0 2 P ( 0)
x + log Y
2 m1 − m 0 PY (1)
10
BDR
what is the role of the prior for class probabilities?
𝜇1 + 𝜇0 𝜎2 𝑃𝑌 (0)
𝑥< + log
2 𝜇1 − 𝜇0 𝑃𝑌 (1)

• the prior moves the threshold up or down, in an intuitive


way
• PY(0)>PY(1) : threshold increases
• since 0 has higher probability, we care more about errors on
the 0 side
• by using a higher threshold we are making it more likely to pick
0
• if PY(0)=1, all we care about is Y=0, the threshold becomes
infinite
• we never say 1
• how relevant is the prior?
• it is weighed by 1
൘𝜇1 − 𝜇0
𝜎2 11
BDR
how relevant is the prior?
• it is weighed by the inverse of the normalized distance
between the means
1
m1 − m 0 distance between the means
2 in units of variance
• if the classes are very far apart, the prior makes no
difference
• this is the easy situation, the observations are very clear, Bayes
says “forget the prior knowledge”
• if the classes are exactly equal (same mean) the prior gets
infinite weight
• in this case the observations do not say anything about the
class, Bayes says “forget about the data, just use the
knowledge that you started with”
• even if that means “always say 0” or “always say 1”
12
The Gaussian classifier
this is one example of a Gaussian classifier
• in practice we rarely have only one variable
• typically X = (X1, …, Xn) is a vector of observations
the BDR for this case is equivalent, but more
interesting
the main difference is in the class-conditional
distributions, which are multivariate Gaussian
PX |Y ( x | i ) =
1  1 
exp − ( x − mi )T i−1 ( x − mi )
(2 ) d | i |  2 

13
The Gaussian classifier
in this case
1  1 
PX |Y ( x | i ) = exp − ( x − mi )T  i−1 ( x − mi )
(2 ) d |  i |  2 

• the BDR

i * (x ) = arg maxlog PX Y| (x | i ) + log PY (i )


i

• becomes

 1
i (x ) = arg max − (x − mi )T i−1 (x − mi )
*

i  2
1 
− log( 2 )d i + log PY (i )
2 
14
1
𝑖 ∗ (𝑥) = argmax ൤− (𝑥 − 𝜇𝑖 )𝑇 Σ𝑖−1 (𝑥 − 𝜇𝑖 )

The Gaussian classifier


𝑖 2
1
− log( 2𝜋)𝑑 Σ𝑖 + log 𝑃𝑌 (𝑖)൨
2

this can be written as


discriminant:
i (x ) = arg min d i (x , mi ) + ai 
*
PY|X(1|x ) = 0.5
i

with

d i (x , y ) = (x − y )T i−1 (x − y )

a i = log( 2 )d i − 2 log PY (i )
the optimal rule is to assign x to the closest class
closest is measured with the Mahalanobis distance
di(x,y)
to which a constant is added to account for class prior
15
The Gaussian classifier
first special case of interest:
• classes have the same covariance,

i = , i

the BDR becomes

i* ( x) = arg min d ( x, mi ) + a i 
i

• with
same metric for
d ( x, y ) = ( x − y )T  −1 ( x − y ) all classes

constant, not function


a i = log( 2 )  − 2 log PY (i)
d
of i, can be dropped

16
The Gaussian classifier
in detail

𝑖 ∗ (𝑥) = argmin (𝑥 − 𝜇𝑖 )𝑇 Σ−1 (𝑥 − 𝜇𝑖 ) − 2 log 𝑃𝑌 (𝑖)


𝑖

= argmin 𝑥 𝑇 Σ −1 𝑥 − 𝑥 𝑇 Σ −1 𝜇𝑖 − 𝜇𝑖 𝑇 Σ −1 𝑥 + 𝜇𝑖 𝑇 Σ−1 𝜇𝑖 − 2 log 𝑃𝑌 (𝑖)


𝑖

= argmin 𝑥 𝑇 Σ −1 𝑥 − 2𝜇𝑖 𝑇 Σ−1 𝑥 + 𝜇𝑖 𝑇 Σ −1 𝜇𝑖 − 2 log 𝑃𝑌 (𝑖)


𝑖

𝑇 −1
1 𝑇 −1
= argmax 𝜇𝑖 Σ 𝑥− 𝜇𝑖 Σ 𝜇𝑖 + log 𝑃𝑌 (𝑖)
𝑖 2
𝑤𝑖𝑇 𝑤𝑖0

17
1
𝑖 ∗ (𝑥) = argmax 𝜇𝑖 𝑇 Σ −1 𝑥− 𝜇𝑖 𝑇 Σ −1 𝜇𝑖 + log 𝑃𝑌 (𝑖)
2

The Gaussian classifier


𝑖 𝑇
𝑤𝑖 𝑤𝑖0

in summary, when classes have equal covariance,

𝑖 ∗ (𝑥) = argmax𝑔𝑖 (𝑥) discriminant:


𝑖 PY|X(1|x ) = 0.5

• with

g i ( x) = wiT x + wi 0
wi =  −1mi
1 T −1
wi 0 = − mi  mi + log PY (i )
2
• the BDR is a linear function or a linear discriminant

18
19

You might also like