The Gaussian classifier
Nuno Vasconcelos
ECE Department, UCSD
Bayesian decision theory
recall that we have
• Y – state of the world
• X – observations
• g(x) – decision function
• L[g(x),y] – loss of predicting y with g(x)
Bayes decision rule is the rule that minimizes the risk
Risk = EX ,Y L( X , Y )
given x, it consists of picking the prediction of
minimum conditional risk
M
g ( x) = arg min PY | X (i | x) L[ g ( x), i ]
*
g ( x) i =1
2
MAP rule
for the “0-1” loss
1, g ( x) y
L[ g ( x), y ] =
0, g ( x) = y
the optimal decision rule is the maximum a-posteriori
probability rule
g * ( x) = arg max PY | X (i | x)
i
the associated risk is the probability of error of this
rule (Bayes error)
there is no other decision function with lower error
3
MAP rule
by application of simple mathematical laws (Bayes
rule, monotonicity of the log)
we have shown that the following three decision
rules are optimal and equivalent
• 1) i * (x ) = arg max PY |X (i | x )
i
• 2) i * (x ) = arg maxPX Y| (x | i )PY (i )
i
• 3) i * (x ) = arg maxlog PX Y| (x | i ) + log PY (i )
i
• 1) is usually hard to use, 3) is frequently easier than 2)
4
Example
the Bayes decision rule is usually highly intuitive
we have used an example from communications
• a bit is transmitted by a source, corrupted by noise, and
received by a decoder
Y X
channel
• Q: what should the optimal decoder do to recover Y?
5
Example
this was modeled as a classification problem with
Gaussian classes
PX |Y ( x | 0) = G ( x, m 0 , ) 1 −
( x−m )2
G ( x, m , ) = e 2 2
PX |Y ( x | 1) = G ( x, m1 , ) 2 2
PY (0) = PY (1) = 1
2
• or, graphically,
m0 m1
6
BDR
for which the optimal decision boundary is a
threshold
• pick “0” if m + m0
x 1
2
m0 m1
pick 0 pick 1
7
BDR
what is the point of going through all the math?
• now we know that the intuitive threshold is actually optimal,
and in which sense it is optimal (minimum probability or
error)
• the Bayesian solution keeps us honest.
• it forces us to make all our assumptions explicit
• assumptions we have made
• uniform class probabilities PY (0) = PY (1) = 1
2
• Gaussianity PX |Y ( x | i) = G( x, mi , i )
• the variance is the same under the two states i = , i
• noise is additive X =Y +
• even for a trivial problem, we have made lots of
assumptions
8
BDR
what if the class probabilities are not the same?
• e.g. coding scheme 7 = 11111110
• in this case PY(1) >> PY(0)
• how does this change the optimal decision rule?
i* ( x) = arg maxlog PX |Y ( x | i) + log PY (i)
i
1 −
( x − mi ) 2
= arg max log e 2 + log PY (i )
2
i 2 2
1 ( x − mi ) 2
= arg max − log( 2 ) −
2
+ log PY (i )
i 2 2 2
( x − mi ) 2
= arg min − log P (i )
2
2 Y
i
9
BDR
( x − mi ) 2
• or i* = arg min − log PY (i )
2
2
i
= arg min ( x 2 − 2 xmi + mi − 2 2 log PY (i ))
2
= arg min (−2 xmi + mi − 2 2 log PY (i ))
2
• the optimal decision is, therefore
• pick 0 if
− 2 xm0 + m0 − 2 2 log PY (0) −2 xm1 + m1 − 2 2 log PY (1)
2 2
PY (0)
2 x( m1 − m0 ) m1 − m0 + 2 log
2 2 2
PY (1)
• or, pick 0 if
m1 + m 0 2 P ( 0)
x + log Y
2 m1 − m 0 PY (1)
10
BDR
what is the role of the prior for class probabilities?
𝜇1 + 𝜇0 𝜎2 𝑃𝑌 (0)
𝑥< + log
2 𝜇1 − 𝜇0 𝑃𝑌 (1)
• the prior moves the threshold up or down, in an intuitive
way
• PY(0)>PY(1) : threshold increases
• since 0 has higher probability, we care more about errors on
the 0 side
• by using a higher threshold we are making it more likely to pick
0
• if PY(0)=1, all we care about is Y=0, the threshold becomes
infinite
• we never say 1
• how relevant is the prior?
• it is weighed by 1
൘𝜇1 − 𝜇0
𝜎2 11
BDR
how relevant is the prior?
• it is weighed by the inverse of the normalized distance
between the means
1
m1 − m 0 distance between the means
2 in units of variance
• if the classes are very far apart, the prior makes no
difference
• this is the easy situation, the observations are very clear, Bayes
says “forget the prior knowledge”
• if the classes are exactly equal (same mean) the prior gets
infinite weight
• in this case the observations do not say anything about the
class, Bayes says “forget about the data, just use the
knowledge that you started with”
• even if that means “always say 0” or “always say 1”
12
The Gaussian classifier
this is one example of a Gaussian classifier
• in practice we rarely have only one variable
• typically X = (X1, …, Xn) is a vector of observations
the BDR for this case is equivalent, but more
interesting
the main difference is in the class-conditional
distributions, which are multivariate Gaussian
PX |Y ( x | i ) =
1 1
exp − ( x − mi )T i−1 ( x − mi )
(2 ) d | i | 2
13
The Gaussian classifier
in this case
1 1
PX |Y ( x | i ) = exp − ( x − mi )T i−1 ( x − mi )
(2 ) d | i | 2
• the BDR
i * (x ) = arg maxlog PX Y| (x | i ) + log PY (i )
i
• becomes
1
i (x ) = arg max − (x − mi )T i−1 (x − mi )
*
i 2
1
− log( 2 )d i + log PY (i )
2
14
1
𝑖 ∗ (𝑥) = argmax − (𝑥 − 𝜇𝑖 )𝑇 Σ𝑖−1 (𝑥 − 𝜇𝑖 )
The Gaussian classifier
𝑖 2
1
− log( 2𝜋)𝑑 Σ𝑖 + log 𝑃𝑌 (𝑖)൨
2
this can be written as
discriminant:
i (x ) = arg min d i (x , mi ) + ai
*
PY|X(1|x ) = 0.5
i
with
d i (x , y ) = (x − y )T i−1 (x − y )
a i = log( 2 )d i − 2 log PY (i )
the optimal rule is to assign x to the closest class
closest is measured with the Mahalanobis distance
di(x,y)
to which a constant is added to account for class prior
15
The Gaussian classifier
first special case of interest:
• classes have the same covariance,
i = , i
the BDR becomes
i* ( x) = arg min d ( x, mi ) + a i
i
• with
same metric for
d ( x, y ) = ( x − y )T −1 ( x − y ) all classes
constant, not function
a i = log( 2 ) − 2 log PY (i)
d
of i, can be dropped
16
The Gaussian classifier
in detail
𝑖 ∗ (𝑥) = argmin (𝑥 − 𝜇𝑖 )𝑇 Σ−1 (𝑥 − 𝜇𝑖 ) − 2 log 𝑃𝑌 (𝑖)
𝑖
= argmin 𝑥 𝑇 Σ −1 𝑥 − 𝑥 𝑇 Σ −1 𝜇𝑖 − 𝜇𝑖 𝑇 Σ −1 𝑥 + 𝜇𝑖 𝑇 Σ−1 𝜇𝑖 − 2 log 𝑃𝑌 (𝑖)
𝑖
= argmin 𝑥 𝑇 Σ −1 𝑥 − 2𝜇𝑖 𝑇 Σ−1 𝑥 + 𝜇𝑖 𝑇 Σ −1 𝜇𝑖 − 2 log 𝑃𝑌 (𝑖)
𝑖
𝑇 −1
1 𝑇 −1
= argmax 𝜇𝑖 Σ 𝑥− 𝜇𝑖 Σ 𝜇𝑖 + log 𝑃𝑌 (𝑖)
𝑖 2
𝑤𝑖𝑇 𝑤𝑖0
17
1
𝑖 ∗ (𝑥) = argmax 𝜇𝑖 𝑇 Σ −1 𝑥− 𝜇𝑖 𝑇 Σ −1 𝜇𝑖 + log 𝑃𝑌 (𝑖)
2
The Gaussian classifier
𝑖 𝑇
𝑤𝑖 𝑤𝑖0
in summary, when classes have equal covariance,
𝑖 ∗ (𝑥) = argmax𝑔𝑖 (𝑥) discriminant:
𝑖 PY|X(1|x ) = 0.5
• with
g i ( x) = wiT x + wi 0
wi = −1mi
1 T −1
wi 0 = − mi mi + log PY (i )
2
• the BDR is a linear function or a linear discriminant
18
19