BLG 527E Machine Learning
FALL 2021-2022
Assoc. Prof. Yusuf Yaslan & Assist. Prof. Ayşe Tosun
Parametric Methods
Parametric Estimation
• X = { xt }t where xt ~ p (x)
• Parametric estimation:
Assume a form for p (x |q ) and estimate q , its sufficient statistics,
using X
e.g., N ( μ, σ2) where q = { μ, σ2}
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 2
Maximum Likelihood Estimation
• Likelihood of q given the sample X
l (θ|X) = p (X |θ) = ∏t p (xt|θ)
• Log likelihood
L(θ|X) = log l (θ|X) = ∑t log p (xt|θ)
• Maximum likelihood estimator (MLE)
θ* = argmaxθ L(θ|X)
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 3
Examples: Bernoulli/Multinomial
• Bernoulli: Two states, failure/success, x in {0,1}
P (x) = pox (1 – po ) (1 – x)
L (po|X) = log ∏t poxt (1 – po ) (1 – xt)
MLE: po = ∑t xt / N
• Multinomial: K>2 states, xi in {0,1}
P (x1,x2,...,xK) = ∏i pixi
L(p1,p2,...,pK|X) = log ∏t ∏i pixit
MLE: pi = ∑t xit / N
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 4
Examples: Bernoulli (Derivation)
• Bernoulli: Two states, failure/success, x in {0,1}
P (x) = pox (1 – po ) (1 – x)
t t
L (po|X) = log ∏t pox (1 – po ) (1 – x )
dL( p0 | X ) N t d N
d
x log( p0 ) (1 x )
t
log(1 p0 )
dp0 t 1 dp0 t 1 dp0
N N
1 1
p0
x
t 1
t
(1 x t )
t 1 1 p0
0
5
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Bernoulli (Derivation)
N N N
(1 p0 ) x t p0 1 p0 x t 0
t 1 t 1 t 1
N N
1
x p0 N 0 p0
t
x t
t 1 N t 1
MLE: po = ∑t xt / N
Gaussian (Normal) Distribution
• p(x) = N ( μ, σ2)
p x
1 x 2
1 x 2
px
exp -
exp
2
2 2
2 2 2
μ σ
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 7
Gaussian (Normal) Distribution
• Given that X = { xt }t with xt ~ N ( μ, σ2)
𝐿 ¿
x t
MLE for μ and σ2: m t
N
x m
t 2
s2 t
N
Bias and Variance
Let X be a sample from a population specified up to a
parameter q
To evaluate the quality of this estimator we can measure
how much it is different from q
That is (d (X)-q )2
But since it is random variable (it depends on the
sample) we need to average over all possible X and
consider meas square error of the estimator
Remember the properties of expectation
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 9
Bias and Variance
Unknown parameter q
Estimator di = d (Xi) on sample Xi
Bias: bq(d) = E [d] – q
Variance: E [(d–E [d])2]
Mean square error:
q
r (d,q) = E [(d–q)2]=E [(d–E[d]+E[d]-θ)2]
= (E[d]-θ)2+E[(d–E[d])2+2 (d–E[d])(E[d]-θ)]
Remember the properties of expectation
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 10
= E[(E[d]-θ)2]+E [(d–E [d])2] +2 E[(d–E[d])(E[d]-θ)]
= E[(E[d]-θ)2]+E [(d–E [d])2] +2 (E[d]–E[d])(E[d]-θ)
= (E [d] – θ)2 +E [(d–E [d])2]
= (E [d] – θ)2 + E [(d–E [d])2]
= Bias2 + Variance
Bayes’ Estimator
• Sometimes before looking at a sample we may have some prior
information on the possible value range that a parameter , θ, may
take.
• This information is quite useful especially when the sample is small.
• Treat θ as a random variable with prior p (θ)
• Bayes’ rule: p (θ|X) = p(X|θ) p(θ) / p(X)
• Density at x: p(x|X) = ∫ p(x|θ, X) p(θ|X) dθ= ∫ p(x|θ) p(θ|X) dθ
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 12
Bayes’ Estimator
• Evaluating the p(x|X) integrals may be quite difficult except in cases
where the posterior has a nice form
• Maximum a Posteriori (MAP): θMAP = argmaxθ p(θ|X)
• Maximum Likelihood (ML): θML = argmaxθ p(X|θ)
• Bayes’: θBayes’ = E[θ|X] = ∫ θ p(θ|X) dθ
• If we have no prior reason to favor some values of θ then the prior
density is flat and the posterior will have the same form as the
likelihood p(X|θ)
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 13
Bayes’ Estimator: Example
• xt ~ N (θ, σo2) and θ ~ N ( μ, σ2) where are μ, σ2, σo2
known
• θML = m
• p(θ|X) ∝ p(X| θ)p(θ)
• Take the derivative with respect to θ
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 14
1 x 2
P( X | ) exp
2 0 2 0
2
1 2
P( ) exp
2 2
2
Likelihood :
1 2 N 1
xt
2
l ( ) exp exp
2 2 2
t 1 2 0 2 2
0
Loglikelihood :
1 N 1 x t
2
2
L( ) log log
2 2 2
t 1 2 0 2 2
0
1
2 N 1 xt
2
L( ) log log
2 2 2
t 1 2 0 2 2
0
L( )
N xt 0
2
t 1 02
N
xt N
N N x t N
t 1 0
2
2
t 1 0 2
2
0 t 1 N
t 1 0
2
2 2
N N
m 2 2 2
0 2
0
N /2
1/ 2
E | X 0
m
N / 0 1/
2 2
N / 0 1/
2 2
Bayes’ Estimator: Example
• xt ~ N (θ, σo2) and θ ~ N ( μ, σ2)
• θML = m
• θMAP = θBayes’ =
N / 2
1/ 2
E | X m 0
N / 0 1/
2 2
N / 0 1/
2 2
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 17
Bayesian Learning for Coin Model
• Bayesian Learning procedure:
• Given data x1, x2,…, xN write down expression for likelihood p(X| θ)
Specify a prior P (θ )
Compute posterior p(θ |X) = p(X| θ)p(θ)/ p(X)
p(θ |X) ∝ p(X| θ)p(θ)
• This example is obtained from Nando Freitas lecture notes
Bayesian Learning for Coin Model
• For coin model likelihood of data (i.i.d. in our case)
P(x1, x2,…, xN| θ ) = ∏t θ xt (1 – θ ) (1 – xt) = θ m (1 – θ ) (N – m)
Where xt ϵ {0,1} and m is the number of 1’s
• Specify a prior on θ. For this we need to introduce Beta distribution
Beta Distribution
( ) 1 , are hyperparameters
p ( ) (1 ) 1
( )( )
( z ) e x x z 1dx
0
p( )d 1
( ) 1
( )( )
1
(1 ) d 1
( )( )
(1 ) d ( )
1 1
E[ ]
• The figure is obtained from wikipedia
Bayesian Learning for Coin Model
Compute Posterior: p(θ |X) ∝ p(X| θ)p(θ)
P(x1, x2,…, xN| θ ) = ∏t θ xt (1 – θ ) (1 – xt) = θ m (1 – θ ) (N – m)
( ) 1 1 1 1
p ( ) (1 ) (1 ) 1
( )( ) const
1 1 ( ' ' ) ' 1
p ( | X ) (1 ) 1 m (1 ) N m p ( | X ) (1 ) ' 1
const ( ' )( ' )
' m , ' N m
Conjugate Prior
• Conjugate priors: A likelihood-prior pair is said to be conjugate if they
result in a posterior which is of the same form as the prior.
• This enables us to compute the posterior density analytically without
having to worry about computing the denominator in Bayes' rule, the
marginal likelihood.
Prior Likelihood
Gaussian Gaussian
Beta Binomial
Dirichlet Multinomial
Gamma Gaussian
Example
• Suppose that we observe X= {1,1,1,1,1,1} where each xt comes
from Bernoulli distribution θML = 1
• We can compute posterior and use its mean as the estimate
1 1 ( ' ' ) ' 1 '
p ( | X ) (1 ) 1 m (1 ) N m p ( | X ) (1 ) ' 1 E[ ]
const ( ' )( ' ) ' '
' m , ' N m
• Using Beta(2,2) prior 8
B
10
Parametric Classification
gi x px |C i P C i
or
gi x log px |C i log P C i
px |C i
1
exp
x i
2
2
2 i 2 i
1
gi x log 2 log i
x i 2
log P C i
2
2 2 i
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 24
• Given the sample X {x t ,r t }tN1
t
1 if x Ci
x t
ri t
0 if x C j , j i
• ML estimates are
• Discriminant becomes
1
gi x log 2 log si
x mi
log ˆC i
P
2
2 2 si 2
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 25
Equal variances
Single boundary at
halfway between means
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 26
Variances are different
Two boundaries
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 27
Probabilistic Interpretation of Linear Regression
r f x
estimator : gx |
~ N 0, 2
pr | x ~ N gx | , 2
N
L |X log px t , r t
t 1
N N
log pr t | x t log px t
t 1 t 1
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 28
Regression: From LogL to Error
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 29
Linear Regression
g x t |w1 , w 0 w1 x t w 0 Take derivative of E
r
t
t
Nw 0 w1 x t
t
…wrto w0
r x t t
w 0 x w1 xt
t 2 …wrto w1
t t t
N t w0
x t
r t
A w y t
t x w1 t x
t 2 t t
x t
r
t
w A 1y
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 30
Polynomial Regression
g x |w k , , w 2 , w1 , w 0 w k x
t
w x w x
t k
2
t 2
1
t
w0
1 x 1
x 1 2
x
1 k
r 1
2
D
1 x 2
x 2 2
x
2 k
r r
N
1 x N
x N 2
x
N 2
r
w D D DT r T 1
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 31
Other Error Measures
N 2
1
• Square Error:
E |X r t g x t |
2 t 1
N 2
r t
g x t |
E |X t 1 N 2
• Relative Square Error:
r
t 1
t
r
• Absolute Error: E (θ |X) = ∑t |rt – g(xt| θ)|
• ε-sensitive Error:
E (θ |X) = ∑ t 1(|rt – g(xt| θ)|>ε) (|rt – g(xt|θ)| – ε)
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 32
Bias and Variance
E r gx 2 | x E r E r | x 2 | x E r | x gx 2
noise squared error
E X E r | x g x | x E r | x E X g x E X g x E X g x
2 2
2
bias variance
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 33
Estimating Bias and Variance
• M samples Xi={xti , rti}, i=1,...,M
are used to fit gi (x), i =1,...,M and t=1,…,N
g x f x
1
Bias g
2 t t 2
N t
g x g x
1
Variance g t t 2
i
NM t i
1
g x g i x
M i
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 34
Bias/Variance Dilemma
• Example: gi(x)=2 has no variance and high bias
gi(x)= ∑t rti/N has lower bias with variance
• As we increase complexity,
bias decreases (a better fit to data) and
variance increases (fit varies more with data)
• Bias/Variance dilemma: (Geman et al., 1992)
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 35
f
f
bias
gi g
variance
Bias Variance Image is obtained from: [Link] 36
Polynomial Regression
Best fit “min error”
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 37
Best fit, “elbow”
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 38
Regression example
Coefficients increase in magnitude as order
increases:
1: [-0.0769, 0.0016]
2: [0.1682, -0.6657, 0.0080]
3: [0.4238, -2.5778, 3.4675, -0.0002
4: [-0.1093, 1.4356, -5.5007, 6.0454, -0.0019]
Idea: Penalize large coefficients
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 39
Regularization
• New Cost Function
2
E w | X y t g x t | w i wi2
N
1
2 t 1
• Ridge RegressionR( w)w 2 w2
i i
• LASSO: R( w) w 1 wi
i
)+
i wi2 wT w
• = -)+λ
• -) +λ = 0 w +λ
• +λ)
• =
• Image is obtained from Nando Freitas’ lecture notes