0% found this document useful (0 votes)
5 views24 pages

Lec 04

The document provides an introduction to machine learning concepts, focusing on maximum likelihood estimation (MLE) and maximum a posteriori estimation (MAP) in the context of regression and classification. It discusses the derivation of ridge regression as a MAP estimate with a Gaussian prior, and explains the perceptron algorithm for classification, including its convergence theorem. Additionally, it introduces logistic regression as a model for binary classification based on probabilities.

Uploaded by

Dayem Mujahid
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views24 pages

Lec 04

The document provides an introduction to machine learning concepts, focusing on maximum likelihood estimation (MLE) and maximum a posteriori estimation (MAP) in the context of regression and classification. It discusses the derivation of ridge regression as a MAP estimate with a Gaussian prior, and explains the perceptron algorithm for classification, including its convergence theorem. Additionally, it introduces logistic regression as a model for binary classification based on probabilities.

Uploaded by

Dayem Mujahid
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Introduction to Machine Learning

Ridge Regression, Perceptron, Logistic Regression

Irfan Chaudhary
EE 439: Introduction to Machine Learning
Department of Electrical Engineering
UET Lahore

18 February, 2026
Recap: MLE for a Coin Toss
Experiment: Toss a coin n times.

nH heads, nT tails
Let
P(H) = θ, θ ∈ [0, 1]
Likelihood of the data:
 
nH + nT nH
P(D; θ) = θ (1 − θ)nT
nH

MLE estimate:
nH
θ̂MLE = arg max P(D; θ) =
θ nH + nT

MLE = empirical frequency.


Recap: MAP for a Coin Toss
Now treat θ as a random variable.
Place a Beta prior:
1
P(θ) = θα−1 (1 − θ)β−1
B(α, β)
Posterior (Bayes rule):

P(θ | D) ∝ P(D | θ)P(θ)


Maximizing the log-posterior gives:
nH + α − 1
θ̂MAP =
nH + nT + α + β − 2

Interpretation:

α − 1 → prior successes
β − 1 → prior failures
MAP = smoothed estimate.
MLE vs MAP: Big Picture
MLE
θ̂MLE = arg max P(D; θ)
θ
▶ Uses data only
▶ θ treated as fixed
▶ Can overfit with small data
▶ Coin toss → empirical frequency

MAP
 
θ̂MAP = arg max P(θ | D) = arg max log P(D | θ) + log P(θ)
θ θ

▶ Uses data + prior belief


▶ θ treated as random
▶ Naturally introduces smoothing / regularization
▶ Coin toss → Laplace smoothing

Next: Apply the same logic to linear regression.


Connecting Back to Regression

We made regression probabilistic:

y (i) = θ⊤ x (i) + ϵ(i) , ϵ(i) ∼ N (0, σ 2 )

Under MLE, this leads to:


n
1 X  (i) 2
θ̂MLE = arg min y − θ⊤ x (i)
θ 2
i=1

So ordinary least squares is the MLE solution.


What About a MAP Estimate?

Now suppose we place a prior on θ.

θ̂MAP = arg max [log P(D | θ) + log P(θ)]


θ

Assume a Gaussian prior:


 ⊤ 
1 θ θ
P(θ) = exp − 2
(2πτ 2 )d/2 2τ

Interpretation:
Smaller weights are more likely a priori.
MAP Objective Derivation

Note on the bias term:


For notational simplicity, we omit the intercept term. All
derivations extend directly to the case with a bias.
n
1 X  (i) 2
log P(D | θ) = − 2 y − θ⊤ x (i)

i=1

1 ⊤
log P(θ) = − θ θ
2τ 2

Combine (note both signs change):


n
" #
1 X  (i) ⊤ (i)
2 1 ⊤
θ̂MAP = arg min y −θ x + 2θ θ
θ 2σ 2 2τ
i=1
Important Note: The Bias Term

If we include b inside θ and apply the Gaussian prior:

∥θ∥2
 
P(θ) ∝ exp − 2 ,

then the intercept is also regularized.

In practice, we often:
▶ Do not regularize the bias
▶ Equivalent to placing a flat (improper) prior on b

This ensures the model can shift freely without penalty.


Ridge Regression Emerges
Multiply objective by σ 2 (does not change minimizer):
" n #
1 X  (i) ⊤ (i)
2 σ2 ⊤
θ̂ = arg min y −θ x + 2θ θ
θ 2 2τ
i=1

Define:

σ2
λ=
τ2

We obtain:
n
" #
1 X  (i) 2 λ
θ̂ = arg min y − θ⊤ x (i) + θ⊤ θ
θ 2 2
i=1

This is ridge regression.


Interpreting the Regularization Parameter
Recall:

σ2
λ=
τ2

Effect of prior variance τ 2 :


▶ Small τ 2 (strong prior) ⇒ large λ
▶ Large τ 2 (weak prior) ⇒ small λ

Effect of noise variance σ 2 :


▶ Larger noise ⇒ stronger regularization
▶ Smaller noise ⇒ weaker regularization

Regularization strength reflects a balance between data noise and


prior confidence.
Interpretation of the Regularization Term

The additional term:

λ ⊤ λ
θ θ = ∥θ∥22
2 2
▶ Penalizes large parameter values
▶ Encourages smaller weights
▶ Controls model complexity
▶ Guarantees a unique solution (even if X ⊤ X is singular)

Important:
Ridge regression is MAP with a Gaussian prior.
Summary: Probabilistic Modeling

We model a family of distributions indexed by θ.

Discriminative model:
Pθ (y |x)

Parameter estimation:

θ̂MLE = arg max Pθ (D)


θ

θ̂MAP = arg max P(θ | D)


θ

In MLE, θ is treated as fixed. In MAP, θ is treated as random with


prior P(θ).
From Regression to Classification
We now switch from regression to classification.
Linearly separable data:

{(x (i) , y (i) )}ni=1 , x (i) ∈ Rd , y (i) ∈ {−1, +1}

Image from [Link]

[Link]
Equation of a Plane

Image from [Link]

▶ n⃗ be a normal vector to the plane,


▶ ⃗r0 be a fixed point on the plane,
▶ ⃗r be any point on the plane.
Since ⃗r − ⃗r0 lies in the plane and n⃗ is perpendicular to the plane,

n⃗ · (⃗r − ⃗r0 ) = 0
Decision Rule from the Plane Equation

We derived the plane:

n⃗ · (⃗r − ⃗r0 ) = 0

But for any point ⃗r in space:



> 0 point lies on one side of the plane

n⃗ · (⃗r − ⃗r0 ) = 0 point lies on the plane

< 0 point lies on the other side

Thus the sign of the dot product determines the class.

n · (⃗r − ⃗r0 ))
ŷ = sign(⃗
Summary: Equation of a Linear Decision Boundary

Geometric definition of a plane


Let n⃗ be a normal vector and ⃗r0 a fixed point on the plane.

n⃗ · (⃗r − ⃗r0 ) = 0
Define b = −⃗
n · ⃗r0

=⇒ n⃗ · ⃗r + b = 0
Machine learning notation: Rename ⃗r → x, n⃗ → w :

w ⊤x + b = 0

Decision rule: sgn(w ⊤ x + b) determines the class


Absorbing the Bias Term

Define augmented vectors:


   
1 b
x̃ = , w̃ =
x w
Then:

w ⊤ x + b = w̃ ⊤ x̃
Decision boundary:

w̃ ⊤ x̃ = 0
Prediction Rule

Classifier:

ŷ = sign(w ⊤ x + b)
Correct classification:

y (i) (w ⊤ x (i) + b) > 0


Misclassification:

y (i) (w ⊤ x (i) + b) ≤ 0
Perceptron Algorithm

Initialize:
w =0
Repeat until no mistakes:
For each i:
If
y (i) w ⊤ x (i) ≤ 0
then update:

w ← w + y (i) x (i)
Perceptron Convergence Theorem

Theorem.
Assume the data are linearly separable (If not separable ⇒
algorithm does not converge). Then the perceptron algorithm
R2
makes at most 2 mistakes, where
γ

R = max ∥x (i) ∥
i
γ = min y (i) w∗⊤ x (i)
i

for some unit vector w∗ that perfectly separates the data. For
details see:
[Link]
lectures/[Link]
Aside: Perceptron as Gradient Descent
Perceptron update:

w ← w + yi xi if yi w ⊤ xi ≤ 0

Define the perceptron loss:

ℓi (w ) = max(0, −yi w ⊤ xi )

If the point is misclassified (yi w ⊤ xi ≤ 0):

∇ℓi (w ) = −yi xi

Gradient descent step:

w ← w − ∇ℓi (w ) = w + yi xi
Not All Data is Linearly Separable

Observed data: t-shirt wearing vs temperature

Days

Temperature

Blue = wore a T-shirt (y = 1)


Red = did not wear a T-shirt (y = 0)
We cannot linearly separate the data. What next?
Modeling T-Shirt vs Temperature
Let us try to model the P(t-shirt | T ).

Days

P(t-shirt | T )

Temperature

1
P(t-shirt | T ) = σ(wT + b) =
1+ e −(wT +b)
w controls how quickly the probability changes with temperature.
b shifts the temperature at which P = 0.5.
Logistic Regression Model

We model:

P(y = 1 | x; θ) = hθ (x) = σ(θ⊤ x)

P(y = 0 | x; θ) = 1 − hθ (x)
Therefore,

P(y | x; θ) = hθ (x)y (1 − hθ (x))1−y .


Note encoding used now:

y ∈ {0, 1} (probability view)

You might also like