0% found this document useful (0 votes)

5 views24 pages

Lec 04

The document provides an introduction to machine learning concepts, focusing on maximum likelihood estimation (MLE) and maximum a posteriori estimation (MAP) in the context of regression and classification. It discusses the derivation of ridge regression as a MAP estimate with a Gaussian prior, and explains the perceptron algorithm for classification, including its convergence theorem. Additionally, it introduces logistic regression as a model for binary classification based on probabilities.

Uploaded by

Dayem Mujahid

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views24 pages

Lec 04

Uploaded by

Dayem Mujahid

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Introduction to Machine Learning

Ridge Regression, Perceptron, Logistic Regression

Irfan Chaudhary
EE 439: Introduction to Machine Learning
Department of Electrical Engineering
UET Lahore

18 February, 2026
Recap: MLE for a Coin Toss
Experiment: Toss a coin n times.

nH heads, nT tails
Let
P(H) = θ, θ ∈ [0, 1]
Likelihood of the data:

nH + nT nH
P(D; θ) = θ (1 − θ)nT
nH

MLE estimate:
nH
θ̂MLE = arg max P(D; θ) =
θ nH + nT

MLE = empirical frequency.

Recap: MAP for a Coin Toss
Now treat θ as a random variable.
Place a Beta prior:
1
P(θ) = θα−1 (1 − θ)β−1
B(α, β)
Posterior (Bayes rule):

P(θ | D) ∝ P(D | θ)P(θ)

Maximizing the log-posterior gives:
nH + α − 1
θ̂MAP =
nH + nT + α + β − 2

Interpretation:

α − 1 → prior successes
β − 1 → prior failures
MAP = smoothed estimate.
MLE vs MAP: Big Picture
MLE
θ̂MLE = arg max P(D; θ)
θ
▶ Uses data only
▶ θ treated as fixed
▶ Can overfit with small data
▶ Coin toss → empirical frequency

MAP

θ̂MAP = arg max P(θ | D) = arg max log P(D | θ) + log P(θ)
θ θ

▶ Uses data + prior belief

▶ θ treated as random
▶ Naturally introduces smoothing / regularization
▶ Coin toss → Laplace smoothing

Next: Apply the same logic to linear regression.

Connecting Back to Regression

We made regression probabilistic:

y (i) = θ⊤ x (i) + ϵ(i) , ϵ(i) ∼ N (0, σ 2 )

Under MLE, this leads to:

n
1 X (i) 2
θ̂MLE = arg min y − θ⊤ x (i)
θ 2
i=1

So ordinary least squares is the MLE solution.

What About a MAP Estimate?

Now suppose we place a prior on θ.

θ̂MAP = arg max [log P(D | θ) + log P(θ)]

Assume a Gaussian prior:

⊤
1 θ θ
P(θ) = exp − 2
(2πτ 2 )d/2 2τ

Interpretation:
Smaller weights are more likely a priori.
MAP Objective Derivation

Note on the bias term:

For notational simplicity, we omit the intercept term. All
derivations extend directly to the case with a bias.
n
1 X (i) 2
log P(D | θ) = − 2 y − θ⊤ x (i)
2σ
i=1

1 ⊤
log P(θ) = − θ θ
2τ 2

Combine (note both signs change):

n
" #
1 X (i) ⊤ (i)
2 1 ⊤
θ̂MAP = arg min y −θ x + 2θ θ
θ 2σ 2 2τ
i=1
Important Note: The Bias Term

If we include b inside θ and apply the Gaussian prior:

∥θ∥2

P(θ) ∝ exp − 2 ,
2τ
then the intercept is also regularized.

In practice, we often:
▶ Do not regularize the bias
▶ Equivalent to placing a flat (improper) prior on b

This ensures the model can shift freely without penalty.

Ridge Regression Emerges
Multiply objective by σ 2 (does not change minimizer):
" n #
1 X (i) ⊤ (i)
2 σ2 ⊤
θ̂ = arg min y −θ x + 2θ θ
θ 2 2τ
i=1

Define:

σ2
λ=
τ2

We obtain:
n
" #
1 X (i) 2 λ
θ̂ = arg min y − θ⊤ x (i) + θ⊤ θ
θ 2 2
i=1

This is ridge regression.

Interpreting the Regularization Parameter
Recall:

σ2
λ=
τ2

Effect of prior variance τ 2 :

▶ Small τ 2 (strong prior) ⇒ large λ
▶ Large τ 2 (weak prior) ⇒ small λ

Effect of noise variance σ 2 :

▶ Larger noise ⇒ stronger regularization
▶ Smaller noise ⇒ weaker regularization

Regularization strength reflects a balance between data noise and

prior confidence.
Interpretation of the Regularization Term

The additional term:

λ ⊤ λ
θ θ = ∥θ∥22
2 2
▶ Penalizes large parameter values
▶ Encourages smaller weights
▶ Controls model complexity
▶ Guarantees a unique solution (even if X ⊤ X is singular)

Important:
Ridge regression is MAP with a Gaussian prior.
Summary: Probabilistic Modeling

We model a family of distributions indexed by θ.

Discriminative model:
Pθ (y |x)

Parameter estimation:

θ̂MLE = arg max Pθ (D)

θ̂MAP = arg max P(θ | D)

In MLE, θ is treated as fixed. In MAP, θ is treated as random with

prior P(θ).
From Regression to Classification
We now switch from regression to classification.
Linearly separable data:

{(x (i) , y (i) )}ni=1 , x (i) ∈ Rd , y (i) ∈ {−1, +1}

Image from [Link]

[Link]
Equation of a Plane

Image from [Link]

▶ n⃗ be a normal vector to the plane,

▶ ⃗r0 be a fixed point on the plane,
▶ ⃗r be any point on the plane.
Since ⃗r − ⃗r0 lies in the plane and n⃗ is perpendicular to the plane,

n⃗ · (⃗r − ⃗r0 ) = 0
Decision Rule from the Plane Equation

We derived the plane:

n⃗ · (⃗r − ⃗r0 ) = 0

But for any point ⃗r in space:


> 0 point lies on one side of the plane

n⃗ · (⃗r − ⃗r0 ) = 0 point lies on the plane

< 0 point lies on the other side


Thus the sign of the dot product determines the class.

n · (⃗r − ⃗r0 ))
ŷ = sign(⃗
Summary: Equation of a Linear Decision Boundary

Geometric definition of a plane

Let n⃗ be a normal vector and ⃗r0 a fixed point on the plane.

n⃗ · (⃗r − ⃗r0 ) = 0
Define b = −⃗
n · ⃗r0

=⇒ n⃗ · ⃗r + b = 0
Machine learning notation: Rename ⃗r → x, n⃗ → w :

w ⊤x + b = 0

Decision rule: sgn(w ⊤ x + b) determines the class

Absorbing the Bias Term

Define augmented vectors:

1 b
x̃ = , w̃ =
x w
Then:

w ⊤ x + b = w̃ ⊤ x̃
Decision boundary:

w̃ ⊤ x̃ = 0
Prediction Rule

Classifier:

ŷ = sign(w ⊤ x + b)
Correct classification:

y (i) (w ⊤ x (i) + b) > 0

Misclassification:

y (i) (w ⊤ x (i) + b) ≤ 0
Perceptron Algorithm

Initialize:
w =0
Repeat until no mistakes:
For each i:
If
y (i) w ⊤ x (i) ≤ 0
then update:

w ← w + y (i) x (i)
Perceptron Convergence Theorem

Theorem.
Assume the data are linearly separable (If not separable ⇒
algorithm does not converge). Then the perceptron algorithm
R2
makes at most 2 mistakes, where
γ

R = max ∥x (i) ∥
i
γ = min y (i) w∗⊤ x (i)
i

for some unit vector w∗ that perfectly separates the data. For
details see:
[Link]
lectures/[Link]
Aside: Perceptron as Gradient Descent
Perceptron update:

w ← w + yi xi if yi w ⊤ xi ≤ 0

Define the perceptron loss:

ℓi (w ) = max(0, −yi w ⊤ xi )

If the point is misclassified (yi w ⊤ xi ≤ 0):

∇ℓi (w ) = −yi xi

Gradient descent step:

w ← w − ∇ℓi (w ) = w + yi xi
Not All Data is Linearly Separable

Observed data: t-shirt wearing vs temperature

Days

Temperature

Blue = wore a T-shirt (y = 1)

Red = did not wear a T-shirt (y = 0)
We cannot linearly separate the data. What next?
Modeling T-Shirt vs Temperature
Let us try to model the P(t-shirt | T ).

Days

P(t-shirt | T )

Temperature

1
P(t-shirt | T ) = σ(wT + b) =
1+ e −(wT +b)
w controls how quickly the probability changes with temperature.
b shifts the temperature at which P = 0.5.
Logistic Regression Model

We model:

P(y = 1 | x; θ) = hθ (x) = σ(θ⊤ x)

P(y = 0 | x; θ) = 1 − hθ (x)
Therefore,

P(y | x; θ) = hθ (x)y (1 − hθ (x))1−y .

Note encoding used now:

y ∈ {0, 1} (probability view)

Linear Classification Techniques Overview
No ratings yet
Linear Classification Techniques Overview
28 pages
Machine Learning Algorithms Explained
No ratings yet
Machine Learning Algorithms Explained
77 pages
ECE 449: Machine Learning Concepts
No ratings yet
ECE 449: Machine Learning Concepts
5 pages
Linear and Logistic Regression Overview
No ratings yet
Linear and Logistic Regression Overview
8 pages
Probabilistic ML Algorithms Overview
No ratings yet
Probabilistic ML Algorithms Overview
6 pages
Machine Learning Cheat Sheet: Python Guide
No ratings yet
Machine Learning Cheat Sheet: Python Guide
7 pages
MLE and MAP in Supervised Learning
No ratings yet
MLE and MAP in Supervised Learning
6 pages
UPSupervised Learning Overview
No ratings yet
UPSupervised Learning Overview
38 pages
Understanding Regression in Machine Learning
No ratings yet
Understanding Regression in Machine Learning
47 pages
Logistic Regression Overview and Methods
No ratings yet
Logistic Regression Overview and Methods
62 pages
Machine Learning: SVMs and Perceptron
No ratings yet
Machine Learning: SVMs and Perceptron
23 pages
CSCI 567 Cheat Sheets v1
No ratings yet
CSCI 567 Cheat Sheets v1
2 pages
Linear Models for Classification in ML
No ratings yet
Linear Models for Classification in ML
32 pages
Evaluating ML Systems & Linear Regression
No ratings yet
Evaluating ML Systems & Linear Regression
34 pages
Linear Regression and Machine Learning Techniques
No ratings yet
Linear Regression and Machine Learning Techniques
86 pages
Introduction to Classification by Risi Kondor
No ratings yet
Introduction to Classification by Risi Kondor
47 pages
Linear Models in Deep Learning
No ratings yet
Linear Models in Deep Learning
28 pages
Probabilistic Models in Supervised Learning
No ratings yet
Probabilistic Models in Supervised Learning
32 pages
Understanding Machine Learning Models
No ratings yet
Understanding Machine Learning Models
49 pages
Deep Learning Algorithm Overview
No ratings yet
Deep Learning Algorithm Overview
24 pages
Machine Learning Overview by dcamenisch
No ratings yet
Machine Learning Overview by dcamenisch
12 pages
Supervised Learning: Linear Regression Guide
No ratings yet
Supervised Learning: Linear Regression Guide
9 pages
Regression vs. Classification in ML
No ratings yet
Regression vs. Classification in ML
42 pages
Linear Models in Machine Learning
No ratings yet
Linear Models in Machine Learning
68 pages
ML Exam Ready Notes - MD
No ratings yet
ML Exam Ready Notes - MD
22 pages
Unit2 - Lecturenotes
No ratings yet
Unit2 - Lecturenotes
33 pages
Machine Learning: Regression Models Explained
No ratings yet
Machine Learning: Regression Models Explained
35 pages
Machine Learning in EDA Tools
No ratings yet
Machine Learning in EDA Tools
150 pages
Polynomial Regression and Overfitting Analysis
No ratings yet
Polynomial Regression and Overfitting Analysis
71 pages
Maximum Likelihood Estimation Techniques
No ratings yet
Maximum Likelihood Estimation Techniques
110 pages
Lec 03
No ratings yet
Lec 03
25 pages
CS217: Intro to Machine Learning
No ratings yet
CS217: Intro to Machine Learning
6 pages
Maximum Likelihood Estimation Techniques
No ratings yet
Maximum Likelihood Estimation Techniques
110 pages
Linear Regression Fundamentals Explained
No ratings yet
Linear Regression Fundamentals Explained
6 pages
Regularisation Ridge
No ratings yet
Regularisation Ridge
82 pages
Supervised Learning with Linear Models
No ratings yet
Supervised Learning with Linear Models
32 pages
Machine Learning
No ratings yet
Machine Learning
35 pages
Polynomial and Regularized Regression Techniques
No ratings yet
Polynomial and Regularized Regression Techniques
9 pages
Machine Learning: Linear & Logistic Regression
No ratings yet
Machine Learning: Linear & Logistic Regression
16 pages
Classification vs Regression Explained
No ratings yet
Classification vs Regression Explained
19 pages
Linear Discriminants and Perceptron Overview
No ratings yet
Linear Discriminants and Perceptron Overview
10 pages
Comparing ML Algorithms and Loss Functions
No ratings yet
Comparing ML Algorithms and Loss Functions
14 pages
Introduction to Machine Learning Basics
No ratings yet
Introduction to Machine Learning Basics
40 pages
Regularization in Machine Learning
No ratings yet
Regularization in Machine Learning
5 pages
Supervised Learning Overview and Types
No ratings yet
Supervised Learning Overview and Types
31 pages
Overview of Machine Learning Algorithms
No ratings yet
Overview of Machine Learning Algorithms
53 pages
01B DL2023 LinearModels
No ratings yet
01B DL2023 LinearModels
47 pages
Linear Regression in Machine Learning
No ratings yet
Linear Regression in Machine Learning
25 pages
04 Support Vector Machines Part1
No ratings yet
04 Support Vector Machines Part1
15 pages
Complete ML Cheat Sheet - Detailed Revision Guide
No ratings yet
Complete ML Cheat Sheet - Detailed Revision Guide
24 pages
Machine Learning: Model Building Complete Study Notes
No ratings yet
Machine Learning: Model Building Complete Study Notes
18 pages
Interview Prep PDF
No ratings yet
Interview Prep PDF
56 pages
Ml-Basic Lect02 Linear Regression
No ratings yet
Ml-Basic Lect02 Linear Regression
62 pages
Linear Regression Lectures 2 4
No ratings yet
Linear Regression Lectures 2 4
53 pages
Santosh Kumar CV - Electrical Engineer
No ratings yet
Santosh Kumar CV - Electrical Engineer
3 pages
PhonePe Statement Jan2026 Feb2026
No ratings yet
PhonePe Statement Jan2026 Feb2026
13 pages
Machine Learning for XSS Attack Detection
No ratings yet
Machine Learning for XSS Attack Detection
45 pages
The Indolence of The Filipino by José Rizal
No ratings yet
The Indolence of The Filipino by José Rizal
9 pages
Student Ledger for BSc in CSE
No ratings yet
Student Ledger for BSc in CSE
2 pages
Diploma in AI & ML Practical Guide
No ratings yet
Diploma in AI & ML Practical Guide
4 pages
! Tongyu Catalog 24-06-2013
No ratings yet
! Tongyu Catalog 24-06-2013
984 pages
Variables and Constants in Java
No ratings yet
Variables and Constants in Java
75 pages
Chapter - 5: Mcgraw-Hill/Irwin
No ratings yet
Chapter - 5: Mcgraw-Hill/Irwin
64 pages
Genpact Laptop Setup & Imaging Guide
No ratings yet
Genpact Laptop Setup & Imaging Guide
16 pages
Freedom Universal Keyboard User Manual
No ratings yet
Freedom Universal Keyboard User Manual
28 pages
Symmetric Polynomials Explained
No ratings yet
Symmetric Polynomials Explained
11 pages
Cost Behavior and Accounting Overview
No ratings yet
Cost Behavior and Accounting Overview
6 pages
Java Revision Notes Overview
No ratings yet
Java Revision Notes Overview
5 pages
Online vs Offline UPS Explained
No ratings yet
Online vs Offline UPS Explained
7 pages
Master Sewer Line Data Summary
No ratings yet
Master Sewer Line Data Summary
7 pages
E-Book Testbank For Oracle 12c SQL 3rd Edition by Joan Casteel
100% (5)
E-Book Testbank For Oracle 12c SQL 3rd Edition by Joan Casteel
304 pages
Crypto Library Implementation Guide
No ratings yet
Crypto Library Implementation Guide
17 pages
Calypso Java Card Command Error
No ratings yet
Calypso Java Card Command Error
4 pages
Daikin VRV Commissioning Guidelines
No ratings yet
Daikin VRV Commissioning Guidelines
2 pages
Brand Equity and Consumer Decisions
No ratings yet
Brand Equity and Consumer Decisions
40 pages
PTR Hartmann Seriess 3020 2g SH 4 0 MM
No ratings yet
PTR Hartmann Seriess 3020 2g SH 4 0 MM
3 pages
Classification of Memory Devices
No ratings yet
Classification of Memory Devices
4 pages
Machina-Labs RoboCraftsman Product-Guide
No ratings yet
Machina-Labs RoboCraftsman Product-Guide
6 pages
Iterative Prototyping in HCI Design
No ratings yet
Iterative Prototyping in HCI Design
11 pages
Power Electronics Course Overview
No ratings yet
Power Electronics Course Overview
4 pages
TWE IOM SSA-SVX06G-EN Ago (2017)
No ratings yet
TWE IOM SSA-SVX06G-EN Ago (2017)
44 pages
Zero HVAC Energy Optimization Strategies
No ratings yet
Zero HVAC Energy Optimization Strategies
5 pages
DL5.0C Manual for Energy Storage
No ratings yet
DL5.0C Manual for Energy Storage
2 pages
Error Correction in Data Communication
100% (1)
Error Correction in Data Communication
4 pages

Lec 04

Uploaded by

Lec 04

Uploaded by

Introduction to Machine Learning

Ridge Regression, Perceptron, Logistic Regression

MLE = empirical frequency.

P(θ | D) ∝ P(D | θ)P(θ)

▶ Uses data + prior belief

Next: Apply the same logic to linear regression.

We made regression probabilistic:

y (i) = θ⊤ x (i) + ϵ(i) , ϵ(i) ∼ N (0, σ 2 )

Under MLE, this leads to:

So ordinary least squares is the MLE solution.

Now suppose we place a prior on θ.

θ̂MAP = arg max [log P(D | θ) + log P(θ)]

Assume a Gaussian prior:

Note on the bias term:

Combine (note both signs change):

If we include b inside θ and apply the Gaussian prior:

This ensures the model can shift freely without penalty.

This is ridge regression.

Effect of prior variance τ 2 :

Effect of noise variance σ 2 :

Regularization strength reflects a balance between data noise and

The additional term:

We model a family of distributions indexed by θ.

θ̂MLE = arg max Pθ (D)

θ̂MAP = arg max P(θ | D)

In MLE, θ is treated as fixed. In MAP, θ is treated as random with

{(x (i) , y (i) )}ni=1 , x (i) ∈ Rd , y (i) ∈ {−1, +1}

Image from [Link]

Image from [Link]

▶ n⃗ be a normal vector to the plane,

We derived the plane:

But for any point ⃗r in space:

Thus the sign of the dot product determines the class.

Geometric definition of a plane

Decision rule: sgn(w ⊤ x + b) determines the class

Define augmented vectors:

y (i) (w ⊤ x (i) + b) > 0

Define the perceptron loss:

If the point is misclassified (yi w ⊤ xi ≤ 0):

Gradient descent step:

Observed data: t-shirt wearing vs temperature

Blue = wore a T-shirt (y = 1)

P(y = 1 | x; θ) = hθ (x) = σ(θ⊤ x)

P(y | x; θ) = hθ (x)y (1 − hθ (x))1−y .

y ∈ {0, 1} (probability view)

You might also like