Introduction to Machine Learning
Ridge Regression, Perceptron, Logistic Regression
Irfan Chaudhary
EE 439: Introduction to Machine Learning
Department of Electrical Engineering
UET Lahore
18 February, 2026
Recap: MLE for a Coin Toss
Experiment: Toss a coin n times.
nH heads, nT tails
Let
P(H) = θ, θ ∈ [0, 1]
Likelihood of the data:
nH + nT nH
P(D; θ) = θ (1 − θ)nT
nH
MLE estimate:
nH
θ̂MLE = arg max P(D; θ) =
θ nH + nT
MLE = empirical frequency.
Recap: MAP for a Coin Toss
Now treat θ as a random variable.
Place a Beta prior:
1
P(θ) = θα−1 (1 − θ)β−1
B(α, β)
Posterior (Bayes rule):
P(θ | D) ∝ P(D | θ)P(θ)
Maximizing the log-posterior gives:
nH + α − 1
θ̂MAP =
nH + nT + α + β − 2
Interpretation:
α − 1 → prior successes
β − 1 → prior failures
MAP = smoothed estimate.
MLE vs MAP: Big Picture
MLE
θ̂MLE = arg max P(D; θ)
θ
▶ Uses data only
▶ θ treated as fixed
▶ Can overfit with small data
▶ Coin toss → empirical frequency
MAP
θ̂MAP = arg max P(θ | D) = arg max log P(D | θ) + log P(θ)
θ θ
▶ Uses data + prior belief
▶ θ treated as random
▶ Naturally introduces smoothing / regularization
▶ Coin toss → Laplace smoothing
Next: Apply the same logic to linear regression.
Connecting Back to Regression
We made regression probabilistic:
y (i) = θ⊤ x (i) + ϵ(i) , ϵ(i) ∼ N (0, σ 2 )
Under MLE, this leads to:
n
1 X (i) 2
θ̂MLE = arg min y − θ⊤ x (i)
θ 2
i=1
So ordinary least squares is the MLE solution.
What About a MAP Estimate?
Now suppose we place a prior on θ.
θ̂MAP = arg max [log P(D | θ) + log P(θ)]
θ
Assume a Gaussian prior:
⊤
1 θ θ
P(θ) = exp − 2
(2πτ 2 )d/2 2τ
Interpretation:
Smaller weights are more likely a priori.
MAP Objective Derivation
Note on the bias term:
For notational simplicity, we omit the intercept term. All
derivations extend directly to the case with a bias.
n
1 X (i) 2
log P(D | θ) = − 2 y − θ⊤ x (i)
2σ
i=1
1 ⊤
log P(θ) = − θ θ
2τ 2
Combine (note both signs change):
n
" #
1 X (i) ⊤ (i)
2 1 ⊤
θ̂MAP = arg min y −θ x + 2θ θ
θ 2σ 2 2τ
i=1
Important Note: The Bias Term
If we include b inside θ and apply the Gaussian prior:
∥θ∥2
P(θ) ∝ exp − 2 ,
2τ
then the intercept is also regularized.
In practice, we often:
▶ Do not regularize the bias
▶ Equivalent to placing a flat (improper) prior on b
This ensures the model can shift freely without penalty.
Ridge Regression Emerges
Multiply objective by σ 2 (does not change minimizer):
" n #
1 X (i) ⊤ (i)
2 σ2 ⊤
θ̂ = arg min y −θ x + 2θ θ
θ 2 2τ
i=1
Define:
σ2
λ=
τ2
We obtain:
n
" #
1 X (i) 2 λ
θ̂ = arg min y − θ⊤ x (i) + θ⊤ θ
θ 2 2
i=1
This is ridge regression.
Interpreting the Regularization Parameter
Recall:
σ2
λ=
τ2
Effect of prior variance τ 2 :
▶ Small τ 2 (strong prior) ⇒ large λ
▶ Large τ 2 (weak prior) ⇒ small λ
Effect of noise variance σ 2 :
▶ Larger noise ⇒ stronger regularization
▶ Smaller noise ⇒ weaker regularization
Regularization strength reflects a balance between data noise and
prior confidence.
Interpretation of the Regularization Term
The additional term:
λ ⊤ λ
θ θ = ∥θ∥22
2 2
▶ Penalizes large parameter values
▶ Encourages smaller weights
▶ Controls model complexity
▶ Guarantees a unique solution (even if X ⊤ X is singular)
Important:
Ridge regression is MAP with a Gaussian prior.
Summary: Probabilistic Modeling
We model a family of distributions indexed by θ.
Discriminative model:
Pθ (y |x)
Parameter estimation:
θ̂MLE = arg max Pθ (D)
θ
θ̂MAP = arg max P(θ | D)
θ
In MLE, θ is treated as fixed. In MAP, θ is treated as random with
prior P(θ).
From Regression to Classification
We now switch from regression to classification.
Linearly separable data:
{(x (i) , y (i) )}ni=1 , x (i) ∈ Rd , y (i) ∈ {−1, +1}
Image from [Link]
[Link]
Equation of a Plane
Image from [Link]
▶ n⃗ be a normal vector to the plane,
▶ ⃗r0 be a fixed point on the plane,
▶ ⃗r be any point on the plane.
Since ⃗r − ⃗r0 lies in the plane and n⃗ is perpendicular to the plane,
n⃗ · (⃗r − ⃗r0 ) = 0
Decision Rule from the Plane Equation
We derived the plane:
n⃗ · (⃗r − ⃗r0 ) = 0
But for any point ⃗r in space:
> 0 point lies on one side of the plane
n⃗ · (⃗r − ⃗r0 ) = 0 point lies on the plane
< 0 point lies on the other side
Thus the sign of the dot product determines the class.
n · (⃗r − ⃗r0 ))
ŷ = sign(⃗
Summary: Equation of a Linear Decision Boundary
Geometric definition of a plane
Let n⃗ be a normal vector and ⃗r0 a fixed point on the plane.
n⃗ · (⃗r − ⃗r0 ) = 0
Define b = −⃗
n · ⃗r0
=⇒ n⃗ · ⃗r + b = 0
Machine learning notation: Rename ⃗r → x, n⃗ → w :
w ⊤x + b = 0
Decision rule: sgn(w ⊤ x + b) determines the class
Absorbing the Bias Term
Define augmented vectors:
1 b
x̃ = , w̃ =
x w
Then:
w ⊤ x + b = w̃ ⊤ x̃
Decision boundary:
w̃ ⊤ x̃ = 0
Prediction Rule
Classifier:
ŷ = sign(w ⊤ x + b)
Correct classification:
y (i) (w ⊤ x (i) + b) > 0
Misclassification:
y (i) (w ⊤ x (i) + b) ≤ 0
Perceptron Algorithm
Initialize:
w =0
Repeat until no mistakes:
For each i:
If
y (i) w ⊤ x (i) ≤ 0
then update:
w ← w + y (i) x (i)
Perceptron Convergence Theorem
Theorem.
Assume the data are linearly separable (If not separable ⇒
algorithm does not converge). Then the perceptron algorithm
R2
makes at most 2 mistakes, where
γ
R = max ∥x (i) ∥
i
γ = min y (i) w∗⊤ x (i)
i
for some unit vector w∗ that perfectly separates the data. For
details see:
[Link]
lectures/[Link]
Aside: Perceptron as Gradient Descent
Perceptron update:
w ← w + yi xi if yi w ⊤ xi ≤ 0
Define the perceptron loss:
ℓi (w ) = max(0, −yi w ⊤ xi )
If the point is misclassified (yi w ⊤ xi ≤ 0):
∇ℓi (w ) = −yi xi
Gradient descent step:
w ← w − ∇ℓi (w ) = w + yi xi
Not All Data is Linearly Separable
Observed data: t-shirt wearing vs temperature
Days
Temperature
Blue = wore a T-shirt (y = 1)
Red = did not wear a T-shirt (y = 0)
We cannot linearly separate the data. What next?
Modeling T-Shirt vs Temperature
Let us try to model the P(t-shirt | T ).
Days
P(t-shirt | T )
Temperature
1
P(t-shirt | T ) = σ(wT + b) =
1+ e −(wT +b)
w controls how quickly the probability changes with temperature.
b shifts the temperature at which P = 0.5.
Logistic Regression Model
We model:
P(y = 1 | x; θ) = hθ (x) = σ(θ⊤ x)
P(y = 0 | x; θ) = 1 − hθ (x)
Therefore,
P(y | x; θ) = hθ (x)y (1 − hθ (x))1−y .
Note encoding used now:
y ∈ {0, 1} (probability view)