ECE 449 Machine Learning: Eric Ji
1 K-Nearest Neighbors (Non-linear)
• Find top K nearest neighbors under metric d and return most common/average label/value among
them
– d(x, z) = ( |x − z|p )1/p
P
– L1 (Manhattan) Distance: p = 1
– L2 (Euclidean) Distance: p = 2
• Determine K using validation set
– Small K is sensitive to noise and will overfit
– Large K includes too far examples and will underfit
• Simple to implement but several issues
– Require memory to store dataset
– Computationally expensive inference time
– Sensitive to outliers
– Curse of dimensionality: High-dimensional data spreads far away from each other giving low
performance
• Nonparametric models place mild assumptions on data distribution and good for complex data, but
require storage/computation of entire dataset
• Parametric models place strong modeling assumptions and require fitting the model to achieve more
efficient storage/computation
2 Perceptron (Linear)
• Only applies to linearly separable data
• Perceptrons are linear classifiers trying to learn a hyperplane
• Hyperplane in Rd -space is represented as w0 + wT x = 0 where w ∈ Rd
– w is orthogonal to hyperplane and points to positive half-space
• Predicted label y(x) = sign(wt + b)
• Perceptron algorithm iterates through all the data and simultaneously updates W until all data is
correctly labeled
– Update rule: wnew = w + yx when y(wT x) <= 0
• Theorem: Given w∗ that perfectly separates the data and γ = min|w∗T x( i)|, ∀x(i) in D, the perception
algorithm takes at most 1/γ 2 to converge
1
3 Probability and Estimation
• Useful Probability Properties:
P (A,B)
– Conditional Probability: P (A|B) = P (B)
P (B|A)P (A)
– Bayes Rule: P (A|B) = P (B)
• Useful Log Rules:
– log(AB ) = B · log(A)
– log(AB) = log(A) + log(B)
– log(A/B) = log(A) − log(B)
• Useful Derivative Rules:
d 1
– dx log(x) = x
• From a dataset of joint probabilities P (X1 , X2 , x3 , ..., Xd , Y ) we can calculate P (Y |X1 , X2 , x3 , ..., Xd )
– Intuitive to learn P (Y |X) from joint distribution, but requires lots of data that may not be
attainable to produce accurate model
• Estimate parameters from sparse data using Maximum Likelihood Estimation and Maximum A Pos-
terior Estimation
• MLE chooses parameter θ that maximizes maximizes probability of observing dataset D
– θ̂ = argmaxθ P (D|θ) where P (D|θ) is the likelihood function
• Steps for solving MLE:
– Take log of likelihood
– Take derivative in respect to θ and set equal to 0
– Solve for θ that maximizes the likelihood
• MAP chooses parameter θ that is most probable given prior P (θ) and dataset D
– θ̂ = argmaxθ P (θ|D)
– θ̂ = argmaxθ P (D|θ)P
P (D)
(θ)
according to Bayes Rule
– θ̂ = argmaxθ P (D|θ)P (θ) as P (D) does not depend on θ
• MAP is better than MLE when small number of samples of dataset and prior is accurate
• As the number of samples from our dataset approaches infinity, the prior becomes irrelevant and MAP
will become MLE
2
4 Naive Bayes(Probalistic)
• Aims to learn P (Y |X) through P (X|Y ) and P (Y ) using Bayes rule with conditional independence
assumption to reduce number of parameters to estimate
P (X1 ,...,Xd |Y )P (Y )
– P (Y |X1 , ..., Xd ) = ∝ P (X1 , ..., Xd |Y )P (Y ) ignoring normalization
P (X1 ,...,Xd )
• Conditional Independence: P (X1 , ..., Xd |Y ) = j P (Xj |Y )
Q
– Requires estimating 2(2d − 1) + 1 parameters without assuming conditional independence
– Requires estimating 2d + 1 parameters with assuming conditional independence
• Utilize MLE and MAP to estimate the parameters to learn P (Y |X)
– MAP makes it such that P (Y |X) won’t be 0 if one component of the product is 0
• If X is a continuous value, it is common to assume P (X|Y ) follow a normal distribution
– Variance can be independent of class, feature, or both
• Gaussian Naive Bayes can be linear with many assumptions regarding the data’s distributions
5 Logistic Regression (Linear)
• Discriminative counterpart to Naive Bayes that directly learn P (Y |X)
– Discriminative models directly calculate the weights
– Generative models calculate all the probabilities/parameters to calculate the weights then
• Learn a set of weights for each class
• P (Y |X) can be represented by sigmoid function
1P
– P (Y = c|X) = 1+exp(w0 + j wj Xj )
• Calculate weights using MCLE
– wM CLE = argmaxw P (y ( i)|x(i), w)
Q
– Objective is concave, but does not have a closed form so needs optimization techniques
• Can apply MAP by placing a prior on the weights themselves
– wM CLE = argmaxw P (w) P (y ( i)|x(i), w)
Q
• Logistic regression typically gives the better solution compared to naive bayes, especially with lots of
data and conditional independence does not hold
3
6 Optimization
• Gradient Descent uses first order Taylor expansion approximation to assume an objective function l
around weights w is linear
– l(w + s) = l(w) + g(w)T s where g(w) = ∇l(w)
• Gradient Descent Update rule: wnew = w − αg(w) to minimize l(w)
– Step size α should decrease by a constant rate for each update for good convergence
• Batch gradient uses error over training of entire dataset and updates w
• Stochastic gradient uses error over single sample and updates w
• Newton’s Method uses 2nd order Taylor expansion approximation
– l(w + s) = l(w) + g(w)T s + 21 sT (H(w)s
• Newton’s Update rule: wnew = w − H(w)−1 g(w)
– H(w) is the Hessian matrix which composes of the outer-product of the second derivative of l(w)
in respect to w
• Encorporating a prior for a MAP estimate results in a regularization term when updating weights
– wnew = w − αg(w) − αλw
– Helps reduce overfitting by keeping weights near 0
7 Linear Regression
• Used to learn function that linearly maps X onto Y where Y is continuous
– First choose parameterized for for P (Y |X, w)
– Then derive MLE or MAP and estimate w
• MLE produces Squared loss for objective: l(w) = 1 i
− wT xi )2
P
N i (y
– Closed form solution that minimizes l(w) give w = (X T X)−1 X T y
• MAP produces Square loss plus sum of squared weights objective: l(w) = 1 i
− wT xi )2 + λ||w||22
P
N i (y
– Closed form solution that minimize l(w) gives w = (X T X + λI)−1 X T y
4
8 Support Vector Machine
• Separate positive and negative samples as wide as possible
• Hard margin SVM is for linearly separable data and expects perfect separation
– Objective is to minimize 12 ||w||22 such that y (i) (wT x(i) + b) ≥ 1
– Only need support vectors for inference
∗ y (i) (wT x(i) + b) = 1
• Soft margin SVM allows for misclassified samples in non-linearly separable data
– Objective is to minimize 21 ||w||22 + C i ξi such that y (i) (wT x(i) + b) ≥ 1 − ξi
P
– C is trade-off parameter where C = ∞ causes hard margin
– ξi is slack variable where ξi = max(0, 1 − y (i) (wT x(i) + b)
• In both objectives ||w||22 is the regularize
• In soft margin objective i max(0, 1 − y (i) is the hinge loss
P
• Utilize Lagrangian multiplier to minimize quadratic objective without the constraints
– Want to solve dual problem: maxα minw,b L(w, b, α)
• Hard margin objective: L(w, b, α) = 21 ||w||22 + i αi (1 − y (i) (wt x(i) + b))
P
– w∗ = i α∗ y (i) x(i)
P
• Can also be applied to soft margin