Background: Generative and
Logistic Discriminative Classifiers
Regressi
on
1
Logistic Regression
Important analytic tool in natural and
social sciences
Baseline supervised machine learning tool
for classification
It is also the foundation of neural
networks
2
Generative and
Discriminative Classifiers
Naive Bayes is a generative classifier
by contrast:
Logistic regression is a discriminative
classifier
3
Generative and
Discriminative Classifiers
Suppose we're distinguishing cat from dog images
imagenet imagenet 4
Generative Classifier:
• Build a model of what's in a cat image
• Knows about whiskers, ears, eyes
• Assigns a probability to any image:
• how cat-y is this image?
Also build a model for dog images
Now given a new image:
Run both models and see which one fits better 5
Discriminative Classifier
Just try to distinguish dogs from cats
Oh look, dogs have collars!
Let's ignore everything else 6
Finding the correct class c from a document d in
Generative vs Discriminative Classifiers
Naive Bayes
Logistic Regression
posterior
P(c|d)
7
Components of a probabilistic machine learning
classifier
Given m input/output pairs (x(i),y(i)):
1. A feature representation of the input. For each input
observation x(i), a vector of features [x1, x2, ... , xn]. Feature j
for input x(i) is xj, more completely xj(i), or sometimes fj(x).
2. A classification function that computes , the estimated class,
via p(y|x), like the sigmoid or softmax functions.
3. An objective function for learning, like cross-entropy loss.
4. An algorithm for optimizing the objective function: stochastic
gradient descent.
8
The two phases of logistic
regression
Training: we learn weights w and b using stochastic
gradient descent and cross-entropy loss.
Test: Given a test example x we compute p(y|x)
using learned weights w and b, and return
whichever label (y = 1 or y = 0) is higher probability
9
Classification in Logistic Regression
Logistic
Regressi
on
10
Classification Reminder
Positive/negative sentiment
Spam/not spam
Authorship attribution
(Hamilton or Madison?)
Alexander Hamilton
11
Text Classification: definition
Input:
◦ a document x
◦ a fixed set of classes C = {c1, c2,…, cJ}
Output: a predicted class C
12
Binary Classification in Logistic Regression
Given a series of input/output pairs:
◦ (x(i), y(i))
For each observation x(i)
◦ We represent x(i) by a feature vector [x1, x2,…, xn]
◦ We compute an output: a predicted class (i) {0,1}
13
Features in logistic regression
• For feature xi, weight wi tells is how important is xi
• xi ="review contains ‘awesome’": wi = +10
• xj ="review contains ‘abysmal’": wj = -10
• xk =“review contains ‘mediocre’": wk = -2
14
Logistic Regression for one
observation x
Input observation: vector x = [x1, x2,…, xn]
Weights: one per feature: W = [w1, w2,…, wn]
◦ Sometimes we call the weights θ = [θ1, θ2,…, θn]
Output: a predicted class {0,1}
(multinomial logistic regression: {0, 1, 2, 3, 4})
15
How to do classification
For each feature xi, weight wi tells us importance of xi
◦ (Plus we'll have a bias b)
We'll sum up all the weighted features and the bias
If this sum is high, we say y=1; if low, then y=0
16
But we want a probabilistic
classifier
We need to formalize “sum is high”.
We’d like a principled classifier that gives us a
probability
We want a model that can tell us:
p(y=1|x; θ)
p(y=0|x; θ)
17
The problem: z isn't a probability, it's just a
number!
Solution: use a function of z that goes from 0 to 1
18
The very useful sigmoid or logistic function
19
Idea of logistic regression
We’ll compute w∙x+b
And then we’ll pass it through the
sigmoid function:
σ(w∙x+b)
And we'll just treat it as a probability
20
Making probabilities with sigmoids
21
Turning a probability into a classifier
0.5 here is called the decision boundary
22
The probabilistic classifier
P(y=1)
wx + b 23
Turning a probability into a classifier
if w∙x+b > 0
if w∙x+b ≤ 0
24
Logistic Regression: a text example
Logistic on sentiment classification
Regressi
on
25
Sentiment example: does y=1
or y=0?
It's hokey . There are virtually no surprises , and the writing is second-rate .
So why was it so enjoyable ? For one thing , the cast is
great . Another nice touch is the music . I was overcome with the urge to get off
the couch and start dancing . It sucked me in , and it'll do the same to you .
26
27
Classifying sentiment for input x
Suppose w =
b = 0.1 28
Classifying sentiment for input x
29
Classification in (binary) logistic regression: summary
Given:
◦ a set of classes: (+ sentiment,- sentiment)
◦ a vector x of features [x1, x2, …, xn]
◦ x1= count( "awesome")
◦ x2 = log(number of words in review)
◦ A vector w of weights [w1, w2, …, wn]
◦ w for each feature f
i i
30
Logistic Regression: a text example
Logistic on sentiment classification
Regressi
on
31
Learning: Cross-Entropy Loss
Logistic
Regressi
on
32
Wait, where did the W’s come
from?
Supervised classification:
• We know the correct label y (either 0 or 1) for each x.
• But what the system produces is an estimate,
We want to set w and b to minimize the distance between our
estimate (i) and the true y(i).
• We need a distance estimator: a loss function or a cost
function
• We need an optimization algorithm to update w and b to
minimize the loss. 33
Learning components
A loss function:
◦ cross-entropy loss
An optimization algorithm:
◦ stochastic gradient descent
34
The distance between and y
We want to know how far is the classifier output:
= σ(w∙x+b)
from the true output:
y [= either 0 or 1]
We'll call this difference:
L(,y) = how much differs from the true y
35
Intuition of negative log likelihood loss = cross-
entropy loss
A case of conditional maximum likelihood
estimation
We choose the parameters w,b that maximize
• the log probability
• of the true y labels in the training data
• given the observations x
36
Deriving cross-entropy loss for a single observation x
Goal: maximize probability of the correct label p(y|x)
Since there are only 2 discrete outcomes (0 or 1) we can
express the probability p(y|x) from our classifier (the
thing we want to maximize) as
noting:
if y=1, this simplifies to
if y=0, this simplifies to 1-
37
Deriving cross-entropy loss for a single observation x
Goal: maximize probability of the correct label p(y|x)
Maximize:
Now take the log of both sides (mathematically handy)
Maximize:
Whatever values maximize log p(y|x) will also maximize
p(y|x)
38
Deriving cross-entropy loss for a single observation x
Goal: maximize probability of the correct label p(y|x)
Maximize:
Now flip sign to turn this into a loss: something to minimize
Cross-entropy loss (because is formula for cross-entropy(y, ))
Minimize:
Or, plugging in definition of
Let's see if this works for our sentiment example
We want loss to be:
• smaller if the model estimate is close to correct
• bigger if model is confused
Let's first suppose the true label of this is y=1 (positive)
It's hokey . There are virtually no surprises , and the writing is second-rate .
So why was it so enjoyable ? For one thing , the cast is great . Another
nice touch is the music . I was overcome with the urge to get off the couch
and start dancing . It sucked me in , and it'll do the same to you .
40
Let's see if this works for our sentiment example
True value is y=1. How well is our model doing?
Pretty well! What's the loss?
41
Let's see if this works for our sentiment example
Suppose true value instead was y=0.
What's the loss?
42
Let's see if this works for our sentiment example
The loss when model was right (if true y=1)
Is lower than the loss when model was wrong (if true y=0):
Sure enough, loss was bigger when model was wrong! 43
Cross-Entropy Loss
Logistic
Regressi
on
44
Stochastic Gradient Descent
Logistic
Regressi
on
45
Our goal: minimize the loss
by weights 𝛳=(w,b)
Let's make explicit that the loss function is parameterized
• And we’ll represent as f (x; θ ) to make the
dependence on θ more obvious
We want the weights that minimize the loss, averaged
over all examples:
46
Our goal: minimize the loss
For logistic regression, loss function is convex
• A convex function has just one minimum
• Gradient descent starting from any point is
guaranteed to find the minimum
• (Loss for neural networks is non-convex)
47
Let's first visualize for a single
scalar w
Q: Given current w, should we make it bigger or smaller?
A: Move w in the reverse direction from the slope of the function
48
Let's first visualize for a single
scalar w
Q: Given current w, should we make it bigger or smaller?
A: Move w in the reverse direction from the slope of the function
So we'll move positive
49
Let's first visualize for a single
scalar w
Q: Given current w, should we make it bigger or smaller?
A: Move w in the reverse direction from the slope of the function
So we'll move positive
50
Gradients
The gradient of a function of many variables is a
vector pointing in the direction of the greatest
increase in a function.
Gradient Descent: Find the gradient of the loss
function at the current point and move in the
opposite direction.
51
How much do we move in that direction ?
• The value of the gradient (slope in our example)
weighted by a learning rate η
• Higher learning rate means move w faster
52
Now let's consider N dimensions
We want to know where in the N-dimensional space
(of the N parameters that make up θ ) we should
move.
The gradient is just such a vector; it expresses the
directional components of the sharpest slope along
each of the N dimensions.
53
Imagine 2 dimensions, w and b
Visualizing the
gradient vector at
the red point
It has two
dimensions shown
in the x-y plane
54
The gradient
We’ll represent as f (x; θ ) to make the dependence on θ more
obvious:
The final equation for updating θ based on the gradient is thus
55
What are these partial derivatives for logistic regression?
The loss function
The elegant derivative of this function
56
57
Hyperparameters
The learning rate η is a hyperparameter
◦ too high: the learner will take big steps and overshoot
◦ too low: the learner will take too long
Hyperparameters:
• Briefly, a special kind of parameter for an ML model
• Instead of being learned by algorithm from
supervision (like regular parameters), they are
chosen by algorithm designer.
58
Stochastic Gradient Descent
Logistic
Regressi
on
59
Stochastic Gradient Descent:
Logistic An example and more details
Regressi
on
60
Working through an example
One step of gradient descent
A mini-sentiment example, where the true y=1 (positive)
Two features:
x1 = 3 (count of positive lexicon words)
x2 = 2 (count of negative lexicon words)
Assume 3 parameters (2 weights and 1 bias) in Θ0 are zero:
w1 = w2 = b = 0
η = 0.1
61
Example of gradient descent
w =w = b = 0;
1 2
Update step for update θ is: x1 = 3; x2 = 2
where
Gradient vector has 3 dimensions:
62
Example of gradient descent
w =w = b = 0;
1 2
Update step for update θ is: x1 = 3; x2 = 2
where
Gradient vector has 3 dimensions:
63
Example of gradient descent
w =w = b = 0;
1 2
Update step for update θ is: x1 = 3; x2 = 2
where
Gradient vector has 3 dimensions:
64
Example of gradient descent
w =w = b = 0;
1 2
Update step for update θ is: x1 = 3; x2 = 2
where
Gradient vector has 3 dimensions:
65
Example of gradient descent
w =w = b = 0;
1 2
Update step for update θ is: x1 = 3; x2 = 2
where
Gradient vector:
66
Example of gradient descent
Now that we have a gradient, we compute the new parameter vector
θ1 by moving θ0 in the opposite direction from the gradient:
η = 0.1;
67
Example of gradient descent
Now that we have a gradient, we compute the new parameter vector
θ1 by moving θ0 in the opposite direction from the gradient:
η = 0.1;
68
Example of gradient descent
Now that we have a gradient, we compute the new parameter vector
θ1 by moving θ0 in the opposite direction from the gradient:
η = 0.1;
69
Example of gradient descent
Now that we have a gradient, we compute the new parameter vector
θ1 by moving θ0 in the opposite direction from the gradient:
η = 0.1;
Note that enough negative examples would eventually make w2 negative 70
Mini-batch training
Stochastic gradient descent chooses a single
random example at a time.
That can result in choppy movements
More common to compute gradient over batches of
training instances.
Batch training: entire dataset
Mini-batch training: m examples
71
Stochastic Gradient Descent:
Logistic An example and more details
Regressi
on
72
Regularization
Logistic
Regressi
on
73
Overfitting
A model that perfectly match the training data has a
problem.
It will also overfit to the data, modeling noise
◦ A random word that perfectly predicts y (it happens to
only occur in one class) will get a very high weight.
◦ Failing to generalize to a test set without this word.
A good model should be able to generalize
74
Overfitting Useful or harmless features
+ X1 = "this"
X2 = "movie
This movie drew me in, and it'll X3 = "hated"
do the same to you. X4 = "drew me in"
-
I can't tell you how much I
4gram features that just
"memorize" training set and
might cause problems
hated this movie. It sucked. X5 = "the same to you"
X7 = "tell you how much"
75
Overfitting
4-gram model on tiny data will just memorize the data
◦ 100% accuracy on the training set
But it will be surprised by the novel 4-grams in the test data
◦ Low accuracy on test set
Models that are too powerful can overfit the data
◦ Fitting the details of the training data so exactly that the
model doesn't generalize well to the test set
◦ How to avoid overfitting?
◦ Regularization in logistic regression
◦ Dropout in neural networks
76
Regularization
A solution for overfitting
Add a regularization term R(θ) to the loss function (for
now written as maximizing logprob rather than minimizing loss)
Idea: choose an R(θ) that penalizes large weights
◦ fitting the data well with lots of big weights not as good as
fitting the data a little less well, with small weights
77
L2 Regularization (= ridge
regression)
The sum of the squares of the weights
The name is because this is the (square of the)
L2 norm ||θ||2, = Euclidean distance of θ to the origin.
L2 regularized objective function:
78
L1 Regularization (= lasso
regression)
The sum of the (absolute value of the) weights
Named after the L1 norm ||W||1, = sum of the absolute
values of the weights, = Manhattan distance
L1 regularized objective function:
79