Natural Language Processing
Prof. Ahmed Guessoum
The National Higher School of AI
Chapter 5
Logistic Regression
Background: Generative and
Discriminative Classifiers
Logistic Regression
• Important analytic tool in natural and
social sciences
• Baseline supervised machine learning tool
for classification
• Is also the foundation of neural networks
• LR can be used to classify an observation
into one of two classes or into one of
many classes (multinomial LR).
3
Generative and Discriminative
Classifiers
• Naïve Bayes is a generative
classifier
• by contrast:
• Logistic regression is a
discriminative classifier
4
Generative and Discriminative
Classifiers
Suppose we're distinguishing cat from dog images
imagenet imagenet
5
Generative Classifier
• Build a model of what's in a cat
image
• Knows about whiskers, ears, eyes
• Assigns a probability to any
image:
• how cat-y is this image?
• Also build a model for dog images
Now given a new image:
Run both models and see which one fits better
6
Discriminative Classifier
Just try to distinguish dogs from cats
Oh look, all dog images (happen to) have collars!
Let's ignore everything else
7
Finding the correct class c from a document d
in Generative vs Discriminative Classifiers
• Naive Bayes
• Logistic Regression
posterior
P(c|d)
8
Components of a probabilistic machine
learning classifier
Given m input/output pairs (x(i),y(i)):
1. A feature representation of the input.
For each input observation x(i), a vector of features [x1, x2, ... , xn].
Feature i for input x(j) is xi, more completely xi(j), or sometimes fi (x).
2. A classification function that computes 𝑦, ො the estimated
class, via p(y|x), like the sigmoid or softmax functions.
3. An objective function for learning, like cross-entropy
loss.
4. An algorithm for optimizing the objective function:
stochastic gradient descent.
9
The two phases of Logistic Regression
• Training: we learn weights w and b using
stochastic gradient descent and cross-entropy
loss.
• Test: Given a test example x, we compute
p(y|x) using learned weights w and b, and
return whichever label (y = 1 or y = 0) has
higher probability.
10
Classification in Logistic Regression
Classification Reminder
• Positive/negative
sentiment
• Spam/not spam
• Authorship attribution
(Hamilton or
Madison?)
Alexander Hamilton
12
Text Classification: definition
• Input:
a document x
a fixed set of classes C = {c1, c2,…, cJ}
• Output: a predicted class y^ C
13
Binary Classification in Logistic Regression
• Goal of binary LR: train a classifier that can make
a binary decision about the class of a new input
observation
• Given a series of input/output pairs: (x(i), y(i))
• For each observation x(i)
We represent x(i) by a feature vector [x1, x2,…, xn]
We compute an output: a predicted class 𝑦𝑖 {0,1}
14
Binary Classification in Logistic Regression
• We want to know the probability P(y = 1|x)
that this observation is a member of the
class.
E.g. “positive sentiment” versus “negative
sentiment”;
the features represent counts of words in a
document;
P(y = 1|x): probability that the document has
positive sentiment;
P(y = 0|x): probability that the document has
negative sentiment.
15
Features in logistic regression
• For feature xi, weight wi tells us how important
xi is.
• xi ="review contains ‘awesome’": wi = +10
• xj ="review contains ‘abysmal’": wj = -10
• xk =“review contains ‘mediocre’": wk = -2
16
Logistic Regression for one observation
• Input observation: vector x = [x1, x2,…, xn]
• Weights: one per feature: W = [w1, w2,…, wn]
Sometimes the weights are called θ = [θ1, θ2,…, θn]
^
• Output: a predicted class y {0,1}
(For multinomial logistic regression: y {0, 1, 2, 3, 4})
• The output is considered as a linear function of
the output
17
Problem
𝑛
𝑧=
esigmoid 𝑤𝑖 𝑥𝑖 y =+ 𝑏 1 − z so
function 1+ e takesareal valueand mapsit to
𝑧 = 𝒘. 𝒙 + 𝑏
r around 0𝑖=1
but outlier values get squashed toward 0 or 1.
• Inclusion of the weights and the bias. (Note the
dot product
a probability, we’onll the RHS)
pass z through the sigmoid functio
But, z isn't
on •(named a probability,
because it looks it's
likejust
an as)number!
is also called the
• We regression
logistic need to decide: if this Thesigmoid
its name. sum is high, we hassay
they=1;
follow
if low,
ally in Fig. then
5.1: y=0
• Solution: use a function of z that goes from 0 to 1
1 1
y = s (z) = =
1+ e− z 1+ exp(− z)
18
hat
lity,nothing in Eq.
we’ ll pass 5.3 forces
z through z to
the be a legal
sigmoid probability,
function, s (z). thT
and
ed The very
1. In
because fact,useful
it looks sigmoid
sincelike
weights
an s)are or logistic
real-valued,
is also function
the
called the output mig
logistic fu
ranges fromits−name.
regression • to • .The sigmoid has the following equati
ig. 5.1:
1
y = s (z) = (5
1+ e− z
1
Thesigmoid function y = 1+ e− z takesa real value and maps it to the
19
Turning a probability into a classifier
0.5 here is called the decision boundary
22
hat nothing in Eq. 5.3 forces z to be a legal probability, tha
and 1. In fact, since weights are real-valued, the output mig
The probabilistic classifier
ranges from − • to • .
P(y=1)
wx + b
Thesigmoid function y = 1+1e− z takesa real value and maps it to the
s nearly linear around 0 but has a sharp slope toward the ends, it ten
23
Turning a probability into a classifier
if w∙x+b > 0
if w∙x+b ≤ 0
24
Logistic Regression: A Sentiment
Classification Example
Sentiment Example
• Consider a binary sentiment classification on
movie review texts
• We would like to know whether to assign the
sentiment class + or - to a review document doc.
Fig. 5.2
•It's hokey . There are virtually no surprises
, and the writing is second-rate . So why was
it so enjoyable ? For one thing , the cast is
great . Another nice touch is the music . I
was overcome with the urge to get off the
couch and start dancing . It sucked me in ,
and it'll do the same to you .
26
Sentiment Example
• We decided to represent each input
observation (i.e. review) by the 6 features x1
… x6 of the input shown in the following table.
27
x2=2
x3=1
It's hokey . There are virtually no surprises , and the writing is second-rate .
So why was it so enjoyable ? For one thing , the cast is
great . Another nice touch is the music . I was overcome with the urge to get off
the couch and start dancing . It sucked me in , and it'll do the same to you .
x4=3
x1=3 x5=0 x6=4.19
Figur e 5.2 A sample mini test document showing the extracted features in the vector x.
Given these 6 features and the input review x, P(+ |x) and P(− |x) can be com-
puted using Eq. 5.5:
p(+ |x) = P(Y = 1|x) = s (w·x+ b)
= s ([2.5, − 5.0, − 1.2, 0.5, 2.0, 0.7] ·[3, 2, 1, 3, 0, 4.19] + 0.1)
= s (.833)
= 0.70 (5.6)
p(− |x) = P(Y = 0|x) = 1− s (w·x+ b)
28
Classifying sentiment for input x
Suppose we have learned w =
and b = 0.1
29
Given
Giventhese
these6 6features
featuresandandthetheinput
inputreview
reviewx,x,P(+
P(+|x)|x)andandP(−
P(−|x)|x)cancanbebecom-
com-
puted Classifying
using Eq. 5.5: sentiment for input x
puted using Eq. 5.5:
p(+p(+|x)|x)= =P(Y = 1|x) = s (w·x+
P(Y = 1|x) = s (w·x+ b) b)
= = s s([2.5,− 5.0,− 1.2,0.5,2.0,0.7]·[3,2,1,3,0,4.19]+ 0.1)
([2.5,− 5.0,− 1.2,0.5,2.0,0.7]·[3,2,1,3,0,4.19]+ 0.1)
= s (.833)
= s (.833)
= 0.70 (5.6)
= 0.70 (5.6)
p(− |x) = P(Y = 0|x) = 1− s (w·x+ b)
p(− |x) = P(Y = 0|x) = 1− s (w·x+ b)
= 0.30
= 0.30
Logistic regression iscommonly applied to all sorts of NLPtasks, and any property
Logistic
Resultsregression iscommonly
rounded here applied
to 2todecimal
all sortsof NLPtasks,
places. and any property
of theinput can beafeature. Consider thetask of period disambiguation: deciding30
Building features for LR for any
classification task: e.g. period disambiguation
End of sentence
This ends in a period.
The house at 465 Main St. is new.
Not end of sentence
1 𝑖𝑓 "𝐶𝑎𝑠𝑒(𝑤𝑖 ) = 𝐿𝑜𝑤𝑒𝑟" perhaps with a +ve weight
𝑥1 = ቊ
0 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
1 𝑖𝑓 "𝑤𝑖 ∈ 𝐴𝑐𝑟𝑜𝑛𝑦𝑚𝐷𝑖𝑐𝑡"
𝑥2 = ቊ perhaps with a -ve weight
0 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
1 𝑖𝑓 "𝑤𝑖 = 𝑆𝑡. 𝑎𝑛𝑑 𝐶𝑎𝑠𝑒(𝑤𝑖−1 ) = 𝐶𝑎𝑝"
𝑥3 = ቊ
0 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
31
Learning in Logistic Regression
How to get the W’s?
• Supervised classification: we know the
correct label y (either 0 or 1) for each x.
• What the system produces is an estimate, 𝑦ො
• We want to set w and b to minimize the
distance between our estimate 𝑦ො (i) and the
true y (i).
• We need a distance estimator: a loss
function or a cost function
• We need an optimization algorithm to
update w and b to minimize the loss.
33
Learning Components
• A loss function:
▪cross-entropy loss
• An optimization algorithm:
▪stochastic gradient descent
34
Cross-Entropy Loss
The distance between 𝑦ො and y
• We want to know how far is the classifier output
(the predicted probability of the positive class, i.e.
y=1):
𝑦ො = σ(w∙x+b)
from the true output (the actual label, 0 or 1):
y
• We'll call this difference:
L(𝑦ො ,y) = how much 𝑦ො differs from the true y
• We choose the parameters w, b that maximize
the log probability of the true y labels in the
training data given the observations x
36
Deriving cross-entropy loss for
a single observation x
Goal: Maximize the probability of the correct
label p(y |x) in the case of a binary classification
Maximize:
• Now take the log of both sides (mathematically handy)
So maximize:
• Whatever values maximize log p(y|x) will also maximize
p(y|x)
• if y=1, this simplifies to 𝑦;
ො if y=0, this simplifies to 1- 𝑦ො
37
Deriving cross-entropy loss for a single
observation x
Maximize:
• Now flip sign to turn this into a loss function:
something to minimize
• Cross-entropy Minimize:
• Or, plugging in the definition of 𝑦:
ො
38
Application to the sentiment example
• We want the loss to be:
• smaller if the model estimate is close to correct
• bigger if the model is confused
• Let us first suppose the true label of this is y=1
(positive)
It's hokey . There are virtually no surprises
, and the writing is second-rate . So why was
it so enjoyable ? For one thing , the cast is
great . Another nice touch is the music . I
was overcome with the urge to get off the
couch and start dancing . It sucked me in ,
and it'll do the same to you .
39
x4=3
x1=3 x5=0 x6=4.19
Application todocument
Figure 5.2 A sample mini test theshowing
sentiment example?
the extracted features in the vector x.
• True value
Given these isand
6 features y=1.
the inputHow
review x,well
P(+ |x)isandour
P(− |x)model
can be com-
doing?
puted using Eq. 5.5:
p(+ |x) = P(Y = 1|x) = s (w·x+ b)
= s ([2.5,− 5.0,− 1.2,0.5,2.0,0.7] ·[3,2,1,3,0,4.19] + 0.1)
= s (.833)
= 0.70 (5.6)
•p(−So
|x) =the
P(Y = loss
0|x) = is:
1− s (w·x+ b)
= 0.30
Logistic regression is commonly applied to all sorts of NLP tasks, and any property
of the input can be a feature. Consider the task of period disambiguation: deciding
if a period is the end of a sentence or part of a word, by classifying each period
into one of two classes EOS (end-of-sentence) and not-EOS. We might use features 40
= s ([2.5,− 5.0,− 1.2,0.5,2.0,0.7]·[3,2,1,3,0,4.19]+ 0.1)
Application= to the sentiment example?
s (.833)
= 0.70value instead was y = 0.
• Suppose the true (5.6)
p(− |x) = P(Y = 0|x) = 1− s (w·x+ b)
= 0.30
• So the loss is:
Logistic regression iscommonly applied to all sortsof NLPtasks, and any property
of theinput can beafeature. Consider thetask of period disambiguation: deciding
if a period is the end of a sentence or part of a word, by classifying each period
into oneof two classes EOS(end-of-sentence) and not-EOS. Wemight usefeatures
like x1 below expressing that the current word is lower case and the class is EOS
(perhaps with a positive weight), or that the current word is in our abbreviations41
Application to the sentiment example?
• The loss when the model was right (if true y=1)
• Is lower than the loss when model was wrong (if
true y=0):
• Sure enough, the loss was bigger when the
model was wrong!
42
Stochastic Gradient Descent
Our goal: minimize the loss
• Let's make explicit that the loss function
in parameterized by weights 𝛳=(w,b)
• We want the weights that minimize the
loss, averaged over all examples:
44
Intuition of gradient descent
• How do I get to the bottom of this
river canyon?
Look around me
360∘
Find the direction
x of steepest slope
Go that way
45
Our goal: minimize the loss
• For logistic regression, the loss function
is convex
• A convex function has just one
minimum
• Gradient descent starting from any
point is guaranteed to find the minimum
• (The Loss for multi-layer neural networks is
non-convex)
46
Visualization for a single scalar w
Q: Given current w, should we make it bigger or smaller?
A: Move w in the reverse direction from the slope of the
function
So we'll move positive
(making w bigger)
where w1 is the initial value of w 47
Gradients
• The gradient of a function of many
variables is a vector pointing in the
direction of the greatest increase in a
function.
• Gradient Descent: Find the gradient of
the loss function at the current point and
move in the opposite direction.
48
Amount of move in that direction ?
SI ON
• The value of the gradient (slope in our
𝑑
example) 𝑓(𝑥; 𝑤) weighted by a
𝑑𝑤
dientlearning
(or therate η
slope, in our sing
• Higher learning rate means move w faster
t+ 1 td
w = w − h f (x; w)
dw
tuition from a function of on
49
Case of N dimensions
• We want to know where in the N-
dimensional space (of the N parameters
that make up θ ) we should move.
• The gradient is just such a vector; it
expresses the directional components of
the sharpest slope along each of the N
dimensions.
50
two orthogonal components, each of which tells us how much t
the w dimension and in the b dimension. Fig. 5.4 shows a visual
Visualizing 2 dimensions, w and b
of a 2-dimensional gradient vector taken at the red point.
Cost(w,b)
• Visualizing
the gradient
vector at the
red point
• It has two
dimensions
shown in the b
w
x-y plane
Figur e 5.4 Visualization of the gradient vector at the red point in two
showing the gradient as a red arrow in the x-y plane.
51
Real gradients
• Are much longer; lots and lots of weights
• For each dimension wi the gradient
component i tells us the slope with respect
to that variable.
“How much would a small change in wi
influence the total loss function L?”
We express the slope as a partial derivative of
the loss with respect to wi
• The gradient is then defined as a vector of
these partial derivatives.
52
The gradient
We’ll represent 𝑦ො as f (x; θ ) to make the dependence on θ more
obvious: 𝜕
𝐿 𝑓 𝑥; 𝜃 , 𝑦
𝜕𝑤1
𝜕
𝐿 𝑓 𝑥; 𝜃 , 𝑦
𝜕𝑤2
.
𝛻𝐿 𝑓 𝑥; 𝜃 , 𝑦 = .
.
𝜕
𝐿 𝑓 𝑥; 𝜃 , 𝑦
𝜕𝑤𝑛
𝜕
𝐿 𝑓 𝑥; 𝜃 , 𝑦
𝜕𝑏
The final equation for updating θ based on the gradient is thus
𝜃 𝑡+1 = 𝜃 𝑡 − 𝜂𝛻L(f x; 𝜃 , 𝑦)
53
Partial derivatives for logistic regression
The loss function
The elegant derivative of this function:
54
Algorithm for Stochastic Gradient
Descent
55
Hyperparameters
• The learning rate η is a hyperparameter
too high: the learner will take big steps and
overshoot
too low: the learner will take too long
• More on hyperparameters in Chapter 7
• Instead of being learned by algorithm
from supervision (like regular parameters),
they are chosen by the algorithm designer
(or grid search).
56
Stochastic Gradient Descent: An example
and more details
Working through an example
• One step of gradient descent
• A mini-sentiment example, where the true y=1
(positive)
• Two features:
x1 = 3 (count of positive lexicon words)
x2 = 2 (count of negative lexicon words)
Assume 3 parameters (2 weights and 1 bias) in
Θ0 are set to zero: w1 = w 2 = b = 0
And learning rate η = 0.1
58
Example of gradient descent
• Step to update θ: w1 = w2 = b = 0;
x1 = 3; x2 = 2
𝜃 𝑡+1 = 𝜃 𝑡 − 𝜂𝛻L(f(𝑥 (𝑖) ; θ), 𝑦 (𝑖) )
where
• Gradient vector has 3 dimensions:
59
Example of gradient descent
• Now that we have a gradient, we compute the
new parameter vector θ1 by moving θ0 in the
opposite direction from the gradient:
𝜃 𝑡+1 = 𝜃 𝑡 − 𝜂𝛻L(f(𝑥 (𝑖) ; θ), 𝑦 (𝑖) ) η = 0.1;
• Note that enough negative examples would eventually make w2 negative
60
Mini-batch training
• Stochastic gradient descent chooses a single
random example at a time.
• That can result in choppy movements
• More common to compute gradient over batches of
training instances.
• Batch training: entire dataset
• Mini-batch training: m examples at a time (512,
or 1024)
• Mini-batches can easily be vectorized
• Size of the minibatch based on the computational
resources.
➔ Can process all the examples in one mini-batch in
parallel and then accumulate the loss
- not possible with individual or batch training.
61
Regularization
Overfitting
• A model that perfectly matches the training
data has a problem.
• It will also overfit to the data, modeling
noise
A random word that perfectly predicts y (it
happens to only occur in one class) will get a very
high weight.
Failing to generalize to a test set without this
word.
• A good model should be able to generalize
well to the unseen (test) data
• A model that overfits will have poor
generalization
63
Overfitting
Useful or harmless features
+ X1 = "this"
This movie drew me in, and X2 = "movie
it'll do the same to you. X3 = "hated"
X4 = "drew me in"
- 4-gram features that just
I can't tell you how much "memorize" training set and
might cause problems
I hated this movie. It
sucked. X5 = "the same to you"
X7 = "tell you how much"
64
Overfitting
• A 4-gram model on tiny data will just memorize the data
➔ 100% accuracy on the training set
• But it will be surprised by the novel 4-grams in the test
data
➔ Low accuracy on the test set
• Models that are too powerful can overfit the data
Fitting the details of the training data so exactly that
the model doesn't generalize well to the test set
▪ How to avoid overfitting?
• Regularization in logistic regression
• Dropout in neural networks
65
Regularization
• A solution for overfitting
• Add a regularization term R (θ) to the loss
function (now written as maximizing
logprob rather than minimizing loss)
• Idea: choose an R (θ) that penalizes large
weights
Fitting the data well with lots of big weights
not as good as fitting the data a little less
well, with small weights
66
L2 Regularization (= Ridge Regression)
• The sum of the squares of the weights
• The name is because this is the (square of
the) L2 norm ||θ||2, = Euclidean distance
of θ to the origin.
• L2 regularized objective function (where α is
a hyperparameter):
67
L1 Regularization (= Lasso Regression)
• The sum of the (absolute value of the)
weights
• Named after the L1 norm ||W||1, = sum of the
absolute values of the weights, = Manhattan
distance
• L1 regularized objective function:
68
L1 vs L2 Regularization
• Both are commonly used in language processing.
• L2 regularization: easier to optimize because of its
simple derivative (the derivative of 𝜃2 is just 2𝜃)
• L1 regularization: more complex (the derivative of
|𝜃| is non-continuous at zero).
• L2 prefers weight vectors with many small weights
• L1 prefers sparse solutions with some larger
weights but many more weights set to zero.
➔ L1 regularization leads to much sparser weight
vectors, i.e. far fewer features.
69
Multinomial Logistic Regression
Multinomial Logistic Regression
• Often we need more than 2 classes
Positive/negative/neutral
Parts of speech (noun, verb, adjective, adverb,
preposition, etc.)
Classify emergency SMSs into different actionable
classes
• If >2 classes we use multinomial logistic regression
= Softmax regression
• So "logistic regression" will just mean binary (2 output
classes)
71
Multinomial Logistic Regression
• Recall the expression of the loss for binary LR:
• Multinomial LR generalizes to K terms.
• y and y^ are represented as vectors.
• True label y is a vector with K elements, each
corresponding to a class, with yc = 1 if the correct
class is c; all other elements of y being 0.
• Our classifier will produce an estimate vector
with K elements ^ y, each element ^yk of which
represents the estimated probability p(yk = 1| x).
72
Multinomial Logistic Regression
The cross-entropy loss is simply the log of the output
probability corresponding to the correct class (hence the
name negative log likelihood loss).
73
Multinomial Logistic Regression
• The gradient for a single example turns out to
be very similar to the gradient for binary LR:
74
The softmax function
• The probability of everything must still sum to 1
P(positive|doc) + P(negative|doc) +
P(neutral|doc) = 1
• Need a generalization of the sigmoid called the
softmax
Takes a vector z = [z1, z2, ..., zk] of k arbitrary values
Outputs a probability distribution
▪ each value in the range [0,1]
▪ all the values summing to 1
75
The softmax function
Turns a vector z = [z1, z2, ... , zk] of k arbitrary values into
probabilities
76
The softmax function
• Example with 6 classes, suppose
77
Softmax in multinomial logistic regression
• Input is still the dot product between weight
vector w and input vector x
• But now we’ll need separate weight vectors
for each of the K classes.
78
Features in binary versus multinomial
logistic regression
• Binary: positive weight → y=1
negative weight → y=0
w5 = 3.0
• Multinominal: separate weights for each
class:
79
Multinomial Logistic Regression
and the
Reference Class Constraint
Model Identifiability
• A statistical model is identifiable if different
parameter values lead to different probability
distributions.
• Formally, a model is identifiable if:
𝑃 𝑦 𝑥; 𝜃1 = 𝑃 𝑦 𝑥; 𝜃2 ⇒ 𝜃1 = 𝜃2
• That means: the parameters θ are uniquely
determined by the distribution they define.
• If a model is non-identifiable, there exist different
parameter sets that produce exactly the same
predictions — so you cannot tell which set of
parameters is the “true” one.
81
Multinomial Logistic Regression
• Suppose we have K classes:
ewk ⋅x
P y = k x = K wj ⋅x
σj=1 e
• Here, each class k has its own weight
vector wk .
82
The Problem:
Redundancy of Parameters
• If one adds the same constant vector 𝑎 to all the
weight vectors 𝑤𝑘 , the probabilities do not change:
𝑒 𝑤𝑘 +𝑎 ⋅𝑥 𝑒 𝑎⋅𝑥 𝑒 𝑤𝑘 ⋅𝑥 𝑒 𝑤𝑘 ⋅𝑥
• 𝑃 𝑦=𝑘𝑥 = = 𝑤𝑗 ⋅𝑥 = 𝑤𝑗 ⋅𝑥
σ𝐾 𝑒
𝑤𝑗 +𝑎 ⋅𝑥 𝑒 𝑎⋅𝑥 σ𝐾
𝑗=1 𝑒 σ𝐾
𝑗=1 𝑒
𝑗=1
• The added term 𝑒 𝑎⋅𝑥 cancels out of numerator and
denominator.
• So there are infinitely many different parameter sets
{𝑤1 , … , 𝑤𝐾 } that yield the same probabilities —
meaning the model is non-identifiable.
83
Fixing the Identifiability Problem
• To make the model identifiable, we remove this
redundancy by fixing one class’s parameters — that’s
called choosing a reference class (or baseline category).
• Typically, we set: 𝑤𝐾 = 0 (𝑤𝑟𝑒𝑓 = 0) for some class 𝑲
• Then, the model becomes:
𝑒 𝑤𝑐 ⋅𝑥
𝑃 𝑦=𝑐𝑥 = 𝐾−1 𝑤 ⋅𝑥 , 𝑐 = 1, … , 𝐾 − 1
1 + σ𝑗=1 𝑒 𝑗
and
1
• 𝑃 𝑦=𝐾𝑥 = 𝑤𝑗 ⋅𝑥
1+σ𝐾−1
𝑗=1 𝑒
• Now, the model is identifiable — each parameter set
corresponds to a unique probability distribution. 84
Intuitive Explanation
• Without fixing one class, the system is
“floating”: all weights can be shifted up or
down and predictions stay the same.
• Choosing a reference class anchors the model,
eliminating this floating ambiguity.
• This subtle yet essential step underlies the
Softmax classifier.
85
Summary & Takeaways
• Multinomial Logistic Regression generalizes
binary logistic regression to multiple classes.
• A reference class constraint (𝑤𝑟𝑒𝑓 =0) is
essential for identifiability.
• LR is widely used in NLP for text classification,
sentiment analysis, and more.
• Unlike Naive Bayes, LR can capture feature
interactions when features are engineered
appropriately.
86
Textbook
Jurafsky, D. and Martin, J.H. (2024) Speech and Language
Processing: An Introduction to Natural Language Processing,
Computational Linguistics, and Speech Recognition with
Language Models (3rd ed. draft), Prentice Hall.
[Link]
The draft version of the 3rd edition
will be used; it has been updated to
include more recent topics such as
Vector Semantics and Embeddings,
ANNs and Deep Learning for NLP,
Transformers, etc.
This course slides are largely
based on this textbook and its
authors’ slides with light
customisations whenever seen fit.
87