0% found this document useful (0 votes)

13 views85 pages

Logistic Regression in NLP Classification

This document discusses Logistic Regression as a fundamental tool in natural language processing and machine learning for classification tasks. It contrasts generative classifiers like Naïve Bayes with discriminative classifiers such as Logistic Regression, detailing the components, training, and testing phases involved in the logistic regression process. Additionally, it covers practical applications, including sentiment analysis and feature representation for various classification tasks.

Uploaded by

hamza.oukil

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views85 pages

Logistic Regression in NLP Classification

Uploaded by

hamza.oukil

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Natural Language Processing

Prof. Ahmed Guessoum

The National Higher School of AI

Chapter 5

Logistic Regression
Background: Generative and
Discriminative Classifiers
Logistic Regression
• Important analytic tool in natural and
social sciences
• Baseline supervised machine learning tool
for classification
• Is also the foundation of neural networks
• LR can be used to classify an observation
into one of two classes or into one of
many classes (multinomial LR).

3
Generative and Discriminative
Classifiers
• Naïve Bayes is a generative
classifier

• by contrast:

• Logistic regression is a
discriminative classifier

4
Generative and Discriminative
Classifiers
Suppose we're distinguishing cat from dog images

imagenet imagenet

5
Generative Classifier
• Build a model of what's in a cat
image
• Knows about whiskers, ears, eyes
• Assigns a probability to any
image:
• how cat-y is this image?
• Also build a model for dog images

Now given a new image:

Run both models and see which one fits better
6
Discriminative Classifier
Just try to distinguish dogs from cats

Oh look, all dog images (happen to) have collars!

Let's ignore everything else
7
Finding the correct class c from a document d
in Generative vs Discriminative Classifiers

• Naive Bayes

• Logistic Regression
posterior
P(c|d)

8
Components of a probabilistic machine
learning classifier
Given m input/output pairs (x(i),y(i)):
1. A feature representation of the input.
For each input observation x(i), a vector of features [x1, x2, ... , xn].
Feature i for input x(j) is xi, more completely xi(j), or sometimes fi (x).
2. A classification function that computes 𝑦, ො the estimated
class, via p(y|x), like the sigmoid or softmax functions.
3. An objective function for learning, like cross-entropy
loss.
4. An algorithm for optimizing the objective function:
stochastic gradient descent.
9
The two phases of Logistic Regression

• Training: we learn weights w and b using

stochastic gradient descent and cross-entropy
loss.
• Test: Given a test example x, we compute
p(y|x) using learned weights w and b, and
return whichever label (y = 1 or y = 0) has
higher probability.

10
Classification in Logistic Regression
Classification Reminder

• Positive/negative
sentiment
• Spam/not spam
• Authorship attribution
(Hamilton or
Madison?)
Alexander Hamilton

12
Text Classification: definition
• Input:
 a document x
 a fixed set of classes C = {c1, c2,…, cJ}

• Output: a predicted class y^  C

13
Binary Classification in Logistic Regression
• Goal of binary LR: train a classifier that can make
a binary decision about the class of a new input
observation
• Given a series of input/output pairs: (x(i), y(i))
• For each observation x(i)
 We represent x(i) by a feature vector [x1, x2,…, xn]
 We compute an output: a predicted class 𝑦෢𝑖  {0,1}

14
Binary Classification in Logistic Regression

• We want to know the probability P(y = 1|x)

that this observation is a member of the
class.
 E.g. “positive sentiment” versus “negative
sentiment”;
 the features represent counts of words in a
document;
 P(y = 1|x): probability that the document has
positive sentiment;
 P(y = 0|x): probability that the document has
negative sentiment.
15
Features in logistic regression

• For feature xi, weight wi tells us how important

xi is.
• xi ="review contains ‘awesome’": wi = +10
• xj ="review contains ‘abysmal’": wj = -10
• xk =“review contains ‘mediocre’": wk = -2

16
Logistic Regression for one observation
• Input observation: vector x = [x1, x2,…, xn]
• Weights: one per feature: W = [w1, w2,…, wn]
 Sometimes the weights are called θ = [θ1, θ2,…, θn]
^
• Output: a predicted class y  {0,1}

(For multinomial logistic regression: y  {0, 1, 2, 3, 4})

• The output is considered as a linear function of

the output
17
Problem
𝑛

𝑧= ෍
esigmoid 𝑤𝑖 𝑥𝑖 y =+ 𝑏 1 − z so
function 1+ e takesareal valueand mapsit to
𝑧 = 𝒘. 𝒙 + 𝑏
r around 0𝑖=1
but outlier values get squashed toward 0 or 1.
• Inclusion of the weights and the bias. (Note the
dot product
a probability, we’onll the RHS)
pass z through the sigmoid functio
But, z isn't
on •(named a probability,
because it looks it's
likejust
an as)number!
is also called the
• We regression
logistic need to decide: if this Thesigmoid
its name. sum is high, we hassay
they=1;
follow
if low,
ally in Fig. then
5.1: y=0
• Solution: use a function of z that goes from 0 to 1
1 1
y = s (z) = =
1+ e− z 1+ exp(− z)
18
hat
lity,nothing in Eq.
we’ ll pass 5.3 forces
z through z to
the be a legal
sigmoid probability,
function, s (z). thT
and
ed The very
1. In
because fact,useful
it looks sigmoid
sincelike
weights
an s)are or logistic
real-valued,
is also function
the
called the output mig
logistic fu
ranges fromits−name.
regression • to • .The sigmoid has the following equati
ig. 5.1:

1
y = s (z) = (5
1+ e− z

1
Thesigmoid function y = 1+ e− z takesa real value and maps it to the
19
Turning a probability into a classifier

0.5 here is called the decision boundary

22
hat nothing in Eq. 5.3 forces z to be a legal probability, tha
and 1. In fact, since weights are real-valued, the output mig
The probabilistic classifier
ranges from − • to • .

P(y=1)

wx + b
Thesigmoid function y = 1+1e− z takesa real value and maps it to the
s nearly linear around 0 but has a sharp slope toward the ends, it ten
23
Turning a probability into a classifier

if w∙x+b > 0
if w∙x+b ≤ 0

24
Logistic Regression: A Sentiment
Classification Example
Sentiment Example

• Consider a binary sentiment classification on

movie review texts
• We would like to know whether to assign the
sentiment class + or - to a review document doc.
Fig. 5.2
•It's hokey . There are virtually no surprises
, and the writing is second-rate . So why was
it so enjoyable ? For one thing , the cast is
great . Another nice touch is the music . I
was overcome with the urge to get off the
couch and start dancing . It sucked me in ,
and it'll do the same to you .
26
Sentiment Example
• We decided to represent each input
observation (i.e. review) by the 6 features x1
… x6 of the input shown in the following table.

27
x2=2
x3=1
It's hokey . There are virtually no surprises , and the writing is second-rate .
So why was it so enjoyable ? For one thing , the cast is
great . Another nice touch is the music . I was overcome with the urge to get off
the couch and start dancing . It sucked me in , and it'll do the same to you .
x4=3
x1=3 x5=0 x6=4.19

Figur e 5.2 A sample mini test document showing the extracted features in the vector x.

Given these 6 features and the input review x, P(+ |x) and P(− |x) can be com-
puted using Eq. 5.5:

p(+ |x) = P(Y = 1|x) = s (w·x+ b)

= s ([2.5, − 5.0, − 1.2, 0.5, 2.0, 0.7] ·[3, 2, 1, 3, 0, 4.19] + 0.1)
= s (.833)
= 0.70 (5.6)
p(− |x) = P(Y = 0|x) = 1− s (w·x+ b)
28
Classifying sentiment for input x

Suppose we have learned w =

and b = 0.1
29
Given
Giventhese
these6 6features
featuresandandthetheinput
inputreview
reviewx,x,P(+
P(+|x)|x)andandP(−
P(−|x)|x)cancanbebecom-
com-
puted Classifying
using Eq. 5.5: sentiment for input x
puted using Eq. 5.5:
p(+p(+|x)|x)= =P(Y = 1|x) = s (w·x+
P(Y = 1|x) = s (w·x+ b) b)
= = s s([2.5,− 5.0,− 1.2,0.5,2.0,0.7]·[3,2,1,3,0,4.19]+ 0.1)
([2.5,− 5.0,− 1.2,0.5,2.0,0.7]·[3,2,1,3,0,4.19]+ 0.1)
= s (.833)
= s (.833)
= 0.70 (5.6)
= 0.70 (5.6)
p(− |x) = P(Y = 0|x) = 1− s (w·x+ b)
p(− |x) = P(Y = 0|x) = 1− s (w·x+ b)
= 0.30
= 0.30
Logistic regression iscommonly applied to all sorts of NLPtasks, and any property
Logistic
Resultsregression iscommonly
rounded here applied
to 2todecimal
all sortsof NLPtasks,
places. and any property
of theinput can beafeature. Consider thetask of period disambiguation: deciding30
Building features for LR for any
classification task: e.g. period disambiguation
End of sentence
This ends in a period.
The house at 465 Main St. is new.
Not end of sentence
1 𝑖𝑓 "𝐶𝑎𝑠𝑒(𝑤𝑖 ) = 𝐿𝑜𝑤𝑒𝑟" perhaps with a +ve weight
𝑥1 = ቊ
0 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
1 𝑖𝑓 "𝑤𝑖 ∈ 𝐴𝑐𝑟𝑜𝑛𝑦𝑚𝐷𝑖𝑐𝑡"
𝑥2 = ቊ perhaps with a -ve weight
0 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
1 𝑖𝑓 "𝑤𝑖 = 𝑆𝑡. 𝑎𝑛𝑑 𝐶𝑎𝑠𝑒(𝑤𝑖−1 ) = 𝐶𝑎𝑝"
𝑥3 = ቊ
0 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
31
Learning in Logistic Regression
How to get the W’s?
• Supervised classification: we know the
correct label y (either 0 or 1) for each x.
• What the system produces is an estimate, 𝑦ො
• We want to set w and b to minimize the
distance between our estimate 𝑦ො (i) and the
true y (i).
• We need a distance estimator: a loss
function or a cost function
• We need an optimization algorithm to
update w and b to minimize the loss.
33
Learning Components

• A loss function:
▪cross-entropy loss
• An optimization algorithm:
▪stochastic gradient descent

34
Cross-Entropy Loss
The distance between 𝑦ො and y
• We want to know how far is the classifier output
(the predicted probability of the positive class, i.e.
y=1):
𝑦ො = σ(w∙x+b)
from the true output (the actual label, 0 or 1):
y
• We'll call this difference:
L(𝑦ො ,y) = how much 𝑦ො differs from the true y

• We choose the parameters w, b that maximize

the log probability of the true y labels in the
training data given the observations x
36
Deriving cross-entropy loss for
a single observation x
Goal: Maximize the probability of the correct
label p(y |x) in the case of a binary classification

Maximize:
• Now take the log of both sides (mathematically handy)
So maximize:

• Whatever values maximize log p(y|x) will also maximize

p(y|x)
• if y=1, this simplifies to 𝑦;
ො if y=0, this simplifies to 1- 𝑦ො
37
Deriving cross-entropy loss for a single
observation x
Maximize:

• Now flip sign to turn this into a loss function:

something to minimize
• Cross-entropy Minimize:

• Or, plugging in the definition of 𝑦:

ො

38
Application to the sentiment example
• We want the loss to be:
• smaller if the model estimate is close to correct
• bigger if the model is confused
• Let us first suppose the true label of this is y=1
(positive)
It's hokey . There are virtually no surprises
, and the writing is second-rate . So why was
it so enjoyable ? For one thing , the cast is
great . Another nice touch is the music . I
was overcome with the urge to get off the
couch and start dancing . It sucked me in ,
and it'll do the same to you .

39
x4=3
x1=3 x5=0 x6=4.19
Application todocument
Figure 5.2 A sample mini test theshowing
sentiment example?
the extracted features in the vector x.

• True value
Given these isand
6 features y=1.
the inputHow
review x,well
P(+ |x)isandour
P(− |x)model
can be com-
doing?
puted using Eq. 5.5:

p(+ |x) = P(Y = 1|x) = s (w·x+ b)

= s ([2.5,− 5.0,− 1.2,0.5,2.0,0.7] ·[3,2,1,3,0,4.19] + 0.1)
= s (.833)
= 0.70 (5.6)
•p(−So
|x) =the
P(Y = loss
0|x) = is:
1− s (w·x+ b)
= 0.30

Logistic regression is commonly applied to all sorts of NLP tasks, and any property
of the input can be a feature. Consider the task of period disambiguation: deciding
if a period is the end of a sentence or part of a word, by classifying each period
into one of two classes EOS (end-of-sentence) and not-EOS. We might use features 40
= s ([2.5,− 5.0,− 1.2,0.5,2.0,0.7]·[3,2,1,3,0,4.19]+ 0.1)
Application= to the sentiment example?
s (.833)
= 0.70value instead was y = 0.
• Suppose the true (5.6)
p(− |x) = P(Y = 0|x) = 1− s (w·x+ b)
= 0.30
• So the loss is:
Logistic regression iscommonly applied to all sortsof NLPtasks, and any property
of theinput can beafeature. Consider thetask of period disambiguation: deciding
if a period is the end of a sentence or part of a word, by classifying each period
into oneof two classes EOS(end-of-sentence) and not-EOS. Wemight usefeatures
like x1 below expressing that the current word is lower case and the class is EOS
(perhaps with a positive weight), or that the current word is in our abbreviations41
Application to the sentiment example?
• The loss when the model was right (if true y=1)

• Is lower than the loss when model was wrong (if

true y=0):

• Sure enough, the loss was bigger when the

model was wrong!
42
Stochastic Gradient Descent
Our goal: minimize the loss

• Let's make explicit that the loss function

in parameterized by weights 𝛳=(w,b)
• We want the weights that minimize the
loss, averaged over all examples:

44
Intuition of gradient descent

• How do I get to the bottom of this

river canyon?
Look around me
360∘
Find the direction
x of steepest slope
Go that way

45
Our goal: minimize the loss

• For logistic regression, the loss function

is convex
• A convex function has just one
minimum
• Gradient descent starting from any
point is guaranteed to find the minimum
• (The Loss for multi-layer neural networks is
non-convex)

46
Visualization for a single scalar w
Q: Given current w, should we make it bigger or smaller?
A: Move w in the reverse direction from the slope of the
function

So we'll move positive

(making w bigger)

where w1 is the initial value of w 47

Gradients

• The gradient of a function of many

variables is a vector pointing in the
direction of the greatest increase in a
function.

• Gradient Descent: Find the gradient of

the loss function at the current point and
move in the opposite direction.

48
Amount of move in that direction ?
SI ON
• The value of the gradient (slope in our
𝑑
example) 𝑓(𝑥; 𝑤) weighted by a
𝑑𝑤
dientlearning
(or therate η
slope, in our sing
• Higher learning rate means move w faster

t+ 1 td
w = w − h f (x; w)
dw
tuition from a function of on
49
Case of N dimensions
• We want to know where in the N-
dimensional space (of the N parameters
that make up θ ) we should move.
• The gradient is just such a vector; it
expresses the directional components of
the sharpest slope along each of the N
dimensions.

50
two orthogonal components, each of which tells us how much t
the w dimension and in the b dimension. Fig. 5.4 shows a visual
Visualizing 2 dimensions, w and b
of a 2-dimensional gradient vector taken at the red point.

Cost(w,b)
• Visualizing
the gradient
vector at the
red point
• It has two
dimensions
shown in the b
w
x-y plane
Figur e 5.4 Visualization of the gradient vector at the red point in two
showing the gradient as a red arrow in the x-y plane.
51
Real gradients
• Are much longer; lots and lots of weights
• For each dimension wi the gradient
component i tells us the slope with respect
to that variable.
 “How much would a small change in wi
influence the total loss function L?”
 We express the slope as a partial derivative of
the loss with respect to wi
• The gradient is then defined as a vector of
these partial derivatives.
52
The gradient
We’ll represent 𝑦ො as f (x; θ ) to make the dependence on θ more
obvious: 𝜕
𝐿 𝑓 𝑥; 𝜃 , 𝑦
𝜕𝑤1
𝜕
𝐿 𝑓 𝑥; 𝜃 , 𝑦
𝜕𝑤2
.
𝛻𝐿 𝑓 𝑥; 𝜃 , 𝑦 = .
.
𝜕
𝐿 𝑓 𝑥; 𝜃 , 𝑦
𝜕𝑤𝑛
𝜕
𝐿 𝑓 𝑥; 𝜃 , 𝑦
𝜕𝑏
The final equation for updating θ based on the gradient is thus
𝜃 𝑡+1 = 𝜃 𝑡 − 𝜂𝛻L(f x; 𝜃 , 𝑦)
53
Partial derivatives for logistic regression
The loss function

The elegant derivative of this function:

54
Algorithm for Stochastic Gradient
Descent

55
Hyperparameters
• The learning rate η is a hyperparameter
 too high: the learner will take big steps and
overshoot
 too low: the learner will take too long
• More on hyperparameters in Chapter 7
• Instead of being learned by algorithm
from supervision (like regular parameters),
they are chosen by the algorithm designer
(or grid search).

56
Stochastic Gradient Descent: An example
and more details
Working through an example
• One step of gradient descent
• A mini-sentiment example, where the true y=1
(positive)
• Two features:
x1 = 3 (count of positive lexicon words)
x2 = 2 (count of negative lexicon words)
Assume 3 parameters (2 weights and 1 bias) in
Θ0 are set to zero: w1 = w 2 = b = 0
And learning rate η = 0.1
58
Example of gradient descent
• Step to update θ: w1 = w2 = b = 0;
x1 = 3; x2 = 2
𝜃 𝑡+1 = 𝜃 𝑡 − 𝜂𝛻L(f(𝑥 (𝑖) ; θ), 𝑦 (𝑖) )

where
• Gradient vector has 3 dimensions:

59
Example of gradient descent

• Now that we have a gradient, we compute the

new parameter vector θ1 by moving θ0 in the
opposite direction from the gradient:
𝜃 𝑡+1 = 𝜃 𝑡 − 𝜂𝛻L(f(𝑥 (𝑖) ; θ), 𝑦 (𝑖) ) η = 0.1;

• Note that enough negative examples would eventually make w2 negative

60
Mini-batch training
• Stochastic gradient descent chooses a single
random example at a time.
• That can result in choppy movements
• More common to compute gradient over batches of
training instances.
• Batch training: entire dataset
• Mini-batch training: m examples at a time (512,
or 1024)
• Mini-batches can easily be vectorized
• Size of the minibatch based on the computational
resources.
➔ Can process all the examples in one mini-batch in
parallel and then accumulate the loss
- not possible with individual or batch training.
61
Regularization
Overfitting
• A model that perfectly matches the training
data has a problem.
• It will also overfit to the data, modeling
noise
 A random word that perfectly predicts y (it
happens to only occur in one class) will get a very
high weight.
 Failing to generalize to a test set without this
word.
• A good model should be able to generalize
well to the unseen (test) data
• A model that overfits will have poor
generalization
63
Overfitting
Useful or harmless features
+ X1 = "this"
This movie drew me in, and X2 = "movie
it'll do the same to you. X3 = "hated"
X4 = "drew me in"
- 4-gram features that just
I can't tell you how much "memorize" training set and
might cause problems
I hated this movie. It
sucked. X5 = "the same to you"
X7 = "tell you how much"
64
Overfitting
• A 4-gram model on tiny data will just memorize the data
➔ 100% accuracy on the training set
• But it will be surprised by the novel 4-grams in the test
data
➔ Low accuracy on the test set
• Models that are too powerful can overfit the data
 Fitting the details of the training data so exactly that
the model doesn't generalize well to the test set
▪ How to avoid overfitting?
• Regularization in logistic regression
• Dropout in neural networks
65
Regularization
• A solution for overfitting
• Add a regularization term R (θ) to the loss
function (now written as maximizing
logprob rather than minimizing loss)

• Idea: choose an R (θ) that penalizes large

weights
 Fitting the data well with lots of big weights
not as good as fitting the data a little less
well, with small weights
66
L2 Regularization (= Ridge Regression)
• The sum of the squares of the weights
• The name is because this is the (square of
the) L2 norm ||θ||2, = Euclidean distance
of θ to the origin.

• L2 regularized objective function (where α is

a hyperparameter):

67
L1 Regularization (= Lasso Regression)
• The sum of the (absolute value of the)
weights
• Named after the L1 norm ||W||1, = sum of the
absolute values of the weights, = Manhattan
distance

• L1 regularized objective function:

68
L1 vs L2 Regularization
• Both are commonly used in language processing.
• L2 regularization: easier to optimize because of its
simple derivative (the derivative of 𝜃2 is just 2𝜃)
• L1 regularization: more complex (the derivative of
|𝜃| is non-continuous at zero).
• L2 prefers weight vectors with many small weights
• L1 prefers sparse solutions with some larger
weights but many more weights set to zero.
➔ L1 regularization leads to much sparser weight
vectors, i.e. far fewer features.
69
Multinomial Logistic Regression
Multinomial Logistic Regression
• Often we need more than 2 classes
 Positive/negative/neutral
 Parts of speech (noun, verb, adjective, adverb,
preposition, etc.)
 Classify emergency SMSs into different actionable
classes
• If >2 classes we use multinomial logistic regression
= Softmax regression
• So "logistic regression" will just mean binary (2 output
classes)

71
Multinomial Logistic Regression
• Recall the expression of the loss for binary LR:

• Multinomial LR generalizes to K terms.

• y and y^ are represented as vectors.
• True label y is a vector with K elements, each
corresponding to a class, with yc = 1 if the correct
class is c; all other elements of y being 0.
• Our classifier will produce an estimate vector
with K elements ^ y, each element ^yk of which
represents the estimated probability p(yk = 1| x).
72
Multinomial Logistic Regression

The cross-entropy loss is simply the log of the output

probability corresponding to the correct class (hence the
name negative log likelihood loss).
73
Multinomial Logistic Regression
• The gradient for a single example turns out to
be very similar to the gradient for binary LR:

74
The softmax function
• The probability of everything must still sum to 1
P(positive|doc) + P(negative|doc) +
P(neutral|doc) = 1

• Need a generalization of the sigmoid called the

softmax
 Takes a vector z = [z1, z2, ..., zk] of k arbitrary values
 Outputs a probability distribution
▪ each value in the range [0,1]
▪ all the values summing to 1
75
The softmax function
Turns a vector z = [z1, z2, ... , zk] of k arbitrary values into
probabilities

76
The softmax function
• Example with 6 classes, suppose

77
Softmax in multinomial logistic regression

• Input is still the dot product between weight

vector w and input vector x
• But now we’ll need separate weight vectors
for each of the K classes.

78
Features in binary versus multinomial
logistic regression
• Binary: positive weight → y=1
negative weight → y=0
w5 = 3.0

• Multinominal: separate weights for each

class:

79
Multinomial Logistic Regression
and the
Reference Class Constraint
Model Identifiability
• A statistical model is identifiable if different
parameter values lead to different probability
distributions.
• Formally, a model is identifiable if:
𝑃 𝑦 𝑥; 𝜃1 = 𝑃 𝑦 𝑥; 𝜃2 ⇒ 𝜃1 = 𝜃2
• That means: the parameters θ are uniquely
determined by the distribution they define.
• If a model is non-identifiable, there exist different
parameter sets that produce exactly the same
predictions — so you cannot tell which set of
parameters is the “true” one.

81
Multinomial Logistic Regression
• Suppose we have K classes:
ewk ⋅x
P y = k x = K wj ⋅x
σj=1 e
• Here, each class k has its own weight
vector wk .

82
The Problem:
Redundancy of Parameters
• If one adds the same constant vector 𝑎 to all the
weight vectors 𝑤𝑘 , the probabilities do not change:
𝑒 𝑤𝑘 +𝑎 ⋅𝑥 𝑒 𝑎⋅𝑥 𝑒 𝑤𝑘 ⋅𝑥 𝑒 𝑤𝑘 ⋅𝑥
• 𝑃 𝑦=𝑘𝑥 = = 𝑤𝑗 ⋅𝑥 = 𝑤𝑗 ⋅𝑥
σ𝐾 𝑒
𝑤𝑗 +𝑎 ⋅𝑥 𝑒 𝑎⋅𝑥 σ𝐾
𝑗=1 𝑒 σ𝐾
𝑗=1 𝑒
𝑗=1

• The added term 𝑒 𝑎⋅𝑥 cancels out of numerator and

denominator.
• So there are infinitely many different parameter sets
{𝑤1 , … , 𝑤𝐾 } that yield the same probabilities —
meaning the model is non-identifiable.

83
Fixing the Identifiability Problem
• To make the model identifiable, we remove this
redundancy by fixing one class’s parameters — that’s
called choosing a reference class (or baseline category).
• Typically, we set: 𝑤𝐾 = 0 (𝑤𝑟𝑒𝑓 = 0) for some class 𝑲
• Then, the model becomes:
𝑒 𝑤𝑐 ⋅𝑥
𝑃 𝑦=𝑐𝑥 = 𝐾−1 𝑤 ⋅𝑥 , 𝑐 = 1, … , 𝐾 − 1
1 + σ𝑗=1 𝑒 𝑗

and
1
• 𝑃 𝑦=𝐾𝑥 = 𝑤𝑗 ⋅𝑥
1+σ𝐾−1
𝑗=1 𝑒

• Now, the model is identifiable — each parameter set

corresponds to a unique probability distribution. 84
Intuitive Explanation
• Without fixing one class, the system is
“floating”: all weights can be shifted up or
down and predictions stay the same.
• Choosing a reference class anchors the model,
eliminating this floating ambiguity.
• This subtle yet essential step underlies the
Softmax classifier.

85
Summary & Takeaways

• Multinomial Logistic Regression generalizes

binary logistic regression to multiple classes.
• A reference class constraint (𝑤𝑟𝑒𝑓 =0) is
essential for identifiability.
• LR is widely used in NLP for text classification,
sentiment analysis, and more.
• Unlike Naive Bayes, LR can capture feature
interactions when features are engineered
appropriately.

86
Textbook
Jurafsky, D. and Martin, J.H. (2024) Speech and Language
Processing: An Introduction to Natural Language Processing,
Computational Linguistics, and Speech Recognition with
Language Models (3rd ed. draft), Prentice Hall.
[Link]
The draft version of the 3rd edition
will be used; it has been updated to
include more recent topics such as
Vector Semantics and Embeddings,
ANNs and Deep Learning for NLP,
Transformers, etc.
This course slides are largely
based on this textbook and its
authors’ slides with light
customisations whenever seen fit.
87

Logistic Regression Explained: Classifiers
No ratings yet
Logistic Regression Explained: Classifiers
78 pages
Logistic Regression in Machine Learning
No ratings yet
Logistic Regression in Machine Learning
43 pages
Logistic Regression Explained: Classifiers
No ratings yet
Logistic Regression Explained: Classifiers
93 pages
Logistic Regression for Text Classification
No ratings yet
Logistic Regression for Text Classification
64 pages
Logistic Regression: Classifiers Explained
No ratings yet
Logistic Regression: Classifiers Explained
79 pages
12 Logistic Regression
No ratings yet
12 Logistic Regression
84 pages
Logistic Regression Explained: Classifiers
No ratings yet
Logistic Regression Explained: Classifiers
94 pages
Logistic Regression Explained: Classifiers
No ratings yet
Logistic Regression Explained: Classifiers
91 pages
Logistic Regression for Text Classification
No ratings yet
Logistic Regression for Text Classification
100 pages
Logistic Regression and Text Classification
No ratings yet
Logistic Regression and Text Classification
100 pages
Logistic Regression Explained: Classifiers
No ratings yet
Logistic Regression Explained: Classifiers
79 pages
Logreg 25 Aug
No ratings yet
Logreg 25 Aug
100 pages
Logistic Regression: "And How Do You Know That These Fine Begonias Are Not of Equal Importance?"
No ratings yet
Logistic Regression: "And How Do You Know That These Fine Begonias Are Not of Equal Importance?"
25 pages
Lect 13
No ratings yet
Lect 13
14 pages
Logistic Regression in NLP Classification
No ratings yet
Logistic Regression in NLP Classification
39 pages
5
100% (1)
5
20 pages
Logistic Regression: "And How Do You Know That These Fine Begonias Are Not of Equal Importance?"
No ratings yet
Logistic Regression: "And How Do You Know That These Fine Begonias Are Not of Equal Importance?"
25 pages
Logistic Regression Notes
No ratings yet
Logistic Regression Notes
25 pages
Logistic Regression for Text Classification
No ratings yet
Logistic Regression for Text Classification
100 pages
Understanding Logistic Regression Basics
No ratings yet
Understanding Logistic Regression Basics
24 pages
Logistic Regression: "And How Do You Know That These Fine Begonias Are Not of Equal Importance?"
No ratings yet
Logistic Regression: "And How Do You Know That These Fine Begonias Are Not of Equal Importance?"
21 pages
LecML - 3 Logistic Regression
No ratings yet
LecML - 3 Logistic Regression
68 pages
Text Classification in NLP Explained
No ratings yet
Text Classification in NLP Explained
39 pages
Understanding Logistic Regression Basics
No ratings yet
Understanding Logistic Regression Basics
23 pages
L14 Logistic Regression
No ratings yet
L14 Logistic Regression
22 pages
Understanding Logistic Regression Basics
No ratings yet
Understanding Logistic Regression Basics
22 pages
Understanding Logistic Regression Basics
100% (1)
Understanding Logistic Regression Basics
37 pages
Classification vs. Regression in ML
100% (1)
Classification vs. Regression in ML
13 pages
Logistic Regression Explained: Key Concepts
No ratings yet
Logistic Regression Explained: Key Concepts
11 pages
Logistic Regression for Binary Classification
No ratings yet
Logistic Regression for Binary Classification
18 pages
Understanding Logistic Regression Basics
No ratings yet
Understanding Logistic Regression Basics
16 pages
Logistic Regression in Machine Learning
No ratings yet
Logistic Regression in Machine Learning
4 pages
Understanding Logistic Regression Concepts
No ratings yet
Understanding Logistic Regression Concepts
21 pages
ML QP Ans U2
No ratings yet
ML QP Ans U2
46 pages
Logistic Regression Analysis on Ads Data
No ratings yet
Logistic Regression Analysis on Ads Data
8 pages
Logistic Regression Explained
No ratings yet
Logistic Regression Explained
25 pages
Logistic Regression Overview and Methods
No ratings yet
Logistic Regression Overview and Methods
19 pages
Understanding Logistic Regression Basics
No ratings yet
Understanding Logistic Regression Basics
8 pages
Logistic Regression in Machine Learning
No ratings yet
Logistic Regression in Machine Learning
35 pages
Understanding Classification Algorithms
No ratings yet
Understanding Classification Algorithms
46 pages
Introduction to Classification Methods
No ratings yet
Introduction to Classification Methods
60 pages
Understanding Logistic Regression Basics
No ratings yet
Understanding Logistic Regression Basics
23 pages
Machine Learning Classification Techniques
No ratings yet
Machine Learning Classification Techniques
111 pages
Understanding Logistic Regression Basics
No ratings yet
Understanding Logistic Regression Basics
1 page
Understanding Logistic Regression in Classification
No ratings yet
Understanding Logistic Regression in Classification
33 pages
Logistic Regression Explained: Guide & Math
No ratings yet
Logistic Regression Explained: Guide & Math
20 pages
Logistic Regression in Machine Learning
No ratings yet
Logistic Regression in Machine Learning
52 pages
Understanding Logistic Regression
No ratings yet
Understanding Logistic Regression
19 pages
Understanding Logistic Regression Basics
No ratings yet
Understanding Logistic Regression Basics
8 pages
Diabetes Data Logistic Regression Analysis
No ratings yet
Diabetes Data Logistic Regression Analysis
14 pages
Understanding Logistic Regression Basics
No ratings yet
Understanding Logistic Regression Basics
48 pages
Logistic Regression: Training & Estimation
No ratings yet
Logistic Regression: Training & Estimation
63 pages
Logistic Regression & Linear Classifiers
100% (1)
Logistic Regression & Linear Classifiers
333 pages
Understanding Logistic Regression Basics
No ratings yet
Understanding Logistic Regression Basics
8 pages
Logistic Regression Basics Explained
No ratings yet
Logistic Regression Basics Explained
10 pages
Understanding Logistic Regression Basics
No ratings yet
Understanding Logistic Regression Basics
40 pages
Understanding Logistic Regression Basics
No ratings yet
Understanding Logistic Regression Basics
8 pages
Understanding the Sigmoid Function
No ratings yet
Understanding the Sigmoid Function
26 pages
Understanding Logistic Regression in Classification
No ratings yet
Understanding Logistic Regression in Classification
27 pages
Find PG Rooms: Android App Proposal
No ratings yet
Find PG Rooms: Android App Proposal
15 pages
L15 CNN2
No ratings yet
L15 CNN2
116 pages
ML.Recommend: E-commerce AI Model
No ratings yet
ML.Recommend: E-commerce AI Model
46 pages
Introduction to Artificial Intelligence Course
No ratings yet
Introduction to Artificial Intelligence Course
2 pages
RBF and Functional Link Neural Networks
No ratings yet
RBF and Functional Link Neural Networks
2 pages
Unit 1 MCQ
No ratings yet
Unit 1 MCQ
8 pages
Multiple Choice Questions on ML Models
No ratings yet
Multiple Choice Questions on ML Models
37 pages
Interpretable Machine Learning Guide
No ratings yet
Interpretable Machine Learning Guide
447 pages
Predictive Analytics Course Spring 2026
No ratings yet
Predictive Analytics Course Spring 2026
1 page
AI for Medical Image Analysis
No ratings yet
AI for Medical Image Analysis
16 pages
Intelligent Traffic Management System
No ratings yet
Intelligent Traffic Management System
15 pages
3D Printing ULTEM 9085: Mechanical Insights
No ratings yet
3D Printing ULTEM 9085: Mechanical Insights
8 pages
User and Entity Behavioral Analytics Flyer
No ratings yet
User and Entity Behavioral Analytics Flyer
2 pages
FIFA Player Analysis with Python ML
No ratings yet
FIFA Player Analysis with Python ML
11 pages
Big Data Practical Record for BCA
No ratings yet
Big Data Practical Record for BCA
35 pages
Generative AI Models Seminar Report
No ratings yet
Generative AI Models Seminar Report
32 pages
AI Framework for PV Maintenance & Diagnosis
No ratings yet
AI Framework for PV Maintenance & Diagnosis
38 pages
ECE Semester 5 Course Structure
No ratings yet
ECE Semester 5 Course Structure
26 pages
Data-Driven Sales Analysis at GB Foods
No ratings yet
Data-Driven Sales Analysis at GB Foods
8 pages
Gurucul Next-Gen SIEM Overview
No ratings yet
Gurucul Next-Gen SIEM Overview
4 pages
YOLOv7-CS: Lightweight Bayberry Detection
No ratings yet
YOLOv7-CS: Lightweight Bayberry Detection
18 pages
Understanding Artificial Intelligence Basics
No ratings yet
Understanding Artificial Intelligence Basics
25 pages
Apache Spark Engine
100% (1)
Apache Spark Engine
82 pages
Cyberbullying Detection in Social Media
No ratings yet
Cyberbullying Detection in Social Media
9 pages
Introduction to Linear Classifiers
No ratings yet
Introduction to Linear Classifiers
9 pages
AI-Driven Human Activity Recognition Comparison
No ratings yet
AI-Driven Human Activity Recognition Comparison
7 pages
Bangla Sentiment Analysis with BangDSA
No ratings yet
Bangla Sentiment Analysis with BangDSA
25 pages
Reckitt's Digital Transformation Strategy
No ratings yet
Reckitt's Digital Transformation Strategy
3 pages
Introduction to Machine Learning Concepts
No ratings yet
Introduction to Machine Learning Concepts
21 pages
Detecting Amazon Fake Reviews with ML
No ratings yet
Detecting Amazon Fake Reviews with ML
6 pages

Logistic Regression in NLP Classification

Uploaded by

Logistic Regression in NLP Classification

Uploaded by

Natural Language Processing

Prof. Ahmed Guessoum

Now given a new image:

Oh look, all dog images (happen to) have collars!

• Training: we learn weights w and b using

• Output: a predicted class y^  C

• We want to know the probability P(y = 1|x)

• For feature xi, weight wi tells us how important

(For multinomial logistic regression: y  {0, 1, 2, 3, 4})

• The output is considered as a linear function of

0.5 here is called the decision boundary

• Consider a binary sentiment classification on

p(+ |x) = P(Y = 1|x) = s (w·x+ b)

Suppose we have learned w =

• We choose the parameters w, b that maximize

• Whatever values maximize log p(y|x) will also maximize

• Now flip sign to turn this into a loss function:

• Or, plugging in the definition of 𝑦:

p(+ |x) = P(Y = 1|x) = s (w·x+ b)

• Is lower than the loss when model was wrong (if

• Sure enough, the loss was bigger when the

• Let's make explicit that the loss function

• How do I get to the bottom of this

• For logistic regression, the loss function

So we'll move positive

where w1 is the initial value of w 47

• The gradient of a function of many

• Gradient Descent: Find the gradient of

The elegant derivative of this function:

• Now that we have a gradient, we compute the

• Note that enough negative examples would eventually make w2 negative

• Idea: choose an R (θ) that penalizes large

• L2 regularized objective function (where α is

• L1 regularized objective function:

• Multinomial LR generalizes to K terms.

The cross-entropy loss is simply the log of the output

• Need a generalization of the sigmoid called the

• Input is still the dot product between weight

• Multinominal: separate weights for each

• The added term 𝑒 𝑎⋅𝑥 cancels out of numerator and

• Now, the model is identifiable — each parameter set

• Multinomial Logistic Regression generalizes

You might also like