0% found this document useful (0 votes)

6 views43 pages

L3 Classification Naivebayes

The document outlines the basics of Natural Language Processing (NLP) with a focus on classification techniques, including supervised learning and various classifiers like Naive Bayes and logistic regression. It discusses the importance of text classification in tasks such as spam detection, sentiment analysis, and intent detection. Additionally, it covers model building, feature selection, and the application of statistical methods in training classifiers.

Uploaded by

creakedeggs

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views43 pages

L3 Classification Naivebayes

Uploaded by

creakedeggs

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

SFU NatLangLab

CMPT 413/713: Natural Language Processing

Classification
Spring 2026
2026-01-12

Adapted from slides from Danqi Chen, Karthik Narasimhan, and Anoop Sarkar
1
Review: Basic Machine Learning Terminology

labeled training data

• Supervised vs Unsupervised learning

• Classification vs Regression

• Discriminative vs Generative models

• We will do Supervised Text Classification

2
Why classify?

Spam detection Sentiment analysis

Movie Reviews

• Authorship attribution neg: unbelievably disappointing

pos: Full of zany characters and richly applied satire,
and some great plot twists

• Language detection pos: this is the greatest screwball comedy ever filmed
neg: It was pathetic. The worst part about it was the
boxing scenes.

• News categorization
3
Classification as a subtask in NLP

• NLP is all (mostly) about classification

• Text classification: Spam/Not Spam, Sentiment Analysis

• Language modelling / generating sentences: select word

to generate at each step (classification over vocabulary!)

• Building dialog system (identifying intent)

• Parsing (identifying word to attach to)

4
Classification as a subtask in NLP
Intent detection
ADDR_CHANGE: I just moved and want to change my address.
ADDR_CHANGE: Please help me update my address
FILE_CLAIM: I just got into a terrible accident and I want to file a claim
CLOSE_ACCOUNT: I’m moving and I want to disconnect my service

Prepositional phrase attachment

noun attach: I bought the shirt with pockets
verb attach: I bought the shirt with my credit card
noun attach: I washed the shirt with mud
verb attach: I washed the shirt with soap
5
Text classification: the task
• Inputs:
sequence of words
• A document d Multiple classes: m
sentence Binary: m=2
• A set of classes C = {c1, c2, c3, … , cm}

• Output:

• Predicted class c for document d

Movie was terrible Classify Negative

Amazing acting Classify Positive

6
Rule-based classification
• Look for patterns, and combinations of features on words in document, meta-data

IF there exists word w in document d such that w in [good, great, extra-ordinary, …],
THEN output Positive

IF email address ends in [[Link], [Link], [Link], …]

THEN output SPAM

• Simple, can be very accurate

• But: rules may be hard to define (and some even unknown to us!)

• Expensive

• Not easily generalizable

7
Supervised Learning: Let’s use statistics!

• Data-driven approach

Let the machine figure out the best patterns to use!

• Inputs:

• Set of m classes C = {c1, c2, …, cm}

• Set of n ‘labeled’ documents: {(d1, c1), (d2, c2), …, (dn, cn)}

• Output: Trained classifier, F :d→c

• What form should F take?

• How to learn F?

8
Designing machine learning models
general recipe

• Input features: f(x) → [ f1, f2, …, fm]

• Need to determine features
• Output: estimate P(y | x) for each class c
• Need to model P(y | x) with a family of functions
Building
the model • Train phase: Learn parameters of model to minimize loss function
• Need training objective and optimization algorithm
• Test phase: Apply parameters to predict class given a new input

9
General guidelines for model building

Two steps to building a probability model:

1. Define the model • What form should F take?

• What independence assumptions do we make?

• What are the model parameters (probability values)?

2. Estimate the model parameters (training/learning)

• How to learn F? What to optimize? What is the training objective?

10
Types of supervised classifiers

Naive Bayes Logistic regression

Neural networks
Support vector machines k-nearest neighbors
11
Naive Bayes

12
Naive Bayes Classifier
General setting
• Let the input x be represented as r features: fj, 1 ≤j≤r

• Let y be the output classification

• We can have a simple classification model using Bayes rule

Prior Likelihood
Posterior
P(y) ⋅ P(x | y)
P(y | x) =
P(x) Evidence

• Make strong (naive) conditional independence assumptions

r r
Bayes rule
∏ ∏
P(x | y) = P( fj | y) P(y | x) ∝ P(y) ⋅ P( fj | y)
j=1 j=1

13
Naive Bayes classifier
for text classification
• For text classification: input x is document d = (w1, …, wk)

• Use as our features the words wj, 1 ≤ j ≤ | V | where V is our vocabulary

• c is the output classification

• Predicting the best class:
cMAP = arg max P(c | d)
c∈C
maximum a posteriori
= arg max
P(c)P(d | c) P(d | c) → Conditional probability of
(MAP) estimate
c∈C P(d) generating document d from class c

= arg max P(c)P(d | c)

c∈C
P(c) → Prior probability of class c
14
Represent P(d | c) as Bag of Words model

• Assume position of each word is irrelevant Order doesn’t matter

(both absolute and relative)

• P(w1, w2, w3, …, wk | c) = P(w1 | c)P(w2 | c)…P(wk | c)

• Probability of each word is conditionally independent
given class c

the
ma on
t
at cat he
s t

15
Predicting with Naive Bayes

• Once we assume that the position of each word is irrelevant and that
the words are conditionally independent given class c, we have:

P(d | c) = P(w1, w2, w3, …, wk | c) = P(w1 | c)P(w2 | c)…P(wk | c)

• The maximum a posteriori (MAP) estimate is now:

P̂ is used to indicate the estimated probability
k
̂ ̂ i | c)
∏
cMAP = arg max P(c)P(d | c) = arg max P(c) P(w
c∈C c∈C
i=1

Note that k is the number of tokens (words) in the document.

The index i is the position of the token.

16
Maximum likelihood estimate
• Count and take average:

Can suﬀer from sparsity issues!

<latexit sha1_base64="ca+IFFdUoBnTIG6rrzdMBNP4Jq8=">AAACE3icbVDLSsNAFJ34rPUVdelmsAjVRUlE1I1Q7MZlBfuAppTJdNKOnUzCzI1YQv7Bjb/ixoUibt2482+cPhbaeuDC4Zx7ufcePxZcg+N8WwuLS8srq7m1/PrG5ta2vbNb11GiKKvRSESq6RPNBJesBhwEa8aKkdAXrOEPKiO/cc+U5pG8hWHM2iHpSR5wSsBIHfvY6xNIq1mRdu6O8CX2AkVo6gF7gLQSJRImTpbKrGMXnJIzBp4n7pQU0BTVjv3ldSOahEwCFUTrluvE0E6JAk4Fy/JeollM6ID0WMtQSUKm2+n4pwwfGqWLg0iZkoDH6u+JlIRaD0PfdIYE+nrWG4n/ea0Egot2ymWcAJN0sihIBIYIjwLCXa4YBTE0hFDFza2Y9olJBUyMeROCO/vyPKmflNyzknNzWihfTePIoX10gIrIReeojK5RFdUQRY/oGb2iN+vJerHerY9J64I1ndlDf2B9/gDOI54k</latexit>

Count(cj )
P̂ (cj ) =
n

Frequency
<latexit sha1_base64="1jn5g8vqDaMSPiuRHKHRoLXUDlU=">AAACP3icbVBNS8QwFEz9dv1a9egluAgKsrQi6kUQvXhcwV2FbSlpNnWjaVqSV9cl9p958S948+rFgyJevZmte/BrIDDMzOPlTZQJrsF1H52R0bHxicmp6crM7Nz8QnVxqaXTXFHWpKlI1XlENBNcsiZwEOw8U4wkkWBn0dXRwD+7ZkrzVJ5CP2NBQi4kjzklYKWw2vK7BEyjWO+FHN9iGl5u4H3sx4pQ4wO7AXOU5hJKf7O0C+PrPAlNz+cSt4r2z9TmIBIUYbXm1t0S+C/xhqSGhmiE1Qe/k9I8YRKoIFq3PTeDwBAFnApWVPxcs4zQK3LB2pZKkjAdmPL+Aq9ZpYPjVNknAZfq9wlDEq37SWSTCYGu/u0NxP+8dg7xXmC4zHJgkn4tinOBIcWDMnGHK0ZB9C0hVHH7V0y7xHYHtvKKLcH7ffJf0tqqezt192S7dnA4rGMKraBVtI48tIsO0DFqoCai6A49oRf06tw7z86b8/4VHXGGM8voB5yPT9uXsBY=</latexit>
Zipf’s Law
Count(wi , cj )
P̂ (wi |cj ) = P
w2V [Count(w, cj )]

Rank

17
Solution: Smoothing!
• Maximum likelihood estimate

Count(cj )
P̂ (cj ) =
n
<latexit sha1_base64="1jn5g8vqDaMSPiuRHKHRoLXUDlU=">AAACP3icbVBNS8QwFEz9dv1a9egluAgKsrQi6kUQvXhcwV2FbSlpNnWjaVqSV9cl9p958S948+rFgyJevZmte/BrIDDMzOPlTZQJrsF1H52R0bHxicmp6crM7Nz8QnVxqaXTXFHWpKlI1XlENBNcsiZwEOw8U4wkkWBn0dXRwD+7ZkrzVJ5CP2NBQi4kjzklYKWw2vK7BEyjWO+FHN9iGl5u4H3sx4pQ4wO7AXOU5hJKf7O0C+PrPAlNz+cSt4r2z9TmIBIUYbXm1t0S+C/xhqSGhmiE1Qe/k9I8YRKoIFq3PTeDwBAFnApWVPxcs4zQK3LB2pZKkjAdmPL+Aq9ZpYPjVNknAZfq9wlDEq37SWSTCYGu/u0NxP+8dg7xXmC4zHJgkn4tinOBIcWDMnGHK0ZB9C0hVHH7V0y7xHYHtvKKLcH7ffJf0tqqezt192S7dnA4rGMKraBVtI48tIsO0DFqoCai6A49oRf06tw7z86b8/4VHXGGM8voB5yPT9uXsBY=</latexit>

Count(wi , cj )
P̂ (wi |cj ) = P
w2V [Count(w, cj )]

Laplace smoothing
• Smoothing
• Simple, easy to use
<latexit sha1_base64="ErHkJlqqqVlXzpHp4JCIXy8oWgE=">AAACUXicbVFNb9QwFHwbCrRbPhY4crG6QiqiWiUVolyQKnrhuJW620qbKHrxOo2p40T2S8vK5C9ygBP/gwsHqno/Dt2WkSyNZubp2eOsVtJSGP7uBA82Hj56vLnV3X7y9Nnz3ouXY1s1hosRr1RlzjK0QkktRiRJibPaCCwzJU6zi6O5f3opjJWVPqFZLZISz7XMJUfyUtor4gLJDdvdq1Sy74ynX9+yTyzODXIXk/hG7qhqNC38vaX9jsWo6gJbF9umTN1VLDUbt5P1+N5aNmnTXj8chAuw+yRakT6sMEx7P+NpxZtSaOIKrZ1EYU2JQ0OSK9F248aKGvkFnouJpxpLYRO3aKRlb7wyZXll/NHEFurtCYeltbMy88kSqbB3vbn4P2/SUP4xcVLXDQnNl4vyRjGq2LxeNpVGcFIzT5Ab6e/KeIG+TfKf0PUlRHeffJ+M9wfRh0F4/L5/+HlVxya8hh3YhQgO4BC+wBBGwOEH/IF/cN351fkbQBAso0FnNfMK1hBs3wAKMrNn</latexit>

Count(wi , cj ) + ↵
• Eﬀective in practice
P̂ (wi |cj ) = P
w2V [Count(w, cj ) + ↵]

18
Overall process
• Input: Set of annotated documents {(di, ci)}ni=1

A. Compute vocabulary V of all words

B. Calculate Count(cj )
P̂ (cj ) =
n
<latexit sha1_base64="ErHkJlqqqVlXzpHp4JCIXy8oWgE=">AAACUXicbVFNb9QwFHwbCrRbPhY4crG6QiqiWiUVolyQKnrhuJW620qbKHrxOo2p40T2S8vK5C9ygBP/gwsHqno/Dt2WkSyNZubp2eOsVtJSGP7uBA82Hj56vLnV3X7y9Nnz3ouXY1s1hosRr1RlzjK0QkktRiRJibPaCCwzJU6zi6O5f3opjJWVPqFZLZISz7XMJUfyUtor4gLJDdvdq1Sy74ynX9+yTyzODXIXk/hG7qhqNC38vaX9jsWo6gJbF9umTN1VLDUbt5P1+N5aNmnTXj8chAuw+yRakT6sMEx7P+NpxZtSaOIKrZ1EYU2JQ0OSK9F248aKGvkFnouJpxpLYRO3aKRlb7wyZXll/NHEFurtCYeltbMy88kSqbB3vbn4P2/SUP4xcVLXDQnNl4vyRjGq2LxeNpVGcFIzT5Ab6e/KeIG+TfKf0PUlRHeffJ+M9wfRh0F4/L5/+HlVxya8hh3YhQgO4BC+wBBGwOEH/IF/cN351fkbQBAso0FnNfMK1hBs3wAKMrNn</latexit>

Count(wi , cj ) + ↵
C. Calculate P̂ (wi |cj ) = P
w2V [Count(w, cj ) + ↵]

D. (Prediction) Given document d = (w1, w2, . . . , wk)

k
<latexit sha1_base64="LwVLPV0kZv4A6M2KVVwu+XV5kcQ=">AAACMXicbVDLahtBEJx1nFhRXuv4mMtgEbAvYjeExBeDnVx0MShg2QKtsvS2RtKg2QczvYnFZH8pF/+J8UUHh5Crf8KjByaxUtBQVHXT3ZUUShoKgpm38Wjz8ZOt2tP6s+cvXr7yt1+fmbzUKDqYq1x3EzBCyUx0SJIS3UILSBMlzpPJ57l//k1oI/PslKaF6KcwyuRQIpCTYr+FsY1IXJA9OW5XFT/kEegRj1K4iJFHYyDbrvZwn0eFzgexlYdh9dVOqnvreyz5D477sd8ImsECfJ2EK9JgK7Rj/yoa5FimIiNUYEwvDArqW9AkUYmqHpVGFIATGImeoxmkwvTt4uOKv3XKgA9z7SojvlD/nrCQGjNNE9eZAo3NQ28u/s/rlTQ86FuZFSWJDJeLhqXilPN5fHwgtUBSU0cAtXS3chyDBiQXct2FED58eZ2cvWuGH5rBl/eNo0+rOGrsDdtleyxkH9kRa7E26zBkP9k1u2G/vEtv5v32/ixbN7zVzA77B97tHcCdqU0=</latexit>

Y
cMAP = arg max P̂ (c) P̂ (wi |c)
c
i=1
19
Variants
Name based on the distribution of the features
P( fi | y) → P(wi | c)
Multinomial Naive Bayes <latexit sha1_base64="ca+IFFdUoBnTIG6rrzdMBNP4Jq8=">AAACE3icbVDLSsNAFJ34rPUVdelmsAjVRUlE1I1Q7MZlBfuAppTJdNKOnUzCzI1YQv7Bjb/ixoUibt2482+cPhbaeuDC4Zx7ufcePxZcg+N8WwuLS8srq7m1/PrG5ta2vbNb11GiKKvRSESq6RPNBJesBhwEa8aKkdAXrOEPKiO/cc+U5pG8hWHM2iHpSR5wSsBIHfvY6xNIq1mRdu6O8CX2AkVo6gF7gLQSJRImTpbKrGMXnJIzBp4n7pQU0BTVjv3ldSOahEwCFUTrluvE0E6JAk4Fy/JeollM6ID0WMtQSUKm2+n4pwwfGqWLg0iZkoDH6u+JlIRaD0PfdIYE+nrWG4n/ea0Egot2ymWcAJN0sihIBIYIjwLCXa4YBTE0hFDFza2Y9olJBUyMeROCO/vyPKmflNyzknNzWihfTePIoX10gIrIReeojK5RFdUQRY/oGb2iN+vJerHerY9J64I1ndlDf2B9/gDOI54k</latexit>

Count(cj )
Normal counts (0,1,2,…) for each document P̂ (cj ) =
n
Binary Multinomial NB
Binarized counts (0/1) for each document Some work show this works
better than full counts or the
Multivariate Bernoulli NB Multivariate Bernoulli NB
Estimate P(w|c) as fraction of documents of class c
with word w

• Explicitly model P(!w|c) = 1 - P(w|c)

20
Variants
Name based on the distribution of the features
P( fi | y) → P(wi | c)
Multinomial Naive Bayes <latexit sha1_base64="ca+IFFdUoBnTIG6rrzdMBNP4Jq8=">AAACE3icbVDLSsNAFJ34rPUVdelmsAjVRUlE1I1Q7MZlBfuAppTJdNKOnUzCzI1YQv7Bjb/ixoUibt2482+cPhbaeuDC4Zx7ufcePxZcg+N8WwuLS8srq7m1/PrG5ta2vbNb11GiKKvRSESq6RPNBJesBhwEa8aKkdAXrOEPKiO/cc+U5pG8hWHM2iHpSR5wSsBIHfvY6xNIq1mRdu6O8CX2AkVo6gF7gLQSJRImTpbKrGMXnJIzBp4n7pQU0BTVjv3ldSOahEwCFUTrluvE0E6JAk4Fy/JeollM6ID0WMtQSUKm2+n4pwwfGqWLg0iZkoDH6u+JlIRaD0PfdIYE+nrWG4n/ea0Egot2ymWcAJN0sihIBIYIjwLCXa4YBTE0hFDFza2Y9olJBUyMeROCO/vyPKmflNyzknNzWihfTePIoX10gIrIReeojK5RFdUQRY/oGb2iN+vJerHerY9J64I1ndlDf2B9/gDOI54k</latexit>

Count(cj )
Normal counts (0,1,2,…) for each document P̂ (cj ) =
n
Binary Multinomial NB
Binarized counts (0/1) for each document

Multivariate Bernoulli NB
Estimate P(w|c) as fraction of documents of class c
with word w

• Explicitly model P(!w|c) = 1 - P(w|c)

21
Naive Bayes Example
Doc Words Class
N Training 1 Chinese Beijing$Chinese c
P̂(c) = c
N 2 Chinese$Chinese$Shanghai c
Smoothing with α =1 3 Chinese$Macao c
count(w, c) +1 4 Tokyo$Japan$Chinese j
P̂(w | c) =
count(c)+ | V | Test 5 Chinese$Chinese$Chinese$Tokyo Japan ?
Priors:
P(c)=$ 3
4 1 Choosing%a%class:
P(j)=$ 4 P(c|d5)$ ∝ 3/4$*$(3/7)3 *$1/14$*$1/14$
≈$0.0003
Conditional%Probabilities:
P(Chinese|c)$= (5+1)$/$(8+6)$=$6/14$=$3/7
P(Tokyo|c)$$$$= (0+1)$/$(8+6)$=$1/14 P(j|d5)$ ∝ 1/4$*$(2/9)3 *$2/9$*$2/9$
P(Japan|c)$$$$$= (0+1)$/$(8+6)$=$1/14 ≈$0.0001
P(Chinese|j)$= (1+1)$/$(3+6)$=$2/9$
P(Tokyo|j)$$$$$= (1+1)$/$(3+6)$=$2/9$
41 P(Japan|j)$$$$$$=$ (1+1)$/$(3+6)$=$2/9$
(Credits: Dan Jurafsky)
22
Naive Bayes Example
Doc Words Class
N Training 1 Chinese Beijing$Chinese c
P̂(c) = c
N 2 Chinese$Chinese$Shanghai c
Smoothing with α =1 3 Chinese$Macao c
count(w, c) +1 4 Tokyo$Japan$Chinese j
P̂(w | c) =
count(c)+ | V | Test 5 Chinese$Chinese$Chinese$Tokyo Japan ?
Priors: 3 1
• Let’s 3
P(c)=$ the
compute priors: what is P(̂ c) and P(̂ j)?Choosing%a%class:
̂ = , P(j)
P(c) ̂ =
P(j)=$
4 1 4 4
4 P(c|d5)$ ∝ 3/4$*$(3/7)3 *$1/14$*$1/14$
≈$0.0003
• P(̂ Japan | c):
Conditional%Probabilities:
Let’s compute
P(Chinese|c)$= (5+1)$/$(8+6)$=$6/14$=$3/7
P(Tokyo|c)$$$$= (0+1)$/$(8+6)$=$1/14 |V| = 6
∑
count(Japan, c) = 0 count(c) = count(w,P(j|d5)$
c) = ∝ 8 1/4$*$(2/9)3 *$2/9$*$2/9$
P(Japan|c)$$$$$= (0+1)$/$(8+6)$=$1/14 ≈$0.0001
w∈V
P(Chinese|j)$= (1+1)$/$(3+6)$=$2/9$
P(Tokyo|j)$$$$$= (1+1)$/$(3+6)$=$2/9$
count(Japan,c) +1 0+1 1
̂
P(Japan
41 | c) =
P(Japan|j)$$$$$$=$ =
(1+1)$/$(3+6)$=$2/9$ =
count(c) + | V | 8+6 14
(Credits: Dan Jurafsky)
23
Naive Bayes Example
Doc Words Class
N Training 1 Chinese Beijing$Chinese c
P̂(c) = c
N 2 Chinese$Chinese$Shanghai c
Smoothing with α =1 3 Chinese$Macao c
count(w, c) +1 4 Tokyo$Japan$Chinese j
P̂(w | c) =
count(c)+ | V | Test 5 Chinese$Chinese$Chinese$Tokyo Japan ?
Priors:
P(c)=$ 3
4 1 Choosing%a%class:
P(j)=$ 4 P(c|d5)$ ∝ 3/4$*$(3/7)3 *$1/14$*$1/14$
≈$0.0003
Conditional%Probabilities:
P(Chinese|c)$= (5+1)$/$(8+6)$=$6/14$=$3/7
P(Tokyo|c)$$$$= (0+1)$/$(8+6)$=$1/14 P(j|d5)$ ∝ 1/4$*$(2/9)3 *$2/9$*$2/9$
P(Japan|c)$$$$$= (0+1)$/$(8+6)$=$1/14 ≈$0.0001
P(Chinese|j)$= (1+1)$/$(3+6)$=$2/9$
P(Tokyo|j)$$$$$= (1+1)$/$(3+6)$=$2/9$
41 P(Japan|j)$$$$$$=$ (1+1)$/$(3+6)$=$2/9$
(Credits: Dan Jurafsky)
24
Some details
• Vocabulary is important

• Tokenization matters: it can aﬀect your vocabulary

• Tokenization = how you break your sentence up into tokens / words

• Make sure you are consistent with your tokenization!

• Special multi-word tokens: NOT_happy

• Modern NLP system use subword tokens (e.g. byte pair encoding)
25
Some details
• Vocabulary is important

• Tokenization matters: it can aﬀect your vocabulary

• Tokenization = how you break your sentence up into tokens / words

• Make sure you are consistent with your tokenization!

• Handling unknown words in test not in your training vocabulary?

• Remove them from your test document! Just ignore them.

• Handling stop words (common words like a, the that may not be useful)

• Remove them from the training data!

Better to use

• In practice not that helpful, so use all words! • Modified counts (tf-idf) that down weighs
frequent, unimportant words
• Better models!
26
Features
• In general, Naive Bayes can use any set of features, not just words

• URLs, email addresses, Capitalization, …

• Domain knowledge can be crucial to performance

Top features
for
Spam detection

27
Properties of Naive Bayes

+ Simple baseline method

+ Works well for small data sizes
+ Optimal if the independence assumptions hold: if the assumed independence is
correct, then it is the Bayes Optimal Classifier for the problem

- But not if the independence assumption is broken

- Does not handle rare classes well - will favour more common class

- Also need to design features

• Modern NLP: use large neural language models with learned representations

28
Generative vs Discriminative Models

• Naive Bayes is a Generative Model: It models

p(y | x) ∝ p(y)p(x | y)

• It models how the document is generated from words

• You can use this model to sample documents

• Next: Logistic Regression, a Discriminative model that

models p(y | x) directly.

29
Evaluation

30
Evaluation

• Consider binary classification

Confusion matrix
• Table of predictions
Truth

Positive Negative

Positive 100 5
Predicted
Negative 45 100

• Ideally, we want:
Truth
Positive Negative

Predicted Positive 145 0

Negative 0 105

31
Evaluation Metrics
Confusion matrix
Truth Actual positives
Positive Negative
FN TN
Predicted Positive 100 TP 5 FP

Negative 45 FN 100 TN

• True positive (TP): Predicted + and actual + TP FP

• True negative (TN): Predicted - and actual -

• False positive (FP): Predicted + and actual -

• False negative (FN): Predicted - and actual + Predicted positives

(image credit: wikipedia)
TP + TN 200
Accuracy = = = 80 %
Total 250
32
Coarse metric
Evaluation Metrics

Truth
Positive Negative Positive Negative

Predicted Positive 100 5 Positive 50 25

Negative 45 100 Negative 25 150

• True positive (TP): Predicted + and actual +

• True negative (TN): Predicted - and actual -

• False positive (FP): Predicted + and actual -

Accuracy cannot distinguish
• False negative (FN): Predicted - and actual + between the two models!

TP + TN 200
Accuracy = = = 80 %
Total 250
33
Precision and Recall

• Precision: % of selected classes that are correct

Actual positives (relevant)

FN TN
TP
Precision( + ) =
TP + FP

TP FP
• Recall: % of correct items selected

TP
Recall( + ) =
TP + FN Predicted positives
(selected/retrieved)
34
(image credit: wikipedia)
Evaluation Metrics

Truth
Positive Negative Positive Negative

Predicted Positive 100 5 Positive 50 25

Negative 45 100 Negative 25 150

TP 100 50
Precision( + ) = = 0.95 = 0.75
TP + FP 100 + 5 50 + 25

TP 100 50
Recall( + ) = = 0.69 = 0.75
TP + FN 100 + 45 50 + 25

Two metrics - which one to use?

35
F-Score

• Combined measure

• Harmonic mean of Precision and Recall

2 ⋅ Precision ⋅ Recall
F1 =
Precision + Recall

• Or more generally, Use β to control importance of

Precision vs Recall
(1 + β 2) ⋅ Precision ⋅ Recall
Fβ =
β 2 ⋅ Precision + Recall

36
Evaluation Metrics

Truth
Positive Negative Positive Negative

Predicted Positive 100 5 Positive 50 25

Negative 45 100 Negative 25 150

TP 100 50
Precision( + ) = = 0.95 = 0.75
TP + FP 100 + 5 50 + 25
TP 100 50
Recall( + ) = = 0.69 = 0.75
TP + FN 100 + 45 50 + 25

2 ⋅ P( + )R( + )
F1( + ) = 0.8 0.75
P( + ) + R( + )
37
Aggregating scores

• How to handle more than 2 classes?

• We have Precision, Recall, F1 for each class

(Credits: Dan Jurafsky)

38
Aggregating scores

• How to handle more than 2 classes?

• We have Precision, Recall, F1 for each class

• How to combine them for an overall score?

• Macro-average: Compute for each class, then average

• Micro-average: Collect predictions for all classes and

jointly evaluate

39
Macro vs Micro average
• Micro-averaged score is dominated by score on common classes

(Credits: Dan Jurafsky)

40
Precision Recall tradeoff

1.0
Maximum F1
0.8

Vary hyperparameters
Precision 0.6
• Smoothing α
0.4 • Threshold T
P( + | d)
0.2 >T
P( − | d)
0.2 0.4 0.6 0.8 1.0
Recall Tune on validation set
41
Train, val, test split

• Train model on training set

• Tune hyperparameters on validation set

• Evaluate performance on unseen test set

Why do we do this? Want to have a model that generalizes.

42
Summary
• Evaluation Metrics

• Accuracy - coarse metric

• Precision, Recall, F1 for each class

• Aggregated scores

• Macro-average: Compute for each class, then average

• Micro-average: Collect predictions for all classes and jointly evaluate

(dominated by common classes)

• Precision-Recall curve: pick threshold for maximum F1

• Use validation set to tune hyperparameters, test set should remain “unseen”

Common questions

A high precision score indicates that a large proportion of the predicted positive outcomes are true positives, meaning the model is effective in selecting relevant instances from those it identifies as positive. High precision suggests that the model has a low rate of false positives, providing a measure of its accuracy in predicting positive class labels .

The "Bag of Words" model represents a text as an unordered collection of words, disregarding grammar and word order, which can lead to the loss of semantic and syntactic information. Although it simplifies the feature space and helps manage computational demands, it doesn't capture contextual differences such as negation or phrasal meanings, potentially leading to misclassification in text classification tasks. Additionally, variations or synonyms of words with similar meanings are treated as distinct entities, which might reduce the model's overall precision and recall .

Tokenization involves breaking down text into smaller units or tokens, such as words or subwords, which are then used to construct the vocabulary in NLP models. The process is crucial as it determines how textual data is presented to the model, affecting its ability to recognize and process language patterns. Inconsistent or improper tokenization can lead to an inaccurate vocabulary, affecting the model's effectiveness by misrepresenting word usage and frequency. Effective tokenization ensures consistent handling of linguistic nuances and enhances model predictions .

Macro-average evaluation calculates metrics independently for each class and then averages them, giving equal weight to all classes regardless of their prevalence. This approach is useful for understanding performance across different classes, including minority ones. In contrast, micro-average evaluation aggregates all confusion matrix metrics across classes before calculating the overall metric, thereby being dominated by the performance on more common classes. This method is beneficial when class distribution reflects importance, but it can obscure performance on minority classes .

Laplace smoothing, also known as additive smoothing, addresses the issue of zero probabilities for unseen words in the training data by adding a constant (usually 1) to the frequency counts of each word in a class. This adjustment ensures that no probability is zero, which helps the Naive Bayes classifier handle unseen words during prediction and thus improves its performance. It is simple to implement and effective in practice .

Smoothing techniques, such as Laplace smoothing, serve to handle the problem of zero probabilities in probabilistic models like Naive Bayes. They adjust probability estimates for unseen words or features in the training data, preventing models from assigning zero probability to such instances during classification. This is crucial for Naive Bayes, as it relies heavily on probability estimates for making predictions. Smoothing ensures all event probabilities are non-zero, which stabilizes the model’s predictive capability in face of new, unseen data .

Generative models, like Naive Bayes, model the joint probability distribution P(X, Y) and can be used to generate new samples. They predict the output by estimating how the data is generated given a label and can be used for tasks like sampling documents . Discriminative models, such as logistic regression, model the conditional probability P(Y|X) directly, focusing on the boundary between classes rather than learning the data generation process. In text classification tasks, generative models are generally effective when the underlying assumptions about data independence hold. However, discriminative models often perform better with fewer features due to their focus on class distinction .

Naive Bayes classifiers tend to favor more common classes over rare ones due to their reliance on prior probabilities and frequency-based feature representation. This bias arises because they estimate probabilities based on class frequency, causing rare classes to have reduced influence in determining outcomes. One way to mitigate this limitation is to adjust class priors or apply techniques like class-specific smoothing to balance the representation and influence of less frequent classes during classification .

The F1-score is a harmonic mean of precision and recall, providing a balanced measure that accounts for both false positives and false negatives in model evaluation. It is particularly useful in scenarios with imbalanced data where one class is more frequent, as it avoids the bias of accuracy that doesn’t account for class proportions. The F1-score facilitates comparison by reflecting both precision and recall, offering a single metric that considers both correctness and completeness of positive predictions .

The training objective function defines what the machine learning model is trying to optimize, usually by minimizing the error or maximizing the likelihood between predicted and actual values. The choice of this objective directly impacts the learning process, convergence rate, and final performance of the model. Optimization algorithms, such as stochastic gradient descent, determine how the parameters are updated in the model. A poorly chosen objective or optimization method can lead to inefficient training, overfitting, or suboptimal predictive performance. For text classification, these choices must align with the nature of the data and task requirements for effective model performance .

Multinomial Naïve Bayes Overview
No ratings yet
Multinomial Naïve Bayes Overview
58 pages
Naïve Bayes for Sentiment Analysis
No ratings yet
Naïve Bayes for Sentiment Analysis
42 pages
Naïve Bayesian Classification in Go
No ratings yet
Naïve Bayesian Classification in Go
55 pages
Naïve Bayes Text Classification Overview
No ratings yet
Naïve Bayes Text Classification Overview
56 pages
Natural Language Processing Classifiers
No ratings yet
Natural Language Processing Classifiers
52 pages
Text Classification in NLP: Naive Bayes & More
No ratings yet
Text Classification in NLP: Naive Bayes & More
60 pages
Text Classification with Naïve Bayes
No ratings yet
Text Classification with Naïve Bayes
34 pages
Text Classification and Naïve Bayes Overview
No ratings yet
Text Classification and Naïve Bayes Overview
11 pages
Text Classification with Naive Bayes
No ratings yet
Text Classification with Naive Bayes
47 pages
Naïve Bayes Classification Overview
No ratings yet
Naïve Bayes Classification Overview
41 pages
Naïve Bayes for Text Classification
No ratings yet
Naïve Bayes for Text Classification
52 pages
Bayesian Learning and Naïve Bayes Classifier
No ratings yet
Bayesian Learning and Naïve Bayes Classifier
40 pages
Text Classification and Naive Bayes Overview
No ratings yet
Text Classification and Naive Bayes Overview
78 pages
Understanding Classification in Machine Learning
No ratings yet
Understanding Classification in Machine Learning
66 pages
ML 18
No ratings yet
ML 18
25 pages
Naïve Bayes Classifier Overview
No ratings yet
Naïve Bayes Classifier Overview
25 pages
Text Classification and Clustering Techniques
No ratings yet
Text Classification and Clustering Techniques
39 pages
Naïve Bayes for Text Classification
No ratings yet
Naïve Bayes for Text Classification
38 pages
Lec05 TextClassificationNaiveBayes
No ratings yet
Lec05 TextClassificationNaiveBayes
38 pages
Introduction to Machine Learning Concepts
No ratings yet
Introduction to Machine Learning Concepts
47 pages
Lect 10
No ratings yet
Lect 10
30 pages
NLP Classifier Performance Metrics
No ratings yet
NLP Classifier Performance Metrics
146 pages
Naïve Bayesian Classification
No ratings yet
Naïve Bayesian Classification
30 pages
Naive Bayes Text Classification Overview
No ratings yet
Naive Bayes Text Classification Overview
68 pages
Text Classification and Sentiment Analysis
No ratings yet
Text Classification and Sentiment Analysis
64 pages
UNIT 2 Text Classification
No ratings yet
UNIT 2 Text Classification
20 pages
Machine Learning Overview and Techniques
No ratings yet
Machine Learning Overview and Techniques
53 pages
Probabilities from Linear Classifiers
No ratings yet
Probabilities from Linear Classifiers
36 pages
Lect 12
No ratings yet
Lect 12
13 pages
Text Classification in NLP: Overview & Methods
No ratings yet
Text Classification in NLP: Overview & Methods
109 pages
Supervised Learning Techniques Overview
No ratings yet
Supervised Learning Techniques Overview
94 pages
Lecture 7 Text Classification and Naive Bayes Model
No ratings yet
Lecture 7 Text Classification and Naive Bayes Model
65 pages
W4 - Text Classification 1
No ratings yet
W4 - Text Classification 1
39 pages
Text Classification Techniques Overview
No ratings yet
Text Classification Techniques Overview
53 pages
Naïve Bayes Machine Learning Overview
No ratings yet
Naïve Bayes Machine Learning Overview
48 pages
Data Science Principles in Machine Learning
No ratings yet
Data Science Principles in Machine Learning
28 pages
NLP Text Classification Techniques
No ratings yet
NLP Text Classification Techniques
82 pages
Naive Bayes for Text Classification Guide
No ratings yet
Naive Bayes for Text Classification Guide
34 pages
Text Classification in NLP: Naive Bayes
No ratings yet
Text Classification in NLP: Naive Bayes
31 pages
Naïve Bayes for Text Categorization
No ratings yet
Naïve Bayes for Text Categorization
102 pages
Naive Bayes & SVM for Text Classification
No ratings yet
Naive Bayes & SVM for Text Classification
7 pages
Supervised Learning Study Guide
No ratings yet
Supervised Learning Study Guide
115 pages
Naive Bayes Classifier in NLP
No ratings yet
Naive Bayes Classifier in NLP
33 pages
Text Classification Techniques Overview
No ratings yet
Text Classification Techniques Overview
66 pages
Supervised and Unsupervised Learning
No ratings yet
Supervised and Unsupervised Learning
252 pages
Classification
No ratings yet
Classification
19 pages
AI Knowledge-Based Systems Overview
No ratings yet
AI Knowledge-Based Systems Overview
69 pages
Overview of Supervised Learning Techniques
No ratings yet
Overview of Supervised Learning Techniques
29 pages
Introduction to Machine Learning Concepts
No ratings yet
Introduction to Machine Learning Concepts
46 pages
AI Learning Concepts and Techniques
No ratings yet
AI Learning Concepts and Techniques
117 pages
Text Classification Techniques Explained
No ratings yet
Text Classification Techniques Explained
19 pages
Text Classification System Pipeline Guide
No ratings yet
Text Classification System Pipeline Guide
28 pages
Machine Learning Basic Concepts: ISOM3360 Data Mining For Business Analytics
No ratings yet
Machine Learning Basic Concepts: ISOM3360 Data Mining For Business Analytics
22 pages
Text Classification Techniques Overview
No ratings yet
Text Classification Techniques Overview
11 pages
Overview of Machine Learning Techniques
No ratings yet
Overview of Machine Learning Techniques
44 pages
Naïve Bayes for Text Classification & Sentiment Analysis
No ratings yet
Naïve Bayes for Text Classification & Sentiment Analysis
74 pages
Introduction to Machine Learning Concepts
No ratings yet
Introduction to Machine Learning Concepts
125 pages
Game Theory and Control Syllabus
No ratings yet
Game Theory and Control Syllabus
3 pages
Numerical Methods Using MATLAB Solutions
No ratings yet
Numerical Methods Using MATLAB Solutions
3 pages
Bangla Basic Character Recognition Using Digital Curvelet Transform
No ratings yet
Bangla Basic Character Recognition Using Digital Curvelet Transform
10 pages
Nonlinear Difference Equations Overview
No ratings yet
Nonlinear Difference Equations Overview
32 pages
Analysis of The Kinematic Variables That Predict.62
No ratings yet
Analysis of The Kinematic Variables That Predict.62
8 pages
Dissertation of Alexey Lindo (Final Version)
No ratings yet
Dissertation of Alexey Lindo (Final Version)
104 pages
Analysis On Stock Market Prediction Using Machine Learning Techniques
No ratings yet
Analysis On Stock Market Prediction Using Machine Learning Techniques
5 pages
Multi-Scale MLP-Mixer for Time Series Analysis
No ratings yet
Multi-Scale MLP-Mixer for Time Series Analysis
14 pages
Exploring Graphs: DFS, BFS, and More
No ratings yet
Exploring Graphs: DFS, BFS, and More
22 pages
Gesture Recognition Using Neural Networks
No ratings yet
Gesture Recognition Using Neural Networks
49 pages
MATLAB Solution for Kasami Code Assignment
No ratings yet
MATLAB Solution for Kasami Code Assignment
5 pages
Understanding Multiclass vs Binary Classification
No ratings yet
Understanding Multiclass vs Binary Classification
8 pages
XGBoost for Cloud Task Prioritization
No ratings yet
XGBoost for Cloud Task Prioritization
30 pages
CPM and PERT in Project Management
No ratings yet
CPM and PERT in Project Management
47 pages
Program Ms
No ratings yet
Program Ms
9 pages
Krishi Balancer: Smart Fertilizer Robot
No ratings yet
Krishi Balancer: Smart Fertilizer Robot
24 pages
Class 11 Computer Science Question Bank
No ratings yet
Class 11 Computer Science Question Bank
8 pages
Matrix Concepts and Applications Guide
No ratings yet
Matrix Concepts and Applications Guide
14 pages
CNN-Based Algorithm for Branch Prediction
No ratings yet
CNN-Based Algorithm for Branch Prediction
5 pages
Train Operation Adjustment Using Rough Sets
No ratings yet
Train Operation Adjustment Using Rough Sets
3 pages
Seasonal Weather Pattern Recognition
No ratings yet
Seasonal Weather Pattern Recognition
10 pages
Test & Testability Overview Guide
No ratings yet
Test & Testability Overview Guide
48 pages
Chapter-1 (Basic Concepts and Automata Theory) :: Video Chapters
No ratings yet
Chapter-1 (Basic Concepts and Automata Theory) :: Video Chapters
92 pages
Understanding Probability Concepts
No ratings yet
Understanding Probability Concepts
21 pages
Nonlinear Stock Portfolio Optimization
No ratings yet
Nonlinear Stock Portfolio Optimization
47 pages
Data Mining & Warehousing Exam Guide
No ratings yet
Data Mining & Warehousing Exam Guide
2 pages
AI Fundamentals and Applications Syllabus
No ratings yet
AI Fundamentals and Applications Syllabus
2 pages
Bresenham's Line Drawing Algorithm Explained
No ratings yet
Bresenham's Line Drawing Algorithm Explained
26 pages
Poker Test for Randomness in MATLAB
No ratings yet
Poker Test for Randomness in MATLAB
7 pages
ANN-Based Adaptive Control for BLDC Motors
No ratings yet
ANN-Based Adaptive Control for BLDC Motors
6 pages