0% found this document useful (0 votes)
6 views43 pages

L3 Classification Naivebayes

The document outlines the basics of Natural Language Processing (NLP) with a focus on classification techniques, including supervised learning and various classifiers like Naive Bayes and logistic regression. It discusses the importance of text classification in tasks such as spam detection, sentiment analysis, and intent detection. Additionally, it covers model building, feature selection, and the application of statistical methods in training classifiers.

Uploaded by

creakedeggs
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views43 pages

L3 Classification Naivebayes

The document outlines the basics of Natural Language Processing (NLP) with a focus on classification techniques, including supervised learning and various classifiers like Naive Bayes and logistic regression. It discusses the importance of text classification in tasks such as spam detection, sentiment analysis, and intent detection. Additionally, it covers model building, feature selection, and the application of statistical methods in training classifiers.

Uploaded by

creakedeggs
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

SFU NatLangLab

CMPT 413/713: Natural Language Processing

Classification
Spring 2026
2026-01-12

Adapted from slides from Danqi Chen, Karthik Narasimhan, and Anoop Sarkar
1
Review: Basic Machine Learning Terminology

labeled training data


• Supervised vs Unsupervised learning

• Classification vs Regression

• Discriminative vs Generative models

• We will do Supervised Text Classification

2
Why classify?

Spam detection Sentiment analysis

Movie Reviews

• Authorship attribution neg: unbelievably disappointing


pos: Full of zany characters and richly applied satire,
and some great plot twists

• Language detection pos: this is the greatest screwball comedy ever filmed
neg: It was pathetic. The worst part about it was the
boxing scenes.

• News categorization
3
Classification as a subtask in NLP

• NLP is all (mostly) about classification

• Text classification: Spam/Not Spam, Sentiment Analysis

• Language modelling / generating sentences: select word


to generate at each step (classification over vocabulary!)

• Building dialog system (identifying intent)

• Parsing (identifying word to attach to)

4
Classification as a subtask in NLP
Intent detection
ADDR_CHANGE: I just moved and want to change my address.
ADDR_CHANGE: Please help me update my address
FILE_CLAIM: I just got into a terrible accident and I want to file a claim
CLOSE_ACCOUNT: I’m moving and I want to disconnect my service

Prepositional phrase attachment


noun attach: I bought the shirt with pockets
verb attach: I bought the shirt with my credit card
noun attach: I washed the shirt with mud
verb attach: I washed the shirt with soap
5
Text classification: the task
• Inputs:
sequence of words
• A document d Multiple classes: m
sentence Binary: m=2
• A set of classes C = {c1, c2, c3, … , cm}

• Output:

• Predicted class c for document d

Movie was terrible Classify Negative

Amazing acting Classify Positive

6
Rule-based classification
• Look for patterns, and combinations of features on words in document, meta-data

IF there exists word w in document d such that w in [good, great, extra-ordinary, …],
THEN output Positive

IF email address ends in [[Link], [Link], [Link], …]


THEN output SPAM

• Simple, can be very accurate

• But: rules may be hard to define (and some even unknown to us!)

• Expensive

• Not easily generalizable


7
Supervised Learning: Let’s use statistics!

• Data-driven approach

Let the machine figure out the best patterns to use!

• Inputs:

• Set of m classes C = {c1, c2, …, cm}

• Set of n ‘labeled’ documents: {(d1, c1), (d2, c2), …, (dn, cn)}

• Output: Trained classifier, F :d→c

• What form should F take?

• How to learn F?

8
Designing machine learning models
general recipe

• Input features: f(x) → [ f1, f2, …, fm]


• Need to determine features
• Output: estimate P(y | x) for each class c
• Need to model P(y | x) with a family of functions
Building
the model • Train phase: Learn parameters of model to minimize loss function
• Need training objective and optimization algorithm
• Test phase: Apply parameters to predict class given a new input

9
General guidelines for model building

Two steps to building a probability model:

1. Define the model • What form should F take?

• What independence assumptions do we make?

• What are the model parameters (probability values)?

2. Estimate the model parameters (training/learning)

• How to learn F? What to optimize? What is the training objective?

10
Types of supervised classifiers

Naive Bayes Logistic regression

Neural networks
Support vector machines k-nearest neighbors
11
Naive Bayes

12
Naive Bayes Classifier
General setting
• Let the input x be represented as r features: fj, 1 ≤j≤r

• Let y be the output classification

• We can have a simple classification model using Bayes rule


Prior Likelihood
Posterior
P(y) ⋅ P(x | y)
P(y | x) =
P(x) Evidence

• Make strong (naive) conditional independence assumptions

r r
Bayes rule
∏ ∏
P(x | y) = P( fj | y) P(y | x) ∝ P(y) ⋅ P( fj | y)
j=1 j=1

13
Naive Bayes classifier
for text classification
• For text classification: input x is document d = (w1, …, wk)

• Use as our features the words wj, 1 ≤ j ≤ | V | where V is our vocabulary

• c is the output classification


• Predicting the best class:
cMAP = arg max P(c | d)
c∈C
maximum a posteriori
= arg max
P(c)P(d | c) P(d | c) → Conditional probability of
(MAP) estimate
c∈C P(d) generating document d from class c

= arg max P(c)P(d | c)


c∈C
P(c) → Prior probability of class c
14
Represent P(d | c) as Bag of Words model

• Assume position of each word is irrelevant Order doesn’t matter


(both absolute and relative)

• P(w1, w2, w3, …, wk | c) = P(w1 | c)P(w2 | c)…P(wk | c)


• Probability of each word is conditionally independent
given class c

the
ma on
t
at cat he
s t

15
Predicting with Naive Bayes

• Once we assume that the position of each word is irrelevant and that
the words are conditionally independent given class c, we have:

P(d | c) = P(w1, w2, w3, …, wk | c) = P(w1 | c)P(w2 | c)…P(wk | c)

• The maximum a posteriori (MAP) estimate is now:


P̂ is used to indicate the estimated probability
k
̂ ̂ i | c)

cMAP = arg max P(c)P(d | c) = arg max P(c) P(w
c∈C c∈C
i=1

Note that k is the number of tokens (words) in the document.

The index i is the position of the token.

16
Maximum likelihood estimate
• Count and take average:

Can suffer from sparsity issues!


<latexit sha1_base64="ca+IFFdUoBnTIG6rrzdMBNP4Jq8=">AAACE3icbVDLSsNAFJ34rPUVdelmsAjVRUlE1I1Q7MZlBfuAppTJdNKOnUzCzI1YQv7Bjb/ixoUibt2482+cPhbaeuDC4Zx7ufcePxZcg+N8WwuLS8srq7m1/PrG5ta2vbNb11GiKKvRSESq6RPNBJesBhwEa8aKkdAXrOEPKiO/cc+U5pG8hWHM2iHpSR5wSsBIHfvY6xNIq1mRdu6O8CX2AkVo6gF7gLQSJRImTpbKrGMXnJIzBp4n7pQU0BTVjv3ldSOahEwCFUTrluvE0E6JAk4Fy/JeollM6ID0WMtQSUKm2+n4pwwfGqWLg0iZkoDH6u+JlIRaD0PfdIYE+nrWG4n/ea0Egot2ymWcAJN0sihIBIYIjwLCXa4YBTE0hFDFza2Y9olJBUyMeROCO/vyPKmflNyzknNzWihfTePIoX10gIrIReeojK5RFdUQRY/oGb2iN+vJerHerY9J64I1ndlDf2B9/gDOI54k</latexit>

Count(cj )
P̂ (cj ) =
n

Frequency
<latexit sha1_base64="1jn5g8vqDaMSPiuRHKHRoLXUDlU=">AAACP3icbVBNS8QwFEz9dv1a9egluAgKsrQi6kUQvXhcwV2FbSlpNnWjaVqSV9cl9p958S948+rFgyJevZmte/BrIDDMzOPlTZQJrsF1H52R0bHxicmp6crM7Nz8QnVxqaXTXFHWpKlI1XlENBNcsiZwEOw8U4wkkWBn0dXRwD+7ZkrzVJ5CP2NBQi4kjzklYKWw2vK7BEyjWO+FHN9iGl5u4H3sx4pQ4wO7AXOU5hJKf7O0C+PrPAlNz+cSt4r2z9TmIBIUYbXm1t0S+C/xhqSGhmiE1Qe/k9I8YRKoIFq3PTeDwBAFnApWVPxcs4zQK3LB2pZKkjAdmPL+Aq9ZpYPjVNknAZfq9wlDEq37SWSTCYGu/u0NxP+8dg7xXmC4zHJgkn4tinOBIcWDMnGHK0ZB9C0hVHH7V0y7xHYHtvKKLcH7ffJf0tqqezt192S7dnA4rGMKraBVtI48tIsO0DFqoCai6A49oRf06tw7z86b8/4VHXGGM8voB5yPT9uXsBY=</latexit>
Zipf’s Law
Count(wi , cj )
P̂ (wi |cj ) = P
w2V [Count(w, cj )]

Rank

17
Solution: Smoothing!
• Maximum likelihood estimate

<latexit sha1_base64="ca+IFFdUoBnTIG6rrzdMBNP4Jq8=">AAACE3icbVDLSsNAFJ34rPUVdelmsAjVRUlE1I1Q7MZlBfuAppTJdNKOnUzCzI1YQv7Bjb/ixoUibt2482+cPhbaeuDC4Zx7ufcePxZcg+N8WwuLS8srq7m1/PrG5ta2vbNb11GiKKvRSESq6RPNBJesBhwEa8aKkdAXrOEPKiO/cc+U5pG8hWHM2iHpSR5wSsBIHfvY6xNIq1mRdu6O8CX2AkVo6gF7gLQSJRImTpbKrGMXnJIzBp4n7pQU0BTVjv3ldSOahEwCFUTrluvE0E6JAk4Fy/JeollM6ID0WMtQSUKm2+n4pwwfGqWLg0iZkoDH6u+JlIRaD0PfdIYE+nrWG4n/ea0Egot2ymWcAJN0sihIBIYIjwLCXa4YBTE0hFDFza2Y9olJBUyMeROCO/vyPKmflNyzknNzWihfTePIoX10gIrIReeojK5RFdUQRY/oGb2iN+vJerHerY9J64I1ndlDf2B9/gDOI54k</latexit>

Count(cj )
P̂ (cj ) =
n
<latexit sha1_base64="1jn5g8vqDaMSPiuRHKHRoLXUDlU=">AAACP3icbVBNS8QwFEz9dv1a9egluAgKsrQi6kUQvXhcwV2FbSlpNnWjaVqSV9cl9p958S948+rFgyJevZmte/BrIDDMzOPlTZQJrsF1H52R0bHxicmp6crM7Nz8QnVxqaXTXFHWpKlI1XlENBNcsiZwEOw8U4wkkWBn0dXRwD+7ZkrzVJ5CP2NBQi4kjzklYKWw2vK7BEyjWO+FHN9iGl5u4H3sx4pQ4wO7AXOU5hJKf7O0C+PrPAlNz+cSt4r2z9TmIBIUYbXm1t0S+C/xhqSGhmiE1Qe/k9I8YRKoIFq3PTeDwBAFnApWVPxcs4zQK3LB2pZKkjAdmPL+Aq9ZpYPjVNknAZfq9wlDEq37SWSTCYGu/u0NxP+8dg7xXmC4zHJgkn4tinOBIcWDMnGHK0ZB9C0hVHH7V0y7xHYHtvKKLcH7ffJf0tqqezt192S7dnA4rGMKraBVtI48tIsO0DFqoCai6A49oRf06tw7z86b8/4VHXGGM8voB5yPT9uXsBY=</latexit>

Count(wi , cj )
P̂ (wi |cj ) = P
w2V [Count(w, cj )]

Laplace smoothing
• Smoothing
• Simple, easy to use
<latexit sha1_base64="ErHkJlqqqVlXzpHp4JCIXy8oWgE=">AAACUXicbVFNb9QwFHwbCrRbPhY4crG6QiqiWiUVolyQKnrhuJW620qbKHrxOo2p40T2S8vK5C9ygBP/gwsHqno/Dt2WkSyNZubp2eOsVtJSGP7uBA82Hj56vLnV3X7y9Nnz3ouXY1s1hosRr1RlzjK0QkktRiRJibPaCCwzJU6zi6O5f3opjJWVPqFZLZISz7XMJUfyUtor4gLJDdvdq1Sy74ynX9+yTyzODXIXk/hG7qhqNC38vaX9jsWo6gJbF9umTN1VLDUbt5P1+N5aNmnTXj8chAuw+yRakT6sMEx7P+NpxZtSaOIKrZ1EYU2JQ0OSK9F248aKGvkFnouJpxpLYRO3aKRlb7wyZXll/NHEFurtCYeltbMy88kSqbB3vbn4P2/SUP4xcVLXDQnNl4vyRjGq2LxeNpVGcFIzT5Ab6e/KeIG+TfKf0PUlRHeffJ+M9wfRh0F4/L5/+HlVxya8hh3YhQgO4BC+wBBGwOEH/IF/cN351fkbQBAso0FnNfMK1hBs3wAKMrNn</latexit>

Count(wi , cj ) + ↵
• Effective in practice
P̂ (wi |cj ) = P
w2V [Count(w, cj ) + ↵]

18
Overall process
• Input: Set of annotated documents {(di, ci)}ni=1

A. Compute vocabulary V of all words


<latexit sha1_base64="ca+IFFdUoBnTIG6rrzdMBNP4Jq8=">AAACE3icbVDLSsNAFJ34rPUVdelmsAjVRUlE1I1Q7MZlBfuAppTJdNKOnUzCzI1YQv7Bjb/ixoUibt2482+cPhbaeuDC4Zx7ufcePxZcg+N8WwuLS8srq7m1/PrG5ta2vbNb11GiKKvRSESq6RPNBJesBhwEa8aKkdAXrOEPKiO/cc+U5pG8hWHM2iHpSR5wSsBIHfvY6xNIq1mRdu6O8CX2AkVo6gF7gLQSJRImTpbKrGMXnJIzBp4n7pQU0BTVjv3ldSOahEwCFUTrluvE0E6JAk4Fy/JeollM6ID0WMtQSUKm2+n4pwwfGqWLg0iZkoDH6u+JlIRaD0PfdIYE+nrWG4n/ea0Egot2ymWcAJN0sihIBIYIjwLCXa4YBTE0hFDFza2Y9olJBUyMeROCO/vyPKmflNyzknNzWihfTePIoX10gIrIReeojK5RFdUQRY/oGb2iN+vJerHerY9J64I1ndlDf2B9/gDOI54k</latexit>

B. Calculate Count(cj )
P̂ (cj ) =
n
<latexit sha1_base64="ErHkJlqqqVlXzpHp4JCIXy8oWgE=">AAACUXicbVFNb9QwFHwbCrRbPhY4crG6QiqiWiUVolyQKnrhuJW620qbKHrxOo2p40T2S8vK5C9ygBP/gwsHqno/Dt2WkSyNZubp2eOsVtJSGP7uBA82Hj56vLnV3X7y9Nnz3ouXY1s1hosRr1RlzjK0QkktRiRJibPaCCwzJU6zi6O5f3opjJWVPqFZLZISz7XMJUfyUtor4gLJDdvdq1Sy74ynX9+yTyzODXIXk/hG7qhqNC38vaX9jsWo6gJbF9umTN1VLDUbt5P1+N5aNmnTXj8chAuw+yRakT6sMEx7P+NpxZtSaOIKrZ1EYU2JQ0OSK9F248aKGvkFnouJpxpLYRO3aKRlb7wyZXll/NHEFurtCYeltbMy88kSqbB3vbn4P2/SUP4xcVLXDQnNl4vyRjGq2LxeNpVGcFIzT5Ab6e/KeIG+TfKf0PUlRHeffJ+M9wfRh0F4/L5/+HlVxya8hh3YhQgO4BC+wBBGwOEH/IF/cN351fkbQBAso0FnNfMK1hBs3wAKMrNn</latexit>

Count(wi , cj ) + ↵
C. Calculate P̂ (wi |cj ) = P
w2V [Count(w, cj ) + ↵]

D. (Prediction) Given document d = (w1, w2, . . . , wk)


k
<latexit sha1_base64="LwVLPV0kZv4A6M2KVVwu+XV5kcQ=">AAACMXicbVDLahtBEJx1nFhRXuv4mMtgEbAvYjeExBeDnVx0MShg2QKtsvS2RtKg2QczvYnFZH8pF/+J8UUHh5Crf8KjByaxUtBQVHXT3ZUUShoKgpm38Wjz8ZOt2tP6s+cvXr7yt1+fmbzUKDqYq1x3EzBCyUx0SJIS3UILSBMlzpPJ57l//k1oI/PslKaF6KcwyuRQIpCTYr+FsY1IXJA9OW5XFT/kEegRj1K4iJFHYyDbrvZwn0eFzgexlYdh9dVOqnvreyz5D477sd8ImsECfJ2EK9JgK7Rj/yoa5FimIiNUYEwvDArqW9AkUYmqHpVGFIATGImeoxmkwvTt4uOKv3XKgA9z7SojvlD/nrCQGjNNE9eZAo3NQ28u/s/rlTQ86FuZFSWJDJeLhqXilPN5fHwgtUBSU0cAtXS3chyDBiQXct2FED58eZ2cvWuGH5rBl/eNo0+rOGrsDdtleyxkH9kRa7E26zBkP9k1u2G/vEtv5v32/ixbN7zVzA77B97tHcCdqU0=</latexit>

Y
cMAP = arg max P̂ (c) P̂ (wi |c)
c
i=1
19
Variants
Name based on the distribution of the features
P( fi | y) → P(wi | c)
Multinomial Naive Bayes <latexit sha1_base64="ca+IFFdUoBnTIG6rrzdMBNP4Jq8=">AAACE3icbVDLSsNAFJ34rPUVdelmsAjVRUlE1I1Q7MZlBfuAppTJdNKOnUzCzI1YQv7Bjb/ixoUibt2482+cPhbaeuDC4Zx7ufcePxZcg+N8WwuLS8srq7m1/PrG5ta2vbNb11GiKKvRSESq6RPNBJesBhwEa8aKkdAXrOEPKiO/cc+U5pG8hWHM2iHpSR5wSsBIHfvY6xNIq1mRdu6O8CX2AkVo6gF7gLQSJRImTpbKrGMXnJIzBp4n7pQU0BTVjv3ldSOahEwCFUTrluvE0E6JAk4Fy/JeollM6ID0WMtQSUKm2+n4pwwfGqWLg0iZkoDH6u+JlIRaD0PfdIYE+nrWG4n/ea0Egot2ymWcAJN0sihIBIYIjwLCXa4YBTE0hFDFza2Y9olJBUyMeROCO/vyPKmflNyzknNzWihfTePIoX10gIrIReeojK5RFdUQRY/oGb2iN+vJerHerY9J64I1ndlDf2B9/gDOI54k</latexit>

Count(cj )
Normal counts (0,1,2,…) for each document P̂ (cj ) =
n
Binary Multinomial NB
Binarized counts (0/1) for each document Some work show this works
better than full counts or the
Multivariate Bernoulli NB Multivariate Bernoulli NB
Estimate P(w|c) as fraction of documents of class c
with word w

• Explicitly model P(!w|c) = 1 - P(w|c)

20
Variants
Name based on the distribution of the features
P( fi | y) → P(wi | c)
Multinomial Naive Bayes <latexit sha1_base64="ca+IFFdUoBnTIG6rrzdMBNP4Jq8=">AAACE3icbVDLSsNAFJ34rPUVdelmsAjVRUlE1I1Q7MZlBfuAppTJdNKOnUzCzI1YQv7Bjb/ixoUibt2482+cPhbaeuDC4Zx7ufcePxZcg+N8WwuLS8srq7m1/PrG5ta2vbNb11GiKKvRSESq6RPNBJesBhwEa8aKkdAXrOEPKiO/cc+U5pG8hWHM2iHpSR5wSsBIHfvY6xNIq1mRdu6O8CX2AkVo6gF7gLQSJRImTpbKrGMXnJIzBp4n7pQU0BTVjv3ldSOahEwCFUTrluvE0E6JAk4Fy/JeollM6ID0WMtQSUKm2+n4pwwfGqWLg0iZkoDH6u+JlIRaD0PfdIYE+nrWG4n/ea0Egot2ymWcAJN0sihIBIYIjwLCXa4YBTE0hFDFza2Y9olJBUyMeROCO/vyPKmflNyzknNzWihfTePIoX10gIrIReeojK5RFdUQRY/oGb2iN+vJerHerY9J64I1ndlDf2B9/gDOI54k</latexit>

Count(cj )
Normal counts (0,1,2,…) for each document P̂ (cj ) =
n
Binary Multinomial NB
Binarized counts (0/1) for each document

Multivariate Bernoulli NB
Estimate P(w|c) as fraction of documents of class c
with word w

• Explicitly model P(!w|c) = 1 - P(w|c)

21
Naive Bayes Example
Doc Words Class
N Training 1 Chinese Beijing$Chinese c
P̂(c) = c
N 2 Chinese$Chinese$Shanghai c
Smoothing with α =1 3 Chinese$Macao c
count(w, c) +1 4 Tokyo$Japan$Chinese j
P̂(w | c) =
count(c)+ | V | Test 5 Chinese$Chinese$Chinese$Tokyo Japan ?
Priors:
P(c)=$ 3
4 1 Choosing%a%class:
P(j)=$ 4 P(c|d5)$ ∝ 3/4$*$(3/7)3 *$1/14$*$1/14$
≈$0.0003
Conditional%Probabilities:
P(Chinese|c)$= (5+1)$/$(8+6)$=$6/14$=$3/7
P(Tokyo|c)$$$$= (0+1)$/$(8+6)$=$1/14 P(j|d5)$ ∝ 1/4$*$(2/9)3 *$2/9$*$2/9$
P(Japan|c)$$$$$= (0+1)$/$(8+6)$=$1/14 ≈$0.0001
P(Chinese|j)$= (1+1)$/$(3+6)$=$2/9$
P(Tokyo|j)$$$$$= (1+1)$/$(3+6)$=$2/9$
41 P(Japan|j)$$$$$$=$ (1+1)$/$(3+6)$=$2/9$
(Credits: Dan Jurafsky)
22
Naive Bayes Example
Doc Words Class
N Training 1 Chinese Beijing$Chinese c
P̂(c) = c
N 2 Chinese$Chinese$Shanghai c
Smoothing with α =1 3 Chinese$Macao c
count(w, c) +1 4 Tokyo$Japan$Chinese j
P̂(w | c) =
count(c)+ | V | Test 5 Chinese$Chinese$Chinese$Tokyo Japan ?
Priors: 3 1
• Let’s 3
P(c)=$ the
compute priors: what is P(̂ c) and P(̂ j)?Choosing%a%class:
̂ = , P(j)
P(c) ̂ =
P(j)=$
4 1 4 4
4 P(c|d5)$ ∝ 3/4$*$(3/7)3 *$1/14$*$1/14$
≈$0.0003
• P(̂ Japan | c):
Conditional%Probabilities:
Let’s compute
P(Chinese|c)$= (5+1)$/$(8+6)$=$6/14$=$3/7
P(Tokyo|c)$$$$= (0+1)$/$(8+6)$=$1/14 |V| = 6

count(Japan, c) = 0 count(c) = count(w,P(j|d5)$
c) = ∝ 8 1/4$*$(2/9)3 *$2/9$*$2/9$
P(Japan|c)$$$$$= (0+1)$/$(8+6)$=$1/14 ≈$0.0001
w∈V
P(Chinese|j)$= (1+1)$/$(3+6)$=$2/9$
P(Tokyo|j)$$$$$= (1+1)$/$(3+6)$=$2/9$
count(Japan,c) +1 0+1 1
̂
P(Japan
41 | c) =
P(Japan|j)$$$$$$=$ =
(1+1)$/$(3+6)$=$2/9$ =
count(c) + | V | 8+6 14
(Credits: Dan Jurafsky)
23
Naive Bayes Example
Doc Words Class
N Training 1 Chinese Beijing$Chinese c
P̂(c) = c
N 2 Chinese$Chinese$Shanghai c
Smoothing with α =1 3 Chinese$Macao c
count(w, c) +1 4 Tokyo$Japan$Chinese j
P̂(w | c) =
count(c)+ | V | Test 5 Chinese$Chinese$Chinese$Tokyo Japan ?
Priors:
P(c)=$ 3
4 1 Choosing%a%class:
P(j)=$ 4 P(c|d5)$ ∝ 3/4$*$(3/7)3 *$1/14$*$1/14$
≈$0.0003
Conditional%Probabilities:
P(Chinese|c)$= (5+1)$/$(8+6)$=$6/14$=$3/7
P(Tokyo|c)$$$$= (0+1)$/$(8+6)$=$1/14 P(j|d5)$ ∝ 1/4$*$(2/9)3 *$2/9$*$2/9$
P(Japan|c)$$$$$= (0+1)$/$(8+6)$=$1/14 ≈$0.0001
P(Chinese|j)$= (1+1)$/$(3+6)$=$2/9$
P(Tokyo|j)$$$$$= (1+1)$/$(3+6)$=$2/9$
41 P(Japan|j)$$$$$$=$ (1+1)$/$(3+6)$=$2/9$
(Credits: Dan Jurafsky)
24
Some details
• Vocabulary is important

• Tokenization matters: it can affect your vocabulary

• Tokenization = how you break your sentence up into tokens / words

• Make sure you are consistent with your tokenization!

• Special multi-word tokens: NOT_happy

• Modern NLP system use subword tokens (e.g. byte pair encoding)
25
Some details
• Vocabulary is important

• Tokenization matters: it can affect your vocabulary

• Tokenization = how you break your sentence up into tokens / words

• Make sure you are consistent with your tokenization!

• Handling unknown words in test not in your training vocabulary?

• Remove them from your test document! Just ignore them.

• Handling stop words (common words like a, the that may not be useful)

• Remove them from the training data!


Better to use

• In practice not that helpful, so use all words! • Modified counts (tf-idf) that down weighs
frequent, unimportant words
• Better models!
26
Features
• In general, Naive Bayes can use any set of features, not just words

• URLs, email addresses, Capitalization, …

• Domain knowledge can be crucial to performance

Top features
for
Spam detection

27
Properties of Naive Bayes

+ Simple baseline method


+ Works well for small data sizes
+ Optimal if the independence assumptions hold: if the assumed independence is
correct, then it is the Bayes Optimal Classifier for the problem

- But not if the independence assumption is broken

- Does not handle rare classes well - will favour more common class

- Also need to design features

• Modern NLP: use large neural language models with learned representations

28
Generative vs Discriminative Models

• Naive Bayes is a Generative Model: It models


p(y | x) ∝ p(y)p(x | y)

• It models how the document is generated from words

• You can use this model to sample documents

• Next: Logistic Regression, a Discriminative model that


models p(y | x) directly.

29
Evaluation

30
Evaluation

• Consider binary classification


Confusion matrix
• Table of predictions
Truth

Positive Negative

Positive 100 5
Predicted
Negative 45 100

• Ideally, we want:
Truth
Positive Negative

Predicted Positive 145 0

Negative 0 105

31
Evaluation Metrics
Confusion matrix
Truth Actual positives
Positive Negative
FN TN
Predicted Positive 100 TP 5 FP

Negative 45 FN 100 TN

• True positive (TP): Predicted + and actual + TP FP

• True negative (TN): Predicted - and actual -

• False positive (FP): Predicted + and actual -

• False negative (FN): Predicted - and actual + Predicted positives


(image credit: wikipedia)
TP + TN 200
Accuracy = = = 80 %
Total 250
32
Coarse metric
Evaluation Metrics

Truth
Positive Negative Positive Negative

Predicted Positive 100 5 Positive 50 25

Negative 45 100 Negative 25 150

• True positive (TP): Predicted + and actual +

• True negative (TN): Predicted - and actual -

• False positive (FP): Predicted + and actual -


Accuracy cannot distinguish
• False negative (FN): Predicted - and actual + between the two models!

TP + TN 200
Accuracy = = = 80 %
Total 250
33
Precision and Recall

• Precision: % of selected classes that are correct


Actual positives (relevant)

FN TN
TP
Precision( + ) =
TP + FP

TP FP
• Recall: % of correct items selected

TP
Recall( + ) =
TP + FN Predicted positives
(selected/retrieved)
34
(image credit: wikipedia)
Evaluation Metrics

Truth
Positive Negative Positive Negative

Predicted Positive 100 5 Positive 50 25

Negative 45 100 Negative 25 150

TP 100 50
Precision( + ) = = 0.95 = 0.75
TP + FP 100 + 5 50 + 25

TP 100 50
Recall( + ) = = 0.69 = 0.75
TP + FN 100 + 45 50 + 25

Two metrics - which one to use?


35
F-Score

• Combined measure

• Harmonic mean of Precision and Recall

2 ⋅ Precision ⋅ Recall
F1 =
Precision + Recall

• Or more generally, Use β to control importance of


Precision vs Recall
(1 + β 2) ⋅ Precision ⋅ Recall
Fβ =
β 2 ⋅ Precision + Recall

36
Evaluation Metrics

Truth
Positive Negative Positive Negative

Predicted Positive 100 5 Positive 50 25

Negative 45 100 Negative 25 150

TP 100 50
Precision( + ) = = 0.95 = 0.75
TP + FP 100 + 5 50 + 25
TP 100 50
Recall( + ) = = 0.69 = 0.75
TP + FN 100 + 45 50 + 25

2 ⋅ P( + )R( + )
F1( + ) = 0.8 0.75
P( + ) + R( + )
37
Aggregating scores

• How to handle more than 2 classes?

• We have Precision, Recall, F1 for each class

(Credits: Dan Jurafsky)


38
Aggregating scores

• How to handle more than 2 classes?

• We have Precision, Recall, F1 for each class

• How to combine them for an overall score?

• Macro-average: Compute for each class, then average

• Micro-average: Collect predictions for all classes and


jointly evaluate

39
Macro vs Micro average
• Micro-averaged score is dominated by score on common classes

(Credits: Dan Jurafsky)


40
Precision Recall tradeoff

1.0
Maximum F1
0.8

Vary hyperparameters
Precision 0.6
• Smoothing α
0.4 • Threshold T
P( + | d)
0.2 >T
P( − | d)
0.2 0.4 0.6 0.8 1.0
Recall Tune on validation set
41
Train, val, test split

• Train model on training set

• Tune hyperparameters on validation set

• Evaluate performance on unseen test set

Why do we do this? Want to have a model that generalizes.

42
Summary
• Evaluation Metrics

• Accuracy - coarse metric

• Precision, Recall, F1 for each class

• Aggregated scores

• Macro-average: Compute for each class, then average

• Micro-average: Collect predictions for all classes and jointly evaluate


(dominated by common classes)

• Precision-Recall curve: pick threshold for maximum F1

• Use validation set to tune hyperparameters, test set should remain “unseen”

43

Common questions

Powered by AI

A high precision score indicates that a large proportion of the predicted positive outcomes are true positives, meaning the model is effective in selecting relevant instances from those it identifies as positive. High precision suggests that the model has a low rate of false positives, providing a measure of its accuracy in predicting positive class labels .

The "Bag of Words" model represents a text as an unordered collection of words, disregarding grammar and word order, which can lead to the loss of semantic and syntactic information. Although it simplifies the feature space and helps manage computational demands, it doesn't capture contextual differences such as negation or phrasal meanings, potentially leading to misclassification in text classification tasks. Additionally, variations or synonyms of words with similar meanings are treated as distinct entities, which might reduce the model's overall precision and recall .

Tokenization involves breaking down text into smaller units or tokens, such as words or subwords, which are then used to construct the vocabulary in NLP models. The process is crucial as it determines how textual data is presented to the model, affecting its ability to recognize and process language patterns. Inconsistent or improper tokenization can lead to an inaccurate vocabulary, affecting the model's effectiveness by misrepresenting word usage and frequency. Effective tokenization ensures consistent handling of linguistic nuances and enhances model predictions .

Macro-average evaluation calculates metrics independently for each class and then averages them, giving equal weight to all classes regardless of their prevalence. This approach is useful for understanding performance across different classes, including minority ones. In contrast, micro-average evaluation aggregates all confusion matrix metrics across classes before calculating the overall metric, thereby being dominated by the performance on more common classes. This method is beneficial when class distribution reflects importance, but it can obscure performance on minority classes .

Laplace smoothing, also known as additive smoothing, addresses the issue of zero probabilities for unseen words in the training data by adding a constant (usually 1) to the frequency counts of each word in a class. This adjustment ensures that no probability is zero, which helps the Naive Bayes classifier handle unseen words during prediction and thus improves its performance. It is simple to implement and effective in practice .

Smoothing techniques, such as Laplace smoothing, serve to handle the problem of zero probabilities in probabilistic models like Naive Bayes. They adjust probability estimates for unseen words or features in the training data, preventing models from assigning zero probability to such instances during classification. This is crucial for Naive Bayes, as it relies heavily on probability estimates for making predictions. Smoothing ensures all event probabilities are non-zero, which stabilizes the model’s predictive capability in face of new, unseen data .

Generative models, like Naive Bayes, model the joint probability distribution P(X, Y) and can be used to generate new samples. They predict the output by estimating how the data is generated given a label and can be used for tasks like sampling documents . Discriminative models, such as logistic regression, model the conditional probability P(Y|X) directly, focusing on the boundary between classes rather than learning the data generation process. In text classification tasks, generative models are generally effective when the underlying assumptions about data independence hold. However, discriminative models often perform better with fewer features due to their focus on class distinction .

Naive Bayes classifiers tend to favor more common classes over rare ones due to their reliance on prior probabilities and frequency-based feature representation. This bias arises because they estimate probabilities based on class frequency, causing rare classes to have reduced influence in determining outcomes. One way to mitigate this limitation is to adjust class priors or apply techniques like class-specific smoothing to balance the representation and influence of less frequent classes during classification .

The F1-score is a harmonic mean of precision and recall, providing a balanced measure that accounts for both false positives and false negatives in model evaluation. It is particularly useful in scenarios with imbalanced data where one class is more frequent, as it avoids the bias of accuracy that doesn’t account for class proportions. The F1-score facilitates comparison by reflecting both precision and recall, offering a single metric that considers both correctness and completeness of positive predictions .

The training objective function defines what the machine learning model is trying to optimize, usually by minimizing the error or maximizing the likelihood between predicted and actual values. The choice of this objective directly impacts the learning process, convergence rate, and final performance of the model. Optimization algorithms, such as stochastic gradient descent, determine how the parameters are updated in the model. A poorly chosen objective or optimization method can lead to inefficient training, overfitting, or suboptimal predictive performance. For text classification, these choices must align with the nature of the data and task requirements for effective model performance .

You might also like