L3 Classification Naivebayes
L3 Classification Naivebayes
Classification
Spring 2026
2026-01-12
Adapted from slides from Danqi Chen, Karthik Narasimhan, and Anoop Sarkar
1
Review: Basic Machine Learning Terminology
• Classification vs Regression
2
Why classify?
Movie Reviews
• Language detection pos: this is the greatest screwball comedy ever filmed
neg: It was pathetic. The worst part about it was the
boxing scenes.
• News categorization
3
Classification as a subtask in NLP
4
Classification as a subtask in NLP
Intent detection
ADDR_CHANGE: I just moved and want to change my address.
ADDR_CHANGE: Please help me update my address
FILE_CLAIM: I just got into a terrible accident and I want to file a claim
CLOSE_ACCOUNT: I’m moving and I want to disconnect my service
• Output:
6
Rule-based classification
• Look for patterns, and combinations of features on words in document, meta-data
IF there exists word w in document d such that w in [good, great, extra-ordinary, …],
THEN output Positive
• But: rules may be hard to define (and some even unknown to us!)
• Expensive
• Data-driven approach
• Inputs:
• How to learn F?
8
Designing machine learning models
general recipe
9
General guidelines for model building
10
Types of supervised classifiers
Neural networks
Support vector machines k-nearest neighbors
11
Naive Bayes
12
Naive Bayes Classifier
General setting
• Let the input x be represented as r features: fj, 1 ≤j≤r
r r
Bayes rule
∏ ∏
P(x | y) = P( fj | y) P(y | x) ∝ P(y) ⋅ P( fj | y)
j=1 j=1
13
Naive Bayes classifier
for text classification
• For text classification: input x is document d = (w1, …, wk)
the
ma on
t
at cat he
s t
15
Predicting with Naive Bayes
• Once we assume that the position of each word is irrelevant and that
the words are conditionally independent given class c, we have:
16
Maximum likelihood estimate
• Count and take average:
Count(cj )
P̂ (cj ) =
n
Frequency
<latexit sha1_base64="1jn5g8vqDaMSPiuRHKHRoLXUDlU=">AAACP3icbVBNS8QwFEz9dv1a9egluAgKsrQi6kUQvXhcwV2FbSlpNnWjaVqSV9cl9p958S948+rFgyJevZmte/BrIDDMzOPlTZQJrsF1H52R0bHxicmp6crM7Nz8QnVxqaXTXFHWpKlI1XlENBNcsiZwEOw8U4wkkWBn0dXRwD+7ZkrzVJ5CP2NBQi4kjzklYKWw2vK7BEyjWO+FHN9iGl5u4H3sx4pQ4wO7AXOU5hJKf7O0C+PrPAlNz+cSt4r2z9TmIBIUYbXm1t0S+C/xhqSGhmiE1Qe/k9I8YRKoIFq3PTeDwBAFnApWVPxcs4zQK3LB2pZKkjAdmPL+Aq9ZpYPjVNknAZfq9wlDEq37SWSTCYGu/u0NxP+8dg7xXmC4zHJgkn4tinOBIcWDMnGHK0ZB9C0hVHH7V0y7xHYHtvKKLcH7ffJf0tqqezt192S7dnA4rGMKraBVtI48tIsO0DFqoCai6A49oRf06tw7z86b8/4VHXGGM8voB5yPT9uXsBY=</latexit>
Zipf’s Law
Count(wi , cj )
P̂ (wi |cj ) = P
w2V [Count(w, cj )]
Rank
17
Solution: Smoothing!
• Maximum likelihood estimate
<latexit sha1_base64="ca+IFFdUoBnTIG6rrzdMBNP4Jq8=">AAACE3icbVDLSsNAFJ34rPUVdelmsAjVRUlE1I1Q7MZlBfuAppTJdNKOnUzCzI1YQv7Bjb/ixoUibt2482+cPhbaeuDC4Zx7ufcePxZcg+N8WwuLS8srq7m1/PrG5ta2vbNb11GiKKvRSESq6RPNBJesBhwEa8aKkdAXrOEPKiO/cc+U5pG8hWHM2iHpSR5wSsBIHfvY6xNIq1mRdu6O8CX2AkVo6gF7gLQSJRImTpbKrGMXnJIzBp4n7pQU0BTVjv3ldSOahEwCFUTrluvE0E6JAk4Fy/JeollM6ID0WMtQSUKm2+n4pwwfGqWLg0iZkoDH6u+JlIRaD0PfdIYE+nrWG4n/ea0Egot2ymWcAJN0sihIBIYIjwLCXa4YBTE0hFDFza2Y9olJBUyMeROCO/vyPKmflNyzknNzWihfTePIoX10gIrIReeojK5RFdUQRY/oGb2iN+vJerHerY9J64I1ndlDf2B9/gDOI54k</latexit>
Count(cj )
P̂ (cj ) =
n
<latexit sha1_base64="1jn5g8vqDaMSPiuRHKHRoLXUDlU=">AAACP3icbVBNS8QwFEz9dv1a9egluAgKsrQi6kUQvXhcwV2FbSlpNnWjaVqSV9cl9p958S948+rFgyJevZmte/BrIDDMzOPlTZQJrsF1H52R0bHxicmp6crM7Nz8QnVxqaXTXFHWpKlI1XlENBNcsiZwEOw8U4wkkWBn0dXRwD+7ZkrzVJ5CP2NBQi4kjzklYKWw2vK7BEyjWO+FHN9iGl5u4H3sx4pQ4wO7AXOU5hJKf7O0C+PrPAlNz+cSt4r2z9TmIBIUYbXm1t0S+C/xhqSGhmiE1Qe/k9I8YRKoIFq3PTeDwBAFnApWVPxcs4zQK3LB2pZKkjAdmPL+Aq9ZpYPjVNknAZfq9wlDEq37SWSTCYGu/u0NxP+8dg7xXmC4zHJgkn4tinOBIcWDMnGHK0ZB9C0hVHH7V0y7xHYHtvKKLcH7ffJf0tqqezt192S7dnA4rGMKraBVtI48tIsO0DFqoCai6A49oRf06tw7z86b8/4VHXGGM8voB5yPT9uXsBY=</latexit>
Count(wi , cj )
P̂ (wi |cj ) = P
w2V [Count(w, cj )]
Laplace smoothing
• Smoothing
• Simple, easy to use
<latexit sha1_base64="ErHkJlqqqVlXzpHp4JCIXy8oWgE=">AAACUXicbVFNb9QwFHwbCrRbPhY4crG6QiqiWiUVolyQKnrhuJW620qbKHrxOo2p40T2S8vK5C9ygBP/gwsHqno/Dt2WkSyNZubp2eOsVtJSGP7uBA82Hj56vLnV3X7y9Nnz3ouXY1s1hosRr1RlzjK0QkktRiRJibPaCCwzJU6zi6O5f3opjJWVPqFZLZISz7XMJUfyUtor4gLJDdvdq1Sy74ynX9+yTyzODXIXk/hG7qhqNC38vaX9jsWo6gJbF9umTN1VLDUbt5P1+N5aNmnTXj8chAuw+yRakT6sMEx7P+NpxZtSaOIKrZ1EYU2JQ0OSK9F248aKGvkFnouJpxpLYRO3aKRlb7wyZXll/NHEFurtCYeltbMy88kSqbB3vbn4P2/SUP4xcVLXDQnNl4vyRjGq2LxeNpVGcFIzT5Ab6e/KeIG+TfKf0PUlRHeffJ+M9wfRh0F4/L5/+HlVxya8hh3YhQgO4BC+wBBGwOEH/IF/cN351fkbQBAso0FnNfMK1hBs3wAKMrNn</latexit>
Count(wi , cj ) + ↵
• Effective in practice
P̂ (wi |cj ) = P
w2V [Count(w, cj ) + ↵]
18
Overall process
• Input: Set of annotated documents {(di, ci)}ni=1
B. Calculate Count(cj )
P̂ (cj ) =
n
<latexit sha1_base64="ErHkJlqqqVlXzpHp4JCIXy8oWgE=">AAACUXicbVFNb9QwFHwbCrRbPhY4crG6QiqiWiUVolyQKnrhuJW620qbKHrxOo2p40T2S8vK5C9ygBP/gwsHqno/Dt2WkSyNZubp2eOsVtJSGP7uBA82Hj56vLnV3X7y9Nnz3ouXY1s1hosRr1RlzjK0QkktRiRJibPaCCwzJU6zi6O5f3opjJWVPqFZLZISz7XMJUfyUtor4gLJDdvdq1Sy74ynX9+yTyzODXIXk/hG7qhqNC38vaX9jsWo6gJbF9umTN1VLDUbt5P1+N5aNmnTXj8chAuw+yRakT6sMEx7P+NpxZtSaOIKrZ1EYU2JQ0OSK9F248aKGvkFnouJpxpLYRO3aKRlb7wyZXll/NHEFurtCYeltbMy88kSqbB3vbn4P2/SUP4xcVLXDQnNl4vyRjGq2LxeNpVGcFIzT5Ab6e/KeIG+TfKf0PUlRHeffJ+M9wfRh0F4/L5/+HlVxya8hh3YhQgO4BC+wBBGwOEH/IF/cN351fkbQBAso0FnNfMK1hBs3wAKMrNn</latexit>
Count(wi , cj ) + ↵
C. Calculate P̂ (wi |cj ) = P
w2V [Count(w, cj ) + ↵]
Y
cMAP = arg max P̂ (c) P̂ (wi |c)
c
i=1
19
Variants
Name based on the distribution of the features
P( fi | y) → P(wi | c)
Multinomial Naive Bayes <latexit sha1_base64="ca+IFFdUoBnTIG6rrzdMBNP4Jq8=">AAACE3icbVDLSsNAFJ34rPUVdelmsAjVRUlE1I1Q7MZlBfuAppTJdNKOnUzCzI1YQv7Bjb/ixoUibt2482+cPhbaeuDC4Zx7ufcePxZcg+N8WwuLS8srq7m1/PrG5ta2vbNb11GiKKvRSESq6RPNBJesBhwEa8aKkdAXrOEPKiO/cc+U5pG8hWHM2iHpSR5wSsBIHfvY6xNIq1mRdu6O8CX2AkVo6gF7gLQSJRImTpbKrGMXnJIzBp4n7pQU0BTVjv3ldSOahEwCFUTrluvE0E6JAk4Fy/JeollM6ID0WMtQSUKm2+n4pwwfGqWLg0iZkoDH6u+JlIRaD0PfdIYE+nrWG4n/ea0Egot2ymWcAJN0sihIBIYIjwLCXa4YBTE0hFDFza2Y9olJBUyMeROCO/vyPKmflNyzknNzWihfTePIoX10gIrIReeojK5RFdUQRY/oGb2iN+vJerHerY9J64I1ndlDf2B9/gDOI54k</latexit>
Count(cj )
Normal counts (0,1,2,…) for each document P̂ (cj ) =
n
Binary Multinomial NB
Binarized counts (0/1) for each document Some work show this works
better than full counts or the
Multivariate Bernoulli NB Multivariate Bernoulli NB
Estimate P(w|c) as fraction of documents of class c
with word w
20
Variants
Name based on the distribution of the features
P( fi | y) → P(wi | c)
Multinomial Naive Bayes <latexit sha1_base64="ca+IFFdUoBnTIG6rrzdMBNP4Jq8=">AAACE3icbVDLSsNAFJ34rPUVdelmsAjVRUlE1I1Q7MZlBfuAppTJdNKOnUzCzI1YQv7Bjb/ixoUibt2482+cPhbaeuDC4Zx7ufcePxZcg+N8WwuLS8srq7m1/PrG5ta2vbNb11GiKKvRSESq6RPNBJesBhwEa8aKkdAXrOEPKiO/cc+U5pG8hWHM2iHpSR5wSsBIHfvY6xNIq1mRdu6O8CX2AkVo6gF7gLQSJRImTpbKrGMXnJIzBp4n7pQU0BTVjv3ldSOahEwCFUTrluvE0E6JAk4Fy/JeollM6ID0WMtQSUKm2+n4pwwfGqWLg0iZkoDH6u+JlIRaD0PfdIYE+nrWG4n/ea0Egot2ymWcAJN0sihIBIYIjwLCXa4YBTE0hFDFza2Y9olJBUyMeROCO/vyPKmflNyzknNzWihfTePIoX10gIrIReeojK5RFdUQRY/oGb2iN+vJerHerY9J64I1ndlDf2B9/gDOI54k</latexit>
Count(cj )
Normal counts (0,1,2,…) for each document P̂ (cj ) =
n
Binary Multinomial NB
Binarized counts (0/1) for each document
Multivariate Bernoulli NB
Estimate P(w|c) as fraction of documents of class c
with word w
21
Naive Bayes Example
Doc Words Class
N Training 1 Chinese Beijing$Chinese c
P̂(c) = c
N 2 Chinese$Chinese$Shanghai c
Smoothing with α =1 3 Chinese$Macao c
count(w, c) +1 4 Tokyo$Japan$Chinese j
P̂(w | c) =
count(c)+ | V | Test 5 Chinese$Chinese$Chinese$Tokyo Japan ?
Priors:
P(c)=$ 3
4 1 Choosing%a%class:
P(j)=$ 4 P(c|d5)$ ∝ 3/4$*$(3/7)3 *$1/14$*$1/14$
≈$0.0003
Conditional%Probabilities:
P(Chinese|c)$= (5+1)$/$(8+6)$=$6/14$=$3/7
P(Tokyo|c)$$$$= (0+1)$/$(8+6)$=$1/14 P(j|d5)$ ∝ 1/4$*$(2/9)3 *$2/9$*$2/9$
P(Japan|c)$$$$$= (0+1)$/$(8+6)$=$1/14 ≈$0.0001
P(Chinese|j)$= (1+1)$/$(3+6)$=$2/9$
P(Tokyo|j)$$$$$= (1+1)$/$(3+6)$=$2/9$
41 P(Japan|j)$$$$$$=$ (1+1)$/$(3+6)$=$2/9$
(Credits: Dan Jurafsky)
22
Naive Bayes Example
Doc Words Class
N Training 1 Chinese Beijing$Chinese c
P̂(c) = c
N 2 Chinese$Chinese$Shanghai c
Smoothing with α =1 3 Chinese$Macao c
count(w, c) +1 4 Tokyo$Japan$Chinese j
P̂(w | c) =
count(c)+ | V | Test 5 Chinese$Chinese$Chinese$Tokyo Japan ?
Priors: 3 1
• Let’s 3
P(c)=$ the
compute priors: what is P(̂ c) and P(̂ j)?Choosing%a%class:
̂ = , P(j)
P(c) ̂ =
P(j)=$
4 1 4 4
4 P(c|d5)$ ∝ 3/4$*$(3/7)3 *$1/14$*$1/14$
≈$0.0003
• P(̂ Japan | c):
Conditional%Probabilities:
Let’s compute
P(Chinese|c)$= (5+1)$/$(8+6)$=$6/14$=$3/7
P(Tokyo|c)$$$$= (0+1)$/$(8+6)$=$1/14 |V| = 6
∑
count(Japan, c) = 0 count(c) = count(w,P(j|d5)$
c) = ∝ 8 1/4$*$(2/9)3 *$2/9$*$2/9$
P(Japan|c)$$$$$= (0+1)$/$(8+6)$=$1/14 ≈$0.0001
w∈V
P(Chinese|j)$= (1+1)$/$(3+6)$=$2/9$
P(Tokyo|j)$$$$$= (1+1)$/$(3+6)$=$2/9$
count(Japan,c) +1 0+1 1
̂
P(Japan
41 | c) =
P(Japan|j)$$$$$$=$ =
(1+1)$/$(3+6)$=$2/9$ =
count(c) + | V | 8+6 14
(Credits: Dan Jurafsky)
23
Naive Bayes Example
Doc Words Class
N Training 1 Chinese Beijing$Chinese c
P̂(c) = c
N 2 Chinese$Chinese$Shanghai c
Smoothing with α =1 3 Chinese$Macao c
count(w, c) +1 4 Tokyo$Japan$Chinese j
P̂(w | c) =
count(c)+ | V | Test 5 Chinese$Chinese$Chinese$Tokyo Japan ?
Priors:
P(c)=$ 3
4 1 Choosing%a%class:
P(j)=$ 4 P(c|d5)$ ∝ 3/4$*$(3/7)3 *$1/14$*$1/14$
≈$0.0003
Conditional%Probabilities:
P(Chinese|c)$= (5+1)$/$(8+6)$=$6/14$=$3/7
P(Tokyo|c)$$$$= (0+1)$/$(8+6)$=$1/14 P(j|d5)$ ∝ 1/4$*$(2/9)3 *$2/9$*$2/9$
P(Japan|c)$$$$$= (0+1)$/$(8+6)$=$1/14 ≈$0.0001
P(Chinese|j)$= (1+1)$/$(3+6)$=$2/9$
P(Tokyo|j)$$$$$= (1+1)$/$(3+6)$=$2/9$
41 P(Japan|j)$$$$$$=$ (1+1)$/$(3+6)$=$2/9$
(Credits: Dan Jurafsky)
24
Some details
• Vocabulary is important
• Modern NLP system use subword tokens (e.g. byte pair encoding)
25
Some details
• Vocabulary is important
• Handling stop words (common words like a, the that may not be useful)
• In practice not that helpful, so use all words! • Modified counts (tf-idf) that down weighs
frequent, unimportant words
• Better models!
26
Features
• In general, Naive Bayes can use any set of features, not just words
Top features
for
Spam detection
27
Properties of Naive Bayes
- Does not handle rare classes well - will favour more common class
• Modern NLP: use large neural language models with learned representations
28
Generative vs Discriminative Models
29
Evaluation
30
Evaluation
Positive Negative
Positive 100 5
Predicted
Negative 45 100
• Ideally, we want:
Truth
Positive Negative
Negative 0 105
31
Evaluation Metrics
Confusion matrix
Truth Actual positives
Positive Negative
FN TN
Predicted Positive 100 TP 5 FP
Negative 45 FN 100 TN
Truth
Positive Negative Positive Negative
TP + TN 200
Accuracy = = = 80 %
Total 250
33
Precision and Recall
FN TN
TP
Precision( + ) =
TP + FP
TP FP
• Recall: % of correct items selected
TP
Recall( + ) =
TP + FN Predicted positives
(selected/retrieved)
34
(image credit: wikipedia)
Evaluation Metrics
Truth
Positive Negative Positive Negative
TP 100 50
Precision( + ) = = 0.95 = 0.75
TP + FP 100 + 5 50 + 25
TP 100 50
Recall( + ) = = 0.69 = 0.75
TP + FN 100 + 45 50 + 25
• Combined measure
2 ⋅ Precision ⋅ Recall
F1 =
Precision + Recall
36
Evaluation Metrics
Truth
Positive Negative Positive Negative
TP 100 50
Precision( + ) = = 0.95 = 0.75
TP + FP 100 + 5 50 + 25
TP 100 50
Recall( + ) = = 0.69 = 0.75
TP + FN 100 + 45 50 + 25
2 ⋅ P( + )R( + )
F1( + ) = 0.8 0.75
P( + ) + R( + )
37
Aggregating scores
39
Macro vs Micro average
• Micro-averaged score is dominated by score on common classes
1.0
Maximum F1
0.8
Vary hyperparameters
Precision 0.6
• Smoothing α
0.4 • Threshold T
P( + | d)
0.2 >T
P( − | d)
0.2 0.4 0.6 0.8 1.0
Recall Tune on validation set
41
Train, val, test split
42
Summary
• Evaluation Metrics
• Aggregated scores
• Use validation set to tune hyperparameters, test set should remain “unseen”
43
A high precision score indicates that a large proportion of the predicted positive outcomes are true positives, meaning the model is effective in selecting relevant instances from those it identifies as positive. High precision suggests that the model has a low rate of false positives, providing a measure of its accuracy in predicting positive class labels .
The "Bag of Words" model represents a text as an unordered collection of words, disregarding grammar and word order, which can lead to the loss of semantic and syntactic information. Although it simplifies the feature space and helps manage computational demands, it doesn't capture contextual differences such as negation or phrasal meanings, potentially leading to misclassification in text classification tasks. Additionally, variations or synonyms of words with similar meanings are treated as distinct entities, which might reduce the model's overall precision and recall .
Tokenization involves breaking down text into smaller units or tokens, such as words or subwords, which are then used to construct the vocabulary in NLP models. The process is crucial as it determines how textual data is presented to the model, affecting its ability to recognize and process language patterns. Inconsistent or improper tokenization can lead to an inaccurate vocabulary, affecting the model's effectiveness by misrepresenting word usage and frequency. Effective tokenization ensures consistent handling of linguistic nuances and enhances model predictions .
Macro-average evaluation calculates metrics independently for each class and then averages them, giving equal weight to all classes regardless of their prevalence. This approach is useful for understanding performance across different classes, including minority ones. In contrast, micro-average evaluation aggregates all confusion matrix metrics across classes before calculating the overall metric, thereby being dominated by the performance on more common classes. This method is beneficial when class distribution reflects importance, but it can obscure performance on minority classes .
Laplace smoothing, also known as additive smoothing, addresses the issue of zero probabilities for unseen words in the training data by adding a constant (usually 1) to the frequency counts of each word in a class. This adjustment ensures that no probability is zero, which helps the Naive Bayes classifier handle unseen words during prediction and thus improves its performance. It is simple to implement and effective in practice .
Smoothing techniques, such as Laplace smoothing, serve to handle the problem of zero probabilities in probabilistic models like Naive Bayes. They adjust probability estimates for unseen words or features in the training data, preventing models from assigning zero probability to such instances during classification. This is crucial for Naive Bayes, as it relies heavily on probability estimates for making predictions. Smoothing ensures all event probabilities are non-zero, which stabilizes the model’s predictive capability in face of new, unseen data .
Generative models, like Naive Bayes, model the joint probability distribution P(X, Y) and can be used to generate new samples. They predict the output by estimating how the data is generated given a label and can be used for tasks like sampling documents . Discriminative models, such as logistic regression, model the conditional probability P(Y|X) directly, focusing on the boundary between classes rather than learning the data generation process. In text classification tasks, generative models are generally effective when the underlying assumptions about data independence hold. However, discriminative models often perform better with fewer features due to their focus on class distinction .
Naive Bayes classifiers tend to favor more common classes over rare ones due to their reliance on prior probabilities and frequency-based feature representation. This bias arises because they estimate probabilities based on class frequency, causing rare classes to have reduced influence in determining outcomes. One way to mitigate this limitation is to adjust class priors or apply techniques like class-specific smoothing to balance the representation and influence of less frequent classes during classification .
The F1-score is a harmonic mean of precision and recall, providing a balanced measure that accounts for both false positives and false negatives in model evaluation. It is particularly useful in scenarios with imbalanced data where one class is more frequent, as it avoids the bias of accuracy that doesn’t account for class proportions. The F1-score facilitates comparison by reflecting both precision and recall, offering a single metric that considers both correctness and completeness of positive predictions .
The training objective function defines what the machine learning model is trying to optimize, usually by minimizing the error or maximizing the likelihood between predicted and actual values. The choice of this objective directly impacts the learning process, convergence rate, and final performance of the model. Optimization algorithms, such as stochastic gradient descent, determine how the parameters are updated in the model. A poorly chosen objective or optimization method can lead to inefficient training, overfitting, or suboptimal predictive performance. For text classification, these choices must align with the nature of the data and task requirements for effective model performance .