UNIT-2
Text Classification :
Automated Text Classification
1. Definition and Need
Text Classification: Assigning documents to predefined categories.
Manual classification is inefficient at scale; automation is essential.
2. Automation with Machine Learning (ML)
ML automates classification using algorithms and feature extraction.
Two main types of ML for this:
1. Supervised Learning
2. Unsupervised Learning
3. Unsupervised Learning
No labeled data needed.
Focuses on discovering patterns, clustering, and topic modeling.
Useful for document grouping and summarization.
4. Supervised Learning
Uses labeled training data.
Learns patterns to predict the class of new, unseen data.
Involves feature extraction, training, and prediction.
5. Types of Supervised Learning
Classification: Predicts categories (e.g., news types).
Regression: Predicts continuous values (e.g., prices).
6. Mathematical Representation
Training set: TS = {(d₁, c₁), (d₂, c₂), ..., (dₙ, cₙ)}
Classifier γ is trained using algorithm F:
F(TS) → γ
Prediction: γ(ND) = cND, where ND is a new document.
7. Training and Prediction Phases
Training: Learn from labeled data.
Prediction: Apply model to classify new data.
Requires initial manual labeling to start.
8. Model Evaluation and Tuning
Prevent overfitting using:
Validation set
Cross-validation
Measure performance using metrics like accuracy.
9. Types of Text Classification Tasks
Binary Classification: Two classes.
Multi-class Classification: More than two classes, one label per instance.
Multi-label Classification: Multiple labels per instance
---
Classification Algorithms definition:
Classification algorithms are supervised machine learning (ML) methods used to categorize
or label data points based on past observations. These algorithms learn from a training
dataset, which includes input data (usually feature vectors) and corresponding output labels
(class labels).
Key Processes in Classification Algorithms
1. Training
The algorithm analyzes training data to learn patterns that correlate features with
outcomes (class labels).
Involves feature extraction/engineering to convert raw data into meaningful inputs.
Results in a model, which should generalize well to unseen data.
2. Evaluation
Tests how well the model performs on validation and test (holdout) datasets.
Cross-validation is often used: data is split into parts, and models are trained and
tested on different combinations.
Models are not tuned using the test dataset to avoid overfitting.
Common metrics: accuracy, precision, recall, F1-score, etc.
3. Tuning (Hyperparameter Optimization)
Focuses on improving model performance by optimizing hyperparameters (e.g.,
regularization strength, kernel type).
Methods include grid search and randomized search.
Although important, this process is outside the current scope and not covered in
implementation.
Algorithms Discussed for Text Classification
1. Multinomial Naïve Bayes
Probabilistic classifier based on Bayes’ theorem.
Well-suited for discrete features like word counts in text classification.
2. Support Vector Machines (SVM)
Finds the optimal hyperplane to separate different classes.
Effective in high-dimensional spaces, which is common in text data.
Other Classification Algorithms (Mentioned Briefly)
Logistic Regression
Decision Trees
Neural Networks
Ensemble Methods (e.g., Random Forests, Gradient Boosting) – Powerful but prone
to overfitting in text classification.
Deep Learning – Advanced methods involving multiple neural network layers, gaining
popularity in complex classification tasks.
Multinomial Naïve Bayes
Multinomial Naïve BayesThis algorithm is a special case of the popular naïve Bayes
algorithm, which is used specifically for prediction and classification tasks where we have
more than two classes. Before looking at multinomial naïve Bayes, let us look at the
definition and formulation of the naïve Bayes algorithm. The naïve Bayes algorithm is a
supervised learning algorithm that puts into action the very popular Bayes’ theorem.
However, there is a “naïve” assumption here that each feature is independent of the others.
Mathematically we can formulate this as follows: Given a response class variable y and a set
of n features in the form of a feature vector {x1, x2,…, xn}, using Bayes’ theorem we can
denote the probability of the occurrence of y given the features as( ) = ( )´ ¼ ( )P y x x x P y P
x x x y| , , , , , , |1 2n1 2¼ n( ) P x x x , , , 1 2¼n( ) = ( ) under the assumption that P x y x x x x x
P x y | , , , , , , , | - + , and for all i we i i i n i 1 2 1 1 ¼ ¼ can represent this asn( ) =( )´ ( )ÕP y P
x y|ii1P y x x x| , , ,=1 2¼ n( ) P x x x , , , 1 2¼nwhere i ranges from 1 to n. In simple terms,
this can be written asposterior prior likelihood´ and now, since P(x1, x2,…, xn) is constant,
the model can be = evidence expressed like this:n( ) µ ( )´ ( ) ÕP y x x x P y P x y | , , , | i 1 2¼
ni1=This means that under the previous assumptions of independence among the features
where each feature is conditionally independent of every other feature, the conditional
distribution over the class variable which is to be predicted, y can be represented using the
following mathematical equation as 195Chapter 4 ■ Text Classification1 n( ) = ( )´ ( ) ÕP y x x
x Z P y P x y | , , , | i 1 2¼ ni1== ( ) is a constant scaling factor dependent on the where the
evidence measure, Z p x feature variables. From this equation, we can build the naïve Bayes
classifier by combining it with a rule known as the MAP decision rule, which stands for
maximum a posteriori. Going into the statistical details would be impossible in the current
scope, but by using it, the classifier can be represented as a mathematical function that can
assign a predicted class label ˆ y C k = for some k using the following representation:nˆ | =
( )´ ( ) Õy argmax P C P x C k K kii k { } = 1 2 1Î ¼ , , ,This classifier is often said to be simple,
quite evident from its name and also because of several assumptions we make about our
data and features that might not be so in the real world. Nevertheless, this algorithm still
works remarkably well in many use cases related to classification, including multi-class
document classification, spam filtering, and so on. They can train really fast compared to
other classifiers and also work well even when we do not have sufficient training data.
Models often do not perform well when they have a lot of features, and this phenomenon is
known as the curse of dimensionality. Naïve Bayes takes care of this problem by decoupling
the class variable–related conditional feature distributions, thus leading to each distribution
being independently estimated as a single dimension [Link] naïve Bayes is
an extension of the preceding algorithm for predicting and classifying data points, where the
number of distinct classes or outcomes is more than two. In this case the feature vectors are
usually assumed to be word counts from the Bag of Words model, but TF-IDF–based weights
will also work. One limitation is that negative weight-based features can‘t be fed into this
algorithm. This distribution can be { } represented as p p p p , , , for each class label y, and
the total number of = ¼ 1 2 y y y yn features is n, which could be represented as the total
vocabulary of distinct words or = ( ) terms in text analytics. From the preceding equation, p P
x y | represents the yi i probability of feature i in any observation sample that has an
outcome or classy. The parameter py can be estimated with a smoothened version of
maximum likelihood estimation (with relative frequency of occurrences), and represented
asˆ F+ayip= yiF n +ayå is the frequency of occurrence for the feature i in a sample for class
where F x = yix TDiÎTDålabel y in our training dataset TD, and F F is the total frequency of all
features for = yiyi1=the class label y. There is some amount of smoothening one with the
help of priors a ³ 0 , 196Chapter 4 ■ Text Classificationwhich accounts for the features that
are not present in the learning data points and helps in getting rid of zero-probability–
related issues. Some specific settings for this parameter are used quite often. The value of a
=1 is known as Laplace smoothing, and a <1 is known as Lidstone smoothing. The scikit-
learn library provides an excellent implementation for multinomial naïve Bayes in the class
MultinomialNB, which we will be leveraging when we build our text classifier later on.
Support Vector Machines (SVM) - Exam Summary
SVM is a powerful supervised machine learning algorithm primarily used for classification,
though it can also handle regression and anomaly detection tasks. It is especially effective
for binary classification problems but can also be extended to multi-class scenarios.
Introduction:
SVM is used for classification, regression, and anomaly detection.
Best suited for binary classification but supports multi-class classification too.
SVM finds the optimal hyperplane that best separates different class data points.
Key Concepts:
Hyperplane: A boundary that separates different classes in the feature space.
Support Vectors: Data points closest to the hyperplane; most influential in defining it.
Margin: Distance between hyperplane and nearest support vectors from each class; larger
margins reduce overfitting.
Types of SVM Classification:
Linear SVM: Used when data is linearly separable.
Non-linear SVM: Uses kernel trick (e.g., RBF, polynomial kernels) to handle complex data by
mapping it into higher dimensions.
Mathematical Representation:
Training Dataset: {(x₁, y₁), ..., (xₙ, yₙ)}, where y ∈ {-1, 1}
Objective: Maximize the margin while minimizing classification error.
Decision Boundary Equation: w · x + b = 0
Margins are defined by: w · x + b = ±1
Soft Margin vs Hard Margin:
Hard Margin: Used when data is perfectly linearly separable (no error allowed).
Soft Margin: Allows some misclassification using a hinge loss function to improve
generalization.
Loss Function:
Hinge Loss: max(0, 1 - yᵢ(w · xᵢ + b))
Used in implementations such as: SVC, LinearSVC, and SGDClassifier from scikit-learn.
Multi-class SVM:
When dealing with more than two classes, SVM uses one-vs-rest strategy:
1. Train n classifiers for n classes.
2. The class with the highest distance score from its hyperplane is selected as prediction.
Example: Iris dataset with three flower species.
Tools & Visual Aids:
Scikit-learn provides easy-to-use APIs for implementing SVM.
Visual aids (e.g., Figures 4-3, 4-4 in the book) illustrate hyperplanes, margins, and
classification areas clearly.
Evaluating Classification Models – Summary Notes
Purpose:
After training and tuning a model, it’s important to evaluate how well it performs—
especially on new, unseen data. This is done using a test (holdout) dataset that was not used
in training.
Steps:
1. Extract features from test data (same way as training).
2. Feed features to the trained model to get predictions.
3. Compare predicted labels with actual labels.
Key Performance Metrics:
Accuracy
Proportion of correct predictions (both classes).
Formula: Accuracy = (TP + TN) / (TP + TN + FP + FN)
Precision
Out of all predicted positives, how many were correct. Also called Positive Predictive Value.
Formula: Precision = TP / (TP + FP)
Recall
Out of all actual positives, how many were correctly predicted. Also known as Sensitivity or
True Positive Rate.
Formula: Recall = TP / (TP + FN)
F1 Score
Harmonic mean of Precision and Recall. Useful when you need a balance between both.
Formula: F1 = 2 × (Precision × Recall) / (Precision + Recall)
Example: Spam vs Ham Classification
Total Emails: 20
Actual Labels Count: 10 spam, 10 ham
Predicted Labels Count: 11 spam, 9 ham
Confusion Matrix (Spam = Positive Class):
Predicted Spam Predicted Ham
Actual Spam TP = 5 FN = 5
Actual Ham FP = 6 TN = 4
Metric Calculations (Manual + sklearn):
Accuracy = (5 + 4) / (5 + 4 + 5 + 6) = 9 / 20 = 0.45
Precision = 5 / (5 + 6) = 5 / 11 ≈ 0.45
Recall = 5 / (5 + 5) = 5 / 10 = 0.5
F1 Score = (2 × 0.45 × 0.5) / (0.45 + 0.5) ≈ 0.47
Confusion Matrix Terms Recap:
TP (True Positive): Correctly predicted positives
FN (False Negative): Actual positives predicted as negative
FP (False Positive): Actual negatives predicted as positive
TN (True Negative): Correctly predicted negatives
Key Takeaway:
Understanding and manually computing these metrics helps in interpreting model
performance better and no blindly relying on library functions.
Multi-class classification
Multi-class classification is a type of supervised machine learning where the goal is to
classify input data into one of three or more distinct categories or classes.
In this type of problem, each input is assigned to exactly one class from a set of multiple
possible classes.
Why is Multi-Class Classification Used?
Multi-class classification is used when a task requires choosing one correct label out of many
possible categories. It helps computers understand, organize, and make decisions based on
diff erent types of data.
Applications of Multi-Class Classification
Multi-class classification is widely used in many fields where a system needs to choose from
three or more categories. Here are some of its most common applications:
1. Text Classification
Email filtering: Categorizing emails into Primary, Social, Promotions, Spam.
News categorization: Grouping articles as Sports, Politics, Technology, Entertainment.
Sentiment analysis: Classifying text as Positive, Neutral, or Negative.
2. Image Classification
Object recognition: Identifying whether an image is of a cat, dog, car, or tree.
Medical imaging: Classifying X-rays or MRIs into diff erent disease categories.
Facial recognition: Recognizing and labeling individuals.
3. Product Categorization (E-Commerce)
Automatically tagging products as Clothing, Electronics, Home Appliances, Books, etc.
Helps in organizing online catalogs and improving search.
4. Handwriting and Digit Recognition
Recognizing handwritten digits (0–9) or letters (A–Z).
Used in postal services, banking (check processing), and digitizing forms
5. Language Identification
Detecting whether a piece of text is in English, French, Spanish, etc.
Used in chat apps, translators, and global content platforms.
6. Disease Diagnosis (Healthcare)
Predicting the type of illness based on patient symptoms or test results.
Classifying skin conditions, cancers, or eye diseases from images.
7. Customer Support Automation
Classifying support tickets into categories like Billing, Technical Issue, Product Inquiry, etc.
Helps route issues to the right department.
8. Fraud Detection
Labeling financial transactions as normal, suspicious, or fraudulent.
9. Speech Emotion Recognition
Identifying if the speaker sounds happy, angry, sad, or excited.
Useful in virtual assistants and customer service.
10. Document Classification
Classifying documents into legal, medical, technical, or educational.
Uses of Multi-Class Classification System:-
1. Sentiment Analysis
Beyond just positive/negative, multi-class classification can include categories
like very positive, positive, neutral, negative, very negative.
2. Topic Classification
Automatically categorizing news articles, emails, or blog posts into topics like
sports, politics, entertainment, technology, etc.
3. Intent Detection
In chatbots or virtual assistants, classifying user queries into intents like
booking a flight, checking weather, ordering food, etc.
4. Spam Detection
Instead of binary spam/not spam, classify emails or messages into multiple
types: spam, promotional, transactional, personal, etc.
5. Language Identification
Determining which language a given piece of text is written in (e.g., English,
French, Hindi, etc.).
6. Emotion Detection
Classifying text into emotional states like joy, anger, sadness, surprise, fear, and
disgust.
7. Document Categorization
Classifying documents or PDFs in industries (e.g., legal, healthcare) into types
like invoice, contract, medical report, prescription etc.
8. Product or Service Review Classification
For companies analyzing feedback, classifying reviews into categories like
delivery, product quality, customer service, pricing etc.
Building a multi class classification system in text analytics:-
1. Define the Problem and Classes
Clearly define:
- *What type of text you're analyzing* (e.g., tweets, reviews, emails)
- What are the possible categories** (e.g., topics, sentiment levels, intent types)
> Goal: Classify customer feedback into 4 classes: Product Quality, Delivery, `Customer
Service`, Others.
2. Collect and Label Data
Gather a dataset of text samples, each with a label.
Example format:
| Text | Label |
|------|-------|
| "My package arrived late." | Delivery |
| "This phone broke in a week." | Product Quality |
Use sources like:
- Surveys
- Reviews
- Social media
- Support tickets
---
3. Preprocess the Text
Clean and standardize text:
- Lowercase everything
- Remove punctuation, numbers, stopwords
- Tokenize and lemmatize/stem words
- Why?** To remove noise and help the model focus on meaningful patterns.
---
4. Convert Text into Numerical Features
Since ML models work with numbers, convert text into vectors using:
- *TF-IDF*: Measures how important a word is in a document relative to the corpus.
- *Word embeddings* (e.g., Word2Vec, GloVe): Capture semantic meaning.
- *Transformer embeddings* (e.g., BERT): Context-aware representations (state-of-the-art).
---
5. Choose a Classification Model
Pick a model that supports *multi-class classification*:
- *Traditional ML*:
- Logistic Regression (with one-vs-rest strategy)
- Naive Bayes
- Random Forest
- Support Vector Machine (SVM)
- *Deep Learning*:
- LSTM, GRU, CNN for text
- Fine-tuned *transformers* like BERT or RoBERTa
6. Train the Model
Split data into training and test sets. Train the model to learn from the training data.
Use:
- *Cross-entropy loss* for multi-class classification
- *Evaluation metrics* like:
- Accuracy
- Precision, Recall, F1-score (per class)
- Confusion matrix
---
7. Evaluate and Improve
- Use validation metrics to check performance.
- Fine-tune hyperparameters, features, and text processing steps.
- Use techniques like class balancing if some labels are underrepresented.
---
8. Predict on New Text
Once trained, the model can assign a category to any new text input.
> Example: “I waited 2 weeks for delivery” → Predicted label: `Delivery
9. (Optional) Deploy the System
Wrap the model in an API (e.g., Flask, FastAPI) and integrate it into:
- Customer support systems
- Feedback analysis dashboards
- Chatbots
---
What is Bag of Words?
Bag of Words is a way to turn text into numbers so that a computer can understand and
analyze it.
How it works (easy version):
[Link] a list of all words used in your text.
[Link] how many times each word appears in each sentence or document.
Example 1:
Let’s say you have two sentences:
Sentence 1: "I love apples"
Sentence 2: "I love oranges"
Step 1: Make a list of all the words:
["I", "love", "apples", "oranges"]
Step 2: Count word appearances:
Now each sentence is a row of numbers (a vector). That’s BoW!
Eg:-2
Sentences:
"Birds can fly"
"Fish can swim"
Vocabulary (all unique words):
["birds", "can", "fly", "fish", "swim"]
Each sentence becomes one BoW vector.
In our example:
"Birds can fly" → Vector: [1, 1, 1, 0, 0]
"Fish can swim" → Vector: [0, 1, 0, 1, 1]
So there are 2 BoW vectors — one for each sentence.
Why it's useful in text analytics:
It helps computers analyze and compare different texts.
You can use it to build models for spam detection, topic classification, or sentiment analysis.
Keep in mind:
BoW doesn’t understand meaning or word order.
But it’s a simple and powerful starting point for analyzing text.
[Link]-IDF (Term Frequency-Inverse Document Frequency)
Definition: TF-IDF is a statistical measure used to evaluate the importance of a word in a
document relative to a collection of documents (corpus).
Term Frequency (TF): This measures how frequently a term appears in a document. It is
calculated as: [ \text{TF}(t, d) = \frac{\text{Number of times term } t \text{ appears in
document } d}{\text{Total number of terms in document } d} ] This means that if a word
appears more often in a document, its TF score increases.
Inverse Document Frequency (IDF): This measures how important a term is across the
entire corpus. It is calculated as: [ \text{IDF}(t) = \log\left(\frac{\text{Total number of
documents}}{\text{Number of documents containing term } t}\right) ] A high IDF score
indicates that the term is rare across documents, making it more significant.
TF-IDF Score: The final TF-IDF score for a term in a document is calculated by multiplying
its TF and IDF scores: [ \text{TF-IDF}(t, d) = \text{TF}(t, d) \times \text{IDF}(t) ] This score
helps in identifying keywords in documents, as higher scores indicate more important terms.
2. Advanced Word Vectorization Models
Definition: Word vectorization is the process of converting words into numerical vectors
that capture their meanings and relationships. Advanced models include:
Word2Vec: This model uses neural networks to learn word associations from a large
corpus of text. It represents words in a continuous vector space, where similar words are
located closer together. There are two main architectures:
CBOW (Continuous Bag of Words): Predicts a word based on its context (surrounding
words).
Skip-gram: Predicts the context based on a given word.
GloVe (Global Vectors for Word Representation): This model creates word vectors by
analyzing the global word co-occurrence matrix from a corpus. It captures the relationships
between words based on their co-occurrence probabilities.
FastText: An extension of Word2Vec that considers subword information (character n-
grams), allowing it to generate better representations for rare words and handle out-of-
vocabulary words more effectively.
These models help in various NLP tasks by providing a way to understand word meanings
and relationships in a numerical format.
3. Understanding Text Syntax and Structure: Parts of Speech Tagging
Definition: Parts of Speech (POS) tagging is the process of labeling words in a text with
their corresponding part of speech, such as nouns, verbs, adjectives, etc. This helps in
understanding the grammatical structure of sentences.
Purpose: POS tagging is crucial for various NLP tasks, including:
Parsing: Understanding sentence structure.
Information Extraction: Identifying key information based on word types.
Sentiment Analysis: Understanding the sentiment by analyzing adjectives and verbs.
How it Works: POS tagging can be done using rule-based methods, statistical models, or
machine learning algorithms. For example, a simple rule might tag a word as a noun if it
follows a determiner (like "the" or "a").
Example: In the sentence "The quick brown fox jumps over the lazy dog":
"The" (determiner)
"quick" (adjective)
"brown" (adjective)
"fox" (noun)
"jumps" (verb)
"over" (preposition)
"the" (determiner)
"lazy" (adjective)
"dog" (noun)
By tagging each word, we can better understand the sentence's structure and meaning.
Parsing Techniques in NLP
[Link]-Based Parsing
Dependency parsing constructs a tree in which words are connected based on their
grammatical relationships.
Each word (except the root) is linked to another word, called its head, revealing relations
such as subject, object, modifier, etc.
Key Points:-
Focuses on binary relations between words.- Typically useful for languages with
flexible word order.
Efficient and useful in information extraction and relation identification.
Often integrated with tools like spaCy, Stanford CoreNLP, and others.
Example using spaCy:
import spacy
nlp = [Link]('en_core_web_sm')
doc = nlp("The quick brown fox jumps over the lazy dog.")
for token in doc:
print(f"{[Link]} --> {token.dep_} --> {[Link]}")
Output: Displays dependency relations like 'nsubj' for subject, 'dobj' for direct object, etc
2. Shallow Parsing (Chunking)
Shallow parsing, or chunking, segments text into phrases such as noun phrases (NP)
or verb phrases (VP) without revealing internal structure.
It provides a surface-level analysis, which is efficient and simpler compared to full
parsing.
Key Points:-
Does not capture complete syntactic structure.
Useful as a preprocessing step in Named Entity Recognition (NER) and text
summarization.
Relies heavily on Part-Of-Speech (POS) tagging and regular expressions.
Enhances downstream tasks by quickly identifying key chunks of information.
Example using NLTK:
import nltk
sentence = [("The", "DT"), ("quick", "JJ"), ("brown", "JJ"), ("fox", "NN"), ("jumps", "VBZ"),
("over", "IN"), ("the", "DT"), ("lazy", "JJ"), ("dog", "NN")]
grammar = "NP: {<DT>?<JJ>*<NN>}”
cp = [Link](grammar)
result = [Link](sentence)
[Link]()
This processes the sentence into identifiable chunks such as [NP The quick brown fox]
Constituency-Based Parsing
Constituency parsing uses grammatical rules based on Context-Free Grammar (CFG) to break
down sentences into hierarchical subphrases or constituents.
It reveals the full syntactic structure of a sentence by illustrating how groups of words form
phrases.
Key Points:-
Represents the complete grammatical structure.
- Enables detailed analysis suitable for syntax-driven applications like machine translation
and question answering.
- Often more computationally intensive due to the hierarchical representation and ambiguity
resolution.
- Can be implemented using parsers like NLTK's ChartParser or Stanford Parser.
Example using NLTK:
import nltk
grammar = [Link]("""
S -> NP VP
NP -> DT JJ NN
VP -> VBZ PP
PP -> IN NP
DT -> 'The' | 'the'
JJ -> 'quick' | 'brown' | 'lazy'
NN -> 'fox' | 'dog'
VBZ -> 'jumps'
IN -> 'over' """)
parser = [Link](grammar)
sentence = ['The', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog']
for tree in [Link](sentence):
tree.pretty_print()
This builds a constituency parse tree representing the hierarchical structure of the sentence.
Conclusion
In text analytics, each parsing technique provides unique benefits:
- **Dependency Parsing** focuses on direct relations between words, making it great for
tasks like relation extraction.
- **Shallow Parsing** provides a fast and efficient way to identify key phrase boundaries,
useful as a preprocessing step.
- **Constituency Parsing** offers comprehensive syntactic analysis but with added
computational complexity.
The choice of parsing technique depends on the specific NLP application, balancing between
the desired detail of syntactic analysis and computational efficiency