0% found this document useful (0 votes)

4 views42 pages

Text Classification Using NLP

The document outlines the process of text document categorization, focusing on text preprocessing techniques such as tokenization, stemming, and lemmatization to improve data quality and feature extraction. It also discusses vectorization models, including Bag of Words and TF-IDF, which convert text into numerical representations for machine learning. Additionally, the document covers Word2Vec as a predictive vectorization model, explaining its mechanisms and applications in understanding word relationships.

Uploaded by

rdharun36

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views42 pages

Text Classification Using NLP

Uploaded by

rdharun36

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Text Document Categorization

Step 1: Text Pre Processing

Need for Text Processing
ü Raw Text Data
ü Unstructured & Noisy data.
ü Type errors
ü Slang data
ü Abbreviations or irrelevant informations

ü Improved data quality

ü Better feature extraction
ü Reduce computational complexity
Text Pre Processing Techniques
ü Regular Expressions
ü Tokenization
ü Lemmatization & Stemming
ü Part of Speech (PoS) Tagging
Sample Text Preprocessing
corpus = [
"I can't wait for the new season of my favorite show!",
"The COVID-19 pandemic has affected millions of people worldwide.",
"U.S. stocks fell on Friday after news of rising inflation.",
"<html><body>Welcome to the website!</body></html>",
"Python is a great programming language!!! ??"
]
Text Cleaning
ü Convert the text to lowercase, remove punctuation, numbers, special characters and HTML tags.

Cleaned_Corpus:
['i cant wait for the new season of my favorite show', 'the covid pandemic has affected millions of
people worldwide', 'us stocks fell on friday after news of rising inflation', 'htmlbodywelcome to the
websitebodyhtml', 'python is a great programming language ']
Tokenization
ü Splitting the cleaned text into tokens (words).
ü Imports word_tokenize to split text into individual words.
ü Downloads the necessary NLTK tokenizer model ('punkt_tab' ).
ü Tokenizes each cleaned document in cleaned_corpus.
ü Stores the list of tokens for each document in tokenized_corpus.
ü Prints the final tokenized output(a list of word lists).

[['i', 'cant', 'wait', 'for', 'the', 'new', 'season', 'of', 'my', 'favorite', 'show'], ['the', 'covid', 'pandemic', 'has',
'affected', 'millions', 'of', 'people', 'worldwide'], ['us', 'stocks', 'fell', 'on', 'friday', 'after', 'news', 'of',
'rising', 'inflation'], ['htmlbodywelcome', 'to', 'the', 'websitebodyhtml'], ['python', 'is', 'a', 'great',
'programming', 'language']]
Stop Words Removal
ü Imports the list of English stopwords from [Link].
ü Downloads the stopwords corpus using [Link]('stopwords').
ü Stores all English stopwords (like "the", "is", "and", etc.) in a set called stop_words for fast lookup.
ü Iterates over each document in tokenized_corpus and removes all stopwords.
ü Saves the cleaned, non-stopword tokens into filtered_corpus.

[['cant', 'wait', 'new', 'season', 'favorite', 'show'], ['covid', 'pandemic', 'affected', 'millions', 'people',
'worldwide'], ['us', 'stocks', 'fell', 'friday', 'news', 'rising', 'inflation'], ['htmlbodywelcome',
'websitebodyhtml'], ['python', 'great', 'programming', 'language']]
Stemming and Lemmatization - 1
[['cant', 'wait', 'new', 'season', 'favorit', 'show'], ['covid', 'pandem', 'affect', 'million', 'peopl', 'worldwid'],
['us', 'stock', 'fell', 'friday', 'news', 'rise', 'inflat'], ['htmlbodywelcom', 'websitebodyhtml'], ['python',
'great', 'program', 'languag']]

[['cant', 'wait', 'new', 'season', 'favorite', 'show'], ['covid', 'pandemic', 'affected', 'million', 'people',
'worldwide'], ['u', 'stock', 'fell', 'friday', 'news', 'rising', 'inflation'], ['htmlbodywelcome',
'websitebodyhtml'], ['python', 'great', 'programming', 'language']]
Stemming and Lemmatization - 2
ü Imports PorterStemmer and WordNetLemmatizer from NLTK.
ü Downloads the wordnet resource required for lemmatization.
ü Initializes the stemmer and lemmatizer.
ü Applies stemming to each word in filtered_corpus and stores the result in stemmed_corpus.
ü Applies lemmatization to each word in filtered_corpus and stores the result in lemmatized_corpus.
Types of Tokenization
ü Tokenize text using NLTK
ü Spacy Tokenizer
ü White Space Tokenization
ü Dictionary Based Tokenization

Python Notebook: Text Tokenization using NLTK

Vectorization Models
Need for Vectors
ü Numerical Representation - vectors translate words and sentences into numerical formats.
ü Capturing Semantics (Meaning) - Word2Vec, GloVe, and BERT generate embeddings that place
words with similar meanings closer together.
ü Contextual Understanding: Understand the context and syntactic role of words within sentences,
enhancing model accuracy in tasks like named entity recognition.
ü Efficient Data Processing
Types of Vectorization Models
Count Based Vectorization Models Predictive Vectorization Models
Bag of Words Vectorizer Word2Vec - Skip Gram
TF-IDF Vectorizer Word2Vec - Continuous Bag of Words
Glove Vectorizer Model
Bag of Words- Count Based
Vectorization Model
Bag of Words Model in NLP
ü Text data needs to be converted into numbers so that machine learning algorithms can understand
it.
ü It turns text like sentence, paragraph or document into a collection of words and counts how often
each word appears but ignoring the order of the words.
ü focuses on counting how often each word appears in the text.
Components of BoW
ü Vocabulary - Lists all unique words from the entire dataset.
ü Document Representation - Each document is represented as a vector, each element shows the
frequency of the words from the vocabulary.

Python Note Book : 03. Bag of Words

TF - IDF Vectorization
TF-IDF (Term Frequency–Inverse Document Frequency)
ü How important a word is to a document in relation to a larger collection of documents.
ü Term Frequency (TF)
ü Inverse Document Frequency (IDF)
Term Frequency (TF)
ü Measures how often a word appears in a document.
ü A higher frequency suggests greater importance.
ü If a term appears frequently in a document, it is likely relevant to the document’s content.
Inverse Document Frequency (IDF)
• Reduces the weight of common words across multiple documents while increasing the weight of
rare words.
• If a term appears in fewer documents, it is more likely to be meaningful and specific.
TF-IDF
• Document 1: "The cat sat on the mat."
• Document 2: "The dog played in the park."
• Document 3: "Cats and dogs are great pets."

• Calculate the TF-IDF score for specific terms in these documents

Term Frequency (TF)
Document 1:
ü The word "cat" appears 1 time.
ü The total number of terms in Document 1 is 6 ("the", "cat", "sat", "on", "the", "mat").
ü So, TF(cat,Document 1) = 1/6

Document 2:
ü The word "cat" does not appear.
ü So, TF(cat,Document 2)=0.

Document 3:
ü The word "cat" appears 1 time.
ü The total number of terms in Document 3 is 6 ("cats", "and", "dogs", "are", "great", "pets").
ü So TF (cat,Document 3)=1/6

ü Document 1 and Document 3 the word "cat" has the same TF score
Inverse Document Frequency (IDF)
• Total number of documents in the corpus (D): 3
• Number of documents containing the term "cat": 2 (Document 1 and Document 3).
Calculate TF-IDF
• Document 1: TF-IDF (cat, Document 1, D)-0.167 * 0.176 - 0.029
• Document 2: TF-IDF(cat, Document 2, D)-0x 0.176-0
• Document 3: TF-IDF (cat, Document 3, D)-0.167 x 0.176 ~ 0.029
Word2Vec - Predictive
Vectorization Model
Word embeddings: properties
• Relationships between words correspond to difference between vectors.
Word embeddings: questions
• How big should the embedding space be?
• Trade-offs like any other machine learning problem – greater capacity versus efficiency and
overfitting.

• How do we find W?
• Often as part of a prediction or classification task involving neighboring words.
word2vec
• Predict words using context
• Two versions: CBOW (continuous bag of words) and Skip-gram
CBOW
• Bag of words
• Gets rid of word order. Used in discrete case using
counts of words that appear.
• CBOW
• Takes vector embeddings of n words before target
and n words after and adds them (as vectors).
• Also removes word order, but the vector sum is
meaningful enough to deduce missing word.
CBOW Model
“The cat sat on floor”
• Window size = 2
Input layer
0
Index of cat in vocabulary 1
0
0

cat 0 Hidden layer Output layer

0
0 0
0 0
… 0
0 0

one-hot 0
sat one-hot
vector 0
vector
0 0
0 1
0 …
1 0
0
on
0
0
0
…
0

12
[Link]/~vagelis/classes/CS242/slides/[Link]
We must learn W and W’
Input layer
0
1
0
0
Hidden layer Output layer
cat 0
0
��×�
0 0
0 0
… 0
V-dim 0 0

�′�×� 0
0
sat
0 0
0 1
0 …
N-dim
1
0
��×� 0 V-dim
on
0
0
0
…
V-dim 0 N will be the size of word vector

32
��×� × �� = ��
0.1 2.4 1.6 1.8 0.5 0.9 … … … 3.2 0 2.4
Input layer 0.5 2.6 1.4 2.9 1.5 3.6 … … … 6.1
1
2.6
0
0 … … … … … … … … … … × 0 = …
1 0
… … … … … … … … … … …
0 0
0 0.6 1.8 2.7 1.9 2.4 2.0 … … … 1.2 0 1.8

xcat 0 �� 0 Output layer

�×
0
0
� ×� …
0 0
��
0
…
=� 0
0
V-dim 0
�� 0
�� + �� 0
+ �= sat
2 0
0
� �� 0
0
= 1

� ��
0 …
V-dim
1
0
× Hidden layer
0

xon � ×�
��
0
0
N-dim
0
…
V-dim 0

33
��×� × �� = ��
0.1 2.4 1.6 1.8 0.5 0.9 … … … 3.2 0 1.8
Input layer 0.5 2.6 1.4 2.9 1.5 3.6 … … … 6.1
0
2.9
0
0 … … … … … … … … … … × 1 = …
1 0
… … … … … … … … … … …
0 0
0 0.6 1.8 2.7 1.9 2.4 2.0 … … … 1.2 0 1.9

xcat 0 �� 0 Output layer

�×
0
0
� ×� …
0 0
��
0
…
=� 0
0
V-dim 0
�� 0
�� + �� 0
+ �= sat
2 0
0
� �� 0
0
= 1

� ��
0 …
V-dim
1
0
× Hidden layer
0

xon � ×�
��
0
0
N-dim
0
…
V-dim 0

34
Input layer
0
1
0
0
Hidden layer Output layer
cat 0
0
��×�
0 0
0 0
… 0
V-dim 0 0

�′�×� ×�=� 0 � = ��(�)

0
0 0
0 1
0 � …
1
��×� 0

on 0 N-dim
0 �sat
0
0
V-dim
…
V-dim 0 N will be the size of word vector

35
Input layer
0
1 We would prefer � close to ��
0
0
Hidden layer Output layer
cat 0
0
��×�
0 0 0.01
0 0
0.02
… 0
V-dim 0 0 0.00
�′�×�×�=� 0
0.02
0
0 � = ��(�) 0 0.01

0 1 0.02
0 � …
0.01
1
��×� 0

on 0 N-dim 0.7
0 �sat …
0
0
V-dim 0.00

…
V-dim 0 N will be the size of word vector �

36
��×�
0.1 2.4 1.6 1.8 0.5 0.9 … … … 3.2
Contain word’s vectors
Input layer 0.5 2.6 1.4 2.9 1.5 3.6 … … … 6.1

0 … … … … … … … … … …
1
… … … … … … … … … …
0
0 0.6 1.8 2.7 1.9 2.4 2.0 … … … 1.2

xcat 0 Output layer

0
0 0
0 ��×� 0
… 0
V-dim 0 0
�′�×� 0
sat
0
0 0
0 1
0 …
1 ��×� 0 V-dim
xon 0 Hidden layer
0
0
N-dim
0
…
V-dim 0

We can consider either W or W’ as the word’s representation. Or

even take the average.
37
Word Analogies
Skip gram
ü Skip gram – alternative to CBOW
ü Start with a single word embedding and try to predict the surrounding words.
ü Much less well-defined problem, but works better in practice (scales better).
Skip Gram - Sample
• Vocabulary of 10,000 words.
• Embedding vectors with 300 features.
• So the hidden layer is going to be represented by a weight matrix with 10,000 rows (multiply by
vector on the left).
Word2vec shortcomings
• Problem: 10,000 words and 300 dim embedding gives a large parameter space to learn. And 10K
words is minimal for real applications.

• Slow to train, and need lots of data, particularly to learn uncommon words.

What Is NLP?: Computer Science (A Linguistics Machine Learning / Deep Learning Goal
No ratings yet
What Is NLP?: Computer Science (A Linguistics Machine Learning / Deep Learning Goal
23 pages
Understanding Natural Language Processing
No ratings yet
Understanding Natural Language Processing
4 pages
UNIT-6 - NLP - Topic - 4 - Techniques of NLP - Copy-1
No ratings yet
UNIT-6 - NLP - Topic - 4 - Techniques of NLP - Copy-1
8 pages
? What Is NLP?
No ratings yet
? What Is NLP?
22 pages
Understanding Natural Language Processing
No ratings yet
Understanding Natural Language Processing
25 pages
X - Part B - Unit - 6
No ratings yet
X - Part B - Unit - 6
5 pages
Class 10 NLP Overview and Applications
No ratings yet
Class 10 NLP Overview and Applications
13 pages
Stemming and Lemmatization in NLP
No ratings yet
Stemming and Lemmatization in NLP
6 pages
Text Analytics Assignment: TF-IDF Methods
No ratings yet
Text Analytics Assignment: TF-IDF Methods
14 pages
UNIT-6: Natural Language Processing
No ratings yet
UNIT-6: Natural Language Processing
38 pages
Introduction to Text Mining Techniques
No ratings yet
Introduction to Text Mining Techniques
8 pages
Word Embedding Techniques in NLP
No ratings yet
Word Embedding Techniques in NLP
13 pages
Bag of Words Algorithm Explained
No ratings yet
Bag of Words Algorithm Explained
19 pages
NLP Tokenization, Stemming, Lemmatization Guide
No ratings yet
NLP Tokenization, Stemming, Lemmatization Guide
29 pages
NLP Lab Practical
No ratings yet
NLP Lab Practical
38 pages
Text Normalization and TFIDF Explained
No ratings yet
Text Normalization and TFIDF Explained
8 pages
Tokenization, Stemming, and VSM Explained
No ratings yet
Tokenization, Stemming, and VSM Explained
14 pages
Text Analytics: Preprocessing Methods
No ratings yet
Text Analytics: Preprocessing Methods
11 pages
Data Science
No ratings yet
Data Science
86 pages
NLP Vocabulary and Tokenization Techniques
No ratings yet
NLP Vocabulary and Tokenization Techniques
37 pages
Deep Learning: Text Feature Extraction
No ratings yet
Deep Learning: Text Feature Extraction
102 pages
Introduction to Natural Language Processing
No ratings yet
Introduction to Natural Language Processing
18 pages
Lecture05 Intro NLP
No ratings yet
Lecture05 Intro NLP
74 pages
NLP Techniques for Machine Learning
No ratings yet
NLP Techniques for Machine Learning
14 pages
Natural Language Processing Basics
No ratings yet
Natural Language Processing Basics
61 pages
Understanding Natural Language Processing
No ratings yet
Understanding Natural Language Processing
2 pages
NLP-till GRU
No ratings yet
NLP-till GRU
187 pages
NLP Text Processing and Sentiment Analysis
No ratings yet
NLP Text Processing and Sentiment Analysis
16 pages
X AI NLP Models
No ratings yet
X AI NLP Models
9 pages
Text Preprocessing with TF-IDF
No ratings yet
Text Preprocessing with TF-IDF
6 pages
Unit 2 Acl
No ratings yet
Unit 2 Acl
17 pages
Advanced NLP: Semantic Representation Techniques
No ratings yet
Advanced NLP: Semantic Representation Techniques
50 pages
Text Representation Techniques in NLP
No ratings yet
Text Representation Techniques in NLP
14 pages
NLP: Stemming and Lemmatization Explained
No ratings yet
NLP: Stemming and Lemmatization Explained
3 pages
NLP Foundations Lab 5 Manual
No ratings yet
NLP Foundations Lab 5 Manual
8 pages
Text Preprocessing in NLP with Python
No ratings yet
Text Preprocessing in NLP with Python
6 pages
NLP Applications and Techniques Overview
No ratings yet
NLP Applications and Techniques Overview
53 pages
NLP Techniques: Tokenization, Stemming, Encoding
No ratings yet
NLP Techniques: Tokenization, Stemming, Encoding
12 pages
Feature Extraction in NLP Techniques
No ratings yet
Feature Extraction in NLP Techniques
27 pages
Understanding Natural Language Processing
No ratings yet
Understanding Natural Language Processing
37 pages
Unsupervised vs Supervised Learning in NLP
No ratings yet
Unsupervised vs Supervised Learning in NLP
10 pages
Understanding Natural Language Processing
No ratings yet
Understanding Natural Language Processing
53 pages
Stemming and Lemmatization in NLP
No ratings yet
Stemming and Lemmatization in NLP
32 pages
NLP Applications in Class 10 AI
No ratings yet
NLP Applications in Class 10 AI
36 pages
Phase2 LucasQuintana NLP
No ratings yet
Phase2 LucasQuintana NLP
23 pages
Understanding NLP in AI: Key Concepts
No ratings yet
Understanding NLP in AI: Key Concepts
4 pages
Introduction to Natural Language Processing
No ratings yet
Introduction to Natural Language Processing
39 pages
NLP Text Processing Techniques Overview
No ratings yet
NLP Text Processing Techniques Overview
42 pages
Part A Assignment - No - 7
No ratings yet
Part A Assignment - No - 7
9 pages
NLP - Key Points
No ratings yet
NLP - Key Points
5 pages
03 Natural Language Processing Notes
No ratings yet
03 Natural Language Processing Notes
6 pages
NLP Pipeline: Data Collection & Preprocessing
No ratings yet
NLP Pipeline: Data Collection & Preprocessing
9 pages
Unit 6 Natural Language Processing (1) - 251120 - 195403
No ratings yet
Unit 6 Natural Language Processing (1) - 251120 - 195403
15 pages
NLP Insights: TF-IDF & AI Chatbots
No ratings yet
NLP Insights: TF-IDF & AI Chatbots
45 pages
Text Preprocessing for NLP Explained
No ratings yet
Text Preprocessing for NLP Explained
36 pages
NLP with Python: NLTK Overview
No ratings yet
NLP with Python: NLTK Overview
15 pages
Text Mining Techniques Overview
No ratings yet
Text Mining Techniques Overview
38 pages
Temporary Shelter Associate in Gambella
No ratings yet
Temporary Shelter Associate in Gambella
2 pages
Seasonal Trends in Atmospheric Methane
No ratings yet
Seasonal Trends in Atmospheric Methane
15 pages
Abhinav Kumar Singh's Career Profile
No ratings yet
Abhinav Kumar Singh's Career Profile
2 pages
FTA1100J Diesel Engine Fire Pump Controllers Product Description
100% (1)
FTA1100J Diesel Engine Fire Pump Controllers Product Description
2 pages
EEMUA Publication 168 Edition3 June 2022 M
No ratings yet
EEMUA Publication 168 Edition3 June 2022 M
89 pages
D-Shape Math Review Guide 2022
No ratings yet
D-Shape Math Review Guide 2022
30 pages
Past Papers - Smart Seriers (1-50) Mcq's
No ratings yet
Past Papers - Smart Seriers (1-50) Mcq's
9 pages
Future Economic Uses of Space
No ratings yet
Future Economic Uses of Space
34 pages
Jim Knight's Impact Cycle Overview
No ratings yet
Jim Knight's Impact Cycle Overview
18 pages
Ooad Notes
No ratings yet
Ooad Notes
10 pages
Sextus Empiricus on Moral Scepticism
No ratings yet
Sextus Empiricus on Moral Scepticism
14 pages
Vaccine Vial Monitor Performance Specs
No ratings yet
Vaccine Vial Monitor Performance Specs
10 pages
Vale Mobilization and Demobilization Guide
No ratings yet
Vale Mobilization and Demobilization Guide
20 pages
Nursing Literature Review Essentials
No ratings yet
Nursing Literature Review Essentials
12 pages
Germany's Cultural Profile in VET
No ratings yet
Germany's Cultural Profile in VET
27 pages
PUEHLER G Recycling Press Overview
No ratings yet
PUEHLER G Recycling Press Overview
8 pages
Shomoul Holding Company Project Details
No ratings yet
Shomoul Holding Company Project Details
1 page
2016 Auto Trends
No ratings yet
2016 Auto Trends
16 pages
2022 U.S. Public Library Ratings Data
No ratings yet
2022 U.S. Public Library Ratings Data
1,767 pages
Thinking Like an Architect Course Overview
No ratings yet
Thinking Like an Architect Course Overview
3 pages
Stirling Gardner Sales Conversion Analysis
No ratings yet
Stirling Gardner Sales Conversion Analysis
16 pages
Understanding Ciphers and Their Types
No ratings yet
Understanding Ciphers and Their Types
14 pages
Fast-Food Impact on BMI in Rural Portugal
No ratings yet
Fast-Food Impact on BMI in Rural Portugal
11 pages
Drying and Psychometrics Experiment 8
No ratings yet
Drying and Psychometrics Experiment 8
6 pages
IX Grade PBL Lesson Plan: Translation in Math
No ratings yet
IX Grade PBL Lesson Plan: Translation in Math
2 pages
HINO500 Series Trucks Products & Technology
No ratings yet
HINO500 Series Trucks Products & Technology
8 pages
Monthly Landscape Water Requirements
No ratings yet
Monthly Landscape Water Requirements
1 page
Powerbridge Installation and Wiring Guide
No ratings yet
Powerbridge Installation and Wiring Guide
1 page
Delphi
No ratings yet
Delphi
9 pages
Impact of Management Methods on Firm Performance
No ratings yet
Impact of Management Methods on Firm Performance
17 pages

Text Classification Using NLP

Uploaded by

Text Classification Using NLP

Uploaded by

Text Document Categorization

Step 1: Text Pre Processing

ü Improved data quality

Python Notebook: Text Tokenization using NLTK

Python Note Book : 03. Bag of Words

• Calculate the TF-IDF score for specific terms in these documents

cat 0 Hidden layer Output layer

xcat 0 �� 0 Output layer

xcat 0 �� 0 Output layer

�′�×� ×�=� 0 � = �������(�)

xcat 0 Output layer

We can consider either W or W’ as the word’s representation. Or

You might also like

�′�×� ×�=� 0 � = ��(�)