0% found this document useful (0 votes)
4 views42 pages

Text Classification Using NLP

The document outlines the process of text document categorization, focusing on text preprocessing techniques such as tokenization, stemming, and lemmatization to improve data quality and feature extraction. It also discusses vectorization models, including Bag of Words and TF-IDF, which convert text into numerical representations for machine learning. Additionally, the document covers Word2Vec as a predictive vectorization model, explaining its mechanisms and applications in understanding word relationships.

Uploaded by

rdharun36
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views42 pages

Text Classification Using NLP

The document outlines the process of text document categorization, focusing on text preprocessing techniques such as tokenization, stemming, and lemmatization to improve data quality and feature extraction. It also discusses vectorization models, including Bag of Words and TF-IDF, which convert text into numerical representations for machine learning. Additionally, the document covers Word2Vec as a predictive vectorization model, explaining its mechanisms and applications in understanding word relationships.

Uploaded by

rdharun36
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Text Document Categorization

Step 1: Text Pre Processing


Need for Text Processing
ü Raw Text Data
ü Unstructured & Noisy data.
ü Type errors
ü Slang data
ü Abbreviations or irrelevant informations

ü Improved data quality


ü Better feature extraction
ü Reduce computational complexity
Text Pre Processing Techniques
ü Regular Expressions
ü Tokenization
ü Lemmatization & Stemming
ü Part of Speech (PoS) Tagging
Sample Text Preprocessing
corpus = [
"I can't wait for the new season of my favorite show!",
"The COVID-19 pandemic has affected millions of people worldwide.",
"U.S. stocks fell on Friday after news of rising inflation.",
"<html><body>Welcome to the website!</body></html>",
"Python is a great programming language!!! ??"
]
Text Cleaning
ü Convert the text to lowercase, remove punctuation, numbers, special characters and HTML tags.

Cleaned_Corpus:
['i cant wait for the new season of my favorite show', 'the covid pandemic has affected millions of
people worldwide', 'us stocks fell on friday after news of rising inflation', 'htmlbodywelcome to the
websitebodyhtml', 'python is a great programming language ']
Tokenization
ü Splitting the cleaned text into tokens (words).
ü Imports word_tokenize to split text into individual words.
ü Downloads the necessary NLTK tokenizer model ('punkt_tab' ).
ü Tokenizes each cleaned document in cleaned_corpus.
ü Stores the list of tokens for each document in tokenized_corpus.
ü Prints the final tokenized output(a list of word lists).

[['i', 'cant', 'wait', 'for', 'the', 'new', 'season', 'of', 'my', 'favorite', 'show'], ['the', 'covid', 'pandemic', 'has',
'affected', 'millions', 'of', 'people', 'worldwide'], ['us', 'stocks', 'fell', 'on', 'friday', 'after', 'news', 'of',
'rising', 'inflation'], ['htmlbodywelcome', 'to', 'the', 'websitebodyhtml'], ['python', 'is', 'a', 'great',
'programming', 'language']]
Stop Words Removal
ü Imports the list of English stopwords from [Link].
ü Downloads the stopwords corpus using [Link]('stopwords').
ü Stores all English stopwords (like "the", "is", "and", etc.) in a set called stop_words for fast lookup.
ü Iterates over each document in tokenized_corpus and removes all stopwords.
ü Saves the cleaned, non-stopword tokens into filtered_corpus.

[['cant', 'wait', 'new', 'season', 'favorite', 'show'], ['covid', 'pandemic', 'affected', 'millions', 'people',
'worldwide'], ['us', 'stocks', 'fell', 'friday', 'news', 'rising', 'inflation'], ['htmlbodywelcome',
'websitebodyhtml'], ['python', 'great', 'programming', 'language']]
Stemming and Lemmatization - 1
[['cant', 'wait', 'new', 'season', 'favorit', 'show'], ['covid', 'pandem', 'affect', 'million', 'peopl', 'worldwid'],
['us', 'stock', 'fell', 'friday', 'news', 'rise', 'inflat'], ['htmlbodywelcom', 'websitebodyhtml'], ['python',
'great', 'program', 'languag']]

[['cant', 'wait', 'new', 'season', 'favorite', 'show'], ['covid', 'pandemic', 'affected', 'million', 'people',
'worldwide'], ['u', 'stock', 'fell', 'friday', 'news', 'rising', 'inflation'], ['htmlbodywelcome',
'websitebodyhtml'], ['python', 'great', 'programming', 'language']]
Stemming and Lemmatization - 2
ü Imports PorterStemmer and WordNetLemmatizer from NLTK.
ü Downloads the wordnet resource required for lemmatization.
ü Initializes the stemmer and lemmatizer.
ü Applies stemming to each word in filtered_corpus and stores the result in stemmed_corpus.
ü Applies lemmatization to each word in filtered_corpus and stores the result in lemmatized_corpus.
Types of Tokenization
ü Tokenize text using NLTK
ü Spacy Tokenizer
ü White Space Tokenization
ü Dictionary Based Tokenization

Python Notebook: Text Tokenization using NLTK


Vectorization Models
Need for Vectors
ü Numerical Representation - vectors translate words and sentences into numerical formats.
ü Capturing Semantics (Meaning) - Word2Vec, GloVe, and BERT generate embeddings that place
words with similar meanings closer together.
ü Contextual Understanding: Understand the context and syntactic role of words within sentences,
enhancing model accuracy in tasks like named entity recognition.
ü Efficient Data Processing
Types of Vectorization Models
Count Based Vectorization Models Predictive Vectorization Models
Bag of Words Vectorizer Word2Vec - Skip Gram
TF-IDF Vectorizer Word2Vec - Continuous Bag of Words
Glove Vectorizer Model
Bag of Words- Count Based
Vectorization Model
Bag of Words Model in NLP
ü Text data needs to be converted into numbers so that machine learning algorithms can understand
it.
ü It turns text like sentence, paragraph or document into a collection of words and counts how often
each word appears but ignoring the order of the words.
ü focuses on counting how often each word appears in the text.
Components of BoW
ü Vocabulary - Lists all unique words from the entire dataset.
ü Document Representation - Each document is represented as a vector, each element shows the
frequency of the words from the vocabulary.

Python Note Book : 03. Bag of Words


TF - IDF Vectorization
TF-IDF (Term Frequency–Inverse Document Frequency)
ü How important a word is to a document in relation to a larger collection of documents.
ü Term Frequency (TF)
ü Inverse Document Frequency (IDF)
Term Frequency (TF)
ü Measures how often a word appears in a document.
ü A higher frequency suggests greater importance.
ü If a term appears frequently in a document, it is likely relevant to the document’s content.
Inverse Document Frequency (IDF)
• Reduces the weight of common words across multiple documents while increasing the weight of
rare words.
• If a term appears in fewer documents, it is more likely to be meaningful and specific.
TF-IDF
• Document 1: "The cat sat on the mat."
• Document 2: "The dog played in the park."
• Document 3: "Cats and dogs are great pets."

• Calculate the TF-IDF score for specific terms in these documents


Term Frequency (TF)
Document 1:
ü The word "cat" appears 1 time.
ü The total number of terms in Document 1 is 6 ("the", "cat", "sat", "on", "the", "mat").
ü So, TF(cat,Document 1) = 1/6

Document 2:
ü The word "cat" does not appear.
ü So, TF(cat,Document 2)=0.

Document 3:
ü The word "cat" appears 1 time.
ü The total number of terms in Document 3 is 6 ("cats", "and", "dogs", "are", "great", "pets").
ü So TF (cat,Document 3)=1/6

ü Document 1 and Document 3 the word "cat" has the same TF score
Inverse Document Frequency (IDF)
• Total number of documents in the corpus (D): 3
• Number of documents containing the term "cat": 2 (Document 1 and Document 3).
Calculate TF-IDF
• Document 1: TF-IDF (cat, Document 1, D)-0.167 * 0.176 - 0.029
• Document 2: TF-IDF(cat, Document 2, D)-0x 0.176-0
• Document 3: TF-IDF (cat, Document 3, D)-0.167 x 0.176 ~ 0.029
Word2Vec - Predictive
Vectorization Model
Word embeddings: properties
• Relationships between words correspond to difference between vectors.
Word embeddings: questions
• How big should the embedding space be?
• Trade-offs like any other machine learning problem – greater capacity versus efficiency and
overfitting.

• How do we find W?
• Often as part of a prediction or classification task involving neighboring words.
word2vec
• Predict words using context
• Two versions: CBOW (continuous bag of words) and Skip-gram
CBOW
• Bag of words
• Gets rid of word order. Used in discrete case using
counts of words that appear.
• CBOW
• Takes vector embeddings of n words before target
and n words after and adds them (as vectors).
• Also removes word order, but the vector sum is
meaningful enough to deduce missing word.
CBOW Model
“The cat sat on floor”
• Window size = 2
Input layer
0
Index of cat in vocabulary 1
0
0

cat 0 Hidden layer Output layer


0
0 0
0 0
… 0
0 0

one-hot 0
sat one-hot
vector 0
vector
0 0
0 1
0 …
1 0
0
on
0
0
0

0

12
[Link]/~vagelis/classes/CS242/slides/[Link]
We must learn W and W’
Input layer
0
1
0
0
Hidden layer Output layer
cat 0
0
���
0 0
0 0
… 0
V-dim 0 0

�′�×� 0
0
sat
0 0
0 1
0 …
N-dim
1
0
��� 0 V-dim
on
0
0
0

V-dim 0 N will be the size of word vector

32
���×� × ���� = ����
0.1 2.4 1.6 1.8 0.5 0.9 … … … 3.2 0 2.4
Input layer 0.5 2.6 1.4 2.9 1.5 3.6 … … … 6.1
1
2.6
0
0 … … … … … … … … … … × 0 = …
1 0
… … … … … … … … … … …
0 0
0 0.6 1.8 2.7 1.9 2.4 2.0 … … … 1.2 0 1.8

xcat 0 �� 0 Output layer


�×
0
0
� ×� …
0 0
���
0

=� 0
0
V-dim 0
��� 0
���� + ��� 0
+ �= sat
2 0
0
� �� 0
0
= 1

� ��
0 …
V-dim
1
0
× Hidden layer
0

xon � �
��
0
0
N-dim
0

V-dim 0

33
���×� × ��� = ���
0.1 2.4 1.6 1.8 0.5 0.9 … … … 3.2 0 1.8
Input layer 0.5 2.6 1.4 2.9 1.5 3.6 … … … 6.1
0
2.9
0
0 … … … … … … … … … … × 1 = …
1 0
… … … … … … … … … … …
0 0
0 0.6 1.8 2.7 1.9 2.4 2.0 … … … 1.2 0 1.9

xcat 0 �� 0 Output layer


�×
0
0
� ×� …
0 0
���
0

=� 0
0
V-dim 0
��� 0
���� + ��� 0
+ �= sat
2 0
0
� �� 0
0
= 1

� ��
0 …
V-dim
1
0
× Hidden layer
0

xon � �
��
0
0
N-dim
0

V-dim 0

34
Input layer
0
1
0
0
Hidden layer Output layer
cat 0
0
���
0 0
0 0
… 0
V-dim 0 0

�′�×� ×�=� 0 � = �������(�)


0
0 0
0 1
0 � …
1
��� 0

on 0 N-dim
0 �sat
0
0
V-dim

V-dim 0 N will be the size of word vector

35
Input layer
0
1 We would prefer � close to ����
0
0
Hidden layer Output layer
cat 0
0
���
0 0 0.01
0 0
0.02
… 0
V-dim 0 0 0.00
�′�×�×�=� 0
0.02
0
0 � = �������(�) 0 0.01

0 1 0.02
0 � …
0.01
1
��� 0

on 0 N-dim 0.7
0 �sat …
0
0
V-dim 0.00


V-dim 0 N will be the size of word vector �

36
����
0.1 2.4 1.6 1.8 0.5 0.9 … … … 3.2
Contain word’s vectors
Input layer 0.5 2.6 1.4 2.9 1.5 3.6 … … … 6.1

0 … … … … … … … … … …
1
… … … … … … … … … …
0
0 0.6 1.8 2.7 1.9 2.4 2.0 … … … 1.2

xcat 0 Output layer


0
0 0
0 ��� 0
… 0
V-dim 0 0
�′�×� 0
sat
0
0 0
0 1
0 …
1 ��� 0 V-dim
xon 0 Hidden layer
0
0
N-dim
0

V-dim 0

We can consider either W or W’ as the word’s representation. Or


even take the average.
37
Word Analogies
Skip gram
ü Skip gram – alternative to CBOW
ü Start with a single word embedding and try to predict the surrounding words.
ü Much less well-defined problem, but works better in practice (scales better).
Skip Gram - Sample
• Vocabulary of 10,000 words.
• Embedding vectors with 300 features.
• So the hidden layer is going to be represented by a weight matrix with 10,000 rows (multiply by
vector on the left).
Word2vec shortcomings
• Problem: 10,000 words and 300 dim embedding gives a large parameter space to learn. And 10K
words is minimal for real applications.

• Slow to train, and need lots of data, particularly to learn uncommon words.

You might also like