0% found this document useful (0 votes)
15 views2 pages

NLP Text Corpus Operations Fixed

The document explains operations on text corpora in NLP, including tokenization, stopword removal, stemming, lemmatization, and vectorization, which prepare raw text for machine learning. It also discusses TF-IDF for word importance and provides a text classification example using a Naive Bayes classifier. These preprocessing techniques are essential for applications such as chatbots, search engines, and sentiment analysis.

Uploaded by

shreya182btcse22
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views2 pages

NLP Text Corpus Operations Fixed

The document explains operations on text corpora in NLP, including tokenization, stopword removal, stemming, lemmatization, and vectorization, which prepare raw text for machine learning. It also discusses TF-IDF for word importance and provides a text classification example using a Naive Bayes classifier. These preprocessing techniques are essential for applications such as chatbots, search engines, and sentiment analysis.

Uploaded by

shreya182btcse22
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Operations of Text Corpus in NLP

1. Introduction
A text corpus is a collection of text data used for various NLP tasks. Operations like tokenization,
stopword removal, stemming, lemmatization, and vectorization transform raw text into a format
suitable for machine learning models.

2. Tokenization
- Splitting text into words or sentences.
- Example:
Text: "NLP is exciting! It helps machines understand human language."
Word Tokens: ['NLP', 'is', 'exciting', '!', 'It', 'helps', 'machines', 'understand', 'human', 'language', '.']
Sentence Tokens: ['NLP is exciting!', 'It helps machines understand human language.']

3. Stopword Removal
- Removes common words that do not add meaning.
- Example: 'is', 'It', 'helps' removed from: 'NLP is exciting! It helps machines understand human
language.'
- Result: ['NLP', 'exciting', '!', 'machines', 'understand', 'human', 'language', '.']

4. Stemming
- Reduces words to root forms.
- Example: 'Processing' -> 'Process', 'Machines' -> 'Machin'

5. Lemmatization
- Converts words to dictionary forms.
- Example: 'Machines' -> 'Machine', 'Helps' -> 'Help'

6. Vectorization
- Converts text into numerical form for machine learning.
- Example using CountVectorizer:
Features: ['exciting', 'helps', 'is', 'it', 'language', 'machines', 'nlp', 'understand']
- Sentence: "NLP is exciting!" -> Vector: [1 0 1 0 0 0 1 0]

7. TF-IDF
- Assigns importance to words based on their frequency. Words appearing frequently across
documents are weighted lower.

8. Text Classification Example


- Uses Naive Bayes classifier.
- Example:
Training Data: ['This movie was amazing!' -> Positive, 'I hated this film.' -> Negative]
- New sentence: "The movie was fantastic!" -> Predicted Label: 'Positive'

9. Conclusion
These operations help in preprocessing text for tasks like chatbots, search engines, and sentiment
analysis.

You might also like