Operations of Text Corpus in NLP
1. Introduction
A text corpus is a collection of text data used for various NLP tasks. Operations like tokenization,
stopword removal, stemming, lemmatization, and vectorization transform raw text into a format
suitable for machine learning models.
2. Tokenization
- Splitting text into words or sentences.
- Example:
Text: "NLP is exciting! It helps machines understand human language."
Word Tokens: ['NLP', 'is', 'exciting', '!', 'It', 'helps', 'machines', 'understand', 'human', 'language', '.']
Sentence Tokens: ['NLP is exciting!', 'It helps machines understand human language.']
3. Stopword Removal
- Removes common words that do not add meaning.
- Example: 'is', 'It', 'helps' removed from: 'NLP is exciting! It helps machines understand human
language.'
- Result: ['NLP', 'exciting', '!', 'machines', 'understand', 'human', 'language', '.']
4. Stemming
- Reduces words to root forms.
- Example: 'Processing' -> 'Process', 'Machines' -> 'Machin'
5. Lemmatization
- Converts words to dictionary forms.
- Example: 'Machines' -> 'Machine', 'Helps' -> 'Help'
6. Vectorization
- Converts text into numerical form for machine learning.
- Example using CountVectorizer:
Features: ['exciting', 'helps', 'is', 'it', 'language', 'machines', 'nlp', 'understand']
- Sentence: "NLP is exciting!" -> Vector: [1 0 1 0 0 0 1 0]
7. TF-IDF
- Assigns importance to words based on their frequency. Words appearing frequently across
documents are weighted lower.
8. Text Classification Example
- Uses Naive Bayes classifier.
- Example:
Training Data: ['This movie was amazing!' -> Positive, 'I hated this film.' -> Negative]
- New sentence: "The movie was fantastic!" -> Predicted Label: 'Positive'
9. Conclusion
These operations help in preprocessing text for tasks like chatbots, search engines, and sentiment
analysis.