NLP
Natural Language Processing
TABLE OF CONTENT
Introduction
Applications
Approaches
Pipeline
INTRODUCTION
Natural Language Processing (NLP) is a part of Artificial Intelligence
that helps computers understand human language. It’s about teaching
machines to read, listen, and talk like people do.
NEED OF NLP : humans talk in languages like hindi , english tamil but computers
only understand numbers (o & 1). To bridge this gap we need NLP so, that
humans can read and understand our language.
Chatbots and Virtual Assistants
They use NLP to understand
questions and give answers.
Example: Siri, Alexa, or Copilot
REAL replying to your queries in natural
language.
WORLD Email Spam Detection
NLP helps filter unwanted or harmful
APPLICATIONS emails.
It looks at words, patterns, and sender
info to decide if an email is spam.
Sentiment Analysis
NLP checks text to find emotions or
opinions.
Example: Analyzing customer reviews to
see if they are positive, negative, or
neutral.
01
Rule-Based Approach:
How it works: Uses manually written grammar rules,
dictionaries, and pattern-matching to process language.
Example: A chatbot that replies based on fixed “if–then” rules.
Drawbacks:
APPROCHES Hard to scale (too many rules needed for complex language).
Not flexible—fails when input doesn’t match predefined rules.
Maintenance is difficult as language evolves.
OF NLP 02
Statistical / Machine Learning Approach:
How it works: Uses probability and statistical models
trained on large text datasets.
Example: Naive Bayes classifier for spam detection.
Drawbacks:
Needs a lot of labeled data for good accuracy.
Struggles with rare words or unseen phrases.
Often ignores deeper meaning (focuses on word
frequency, not context).
03
Deep Learning-Based Approach
How it works: Uses neural networks (RNNs, CNNs,
Transformers) to learn complex patterns and context in
language.
Example: BERT, GPT models for translation, chatbots,
APPROCHES
summarization.
Drawbacks:
OF NLP
Requires massive data and computing power.
Can be a “black box”—hard to explain why it gives
certain results.
Risk of bias if training data is biased.
Early systems used rule-based sentiment lexicons.
Then came ML classifiers (Naive Bayes, SVM with Bag of Words/TF-IDF).
Now, deep learning with embeddings and transformers dominates.
PIPELINE
PIPELINE
1. Text Acquisition
Collect raw text data (documents, tweets, emails, speech converted to
text).
2. Text Preprocessing
Tokenization: Split text into words or sentences.
Normalization: Lowercasing, removing punctuation, handling contractions.
Stopword Removal: Filter out common words like the, is, and.
Stemming/Lemmatization: Reduce words to their root form (running →
run).
Noise Removal: Clean HTML tags, special symbols, etc.
3. Feature Extraction
Convert text into numerical form for algorithms.
Methods: Bag of Words, TF‑IDF, Word Embeddings (Word2Vec, GloVe,
BERT).
PIPELINE
4. Modeling / Learning
Apply algorithms to learn patterns.
Approaches:
Rule‑based (simple patterns).
Statistical ML (Naive Bayes, SVM).
Deep Learning (RNNs, Transformers like BERT, GPT).
5. Evaluation
Measure performance using metrics like accuracy, precision, recall, F1‑score.
Ensures the model works well on unseen data.
6. Deployment
Integrate into applications (chatbots, sentiment analysis tools, search engines).
Monitor and update with new data.
1 TEXT GATHERING
01 03
Web Scraping
Public Datasets (e.g., Kaggle)
What it is: Extracting text directly from websites using tools like
What it is: Ready-made datasets shared by researchers,
02
BeautifulSoup, Scrapy, or Selenium.
organizations, or communities. Kaggle is one of the most
Advantages:
popular platforms.
Flexible: you can target any website with textual content.
Advantages: APIs (Application Programming Interfaces) Useful when APIs are unavailable or limited.
Free and easily accessible. What it is: Structured access to data provided by platforms Limitations:
Often cleaned, preprocessed, and labeled (e.g., (Twitter API, Reddit API, News API, etc.). Legal/ethical concerns (must respect [Link] and site
sentiment datasets, spam detection datasets).
Advantages: policies).
Saves time compared to raw data collection.
Real-time or regularly updated data. HTML parsing can be messy (ads, navigation text,
Limitations:
Structured format (JSON/XML), making parsing easier. duplicates).
May not perfectly fit your specific problem.
Often includes metadata (timestamps, user info, etc.). Requires more preprocessing to clean raw text.
Risk of overfitting if the dataset is too small or outdated.
Limitations: Example: Scraping product reviews from e-commerce sites for
Example: Using Kaggle’s "IMDB Movie Reviews" dataset for
Rate limits (restricted number of requests). sentiment analysis.
sentiment analysis.
Requires authentication (API keys).
Sometimes paid access for large-scale usage.
Example: Collecting tweets via Twitter API for hate speech
detection.
2 TEXT CLEANING
LOWECASING
REMOVE PUNCTUATIONS
REMOVING NUMBERS
REMOVING URL’S / LINKS
REMOVING HTML TAGS
REMOVING EMOJI’S / SPECIAL CHARACTERS
REMOVING STOPWORDS
THANK
YOU
Presented by Group - 03