Natural language processing (NLP) is a subfield of computer science and artificial
intelligence (AI) that uses machine learning to enable computers to understand and
communicate with human language.
Growing amount of text data - social media, websites and other sources,
NLP is becoming a key tool to gain insights and automate tasks like analyzing text or
translating languages.
Applications
● Text translation,
● voice recognition,
● text summarization and
● Chatbots.
● voice-operated GPS systems,
● digital assistants,
● speech-to-text software and
● customer service bots
Text Input and Data Collection
❖ Data Collection -websites, books, social media or proprietary databases.
❖ Data Storage - structured format, such as a database or a collection of documents.
Text Preprocessing
clean and prepare the raw text data for analysis
❖ Tokenization - Splitting text into smaller units like words or sentences.
❖ Lowercasing - Converting all text to lowercase to ensure uniformity.
❖ Stopword Removal - Removing common words that do not contribute significant
meaning, such as "and," "the," "is."
❖ Punctuation Removal - Removing punctuation marks.
❖ Stemming and Lemmatization - Reducing words to their base or root forms.
Stemming : Stemming chops off prefixes or suffixes based on simple rules.
It doesn't always give a real word. FASTER - LESS ACCURATE
Input Word : "running"
Stemmer Output : "run"
Examples:
● "connection" → "connect"
● "flies" → "fli"
● "studies" → "studi"
● "undone" → "done"
Lemmatization uses vocabulary and grammar (like parts of speech) to return the correct
base or dictionary form (lemma) of a word. SLOWER - ACCURATE
● "better" → "good"
● "mice" → "mouse"
● "was" → "be"
❖ Text Normalization - Standardizing text, including case normalization, removing
punctuation and correcting spelling errors.
Text Representation
❖ Bag of Words (BoW) - keeping track of word frequency.
❖ Term Frequency-Inverse Document Frequency (TF-IDF): - importance of a word in a
document relative to a collection of documents
❖ Word Embeddings - Using dense vector representations of words where semantically
similar words are closer together in the vector space (e.g., Word2Vec, GloVe).
Feature Extraction
● Extracting meaningful features from the text data that can be used for various NLP tasks.
❖ Syntactic Features - Using parts of speech tags, syntactic dependencies and parse
trees.
❖ Semantic Features - Leveraging word embeddings and other representations to
capture word meaning and context.
❖ N-grams - Capturing sequences of N words to preserve some context and word order
Model Selection and Training
❖ Supervised Learning - Using labeled data to train models like (SVM), Random
Forests or deep learning models like (CNNs) and (RNNs).
❖ Unsupervised Learning - Applying techniques like clustering or topic modeling
(e.g., Latent Dirichlet Allocation) on unlabeled data.
❖ Pre-trained Models - Utilizing pre-trained language models such as BERT, GPT
or transformer-based models that have been trained on large corpora.
Model Deployment and Inference
❖ Text Classification - Categorizing text into predefined classes (e.g., spam detection,
sentiment analysis).
❖ Named Entity Recognition (NER) - Identifying and classifying entities in the text
❖ Machine Translation - Translating text from one language to another.
❖ Question Answering - Providing answers to questions based on the context
provided by text data.
Evaluation and Optimization : accuracy, precision, recall, F1-score and others.
Hyperparameter Tuning - Adjusting model parameters to improve performance
Error Analysis - nalyzing errors to understand model weaknesses and improve robustness.
Origins of Natural Language Processing (NLP)
Interdisciplinary Roots
➢ Linguistics – Understanding grammar, syntax, semantics of human languages.
➢ Computer Science – Algorithms, data structures, parsing, and programming.
➢ Artificial Intelligence (AI) – Enabling machines to understand and simulate human
language understanding.
➢ Mathematics & Statistics – Especially in probabilistic models and machine learning
(used in modern NLP).
Historical Development
Birth of Machine Translation
➢ Translate Russian to English (Cold War era).
➢ Georgetown-IBM Experiment (1954) – Translated 60 sentences automatically
➢ Only basic syntactic translation; failed in real-world scenarios.
AI and Symbolic NLP
➢ Development of rule-based systems using linguistic knowledge (syntax trees, grammars).
➢ SHRDLU (Terry Winograd) – an early system that could understand natural commands in a
blocks world.
Statistical Revolution
➢ Shift from symbolic to statistical NLP due to limitations of rules.
➢ Use of corpora (large text datasets) and probabilistic models like Hidden Markov Models
(HMMs)
➢ Focused on Part-of-Speech Tagging, Speech Recognition, etc.
Machine Learning Era
➢ data-driven approaches.
➢ Popularity of annotated corpora like Penn Treebank and WordNet.
➢ algorithms like Naive Bayes, Decision Trees, and later Support Vector Machines
(SVMs).
Deep Learning and Transformers
➢ Breakthroughs with Neural Networks, especially Recurrent Neural Networks (RNNs),
LSTMs, and Transformers.
Introduction of models like:
● Word2Vec, GloVe (for word embeddings)
● BERT, GPT, T5 (transformer-based language models)
Multilingual and Indian Language Processing
➢ Initial NLP focused on English and European languages.
➢ Later extended to Asian languages (including Indian languages) with
unique challenges:
● Complex morphology
● Lack of resources
● Free word order (e.g., Hindi, Tamil)
Emergence of Speech and Dialogue Systems
Speech recognition started in the 1960s–70s (e.g., IBM’s Shoebox).
Grew into modern voice assistants: Siri, Alexa, Google Assistant.
Needed robust NLP for:
● Intent recognition
● Dialogue management
Corpus Linguistics & Real-World Texts
➢ Rise of text corpora like:
● Brown Corpus
● British National Corpus
● Penn Treebank
➢ These helped in:
● Training statistical models
● Part-of-speech tagging
● Syntax parsing
Key Role of Search Engines
➢ Search engines like Google used NLP for:
○ Understanding queries
○ Indexing and retrieving documents
○ Autocomplete and spelling correction
Challenges of NLP
Ambiguity: Linguistic ambiguity arises at lexical, syntactic, and semantic levels—words or phrases with multiple
possible interpretations.
Representation of Knowledge: Capturing world knowledge, context, and semantics remains difficult, especially in
open-ended language understanding.
Indian / Low‑Resource Language Processing: The book highlights the complexity of modeling languages like
Hindi, Urdu, Telugu, and Tamil, where resources, annotated corpora, and tools are scarce
Integration of Multiple Levels: Combining morphology, syntax, semantics, pragmatics, and discourse in a coherent
pipeline is non‑trivial.
Scalability and Real‑World Deployment: Addressing efficiency, resource requirements, and adapting theoretical
models to practical systems is emphasized.
Ambiguity in Language
Natural language is inherently ambiguous at various levels:
Lexical Ambiguity
● A word has multiple meanings.
● Example: "bank" (river bank or financial institution)
Syntactic Ambiguity
● A sentence can have more than one structure.
Example: "I saw the man with the telescope."
(Who has the telescope?)
Semantic Ambiguity
● The meaning is unclear, even if grammar is correct.
● Example: “He ate the cake with a spoon.”
(Did he use the spoon or was the cake made of spoon?)
Pragmatic Ambiguity
● Depends on context, speaker intention.
● Example: “Can you pass the salt?”
(Actually a request, not a question about ability.)
Variability of Language
● Many ways to express the same meaning.
● Example:
"I am going", "I’m going", "Gonna go", "I'll be there".
Multilinguality and Code-Mixing
● Human languages are diverse:
○ Grammar structures
○ Scripts
○ Word order (e.g., SOV in Hindi vs SVO in English)
● Code-mixing: Switching languages mid-sentence.
○ Example: “Mujhe ek coffee chahiye, strong one.”
Speech and Pronunciation Variability
● Spoken language adds more complexity:
○ Accents
○ Disfluencies (uh, um, etc.)
○ Noise
Evolving Language
● Language constantly changes (e.g., new words, memes, emojis).
● NLP systems must adapt to trends.
Context Understanding and Long Dependencies
● Some sentences depend on previous dialogue or long-range context.
● Example: “He is very kind.”
(Who is ‘he’? Need earlier context.)
Morphological Complexity
Some languages have rich morphology (word formation).
Example: In Tamil or Finnish, a single word can encode a full sentence.
World Knowledge and Common Sense
● NLP systems lack real-world understanding.
● Example: "The trophy doesn’t fit in the suitcase because it is too small."
(What is small? Trophy or suitcase?)
Lack of Annotated Data
● Supervised learning needs labeled data (e.g., POS tags, named entities).
● For many low-resource languages, data is missing or insufficient
Parsing and Grammar Irregularities
Parsing natural sentences is hard because:
● Natural language is not always grammatically perfect.
● Slangs, typos, informal styles are common.
Language and Grammar in NLP
What is Language?
● A system of communication consisting of symbols (words) and rules (grammar).
● Two types:
○ Natural language (e.g., Hindi, Tamil, English)
○ Formal language (used in computer science like C++, Java)
Grammar defines the rules of how words combine to form valid sentences.
Levels of grammar:
● Phonology – sounds of language
● Morphology – structure of words (prefixes, suffixes)
● Syntax – sentence structure
● Semantics – meaning of words/sentences
● Pragmatics – context-based interpretation
Grammar Rules in NLP
● Context-Free Grammar (CFG): Used to describe the syntax of natural languages.
● Dependency Grammar: Focuses on the relationship between words.
Processing Indian Languages
Challenges in Indian Languages:
● Diversity: India has 22 official languages and hundreds of dialects.
● Morphological richness: Indian languages (like Tamil, Hindi) are more complex than English.
● Word order: Flexible word order (SOV instead of SVO in English).
● Lack of standardized datasets.
● Multiple scripts (e.g., Devanagari, Tamil script, Bengali script).
Approaches:
● Rule-based systems
● Machine learning and deep learning techniques
● Cross-lingual tools and transfer learning
Indian Use-Cases:
● Translating government documents into regional languages
● Voice assistants supporting Hindi, Tamil, etc.
● Sentiment analysis in regional languages for political and social media analysis
Applications
❖ Machine Translation - Google Translate, Microsoft Translator
❖ Speech Recognition - Siri, Alexa, Google Assistant
❖ Text Classification - Email spam detection, sentiment analysis
❖ Information Extraction - Extracting names, places from documents
❖ Chatbots - Customer support bots
❖ Spell Checking - Auto-correction tools
Information Retrieval (IR)
Process of finding relevant information (documents, web pages) from a large collection
based on a user’s query.
Components of IR System:
1. Document Collection – A set of documents (like Google’s indexed web pages)
2. Indexing – Creating a searchable index using keywords
3. Query Processing – Understanding what the user is searching for
4. Ranking – Displaying the most relevant results first
Techniques Used
● TF-IDF (Term Frequency-Inverse Document Frequency)
● Boolean Search
● Vector Space Models
Examples:
● Google Search
● Digital Library Search (e.g., IEEE, ScienceDirect)
Language modeling
● A language model is the core component of modern Natural Language Processing
(NLP). It's a statistical model that is designed to analyze the pattern of human
language and predict the likelihood of a sequence of words or tokens.
● Language modeling, or LM, is the use of various statistical and probabilistic
techniques to determine the probability of a given sequence of words occurring in a
sentence.
● Language models analyze bodies of text data to provide a basis for their word
predictions.
Language modeling is used in artificial intelligence (AI), natural language processing
(NLP),natural language understanding and natural language generation systems,
particularly ones that perform text generation, machine translation and question answering.
Large language models (LLMs) also use language modeling. These are advanced language
models, such as OpenAI's GPT-3 and Google's Palm 2, that handle billions of training data
parameters and generate text output.
How language modeling works
● Determine word probability by analyzing text data.
● Interpret this data by feeding it through an algorithm that establishes rules for context
in natural language.
● Applies these rules in language tasks to accurately predict or produce new sentences.
● The model essentially learns the features and characteristics of basic language and
uses those features to understand new phrases.
There are several different probabilistic approaches to modeling language. They vary
depending on the purpose of the language model.
For example, a language model designed to generate sentences for an automated social
media bot might use different math and analyze text data in different ways than a language
model designed for determining the likelihood of a search query.
Language modeling types : Some common statistical language modeling types are the
following
● N-gram
● Unigram
● Bidirectional
● Exponential
● Neural language models
● Continuous space
Importance of language modeling
● Each language model type, in one way or another, turns qualitative information into
quantitative information. This allows people to communicate with machines as they
do with each other, to a limited extent.
● Language modeling is used in a variety of industries including information
technology, finance, healthcare, transportation, legal, military and government.
● In addition, it's likely that most people have interacted with a language model in
some way at some point in the day, whether through Google search, an autocomplete
text function or engaging with a voice assistant.
Uses and examples of language modeling
● Speech recognition
● Text generation
● Chatbots
● Machine translation
● Parts-of-speech tagging
● Parsing
● Optical character recognition
● Information retrieval
● Observed data analysis
● Sentiment analysis