0% found this document useful (0 votes)

22 views43 pages

Understanding Natural Language Processing

Uploaded by

deekshitha1325

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

22 views43 pages

Understanding Natural Language Processing

Uploaded by

deekshitha1325

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Natural language processing (NLP) is a subfield of computer science and artificial

intelligence (AI) that uses machine learning to enable computers to understand and
communicate with human language.

Growing amount of text data - social media, websites and other sources,

NLP is becoming a key tool to gain insights and automate tasks like analyzing text or
translating languages.
Applications

● Text translation,
● voice recognition,
● text summarization and
● Chatbots.
● voice-operated GPS systems,
● digital assistants,
● speech-to-text software and
● customer service bots
Text Input and Data Collection

❖ Data Collection -websites, books, social media or proprietary databases.

❖ Data Storage - structured format, such as a database or a collection of documents.

Text Preprocessing

clean and prepare the raw text data for analysis

❖ Tokenization - Splitting text into smaller units like words or sentences.

❖ Lowercasing - Converting all text to lowercase to ensure uniformity.
❖ Stopword Removal - Removing common words that do not contribute significant
meaning, such as "and," "the," "is."
❖ Punctuation Removal - Removing punctuation marks.
❖ Stemming and Lemmatization - Reducing words to their base or root forms.

Stemming : Stemming chops off prefixes or suffixes based on simple rules.

It doesn't always give a real word. FASTER - LESS ACCURATE

Input Word : "running"

Stemmer Output : "run"
Examples:
● "connection" → "connect"
● "flies" → "fli"
● "studies" → "studi"
● "undone" → "done"
Lemmatization uses vocabulary and grammar (like parts of speech) to return the correct
base or dictionary form (lemma) of a word. SLOWER - ACCURATE
● "better" → "good"
● "mice" → "mouse"
● "was" → "be"
❖ Text Normalization - Standardizing text, including case normalization, removing
punctuation and correcting spelling errors.

Text Representation

❖ Bag of Words (BoW) - keeping track of word frequency.

❖ Term Frequency-Inverse Document Frequency (TF-IDF): - importance of a word in a
document relative to a collection of documents

❖ Word Embeddings - Using dense vector representations of words where semantically

similar words are closer together in the vector space (e.g., Word2Vec, GloVe).

Feature Extraction

● Extracting meaningful features from the text data that can be used for various NLP tasks.
❖ Syntactic Features - Using parts of speech tags, syntactic dependencies and parse
trees.
❖ Semantic Features - Leveraging word embeddings and other representations to
capture word meaning and context.
❖ N-grams - Capturing sequences of N words to preserve some context and word order

Model Selection and Training

❖ Supervised Learning - Using labeled data to train models like (SVM), Random
Forests or deep learning models like (CNNs) and (RNNs).
❖ Unsupervised Learning - Applying techniques like clustering or topic modeling
(e.g., Latent Dirichlet Allocation) on unlabeled data.

❖ Pre-trained Models - Utilizing pre-trained language models such as BERT, GPT

or transformer-based models that have been trained on large corpora.
Model Deployment and Inference

❖ Text Classification - Categorizing text into predefined classes (e.g., spam detection,
sentiment analysis).
❖ Named Entity Recognition (NER) - Identifying and classifying entities in the text
❖ Machine Translation - Translating text from one language to another.
❖ Question Answering - Providing answers to questions based on the context
provided by text data.
Evaluation and Optimization : accuracy, precision, recall, F1-score and others.

Hyperparameter Tuning - Adjusting model parameters to improve performance

Error Analysis - nalyzing errors to understand model weaknesses and improve robustness.
Origins of Natural Language Processing (NLP)
Interdisciplinary Roots

➢ Linguistics – Understanding grammar, syntax, semantics of human languages.

➢ Computer Science – Algorithms, data structures, parsing, and programming.

➢ Artificial Intelligence (AI) – Enabling machines to understand and simulate human

language understanding.

➢ Mathematics & Statistics – Especially in probabilistic models and machine learning

(used in modern NLP).
Historical Development
Birth of Machine Translation
➢ Translate Russian to English (Cold War era).
➢ Georgetown-IBM Experiment (1954) – Translated 60 sentences automatically
➢ Only basic syntactic translation; failed in real-world scenarios.
AI and Symbolic NLP
➢ Development of rule-based systems using linguistic knowledge (syntax trees, grammars).
➢ SHRDLU (Terry Winograd) – an early system that could understand natural commands in a
blocks world.
Statistical Revolution
➢ Shift from symbolic to statistical NLP due to limitations of rules.
➢ Use of corpora (large text datasets) and probabilistic models like Hidden Markov Models
(HMMs)
➢ Focused on Part-of-Speech Tagging, Speech Recognition, etc.
Machine Learning Era
➢ data-driven approaches.
➢ Popularity of annotated corpora like Penn Treebank and WordNet.
➢ algorithms like Naive Bayes, Decision Trees, and later Support Vector Machines
(SVMs).
Deep Learning and Transformers
➢ Breakthroughs with Neural Networks, especially Recurrent Neural Networks (RNNs),
LSTMs, and Transformers.

Introduction of models like:

● Word2Vec, GloVe (for word embeddings)

● BERT, GPT, T5 (transformer-based language models)

Multilingual and Indian Language Processing

➢ Initial NLP focused on English and European languages.

➢ Later extended to Asian languages (including Indian languages) with

unique challenges:

● Complex morphology

● Lack of resources

● Free word order (e.g., Hindi, Tamil)

Emergence of Speech and Dialogue Systems
Speech recognition started in the 1960s–70s (e.g., IBM’s Shoebox).

Grew into modern voice assistants: Siri, Alexa, Google Assistant.

Needed robust NLP for:

● Intent recognition

● Dialogue management
Corpus Linguistics & Real-World Texts

➢ Rise of text corpora like:

● Brown Corpus

● British National Corpus

● Penn Treebank

➢ These helped in:

● Training statistical models

● Part-of-speech tagging

● Syntax parsing
Key Role of Search Engines

➢ Search engines like Google used NLP for:

○ Understanding queries

○ Indexing and retrieving documents

○ Autocomplete and spelling correction

Challenges of NLP
Ambiguity: Linguistic ambiguity arises at lexical, syntactic, and semantic levels—words or phrases with multiple
possible interpretations.

Representation of Knowledge: Capturing world knowledge, context, and semantics remains difficult, especially in
open-ended language understanding.

Indian / Low‑Resource Language Processing: The book highlights the complexity of modeling languages like
Hindi, Urdu, Telugu, and Tamil, where resources, annotated corpora, and tools are scarce

Integration of Multiple Levels: Combining morphology, syntax, semantics, pragmatics, and discourse in a coherent
pipeline is non‑trivial.

Scalability and Real‑World Deployment: Addressing efficiency, resource requirements, and adapting theoretical
models to practical systems is emphasized.
Ambiguity in Language
Natural language is inherently ambiguous at various levels:
Lexical Ambiguity
● A word has multiple meanings.

● Example: "bank" (river bank or financial institution)

Syntactic Ambiguity
● A sentence can have more than one structure.

Example: "I saw the man with the telescope."

(Who has the telescope?)
Semantic Ambiguity
● The meaning is unclear, even if grammar is correct.

● Example: “He ate the cake with a spoon.”

(Did he use the spoon or was the cake made of spoon?)

Pragmatic Ambiguity
● Depends on context, speaker intention.

● Example: “Can you pass the salt?”

(Actually a request, not a question about ability.)
Variability of Language
● Many ways to express the same meaning.

● Example:
"I am going", "I’m going", "Gonna go", "I'll be there".
Multilinguality and Code-Mixing
● Human languages are diverse:

○ Grammar structures

○ Scripts

○ Word order (e.g., SOV in Hindi vs SVO in English)

● Code-mixing: Switching languages mid-sentence.

○ Example: “Mujhe ek coffee chahiye, strong one.”

Speech and Pronunciation Variability
● Spoken language adds more complexity:

○ Accents

○ Disfluencies (uh, um, etc.)

○ Noise
Evolving Language
● Language constantly changes (e.g., new words, memes, emojis).

● NLP systems must adapt to trends.

Context Understanding and Long Dependencies

● Some sentences depend on previous dialogue or long-range context.

● Example: “He is very kind.”

(Who is ‘he’? Need earlier context.)
Morphological Complexity
Some languages have rich morphology (word formation).

Example: In Tamil or Finnish, a single word can encode a full sentence.

World Knowledge and Common Sense

● NLP systems lack real-world understanding.

● Example: "The trophy doesn’t fit in the suitcase because it is too small."
(What is small? Trophy or suitcase?)
Lack of Annotated Data
● Supervised learning needs labeled data (e.g., POS tags, named entities).

● For many low-resource languages, data is missing or insufficient

Parsing and Grammar Irregularities

Parsing natural sentences is hard because:
● Natural language is not always grammatically perfect.

● Slangs, typos, informal styles are common.

Language and Grammar in NLP

What is Language?

● A system of communication consisting of symbols (words) and rules (grammar).

● Two types:

○ Natural language (e.g., Hindi, Tamil, English)

○ Formal language (used in computer science like C++, Java)

Grammar defines the rules of how words combine to form valid sentences.

Levels of grammar:

● Phonology – sounds of language

● Morphology – structure of words (prefixes, suffixes)

● Syntax – sentence structure

● Semantics – meaning of words/sentences

● Pragmatics – context-based interpretation

Grammar Rules in NLP

● Context-Free Grammar (CFG): Used to describe the syntax of natural languages.

● Dependency Grammar: Focuses on the relationship between words.

Processing Indian Languages
Challenges in Indian Languages:
● Diversity: India has 22 official languages and hundreds of dialects.

● Morphological richness: Indian languages (like Tamil, Hindi) are more complex than English.

● Word order: Flexible word order (SOV instead of SVO in English).

● Lack of standardized datasets.

● Multiple scripts (e.g., Devanagari, Tamil script, Bengali script).

Approaches:
● Rule-based systems

● Machine learning and deep learning techniques

● Cross-lingual tools and transfer learning

Indian Use-Cases:
● Translating government documents into regional languages

● Voice assistants supporting Hindi, Tamil, etc.

● Sentiment analysis in regional languages for political and social media analysis
Applications

❖ Machine Translation - Google Translate, Microsoft Translator

❖ Speech Recognition - Siri, Alexa, Google Assistant

❖ Text Classification - Email spam detection, sentiment analysis

❖ Information Extraction - Extracting names, places from documents

❖ Chatbots - Customer support bots

❖ Spell Checking - Auto-correction tools

Information Retrieval (IR)
Process of finding relevant information (documents, web pages) from a large collection
based on a user’s query.

Components of IR System:
1. Document Collection – A set of documents (like Google’s indexed web pages)

2. Indexing – Creating a searchable index using keywords

3. Query Processing – Understanding what the user is searching for

4. Ranking – Displaying the most relevant results first

Techniques Used
● TF-IDF (Term Frequency-Inverse Document Frequency)

● Boolean Search

● Vector Space Models

Examples:
● Google Search

● Digital Library Search (e.g., IEEE, ScienceDirect)

Language modeling

● A language model is the core component of modern Natural Language Processing

(NLP). It's a statistical model that is designed to analyze the pattern of human
language and predict the likelihood of a sequence of words or tokens.

● Language modeling, or LM, is the use of various statistical and probabilistic

techniques to determine the probability of a given sequence of words occurring in a
sentence.

● Language models analyze bodies of text data to provide a basis for their word
predictions.
Language modeling is used in artificial intelligence (AI), natural language processing
(NLP),natural language understanding and natural language generation systems,
particularly ones that perform text generation, machine translation and question answering.

Large language models (LLMs) also use language modeling. These are advanced language
models, such as OpenAI's GPT-3 and Google's Palm 2, that handle billions of training data
parameters and generate text output.
How language modeling works
● Determine word probability by analyzing text data.
● Interpret this data by feeding it through an algorithm that establishes rules for context
in natural language.
● Applies these rules in language tasks to accurately predict or produce new sentences.
● The model essentially learns the features and characteristics of basic language and
uses those features to understand new phrases.
There are several different probabilistic approaches to modeling language. They vary
depending on the purpose of the language model.

For example, a language model designed to generate sentences for an automated social
media bot might use different math and analyze text data in different ways than a language
model designed for determining the likelihood of a search query.

Language modeling types : Some common statistical language modeling types are the
following

● N-gram
● Unigram
● Bidirectional
● Exponential
● Neural language models
● Continuous space
Importance of language modeling
● Each language model type, in one way or another, turns qualitative information into
quantitative information. This allows people to communicate with machines as they
do with each other, to a limited extent.

● Language modeling is used in a variety of industries including information

technology, finance, healthcare, transportation, legal, military and government.

● In addition, it's likely that most people have interacted with a language model in
some way at some point in the day, whether through Google search, an autocomplete
text function or engaging with a voice assistant.
Uses and examples of language modeling

● Speech recognition
● Text generation
● Chatbots
● Machine translation
● Parts-of-speech tagging
● Parsing
● Optical character recognition
● Information retrieval
● Observed data analysis
● Sentiment analysis

History and Evolution of NLP
No ratings yet
History and Evolution of NLP
26 pages
NLP 7
No ratings yet
NLP 7
75 pages
Understanding Natural Language Processing
100% (1)
Understanding Natural Language Processing
23 pages
History and Techniques of NLP
No ratings yet
History and Techniques of NLP
46 pages
NLP Tasks and Challenges Overview
No ratings yet
NLP Tasks and Challenges Overview
15 pages
NLP Unit - I-1
No ratings yet
NLP Unit - I-1
50 pages
NLP Complete Textbook
No ratings yet
NLP Complete Textbook
52 pages
NLP Evolution: From Rules to Deep Learning
No ratings yet
NLP Evolution: From Rules to Deep Learning
54 pages
Understanding Natural Language Processing
No ratings yet
Understanding Natural Language Processing
22 pages
Understanding Natural Language Processing
No ratings yet
Understanding Natural Language Processing
87 pages
Overview of Natural Language Processing
No ratings yet
Overview of Natural Language Processing
133 pages
NLP Fundamentals and Techniques Overview
No ratings yet
NLP Fundamentals and Techniques Overview
55 pages
Understanding Natural Language Processing
No ratings yet
Understanding Natural Language Processing
7 pages
NLP Notes
No ratings yet
NLP Notes
18 pages
Understanding Natural Language Processing
No ratings yet
Understanding Natural Language Processing
8 pages
Unit 5 AI Notes
No ratings yet
Unit 5 AI Notes
8 pages
AI Notes: Computer Vision & NLP Insights
No ratings yet
AI Notes: Computer Vision & NLP Insights
31 pages
NLP Fundamentals and Applications
No ratings yet
NLP Fundamentals and Applications
16 pages
Overview of Natural Language Processing
No ratings yet
Overview of Natural Language Processing
65 pages
Introduction to Natural Language Processing
No ratings yet
Introduction to Natural Language Processing
37 pages
Overview of Natural Language Processing
No ratings yet
Overview of Natural Language Processing
36 pages
Understanding Natural Language Processing
No ratings yet
Understanding Natural Language Processing
39 pages
History and Phases of NLP Techniques
No ratings yet
History and Phases of NLP Techniques
40 pages
History and Phases of Natural Language Processing
No ratings yet
History and Phases of Natural Language Processing
46 pages
Unit 1 NLP
No ratings yet
Unit 1 NLP
19 pages
Understanding Natural Language Processing
No ratings yet
Understanding Natural Language Processing
5 pages
NLP Techniques and Applications Overview
100% (1)
NLP Techniques and Applications Overview
7 pages
Introduction to Natural Language Processing
No ratings yet
Introduction to Natural Language Processing
6 pages
Understanding Natural Language Processing
No ratings yet
Understanding Natural Language Processing
37 pages
Overview of Natural Language Processing
No ratings yet
Overview of Natural Language Processing
36 pages
Overview of Natural Language Processing
No ratings yet
Overview of Natural Language Processing
29 pages
Comprehensive Guide to Natural Language Processing
No ratings yet
Comprehensive Guide to Natural Language Processing
86 pages
Introduction to Natural Language Processing
No ratings yet
Introduction to Natural Language Processing
22 pages
AI Unit-5
No ratings yet
AI Unit-5
65 pages
NLP Data Preprocessing Techniques
No ratings yet
NLP Data Preprocessing Techniques
35 pages
Introduction to NLP and Deep Learning
No ratings yet
Introduction to NLP and Deep Learning
38 pages
Natural Language Processing (NLP)
No ratings yet
Natural Language Processing (NLP)
12 pages
Natural Language Processing CIA 1 Notes
No ratings yet
Natural Language Processing CIA 1 Notes
139 pages
Unit 5
No ratings yet
Unit 5
45 pages
NLP Overview and Challenges (AL3501)
No ratings yet
NLP Overview and Challenges (AL3501)
20 pages
Natural Language Processing
No ratings yet
Natural Language Processing
21 pages
Understanding Natural Language Processing
No ratings yet
Understanding Natural Language Processing
5 pages
Overview of Natural Language Processing
No ratings yet
Overview of Natural Language Processing
19 pages
Understanding Natural Language Processing
No ratings yet
Understanding Natural Language Processing
5 pages
Introduction to Natural Language Processing
No ratings yet
Introduction to Natural Language Processing
12 pages
Key Concepts in Natural Language Processing
No ratings yet
Key Concepts in Natural Language Processing
33 pages
Natural Language Processing (NLP)
No ratings yet
Natural Language Processing (NLP)
4 pages
Understanding Natural Language Processing
No ratings yet
Understanding Natural Language Processing
7 pages
Ai Mod4
No ratings yet
Ai Mod4
19 pages
Grammatical Terms in Chatbots and NLP
No ratings yet
Grammatical Terms in Chatbots and NLP
48 pages
Introduction to NLP for Class 9
No ratings yet
Introduction to NLP for Class 9
6 pages
Introduction to Natural Language Processing
No ratings yet
Introduction to Natural Language Processing
9 pages
NLP - 26 (Chap 1-2-3)
No ratings yet
NLP - 26 (Chap 1-2-3)
104 pages
FALLSEM2025-26 VL BCSE409L 00100 TH 2025-08-15 Introduction-on-NLP
No ratings yet
FALLSEM2025-26 VL BCSE409L 00100 TH 2025-08-15 Introduction-on-NLP
16 pages
Overview of Natural Language Processing
No ratings yet
Overview of Natural Language Processing
6 pages
Overview of Natural Language Processing
No ratings yet
Overview of Natural Language Processing
15 pages
NLP Overview and Key Techniques
No ratings yet
NLP Overview and Key Techniques
16 pages
Basic Phrasal Verbs for A1 Learners
No ratings yet
Basic Phrasal Verbs for A1 Learners
3 pages
Spanish Irregular Verbs List
No ratings yet
Spanish Irregular Verbs List
2 pages
Understanding Pantayong Pananaw
100% (1)
Understanding Pantayong Pananaw
14 pages
Grade 11 English Lesson Plan: Independence
No ratings yet
Grade 11 English Lesson Plan: Independence
25 pages
Grade 4 English Lesson Plan on Prefixes
100% (1)
Grade 4 English Lesson Plan on Prefixes
5 pages
Present and Simple Past Tenses Guide
No ratings yet
Present and Simple Past Tenses Guide
82 pages
Effective Metacognitive Strategies for Listening
No ratings yet
Effective Metacognitive Strategies for Listening
7 pages
Grammar and Vocabulary Practice
No ratings yet
Grammar and Vocabulary Practice
7 pages
Sentence Types and Structure Analysis
No ratings yet
Sentence Types and Structure Analysis
6 pages
Understanding Morphemes and Syntax
No ratings yet
Understanding Morphemes and Syntax
19 pages
Introduction to Natural Language Processing
No ratings yet
Introduction to Natural Language Processing
18 pages
Family Life: Household Chores Lesson
No ratings yet
Family Life: Household Chores Lesson
32 pages
Introduction to Shorthand Principles
No ratings yet
Introduction to Shorthand Principles
30 pages
Animation Dubbing: Adaptation vs. Translation
No ratings yet
Animation Dubbing: Adaptation vs. Translation
21 pages
Stylistic Analysis of German Ads
No ratings yet
Stylistic Analysis of German Ads
10 pages
Smart Phonics 3: Long Vowels Guide
No ratings yet
Smart Phonics 3: Long Vowels Guide
102 pages
English Exam for SMA Generasi Bintang
No ratings yet
English Exam for SMA Generasi Bintang
3 pages
Rhetorical Figures Explained
No ratings yet
Rhetorical Figures Explained
4 pages
Memorize Days of the Week Easily
No ratings yet
Memorize Days of the Week Easily
5 pages
Skimming and Scanning Techniques
No ratings yet
Skimming and Scanning Techniques
15 pages
Sports Vocabulary Lesson Plan for Teens
No ratings yet
Sports Vocabulary Lesson Plan for Teens
4 pages
Languages Syllabus Overview
No ratings yet
Languages Syllabus Overview
2 pages
Vocabulary Skills and Memory Techniques
No ratings yet
Vocabulary Skills and Memory Techniques
34 pages
Types of Speech Styles Explained
No ratings yet
Types of Speech Styles Explained
3 pages
Key Concepts in Phonetics and Phonology
No ratings yet
Key Concepts in Phonetics and Phonology
8 pages
Comparatives and Superlatives Board Game
No ratings yet
Comparatives and Superlatives Board Game
2 pages
Maria's Journey: Learning and Braille
No ratings yet
Maria's Journey: Learning and Braille
2 pages
Grade 7 Simple Past Tense Lesson Plan
No ratings yet
Grade 7 Simple Past Tense Lesson Plan
9 pages
Introduction to Third Declension Adjectives
No ratings yet
Introduction to Third Declension Adjectives
14 pages
English Grade 3 Teachers Guide
No ratings yet
English Grade 3 Teachers Guide
172 pages