0% found this document useful (0 votes)
11 views42 pages

Tagged Corpora and NLP Tagging Methods

Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views42 pages

Tagged Corpora and NLP Tagging Methods

Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

UNIT 4: Categorizing and Tagging words

1. How to represent tagged corpora, tokens and also explain how


these tagged corpora can be read?

1. How to Represent Tagged Corpora in NLP


A tagged corpus is a collection of text where each word/token is assigned a
tag, usually a Part-of-Speech (POS) tag or some linguistic label.
Representation Formats
1. Pair / Tuple Format
o Each token is stored as a (word, tag) pair.
o Example:
o [('I', 'PRP'), ('love', 'VBP'), ('NLP', 'NNP')]
2. String Format
o Tagged words are written with a separator such as /.
o Example:
o I/PRP love/VBP NLP/NNP
3. Tree or Sentence Structure
o Represented as sentences containing lists of tagged tokens.
o Example (NLTK representation):
o [ ('The','DT'), ('cat','NN'), ('runs','VBZ') ]

2. What Are Tokens?


Tokens are the smallest units of text extracted during tokenization.
Examples of tokens:
• Words: “The”, “cat”, “runs”
• Punctuation: “.”
• Sometimes numbers, hashtags, emojis, etc.
In code:
["The", "cat", "runs", "."]
Tokens are used as input for tagging tasks, parsing, language models, etc.

3. How Tagged Corpora Are Read (in Python / NLTK)


NLTK (Natural Language Toolkit) provides built-in functions to read tagged
corpora.
Reading Tagged Sentences
from [Link] import brown

tagged_sents = brown.tagged_sents()
print(tagged_sents[0])
Output example:
[('The', 'AT'), ('Fulton', 'NP'), ('County', 'NN'), ...]
Reading Tagged Words
tagged_words = brown.tagged_words()
print(tagged_words[:5])
Reading Corpora with a Specific Tagset
brown.tagged_words(tagset='universal')

4. Custom Tagged Corpus Reading


If the tagged file is stored as word/tag format:
import nltk
from [Link] import ConllCorpusReader

reader = ConllCorpusReader('.', '[Link]',


('words', 'pos'))
tagged_data = reader.tagged_sents()
Or using simple parsing:
sentence = "I/PRP love/VBP NLP/NNP"
tagged = [tuple([Link]('/')) for word in [Link]()]

2. Explain the following:


i) Universal pos tag set ii) Nouns iii) Verbs
(i) Universal POS Tag Set
The Universal Part-of-Speech (POS) Tag Set is a simplified and standardized
set of POS tags used in Natural Language Processing (NLP). It was introduced
to provide a common tagging system that works across different languages
and corpora.
It contains 17 basic POS categories, making it easy to apply POS tagging in
multilingual NLP tasks.
Features of the Universal POS Tag Set
• It is language-independent.
• It provides a uniform set of tags for all languages.
• It simplifies the output of POS taggers by reducing complex tagsets.
• It is widely used in corpora such as Universal Dependencies (UD).
Examples of Universal POS Tags
• NOUN – common nouns
• VERB – action or state words
• ADJ – adjectives
• ADV – adverbs
• PRON – pronouns
• DET – determiners
• ADP – prepositions/postpositions
• SCONJ – subordinating conjunctions
• CCONJ – coordinating conjunctions
• NUM – numbers
• INTJ – interjections
The Universal POS tagset helps ensure consistency in linguistic analysis
across different languages.

(ii) Nouns
A noun is a part of speech that refers to a person, place, thing, idea, or
concept. Nouns are one of the most important grammatical categories in
language.
Types of Nouns
1. Proper Nouns
o Names of specific people, places, or organizations.
o Example: India, Ravi, Google
2. Common Nouns
o General names of things.
o Example: city, girl, book
3. Abstract Nouns
o Ideas, emotions, qualities.
o Example: love, freedom, honesty
4. Concrete Nouns
o Things that can be seen or touched.
o Example: table, car, dog
5. Countable Nouns
o Can be counted.
o Example: apples, books
6. Uncountable Nouns
o Cannot be counted.
o Example: water, sugar
Functions of Nouns
• Act as subject of a sentence
• Act as object
• Used in naming and identification
• Can form noun phrases
Example sentence:
The girl is reading a book.
(girl and book are nouns)

(iii) Verbs
A verb is a word that expresses an action, event, or state of being. Verbs are
the central part of a sentence because they describe what the subject is doing.
Types of Verbs
1. Action Verbs
o Show physical or mental action.
o Example: run, write, think
2. Linking Verbs
o Connect the subject to a state or condition.
o Example: am, is, are, seem
3. Auxiliary (Helping) Verbs
o Help form tenses, questions, negatives.
o Example: has, have, will, can
4. Transitive Verbs
o Require an object.
o Example: She wrote a letter.
5. Intransitive Verbs
o Do not require an object.
o Example: He sleeps.
6. Regular Verbs
o Form past tense with -ed.
o Example: walk → walked
7. Irregular Verbs
o Do not follow a regular rule.
o Example: go → went
Functions of Verbs
• Indicate action (e.g., The boy runs.)
• Express states (e.g., She is happy.)
• Show time/tense (past, present, future)
• Form questions, negatives, and passive voice

3. Explain how to map words to properties and explain different


methods for same (indexing , list, dictionary).
Mapping Words to Properties and Its Methods (Indexing, List, Dictionary)
In Natural Language Processing (NLP), mapping words to properties refers to
the process of associating each word with certain information or attributes.
These properties can include part-of-speech tags, frequency, sentiment
scores, meanings, or any linguistic feature.
This mapping helps computers understand and process language efficiently.
To store and access these mappings, different data structures are used. The
three common methods are indexing, lists, and dictionaries.

1. Mapping Words to Properties


Words are assigned specific properties such as:
• POS tag (e.g., “run → VERB”)
• Frequency (e.g., “the → 5000 occurrences”)
• Word ID (used in text indexing)
• Semantic category (e.g., “apple → fruit”)
Mapping is important for tasks such as text classification, search engines, POS
tagging, sentiment analysis, and language models.

2. Methods to Map Words to Properties


(i) Indexing Method
Indexing assigns a unique number (index) to each word.
This index can then be used to refer to the word in vocabulary lists, matrices, or
models.
Features
• Every word gets a unique integer ID.
• Efficient for machine learning models.
• Commonly used in Bag-of-Words, TF-IDF, and embedding models.
Example
Word list:
[“cat”, “dog”, “apple”]
Mapping:
cat → 0
dog → 1
apple → 2
Indexing simplifies storage and speeds up lookup.

(ii) List Method


A list stores words in a sequence, and their properties can be stored at
corresponding positions or as sub-lists.
Features
• Ordered collection.
• Useful when the mapping is simple and position-based.
• Easy to search small datasets.
Example
List of pairs:
[ (“cat”, “animal”), (“apple”, “fruit”), (“blue”, “color”) ]
The list holds both the word and its property together.
(iii) Dictionary Method
A dictionary (or hash map) stores mappings in the form of key–value pairs,
where the key is the word and the value is its property.
Features
• Fast lookup.
• Keys are unique.
• Suitable for large NLP datasets.
• Can store complex properties.
Example
{
“cat”: “NOUN”,
“run”: “VERB”,
“happy”: “ADJECTIVE”
}
Dictionaries are widely used because they allow efficient and flexible mapping.

4. Use of dictionaries in indexing.


Use of Dictionaries in Indexing (NLP / IR)
In Natural Language Processing (NLP) and Information Retrieval (IR),
dictionaries play an important role in indexing, which is the process of storing
and organizing words so that they can be searched quickly.
A dictionary is a data structure that stores information in the form of key–
value pairs, where the word acts as the key and its properties or information
act as the value.

Use of Dictionaries in Indexing


1. Fast Lookup of Words
Dictionaries allow very fast searching of words.
When a word appears in a document or corpus, it can be quickly checked and
located using the dictionary.
This speeds up indexing because each word can be found in constant time
(O(1)).
2. Storing Word Properties
Dictionaries store important details about each word, such as:
o frequency of the word
o document IDs where it appears
o positions of the word in the text
o part-of-speech tags
o synonyms or lemmas
This helps create an efficient index.
3. Constructing Inverted Index
In IR systems (e.g., search engines), dictionaries are used to form inverted
indexes, where each word maps to a list of documents containing that word.
Example:
o “book” → [Doc1, Doc4, Doc7]
4. Avoiding Duplication
Dictionaries ensure that each word appears only once.
When indexing, if a word is already present in the dictionary, its information is
updated instead of adding it again.
5. Efficient Text Processing
Dictionaries help in quick operations like:
o token lookup
o checking stopwords
o identifying vocabulary
o storing POS tags for each token
This improves indexing speed and accuracy.
6. Useful in NLP Tasks
Dictionaries are widely used for:
o POS tagging
o lemmatization
o word normalization
o vocabulary building
All these tasks support better indexing of text.

5. Explain Automatic tagging and its methods (default tagger,


regular expression tagger , lookup tagger).
Automatic Tagging and Its Methods
(Default Tagger, Regular Expression Tagger, Lookup Tagger)
Automatic tagging is a process in Natural Language Processing (NLP) where
each word in a text is automatically assigned a Part-of-Speech (POS) tag
without manual annotation. It is an essential step in syntactic analysis,
machine translation, information extraction, and many NLP applications.
Automatic taggers use predefined rules, patterns, or statistical information to
determine the correct tag for each word in a sentence.

1. Default Tagger
A Default Tagger is the simplest type of automatic tagger.
It assigns the same POS tag to every word in the text, regardless of the context
or meaning.
Working
• The system selects the most frequent tag in the language (usually NN for
nouns).
• Every word in the corpus is tagged with this single tag.
Example
If the default tag is NN, then:
• He/NN is/NN running/NN fast/NN.
Advantages
• Very simple and fast
• Provides a baseline for evaluating more complex taggers
Disadvantages
• Very low accuracy, because not all words share the same POS
• Ignores context and sentence structure

2. Regular Expression Tagger


A Regular Expression (Regexp) Tagger uses patterns (regular expressions) to
assign tags to words.
It matches word endings, prefixes, digits, and other patterns.
Working
The tagger applies a list of pattern–tag rules, such as:
• Words ending in -ing → VBG
• Words ending in -ed → VBD
• Words ending in -ly → RB
• Numbers → CD
Example
• running ⇒ matches .*ing ⇒ tagged as VBG
• slowly ⇒ matches .*ly ⇒ tagged as RB
• played ⇒ matches .*ed ⇒ tagged as VBD
Advantages
• More accurate than default tagger
• Works well for languages with clear morphological patterns
• Useful for unknown words
Disadvantages
• Depends on manually crafted patterns
• Cannot resolve ambiguous words without context

3. Lookup Tagger
A Lookup Tagger assigns tags based on a word–tag dictionary (also called a
lexicon).
It looks up each word in a pre-compiled list of common words and their most
frequent POS tags.
Working
• A lexicon/dictionary is prepared from a tagged corpus.
• For each word in the input text:
o If the word exists in the dictionary → assign the stored tag
o If not → assign a default tag
Example
Dictionary entries:
• play → VB
• played → VBD
• dog → NN
Sentence: The dog played.
• dog → NN
• played → VBD
Advantages
• More accurate than default and regex taggers
• Works well for high-frequency words
• Simple and efficient
Disadvantages
• Cannot tag unknown or rare words correctly
• Quality depends on size of dictionary

Conclusion
Automatic tagging is used to assign POS tags to text automatically.
The main methods include:
• Default Tagger: assigns the same tag everywhere.
• Regular Expression Tagger: uses patterns to tag words.
• Lookup Tagger: uses a dictionary of known word–tag pairs.
These methods form the basis for more advanced taggers such as statistical
and machine-learning based taggers.

6. Illustrate different n-gram tagging methods (unigram , n-gram,


combined tagger).
Illustrate Different N-gram Tagging Methods (Unigram, N-gram, Combined
Tagger)
N-gram tagging is an automatic POS-tagging technique used in Natural
Language Processing (NLP). An n-gram tagger predicts the tag of a word
based on the tags of the previous “n-1” words. These taggers learn patterns
from a tagged corpus and apply them to new text.
The main n-gram tagging methods are: Unigram Tagger, N-gram
(Bigram/Trigram) Tagger, and Combined Tagger.

1. Unigram Tagger
Definition
A unigram tagger assigns each word the most frequent tag it received in the
training corpus.
It does not consider context or surrounding words.
Working
• The tagger scans a corpus.
• For each word, it finds the tag that most often occurs with it.
• During tagging, the same tag is assigned to the word every time it appears.
Example
In the corpus:
• “book” appears mostly as a noun → tag “book” as NN
• “run” appears mostly as a verb → tag “run” as VB
Advantages
• Simple and fast
• Works well for common words
Disadvantages
• Fails for words with multiple tags
• Does not consider sentence context

2. N-gram Tagger (Bigram and Trigram Taggers)


Definition
An n-gram tagger uses the previous (n-1) tags to predict the current word’s
tag.
Most common types:
• Bigram tagger → uses previous 1 tag
• Trigram tagger → uses previous 2 tags
Working
Instead of assigning tags only based on the word, it uses tag sequences from
the training corpus.
Example (Bigram tagger)
If previous tag = DT (determiner)
Then the next tag is likely NN (noun)
Sentence: “The / DT cat / NN”
Advantages
• Considers context
• More accurate than unigram tagger
Disadvantages
• Data sparsity problem (unseen tag sequences)
• Needs large corpus

3. Combined (Backoff) Tagger


Definition
A combined tagger, also called a backoff tagger, uses multiple taggers in
sequence.
If a higher-level tagger fails, it “backs off” to a simpler tagger.
Structure Example
1. Trigram tagger
2. If fails → use Bigram tagger
3. If fails → use Unigram tagger
4. If still fails → use Default tagger
Working
• The model first tries the most accurate tagger.
• If it cannot tag a word (unknown word or unseen context),
it tries the next tagger in the chain.
• This significantly increases accuracy.
Advantages
• Very high accuracy
• Solves data sparsity problem
• Handles unknown words
Disadvantages
• More complex to train
• Requires more memory

7. Illustrate the process of determining category of word.


Answer: Process of Determining the Category of a Word
Determining the category of a word means identifying its Part of Speech
(POS), such as noun, verb, adjective, etc. This is an essential task in Natural
Language Processing (NLP) because it helps machines understand the
grammatical role and meaning of each word in a sentence. The process is
known as POS tagging or morpho-syntactic analysis.
The determination of a word’s category involves several steps and linguistic
clues:

1. Morphological Analysis
This step examines the form and structure of the word.
a) Word endings (suffixes)
Many words indicate their category by their suffix.
Examples:
• -tion, -ment, -ness → nouns
(information, enjoyment, kindness)
• -ly → adverbs
(quickly, slowly)
• -able, -ive → adjectives
(comfortable, creative)
b) Word roots or stems
Some word families contain related categories.
Example:
• act (verb) → action (noun) → active (adjective)
Morphological clues help guess the likely category.

2. Syntactic Analysis (Position in Sentence)


A word’s place in a sentence gives strong hints about its category.
Examples:
• Words before nouns are often adjectives:
The beautiful garden…
• Words after articles (a, an, the) are usually nouns:
The dog barked.
• Words after the subject are often verbs:
He runs fast.
Syntactic context is the most powerful clue for determining category.

3. Semantic Clues (Meaning-based Identification)


Meaning also helps identify categories.
Examples:
• If the word represents a thing/person/place, it is a noun.
• If the word shows action or state, it is a verb.
• If it describes a quality, it is an adjective.
Although meaning is helpful, NLP uses this less because meaning varies with
context.

4. Lexical Resources (Dictionaries / Lexicons)


Computational systems use dictionaries that store:
• the word
• its definition
• its possible POS categories
Example:
run → verb / noun (multi-category word)
Lexicons provide the initial set of possible categories.

5. Statistical / Machine Learning Methods


Modern NLP uses probability models to determine the most likely category.
These include:
• Unigram taggers
Choose the most common tag for each word.
• Bigram / Trigram taggers
Choose the tag based on surrounding words.
• Hidden Markov Models, CRFs, Neural Networks
Learn patterns from large corpora and predict categories with high accuracy.

6. Rule-based Determination
Some systems use linguistic rules such as:
• If a word ends with “-ing” and follows “is/was,” → verb
• If a word comes between an article and a noun → adjective
Rule-based methods were used earlier before machine learning.

7. Combining Multiple Sources (Hybrid Approach)


Modern POS taggers combine:
• rules
• dictionaries
• statistical models
This hybrid approach improves accuracy, especially for ambiguous words.

Example
Sentence: “The old man walks slowly.”
Word Category Determination Clue Final Category

The Determiner before noun DET

old Adjective before noun ADJ

man Refers to a person NOUN

walks Action verb in present tense VERB

slowly Ends with –ly → adverb ADV


UNIT 5 : Classify Text and Extracting Information from Text

1. Illustrate information extraction with its proper architectural


diagram.
Information Extraction (IE)
Information Extraction (IE) is a Natural Language Processing (NLP) technique
used to automatically identify and extract structured information from
unstructured text such as documents, articles, emails, or web pages.
It focuses on extracting specific pieces of information, such as:
• Names of people, places, organizations
• Dates, times, quantities
• Relationships between entities
• Events and facts
IE converts raw text → structured, machine-readable information, useful for
search engines, AI assistants, summarization, and knowledge bases.

Objectives of Information Extraction


1. To identify important entities from text
2. To detect relations between these entities
3. To extract facts and events
4. To convert unstructured text into structured databases or knowledge graphs

Applications of IE
• Search engines (Google Knowledge Graph)
• Resume parsers
• Medical record mining
• Business intelligence
• Event extraction in news
• Chatbots and virtual assistants

Information Extraction Architecture


Below is a simple, neat architectural diagram you can draw in your exam:
┌───────────────────────────┐
│ Unstructured Text │
└──────────────┬────────────┘


┌───────────────────────────┐
│ Text Preprocessing │
│ (Tokenization, POS Tagging)│
└──────────────┬────────────┘


┌───────────────────────────┐
│ Named Entity Recognition │
│ (NER: Person, Location) │
└──────────────┬────────────┘


┌───────────────────────────┐
│ Relation Extraction │
│ (Identifying connections) │
└──────────────┬────────────┘


┌───────────────────────────┐
│ Event Extraction │
│ (Who did What, When, Where)│
└──────────────┬────────────┘


┌───────────────────────────┐
│ Structured Information │
│ (Database / Knowledge Base)│
└───────────────────────────┘

Explanation of Each Component


1. Unstructured Text
Input such as:
• News articles
• Emails
• Reports
• Social media posts
Text is raw and not suitable for direct analysis.

2. Preprocessing
To prepare text for extraction:
• Tokenization – splitting text into words
• POS Tagging – determining word categories
• Lemmatization/Stemming – reducing words to base form
• Stopword removal
Preprocessing simplifies the extraction task.

3. Named Entity Recognition (NER)


NER identifies important entities such as:
• Person (PER)
• Organization (ORG)
• Location (LOC)
• Date/Time
• Percentage, Money, Quantity
Example:
Sentence: "Microsoft hired John in 2020."
NER detects:
• Microsoft → ORG
• John → PERSON
• 2020 → DATE

4. Relation Extraction
Detects semantic relationships between entities.
Example:
“Alice works at Amazon.”
Relation: Works-At(Alice, Amazon)
Types of relations:
• Employment
• Ownership
• Location-based
• Membership

5. Event Extraction
Identifies:
• What happened
• Who is involved
• When and where it happened
Example event structure:
“Flood hit Mumbai in July 2021.”
• Event: Natural Disaster
• Location: Mumbai
• Time: July 2021

6. Structured Output
Finally, extracted information is stored as:
• Tables
• Knowledge graphs
• JSON/XML
• Databases
This structured format is usable for further analysis or machine learning tasks.

2. Define chunking list and explain various chunking methods (Noun


phrase chunking , tag patterns, chunking with regex ,representing
chunks ).
Chunking in NLP
Chunking, also known as shallow parsing, is the process of identifying and
grouping related words in a sentence into meaningful phrases, such as noun
phrases (NPs), verb phrases (VPs), or prepositional phrases (PPs).
Chunking works on top of POS-tagged text and extracts higher-level structures
without performing full parsing.
Example:
Sentence: "The quick brown fox jumps"
POS tags: The/DT quick/JJ brown/JJ fox/NN jumps/VBZ
Chunk (NP): [The quick brown fox]
Chunking helps in information extraction, question answering, text
summarization, and named entity recognition.

What is a Chunk List?


A chunk list is the structured output produced after applying chunking rules to a
tagged sentence.
It contains groups (chunks) of words along with the words' POS tags, usually
represented in a hierarchical or tree-like format.
Example of chunk list:
[NP The/DT quick/JJ brown/JJ fox/NN] [VP jumps/VBZ]
It is used to store chunks and process them for further NLP tasks.

Chunking Methods
Chunking can be performed using different techniques. The major methods
include:

1. Noun Phrase (NP) Chunking


NP chunking identifies noun-centered phrases using POS-tag patterns.
Typical NP pattern:
• Determiner (DT)
• Adjectives (JJ)
• Noun (NN)
Example pattern:
NP → DT JJ* NN
Example:
Sentence: “The beautiful red flower bloomed.”
Chunk: [The beautiful red flower] (one NP)
NP chunking is widely used because noun phrases are essential in identifying
entities and meaningful concepts.

2. Tag Patterns for Chunking


Chunking relies on patterns of POS tags to define phrase boundaries.
A tag pattern is a rule that describes how POS tags combine to form larger
phrases.
Example NP pattern:
<DT>? <JJ>* <NN>
Explanation:
• <DT>? → optional determiner
• <JJ>* → zero or more adjectives
• <NN> → one noun
Other examples:
• Verb phrase (VP): <VB.*> <RB>*
• Prepositional phrase (PP): <IN> <NP>
Tag patterns are the foundation of both regex-based and rule-based chunking.

3. Chunking with Regular Expressions


Chunking with regex means writing regular expression rules over POS tags to
identify chunks.
Most NLP libraries (like NLTK) support RegexpParser, which uses tag-based
regex rules.
Example regex rule for NP:
NP: {<DT>?<JJ>*<NN>}
Here:
• { } → defines a chunk
• <DT> → determiner
• <JJ> → adjective
• <NN> → noun
Advantages:
• Simple
• Transparent rules
• Easy to modify
• Language-independent
Regex chunking is widely used in shallow parsing.

4. Representing Chunks
Chunks can be represented in multiple forms:
a) Bracketed Notation
[NP The/DT red/JJ car/NN] stopped.
b) Chunk Trees
Hierarchical representation using tree structures.
Example:
(S
(NP The/DT red/JJ car/NN)
(VP stopped/VBD)
)
c) BIO Tagging Format
Each word is labeled as:
• B- beginning of a chunk
• I- inside a chunk
• O outside any chunk
Example:
The B-NP
red I-NP
car I-NP
stopped O
BIO is used in machine learning and deep learning for chunking and named
entity recognition.

3. How to develop and evaluate chunkers (Reading IOB format,


simple evaluation baseline, training classifier based chunkers ).

Developing and Evaluating Chunkers


Chunking (also called shallow parsing) is the process of identifying phrases
such as noun phrases (NP), verb phrases (VP), or prepositional phrases (PP) in a
sentence. Chunkers label multi-word units based on POS-tagged input.
Developing an accurate chunker requires proper training data, a chunking
model, and evaluation techniques.
The development of chunkers generally involves three major steps:
1. Reading and preparing data (IOB format)
2. Establishing a simple evaluation baseline
3. Training classifier-based chunkers for better accuracy

1. Reading IOB Format


IOB stands for Inside–Outside–Beginning, a popular annotation scheme for
chunking corpora.
IOB Tags
• B-XXX: Beginning of a chunk of type XXX
• I-XXX: Inside a chunk
• O: Outside any chunk
Example sentence with chunks (Noun Phrase Chunking):
Word POS Chunk Tag

The DT B-NP

little JJ I-NP

girl NN I-NP

plays VBZ O

in IN O

the DT B-NP

park NN I-NP

Purpose of IOB Format


• Clearly identifies chunk boundaries
• Makes chunking suitable for machine learning
• Ensures consistency across dataset
Tools such as NLTK use built-in functions to read IOB formatted corpora like
CoNLL-2000.

2. Simple Evaluation Baseline


Before building a complex model, a baseline system is established for
comparison. A baseline chunker helps evaluate whether advanced models truly
perform better.
Common baseline methods:
(a) Most Frequent Chunk Tag Baseline
• Assigns the most common chunk tag found in the training data to every word.
• Example: If "O" is the most frequent tag, all words are tagged "O".
(b) POS-based Baseline
• Assigns the most frequent chunk tag for a given POS tag.
• E.g., all nouns (NN) are tagged as I-NP.
Importance of Baseline
• Acts as a reference model
• Helps determine whether the classifier-based chunker is actually learning
• Avoids meaningless accuracy claims
Baseline performance is usually low, but it establishes a minimum standard.

3. Training Classifier-Based Chunkers


Classifier-based chunkers use machine learning algorithms to learn patterns of
chunks from annotated training data.
How Classifier-Based Chunkers Work
1. Extract features from POS-tagged words
o Current word, POS tag
o Previous word/POS
o Next word/POS
o Word shape, suffix, capitalization, etc.
2. Train a classifier such as:
o Decision Trees
o Maximum Entropy classifier
o Naïve Bayes
o Support Vector Machines
3. The classifier predicts IOB tags for each word.
Advantages
• Learns complex patterns automatically
• Handles long-range dependencies
• Higher accuracy than rule-based chunkers
Example Features for Chunking
• Current POS: NN
• Previous POS: DT
• Next POS: VBZ
• Previous chunk tag
• Bigram or trigram tag history
Training Process
• Provide IOB-tagged corpus as training set
• Extract linguistic features
• Train the classifier
• Test on unseen data
Performance Evaluation
• Precision = correctly predicted chunks / total predicted chunks
• Recall = correctly predicted chunks / total actual chunks
• F-measure = harmonic mean of precision and recall
Evaluation ensures that the chunker generalizes well and does not overfit.
4. Explain recursion in linguistic structure - (How to build nested
structure , Trees, Tree traversal ).

Recursion in Linguistic Structure


Recursion is a fundamental concept in linguistics where a linguistic unit can
contain another unit of the same type inside it.
It allows language to generate infinite sentences using finite rules.
For example:
“The girl [who met the boy [who won the prize]] smiled.”
Here subordinate clauses are nested inside one another.
Recursion forms the basis of phrase structure grammars, syntax trees, and
hierarchical sentence representation in NLP.

1. Building Nested Structures


To represent recursion, linguistic structures are built using nested phrases.
A sentence (S) may contain:
• a noun phrase (NP)
• a verb phrase (VP)
Inside NP, another NP or clause may be nested.
Example:
Sentence (S)
→ NP + VP
→ NP → Determiner + Noun + (Clause)
→ Clause → who + VP
Each phrase can contain another phrase, creating hierarchical nesting.
Example of nested structure
The sentence:
“The book on the table near the window in the room is old.”
Nesting occurs:
• Book
o on the table
▪ near the window
▪ in the room
This type of structure is recursive because similar patterns repeat.

2. Trees in Linguistic Structure


A tree is a graphical representation of the hierarchical structure of a sentence.
It shows:
• Phrases (NP, VP)
• Words
• How units are grouped
Features of Linguistic Trees
• Represent grammar hierarchies.
• Show how words combine to form phrases.
• Demonstrate recursion clearly.
• Are used in parsing and syntax analysis.
Example Tree Structure
Sentence: “The boy eats an apple.”
S
______|_______
NP VP
____|_____ ___|____
Det N V NP
| | | ___|___
The boy eats Det N
| |
an apple
Each phrase branches into sub-phrases or words.

3. Tree Traversal
Tree traversal means visiting each node of the linguistic tree in a particular
order.
Traversal is important for processing sentences in NLP applications such as
parsing, chunking, grammar checking, etc.
Types of Tree Traversal

(a) Pre-order Traversal (Top–Down)


• Visit the root first
• Then visit left and right branches
Used in generating sentences from grammar.
Example order for above tree:
S → NP → Det → The → N → boy → VP → V → eats → NP → Det → an → N → apple

(b) Post-order Traversal (Bottom–Up)


• Visit children first
• Visit the parent last
Useful in parsing, where the structure is built from words upward.
Example:
The → Det → boy → N → NP → eats → V → an → Det → apple → N → NP → VP → S
(c) In-order Traversal
Common in binary trees, but can also be adapted for syntax trees for sequential
processing.

Importance of Recursion in NLP


• Helps in building syntactic parsers
• Allows models to interpret complex and longer sentences
• Essential for context-free grammars and phrase structure rules
• Enables understanding of nested clauses, coordination, relative clauses
• Forms the basis of syntax trees, chunking, and semantic structure analysis

5. What is Named Entity Recognition (NER) also explain its


usefulness.

Named Entity Recognition (NER)

1. Definition

Named Entity Recognition (NER) is a key task in Information Extraction (IE) that
involves identifying and classifying named entities in text into predefined categories
such as:

• Person (PER) – e.g., Albert Einstein, Priya Sharma

• Organization (ORG) – e.g., Google, IBM

• Location (LOC) – e.g., India, London

• Date/Time (DATE/TIME) – e.g., 5th December 2025

• Money/Percentage (MONEY/PERCENT) – e.g., $1000, 20%

• Miscellaneous (MISC) – e.g., product names, events

NER transforms unstructured text into structured information, which can be easily
analyzed by machines.

2. Importance and Usefulness

NER is useful in many NLP and AI tasks:

1. Information Retrieval – Extracts specific entities from documents to improve


search results.
2. Question Answering Systems – Finds precise answers by recognizing entities in
questions and text.

3. Text Summarization – Helps highlight important names, places, and dates.

4. Machine Translation – Ensures proper nouns are preserved during translation.

5. Sentiment Analysis – Identifies entities like brands, products, or people to


analyze opinions.

6. Knowledge Graphs and Ontologies – Converts text into structured entities and
relationships.

7. Content Categorization – Automatically classifies news articles, reports, and


social media posts.

3. Approaches to NER

NER can be implemented using different methods:

(a) Rule-based Approach

• Uses hand-crafted linguistic rules, dictionaries, and patterns.

• Example: Using regular expressions to detect dates or email addresses.

• Pros: Simple, interpretable.

• Cons: Limited coverage and requires manual updating.

(b) Statistical / Machine Learning Approach

• Treats NER as a sequence labeling problem.

• Common models:

o Hidden Markov Models (HMM)

o Conditional Random Fields (CRF)

o Neural Networks (LSTM, BiLSTM, Transformers)

• Requires labeled corpus for training.

• Pros: More flexible, handles unseen entities.

• Cons: Requires large annotated datasets.

(c) Hybrid Approach

• Combines rules and ML to improve accuracy.


• Rules handle domain-specific entities; ML handles general cases.

4. Challenges in NER

1. Ambiguity – A word may refer to multiple entity types.

o Example: Apple (fruit) vs Apple (organization)

2. Out-of-Vocabulary Entities – New entities not seen during training.

3. Nested Entities – Entities inside entities: “University of California, Berkeley”

4. Multi-word Entities – Proper detection of phrases like “Barack Obama”

5. Example of NER

Sentence:
"Microsoft CEO Satya Nadella visited Paris on 1st December 2025."

NER Output:

• Microsoft → Organization

• Satya Nadella → Person

• Paris → Location

• 1st December 2025 → Date

6. NER System Architecture (Conceptual)

1. Input Text – Raw unstructured text.

2. Preprocessing – Tokenization, POS tagging, sentence segmentation.

3. Entity Detection – Rule-based or ML-based identification of entities.

4. Entity Classification – Assigns categories to detected entities.

5. Output – Structured information with labeled entities.


UNIT 6 : Analyzing Sentence Structure

1. Interpret how Grammatical Dilemmas are used to analyze


sentences – (Linguistic Data , Ubiquitous ambiguity ).

Interpreting How Grammatical Dilemmas Are Used to Analyze Sentences


In computational linguistics and NLP, grammatical dilemmas arise when a
sentence can be interpreted in multiple ways due to structural or lexical
ambiguities. These dilemmas help in understanding how language is processed
and analyzed.

1. Linguistic Data
Linguistic data refers to the textual or spoken material that is used to study
language patterns, syntax, and grammar. It includes:
• Words and tokens – the basic units of analysis
• Part-of-Speech (POS) tags – grammatical categories of words
• Sentence structure – the arrangement of words in phrases and clauses
Grammatical dilemmas occur when this data exhibits structural complexity or
ambiguity, requiring careful analysis to determine the intended meaning.

2. Ubiquitous Ambiguity
Ambiguity is a situation where a sentence or phrase can have multiple
interpretations. It is ubiquitous (common) in natural language and arises due to:
(a) Lexical Ambiguity
• A single word has multiple meanings.
• Example:
“He saw the bat.”
o Bat → an animal or a sports equipment?
(b) Syntactic Ambiguity
• The structure of the sentence allows more than one interpretation.
• Example:
“I saw the man with a telescope.”
o Did I use a telescope to see the man?
o Or does the man have a telescope?
(c) Semantic Ambiguity
• Meaning of a sentence is unclear due to word interactions.
• Example:
“Visiting relatives can be annoying.”
o Are relatives visiting me, or am I visiting relatives?

3. How Grammatical Dilemmas Are Used in Sentence Analysis


Grammatical dilemmas are analyzed to improve parsing and understanding:
1. Parsing Sentences
o Ambiguities are detected during parsing.
o Multiple parse trees may be generated for a single sentence.
2. Disambiguation
o Context, POS tags, and grammatical rules are used to choose the correct
interpretation.
3. Improving NLP Systems
o Helps in building POS taggers, parsers, and machine translation
systems that can handle ambiguous sentences.

4. Example of Analysis
Sentence:
“Flying planes can be dangerous.”
• Interpretation 1: The act of flying planes is dangerous.
• Interpretation 2: Planes that are flying can be dangerous.
Analysis involves linguistic data (words, POS tags) and applying grammatical
rules to resolve ambiguity.

2. How use of syntax helps in analyzing sentences.

How the Use of Syntax Helps in Analyzing Sentences


1. Definition of Syntax
Syntax is the branch of linguistics that studies the rules governing the structure
of sentences. It determines how words combine to form phrases and sentences.
In NLP, syntax is essential for analyzing and understanding text, as it provides the
structural framework of language.

2. Role of Syntax in Sentence Analysis


The use of syntax helps in systematic analysis of sentences through the following
ways:
(a) Structuring Sentences
• Syntax identifies constituents like noun phrases (NP), verb phrases (VP),
adjective phrases (ADJP), and prepositional phrases (PP).
• It creates a hierarchical representation of sentences.
Example:
"The cat chased the mouse."
• NP → The cat
• VP → chased the mouse
(b) Resolving Ambiguity
• Sentences in natural language often have multiple interpretations.
• Syntax helps disambiguate meaning by showing grammatical relationships.
Example:
"I saw the man with a telescope."
• Does with a telescope modify saw or man?
• Syntactic parsing clarifies the correct structure.
(c) Identifying Grammatical Roles
• Syntax helps in determining subject, predicate, object, modifiers, and
complements.
• Example:
"She gave him a gift."
• Subject → She
• Verb → gave
• Direct Object → a gift
• Indirect Object → him
(d) Enabling Automated Parsing
• Syntax allows building parse trees (constituency trees) or dependency trees to
represent sentence structure.
• These representations are crucial for computational analysis.
(e) Supporting Semantic Understanding
• Understanding syntax helps infer meaning and relationships between entities in
a sentence.
• For example, “The dog bit the man” vs “The man bit the dog”
• Syntax determines who did what to whom.

3. Methods for Syntactic Analysis


1. Constituency (Phrase Structure) Parsing
o Divides a sentence into nested phrases.
o Uses context-free grammar (CFG) rules.
o Example:
o S
o ├── NP: The cat
o └── VP: chased the mouse
2. Dependency Parsing
o Represents direct relationships between words (head → dependent).
o Shows which word governs others in the sentence.
o Example: cat → chased → mouse
3. Grammar-Based Approaches
o Uses rules for sentence construction:
▪ Sentence → NP + VP
▪ NP → Det + Noun
▪ VP → Verb + NP / PP
4. Probabilistic Parsing
o Uses statistical models to choose the most likely parse from ambiguous
possibilities.

4. Applications of Syntax in NLP


• Machine Translation – Ensures correct sentence structure in target language.
• Information Extraction – Identifies subjects, objects, and relationships.
• Question Answering Systems – Determines correct answer by analyzing
grammatical roles.
• Text Summarization – Helps extract key information and preserve sentence
meaning.
• Speech Recognition – Ensures grammatically correct output.

5. Example Analysis
Sentence: “The quick brown fox jumps over the lazy dog.”
• Noun Phrase (NP): The quick brown fox
• Verb Phrase (VP): jumps over the lazy dog
• Prepositional Phrase (PP): over the lazy dog
• Grammatical Roles:
o Subject → fox
o Verb → jumps
o Object/Prepositional object → dog
Syntactic analysis provides a clear representation of relationships between
words, which is essential for both humans and machines to understand the
sentence.

3. Explain role of Context Free Grammer (CFG ) in analyzing


Sentence.

Role of Context-Free Grammar (CFG) in Analyzing Sentences


1. Definition of CFG
A Context-Free Grammar (CFG) is a set of rules or productions that define the
syntactic structure of a language.
• Each rule specifies how a non-terminal symbol (like a sentence or phrase) can
be replaced by terminals (words) or other non-terminals.
• CFG is widely used in Natural Language Processing (NLP) for parsing and
analyzing sentence structure.

2. Structure of CFG
A CFG consists of four components:
1. Non-terminal symbols (N): Represent grammatical categories, e.g., S
(sentence), NP (noun phrase), VP (verb phrase).
2. Terminal symbols (Σ): Actual words in the language, e.g., “cat”, “runs”, “the”.
3. Production rules (P): Define how non-terminals can be expanded, e.g.,
4. S → NP VP
5. NP → Det N
6. VP → V NP
7. Start symbol (S): The root from which derivation begins (usually a sentence).

3. Role of CFG in Sentence Analysis


CFG plays a key role in analyzing sentences:
(a) Parsing Sentences
• CFG allows breaking down sentences into hierarchical structures.
• Example: Sentence → NP + VP → (Det + N) + (V + NP)
(b) Generating Parse Trees
• CFG rules are used to generate parse trees that represent syntactic structure.
• Example:
• S
• /\
• NP VP
• /\ \
• Det N V
• | | |
• The cat chased
(c) Detecting Ungrammatical Sentences
• CFG can validate sentence structure by checking if a sentence can be derived
from its rules.
• Example: “Cat the chased dog” → Ungrammatical (does not follow CFG rules)
(d) Disambiguation
• CFG helps resolve structural ambiguities by generating all possible parse trees.
• Example: “I saw the man with a telescope.”
o Multiple parse trees show different interpretations.
(e) Supporting NLP Applications
• Machine Translation – ensures syntactically correct translation.
• Information Extraction – identifies subjects, objects, verbs.
• Question Answering – determines who did what to whom.

4. Example CFG Rules for a Simple Sentence


1. S → NP VP
2. NP → Det N | N
3. VP → V NP | V
4. Det → the | a
5. N → cat | dog | mouse
6. V → chased | saw
Sentence: “The cat chased the mouse”
• Using CFG rules, we can derive the sentence and construct a parse tree.

5. Advantages of Using CFG in NLP


• Provides a formal and systematic method to describe sentence structure.
• Can handle recursive structures (e.g., nested phrases).
• Forms the basis of syntactic parsers and computational models.
• Helps in both sentence generation and analysis.

6. Conclusion
Context-Free Grammar (CFG) is essential in NLP for analyzing sentences. It
helps in:
• Structuring sentences into phrases and sub-phrases
• Generating parse trees
• Validating grammatical correctness
• Resolving syntactic ambiguities
CFG bridges the gap between linguistic theory and computational language
processing, making it fundamental for sentence analysis and advanced NLP
tasks.

4. Explain recursive decent parsing.

Recursive Descent Parsing


1. Definition
Recursive Descent Parsing is a top-down parsing technique used in Natural
Language Processing (NLP) and compiler design to analyze the syntactic
structure of sentences.
• It involves a set of recursive procedures where each procedure corresponds to
a non-terminal symbol in the grammar.
• The parser attempts to match the input sentence against the production rules
of a context-free grammar (CFG).

2. Working Principle
1. Start from the start symbol of the grammar (usually S for sentence).
2. For each non-terminal, call a recursive procedure that tries to expand it using
CFG rules.
3. Compare the terminals in the input sentence with the expected terminals in the
grammar.
4. If a match is found, continue; if not, backtrack to try another production rule.
5. Parsing succeeds if the entire input sentence is consumed and matches the
grammar; otherwise, it fails.

3. Example Grammar
Consider a simple CFG:
S → NP VP
NP → Det N
VP → V NP
Det → the | a
N → cat | dog
V → chased | saw
Sentence: “the cat chased the dog”
Parsing Steps:
1. Start with S → NP VP
2. NP → Det N → the cat (matches input)
3. VP → V NP → chased the dog (matches input)
4. Parsing successful → sentence is grammatical

4. Features of Recursive Descent Parsing


1. Top-Down Approach
o Starts from the root symbol (S) and tries to reach the leaves (terminals).
2. Recursive Procedure for Each Non-Terminal
o Each non-terminal in the grammar is implemented as a function or
procedure that calls itself or other procedures.
3. Backtracking
o If a rule fails to match, the parser backtracks and tries the next
alternative rule.
4. Parse Tree Construction
o A parse tree can be built naturally during recursion, showing the
hierarchical structure of the sentence.

5. Advantages
• Simple and intuitive for small grammars.
• Easy to implement by hand or in code.
• Can produce parse trees directly.

6. Disadvantages
• Not suitable for left-recursive grammars (can cause infinite recursion).
• Inefficient for large grammars due to extensive backtracking.
• Cannot handle all context-free grammars without modifications.

7. Applications
• Syntactic Analysis in NLP – analyzing sentence structure.
• Compiler Design – parsing programming languages.
• Grammar Checking Tools – verifying sentence correctness.
• Machine Translation – parsing source sentences before translation.

8. Diagram (Conceptual)
Parse Tree for "the cat chased the dog":
S
/\
NP VP
/\ /\
Det N V NP
| | | /\
the cat chased Det N
| |
the dog

5. Explain Shift reduce parsing.


Shift-Reduce Parsing

1. Definition

Shift-Reduce Parsing is a bottom-up parsing technique used in Natural


Language Processing (NLP) and compiler design.
• It constructs the parse tree from the leaves (input words) up to the root (start
symbol).

• The parser uses a stack to hold grammar symbols and an input buffer to hold
remaining words.

• It operates by repeatedly performing shift and reduce operations until the input
is fully parsed.

2. Key Concepts

1. Shift

o Moves (shifts) the next input word from the input buffer to the stack.

2. Reduce

o Replaces a sequence of symbols on the stack that matches the right-


hand side of a grammar rule with the left-hand side non-terminal.

3. Stack

o Keeps track of partially processed symbols during parsing.

4. Input Buffer

o Holds the remaining words of the sentence to be parsed.

3. How Shift-Reduce Parsing Works

Step-by-Step Process:

1. Initialize stack as empty and input buffer with the sentence.

2. Repeat until the stack contains the start symbol and input is empty:

o Shift: Push the next input symbol onto the stack.

o Reduce: If the top elements of the stack match the right-hand side of a
grammar rule, replace them with the left-hand side non-terminal.

4. Example

Grammar:

S → NP VP
NP → Det N

VP → V NP

Det → the

N → cat | dog

V → chased

Sentence: “the cat chased the dog”

Stepwise Parsing Table:

Stack Input Action

[] the cat chased the Shift


dog

[the] cat chased the dog Shift

[the, cat] chased the dog Reduce NP → Det


N

[NP] chased the dog Shift

[NP, chased] the dog Shift

[NP, chased, the] dog Shift

[NP, chased, the, [] Reduce NP → Det


dog] N

[NP, VP] [] Reduce S → NP VP

Parsing complete → Sentence is grammatical.

5. Advantages

• Simple and efficient for bottom-up parsing.

• Can handle left-recursive grammars (unlike recursive descent).

• Naturally constructs parse trees from leaves to root.

6. Disadvantages

• Cannot handle all ambiguous grammars without modifications.

• Requires careful conflict resolution:


o Shift-Reduce Conflict: Parser is unsure whether to shift or reduce.

o Reduce-Reduce Conflict: Parser is unsure which reduction to apply.

• Needs lookahead tokens to resolve conflicts in complex grammars.

7. Applications

• Syntactic Analysis in NLP – identifying grammatical structure.

• Compilers – parsing programming languages.

• Machine Translation – bottom-up construction of parse trees for translation.

• Grammar Checking – detecting syntactic errors in sentences.

8. Parse Tree Illustration

For “the cat chased the dog”:

/\

NP VP

/\ /\

Det N V NP

| | | /\

the cat chased Det N

| |

the dog

6. Explain left corner parser.

1. Definition

Left Corner Parsing is a hybrid parsing technique used in Natural Language


Processing (NLP) that combines top-down and bottom-up parsing strategies.
• It starts parsing a sentence from the leftmost word (left corner) of a production
rule and works its way up to the root.

• This approach helps in efficiently predicting the structure of a sentence while


avoiding some limitations of pure top-down or bottom-up parsers.

2. Working Principle

1. Left Corner Identification

o The parser identifies the leftmost symbol (left corner) of a grammar


production.

o This symbol may be a terminal (actual word) or non-terminal (phrase


category).

2. Bottom-Up Recognition

o The parser first recognizes the left corner in the input sentence, similar to
bottom-up parsing.

3. Top-Down Prediction

o After recognizing the left corner, the parser predicts the higher-level
constituents (like NP, VP, S) that can generate it, similar to top-down
parsing.

4. Building Parse Trees

o By combining bottom-up recognition and top-down expectation, the


parser constructs the parse tree efficiently.

3. Advantages of Left Corner Parsing

• Handles left recursion naturally, unlike top-down recursive parsers.

• Reduces unnecessary backtracking compared to pure top-down parsers.

• Can predict sentence structure early using leftmost symbols.

• Works well with ambiguous sentences, producing multiple parse trees if


needed.

4. Example

Grammar:
S → NP VP

NP → Det N

VP → V NP

Det → the

N → cat | dog

V → chased

Sentence: “the cat chased the dog”

Parsing Steps Using Left Corner Parsing:

1. Identify left corner of S → NP VP → left corner = NP

2. Identify left corner of NP → Det N → left corner = Det → matches the

3. Recognize NP → the cat

4. Recognize VP → chased the dog

5. Combine NP + VP → S → parse tree complete

5. Parse Tree Illustration

/\

NP VP

/\ /\

Det N V NP

| | | /\

the cat chased Det N

| |

the dog

6. Applications of Left Corner Parsing

• Syntactic Analysis in NLP – parsing natural language efficiently.

• Machine Translation – helps generate syntactic structures early.


• Speech Recognition – predicts sentence structures for correct interpretation.

• Grammar Checking – detects grammatical errors in real-time.

7. Explain tree banks & Grammer’s uses in Grammer Development .


Treebanks and Grammars in Grammar Development

1. Definition of Treebank

A Treebank is a linguistically annotated corpus in which sentences are parsed into


syntactic structures and represented as parse trees.

• It contains structured syntactic information for each sentence, including parts


of speech (POS), phrases, and hierarchical relationships.

• Example: Penn Treebank is a widely used resource in NLP.

2. Structure of a Treebank

A typical treebank contains:

1. Sentence Text – the original sentence.

2. Tokenization – splitting sentences into words or tokens.

3. POS Tags – grammatical categories assigned to each word.

4. Parse Trees – hierarchical representation showing phrases and sub-phrases


(NP, VP, PP, etc.).

5. Syntactic Labels – indicating grammatical relationships like subject, object,


modifiers.

Example Sentence:
"The cat chased the mouse."

Parse Tree Representation:

/\

NP VP

/\ /\

Det N V NP
| | | /\

The cat chased Det N

| |

the mouse

3. Role of Grammars in Grammar Development

Grammar provides a set of rules that define how words combine to form valid
sentences.

• In NLP, grammars are used to generate, parse, and validate sentence


structures.

• Types of grammars in NLP:

1. Context-Free Grammar (CFG) – rules like S → NP VP

2. Probabilistic CFG (PCFG) – CFG with probabilities for ambiguous rules

3. Dependency Grammar – focuses on relationships between words (head


→ dependent)

4. How Treebanks Help in Grammar Development

1. Training Data for Parsers

o Treebanks provide annotated parse trees used to train statistical


parsers or machine learning models.

2. Grammar Induction

o Patterns in treebanks are analyzed to extract grammar rules


automatically.

o Example: Observing NP → Det N in multiple sentences leads to a


generalized rule.

3. Handling Ambiguity

o Treebanks include alternative parses for ambiguous sentences.

o Helps in developing probabilistic grammars that choose the most likely


structure.

4. Evaluation
o Treebanks serve as gold-standard references to evaluate new grammar
rules and parsing algorithms.

5. Lexicon Development

o Annotated words in treebanks help build dictionaries with POS tags,


useful for grammar construction.

5. Advantages of Using Treebanks in Grammar Development

• Provides large-scale annotated linguistic data.

• Supports statistical and rule-based grammar extraction.

• Enables automatic parser training and evaluation.

• Handles complex sentence structures like recursion and coordination.

6. Applications

• Syntactic Parsing – building parsers for NLP applications.

• Machine Translation – using parse structures to translate accurately.

• Grammar Checking – automatic detection of ungrammatical sentences.

• Information Extraction – extracting entities and relations from structured parse


trees.

You might also like