UNIT - II LEXICAL LEVEL
Lexical level: erro r tolerant lexical processing (spelling error correction)-transducers for the design
of morphologic analyzers features-towards syntax: part-of-speech tagging (BRILL,HMM)-efficient
Representations for linguistic resources (lexica, grammars,....)tries and finite state automata.
LEXICAL LEVEL
✓ Lexical analysis is the process of breaking down text into individual words or tokens. It is
the first step in the compilation process and is used to identify the basic building blocks of
a language.
✓ The lexical phase in Natural Language Processing (NLP) involves scanning text and
breaking it down into smaller units such as paragraphs, sentences, and words. This
process, known as tokenization, converts raw text into manageable units called tokens or
lexemes.
Token:
✓ A lexical token is a sequence of characters that can be treated as a unit in the grammar of
the programming languages.
Example of tokens
• Type token (id, number, real, . . . )
• Punctuation tokens (IF, void, return, . . . )
• Alphabetic tokens (keywords)
Example of Non-Tokens
• Comments, preprocessor directive, macros, blanks, tabs, newline, etc.
Lexeme :
✓ The sequence of characters matched by a pattern to form the corresponding token or a
sequence of input characters that comprises a single token is called a lexeme. eg- “float”,
“abs_zero_Kelvin”, “=”, “-”, “273”, “;” .
Lexical Analyzer Works:
• Input preprocessing: This stage involves cleaning up the input text and preparing it for
lexical analysis. This may include removing comments, whitespace, and other non-
essential characters from the input text.
• Tokenization: This is the process of breaking the input text into a sequence of tokens. This
is usually done by matching the characters in the input text against a set of patterns or
regular expressions that define the different types of tokens.
• Token classification: In this stage, the lexer determines the type of each token. For
example, in a programming language, the lexer might classify keywords, identifiers,
operators, and punctuation symbols as separate token types.
• Token validation: In this stage, the lexer checks that each token is valid according to the
rules of the programming language. For example, it might check that a variable name is a
valid identifier, or that an operator has the correct syntax.
• Output generation: In this final stage, the lexer generates the output of the lexical analysis
process, which is typically a list of tokens. This list of tokens can then be passed to the next
stage of compilation or interpretation.
Types of Lexical Analysis
1. Tokenization: Breaking down text into individual words or tokens.
2. Stopword removal: Removing common words like "the", "and", etc. that do not carry
significant meaning.
3. Stemming: Reducing words to their base form (e.g., "running" becomes "run").
4. Lemmatization: Converting words to their base or dictionary form (e.g., "running" becomes
"run").
5. Part-of-Speech (POS) Tagging: Identifying the grammatical category of each word (e.g., noun,
verb, adjective).
1. Tokenization
✓ Tokenization is the process of breaking down text into individual words or tokens. There
are several tokenization techniques, including:
i. Word-based tokenization: Breaking down text into individual words.
ii. Character-based tokenization: Breaking down text into individual characters.
iii. Subword-based tokenization: Breaking down text into subwords (e.g., word pieces).
2. Stopword Removal
✓ Stopwords are common words like "the", "and", etc. that do not carry significant meaning.
Removing stopwords can help reduce the dimensionality of text data and improve the
performance of text processing algorithms.
3. Stemming
✓ Stemming is the process of reducing words to their base form. There are several stemming
algorithms, including:
i. Porter Stemmer: A widely used stemming algorithm that reduces words to their base form.
ii. Snowball Stemmer: A stemming algorithm that uses a combination of rules and
dictionaries to reduce words to their base form.
4. Lemmatization
✓ Lemmatization is the process of converting words to their base or dictionary form. There
are several lemmatization algorithms, including:
i. WordNet: A lexical database that provides lemmatization information for English words.
ii. NLTK: A Python library that provides lemmatization tools for English words.
5. Part-of-Speech (POS) Tagging
✓ POS tagging is the process of identifying the grammatical category of each word. There
are several POS tagging algorithms, including:
i. Rule-based POS tagging: A POS tagging algorithm that uses a set of rules to identify the
grammatical category of each word.
ii. Machine learning-based POS tagging: A POS tagging algorithm that uses machine
learning techniques to identify the grammatical category of each word.
Applications of Lexical Analysis
1. Text classification: Lexical analysis is used to identify the features of text data that are used to
classify text into different categories.
2. Sentiment analysis: Lexical analysis is used to identify the sentiment of text data.
3. Information retrieval: Lexical analysis is used to index and retrieve text data.
4. Machine translation: Lexical analysis is used to translate text from one language to another.
ERROR TOLERANT LEXICAL PROCESSING (SPELLING ERROR CORRECTION)
✓ Error-tolerant lexical processing refers to the ability of a lexical processing system to
handle and recover from errors in the input text.
✓ This is essential in natural language processing (NLP) applications, where input text may
contain errors, ambiguities, or inconsistencies.
Types of Errors:
1. Typographical errors: Spelling mistakes, punctuation errors, or incorrect capitalization.
2. Lexical errors: Incorrect word usage, word order, or grammatical errors.
3. Semantic errors: Errors in meaning or context.
Error-Tolerant Lexical Processing Techniques:
1. Fuzzy matching: Using algorithms to match words or phrases with similar spellings or
meanings.
2. Edit distance: Calculating the minimum number of operations (insertions, deletions,
substitutions) required to transform one word into another.
3. N-gram analysis: Analyzing sequences of n items (e.g., words, characters) to identify patterns
and relationships.
4. Machine learning-based approaches: Using machine learning algorithms to learn patterns and
relationships in text data and correct errors.
Applications of Error-Tolerant Lexical Processing:
1. Spell checking and correction: Identifying and correcting spelling errors in text.
2. Text classification: Classifying text into categories despite errors or inconsistencies.
3. Information retrieval: Retrieving relevant information from text data despite errors or
inconsistencies.
4. Machine translation: Translating text from one language to another despite errors or
inconsistencies.
Tools and Resources:
1. NLTK: A popular Python library for NLP tasks, including error-tolerant lexical processing.
2. spaCy: A modern Python library for NLP tasks, including error-tolerant lexical processing.
3. Stanford CoreNLP: A Java library for NLP tasks, including error-tolerant lexical processing.
Spelling Error Correction:
✓ In Natural Language Processing it’s important that spelling errors should be as less as
possible so that whatever we are making should be highly accurate.
✓ Spelling error correction is the process of identifying and correcting spelling mistakes in
text. This is an essential task in natural language processing (NLP) applications, such as
text processing, information retrieval, and machine translation.
Types of Spelling Errors:
1. Typographical errors: Errors caused by incorrect typing, such as "teh" instead of "the".
2. Phonetic errors: Errors caused by words that sound similar, such as "their" instead of "there".
3. Orthographic errors: Errors caused by incorrect spelling, such as "accomodate" instead of
"accommodate".
Spelling Error Correction Techniques:
1. Rule-based approaches: Using predefined rules to correct spelling errors.
2. Statistical approaches: Using statistical models to predict the correct spelling.
3. Machine learning-based approaches: Using machine learning algorithms to learn patterns and
relationships in text data and correct spelling errors.
4. Hybrid approaches: Combining multiple techniques to correct spelling errors.
Algorithms for Spelling Error Correction:
1. Levenshtein distance: A measure of the minimum number of single-character edits (insertions,
deletions, substitutions) required to transform one word into another.
2. Jaro-Winkler distance: A measure of the similarity between two strings.
3. N-gram analysis: Analyzing sequences of n items (e.g., words, characters) to identify patterns
and relationships.
TRANSDUCERS FOR THE DESIGN OF MORPHOLOGIC ANALYZERS FEATURES
TOWARDS SYNTAX
✓ Transducers are mathematical models used to describe the behavior of systems that process
input sequences and produce output sequences.
✓ In the context of natural language processing (NLP), transducers are used to design
morphologic analyzers that can analyze the internal structure of words and identify their
grammatical features.
Features of Morphologic Analyzers:
1. Morphological analysis: Breaking down words into their constituent parts, such as roots,
prefixes, and suffixes.
2. Part-of-speech (POS) tagging: Identifying the grammatical category of each word, such as
noun, verb, adjective, etc.
3. Feature extraction: Extracting relevant features from words, such as their grammatical features,
semantic roles, and syntactic dependencies.
Transducers for Morphologic Analyzers:
1. Finite-state transducers (FSTs): Mathematical models that describe the behavior of systems
that process input sequences and produce output sequences.
2. Weighted finite-state transducers (WFSTs): FSTs that assign weights to transitions and outputs,
allowing for probabilistic modeling.
3. Morphological transducers: Transducers specifically designed for morphological analysis,
which can handle complex morphological phenomena.
Towards Syntax:
1. Syntactic analysis: Analyzing the grammatical structure of sentences, including phrase structure
and syntactic dependencies.
2. Syntactic features: Extracting relevant features from sentences, such as their grammatical
features, semantic roles, and syntactic dependencies.
3. Integration with morphologic analyzers: Combining morphologic analyzers with syntactic
analyzers to provide a comprehensive analysis of language.
Applications:
1. Language translation
2. Text summarization:
3. Question answering:
Tools and Resources:
1. OpenFST: An open-source library for working with FSTs and WFSTs.
2. Morphological transducers: Specialized transducers designed for morphological analysis,
available in libraries such as OpenFST.
3. Syntax-based machine translation: Systems that use syntactic analysis to improve machine
translation, such as Moses.
PART-OF-SPEECH TAGGING (BRILL,HMM)
✓ POS tagging is the process of identifying the grammatical category of each word in a
sentence, such as noun, verb, adjective, etc.
✓ POS tagging is important to get an idea that which parts of speech does tokens belongs to
i.e whether it is noun, verb, adverb, conjunction, pronoun, adjective, preposition,
interjection, if it is verb then which form and so on….. whether it is plural or singular
and many more conditions.
✓ In simple words process of finding the sequence of tags which is most likely to have
generated a given word sequence.
Why is POS Tagging Important?
1. Syntax Analysis: POS tagging is essential for syntax analysis, as it helps identify the
grammatical structure of a sentence.
2. Semantic Analysis: POS tagging helps identify the meaning of words and their relationships in
a sentence.
3. Information Retrieval: POS tagging can improve the accuracy of information retrieval systems
by allowing for more precise searches.
POS Tagging Techniques:
1. Rule-Based Approaches: Using predefined rules to assign POS tags to words.
2. Statistical Approaches: Using statistical models to predict POS tags based on context and word
frequencies.
3. Machine Learning Approaches: Using machine learning algorithms to learn POS tagging
patterns from labeled data.
POS Tagging Algorithms:
1. Hidden Markov Model (HMM): A statistical model that uses probability distributions to assign
POS tags.
2. Maximum Entropy (ME): A machine learning algorithm that uses a probabilistic framework to
assign POS tags.
3. Support Vector Machines (SVMs): A machine learning algorithm that uses a kernel-based
approach to assign POS tags.
Challenges in POS Tagging:
1. Ambiguity: Words can have multiple possible POS tags, making it challenging to assign the
correct tag.
2. Contextual Dependence: POS tags can depend on the context in which a word is used.
3. Limited Training Data: POS tagging models require large amounts of labeled training data to
achieve high accuracy.
Tools and Resources:
1. NLTK: A popular Python library for NLP tasks, including POS tagging.
2. spaCy: A modern Python library for NLP tasks, including POS tagging.
3. Stanford CoreNLP: A Java library for NLP tasks, including POS tagging.
Brill and HMM:
Brill and HMM are two popular algorithms used in Part-of-Speech (POS) tagging.
Brill's Algorithm:
1. Rule-based approach: Brill's algorithm uses a set of predefined rules to assign POS tags to
words.
2. Transformation-based learning: The algorithm learns to transform the initial POS tags into the
correct ones based on the context.
3. High accuracy: Brill's algorithm is known for its high accuracy, especially for languages with
complex grammar rules.
Hidden Markov Model (HMM):
1. Statistical approach: HMM is a statistical model that uses probability distributions to assign
POS tags to words.
2. Markov chain: HMM represents the sequence of words as a Markov chain, where each state
corresponds to a POS tag.
3. Baum-Welch algorithm: HMM uses the Baum-Welch algorithm to estimate the model
parameters and assign POS tags.
Comparison:
1. Accuracy: Brill's algorithm tends to perform better than HMM, especially for languages with
complex grammar rules.
2. Computational complexity: HMM is generally faster and more efficient than Brill's algorithm.
3. Training data: Brill's algorithm requires a large amount of training data to learn the
transformation rules, while HMM can work with smaller datasets.
Applications:
1. POS tagging: Both Brill's algorithm and HMM are widely used for POS tagging in NLP
applications.
2. Language modeling: HMM can be used for language modeling, while Brill's algorithm is more
suitable for POS tagging.
3. Information retrieval: Both algorithms can be used in information retrieval systems to improve
the accuracy of search results.
EFFICIENT REPRESENTATIONS FOR LINGUISTIC RESOURCES (LEXICA,
GRAMMARS,....)
✓ Efficient representations for linguistic resources, such as lexica and grammars, are crucial
for natural language processing (NLP) applications. These representations enable fast and
accurate processing of linguistic data.
Corpus :
✓ A corpus is a large and structured set of machine-readable texts that have been produced
in a natural communicative setting. Its plural is corpora. They can be derived in different
ways like text that was originally electronic, transcripts of spoken language and optical
character recognition, etc.
✓ Elements of Corpus Design: Language is infinite but a corpus has to be finite in size. For
the corpus to be finite in size, we need to sample and proportionally include a wide range
of text types to ensure a good corpus design.
✓ Corpus are determined by the following two factors:
1. Balance − The range of genre include in a corpus
2. Sampling − How the chunks for each genre are selected.
✓ In NLP, a corpus contains text and speech data that can be used to train AI and machine
learning systems.
✓ If a user has a specific problem or objective, they'll need a collection of data supporting or
at least a representation of what they're looking to achieve.
Types of TreeBank Corpus
✓ Semantic and Syntactic Treebanks are the two most common types of Treebanks in
linguistics.
1. Semantic Treebanks
✓ These Treebanks use a formal representation of sentence’s semantic structure. They vary
in the depth of their semantic representation.
✓ Examples: Robot Commands Treebank, Geoquery, Groningen Meaning Bank, RoboCup
Corpus
Syntactic Treebanks
✓ Opposite to the semantic Treebanks, inputs to the Syntactic Treebank systems are
expressions of the formal language obtained from the conversion of parsed Treebank data.
✓ For example, Penn Arabic Treebank, Columbia Arabic Treebank are syntactic
Treebanks created in Arabia language.
Lexical Representations:
1. Hash tables: Fast lookup and storage of lexical entries.
2. Trie data structures: Efficient storage and retrieval of lexical entries with shared prefixes.
3. Compact binary representations: Reduced memory footprint for large lexica.
Grammar Representations:
1. Context-Free Grammars (CFGs): Efficient representation of grammatical rules and structures.
2. Probabilistic Context-Free Grammars (PCFGs): Incorporate probabilities into CFGs for more
accurate parsing.
3. Lexicalized Grammars: Combine lexical and grammatical information for improved parsing.
Efficient Parsing Algorithms:
1. Chart parsing: Efficient parsing algorithm for CFGs and PCFGs.
2. Left-corner parsing: Fast and efficient parsing algorithm for CFGs.
3. Dynamic programming: Efficient parsing algorithm for PCFGs.
TRIES AND FINITE STATE AUTOMATA
✓ Tries and finite state automata are data structures and algorithms used in natural language
processing (NLP) and computer science to efficiently process and store large amounts of
data.
Tries:
1. Definition: A trie is a tree-like data structure that stores a collection of strings in a way that
allows for efficient retrieval and matching.
2. Properties: Tries are prefix trees, meaning that each node represents a prefix of a string.
3. Operations: Tries support operations such as insertion, deletion, and search.
Finite State Automata (FSA):
1. Definition: An FSA is a mathematical model that recognizes patterns in strings by transitioning
between states.
2. Properties: FSAs are composed of states, transitions, and accepting states.
3. Operations: FSAs support operations such as recognition, parsing, and generation.
Applications:
1. String matching: Tries and FSAs are used in string matching algorithms to efficiently search
for patterns in text.
2. Language modeling: Tries and FSAs are used in language modeling to predict the next word in
a sentence.
3. Text classification: Tries and FSAs are used in text classification to classify text into categories.
Types of Tries:
1. Standard trie: A basic trie data structure.
2. Compressed trie: A trie that stores multiple strings in a single node.
3. Suffix trie: A trie that stores the suffixes of a string.
Types of FSAs:
1. Deterministic FSA (DFA): An FSA that has a single transition for each input symbol.
2. Nondeterministic FSA (NFA): An FSA that has multiple transitions for each input symbol.
3. Weighted FSA (WFSA): An FSA that assigns weights to transitions.
Tools and Resources:
1. NLTK: A popular Python library for NLP tasks that includes implementations of tries and FSAs.
2. spaCy: A modern Python library for NLP tasks that includes implementations of tries and FSAs.
3. OpenFST: An open-source library for working with FSAs and other finite-state machines.