0% found this document useful (0 votes)
11 views22 pages

Spelling Error Detection in NLP

The document discusses various applications of Natural Language Processing (NLP), focusing on Information Extraction (IE), Named Entity Recognition (NER), and spell correction techniques. It outlines the goals and methods of IE, the importance of NER in classifying entities in text, and the challenges of correcting real-word spelling errors using contextual information and machine learning. Additionally, it highlights the use of language models and semantic relationships in improving spelling correction systems.

Uploaded by

janiita786
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views22 pages

Spelling Error Detection in NLP

The document discusses various applications of Natural Language Processing (NLP), focusing on Information Extraction (IE), Named Entity Recognition (NER), and spell correction techniques. It outlines the goals and methods of IE, the importance of NER in classifying entities in text, and the challenges of correcting real-word spelling errors using contextual information and machine learning. Additionally, it highlights the use of language models and semantic relationships in improving spelling correction systems.

Uploaded by

janiita786
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd

NLP Applications

(Information Extraction, Named Entity


Recognition and Spell Corrections)
Lecture # 5
Information Extraction (IE)
IE systems extract clear, factual information
• E.g.,
• Gathering earnings, profits, board members, headquarters, etc.
from company reports
• The headquarters of BHP Billiton Limited, and the global
headquarters of the combined BHP Billiton Group, are located in
Melbourne, Australia.
• headquarters(“BHP Biliton Limited”, “Melbourne,
Australia”)
Information Extraction (IE)
Goal of IE is to map a document curpus to some structured
database/format
Benefit: – Complex searches e.g.
Find me all teaching jobs in Taxila paying at least Rs. 50K
How the job nature and requirements have changed over
the years
Information Extraction with
Natural Language
Understanding
Information Extraction
Named Entity
Recognition (NER)
• A very important sub-task: find and classify names Person
in text, for example: Date
• The decision by the independent MP Andrew Wilkie to Location
withdraw his support for the minority Labor government Organization
sounded dramatic but it should not further threaten its
stability. When, after the 2010 election, Wilkie, Rob
Oakeshott, Tony Windsor and the Greens agreed to support
Labor, they gave just two guarantees: confidence and supply
Free writing
Free writing is a task where writer writes while
ignoring grammatical and spelling mistakes.

Is Chat GPT Support Free Writing ?


Today Spell Correction
Applications
used in many of our day-to-day activities like
◦ word prediction while sending a text
◦ spell checker while writing a word document
◦ query prediction in search engines etc.
Spelling Tasks
Spelling Error Detection
Spelling Error Correction:
◦ Autocorrect
◦ hte ---> the
◦ Suggest a correction
◦ Suggestion lists
Spelling checker in Word
Processors
Nearly all word processors have a built-in Spelling checker that flags the spelling
mistakes.
It also provides the solution to correct these spelling mistakes by choosing a
possible alternative from a given list.
For identification of spelling mistakes, most spellcheckers checks each word
drawn separately from the written text against the dictionary-stored words.
If the word is found while searching the dictionary, it is considered as correct
word regardless of its context.
This approach is efficient for identifying the non-word spelling mistakes but
other mistakes cannot be identified using this method.
Types of spelling errors
Non-word Errors
• graffe ---> giraffe
• Real-word Errors
• Typographical errors
three ---> there
• Cognitive Errors (homophones)
◦ piece ---> peace,
◦ too ---> two
Non-word spelling errors
• Non-word spelling error detection:
• Any word not in a dictionary is an error
• The larger the dictionary the better
• Non-word spelling error correction:
◦ Edit Distance Algorithms (you already cover it in assignment # 1 
◦ Language Models (N-Gram) (In next lectures)
◦ Spell-checking library in Python (NLTK )
◦ Contextual Correction (Used in Chat GPT and BERT)
◦ Machine Learning Approaches
N-gram similarity measures
Models that assign probabilities to sequences of words are called language
models or LMs.
Simplest model that assigns probabilities to sentences and sequences of words,
the n-gram.
An n-gram is a sequence n-gram of n words: a 2-gram (which we’ll call bigram) is
a two-word sequence of words like “please turn”, “turn your”, or ”your
homework”, and a 3-gram (a trigram) is a three-word sequence of words like
“please turn your”, or “turn your homework”.
Will discuss in details in upcoming lectures
Real word spelling errors
Solving real-word errors in Natural Language Processing
(NLP) involves addressing issues related to
◦ Homophones
◦ Misspellings
◦ other errors that commonly occur in text
Real-word errors can affect the accuracy and
comprehensibility of NLP systems, so it's essential to handle
them effectively.
Cont.……
Real-word spelling mistakes i.e. words that are correctly spelled but are not
intended by the user.
Mistakes falling under this category go unrecognized by most spellcheckers
because they handle non-word spelling mistakes by checking against the
dictionary word list only.
To identify the real-word spelling mistakes, there is a need to utilize the
neighboring contextual information of the target word.
An example of such sentence is “I want to eat a piece of cake” and the confused
word set in this case is(piece, peace),to identify that ‘peace’ cannot be used in
this case, we utilize the neighboring contextual information ‘cake’ for word
‘piece’
Need A Solution …
Correcting real-word errors
Machine Learning
◦ Relying on the feature training set, and the learning method of annotation

Semantic Information
◦ Real-word errors checking is based on contextual semantic relations,
assuming that the right word has a strong semantic connection to its context,
while the real-word errors does not have such a semantic association.
N-Gram Statistical Language Model
◦ Correcting errors approach relies on huge N-Gram statistical model, capturing
longer semantic relations is difficult.
Solution?
An automatic spelling correction system identifies real-word
errors by semantic analysis of the surrounding context.
More complex error-detection systems may be used to
detect words that are correctly spelled but do not fit into the
syntactic or semantic context.
The neural networks can be trained on already available
large textual corpora.
Cont.…….
Correction using Trigrams
There are two ways to create confusion set. The confusion set
can be generated in advance or at the runtime.
Correction using Machine Learning Techniques
Machine learning method is one of the most widely used methods to perform
the NLP tasks e.g. Part-of-speech tagging is used in order to correct the ‘real-
word spelling errors’. In this method disambiguation of lexical resources is
considered to be the main obstacle and the ambiguity is removed using
confusion sets.
Correction using Semantic
Relationships
The semantic relationship method is the correction method in which the meaning of the word is
analyzed with respect to the sentence.
The method was actually based on a study that the meaning of the words should be in sync with
the surrounding words of the sentence.
There are some real-word errors which cannot be solved easily such ‘malapropism’ causes the
coherence of the text.
In these types of malapropism errors, the spell checkers work in two stages. In the first stage it
finds the usual suspects. The word that does not seem to be related with other word is
considered to be the suspect.
The words belong to the suspects group are discarded and the rest from the rest of the words
the most likely words are identified in second phase.
To represent the semantic relationship of the words used in text, the noun portion of
corpus/lexicon of the particular language is used.
Spell-checking library in Python

PySpellChecker
SpaCy
SymSpell
TextBlob
[Link]

You might also like