Department of Collegiate and Technical Education Diploma in CS&E
Course: Artificial Intelligence and Machine Learning Code: 20CS51I
WEEK- 11
NATURAL LANGUAGE PROCESSING
Session 1
11.1 UNDERSTANDING NATURAL LANGUAGE PROCESSING
Language is the primary means of communication used by humans. It is the tool we use to express the
greater part of our ideas and emotions. It shapes thought, has a structure, and carries meaning. Learning new
concepts and expressing ideas through them is so natural that we hardly realize how we process natural
language.
“A natural language or ordinary language is any language that has evolved naturally in humans
through use and repetition without conscious planning or premeditation.” In contrast to artificial languages
like Python, C, Java, etc. natural languages like English, Kannada, Hindi, French, etc. have evolved over time
and use and it is difficult to express them in strict formal rules.
“Natural Language Processing (NLP) is a branch of Artificial Intelligence that helps computers to
understand, interpret and manipulate human languages to analyze and derive its meaning.”
NLP incorporates machine learning models, statistics, and deep learning models into computational
linguistics i.e., rule-based modeling of human language to allow computers to understand text, spoken words
and understands human language, intent, and sentiment. It helps developers to organize and structure knowledge
to perform tasks like translation, summarization, named entity recognition, relationship extraction, speech
recognition, topic segmentation, etc.
11.1.1 Why we need NLP?
Natural language is full of ambiguities. Ambiguity can be referred to as the ability of having more than one
meaning or being understood in more than one way.
There are different types of ambiguities present in natural language:
1. Lexical Ambiguity: It is defined as the ambiguity associated with the meaning of a single word. A single
word can have different meanings. Also, a single word can be a noun, adjective, or verb. For example: The word
“bank” can have different meanings. It can be a financial bank or a riverbank. Similarly, the word “clean” can
be a noun, adverb, adjective, or verb.
Artificial Intelligence and Machine Learning-20CS51I Page 1
Department of Collegiate and Technical Education Diploma in CS&E
2. Syntactic Ambiguity: It is defined as the ambiguity associated with the way the words are parsed. For
example: The sentence “Visiting relatives can be boring.” This sentence can have two different meanings. One
is that visiting a relative’s house can be boring. The second is that visiting relatives at your place can be boring.
3. Semantic Ambiguity: It is defined as ambiguity when the meaning of the words themselves can be
ambiguous. For example: The sentence “Mary knows a little French.” In this sentence the word “little French”
is ambiguous. As we don’t know whether it is about the language French or a person.
Let us now try to understand why NLP is considered hard using a few examples.
1. "There was not a single man at the party"
Does it mean that there were no men at the party? or
Does it mean that there was no one at the party?
Here does man refer to the gender "man" or "mankind"?
2. "The chicken is ready to eat"
Does this mean that the bird (chicken) is ready to feed on some grains? or
Does it mean that the meat is cooked well and is ready to be eaten by a human?
3. "Google is a great company." and "Google this word and find its meaning."
Google is being used as a noun in the first statement and as a verb in the second.
4. The man saw a girl with a telescope.
Did the man use a telescope to see the girl? or
Did the man see a girl who was holding a telescope?
This is a primary reason why NLP is considered hard. Another reason why NLP is hard is because it deals with
the extraction of knowledge from unstructured data.
11.2 NLP APPROACHES
11.2.1 Rule Based NLP
Rule-based approaches are the oldest approaches to NLP. The rule-based or grammar-based approach implies
that a human is involved in the process of stepwise system development and improvement. A rule-based NLP
system simply follows these rules to categorize the language it’s analyzing. If the rule doesn’t exist, the system
will be unable to ‘understand’ the human language and thus will fail to categorize it. Regular
expressions and context free grammars are some examples of rule-based approaches to NLP.
Advantages of Rule based approach:
A rule-based system is good at capturing a specific language phenomenon: it will decode the linguistic
relationships between words to interpret the sentence.
It tends to focus on pattern-matching or parsing.
Artificial Intelligence and Machine Learning-20CS51I Page 2
Department of Collegiate and Technical Education Diploma in CS&E
Rule-based systems are low precision, high recall, meaning they can have high performance in specific use
cases, but often suffer performance degradation when generalized.
Disadvantage of Rule-based approach:
It requires skilled experts: it takes a linguist or a knowledge engineer to manually encode each rule in NLP.
Rules need to be manually crafted and enhanced all the time.
Moreover, the system can become so complex, that some rules can start contradicting each other.
Accuracy of the NLP system is dependent on the rules provided
They cannot easily scale to accommodate a seemingly endless stream of exceptions or the increasing volumes
of text and voice data.
11.2.2 Statistical NLP
Statistical NLP aims to perform statistical inference for the field of NLP. It combines computer algorithms with
machine learning and deep learning models to automatically extract, classify, and label elements of text and voice
data and then assign a statistical likelihood to each possible meaning of those elements. Today, deep learning
models and learning techniques based on convolutional neural networks (CNNs) and recurrent neural networks
(RNNs) enable NLP systems that 'learn' as they work and extract ever more accurate meaning from huge volumes
of raw, unstructured, and unlabeled text and voice data sets.
Advantages of Statistical NLP:
The main advantage of statistical NLP is machine learning algorithm’s “learnability”, which is why no
manual rule/grammar coding is needed, requiring high skills.
The corpus can be annotated using the low-skilled workforce.
The data fed to such system will be huge and there are a lot of data points (e.g. keywords etc.), which makes
it easy for the machine to learn statistical clues of the words for a given task.
Machine learning approaches can significantly speed up the development of a capability of certain NLP
systems, when good training data sets are available
Disadvantages of Statistical NLP:
Lack of training data
Poorly labelled, insufficient data
New “preparation” of data is required each time as, once created and labelled, the corpus often can’t be reused
on new data schemas.
Artificial Intelligence and Machine Learning-20CS51I Page 3
Department of Collegiate and Technical Education Diploma in CS&E
11.3 NLP USE CASES
Natural language processing is the driving force behind machine intelligence in many modern real-world
applications. Here are a few examples:
Spam detection: NLP's text classification capabilities can be used to scan emails for language that often
indicates spam or phishing. These indicators can include overuse of financial terms, characteristic bad
grammar, threatening language, inappropriate urgency, misspelled company names, and more.
Machine translation: Machine translation involves translating words/sentences in one language to
another. Effective translation has to capture accurately the meaning and tone of the input language and
translate it to text with the same meaning and desired impact in the output language. Example: Google
translate
Speech Recognition: This is the process of mapping acoustic speech signals to a set of words. Difficulty
arises due to wide variations in the pronunciations of words, homonym (e.g., dear and deer) and acoustic
ambiguities (e.g., in the rest and interest)
Speech Synthesis: It refers to automatic production of speech (uttering sentences in natural language).
Such systems can read out your mail or messages for you.
Virtual agents and chatbots: Virtual agents such as Apple's Siri and Amazon's Alexa use speech
recognition to recognize patterns in voice commands and natural language generation to respond with
appropriate action or helpful comments. Chatbots perform the same magic in response to typed text
queries.
Social media sentiment analysis: Sentiment analysis can analyze language used in social media posts,
responses, reviews, and more to extract attitudes and emotions in response to products, promotions, and
events. companies can use this information in product designs, advertising campaigns, and more.
Text summarization: Text summarization uses NLP techniques to digest huge volumes of digital text
and create summaries and synopses for indexes, research databases, or busy readers who don't have time
to read full text. The best text summarization applications use semantic reasoning and natural language
generation (NLG) to add useful context and conclusions to summaries.
11.4 NLP TOOLS & LIBRARIES
Some commonly used NLP tools and libraries are as follows:
1. NLTK - entry-level open-source NLP Tool
Natural Language Toolkit (AKA NLTK) is an open-source software powered with Python NLP. NLTK
provides users with a basic set of tools for text-related operations. It is a good starting point for beginners
in Natural Language Processing.
Artificial Intelligence and Machine Learning-20CS51I Page 4
Department of Collegiate and Technical Education Diploma in CS&E
Natural Language Toolkit features include:
Text classification
Part-of-speech tagging
Entity extraction
Tokenization
Parsing
Stemming
Semantic reasoning
Natural Language Toolkit is useful for simple text analysis. But, if you need to work on a massive amount of
data, it requires significant resources.
2. Stanford Core NLP - Data Analysis, Sentiment Analysis, Conversational UI
Stanford NLP library is a multi-purpose tool for text analysis. Like NLTK, Stanford CoreNLP provides
many different natural language processing software. But if you need more, you can use custom modules. The
main advantage of Stanford NLP tools is scalability. Unlike NLTK, Stanford Core NLP is a perfect choice for
processing large amounts of data and performing complex operations.
3. Apache OpenNLP - Data Analysis and Sentiment Analysis
Apache OpenNLP is an open-source library for those who prefer practicality and accessibility. Like
Stanford CoreNLP, it uses Java NLP libraries with Python decorators. While NLTK and Stanford CoreNLP are
state-of-the-art libraries with tons of additions, OpenNLP is a simple yet useful tool. Besides, you can configure
OpenNLP in the way you need and get rid of unnecessary features.
Apache OpenLP is the right choice for:
Named Entity Recognition
Sentence Detection
POS tagging
Tokenization
4. SpaCy - Data Extraction, Data Analysis, Sentiment Analysis, Text Summarization
SpaCy is the next step of the NLTK evolution. NLTK is clumsy and slow when it comes to more complex
business applications. At the same time, SpaCy provides users with a smoother, faster, and efficient experience.
SpaCy, an open-source NLP library, is a perfect match for comparing customer profiles, product profiles,
or text documents.
SpaCy is good at syntactic analysis, which is handy for aspect-based sentiment analysis and conversational
user interface optimization.
Artificial Intelligence and Machine Learning-20CS51I Page 5
Department of Collegiate and Technical Education Diploma in CS&E
SpaCy is also an excellent choice for named-entity recognition. You can use SpaCy for business insights
and market research.
Another SpaCy advantage is word vector usage. Unlike OpenNLP and CoreNLP, SpaCy works
with word2vec and doc2vec.
5. GenSim - Document Analysis, Semantic Search, Data Exploration
GenSim is the perfect tool to extract particular information to discover business insights. It is an open-source
NLP library designed for document exploration and topic modeling. It would help you to navigate the various
databases and documents.
The key GenSim feature is word vectors. It sees the content of the documents as sequences of vectors and
clusters, and then, classifies them.
GenSim is also resource-saving when it comes to dealing with a large amount of data.
The main GenSim use cases are:
Data analysis
Semantic search applications
Text generation applications (chatbot, service customization, text summarization, etc.)
11.5 ENVIRONMENT SETUP
NLTK can be installed by using the pip package installer. Recently NLTK has dropped support for Python 2 so
Python 3.5 and above is required to install NLTK.
Pip command to install NLTK is as follows:
pip install nltk
Artificial Intelligence and Machine Learning-20CS51I Page 6
Department of Collegiate and Technical Education Diploma in CS&E
We can install SpaCy in a similar way using
pip install spacy
References:
1. Tanveer Siddiqui, U.S. Tiwary, “Natural Language Processing and Information Retrieval”, Oxford
University Press, 2008.
2. [Link]
3. [Link]
4. [Link]
5. [Link]
6. [Link]
7. Infosys Springboard – Natural Language Processing for developers
Artificial Intelligence and Machine Learning-20CS51I Page 7
Department of Collegiate and Technical Education Diploma in CS&E
Course: Artificial Intelligence and Machine Learning Code: 20CS51I
WEEK- 11
NATURAL LANGUAGE PROCESSING
Session 2
11.6 TEXT PROCESSING TASKS
Text data derived from natural language is unstructured and noisy. Text preprocessing involves transforming
text into a clean and consistent format that can then be fed into a model for further analysis and learning.
Text preprocessing is an important step for natural language processing (NLP) tasks.
It transforms text into a more digestible form so that machine learning algorithms can perform better,
and depending on how well the data has been preprocessed; the results are seen.
Text preprocessing improves the performance of an NLP system.
For tasks such as sentiment analysis, document categorization, document retrieval based upon user
queries, and more, adding a text preprocessing layer provides more accuracy.
Some of the preprocessing steps are:
Tokenization
Removing Stop words
Spelling correction
Stemming
Lemmatization
We will be using NLTK to perform all NLP tasks. If NLTK is not yet installed in your system, go to Anaconda
Command prompt and perform the following steps:
pip install nltk
Artificial Intelligence and Machine Learning-20CS51I Page 1
Department of Collegiate and Technical Education Diploma in CS&E
Then, enter the python shell in your terminal by simply typing python
Type import nltk
Type [Link](‘all’)
The above installation will take quite some time due to the massive number of tokenizers, chunkers, other
algorithms, and all of the corpora to be downloaded.
11.6.1 Tokenization
Token in a text document refers to each “entity” that is a part of whatever was split up based on rules. For
examples, each word is a token when a sentence is “tokenized” into words. Each sentence can also be a token,
if you tokenized the sentences out of a paragraph. So basically, tokenizing involves splitting sentences and
words from the body of the text.
NLTK Tokenizer Package
The [Link] package contains methods to tokenize the given text into sentences and words based on our
requirement.
1. Sentence tokenization
The given document is divided or tokenized into sentences.
Syntax:
[Link].sent_tokenize(text, language='english')
returns a sentence-tokenized copy of text, using NLTK’s recommended sentence tokenizer. (currently
PunktSentenceTokenizer for the specified language).
Parameters:
Text (str) – text to split into sentences
Language(str) – the model name in the Punkt corpus. By default it will be English.
Artificial Intelligence and Machine Learning-20CS51I Page 2
Department of Collegiate and Technical Education Diploma in CS&E
Example:
The following example reads a text file and tokenizes it into sentences.
Import the sent_tokenize method from [Link] package
Read the text file using open method
Output:
Divide the text into sentences using the sent_tokenize method
Print the sentences to see how the method has divided the text into sentences.
Output:
Artificial Intelligence and Machine Learning-20CS51I Page 3
Department of Collegiate and Technical Education Diploma in CS&E
2. Word tokenization
The given document is divided or tokenized into words.
Syntax:
[Link].word_tokenize(text, language='english', preserve_line=False)
returns a tokenized copy of text, using NLTK’s recommended word tokenizer (currently an improved
TreebankWordTokenizer along with PunktSentenceTokenizer for the specified language).
Parameters
text (str) – text to split into words
language (str) – the model name in the Punkt corpus
preserve_line (bool) – A flag to decide whether to sentence tokenize the text or not.
Example:
For the text file used in previous example, let us find the word tokens.
Output:
The tokens generated will contain special characters such as comma, dot, parentheses, apostrophe etc. We may
have to remove these special characters, and stop words to generate the vocabulary of our text document.
3. Visualizing Frequency Distribution of words
The standard method for visualizing the word frequency distribution is to count how often each word occurs in
a corpus and to sort the word frequency counts by decreasing magnitude. We need to create a Bag of Words
(BoW), which is a statistical language model used to analyze text and documents based on word count.
Example:
To visualize the frequency distribution, let us consider opinions given by customers about a hotel.
Read the “[Link]” text file which contains opinions given by multiple customers
Artificial Intelligence and Machine Learning-20CS51I Page 4
Department of Collegiate and Technical Education Diploma in CS&E
Creating the Frequency Distribution
Without the NLTK package, creating a frequency distribution plot (histogram) for a BoW is possible, but
will take multiple lines of code to do so. Through the use of the FreqDist class, we are able to obtain the
frequencies of every token in the BoW with one single line of code:
Output:
Visualizing Frequency Distribution using matplotlib
Artificial Intelligence and Machine Learning-20CS51I Page 5
Department of Collegiate and Technical Education Diploma in CS&E
As we can see, the frequency distribution contains frequencies of tokens like ‘the’, ’to’, ‘of’ which are
considered as stop words. It also contains special characters and tokens with different case are treated
differently. To overcome these issues, we have to perform text cleanup as follows:
Converting the text to lower case letters.
Removing special character from data
Removing words with numbers from data
Removing Stopwords
Stop words are a set of commonly used words in any language. For example, in English, “the”, “is” and
“and”, would easily qualify as stop words. In NLP and text mining applications, stop words are used to
eliminate unimportant words, allowing applications to focus on the important words instead.
NLTK has an inbuilt list of stopwords. We can also create our own stopword list and use it based on
our requirements. The following code removes the stopwords from the text. When we visualize the
frequency distribution after text cleaning, it will be as follows:
Artificial Intelligence and Machine Learning-20CS51I Page 6
Department of Collegiate and Technical Education Diploma in CS&E
When we draw the Histogram after the text cleaning, it would look like the below figure.
Artificial Intelligence and Machine Learning-20CS51I Page 7
Department of Collegiate and Technical Education Diploma in CS&E
4. Visualize with word cloud
A word cloud is a collection, or cluster, of words depicted in different sizes. The bigger and bolder the word
appears, the more often it’s mentioned within a given text and the more important it is.
To draw a word cloud, install it through pip command as follows:
pip install wordcloud
Example:
Output:
Artificial Intelligence and Machine Learning-20CS51I Page 8
Department of Collegiate and Technical Education Diploma in CS&E
We can set a custom image as a mask to display the word cloud as follows:
Artificial Intelligence and Machine Learning-20CS51I Page 9
Department of Collegiate and Technical Education Diploma in CS&E
References:
1. [Link]
2. [Link]
3. [Link]
6113ff5decd8
4. Infosys Springboard – Natural Language Processing for developers
Artificial Intelligence and Machine Learning-20CS51I Page
10
Department of Collegiate and Technical Education Diploma in CS&E
Course: Artificial Intelligence and Machine Learning Code: 20CS51I
WEEK- 11
NATURAL LANGUAGE PROCESSING
Session 3
11.7 SPELL CORRECTION
Spelling correction is the process of correcting word’s spelling. For example “lisr” instead of “list”. Spelling
correction is important for many NLP applications like web search engines, text summarization, sentiment
analysis etc. Most approaches use parallel data of noisy and correct word mappings from different sources as
training data for automatic spelling correction.
Here we are going to use Levenshtein distance or Edit Distance method for spelling correction. This
method takes a list of misspelled words and gives the suggestion of the correct word for each incorrect word.
It tries to find a word in the list of correct spellings that has the shortest distance and the same initial letter as
the misspelled word. It then returns the word which matches the given criteria.
11.7.1 Edit Distance
Edit Distance or Levenshtein distance between two words is the minimum number of single-character edits
(insertions, deletions or substitutions) required to change one word into the other.
Edit Distance measures dissimilarity between two strings by finding the minimum number of operations
needed to transform one string into the other. The transformations that can be performed are:
Inserting a new character:
bat -> bats (insertion of 's')
Deleting an existing character.
care -> car (deletion of 'e')
Substituting an existing character.
bin -> bit (substitution of n with t)
Transposition of two existing consecutive characters.
sing -> sign (transposition of ng to gn)
Implementation using NLTK
Step 1: First of all, we install and import the nltk suite.
Artificial Intelligence and Machine Learning-20CS51I Page 1
Department of Collegiate and Technical Education Diploma in CS&E
Step 2: Now, we download the ‘words’ resource (which contains correct spellings of words) from the nltk
downloader and import it through [Link] and assign it to correct_words.
Step 3: We define the list of incorrect_words for which we need the correct spellings. Then we run a loop for
each word in the incorrect words list in which we calculate the Edit distance of the incorrect word with each
correct spelling word having the same initial letter. We then sort them in ascending order so the shortest
distance is on top and extract the word corresponding to it and print it.
11.8 NORMALIZATION
Base form of a word is defined as Morpheme. A token is basically made up of two components one is
morpheme and the other is inflectional form like prefix or suffix. For example, consider the word Antinationalist
(Anti + national+ ist ) which is made up of Anti and ist as inflectional forms and national as the morpheme.
Normalization is the process of converting a token into its base form. In the normalization process,
the inflectional form of a word is removed so that the base form can be obtained. In the above example, the
normal form of antinationalist is national.
Why do we need text normalization?
Normalization is helpful in reducing the number of unique tokens present in the text, removing the
variations in a text and also cleaning the text by removing redundant information.
When we normalize text, we attempt to reduce its randomness, bringing it closer to a predefined
“standard”.
Artificial Intelligence and Machine Learning-20CS51I Page 2
Department of Collegiate and Technical Education Diploma in CS&E
This helps us to reduce the amount of different information that the computer has to deal with, and
therefore improves efficiency.
Two popular methods used for normalization are
Stemming
Lemmatization
11.8.1 Stemming
Stemming is the process of reducing the words to their root form. It is a rule-based process for removing
inflationary forms from a given token. The output of this process is called the stem. For example, “retrieval”,
“retrieved”, “retrieves” reduce to the stem “retrieve”. Another example of stemming can be "likes", "liked",
"likely", "liking" are reduced to root word "like".
The objective of stemming is to reduce related words to the same stem even if the stem is not a dictionary
word. Stemming is not a good process for normalization. since sometimes it can produce non-meaningful words
which are not present in the dictionary.
Implementation using Porter Stemmer
The most common algorithm for stemming English is Porter’s algorithm. There are other stemmers also
available such as Snowball Stemmer.
Step 1: First of all, we read the text from a file and perform text cleaning such as converting to lower case,
removing punctuations and numeric data from the text.
Artificial Intelligence and Machine Learning-20CS51I Page 3
Department of Collegiate and Technical Education Diploma in CS&E
Step 2: The text is then split into tokens or words and each word is fed to the Porter Stemmer to get the stem
word.
As we can see, some words are converted to a correct root word, such as ‘refers’ is reduced to ‘refer’,
‘processing’ is reduced to ‘process’. But most of the words are stemmed to a word which is not present in the
dictionary and does not carry any meaning. Stemming a word or sentence may result in words that are not
actual words.
11.8.2 Lemmatization
Lemmatization is a systematic process of removing the inflectional form of a token and transform it into a base
word. It makes use of word structure, vocabulary, part of speech tags, and grammar relations.
Unlike stemming, lemmatization reduces words to their base word, reducing the inflected words
properly and ensuring that the root word belongs to the language. It’s usually more sophisticated than stemming,
Artificial Intelligence and Machine Learning-20CS51I Page 4
Department of Collegiate and Technical Education Diploma in CS&E
since stemmers works on an individual word without knowledge of the context. In lemmatization, a root word
is called lemma. For example, “am”, “are”, “is” will be converted to “be”. Similarly, ‘running’, ‘runs’, ‘ran’
will be replaced by ‘run’.
Just like for stemming, there are different lemmatizers. Here we are going to use WordNet lemmatizer.
It’s possible to improve performance over lemmatization even further if you provide the context in which you
want to lemmatize, which you can do through parts-of-speech (POS) tagging.
POS tagging is the task of assigning each word in a sentence the part of speech that it assumes in that
sentence. The primary target of POS tagging is to identify the grammatical group of a given word: whether it is
a noun, pronoun, adjective, verb, adverbs, etc. based on the context.
References:
1. [Link]
2. [Link]
3. [Link]
4. [Link]
Artificial Intelligence and Machine Learning-20CS51I Page 5
Department of Collegiate and Technical Education Diploma in CS&E
Course: Artificial Intelligence and Machine Learning Code: 20CS51I
WEEK- 11
NATURAL LANGUAGE PROCESSING
Session 6
11.12 IF-TDF VECTORIZATION
All the methods discussed in the previous sessions were based on the Bag of Words model which is simple and
works well. But the problem with that is that it treats all words equally. As a result, it cannot distinguish very
common words or rare words. So, to solve this problem, TF-IDF comes into the picture. TF-IDF is made up of
two terms: Term Frequency and Inverse Document Frequency
1. Term Frequency
Term frequency denotes the frequency of a word in a document. For a specified word, it is defined as the ratio
of the number of times a word appears in a document to the total number of words in the document. Or, it is
also defined in the following manner:
It is the percentage of the number of times a word (x) occurs in a particular document (y) divided by the
total number of words in that document.
For Example, Consider the following sentence
‘Cat loves to play with ball’
For the above sentence, the term frequency value for word cat will be: tf(‘cat’) = 1/6
This number will always stay ≤ 1, thus we now judge how frequent a word is in the context of all of the
words in a document.
2. Inverse document frequency
Inverse document frequency looks at how common (or uncommon) a word is amongst the corpus. It measures
the importance of the word in the corpus.
Before we go into IDF, we must make sense of DF – Document Frequency. It’s given by the following formula:
𝐷𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑠 𝑐𝑜𝑛𝑡𝑎𝑖𝑛𝑖𝑛𝑔 𝑤𝑜𝑟𝑑 𝑊
𝐷𝐹 =
𝑇𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑠
DF tells us about the proportion of documents that contain a certain word. IDF is the reciprocal of the Document
Frequency.
𝑇𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑠
𝐼𝐷𝐹 = 𝑙𝑜𝑔 ( )
𝐷𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑠 𝑐𝑜𝑛𝑡𝑎𝑖𝑛𝑖𝑛𝑔 𝑤𝑜𝑟𝑑 𝑊
Artificial Intelligence and Machine Learning-20CS51I Page 1
Department of Collegiate and Technical Education Diploma in CS&E
The intuition behind using IDF is that the more common a word is across all documents, the lesser its importance
is for the current document. A logarithm is taken to dampen the effect of (normalize) IDF in the final calculation.
The final TF-IDF score comes out to be:
𝑇𝐹𝐼𝐷𝐹 = 𝑇𝐹 ∗ 𝐼𝐷𝐹
Implementation
Artificial Intelligence and Machine Learning-20CS51I Page 2
Department of Collegiate and Technical Education Diploma in CS&E
References:
1. [Link]
vectorization-approaches/
2. [Link]
3. [Link]
4. Infosys Springboard – Natural Language Processing for developers
Artificial Intelligence and Machine Learning-20CS51I Page 3
Department of Collegiate and Technical Education Diploma in CS&E
Course: Artificial Intelligence and Machine Learning Code: 20CS51I
WEEK- 11
NATURAL LANGUAGE PROCESSING
Session 7
11.13 NLP Pipeline
The set of ordered stages one should go through from a labeled dataset to creating a classifier that can be applied
to new samples is called the NLP pipeline. NLP Pipeline is a set of steps followed to build an end to end NLP
software.
11.13.1 Steps involved in building NLP Pipeline
1. Data Acquisition
In the data acquisition step, we collect the data required for building our NLP software. We can collect the data
using any of the following methods:
a) We can conduct to survey to collect data and then manually give a label to the data
b) Public Dataset – If a public dataset is available for our problem statement.
c) Web Scrapping – Scrapping data using beautiful soup or other libraries
2. Text Preprocessing
Once the data collection step is done, we cannot use this data as is for model building. We have to do text
preprocessing. It helps to remove unhelpful parts of the data, or noise, by converting all characters to lowercase,
removing stop words, punctuation marks, and typos in the data. After doing data preprocessing accuracy of the
model get increases.
Steps involved in Text Preprocessing –
1. Text Cleaning – In-text cleaning we do HTML tag removing, removing punctuations, Spelling checker, etc.
Artificial Intelligence and Machine Learning-20CS51I Page 1
Department of Collegiate and Technical Education Diploma in CS&E
2. Basic Preprocessing — In basic preprocessing we do tokenization (word or sent tokenization), stop word
removal, removing digits, lower casing etc.
3. Advance Preprocessing — In this step we do POS tagging, Named entity recognition etc.
3. Feature Engineering
After text cleaning and normalization, the processed text is converted to feature vectors so that we can feed it
to machine learning applications. Feature Engineering means converting text data to numerical data. But why
it is required to convert text data to numerical data? Because many Machine Learning algorithms and almost all
Deep Learning Architectures are not capable of processing strings or plain text in their raw form. This step is
also called Feature extraction from text.
In this step, we use multiple techniques to convert text to numerical vectors.
1. One Hot Encoder
2. Bag Of Word(BOW)
3. n-grams
4. Tf-Idf
5. Word2vec
4. Modelling / Model Building
In the modeling step, we try to create a model based on the cleaned data. Here also we can use multiple
approaches to build the model based on the problem statement.
Approaches to building model –
1. Heuristic Approach
2. Machine Learning Approach
3. Deep Learning Approach
5. Model Evaluation
In the model evaluation, we can use different metrics for evaluation such as Accuracy, Recall, Confusion
Metrics, Perplexity, etc.
6. Deployment
In the deployment step, we have to deploy our model on the cloud for the users. Deployment has three stages
deployment, monitoring, and retraining or model update.
Three stages of deployment –
1. Deployment – model deploying on the cloud for users.
2. Monitoring – In the monitoring phase, we have to watch the model continuously. Here we have to create
Artificial Intelligence and Machine Learning-20CS51I Page 2
Department of Collegiate and Technical Education Diploma in CS&E
a dashboard to show evaluation metrics.
3. Update- Retrain the model on new data and again deploy.
References:
1. [Link]
2. [Link]
3. [Link]
4. [Link]
5. Infosys Springboard – Natural Language Processing for developers
Artificial Intelligence and Machine Learning-20CS51I Page 3