Natural Language
Processing (NLP)
basics
N AT U R A L L A N G U A G E P R O C E S S I N G W I T H S PA C Y
Azadeh Mobasher
Principal Data Scientist
Natural Language Processing (NLP)
A subfield of Artificial Intelligence (AI)
Helps computers to understand human
language
Helps extract insights from unstructured
data
Incorporates statistics, machine learning
models and deep learning models
NATURAL LANGUAGE PROCESSING WITH SPACY
NLP use cases
Sentiment analysis
Use of computers to determine the underlying subjective tone of a piece of writing
NATURAL LANGUAGE PROCESSING WITH SPACY
NLP use cases
Named entity recognition (NER)
Locating and classifying named entities mentioned in unstructured text into pre-defined
categories
Named entities are real-world objects such as a person or location
NATURAL LANGUAGE PROCESSING WITH SPACY
NLP use cases
Generate human-like responses to text input, such as ChatGPT
NATURAL LANGUAGE PROCESSING WITH SPACY
Introduction to spaCy
spaCy is a free, open-source library for NLP in
Python which:
Is designed to build systems for information
extraction
Provides production-ready code for NLP
use cases
Supports 64+ languages
Is robust and fast and has visualization
libraries
NATURAL LANGUAGE PROCESSING WITH SPACY
Install and import spaCy
As the first step, spaCy can be installed $ python3 pip install spacy
using the Python package manager pip
spaCy trained models can be downloaded python3 -m spacy download en_core_web_sm
import spacy
Multiple trained models are available for nlp = [Link]("en_core_web_sm")
English language at [Link]
NATURAL LANGUAGE PROCESSING WITH SPACY
Read and process text with spaCy
Loaded spaCy model en_core_web_sm = nlp object
nlp object converts text into a Doc object (container) to store processed text
NATURAL LANGUAGE PROCESSING WITH SPACY
spaCy in action
Processing a string using spaCy
import spacy
nlp = [Link]("en_core_web_sm")
text = "A spaCy pipeline object is created."
doc = nlp(text)
Tokenization
A Token is defined as the smallest meaningful part of the text.
Tokenization: The process of dividing a text into a list of meaningful tokens
print([[Link] for token in doc])
['A', 'spaCy', 'pipeline', 'object', 'is', 'created', '.']
NATURAL LANGUAGE PROCESSING WITH SPACY
Let's practice!
N AT U R A L L A N G U A G E P R O C E S S I N G W I T H S PA C Y
spaCy basics
N AT U R A L L A N G U A G E P R O C E S S I N G W I T H S PA C Y
Azadeh Mobasher
Principal Data Scientist
spaCy NLP pipeline
Import spaCy
import spacy
nlp = [Link]("en_core_web_sm") Use [Link]() to return nlp , a
doc = nlp("Here's my spaCy pipeline.") Language class
The Language object is the text
processing pipeline
Apply nlp() on any text to get a Doc
container
NATURAL LANGUAGE PROCESSING WITH SPACY
spaCy NLP pipeline
spaCy applies some processing steps using its Language class:
NATURAL LANGUAGE PROCESSING WITH SPACY
Container objects in spaCy
There are multiple data structures to represent text data in spaCy :
Name Description
Doc A container for accessing linguistic annotations of text
Span A slice from a Doc object
Token An individual token, i.e. a word, punctuation, whitespace, etc.
NATURAL LANGUAGE PROCESSING WITH SPACY
Pipeline components
The spaCy language processing pipeline always depends on the loaded model and its
capabilities.
Component Name Description
Tokenizer Tokenizer Segment text into tokens and create Doc object
Tagger Tagger Assign part-of-speech tags
Lemmatizer Lemmatizer Reduce the words to their root forms
EntityRecognizer NER Detect and label named entities
NATURAL LANGUAGE PROCESSING WITH SPACY
Pipeline components
Each component has unique features to process text
Language
DependencyParser
Sentencizer
NATURAL LANGUAGE PROCESSING WITH SPACY
Tokenization
Always the first operation
All the other operations require tokens
Tokens can be words, numbers and punctuation
import spacy
nlp = [Link]("en_core_web_sm")
doc = nlp("Tokenization splits a sentence into its tokens.")
print([[Link] for token in doc])
['Tokenization', 'splits', 'a', 'sentence', 'into', 'its', 'tokens', '.']
NATURAL LANGUAGE PROCESSING WITH SPACY
Sentence segmentation
More complex than tokenization
Is a part of DependencyParser component
import spacy
nlp = [Link]("en_core_web_sm")
text = "We are learning NLP. This course introduces spaCy."
doc = nlp(text)
for sent in [Link]:
print([Link])
We are learning NLP.
This course introduces spaCy.
NATURAL LANGUAGE PROCESSING WITH SPACY
Lemmatization
A lemma is a the base form of a token
The lemma of eats and ate is eat
Improves accuracy of language models
import spacy
nlp = [Link]("en_core_web_sm")
doc = nlp("We are seeing her after one year.")
print([([Link], token.lemma_) for token in doc])
[('We', 'we'), ('are', 'be'), ('seeing', 'see'), ('her', 'she'),
('after', 'after'), ('one', 'one'), ('year', 'year'), ('.', '.')]
NATURAL LANGUAGE PROCESSING WITH SPACY
Let's practice!
N AT U R A L L A N G U A G E P R O C E S S I N G W I T H S PA C Y
Linguistic features in
spaCy
N AT U R A L L A N G U A G E P R O C E S S I N G W I T H S PA C Y
Azadeh Mobasher
Principal Data Scientist
POS tagging
Categorizing words grammatically, based on function and context within a sentence
POS Description Example
VERB Verb run, eat, ate, take
NOUN Noun man, airplane, tree, flower
ADJ Adjective big, old, incompatible, conflicting
ADV Adverb very, down, there, tomorrow
CONJ Conjunction and, or, but
NATURAL LANGUAGE PROCESSING WITH SPACY
POS tagging with spaCy
POS tagging confirms the meaning of a word
Some words such as watch can be both noun and verb
spaCy captures POS tags in the pos_ feature of the nlp pipeline
[Link]() explains a given POS tag
NATURAL LANGUAGE PROCESSING WITH SPACY
POS tagging with spaCy
verb_sent = "I watch TV." noun_sent = "I left without my watch."
print([([Link], token.pos_, print([([Link], token.pos_,
[Link](token.pos_)) [Link](token.pos_))
for token in nlp(verb_sent)]) for token in nlp(noun_sent)])
[('I', 'PRON', 'pronoun'), [('I', 'PRON', 'pronoun'),
('watch', 'VERB', 'verb'), ('left', 'VERB', 'verb'),
('TV', 'NOUN', 'noun'), ('without', 'ADP', 'adposition'),
('.', 'PUNCT', 'punctuation')] ('my', 'PRON', 'pronoun'),
('watch', 'NOUN', 'noun'),
('.', 'PUNCT', 'punctuation')]
NATURAL LANGUAGE PROCESSING WITH SPACY
Named entity recognition
A named entity is a word or phrase that refers to a specific entity with a name
Named-entity recognition (NER) classifies named entities into pre-defined categories
Entity type Description
PERSON Named person or family
ORG Companies, institutions, etc.
GPE Geo-political entity, countries, cities, etc.
LOC Non-GPE locations, mountain ranges, etc.
DATE Absolute or relative dates or periods
TIME Time smaller than a day
NATURAL LANGUAGE PROCESSING WITH SPACY
NER and spaCy
spaCy models extract named entities using the NER pipeline component
Named entities are available via the [Link] property
spaCy will also tag each entity with its entity label ( .label_ )
NATURAL LANGUAGE PROCESSING WITH SPACY
NER and spaCy
import spacy
nlp = [Link]("en_core_web_sm")
text = "Albert Einstein was genius."
doc = nlp(text)
print([([Link], ent.start_char,
ent.end_char, ent.label_) for ent in [Link]])
>>> [('Albert Einstein', 0, 15, 'PERSON')]
NATURAL LANGUAGE PROCESSING WITH SPACY
NER and spaCy
We can also access entity types of each token in a Doc container
import spacy
nlp = [Link]("en_core_web_sm")
text = "Albert Einstein was genius."
doc = nlp(text)
print([([Link], token.ent_type_) for token in doc])
>>> [('Albert', 'PERSON'), ('Einstein', 'PERSON'),
('was', ''), ('genius', ''), ('.', '')]
NATURAL LANGUAGE PROCESSING WITH SPACY
displaCy
import spacy
from spacy import displacy
spaCy is equipped with a modern
visualizer: displaCy
text = "Albert Einstein was genius."
The displaCy entity visualizer highlights nlp = [Link]("en_core_web_sm")
named entities and their labels
doc = nlp(text)
[Link](doc, style="ent")
NATURAL LANGUAGE PROCESSING WITH SPACY
Let's practice!
N AT U R A L L A N G U A G E P R O C E S S I N G W I T H S PA C Y