0% found this document useful (0 votes)
2 views30 pages

spaCy for Natural Language Processing

The document provides an overview of Natural Language Processing (NLP) and its applications, including sentiment analysis and named entity recognition, using the spaCy library. It covers the installation, basic functionalities, and key components of spaCy, such as tokenization, lemmatization, and part-of-speech tagging. Additionally, it introduces the visualizer displaCy for highlighting named entities in text.

Uploaded by

Bhuvnesh Verma
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views30 pages

spaCy for Natural Language Processing

The document provides an overview of Natural Language Processing (NLP) and its applications, including sentiment analysis and named entity recognition, using the spaCy library. It covers the installation, basic functionalities, and key components of spaCy, such as tokenization, lemmatization, and part-of-speech tagging. Additionally, it introduces the visualizer displaCy for highlighting named entities in text.

Uploaded by

Bhuvnesh Verma
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Natural Language

Processing (NLP)
basics
N AT U R A L L A N G U A G E P R O C E S S I N G W I T H S PA C Y

Azadeh Mobasher
Principal Data Scientist
Natural Language Processing (NLP)

A subfield of Artificial Intelligence (AI)

Helps computers to understand human


language

Helps extract insights from unstructured


data

Incorporates statistics, machine learning


models and deep learning models

NATURAL LANGUAGE PROCESSING WITH SPACY


NLP use cases
Sentiment analysis

Use of computers to determine the underlying subjective tone of a piece of writing

NATURAL LANGUAGE PROCESSING WITH SPACY


NLP use cases
Named entity recognition (NER)

Locating and classifying named entities mentioned in unstructured text into pre-defined
categories

Named entities are real-world objects such as a person or location

NATURAL LANGUAGE PROCESSING WITH SPACY


NLP use cases

Generate human-like responses to text input, such as ChatGPT

NATURAL LANGUAGE PROCESSING WITH SPACY


Introduction to spaCy
spaCy is a free, open-source library for NLP in
Python which:

Is designed to build systems for information


extraction

Provides production-ready code for NLP


use cases

Supports 64+ languages

Is robust and fast and has visualization


libraries

NATURAL LANGUAGE PROCESSING WITH SPACY


Install and import spaCy

As the first step, spaCy can be installed $ python3 pip install spacy
using the Python package manager pip

spaCy trained models can be downloaded python3 -m spacy download en_core_web_sm


import spacy
Multiple trained models are available for nlp = [Link]("en_core_web_sm")
English language at [Link]

NATURAL LANGUAGE PROCESSING WITH SPACY


Read and process text with spaCy
Loaded spaCy model en_core_web_sm = nlp object
nlp object converts text into a Doc object (container) to store processed text

NATURAL LANGUAGE PROCESSING WITH SPACY


spaCy in action
Processing a string using spaCy

import spacy
nlp = [Link]("en_core_web_sm")
text = "A spaCy pipeline object is created."
doc = nlp(text)

Tokenization
A Token is defined as the smallest meaningful part of the text.

Tokenization: The process of dividing a text into a list of meaningful tokens

print([[Link] for token in doc])

['A', 'spaCy', 'pipeline', 'object', 'is', 'created', '.']

NATURAL LANGUAGE PROCESSING WITH SPACY


Let's practice!
N AT U R A L L A N G U A G E P R O C E S S I N G W I T H S PA C Y
spaCy basics
N AT U R A L L A N G U A G E P R O C E S S I N G W I T H S PA C Y

Azadeh Mobasher
Principal Data Scientist
spaCy NLP pipeline
Import spaCy
import spacy
nlp = [Link]("en_core_web_sm") Use [Link]() to return nlp , a
doc = nlp("Here's my spaCy pipeline.") Language class
The Language object is the text
processing pipeline

Apply nlp() on any text to get a Doc


container

NATURAL LANGUAGE PROCESSING WITH SPACY


spaCy NLP pipeline

spaCy applies some processing steps using its Language class:

NATURAL LANGUAGE PROCESSING WITH SPACY


Container objects in spaCy
There are multiple data structures to represent text data in spaCy :

Name Description
Doc A container for accessing linguistic annotations of text

Span A slice from a Doc object

Token An individual token, i.e. a word, punctuation, whitespace, etc.

NATURAL LANGUAGE PROCESSING WITH SPACY


Pipeline components
The spaCy language processing pipeline always depends on the loaded model and its
capabilities.

Component Name Description


Tokenizer Tokenizer Segment text into tokens and create Doc object

Tagger Tagger Assign part-of-speech tags


Lemmatizer Lemmatizer Reduce the words to their root forms
EntityRecognizer NER Detect and label named entities

NATURAL LANGUAGE PROCESSING WITH SPACY


Pipeline components

Each component has unique features to process text


Language

DependencyParser

Sentencizer

NATURAL LANGUAGE PROCESSING WITH SPACY


Tokenization
Always the first operation
All the other operations require tokens

Tokens can be words, numbers and punctuation

import spacy
nlp = [Link]("en_core_web_sm")

doc = nlp("Tokenization splits a sentence into its tokens.")


print([[Link] for token in doc])

['Tokenization', 'splits', 'a', 'sentence', 'into', 'its', 'tokens', '.']

NATURAL LANGUAGE PROCESSING WITH SPACY


Sentence segmentation
More complex than tokenization
Is a part of DependencyParser component

import spacy
nlp = [Link]("en_core_web_sm")

text = "We are learning NLP. This course introduces spaCy."


doc = nlp(text)
for sent in [Link]:
print([Link])

We are learning NLP.


This course introduces spaCy.

NATURAL LANGUAGE PROCESSING WITH SPACY


Lemmatization
A lemma is a the base form of a token
The lemma of eats and ate is eat

Improves accuracy of language models

import spacy
nlp = [Link]("en_core_web_sm")
doc = nlp("We are seeing her after one year.")
print([([Link], token.lemma_) for token in doc])

[('We', 'we'), ('are', 'be'), ('seeing', 'see'), ('her', 'she'),


('after', 'after'), ('one', 'one'), ('year', 'year'), ('.', '.')]

NATURAL LANGUAGE PROCESSING WITH SPACY


Let's practice!
N AT U R A L L A N G U A G E P R O C E S S I N G W I T H S PA C Y
Linguistic features in
spaCy
N AT U R A L L A N G U A G E P R O C E S S I N G W I T H S PA C Y

Azadeh Mobasher
Principal Data Scientist
POS tagging
Categorizing words grammatically, based on function and context within a sentence

POS Description Example


VERB Verb run, eat, ate, take
NOUN Noun man, airplane, tree, flower
ADJ Adjective big, old, incompatible, conflicting
ADV Adverb very, down, there, tomorrow
CONJ Conjunction and, or, but

NATURAL LANGUAGE PROCESSING WITH SPACY


POS tagging with spaCy

POS tagging confirms the meaning of a word

Some words such as watch can be both noun and verb


spaCy captures POS tags in the pos_ feature of the nlp pipeline

[Link]() explains a given POS tag

NATURAL LANGUAGE PROCESSING WITH SPACY


POS tagging with spaCy
verb_sent = "I watch TV." noun_sent = "I left without my watch."

print([([Link], token.pos_, print([([Link], token.pos_,


[Link](token.pos_)) [Link](token.pos_))
for token in nlp(verb_sent)]) for token in nlp(noun_sent)])

[('I', 'PRON', 'pronoun'), [('I', 'PRON', 'pronoun'),


('watch', 'VERB', 'verb'), ('left', 'VERB', 'verb'),
('TV', 'NOUN', 'noun'), ('without', 'ADP', 'adposition'),
('.', 'PUNCT', 'punctuation')] ('my', 'PRON', 'pronoun'),
('watch', 'NOUN', 'noun'),
('.', 'PUNCT', 'punctuation')]

NATURAL LANGUAGE PROCESSING WITH SPACY


Named entity recognition
A named entity is a word or phrase that refers to a specific entity with a name
Named-entity recognition (NER) classifies named entities into pre-defined categories

Entity type Description


PERSON Named person or family
ORG Companies, institutions, etc.
GPE Geo-political entity, countries, cities, etc.
LOC Non-GPE locations, mountain ranges, etc.
DATE Absolute or relative dates or periods
TIME Time smaller than a day

NATURAL LANGUAGE PROCESSING WITH SPACY


NER and spaCy

spaCy models extract named entities using the NER pipeline component

Named entities are available via the [Link] property


spaCy will also tag each entity with its entity label ( .label_ )

NATURAL LANGUAGE PROCESSING WITH SPACY


NER and spaCy

import spacy
nlp = [Link]("en_core_web_sm")
text = "Albert Einstein was genius."
doc = nlp(text)
print([([Link], ent.start_char,
ent.end_char, ent.label_) for ent in [Link]])

>>> [('Albert Einstein', 0, 15, 'PERSON')]

NATURAL LANGUAGE PROCESSING WITH SPACY


NER and spaCy
We can also access entity types of each token in a Doc container

import spacy
nlp = [Link]("en_core_web_sm")
text = "Albert Einstein was genius."
doc = nlp(text)
print([([Link], token.ent_type_) for token in doc])

>>> [('Albert', 'PERSON'), ('Einstein', 'PERSON'),


('was', ''), ('genius', ''), ('.', '')]

NATURAL LANGUAGE PROCESSING WITH SPACY


displaCy
import spacy
from spacy import displacy
spaCy is equipped with a modern
visualizer: displaCy
text = "Albert Einstein was genius."
The displaCy entity visualizer highlights nlp = [Link]("en_core_web_sm")
named entities and their labels
doc = nlp(text)
[Link](doc, style="ent")

NATURAL LANGUAGE PROCESSING WITH SPACY


Let's practice!
N AT U R A L L A N G U A G E P R O C E S S I N G W I T H S PA C Y

You might also like