AI: NLP
1
What is Natural Language Processing
(NLP)
• The process of computer analysis of input
provided in a human language (natural language),
and conversion of this input into a useful form of
representation.
• The field of NLP is primarily concerned with
getting computers to perform useful and
interesting tasks with human languages.
• The field of NLP is secondarily concerned with
helping us come to a better understanding of
human language.
2
Forms of Natural Language
• The input/output of a NLP system can be:
– written text
– speech
• We will mostly concerned with written text (not
speech).
• To process written text, we need:
– lexical, syntactic, semantic knowledge about the language
– discourse information, real world knowledge
• To process spoken language, we need everything
required to process written text, plus the
challenges of speech recognition and speech
synthesis.
3
Components of NLP
• Natural Language Understanding
– Mapping the given input in the natural language into a useful representation.
– Different level of analysis required:
morphological analysis,
syntactic analysis,
semantic analysis,
discourse analysis, …
• Natural Language Generation
– Producing output in the natural language from some internal representation.
– Different level of synthesis required:
deep planning (what to say),
syntactic generation
• NL Understanding is much harder than NL Generation. But, still
both of them are hard.
4
Why NL Understanding is hard?
• Natural language is extremely rich in form and structure,
and very ambiguous.
– How to represent meaning,
– Which structures map to which meaning structures.
• One input can mean many different things. Ambiguity can
be at different levels.
– Lexical (word level) ambiguity -- different meanings of words
– Syntactic ambiguity -- different ways to parse the sentence
– Interpreting partial information -- how to interpret pronouns
– Contextual information -- context of the sentence may affect
the meaning of that sentence.
• Many input can mean the same thing.
• Interaction among components of the input is not clear.
5
Knowledge of Language
• Phonology – concerns how words are related to the sounds
that realize them.
• Morphology – concerns how words are constructed from
more basic meaning units called morphemes. A
morpheme is the primitive unit of meaning in a language.
• Syntax – concerns how can be put together to form correct
sentences and determines what structural role each word
plays in the sentence and what phrases are subparts of
other phrases.
• Semantics – concerns what words mean and how these
meaning combine in sentences to form sentence meaning.
The study of context-independent meaning.
6
Knowledge of Language (cont.)
• Pragmatics – concerns how sentences are used in
different situations and how use affects the
interpretation of the sentence.
• Discourse – concerns how the immediately preceding
sentences affect the interpretation of the next
sentence. For example, interpreting pronouns and
interpreting the temporal aspects of the information.
• World Knowledge – includes general knowledge about
the world. What each language user must know about
the other’s beliefs and goals.
7
What is Natural Language Processing
(NLP)
• The process of computer analysis of input
provided in a human language (natural language),
and conversion of this input into a useful
form of representation.
• The field of NLP is primarily concerned with
getting computers to perform useful and
interesting tasks with human languages.
• The field of NLP is secondarily concerned with
helping us come to a better understanding
of human language.
BİL711 Natural Language Processing 8
Forms of Natural Language
• The input/output of a NLP system can be:
– written text
– speech
• We will mostly concerned with written text (not
speech).
• To process written text, we need:
– lexical, syntactic, semantic knowledge about the language
– discourse information, real world knowledge
• To process spoken language, we need everything
required to process written text, plus the
challenges of speech recognition and speech
synthesis.
BİL711 Natural Language Processing 9
Components of NLP
• Natural Language Understanding
– Mapping the given input in the natural language into a useful representation.
– Different level of analysis required:
morphological analysis,
syntactic analysis,
semantic analysis,
discourse analysis, …
• Natural Language Generation
– Producing output in the natural language from some internal representation.
– Different level of synthesis required:
deep planning (what to say),
syntactic generation
• NL Understanding is much harder than NL Generation. But, still
both of them are hard.
BİL711 Natural Language Processing 10
Why NL Understanding is hard?
• Natural language is extremely rich in form and structure,
and very ambiguous.
– How to represent meaning,
– Which structures map to which meaning structures.
• One input can mean many different things. Ambiguity can
be at different levels.
– Lexical (word level) ambiguity -- different meanings of words
– Syntactic ambiguity -- different ways to parse the sentence
– Interpreting partial information -- how to interpret pronouns
– Contextual information -- context of the sentence may affect
the meaning of that sentence.
• Many input can mean the same thing.
• Interaction among components of the input is not clear.
BİL711 Natural Language Processing 11
Knowledge of Language
• Phonology – concerns how words are related to the sounds
that realize them.
• Morphology – concerns how words are constructed from
more basic meaning units called morphemes. A
morpheme is the primitive unit of meaning in a language.
• Syntax – concerns how can be put together to form correct
sentences and determines what structural role each word
plays in the sentence and what phrases are subparts of
other phrases.
• Semantics – concerns what words mean and how these
meaning combine in sentences to form sentence meaning.
The study of context-independent meaning.
BİL711 Natural Language Processing 12
Knowledge of Language (cont.)
• Pragmatics – concerns how sentences are used in
different situations and how use affects the
interpretation of the sentence.
• Discourse – concerns how the immediately preceding
sentences affect the interpretation of the next
sentence. For example, interpreting pronouns and
interpreting the temporal aspects of the information.
• World Knowledge – includes general knowledge about
the world. What each language user must know about
the other’s beliefs and goals.
BİL711 Natural Language Processing 13
Language and Intelligence
Turing Test
Computer Human
Human Judge
• Human Judge asks tele-typed questions to Computer and
Human.
• Computer’s job is to act like a human.
• Human’s job is to convince Judge that he is not machine.
• Computer is judged “intelligent” if it can fool the judge
• Judgment of intelligence is linked to appropriate answers to
questions from the system.
BİL711 Natural Language Processing 14
NLP - an inter-disciplinary Field
• NLP borrows techniques and insights from several disciplines.
• Linguistics: How do words form phrases and sentences? What
constraints the possible meaning for a sentence?
• Computational Linguistics: How is the structure of sentences are
identified? How can knowledge and reasoning be modeled?
• Computer Science: Algorithms for automatons, parsers.
• Engineering: Stochastic techniques for ambiguity resolution.
• Psychology: What linguistic constructions are easy or difficult for
people to learn to use?
• Philosophy: What is the meaning, and how do words and sentences
acquire it?
BİL711 Natural Language Processing 15
Some NLP Applications
• Machine Translation – Translation between two natural
languages.
– See the Babel Fish translations system on Alta Vista.
• Information Retrieval – Web search (uni-lingual or
multi-lingual).
• Query Answering/Dialogue – Natural language
interface with a database system, or a dialogue system.
• Report Generation – Generation of reports such as
weather reports.
• Some Small Applications –
– Grammar Checking, Spell Checking, Spell Corrector
BİL711 Natural Language Processing 16
Brief History of NLP
• 1940s –1950s: Foundations
– Development of formal language theory (Chomsky, Backus, Naur,
Kleene)
– Probabilities and information theory (Shannon)
• 1957 – 1970s:
– Use of formal grammars as basis for natural language processing
(Chomsky, Kaplan)
– Use of logic and logic based programming (Minsky, Winograd,
Colmerauer, Kay)
• 1970s – 1983:
– Probabilistic methods for early speech recognition (Jelinek, Mercer)
– Discourse modeling (Grosz, Sidner, Hobbs)
• 1983 – 1993:
– Finite state models (morphology) (Kaplan, Kay)
• 1993 – present:
– Strong integration of different techniques, different areas.
BİL711 Natural Language Processing 17
Natural Language Understanding
Words
Morphological Analysis
Morphologically analyzed words (another step: POS tagging)
Syntactic Analysis
Syntactic Structure
Semantic Analysis
Context-independent meaning representation
Discourse Processing
Final meaning representation
BİL711 Natural Language Processing 18
Natural Language Generation
Meaning representation
Utterance Planning
Meaning representations for sentences
Sentence Planning and Lexical Choice
Syntactic structures of sentences with lexical choices
Sentence Generation
Morphologically analyzed words
Morphological Generation
Words
BİL711 Natural Language Processing 19
Morphological Analysis
• Analyzing words into their linguistic components (morphemes).
• Morphemes are the smallest meaningful units of language.
cars car+PLU
giving give+PROG
geliyordum gel+PROG+PAST+1SG - I was coming
• Ambiguity: More than one alternatives
flies flyVERB+PROG
flyNOUN+PLU
adamı adam+ACC - the man
(accusative)
adam+P1SG - my man
ada+P1SG+ACC - my island
(accusative)
BİL711 Natural Language Processing 20
Morphological Analysis (cont.)
• Relatively simple for English. But for some languages
such as Turkish, it is more difficult.
uygarlaştıramadıklarımızdanmışsınızcasına
uygar-laş-tır-ama-dık-lar-ımız-dan-mış-sınız-casına
uygar +BEC +CAUS +NEGABLE +PPART +PL +P1PL +ABL +PAST +2PL +AsIf
“(behaving) as if you are among those whom we could not civilize/cause to become civilized”
+BEC is “become” in English
+CAUS is the causative voice marker on a verb
+PPART marks a past participle form
+P1PL is 1st person plural possessive marker
+2PL is 2nd person plural
+ABL is the ablative (from/among) case marker
+AsIf is a derivational marker that forms an adverb from a finite verb form
+NEGABLE is “not able” in English
• Inflectional and Derivational Morphology.
• Common tools: Finite-state transducers
BİL711 Natural Language Processing 21
Part-of-Speech (POS) Tagging
• Each word has a part-of-speech tag to describe its category.
• Part-of-speech tag of a word is one of major word groups
(or its subgroups).
– open classes -- noun, verb, adjective, adverb
– closed classes -- prepositions, determiners, conjuctions,
pronouns, particples
• POS Taggers try to find POS tags for the words.
• duck is a verb or noun? (morphological analyzer cannot
make decision).
• A POS tagger may make that decision by looking the
surrounding words.
– Duck! (verb)
– Duck is delicious for dinner. (noun)
BİL711 Natural Language Processing 22
Lexical Processing
• The purpose of lexical processing is to determine meanings of
individual words.
• Basic methods is to lookup in a database of meanings -- lexicon
• We should also identify non-words such as punctuation marks.
• Word-level ambiguity -- words may have several meanings, and the
correct one cannot be chosen based solely on the word itself.
– bank in English
– yüz in Turkish
• Solution -- resolve the ambiguity on the spot by POS tagging
(if possible) or pass-on the ambiguity to the other levels.
BİL711 Natural Language Processing 23
Syntactic Processing
• Parsing -- converting a flat input sentence into a hierarchical
structure that corresponds to the units of meaning in the sentence.
• There are different parsing formalisms and algorithms.
• Most formalisms have two main components:
– grammar -- a declarative representation describing the syntactic
structure of sentences in the language.
– parser -- an algorithm that analyzes the input and outputs its
structural representation (its parse) consistent with the grammar
specification.
• CFGs are in the center of many of the parsing mechanisms. But they
are complemented by some additional features that make the
formalism more suitable to handle natural languages.
BİL711 Natural Language Processing 24
Semantic Analysis
• Assigning meanings to the structures created by
syntactic analysis.
• Mapping words and structures to particular domain
objects in way consistent with our knowledge of the
world.
• Semantic can play an import role in selecting among
competing syntactic analyses and discarding illogical
analyses.
– I robbed the bank -- bank is a river bank or a
financial institution
• We have to decide the formalisms which will be used in
the meaning representation.
BİL711 Natural Language Processing 25
Knowledge Representation for NLP
• Which knowledge representation will be used depends
on the application -- Machine Translation, Database
Query System.
• Requires the choice of representational framework, as
well as the specific meaning vocabulary (what are
concepts and relationship between these concepts --
ontology)
• Must be computationally effective.
• Common representational formalisms:
– first order predicate logic
– conceptual dependency graphs
– semantic networks
– Frame-based representations
BİL711 Natural Language Processing 26
How to get there
NLP applications are all similar in that they
require some level of understanding.
Understand the query, understand the
document, understand the data being
communicated…
Understanding Sentences: Overview
Parsing and Grammar
How is a sentence composed?
Lexicons
How is a word composed?
Ambiguity
Parsing Requirements
Requires a defined Grammar
Requires a big dictionary (10K words)
Requires that sentences follow the grammar
defined
Requires ability to deal with words not in
dictionary
Parsing (from Section 22.4)
Goal:
Understand a single sentence by syntax analysis
Methods
– Bottom-up
– Top-down
More efficient (and complicated) algorithm
given in 23.2
A Parsing Example
S NP VP
NP Article N | Proper
Rules: VP Verb NP
N home | boy | store
Proper Betty | John
Verb go|give|see
Article the | an | a
The Sentence: The boy went home.
n-grams
• Limit hi to n-1 preceding words
Most used cases
n
– Uni-gram: P ( s ) P( wi )
i 1
n
– Bi-gram: P( s) P( wi | wi 1 )
i 1
n
– Tri-gram: P( s) P( wi | wi 2 wi 1 )
i 1
A simple example
(corpus = 10 000 words, 10 000 bi-grams)
wi P(wi) wi-1 wi-1wi P(wi|wi-1)
I (10) 10/10 000 # (1000) (# I) (8) 8/1000
= 0.001 = 0.008
that (10) (that I) (2) 0.2
talk (8) 0.0008 I (10) (I talk) (2) 0.2
we (10) (we talk) (1) 0.1
…
talks (8) 0.0008 he (5) (he talks) (2) 0.4
she (5) (she talks) (2) 0.4
…
she (5) 0.0005 says (4) (she says) (2) 0.5
laughs (2) (she laughs) (1) 0.5
listens (2) (she listens) (2) 1.0
Uni-gram: P(I, talk) = P(I) * P(talk) = 0.001*0.0008
P(I, talks) = P(I) * P(talks) = 0.001*0.0008
Bi-gram: P(I, talk) = P(I | #) * P(talk | I) = 0.008*0.2
P(I, talks) = P(I | #) * P(talks | I) = 0.008*0
Smoothing
• Goal: assign a low probability to words or
n-grams not observed in the training corpus
P
MLE
smoothed
word
Smoothing methods
n-gram:
• Change the freq. of occurrences
– Laplace smoothing (add-one):
| | 1
Padd _ one ( | C )
(| i | 1)
i V
– Good-Turing
nr 1
change the freq. r to r* (r 1)
nr
nr = no. of n-grams of freq. r
Smoothing (cont’d)
• Combine a model with a lower-order model
– Backoff (Katz)
PGT (wi | wi 1 ) if | wi 1wi | 0
PKatz ( wi | wi 1 )
(wi 1 ) PKatz ( wi ) otherwise
– Interpolation (Jelinek-Mercer)
PJM ( wi | wi 1 ) wi1 PML ( wi | wi 1 ) (1 wi1 ) PJM ( wi )
Information Retrieval
Now the main focus of Natural Language
Processing
There are four types:
1. Query answering
2. Text categorization
3. Text summary
4. Data extraction
Information Retrieval: The task
Choose from some set of documents ones that
are related to my query
Ex. Internet search
Information Retrieval
Methods
Boolean: “(Natural AND Language) OR
(Computational AND Linguistics)”
• too confusing for most users
Vector: Assign different weights to each term in
query. Rank documents by distance from
query and report ones that are close.
Information Retrieval
Mostly implemented using simple statistical
models on the words only
More advanced NLP techniques have not
yielded significantly better results
Information in a text is mostly in its words
41
Text Categorization
Once upon a time… this was done by humans
Computers are much better at it (and more consistent)
Best success for NLP so far (90+ % accuracy)
Much faster and more consistent than humans.
Automated systems now perform most of the work.
NLP works better for TC than IR because categories are
fixed.
Text Summarization
Main task: understand main meaning and
describe in a shorter way
Common Systems: Microsoft
How:
– Sentence/paragraph extraction (find the most
important sentences/paragraphs and string them
together for a summary)
– Statistical methods are more common
The PageRank Algo
• PageRank3 was one of the two original ideas
that set Google’s search apart from other Web
search engines when it was introduced in
1997.
• “The other innovation was the use of anchor
text—the underlined text in a hyperlink—to
index a page, even though the anchor text was
ona different page than the one being
indexed.)
44
45
• The PageRank algorithm is designed to weight links
from high-quality sites more heavily. What is a high-
quality site? One that is linked to by other high-quality
sites. The definition is recursive, but we will see that
the recursion bottoms out properly.
• The PageRank for a page p is defined as:
• where PR(p) is the PageRank of page p, N is the total
number of pages in the corpus, ini are the pages that
link in to p, and C(ini) is the count of the total number
of out-links on page ini.
• The constant d is a damping factor. It can be
understood through the random surfer model: imagine
a Web surfer who starts at some random page and
begins exploring.
46
The HITS Algo: Question Answering
System
• The Hyperlink-Induced Topic Search algorithm,
also known as “Hubs and Authorities” or HITS,
is another influential link-analysis algorithm.
• Both PageRank and HITS played important
roles in developing our understanding of Web
information retrieval.
47
HITS differs from PageRank in several ways:
• First, it is a query-dependent measure: it rates
pages with respect to a query. That means that it
must be computed anew for each query—a
computational burden that most search engines
have elected not to take on.
• Given a query, HITS first finds a set of pages that
are relevant to the query. It does that by
intersecting hit lists of query words, and then
adding pages in the link neighborhood of these
pages—pages that link to or are linked from one
of the pages in the original relevant set.
48
Question Answering
• Information retrieval is the task of finding documents
that are relevant to a query, where the query may be a
question, or just a topic area or concept.
• Question answering is a somewhat different task, in
which the query really is a question, and the answer is
not a ranked list of documents but rather a short
response—a sentence, or even just a phrase.
• There have been question-answering NLP (natural
language processing) systems since the 1960s, but only
since 2001 have such systems used Web information
retrieval to radically increase their breadth of coverage.
49
Information Extraction
• In formation extraction is the process of acquiring
knowledge by skimming a text and looking for
occurrences of a particular class of object and for
relationships among objects.
• A typical task is to extract instances of addresses
from Web pages, with database fields for street,
city, state, and zip code; or instances of storms
from weather reports, with fields for
temperature, wind speed, and precipitation.
50
• 1. Tokenization
• 2. Complex-word handling
• 3. Basic-group handling
• 4. Complex-phrase handling
• 5. Structure merging
51