0% found this document useful (0 votes)
3 views46 pages

CoSc581 NLP Topic 1 - Introduction PDF

Uploaded by

Abdisa Abdella
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views46 pages

CoSc581 NLP Topic 1 - Introduction PDF

Uploaded by

Abdisa Abdella
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

HARAMAYA UNIVERSITY

COLLEGE OF COMPUTING AND INFORMATICS


DEPARTMENT OF COMPUTER SCIENCE

NATURAL LANGUAGE
PROCESSING (NLP)
COSC581

Wondwossen
Mulugeta (PhD) email: [Link]@[Link]
What is NLP
2

 Natural Language Processing (NLP) is both a


modern computational technology and a method of
investigating and evaluating claims about human
language itself.
 Also called Computational Linguistics which links to
Artificial Intelligence (AI), the general study of
cognitive function by computational processes,
normally with an emphasis on the role of knowledge
representations, that is to say the need for
representations of our knowledge of the world in
order to understand human language with computers.
Scope

3
4

Course Code CoSc581


Course Title Natural Language Processing
Course This course is an introduction to natural language
Description processing - the study of human language from a
computational perspective and designed to get
students up to speed of with the current research in
the area. It covers morphology, syntactic, semantic
and pragmatic processing models, emphasizing
statistical or corpus-based methods and algorithms. It
also covers applications of these methods and models
in syntactic parsing, information extraction, statistical
machine translation, dialogue systems, and
summarization.
5

Learning On successful completion of this module, students will be able to


Outcomes explain and apply fundamental algorithms and techniques in the
area of natural language processing (NLP). In particular, students
will:
➢ describe major trends and systems in Natural Language
processing;
➢ define: morphology; syntax; semantics; pragmatic processing;
and give appropriate examples to illustrate their definitions;
➢ describe approaches to syntax and semantics in NLP;
➢ describe approaches to pragmatic, generation, dialogue and
summarization within NLP;
➢ describe current corpus-based methods to NLP;
➢ describe statistical techniques as applied within NLP;
➢ describe an application of natural language processing (for
instance machine translation, information retrieval) and show
the place of syntactic, semantic and pragmatic processing.
Topics
6

Topics Subtopics
1: Natural Definitions, Scope and Coverage of NLP,
Language Application of NLP, Approach to NLP, Levels of
Processing: language processing, NLP from the different
Background and views, Course project Idea
Application Activity 1: Review the researches in NLP
conducted for Ethiopian languages
Topics
7

Topics Subtopics
2: Tokenization and word Segmentation, Stemming,
Morphology Punctuation, Number, Lemmatization,
Morphological Processing (Types of Morpheme,
Morphological Types, Morphological Rules,
Morphemes and Words, Inflectional and
Derivational Morphology),

3: Parsing Part of speech tagging, Syntax and Parsing


and Syntax (Introduction, Context Free Grammar, parsing)
Topics
8

Topics Subtopics
4: Lexical Practical Problem, Word Sense relations, Lexical
semantics and Semantic, Disambiguation Approaches,
word-sense Knowledge based WSD, Machine readable
disambiguation Dictionary, Selectional Preference for WSD
5: Natural Application, Types, Single doc. Summarization,
Language Sentiment Analysis
Generation/ Text
Summarization
6: Pragmatic Pragmatics in Language, Extracting Meaning greater
Processing than word level semantic, Application
Topics
9

Topics Subtopics
7: Machine Application, Challenges, Approaches
Translation

Optional: Speech Introduction, Challenges, Automatic Speech


Processing Systems Recognition (Approaches, Acoustic Modeling,
Lexical Modeling), Text to Speech System (Text
Analysis, Wave form synthesis), Evaluations
Text Book and Reference
10

Text Book Jurafeski, Daniel, and James H. Martin (2009)


Speech and Language Processing: An introduction to
NLP, Speech recognition, and Computational
Linguistics, 2nd Ed

Reference [Link] Manning and HinrichSchutze. Foundations


Materials of Statistical NLP, MIT Press, 1999.
[Link] Allen. Natural Language Understanding, 2nd
edition.
[Link] R. Hausser. Foundations of Computational
Linguistics: Human-Computer Communication in
Natural Language, Springer Verlag, 2001.
[Link] Mitchell. Machine Learning. McGraw Hill, 1997.
Evaluation
11

Assessment • Activity 1: 10%


Method • Activity 2: 15%
• Project: 25%
• Class Participation: 10%
• Final Exam: 40%
TOPIC 1: BACKGROUND AND
APPLICATION

NATURAL LANGUAGE PROCESSING


COSC-581
Language Technology
13

making good progress


Sentiment analysis still really hard
Best roast chicken in San Francisco!
mostly solved Question answering (QA)
The waiter ignored us for 20 minutes.
Q. How effective is ibuprofen in reducing
Spam detection Coreference resolution fever in patients with acute febrile illness?

Let’s go to Agra! ✓

Carter told Mubarak he shouldn’t run again. Paraphrase
Buy DraG…
Word sense disambiguation (WSD) XYZ acquired ABC yesterday
I need new batteries for my mouse. ABC has been taken over by XYZ
Part-of-speech (POS) tagging
ADJ ADJ NOUN VERB ADV Summarization
Colorless green ideas sleep furiously. Parsing The Dow Jones is up Economy is
I can see Alcatraz from the window! The S&P500 jumped good
Housing prices rose
Named entity recognition (NER) Machine translation (MT)
PERSON ORG LOC 第13届上海国际电影节开幕… Dialog
Einstein met with UN officials in Princeton Where is Citizen Kane playing in SF?
The 13th Shanghai International Film Festival…
Castro Theatre at 7:30. Do
Information extraction (IE) you want a ticket?
Party
You’re invited to our dinner May 27
party, Friday May 27 at 8:30 add
Background
14

 Solving the language-related problems, is the main


concern of the fields known as Natural Language
Processing, Computational Linguistics, and Speech
Recognition and Synthesis
 Few applications of language processing
 spellingcorrection,
 grammar checking,

 information retrieval, and

 machine translation,

 speech processing, etc


Knowledge in NLP
15

 Tasks of being capable of analyzing an incoming


audio signal and recovering the exact sequence of
words and generating its response require knowledge
about phonetics and phonology, which can help
model how words are pronounced in colloquial
speech.
 Producing and recognizing the variations of individual
words (e.g., recognizing that doors is plural) requires
knowledge about morphology, which captures
information about the shape and behavior of words
in context.
Knowledge in NLP
16

 Syntax: the knowledge needed to order and


group words together
I’m I do, sorry that afraid Dave I’m can’t.
(Dave, I’m sorry I’m afraid I can’t do that.)

ቤቴ ሄጄ እመጣለሁ vs እመጣለሁ ቤቴ ሄጄ vs ሄጄ እመጣለሁ ቤቴ


 Lexical semantics: knowledge of the meanings of the
component words
 Compositional semantics: knowledge of how these
components combine to form larger meanings
 data + base = database ቤት + መጽሐፍ = ቤተመጽሐፍ
Knowledge in NLP
17

 Pragmatics: the appropriate use of the kind of


polite and indirect language
o No or
o No, I won’t open the door.
◼ I’m sorry, I’m afraid, I can’t.
◼ I won’t.
 Discourse conventions: knowledge of correctly structuring
these such conversations (intonation, gesturer, style, speech
act, etc)
◼ Dave, I’m sorry I’m afraid I can’t do that.

 The word “that” is referring to something which is not part of the


sentences
Knowledge in Language Processing
18

 Phonetics and Phonology — The study of linguistic sounds


 Morphology —The study of the meaningful components of
words
 Syntax —The study of the structural relationships between
words
 Semantics — The study of meaning
 Pragmatics — The study of how language is used to
accomplish goals
 Discourse—The study of linguistic units larger than a single
utterance
Technologies
19

 Speech recognition
 Spoken language is recognized
and transformed in into text as in
dictation systems, into commands as
in command control systems, or into
some other internal representation.
 Speech synthesis
 Utterances in spoken language are
produced from text (text-to-speech
systems) or from internal
representations of words or
sentences (concept-to-speech
systems)
19
Technologies
20

 Text categorization
 This technology assigns texts to
categories. Texts may belong to more
than one category, categories may
contain other categories.
◼ Eg: News Classification?
 Text Summarization
 The most relevant portions of a text
are extracted as a summary. The task
depends on the needed lengths of the
summaries. Summarization is harder if
the summary has to be specific to a
certain query.
◼ Eg: News Summerization?
20
Technologies
21

 Text Indexing
 As a precondition for document retrieval,
texts are stored in an indexed database.
Usually a text is indexed for all word forms
or – after lemmatization – for all lemmas.
Sometimes indexing is combined with
categorization and summarization.
 Text Retrieval
 Texts are retrieved from a database that
best match a given query or document. The
candidate documents are ordered with
respect to their expected relevance.
Indexing, categorization, summarization
and retrieval are often subsumed under the
term information retrieval.

21
Technologies
22

 Information Extraction
 Relevant information pieces of
information are discovered and marked
for extraction. The extracted pieces can
be: the topic, named entities such as
company, place or person names, simple
relations such as prices, destinations,
functions etc. or complex relations
describing accidents, company mergers
or football matches.
◼ Eg: Named Entity Recognition
 Data and Text Mining
 Extracted pieces of information from
several sources are combined in one
database. Previously undetected
relationships may be discovered. 22
Technologies
23

 Question Answering
 Natural language queries are used to
access information in a database. The
database may be a base of structured
data or a repository of digital texts in
which certain parts have been marked
as potential answers.
 Report Generation
 A report in natural language is
produced that describes the essential
contents or changes of a database.

23
Technologies
24

 Spoken Dialogue Systems


 The system can carry out a dialogue
with a human user in which the user can
solicit information or conduct purchases,
reservations or other transactions.
 Translation Technologies
 Technologies that translate texts or
assist human translators. Automatic
translation is called machine translation.
Translation memories use large amounts
of texts together with existing
translations for efficient look-up of
possible translations for words, phrases
and sentences.
◼ What do we need to do? 24
Methods and Resources
25

 The methods of language technology come from


several disciplines:
 computer science,
 computational and theoretical linguistics,

 mathematics,

 electrical engineering and

 psychology.
Methods and Resources
26

 Generic CS Methods
 Programming languages, algorithms for generic data types, and
software engineering methods for structuring and organizing software
development and quality assurance.
 Specialized Algorithms
 Dedicated algorithms have been designed for parsing, generation and
translation, for morphological and syntactic processing with finite state
automata/transducers and many other tasks.
 Nondiscrete Mathematical Methods
 Statistical techniques have become especially successful in speech
processing, information retrieval, and the automatic acquisition of
language models. Other methods in this class are neural networks and
powerful techniques for optimization and search.
Methods and Resources
27

 Logical and Linguistic Formalisms


 For deep linguistic processing, constraint based grammar formalisms are
employed. Complex formalisms have been developed for the
representation of semantic content and knowledge.
 Linguistic Knowledge
 Linguistic knowledge resources for many languages are utilized:
dictionaries, morphological and syntactic grammars, rules for semantic
interpretation, pronunciation and intonation.
 Corpora and Corpus Tools
 Large collections of application-specific or generic collections of spoken
and written language are exploited for the acquisition and testing of
statistical or rule-based language models.
Approach to NLP
28

 Rule Based (Hand Crafted Rules)


 Develop the rules to process the natural languages
based on known facts and exceptions
 Machine Learning
 Capture rules from examples and apply on new
instances
◼ Supervised: learn by comparing with expected output
◼ Unsupervised: blind learning. Create knowledge by association rather
than predefined output
Approach to NLP
29

 There are various approaches but the most


important models are:
◼ state machines,
◼ formal rule systems,
◼ logic,
◼ probability theory and
◼ other machine learning tools

 The most important algorithms of these models:


 state
space search algorithms and
 dynamic programming algorithms
Approach to NLP
30

 State machines are


 formal models that consist of states, transitions among
states, and an input representation.
 Some of the variations of this basic model:
 Deterministic and non-deterministic finite-state
automata,
 finite-state transducers, which can write to an output
device,
 Markov models, and hidden Markov models, which
have a probabilistic component.
A Finite State Transducer
31
Generation using FST
32
Analysis using FST
33
Approach to NLP
34

 Formal rule systems.


 regular grammars and regular relations, context-free grammars, as
well as probabilistic variants of them all.
 State machines and formal rule systems are the main tools used
when dealing with knowledge of phonology, morphology, and
syntax.
 They deal with search through a space of states representing
hypotheses about an input.
 Among the algorithms that are often used for these tasks are
well-known graph algorithms such as depth-first search, as
well as heuristic variants such as best-first, and A* search.
 The dynamic programming paradigm ensures that redundant
computations are avoided.
Approach to NLP
35

 The third model for capturing knowledge of language


is logic.
 first order logic,
 feature-structures,

 semantic networks, and

 conceptual dependency.

 These logical representations have traditionally been


the tool of choice when dealing with knowledge of
semantics, pragmatics, and discourse (used in
phonology, morphology, and syntax).
Approach to NLP
36

 state machines, formal rule systems, and logic can be


augmented with probabilities.
 In these models probability theory is used to solve
ambiguity
 “given N choices for some ambiguous input, choose the most
probable one”.
 Another major advantage of probabilistic models is
that
 they are one of a class of machine learning models.
Approach to NLP
37

 Machine learning research has focused on ways


to automatically learn the various representations
described above;
 automata, rule systems, search heuristics, classifiers.
 These systems can be trained on large corpora and
can be used as a powerful modeling technique,
especially in places where we don’t yet have good
causal models.
Language, Thought, and Understanding
38

 The effective use of language is intertwined with


our general cognitive abilities.
 Turing Test by Alan Turing (1950)
 He suggested an empirical test, a game, in which a
computer’s use of language would form the basis for
determining if it could think. If the machine could win
the game it would be judged intelligent.
Language, Thought, and Understanding
39

 How to measure Intelligence of a Machine?


 Turing test – Alan Turing (1950)
 A machine can be accepted to be intelligent if it can fool a
judge that its human over a tele-typing exercise.
 ELIZA by Weizenbaum (1966)
 Pretends to be a psychiatrist and converses with a user on
his problems.
 Uses Keyword pattern matching

 Many users thought the machine really understood their


problem.
 Many such systems exist now. E.g. Alan, Alice,
Language, Thought, and Understanding
40

 ELIZA program (Weizenbaum, 1966)


 ELIZA was an early natural language processing system capable of
carrying on a limited form of conversation with a user.
 Consider the following session with a version of ELIZA that
imitated the responses of a Rogerian psychotherapist.
User1: You are like my father in some ways.
ELIZA1: WHAT RESEMBLANCE DO YOU SEE
User2: You are not very aggressive but I think you don’t want me to notice that.
ELIZA2: WHAT MAKES YOU THINK I AM NOT AGGRESSIVE
User3: You don’t argue with me.
ELIZA3: WHY DO YOU THINK I DON’T ARGUE WITH YOU
User4: You are afraid of me.
ELIZA4: DOES IT PLEASE YOU TO BELIEVE I AM AFRAID OF YOU
NLP Applications
41

 Question answering
 Who is the first Taiwanese president?
 Text Categorization/Routing
 e.g., customer e-mails.
 Text Mining
 Find everything that interacts with BRCA1.
 Machine (Assisted) Translation
 Language Teaching/Learning
 Usage checking, Grammar, Spelling, etc
 Spelling correction
 Is that just dictionary lookup?
Application areas
42

 Text-to-Speech & Speech recognition


 Natural Language Dialogue Interfaces to Databases
 Information Retrieval
 Information Extraction
 Document Classification
 Document Image Analysis
 Automatic Summarization
 Text Proofreading – Spelling & Grammar
 Machine Translation
 Story understanding systems
 Plagiarism detection
 Can u think of anything else ??
Big Challenge
43

 Language = Words + rules + exceptions..


 Ambiguity is there at all levels..
 Very many different languages..
 language has a cultural element..
 Languages are not equivalent..
 Highly systematic but also complex..
 Keeps changing.. New words, New rules and New exceptions..
 Source : Electronic texts / Printed texts / Acoustic Speech Signal..
they are noisy..
 Language looks obvious to us.. But it is a Big Deal for computer and
for artificial intelligence
Project Idea
44

1. Select topic for your project. The topic should be in line with the
NLP application we have identified.
2. Gather sufficient number of reference literatures and review
them in detail
3. Produce your project proposal
4. After getting the approval, write literature review having at
least the following sections.
a) Approaches that exists to address the problem
b) The model of the selected approach
c) Detail description of the selected approach
d) Detailed description of the methodology you are planning to follow
e) Detail description of tools you are planning to use
Project Idea
45

5. After completing the literature review proceed to:


a) Design your system
b) Do the project implementation
c) Test your systems
d) Prepare the report for Submission
e) Present your findings to the class
f) Submit your data, tools used and implementation
together with the implementation guide
End of Topic 1: Introduction

You might also like