CoSc581 NLP Topic 1 - Introduction PDF
CoSc581 NLP Topic 1 - Introduction PDF
NATURAL LANGUAGE
PROCESSING (NLP)
COSC581
Wondwossen
Mulugeta (PhD) email: [Link]@[Link]
What is NLP
2
3
4
Topics Subtopics
1: Natural Definitions, Scope and Coverage of NLP,
Language Application of NLP, Approach to NLP, Levels of
Processing: language processing, NLP from the different
Background and views, Course project Idea
Application Activity 1: Review the researches in NLP
conducted for Ethiopian languages
Topics
7
Topics Subtopics
2: Tokenization and word Segmentation, Stemming,
Morphology Punctuation, Number, Lemmatization,
Morphological Processing (Types of Morpheme,
Morphological Types, Morphological Rules,
Morphemes and Words, Inflectional and
Derivational Morphology),
Topics Subtopics
4: Lexical Practical Problem, Word Sense relations, Lexical
semantics and Semantic, Disambiguation Approaches,
word-sense Knowledge based WSD, Machine readable
disambiguation Dictionary, Selectional Preference for WSD
5: Natural Application, Types, Single doc. Summarization,
Language Sentiment Analysis
Generation/ Text
Summarization
6: Pragmatic Pragmatics in Language, Extracting Meaning greater
Processing than word level semantic, Application
Topics
9
Topics Subtopics
7: Machine Application, Challenges, Approaches
Translation
Let’s go to Agra! ✓
✗
Carter told Mubarak he shouldn’t run again. Paraphrase
Buy DraG…
Word sense disambiguation (WSD) XYZ acquired ABC yesterday
I need new batteries for my mouse. ABC has been taken over by XYZ
Part-of-speech (POS) tagging
ADJ ADJ NOUN VERB ADV Summarization
Colorless green ideas sleep furiously. Parsing The Dow Jones is up Economy is
I can see Alcatraz from the window! The S&P500 jumped good
Housing prices rose
Named entity recognition (NER) Machine translation (MT)
PERSON ORG LOC 第13届上海国际电影节开幕… Dialog
Einstein met with UN officials in Princeton Where is Citizen Kane playing in SF?
The 13th Shanghai International Film Festival…
Castro Theatre at 7:30. Do
Information extraction (IE) you want a ticket?
Party
You’re invited to our dinner May 27
party, Friday May 27 at 8:30 add
Background
14
machine translation,
Speech recognition
Spoken language is recognized
and transformed in into text as in
dictation systems, into commands as
in command control systems, or into
some other internal representation.
Speech synthesis
Utterances in spoken language are
produced from text (text-to-speech
systems) or from internal
representations of words or
sentences (concept-to-speech
systems)
19
Technologies
20
Text categorization
This technology assigns texts to
categories. Texts may belong to more
than one category, categories may
contain other categories.
◼ Eg: News Classification?
Text Summarization
The most relevant portions of a text
are extracted as a summary. The task
depends on the needed lengths of the
summaries. Summarization is harder if
the summary has to be specific to a
certain query.
◼ Eg: News Summerization?
20
Technologies
21
Text Indexing
As a precondition for document retrieval,
texts are stored in an indexed database.
Usually a text is indexed for all word forms
or – after lemmatization – for all lemmas.
Sometimes indexing is combined with
categorization and summarization.
Text Retrieval
Texts are retrieved from a database that
best match a given query or document. The
candidate documents are ordered with
respect to their expected relevance.
Indexing, categorization, summarization
and retrieval are often subsumed under the
term information retrieval.
21
Technologies
22
Information Extraction
Relevant information pieces of
information are discovered and marked
for extraction. The extracted pieces can
be: the topic, named entities such as
company, place or person names, simple
relations such as prices, destinations,
functions etc. or complex relations
describing accidents, company mergers
or football matches.
◼ Eg: Named Entity Recognition
Data and Text Mining
Extracted pieces of information from
several sources are combined in one
database. Previously undetected
relationships may be discovered. 22
Technologies
23
Question Answering
Natural language queries are used to
access information in a database. The
database may be a base of structured
data or a repository of digital texts in
which certain parts have been marked
as potential answers.
Report Generation
A report in natural language is
produced that describes the essential
contents or changes of a database.
23
Technologies
24
mathematics,
psychology.
Methods and Resources
26
Generic CS Methods
Programming languages, algorithms for generic data types, and
software engineering methods for structuring and organizing software
development and quality assurance.
Specialized Algorithms
Dedicated algorithms have been designed for parsing, generation and
translation, for morphological and syntactic processing with finite state
automata/transducers and many other tasks.
Nondiscrete Mathematical Methods
Statistical techniques have become especially successful in speech
processing, information retrieval, and the automatic acquisition of
language models. Other methods in this class are neural networks and
powerful techniques for optimization and search.
Methods and Resources
27
conceptual dependency.
Question answering
Who is the first Taiwanese president?
Text Categorization/Routing
e.g., customer e-mails.
Text Mining
Find everything that interacts with BRCA1.
Machine (Assisted) Translation
Language Teaching/Learning
Usage checking, Grammar, Spelling, etc
Spelling correction
Is that just dictionary lookup?
Application areas
42
1. Select topic for your project. The topic should be in line with the
NLP application we have identified.
2. Gather sufficient number of reference literatures and review
them in detail
3. Produce your project proposal
4. After getting the approval, write literature review having at
least the following sections.
a) Approaches that exists to address the problem
b) The model of the selected approach
c) Detail description of the selected approach
d) Detailed description of the methodology you are planning to follow
e) Detail description of tools you are planning to use
Project Idea
45