0% found this document useful (0 votes)
12 views84 pages

Information Retrieval Techniques Overview

Uploaded by

deekshitha1325
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views84 pages

Information Retrieval Techniques Overview

Uploaded by

deekshitha1325
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

NLP -Module-V

Information Retrieval(IR)
NLP MODULE 5
INFORMATION RETRIEVAL
What is IR?
• Information retrieval (IR) deals with the organization, storage,
retrieval and evaluation of information relevant to a user’s query
written in a natural language.

• ‘An information retrieval system does not inform (i.e. change the
knowledge of) the user on the subject of her inquiry. It merely
informs on the existence (or non-existence) and whereabouts of
documents relating to her request.’
DESIGN FEATURES OF INFORMATION
RETRIEVAL SYSTEMS
1. INDEXING

2. ELIMINATING STOP WORDS

3. STEMMING

4. ZIPF’S LAW
Design of a basic IR system
Basic IR process
• Problems:
• Problem of representation of documents and queries.
• Matching the query representation with document representation.
• Documents represented using ‘Index terms’ or ‘keywords’ which
provides logical view of document.
1. Indexing
• Transforming text document to some logical representation.
• Ex: Inverted indexing
• Reducing set of representative keywords:
• Stop word elimination
• Stemming
• Zipf’s law
• ‘Term-weighting’ to indicate the significance of index term to
document.
• Most of the indexing techniques involve identifying good document
descriptors, such as keywords or terms, to describe information content of
the documents.
• A good descriptor is one that helps in describing the content of the
document and in discriminating the document from other documents in the
collection.
• Term can be a single word or it can be multi-word phrases.
Example:
‘Design Features of Information Retrieval systems’
can be represented by single word terms :
Design, Features, Information, Retrieval, systems
or by the set of multi-term words terms:
Design, Features, Information Retrieval, Information Retrieval systems
Luhn’s early assumption
• Luhn(1957) assumed that frequency of word occurrence in
an article gives meaningful identification of their content.
• Discrimination power for index terms is a function of the
rank order of their frequency of occurrence.
• Middle frequency terms have the highest discrimination
power.
2. Eliminating stop words
•Stop words are high frequency words, which have little semantic
weight and are thus unlikely to help with retrieval.
•Such words are commonly used in documents, regardless of topics;
and have no topical specificity.
Example :
articles (“a”, “an” “the”) and prepositions (e.g. “in”, “of”, “for”,
“at” etc.).
•Advantage
Eliminating these words can result in considerable reduction in
number of index terms without losing any significant information.
•Disadvantage
It can sometimes result in elimination of terms useful for searching,
for instance the stop word A in Vitamin A. Some phrases like “to be
or not to be” consist entirely of stop words.
SAMPLE STOP WORDS IN ENGLISH
3. Stemming
•Stemming normalizes morphological variants
•It removes affixes from the words to reduce them to some root form
e.g. the words compute, computing, computes and computer will all
be reduced to same word stem comput.
•Widely used stemming algorithm: developed by Porter (1980).
•Example:
The stemmed representation of
Design Features of Information Retrieval systems
will be
{design, featur, inform, retriev, system}
PROBLEM WITH STEMMING

• It throws away useful distinctions.


• In some cases, it is useful to conflate similar terms
increasing RECALL.
• In other cases, it may be harmful, reducing
PRECISION.
4. Zipf’s law
“Frequency of words multiplied by their ranks in a large corpus is
approximately constant”, i.e.
• If we compute the frequencies of the words in a corpus and arrange
them in descending order of frequency, then the product of
frequency a word and its rank is approximately equal to the
product of frequency and rank of another word.

• This indicates that the frequency of the word is inversely


proportional to its rank. The relationship is shown below:
High frequency words:
Common words with less
discriminatory power.

Low frequency words:


Less likely to be included
in query.

Medium frequency
words:
Content bearing words
that can be used for
indexing , extracted using
high and low frequency
threshold.
INFORMATION RETRIEVAL
MODELS
Information Retrieval Models
• An IR model is a pattern that defines several aspects of retrieval procedure, for
example,
• how documents and user’s queries are represented
• how a system retrieves relevant documents according to users’ queries
• how retrieved documents are ranked.
• An IR system consists of
• a model of documents
• a model of queries
• a matching function which compares queries to documents.
•The central objective of the model is to retrieve all documents relevant to the
query.
• IR models can be classified as:
• Classical models of IR
• Non-Classical models of IR
• Alternative models of IR
TYPES OF INFORMATION
RETRIEVAL MODELS
1.

CLASSICAL MODELS OF IR

2.

NON-CLASSICAL MODELS OF IR

3.

ALTERNATE MODELS OF IR
• Classical IR models: Based on mathematical knowledge that was
easily recognized and well understood
• Simple, efficient and easy to implement
Ex: Boolean, Vector and Probabilistic models .

• Non-classical IR models: Based on principles other than similarity,


probability, Boolean operations etc. on which classical retrieval
models are based on.
Ex: Information logic model, Situation theory model and Interaction
model.

• Alternative IR models: Enhancements of classical models making


use of specific techniques from other fields.
Ex: Cluster model, fuzzy model and latent semantic indexing (LSI)
models.
1. CLASSICAL INFORMATION RETRIEVAL
MODELS
CLASSICAL INFORMATION RETRIEVAL
MODELS

1. BOOLEAN MODEL

2. PROBABILISTIC MODEL

3. VECTOR SPACE MODEL

4. TERM WEIGHTING
1. Boolean Model
• The oldest of the three classical models. It is based on Boolean logic
and classical set theory. Represents documents as a set of keywords,
usually stored in an inverted file.
• Users are required to express their queries as a boolean expression
consisting of keywords connected with boolean logical operators
(AND, OR, NOT).
• Retrieval is performed based on whether or not document contains
the query terms.
Given a finite set
T = {t1, t2, ...,ti,...,tm}
of index terms, a finite set
D = {d1, d2, ...,dj,...,dn}
of documents and a boolean expression in a normal form - representing a query Q
as follows:

1. The set Ri of documents are obtained that contain or not term ti:
Ri = { },
Where
2. Set operations are used to retrieve documents in response to Q:
Example:
Which plays of Shakespeare contain the words Brutus AND Caesar but NOT
Calpurnia?
Document collection: A collection of Shakespeare's work.

1 if play contains
• So we have a 0/1 vector for each term. word, 0 otherwise
• To answer query: take the vectors for Brutus, Caesar and Calpurnia
(complemented) bitwise AND.
• 110100 AND 110111 AND 101111 = 100100.
Advantages:
Simple, efficient, easy to implement, performs well in terms
of recall and precision if the query is well formulated.

Drawbacks:
1. Cannot retrieve documents that are only partly relevant to
user query.
2. Boolean system cannot rank the retrieved documents.
3. User need to formulate the query in pure Boolean
expression.
2. Probabilistic model
• Ranks the document based on the probability of their relevance to a given
query.
• Retrieval depends on whether probability of relevance of a document is higher
than that of non relevance or threshold value.
• Given a set of document D, a query ‘q’ and a cut off value ‘α’ , this model
calculates the probability of relevance and irrelevance of a document to the
query, then ranks the document.
• P( R | d ) is the probability of relevance of a document d, for the query q, P( I |
d ) is the probability of irrelevance. Then the set of the documents retrieved is:
• S={dj | P( R | dj ) >=P( I | dj )} , P( R | dj ) >α
ASSUMPTIONS OF PROBABILISTIC MODEL
• TERMS ARE INDEPENDENT while estimating probabilities.

shortcoming:
Inaccurate assumption
Some terms in a given domain, tend to co-occur.
Eg. The term “match-point” co-occurs with TENNIS rather
than CRICKET.

DRAWBACKS
To determine the threshold value for the initially retrieved set.
3. Vector space model
• Most well studied retrieval model.
• Represents documents and queries as vectors of
features representing terms that occur within them.
Each document is characterized by a numerical
vector.
• Vector represented in multidimensional space , in
which each dimension corresponds to a distinct term
in the corpus of documents.
• Vector space = all the keywords encountered
T= <t1, t2, t3, …, tm>
• Document
D = < d1, d2, d3, …, dn>
• Each document is represented by a column vector of weights
{w1j, w2j, w3j,…. wij........wmj }t
Where Wij is the weight of the term ti in document dj
Term space
w11, w12, w13……………… w1n
Document space

w21, w22, w23……………… w2n


w31, w32, w33……………… w3n
wm1, wm2, wm3……………wmn
Example:
• D1= Information retrieval is concerned with the organization, storage,
retrieval and evaluation of information relevant to user’s query.
• D2=A user having an information needs to formulate a request in the form
of query written in natural languages.
• D3= The retrieval system responds by retrieving the document that seems
relevant to the query.
• T={information, retrieval, query}
• Based on the frequency of the term vector will be as below.
t1 t2 t3 T1 t2 T3
After Normalization ->
d1 2 2 1 d1 0.67 0.67 0.33
d2 1 0 1 d2 0.71 0 0.71
d3 0 1 1 d3 0 0.71 0.71
4. Term weighting
Example:

3.296
Weighting schemes
A. Term frequency
B. Inverse document frequency
C. Document length
Simple automatic method for indexing
• Step 1: Tokenization:
This extracts individual term from the document, converts all the letters lower
case, and removes punctuation marks.
• Step 2: Stop word elimination:
This removes the words that appear more frequently in the document.
• Step 3: Stemming:
This reduces the remaining terms to their linguistic root, to obtain the index
terms.
• Step 4: Term weighting:
This assigns weight to terms according to their importance in the document or
collection of documents.
2. Non-classical models of IR
1. Information logic model

2. Situation theory model

3. Interaction model
Non-classical models of IR
• Based on principles such as Information logic model,
situation theory model and interaction model.
1. Information Logic Model: Based on logical imaging, a
measure of uncertainty is associated with this
inferences which is given by

“ Given any two sentences x and y , a measure of uncertainty of


y->x relating to a given data set is determined by the minimal
extent to which one has to add information to the data set in
order to establish truth of y->x”
2. Situation Theory model: Retrieval is considered as a flow of
information from one document to another. A structure called
Infon(τ) is used to describe the situation and to model
information flow. The polarity of an infon is either 0 or 1,
indicating if infon carries positive information or negative.
Ex: Adil is serving a dish
τ=<<Serving Adil,dish;1>>
Polarity depends on the support represented by s |= τ
S can be “ I saw Adil serving a dish”
• A document d is relevant to a query q,
if d |=q
3. Interaction model:
Documents are interconnected and query interacts
with the connected modules. Retrieval is considered as
a result of this interaction. Implemented using
artificial neural networks where both document and
query acts as a neuron in the neural network.
Alternative models of IR

1 •CLUSTER MODEL

2 •FUZZY MODEL
Alternative models of IR
• 1. Cluster Model:
Attempts to reduce the number of matches during retrieval based
on cluster hypothesis:
“Closely associated documents tend to be relevant to the same
clusters”
And hence instead of matching a query with individual documents
, it is matched with representatives of the cluster(class), and only
documents from a class whose representative is close to query, are
considered for individual match.
Clustering can also be applied on terms based on co-occurrence.
Cluster generation based on similarity
matrix
• LetD={ d1,d2,d3,…..dm} be set of documents.
• Let (eij)n,n be the similarity matrix, element Ei,j
denotes a similarity between document di and dj if
similarity measure exceed threshold value then are
grouped to form a cluster.
2. Fuzzy model
• Document is represented as a fuzzy set of terms [ti ,μ(ti )]
• Where μ is a member function that assigns to each term of the document a numeric
membership degree.
• Ex: d1={information, retrieval, query}
d2={retrieval, query, model}
d3={information, retrieval}
T={information, model, query, retrieval}
Fuzzy set for terms
f1={(d1,1/3),(d2,0),(d3,1/2)}
f2={(d1,0),(d2,1/3),(d3,0)}
f3={(d1,1/3),(d2,1/3),(d3,0)}
f4={(d1,1/3),(d2,1/3),(d3,1/2)}
If the query is t2 ∧ t4, then document d2 is returned.
Evaluation of the IR systems
Six criteria
1. Coverage of the collection
2. Time lag
3. Presentation format
4. User effort
5. Precision
6. Recall
Both precision and recall are most frequently applies in
evaluating IR system, related to effectiveness of IR system
ie the system ability to retrieve relevant documents in
response to a query.
1. Relevance
• Not possible to measure true relevance as it is subjective in
nature.
• Relevance test based on known relevance judgments based
on assessment by experts of that discipline.
• Degree of relevance: binary or continuous function
• Relevance frameworks: System, communication,
psychological, situational
2. Effectiveness
Measure of ability of the system to satisfy the user in
terms of relevance of documents retrieved.
• Relevance of the document retrieved
• Order of relevance
• Number of relevant documents returned
• Two most important measures
1. Precision
2. Recall
Precision and recall

Relevant Non relevant


Retrieved A∩B A∩B B
Not-retrieved A ∩B A ∩B B
A A
Trade off between precision and recall
• High value of both at the same time is desirable.
• But precision is high when recall is low and as recall increases,
precision decreases.
• The ideal case of perfect retrieval requires that all relevant
documents be retrieved before the first non relevant document is
retrieved.
Non-interpolated average precision
• Average of the precision at observed recall points. Observed recall
points corresponds to points where a relevant document is retrieved.
3. Interpolated precision
•Precision values are interpolated for a set of recall points.
•Recall levels are 0.0, 0.1, 0.2…….1.0. Precisions are calculated at
each of these level and then averaged to get a single value.
•Precision at a given recall level is the greatest known precision at
any recall level greater than or equal to this given level.
Ex: precision observed at recall points:
0.25 1.0
0.4 0.67
0.55 0.8
0.8 0.6
1.0 0.5
Lexical resources
Lexical resources
• ‘Knowing where the information is half the
information”
• Tools and lexical resources to work with NLP.
• WordNet
• FrameNet
• Stemmers
• Taggers
• Parsers
• Text Corpus
WORDNET
FRAMENET
Feature WordNet FrameNet
Lexical-semantic Frame semantics
Basis
relations (synsets) (conceptual structures)
Synsets (sets of Frames (situations with
Unit of Representation
synonyms) roles)
Sentence-level meaning
Focus Word-level meaning
and relations
Synonymy, hypernymy, Frame-to-frame relations
Relations
hyponymy, etc. (Inheritance, Causative)
Semantic role labeling,
Applications WSD, semantic similarity
deep semantic parsing
Commerce_buy (Buyer,
Example {car, auto, automobile}
Seller, Goods, Money)
WordNet applications
• Concept identifications in Natural language
• Word sense disambiguation
• Automatic Query Expansion
• Document structuring and categorization
• Document summarization
FRAMENET
• FrameNet is a large database of semantically annotated English
sentences.
• British national corpus
• Each word evokes a particular situation with particular participants.
• FrameNet captures situation through frames, and frame elements.
Stemmers
• Reducing inflected word to its
root word.
• Most commonly used stemmers
• Porter’s algorithm stemmers
• Lovins stemmers
• Paice/Husk stemmers
• Snowball stemmer for European
languages.
POS Taggers
POS Taggers
1. Stanford Log-linear POS tagger (97.24% accuracy)
2. Postagger
3. TnT tagger (Trigrams’n’ Tag)
4. Brill tagger
5. CLAWS (constituent likelihood automatic word-tagging system)
6. Tree-tagger
7. ACOPOST (A collection of POS taggers)
i. Maximum entropy tagger
ii. Trigram tagger
iii. Error-driven transformation-based tagger
Example-based tagger
Tagger Name Model Type Key Features Accuracy Developer
Maximum Rich feature sets,
Stanford
Entropy overlapping ~97% Stanford NLP
Log-Linear
(Log-linear) features
English-specific
A POS Tagger Generic (HMM /
corpora and ~95–97% —
for English Rule-based)
tagsets

Statistical, fast,
TnT Tagger HMM (Trigram) ~96% Thorsten Brants
Viterbi decoding

Transformation-B Human-readable
Brill Tagger ased rules, ~95% Eric Brill
(Rule-based) interpretable
Detailed English
Hybrid (Rule + Lancaster
CLAWS Tagger tagset, ~96–97%
Statistical) University
corpus-based
Multilingual,
Decision Tree +
TreeTagger supports ~96% Helmut Schmid
Probabilistic
lemmatization
RESEARCH CORPORA
Type Description Examples

British National Corpus (BNC),


Monolingual Corpora Texts in a single language
Brown Corpus

Texts in several languages (not


Multilingual Corpora Leipzig Corpora Collection
aligned)

Same text translated into multiple


Parallel Corpora Europarl Corpus, OPUS Corpus
languages

Comparable Corpora Similar domain texts across languages News Commentary Corpus

Penn Treebank (POS + syntax),


Annotated Corpora Contain linguistic tags/annotations PropBank, WordNet annotated
corpora

Transcriptions of speech with prosodic Switchboard Corpus,


Spoken Corpora
features CALLHOME

Medical Corpus, Legal Corpus,


Domain-Specific Corpora Focus on specialized fields
Social Media Corpus
Importance of Research Corpora in NLP
a) Training and Testing NLP Models
•Corpora provide data for training statistical and machine learning models such as:
• POS taggers
• Parsers
• Named Entity Recognizers
• Machine translation systems
b) Empirical Language Analysis
•Helps in studying linguistic patterns such as collocations, frequency, and word usage.
c) Development of Lexical Resources
•Corpora are the base for building WordNet, FrameNet, and bilingual dictionaries.
d) Evaluation of NLP Systems
•Standard corpora provide benchmark datasets for evaluating accuracy and performance (e.g., BLEU
score for translation).
e) Domain and Language Adaptation
•Specialized corpora support NLP system adaptation to specific domains like medicine, law, or social
media.
f) Building Statistical Models
•Corpora provide n-gram probabilities, syntactic structures, and semantic role information used in
statistical NLP.
g) Automatic Annotation and Machine Learning
•Annotated corpora enable supervised learning, helping algorithms learn from examples.
Challenges in Corpus Construction

• Data collection and cleaning

• Annotation consistency and cost

• Copyright and availability issues

• Language coverage (especially for low-resource languages)


Case study:

GOOGLE ASSISTANT,
ALEXA,
APPLE SIRI
Assistant Developer Launch Year Core Function

Personal assistant for


Apple Siri Apple Inc. 2011
iOS/macOS

Voice assistant for Echo


Amazon Alexa Amazon 2014
devices & smart home control

AI-powered assistant
Google Assistant Google 2016 integrated with Google
ecosystem
Process Technique Example
Query Tokenization, POS Tagging, Extracts entities like “Paris”,
Understanding NER “France”
Word Embeddings, Understands “capital” ≈ “main
Semantic Search
Knowledge Graph city”
Sorts results based on
Ranking Vector Space Model, LTR
relevance
Maintains continuity in
Context Handling Dialogue State Tracking
conversations

Customizes results (e.g.,


Personalization User Profiling, Query Logs
location-based weather)
Feature Apple Siri Amazon Alexa Google Assistant

Search Engine Used Bing + Apple Graph Amazon APIs + Skills Google Search

Core IR Model Context-based Intent-based Semantic-based

Knowledge graph + Web + Google Knowledge


Response Source AWS + External APIs
Web Graph

Personalization
High (device-level) Medium (account-level) Very High (search + context)
Level

Dialogue Continuity Moderate Limited Strong (context retention)

You might also like