0% found this document useful (0 votes)
22 views15 pages

Information Retrieval Models Explained

NOTE

Uploaded by

mutgatkekdeng
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views15 pages

Information Retrieval Models Explained

NOTE

Uploaded by

mutgatkekdeng
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd

Mizan Tepi University

Tepi campus
Department of IT

Information Storage and Retrieval


ITec3081

Chapter Four
IR models
IR Models - Basic Concepts
 Word evidence:
IR systems usually adopt index terms to index and
retrieve documents
Each document is represented by a set of
representative keywords or index terms (called
Bag of Words)
• An index term is a document word useful for remembering
the document main themes
• Not all terms are equally useful for
representing the document contents:
less frequent terms allow identifying a narrower
set of documents
• But no ordering information is attached to the Bag of Words
identified from the document collection.
IR Models - Basic Concepts
• One central problem regarding IR systems is
the issue of predicting the degree of relevance
of documents for a given query
• Such a decision is usually dependent on a
ranking algorithm which attempts to
establish a simple ordering of the
documents retrieved
• Documents appearning at the top of this
ordering are considered to be more likely to
be relevant
• Thus ranking algorithms are at the core of IR
systems
• The IR models determine the predictions of
what is relevant and what is not, based on
Classes of Retrieval
.
Models
1 Boolean models (set theoretic)
 The BooSimple model based on set theory and Boolean algebra.
 A document is represented as a set of keywords.
 Queries are Boolean expressions of keywords, connected by AND,
OR, and NOT, including the use of brackets to indicate scope.
 – [[Rio & Brazil] | [Hilo & Hawaii]] & hotel & !Hilton]
 Boolean model imposes a binary criterion for deciding relevance
 Terms are either present or absent. Thus,
wij  {0,1}
• sim(q,dj) = 1 - if document satisfies the boolean query
0 - otherwise
 Note that, no weights assigned in-between 0 and 1, just
only values 0 or 1
Advantage of Boolean model
 Popular retrieval model because:
 Easy to understand for simple queries, clean
formalism.
 Boolean models can be extended to include
ranking.
 Reasonably efficient implementations
possible for normal queries.
 Efficient for the computer
 Results are predictable, relatively easy to
explain
Drawbacks of the Boolean Model

 Retrieval based on binary decision criteria with no notion of partial


matching
 No ranking of the documents is provided (absence of a grading scale)
 Information need has to be translated into a Boolean expression, which
most users find awkward.
 The Boolean queries formulated by the users are most often too simplistic
 The model frequently returns either too few or too many documents in
response to a user query
 Very rigid: AND means all; OR means any.
 All index terms have equal weight.
 Difficult to express complex user requests.
 Difficult to control the number of documents retrieved, because all
matched documents will be returned.
 Difficult to rank output, because all matched documents logically satisfy
the query.
 Difficult to perform relevance feedback, because if a document is identified
by the user as relevant or irrelevant, how should the query be modified?
[Link] space models (statistical/algebraic)

 Boolean matching and binary weights is too limiting.


 The vector model proposes a framework in which partial matching is
possible.
 This is accomplished by assigning non-binary weights to index terms
in queries and in documents.
 Term weights are used to compute a degree of similarity between a
query and each document.
 The documents are ranked in decreasing order of their degree of
similarity.
 Also called as ‘term vector model’ or ‘vector processing model’
 Represents both documents and queries by term sets and compares
global similarities between queries and documents
 used in information filtering, information retrieval, indexing and
relevancy rankings
 first use was in the SMART Information Retrieval System
Issues for Vector Space Model

 How to determine important words in a document?


 Word sense?
 Word n-grams (and phrases, idioms,…)  terms
 How to determine the degree of importance of a term
within a document and within the entire collection?
 How to determine the degree of similarity between a
document and the query?
 In the case of the web, what is a collection and what
are the effects of links, formatting information, etc.?
Cont...
• Advantages:
• Term-weighting improves quality of the answer set
since it helps to display relevant documents in
ranked order
• Partial matching allows retrieval of documents that
approximate the query conditions
• Cosine ranking formula sorts documents according to
degree of similarity to the query

• Disadvantages:
• Assumes independence of index terms. It doesn’t
relate one term with another term
• Computationally expensive since it measures the
similarity between each document and the query
3. Probabilistic models
10
 The probabilistic model captures the IR problem using a
probabilistic framework
 Given a user query, there is an ideal answer set for this query
 Given a description of this ideal answer set, we could retrieve the
relevant documents
 Querying is seen as a specification of the properties of this ideal
answer set
But, what are these properties
 An initial set of documents is retrieved somehow.
 The user inspects these docs looking for the relevant ones (in
truth, only top 10-20 need to be inspected)
 The IR system uses this information to refine the description of
the ideal answer set
 By repeating this process, it is expected that the description of
the ideal answer set will improve.
Probability Ranking Principle
 The relevance of a given document for users query
can be determined by the probability score
 High probability (prob(rel | di q): means more likely
for users to get relevant information by reading
document di.
 A Probabilistic retrieval model follows Probability
ranking principle
 You have a collection of Documents
 A set of relevant documents needs to be returned for
queries issued by users
 Intuitively, want the “best” document to be first, second
best - second, etc…
 According to probability ranking principle, documents
are ranked in decreasing order of probability of
relevance to users information need
 If p(R|D) > p(NR|D) then D is relevant,
Terms Existence in Relevant
Document
N=the total number of documents in the collection
n= the total number of documents that contain term ti
R=the total number of relevant documents retrieved
r=the total number of relevant documents retrieved that contain term t i
Document Relevance
For term ti No of relevant No of non- Total
docs relevant docs

No of docs r n-r n
including term ti
No of docs R-r N-R-(n-r) N-n
excluding term ti
Total R N-R N
(r  0.5)( N  n  R  r  0.5)
wi log
(n  r  0.5)( R  r  0.5)
Advantage and draw back of probabilistic model

 Advantages of probabilistic model over vector‐


space
 Strong theoretical basis
 Since the base is probability theory, it is very well
understood
 Easy to extend
 Disadvantages
 Models are often complicated
 No term frequency weighting
 Which is better: vector‐space or probabilistic?
 Both are approximately as good as each other
 Depends on collection, query, and other factors
14
Questions, Ambiguities, Doubts, … ???
15
End of Chapter 4

You might also like