Information Retrieval Models Explained

NOTE

Uploaded by

mutgatkekdeng

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

22 views15 pages

Information Retrieval Models Explained

NOTE

Uploaded by

mutgatkekdeng

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

Mizan Tepi University

Tepi campus
Department of IT

Information Storage and Retrieval

ITec3081

Chapter Four
IR models
IR Models - Basic Concepts
 Word evidence:
IR systems usually adopt index terms to index and
retrieve documents
Each document is represented by a set of
representative keywords or index terms (called
Bag of Words)
• An index term is a document word useful for remembering
the document main themes
• Not all terms are equally useful for
representing the document contents:
less frequent terms allow identifying a narrower
set of documents
• But no ordering information is attached to the Bag of Words
identified from the document collection.
IR Models - Basic Concepts
• One central problem regarding IR systems is
the issue of predicting the degree of relevance
of documents for a given query
• Such a decision is usually dependent on a
ranking algorithm which attempts to
establish a simple ordering of the
documents retrieved
• Documents appearning at the top of this
ordering are considered to be more likely to
be relevant
• Thus ranking algorithms are at the core of IR
systems
• The IR models determine the predictions of
what is relevant and what is not, based on
Classes of Retrieval
.
Models
1 Boolean models (set theoretic)
 The BooSimple model based on set theory and Boolean algebra.
 A document is represented as a set of keywords.
 Queries are Boolean expressions of keywords, connected by AND,
OR, and NOT, including the use of brackets to indicate scope.
 – [[Rio & Brazil] | [Hilo & Hawaii]] & hotel & !Hilton]
 Boolean model imposes a binary criterion for deciding relevance
 Terms are either present or absent. Thus,
wij  {0,1}
• sim(q,dj) = 1 - if document satisfies the boolean query
0 - otherwise
 Note that, no weights assigned in-between 0 and 1, just
only values 0 or 1
Advantage of Boolean model
 Popular retrieval model because:
 Easy to understand for simple queries, clean
formalism.
 Boolean models can be extended to include
ranking.
 Reasonably efficient implementations
possible for normal queries.
 Efficient for the computer
 Results are predictable, relatively easy to
explain
Drawbacks of the Boolean Model

 Retrieval based on binary decision criteria with no notion of partial

matching
 No ranking of the documents is provided (absence of a grading scale)
 Information need has to be translated into a Boolean expression, which
most users find awkward.
 The Boolean queries formulated by the users are most often too simplistic
 The model frequently returns either too few or too many documents in
response to a user query
 Very rigid: AND means all; OR means any.
 All index terms have equal weight.
 Difficult to express complex user requests.
 Difficult to control the number of documents retrieved, because all
matched documents will be returned.
 Difficult to rank output, because all matched documents logically satisfy
the query.
 Difficult to perform relevance feedback, because if a document is identified
by the user as relevant or irrelevant, how should the query be modified?
[Link] space models (statistical/algebraic)

 Boolean matching and binary weights is too limiting.

 The vector model proposes a framework in which partial matching is
possible.
 This is accomplished by assigning non-binary weights to index terms
in queries and in documents.
 Term weights are used to compute a degree of similarity between a
query and each document.
 The documents are ranked in decreasing order of their degree of
similarity.
 Also called as ‘term vector model’ or ‘vector processing model’
 Represents both documents and queries by term sets and compares
global similarities between queries and documents
 used in information filtering, information retrieval, indexing and
relevancy rankings
 first use was in the SMART Information Retrieval System
Issues for Vector Space Model

 How to determine important words in a document?

 Word sense?
 Word n-grams (and phrases, idioms,…)  terms
 How to determine the degree of importance of a term
within a document and within the entire collection?
 How to determine the degree of similarity between a
document and the query?
 In the case of the web, what is a collection and what
are the effects of links, formatting information, etc.?
Cont...
• Advantages:
• Term-weighting improves quality of the answer set
since it helps to display relevant documents in
ranked order
• Partial matching allows retrieval of documents that
approximate the query conditions
• Cosine ranking formula sorts documents according to
degree of similarity to the query

• Disadvantages:
• Assumes independence of index terms. It doesn’t
relate one term with another term
• Computationally expensive since it measures the
similarity between each document and the query
3. Probabilistic models
10
 The probabilistic model captures the IR problem using a
probabilistic framework
 Given a user query, there is an ideal answer set for this query
 Given a description of this ideal answer set, we could retrieve the
relevant documents
 Querying is seen as a specification of the properties of this ideal
answer set
But, what are these properties
 An initial set of documents is retrieved somehow.
 The user inspects these docs looking for the relevant ones (in
truth, only top 10-20 need to be inspected)
 The IR system uses this information to refine the description of
the ideal answer set
 By repeating this process, it is expected that the description of
the ideal answer set will improve.
Probability Ranking Principle
 The relevance of a given document for users query
can be determined by the probability score
 High probability (prob(rel | di q): means more likely
for users to get relevant information by reading
document di.
 A Probabilistic retrieval model follows Probability
ranking principle
 You have a collection of Documents
 A set of relevant documents needs to be returned for
queries issued by users
 Intuitively, want the “best” document to be first, second
best - second, etc…
 According to probability ranking principle, documents
are ranked in decreasing order of probability of
relevance to users information need
 If p(R|D) > p(NR|D) then D is relevant,
Terms Existence in Relevant
Document
N=the total number of documents in the collection
n= the total number of documents that contain term ti
R=the total number of relevant documents retrieved
r=the total number of relevant documents retrieved that contain term t i
Document Relevance
For term ti No of relevant No of non- Total
docs relevant docs

No of docs r n-r n
including term ti
No of docs R-r N-R-(n-r) N-n
excluding term ti
Total R N-R N
(r  0.5)( N  n  R  r  0.5)
wi log
(n  r  0.5)( R  r  0.5)
Advantage and draw back of probabilistic model

 Advantages of probabilistic model over vector‐

space
 Strong theoretical basis
 Since the base is probability theory, it is very well
understood
 Easy to extend
 Disadvantages
 Models are often complicated
 No term frequency weighting
 Which is better: vector‐space or probabilistic?
 Both are approximately as good as each other
 Depends on collection, query, and other factors
14
Questions, Ambiguities, Doubts, … ???
15
End of Chapter 4

Web Structure and Ranking in IR Techniques
No ratings yet
Web Structure and Ranking in IR Techniques
20 pages
Multimedia Database Management Overview
No ratings yet
Multimedia Database Management Overview
25 pages
Automatic Indexing Techniques Overview
No ratings yet
Automatic Indexing Techniques Overview
28 pages
Unit - Iv: Machine Learning (ML) For Iot
No ratings yet
Unit - Iv: Machine Learning (ML) For Iot
17 pages
Overview of Information Retrieval Systems
No ratings yet
Overview of Information Retrieval Systems
15 pages
Remote Invocation in Distributed Systems
No ratings yet
Remote Invocation in Distributed Systems
11 pages
Designing DMQL-Based GUIs for Data Mining
0% (1)
Designing DMQL-Based GUIs for Data Mining
2 pages
Understanding Bluetooth Networks
No ratings yet
Understanding Bluetooth Networks
6 pages
Forward vs. Backward Reasoning in AI
No ratings yet
Forward vs. Backward Reasoning in AI
8 pages
Apriori Algorithm for Frequent Itemsets
No ratings yet
Apriori Algorithm for Frequent Itemsets
3 pages
Understanding Production Systems in AI
No ratings yet
Understanding Production Systems in AI
57 pages
Multiple Granularity Locking in DBMS
No ratings yet
Multiple Granularity Locking in DBMS
1 page
Overview of IoT Protocols
No ratings yet
Overview of IoT Protocols
8 pages
Brute Force, Divide & Decrease Conquer
No ratings yet
Brute Force, Divide & Decrease Conquer
132 pages
DAA Exam Paper for BCA IV Semester
No ratings yet
DAA Exam Paper for BCA IV Semester
17 pages
MongoDB NoSQL Course Syllabus
No ratings yet
MongoDB NoSQL Course Syllabus
4 pages
Overcoming Web Service Challenges
0% (1)
Overcoming Web Service Challenges
19 pages
R&D Project Proposal Submission Form
No ratings yet
R&D Project Proposal Submission Form
7 pages
Association Rule Mining Techniques
No ratings yet
Association Rule Mining Techniques
71 pages
Linear Data Structures and Arrays
No ratings yet
Linear Data Structures and Arrays
92 pages
Understanding Truth Maintenance Systems
No ratings yet
Understanding Truth Maintenance Systems
9 pages
Cloud Enabling Technologies Overview
No ratings yet
Cloud Enabling Technologies Overview
75 pages
Introduction to Socket Programming
100% (1)
Introduction to Socket Programming
7 pages
Prolog and Lisp in AI: A Study Guide
No ratings yet
Prolog and Lisp in AI: A Study Guide
5 pages
Inverted Index Construction Techniques
No ratings yet
Inverted Index Construction Techniques
37 pages
Item Clustering in Automatic Indexing
No ratings yet
Item Clustering in Automatic Indexing
30 pages
AI Problem Solving & Search Strategies
No ratings yet
AI Problem Solving & Search Strategies
28 pages
Data Stream Mining Techniques
No ratings yet
Data Stream Mining Techniques
16 pages
The Classic TF-IDF Vector Space Model
No ratings yet
The Classic TF-IDF Vector Space Model
15 pages
Blockchain Technology Exam Overview
No ratings yet
Blockchain Technology Exam Overview
1 page
Understanding Percept Sequences in AI
No ratings yet
Understanding Percept Sequences in AI
36 pages
Greedy Algorithms in Optimization Problems
0% (1)
Greedy Algorithms in Optimization Problems
55 pages
Recent Trends in Computer Networking
No ratings yet
Recent Trends in Computer Networking
7 pages
Machine Learning Life Cycle Overview
No ratings yet
Machine Learning Life Cycle Overview
72 pages
Understanding Automatic Indexing Techniques
100% (1)
Understanding Automatic Indexing Techniques
15 pages
ER Model to Relational Mapping Guide
100% (1)
ER Model to Relational Mapping Guide
18 pages
Traversal and Search Algorithms in AI
No ratings yet
Traversal and Search Algorithms in AI
31 pages
Understanding m-Way Search Trees
No ratings yet
Understanding m-Way Search Trees
33 pages
Greedy Algorithm Design Techniques
No ratings yet
Greedy Algorithm Design Techniques
34 pages
Causality in Distributed Systems
No ratings yet
Causality in Distributed Systems
27 pages
Software Testing and Maintenance Overview
No ratings yet
Software Testing and Maintenance Overview
37 pages
Characteristics of Distributed Systems
0% (1)
Characteristics of Distributed Systems
9 pages
Applications of Path Testing in Software
No ratings yet
Applications of Path Testing in Software
2 pages
C++ File Stream Operations Guide
No ratings yet
C++ File Stream Operations Guide
19 pages
Automata and Compiler Design Notes
No ratings yet
Automata and Compiler Design Notes
59 pages
Data Structure Interview Questions Guide
No ratings yet
Data Structure Interview Questions Guide
27 pages
OOAD Lab Manual for Mini-Projects
0% (1)
OOAD Lab Manual for Mini-Projects
199 pages
Greedy vs Dynamic Programming Overview
No ratings yet
Greedy vs Dynamic Programming Overview
120 pages
DRDO Scientist-B Interview Insights
No ratings yet
DRDO Scientist-B Interview Insights
2 pages
Deep Learning Data Processing Guide
No ratings yet
Deep Learning Data Processing Guide
41 pages
Regular Expressions and Its Applications
No ratings yet
Regular Expressions and Its Applications
6 pages
Design and Analysis of Algorithms Exam
100% (1)
Design and Analysis of Algorithms Exam
12 pages
Overview of UNIX Operating System
No ratings yet
Overview of UNIX Operating System
4 pages
C# Basics: Lambda, Delegates, and Collections
100% (1)
C# Basics: Lambda, Delegates, and Collections
3 pages
Classic Information Retrieval Models
No ratings yet
Classic Information Retrieval Models
8 pages
Taxonomy of Information Retrieval Models
No ratings yet
Taxonomy of Information Retrieval Models
113 pages
Classic Information Retrieval Models
No ratings yet
Classic Information Retrieval Models
8 pages
Overview of Information Retrieval Models
No ratings yet
Overview of Information Retrieval Models
15 pages
Retrieval Models and Ranking Techniques
No ratings yet
Retrieval Models and Ranking Techniques
16 pages
PageRank Algorithm in Information Retrieval
No ratings yet
PageRank Algorithm in Information Retrieval
37 pages
Data Structures & Algorithms Project Ideas
100% (1)
Data Structures & Algorithms Project Ideas
1 page
Document Operations & Indexing Techniques
No ratings yet
Document Operations & Indexing Techniques
37 pages
IPE: Key Concepts and Theories
No ratings yet
IPE: Key Concepts and Theories
11 pages
Understanding the Relational Model Basics
No ratings yet
Understanding the Relational Model Basics
30 pages
C++ Arrays, Strings, and Structures Guide
No ratings yet
C++ Arrays, Strings, and Structures Guide
28 pages
Ethical Decision-Making in Society
No ratings yet
Ethical Decision-Making in Society
9 pages
Chapter 3: Understanding Artificial Intelligence
100% (2)
Chapter 3: Understanding Artificial Intelligence
47 pages
Introduction to Anthropology Concepts
No ratings yet
Introduction to Anthropology Concepts
207 pages
IoT Overview and Applications
No ratings yet
IoT Overview and Applications
48 pages
Medieval Ethiopian Historiography
No ratings yet
Medieval Ethiopian Historiography
316 pages
Introduction to Artificial Intelligence
No ratings yet
Introduction to Artificial Intelligence
18 pages
The Rise of the Derg in Ethiopia
100% (1)
The Rise of the Derg in Ethiopia
2 pages
Introduction to Data Science Concepts
No ratings yet
Introduction to Data Science Concepts
32 pages
Nuer Gospel Songbook Collection
100% (3)
Nuer Gospel Songbook Collection
114 pages
Active and Passive Voice Overview
No ratings yet
Active and Passive Voice Overview
63 pages
Assignment Report Speech
No ratings yet
Assignment Report Speech
10 pages
BOOK Din 2 A-1 Nuer Gospel Song PDF
75% (4)
BOOK Din 2 A-1 Nuer Gospel Song PDF
112 pages
Microeconometrics Course Overview 2018
No ratings yet
Microeconometrics Course Overview 2018
3 pages
Stability Theory Overview
No ratings yet
Stability Theory Overview
45 pages
Applied Engineering Mathematics Exam
No ratings yet
Applied Engineering Mathematics Exam
2 pages
Non-Parametric News Impact Curve Model
No ratings yet
Non-Parametric News Impact Curve Model
45 pages
Reflective Journal on Linear Algebra Concepts
No ratings yet
Reflective Journal on Linear Algebra Concepts
5 pages
Data Structures & Algorithms Course Guide
No ratings yet
Data Structures & Algorithms Course Guide
4 pages
Chentsov's Theorem in Information Geometry
No ratings yet
Chentsov's Theorem in Information Geometry
20 pages
Rice Grain Counting via Image Processing
No ratings yet
Rice Grain Counting via Image Processing
8 pages
Abstractive Text Summarization Study
No ratings yet
Abstractive Text Summarization Study
8 pages
DWDM Classification and Clustering Techniques
No ratings yet
DWDM Classification and Clustering Techniques
2 pages
Understanding Signal Flow Graphs in FCS
No ratings yet
Understanding Signal Flow Graphs in FCS
60 pages
HCIA-AI Exam Questions and Answers
No ratings yet
HCIA-AI Exam Questions and Answers
62 pages
Machine Learning in Genomic Medicine
No ratings yet
Machine Learning in Genomic Medicine
22 pages
NP-Completeness and Approximation Methods
No ratings yet
NP-Completeness and Approximation Methods
11 pages
BCS602 Machine Learning Curriculum Overview
No ratings yet
BCS602 Machine Learning Curriculum Overview
126 pages
Dersnot 6452 1668688984
No ratings yet
Dersnot 6452 1668688984
36 pages
Deep Learning Overview and Techniques
No ratings yet
Deep Learning Overview and Techniques
74 pages
Disjoint Set Data Structure Explained
No ratings yet
Disjoint Set Data Structure Explained
9 pages
Numerical Differentiation Techniques
No ratings yet
Numerical Differentiation Techniques
5 pages
Chapter04 Decrease-and-Conquer
No ratings yet
Chapter04 Decrease-and-Conquer
56 pages
Minimum Phase System Characteristics
No ratings yet
Minimum Phase System Characteristics
6 pages
Two-Port Network Parameter Analysis
No ratings yet
Two-Port Network Parameter Analysis
3 pages
Machine Learning and Deep Learning Course
No ratings yet
Machine Learning and Deep Learning Course
5 pages
GM Week5x
No ratings yet
GM Week5x
43 pages
Additional Mathematics-II Syllabus
No ratings yet
Additional Mathematics-II Syllabus
2 pages
VMambaDA: Cervical Cancer Detection Model
No ratings yet
VMambaDA: Cervical Cancer Detection Model
6 pages
Dirichlet Problem for Degenerate Elliptic PDEs
No ratings yet
Dirichlet Problem for Degenerate Elliptic PDEs
51 pages
Unit 6 MCQs on Classification Metrics
No ratings yet
Unit 6 MCQs on Classification Metrics
6 pages
Data Mining Exam Paper 2021
No ratings yet
Data Mining Exam Paper 2021
3 pages
BraTS 2019 Dataset Segmentation Results
No ratings yet
BraTS 2019 Dataset Segmentation Results
11 pages