Assignment No.
ProblemStatement:
Implement a program to calculate precision and recall for sample input. (Answer set A, Query q1,
Relevant documents to query q1- Rq1 )
Objectives:
1. To understand precision and recall in information retrieval
2. To study indexing structures for information retrieval.
Outcomes:
At the end of the assignment the students should have:
1. Understood precision and recall in information retrieval.
2. Understood use of indexing in fast retrieval.
Theory:
Precision and Recall in Information Retrieval
Information Systems can be measured with two metrics: precision and recall. When a user decides to
search for information on a topic, the total database and the results to be obtained can be divided into 4
categories:
1. Relevant and Retrieved
2. Relevant and Not Retrieved
3. Non-Relevant and Retrieved
4. Non-Relevant and Not Retrieved
Relevant items are those documents that help the user in answering his question. Non-Relevant items
are items that don’t provide actually useful information. For each item there are two possibilities it can
be retrieved or not retrieved by the user’s query. Precision is defined as the ratio of the number of
relevant and retrieved documents(number of items retrieved that are actually useful to the user and
match his search need) to the number of total retrieved documents from the query. Recall is defined as
ratio of the number of retrieved and relevant documents(the number of items retrieved that are relevant
to the user and match his needs) to the number of possible relevant documents(number of relevant
documents in the database).Precision measures one aspect of information retrieval overhead for a user
associated with a particular search. If a search has 85 percent precision then 15(100-85) percent of user
effort is overhead reviewing non-relevant items. Recall measures to what extent a system processing a
particular query is able to retrieve the relevant items the user is interested in seeing. Recall is a very
useful concept but due to the denominator is non-calculable in operational systems. If the system is
made known the total set of relevant items in the database, recall can be made calculable.
Precision/recall trade-off
You can increase recall by returning more docs. Recall is a non-decreasing function of the number of
docs retrieved. A system that returns all docs has 100% recall! The converse is also true (usually): It’s
easy to get high precision for very low recall.
Consider an Information retrieval (IR) system returning relevant documents
Fig 1: IR system returning relevant documents
Precision and Recall explanation:
Consider,
I: an information request
R: the set of relevant documents for I
A: the answer set for I, generated by an IR system
R ∩ A: the intersection of the sets R and A
|A|-number of documents in the set A
|Ra |-number of documents in the intersection of sets R and A
The goal is to achieve high precision and high recall. The definition of precision and recall assumes
that all docs in the set A have been examined However, the user is not usually presented with all docs in
the answer set A at once User sees a ranked set of documents and examines them starting from the top
Thus, precision and recall vary as the user proceeds with their examination of the set A. Most
appropriate then is to plot a curve of precision versus recall.
If we proceed with our examination of the ranking generated, we can plot a curve of precision versus
recall as follows:
Thus, Precision and recall have been extensively used to evaluate the retrieval performance of IR
systems or algorithms. However, a more careful reflection reveals problems with these two measures:
First, the proper estimation of maximum recall for a query requires detailed knowledge of all the
documents in the collection Second, in many situations the use of a single measure could b e more
appropriate Third, recall and precision measure the effectiveness over a set of queries processed in batch
mode Fourth, for systems which require a weak ordering though, recall and precision might be
inadequate.
Sample code in C++
• Code
• Output
Conclusion: Implementation is concluded by executing a program to calculate precision and recall for
sample input with relevant documents Rq1 for query q1.
A. Write short answer of following questions:
1. What is precision and recall in IR systems?
2. How recall and precision measures are are defined?
B. Viva Questions:
1. What is relevance of document?
2. What are the metrics to measure information systems?
3. How are precision and recall calculated for information systems?.
4. What is the problem with these two measures?