Retrieval Effectiveness
• Evaluation of IR systems
• Relevance judgement
• Performance measures
• Recall, Precision
• Single-valued measures
• User centred measures
Why System Evaluation?
• Any systems needs validation and verification
–Check whether the system is right or not
–Check whether it is the right system or not
• It provides the ability to measure the difference between
IR systems
–How well do our search engines work?
–Is system A better than B? Under what conditions?
• Evaluation drives what to study
–Identify techniques that work well and do not work
–There are many retrieval models/algorithms. which one is the
best?
–What is the best component for:
• Index term selection (tokenization, stop-word removal,
stemming, normalization…)
• Term weighting (TF, IDF, TF*IDF, P(R|D)…)
• Similarity measures (cosine, Euclidean, string editing…)
2
Evaluation Criteria
What are the main evaluation measures to check the
performance of an IR system?
• Efficiency
– Time and space complexity
❑ Speed in terms of retrieval time, indexing time, query
processing time
❑ The space taken by corpus vs. index file
• Index size: determine Index/corpus size ratio
• Is there a need for compression
• Effectiveness
–How is a system capable of retrieving relevant documents
from the collection: precision & recall?
–Is system X better than other systems?
–User satisfaction: How “good” are the documents that are
returned as a response to user query?
–Relevance of results to meet information need of users
Types of Evaluation Strategies
•User-centered evaluation
– Given several users, and at least two retrieval
systems
• Have each user try the same task on both systems
• Measure which system works the “best” for users
information need
• How to measure users satisfaction?
•System-centered evaluation
– Given documents, queries, and relevance
judgments
• Try several variations of the system
• Measure which system returns the “best” hit list
The Notion of Relevance Judgment
• Relevance is the measure of a correspondence
existing between a document and query.
–Construct document - query matrix as determined by:
(i) the user who posed the retrieval problem;
(ii) an external judge;
(iii) information specialist
–Is the relevance judgment made by users and external
person the same ?
• Relevance judgment is usually:
– Subjective: depends upon a specific user’s judgment.
– Situational: relates to user’s current needs.
– Cognitive: depends on human perception and behavior.
– Dynamic: changes over time.
Measuring Retrieval Effectiveness
Relevant Irrelevant
Metrics often “Type one
used to evaluate
effectiveness of
the system
Retrieved
A B error”
Not
retrieved C D
“Type two error”
• Retrieval of documents may result in:
–False positive (Errors of omission): some irrelevant
documents may be retrieved by the system as relevant.
–False negative (False drop or Errors of commission):
some relevant documents may not be retrieved by the system
as irrelevant.
–For many applications a good index should not permit any
false drops, but may permit a few false positives.
Measuring Retrieval Effectiveness
Relevant Not
relevant Collection size = A+B+C+D
Retrieved A B Relevant = A+C
Retrieved = A+B
Not retrieved C D
| {Relevant} {Retrieved} |
Re call =
| {Relevant} |
Relevant +
Relevant
Retrieved Retrieved
| {Relevant} {Retrieved} |
Pr ecision =
| {Retrieved} |
Irrelevant + Not Retrieved
• When is precision important? When is recall important?
Example
Assume that there are a total of 10 relevant document
Ranking Relevance Recall Precision
1. Doc. 50 R 0.10 1.00
2. Doc. 34 NR 0.10 0.50
3. Doc. 45 R 0.20 0.67
4. Doc. 8 NR 0.20 0.50
5. Doc. 23 NR 0.20 0.40
6. Doc. 16 NR 0.20 0.33
7. Doc. 63 R 0.30 0.43
8. Doc 119 R 0.40 0.50
9. Doc 21 NR 0.40 0.44
10. Doc 80 R 0.50 0.50
Graphing Precision and Recall
• Plot each (recall, precision) point on a graph
• Recall is a non-decreasing function of the number of
documents retrieved,
–Precision usually decreases (in a good system)
• Precision/Recall tradeoff
– Can increase recall by retrieving many documents (down to
a low level of relevance ranking),
•but many irrelevant documents would be fetched, reducing precision
– Can get high recall (but low precision) by retrieving all
documents for all queries
1 The ideal
Precision
Returns relevant
documents but
misses many Returns most relevant
useful ones too documents but includes
0 lots of junk
Recall 1
Need for Interpolation
•Two issues:
–How do you compare performance across
queries?
–Is the sawtooth shape intuitive of what’s going on?
1
0.8
0.6
Precision
0.4
0.2
0
0 0.2 0.4 0.6 0.8 1
Recall
Solution: Interpolation!
Interpolate a precision value for each standard recall level
Interpolation
• It is a general form of precision/recall calculation
• Precision change w.r.t. Recall (not a fixed point)
– It is an empirical fact that on average as recall increases,
precision decreases
• Interpolate precision at 11 standard recall levels:
– rj {0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0},
where j = 0 …. 10
• The interpolated precision at the j-th standard recall
level is the maximum known precision at any recall
level between the jth and (j + 1)th level:
P(rj ) = max P(r )
r j r r j +1
Result of Interpolation
0.8
0.6
Precision
0.4
0.2
0
0 0.1 0.2 0.3 0.4 0.5
Recall
Exercise
• Let total number of relevant documents = 6, compute recall and
precision for each cut off point n:
n doc # relevant Recall Precision
1 588 x 0.167 1
2 589 x 0.333 1
3 576
4 590 x 0.5 0.75
5 986
6 592 x 0.667 0.667
7 984
8 988
9 578
10 985
11 103
12 591
13 772 x 0.833 0.38
14 990
Missing one relevant document. Never reach 100% recall 13
Interpolating a Recall/Precision
Curve: Exercise
Precision
1.0
0.8
0.6
0.4
0.2
0.2 0.4 0.6 0.8 1.0
Recall
14
Computing Recall/Precision Points:
Exercise 2
n doc # relevant
Let total # of relevant docs = 6
1 588 x
Check each new recall point:
2 576
3 589 x
R=1/6=0.167;
4 342
P=1/1=1
5 590 x
R=2/6=0.333;
6 717 P=2/3=0.66
7 984 R=3/6=0.5;
7
8 772 x P=3/5=0.6
9 321 x R=4/6=0.667;
10 498 P=4/8=0.5
11 113 R=5/6=0.833;
12 628 P=5/9=0.556
13 772
14 592 x R=6/6=1.0;
15 p=6/14=0.42
9
Interpolating a Recall/Precision Curve:
Exercise 2
Precision
1.0
0.8
0.6
0.4
0.2
0.2 0.4 0.6 0.8 1.0
Recall
16
Interpolating across queries
• For each query, calculate precision at 11 standard
recall levels
• Compute average precision at each standard recall
level across all queries.
• Plot average precision/recall curves to evaluate
overall system performance on a document/query
corpus.
• Average precision favors systems which produce
relevant documents high in rankings
Single-valued measures
• Single value measures: calculate a single value for
each query to evaluate performance
– Mean Average Precision at seen relevant
documents
• Typically average performance over a large set of
queries.
– R-Precision
• Precision at Rth relevant documents
MAP (Mean Average Precision)
• Computing mean average for more than one query
MAP = ( )
1 1 j
n Qi | Ri | D j Ri rij
– rij = rank of the jth relevant document for Qi
– |Ri| = number of relevant documents for Qi
– n = number of test queries
• E.g. Assume there are 3 relevant documents for Query 1 and 2
for query 2. Calculate MAP?
Relevant Docs. retrieved Query 1 Query 2
1st rel. doc. 1 4
2nd rel. doc. 5 8
3rd rel. doc. 10 -
1 1 1 2 3 1 1 2
MAP = [ ( + + ) + ( + )]
2 3 1 5 10 2 4 8
R- Precision
• Precision at the R-th position in the ranking of results
for a query, where R is the total number of relevant
documents.
– Calculate precision after R documents are seen
– Can be averaged over all queries
n doc # relevant
1 588 x
2 589 x R = # of relevant docs = 6
3 576
4 590 x
5 986
6 592 x
7 984
R-Precision = 4/6 = 0.67
8 988
9 578
10 985
11 103
12 591
13 772 x
20
14 990
F-Measure
• One measure of performance that takes into
account both recall and precision.
• Harmonic mean of recall and precision:
2 PR 2
F= = 1 1
P + R R+P
• Compared to arithmetic mean, both need to be high
for harmonic mean to be high.
• What if no relevant documents exist?
21
E Measure
• Associated with Van Rijsbergen
• Allows user to specify importance of recall and
precision
• It is parameterized F Measure. A variant of F
measure that allows weighting emphasis on precision
over recall:
(1 + ) PR (1 + )
2 2
E= = 2 1
P+R
2
+
R P
• Value of controls trade-off:
– = 1: Equal weight for precision and recall (E=F).
– > 1: Weight recall more. It emphasizes recall.
– < 1: Weight precision more. It emphasizes precision.
22
Problems with both precision and recall
• Number of irrelevant documents in the collection is not taken
into account.
• Recall is undefined when there is no relevant document in
the collection.
• Precision is undefined when no document is retrieved.
Other measures
• Noise = retrieved irrelevant docs / retrieved docs
• Silence/Miss = non-retrieved relevant docs / relevant docs
– Noise = 1 – Precision; Silence = 1 – Recall
| {Relevant} {NotRetrieved} |
Miss =
| {Relevant} |
| {Retrieved} {NotRelevant} |
Fallout =
| {NotRelevant} | 23
Programming Assignment (Due date: ____)
• Design an IR system for one of the local language following the
principle discussed in class.
– Form a group having up to three members
1. Construct inverted file (vocabulary file & posting file)
– Taking N document corpus, generate content-bearing index terms and
organize them using inverted file indexing structure, include frequencies
(TF, DF, CF) & position/location information of terms in each document.
[Link] Vector space retrieval model
– Implement a Vector space model that retrieve relevant documents in
ranking order
3. Test your system using five queries (with three to six words)
and report its performance
• Required: write a publishable report that have abstract (½ page),
introduction, problem statement & objective (1 page), literature
review (2 pages), methods used & architecture of your system (1
page) & experimentation (test results & findings) (2-3 pages),
concluding remarks with one basic recommendation (1 page) and
References (1 page).