0% found this document useful (0 votes)
4 views16 pages

Answer Sheet

The document provides detailed answers to questions related to Information Retrieval (IR) for the CS622 course, covering topics such as the definition of IR, its purpose, historical development, components of IR systems, and the impact of the web on IR. It also discusses various retrieval models, preprocessing techniques, and the role of open-source software in IR. Additionally, it highlights challenges faced by IR systems and the differences between classical IR and web search.

Uploaded by

sevensevs98
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views16 pages

Answer Sheet

The document provides detailed answers to questions related to Information Retrieval (IR) for the CS622 course, covering topics such as the definition of IR, its purpose, historical development, components of IR systems, and the impact of the web on IR. It also discusses various retrieval models, preprocessing techniques, and the role of open-source software in IR. Additionally, it highlights challenges faced by IR systems and the differences between classical IR and web search.

Uploaded by

sevensevs98
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Based on the "Information Extraction & Retrieval Question Bank" for Subject Code CS622,

here are the answers for the Unit I Part-A questions:


1.​ What is Information Retrieval?​
Information Retrieval (IR) is the process of finding and obtaining information system
resources (typically documents) that are relevant to an information need from a
collection of those resources. It focuses on the representation, storage, organization
of, and access to information items.
2.​ Specify the purpose/role of an IR system.​
The primary purpose is to help users find information that is relevant to their specific
queries within a large collection of unstructured or semi-structured data. It acts as an
intermediary, minimizing the user's effort while maximizing the accuracy of retrieved
results.
3.​ Outline the impact of the web on Information Retrieval.​
The web shifted IR from a specialized field to a daily necessity. It introduced
challenges of scale (billions of pages), data diversity (multimedia), and the need for
link-based ranking (like PageRank). It also led to the development of crawlers to
discover dynamic content.
4.​ What are the two kinds of Information Retrieval systems?​
The two main kinds are Classical IR Systems (used for static, controlled document
collections like library archives) and Web IR Systems (designed for the massive,
dynamic, and linked environment of the internet).
5.​ What are the issues/major challenges in Information Retrieval systems?​
Major challenges include scalability (handling growing data volumes), relevance
ranking, handling ambiguous natural language queries, filtering spam/low-quality
content, and maintaining real-time updates for dynamic information.
6.​ List some famous Open source Search engines.​
Famous open-source engines include Apache Lucene, Apache Solr, Elasticsearch,
and Meilisearch.
7.​ What are the components of any search engine?​
The core components include a Crawler (to gather data), an Indexer (to organize
data into searchable structures), a Query Processor (to interpret user input), and a
Ranking Function (to order results by relevance).
8.​
9.​ Shutterstock
8.​ What are the performance measures for search engine?​
The primary measures are Precision (fraction of retrieved results that are relevant),
Recall (fraction of all relevant results in the database that were successfully
retrieved), and the F-Measure (the harmonic mean of precision and recall).
9.​ Give an example of Information Retrieval process.​
An example is a user searching for "symptoms of common cold" on a search engine.
The engine parses the query, searches its index for documents containing these
keywords, applies a ranking algorithm to find medical authorities, and displays a list
of websites to the user.
10.​What are the major sub systems in Information Retrieval process?​
The process involves the Ad-hoc Retrieval subsystem (answering one-time
queries), the Indexing subsystem (creating searchable files), and the User
Interface/Feedback subsystem (capturing user queries and refining results).
11.​Draw the broad outline of an Information Retrieval System.​
A broad outline typically shows the flow from the Document Collection → Text
Operations → Indexing → Matcher ← Query Formulation ← User Information Need.
12.​Draw the Query Model for Information Retrieval.​
The query model represents how a user's information need is transformed into a
system-readable query. It usually involves query expansion, term weighting (like
TF-IDF), and Boolean or Vector-based logical structures.
13.​Define Zipf’s law.​
Zipf's law states that in any large corpus of natural language, the frequency of a word
is inversely proportional to its rank in the frequency table. For example, the most
frequent word will occur twice as often as the second most frequent word.
14.​What is open source software?​
Open source software is software with source code that is made available for anyone
to inspect, modify, and distribute. It promotes collaborative development and
transparency.
15.​What is a search engine?​
A search engine is a software system designed to carry out web searches, which
means to search the World Wide Web in a systematic way for particular information
specified in a textual web search query.
16.​Explain difference between data retrieval and information retrieval.​
Data Retrieval (e.g., SQL) focuses on finding exact matches in structured data; it is
deterministic (all-or-nothing). Information Retrieval deals with unstructured text and
finds relevant documents based on partial matches; it is probabilistic and focuses on
relevance.
17.​List the advantage of open source.​
Advantages include lower costs (no licensing fees), increased security through
community audits, flexibility to customize the code, and no vendor lock-in.
18.​Draw the architecture diagram for Web Search using IR.​
The architecture includes the Web → Crawler → Repository → Indexer → Inverted
Index → Query Engine → User.
19.​Compare Information Retrieval and Web Search.​
While IR is the underlying science, Web Search is its largest application. IR often
happens on static, high-quality collections, whereas Web Search must deal with
massive scale, link analysis, adversarial SEO (spam), and highly diverse data
formats.
20.​What are the two major dimensions in the model representation?​
The two dimensions are the Mathematical Dimension (Boolean, Vector Space,
Probabilistic) and the Structural Dimension (how the content within the document is
handled, such as flat text vs. structured/hypertext).
21.​What is the first step towards the knowledge discovery of IR?​
The first step is Document Preprocessing, which involves cleaning raw data
through tokenization, stop-word removal, and normalization to prepare it for indexing.
22.​What is peer-to-peer search?​
Peer-to-peer (P2P) search is a decentralized retrieval method where search queries
are distributed across a network of individual nodes (peers) rather than relying on a
central server or index.
Here are the detailed answers for the Unit I Part-B questions from your question bank. Each
answer is structured to meet the depth required for a 7-mark question.

1(a) What is Information Retrieval? Explain the history of Information


Retrieval.
Information Retrieval (IR) is the science of searching for information within a document,
searching for documents themselves, and searching for metadata that describes data. It
focuses on satisfying a user's information need by retrieving relevant resources from a
large collection of unstructured or semi-structured data (usually text).

History of Information Retrieval:


●​ 1940s – 1950s (The Beginning): The field emerged after WWII. Hans Peter Luhn
(IBM) proposed using words as indexing units and measuring word frequency to
determine relevance.
●​ 1960s (The Golden Age): Gerard Salton developed the SMART system at Cornell
University. He introduced the Vector Space Model, term weighting (TF-IDF), and
relevance feedback, which remain the backbone of modern IR.
●​ 1970s – 1980s (Formalization): This era saw the development of the Probabilistic
Retrieval Model (Stephen Robertson). IR systems moved from specialized hardware
to mainframe computers, primarily serving libraries and legal sectors.
●​ 1990s (The Web Revolution): The launch of the World Wide Web shifted IR from a
niche academic field to a global necessity. Scale, link analysis (PageRank), and
crawlers became the new focus.
●​ 2000s – Present (Modern Era): Integration of Machine Learning, Learning to Rank
(LTR), and Natural Language Processing (NLP). Systems now focus on
understanding user intent, personalization, and multimedia retrieval.

2(b) Explain the purpose of Information Retrieval System.


The primary purpose of an IR system is to help a user find information that is relevant to
their need while minimizing the effort required.
1.​ Bridging the Gap: It acts as an interface between a massive, unorganized
collection of data and a user who has a specific question but doesn't know where the
answer is stored.
2.​ Organization of Unstructured Data: Unlike databases (which use tables), IR
systems organize "unstructured" text using indexing so it can be searched efficiently.
3.​ Relevance Ranking: The system doesn't just find matches; it ranks them. The goal
is to place the most useful information at the very top (Top-K results).
4.​ Filtering: It reduces "information overload" by filtering out noise and non-relevant
documents.
5.​ Handling Ambiguity: It interprets natural language queries, which are often vague,
to find the most likely intended information.
3. Explain in detail about the components of Information Retrieval
System with neat diagram.
An IR system consists of several functional modules that work together to process
documents and queries:

Getty Images

Explore

1.​ Document Corpus: The input collection of documents (text, PDFs, web pages).
2.​ Text Operations: Before indexing, documents undergo Tokenization (splitting into
words), Stop-word removal (removing common words like "the", "is"), and
Stemming/Lemmatization (reducing "walking" to "walk").
3.​ Indexer: Creates an Inverted Index, a data structure similar to a book index that
maps every unique word to the list of documents where it appears.
4.​ Query Formulation: The user provides a query, which is pre-processed using the
same text operations as the documents.
5.​ Matching Function: An algorithm (like Cosine Similarity) compares the query vector
against the document vectors in the index.
6.​ Ranking: The system assigns a score to each document and sorts them by
relevance for the user.

4. Explain in detail about need of Open source software.


In the context of Information Retrieval and Software Engineering, Open Source Software
(OSS) is critical for:
1.​ Innovation and Collaboration: Allows developers worldwide to contribute code,
leading to faster feature updates and bug fixes than proprietary software.
2.​ Transparency: In search, it is vital to know how a system ranks information. OSS
allows users to inspect the algorithm to ensure there is no hidden bias.
3.​ Cost Efficiency: Organizations can build enterprise-grade search engines (using
Lucene or Solr) without paying massive licensing fees.
4.​ No Vendor Lock-in: Users are not dependent on a single company for support or
updates. They have full control over the source code.
5.​ Security: Because the code is public, "many eyes" look for vulnerabilities, often
making it more secure than closed-source alternatives.

5. Explain detail about free open source software (FOSS).


FOSS emphasizes the freedom of the user. It is defined by the "Four Essential Freedoms":
1.​ Freedom 0: The freedom to run the program as you wish, for any purpose.
2.​ Freedom 1: The freedom to study how the program works and change it so it does
your computing as you wish. Access to source code is a precondition.
3.​ Freedom 2: The freedom to redistribute copies so you can help others.
4.​ Freedom 3: The freedom to distribute copies of your modified versions to others.
●​ Examples: Linux kernel, Apache Web Server, and the Python programming
language.

6. Explain detail about widely used open source software license.


Licenses define the legal boundaries of how OSS can be used:
1.​ GNU General Public License (GPL): A "Copyleft" license. If you modify the
software and distribute it, your version must also be open source under the GPL.
2.​ Apache License 2.0: A permissive license. It allows you to use, modify, and
distribute the code (even in commercial, closed products) as long as you keep the
original copyright and disclaimer.
3.​ MIT License: The most simple and permissive license. You can do almost anything
with the code as long as you include the original license notice.
4.​ BSD License: Similar to MIT, it allows for redistribution and use in source and binary
forms with minimal restrictions.

7. Explain any four open source search engine frameworks.


1.​ Apache Lucene: A high-performance, full-featured text search engine library written
in Java. It provides the core indexing and searching logic.
2.​ Apache Solr: Built on top of Lucene, it is an enterprise search server. It adds
features like faceted search, hit highlighting, and integration with databases.
3.​ Elasticsearch: A distributed, RESTful search and analytics engine. It is famous for
its scalability and is used heavily for log analysis and real-time data searching.
4.​ Meilisearch: A modern, lightning-fast search engine focused on providing a
"search-as-you-type" experience with very low latency and high relevance.

8(i) Identify the various issues in IR system.


●​ Ambiguity: Words often have multiple meanings (Polysemy) or multiple words have
one meaning (Synonymy).
●​ Scalability: As the web grows, systems must index petabytes of data while keeping
search times under a second.
●​ Spam: Malicious actors use "search engine spamming" to trick algorithms into
ranking low-quality pages highly.
●​ Evaluation: It is difficult to measure "relevance" objectively since it varies from user
to user.

8(ii) Examine the various impacts of Web on IR.


●​ Link Analysis: The web introduced hyperlinks, leading to algorithms like PageRank
that use the "popularity" of a page as a ranking signal.
●​ Diversity of Media: IR had to evolve from searching just text to searching images,
videos, and dynamic JavaScript content.
●​ User Behavior: The web allowed IR systems to track click-through rates (CTR) to
learn what users actually find relevant.

9(a) Differentiate between Information Retrieval and Web Search.

Feature Information Retrieval (Classical) Web Search

Collection Small, static, controlled (e.g., Massive, dynamic, uncontrolled.


Library).

Goal High Recall (find everything High Precision (find the best results
relevant). first).
Ranking Mostly based on text content Based on text + Link Analysis
(TF-IDF). (PageRank).

Spam Little to no spam. Constant battle against SEO spam.

9(b) Explain the issues in the process of Information Retrieval.


●​ Query Formulation: Users often write short, vague queries (e.g., searching "Java"
when they want the island, not the language).
●​ Indexing Complexity: Keeping the index updated in real-time as documents are
modified.
●​ Cross-Lingual Retrieval: Finding relevant documents written in a language different
from the query.

10. Demonstrate the role of Artificial Intelligence in IR Systems.


AI transforms IR from "keyword matching" to "intent understanding":
1.​ Natural Language Processing (NLP): Using models like BERT or GPT to
understand the context of a sentence.
2.​ Neural Ranking: Using deep learning to rank documents based on semantic
similarity rather than just counting words.
3.​ Query Expansion: AI can automatically add synonyms to a user's query to improve
results.
4.​ Personalization: AI analyzes a user's past behavior to show results that match their
specific profile (e.g., showing coding results to a developer).

11 & 12. Explain in detail about the components of search engine.


Shutterstock

1.​ Crawler (Spider): A module that automatically downloads pages from the web by
following links.
2.​ Page Repository: Stores the raw HTML of the downloaded pages.
3.​ Indexer: Extracts the text and builds the Inverted Index.
4.​ Query Engine: The front-end that accepts the user query, searches the index,
calculates scores, and returns the sorted list of results (SERP).
5.​ Ranking Algorithm: The specific logic (like BM25 or PageRank) used to determine
the order of results.
Based on the "Information Extraction & Retrieval Question Bank" for Subject Code CS622,
here are the answers for the Unit II Part-A questions:

1. Define an Information Retrieval model.

An Information Retrieval (IR) model is a mathematical framework that defines how


documents and queries are represented and how the similarity between them is calculated
to determine the relevance of a document to a user's query.

2. List the retrieval models.


The primary retrieval models include:
●​ Boolean Model (Set-theoretic)
●​ Vector Space Model (Algebraic)
●​ Probabilistic Model
●​ Language Models
●​ Latent Semantic Indexing (LSI)

3. Define Document Preprocessing.

Document preprocessing is the process of cleaning and transforming raw text into a
consistent format suitable for indexing. Key steps include tokenization (breaking text into
words), stop-word removal (filtering common words like "the" or "is"), and
stemming/lemmatization (reducing words to their root form).

4. What is Boolean model?

The Boolean model is a classic IR model based on set theory and Boolean algebra. It uses
operators such as AND, OR, and NOT to combine query terms. A document is either
retrieved (matches the logic) or not retrieved; there is no partial matching or ranking.

5. List the pros and cons of Boolean model.


●​ Pros: Easy to implement, predictable results, and allows for precise query
expression.
●​ Cons: No ranking (all results are treated as equally relevant), query formulation can
be difficult for average users, and it often leads to either too many or too few results
("feast or famine" effect).

6. Explain Vector Space model.

The Vector Space Model (VSM) represents both documents and queries as vectors in a
multi-dimensional space. Each dimension corresponds to a unique term in the collection.
The proximity of the query vector to a document vector (often measured by the angle
between them) determines the relevance score.

7. List the process of vector space model.


1.​ Tokenization: Extract terms from documents.
2.​ Weighting: Assign numerical weights to terms (e.g., using TF-IDF).
3.​ Vector Construction: Represent each document and the query as a vector of these
weights.
4.​ Similarity Calculation: Measure the similarity (typically Cosine Similarity) between
the query vector and document vectors.
5.​ Ranking: Sort documents based on their similarity scores.

8. Define Term Weighting and list its two factors.


Term weighting is a technique used to assign a numerical value to a term to indicate its
importance within a document or the entire collection. Its two primary factors are:
1.​ Term Frequency (TF): How often a word appears in a specific document.
2.​ Inverse Document Frequency (IDF): How rare or common a word is across the
whole collection.

9. Define tf-idf.

TF-IDF (Term Frequency-Inverse Document Frequency) is a weighting scheme that reflects


how important a word is to a document in a collection. It increases proportionally to the
number of times a word appears in the document but is offset by the frequency of the word
in the corpus, which helps to adjust for the fact that some words appear more frequently in
general.

10. Define similarity and various similarity coefficients.


Similarity in IR is a measure of how closely a document matches a query. Common similarity
coefficients include:
●​ Cosine Similarity: Measures the cosine of the angle between two vectors (the most
widely used in VSM).
●​ Jaccard Coefficient: Measures the intersection divided by the union of the terms in
the query and document.
●​ Dice’s Coefficient: Similar to Jaccard but gives more weight to the intersection.
●​ Euclidean Distance: Measures the straight-line distance between two points in the
vector space.

11. What is data preprocessing?

Data preprocessing (or document preprocessing) is the initial stage in an IR system where
raw text is cleaned and transformed into a structured format. It typically involves several
steps: Tokenization (breaking text into terms), Normalization (unifying case and spelling),
Stop-word removal (deleting high-frequency words with little semantic value), and
Stemming/Lemmatization (reducing words to their root form).

12. What is the need for preprocessing?


Preprocessing is essential to:
●​ Reduce Index Size: By removing stop-words and grouping word variants, the
storage requirements for the index are significantly lowered.
●​ Improve Matching: It ensures that a query for "running" matches a document
containing "run."
●​ Increase Efficiency: It simplifies the data the system must process, leading to faster
search results.
●​ Filter Noise: It removes irrelevant characters and terms that do not contribute to the
meaning of the document.

13. List the two key statistics used to assess effectiveness of an IR system.

The two primary statistics used are Precision and Recall.

14. Define Precision and Recall.


●​ Precision: The fraction of retrieved documents that are relevant to the user's query.
It measures the "accuracy" of the system.
●​ Recall: The fraction of all relevant documents in the entire collection that were
successfully retrieved by the system. It measures the "completeness" of the system.

15. Define Query Optimization.

Query Optimization is the process of modifying a user's query to improve the efficiency and
effectiveness of the retrieval process. This can involve reordering Boolean operations,
expanding the query with synonyms, or selecting the most discriminative terms to speed up
index lookup.

16. Define Language model.

A Language Model (LM) in IR is a probabilistic model that assigns a probability to a


sequence of words. In the context of retrieval, the system builds a model for each document
and calculates the probability that the user's query was generated by that document's model.

17. List the types of language models.


●​ Unigram Model: Assumes each word is generated independently.
●​ Bigram Model: Assumes each word depends on the previous word.
●​ N-gram Model: Assumes each word depends on the previous $n-1$ words.
●​ Query Likelihood Model: The most common LM used for ranking in IR.

18. What is probabilistic retrieval model?

The Probabilistic Retrieval Model is based on the Probability Ranking Principle, which
states that an IR system should rank documents in decreasing order of their probability of
relevance to the user's query. It uses statistical data to estimate the likelihood that a
document belongs to the "relevant" set vs. the "non-relevant" set.

19. Explain Relevance Feedback.

Relevance Feedback is a process where the user provides feedback on the initial results by
marking certain documents as "relevant" or "not relevant." The system then uses this
information to reformulate the query (usually by adding terms from relevant documents) to
produce a better second set of results.

20. What is zone index?

A Zone Index (or Parametric Index) is an indexing technique where specific parts of a
document (zones) such as the Title, Author, Abstract, or Body are indexed separately.
This allows the system to weigh matches in the "Title" more heavily than matches in the
"Body."

21. Differentiate between relevance feedback and pseudo relevance feedback.


●​ Relevance Feedback: Requires explicit human interaction where the user manually
labels the results.
●​ Pseudo Relevance Feedback (Blind Feedback): Does not involve the user. The
system assumes that the top $k$ documents from the initial search are relevant and
uses them to automatically expand the query.

22. What is inverted index? / What is inversion in indexing process?

An Inverted Index is the central data structure in most IR systems. It is a mapping from
every unique word (term) to a list of documents (postings) that contain that word. "Inversion"
is the process of turning a document-term list (Document A contains words 1, 2, 3) into a
term-document list (Word 1 is in Documents A, C, and F).

Based on the "Information Extraction & Retrieval Question Bank," here are the detailed
answers for the Unit II Part-B questions. Each response is structured to provide the depth
required for a 7-mark academic answer.

1(a) Discuss the Boolean retrieval model in detail with diagram.


The Boolean Retrieval Model is the oldest and most basic model of information retrieval. It
is based on set theory and Boolean algebra.
●​ Logic: It treats a document as a set of terms. Queries are expressed as Boolean
expressions using operators: AND, OR, and NOT.
○​ AND: Retrieves documents containing all specified terms.
○​ OR: Retrieves documents containing at least one of the specified terms.
○​ NOT: Excludes documents containing the specified term.
●​ Mathematical Representation: A document $D$ is represented as a binary vector
$\{0, 1\}^n$, where 1 indicates the presence of a term and 0 indicates its absence.
●​ Advantages: Simple to implement, clean formalism, and provides the user with
precise control.
●​ Disadvantages: It does not support ranking (all retrieved documents are treated as
equally relevant). It often results in "feast or famine"—either too many results or none
at all.
1(b) Explain Vector Space Retrieval Model. Explain various IR models in
detail.
The Vector Space Model (VSM) represents documents and queries as vectors in a
high-dimensional space.
●​ Mechanism: Every unique term in the collection is a dimension. A document $D_i$
is represented as a vector $(w_{i1}, w_{i2}, \dots, w_{in})$, where $w$ is the weight
of the term (usually TF-IDF).
●​ Ranking: Relevance is determined by the proximity of the query vector to the
document vector, typically calculated using Cosine Similarity.

Other Major IR Models:


1.​ Probabilistic Model: Estimates the probability that a document is relevant to a query
based on statistical distributions.
2.​ Language Models: Treats each document as a probability distribution over words.
The system calculates the likelihood that the query was "generated" by that
document’s model.
3.​ Latent Semantic Indexing (LSI): Uses singular value decomposition (SVD) to
identify patterns in relationships between terms and concepts (hidden meanings).

2 Develop an example to implement term weighting.


Scenario: Suppose we have a collection of 10,000 documents. We want to calculate the
weight of the term "Retrieval" in a specific Document $D_1$.
●​ Step 1: Term Frequency (TF). Suppose "Retrieval" appears 5 times in $D_1$. To
prevent bias toward long documents, we often use Logarithmic TF: $1 + \log(5)
\approx 1.69$.
●​ Step 2: Document Frequency (DF). Suppose "Retrieval" appears in 100 documents
across the collection.
●​ Step 3: Inverse Document Frequency (IDF). $IDF = \log(\text{Total Documents} /
DF) = \log(10,000 / 100) = \log(100) = 2$.
●​ Step 4: TF-IDF Weight. $W = TF \times IDF = 1.69 \times 2 = 3.38$.
●​ Conclusion: The term "Retrieval" has a weight of 3.38 in $D_1$. This value is then
used to construct the document vector for similarity comparisons.

3 Briefly explain weighting and cosine similarity.


Weighting is the process of assigning a numerical value to a term to indicate its importance.
A good weighting scheme should reward terms that are frequent in a specific document (TF)
but penalize terms that are common across the entire collection (IDF), as they are poor
discriminators.

Cosine Similarity is the measure used to calculate the distance between a query vector
($q$) and a document vector ($d$). Instead of calculating the Euclidean distance (which is
affected by document length), we measure the cosine of the angle between the two
vectors.
●​ Formula: $\text{Similarity}(d, q) = \frac{d \cdot q}{\|d\| \|q\|}$
●​ A value of 1 means the vectors are identical (same direction); a value of 0 means
they are orthogonal (no shared terms).

4(i) Discuss the structure of inverted indices.


The Inverted Index is the central data structure of modern IR systems. It consists of two
main parts:
1.​ The Dictionary (Lexicon): A list of all unique terms found in the document
collection. It often includes metadata like the document frequency (how many
documents the term appears in).
2.​ Postings List: For every term in the dictionary, there is a linked list or array
containing the IDs of documents where that term appears.
○​ Example: Retrieval $\rightarrow$ [Doc 1, Doc 4, Doc 15]

4(ii) Discuss the searching process in inverted file.


The search process involves three main steps:
1.​ Vocabulary Search: The system looks up the terms from the user query in the
Dictionary.
2.​ Retrieval of Postings: The system fetches the corresponding postings lists for those
terms.
3.​ Postings Merging (Intersection): * If the query is A AND B, the system finds the
intersection of the two lists.
○​ If the query is A OR B, it finds the union.
○​ For ranked retrieval, it calculates the scores for each document ID found in
the lists.

5 Analyze the language model based IR and its probabilistic


representation.
Language Modeling (LM) approaches retrieval by building a generative model for each
document.
●​ Probabilistic Representation: For a query $Q = (q_1, q_2, \dots, q_n)$, the system
estimates $P(D|Q)$—the probability that document $D$ is relevant given query $Q$.
●​ In practice, we use the Query Likelihood Model, calculating $P(Q|M_d)$, which is
the probability that the query $Q$ would be generated by the model $M_d$
associated with document $D$.
●​ Smoothing: A critical part of LM is smoothing (like Jelinek-Mercer or Bayesian
smoothing), which prevents the probability from becoming zero if a query word is
missing from a document.

6(i) Explain Latent Semantic Indexing (LSI).


LSI (also called Latent Semantic Analysis) addresses the problems of Synonymy (different
words with same meaning) and Polysemy (one word with different meanings).
●​ Process: It uses a mathematical technique called Singular Value Decomposition
(SVD) on the term-document matrix.
●​ Goal: It identifies "latent" (hidden) concepts. It reduces the thousands of term
dimensions into a smaller number of "concept" dimensions.
●​ Benefit: If a user searches for "Physician," LSI can retrieve documents containing
"Doctor" even if the word "Physician" is absent, because they share a latent concept.

8 Write about relevance feedback and query expansion.


Relevance Feedback: This is an iterative process where the system presents initial results,
the user marks some as "relevant," and the system re-runs the search. The most famous
algorithm is the Rocchio Algorithm, which moves the query vector closer to the center of
the relevant document vectors and further from the non-relevant ones.

Query Expansion: This involves adding related terms to the original query to improve recall.
●​ Manual: User picks from a list of suggested terms.
●​ Automatic: The system uses a Thesaurus or Pseudo-Relevance Feedback
(assuming the top 5 results are relevant) to find new terms to add to the query
automatically.

9(ii) Explain in detail about conflation algorithm.


A Conflation Algorithm (better known as Stemming) is a procedure that reduces words to
their root or "stem" form.
●​ Purpose: To improve retrieval by matching different variations of a word (e.g.,
"connecting," "connected," "connection" all map to "connect").
●​ Types:
1.​ Suffix Stripping (e.g., Porter Stemmer): Uses a set of rules to chop off
common endings like -ing, -ed, or -es.
2.​ Lemmatization: Uses a dictionary and morphological analysis to find the
actual linguistic root (e.g., "better" $\rightarrow$ "good").
●​ Trade-off: Over-stemming (reducing "organization" and "organ" to the same root)
can hurt precision, while under-stemming can hurt recall.

You might also like