0% found this document useful (0 votes)
41 views6 pages

Constructing an Inverted Index

1) The document discusses constructing an inverted index to summarize text documents. It involves tokenizing documents, sorting terms, calculating term frequency and document frequency, and separating the index into a vocabulary file and posting files. 2) An inverted index maps words to their locations in documents, allowing fast full-text searches. It contains a vocabulary listing all unique terms and posting files listing frequency and location for each term across documents. 3) The example shows tokenizing example documents, processing the terms, calculating statistics, and structuring the final inverted index files.

Uploaded by

Bini Teflon Ankh
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
41 views6 pages

Constructing an Inverted Index

1) The document discusses constructing an inverted index to summarize text documents. It involves tokenizing documents, sorting terms, calculating term frequency and document frequency, and separating the index into a vocabulary file and posting files. 2) An inverted index maps words to their locations in documents, allowing fast full-text searches. It contains a vocabulary listing all unique terms and posting files listing frequency and location for each term across documents. 3) The example shows tokenizing example documents, processing the terms, calculating statistics, and structuring the final inverted index files.

Uploaded by

Bini Teflon Ankh
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd

Addis Ababa University

School of Information Science


Department of Information Science
IR
Assignment II
constructing inverted file

By:
Name ID No.
Biniam Worku GSE/6722/13

Submission Date: 29/9/2021

Inverted Index
Inverted index also known as Inverted file, an index data structure storing a
mapping from content, such as words or numbers, to its locations in a document or
a set of documents. Building and maintaining an inverted index is a relatively low-
cost risk. On a text of n words an inverted index can be built in O(n) time, n is
number of terms. The vocabulary (List of terms) and the occurrence (Location and
frequency of terms in a document collection) are the two contents of an inverted
file.

The occurrence contains one record per term and lists frequency of each term in a
document, also shows locations of words in the text.

A vocabulary file (Word list): stores all of the distinct terms (keywords) that
appear in any of the documents (in lexicographical order) and for each word a
pointer to posting file.

In the given assignment the first thing to do is tokenizing the documents. For a
reference, the documents given in the assignment are:

Doc 1  :  New home to home sales forecasts


Doc 2  : Rise in home sales in July
Doc 3  :  Home sales rise in July for new homes
Doc 4  :  July new home sales rise

After tokenization, the next thing to do is sorting the inverted file by terms.
Term Doc# Term Doc#
New 1 for 3
home 1 forecasts 1
to 1 home 1
home 1 home 1
sales 1 home 2
forecast home 3
1
s
home 4
Rise 2
homes 3
in 2
in 2
home 2
in 2
sales 2
in 3
in 2
july 2
july 2
Home 3 july 3
sales 3 july 4
rise 3 new 1
in 3 Next stop words (to, in and new 3 for) are
july 3 removed. new 4
for 3 rise 2
new 3 rise 3
By stemming the suffix ‘s’ is removed
homes 3 rise 4
from terms like, forecasts, sales and
July 4 sales 1
new 4
homes. sales 2
home 4 sales 3
sales 4 Then all the terms caps sales 4 changed
rise 4 to small caps for the to 1
normalizing purposes.

Multiple term entries in a single document are merged and frequency information
added.

Term frequency (TF) then calculated by counting number of occurrence of


terms in the collections.
Term Doc#
forecas
1
t Term Doc# TF
home 1 forecas
1
home 1 t 1
home 2 home 1 2
home 3 home 2 1
home 4 home 3 2
home 3 home 4 1
july 2 july 2 1
july 3 july 3 1
july 4 july 4 1
new 1 new 1 1
new 3 new 3 1
new 4 new 4 1
rise 2 rise 2 1
rise 3 rise 3 1
rise 4 rise 4 1
sale 1 sale 1 1
sale 2 sale 2 1
sale 3 sale 3 1
sale 4 sale 4 1

Term Doc# TF
forecas
1
t 1
home 1 2
home 2 1
home 3 2 Content Frequency (CF) and Document
home 4 1 Frequency(DF) are calculated by using Document
july 2 1 numbers and frequency of terms appear in the
july 3 1
documents . The result is shown on the following table .
july 4 1
new 1 1 Term DF CF
new 3 1 forecas
new 4 1 t 1 1
rise 2 1 home 4 6
rise 3 1 july 3 3
rise 4 1 new 3 3
sale 1 1 rise 3 3
sale 2 1 sale 4 4
sale 3 1
sale 4 1
The final step is Separation of inverted file into vocabulary and posting file.
Vocabulary: For searching purpose we need only word list. This allows the
vocabulary to be kept in memory at search time since the space required for the
vocabulary is small.

Posting file : requires much more space. For each word appearing in the text we
are keeping statistical information related to word occurrence in documents.

vocabulary Doc#
1 posting
TF
1
1 2
2 1
Term DF CF 3 2
forecast 1 1 4 1
home 4 6 2 1
july 3 3 3 1
new 3 3 4 1
rise 3 3 1 1
sale 3 1
4 4
4 1
2 1
3 1
4 1
1 1
2 1
3 1
4 1
Pointer
s

References
1. Modern information retrieval lecture note Addis Ababa university , school of
information science , 2021
2. Christopher D. Manning, Hinrich Schütze, and Prabhakar Raghavan (2007)
Introduction to information retrieval,cabridge university press, Cambridge
,England

Common questions

Powered by AI

Building an inverted index in O(n) time complexity is computationally advantageous because it scales linearly with the size of the term corpus, ensuring efficiency even as document collections grow larger. This time complexity allows for manageable computing resource use, enabling robust performance across varied and extensive datasets, which is critical for large-scale information retrieval tasks .

Removal of suffixes, such as 's', through stemming improves indexing by reducing the number of unique terms and consolidating term variations, such as making both 'sales' and 'sale' indexable under a single term. This reduces index size and improves retrieval precision by ensuring users get hits from both base and plural forms using a single query .

Stemming and stop word removal are crucial for normalizing text, as stemming reduces morphological variations of terms, while stop word removal eliminates non-informative words. This dual process creates a more standardized and concise representation of text data, enhancing the inverted index's capacity to deliver accurate and relevant search results by focusing on substantive words that convey essential meaning .

Calculating both Document Frequency (DF) and Content Frequency (CF) is significant as DF provides insights into how many documents contain a specific term, which helps in assessing term importance in the document set. CF gives the total count of term occurrences, aiding in understanding term distribution and commonality. These metrics are crucial for effective weighting schemes in retrieval algorithms .

Maintaining a separate vocabulary in memory is advantageous because it allows for quick access to term information, minimizing the time spent searching through the extensive posting lists. Since the vocabulary is smaller, it is practical to keep it in memory, facilitating rapid determination of whether a term is present and where its detailed posting information can be found .

Separation of the vocabulary file and posting file optimizes search operations by ensuring the vocabulary, which is smaller, can be kept in memory to quickly locate terms. The larger posting file, containing detailed statistical information on term occurrences, is accessed only when necessary, making the retrieval process more efficient by reducing memory load during searches .

Tokenizing documents as the initial step is pivotal because it breaks down text into discrete elements or tokens, forming the base structure on which the entire indexing process builds. Proper tokenization ensures accuracy in subsequent sorting, stop word removal, stemming, and indexing operations. If tokens are inaccurately identified, it can propagate errors throughout the indexing process, affecting retrieval efficiency and accuracy .

Term frequency (TF) impacts the structure of an inverted index by recording how often a term appears in different documents, which allows the index to quickly retrieve documents based on commonality of terms. In terms of efficiency, knowing TF helps optimize searches, as terms with higher frequency may indicate more relevant or central documents within a dataset .

The critical steps involved in constructing an inverted index include tokenizing the documents, sorting the inverted file by terms, removing stop words, stemming to unify suffix variations, normalizing by changing all terms to lower case, and merging multiple term entries within a single document while adding frequency information. Term frequency (TF) is then calculated by counting occurrences of terms across documents .

Stop words are removed as they are common words with little retrieval value, reducing index size and improving efficiency. Stemming unifies terms by removing suffix variations, like reducing 'sales' and 'homes' to 'sale' and 'home', respectively, which consolidates indexing and enhances retrieval accuracy by standardizing term variations .

You might also like