DEPARTMENT OF CSE (ARTIFICIAL INTELLIGENCE & MACHINE LEARNING)
R23 Regulation V Semester A. Y.: 2025 -26
Lecture Notes
Information Retrieval System(23ML5T01)
Prepared By
Mr. P V Kishore Kumar [Link]., (Ph.D.)
Associate Professor
Department of CSE (AI & ML)
Ramachandra College of Engineering(A), Eluru.
pg. 1 - IRS - 23MLT01 – P V K K – Dept. of CSE (AIML) 8008066724,
Course Objectives:
1 To introduce the fundamental concepts and domain of Information Retrieval (IR) systems
and how they differ from other information systems.
2 To explain the data structures and algorithms fundamental to building and operating
efficient IR systems.
3 To analyze the working of indexing techniques such as inverted files and signature files
and their enhancements.
4 To explore advanced indexing structures like PAT Trees and the role of lexical analysis,
stop lists, and stemming in IR systems.
5 To study various string searching algorithms and their applicability to IR systems for
efficient retrieval.
Unit – I Introduction to Information storage and retrieval systems Domain Analysis of IR
systems, IR and other types of Information Systems, IR System Evaluation Introduction to
Data structures and algorithms related to Information Retrieval: Basic Concepts, Data
structures, Algorithms.
Unit – II Inverted Files and Signature Files Introduction, Structures used in Inverted Files,
building an Inverted file using a sorted array, Modifications to the Basic Techniques. Signature
Files: Concepts of Signature files, Compression, Vertical Partitioning, Horizontal Partitioning.
Unit – III New Indices for Text, Lexical Analysis and Stoplists PAT Trees and PAT Arrays:
Introduction, PAT Tree structure, Algorithms on the PAT Trees, Building PAT Trees as
PATRICA Trees, PAT representation as Arrays. Stoplists.
Unit – IV Stemming Algorithms and Thesaurus Construction Types of Stemming
algorithms, Experimental Evaluations of Stemming, stemming to Compress Inverted Files.
Thesaurus Construction: Features of Thesauri, Thesaurus Construction, Thesaurus construction
from Texts, Merging existing Thesauri.
Unit – V String Searching Algorithms Introduction, Preliminaries, The Naive Algorithm, The
Knutt-Morris-Pratt Algorithm, The Boyer Moore Algorithm, The Shift-Or Algorithm, The
Karp-Rabin Algorithm.
Text Books:
1 Modern Information Retrieval, Ricardo Baeza-Yates, Neto, PEA,2007.
2 Information Storage and Retrieval Systems: Theory and Implementation, Kowalski, Gerald,
Mark Academic Press, 2000.
pg. 2 - IRS - 23MLT01 – P V K K – Dept. of CSE (AIML) 8008066724,
Unit – III New Indices for Text, Lexical Analysis and Stoplists PAT Trees and PAT Arrays:
Introduction, PAT Tree structure, Algorithms on the PAT Trees, Building PAT Trees as
PATRICA Trees, PAT representation as Arrays. Stoplists.
INTRODUCTION
Text searching methods can be classified into:
1. Lexicographical indices – Indices that are sorted.
2. Clustering techniques – Grouping related items together.
3. Hashing-based indices – Using hash functions to access items quickly.
we focus on two lexicographical indices:
PAT Trees
PAT Arrays
Traditional Model of Text Retrieval
Used mostly in library applications.
Each document is assigned a list of keywords (attributes).
Sometimes, relevance weights are added to keywords.
Problems in the Traditional Model:
1. Fixed structure assumption – Works with documents and words only, not flexible for
all applications.
2. Indexing required – Keywords must be extracted, which is difficult and error-prone
(manual or automatic).
3. Query limitation – Queries are restricted to keywords.
Sometimes, all words (except common ones called stopwords) are indexed.
A New Model: Text as One Long String
Instead of treating text as documents + keywords, we see the text as one long string.
Each position in the text corresponds to a semi-infinite string (sistring):
A sistring starts at a position and continues till the end of the text.
Every sistring is unique (since it starts at a different position).
Advantages of this Model:
1. No structure needed – Works even if text has no document structure.
2. No keywords required – Queries are based on substrings (prefixes of sistrings).
3. Wider query domain – Any substring can be searched.
pg. 3 - IRS - 23MLT01 – P V K K – Dept. of CSE (AIML) 8008066724,
4. Supports multiple searching structures – Flexible to use different indexing
techniques.
Limitations In traditional DBs:
o Each document = record
o Each keyword = secondary key
Problems:
o Number of keywords is variable → database techniques not useful.
o Queries usually check equality or ranges.
o Do not support approximate text searching (e.g., partial match, pattern
matching).
PAT Trees – Original structure for efficient text searching.
PAT Arrays – Efficient implementation of PAT Trees.
Provide a more powerful query language than keyword-based or Boolean searching.
Independently discovered by:
o Gonnet (1987) – used in PATTM, a fast text searching system for the Oxford
English Dictionary (OED).
o Manber and Myers (1990) – focused on searching in large genetic databases.
Traditional model = documents + keywords → limited, error-prone, keyword-based.
New model = one long string + sistrings → flexible, substring-based searching.
PAT Trees & PAT Arrays = efficient lexicographical indices, support advanced queries beyond
Boolean keyword search.
PAT TREE STRUCTURE
PAT Tree = A data structure for very efficient text searching after preprocessing.
First described by Gonnet (1983) in Unstructured Data Bases.
Implemented in 1985 and used for the Oxford English Dictionary (OED) computerization.
The implementation became known as PATTM, a famous software package for fast string
searching. Semi-Infinite Strings (Sistrings)
Definition
Consider the text as a long array of characters (indexed sequentially from 1).
pg. 4 - IRS - 23MLT01 – P V K K – Dept. of CSE (AIML) 8008066724,
A semi-infinite string (sistring) is a subsequence starting at a given position and
continuing to the right.
If the end of text is reached, special null characters are added (different from any text
characters).
Sistrings are uniquely identified by their starting position.
Example -
Once upon a time, in a far away land . . .
Sistrings:
Sistring 1 → Once upon a time . . .
Sistring 2 → nce upon a time . . .
Sistring 8 → on a time, in a . . .
Sistring 11 → a time, in a far . . .
Sistring 22 → a far away land . . .
The main operation on sistrings = lexicographical comparison (alphabetical order).
Two sistrings are never equal unless they start at the same position.
Example Comparison: 22 < 11 < 2 < 8 < 1
(Here “a far away …” is the smallest sistring and “upon a time …” is the largest)
PAT Tree
A PAT Tree is a Patricia Tree built over all possible sistrings of a text.
A Patricia Tree is a compressed binary trie:
Uses individual bits of keys for branching.
0 bit → left subtree, 1 bit → right subtree.
Internal nodes store which bit to inspect (absolute position or skip count).
Eliminates unnecessary nodes → compact representation.
Key Features
External nodes = sistrings (represented by integer displacements).
Internal nodes = branching info (bit positions).
For text of size n:
n external nodes
n – 1 internal nodes
pg. 5 - IRS - 23MLT01 – P V K K – Dept. of CSE (AIML) 8008066724,
Overall size = O(n) Very similar to compact suffix trees.
Example of PAT Tree
Example text (in binary): "01100100010111..."
Build PAT Tree after inserting first 8 sistrings.
Internal nodes: circles → contain bit positions.
External nodes: squares → point to sistrings.
Example Query
Searching for sistring 00101:
1. Inspect bit 1 → 0 → go left.
2. Inspect bit 2 → 0 → go left.
3. Inspect bit 3 → 1 → go right.
4. Inspect bit 5 → 1 → go right.
5. Compare with stored sistring to confirm.
Skipped bits require a final verification step
Fig:PAT tree when the sistrings 1 through 8
have been inserted
Fig shows an example of a PAT tree over a sequence of bits (normally it would be over a
sequence of characters), just for the purpose of making the example easier to understand. In
this example, we show the Patricia tree for the text "01100100010111..." after the first 8
sistrings have been inserted. External nodes are indicated by squares, and they contain a
reference to a sistring, and internal nodes are indicated by a circle and contain a displacement.
In this case we have used, in each internal node, the total displacement of the bit to be
inspected, rather than the skip value.
Indexing Points
Default: Every position in the text is indexed → n external nodes.
But: Not always necessary.
Example: If only word and phrase searches are needed → index only word
beginnings (~20% of text positions).
Decision depends on application requirements:
More sistrings indexed = bigger index, more flexibility.
Fewer sistrings indexed = smaller index, faster but limited search.
Summary:
pg. 6 - IRS - 23MLT01 – P V K K – Dept. of CSE (AIML) 8008066724,
Sistrings = unique substrings starting from each text position.
PAT Tree = compact Patricia Tree built on sistrings.
Efficient for lexicographical comparisons and substring searching.
Useful for large text databases (e.g., OED, genetic databases).
Index size can be tuned depending on application (all positions vs. word starts).
ALGORITHMS ON THE PAT TREE
some of the algorithms for text searching when we have a PAT tree of our text.
Prefix Searching → All substrings starting with query.
Proximity Searching → Words close to each other.
Range Searching → Words between lower & upper bounds.
Longest Repetition → Deepest repeated substring.
Most Frequent → Substring with largest subtree count.
Regex Searching → Supports wildcards & complex queries.
1. Prefix Searching
Purpose:
Find all occurrences of a given prefix (substring starting characters) in the text.
Steps:
1. Take the query prefix.
2. Start from the root of the PAT tree.
3. Compare the prefix with sistrings (semi-infinite strings).
4. Follow the tree until all matches are grouped.
5. Output the text positions where the prefix occurs.
Example:
Text = "banana"
Sistrings = "banana", "anana", "nana", "ana", "na", "a"
Query = "ana" Result → Matches at positions 1 and 3.
Root
├── "a" → "ana", "anana"
├── "b" → "banana"
└── "n" → "na", "nana"
Key Points: Fast searching for prefixes. No need for keywords.
2. Proximity Searching
pg. 7 - IRS - 23MLT01 – P V K K – Dept. of CSE (AIML) 8008066724,
Purpose: Find words/strings that occur near each other in the text.
Steps:
1. Perform prefix searches for both words.
2. Store the positions of each occurrence.
3. Compare positions → select those within a distance k.
Example:
Text = "data mining in big data"
Query = "data" and "mining", distance ≤ 2 words
Result → "data" (pos 1), "mining" (pos 2) → match.
Diagram :
Two lists of positions merged with distance check.
Key Points: Useful in search engines ("AI" near "research"). Requires position comparison.
3. Range Searching
Purpose: Find sistrings that fall within a given lexicographic range.
Steps:
1. Define a range [L, R].
2. Traverse PAT tree in lexicographic order.
3. Collect all sistrings between L and R.
Example:
Text = "cat, dog, cow"
Range = [“c”, “d”]
Result → "cat", "cow", "dog"
Diagram :
Lexicographic ordering of sistrings like a dictionary search.
Key Points: Works like dictionary word [Link] in alphabetical filtering.
4. Longest Repetition
Purpose: Find the longest repeated substring in the text.
Steps:
1. Build PAT tree with sistrings.
2. Check common prefixes of adjacent sistrings.
3. Keep track of longest match.
pg. 8 - IRS - 23MLT01 – P V K K – Dept. of CSE (AIML) 8008066724,
Example:
Text = "banana"
Sistrings = "a", "ana", "anana", "banana", "na", "nana"
Longest repeated substring = "ana" (length 3).
Diagram Idea:
Common prefix branches in PAT tree highlight repetitions.
Key Points: Useful in DNA analysis, data compression.
5. Most Frequent String
Purpose: Find which substring occurs the maximum number of times.
Steps:
1. Build PAT tree.
2. Count occurrences of each substring in sistrings.
3. Track the maximum count.
Example:
Text = "banana"
Frequent substring = "a" (appears 3 times).
Diagram Idea:
Subtree counts showing "a" = 3, "ana" = 2, "na" = 2.
Key Points: Similar to word frequency analysis. Helps in keyword extraction.
6. Regular Expression Searching
Purpose: Support flexible queries with patterns (like a.*na).
Steps:
1. Convert regex into prefixes and wildcards.
2. Use PAT tree to find matches for fixed parts.
3. Expand results to check complete regex.
Example:
Text = "banana"
Query = b.*a
Matches → "ba", "bana", "banana".
Diagram Idea:
Regex → Tree traversal + wildcard expansion.
Key Points: Powerful search beyond exact strings. Used in advanced search engines & editors.
Algorithm Purpose Example (Text = Result
pg. 9 - IRS - 23MLT01 – P V K K – Dept. of CSE (AIML) 8008066724,
"banana")
Prefix Searching Find substrings starting "ana" Pos 1, 3
with q
Proximity Words within distance "data" near "mining" Match
Searching
Range Searching Strings within [L,R] Range [c,d] "cat", "cow",
"dog"
Longest Longest repeated "banana" "ana"
Repetition substring
Most Frequent Substring with max "banana" "a" (3 times)
String occurrences
Regex Searching Pattern-based search b.*a "ba", "bana",
"banana"
BUILDING PAT TREES AS PATRICIA TREES
PAT Trees (Practical Algorithm to Retrieve Information Coded in Alphanumeric) are efficient
data structures for text indexing.
They are often implemented using Patricia Trees (Practical Algorithm to Retrieve Information
Coded in Alphanumeric).
The challenge: when dealing with very large text databases, efficiency in space and time
becomes critical.
Storage Requirements
Internal nodes: about 3–4 words in size.
External nodes: about 1 word in size.
Total storage: ~ 4n to 5n words (≈ 18n characters for n characters of text).
This is too large and impractical.
Problems in Implementation
1. Large index size → storage overhead.
2. Slow access time → since external data is stored on disk in large blocks.
Solutions to Reduce Space & Access Time
a) Bucketing of External Nodes
pg. 10 - IRS - 23MLT01 – P V K K – Dept. of CSE (AIML) 8008066724,
Definition: Group multiple external nodes into a bucket instead of keeping them
separately.
Benefit: Saves many internal nodes (up to b − 1).
o Pros → Reduces space significantly.
o Cons → Slightly increases time (because searching inside a bucket requires
checking multiple entries).
On average:
o Keys per bucket ≈ b/2.
o Internal nodes reduce to about (n / (b log₂n)) instead of (n − 1).
b) Supernodes (Disk Page Mapping)
Idea: Store as much of the tree as possible inside one disk page (supernode).
Working:
o Each disk page has one entry point.
o Contains internal nodes + pointers.
o Pointers may lead to:
Another node inside the same page, OR
Another disk page.
Advantages:
o Smaller pointers (≈ half a word).
o High storage efficiency (≈ 80% disk page utilization).
o Fewer disk accesses → faster search.
Disk page ≈ 1000 nodes (internal + external).
Average: 10 tree levels per page.
Root page always in memory.
Typical search:
o Needs 2–3 disk accesses (instead of 30–40).
o Plus one more access for verification.
Bucketing saves space by grouping nodes.
Supernodes reduce time by minimizing disk accesses.
pg. 11 - IRS - 23MLT01 – P V K K – Dept. of CSE (AIML) 8008066724,
Space vs Time → balance depends on application needs.
Efficient PAT Tree implementation is crucial for large-scale text databases.
Bucketing like putting all small files into a single folder to save cabinet space.
Supernodes like storing an entire book chapter on one page, so you don’t have to flip through
many pages to read it.
PAT TREES REPRESENTED AS ARRAYS
In PAT tree implementation, buckets are used to group external nodes.
Problem: If bucket size is too large → searching becomes slow.
Solution: Store external nodes in sorted order inside the bucket → allows binary search
instead of sequential search.
Cost reduces from b comparisons → 2 log b − 1 comparisons.
If buckets grow large, even up to n elements, the whole index can be represented as a single
array of external nodes sorted lexicographically.
This leads to the concept of Suffix Arrays.
Key advantage: Same information as Patricia tree but with less storage cost.
Access time increases slightly (factor of log₂ n).
Searching in PAT Arrays
Prefix Search & Range Search:
o Done using indirect binary search over the array.
o Complexity:
At most 2 log₂ n − 1 comparisons.
At most 4 log₂ n disk accesses.
Other operations:
o "Most frequent" search → improved but more technical.
o "Longest repetition" → needs extra data structures.
o Regular expression search → slower by O(log n).
Searching is O(log² n) with 1 word per index (pointer) → very space-efficient.
Building PAT Trees as Arrays
pg. 12 - IRS - 23MLT01 – P V K K – Dept. of CSE (AIML) 8008066724,
a) The OED (Oxford English Dictionary) Challenge
OED text size: ~600 MB (~119 million entries).
Building Patricia tree directly:
o Would take 3.4 years (!) with available computers (30 random disk
accesses/sec).
Even optimized algorithms → still 45+ days.
Hence → need better, sequential access-based algorithms.
b) Building in Memory
If the text + PAT array fit in main memory, build efficiently.
Equivalent to string sorting.
Quicksort or External Quicksort is used.
Best for small files.
c) Merging Small vs Large Arrays
If a small file + its PAT array fit in memory, merge with a large file:
o Keep small file + array + counter array in memory.
o Sequentially read large file, locate each string in small array, update counters.
o Then merge sequentially.
Efficient: O(n₂ log n₁) comparisons, O(n₂) characters read sequentially.
Works very well in practice.
d) General Index Building
Split text into chunks.
Build PAT array for each chunk in memory.
Merge indices step by step using counters.
Time: O(n² / m), where n = index size, m = available memory.
Can be parallelized (different merges run independently).
Can build OED-sized index in a weekend.
Delayed Reading Paradigm
Problem: Random disk access = very costly.
Solution: Batch I/O requests instead of executing them immediately.
pg. 13 - IRS - 23MLT01 – P V K K – Dept. of CSE (AIML) 8008066724,
Maintain 3 lists:
1. Index building program
2. Blocked requests
3. Satisfied requests
Process requests in optimal order to reduce disk movements.
This greatly improves efficiency.
Merging Large Files
Sometimes need to merge two or more large indices.
Naive method: n log m comparisons, but → requires too many random accesses.
Improvement:
o Read large blocks of pointers.
o Sort them by location in text.
o Read keys sequentially instead of random access.
Further optimization:
o Store only prefixes (stemming) of adjacent keys to save space.
Balances memory, disk space, and performance.
PAT Trees as Arrays = Suffix Arrays (compact, ordered).
Advantages:
Space-efficient (1 word per entry).
Easier prefix/range searching.
Efficient merging strategies for large text databases.
Challenges:
Slightly slower access (O(log² n)).
Needs smart building algorithms for huge datasets (like OED).
Patricia Tree like a library with shelves and sections (lots of pointers).
A PAT Array is like keeping all books sorted in a single long list → saves space but
searching may take slightly more steps.
LEXICAL ANALYSIS AND STOPLISTS
pg. 14 - IRS - 23MLT01 – P V K K – Dept. of CSE (AIML) 8008066724,
Lexical Analysis = Breaking text into tokens (words with meaning).
Tokens = groups of characters with significance.
Example: sentence "AI helps in data mining" → tokens = [AI, helps, in, data, mining].
Automatic Indexing → system extracts index terms from documents.
Query Processing → system extracts terms from user query for comparison.
Stoplist = list of common words (e.g., "is", "the", "and") that are ignored.
Example: "the state of the art" → stoplist removes "the, of", tokens = [state, art].
Lexical Analysis for Automatic Indexing
Decide what counts as a word/token.
1. Digits → Usually ignored, but important in some cases.
Example: "B12 vitamin", "F-16 jet".
2. Hyphens → Should "state-of-the-art" be one token or split into many?
3. Punctuation → "[Link]", "OS/2" may need special treatment.
4. Case Sensitivity → Usually ignored (e.g., "Apple" = "apple"), but in
programming "Var" ≠ "var".
Keeping numbers → increases recall but decreases precision.
Breaking hyphens → helps matching but may lose specific meaning.
Lexical Analysis for Query Processing
Must match indexing analyzer (same token rules).
Extra work for queries:
o Handle Boolean operators (AND, OR, NOT).
o Handle grouping symbols (parentheses).
o Detect invalid characters.
Example query:
(AI AND healthcare) OR "disease prediction".
Tokens: (, AI, AND, healthcare, ), OR, disease, prediction.
Cost of Lexical Analysis
Very expensive step (checks every character).
In compilers, takes up to 50% of processing cost.
Must be made efficient → usually implemented as Finite State Machines (FSMs).
pg. 15 - IRS - 23MLT01 – P V K K – Dept. of CSE (AIML) 8008066724,
Implementing a Lexical Analyzer
Methods:
1. Using Lex tool (UNIX tool lex).
2. Ad-hoc method (not recommended, error-prone).
3. Finite State Machine (FSM) → preferred, efficient.
FSM Idea:
States represent token recognition progress.
Example:
o State 0 → start.
o See a letter → go to word state.
o See digits → go to number state.
o See "(" → return left parenthesis token.
Stoplists
Stoplist words = frequent words with little meaning.
Example stoplist: the, is, at, which, on, and.
Helps reduce: Index size, Noise in search results
Example:
Text: "The quick brown fox jumps over the lazy dog"
Tokens after stoplist: [quick, brown, fox, jumps, lazy, dog].
Example (Query Tokenization)
(AI AND Data) OR NOT cloud
left parenthesis
Summary
term: ai
Lexical analysis = first step in indexing & query processing.
and operator
Converts raw text → tokens.
term: data
Must handle digits, hyphens, punctuation, case.
right parenthesis
Query analyzers also handle Boolean operators.
or operator
Expensive → usually FSM-based.
not operator
term: cloud
Stoplists remove common words to improve efficiency.
end of string
pg. 16 - IRS - 23MLT01 – P V K K – Dept. of CSE (AIML) 8008066724,
pg. 17 - IRS - 23MLT01 – P V K K – Dept. of CSE (AIML) 8008066724,