0% found this document useful (0 votes)

37 views17 pages

PAT Trees in Information Retrieval

Q: What is range searching in a PAT Tree and how can it be used to filter text segments?

Range searching in a PAT Tree involves identifying and collecting sistrings that fall within a specified lexicographic range. Starting from a defined range [L, R], the PAT Tree is traversed in lexicographic order, collecting all sistrings that meet the criteria. This method functions like a dictionary search and is used to filter text segments alphabetically, making it incredibly effective for organizing large text datasets and performing quick categorizations .

Q: Describe the process and importance of prefix searching in a PAT Tree.

Prefix searching in a PAT Tree involves finding all occurrences of substrings that start with a given prefix. The process starts from the root of the tree and involves comparing the query prefix with sistrings as traversed. Each step follows the tree's structure until all sistrings matching the prefix are grouped, and then outputs the text positions where the prefix occurs. This is crucial because it enables efficient retrieval of text segments based purely on starting character sequences, useful in applications like search engines where prompt and accurate retrieval of information is key .

Q: How does a PAT Tree optimize regular expression searching, and what are the limitations?

A PAT Tree optimizes regular expression searching by converting regex patterns into prefixes and wildcard segments, which are then used to navigate the tree to find matches for the fixed parts. Subsequent expansions accommodate the full regex pattern. Although powerful, this approach can be slower than exact matches due to the need for multiple tree traversals and extended computations for wildcard expansions, resulting in complexity increase by O(log n).

Q: Evaluate how a stoplist can enhance the efficiency of lexical analysis in text indexing using a PAT Tree.

Stoplists enhance lexical analysis by filtering out common and semantically redundant words from indexing, which in turn reduces the size of the index and noise in search results. By focusing on meaningful tokens, stoplists streamline the indexing process within a PAT Tree, improving both space efficiency and the speed of text searches. This efficiency is crucial in large-scale text databases where processing costs of lexical analysis can be substantial, often constituting up to 50% of indexing resource expenditure .

Q: What is a semi-infinite string (sistring) and how is it utilized in a PAT Tree?

A semi-infinite string (sistring) is a subsequence of a string that starts at a specific position and continues to the right, with additional null characters added if the end of the text is reached. They are uniquely identified by their starting position and play a fundamental role in PAT Trees, where each external node in the tree represents a sistring. Sistrings enable efficient lexicographical comparisons and substring searches, which are key operations in a PAT Tree .

Q: Explain how a PAT Tree improves text indexing efficiency, specifically compared to traditional data structures like suffix trees.

PAT Trees improve text indexing efficiency by representing a compressed binary trie structure that uses individual bits of keys for branching, thereby eliminating unnecessary nodes and achieving a compact representation. Unlike traditional suffix trees, which might require more memory, PAT Trees utilize internal nodes to store branching information using a bit position or skip count, resulting in an overall smaller size, approximately O(n). This compact nature makes them more suitable for handling large text databases by balancing space efficiency and search speed .

Q: What challenges exist in implementing PAT Trees for large datasets like the Oxford English Dictionary, and what strategies can overcome them?

Implementing PAT Trees for large datasets like the Oxford English Dictionary presents challenges mainly in terms of storage space and access time. Direct building requires excessive time (up to 3.4 years with traditional methods). Strategies to overcome these hurdles include building Patricia trees using sequential access, merging indices progressively, and reducing disk access through batching I/O requests. Efficient algorithms that split text into manageable chunks and leverage memory for in-memory operations can also drastically reduce build times to more feasible durations like a weekend .

Q: How does the delay reading paradigm improve disk access efficiency when building large text indices with PAT Trees?

The delayed reading paradigm improves disk access efficiency by batching I/O requests rather than executing them immediately, which minimizes costly random disk access. By maintaining lists of blocked and satisfied requests and processing them in an optimal order, disk head movements are reduced, enhancing access speed dramatically. This approach is particularly effective in environments where PAT Trees are used to build large indices, as it balances the trade-offs between time consumption and storage overhead .

Q: Discuss the role of indexing points in a PAT Tree and their impact on search flexibility and performance.

Indexing points in a PAT Tree determine the text positions that are indexed. By default, every position could be indexed, creating n external nodes, but in practical scenarios, indexing only word beginnings—about 20% of text positions—can be sufficient depending on application needs. A higher indexed position count increases flexibility but also index size, leading to higher storage demands and potentially slower searches. Conversely, fewer index points create a smaller, faster index but might limit search range and flexibility .

Q: In what ways can lexically analyzing text to produce tokens impact the efficiency of a query processing system using PAT Trees?

Lexical analysis of text to produce tokens significantly affects query processing efficiency by ensuring consistent and accurate matchings of terms between the indexed data and user queries. The process involves handling case sensitivity, punctuation, and Boolean operators consistently. Moreover, aligning the lexical rules for indexing and querying enhances precision and recall rates. Despite its high processing cost, estimated at 50% of the processing effort in compilers, efficient lexical analysis through FSMs (Finite State Machines) is crucial for optimizing the performance of PAT Trees in large-scale query systems .

The document provides lecture notes on Information Retrieval Systems, detailing course objectives and key topics such as data structures, algorithms, indexing techniques, and string searching algorithms. It introduces concepts like PAT Trees and sistrings for efficient text searching, along with various searching algorithms including prefix, proximity, and regex searching. The notes also discuss the advantages of new models over traditional text retrieval methods, emphasizing flexibility and efficiency in information retrieval.

Uploaded by

kishore.rce

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

37 views17 pages

PAT Trees in Information Retrieval

Uploaded by

kishore.rce

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

DEPARTMENT OF CSE (ARTIFICIAL INTELLIGENCE & MACHINE LEARNING)

R23 Regulation V Semester A. Y.: 2025 -26

Lecture Notes
Information Retrieval System(23ML5T01)

Prepared By
Mr. P V Kishore Kumar [Link]., (Ph.D.)
Associate Professor

Department of CSE (AI & ML)

Ramachandra College of Engineering(A), Eluru.

pg. 1 - IRS - 23MLT01 – P V K K – Dept. of CSE (AIML) 8008066724,

Course Objectives:
1 To introduce the fundamental concepts and domain of Information Retrieval (IR) systems
and how they differ from other information systems.
2 To explain the data structures and algorithms fundamental to building and operating
efficient IR systems.
3 To analyze the working of indexing techniques such as inverted files and signature files
and their enhancements.
4 To explore advanced indexing structures like PAT Trees and the role of lexical analysis,
stop lists, and stemming in IR systems.
5 To study various string searching algorithms and their applicability to IR systems for
efficient retrieval.

Unit – I Introduction to Information storage and retrieval systems Domain Analysis of IR

systems, IR and other types of Information Systems, IR System Evaluation Introduction to
Data structures and algorithms related to Information Retrieval: Basic Concepts, Data
structures, Algorithms.
Unit – II Inverted Files and Signature Files Introduction, Structures used in Inverted Files,
building an Inverted file using a sorted array, Modifications to the Basic Techniques. Signature
Files: Concepts of Signature files, Compression, Vertical Partitioning, Horizontal Partitioning.
Unit – III New Indices for Text, Lexical Analysis and Stoplists PAT Trees and PAT Arrays:
Introduction, PAT Tree structure, Algorithms on the PAT Trees, Building PAT Trees as
PATRICA Trees, PAT representation as Arrays. Stoplists.
Unit – IV Stemming Algorithms and Thesaurus Construction Types of Stemming
algorithms, Experimental Evaluations of Stemming, stemming to Compress Inverted Files.
Thesaurus Construction: Features of Thesauri, Thesaurus Construction, Thesaurus construction
from Texts, Merging existing Thesauri.
Unit – V String Searching Algorithms Introduction, Preliminaries, The Naive Algorithm, The
Knutt-Morris-Pratt Algorithm, The Boyer Moore Algorithm, The Shift-Or Algorithm, The
Karp-Rabin Algorithm.

Text Books:
1 Modern Information Retrieval, Ricardo Baeza-Yates, Neto, PEA,2007.
2 Information Storage and Retrieval Systems: Theory and Implementation, Kowalski, Gerald,
Mark Academic Press, 2000.

pg. 2 - IRS - 23MLT01 – P V K K – Dept. of CSE (AIML) 8008066724,

Unit – III New Indices for Text, Lexical Analysis and Stoplists PAT Trees and PAT Arrays:
Introduction, PAT Tree structure, Algorithms on the PAT Trees, Building PAT Trees as
PATRICA Trees, PAT representation as Arrays. Stoplists.
INTRODUCTION
 Text searching methods can be classified into:
1. Lexicographical indices – Indices that are sorted.
2. Clustering techniques – Grouping related items together.
3. Hashing-based indices – Using hash functions to access items quickly.
 we focus on two lexicographical indices:
PAT Trees
PAT Arrays
 Traditional Model of Text Retrieval
Used mostly in library applications.
 Each document is assigned a list of keywords (attributes).
 Sometimes, relevance weights are added to keywords.
Problems in the Traditional Model:
1. Fixed structure assumption – Works with documents and words only, not flexible for
all applications.
2. Indexing required – Keywords must be extracted, which is difficult and error-prone
(manual or automatic).
3. Query limitation – Queries are restricted to keywords.
Sometimes, all words (except common ones called stopwords) are indexed.
A New Model: Text as One Long String
Instead of treating text as documents + keywords, we see the text as one long string.
 Each position in the text corresponds to a semi-infinite string (sistring):
 A sistring starts at a position and continues till the end of the text.
 Every sistring is unique (since it starts at a different position).
Advantages of this Model:
1. No structure needed – Works even if text has no document structure.
2. No keywords required – Queries are based on substrings (prefixes of sistrings).
3. Wider query domain – Any substring can be searched.

pg. 3 - IRS - 23MLT01 – P V K K – Dept. of CSE (AIML) 8008066724,

4. Supports multiple searching structures – Flexible to use different indexing
techniques.
Limitations In traditional DBs:
o Each document = record

o Each keyword = secondary key

Problems:
o Number of keywords is variable → database techniques not useful.

o Queries usually check equality or ranges.

o Do not support approximate text searching (e.g., partial match, pattern

matching).

 PAT Trees – Original structure for efficient text searching.

 PAT Arrays – Efficient implementation of PAT Trees.
 Provide a more powerful query language than keyword-based or Boolean searching.
 Independently discovered by:
o Gonnet (1987) – used in PATTM, a fast text searching system for the Oxford
English Dictionary (OED).
o Manber and Myers (1990) – focused on searching in large genetic databases.

Traditional model = documents + keywords → limited, error-prone, keyword-based.

New model = one long string + sistrings → flexible, substring-based searching.
PAT Trees & PAT Arrays = efficient lexicographical indices, support advanced queries beyond
Boolean keyword search.
PAT TREE STRUCTURE
PAT Tree = A data structure for very efficient text searching after preprocessing.
First described by Gonnet (1983) in Unstructured Data Bases.
Implemented in 1985 and used for the Oxford English Dictionary (OED) computerization.
The implementation became known as PATTM, a famous software package for fast string
searching. Semi-Infinite Strings (Sistrings)
Definition
 Consider the text as a long array of characters (indexed sequentially from 1).

pg. 4 - IRS - 23MLT01 – P V K K – Dept. of CSE (AIML) 8008066724,

 A semi-infinite string (sistring) is a subsequence starting at a given position and
continuing to the right.
 If the end of text is reached, special null characters are added (different from any text
characters).
Sistrings are uniquely identified by their starting position.
Example -
Once upon a time, in a far away land . . .
Sistrings:
Sistring 1 → Once upon a time . . .
Sistring 2 → nce upon a time . . .
Sistring 8 → on a time, in a . . .
Sistring 11 → a time, in a far . . .
Sistring 22 → a far away land . . .
The main operation on sistrings = lexicographical comparison (alphabetical order).
Two sistrings are never equal unless they start at the same position.
Example Comparison: 22 < 11 < 2 < 8 < 1
(Here “a far away …” is the smallest sistring and “upon a time …” is the largest)
PAT Tree
A PAT Tree is a Patricia Tree built over all possible sistrings of a text.
 A Patricia Tree is a compressed binary trie:
Uses individual bits of keys for branching.
0 bit → left subtree, 1 bit → right subtree.
Internal nodes store which bit to inspect (absolute position or skip count).
Eliminates unnecessary nodes → compact representation.
Key Features
 External nodes = sistrings (represented by integer displacements).
 Internal nodes = branching info (bit positions).
 For text of size n:
n external nodes
n – 1 internal nodes

pg. 5 - IRS - 23MLT01 – P V K K – Dept. of CSE (AIML) 8008066724,

Overall size = O(n) Very similar to compact suffix trees.
Example of PAT Tree
Example text (in binary): "01100100010111..."
Build PAT Tree after inserting first 8 sistrings.
Internal nodes: circles → contain bit positions.
External nodes: squares → point to sistrings.
Example Query
Searching for sistring 00101:
1. Inspect bit 1 → 0 → go left.
2. Inspect bit 2 → 0 → go left.
3. Inspect bit 3 → 1 → go right.
4. Inspect bit 5 → 1 → go right.
5. Compare with stored sistring to confirm.
Skipped bits require a final verification step
Fig:PAT tree when the sistrings 1 through 8
have been inserted
Fig shows an example of a PAT tree over a sequence of bits (normally it would be over a
sequence of characters), just for the purpose of making the example easier to understand. In
this example, we show the Patricia tree for the text "01100100010111..." after the first 8
sistrings have been inserted. External nodes are indicated by squares, and they contain a
reference to a sistring, and internal nodes are indicated by a circle and contain a displacement.
In this case we have used, in each internal node, the total displacement of the bit to be
inspected, rather than the skip value.
Indexing Points
 Default: Every position in the text is indexed → n external nodes.
 But: Not always necessary.
 Example: If only word and phrase searches are needed → index only word
beginnings (~20% of text positions).
Decision depends on application requirements:
 More sistrings indexed = bigger index, more flexibility.
 Fewer sistrings indexed = smaller index, faster but limited search.
Summary:

pg. 6 - IRS - 23MLT01 – P V K K – Dept. of CSE (AIML) 8008066724,

 Sistrings = unique substrings starting from each text position.
 PAT Tree = compact Patricia Tree built on sistrings.
 Efficient for lexicographical comparisons and substring searching.
 Useful for large text databases (e.g., OED, genetic databases).
 Index size can be tuned depending on application (all positions vs. word starts).

ALGORITHMS ON THE PAT TREE

some of the algorithms for text searching when we have a PAT tree of our text.
Prefix Searching → All substrings starting with query.
Proximity Searching → Words close to each other.
Range Searching → Words between lower & upper bounds.
Longest Repetition → Deepest repeated substring.
Most Frequent → Substring with largest subtree count.
Regex Searching → Supports wildcards & complex queries.
1. Prefix Searching
Purpose:
Find all occurrences of a given prefix (substring starting characters) in the text.
Steps:
1. Take the query prefix.
2. Start from the root of the PAT tree.
3. Compare the prefix with sistrings (semi-infinite strings).
4. Follow the tree until all matches are grouped.
5. Output the text positions where the prefix occurs.
Example:
Text = "banana"
Sistrings = "banana", "anana", "nana", "ana", "na", "a"
Query = "ana" Result → Matches at positions 1 and 3.
Root
├── "a" → "ana", "anana"
├── "b" → "banana"
└── "n" → "na", "nana"
Key Points: Fast searching for prefixes. No need for keywords.
2. Proximity Searching

pg. 7 - IRS - 23MLT01 – P V K K – Dept. of CSE (AIML) 8008066724,

Purpose: Find words/strings that occur near each other in the text.
Steps:
1. Perform prefix searches for both words.
2. Store the positions of each occurrence.
3. Compare positions → select those within a distance k.
Example:
Text = "data mining in big data"
Query = "data" and "mining", distance ≤ 2 words
Result → "data" (pos 1), "mining" (pos 2) → match.
Diagram :
Two lists of positions merged with distance check.
Key Points: Useful in search engines ("AI" near "research"). Requires position comparison.
3. Range Searching
Purpose: Find sistrings that fall within a given lexicographic range.
Steps:
1. Define a range [L, R].
2. Traverse PAT tree in lexicographic order.
3. Collect all sistrings between L and R.
Example:
Text = "cat, dog, cow"
Range = [“c”, “d”]
Result → "cat", "cow", "dog"
Diagram :
Lexicographic ordering of sistrings like a dictionary search.
Key Points: Works like dictionary word [Link] in alphabetical filtering.
4. Longest Repetition
Purpose: Find the longest repeated substring in the text.
Steps:
1. Build PAT tree with sistrings.
2. Check common prefixes of adjacent sistrings.
3. Keep track of longest match.

pg. 8 - IRS - 23MLT01 – P V K K – Dept. of CSE (AIML) 8008066724,

Example:
Text = "banana"
Sistrings = "a", "ana", "anana", "banana", "na", "nana"
Longest repeated substring = "ana" (length 3).
Diagram Idea:
Common prefix branches in PAT tree highlight repetitions.
Key Points: Useful in DNA analysis, data compression.
5. Most Frequent String
Purpose: Find which substring occurs the maximum number of times.
Steps:
1. Build PAT tree.
2. Count occurrences of each substring in sistrings.
3. Track the maximum count.
Example:
Text = "banana"
Frequent substring = "a" (appears 3 times).
Diagram Idea:
Subtree counts showing "a" = 3, "ana" = 2, "na" = 2.
Key Points: Similar to word frequency analysis. Helps in keyword extraction.
6. Regular Expression Searching
Purpose: Support flexible queries with patterns (like a.*na).
Steps:
1. Convert regex into prefixes and wildcards.
2. Use PAT tree to find matches for fixed parts.
3. Expand results to check complete regex.
Example:
Text = "banana"
Query = b.*a
Matches → "ba", "bana", "banana".
Diagram Idea:
Regex → Tree traversal + wildcard expansion.
Key Points: Powerful search beyond exact strings. Used in advanced search engines & editors.

Algorithm Purpose Example (Text = Result

pg. 9 - IRS - 23MLT01 – P V K K – Dept. of CSE (AIML) 8008066724,

"banana")

Prefix Searching Find substrings starting "ana" Pos 1, 3

with q

Proximity Words within distance "data" near "mining" Match

Searching

Range Searching Strings within [L,R] Range [c,d] "cat", "cow",

"dog"

Longest Longest repeated "banana" "ana"

Repetition substring

Most Frequent Substring with max "banana" "a" (3 times)

String occurrences

Regex Searching Pattern-based search b.*a "ba", "bana",

"banana"

BUILDING PAT TREES AS PATRICIA TREES

PAT Trees (Practical Algorithm to Retrieve Information Coded in Alphanumeric) are efficient
data structures for text indexing.
They are often implemented using Patricia Trees (Practical Algorithm to Retrieve Information
Coded in Alphanumeric).
The challenge: when dealing with very large text databases, efficiency in space and time
becomes critical.
Storage Requirements
 Internal nodes: about 3–4 words in size.
 External nodes: about 1 word in size.
 Total storage: ~ 4n to 5n words (≈ 18n characters for n characters of text).
 This is too large and impractical.
Problems in Implementation
1. Large index size → storage overhead.
2. Slow access time → since external data is stored on disk in large blocks.
Solutions to Reduce Space & Access Time
a) Bucketing of External Nodes

pg. 10 - IRS - 23MLT01 – P V K K – Dept. of CSE (AIML) 8008066724,

 Definition: Group multiple external nodes into a bucket instead of keeping them
separately.
 Benefit: Saves many internal nodes (up to b − 1).
o Pros → Reduces space significantly.

o Cons → Slightly increases time (because searching inside a bucket requires

checking multiple entries).
 On average:
o Keys per bucket ≈ b/2.

o Internal nodes reduce to about (n / (b log₂n)) instead of (n − 1).

b) Supernodes (Disk Page Mapping)

 Idea: Store as much of the tree as possible inside one disk page (supernode).
 Working:
o Each disk page has one entry point.

o Contains internal nodes + pointers.

o Pointers may lead to:

 Another node inside the same page, OR

 Another disk page.
 Advantages:
o Smaller pointers (≈ half a word).

o High storage efficiency (≈ 80% disk page utilization).

o Fewer disk accesses → faster search.

Disk page ≈ 1000 nodes (internal + external).

Average: 10 tree levels per page.
Root page always in memory.
Typical search:
o Needs 2–3 disk accesses (instead of 30–40).

o Plus one more access for verification.

Bucketing saves space by grouping nodes.

Supernodes reduce time by minimizing disk accesses.

pg. 11 - IRS - 23MLT01 – P V K K – Dept. of CSE (AIML) 8008066724,

Space vs Time → balance depends on application needs.
Efficient PAT Tree implementation is crucial for large-scale text databases.
Bucketing like putting all small files into a single folder to save cabinet space.
Supernodes like storing an entire book chapter on one page, so you don’t have to flip through
many pages to read it.
PAT TREES REPRESENTED AS ARRAYS
In PAT tree implementation, buckets are used to group external nodes.
Problem: If bucket size is too large → searching becomes slow.
Solution: Store external nodes in sorted order inside the bucket → allows binary search
instead of sequential search.
Cost reduces from b comparisons → 2 log b − 1 comparisons.
If buckets grow large, even up to n elements, the whole index can be represented as a single
array of external nodes sorted lexicographically.
This leads to the concept of Suffix Arrays.
Key advantage: Same information as Patricia tree but with less storage cost.
Access time increases slightly (factor of log₂ n).
Searching in PAT Arrays
 Prefix Search & Range Search:
o Done using indirect binary search over the array.

o Complexity:

 At most 2 log₂ n − 1 comparisons.

 At most 4 log₂ n disk accesses.
 Other operations:
o "Most frequent" search → improved but more technical.

o "Longest repetition" → needs extra data structures.

o Regular expression search → slower by O(log n).

Searching is O(log² n) with 1 word per index (pointer) → very space-efficient.

Building PAT Trees as Arrays

pg. 12 - IRS - 23MLT01 – P V K K – Dept. of CSE (AIML) 8008066724,

a) The OED (Oxford English Dictionary) Challenge
 OED text size: ~600 MB (~119 million entries).
 Building Patricia tree directly:
o Would take 3.4 years (!) with available computers (30 random disk
accesses/sec).
 Even optimized algorithms → still 45+ days.
 Hence → need better, sequential access-based algorithms.
b) Building in Memory
 If the text + PAT array fit in main memory, build efficiently.
 Equivalent to string sorting.
 Quicksort or External Quicksort is used.
 Best for small files.
c) Merging Small vs Large Arrays
 If a small file + its PAT array fit in memory, merge with a large file:
o Keep small file + array + counter array in memory.

o Sequentially read large file, locate each string in small array, update counters.

o Then merge sequentially.

 Efficient: O(n₂ log n₁) comparisons, O(n₂) characters read sequentially.

 Works very well in practice.
d) General Index Building
 Split text into chunks.
 Build PAT array for each chunk in memory.
 Merge indices step by step using counters.
 Time: O(n² / m), where n = index size, m = available memory.
 Can be parallelized (different merges run independently).
 Can build OED-sized index in a weekend.
Delayed Reading Paradigm
Problem: Random disk access = very costly.
Solution: Batch I/O requests instead of executing them immediately.

pg. 13 - IRS - 23MLT01 – P V K K – Dept. of CSE (AIML) 8008066724,

 Maintain 3 lists:
1. Index building program
2. Blocked requests
3. Satisfied requests
 Process requests in optimal order to reduce disk movements.
 This greatly improves efficiency.
Merging Large Files
 Sometimes need to merge two or more large indices.
 Naive method: n log m comparisons, but → requires too many random accesses.
 Improvement:
o Read large blocks of pointers.

o Sort them by location in text.

o Read keys sequentially instead of random access.

 Further optimization:
o Store only prefixes (stemming) of adjacent keys to save space.

 Balances memory, disk space, and performance.

PAT Trees as Arrays = Suffix Arrays (compact, ordered).
Advantages:
 Space-efficient (1 word per entry).
 Easier prefix/range searching.
 Efficient merging strategies for large text databases.
Challenges:
 Slightly slower access (O(log² n)).
 Needs smart building algorithms for huge datasets (like OED).

Patricia Tree like a library with shelves and sections (lots of pointers).
A PAT Array is like keeping all books sorted in a single long list → saves space but
searching may take slightly more steps.
LEXICAL ANALYSIS AND STOPLISTS

pg. 14 - IRS - 23MLT01 – P V K K – Dept. of CSE (AIML) 8008066724,

Lexical Analysis = Breaking text into tokens (words with meaning).
Tokens = groups of characters with significance.
Example: sentence "AI helps in data mining" → tokens = [AI, helps, in, data, mining].
Automatic Indexing → system extracts index terms from documents.
Query Processing → system extracts terms from user query for comparison.
Stoplist = list of common words (e.g., "is", "the", "and") that are ignored.
Example: "the state of the art" → stoplist removes "the, of", tokens = [state, art].
Lexical Analysis for Automatic Indexing
Decide what counts as a word/token.
1. Digits → Usually ignored, but important in some cases.
Example: "B12 vitamin", "F-16 jet".
2. Hyphens → Should "state-of-the-art" be one token or split into many?
3. Punctuation → "[Link]", "OS/2" may need special treatment.
4. Case Sensitivity → Usually ignored (e.g., "Apple" = "apple"), but in
programming "Var" ≠ "var".
Keeping numbers → increases recall but decreases precision.
Breaking hyphens → helps matching but may lose specific meaning.
Lexical Analysis for Query Processing
Must match indexing analyzer (same token rules).
Extra work for queries:
o Handle Boolean operators (AND, OR, NOT).

o Handle grouping symbols (parentheses).

o Detect invalid characters.

Example query:
(AI AND healthcare) OR "disease prediction".
Tokens: (, AI, AND, healthcare, ), OR, disease, prediction.
Cost of Lexical Analysis
 Very expensive step (checks every character).
 In compilers, takes up to 50% of processing cost.
 Must be made efficient → usually implemented as Finite State Machines (FSMs).

pg. 15 - IRS - 23MLT01 – P V K K – Dept. of CSE (AIML) 8008066724,

Implementing a Lexical Analyzer
Methods:
1. Using Lex tool (UNIX tool lex).
2. Ad-hoc method (not recommended, error-prone).
3. Finite State Machine (FSM) → preferred, efficient.
FSM Idea:
 States represent token recognition progress.
 Example:
o State 0 → start.

o See a letter → go to word state.

o See digits → go to number state.

o See "(" → return left parenthesis token.

Stoplists
 Stoplist words = frequent words with little meaning.
 Example stoplist: the, is, at, which, on, and.
Helps reduce: Index size, Noise in search results
Example:
Text: "The quick brown fox jumps over the lazy dog"
Tokens after stoplist: [quick, brown, fox, jumps, lazy, dog].
Example (Query Tokenization)
(AI AND Data) OR NOT cloud

left parenthesis
Summary
term: ai
Lexical analysis = first step in indexing & query processing.
and operator
Converts raw text → tokens.
term: data
Must handle digits, hyphens, punctuation, case.
right parenthesis
Query analyzers also handle Boolean operators.
or operator
Expensive → usually FSM-based.
not operator

term: cloud
Stoplists remove common words to improve efficiency.

end of string

pg. 16 - IRS - 23MLT01 – P V K K – Dept. of CSE (AIML) 8008066724,

pg. 17 - IRS - 23MLT01 – P V K K – Dept. of CSE (AIML) 8008066724,

Common questions