Unit 3 questions 5 marks
Q1. Explain PAT Trees and their importance in Information Retrieval.
Answer:
Definition:
A PAT Tree (Position Affix Tree) is a compact data structure used for fast searching
and indexing of strings in large text databases.
It stores all suffixes (or substrings) of a given text in a tree format, enabling quick
retrieval of patterns.
Structure:
Each node in a PAT tree represents a substring (affix) and contains:
o Key: A substring or a part of the original text.
o Pointers: To child nodes (representing further extensions of the substring).
Purpose:
o Used to represent text dictionaries efficiently.
o Supports fast pattern matching and prefix searching.
o Commonly applied in information retrieval systems.
Advantages:
o Efficient searching compared to linear scans.
o Space-efficient due to shared prefixes.
o Supports both exact and partial match queries.
Diagram (conceptual):
Root
/ \
pat tree
/ \
pattern trees
Each branch represents different substrings or words from the text.
🌲 2️⃣ Building PAT Trees as PATRICIA Trees
Q2. How are PAT Trees implemented as PATRICIA Trees? Explain the
process.
Answer:
Definition:
A PATRICIA Tree (Practical Algorithm To Retrieve Information Coded In
Alphanumeric) is a compressed binary trie that reduces storage requirements of
standard tries.
PAT Trees can be built using PATRICIA tree logic to improve efficiency.
Steps to Build:
1. Start with a set of strings (words, terms, or suffixes).
Example: {“bat”, “ball”, “bark”, “bar”}
2. Construct a Trie by inserting all strings character by character.
3. Compress the Trie by merging nodes that have a single child.
This reduces unnecessary nodes and saves space.
4. Add bit positions or indices to mark branching points.
Each internal node contains a bit index that decides where to branch next.
Features of PATRICIA Trees:
Only branch at the first bit where strings differ.
Each leaf node contains the complete key (string).
Traversal depends on key comparison at specific bit positions.
Advantages:
Very space-efficient (fewer nodes than Trie).
Faster lookup and insertion operations.
Suitable for large text datasets.
Time Complexity:
Average: O(log n)
Worst: O(n) (for very similar strings)
Example Diagram (simplified):
Root
|
b(2)
/ \
ar at
/ \ \
bark bar ball
Numbers represent branching bit positions.
📦 3️⃣ PAT Representation as Arrays
Q3. Explain how PAT Trees can be represented using Arrays.
Answer:
Purpose:
Storing a tree structure in array form makes traversal and search operations faster and
simpler in implementation.
Array Representation Components:
A PAT Tree can be stored using arrays that hold:
1. Key Array (K): Contains the strings or pattern keys.
2. Left Child Array (L): Stores left child indices.
3. Right Child Array (R): Stores right child indices.
4. Bit Index Array (B): Indicates bit position used for branching.
Example Table:
Node Key Bit Left Right
1 Root 0 2 3
2 bar 2 - -
3 bat 3 - -
Advantages:
Easy to store and access nodes using array indices.
No pointer manipulation required.
Faster for static trees (when data doesn’t change often).
Limitations:
Not efficient for dynamic updates (insert/delete).
Requires fixed memory allocation.
📝 4️⃣ Stop Lists
Q4. What are Stop Lists? Why are they used in Information Retrieval?
Answer:
Definition:
A Stop List (or Stop Word List) is a set of common, frequently occurring words
that are excluded during indexing and searching because they carry little semantic
meaning.
Examples:
“the”, “is”, “an”, “of”, “on”, “for”, “with”, etc.
Purpose:
To reduce index size.
To improve retrieval efficiency by ignoring words that do not affect search results.
To focus on content-bearing terms (nouns, verbs, adjectives).
Advantages:
1. Decreases storage space for index files.
2. Speeds up searching and matching.
3. Reduces noise in retrieval results.
Disadvantages:
1. May remove useful words in some queries (e.g., “To be or not to be”).
2. Reduces recall for phrase-based searches.
Example:
Sentence:
“The boy is playing in the park.”
After Stop-word Removal:
“boy playing park”
Applications:
Used in search engines, IR systems, and text mining.
Often customized for domain-specific use (medical, legal, etc.).
✅ Summary Table
Topic Key Points Marks
PAT Trees Compact data structure for fast string retrieval 5
Building PAT Trees
Compression of tries using bit-based branching 5
(PATRICIA)
PAT Representation (Arrays) Uses arrays for storing nodes and indices 5
Common words excluded from indexing to save
Stop Lists 5
space