New Indices for Text in IR Systems
New Indices for Text in IR Systems
R – 23
III [Link]. I Semester
Learning Material
On
UNIT – III
“New Indices for Text,Lexical Analysis and
Stoplists”
Dept of CSE(AI&ML)
SRGEC, Gudlavalleru
R – 23
SRGEC INFORMATION RETREIVAL SYSTEMS UNIT - III
SYLLABUS
New Indices for Text, Lexical Analysis and Stop lists:
PAT Trees and PAT Arrays: Introduction, PAT Tree structure, Algorithms on the PAT Trees,
Building PAT Trees as PATRICA Trees, PAT representation as Arrays. Stop lists.
LEARNING MATERIAL
3.1 INTRODUCTION
Text searching methods may be classified as lexicographical indices (indices that are
sorted),clustering techniques, and indices based on hashing. In this chapter we discuss two new
lexicographical indices for text, called PAT trees and PAT arrays.
The traditional model of text used in information retrieval is that of a set of documents. Each
document is assigned a list of keywords (attributes), with optional relevance weights associated to
each keyword. This model is oriented to library applications, which it serves quite well. For more
general applications, it has some problems, namely:
A basic structure is assumed (documents and words). This may be reasonable for many
applications, but not for others.
Keywords must be extracted from the text (this is called "indexing"). This task is not
trivial and error prone, whether it is done by a person, or automatically by a computer.
Queries are restricted to keywords.
For some indices, instead of indexing a set of keywords, all words except for those deemed to be
too common (called stopwords) are indexed.
We prefer a different model. We see the text as one long string. Each position in the text
corresponds to a semi-infinite string (sistring), the string that starts at that position and extends
arbitrarily far to the right, or to the end of the text. It is not difficult to see that any two strings not
at the same position are different.
The main advantages of this model are:
No structure of the text is needed, although if there is one, it can be used.
No keywords are used. The queries are based on prefixes of sistrings, that is, on
any substring of the text.
This model is simpler and does not restrict the query domain. Furthermore, almost any searching
structure can be used to support this view of text.
In the traditional text model, each document is considered a database record, and each keyword a
value or a secondary key. Because the number of keywords is variable, common database
techniques are not useful in this context. Typical data-base queries are on equality or on ranges.
They seldom consider "approximate text searching."
Department of CSE (AI&ML) New Indices for Text,Lexical Analysis and Stoplists Page 2
R – 23
SRGEC INFORMATION RETREIVAL SYSTEMS UNIT - III
conjunction with the computerization of the Oxford English Dictionary. The name of the
implementation, the PATTM system, has become well known in its own right, as a software
package for very fast string searching.
3.2.1 Semi-infinite Strings
In what follows, we will use a very simple model of text. Our text, or text data-base, will consist
of a single (possibly very long) array of characters, numbered sequentially from one onward.
Whether the text is already presented as such, or whether it can be viewed as such is not relevant.
To apply our algorithms it is sufficient to be able to view the entire text as an array of characters.
A semi-infinite string is a subsequence of characters from this array, taken from a given starting
point but going on as necessary to the right. In case the semi-infinite string (sistring) is used
beyond the end of the actual text, special null characters will be considered to be added at its end,
these characters being different than any other in the text. The name semi-infinite is taken from
the analogy with geometry where we have semi-infinite lines, lines with one origin, but infinite in
one direction. Sistrings are uniquely identified by the position where they start, and for a given,
fixed text, this is simply given by an integer.
Example:
Text :“ Once upon a time, in a far away land . . . “
sistring 1 :“ Once upon a time . . .”
Sistrings can be defined formally as an abstract data type and as such present a very useful and
important model of text. For the purpose of this section, the most important operation on sistrings
is the lexicographical comparison of sistrings and will be the only one defined. This comparison is
the one resulting from comparing two sistrings' contents (not their positions). Note that unless we
are comparing a sistring to itself, the comparison of two sistrings cannot yield equal. (If the
sistrings are not the same, sooner or later, by inspecting enough characters, we will have to find a
character where they differ, even if we have to start comparing the fictitious null characters at the
end of the text).
Department of CSE (AI&ML) New Indices for Text,Lexical Analysis and Stoplists Page 3
R – 23
SRGEC INFORMATION RETREIVAL SYSTEMS UNIT - III
right subtree. Hence Patricia trees are binary digital trees. In addition, Patricia trees have in each
internal node an indication of which bit of the query is to be used for branching. This may be
given by an absolute bit position, or by a count of the number of bits to skip. This allows internal
nodes with single descendants to be eliminated, and thus all internal nodes of the tree produce a
useful branching, that is, both subtrees are non-null.
Patricia trees store key values at external nodes; the internal nodes have no key information, just
the skip counter and the pointers to the subtrees. The external nodes in a PAT tree are sistrings,
that is,integer displacements. For a text of size n, there are n external nodes in the PAT tree and n
- 1 internal nodes. This makes the tree O(n) in size, with a relatively small asymptotic constant.
Later we will want to store some additional information (the size of the subtree and which is the
taller subtree) with each internal node, but this information will always be of a constant size.
Figure 3.1 shows an example of a PAT tree over a sequence of bits (normally it would be over a
sequence of characters), just for the purpose of making the example easier to understand. In this
example, we show the Patricia tree for the text "01100100010111..." after the first 8 sistrings have
been inserted. External nodes are indicated by squares, and they contain a reference to a sistring,
and internal nodes are indicated by a circle and contain a displacement. In this case we have used,
in each internal node, the total displacement of the bit to be inspected, rather than the skip value.
Notice that to reach the external node for the query 00101 we first inspect bit 1 (it is a zero, we
goleft) then bit 2 (it is zero, we go left), then bit 3 (it is a one, we go right), and then bit 5 (it is a
one, we go right). Because we may skip the inspection of some bits (in this case bit 4), once we
reach our desired node we have to make one final comparison with one of the sistrings stored in
an external node of the current subtree, to ensure that all the skipped bits coincide. If they do not
coincide, then the key is not in the tree.
Figure 3.1: PAT tree when the sistrings 1 through 8 have been inserted
3.2.3 Indexing Points
So far we have assumed that every position in the text is indexed, that is, the Patricia tree has n
external nodes, one for each position in the text. For some types of search this is desirable, but in
some other cases not all points are necessary or even desirable to index. For example, if we are
interested in word and phrase searches, then only those sistrings that are at the beginning of words
(about 20% of the total for common English text) are necessary. The decision of how many
sistrings to include in the tree is application dependent, and will be a trade-off between size of the
index and search requirements.
Department of CSE (AI&ML) New Indices for Text,Lexical Analysis and Stoplists Page 4
R – 23
SRGEC INFORMATION RETREIVAL SYSTEMS UNIT - III
Department of CSE (AI&ML) New Indices for Text,Lexical Analysis and Stoplists Page 5
R – 23
SRGEC INFORMATION RETREIVAL SYSTEMS UNIT - III
will be in the answer even in the worst case (the worst case is 2 height - 1) and hence only O(log
n) time is necessary in total. As before, we can return the answer and the size of the answer in
time O(log n) independent of the actual size of the answer.
Searching for "most common" word is slightly more difficult but uses a similar algorithm. A word
could be defined as any sequence of characters delimited by a blank space. This type of search
will also require a traversal, but in this case the traversal is only done in a subtree (the subtree of
all sistrings starting with a space) and does not have a constant depth; it traverses the tree to the
places where each second blank appears.
We may also apply this algorithm within any arbitrary subtree. This is equivalent to finding the
most frequently occurring trigram, word, and the like, that follows some given prefix.
In all cases, finding the most frequent string with a certain property requires a subtree selection
and then a tree traversal which is at most O(n/k) but typically much smaller. Here, k is the average
size of each group of strings of the given property. Techniques similar to alpha-beta pruning can
be used to improve this search.
3.3.6 Regular Expression Searching
The main steps of the algorithm due by Baeza-Yates and Gonnet (1989) are:
a. Convert the regular expression passed as a query into a minimized deterministic finite
automation (DFA), which may take exponential space/time with respect to the query size
but is independent of the size of the text.
b. Next eliminate outgoing transitions from final states (see justification in step (e). This
may induce further minimization.
Department of CSE (AI&ML) New Indices for Text,Lexical Analysis and Stoplists Page 6
R – 23
SRGEC INFORMATION RETREIVAL SYSTEMS UNIT - III
c. Convert character DFAs into binary DFAs using any suitable binary encoding of the input
alphabet;each state will then have at most two outgoing transitions, one labeled 0 and one
labeled 1.
d. Simulate the binary DFA on the binary digital trie from all sistrings of text using the same
binary encoding as in step b. That is, associate the root of the tree with the initial state,
and, for any internal node associated with state i, associate its left descendant with state j if
i j for a bit 0, and associate its right descendant with state k if i k for a 1 (see Figure 5.3).
e. For every node of the index associated with a final state, accept the whole subtree and halt
the search in that subtree. (For this reason, we do not need outgoing transitions in final
states).
f. On reaching an external node, run the remainder of the automaton on the single string
determined by this external node.
The implementation of PAT trees using conventional Patricia trees is the obvious one, except for
some implementation details which cannot be overlooked as they would increase the size of the
index or its accessing time quite dramatically. Since this type of index is typically built over very
large text databases, and the size of the index tree is linear in the size of the text, an economic
implementation is mandatory.
It is easy to see that the internal nodes will be between 3 and 4 words in size. Each external node
could be one word and consequently we are taking between 4n and 5n words for the index, or
about 18n chars tor indexing n characters, most likely unacceptable from the practical point of
view.
Second, the organization for the tree should be such that we can gain from the reading of large
records external files will use physical records which are certainly larger than one internal node).
The main ideas to solve (or alleviate) both problems are
a. bucketing of external nodes, and
b. mapping the tree onto the disk using supernodes.
Collecting more than one external node is called bucketing and is an old idea in data processing.
A bucket replaces any subtree with size less than a certain constant (say b) and hence saves up to
b – l internal nodes. The external nodes inside a bucket do not have any structure associated with
them, and any search that has to be done in the bucket has to be done on all the members of the
bucket. This increases the number of comparisons for each search, in the worst case by b. On the
Department of CSE (AI&ML) New Indices for Text,Lexical Analysis and Stoplists Page 7
R – 23
SRGEC INFORMATION RETREIVAL SYSTEMS UNIT - III
other hand, and assuming a random distribution of keys, buckets save a significant number of
internal nodes. In general, it is not possible to have all buckets full, and on average the number of
keys per bucket is b In 2. Hence, instead of n - 1 internal nodes, we have about (n/bln2) internal
nodes for random strings.
This means that the overhead of the internal nodes, which are the largest part of the index, can be
cut down by a factor of b1n2. We have then a very simple trade-off between time (a factor of b)
and space (a factor of b1n2).
Organizing the tree in super-nodes has advantages from the point of view of the number of
accesses as well as in space. The main idea is simple: we allocate as much as possible of the tree
in a disk page as long as we preserve a unique entry point for every page. De facto, every disk
page has a single entry point, contains as much of the tree as possible, and terminates either in
external nodes or in pointers to other disk pages (notice that we need to access disk pages only, as
each disk page has a single entry point). The pointers in internal nodes will address either a disk
page or another node inside the same page, and consequently can be substantially smaller
(typically about half a word is enough). This further reduces the storage cost of internal nodes.
Unfortunately, not all disk pages will be 100 percent full. A bottom-up greedy construction
guarantees at least 50 percent occupation. Actual experiments indicate an occupation close to 80
percent.
With these constraints, disk pages will contain on the order of 1,000 internal/external nodes. This
means that on the average, each disk page will contain about 10 steps of a root-to-leaf path, or in
other words that the total number of accesses is a tenth of the height of the tree. Since it is very
easy to keep the root page of the tree in memory, this means that a typical prefix search can be
accomplished with 2-3 disk accesses to read the index (about 30 to 40 tree levels) and one
additional final access to verify the skipped bits (buckets may require additional reading of
strings). This implementation is the most efficient in terms of disk accesses for this type of search.
However, if the external nodes in the bucket are kept in the same relative order as they would be
in the tree, then we do not need to do a sequential search, we could do an indirect binary search
(i.e.,compare the sistrings referred to by the external node) to find the nodes that satisfy the
[Link], the cost of searching a bucket becomes 2 log b - 1 instead of b.
Although this is not significant for small buckets, it is a crucial observation that allows us to
develop another implementation of PAT trees. This is simply to let the bucket grow well beyond
normal bucket sizes, including the option of letting it be equal to n, that is, the whole index
degenerates into a single array of external nodes ordered lexicographically by sistrings. With
some additions, this idea was independently discovered by Manber and Myers (1990), who called
the structures suffix arrays.
There is a straightforward argument that shows that these arrays contain most of the information
we had in the Patricia tree at the cost of a factor of log2 n. The argument simply says that for any
interval in the array which contains all the external nodes that would be in a subtree, in log2 n
comparisons in the worst case we can divide the interval according to the next bit which is
different. The sizes of the subtrees are trivially obtained from the limits of any portion of the
array, so the only information that is missing is the longest-repetition bit, which is not possible to
Department of CSE (AI&ML) New Indices for Text,Lexical Analysis and Stoplists Page 8
R – 23
SRGEC INFORMATION RETREIVAL SYSTEMS UNIT - III
represent without an additional structure. Any operation on a Patricia tree can be simulated in
O(log n) accesses. Figure 3.4 shows this structure.
It turns out that it is not neccessary to simulate the Patricia tree for prefix and range searching, and
we obtain an algorithm which is O(log n) instead of O(log2 n) for these operations. Actually,
prefix searching and range searching become more uniform. Both can be implemented by doing
an indirect binary search over the array with the results of the comparisons being less than, equal
(or included in the case of range searching), and greater than. In this way, the searching takes at
most 2 log2 n – 1 comparisons and 4log2 n disk accesses (in the worst case).
The "most frequent" searching can also be improved, but its discussion becomes too technical and
falls outside the scope of this section. The same can be said about "longest repetition" which
requires additional supporting data structures. For regular expression searching, the time increases
by a O(logn) factor.
Quicksort is an appropriate algorithm for this building phase since it has an almost sequential
pattern of access over the sorted file. For maximal results we can put all of the index array on
external storage and apply external quicksort (see Gonnet and Baeza-Yates, section 4.4.5 [1984])
indirectly over the text. In this case it is possible to build an index for any text which together with
the program can fit in main memory. With today's memory sizes, this is not a case to ignore. This
is the algorithm of choice for small files and also as a building block for other algorithms.
A second case that can be solved efficiently is the case of merging two indices (to produce a
single one) when the text plus twice the index of one of them fits in main memory. This algorithm
is not trivial and deserves a short explanation. The text of the small file together with a PAT array
for the small file (of size n1) plus an integer array of size n1 + 1 are kept in main memory. The
integer array is used to count how many sistrings of the big file fall between each pair of index
points in the small file (see Figure 5.5). To do this counting, the large file is read sequentially and
each sistring is searched in the PAT array of the small file until it is located between a pair of
points in the index. The corresponding counter is incremented. This step will require O(n2 log n1)
comparisons and O(n2) characters to be read sequentially. Once the counting is finished, the
merging takes place by reading the PAT array of the large file and inserting the PAT array of the
small file guided by the counts (see Figure 5.6). This will require a sequential reading of n1+ n2
Department of CSE (AI&ML) New Indices for Text,Lexical Analysis and Stoplists Page 9
R – 23
SRGEC INFORMATION RETREIVAL SYSTEMS UNIT - III
words. In total, this algorithm performs a linear amount of sequential input and output and O(n2
log n1) internal work, and its behavior is not only acceptable but exceptionally good.
Given these simple and efficient building blocks we can design a general index building
[Link] we split the text file into pieces, the first piece being as large as possible to build
an index in main memory. The remaining pieces are as large as possible to allow merging via the
previous algorithm (small against large). Then we build indices for all these parts and merge each
part.
An improvement may be made by noticing that index points at the front of the text will be merged
many times into new points being added at the end. We take advantage of this by constructing
partial indices on blocks of text one half the size of memory. These indices may be merged with
each other,the entire merge taking place in memory. The merged index is not created at this point,
since it would fill memory and could not be merged with any further index. As before, a vector of
counters is kept,indicating how many entries of the first index fall between each pair of adjacent
entries in the second.
When the nth block of text is being indexed, the n - 1 previous indices are merged with it. The
counters are accumulated with each merge. When all of the merges have been done, the counters
are written to a file. When all of the blocks have been indexed and merged, the files of counters
are used as instructions to merge all the partial indices.
The number of parts is O(n / m) where m is the total amount of available memory, and overall the
algorithm requires O(n2 / m) sequential access and O(n2 log n / m) time. Given the sizes of
present day memories and the differences of accessing time for random versus sequential access,
the above quadratic algorithm would beat a linear algorithm even for databases like the Oxford
English Dictionary!
An interesting property of the last version of the algorithm is that each of the merges of partial
indices is independent. Therefore, all of the O((n / m)2) merges may be run in parallel on separate
processors. After all merges are complete, the counters for each partial index must be accumulated
and the final index merge may be done.
Department of CSE (AI&ML) New Indices for Text,Lexical Analysis and Stoplists Page 10
R – 23
SRGEC INFORMATION RETREIVAL SYSTEMS UNIT - III
Another practical advantage of this algorithm is that the final index merge is done without
reference to the text. Thus, at the time that the final merge is being done, the text may be removed
from the system and reloaded after removal of the partial indices. In situations where the text, the
final index, and the sum of partial indices may each be one or more gigabytes, this is an important
consideration.
3.6 STOPLISTS
It has been recognized since the earliest days of information retrieval (Luhn 1957) that many of
the most frequently occurring words in English (like "the," "of," "and," "to," etc.) are worthless as
index terms. A search using one of these terms is likely to retrieve almost every item in a database
regardless of its relevance, so their discrimination value is low.
Eliminating such words from consideration early in automatic indexing speeds processing, saves
huge amounts of space in indexes, and does not damage retrieval effectiveness. A list of words
filtered out during automatic indexing because they make poor index terms is called a stoplist or a
negative dictionary.
One way to improve information retrieval system performance, then, is to eliminate stopwords
during automatic indexing. As with lexical analysis, however, it is not clear which words should
be included in a stoplist. Traditionally, stoplists are supposed to have included the most frequently
occurring words. However, some frequently occurring words are too important as index terms.
For example,included among the 200 most frequently occurring words in general literature in
English are "time,""war," "home," "life," "water," and "world." On the other hand, specialized
databases will contain many words useless as index terms that are not frequent in general English.
For example, a computer literature database probably need not use index terms like "computer,"
"program," "source," "machine," and "language."
As with lexical analysis in general, stoplist policy will depend on the database and features of the
users and the indexing process. Commercial information systems tend to take a very conservative
approach, with few stopwords. For example, the ORBIT Search Service has only eight stopwords:
"and," "an," "by," "from," "of," "the," and "with." Larger stoplists are usually advisable.
There are two ways to filter stoplist words from an input token stream:
(a) examine lexical analyzer output and remove any stopwords, or
(b) remove stopwords as part of lexical analysis.
The first approach, filtering stopwords from lexical analyzer output, makes the stoplist problem
into a standard list searching problem: every token must be looked up in the stoplist, and removed
from further analysis if found. The usual solutions to this problem are adequate, including binary
search trees, binary search of an array, and hashing.
Undoubtedly the fastest solution is hashing.
When hashing is used to search a stoplist, the list must first be inserted into a hash table. Each
token is then hashed into the table. If the resulting location is empty, the token is not a stopword,
and is passed on; otherwise, comparisons must be made to determine whether the hashed value
really matches the entries at that hash table location. If not, then the token is passed on; if so, the
token is a word, and is eliminated from the token stream. This strategy is fast, but is slowed by the
need to re-examine each character in a token to generate its hash value, and by the need to resolve
collisions.
Although hashing is an excellent approach, probably the best implementation of stoplists is the
second strategy: remove stoplist words as part of the lexical analysis process. Since lexical
analysis must be done anyway, and recognizing even a large stoplist can be done at almost no
extra cost during lexical analysis, this approach is extremely efficient. Furthermore, lexical
Department of CSE (AI&ML) New Indices for Text,Lexical Analysis and Stoplists Page 11
R – 23
SRGEC INFORMATION RETREIVAL SYSTEMS UNIT - III
analyzers that filter stoplists can be generated automatically, which is easier and less error-prone
than writing stopword filters by hand.
Assignment-Cum-Tutorial Questions
Part – A: Objective Questions
1. What does PAT in PAT Trees stand for?
a) Pattern Access Tree
b) Practical Algorithm Tree
c) PArTitioned Tree
d) Partial Access Tree
Department of CSE (AI&ML) New Indices for Text,Lexical Analysis and Stoplists Page 12
R – 23
SRGEC INFORMATION RETREIVAL SYSTEMS UNIT - III
Department of CSE (AI&ML) New Indices for Text,Lexical Analysis and Stoplists Page 13
R – 23
SRGEC INFORMATION RETREIVAL SYSTEMS UNIT - III
17. Which of the following is most suitable for exact match queries?
a) B-Trees
b) Hash Tables
c) PAT Trees
d) Graphs
20. What is a primary use case of PAT Trees in text retrieval systems?
a) Sorting strings
b) Text compression
c) Indexing suffixes for fast lookup
d) Tokenization
Department of CSE (AI&ML) New Indices for Text,Lexical Analysis and Stoplists Page 14
R – 23
SRGEC INFORMATION RETREIVAL SYSTEMS UNIT - III
24. Which key feature differentiates PATRICIA Trees from standard tries?
a) Random access
b) Bitwise compression
c) Height balancing
d) Stack-based memory
26. Which data structure merges the concepts of tries and binary trees?
a) B+ Tree
b) Trie Hash
c) PATRICIA Tree
d) Ternary Tree
Department of CSE (AI&ML) New Indices for Text,Lexical Analysis and Stoplists Page 15
R – 23
SRGEC INFORMATION RETREIVAL SYSTEMS UNIT - III
b) Tree
c) The
d) Data
Department of CSE (AI&ML) New Indices for Text,Lexical Analysis and Stoplists Page 16
R – 23
SRGEC INFORMATION RETREIVAL SYSTEMS UNIT - III
15. Describe a method for updating a PAT Tree when a new string is added.
16. What does PATRICIA stand for?
17. How does a PATRICIA Tree optimize space usage?
18. Compare PATRICIA Trees with standard PAT Trees.
19. What is the benefit of representing PAT Trees as PATRICIA Trees?
20. What is the purpose of representing PAT Trees as arrays?
21. Mention two benefits of using arrays for PAT representation.
22. How are suffixes organized in a PAT array?
23. What kind of search technique is used in PAT Arrays?
24. List two examples of common stop words.
1. How PAT trees are represented as PAT arrays. Discuss with an example.
2. Determine Semi-infinite strings (Sistrings) for the text “Once upon a time in a far away land”.
3. What is Sistring? Explain with an example.
4. Construct PAT Tree structure for the following data:
01100100010111-----Text
12345678901234-----Position
5. Explain prefix searching in PAT tree. Construct a prefix searching for the text “BAANANA”.
6. Construct PAT tree for the Text “Economics for Warsaw is complex” With
an example explain the process involved in the construction of PAT trees.
7. Describe the structure of a PAT Tree in detail. How does it represent suffixes, and what role do
internal and leaf nodes play?
8. Explain the concept of PAT Trees. Discuss their significance and applications in text indexing
and retrieval systems.
9. Explain Range searching with example.
[Link] Range searching and Proximity searching.
[Link] Proximity searching with Prefix searching.
12. How do you merge small against large PAT arrays? Explain with a neat sketch.
13. Illustrate the construction of a PAT Tree using a sample text. Explain each step involved in
building the tree.
14. Discuss the algorithm for pattern searching in a PAT Tree. How is efficient searching
achieved, and what is the time complexity?
15. What is a PATRICIA Tree? How can PAT Trees be implemented using PATRICIA Trees to
optimize space and performance?
16. Explain how PAT Trees can be represented using arrays. What are the advantages and
limitations of this representation?
17. How do you merge small against large PAT arrays? Explain with a neat sketch.
Department of CSE (AI&ML) New Indices for Text,Lexical Analysis and Stoplists Page 17