0% found this document useful (0 votes)
14 views17 pages

New Indices for Text in IR Systems

The document provides learning material on Information Retrieval Systems, specifically focusing on new indices for text, lexical analysis, and stoplists, particularly PAT trees and PAT arrays. It discusses the structure and algorithms related to PAT trees, including prefix searching, proximity searching, and range searching, highlighting their efficiency in text searching. The document emphasizes a model of text as a long string and introduces the concept of semi-infinite strings for improved search capabilities.

Uploaded by

skafrid2465
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views17 pages

New Indices for Text in IR Systems

The document provides learning material on Information Retrieval Systems, specifically focusing on new indices for text, lexical analysis, and stoplists, particularly PAT trees and PAT arrays. It discusses the structure and algorithms related to PAT trees, including prefix searching, proximity searching, and range searching, highlighting their efficiency in text searching. The document emphasizes a model of text as a long string and introduces the concept of semi-infinite strings for improved search capabilities.

Uploaded by

skafrid2465
Copyright
© All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

SESHADRI RAO GUDLAVALLERU ENGINEERING COLLEGE

Academics Strengthening & Advancement


(AS&A)

Department of CSE(AI & ML)

R – 23
III [Link]. I Semester

Learning Material
On

INFORMATION RETRIEVAL SYSTEMS

UNIT – III
“New Indices for Text,Lexical Analysis and
Stoplists”

Prepared by : [Link] Begum, [Link]

Dept of CSE(AI&ML)

SRGEC, Gudlavalleru
R – 23
SRGEC INFORMATION RETREIVAL SYSTEMS UNIT - III

SYLLABUS
New Indices for Text, Lexical Analysis and Stop lists:
PAT Trees and PAT Arrays: Introduction, PAT Tree structure, Algorithms on the PAT Trees,
Building PAT Trees as PATRICA Trees, PAT representation as Arrays. Stop lists.

LEARNING MATERIAL

3.1 INTRODUCTION
Text searching methods may be classified as lexicographical indices (indices that are
sorted),clustering techniques, and indices based on hashing. In this chapter we discuss two new
lexicographical indices for text, called PAT trees and PAT arrays.

The traditional model of text used in information retrieval is that of a set of documents. Each
document is assigned a list of keywords (attributes), with optional relevance weights associated to
each keyword. This model is oriented to library applications, which it serves quite well. For more
general applications, it has some problems, namely:
 A basic structure is assumed (documents and words). This may be reasonable for many
applications, but not for others.
 Keywords must be extracted from the text (this is called "indexing"). This task is not
trivial and error prone, whether it is done by a person, or automatically by a computer.
 Queries are restricted to keywords.

For some indices, instead of indexing a set of keywords, all words except for those deemed to be
too common (called stopwords) are indexed.
We prefer a different model. We see the text as one long string. Each position in the text
corresponds to a semi-infinite string (sistring), the string that starts at that position and extends
arbitrarily far to the right, or to the end of the text. It is not difficult to see that any two strings not
at the same position are different.
The main advantages of this model are:
 No structure of the text is needed, although if there is one, it can be used.
 No keywords are used. The queries are based on prefixes of sistrings, that is, on
any substring of the text.
This model is simpler and does not restrict the query domain. Furthermore, almost any searching
structure can be used to support this view of text.
In the traditional text model, each document is considered a database record, and each keyword a
value or a secondary key. Because the number of keywords is variable, common database
techniques are not useful in this context. Typical data-base queries are on equality or on ranges.
They seldom consider "approximate text searching."

3.2 THE PAT TREE STRUCTURE


The PAT tree is a data structure that allows very efficient searching with preprocessing. This
section describes the PAT data structure, how to do some text searches and algorithms to build
two of its
possible implementations. This structure was originally described by Gonnet in the paper
"Unstructured Data Bases" by Gonnet (1983). In 1985 it was implemented and later used in

Department of CSE (AI&ML) New Indices for Text,Lexical Analysis and Stoplists Page 2
R – 23
SRGEC INFORMATION RETREIVAL SYSTEMS UNIT - III

conjunction with the computerization of the Oxford English Dictionary. The name of the
implementation, the PATTM system, has become well known in its own right, as a software
package for very fast string searching.
3.2.1 Semi-infinite Strings
In what follows, we will use a very simple model of text. Our text, or text data-base, will consist
of a single (possibly very long) array of characters, numbered sequentially from one onward.
Whether the text is already presented as such, or whether it can be viewed as such is not relevant.
To apply our algorithms it is sufficient to be able to view the entire text as an array of characters.
A semi-infinite string is a subsequence of characters from this array, taken from a given starting
point but going on as necessary to the right. In case the semi-infinite string (sistring) is used
beyond the end of the actual text, special null characters will be considered to be added at its end,
these characters being different than any other in the text. The name semi-infinite is taken from
the analogy with geometry where we have semi-infinite lines, lines with one origin, but infinite in
one direction. Sistrings are uniquely identified by the position where they start, and for a given,
fixed text, this is simply given by an integer.
Example:
Text :“ Once upon a time, in a far away land . . . “
sistring 1 :“ Once upon a time . . .”

sistring 2 :” nce upon a time . . .”


sistring 8 :” on a time, in a . . .”
sistring 11 :” a time, in a far . . .”
sistring 22 :” a far away land . . .”

Sistrings can be defined formally as an abstract data type and as such present a very useful and
important model of text. For the purpose of this section, the most important operation on sistrings
is the lexicographical comparison of sistrings and will be the only one defined. This comparison is
the one resulting from comparing two sistrings' contents (not their positions). Note that unless we
are comparing a sistring to itself, the comparison of two sistrings cannot yield equal. (If the
sistrings are not the same, sooner or later, by inspecting enough characters, we will have to find a
character where they differ, even if we have to start comparing the fictitious null characters at the
end of the text).

For example, the above sistrings will compare as follows:


22 < 11 < 2 < 8 < 1
Of the first 22 sistrings (using ASCII ordering) the lowest sistring is "a far away. . ." and the
highest is "upon a time. . . ."

3.2.2 PAT Tree


A Patricia tree is a digital tree where the individual bits of the keys are used to decide on the
branching. A zero bit will cause a branch to the left subtree, a one bit will cause a branch to the

Department of CSE (AI&ML) New Indices for Text,Lexical Analysis and Stoplists Page 3
R – 23
SRGEC INFORMATION RETREIVAL SYSTEMS UNIT - III

right subtree. Hence Patricia trees are binary digital trees. In addition, Patricia trees have in each
internal node an indication of which bit of the query is to be used for branching. This may be
given by an absolute bit position, or by a count of the number of bits to skip. This allows internal
nodes with single descendants to be eliminated, and thus all internal nodes of the tree produce a
useful branching, that is, both subtrees are non-null.

Patricia trees store key values at external nodes; the internal nodes have no key information, just
the skip counter and the pointers to the subtrees. The external nodes in a PAT tree are sistrings,
that is,integer displacements. For a text of size n, there are n external nodes in the PAT tree and n
- 1 internal nodes. This makes the tree O(n) in size, with a relatively small asymptotic constant.
Later we will want to store some additional information (the size of the subtree and which is the
taller subtree) with each internal node, but this information will always be of a constant size.
Figure 3.1 shows an example of a PAT tree over a sequence of bits (normally it would be over a
sequence of characters), just for the purpose of making the example easier to understand. In this
example, we show the Patricia tree for the text "01100100010111..." after the first 8 sistrings have
been inserted. External nodes are indicated by squares, and they contain a reference to a sistring,
and internal nodes are indicated by a circle and contain a displacement. In this case we have used,
in each internal node, the total displacement of the bit to be inspected, rather than the skip value.
Notice that to reach the external node for the query 00101 we first inspect bit 1 (it is a zero, we
goleft) then bit 2 (it is zero, we go left), then bit 3 (it is a one, we go right), and then bit 5 (it is a
one, we go right). Because we may skip the inspection of some bits (in this case bit 4), once we
reach our desired node we have to make one final comparison with one of the sistrings stored in
an external node of the current subtree, to ensure that all the skipped bits coincide. If they do not
coincide, then the key is not in the tree.

Figure 3.1: PAT tree when the sistrings 1 through 8 have been inserted
3.2.3 Indexing Points
So far we have assumed that every position in the text is indexed, that is, the Patricia tree has n
external nodes, one for each position in the text. For some types of search this is desirable, but in
some other cases not all points are necessary or even desirable to index. For example, if we are
interested in word and phrase searches, then only those sistrings that are at the beginning of words
(about 20% of the total for common English text) are necessary. The decision of how many
sistrings to include in the tree is application dependent, and will be a trade-off between size of the
index and search requirements.

Department of CSE (AI&ML) New Indices for Text,Lexical Analysis and Stoplists Page 4
R – 23
SRGEC INFORMATION RETREIVAL SYSTEMS UNIT - III

3.3 ALGORITHMS ON THE PAT TREE


Some of the algorithms for text searching are discussed below:

3.3.1 Prefix Searching


Notice that every subtree of the PAT tree has all the sistrings with a given prefix, by construction.
Then prefix searching in a PAT tree consists of searching the prefix in the tree up to the point
where we exhaust the prefix or up to the point where we reach an external node. At this point we
need to verify whether we could have skipped bits. This is done with a single comparison of any
of the sistrings in the subtree (considering an external node as a subtree of size one). If this
comparison is successful,then all the sistrings in the subtree (which share the common prefix) are
the answer; otherwise there are no sistrings in the answer.
It is important to notice that the search ends when the prefix is exhausted or when we reach an
external node and at that point all the answer is available (regardless of its size) in a single
subtree. For random Patricia trees, the height is O(log n) and consequently with PAT trees we can
do arbitrary prefix searching in O(log n) time, independent of the size of the answer. In practice,
the length of the query is less than O(log n), thus the searching time is proportional to the query
length.
By keeping the size of each subtree in each internal node, we can trivially find the size of any
matched subtree. (Knowing the size of the answer is very appealing for information retrieval
purposes.) Figure 3.2 shows the search for the prefix "10100" and its answer.

Figure 3.2: Prefix searching


3.3.2 Proximity Searching
We define a proximity search as finding all places where a string S1 is at most a fixed (given by
the user) number of characters away from another string S2. The simplest algorithm for this type
of query is to search for sl and s2. Then, sort by position the smaller of the two answers. Finally,
we traverse the unsorted answer set, searching every position in the sorted set, and checking if the
distance between positions (and order, if we always want s1 before s2) satisfies the proximity
condition. If m1and m2 (m1 < m2) are the respective answer set sizes, this algorithm requires (m1
+ m2) log m1 [Link] is not very appealing if m1 or m2 are O(n).

3.3.3 Range Searching


Searching for all the strings within a certain range of values (lexicographical range) can be done
equally efficiently. More precisely, range searching is defined as searching for all strings that
lexicographically compare between two given strings. For example, the range "abc" . . "acc" will
contain strings like "abracadabra," "acacia," "aboriginal," but not "abacus" or "acrimonious."
To do range searching on a PAT tree, we search each end of the defining intervals and then collect
all the subtrees between (and including) them. It should be noticed that only O(height) subtrees

Department of CSE (AI&ML) New Indices for Text,Lexical Analysis and Stoplists Page 5
R – 23
SRGEC INFORMATION RETREIVAL SYSTEMS UNIT - III

will be in the answer even in the worst case (the worst case is 2 height - 1) and hence only O(log
n) time is necessary in total. As before, we can return the answer and the size of the answer in
time O(log n) independent of the actual size of the answer.

3.3.4 Longest Repetition Searching


The longest repetition of a text is defined as the match between two different positions of a text
where this match is the longest (the most number of characters) in the entire text. For a given text,
the longest repetition will be given by the tallest internal node in the PAT tree, that is, the tallest
internal node gives a pair of sistrings that match for the greatest number of characters. In this case,
tallest means not only the shape of the tree but has to consider the skipped bits also. For a given
text, the longest repetition can be found while building the tree and it is a constant, that is, will not
change unless we change the tree (that is, the text).
It is also interesting and possible to search for the longest repetition not just for the entire tree/text,
but for a subtree. This means searching for the longest repetition among all the strings that share a
common prefix. This can be done in O(height) time by keeping one bit of information at each
internal node, which will indicate on which side we have the tallest subtree. By keeping such a bit,
we can find one of the longest repetitions starting with an arbitrary prefix in O(log n) time. If we
want to search for all of the longest repetitions, we need two bits per internal node (to indicate
equal heights as well) and the search becomes logarithmic in height and linear in the number of
matches.
This type of search has great practical interest, but is slightly difficult to describe. By "most
significant" or "most frequent" we mean the most frequently occurring strings within the text
[Link] example, finding the "most frequent" trigram is finding a sequence of three letters
that appears most often within our text.
In terms of the PAT tree, and for the example of the trigrams, the number of occurrences of a
trigram is given by the size of the subtree at a distance 3 characters from the root. So finding the
most frequent trigram is equivalent to finding the largest subtree at a distance 3 characters from
the root. This can be achieved by a simple traversal of the PAT tree which is at most O(n/average
size of answer) but is usually much faster.

Searching for "most common" word is slightly more difficult but uses a similar algorithm. A word
could be defined as any sequence of characters delimited by a blank space. This type of search
will also require a traversal, but in this case the traversal is only done in a subtree (the subtree of
all sistrings starting with a space) and does not have a constant depth; it traverses the tree to the
places where each second blank appears.
We may also apply this algorithm within any arbitrary subtree. This is equivalent to finding the
most frequently occurring trigram, word, and the like, that follows some given prefix.
In all cases, finding the most frequent string with a certain property requires a subtree selection
and then a tree traversal which is at most O(n/k) but typically much smaller. Here, k is the average
size of each group of strings of the given property. Techniques similar to alpha-beta pruning can
be used to improve this search.
3.3.6 Regular Expression Searching
The main steps of the algorithm due by Baeza-Yates and Gonnet (1989) are:
a. Convert the regular expression passed as a query into a minimized deterministic finite
automation (DFA), which may take exponential space/time with respect to the query size
but is independent of the size of the text.
b. Next eliminate outgoing transitions from final states (see justification in step (e). This
may induce further minimization.

Department of CSE (AI&ML) New Indices for Text,Lexical Analysis and Stoplists Page 6
R – 23
SRGEC INFORMATION RETREIVAL SYSTEMS UNIT - III

c. Convert character DFAs into binary DFAs using any suitable binary encoding of the input
alphabet;each state will then have at most two outgoing transitions, one labeled 0 and one
labeled 1.
d. Simulate the binary DFA on the binary digital trie from all sistrings of text using the same
binary encoding as in step b. That is, associate the root of the tree with the initial state,
and, for any internal node associated with state i, associate its left descendant with state j if
i j for a bit 0, and associate its right descendant with state k if i k for a 1 (see Figure 5.3).
e. For every node of the index associated with a final state, accept the whole subtree and halt
the search in that subtree. (For this reason, we do not need outgoing transitions in final
states).
f. On reaching an external node, run the remainder of the automaton on the single string
determined by this external node.

Figure 3.3: Simulating the automaton on a binary digital tree


For random text, it is possible to prove that for any regular expression, the average searching time
is sublinear (because of step e), and of the form:
O(logm(n)n )
where < 1, and m 0 (integer) depend on the incidence matrix of the simulated DFA (that is, they
depend on the regular expression).

3.4 BUILDING PAT TREES AS PATRICIA TREES

The implementation of PAT trees using conventional Patricia trees is the obvious one, except for
some implementation details which cannot be overlooked as they would increase the size of the
index or its accessing time quite dramatically. Since this type of index is typically built over very
large text databases, and the size of the index tree is linear in the size of the text, an economic
implementation is mandatory.
It is easy to see that the internal nodes will be between 3 and 4 words in size. Each external node
could be one word and consequently we are taking between 4n and 5n words for the index, or
about 18n chars tor indexing n characters, most likely unacceptable from the practical point of
view.
Second, the organization for the tree should be such that we can gain from the reading of large
records external files will use physical records which are certainly larger than one internal node).
The main ideas to solve (or alleviate) both problems are
a. bucketing of external nodes, and
b. mapping the tree onto the disk using supernodes.
Collecting more than one external node is called bucketing and is an old idea in data processing.
A bucket replaces any subtree with size less than a certain constant (say b) and hence saves up to
b – l internal nodes. The external nodes inside a bucket do not have any structure associated with
them, and any search that has to be done in the bucket has to be done on all the members of the
bucket. This increases the number of comparisons for each search, in the worst case by b. On the

Department of CSE (AI&ML) New Indices for Text,Lexical Analysis and Stoplists Page 7
R – 23
SRGEC INFORMATION RETREIVAL SYSTEMS UNIT - III

other hand, and assuming a random distribution of keys, buckets save a significant number of
internal nodes. In general, it is not possible to have all buckets full, and on average the number of
keys per bucket is b In 2. Hence, instead of n - 1 internal nodes, we have about (n/bln2) internal
nodes for random strings.
This means that the overhead of the internal nodes, which are the largest part of the index, can be
cut down by a factor of b1n2. We have then a very simple trade-off between time (a factor of b)
and space (a factor of b1n2).
Organizing the tree in super-nodes has advantages from the point of view of the number of
accesses as well as in space. The main idea is simple: we allocate as much as possible of the tree
in a disk page as long as we preserve a unique entry point for every page. De facto, every disk
page has a single entry point, contains as much of the tree as possible, and terminates either in
external nodes or in pointers to other disk pages (notice that we need to access disk pages only, as
each disk page has a single entry point). The pointers in internal nodes will address either a disk
page or another node inside the same page, and consequently can be substantially smaller
(typically about half a word is enough). This further reduces the storage cost of internal nodes.
Unfortunately, not all disk pages will be 100 percent full. A bottom-up greedy construction
guarantees at least 50 percent occupation. Actual experiments indicate an occupation close to 80
percent.
With these constraints, disk pages will contain on the order of 1,000 internal/external nodes. This
means that on the average, each disk page will contain about 10 steps of a root-to-leaf path, or in
other words that the total number of accesses is a tenth of the height of the tree. Since it is very
easy to keep the root page of the tree in memory, this means that a typical prefix search can be
accomplished with 2-3 disk accesses to read the index (about 30 to 40 tree levels) and one
additional final access to verify the skipped bits (buckets may require additional reading of
strings). This implementation is the most efficient in terms of disk accesses for this type of search.

3.5 PAT TREES REPRESENTED AS ARRAYS


The previous implementation has a parameter, the external node bucket size, b. When a search
reaches a bucket, we have to scan all the external nodes in the bucket to determine which if any
satisfy the search. If the bucket is too large, these costs become prohibitive.

However, if the external nodes in the bucket are kept in the same relative order as they would be
in the tree, then we do not need to do a sequential search, we could do an indirect binary search
(i.e.,compare the sistrings referred to by the external node) to find the nodes that satisfy the
[Link], the cost of searching a bucket becomes 2 log b - 1 instead of b.

Although this is not significant for small buckets, it is a crucial observation that allows us to
develop another implementation of PAT trees. This is simply to let the bucket grow well beyond
normal bucket sizes, including the option of letting it be equal to n, that is, the whole index
degenerates into a single array of external nodes ordered lexicographically by sistrings. With
some additions, this idea was independently discovered by Manber and Myers (1990), who called
the structures suffix arrays.
There is a straightforward argument that shows that these arrays contain most of the information
we had in the Patricia tree at the cost of a factor of log2 n. The argument simply says that for any
interval in the array which contains all the external nodes that would be in a subtree, in log2 n
comparisons in the worst case we can divide the interval according to the next bit which is
different. The sizes of the subtrees are trivially obtained from the limits of any portion of the
array, so the only information that is missing is the longest-repetition bit, which is not possible to

Department of CSE (AI&ML) New Indices for Text,Lexical Analysis and Stoplists Page 8
R – 23
SRGEC INFORMATION RETREIVAL SYSTEMS UNIT - III

represent without an additional structure. Any operation on a Patricia tree can be simulated in
O(log n) accesses. Figure 3.4 shows this structure.

Figure 3.4: A PAT array

3.5.1 Searching PAT Trees as Arrays

It turns out that it is not neccessary to simulate the Patricia tree for prefix and range searching, and
we obtain an algorithm which is O(log n) instead of O(log2 n) for these operations. Actually,
prefix searching and range searching become more uniform. Both can be implemented by doing
an indirect binary search over the array with the results of the comparisons being less than, equal
(or included in the case of range searching), and greater than. In this way, the searching takes at
most 2 log2 n – 1 comparisons and 4log2 n disk accesses (in the worst case).

The "most frequent" searching can also be improved, but its discussion becomes too technical and
falls outside the scope of this section. The same can be said about "longest repetition" which
requires additional supporting data structures. For regular expression searching, the time increases
by a O(logn) factor.

3.5.2 Building PAT arrays in memory


If a portion of the text is small enough to fit in main memory together with its PAT array, then this
process can be done very efficiently as it is equivalent to string sorting. Note that here we are
talking about main memory; if paging is used to simulate a larger memory, the random access
patterns over the text will certainly cause severe memory thrashing.

Quicksort is an appropriate algorithm for this building phase since it has an almost sequential
pattern of access over the sorted file. For maximal results we can put all of the index array on
external storage and apply external quicksort (see Gonnet and Baeza-Yates, section 4.4.5 [1984])
indirectly over the text. In this case it is possible to build an index for any text which together with
the program can fit in main memory. With today's memory sizes, this is not a case to ignore. This
is the algorithm of choice for small files and also as a building block for other algorithms.

Merging small against large PAT arrays

A second case that can be solved efficiently is the case of merging two indices (to produce a
single one) when the text plus twice the index of one of them fits in main memory. This algorithm
is not trivial and deserves a short explanation. The text of the small file together with a PAT array
for the small file (of size n1) plus an integer array of size n1 + 1 are kept in main memory. The
integer array is used to count how many sistrings of the big file fall between each pair of index
points in the small file (see Figure 5.5). To do this counting, the large file is read sequentially and
each sistring is searched in the PAT array of the small file until it is located between a pair of
points in the index. The corresponding counter is incremented. This step will require O(n2 log n1)
comparisons and O(n2) characters to be read sequentially. Once the counting is finished, the
merging takes place by reading the PAT array of the large file and inserting the PAT array of the
small file guided by the counts (see Figure 5.6). This will require a sequential reading of n1+ n2

Department of CSE (AI&ML) New Indices for Text,Lexical Analysis and Stoplists Page 9
R – 23
SRGEC INFORMATION RETREIVAL SYSTEMS UNIT - III

words. In total, this algorithm performs a linear amount of sequential input and output and O(n2
log n1) internal work, and its behavior is not only acceptable but exceptionally good.

Figure 3.5: Small index in main memory

Figure 3.6: Merging the small and the large index

Given these simple and efficient building blocks we can design a general index building
[Link] we split the text file into pieces, the first piece being as large as possible to build
an index in main memory. The remaining pieces are as large as possible to allow merging via the
previous algorithm (small against large). Then we build indices for all these parts and merge each
part.
An improvement may be made by noticing that index points at the front of the text will be merged
many times into new points being added at the end. We take advantage of this by constructing
partial indices on blocks of text one half the size of memory. These indices may be merged with
each other,the entire merge taking place in memory. The merged index is not created at this point,
since it would fill memory and could not be merged with any further index. As before, a vector of
counters is kept,indicating how many entries of the first index fall between each pair of adjacent
entries in the second.
When the nth block of text is being indexed, the n - 1 previous indices are merged with it. The
counters are accumulated with each merge. When all of the merges have been done, the counters
are written to a file. When all of the blocks have been indexed and merged, the files of counters
are used as instructions to merge all the partial indices.
The number of parts is O(n / m) where m is the total amount of available memory, and overall the
algorithm requires O(n2 / m) sequential access and O(n2 log n / m) time. Given the sizes of
present day memories and the differences of accessing time for random versus sequential access,
the above quadratic algorithm would beat a linear algorithm even for databases like the Oxford
English Dictionary!
An interesting property of the last version of the algorithm is that each of the merges of partial
indices is independent. Therefore, all of the O((n / m)2) merges may be run in parallel on separate
processors. After all merges are complete, the counters for each partial index must be accumulated
and the final index merge may be done.

Department of CSE (AI&ML) New Indices for Text,Lexical Analysis and Stoplists Page 10
R – 23
SRGEC INFORMATION RETREIVAL SYSTEMS UNIT - III

Another practical advantage of this algorithm is that the final index merge is done without
reference to the text. Thus, at the time that the final merge is being done, the text may be removed
from the system and reloaded after removal of the partial indices. In situations where the text, the
final index, and the sum of partial indices may each be one or more gigabytes, this is an important
consideration.

3.6 STOPLISTS
It has been recognized since the earliest days of information retrieval (Luhn 1957) that many of
the most frequently occurring words in English (like "the," "of," "and," "to," etc.) are worthless as
index terms. A search using one of these terms is likely to retrieve almost every item in a database
regardless of its relevance, so their discrimination value is low.
Eliminating such words from consideration early in automatic indexing speeds processing, saves
huge amounts of space in indexes, and does not damage retrieval effectiveness. A list of words
filtered out during automatic indexing because they make poor index terms is called a stoplist or a
negative dictionary.
One way to improve information retrieval system performance, then, is to eliminate stopwords
during automatic indexing. As with lexical analysis, however, it is not clear which words should
be included in a stoplist. Traditionally, stoplists are supposed to have included the most frequently
occurring words. However, some frequently occurring words are too important as index terms.
For example,included among the 200 most frequently occurring words in general literature in
English are "time,""war," "home," "life," "water," and "world." On the other hand, specialized
databases will contain many words useless as index terms that are not frequent in general English.
For example, a computer literature database probably need not use index terms like "computer,"
"program," "source," "machine," and "language."
As with lexical analysis in general, stoplist policy will depend on the database and features of the
users and the indexing process. Commercial information systems tend to take a very conservative
approach, with few stopwords. For example, the ORBIT Search Service has only eight stopwords:
"and," "an," "by," "from," "of," "the," and "with." Larger stoplists are usually advisable.

There are two ways to filter stoplist words from an input token stream:
(a) examine lexical analyzer output and remove any stopwords, or
(b) remove stopwords as part of lexical analysis.
The first approach, filtering stopwords from lexical analyzer output, makes the stoplist problem
into a standard list searching problem: every token must be looked up in the stoplist, and removed
from further analysis if found. The usual solutions to this problem are adequate, including binary
search trees, binary search of an array, and hashing.
Undoubtedly the fastest solution is hashing.
When hashing is used to search a stoplist, the list must first be inserted into a hash table. Each
token is then hashed into the table. If the resulting location is empty, the token is not a stopword,
and is passed on; otherwise, comparisons must be made to determine whether the hashed value
really matches the entries at that hash table location. If not, then the token is passed on; if so, the
token is a word, and is eliminated from the token stream. This strategy is fast, but is slowed by the
need to re-examine each character in a token to generate its hash value, and by the need to resolve
collisions.

Although hashing is an excellent approach, probably the best implementation of stoplists is the
second strategy: remove stoplist words as part of the lexical analysis process. Since lexical
analysis must be done anyway, and recognizing even a large stoplist can be done at almost no
extra cost during lexical analysis, this approach is extremely efficient. Furthermore, lexical

Department of CSE (AI&ML) New Indices for Text,Lexical Analysis and Stoplists Page 11
R – 23
SRGEC INFORMATION RETREIVAL SYSTEMS UNIT - III

analyzers that filter stoplists can be generated automatically, which is easier and less error-prone
than writing stopword filters by hand.

Assignment-Cum-Tutorial Questions
Part – A: Objective Questions
1. What does PAT in PAT Trees stand for?
a) Pattern Access Tree
b) Practical Algorithm Tree
c) PArTitioned Tree
d) Partial Access Tree

2. PAT trees are primarily used for:


a) Arithmetic calculations
b) Graph traversal
c) Efficient text indexing and retrieval
d) Binary tree balancing

3. The main advantage of PAT trees over suffix trees is:


a) Smaller storage requirement
b) Faster search
c) Easier construction
d) Supports only numeric data

4. In PAT Trees, internal nodes store:


a) Complete strings
b) Only prefixes
c) Indices and offsets
d) Leaf nodes

5. Which data structure is closely related to PAT Trees?


a) Hash Tables
b) PATRICIA Trees
c) Heaps
d) AVL Trees

6. PAT Arrays are typically used for:


a) Sorting numeric arrays
b) Compact storage of suffix information
c) Tree balancing
d) Encryption

7. Which of the following is not true about PAT Trees?


a) Used for full-text search
b) Based on suffixes
c) Are balanced binary trees

Department of CSE (AI&ML) New Indices for Text,Lexical Analysis and Stoplists Page 12
R – 23
SRGEC INFORMATION RETREIVAL SYSTEMS UNIT - III

d) Use compact prefix encoding

8. Which feature of PAT arrays makes them efficient?


a) Constant time updates
b) Use of hashing
c) Binary search over sorted suffixes
d) Graph-based storage

9. The leaves in a PAT Tree represent:


a) Prefixes
b) Suffixes of the text
c) Keywords
d) Stop words

10. PAT Trees avoid storing redundant substrings by:


a) Using AVL rotations
b) Using compressed edges
c) Using bloom filters
d) Using hash-based storage

11. Searching in a PAT Tree is similar to:


a) Binary search
b) Hash lookup
c) Trie traversal
d) Breadth-first traversal

12. Which operation is typically faster in PAT Trees?


a) Insertion
b) Deletion
c) Pattern search
d) Tree balancing

13. Time complexity of pattern matching in PAT Trees is:


a) O(log n)
b) O(n)
c) O(m), where m is pattern length
d) O(n log n)

14. Which algorithm can be used to build PAT Trees efficiently?


a) Boyer-Moore
b) Weiner’s algorithm
c) Rabin-Karp
d) Knuth-Morris-Pratt

15. In PAT Trees, longest prefix matching is used to:


a) Improve hash lookup
b) Identify redundant substrings
c) Traverse the tree
d) Compress search paths

Department of CSE (AI&ML) New Indices for Text,Lexical Analysis and Stoplists Page 13
R – 23
SRGEC INFORMATION RETREIVAL SYSTEMS UNIT - III

16. In which traversal is text retrieval from a PAT Tree performed?


a) Post-order
b) In-order
c) Pre-order
d) Level-order

17. Which of the following is most suitable for exact match queries?
a) B-Trees
b) Hash Tables
c) PAT Trees
d) Graphs

18. Which operation is NOT generally supported in a PAT Tree?


a) Prefix search
b) Substring search
c) Range queries
d) Lexical sorting

19. What makes PAT Trees more space-efficient than tries?


a) Uses stacks
b) Uses pointers
c) Uses compressed paths
d) Uses fixed-length arrays

20. What is a primary use case of PAT Trees in text retrieval systems?
a) Sorting strings
b) Text compression
c) Indexing suffixes for fast lookup
d) Tokenization

21. PATRICIA stands for:


a) Practical Algorithm To Retrieve Information Coded In Alphanumeric
b) Prefix Algorithm Tree
c) Pattern Tree Algorithm
d) Partial Access Tree for Compression

22. PATRICIA Trees are:


a) Compressed Binary Tries
b) Balanced Trees
c) B-Trees
d) Hash Trees

23. In a PATRICIA Tree, branching occurs at:


a) Every character
b) First occurrence
c) Bit positions
d) Leaf nodes only

Department of CSE (AI&ML) New Indices for Text,Lexical Analysis and Stoplists Page 14
R – 23
SRGEC INFORMATION RETREIVAL SYSTEMS UNIT - III

24. Which key feature differentiates PATRICIA Trees from standard tries?
a) Random access
b) Bitwise compression
c) Height balancing
d) Stack-based memory

25. PAT Trees can be implemented using PATRICIA Trees to:


a) Optimize insertion
b) Reduce tree depth
c) Avoid suffix representation
d) Enable sorting

26. Which data structure merges the concepts of tries and binary trees?
a) B+ Tree
b) Trie Hash
c) PATRICIA Tree
d) Ternary Tree

27. Which operation is optimized in PATRICIA Trees?


a) Sorting
b) Binary Search
c) Lookup and Insertion
d) Balancing

28. PATRICIA Trees are best suited for:


a) Handling numeric data
b) Word frequency analysis
c) Large string dictionary lookup
d) Grammar parsing

29. In PAT Tree represented as a PATRICIA Tree, leaf nodes contain:


a) Prefix matches
b) Indices to original text
c) Hash values
d) Stop word flags

30. What is minimized in a PATRICIA Tree compared to a Trie?


a) Memory usage
b) Speed
c) Accuracy
d) Recursion depth

31. A stop list is used to:


a) Speed up indexing
b) Remove irrelevant words
c) Encrypt text
d) Sort tokens

32. Which word is most likely to be on a stop list?


a) Algorithm

Department of CSE (AI&ML) New Indices for Text,Lexical Analysis and Stoplists Page 15
R – 23
SRGEC INFORMATION RETREIVAL SYSTEMS UNIT - III

b) Tree
c) The
d) Data

33. Stop lists are applied during:


a) Compilation
b) Semantic analysis
c) Lexical analysis
d) Syntax tree construction

34. Removing stop words helps in:


a) Increasing storage
b) Reducing search quality
c) Enhancing precision in IR
d) Grammar correction

35. Which of the following is NOT a common stop word?


a) And
b) Or
c) Not
d) Algorithm

36. Which method helps generate a custom stop list?


a) Term Frequency analysis
b) Binary search
c) Sorting
d) Parsing

37. A stop list is typically stored as:


a) Binary Tree
b) Array
c) Hash Set or List
d) Stack

Part – B : SHORT ANSWER QUESTIONS


1. What is a PAT Tree and where is it used?
2. How does a PAT Tree differ from a standard trie?
3. Define the purpose of PAT Arrays in text indexing.
4. What kind of data is most suitable for indexing using PAT Trees?
5. Mention two advantages of PAT Trees in information retrieval systems.
6. Describe the structure of a PAT Tree.
7. What information is typically stored at the internal nodes of a PAT Tree?
8. Explain how suffixes are represented in a PAT Tree.
9. What is the role of leaf nodes in a PAT Tree?
10. How does path compression work in a PAT Tree?
11. How is pattern matching performed in a PAT Tree?
12. What is the time complexity for searching a pattern in a PAT Tree?
13. List the major steps involved in constructing a PAT Tree.
14. How does the PAT Tree handle duplicate suffixes?

Department of CSE (AI&ML) New Indices for Text,Lexical Analysis and Stoplists Page 16
R – 23
SRGEC INFORMATION RETREIVAL SYSTEMS UNIT - III

15. Describe a method for updating a PAT Tree when a new string is added.
16. What does PATRICIA stand for?
17. How does a PATRICIA Tree optimize space usage?
18. Compare PATRICIA Trees with standard PAT Trees.
19. What is the benefit of representing PAT Trees as PATRICIA Trees?
20. What is the purpose of representing PAT Trees as arrays?
21. Mention two benefits of using arrays for PAT representation.
22. How are suffixes organized in a PAT array?
23. What kind of search technique is used in PAT Arrays?
24. List two examples of common stop words.

Part – C :DESCRIPTIVE QUESTIONS

1. How PAT trees are represented as PAT arrays. Discuss with an example.
2. Determine Semi-infinite strings (Sistrings) for the text “Once upon a time in a far away land”.
3. What is Sistring? Explain with an example.
4. Construct PAT Tree structure for the following data:
01100100010111-----Text
12345678901234-----Position
5. Explain prefix searching in PAT tree. Construct a prefix searching for the text “BAANANA”.
6. Construct PAT tree for the Text “Economics for Warsaw is complex” With
an example explain the process involved in the construction of PAT trees.
7. Describe the structure of a PAT Tree in detail. How does it represent suffixes, and what role do
internal and leaf nodes play?
8. Explain the concept of PAT Trees. Discuss their significance and applications in text indexing
and retrieval systems.
9. Explain Range searching with example.
[Link] Range searching and Proximity searching.
[Link] Proximity searching with Prefix searching.
12. How do you merge small against large PAT arrays? Explain with a neat sketch.
13. Illustrate the construction of a PAT Tree using a sample text. Explain each step involved in
building the tree.
14. Discuss the algorithm for pattern searching in a PAT Tree. How is efficient searching
achieved, and what is the time complexity?
15. What is a PATRICIA Tree? How can PAT Trees be implemented using PATRICIA Trees to
optimize space and performance?
16. Explain how PAT Trees can be represented using arrays. What are the advantages and
limitations of this representation?
17. How do you merge small against large PAT arrays? Explain with a neat sketch.

Department of CSE (AI&ML) New Indices for Text,Lexical Analysis and Stoplists Page 17

You might also like