100% found this document useful (4 votes)
2K views13 pages

Cataloging and Indexing Techniques

The document discusses the history and objectives of indexing, the indexing process, automatic indexing, and information extraction and summarization. Indexing involves determining terms to represent items to facilitate search and retrieval. It has evolved from manual subject indexing to include automatic indexing using terms or concepts. Automatic indexing aims to index items faster while maintaining consistency. Information extraction focuses on extracting specific facts or text like summaries rather than fully representing an item.

Uploaded by

7killers4u
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
100% found this document useful (4 votes)
2K views13 pages

Cataloging and Indexing Techniques

The document discusses the history and objectives of indexing, the indexing process, automatic indexing, and information extraction and summarization. Indexing involves determining terms to represent items to facilitate search and retrieval. It has evolved from manual subject indexing to include automatic indexing using terms or concepts. Automatic indexing aims to index items faster while maintaining consistency. Information extraction focuses on extracting specific facts or text like summaries rather than fully representing an item.

Uploaded by

7killers4u
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd

UNIT- II

Cataloging and Indexing


2.1 History and Objectives of Indexing 2.2 Indexing Process 2.3 Automatic Indexing 2.4 Information Extraction

2.1 History and Objectives of Indexing: Overview:


The indexing process determines which terms (concepts) can represent a particular item The transformation from the received item to the searchable data structure is called indexing Manual or automatic Search Once the searchable data structure has been created, techniques must be defined that correlate the user entered query statement to the set of items in the database to determine the items to be returned to the user. Information extraction extract specific information to be normalized and entered into a structured database (DBMS) Focus on very specific concepts and contains a transformation process that modifies the extracted information into a form compatible with the end structured database Automatic File Build Text summarization One application of information extraction extracting larger contextual constructs (e.g. sentences) that are combined to form a summary of an item

2.1.1 History:
Indexing (originally called cataloging) is the oldest technique for identifying the contents of item to assist in their retrieval Subject indexing ->hierarchical subject indexing Indexing creates a bibliographic citation in a structured file that reference the original text Citation information about the items, key wording the subjects of the item, and a constrained length free text field used for an abstract/summary Usually performed by professional indexers Automatic indexing Full text search

2.1.2 Objectives:
Represent the concepts within an item to facilitate the users finding relevant information The full text searchable data structures for items in the Document File provides a new class of indexing called total document indexing All of the words within the item are potential index descriptors of the Subjects of the item. The process of Item normalization that takes all possible words in an item and transforms them into processing tokens used in defining the searchable representation of an item. Current systems have the ability to automatically weight the processing tokens based upon their potential Importance in defining the concepts in the item. Other objectives of indexing ranking, item clustering

2.2 Indexing Process:


When an organization with multiple indexers decides to create a public or private index, some procedural decisions on how to create the index terms assist the indexer and end users in knowing what to expect in the index file Scope of the indexing define what level of detail the subject index will contain Based on usage scenarios of the end users The need to link index terms together in a single index for a particular concept Needed when there are multiple independent concepts found within an item

2.2.1 Scope of Indexing :


When performed manually, the process of reliably and consistently determining the bibliographic terms that represent the concepts in an item is extremely difficult Vocabulary domains of indexer and author may be different Results in different quality level of indexing Two factors involved in deciding on what level to index the concepts in an item Exhaustivity Specificity What portions of an item should be indexed Only title or title + abstract ? (low precision and recall) Weighting of index terms is not common in manual indexing Weighting is the process of assigning an importance to an index terms use in an item The weight should represent the degree to which the concept associated with the index term is represented in the item

The weight should help in discriminating the extent to which the concept is discussed in items in the database The manual process of assigning weights adds additional overhead on the indexer and requires a more complex data structure to store the weights.

2.2.2 Precoordination and Linkage :


Whether linkage are available between index terms for an item Used to correlate related attributes associated with concepts discussed in an item The process of creating term linkages at index creating time is called precoordination Postcoordination coordinating terms at search time by ANDing index terms together, which only finds indexes that have all of the search term Factors that must be determined in the linkage process are the number of terms that can be related, any ordering constraints on the linked terms, and any additional descriptors are associated with the index terms.

Linkage of Index Terms :

2.3 Automatic Indexing : 2.3.1 Overview


Automatic indexing is the capability to automatically determine the index terms to be assigned to an item Simplest case: total document indexing Complex: emulate a human indexer and determine a limited number of index terms for the major concepts in the item Human indexing VS Automatic indexing Adv: ability to determine concept abstraction and judge the value of a concept Disadv: cost, processing time, and consistency Processing time of an item by a human indexer varies significantly based upon the indexers knowledge of the concepts being indexed, the exhaustivity and specificity guidelines and the amount and accuracy of preprocessing via Automatic File Build. Usually take at least five minutes per item. Automatic indexing requires only a few seconds or less of computer time based upon the size of the processor and the complexity of the algorithms to generate the index. If the indexing is being performed automatically, by an algorithm, there is consistency in the index term selection process. Automatic indexing: weighted and un-weighted; Indexing by term and indexing by concept

[Link] Un-weighted Automatic Indexing:


The existence of an index term in a document and sometimes its word location(s) are kept as part of the searchable data structure No attempt is made to discriminate between the value of the index terms in representing concepts in the item Not possible to tell the difference between the main topics in the item and a casual reference to a concept Query against unweighted systems are based on Boolean logic and the items in the resultant Hit file are considered equal in value

[Link] Weighted Automatic Indexing :


An attempt is made to place a value on the index terms representation of its associated concept in the document An index terms weight is based on a function associated with the frequency of occurrence of the term in the item. Luhn postulated that the significance of a concept in an item is directly proportional to the frequency of use of the word associated with the concept in the document Values for the index terms are normalized between zero and one

The higher the weight, the more the term represents a concept discussed in the item The query process uses the weights along with any weights assigned to terms in the query to determine a rank value used in predicting the likelihood that an item satisfies the query Thresholds or a parameter specifying the maximum number of items to be returned are used to bound the number of items returned to a user.

[Link] Indexing by Term:


The terms of the original item are used as a basis of the index process Two major techniques: statistical and natural language Statistical techniques Calculation of weights use statistic information such as the frequency of occurrence of words and their distributions in the searchable DB Vector models and probabilistic models Natural language processing Process items at the morphological, lexical, semantic, syntax, and discourse levels Each level uses information from the previous level to perform it additional analysis Events, and event relationships

[Link] Indexing by Concept:


The basis for concept indexing is that there are many ways to express the same ideas and increased retrieval performance comes from using a single representation Indexing by term treats each of these occurrences as a different index and then uses thesauri or other query expansion techniques to expand a query to find the different ways the same thing has been represented Concept indexing determines a canonical set of concepts based on a test set of terms and uses them as a basis for indexing all items

2.4 Information Extraction & Summarization:


There are two processes associated with information extraction: Determination of facts to go into structured fields in a database In this case only a subset of the important facts in an item may be identified and extracted.

Extraction of text that can be used to summarize an item. In summarization all of the major concepts in the item should be represented in the summary. The process of extracting facts to go into indexes is called Automatic File Build. Its goal

is to process incoming items and extract index terms that will go into a structured database. This differs from indexing in that its objective is to extract specific types of information versus understanding all of the text of the document. An Information Retrieval Systems goal is to provide an indepth representation of the total contents of an item Information Extraction system only analyzes those portions of a document that potentially contain information relevant to the extraction criteria The objective of the data extraction is in most cases to update a structured database with additional facts. The updates may be from a controlled vocabulary or substrings from the item as defined by the extraction rules. The term slot is used to define a particular category of information to be extracted. Slots are organized into templates or semantic frames. Information extraction requires multiple levels of analysis of the text of an item. It must understand the words and their context (discourse analysis). Recall refers to how much information was extracted from an item versus how much should have been extracted from the item. It shows the amount of correct and relevant data extracted versus the correct and relevant data in the item. Precision refers to how much in formation was extracted accurately versus the total information extracted.

Metrics used are over generation and fallout. Over generation measures the amount of irrelevant information that is extracted. This could be caused by templates filled on topics that are not intended to be extracted or slots that get filled with non-relevant data. Fallout measures how much a system assigns incorrect slot fillers as the number of potential incorrect slot fillers increases. The goal of document summarization is to extract a summary of an item maintaining the most important ideas while significantly reducing the size. Examples of summaries that are often part of any item are titles, table of contents, and abstracts with the abstract being the closest The abstract can be used to represent the item for search purposes or as a way for a user to determine the utility of an item without having to read the complete item. It is not feasible to automatically generate a coherent narrative summary of an item with proper discourse, abstraction and language usage Restricting the domain of the item can significantly improve the quality of the Output The more restricted goals for much of the research is in finding subsets of the item that can be extracted and concatenated (usually extracting at the sentence level) and represents the most important concepts in the item There is no guarantee of readability as a narrative abstract and it is seldom achieved. Different algorithms produce different summaries. Just as different humans create different abstracts for the same item, automated techniques that generate different summaries does not intrinsically imply major deficiencies between the summaries. Most automated algorithms approach summarization by calculating a score for each sentence and then extracting the sentences with the highest scores

Kupiec et al. are pursuing statistical classification approach based upon a training set reducing the heuristics by focusing on a weighted combination of criteria to produce optimal scoring scheme (Kupiec-95). They selected the following five feature sets as a basis for their algorithm: Sentence Length Feature that requires sentence to be over five words in length Fixed Phrase Feature that looks for the existence of phrase cues (e.g.,in conclusion) Paragraph Feature that places emphasis on the first ten and last five paragraphs in an item and also the location of the sentences within the paragraph Thematic Word Feature that uses word frequency Uppercase Word Feature that places emphasis on proper names and acronyms.

DATA STRUCTURES:

Introduction to Data Structure in IR :


Stemming Porter Stemming Algorithm Dictionary Look-up Stemmers Successors stemmers Major Data Structure Inverted File Structures N-Gram Data Structures PAT Data Structures Signature File Structure Hypertext Data Structures

Introduction to Data Structure 4 Two aspects of Data structures from IRS perspective Ability to represent concepts and their relationships Its support to locate those concepts. Two major data structures Stores and manages the received items in their normalized formDocument manager Contains the processing tokens and associated data to support searchDocument search manager The results of the search are the references to the items, which are passed to the Document Manager for retrieval Data structures that support search function are dealt.

Major Data Structures:

Before placing data in the searchable data structure, the transformation of data called stemming is applied. Conflation is the term used to refer to mapping multiple morphological variants to a single representation called stem/root. Reduce tokens to root form of words to recognize morphological variation. computer, computational, computation all reduced to same token compute Correct morphological analysis is language specific and can be complex. Stemming blindly strips off known affixes (prefixes and suffixes) in an iterative fashion. Stemming provide compression, savings in storage and processing. Stemming improves recall Stemming process has to categorize a word prior to making the decision to stem it. proper names and acronyms should not stem as they are not related to a common core concept. Stemming process in NLP causes loss of information Tense information is lost, hence a concept economic support being indexed needed to determine whether occurred in past or will be occurring in future.

The Porter algorithm:


The Porter algorithm consists of a set of condition/action rules. The condition fall into three classes Conditions on the stem Conditions on the suffix Conditions on rules

Conditions on the stem :


1. The measure , denoted m ,of a stem is based on its alternate vowel-consonant sequences. [C] (VC) m [V] Measure M=0 M=1 M=2 Example TR, EE, TREE, Y, BY TROUBLE, OATS, TREES, IVY TROUBLES, PRIVATE, OATEN

2.*<X> ---the stem ends with a given letter X 3.*v*---the stem contains a vowel 7 4.*d ---the stem ends in double consonant 5.*o ---the stem ends with a consonant-vowel-consonant, sequence, where the final consonant is not w, x or y

Suffix conditions take the form:


(current_suffix == pattern)

Conditions on rules :
The rules are divided into steps. The rules in a step are examined in sequence , and only one rule from a step can apply { step1a(word); step1b(stem); if (the second or third rule of step 1b was used) step1b1(stem); step1c(stem);

step2(stem); step3(stem); step4(stem); step5a(stem); step5b(stem); }

Table / Dictionary Look up:


15 Store a table of all index terms and their stems. The original term or stemmed version of the term is looked up in a dictionary and Replaced by the stem that best represents it. Implemented in INQUERY, Retrieval Ware systems Stem a table look up algorithm implemented in INQUERY uses the following six data Files Dictionary of words (lexicon) Supplemental list of words for the dictionary Exceptions list for those words that should retain an e at the end o (e.g., suites to suite but suited to suit) Direct_Conflation - allows definition of direct conflation via o word pairs that override the stemming algorithm Country_Nationality - conflations between nationalities and o countries (British maps to Britain) Proper Nouns - a list of proper nouns that should not be stemmed.

Successor Stemmer :
Successor Stemmer based on the length of prefixes that optionally stem expansions of additional suffixes The alg investigates word and morpheme boundaries based on the distribution of phonemes that distinguishes one word form other The process determines the successor variety for a word, uses this information to divide a word into segments and selects one of the segment as stem The successor variety of a segment of a word in a set of words is the no. of distinct letters that occupy the segment length plus one character Ex : The successor variety for the first 3 letters of a 5 letter word is the no. of words that have the same first 3 letters but a different 4th letter plus one The successor variety of any prefix of a word is the no. of children associate with the node in the symbol tree representing that prefix

16

Successor variety for the first letter b is three. The successor variety for the prefix ba is two.

Affix Removal Stemmers:


Affix removal algorithms remove suffixes and/or prefixes from terms leaving a stem If a word ends in ies but not eies or aies (Harman 1991) Then ies -> y If a word ends in es but not aes , or ees or oes Then es -> e If a word ends in s but not us or ss Then s -> NULL

21

Related Data Structures for PT Searchable Files:


25 Inverted file system Minimize secondary storage access when multiple search terms are applied across the total database N-gram Break process tokens into smaller string units and uses the token fragment for search Improve efficiencies and conceptual manipulation over full word inversion PAT Trees and Arrays View the text of an item as a single long stream versus a juxtaposition of words Signature file Fast elimination of non-relevant items reducing the searchable items into a manageable subset Hypertext Manually or automatically create imbedded links within one item to a related item

25

Inverted File Structure:


26 Commonly used in DBMS and IR For each word, a list of documents in which the word is found in is stored Composed of three basic files Document files Inversion lists: contains the document identifier Dictionary: list all the unique word or other information used in query optimization (e.q. length of inversion lists) The inversion list contains Doc-id for each DOC in which the word is found. To support proximity, continuous word phrase & term weighting all occurrences of a word are stored in the inversion list along with the word position. For Systems that supports ranking, the list is re-organized into rank order.

Complete data structures material

Common questions

Powered by AI

Human indexing involves the ability to determine concept abstraction and judge the value of a concept, enabling nuanced and context-aware indexing. However, it is costly, time-consuming, and subject to inconsistency due to varying levels of knowledge among indexers. In contrast, automatic indexing provides consistency in term selection, and processes items quickly, often in a few seconds. Automatic indexing can be divided into weighted and un-weighted, and indexing by term versus concept, and significantly reduces processing time and costs .

Stemming contributes to the efficiency of information retrieval systems by reducing tokens to their root forms, which facilitates the recognition of morphological variations of words. This process improves recall and storage efficiency as different forms of a word (“computer”, “computational”, “computation”) are conflated to a single token (“compute”). Stemming reduces data size and processing requirements, enhancing the performance of search operations. However, it may cause a loss of certain linguistic details such as tense, which can sometimes impact the retrieval of precise information .

The primary objectives of indexing in information retrieval systems are to represent the concepts within an item to facilitate the user's ability to find relevant information and to create full-text searchable data structures for items, known as total document indexing. This involves using all words within an item as potential index descriptors of the subjects, and transforming these words into processing tokens that define the item's searchable representation. Additionally, indexing aims for ranking and item clustering to enhance retrieval effectiveness .

Sentence scoring in automated text summarization plays a crucial role in determining which parts of a document should be extracted to form a summary. Various criteria are used for scoring, including sentence length, presence of fixed phrases like 'in conclusion,' the position of sentences within a paragraph, thematic word frequency, and the incidence of uppercase words, such as proper names and acronyms. These features are combined into an optimal scoring scheme to select sentences that capture the most important concepts, thereby forming an effective summary .

Inverted file structures are significant in DBMS and IR systems as they enable efficient search operations by associating unique words with lists of documents where they occur. This structure minimizes secondary storage access and supports functionalities like proximity search and term weighting, facilitating quick retrieval and ranking of documents. Inverted files consist of components like document files, inversion lists storing document identifiers, and dictionaries. They provide a framework for rapidly locating and processing information, which is essential in handling large-scale data environments .

Weighting index terms in the indexing process enhances the ability to discriminate the importance of concepts represented by the index terms in a document. It provides a ranked value indicating the likelihood that an item satisfies a query. However, in manual indexing, weighting introduces challenges as it adds overhead for the indexer and requires complex data structures to store these weights. The manual process can be inconsistent and time-consuming, making it more challenging to maintain a uniform quality of indexing .

Indexing by term relies on using the actual words present in a document as indices, and employs techniques like statistical analysis and natural language processing. It treats different representations of the same concept separately and typically uses query expansion techniques to address these variations. In contrast, indexing by concept aims to unify various expressions of the same idea under a single representation, improving retrieval by focusing on the semantic meaning rather than literal word form. This consolidation is intended to enhance retrieval performance by minimizing redundancy and aligning indexing more closely with user queries .

Information extraction refers to the process of analyzing unstructured data to identify and extract specified pieces of information, transforming them into a structured format compatible with database systems. Text summarization can be considered an application of information extraction, which involves extracting key sentences or segments from a document to create a coherent summary. This process often relies on algorithms that score and select relevant content, streamlining the information for easier consumption and improved usability in decision-making contexts .

Statistical term weighting in automatic indexing systems involves assigning weights to index terms based on their occurrence frequency within a document. Luhn's principle suggests that the significance of a concept is proportional to this frequency. Terms are normalized on a scale between zero and one, with higher weights indicating stronger representations of a concept. This weighting system refines retrieval performance by allowing ranked query results, which improves the likelihood of returning relevant items to the user while also utilizing thresholds to manage the volume of output, enhancing the system's efficiency .

Precoordination in indexing refers to creating linkages between index terms at the time of index creation, allowing related concepts to be correlated in an item beforehand. Postcoordination, on the other hand, involves coordinating terms during the search process by combining index terms, usually with Boolean logic, to find items that meet the combined criteria only after indexing is completed. Precoordination allows for a semantically richer indexing structure while postcoordination offers flexibility at the searching stage .

UNIT- II 
Cataloging and Indexing 
 
2.1 History and Objectives of Indexing 
2.2 Indexing Process 
2.3 Automatic Indexing 
2.
2.1.2 Objectives: 
 
 Represent the concepts within an item to facilitate the user’s finding relevant information 
 The ful
  The weight should help in discriminating the extent to which the concept  
is discussed in items in the database 
  The m
2.3 Automatic Indexing : 
 
2.3.1 Overview 
 Automatic indexing is the capability to automatically determine the index terms
 The higher the weight, the more the term represents a concept discussed in 
the item 
 
 The query process uses the weight
 Extraction of text that can be used to summarize an item. 
 In summarization all of the major concepts in   the item  shou
Metrics used are over generation and fallout. 
Over generation measures the amount of irrelevant information that is extracte
Kupiec et al. are pursuing statistical classification approach based upon a training set reducing 
the heuristics by focusi
Major Data Structures:  
 
 
 Before placing data in the searchable data structure, the transformation of data called 
“st
The Porter algorithm: 
 
 The Porter algorithm consists of a set of condition/action rules. 
 The condition fall into t

You might also like