100% found this document useful (4 votes)

2K views13 pages

Cataloging and Indexing Techniques

The document discusses the history and objectives of indexing, the indexing process, automatic indexing, and information extraction and summarization. Indexing involves determining terms to represent items to facilitate search and retrieval. It has evolved from manual subject indexing to include automatic indexing using terms or concepts. Automatic indexing aims to index items faster while maintaining consistency. Information extraction focuses on extracting specific facts or text like summaries rather than fully representing an item.

Uploaded by

7killers4u

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

100% found this document useful (4 votes)

2K views13 pages

Cataloging and Indexing Techniques

Uploaded by

7killers4u

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

UNIT- II

Cataloging and Indexing

2.1 History and Objectives of Indexing 2.2 Indexing Process 2.3 Automatic Indexing 2.4 Information Extraction

2.1 History and Objectives of Indexing: Overview:

The indexing process determines which terms (concepts) can represent a particular item The transformation from the received item to the searchable data structure is called indexing Manual or automatic Search Once the searchable data structure has been created, techniques must be defined that correlate the user entered query statement to the set of items in the database to determine the items to be returned to the user. Information extraction extract specific information to be normalized and entered into a structured database (DBMS) Focus on very specific concepts and contains a transformation process that modifies the extracted information into a form compatible with the end structured database Automatic File Build Text summarization One application of information extraction extracting larger contextual constructs (e.g. sentences) that are combined to form a summary of an item

2.1.1 History:
Indexing (originally called cataloging) is the oldest technique for identifying the contents of item to assist in their retrieval Subject indexing ->hierarchical subject indexing Indexing creates a bibliographic citation in a structured file that reference the original text Citation information about the items, key wording the subjects of the item, and a constrained length free text field used for an abstract/summary Usually performed by professional indexers Automatic indexing Full text search

2.1.2 Objectives:
Represent the concepts within an item to facilitate the users finding relevant information The full text searchable data structures for items in the Document File provides a new class of indexing called total document indexing All of the words within the item are potential index descriptors of the Subjects of the item. The process of Item normalization that takes all possible words in an item and transforms them into processing tokens used in defining the searchable representation of an item. Current systems have the ability to automatically weight the processing tokens based upon their potential Importance in defining the concepts in the item. Other objectives of indexing ranking, item clustering

2.2 Indexing Process:

When an organization with multiple indexers decides to create a public or private index, some procedural decisions on how to create the index terms assist the indexer and end users in knowing what to expect in the index file Scope of the indexing define what level of detail the subject index will contain Based on usage scenarios of the end users The need to link index terms together in a single index for a particular concept Needed when there are multiple independent concepts found within an item

2.2.1 Scope of Indexing :

When performed manually, the process of reliably and consistently determining the bibliographic terms that represent the concepts in an item is extremely difficult Vocabulary domains of indexer and author may be different Results in different quality level of indexing Two factors involved in deciding on what level to index the concepts in an item Exhaustivity Specificity What portions of an item should be indexed Only title or title + abstract ? (low precision and recall) Weighting of index terms is not common in manual indexing Weighting is the process of assigning an importance to an index terms use in an item The weight should represent the degree to which the concept associated with the index term is represented in the item

The weight should help in discriminating the extent to which the concept is discussed in items in the database The manual process of assigning weights adds additional overhead on the indexer and requires a more complex data structure to store the weights.

2.2.2 Precoordination and Linkage :

Whether linkage are available between index terms for an item Used to correlate related attributes associated with concepts discussed in an item The process of creating term linkages at index creating time is called precoordination Postcoordination coordinating terms at search time by ANDing index terms together, which only finds indexes that have all of the search term Factors that must be determined in the linkage process are the number of terms that can be related, any ordering constraints on the linked terms, and any additional descriptors are associated with the index terms.

Linkage of Index Terms :

2.3 Automatic Indexing : 2.3.1 Overview

Automatic indexing is the capability to automatically determine the index terms to be assigned to an item Simplest case: total document indexing Complex: emulate a human indexer and determine a limited number of index terms for the major concepts in the item Human indexing VS Automatic indexing Adv: ability to determine concept abstraction and judge the value of a concept Disadv: cost, processing time, and consistency Processing time of an item by a human indexer varies significantly based upon the indexers knowledge of the concepts being indexed, the exhaustivity and specificity guidelines and the amount and accuracy of preprocessing via Automatic File Build. Usually take at least five minutes per item. Automatic indexing requires only a few seconds or less of computer time based upon the size of the processor and the complexity of the algorithms to generate the index. If the indexing is being performed automatically, by an algorithm, there is consistency in the index term selection process. Automatic indexing: weighted and un-weighted; Indexing by term and indexing by concept

[Link] Un-weighted Automatic Indexing:

The existence of an index term in a document and sometimes its word location(s) are kept as part of the searchable data structure No attempt is made to discriminate between the value of the index terms in representing concepts in the item Not possible to tell the difference between the main topics in the item and a casual reference to a concept Query against unweighted systems are based on Boolean logic and the items in the resultant Hit file are considered equal in value

[Link] Weighted Automatic Indexing :

An attempt is made to place a value on the index terms representation of its associated concept in the document An index terms weight is based on a function associated with the frequency of occurrence of the term in the item. Luhn postulated that the significance of a concept in an item is directly proportional to the frequency of use of the word associated with the concept in the document Values for the index terms are normalized between zero and one

The higher the weight, the more the term represents a concept discussed in the item The query process uses the weights along with any weights assigned to terms in the query to determine a rank value used in predicting the likelihood that an item satisfies the query Thresholds or a parameter specifying the maximum number of items to be returned are used to bound the number of items returned to a user.

[Link] Indexing by Term:

The terms of the original item are used as a basis of the index process Two major techniques: statistical and natural language Statistical techniques Calculation of weights use statistic information such as the frequency of occurrence of words and their distributions in the searchable DB Vector models and probabilistic models Natural language processing Process items at the morphological, lexical, semantic, syntax, and discourse levels Each level uses information from the previous level to perform it additional analysis Events, and event relationships

[Link] Indexing by Concept:

The basis for concept indexing is that there are many ways to express the same ideas and increased retrieval performance comes from using a single representation Indexing by term treats each of these occurrences as a different index and then uses thesauri or other query expansion techniques to expand a query to find the different ways the same thing has been represented Concept indexing determines a canonical set of concepts based on a test set of terms and uses them as a basis for indexing all items

2.4 Information Extraction & Summarization:

There are two processes associated with information extraction: Determination of facts to go into structured fields in a database In this case only a subset of the important facts in an item may be identified and extracted.

Extraction of text that can be used to summarize an item. In summarization all of the major concepts in the item should be represented in the summary. The process of extracting facts to go into indexes is called Automatic File Build. Its goal

is to process incoming items and extract index terms that will go into a structured database. This differs from indexing in that its objective is to extract specific types of information versus understanding all of the text of the document. An Information Retrieval Systems goal is to provide an indepth representation of the total contents of an item Information Extraction system only analyzes those portions of a document that potentially contain information relevant to the extraction criteria The objective of the data extraction is in most cases to update a structured database with additional facts. The updates may be from a controlled vocabulary or substrings from the item as defined by the extraction rules. The term slot is used to define a particular category of information to be extracted. Slots are organized into templates or semantic frames. Information extraction requires multiple levels of analysis of the text of an item. It must understand the words and their context (discourse analysis). Recall refers to how much information was extracted from an item versus how much should have been extracted from the item. It shows the amount of correct and relevant data extracted versus the correct and relevant data in the item. Precision refers to how much in formation was extracted accurately versus the total information extracted.

Metrics used are over generation and fallout. Over generation measures the amount of irrelevant information that is extracted. This could be caused by templates filled on topics that are not intended to be extracted or slots that get filled with non-relevant data. Fallout measures how much a system assigns incorrect slot fillers as the number of potential incorrect slot fillers increases. The goal of document summarization is to extract a summary of an item maintaining the most important ideas while significantly reducing the size. Examples of summaries that are often part of any item are titles, table of contents, and abstracts with the abstract being the closest The abstract can be used to represent the item for search purposes or as a way for a user to determine the utility of an item without having to read the complete item. It is not feasible to automatically generate a coherent narrative summary of an item with proper discourse, abstraction and language usage Restricting the domain of the item can significantly improve the quality of the Output The more restricted goals for much of the research is in finding subsets of the item that can be extracted and concatenated (usually extracting at the sentence level) and represents the most important concepts in the item There is no guarantee of readability as a narrative abstract and it is seldom achieved. Different algorithms produce different summaries. Just as different humans create different abstracts for the same item, automated techniques that generate different summaries does not intrinsically imply major deficiencies between the summaries. Most automated algorithms approach summarization by calculating a score for each sentence and then extracting the sentences with the highest scores

Kupiec et al. are pursuing statistical classification approach based upon a training set reducing the heuristics by focusing on a weighted combination of criteria to produce optimal scoring scheme (Kupiec-95). They selected the following five feature sets as a basis for their algorithm: Sentence Length Feature that requires sentence to be over five words in length Fixed Phrase Feature that looks for the existence of phrase cues (e.g.,in conclusion) Paragraph Feature that places emphasis on the first ten and last five paragraphs in an item and also the location of the sentences within the paragraph Thematic Word Feature that uses word frequency Uppercase Word Feature that places emphasis on proper names and acronyms.

DATA STRUCTURES:

Introduction to Data Structure in IR :

Stemming Porter Stemming Algorithm Dictionary Look-up Stemmers Successors stemmers Major Data Structure Inverted File Structures N-Gram Data Structures PAT Data Structures Signature File Structure Hypertext Data Structures

Introduction to Data Structure 4 Two aspects of Data structures from IRS perspective Ability to represent concepts and their relationships Its support to locate those concepts. Two major data structures Stores and manages the received items in their normalized formDocument manager Contains the processing tokens and associated data to support searchDocument search manager The results of the search are the references to the items, which are passed to the Document Manager for retrieval Data structures that support search function are dealt.

Major Data Structures:

Before placing data in the searchable data structure, the transformation of data called stemming is applied. Conflation is the term used to refer to mapping multiple morphological variants to a single representation called stem/root. Reduce tokens to root form of words to recognize morphological variation. computer, computational, computation all reduced to same token compute Correct morphological analysis is language specific and can be complex. Stemming blindly strips off known affixes (prefixes and suffixes) in an iterative fashion. Stemming provide compression, savings in storage and processing. Stemming improves recall Stemming process has to categorize a word prior to making the decision to stem it. proper names and acronyms should not stem as they are not related to a common core concept. Stemming process in NLP causes loss of information Tense information is lost, hence a concept economic support being indexed needed to determine whether occurred in past or will be occurring in future.

The Porter algorithm:

The Porter algorithm consists of a set of condition/action rules. The condition fall into three classes Conditions on the stem Conditions on the suffix Conditions on rules

Conditions on the stem :

1. The measure , denoted m ,of a stem is based on its alternate vowel-consonant sequences. [C] (VC) m [V] Measure M=0 M=1 M=2 Example TR, EE, TREE, Y, BY TROUBLE, OATS, TREES, IVY TROUBLES, PRIVATE, OATEN

2.*<X> ---the stem ends with a given letter X 3.*v*---the stem contains a vowel 7 4.*d ---the stem ends in double consonant 5.*o ---the stem ends with a consonant-vowel-consonant, sequence, where the final consonant is not w, x or y

Suffix conditions take the form:

(current_suffix == pattern)

Conditions on rules :
The rules are divided into steps. The rules in a step are examined in sequence , and only one rule from a step can apply { step1a(word); step1b(stem); if (the second or third rule of step 1b was used) step1b1(stem); step1c(stem);

step2(stem); step3(stem); step4(stem); step5a(stem); step5b(stem); }

Table / Dictionary Look up:

15 Store a table of all index terms and their stems. The original term or stemmed version of the term is looked up in a dictionary and Replaced by the stem that best represents it. Implemented in INQUERY, Retrieval Ware systems Stem a table look up algorithm implemented in INQUERY uses the following six data Files Dictionary of words (lexicon) Supplemental list of words for the dictionary Exceptions list for those words that should retain an e at the end o (e.g., suites to suite but suited to suit) Direct_Conflation - allows definition of direct conflation via o word pairs that override the stemming algorithm Country_Nationality - conflations between nationalities and o countries (British maps to Britain) Proper Nouns - a list of proper nouns that should not be stemmed.

Successor Stemmer :
Successor Stemmer based on the length of prefixes that optionally stem expansions of additional suffixes The alg investigates word and morpheme boundaries based on the distribution of phonemes that distinguishes one word form other The process determines the successor variety for a word, uses this information to divide a word into segments and selects one of the segment as stem The successor variety of a segment of a word in a set of words is the no. of distinct letters that occupy the segment length plus one character Ex : The successor variety for the first 3 letters of a 5 letter word is the no. of words that have the same first 3 letters but a different 4th letter plus one The successor variety of any prefix of a word is the no. of children associate with the node in the symbol tree representing that prefix

Successor variety for the first letter b is three. The successor variety for the prefix ba is two.

Affix Removal Stemmers:

Affix removal algorithms remove suffixes and/or prefixes from terms leaving a stem If a word ends in ies but not eies or aies (Harman 1991) Then ies -> y If a word ends in es but not aes , or ees or oes Then es -> e If a word ends in s but not us or ss Then s -> NULL

Related Data Structures for PT Searchable Files:

25 Inverted file system Minimize secondary storage access when multiple search terms are applied across the total database N-gram Break process tokens into smaller string units and uses the token fragment for search Improve efficiencies and conceptual manipulation over full word inversion PAT Trees and Arrays View the text of an item as a single long stream versus a juxtaposition of words Signature file Fast elimination of non-relevant items reducing the searchable items into a manageable subset Hypertext Manually or automatically create imbedded links within one item to a related item

Inverted File Structure:

26 Commonly used in DBMS and IR For each word, a list of documents in which the word is found in is stored Composed of three basic files Document files Inversion lists: contains the document identifier Dictionary: list all the unique word or other information used in query optimization (e.q. length of inversion lists) The inversion list contains Doc-id for each DOC in which the word is found. To support proximity, continuous word phrase & term weighting all occurrences of a word are stored in the inversion list along with the word position. For Systems that supports ranking, the list is re-organized into rank order.

Complete data structures material

Common questions

Human indexing involves the ability to determine concept abstraction and judge the value of a concept, enabling nuanced and context-aware indexing. However, it is costly, time-consuming, and subject to inconsistency due to varying levels of knowledge among indexers. In contrast, automatic indexing provides consistency in term selection, and processes items quickly, often in a few seconds. Automatic indexing can be divided into weighted and un-weighted, and indexing by term versus concept, and significantly reduces processing time and costs .

Stemming contributes to the efficiency of information retrieval systems by reducing tokens to their root forms, which facilitates the recognition of morphological variations of words. This process improves recall and storage efficiency as different forms of a word (“computer”, “computational”, “computation”) are conflated to a single token (“compute”). Stemming reduces data size and processing requirements, enhancing the performance of search operations. However, it may cause a loss of certain linguistic details such as tense, which can sometimes impact the retrieval of precise information .

The primary objectives of indexing in information retrieval systems are to represent the concepts within an item to facilitate the user's ability to find relevant information and to create full-text searchable data structures for items, known as total document indexing. This involves using all words within an item as potential index descriptors of the subjects, and transforming these words into processing tokens that define the item's searchable representation. Additionally, indexing aims for ranking and item clustering to enhance retrieval effectiveness .

Sentence scoring in automated text summarization plays a crucial role in determining which parts of a document should be extracted to form a summary. Various criteria are used for scoring, including sentence length, presence of fixed phrases like 'in conclusion,' the position of sentences within a paragraph, thematic word frequency, and the incidence of uppercase words, such as proper names and acronyms. These features are combined into an optimal scoring scheme to select sentences that capture the most important concepts, thereby forming an effective summary .

Inverted file structures are significant in DBMS and IR systems as they enable efficient search operations by associating unique words with lists of documents where they occur. This structure minimizes secondary storage access and supports functionalities like proximity search and term weighting, facilitating quick retrieval and ranking of documents. Inverted files consist of components like document files, inversion lists storing document identifiers, and dictionaries. They provide a framework for rapidly locating and processing information, which is essential in handling large-scale data environments .

Weighting index terms in the indexing process enhances the ability to discriminate the importance of concepts represented by the index terms in a document. It provides a ranked value indicating the likelihood that an item satisfies a query. However, in manual indexing, weighting introduces challenges as it adds overhead for the indexer and requires complex data structures to store these weights. The manual process can be inconsistent and time-consuming, making it more challenging to maintain a uniform quality of indexing .

Indexing by term relies on using the actual words present in a document as indices, and employs techniques like statistical analysis and natural language processing. It treats different representations of the same concept separately and typically uses query expansion techniques to address these variations. In contrast, indexing by concept aims to unify various expressions of the same idea under a single representation, improving retrieval by focusing on the semantic meaning rather than literal word form. This consolidation is intended to enhance retrieval performance by minimizing redundancy and aligning indexing more closely with user queries .

Information extraction refers to the process of analyzing unstructured data to identify and extract specified pieces of information, transforming them into a structured format compatible with database systems. Text summarization can be considered an application of information extraction, which involves extracting key sentences or segments from a document to create a coherent summary. This process often relies on algorithms that score and select relevant content, streamlining the information for easier consumption and improved usability in decision-making contexts .

Statistical term weighting in automatic indexing systems involves assigning weights to index terms based on their occurrence frequency within a document. Luhn's principle suggests that the significance of a concept is proportional to this frequency. Terms are normalized on a scale between zero and one, with higher weights indicating stronger representations of a concept. This weighting system refines retrieval performance by allowing ranked query results, which improves the likelihood of returning relevant items to the user while also utilizing thresholds to manage the volume of output, enhancing the system's efficiency .

Precoordination in indexing refers to creating linkages between index terms at the time of index creation, allowing related concepts to be correlated in an item beforehand. Postcoordination, on the other hand, involves coordinating terms during the search process by combining index terms, usually with Boolean logic, to find items that meet the combined criteria only after indexing is completed. Precoordination allows for a semantically richer indexing structure while postcoordination offers flexibility at the searching stage .

UNIT- II
Cataloging and Indexing

2.1 History and Objectives of Indexing
2.2 Indexing Process
2.3 Automatic Indexing
2.

2.1.2 Objectives:

 Represent the concepts within an item to facilitate the user’s finding relevant information
 The ful

 The weight should help in discriminating the extent to which the concept
is discussed in items in the database
 The m

2.3 Automatic Indexing :

2.3.1 Overview
 Automatic indexing is the capability to automatically determine the index terms

 The higher the weight, the more the term represents a concept discussed in
the item

 The query process uses the weight

 Extraction of text that can be used to summarize an item.
 In summarization all of the major concepts in the item shou

Metrics used are over generation and fallout.
Over generation measures the amount of irrelevant information that is extracte

Kupiec et al. are pursuing statistical classification approach based upon a training set reducing
the heuristics by focusi

Major Data Structures:

 Before placing data in the searchable data structure, the transformation of data called
“st

The Porter algorithm:

 The Porter algorithm consists of a set of condition/action rules.
 The condition fall into t

Indexing Techniques in Information Retrieval
60% (5)
Indexing Techniques in Information Retrieval
13 pages
Indexing in Information Retrieval Systems
No ratings yet
Indexing in Information Retrieval Systems
25 pages
Overview of Information Retrieval Systems
100% (5)
Overview of Information Retrieval Systems
14 pages
Introduction to Information Visualization
100% (2)
Introduction to Information Visualization
5 pages
IRS III Year UNIT-3 Part 1
75% (4)
IRS III Year UNIT-3 Part 1
18 pages
Overview of Information Retrieval Systems
100% (1)
Overview of Information Retrieval Systems
87 pages
Multimedia Information Retrieval Systems
No ratings yet
Multimedia Information Retrieval Systems
48 pages
Clustering Techniques in Information Retrieval
50% (4)
Clustering Techniques in Information Retrieval
13 pages
Clustering and Thesaurus Generation in IR
No ratings yet
Clustering and Thesaurus Generation in IR
15 pages
Automatic Indexing Techniques in IRS
80% (5)
Automatic Indexing Techniques in IRS
18 pages
Overview of Automatic Indexing in IRS
100% (2)
Overview of Automatic Indexing in IRS
28 pages
Text Search Algorithms in IRS Unit 5
100% (1)
Text Search Algorithms in IRS Unit 5
78 pages
Overview of Information Retrieval Systems
100% (2)
Overview of Information Retrieval Systems
14 pages
Overview of Text Search Techniques
No ratings yet
Overview of Text Search Techniques
80 pages
User Search Techniques in IRS R22
No ratings yet
User Search Techniques in IRS R22
24 pages
Automatic Indexing in Information Retrieval
No ratings yet
Automatic Indexing in Information Retrieval
40 pages
Document Clustering Techniques in IRS
67% (3)
Document Clustering Techniques in IRS
39 pages
Item Normalization in Information Retrieval
No ratings yet
Item Normalization in Information Retrieval
7 pages
Statistical Indexing in Information Retrieval
No ratings yet
Statistical Indexing in Information Retrieval
22 pages
Information Visualization in IRS
No ratings yet
Information Visualization in IRS
15 pages
Data Structures in Information Retrieval
No ratings yet
Data Structures in Information Retrieval
84 pages
Automatic Indexing in Information Retrieval
No ratings yet
Automatic Indexing in Information Retrieval
28 pages
Information Retrieval System Capabilities
No ratings yet
Information Retrieval System Capabilities
16 pages
Clustering Techniques in IRS
No ratings yet
Clustering Techniques in IRS
95 pages
Enhanced Boolean Search Techniques
100% (1)
Enhanced Boolean Search Techniques
7 pages
Overview of Information Retrieval Systems
100% (2)
Overview of Information Retrieval Systems
2 pages
Information Retrieval Systems Exam Papers
No ratings yet
Information Retrieval Systems Exam Papers
6 pages
Signature File Structure in IR Systems
100% (1)
Signature File Structure in IR Systems
8 pages
Data Structures in Information Retrieval
No ratings yet
Data Structures in Information Retrieval
43 pages
Text Search Algorithms in Multimedia IR
No ratings yet
Text Search Algorithms in Multimedia IR
22 pages
Understanding PAT Trees and Arrays
No ratings yet
Understanding PAT Trees and Arrays
12 pages
Overview of Information Retrieval Systems
50% (2)
Overview of Information Retrieval Systems
14 pages
Information Retrieval Systems
100% (1)
Information Retrieval Systems
102 pages
IRS Spectrum
100% (1)
IRS Spectrum
150 pages
Cognition and Perception in IRS
No ratings yet
Cognition and Perception in IRS
63 pages
Objectives of Information Retrieval Systems
No ratings yet
Objectives of Information Retrieval Systems
5 pages
Key Questions on Information Retrieval Systems
No ratings yet
Key Questions on Information Retrieval Systems
3 pages
Bayesian Estimation in NLP Models
No ratings yet
Bayesian Estimation in NLP Models
2 pages
System Paradigms for NLP Meaning
No ratings yet
System Paradigms for NLP Meaning
8 pages
Non-Speech Audio Retrieval Techniques
No ratings yet
Non-Speech Audio Retrieval Techniques
19 pages
Predicate Argument Structure in NLP
100% (2)
Predicate Argument Structure in NLP
5 pages
Analytical Learning with Prolog-EBG
100% (1)
Analytical Learning with Prolog-EBG
19 pages
Distinguishing IRS and DBMS Features
100% (3)
Distinguishing IRS and DBMS Features
3 pages
Data Handling and Analytics
No ratings yet
Data Handling and Analytics
6 pages
Mental Events and Objects in AI
100% (1)
Mental Events and Objects in AI
25 pages
NLP Unit 4: Predicate-Argument Structure
100% (1)
NLP Unit 4: Predicate-Argument Structure
8 pages
NLP Unit 3: Parsing and Ambiguity
100% (2)
NLP Unit 3: Parsing and Ambiguity
19 pages
JNTUH R18 Automata Theory Notes
No ratings yet
JNTUH R18 Automata Theory Notes
211 pages
Search Statements and Ranking in IR
No ratings yet
Search Statements and Ranking in IR
29 pages
Morphological Analysis in NLP
100% (2)
Morphological Analysis in NLP
48 pages
IRS Important Questions for IT Exams
0% (1)
IRS Important Questions for IT Exams
3 pages
Document Structure in NLP: Methods
No ratings yet
Document Structure in NLP: Methods
39 pages
NLP Lab Manual for JNTUH R22
100% (3)
NLP Lab Manual for JNTUH R22
20 pages
Ambiguity Resolution in NLP Parsing
No ratings yet
Ambiguity Resolution in NLP Parsing
26 pages
Language Model Adaptation Techniques
100% (1)
Language Model Adaptation Techniques
10 pages
Understanding DWBI in Business Intelligence
No ratings yet
Understanding DWBI in Business Intelligence
26 pages
Unit 5
No ratings yet
Unit 5
8 pages
Indexing and Data Structures Overview
No ratings yet
Indexing and Data Structures Overview
63 pages
Indexing and Cataloging Processes Explained
No ratings yet
Indexing and Cataloging Processes Explained
32 pages
Indexing and Information Extraction Guide
No ratings yet
Indexing and Information Extraction Guide
61 pages
STM Question Paper: Key Topics and Concepts
No ratings yet
STM Question Paper: Key Topics and Concepts
2 pages
Data Mining Techniques and Applications
No ratings yet
Data Mining Techniques and Applications
1 page
User Search Techniques in IRS
No ratings yet
User Search Techniques in IRS
1 page
Data Mining Lab Manual: Credit Risk Assessment
No ratings yet
Data Mining Lab Manual: Credit Risk Assessment
32 pages
Web Services Mid Questions for B.Tech
No ratings yet
Web Services Mid Questions for B.Tech
1 page
Computer Graphics Transformation Techniques
No ratings yet
Computer Graphics Transformation Techniques
4 pages
Linux Programming Question Bank
No ratings yet
Linux Programming Question Bank
2 pages
Case Tools and Software Testing Lab Guide
No ratings yet
Case Tools and Software Testing Lab Guide
5 pages
Certificate of Laboratory Work 2012-2013
No ratings yet
Certificate of Laboratory Work 2012-2013
1 page
Linux Programming Lab Manual
No ratings yet
Linux Programming Lab Manual
89 pages
Key Linux Programming Questions
No ratings yet
Key Linux Programming Questions
2 pages
Software Testing Methodologies Exam Guide
No ratings yet
Software Testing Methodologies Exam Guide
1 page
Mid - 1 Questions Irs
No ratings yet
Mid - 1 Questions Irs
2 pages
UML Case Tools & Testing Lab Manual
No ratings yet
UML Case Tools & Testing Lab Manual
33 pages
Linux Programming Lecture Notes
79% (19)
Linux Programming Lecture Notes
190 pages
Automatic Indexing Techniques Overview
No ratings yet
Automatic Indexing Techniques Overview
28 pages
09a51202 Linuxprogramming
No ratings yet
09a51202 Linuxprogramming
4 pages
Understanding Clustering Techniques
No ratings yet
Understanding Clustering Techniques
13 pages
Case Tools & Software Testing Lab Manual
No ratings yet
Case Tools & Software Testing Lab Manual
63 pages
GATE 2013 Application Details
No ratings yet
GATE 2013 Application Details
1 page
Operating System Concepts Overview
No ratings yet
Operating System Concepts Overview
15 pages
LessLess Aluminum Table Collection
No ratings yet
LessLess Aluminum Table Collection
6 pages
Marketing Strategies of Grandma's Kitchen
No ratings yet
Marketing Strategies of Grandma's Kitchen
4 pages
Women in Indian Carp Culture - Final Draft For Circulation-28.5.11.
No ratings yet
Women in Indian Carp Culture - Final Draft For Circulation-28.5.11.
31 pages
Articles of Partnership for Brixton Coffee
No ratings yet
Articles of Partnership for Brixton Coffee
5 pages
Flexible Pavement Design Methods
No ratings yet
Flexible Pavement Design Methods
67 pages
Detailed Construction Estimate for Steps
No ratings yet
Detailed Construction Estimate for Steps
2 pages
US Government and Policy Exam Guide
No ratings yet
US Government and Policy Exam Guide
12 pages
UnlockingtheadvantagesofN Vinyl 2 Pyrrolidoneassup - 241123 - 212003
No ratings yet
UnlockingtheadvantagesofN Vinyl 2 Pyrrolidoneassup - 241123 - 212003
5 pages
Understanding 29-Bit CAN Identifier
No ratings yet
Understanding 29-Bit CAN Identifier
23 pages
Recovery Radio Schedule & Events Guide
No ratings yet
Recovery Radio Schedule & Events Guide
1 page
Lahti Precision Belt Scales Overview
No ratings yet
Lahti Precision Belt Scales Overview
8 pages
Movie Theater Subscription Trends
No ratings yet
Movie Theater Subscription Trends
9 pages
LAN Cable Types and Drawings Activity
No ratings yet
LAN Cable Types and Drawings Activity
3 pages
Thermoelectric Properties of Be3X2 Materials
No ratings yet
Thermoelectric Properties of Be3X2 Materials
10 pages
Schools in Patna: Locations & Details
No ratings yet
Schools in Patna: Locations & Details
14 pages
BSRM Group: Steel Industry Leader
No ratings yet
BSRM Group: Steel Industry Leader
13 pages
Understanding the Kano Model
No ratings yet
Understanding the Kano Model
1 page
e-Governance Systems: Features & Frameworks
No ratings yet
e-Governance Systems: Features & Frameworks
14 pages
Effective Planning in Management
No ratings yet
Effective Planning in Management
59 pages
Community Health Nursing Overview 2024
No ratings yet
Community Health Nursing Overview 2024
120 pages
ICT History Timeline in the Philippines
No ratings yet
ICT History Timeline in the Philippines
9 pages
AEFI Surveillance and Pharmacist Roles
No ratings yet
AEFI Surveillance and Pharmacist Roles
2 pages
Decree Absolute Search Revised - Aug - 16
No ratings yet
Decree Absolute Search Revised - Aug - 16
2 pages
Sodium Cyanide Safety Training Guide
No ratings yet
Sodium Cyanide Safety Training Guide
21 pages
iAccelerate Women's Hackathon 2025
No ratings yet
iAccelerate Women's Hackathon 2025
3 pages
Tech Discovery Guide
No ratings yet
Tech Discovery Guide
11 pages
Ground Floor Plan Layout Details
No ratings yet
Ground Floor Plan Layout Details
2 pages
Loan Packages and Complaints Overview
No ratings yet
Loan Packages and Complaints Overview
10 pages
Al-Haj FAW Motors: Expansion Strategies
No ratings yet
Al-Haj FAW Motors: Expansion Strategies
10 pages
Patta Vilekh Loan Approval Report
No ratings yet
Patta Vilekh Loan Approval Report
3 pages

Cataloging and Indexing Techniques

Uploaded by

Cataloging and Indexing Techniques

Uploaded by

UNIT- II

Cataloging and Indexing

2.1 History and Objectives of Indexing: Overview:

2.2 Indexing Process:

2.2.1 Scope of Indexing :

2.2.2 Precoordination and Linkage :

Linkage of Index Terms :

2.3 Automatic Indexing : 2.3.1 Overview

[Link] Un-weighted Automatic Indexing:

[Link] Weighted Automatic Indexing :

[Link] Indexing by Term:

[Link] Indexing by Concept:

2.4 Information Extraction & Summarization:

Introduction to Data Structure in IR :

Major Data Structures:

The Porter algorithm:

Conditions on the stem :

Suffix conditions take the form:

step2(stem); step3(stem); step4(stem); step5a(stem); step5b(stem); }

Table / Dictionary Look up:

Affix Removal Stemmers:

Related Data Structures for PT Searchable Files:

Inverted File Structure:

Complete data structures material

Common questions

Describe the main differences between human and automatic indexing processes and their respective advantages and disadvantages.

In what ways does stemming contribute to the efficiency of information retrieval systems?

What are the primary objectives of indexing in the context of information retrieval systems?

Discuss the role of sentence scoring in automated text summarization and the criteria used in this process.

What is the significance of inverted file structures in database management systems (DBMS) and information retrieval (IR) systems?

How does the weighting of index terms enhance the indexing process, and what challenges are associated with it in manual indexing?

How do the principles of automatic indexing by concept differ from those of indexing by term?

Explain the concept of information extraction and its relation to text summarization.

How does statistical term weighting work in automatic indexing systems, and how does it improve retrieval performance?

What is precoordination in the indexing process, and how does it differ from postcoordination?

You might also like