Results 1 - 10
of
10
Self-Indexing Inverted Files for Fast Text Retrieval
- ACM Transactions on Information Systems
, 1996
"... Query processing costs on large text databases are dominated by the need to retrieve and scan the inverted list of each query term. Here we show that query response time for conjunctive Boolean queries and for informal ranked queries can be dramatically reduced, at little cost in terms of storage, b ..."
Abstract
-
Cited by 127 (23 self)
- Add to MetaCart
Query processing costs on large text databases are dominated by the need to retrieve and scan the inverted list of each query term. Here we show that query response time for conjunctive Boolean queries and for informal ranked queries can be dramatically reduced, at little cost in terms of storage, by the inclusion of an internal index in each inverted list. This method has been applied in a retrieval system for a collection of nearly two million short documents. Our experimental results show that the selfindexing strategy adds less than 20% to the size of the inverted file, but, for Boolean queries of 5--10 terms, can reduce processing time to under one fifth of the previous cost. Similarly, ranked queries of 40--50 terms can be evaluated in as little as 25% of the previous time, with little or no loss of retrieval effectiveness.
Inverted files versus signature files for text indexing
- ACM Transactions on Database Systems
, 1998
"... Two well-known indexing methods are inverted files and signature files. We have undertaken a detailed comparison of these two approaches in the context of text indexing, paying particular attention to query evaluation speed and space requirements. We have examined their relative performance using bo ..."
Abstract
-
Cited by 74 (3 self)
- Add to MetaCart
Two well-known indexing methods are inverted files and signature files. We have undertaken a detailed comparison of these two approaches in the context of text indexing, paying particular attention to query evaluation speed and space requirements. We have examined their relative performance using both experimentation and a refined approach to modeling of signature files, and demonstrate that inverted files are distinctly superior to signature files. Not only can inverted files be used to evaluate typical queries in less time than can signature files, but inverted files require less space and provide greater functionality. Our results also show that a synthetic text database can provide a realistic indication of the behavior of an actual text database. The tools used to generate the synthetic database have been made publicly available.
Integrating Structured Data and Text: A relational approach
- Journal of the American Society of Information Science
, 1997
"... We integrate structured data and text using the unchanged, standard relational model. We started with the premise that a relational system could be used to implement an Information Retrieval (IR) system. After implementing a prototype to verify that premise, we then began to investigate the performa ..."
Abstract
-
Cited by 50 (27 self)
- Add to MetaCart
We integrate structured data and text using the unchanged, standard relational model. We started with the premise that a relational system could be used to implement an Information Retrieval (IR) system. After implementing a prototype to verify that premise, we then began to investigate the performance of a parallel relational database system for this application. We also tested the effect of query reduction on accuracy and found that queries can be reduced prior to their implementation without incurring a significant loss in precision/recall. This reduction also serves to improve run-time performance. After comparing our results to a special purpose IR system, we conclude that the relational model offers scalable performance and includes the ability to integrate structured data and text in a portable fashion. 1 Introduction Increasingly, applications integrate structured and unstructured data, responding to requests such as "Find articles containing vehicle and sales published in jou...
Adding Compression to Block Addressing Inverted Indexes
, 2000
"... . Inverted index compression, block addressing and sequential search on compressed text are three techniques that have been separately developed for efficient, low-overhead text retrieval. Modern text compression techniques can reduce the text to less than 30% of its size and allow searching it dire ..."
Abstract
-
Cited by 47 (26 self)
- Add to MetaCart
. Inverted index compression, block addressing and sequential search on compressed text are three techniques that have been separately developed for efficient, low-overhead text retrieval. Modern text compression techniques can reduce the text to less than 30% of its size and allow searching it directly and faster than the uncompressed text. Inverted index compression obtains significant reduction of their original size at the same processing speed. Block addressing makes the inverted lists point to text blocks instead of exact positions and pay the reduction in space with some sequential text scanning. In this work we combine the three ideas in a single scheme. We present a compressed inverted file that indexes compressed text and uses block addressing. We consider different techniques to compress the index and study their performance with respect to the block size. We compare the index against three separate techniques for varying block sizes, showing that our index is superior to each isolated approach. For instance, with just 4% of extra space overhead the index has to scan less than 12% of the text for exact searches and about 20% allowing one error in the matches. Keywords: Text compression, inverted files, block addressing, text databases. 1.
Execution Performance Issues in Full-Text Information Retrieval
, 1995
"... The task of an information retrieval system is to identify documents that will satisfy a user's information need. Effective fulfillment of this task has long been an active area of research, leading to sophisticated retrieval models for representing information content in documents and queries and m ..."
Abstract
-
Cited by 18 (0 self)
- Add to MetaCart
The task of an information retrieval system is to identify documents that will satisfy a user's information need. Effective fulfillment of this task has long been an active area of research, leading to sophisticated retrieval models for representing information content in documents and queries and measuring similarity between the two. The maturity and proven effectiveness of these systems has resulted in demand for increased capacity, performance, scalability, and functionality, especially as information retrieval is integrated into more traditional database management environments. In this dissertation we explore a number of functionality and performance issues in information retrieval. First, we consider creation and modification of the document collection, concentrating on management of the inverted file index. An inverted file architecture based on a persistent object store is described and experimental results are presented for inverted file creation and modification. Our architecture provides performance that scales well with document collection size and the database features supported by the persistent object store provide many solutions to issues that arise during integration of information retrieval into more general database environments. We then turn to query evaluation speed and introduce a new optimization technique for statistical ranking retrieval systems that support structured queries. Experimental results from a variety of query sets show that execution time can be reduced by more than 50% wit...
Improving Accuracy and Run-Time Performance for TREC-4
"... For TREC-4, we enhanced our existing prototype that implements relevance ranking using the AT&T DBC-1012 Model 4 parallel database machine to support the entire document collection. Additionally,we developed a special purpose IR prototype to test a new index compression algorithm and to provide p ..."
Abstract
-
Cited by 7 (4 self)
- Add to MetaCart
For TREC-4, we enhanced our existing prototype that implements relevance ranking using the AT&T DBC-1012 Model 4 parallel database machine to support the entire document collection. Additionally,we developed a special purpose IR prototype to test a new index compression algorithm and to provide performance comparisons to the relational approach. We submitted o#cial results for both automatic and manual adhoc queries for the entire 2GB English collection and the provided Spanish collection. Additionally,we submitted results using n-grams to process the corrupted data. In addition to implementing the vector-space model, we experimented with query reduction based on term frequency. Query reduction was shown to result in dramatically improved run-time performance and, in many cases, resulted in little or no degradation of precision#recall. 1 Introduction For TREC-4, we implemented relevance ranking queries using SQL on an AT&T DBC-1012 #formerly Teradata# parallel database machi...
Incremental Processing of Vague Queries in Interactive Retrieval Systems
- In Hypertext - Information Retrieval - Multimedia ’97: Theorien, Modelle und Implementierungen
, 1994
"... The application of information retrieval techniques in interactive environments require systems capable of efficiently processing vague queries. To reach reasonable response times, new data structures and algorithms have to be developed. In this paper we describe an approach taking advantage of the ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
The application of information retrieval techniques in interactive environments require systems capable of efficiently processing vague queries. To reach reasonable response times, new data structures and algorithms have to be developed. In this paper we describe an approach taking advantage of the conditions of interactive usage and special access paths. To have a reference we investigate text queries and compared our algorithms to the well known Buckley/Lewit algorithm. We achieved significant improvements for the response times. 1 Introduction Information retrieval deals with information systems for vague queries and imprecise data. In text retrieval, imprecision is caused by the limited capabilities of a system for representing text content. Due to the implicit imprecision and vagueness of natural language, text queries are always vague. But even for fact retrieval, there also often is a need for vagueness. If we have a system containing texts as well as facts, then we would like ...
Efficiency Considerations in Very Large Information Retrieval Servers
- Journal of Digital Information, (British Computer Society
, 1999
"... It is estimated that the World Wide Web now contains more than twenty million different content areas, presented on more than 320 million web pages, and one million web servers--- and it is doubling every nine months [16, 17]. To combat this, Moore's law suggests that computational resource will ..."
Abstract
-
Cited by 3 (3 self)
- Add to MetaCart
It is estimated that the World Wide Web now contains more than twenty million different content areas, presented on more than 320 million web pages, and one million web servers--- and it is doubling every nine months [16, 17]. To combat this, Moore's law suggests that computational resource will continue to double every eighteen months. This suggests that--- although both curves are exponential---we may be losing ground in the quest to search and find useful information. We briefly review some of the algorithms known to be useful for very large scale information retrieval and suggest opportunities for improvement in this area. Without research in this area, the web and other electronic information sources will become large haystacks in which no one will be able to find any needles. 1 Introduction Information Retrieval (IR) is devoted to finding "relevant" documents. Simple pattern matching is one approach, but many other techniques exist. Unlike most algorithms found in Comp...
Modeling Word Occurences for the Compression of Concordances
"... An earlier paper developed a procedure for compressing concordances, assuming that all elements occurred independently. The models introduced in that paper are extended here to take the possibility of clustering into account. The concordance is conceptualized as a set of bitmaps, in which the bit lo ..."
Abstract
- Add to MetaCart
An earlier paper developed a procedure for compressing concordances, assuming that all elements occurred independently. The models introduced in that paper are extended here to take the possibility of clustering into account. The concordance is conceptualized as a set of bitmaps, in which the bit locations represent documents, and the 1-bits represent the occurrence of given terms. Hidden Markov models (HMM) are used to describe the clustering of the 1bits. However, for computational reasons, the HMM is approximated by traditional Markov models. A set of criteria is developed to constrain the allowable set of n-state models, and a full inventory is given for n 4. Graph theoretic reduction and complementation operations are defined among the various models, and are used to provide a structure relating the models studied. Finally, the new methods were tested on the concordances of the English Bible and of two of the world's largest full-text retrieval system: the Tr'esor de la Langue Fr...
Working with Compressed Concordances
"... Abstract. A combination of new compression methods is suggested in order to compress the concordance of a large Information Retrieval system. The methods are aimed at allowing most of the processing directly on the compressed file, requesting decompression, if at all, only for small parts of the acc ..."
Abstract
- Add to MetaCart
Abstract. A combination of new compression methods is suggested in order to compress the concordance of a large Information Retrieval system. The methods are aimed at allowing most of the processing directly on the compressed file, requesting decompression, if at all, only for small parts of the accessed data, saving I/O operations and CPU time.

