Results 1 -
9 of
9
Inverted files for text search engines
- ACM Computing Surveys
, 2006
"... The technology underlying text search engines has advanced dramatically in the past decade. The development of a family of new index representations has led to a wide range of innovations in index storage, index construction, and query evaluation. While some of these developments have been consolida ..."
Abstract
-
Cited by 136 (2 self)
- Add to MetaCart
The technology underlying text search engines has advanced dramatically in the past decade. The development of a family of new index representations has led to a wide range of innovations in index storage, index construction, and query evaluation. While some of these developments have been consolidated in textbooks, many specific techniques are not widely known or the textbook descriptions are out of date. In this tutorial, we introduce the key techniques in the area, describing both a core implementation and how the core can be enhanced through a range of extensions. We conclude with a comprehensive bibliography of text indexing literature.
Self-Indexing Inverted Files for Fast Text Retrieval
- ACM Transactions on Information Systems
, 1996
"... Query processing costs on large text databases are dominated by the need to retrieve and scan the inverted list of each query term. Here we show that query response time for conjunctive Boolean queries and for informal ranked queries can be dramatically reduced, at little cost in terms of storage, b ..."
Abstract
-
Cited by 127 (23 self)
- Add to MetaCart
Query processing costs on large text databases are dominated by the need to retrieve and scan the inverted list of each query term. Here we show that query response time for conjunctive Boolean queries and for informal ranked queries can be dramatically reduced, at little cost in terms of storage, by the inclusion of an internal index in each inverted list. This method has been applied in a retrieval system for a collection of nearly two million short documents. Our experimental results show that the selfindexing strategy adds less than 20% to the size of the inverted file, but, for Boolean queries of 5--10 terms, can reduce processing time to under one fifth of the previous cost. Similarly, ranked queries of 40--50 terms can be evaluated in as little as 25% of the previous time, with little or no loss of retrieval effectiveness.
Parameterised Compression for Sparse Bitmaps
- Proc. ACM-SIGIR International Conference on Research and Development in Information Retrieval
, 1992
"... : Full-text retrieval systems typically use either a bitmap or an inverted file to identify which documents contain which words, so that the documents containing any combination of words can be quickly located. Bitmaps of word occurrences are large, but are usually sparse, and thus are amenable to a ..."
Abstract
-
Cited by 26 (8 self)
- Add to MetaCart
: Full-text retrieval systems typically use either a bitmap or an inverted file to identify which documents contain which words, so that the documents containing any combination of words can be quickly located. Bitmaps of word occurrences are large, but are usually sparse, and thus are amenable to a variety of compression techniques. Here we consider techniques in which the encoding of each bitvector within the bitmap is parameterised, so that a different code can be used for each bitvector. Our experimental results show that the new methods yield better compression than previous techniques. Categories and Subject Descriptors: E.4 [Coding and Information Theory]: Data compaction and compression; H.3.2 [Information Storage]: File organisation . Keywords: Full-text retrieval, data compression, document database, Huffman coding, geometric distribution, inverted file. 1 Introduction Full-text retrieval systems are used for storing and accessing document collections such as newspaper a...
Storing Text Retrieval Systems on CD-ROM: Compression and Encryption Considerations
- ACM Transactions on Information Systems
, 1989
"... : The emergence of the CD-ROM as a storage medium for full-text databases raises the question of the maximum size database that can be contained by this medium. As an example, the problem of storing the Tr'esor de la Langue Fran¸caise on a CD-ROM is examined in this paper. The text alone of this dat ..."
Abstract
-
Cited by 21 (3 self)
- Add to MetaCart
: The emergence of the CD-ROM as a storage medium for full-text databases raises the question of the maximum size database that can be contained by this medium. As an example, the problem of storing the Tr'esor de la Langue Fran¸caise on a CD-ROM is examined in this paper. The text alone of this database is 700 MB long, more than a CD-ROM can hold. But in addition the dictionary and concordance needed to access this data must be stored. A further constraint is that some of the material is copyrighted, and it is desirable that such material be difficult to decode except through software provided by the system. Pertinent approaches to compression of the various files are reviewed and the compression of the text is related to the problem of data encryption: specifically, it is shown that, under simple models of text generation, Huffman encoding produces a bit-string indistinguishible from a representation of coin flips. Categories and Subject Descriptors: E.3 E.4 H.3.2 J.5 General terms: ...
Compressing Inverted Files
, 2003
"... Research into inverted file compression has focused on compression ratio---how small the indexes can be. Compression ratio is important for fast interactive searching. It is taken as read, the smaller the index, the faster the search. ..."
Abstract
-
Cited by 16 (6 self)
- Add to MetaCart
Research into inverted file compression has focused on compression ratio---how small the indexes can be. Compression ratio is important for fast interactive searching. It is taken as read, the smaller the index, the faster the search.
Posting Compression in Dynamic Retried Environments
- Proc. 14th Intemational Conference on Research and Development in Information Retrieval SIGIR 91
, 1991
"... prohibited without the written consent of the copyright owner. NAT. LAB. UR 008/91 ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
prohibited without the written consent of the copyright owner. NAT. LAB. UR 008/91
The Responsa Storage and Retrieval System - Whither?
, 1996
"... p. 173). We did develop such a tool [CCDFS1971]. As each of these methods has certain advantages and disadvantages, we ended up by merging -- 2 -- them into a joint analysis-synthesis method; a global analysis of all words in the database is done, but without prepositions (otiyot shimush), in order ..."
Abstract
- Add to MetaCart
p. 173). We did develop such a tool [CCDFS1971]. As each of these methods has certain advantages and disadvantages, we ended up by merging -- 2 -- them into a joint analysis-synthesis method; a global analysis of all words in the database is done, but without prepositions (otiyot shimush), in order to end up with a database of manageable size; the prepositions are left to the synthesis phase. See [AFCS1972] for full details. I also set up a "Committee for the Mechanization in Jewish Law Research" whose first members were, I think, Dr. Choueka, Mr. Asa Kasher, later professor of Philosophy at Tel Aviv University, Mr. Joseph Dueck, a young lawyer and research assistant at the IRJL, who served as their representative, and assistants, to formulate procedures for preediting and postediting texts to be inputted, and various algorithms needed for the work. (Many other persons, such as Mr. Reuven Mirkin of the Academy of the Hebrew Language, and research students, joined later.) I also felt ...
Working with Compressed Concordances
"... Abstract. A combination of new compression methods is suggested in order to compress the concordance of a large Information Retrieval system. The methods are aimed at allowing most of the processing directly on the compressed file, requesting decompression, if at all, only for small parts of the acc ..."
Abstract
- Add to MetaCart
Abstract. A combination of new compression methods is suggested in order to compress the concordance of a large Information Retrieval system. The methods are aimed at allowing most of the processing directly on the compressed file, requesting decompression, if at all, only for small parts of the accessed data, saving I/O operations and CPU time.

