Results 1 - 10
of
13
Inverted files for text search engines
- ACM Computing Surveys
, 2006
"... The technology underlying text search engines has advanced dramatically in the past decade. The development of a family of new index representations has led to a wide range of innovations in index storage, index construction, and query evaluation. While some of these developments have been consolida ..."
Abstract
-
Cited by 136 (2 self)
- Add to MetaCart
The technology underlying text search engines has advanced dramatically in the past decade. The development of a family of new index representations has led to a wide range of innovations in index storage, index construction, and query evaluation. While some of these developments have been consolidated in textbooks, many specific techniques are not widely known or the textbook descriptions are out of date. In this tutorial, we introduce the key techniques in the area, describing both a core implementation and how the core can be enhanced through a range of extensions. We conclude with a comprehensive bibliography of text indexing literature.
Parameterised Compression for Sparse Bitmaps
- Proc. ACM-SIGIR International Conference on Research and Development in Information Retrieval
, 1992
"... : Full-text retrieval systems typically use either a bitmap or an inverted file to identify which documents contain which words, so that the documents containing any combination of words can be quickly located. Bitmaps of word occurrences are large, but are usually sparse, and thus are amenable to a ..."
Abstract
-
Cited by 26 (8 self)
- Add to MetaCart
: Full-text retrieval systems typically use either a bitmap or an inverted file to identify which documents contain which words, so that the documents containing any combination of words can be quickly located. Bitmaps of word occurrences are large, but are usually sparse, and thus are amenable to a variety of compression techniques. Here we consider techniques in which the encoding of each bitvector within the bitmap is parameterised, so that a different code can be used for each bitvector. Our experimental results show that the new methods yield better compression than previous techniques. Categories and Subject Descriptors: E.4 [Coding and Information Theory]: Data compaction and compression; H.3.2 [Information Storage]: File organisation . Keywords: Full-text retrieval, data compression, document database, Huffman coding, geometric distribution, inverted file. 1 Introduction Full-text retrieval systems are used for storing and accessing document collections such as newspaper a...
Compression of Correlated Bit-Vectors
- Information Systems
, 1990
"... : Bitmaps are data structures occurring often in information retrieval. They are useful; they are also large and expensive to store. For this reason, considerable effort has been devoted to finding techniques for compressing them. These techniques are most effective for sparse bitmaps. We propose a ..."
Abstract
-
Cited by 23 (2 self)
- Add to MetaCart
: Bitmaps are data structures occurring often in information retrieval. They are useful; they are also large and expensive to store. For this reason, considerable effort has been devoted to finding techniques for compressing them. These techniques are most effective for sparse bitmaps. We propose a preprocessing stage, in which bitmaps are first clustered and the clusters used to transform their member bitmaps into sparser ones, that can be more effectively compressed. The clustering method efficiently generates a graph structure on the bitmaps. In some situations, it is desired to impose restrictions on the graph; finding the optimal graph satisfying these restrictions is shown to be NPcomplete. The results of applying our algorithm to the Bible is presented: for some sets of bitmaps, our method almost doubled the compression savings. 1. Introduction Textual Information Retrieval Systems (IRS) are voracious consumers of computer storage resources. Most conspicuous, of course, is the...
Storing Text Retrieval Systems on CD-ROM: Compression and Encryption Considerations
- ACM Transactions on Information Systems
, 1989
"... : The emergence of the CD-ROM as a storage medium for full-text databases raises the question of the maximum size database that can be contained by this medium. As an example, the problem of storing the Tr'esor de la Langue Fran¸caise on a CD-ROM is examined in this paper. The text alone of this dat ..."
Abstract
-
Cited by 21 (3 self)
- Add to MetaCart
: The emergence of the CD-ROM as a storage medium for full-text databases raises the question of the maximum size database that can be contained by this medium. As an example, the problem of storing the Tr'esor de la Langue Fran¸caise on a CD-ROM is examined in this paper. The text alone of this database is 700 MB long, more than a CD-ROM can hold. But in addition the dictionary and concordance needed to access this data must be stored. A further constraint is that some of the material is copyrighted, and it is desirable that such material be difficult to decode except through software provided by the system. Pertinent approaches to compression of the various files are reviewed and the compression of the text is related to the problem of data encryption: specifically, it is shown that, under simple models of text generation, Huffman encoding produces a bit-string indistinguishible from a representation of coin flips. Categories and Subject Descriptors: E.3 E.4 H.3.2 J.5 General terms: ...
Compressing Inverted Files
, 2003
"... Research into inverted file compression has focused on compression ratio---how small the indexes can be. Compression ratio is important for fast interactive searching. It is taken as read, the smaller the index, the faster the search. ..."
Abstract
-
Cited by 16 (6 self)
- Add to MetaCart
Research into inverted file compression has focused on compression ratio---how small the indexes can be. Compression ratio is important for fast interactive searching. It is taken as read, the smaller the index, the faster the search.
Using Bitmaps for Medium Sized Information Retrieval Systems
- Information Processing & Management
, 1990
"... : We describe the use of various forms of bitmaps as a basic tool for improving the search algorithms in medium sized information retrieval systems. The bitmaps considered include and extend known techniques using occurrence maps and signatures. Such an approach to text retrieval is flexible, effici ..."
Abstract
-
Cited by 6 (3 self)
- Add to MetaCart
: We describe the use of various forms of bitmaps as a basic tool for improving the search algorithms in medium sized information retrieval systems. The bitmaps considered include and extend known techniques using occurrence maps and signatures. Such an approach to text retrieval is flexible, efficient and, relative to the customary concordance approach, inexpensive in storage costs. 1. Introduction Our ability to control textual information is being strongly influenced by a variety of technological advances. These include new means of storing and sharing information that makes possible and realistic an information system model in which large bodies of full text are compactly stored, widely distributed, and shared by a large number of interested persons. Such changes require a careful search for techniques that promise convenient and effective access to such textual databases. The research that is required in this environment differs from that traditional in Information Retrieval (IR)...
Models of Bitmap Generation: A Systematic Approach to Bitmap Compression
- Inf. Proc. & Management, v28
, 1992
"... : In large IR systems, information about word occurrence may be stored in form of a bit matrix, with rows corresponding to different words and columns to documents. Such a matrix is generally very large and very sparse. New methods for compressing such matrices are presented, which exploit possible ..."
Abstract
-
Cited by 5 (2 self)
- Add to MetaCart
: In large IR systems, information about word occurrence may be stored in form of a bit matrix, with rows corresponding to different words and columns to documents. Such a matrix is generally very large and very sparse. New methods for compressing such matrices are presented, which exploit possible correlations between rows and between columns. The methods are based on partitioning the matrix into small blocks and predicting the 1-bit distribution within a block by means of various bit generation models. Each block is then encoded using Huffman or arithmetic coding. The methods also use a new way of enumerating subsets of fixed size from a given superset. Preliminary experimental results indicate improvements over previous methods. 1. Introduction The common approach to processing complex boolean queries in large full-text document retrieval systems is to use inverted files: a concordance is accessed via a dictionary, and includes for each different word of the text, the ordered list ...
Improved Index Compression Techniques for Versioned Document Collections
"... Current Information Retrieval systems use inverted index structures for efficient query processing. Due to the extremely large size of many data sets, these index structures are usually kept in compressed form, and many techniques for optimizing compressed size and query processing speed have been p ..."
Abstract
-
Cited by 5 (2 self)
- Add to MetaCart
Current Information Retrieval systems use inverted index structures for efficient query processing. Due to the extremely large size of many data sets, these index structures are usually kept in compressed form, and many techniques for optimizing compressed size and query processing speed have been proposed. In this paper, we focus on versioned document collections, that is, collections where each document is modified over time, resulting in multiple versions of the document. Consecutive versions of the same document are often similar, and several researchers have explored ideas for exploiting this similarity to decrease index size. We propose new index compression techniques for versioned document collections that achieve reductions in index size over previous methods. In particular, we first propose several bitwise compression techniques that achieve a compact index structure but that are too slow for most applications. Based on the lessons learned, we then propose additional techniques that come close to the sizes of the bitwise technique while also improving on the speed of the best previous methods.
Theory and Practice of Time-Space Trade-Offs in Memory Limited Search
- In Proceedings of KI-01, Lecture Notes in Computer Science
, 2001
"... . Having to cope with memory limitations is an ubiquitous issue in heuristic search. We present theoretical and practical results on new variants for exploring state-space with respect to memory limitations. We establish ##### ## minimum-space algorithms that omit both the open and the closed li ..."
Abstract
-
Cited by 4 (2 self)
- Add to MetaCart
. Having to cope with memory limitations is an ubiquitous issue in heuristic search. We present theoretical and practical results on new variants for exploring state-space with respect to memory limitations. We establish ##### ## minimum-space algorithms that omit both the open and the closed list to determine the shortest path between every two nodes and study the gap in between full memorization in a hash table and the information-theoretic lower bound. The proposed structure of suffix-lists elaborates on a concise binary representation of states by applying bit-state hashing techniques. Significantly more states can be stored while searching and inserting # items into suffix lists is still available in ### ### ## time. Bit-state hashing leads to the new paradigm of partial iterative-deepening heuristic search, in which full exploration is sacrificed for a better detection of duplicates in large search depth. We give first promising results in the application area of communication protocols. 1
Configuration Encoding Techniques for Fast FPGA Reconfiguration
, 2006
"... ‘I hereby declare that this submission is my own work and to the best of my knowledge it contains no materials previously published or written by another person, or substantial proportions of material which have been accepted for the award of any other degree or diploma at UNSW or any other educatio ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
‘I hereby declare that this submission is my own work and to the best of my knowledge it contains no materials previously published or written by another person, or substantial proportions of material which have been accepted for the award of any other degree or diploma at UNSW or any other educational institution, except where due acknowledgement is made in the thesis. Any contribution made to the research by others, with whom I have worked at UNSW or elsewhere, is explicitly acknowledged in the thesis. I also declare that the intellectual content of this thesis is the product of my own work, except to the extent that assistance from others in the project’s design and conception or in style, presentation and linguistic expression is acknowledged.’ Signed..........................................................................Acknowledgements I would like to thank my supervisor, Dr. Oliver Diessel, for his continuous support in this project. Thank you for your high throughput editing, short response time feedback and fine-grained discussions containing no null data!

