Results 1 - 10
of
11
Inverted files for text search engines
- ACM Computing Surveys
, 2006
"... The technology underlying text search engines has advanced dramatically in the past decade. The development of a family of new index representations has led to a wide range of innovations in index storage, index construction, and query evaluation. While some of these developments have been consolida ..."
Abstract
-
Cited by 136 (2 self)
- Add to MetaCart
The technology underlying text search engines has advanced dramatically in the past decade. The development of a family of new index representations has led to a wide range of innovations in index storage, index construction, and query evaluation. While some of these developments have been consolidated in textbooks, many specific techniques are not widely known or the textbook descriptions are out of date. In this tutorial, we introduce the key techniques in the area, describing both a core implementation and how the core can be enhanced through a range of extensions. We conclude with a comprehensive bibliography of text indexing literature.
Lightweight natural language text compression. Information Retrieval
, 2007
"... Variants of Huffman codes where words are taken as the source symbols are currently the most attractive choices to compress natural language text databases. In particular, Tagged Huffman Code by Moura et al. offers fast direct searching on the compressed text and random access capabilities, in excha ..."
Abstract
-
Cited by 22 (18 self)
- Add to MetaCart
Variants of Huffman codes where words are taken as the source symbols are currently the most attractive choices to compress natural language text databases. In particular, Tagged Huffman Code by Moura et al. offers fast direct searching on the compressed text and random access capabilities, in exchange for producing around 11 % larger compressed files. This work describes End-Tagged Dense Code and (s, c)-Dense Code, two new semistatic statistical methods for compressing natural language texts. These techniques permit simpler and faster encoding and obtain better compression ratios than Tagged Huffman Code, while maintaining its fast direct search and random access capabilities. We show that Dense Codes improve Tagged Huffman Code compression ratio by about 10%, reaching only 0.6% overhead over the optimal Huffman compression ratio. Being simpler, Dense Codes are generated 45% to 60 % faster than Huffman codes. This makes Dense Codes a very attractive alternative to Huffman code variants for various reasons: they are simpler to program, faster to build, of almost optimal size, and as fast and easy to search as the best Huffman variants, which are not so close to the optimal size.
Efficiently decodable and searchable natural language adaptive compression
- In Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’05
"... We address the problem of adaptive compression of natural language text, focusing on the case where low bandwidth is available and the receiver has little processing power, as in mobile applications. Our technique achieves compression ratios around 32 % and requires very little effort from the recei ..."
Abstract
-
Cited by 5 (3 self)
- Add to MetaCart
We address the problem of adaptive compression of natural language text, focusing on the case where low bandwidth is available and the receiver has little processing power, as in mobile applications. Our technique achieves compression ratios around 32 % and requires very little effort from the receiver. This tradeoff, not previously achieved with alternative techniques, is obtained by breaking the usual symmetry between sender and receiver present in statistical adaptive compression. Moreover, we show that our technique can be adapted to avoid decompression at all in cases where the receiver only wants to detect the presence of some keywords in the document, which is useful in scenarios such as selective dissemination of information, news clipping, alert systems, text categorization, and clustering. We show that, thanks to the same asymmetry, the receiver can search the compressed text much faster than the plain text. This was previously achieved only in semistatic compression scenarios. 1.
Dual-Sorted Inverted Lists
"... Several IR tasks rely, to achieve high efficiency, on a single pervasive data structure called the inverted index. This is a mapping from the terms in a text collection to the documents where they appear, plus some supplementary data. Different orderings in the list of documents associated to a ter ..."
Abstract
-
Cited by 4 (3 self)
- Add to MetaCart
Several IR tasks rely, to achieve high efficiency, on a single pervasive data structure called the inverted index. This is a mapping from the terms in a text collection to the documents where they appear, plus some supplementary data. Different orderings in the list of documents associated to a term, and different supplementary data, fit widely different IR tasks. Index designers have to choose the right order for one such task, rendering the index difficult to use for others. In this paper we introduce a general technique, based on wavelet trees, to maintain a single data structure that offers the combined functionality of two independent orderings for an inverted index, with competitive efficiency and within the space of one compressed inverted index. We show in particular that the technique allows combining an ordering by decreasing term frequency (useful for ranked document retrieval) with an ordering by increasing document identifier (useful for phrase and Boolean queries). We show that we can support not only the primitives required by the different search paradigms (e.g., in order to implement any intersection algorithm on top of our data structure), but also that the data structure offers novel ways of carrying out many operations of interest, including space-free treatment of stemming and hierarchical documents.
A compressed self-indexed representation of XML documents ⋆
"... Abstract. This paper presents a structure we call XML Wavelet Tree (XWT) to represent any XML document in a compressed and self-indexed form. Therefore, any query or procedure that could be performed over the original document can be performed more efficiently over the XWT representation because it ..."
Abstract
-
Cited by 2 (2 self)
- Add to MetaCart
Abstract. This paper presents a structure we call XML Wavelet Tree (XWT) to represent any XML document in a compressed and self-indexed form. Therefore, any query or procedure that could be performed over the original document can be performed more efficiently over the XWT representation because it is shorter and has some indexing properties. In fact, XWT permits to answer XPath queries more efficiently than using the uncompressed version of the documents. XWT is also competitive when comparing it with inverted indexes over the XML document (if both structures use the same space). 1
Using Structural Contexts to Compress Semistructured Text Collections ∗†
"... We describe a compression model for semistructured documents, called Structural Contexts Model (SCM), which takes advantage of the context information usually implicit in the structure of the text. The idea is to use a separate model to compress the text that lies inside each different structure typ ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
We describe a compression model for semistructured documents, called Structural Contexts Model (SCM), which takes advantage of the context information usually implicit in the structure of the text. The idea is to use a separate model to compress the text that lies inside each different structure type (e.g., different XML tag). The intuition behind SCM is that the distribution of all the texts that belong to a given structure type should be similar, and different from that of other structure types. We mainly focus on semistatic models, and test our idea using a word-based Huffman method. This is the standard for compressing large natural language text databases, because random access, partial decompression, and direct search of the compressed collection is possible. This variant, dubbed SCMHuff, retains those features and improves Huffman’s compression ratios. We consider the possibility that storing separate models may not pay off if the distribution of different structure types is not different enough, and present a heuristic to merge models with the aim of minimizing the total size of the compressed database. This gives an additional improvement over the plain technique. The comparison against existing prototypes shows that, among the methods that permit random access to the collection, SCMHuff achieves the best compression ratios, 2–4 % better than the closest alternative. From a purely compression-aimed perspective, we combine SCM with PPM modeling. A separate PPM model is used to compress the text that lies inside each different structure type. The result, SCMPPM, does not permit random access nor direct search in the compressed text, but it gives 2–5 % better compression ratios than other techniques for texts longer than 5 megabytes.
A general compression algorithm that supports fast searching
, 2006
"... Key words: algorithms, compression, searching in compressed text, q–grams ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Key words: algorithms, compression, searching in compressed text, q–grams
Compressing Distributed Text in Parallel with (s, c)-Dense Codes
"... Systems able to cope with very large text collections are making intensive use of distributed memory parallel computing platforms such as Clusters of PCs. This is particularly evident in Web Search Engines which must resort to parallelism in order to deal e#ciently with both high rates of queries ..."
Abstract
- Add to MetaCart
Systems able to cope with very large text collections are making intensive use of distributed memory parallel computing platforms such as Clusters of PCs. This is particularly evident in Web Search Engines which must resort to parallelism in order to deal e#ciently with both high rates of queries per unit time and high space requirements in the form of large numbers of small documents stored in secondary memory. Those documents can be stored in compressed format to reduce memory space and communication time. This paper proposes a parallel algorithm for compressing text in such a distributed memory environment. We show e#cient performance against the usual-practice alternative of compressing the whole text on a single machine.
Huffman Coding with Non-Sorted Frequencies
"... Abstract. A standard way of implementing Huffman’s optimal code construction algorithm is by using a sorted sequence of frequencies. Several aspects of the algorithm are investigated as to the consequences of relaxing the requirement of keeping the frequencies in order. Using only partial order may ..."
Abstract
- Add to MetaCart
Abstract. A standard way of implementing Huffman’s optimal code construction algorithm is by using a sorted sequence of frequencies. Several aspects of the algorithm are investigated as to the consequences of relaxing the requirement of keeping the frequencies in order. Using only partial order may speed up the code construction, which is important in some applications, at the cost of increasing the size of the encoded file. 1.
On the Usefulness of Fibonacci Compression Codes
, 2004
"... Recent publications advocate the use of various variable length codes for which each codeword consists of an integral number of bytes in compression applications using large alphabets. This paper shows that another tradeoff with similar properties can be obtained by Fibonacci codes. These are fixed ..."
Abstract
- Add to MetaCart
Recent publications advocate the use of various variable length codes for which each codeword consists of an integral number of bytes in compression applications using large alphabets. This paper shows that another tradeoff with similar properties can be obtained by Fibonacci codes. These are fixed codeword sets, using binary representations of integers based on Fibonacci numbers of order m ≥ 2. Fibonacci codes have been used before, and this paper extends previous work presenting several novel features. In particular, the compression efficiency is analyzed and compared to that of dense codes, and various table-driven decoding routines are suggested.

