Results 1  10
of
52
Adding Compression to Block Addressing Inverted Indexes
, 2000
"... . Inverted index compression, block addressing and sequential search on compressed text are three techniques that have been separately developed for efficient, lowoverhead text retrieval. Modern text compression techniques can reduce the text to less than 30% of its size and allow searching it dire ..."
Abstract

Cited by 49 (28 self)
 Add to MetaCart
. Inverted index compression, block addressing and sequential search on compressed text are three techniques that have been separately developed for efficient, lowoverhead text retrieval. Modern text compression techniques can reduce the text to less than 30% of its size and allow searching it directly and faster than the uncompressed text. Inverted index compression obtains significant reduction of their original size at the same processing speed. Block addressing makes the inverted lists point to text blocks instead of exact positions and pay the reduction in space with some sequential text scanning. In this work we combine the three ideas in a single scheme. We present a compressed inverted file that indexes compressed text and uses block addressing. We consider different techniques to compress the index and study their performance with respect to the block size. We compare the index against three separate techniques for varying block sizes, showing that our index is superior to each isolated approach. For instance, with just 4% of extra space overhead the index has to scan less than 12% of the text for exact searches and about 20% allowing one error in the matches. Keywords: Text compression, inverted files, block addressing, text databases. 1.
Approximate String Matching over ZivLempel Compressed Text
, 2000
"... We present the first nontrivial algorithm for approximate pattern matching on compressed text. The format we choose is the ZivLempel family. Given a text of length u compressed into length n, and a pattern of length m, we report all the R occurrences of the pattern in the text allowing up to k inse ..."
Abstract

Cited by 43 (13 self)
 Add to MetaCart
We present the first nontrivial algorithm for approximate pattern matching on compressed text. The format we choose is the ZivLempel family. Given a text of length u compressed into length n, and a pattern of length m, we report all the R occurrences of the pattern in the text allowing up to k insertions, deletions and substitutions. On LZ78/LZW we need O(mkn + R) time in the worst case and O(k ) +R) on average where is the alphabet size. The experimental results show a practical speedup over the basic approach of up to 2X for moderate m and small k. We extend the algorithms to more general compression formats and approximate matching models.
Hierarchies Of Generalized Kolmogorov Complexities And Nonenumerable Universal Measures Computable In The Limit
 INTERNATIONAL JOURNAL OF FOUNDATIONS OF COMPUTER SCIENCE
, 2000
"... The traditional theory of Kolmogorov complexity and algorithmic probability focuses on monotone Turing machines with oneway writeonly output tape. This naturally leads to the universal enumerable SolomonoLevin measure. Here we introduce more general, nonenumerable but cumulatively enumerable m ..."
Abstract

Cited by 38 (20 self)
 Add to MetaCart
The traditional theory of Kolmogorov complexity and algorithmic probability focuses on monotone Turing machines with oneway writeonly output tape. This naturally leads to the universal enumerable SolomonoLevin measure. Here we introduce more general, nonenumerable but cumulatively enumerable measures (CEMs) derived from Turing machines with lexicographically nondecreasing output and random input, and even more general approximable measures and distributions computable in the limit. We obtain a natural hierarchy of generalizations of algorithmic probability and Kolmogorov complexity, suggesting that the "true" information content of some (possibly in nite) bitstring x is the size of the shortest nonhalting program that converges to x and nothing but x on a Turing machine that can edit its previous outputs. Among other things we show that there are objects computable in the limit yet more random than Chaitin's "number of wisdom" Omega, that any approximable measure of x is small for any x lacking a short description, that there is no universal approximable distribution, that there is a universal CEM, and that any nonenumerable CEM of x is small for any x lacking a short enumerating program. We briey mention consequences for universes sampled from such priors.
Efficient tree layout in a multilevel memory hierarchy, arXiv:cs.DS/0211010
, 2003
"... We consider the problem of laying out a tree with fixed parent/child structure in hierarchical memory. The goal is to minimize the expected number of block transfers performed during a search along a roottoleaf path, subject to a given probability distribution on the leaves. This problem was previ ..."
Abstract

Cited by 31 (7 self)
 Add to MetaCart
We consider the problem of laying out a tree with fixed parent/child structure in hierarchical memory. The goal is to minimize the expected number of block transfers performed during a search along a roottoleaf path, subject to a given probability distribution on the leaves. This problem was previously considered by Gil and Itai, who developed optimal but slow algorithms when the blocktransfer size B is known. We present faster but approximate algorithms for the same problem; the fastest such algorithm runs in linear time and produces a solution that is within an additive constant of optimal. In addition, we show how to extend any approximately optimal algorithm to the cacheoblivious setting in which the blocktransfer size is unknown to the algorithm. The query performance of the cacheoblivious layout is within a constant factor of the query performance of the optimal knownblocksize layout. Computing the cacheoblivious layout requires only logarithmically many calls to the layout algorithm for known block size; in particular, the cacheoblivious layout can be computed in O(N lg N) time, where N is the number of nodes. Finally, we analyze two greedy strategies, and show that they have a performance ratio between Ω(lg B / lg lg B) and O(lg B) when compared to the optimal layout.
An Efficient Compression Code for Text Databases
"... We present a new compression format for natural language texts, allowing both exact and approximate search without decompression. This new code (called EndTagged Dense Code) has some advantages with respect to other compression techniques with similar features such as the Tagged Huffman Code of [Mo ..."
Abstract

Cited by 23 (8 self)
 Add to MetaCart
We present a new compression format for natural language texts, allowing both exact and approximate search without decompression. This new code (called EndTagged Dense Code) has some advantages with respect to other compression techniques with similar features such as the Tagged Huffman Code of [Moura et al., ACM TOIS 2000]. Our compression method obtains (i) better compression ratios, (ii) a smaller and simpler vocabulary representation, and (iii) a simpler and faster encoding. At the same time, it retains the most interesting features of the method based on the Tagged Huffman Code, i.e., exact search for words and phrases directly on the compressed text using any known sequential pattern matching algorithm, efficient wordbased approximate and extended searches without any decoding, and efficient decompression of arbitrary portions of the text. As a side effect, our analytical results give new upper and lower bounds for the redundancy of dary Huffman codes.
Compression of Digital Holograms for ThreeDimensional Object Reconstruction and Recognition
, 2002
"... We present the results of applying lossless and lossy data compression to a threedimensional object reconstruction and recognition technique based on phaseshift digital holography. We find that the best lossless (LempelZiv, LempelZivWelch, Huffman, BurrowsWheeler) compression rates can be expe ..."
Abstract

Cited by 21 (17 self)
 Add to MetaCart
We present the results of applying lossless and lossy data compression to a threedimensional object reconstruction and recognition technique based on phaseshift digital holography. We find that the best lossless (LempelZiv, LempelZivWelch, Huffman, BurrowsWheeler) compression rates can be expected when the digital hologram is stored in an intermediate coding of separate data streams for real and imaginary components. The lossy techniques are based on subsampling, quantization, and discrete Fourier transformation. For various degrees of speckle reduction, we quantify the number of Fourier coefficients that can be removed from the hologram domain, and the lowest level of quantization achievable, without incurring significant loss in correlation performance or significant error in the reconstructed object domain.
S,C)Dense Coding: An optimized compression code for natural language text databases
 In Proc. 10th Intl. Symp. on String Processing and Information Retrieval (SPIRE’03), LNCS 2857
, 2003
"... Abstract. This work presents (s, c)Dense Code, a new method for compressing natural language texts. This technique is a generalization of a previous compression technique called EndTagged Dense Code that obtains better compression ratio as well as a simpler and faster encoding than Tagged Huffman. ..."
Abstract

Cited by 18 (11 self)
 Add to MetaCart
Abstract. This work presents (s, c)Dense Code, a new method for compressing natural language texts. This technique is a generalization of a previous compression technique called EndTagged Dense Code that obtains better compression ratio as well as a simpler and faster encoding than Tagged Huffman. At the same time, (s, c)Dense Code is a prefix code that maintains the most interesting features of Tagged Huffman Code with respect to direct search on the compressed text. (s, c)Dense Coding retains all the efficiency and simplicity of Tagged Huffman, and improves its compression ratios. We formally describe the (s, c)Dense Code and show how to compute the parameters s and c that optimize the compression for a specific corpus. Our empirical results show that (s, c)Dense Code improves EndTagged Dense Code and Tagged Huffman Code, and reaches only 0.5 % overhead over plain Huffman Code. 1
Regular Expression Searching on Compressed Text
 Journal of Discrete Algorithms
, 2003
"... We present a solution to the problem of regular expression searching on compressed text. ..."
Abstract

Cited by 13 (1 self)
 Add to MetaCart
We present a solution to the problem of regular expression searching on compressed text.
Automatic Synthesis of Compression Techniques for Heterogeneous Files
, 1995
"... this paper uses a straightforward program synthesis technique: a compression plan, consisting of instructions for each block of input data, is generated, guided by the statistical properties of the input data. Because of its use of algorithms specifically suited to the types of redundancy exhibited ..."
Abstract

Cited by 11 (5 self)
 Add to MetaCart
this paper uses a straightforward program synthesis technique: a compression plan, consisting of instructions for each block of input data, is generated, guided by the statistical properties of the input data. Because of its use of algorithms specifically suited to the types of redundancy exhibited by the particular input file, the system achieves consistent average performance throughout the file, as shown by experimental evidence