Results 1 - 10
of
22
Fast and Flexible Word Searching on Compressed Text
, 2000
"... ... text. When searching complex or approximate patterns, our algorithms are up to 8 times faster than the search on uncompressed text. We also discuss the impact of our technique in inverted files pointing to logical blocks and argue for the possibility of keeping the text compressed all the time, ..."
Abstract
-
Cited by 74 (30 self)
- Add to MetaCart
... text. When searching complex or approximate patterns, our algorithms are up to 8 times faster than the search on uncompressed text. We also discuss the impact of our technique in inverted files pointing to logical blocks and argue for the possibility of keeping the text compressed all the time, decompressing only for displaying purposes.
Lightweight natural language text compression. Information Retrieval
, 2007
"... Variants of Huffman codes where words are taken as the source symbols are currently the most attractive choices to compress natural language text databases. In particular, Tagged Huffman Code by Moura et al. offers fast direct searching on the compressed text and random access capabilities, in excha ..."
Abstract
-
Cited by 22 (18 self)
- Add to MetaCart
Variants of Huffman codes where words are taken as the source symbols are currently the most attractive choices to compress natural language text databases. In particular, Tagged Huffman Code by Moura et al. offers fast direct searching on the compressed text and random access capabilities, in exchange for producing around 11 % larger compressed files. This work describes End-Tagged Dense Code and (s, c)-Dense Code, two new semistatic statistical methods for compressing natural language texts. These techniques permit simpler and faster encoding and obtain better compression ratios than Tagged Huffman Code, while maintaining its fast direct search and random access capabilities. We show that Dense Codes improve Tagged Huffman Code compression ratio by about 10%, reaching only 0.6% overhead over the optimal Huffman compression ratio. Being simpler, Dense Codes are generated 45% to 60 % faster than Huffman codes. This makes Dense Codes a very attractive alternative to Huffman code variants for various reasons: they are simpler to program, faster to build, of almost optimal size, and as fast and easy to search as the best Huffman variants, which are not so close to the optimal size.
An asymptotic theory for Cauchy-Euler differential equations with applications to the analysis of algorithms
, 2002
"... Cauchy-Euler differential equations surfaced naturally in a number of sorting and searching problems, notably in quicksort and binary search trees and their variations. Asymptotics of coefficients of functions satisfying such equations has been studied for several special cases in the literature. We ..."
Abstract
-
Cited by 17 (8 self)
- Add to MetaCart
Cauchy-Euler differential equations surfaced naturally in a number of sorting and searching problems, notably in quicksort and binary search trees and their variations. Asymptotics of coefficients of functions satisfying such equations has been studied for several special cases in the literature. We study in this paper the most general framework for Cauchy-Euler equations and propose an asymptotic theory that covers almost all applications where Cauchy-Euler equations appear. Our approach is very general and requires almost no background on differential equations. Indeed the whole theory can be stated in terms of recurrences instead of functions. Old and new applications of the theory are given. New phase changes of limit laws of new variations of quicksort are systematically derived. We apply our theory to about a dozen of diverse examples in quicksort, binary search trees, urn models, increasing trees, etc.
The WARM-UP Algorithm: A Lagrangean Construction of Length Restricted Huffman Codes
- Departamento de Inform'atica, PUC-RJ, Rio de
, 1996
"... : Given an alphabet fa 1 ; : : : ; ang with corresponding set of weights fw 1 ; : : : ; wng, and a number L dlog ne, we introduce an O(n log n+n log w) algorithm for constructing a suboptimal prefix code with restricted maximal length L, where w is the highest presented weight. The number of additi ..."
Abstract
-
Cited by 13 (8 self)
- Add to MetaCart
: Given an alphabet fa 1 ; : : : ; ang with corresponding set of weights fw 1 ; : : : ; wng, and a number L dlog ne, we introduce an O(n log n+n log w) algorithm for constructing a suboptimal prefix code with restricted maximal length L, where w is the highest presented weight. The number of additional bits per symbol generated by our code is not greater than 1=/ L\Gammadlog(n+dlog ne\GammaL)e\Gamma2 when L ? dlog ne + 1, where / is the golden ratio 1:618. An important feature of the proposed algorithm is its implementation simplicity. The algorithm is basically a selected sequence of Huffman trees construction for modified weights. Keywords: Prefix codes, Huffman Trees, Lagragean Duality Resumo: Dado um alfabeto fa 1 ; : : : ; ang com pesos correspondentes fw 1 ; : : : ; wng e um n'umero L dlog ne, n'os apresentamoso um algoritmo de de complexidade O(n log n + n log w)para construit c'odigos de prefixo sub'otimos com restric~ao de comprimento L, onde w 'e o maior peso do dado co...
Improved Bounds on the Inefficiency of Length-Restricted Prefix Codes
- Departamento de Inform'atica, PUC-RJ, Rio de
, 1997
"... : Consider an alphabet \Sigma = fa 1 ; : : : ; ang with corresponding symbol probabilities p 1 ; : : : ; pn . The L\Gammarestricted prefix code is a prefix code where all the code lengths are not greater than L. The value L is a given integer such that L dlog ne. Define the average code length dif ..."
Abstract
-
Cited by 13 (5 self)
- Add to MetaCart
: Consider an alphabet \Sigma = fa 1 ; : : : ; ang with corresponding symbol probabilities p 1 ; : : : ; pn . The L\Gammarestricted prefix code is a prefix code where all the code lengths are not greater than L. The value L is a given integer such that L dlog ne. Define the average code length difference by ffl = P n i=1 p i :l i \Gamma P n i=1 p i :l i , where l 1 ; : : : ; l n are the code lengths of the optimal L-restricted prefix code for \Sigma and l 1 ; : : : ; l n are the code lengths of the optimal prefix code for \Sigma. Let / be the golden ratio 1,618. In this paper, we show that ffl ! 1=/ L\Gammadlog(n+dlog ne\GammaL)e\Gamma1 when L ? dlog ne. We also prove the sharp bound ffl ! dlog ne \Gamma 1, when L = dlog ne. By showing the lower bound 1 / L\Gammadlog ne+2+dlog n n\GammaL e \Gamma1 on the maximum value of ffl, we guarantee that our bound is asymptotically tight in the range dlog ne ! L n=2. Furthermore, we present an O(n) time and space 1=/ L\Gammadlo...
A fast and space-economical algorithm for length-limited coding
- Proc. Int. Symp. Algorithms and Computation, pp.1221
, 1995
"... Abstract. The minimum-redundancy prefix code problem is to determine a list of integer codeword lengths I = [li l i E {1... n}], given a list of n symbol weightsp = [pili C {1.n}], such that ~' ~ 2-l ' < 1, 9 " i = l--n and ~i=1 lipi is minimised. An extension is the minimum-redundancy length-l ..."
Abstract
-
Cited by 10 (1 self)
- Add to MetaCart
Abstract. The minimum-redundancy prefix code problem is to determine a list of integer codeword lengths I = [li l i E {1... n}], given a list of n symbol weightsp = [pili C {1.n}], such that ~' ~ 2-l ' < 1, 9 " i = l--n and ~i=1 lipi is minimised. An extension is the minimum-redundancy length-limited prefix code problem, in which the further constraint li < L is imposed, for all i C {1...n} and some integer L> [log 2 hi. The package-merge algorithm of Larmore and Hirschberg generates lengthlimited codes in O(nL) time using O(n) words of auxiliary space. Here we show how the size of the work space can be reduced to O(L2). This represents a useful improvement, since for practical purposes L is O(log n). 1
Self-Indexing Natural Language ⋆
"... Abstract. Self-indexing is a concept developed for indexing arbitrary strings. It has been enormously successful to reduce the size of the large indexes typically used on strings, namely suffix trees and arrays. Selfindexes represent a string in a space close to its compressed size and provide index ..."
Abstract
-
Cited by 6 (4 self)
- Add to MetaCart
Abstract. Self-indexing is a concept developed for indexing arbitrary strings. It has been enormously successful to reduce the size of the large indexes typically used on strings, namely suffix trees and arrays. Selfindexes represent a string in a space close to its compressed size and provide indexed searching on it. On natural language, a compressed inverted index over the compressed text already provides a reasonable alternative, in space and time, for indexed searching of words and phrases. In this paper we explore the possibility of regarding natural language text as a string of words and applying a self-index to it. There are several challenges involved, such as dealing with a very large alphabet and detaching searchable content from non-searchable presentation aspects in the text. As a result, we show that the self-index requires space very close to that of the best word-based compressors, and that it obtains better search time than inverted indexes (using the same overall space) when searching for phrases.
Lossless Compression for Text and Images
- International Journal of High Speed Electronics and Systems
, 1995
"... Most data that is inherently discrete needs to be compressed in such a way that it can be recovered exactly, without any loss. Examples include text of all kinds, experimental results, and statistical databases. Other forms of data may need to be stored exactly, such as images---particularly bilevel ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
Most data that is inherently discrete needs to be compressed in such a way that it can be recovered exactly, without any loss. Examples include text of all kinds, experimental results, and statistical databases. Other forms of data may need to be stored exactly, such as images---particularly bilevel ones, or ones arising in medical and remotesensing applications, or ones that may be required to be certified true for legal reasons. Moreover, during the process of lossy compression, many occasions for lossless compression of coefficients or other information arise. This paper surveys techniques for lossless compression. The process of compression can be broken down into modeling and coding. We provide an extensive discussion of coding techniques, and then introduce methods of modeling that are appropriate for text and images. Standard methods used in popular utilities (in the case of text) and international standards (in the case of images) are described. Keywords Text compression, ima...
Efficient Implementation of the WARM-UP Algorithm for the Construction of Length-Restricted Prefix Codes
- in Proceedings of the ALENEX
, 1999
"... . Given an alphabet \Sigma = fa1 ; : : : ; ang with a corresponding list of positive weights fw1 ; : : : ; wng and a length restriction L, the length-restricted prefix code problem is to find, a prefix code that minimizes P n i=1 w i l i , where l i , the length of the codeword assigned to a i , ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
. Given an alphabet \Sigma = fa1 ; : : : ; ang with a corresponding list of positive weights fw1 ; : : : ; wng and a length restriction L, the length-restricted prefix code problem is to find, a prefix code that minimizes P n i=1 w i l i , where l i , the length of the codeword assigned to a i , cannot be greater than L, for i = 1; : : : ; n. In this paper, we present an efficient implementation of the WARM-UP algorithm, an approximative method for this problem. The worst-case time complexity of WARM-UP is O(n log n +n log wn ), where wn is the greatest weight. However, some experiments with a previous implementation of WARM-UP show that it runs in linear time for several practical cases, if the input weights are already sorted. In addition, it often produces optimal codes. The proposed implementation combines two new enhancements to reduce the space usage of WARM-UP and to improve its execution time. As a result, it is about ten times faster than the previous implementat...
Codes for the World Wide Web
"... Abstract. We introduce a new family of simple, complete instantaneous codes for positive integers, called ζ codes, which are suitable for integers distributed as a power law with small exponent (smaller than 2). The main motivation for the introduction of ζ codes comes from web-graph compression: if ..."
Abstract
-
Cited by 5 (1 self)
- Add to MetaCart
Abstract. We introduce a new family of simple, complete instantaneous codes for positive integers, called ζ codes, which are suitable for integers distributed as a power law with small exponent (smaller than 2). The main motivation for the introduction of ζ codes comes from web-graph compression: if nodes are numbered according to URL lexicographical order, gaps in successor lists are distributed according to a power law with small exponent. We give estimates of the expected length of ζ codes against power-law distributions, and compare the results with analogous estimates for the more classical γ, δ and variable-length block codes. 1.

