Results 1  10
of
22
Fast and Flexible Word Searching on Compressed Text
, 2000
"... ... text. When searching complex or approximate patterns, our algorithms are up to 8 times faster than the search on uncompressed text. We also discuss the impact of our technique in inverted files pointing to logical blocks and argue for the possibility of keeping the text compressed all the time, ..."
Abstract

Cited by 81 (33 self)
 Add to MetaCart
... text. When searching complex or approximate patterns, our algorithms are up to 8 times faster than the search on uncompressed text. We also discuss the impact of our technique in inverted files pointing to logical blocks and argue for the possibility of keeping the text compressed all the time, decompressing only for displaying purposes.
Lightweight natural language text compression. Information Retrieval
, 2007
"... Variants of Huffman codes where words are taken as the source symbols are currently the most attractive choices to compress natural language text databases. In particular, Tagged Huffman Code by Moura et al. offers fast direct searching on the compressed text and random access capabilities, in excha ..."
Abstract

Cited by 27 (21 self)
 Add to MetaCart
Variants of Huffman codes where words are taken as the source symbols are currently the most attractive choices to compress natural language text databases. In particular, Tagged Huffman Code by Moura et al. offers fast direct searching on the compressed text and random access capabilities, in exchange for producing around 11 % larger compressed files. This work describes EndTagged Dense Code and (s, c)Dense Code, two new semistatic statistical methods for compressing natural language texts. These techniques permit simpler and faster encoding and obtain better compression ratios than Tagged Huffman Code, while maintaining its fast direct search and random access capabilities. We show that Dense Codes improve Tagged Huffman Code compression ratio by about 10%, reaching only 0.6% overhead over the optimal Huffman compression ratio. Being simpler, Dense Codes are generated 45% to 60 % faster than Huffman codes. This makes Dense Codes a very attractive alternative to Huffman code variants for various reasons: they are simpler to program, faster to build, of almost optimal size, and as fast and easy to search as the best Huffman variants, which are not so close to the optimal size.
An asymptotic theory for CauchyEuler differential equations with applications to the analysis of algorithms
, 2002
"... CauchyEuler differential equations surfaced naturally in a number of sorting and searching problems, notably in quicksort and binary search trees and their variations. Asymptotics of coefficients of functions satisfying such equations has been studied for several special cases in the literature. We ..."
Abstract

Cited by 22 (10 self)
 Add to MetaCart
CauchyEuler differential equations surfaced naturally in a number of sorting and searching problems, notably in quicksort and binary search trees and their variations. Asymptotics of coefficients of functions satisfying such equations has been studied for several special cases in the literature. We study in this paper the most general framework for CauchyEuler equations and propose an asymptotic theory that covers almost all applications where CauchyEuler equations appear. Our approach is very general and requires almost no background on differential equations. Indeed the whole theory can be stated in terms of recurrences instead of functions. Old and new applications of the theory are given. New phase changes of limit laws of new variations of quicksort are systematically derived. We apply our theory to about a dozen of diverse examples in quicksort, binary search trees, urn models, increasing trees, etc.
A fast and spaceeconomical algorithm for lengthlimited coding
 Proc. Int. Symp. Algorithms and Computation, pp.1221
, 1995
"... Abstract. The minimumredundancy prefix code problem is to determine a list of integer codeword lengths I = [li l i E {1... n}], given a list of n symbol weightsp = [pili C {1.n}], such that ~' ~ 2l ' < 1, 9 " i = ln and ~i=1 lipi is minimised. An extension is the minimumredundancy lengthl ..."
Abstract

Cited by 15 (1 self)
 Add to MetaCart
Abstract. The minimumredundancy prefix code problem is to determine a list of integer codeword lengths I = [li l i E {1... n}], given a list of n symbol weightsp = [pili C {1.n}], such that ~' ~ 2l ' < 1, 9 " i = ln and ~i=1 lipi is minimised. An extension is the minimumredundancy lengthlimited prefix code problem, in which the further constraint li < L is imposed, for all i C {1...n} and some integer L> [log 2 hi. The packagemerge algorithm of Larmore and Hirschberg generates lengthlimited codes in O(nL) time using O(n) words of auxiliary space. Here we show how the size of the work space can be reduced to O(L2). This represents a useful improvement, since for practical purposes L is O(log n). 1
Improved Bounds on the Inefficiency of LengthRestricted Prefix Codes
 Departamento de Inform'atica, PUCRJ, Rio de
, 1997
"... : Consider an alphabet \Sigma = fa 1 ; : : : ; ang with corresponding symbol probabilities p 1 ; : : : ; pn . The L\Gammarestricted prefix code is a prefix code where all the code lengths are not greater than L. The value L is a given integer such that L dlog ne. Define the average code length dif ..."
Abstract

Cited by 14 (5 self)
 Add to MetaCart
: Consider an alphabet \Sigma = fa 1 ; : : : ; ang with corresponding symbol probabilities p 1 ; : : : ; pn . The L\Gammarestricted prefix code is a prefix code where all the code lengths are not greater than L. The value L is a given integer such that L dlog ne. Define the average code length difference by ffl = P n i=1 p i :l i \Gamma P n i=1 p i :l i , where l 1 ; : : : ; l n are the code lengths of the optimal Lrestricted prefix code for \Sigma and l 1 ; : : : ; l n are the code lengths of the optimal prefix code for \Sigma. Let / be the golden ratio 1,618. In this paper, we show that ffl ! 1=/ L\Gammadlog(n+dlog ne\GammaL)e\Gamma1 when L ? dlog ne. We also prove the sharp bound ffl ! dlog ne \Gamma 1, when L = dlog ne. By showing the lower bound 1 / L\Gammadlog ne+2+dlog n n\GammaL e \Gamma1 on the maximum value of ffl, we guarantee that our bound is asymptotically tight in the range dlog ne ! L n=2. Furthermore, we present an O(n) time and space 1=/ L\Gammadlo...
The WARMUP Algorithm: A Lagrangean Construction of Length Restricted Huffman Codes
 Departamento de Inform'atica, PUCRJ, Rio de
, 1996
"... : Given an alphabet fa 1 ; : : : ; ang with corresponding set of weights fw 1 ; : : : ; wng, and a number L dlog ne, we introduce an O(n log n+n log w) algorithm for constructing a suboptimal prefix code with restricted maximal length L, where w is the highest presented weight. The number of additi ..."
Abstract

Cited by 13 (8 self)
 Add to MetaCart
: Given an alphabet fa 1 ; : : : ; ang with corresponding set of weights fw 1 ; : : : ; wng, and a number L dlog ne, we introduce an O(n log n+n log w) algorithm for constructing a suboptimal prefix code with restricted maximal length L, where w is the highest presented weight. The number of additional bits per symbol generated by our code is not greater than 1=/ L\Gammadlog(n+dlog ne\GammaL)e\Gamma2 when L ? dlog ne + 1, where / is the golden ratio 1:618. An important feature of the proposed algorithm is its implementation simplicity. The algorithm is basically a selected sequence of Huffman trees construction for modified weights. Keywords: Prefix codes, Huffman Trees, Lagragean Duality Resumo: Dado um alfabeto fa 1 ; : : : ; ang com pesos correspondentes fw 1 ; : : : ; wng e um n'umero L dlog ne, n'os apresentamoso um algoritmo de de complexidade O(n log n + n log w)para construit c'odigos de prefixo sub'otimos com restric~ao de comprimento L, onde w 'e o maior peso do dado co...
Codes for the World Wide Web
"... Abstract. We introduce a new family of simple, complete instantaneous codes for positive integers, called ζ codes, which are suitable for integers distributed as a power law with small exponent (smaller than 2). The main motivation for the introduction of ζ codes comes from webgraph compression: if ..."
Abstract

Cited by 8 (1 self)
 Add to MetaCart
Abstract. We introduce a new family of simple, complete instantaneous codes for positive integers, called ζ codes, which are suitable for integers distributed as a power law with small exponent (smaller than 2). The main motivation for the introduction of ζ codes comes from webgraph compression: if nodes are numbered according to URL lexicographical order, gaps in successor lists are distributed according to a power law with small exponent. We give estimates of the expected length of ζ codes against powerlaw distributions, and compare the results with analogous estimates for the more classical γ, δ and variablelength block codes. 1.
Compressed String Dictionaries
"... The problem of storing a set of strings – a string dictionary – in compact form appears naturally in many cases. While classically it has represented a small part of the whole data to be processed (e.g., for Natural Language processing or for indexing text collections), recent applications in Web ..."
Abstract

Cited by 7 (3 self)
 Add to MetaCart
The problem of storing a set of strings – a string dictionary – in compact form appears naturally in many cases. While classically it has represented a small part of the whole data to be processed (e.g., for Natural Language processing or for indexing text collections), recent applications in Web engines, RDF graphs, Bioinformatics, and many others, handle very large string dictionaries, whose size is a significant fraction of the whole data. Thus efficient approaches to compress them are necessary. In this paper we empirically compare time and space performance of some existing alternatives, as well as new ones we propose. We show that space reductions of up to 20 % of the original size of the strings is possible while supporting dictionary searches within a few microseconds, and up to 10 % within a few tens or hundreds of microseconds.
SelfIndexing Natural Language
, 2008
"... Selfindexing is a concept developed for indexing arbitrary strings. It has been enormously successful to reduce the size of the large indexes typically used on strings, namely suffix trees and arrays. Selfindexes represent a string in a space close to its compressed size and provide indexed searchi ..."
Abstract

Cited by 6 (4 self)
 Add to MetaCart
Selfindexing is a concept developed for indexing arbitrary strings. It has been enormously successful to reduce the size of the large indexes typically used on strings, namely suffix trees and arrays. Selfindexes represent a string in a space close to its compressed size and provide indexed searching on it. On natural language, a compressed inverted index over the compressed text already provides a reasonable alternative, in space and time, for indexed searching of words and phrases. In this paper we explore the possibility of regarding natural language text as a string of words and applying a selfindex to it. There are several challenges involved, such as dealing with a very large alphabet and detaching searchable content from nonsearchable presentation aspects in the text. As a result, we show that the selfindex requires space very close to that of the best wordbased compressors, and that it obtains better search time than inverted indexes (using the same overall space) when searching for phrases.
Lossless Compression for Text and Images
 International Journal of High Speed Electronics and Systems
, 1995
"... Most data that is inherently discrete needs to be compressed in such a way that it can be recovered exactly, without any loss. Examples include text of all kinds, experimental results, and statistical databases. Other forms of data may need to be stored exactly, such as imagesparticularly bilevel ..."
Abstract

Cited by 6 (0 self)
 Add to MetaCart
Most data that is inherently discrete needs to be compressed in such a way that it can be recovered exactly, without any loss. Examples include text of all kinds, experimental results, and statistical databases. Other forms of data may need to be stored exactly, such as imagesparticularly bilevel ones, or ones arising in medical and remotesensing applications, or ones that may be required to be certified true for legal reasons. Moreover, during the process of lossy compression, many occasions for lossless compression of coefficients or other information arise. This paper surveys techniques for lossless compression. The process of compression can be broken down into modeling and coding. We provide an extensive discussion of coding techniques, and then introduce methods of modeling that are appropriate for text and images. Standard methods used in popular utilities (in the case of text) and international standards (in the case of images) are described. Keywords Text compression, ima...