Results 1  10
of
45
Lightweight natural language text compression. Information Retrieval
, 2007
"... Variants of Huffman codes where words are taken as the source symbols are currently the most attractive choices to compress natural language text databases. In particular, Tagged Huffman Code by Moura et al. offers fast direct searching on the compressed text and random access capabilities, in excha ..."
Abstract

Cited by 27 (21 self)
 Add to MetaCart
Variants of Huffman codes where words are taken as the source symbols are currently the most attractive choices to compress natural language text databases. In particular, Tagged Huffman Code by Moura et al. offers fast direct searching on the compressed text and random access capabilities, in exchange for producing around 11 % larger compressed files. This work describes EndTagged Dense Code and (s, c)Dense Code, two new semistatic statistical methods for compressing natural language texts. These techniques permit simpler and faster encoding and obtain better compression ratios than Tagged Huffman Code, while maintaining its fast direct search and random access capabilities. We show that Dense Codes improve Tagged Huffman Code compression ratio by about 10%, reaching only 0.6% overhead over the optimal Huffman compression ratio. Being simpler, Dense Codes are generated 45% to 60 % faster than Huffman codes. This makes Dense Codes a very attractive alternative to Huffman code variants for various reasons: they are simpler to program, faster to build, of almost optimal size, and as fast and easy to search as the best Huffman variants, which are not so close to the optimal size.
S,C)Dense Coding: An optimized compression code for natural language text databases
 In Proc. 10th Intl. Symp. on String Processing and Information Retrieval (SPIRE’03), LNCS 2857
, 2003
"... Abstract. This work presents (s, c)Dense Code, a new method for compressing natural language texts. This technique is a generalization of a previous compression technique called EndTagged Dense Code that obtains better compression ratio as well as a simpler and faster encoding than Tagged Huffman. ..."
Abstract

Cited by 18 (11 self)
 Add to MetaCart
Abstract. This work presents (s, c)Dense Code, a new method for compressing natural language texts. This technique is a generalization of a previous compression technique called EndTagged Dense Code that obtains better compression ratio as well as a simpler and faster encoding than Tagged Huffman. At the same time, (s, c)Dense Code is a prefix code that maintains the most interesting features of Tagged Huffman Code with respect to direct search on the compressed text. (s, c)Dense Coding retains all the efficiency and simplicity of Tagged Huffman, and improves its compression ratios. We formally describe the (s, c)Dense Code and show how to compute the parameters s and c that optimize the compression for a specific corpus. Our empirical results show that (s, c)Dense Code improves EndTagged Dense Code and Tagged Huffman Code, and reaches only 0.5 % overhead over plain Huffman Code. 1
Skeleton Trees for the Efficient Decoding of Huffman Encoded Texts
 Information Retrieval
, 1997
"... : A new data structure is investigated, which allows fast decoding of texts encoded by canonical Huffman codes. The storage requirements are much lower than for conventional Huffman trees, O(log 2 n) for trees of depth O(log n), and decoding is faster, because a part of the bitcomparisons nec ..."
Abstract

Cited by 10 (4 self)
 Add to MetaCart
: A new data structure is investigated, which allows fast decoding of texts encoded by canonical Huffman codes. The storage requirements are much lower than for conventional Huffman trees, O(log 2 n) for trees of depth O(log n), and decoding is faster, because a part of the bitcomparisons necessary for the decoding may be saved. Empirical results on large reallife distributions show a reduction of up to 50% and more in the number of bit operations. The basic idea is then generalized, yielding further savings. This is an extended version of a paper which has been presented at the 8th Annual Symposium on Combinatorial Pattern Matching (CPM'97), and appeared in its proceedings, pp. 6575.  1  1.
Lossless Compression for Text and Images
 International Journal of High Speed Electronics and Systems
, 1995
"... Most data that is inherently discrete needs to be compressed in such a way that it can be recovered exactly, without any loss. Examples include text of all kinds, experimental results, and statistical databases. Other forms of data may need to be stored exactly, such as imagesparticularly bilevel ..."
Abstract

Cited by 7 (0 self)
 Add to MetaCart
Most data that is inherently discrete needs to be compressed in such a way that it can be recovered exactly, without any loss. Examples include text of all kinds, experimental results, and statistical databases. Other forms of data may need to be stored exactly, such as imagesparticularly bilevel ones, or ones arising in medical and remotesensing applications, or ones that may be required to be certified true for legal reasons. Moreover, during the process of lossy compression, many occasions for lossless compression of coefficients or other information arise. This paper surveys techniques for lossless compression. The process of compression can be broken down into modeling and coding. We provide an extensive discussion of coding techniques, and then introduce methods of modeling that are appropriate for text and images. Standard methods used in popular utilities (in the case of text) and international standards (in the case of images) are described. Keywords Text compression, ima...
A Wordbased SelfIndexes for Natural Language Text
"... The inverted index supports efficient fulltext searches on natural language text collections. It requires some extra space over the compressed text that can be traded for search speed. It is usually fast for singleword searches, yet phrase searches require more expensive intersections. In this art ..."
Abstract

Cited by 6 (3 self)
 Add to MetaCart
The inverted index supports efficient fulltext searches on natural language text collections. It requires some extra space over the compressed text that can be traded for search speed. It is usually fast for singleword searches, yet phrase searches require more expensive intersections. In this article we introduce a different kind of index. It replaces the text using essentially the same space required by the compressed text alone (compression ratio around 35%). Within this space it supports not only decompression of arbitrary passages, but efficient word and phrase searches. Searches are orders of magnitude faster than those over inverted indexes when looking for phrases, and still faster on singleword searches when little space is available. Our new indexes are particularly fast at counting the occurrences of words or phrases. This is useful for computing relevance of words or phrases. We adapt selfindexes that succeeded in indexing arbitrary strings within compressed space to deal with large alphabets. Natural language texts are then regarded as sequences of words, not characters, to achieve wordbased selfindexes. We design an architecture that separates the searchable sequence from its presentation aspects. This permits applying case folding, stemming, removing stopwords, etc. as is usual on inverted indexes.
A blockbased interband lossless hyperspectral image compressor
 in Proc. of IEEE Data Compression Conference, 2005
"... Abstract: We propose a hyperspectral image compressor called BH which considers its input image as being partitioned into square blocks, each lying entirely within a particular band, and compresses one such block at a time by using the following steps: first predict the block from the corresponding ..."
Abstract

Cited by 5 (0 self)
 Add to MetaCart
Abstract: We propose a hyperspectral image compressor called BH which considers its input image as being partitioned into square blocks, each lying entirely within a particular band, and compresses one such block at a time by using the following steps: first predict the block from the corresponding block in the previous band, then select a predesigned code based on the prediction errors, and finally encode the predictor coefficient and errors. Apart from giving good compression rates and being fast, BH can provide random access to spatial locations in the image. We hypothesize that BH works well because it accommodates the rapidly changing image brightness that often occurs in hyperspectral images. We also propose an intraband compressor called LM which is worse than BH, but whose performance helps explain BH’s performance. 1
Efficient Implementation of the WARMUP Algorithm for the Construction of LengthRestricted Prefix Codes
 in Proceedings of the ALENEX
, 1999
"... . Given an alphabet \Sigma = fa1 ; : : : ; ang with a corresponding list of positive weights fw1 ; : : : ; wng and a length restriction L, the lengthrestricted prefix code problem is to find, a prefix code that minimizes P n i=1 w i l i , where l i , the length of the codeword assigned to a i , ..."
Abstract

Cited by 5 (0 self)
 Add to MetaCart
. Given an alphabet \Sigma = fa1 ; : : : ; ang with a corresponding list of positive weights fw1 ; : : : ; wng and a length restriction L, the lengthrestricted prefix code problem is to find, a prefix code that minimizes P n i=1 w i l i , where l i , the length of the codeword assigned to a i , cannot be greater than L, for i = 1; : : : ; n. In this paper, we present an efficient implementation of the WARMUP algorithm, an approximative method for this problem. The worstcase time complexity of WARMUP is O(n log n +n log wn ), where wn is the greatest weight. However, some experiments with a previous implementation of WARMUP show that it runs in linear time for several practical cases, if the input weights are already sorted. In addition, it often produces optimal codes. The proposed implementation combines two new enhancements to reduce the space usage of WARMUP and to improve its execution time. As a result, it is about ten times faster than the previous implementat...
WorstCase Optimal Adaptive Prefix Coding
 IN: PROCEEDINGS OF THE ALGORITHMS AND DATA STRUCTURES SYMPOSIUM (WADS
, 2009
"... A common complaint about adaptive prefix coding is that it is much slower than static prefix coding. Karpinski and Nekrich recently took an important step towards resolving this: they gave an adaptive Shannon coding algorithm that encodes each character in O(1) amortized time and decodes it in O(l ..."
Abstract

Cited by 4 (4 self)
 Add to MetaCart
A common complaint about adaptive prefix coding is that it is much slower than static prefix coding. Karpinski and Nekrich recently took an important step towards resolving this: they gave an adaptive Shannon coding algorithm that encodes each character in O(1) amortized time and decodes it in O(log H) amortized time, where H is the empirical entropy of the input string s. For comparison, Gagie’s adaptive Shannon coder and both Knuth’s and Vitter’s adaptive Huffman coders all use Θ(H) amortized time for each character. In this paper we give an adaptive Shannon coder that both encodes and decodes each character in O(1) worstcase time. As with both previous adaptive Shannon coders, we store s in at most (H + 1)s  + o(s) bits. We also show that this encoding length is worstcase optimal up to the lower order term.