Arithmetic coding revisited
 ACM Transactions on Information Systems
, 1995
Over the last decade, arithmetic coding has emerged as an important compression tool. It is now the method of choice for adaptive coding on multisymbol alphabets because of its speed, low storage requirements, and effectiveness of compression. This article describes a new implementation of arithmetic coding that incorporates several improvements over a widely used earlier version by Witten, Neal, and Cleary, which has become a de facto standard. These improvements include fewer multiplicative operations, greatly extended range of alphabet sizes and symbol probabilities, and the use of lowprecision arithmetic, permitting implementation by fast shift/add operations. We also describe a modular structure that separates the coding, modeling, and probability estimation components of a compression system. To motivate the improved coder, we consider the needs of a wordbased text compression program. We report a range of experimental results using this and other models. Complete source code is available.
Implementing the PPM Data Compression Scheme
, 1990
The “Prediction by Partial Matching” (PPM) data compression algorithm developed by Cleary and Witten is capable of very high compression rates, encoding English text in as little as 2.2 bits/character. Here it is shown that the estimates made by Cleary and Witten of the resources required to implement the scheme can be revised to allow for a tractable and useful implementation. In particular, a variant is described that encodes and decodes at over 4 kbytes/s on a small workstation, and operates within a few hundred kilobytes of data space, but still obtains compression of about 2.4 bits/character on English text.
Design and analysis of dynamic Huffman codes
 Journal of the ACM
, 1987
Abstract. A new onepass algorithm for constructing dynamic Huffman codes is introduced and analyzed. We also analyze the onepass algorithm due to Failer, Gallager, and Knuth. In each algorithm, both the sender and the receiver maintain equivalent dynamically varying Huffman trees, and the coding is done in real time. We show that the number of bits used by the new algorithm to encode a message containing t letters is < t bits more than that used by the conventional twopass Huffman scheme, independent of the alphabet size. This is best possible in the worst case, for any onepass Huffman method. Tight upper and lower bounds are derived. Empirical tests show that the encodings produced by the new algorithm are shorter than those of the other onepass algorithm and, except for long messages, are shorter than those of the twopass method. The new algorithm is well suited for online encoding/decoding in data networks and for file compression.
Data Compression
 ACM Computing Surveys
, 1987
This paper surveys a variety of data compression methods spanning almost forty years of research, from the work of Shannon, Fano and Huffman in the late 40's to a technique developed in 1986. The aim of data compression is to reduce redundancy in stored or communicated data, thus increasing effective data density. Data compression has important application in the areas of file storage and distributed systems. Concepts from information theory, as they relate to the goals and evaluation of data compression methods, are discussed briefly. A framework for evaluation and comparison of methods is constructed and applied to the algorithms presented. Comparisons of both theoretical and empirical natures are reported and possibilities for future research are suggested. INTRODUCTION Data compression is often referred to as coding, where coding is a very general term encompassing any special representation of data which satisfies a given need. Information theory is defined to be the study of eff...
Fast and Flexible Word Searching on Compressed Text
, 2000
... text. When searching complex or approximate patterns, our algorithms are up to 8 times faster than the search on uncompressed text. We also discuss the impact of our technique in inverted files pointing to logical blocks and argue for the possibility of keeping the text compressed all the time, decompressing only for displaying purposes.
Adding Compression to a FullText Retrieval System
, 1995
We describe the implementation of a data compression scheme as an integral and transparent layer within a fulltext...
Compressed text databases with efficient query algorithms based on the compressed suffix array
 Proceedings of ISAAC'00, number 1969 in LNCS
, 2000
A compressed text database based on the compressed suffix array is proposed. The compressed su#x array of Grossi and Vitter occupies only O(n) bits for a text of length n; however it also uses the text itself that occupies O(n log bits for the alphabet #. On the other hand, our data structure does not use the text itself, and supports important operations for text databases: inverse, search and decompress. Our algorithms can find occ occurrences of any substring P of the text
EnergyEfficient Algorithms for . . .
, 2007
We study scheduling problems in batteryoperated computing devices, aiming at schedules with low total energy consumption. While most of the previous work has focused on finding feasible schedules in deadlinebased settings, in this article we are interested in schedules that guarantee good response times. More specifically, our goal is to schedule a sequence of jobs on a variablespeed processor so as to minimize the total cost consisting of the energy consumption and the total flow time of all jobs. We first show that when the amount of work, for any job, may take an arbitrary value, then no online algorithm can achieve a constant competitive ratio. Therefore, most of the article is concerned with unitsize jobs. We devise a deterministic constant competitive online algorithm and show that
Code Compression for Embedded Systems
, 1998
Memory is one of the most restrictedresources in many modern embedded systems. Code compression can provide substantial savings in terms of size. In a compressedcode CPU, acache miss triggers the decompression of a main memory block, before it gets transferred to the cache. Because the code must bedecompressible starting from any point #or at least at cache block boundaries#, most #leorientedcompression techniques cannot be used. We propose two algorithms to compress code in a spacee#cient and simple to decompress way, one which is independent of the instruction set and another which depends on the instruction set. We perform experiments on two instruction sets, a typical RISC #MIPS# and a typical CISC #x86# and compare our results to existing #leorientedcompression algorithms. 1 Introduction Manyembedded computing systems are space and cost sensitive. As a result, available memory is limited, posing serious constraints on program size. We are studying ways of reducing the size of...
Adding Compression to Block Addressing Inverted Indexes
, 2000
. Inverted index compression, block addressing and sequential search on compressed text are three techniques that have been separately developed for efficient, lowoverhead text retrieval. Modern text compression techniques can reduce the text to less than 30% of its size and allow searching it directly and faster than the uncompressed text. Inverted index compression obtains significant reduction of their original size at the same processing speed. Block addressing makes the inverted lists point to text blocks instead of exact positions and pay the reduction in space with some sequential text scanning. In this work we combine the three ideas in a single scheme. We present a compressed inverted file that indexes compressed text and uses block addressing. We consider different techniques to compress the index and study their performance with respect to the block size. We compare the index against three separate techniques for varying block sizes, showing that our index is superior to each isolated approach. For instance, with just 4% of extra space overhead the index has to scan less than 12% of the text for exact searches and about 20% allowing one error in the matches. Keywords: Text compression, inverted files, block addressing, text databases. 1.