Results 1  10
of
94
Arithmetic coding revisited
 ACM Transactions on Information Systems
, 1995
"... Over the last decade, arithmetic coding has emerged as an important compression tool. It is now the method of choice for adaptive coding on multisymbol alphabets because of its speed, low storage requirements, and effectiveness of compression. This article describes a new implementation of arithmeti ..."
Abstract

Cited by 139 (2 self)
 Add to MetaCart
Over the last decade, arithmetic coding has emerged as an important compression tool. It is now the method of choice for adaptive coding on multisymbol alphabets because of its speed, low storage requirements, and effectiveness of compression. This article describes a new implementation of arithmetic coding that incorporates several improvements over a widely used earlier version by Witten, Neal, and Cleary, which has become a de facto standard. These improvements include fewer multiplicative operations, greatly extended range of alphabet sizes and symbol probabilities, and the use of lowprecision arithmetic, permitting implementation by fast shift/add operations. We also describe a modular structure that separates the coding, modeling, and probability estimation components of a compression system. To motivate the improved coder, we consider the needs of a wordbased text compression program. We report a range of experimental results using this and other models. Complete source code is available.
The String BTree: A New Data Structure for String Search in External Memory and its Applications.
 Journal of the ACM
, 1998
"... We introduce a new textindexing data structure, the String BTree, that can be seen as a link between some traditional externalmemory and stringmatching data structures. In a short phrase, it is a combination of Btrees and Patricia tries for internalnode indices that is made more effective by a ..."
Abstract

Cited by 120 (11 self)
 Add to MetaCart
We introduce a new textindexing data structure, the String BTree, that can be seen as a link between some traditional externalmemory and stringmatching data structures. In a short phrase, it is a combination of Btrees and Patricia tries for internalnode indices that is made more effective by adding extra pointers to speed up search and update operations. Consequently, the String BTree overcomes the theoretical limitations of inverted files, Btrees, prefix Btrees, suffix arrays, compacted tries and suffix trees. String Btrees have the same worstcase performance as Btrees but they manage unboundedlength strings and perform much more powerful search operations such as the ones supported by suffix trees. String Btrees are also effective in main memory (RAM model) because they improve the online suffix tree search on a dynamic set of strings. They also can be successfully applied to database indexing and software duplication.
Unbounded Length Contexts for PPM
 The Computer Journal
, 1995
"... uses considerably greater computational resources (both time and space). The next section describes the basic PPM compression scheme. Following that we motivate the use of contexts of unbounded length, introduce the new method, and show how it can be implemented using a trie data structure. Then we ..."
Abstract

Cited by 111 (7 self)
 Add to MetaCart
uses considerably greater computational resources (both time and space). The next section describes the basic PPM compression scheme. Following that we motivate the use of contexts of unbounded length, introduce the new method, and show how it can be implemented using a trie data structure. Then we give some results that demonstrate an improvement of about 6% over the old method. Finally, a recentlypublished and seemingly unrelated compression scheme [2] is related to the unboundedcontext idea that forms the essential innovation of PPM*. 1 PPM: Prediction by partial match The basic idea of PPM is to use the last few characters in the input stream to predict the upcoming one. Models that condition their predictions on a few immediately preceding symbols are called "finitecontext" models of order k, where k is the number of preceding symbols used. PPM employs a suite of fixedorder context models with different values of k
Analysis of Branch Prediction via Data Compression
 in Proceedings of the 7th International Conference on Architectural Support for Programming Languages and Operating Systems
, 1996
"... Branch prediction is an important mechanism in modem microprocessor design. The focus of research in this area has been on designing new branch prediction schemes. In contrast, very few studies address the theoretical basis behind these prediction schemes. Knowing this theoretical basis helps us to ..."
Abstract

Cited by 83 (3 self)
 Add to MetaCart
Branch prediction is an important mechanism in modem microprocessor design. The focus of research in this area has been on designing new branch prediction schemes. In contrast, very few studies address the theoretical basis behind these prediction schemes. Knowing this theoretical basis helps us to evaluate how good a prediction scheme is and how much we can expect to improve its accuracy.
Adding Compression to a FullText Retrieval System
, 1995
"... We describe the implementation of a data compression scheme as an integral and transparent layer within a fulltext... ..."
Abstract

Cited by 81 (25 self)
 Add to MetaCart
We describe the implementation of a data compression scheme as an integral and transparent layer within a fulltext...
On prediction using variable order Markov models
 JOURNAL OF ARTIFICIAL INTELLIGENCE RESEARCH
, 2004
"... This paper is concerned with algorithms for prediction of discrete sequences over a finite alphabet, using variable order Markov models. The class of such algorithms is large and in principle includes any lossless compression algorithm. We focus on six prominent prediction algorithms, including Cont ..."
Abstract

Cited by 56 (1 self)
 Add to MetaCart
This paper is concerned with algorithms for prediction of discrete sequences over a finite alphabet, using variable order Markov models. The class of such algorithms is large and in principle includes any lossless compression algorithm. We focus on six prominent prediction algorithms, including Context Tree Weighting (CTW), Prediction by Partial Match (PPM) and Probabilistic Suffix Trees (PSTs). We discuss the properties of these algorithms and compare their performance using real life sequences from three domains: proteins, English text and music pieces. The comparison is made with respect to prediction quality as measured by the average logloss. We also compare classification algorithms based on these predictors with respect to a number of large protein classification tasks. Our results indicate that a “decomposed” CTW (a variant of the CTW algorithm) and PPM outperform all other algorithms in sequence prediction tasks. Somewhat surprisingly, a different algorithm, which is a modification of the LempelZiv compression algorithm, significantly outperforms all algorithms on the protein classification problems.
A Corpus for the Evaluation of Lossless Compression Algorithms
, 1997
"... This paper investigates how the reliability of these evaluations can be ensured, particularly the repeatability of experiments, in line with scientific method. The evaluation of compression methods can be analytical or empirical. Analytical results are generally expressed in terms of the compression ..."
Abstract

Cited by 55 (1 self)
 Add to MetaCart
This paper investigates how the reliability of these evaluations can be ensured, particularly the repeatability of experiments, in line with scientific method. The evaluation of compression methods can be analytical or empirical. Analytical results are generally expressed in terms of the compression of a system relative to the entropy of the source, which is assumed to belong to a specified class. Such results tend to have only asymptotic significance; for example, the LZ78 method [ZL78] converges to the entropy for very large inputs, but in practical situations files are far too short for this convergence to have any significance. For this reason empirical results are needed to establish the practical worth of a method. The main factor measured in typical empirical experiments is the amount of compression achieved on some set of files. Researchers also often report the speed of compression, and the amount of primary memory required to perform the compression. The speed and memory requirements can be different for the encoding and decoding processes, and may depend on the file being compressed. This can result in a daunting number of factors that need to be presented. A number of authors have used the "Calgary corpus" of texts to provide empirical results for lossless compression algorithms. This corpus was collected in 1987, although it was not published until 1990 [BCW90]. Recent advances with compression algorithms have been achieving relatively small improvements in compression, measured using the Calgary corpus. There is a concern that algorithms are being finetuned to this corpus, and that small improvements measured in this way may not apply to other files. Furthermore, the corpus is almost ten years old, and over
Augmenting Naive Bayes Classifiers with Statistical Language Models
, 2003
"... We augment naive Bayes models with statistical ngram language models to address shortcomings of the standard naive Bayes text classifier. The result is a generalized naive Bayes classifier ..."
Abstract

Cited by 47 (0 self)
 Add to MetaCart
We augment naive Bayes models with statistical ngram language models to address shortcomings of the standard naive Bayes text classifier. The result is a generalized naive Bayes classifier
A Fast Blocksorting Algorithm for lossless Data Compression
, 1996
"... I describe a fast blocksorting algorithm and its implementation to be used as frontend to simple lossless data compression algorithms like movetofront coding. I also compare it with widely available data compression algorithms running on the same Hardware. My algorithm achieves speed above c ..."
Abstract

Cited by 40 (0 self)
 Add to MetaCart
I describe a fast blocksorting algorithm and its implementation to be used as frontend to simple lossless data compression algorithms like movetofront coding. I also compare it with widely available data compression algorithms running on the same Hardware. My algorithm achieves speed above comparable algorithms while maintaining the same good compression. Since it is a derivative from the algorithm published by M. Burrows and D.J. Wheeler the size of the input blocks must be large to achieve good compression. Unlike their method execution speed here does not depend on the blocksize used. I will also present improvements to the backend of blocksorting compression methods. Michael Schindler A fast blocksorting algorithm for lossless data compression 3 1 Introduction Today's popular lossless data compression algorithms are mainly based on the sequential datacompression published by Lempel and Ziv in 1977 [1] and 1978 [2]. There were improvements like in [3] or the developme...
Boosting textual compression in optimal linear time
 Journal of the ACM
, 2005
"... Abstract. We provide a general boosting technique for Textual Data Compression. Qualitatively, it takes a good compression algorithm and turns it into an algorithm with a better compression Extended abstracts related to this article appeared in Proceedings of CPM 2001 and Proceedings of ACMSIAM SOD ..."
Abstract

Cited by 39 (19 self)
 Add to MetaCart
Abstract. We provide a general boosting technique for Textual Data Compression. Qualitatively, it takes a good compression algorithm and turns it into an algorithm with a better compression Extended abstracts related to this article appeared in Proceedings of CPM 2001 and Proceedings of ACMSIAM SODA 2004, and were combined due to their strong relatedness and complementarity. The work of P. Ferragina was partially supported by the Italian MIUR projects “Algorithms for the Next