Results 1  10
of
24
Analysis of Arithmetic Coding for Data Compression
 INFORMATION PROCESSING AND MANAGEMENT
, 1992
"... Arithmetic coding, in conjunction with a suitable probabilistic model, can provide nearly optimal data compression. In this article we analyze the effect that the model and the particular implementation of arithmetic coding have on the code length obtained. Periodic scaling is often used in arithmet ..."
Abstract

Cited by 41 (6 self)
 Add to MetaCart
Arithmetic coding, in conjunction with a suitable probabilistic model, can provide nearly optimal data compression. In this article we analyze the effect that the model and the particular implementation of arithmetic coding have on the code length obtained. Periodic scaling is often used in arithmetic coding implementations to reduce time and storage requirements; it also introduces a recency effect which can further affect compression. Our main contribution is introducing the concept of weighted entropy and using it to characterize in an elegant way the effect that periodic scaling has on the code length. We explain why and by how much scaling increases the code length for files with a homogeneous distribution of symbols, and we characterize the reduction in code length due to scaling for files exhibiting locality of reference. We also give a rigorous proof that the coding effects of rounding scaled weights, using integer arithmetic, and encoding endoffile are negligible.
Semantically Motivated Improvements for PPM Variants
 The Computer Journal
, 1997
"... This paper explains how to significantly improve the compression performance of any PPM variant ..."
Abstract

Cited by 25 (3 self)
 Add to MetaCart
(Show Context)
This paper explains how to significantly improve the compression performance of any PPM variant
OnLine Stochastic Processes in Data Compression
, 1996
"... The ability to predict the future based upon the past in finitealphabet sequences has many applications, including communications, data security, pattern recognition, and natural language processing. By Shannon's theory and the breakthrough development of arithmetic coding, any sequence, a 1 ..."
Abstract

Cited by 16 (6 self)
 Add to MetaCart
The ability to predict the future based upon the past in finitealphabet sequences has many applications, including communications, data security, pattern recognition, and natural language processing. By Shannon's theory and the breakthrough development of arithmetic coding, any sequence, a 1 a 2 \Delta \Delta \Delta a n , can be encoded in a number of bits that is essentially equal to the minimal informationlossless codelength, P i \Gamma log 2 p(a i ja 1 \Delta \Delta \Delta a i\Gamma1 ). The goal of universal online modeling, and therefore of universal data compression, is to deduce the model of the input sequence a 1 a 2 \Delta \Delta \Delta a n that can estimate each p(a i ja 1 \Delta \Delta \Delta a i\Gamma1 ) knowing only a 1 a 2 \Delta \Delta \Delta a i\Gamma1 so that the ex...
W.J.: Experiments on the zero frequency problem
 In: Proceedings of the Conference on Data Compression. p. 480. IEEE Computer Society
, 1995
"... ..."
(Show Context)
Lossless Compression for Text and Images
 International Journal of High Speed Electronics and Systems
, 1995
"... Most data that is inherently discrete needs to be compressed in such a way that it can be recovered exactly, without any loss. Examples include text of all kinds, experimental results, and statistical databases. Other forms of data may need to be stored exactly, such as imagesparticularly bilevel ..."
Abstract

Cited by 10 (0 self)
 Add to MetaCart
Most data that is inherently discrete needs to be compressed in such a way that it can be recovered exactly, without any loss. Examples include text of all kinds, experimental results, and statistical databases. Other forms of data may need to be stored exactly, such as imagesparticularly bilevel ones, or ones arising in medical and remotesensing applications, or ones that may be required to be certified true for legal reasons. Moreover, during the process of lossy compression, many occasions for lossless compression of coefficients or other information arise. This paper surveys techniques for lossless compression. The process of compression can be broken down into modeling and coding. We provide an extensive discussion of coding techniques, and then introduce methods of modeling that are appropriate for text and images. Standard methods used in popular utilities (in the case of text) and international standards (in the case of images) are described. Keywords Text compression, ima...
A Percolating State Selector for SuffixTree Context Models
 In Proceedings Data Compression Conference. IEEE Computer
, 1997
"... This paper introduces into practice and empirically evaluates a set of techniques for performing informationtheoretic state selection that have been developing in asymptotic results for over a decade. State selection, which actually implements the selection of an entire model from among a set of co ..."
Abstract

Cited by 5 (3 self)
 Add to MetaCart
(Show Context)
This paper introduces into practice and empirically evaluates a set of techniques for performing informationtheoretic state selection that have been developing in asymptotic results for over a decade. State selection, which actually implements the selection of an entire model from among a set of competing models, is performed at least trivially by all of the suffixtree FSMs used for online probability estimation. The set of stateselection techniques presented here combines orthogonally with the other sets of design options covered in the companion papers, "A Generalization and Improvement to PPM's Blending," and, "An Executable Taxonomy of OnLine Modeling Algorithms," written by this author. The main results of this paper are: ffl a novel dynamic programming solution that does not resort to the suboptimal hillclimbing or global order bounds that are used in other techniques, ffl the successful combination of informationtheoretic state selection and em mixtures, which include ...
English to Persian transliteration
 In String Processing and Information Retrieval
, 2006
"... mapping of a Persian word that is not readily available in a bilingual dictionary—is an unstudied problem. In this paper we make three novel contributions. First, we present performance comparisons of existing graphemebased transliteration methods on English to Persian. Second, we discuss the diffi ..."
Abstract

Cited by 5 (3 self)
 Add to MetaCart
(Show Context)
mapping of a Persian word that is not readily available in a bilingual dictionary—is an unstudied problem. In this paper we make three novel contributions. First, we present performance comparisons of existing graphemebased transliteration methods on English to Persian. Second, we discuss the difficulties in establishing a corpus for studying transliteration. Finally, we introduce a new model of Persian that takes into account the habit of shortening, or even omitting, runs of English vowels. This trait makes transliteration of Persian particularly difficult for phonetic based methods. This new model outperforms the existing grapheme based methods on Persian, exhibiting a 24 % relative increase in transliteration accuracy measured using the top5 criteria. 1
Fast and Efficient Algorithms for Text and Video Compression
, 1997
"... There is a tradeoff between the speed of a data compressor and the level of compression it can achieve. Improving compression generally requires more computation; and improving speed generally sacrifices compression. In this thesis, we examine a range of tradeoffs for text and video. In text compres ..."
Abstract

Cited by 5 (1 self)
 Add to MetaCart
There is a tradeoff between the speed of a data compressor and the level of compression it can achieve. Improving compression generally requires more computation; and improving speed generally sacrifices compression. In this thesis, we examine a range of tradeoffs for text and video. In text compression, we attempt to bridge the gap between statistical techniques, which exhibit a greater amount of compression but are computationally intensive, and dictionarybased techniques, which give less compression but run faster. We combine the context modeling of statistical coding with dynamic dictionaries into a hybrid coding scheme we call Dictionary by Partial Matching. In lowbitrate video compression, we explore the speedcompression tradeoffs with a range of motion estimation techniques operating within the H.261 video coding standard. We initially consider algorithms that explicitly minimizes bit rate and combination of rate and distortion. With insights gained from the explicit minimization algorithms, we propose a new technique for motion estimation that minimizes an efficiently computed heuristic function. The new technique gives compression efficiency comparable to the explicitminimization algorithms while running much faster. We also explore bitminimization in a nonstandard quadtreebased video coder that codes
An MDL Estimate of the Significance of Rules
 In Proceedings of ISIS: Information, Statistics, and Induction in Science
, 1996
"... This paper proposes a new method for measuring the performance of modelswhether decision trees or sets of rulesinferred by machine learning methods. Inspired by the minimum description length (MDL) philosophy and theoretically rooted in information theory, the new method measures the complexit ..."
Abstract

Cited by 4 (1 self)
 Add to MetaCart
(Show Context)
This paper proposes a new method for measuring the performance of modelswhether decision trees or sets of rulesinferred by machine learning methods. Inspired by the minimum description length (MDL) philosophy and theoretically rooted in information theory, the new method measures the complexity of test data with respect to the model. It has been evaluated on rule sets produced by several different machine learning schemes on a large number of standard data sets. When compared with the usual percentage correct measure, it is shown to agree with it in restricted cases. However, in other more general cases taken from real data setsfor example, when rule sets make multiple or no predictionsit disagrees substantially. It is argued that the MDL measure is more reasonable in these cases. and represents a better way of assessing the significance of a rule set's performance. The question of the complexity of the rule set itself is not addressed in the paper. Keywords: Machine learn...
Symboldriven compression of burrows wheeler transformed text
, 2000
"... Despite the enormous growth in storage capacity in recent years, the search for fast and efficient text compression algorithms continues. As processor speed is increasing at a higher rate than disk access time is decreasing, there is now even more reason to store information in a compressed form th ..."
Abstract

Cited by 3 (0 self)
 Add to MetaCart
(Show Context)
Despite the enormous growth in storage capacity in recent years, the search for fast and efficient text compression algorithms continues. As processor speed is increasing at a higher rate than disk access time is decreasing, there is now even more reason to store information in a compressed form than there was previously. Prediction by Partial Matching (PPM), first published in 1984, was a significant step forward in the quest for efficient text compression. The Burrows Wheeler transform (BWT), introduced ten years later, has been the next significant breakthrough; its best implementations rank alongside those of PPM. In most BWT implementations, transformed text is converted to a string of ranks with a movetofront (MTF) or similar mechanism before being compressed. Ranks are then encoded with an Order model or a hierarchy of such models, with some substrings of repeated ranks encoded as run lengths. Although these rank based methods perform very well, the transformation to MTF numbers blurs the distinction between individual symbols and is a possible cause of ineffectiveness. Instead of relying on symbol ranking, we examine the problem of modelling the transformed text as a sequence of segments with iid symbols, using three different techniques.