Results 1  10
of
10
The engineering of a compression boosting library: Theory vs practice in BWT compression
 In Proc. 14th European Symposium on Algorithms (ESA ’06
, 2006
"... Abstract. Data Compression is one of the most challenging arenas both for algorithm design and engineering. This is particularly true for Burrows and Wheeler Compression a technique that is important in itself and for the design of compressed indexes. There has been considerable debate on how to des ..."
Abstract

Cited by 11 (6 self)
 Add to MetaCart
Abstract. Data Compression is one of the most challenging arenas both for algorithm design and engineering. This is particularly true for Burrows and Wheeler Compression a technique that is important in itself and for the design of compressed indexes. There has been considerable debate on how to design and engineer compression algorithms based on the BWT paradigm. In particular, MovetoFront Encoding is generally believed to be an “inefficient ” part of the BurrowsWheeler compression process. However, only recently two theoretically superior alternatives to MovetoFront have been proposed, namely Compression Boosting and Wavelet Trees. The main contribution of this paper is to provide the first experimental comparison of these three techniques, giving a much needed methodological contribution to the current debate. We do so by providing a carefully engineered compression boosting library that can be used, on the one hand, to investigate the myriad new compression algorithms that can be based on boosting, and on the other hand, to make the first experimental assessment of how MovetoFront behaves with respect to its recently proposed competitors. The main conclusion is that Boosting, Wavelet Trees and MovetoFront yield quite close compression performance. Finally, our extensive experimental study of boosting technique brings to light a new fact overlooked in 10 years of experiments in the area: a fast adapting orderzero compressor is enough to provide state of the art BWT compression by simply compressing the run length encoded transform. In other words, MovetoFront, Wavelet Trees, and Boosters can all be bypassed by a fast learner.
List update with locality of reference
 In Proceedings of the 8th Latin American Theoretical Informatics Symposium
, 2008
"... Abstract. It is known that in practice, request sequences for the list update problem exhibit a certain degree of locality of reference. Motivated by this observation we apply the locality of reference model for the paging problem due to Albers et al. [STOC 2002/JCSS 2005] in conjunction with biject ..."
Abstract

Cited by 4 (3 self)
 Add to MetaCart
Abstract. It is known that in practice, request sequences for the list update problem exhibit a certain degree of locality of reference. Motivated by this observation we apply the locality of reference model for the paging problem due to Albers et al. [STOC 2002/JCSS 2005] in conjunction with bijective analysis [SODA 2007] to list update. Using this framework, we prove that MovetoFront (MTF) is the unique optimal algorithm for list update. This addresses the open question of defining an appropriate model for capturing locality of reference in the context of list update [Hester and Hirschberg ACM Comp. Surv. 1985]. Our results hold both for the standard cost function of Sleator and Tarjan [CACM 1985] and the improved cost function proposed independently by Martínez and Roura [TCS 2000] and Munro [ESA 2000]. This result resolves an open problem of Martínez and Roura, namely proposing a measure which can successfully separate MTF from all other listupdate algorithms. 1
Paging and List Update under Bijective Analysis
, 2009
"... It has long been known that for the paging problem in its standard form, competitive analysis cannot adequately distinguish algorithms based on their performance: there exists a vast class of algorithms which achieve the same competitive ratio, ranging from extremely naive and inefficient strategies ..."
Abstract

Cited by 4 (0 self)
 Add to MetaCart
It has long been known that for the paging problem in its standard form, competitive analysis cannot adequately distinguish algorithms based on their performance: there exists a vast class of algorithms which achieve the same competitive ratio, ranging from extremely naive and inefficient strategies (such as FlushWhenFull), to strategies of excellent performance in practice (such as LeastRecentlyUsed and some of its variants). A similar situation arises in the list update problem: in particular, under the cost formulation studied by Martínez and Roura [TCS 2000] and Munro [ESA 2000] every list update algorithm has, asymptotically, the same competitive ratio. Several refinements of competitive analysis, as well as alternative performance measures have been introduced in the literature, with varying degrees of success in narrowing this disconnect between theoretical analysis and empirical evaluation. In this paper we study these two fundamental online problems under the framework of bijective analysis [Angelopoulos, Dorrigiv and LópezOrtiz, SODA 2007 and LATIN 2008]. This is an intuitive technique which is based on pairwise comparison of the costs incurred by two algorithms on sets of request sequences of the same size. Coupled with a wellestablished model of locality of reference due to Albers, Favrholdt and Giel [JCSS 2005], we show that LeastRecentlyUsed and MovetoFront are the unique optimal algorithms for paging and list update, respectively. Prior to this work, only measures based on averagecost analysis have separated LRU and MTF from all other algorithms. Given that bijective analysis is a fairly stringent measure (and also subsumes averagecost analysis), we prove that in a strong sense LRU and MTF stand out as the best (deterministic) algorithms.
MovetoFront, Distance Coding, and Inversion Frequencies Revisited
, 2007
"... MovetoFront, Distance Coding and Inversion Frequencies are three somewhat related techniques used to process the output of the BurrowsWheeler Transform. In this paper we analyze these techniques from the point of view of how effective they are in the task of compressing lowentropy strings, that ..."
Abstract

Cited by 4 (0 self)
 Add to MetaCart
MovetoFront, Distance Coding and Inversion Frequencies are three somewhat related techniques used to process the output of the BurrowsWheeler Transform. In this paper we analyze these techniques from the point of view of how effective they are in the task of compressing lowentropy strings, that is, strings which have many regularities and are therefore highly compressible. This is a nontrivial task since many compressors have nonconstant overheads that become nonnegligible when the input string is highly compressible. Because of the properties of the BurrowsWheeler transform, being locally optimal ensures an algorithm compresses lowentropy strings effectively. Informally, local optimality implies that an algorithm is able to effectively compress an arbitrary partition of the input string. We show that in their original formulation neither MovetoFront, nor Distance Coding, nor Inversion Frequencies is locally optimal. Then, we describe simple variants of the above algorithms which are locally optimal. To achieve local optimality with MovetoFront it suffices to combine it with Run Length Encoding. To achieve local optimality with Distance Coding and Inversion Frequencies we use a novel “escape and reenter” strategy.
Empirical entropy in context
, 2007
"... In statistics as in life, many things become clearer when we consider context. Statisticians’ use of context itself becomes clearer, in fact, when we consider the past century. It was anathema to them prior to 1906, when Markov [22] proved the weak law of large numbers applies to chains of dependent ..."
Abstract
 Add to MetaCart
In statistics as in life, many things become clearer when we consider context. Statisticians’ use of context itself becomes clearer, in fact, when we consider the past century. It was anathema to them prior to 1906, when Markov [22] proved the weak law of large numbers applies to chains of dependent events over finite domains (i.e., finitestate Markov processes). 1 He published several papers on the statistics of dependent events and in 1913 gave an example of dependence in language: he analyzed the first 20000 characters of Pushkin’s Eugene Onegin and found the likelihood of a vowel was strongly affected by the presence of vowels in the four preceding positions. Many other examples have been found since, in physics, chemistry, biology, economics, sociology, psychology — every branch of the natural and social sciences. While Markov was developing the idea of Markov processes, another probability theorist, Borel, was starting an investigation into examples beyond their scope. Borel [2] defined a number to be normal in base b if, in its infinite bary representation, every ktuple occurs with relative frequency 1/b k; he called a number absolutely normal if normal in every base. Using the BorelCantelli Lemma, he showed nearly all numbers are absolutely normal, although his proof was completely nonconstructive. Sierpinski [28] gave the first example
Bounds for Compression in Streaming Models
, 2007
"... Compression algorithms and streaming algorithms are both powerful tools for dealing with massive data sets, but many of the best compression algorithms — e.g., those based on the BurrowsWheeler Transform — at first seem incompatible with streaming. In this paper we consider several popular stream ..."
Abstract
 Add to MetaCart
Compression algorithms and streaming algorithms are both powerful tools for dealing with massive data sets, but many of the best compression algorithms — e.g., those based on the BurrowsWheeler Transform — at first seem incompatible with streaming. In this paper we consider several popular streaming models and ask in which, if any, we can compress as well as we can with the BWT. We first prove a nearly tight tradeoff between memory and redundancy for the Standard, Multipass and WStreams models, demonstrating a bound that is achievable with the BWT but unachievable in those models. We then show we can compute the related Schindler Transform in the StreamSort model and the BWT in the ReadWrite model and, thus, achieve that bound.
Bounds for Compression in Streaming Models
, 2007
"... Compression algorithms and streaming algorithms are both powerful tools for dealing with massive data sets, but many of the best compression algorithms — e.g., those based on the BurrowsWheeler Transform — at first seem incompatible with streaming. In this paper we consider several popular stream ..."
Abstract
 Add to MetaCart
Compression algorithms and streaming algorithms are both powerful tools for dealing with massive data sets, but many of the best compression algorithms — e.g., those based on the BurrowsWheeler Transform — at first seem incompatible with streaming. In this paper we consider several popular streaming models and ask in which, if any, we can compress as well as we can with the BWT. We first prove a nearly tight tradeoff between memory and redundancy for the Standard, Multipass and WStreams models, demonstrating a bound that is achievable with the BWT but unachievable in those models. We then show we can compute the related Schindler Transform in the StreamSort model and the BWT in the ReadWrite model and, thus, achieve that bound.
An Application of Selforganizing Data Structures to Compression
"... Abstract. List update algorithms have been widely used as subroutines in compression schemas, most notably as part of BurrowsWheeler compression. The BurrowsWheeler transform (BWT), which is the basis of many stateoftheart general purpose compressors applies a compression algorithm to a permute ..."
Abstract
 Add to MetaCart
Abstract. List update algorithms have been widely used as subroutines in compression schemas, most notably as part of BurrowsWheeler compression. The BurrowsWheeler transform (BWT), which is the basis of many stateoftheart general purpose compressors applies a compression algorithm to a permuted version of the original text. List update algorithms are a common choice for this second stage of BWTbased compression. In this paper we perform an experimental comparison of various list update algorithms both as stand alone compression mechanisms and as a second stage of the BWTbased compression. Our experiments show MTF outperforms other list update algorithms in practice after BWT. This is consistent with the intuition that BWT increases locality of reference and the predicted result from the locality of reference model of Angelopoulos et al. [1]. Lastly, we observe that due to an often neglected difference in the cost models, good list update algorithms may be far from optimal for BWT compression and construct an explicit example of this phenomena. This is a fact that had yet to be supported theoretically in the literature. 1