## The Burrows-Wheeler compression algorithm is even better than what you have thought (2005)

### BibTeX

@MISC{Landau05theburrows-wheeler,

author = {Shir Landau and Elad Verbin},

title = {The Burrows-Wheeler compression algorithm is even better than what you have thought},

year = {2005}

}

### OpenURL

### Abstract

The best compression algorithm today for English text is based on the Burrows-Wheeler transform. This algorithm (whose common implementation is bzip2) consists of the following three essential steps: 1) Obtain the Burrows-Wheeler transform of the text, 2) Convert the transform into a sequence of integers using the move-to-front algorithm, 3) Encode the integers using arithmetic code or any order-0 encoding (possibly with run length encoding). In this paper we achieve a strong bound on the worst-case compression ratio of this algorithm, that is significantly better than bounds known to date and is obtained via simple analytical techniques. Specifically, for any input string s, and µ> 1, the length of the compressed string is bounded by µ · |s|Hk(s) + log(ζ(µ)) · |s | + gk where Hk is the k-th order empirical entropy, gk is a constant depending only on k and on the size of the alphabet, and ζ(µ) = 1 1 µ + 1 2 µ +... is the standard zeta function. In fact we prove a stronger result: That this bound without the additive term gk holds when we replace Hk(s) by the sum of the logarithms of the integers obtain by the move-to-front encoding of the transform. This refined bound is tight and close to the actual compression achieved in practice. To obtain this result we prove a tight result on the compressibility of integer sequences, which is of independent interest. 1

### Citations

688 |
Arithmetic coding for data compression
- Witten, Neal, et al.
- 1987
(Show Context)
Citation Context ... the paper we refer to an algorithm order0. By this we mean any order-0 algorithm, which is assumed to be a (1, Corder0)-n · H0-competitive algorithm. For example, CHuffman = 1 and CArithmetic ≈ 10−2 =-=[19, 16]-=-. 3.1 An optimal (µ, C)-SL-competitive algorithm We show, using a technique based on Lemma 2.2, that the algorithm order0 is (µ, log ζ(µ) + Corder0)-sl-competitive for any µ > 1. In fact, we prove a s... |

588 | A block-sorting lossless data compression algorithm - Burrows, Wheeler - 1994 |

352 |
Universal Codeword Sets and Representations of the Integers
- Elias
- 1975
(Show Context)
Citation Context ...tive and ̂le-competitive compression algorithms. The problem we deal with in this section is related to the problem of universal encoding of integers. In the problem of universal encoding of integers =-=[5, 3]-=- the goal is to find a prefix-free encoding for integers, U : Z + → {0, 1} ∗ , such that for every x ≥ 0, |U(x)| ≤ µ log(x + 1) + C. A particularly nice solution for this is the Fibonacci Encoding [2,... |

199 | Succinct Indexable Dictionaries with Applications to Encoding k-aray Trees and Multisets
- Raman, Raman, et al.
- 2002
(Show Context)
Citation Context ...ed Text Index that comes within additive lower-order terms of the order-k entropy of the input text. This result makes heavy use of data structures for indexable dictionaries by Raman, Raman, and Rao =-=[17]-=-. For more on Compressed Text Indexing, see [11, 15, 8]. We leave open the question of how our techniques can be applied to the subject of Compressed Text Indexing. 2 Preliminaries Throughout the pape... |

198 | High-Order Entropy-Compressed Text Indexes
- Grossi, Gupta, et al.
- 2003
(Show Context)
Citation Context ...taneously both a compression algorithm and an indexing data structure. Early progress on Compressed Text Indexes was made by Manzini and Ferragina in [15]. A recent result by Grossi, Gupta and Vitter =-=[10]-=- presents a Compressed Text Index that comes within additive lower-order terms of the order-k entropy of the input text. This result makes heavy use of data structures for indexable dictionaries by Ra... |

153 |
A locally adaptive data compression scheme
- BENTLEY, SLEATOR, et al.
- 1986
(Show Context)
Citation Context ...is bound is even more interesting than the bound itself. We define a new natural statistic of a text which we call the “Local Entropy” (le). This statistic was implicitly considered by Bentley et al. =-=[3]-=- as well as by Manzini [13]. Using two observations on the behavior of le we bypass some of the technical hurdles in the analysis of [13]. Our analysis actually proves a considerably stronger result: ... |

145 | Arithmetic coding revisited
- Moffat, Neal, et al.
- 1998
(Show Context)
Citation Context ... the paper we refer to an algorithm order0. By this we mean any order-0 algorithm, which is assumed to be a (1, Corder0)-n · H0-competitive algorithm. For example, CHuffman = 1 and CArithmetic ≈ 10−2 =-=[19, 16]-=-. 3.1 An optimal (µ, C)-SL-competitive algorithm We show, using a technique based on Lemma 2.2, that the algorithm order0 is (µ, log ζ(µ) + Corder0)-sl-competitive for any µ > 1. In fact, we prove a s... |

133 | An analysis of the Burrows-Wheeler transform
- Manzini
(Show Context)
Citation Context ...endent interest. 1 Introduction In 1994, Burrows and Wheeler [4] introduced the Burrows-Wheeler Transform, and two new lossless text-compression algorithms that are based on this transform. Following =-=[13]-=-, we refer to these algorithms as bw0 and bwRL. A well known implementation of these algorithms, known as bzip2 [18], is among the best compressors for English text, and is definitely the fastest amon... |

85 | Data compression
- Lelewer, Hirschberg
- 1987
(Show Context)
Citation Context ...e Fibonacci Encoding [2, 9], for which µ = logφ 2, C = √ 1 + logφ 5 ≃ 2.6723. An additional solution for this problem was proposed by Elias [5]. This is an optimal solution, in the sense described in =-=[12]-=-. For more information on universal encoding of integers see the (somewhat outdated) survey paper [12]. It can be seen that a universal encoding scheme with parameters µ, C easily gives a (µ, C)-slcom... |

58 | Engineering a lightweight suffix array construction algorithm - Manzini, Ferragina - 2002 |

45 | An alphabet-friendly FMindex
- Ferragina, Manzini, et al.
- 2004
(Show Context)
Citation Context ...rder terms of the order-k entropy of the input text. This result makes heavy use of data structures for indexable dictionaries by Raman, Raman, and Rao [17]. For more on Compressed Text Indexing, see =-=[11, 15, 8]-=-. We leave open the question of how our techniques can be applied to the subject of Compressed Text Indexing. 2 Preliminaries Throughout the paper we assume 0 log 0 = 0, as customary. We will be caref... |

43 | When indexing equals compression: Experiments with compressing suffix arrays and applications
- Foschini, Grossi, et al.
- 2006
(Show Context)
Citation Context ...rder terms of the order-k entropy of the input text. This result makes heavy use of data structures for indexable dictionaries by Raman, Raman, and Rao [17]. For more on Compressed Text Indexing, see =-=[11, 15, 8]-=-. We leave open the question of how our techniques can be applied to the subject of Compressed Text Indexing. 2 Preliminaries Throughout the paper we assume 0 log 0 = 0, as customary. We will be caref... |

40 | Boosting textual compression in optimal linear time
- Ferragina, Giancarlo, et al.
- 2005
(Show Context)
Citation Context ...ions, we actually need to compute the bwt of s in reversed order, that is from right to left. This will, of course, not change our results and does not effect the compression ratio significantly (see =-=[7]-=- for a discussion on this ), so we will ignore this point from now on. 12. Transform ˆs to a string of integers ˙s = mtf(ˆs) by using the move to front algorithm. This algorithm maintains the charact... |

29 |
Robust transmission of unbounded strings using Fibonacci representations
- Apostolico, Fraenkel
- 1987
(Show Context)
Citation Context ... 3] the goal is to find a prefix-free encoding for integers, U : Z + → {0, 1} ∗ , such that for every x ≥ 0, |U(x)| ≤ µ log(x + 1) + C. A particularly nice solution for this is the Fibonacci Encoding =-=[2, 9]-=-, for which µ = logφ 2, C = √ 1 + logφ 5 ≃ 2.6723. An additional solution for this problem was proposed by Elias [5]. This is an optimal solution, in the sense described in [12]. For more information ... |

22 |
On compressing and indexing data
- Ferragina, Manzini
(Show Context)
Citation Context ...l text. A Compressed Text Index is therefore simultaneously both a compression algorithm and an indexing data structure. Early progress on Compressed Text Indexes was made by Manzini and Ferragina in =-=[15]-=-. A recent result by Grossi, Gupta and Vitter [10] presents a Compressed Text Index that comes within additive lower-order terms of the order-k entropy of the input text. This result makes heavy use o... |

10 | T.: Robust universal complete codes for transmission and compression
- Fraenkel, Klein
- 1996
(Show Context)
Citation Context ... 3] the goal is to find a prefix-free encoding for integers, U : Z + → {0, 1} ∗ , such that for every x ≥ 0, |U(x)| ≤ µ log(x + 1) + C. A particularly nice solution for this is the Fibonacci Encoding =-=[2, 9]-=-, for which µ = logφ 2, C = √ 1 + logφ 5 ≃ 2.6723. An additional solution for this problem was proposed by Elias [5]. This is an optimal solution, in the sense described in [12]. For more information ... |

7 |
bzip2, a program and library for data compression, http://www.bzip.org
- Seward
(Show Context)
Citation Context ...ew lossless text-compression algorithms that are based on this transform. Following [13], we refer to these algorithms as bw0 and bwRL. A well known implementation of these algorithms, known as bzip2 =-=[18]-=-, is among the best compressors for English text, and is definitely the fastest among them. For example, bzip2 typically shrinks an English to about 20% of its original size while gzip only gets 26% (... |