## Superior Guarantees for Sequential Prediction and Lossless Compression via Alphabet Decomposition

### Cached

### Download Links

- [www.cs.technion.ac.il]
- [www.jmlr.org]
- [jmlr.org]
- [eprints.pascal-network.org]
- DBLP

### Other Repositories/Bibliography

Citations: | 3 - 0 self |

### BibTeX

@MISC{Begleiter_superiorguarantees,

author = {Ron Begleiter and Ran El-yaniv and Dana Ron},

title = {Superior Guarantees for Sequential Prediction and Lossless Compression via Alphabet Decomposition},

year = {}

}

### OpenURL

### Abstract

We present worst case bounds for the learning rate of a known prediction method that is based on hierarchical applications of binary context tree weighting (CTW) predictors. A heuristic application of this approach that relies on Huffman’s alphabet decomposition is known to achieve state-ofthe-art performance in prediction and lossless compression benchmarks. We show that our new bound for this heuristic is tighter than the best known performance guarantees for prediction and lossless compression algorithms in various settings. This result substantiates the efficiency of this hierarchical method and provides a compelling explanation for its practical success. In addition, we present the results of a few experiments that examine other possibilities for improving the multialphabet prediction performance of CTW-based algorithms.

### Citations

8603 |
Elements of information theory
- Cover, Thomas
- 1991
(Show Context)
Citation Context ... optimal tree. However, if we replace each |Si| value with its maximal value k D , we are able to show that the bound is optimized when the decomposition tree is the Huffman decoding tree (see, e.g., =-=Cover and Thomas, 1991-=-, Chapter 5.6) of the sequence x n 1 . For any decomposition tree T and a sequence x n 1 , let ni be the number of times that the internal node i ∈ T is visited when predicting x n 1 using the DECO al... |

2348 | Computational complexity
- Papadimitriou
- 2003
(Show Context)
Citation Context ... if the sizes |Si| were known, it is an NP-hard problem even to decide on the optimal partition corresponding to the root. This hardness result can be obtained by a reduction from MAX-CUT (see, e.g., =-=Papadimitriou, 1994-=-, Chapter 9.3). Hence, we can only hope to approximate the optimal tree. However, if we replace each |Si| value with its maximal value k D , we are able to show that the bound is optimized when the de... |

1141 | A universal algorithm for sequential data compression - Ziv, Lempel - 1977 |

857 | An empirical study of smoothing techniques for language modeling
- Chen, Goodman
- 1999
(Show Context)
Citation Context ...d-one” estimator by ˆz +1 and the “improved GT” by ˆz GT* . 12 The Good-Turing (GT) estimator (Good, 1953) is well known and most commonly used in language modeling for speech recognition (see, e.g., =-=Chen and Goodman, 1996-=-). 13 symbol probability generated by GT is The next⎧ ⎨ a ′ 1 ˆz GT (σ|x t 1) = 1 St+1 t×a0 × ⎩ , if Nσ = 0; , otherwise, where a ′ m is a smoothed version of am and St+1 is a normalization factor. 14... |

636 | On-Line Encyclopedia of Integer Sequences - Sloane |

617 |
Text Compression
- Bell, Cleary, et al.
- 1990
(Show Context)
Citation Context ...e lossless compression community, we examined the algorithms over the ‘Calgary Corpus.’ This Corpus serves as a standard benchmark for testing log-loss prediction and lossless compression algorithms (=-=Bell et al., 1990-=-; Witten and Bell, 1991; Cleary and Teahan, 1995; Begleiter et al., 2004). The corpus consists of 18 files of nine different types. Most of the files are pure ASCII files and four are binary files. Th... |

566 | A Block – Sorting Lossless Data compression Algorithm
- Burrows, Wheeler
- 1994
(Show Context)
Citation Context ... algorithm’s phenomenal commercial success, it is not among the best lossless compressors (see, e.g., Bell et al., 1990). Two more recent universal algorithms are the Burrows-Wheeler transform (BWT) (=-=Burrows and Wheeler, 1994-=-) and grammar-based compression (Yang and Kieffer, 2000). The public-domain 7. A similar paper by Tjalkens, Volf and Willems, proposing the same method and results, appeared a few years later (Tjalken... |

421 | Reducing multiclass to binary: a unifying approach for margin classifiers - Allwein, Schapire, et al. - 2001 |

415 |
Individual comparisons by ranking methods
- Wilcoxon
- 1945
(Show Context)
Citation Context ... will give rise to inferior performance. In all the experimental results below we analyzed the statistical significance of pairwise comparisons between algorithms using the Wilcoxon signed rank test (=-=Wilcoxon, 1945-=-) 5 with a confidence level of 95%. Table 1 shows the average prediction performance of DECO compared to several tree structures over the text files of the Calgary Corpus. The slightly better but stat... |

355 |
The population frequencies of species and the estimation of population parameters
- Good
- 1953
(Show Context)
Citation Context ...f symbols that appear exactly m times in xt 1 , i.e., am = |{σ ∈ Σ : Nσ = m}|. We denote the “improved add-one” estimator by ˆz +1 and the “improved GT” by ˆz GT* . 12 The Good-Turing (GT) estimator (=-=Good, 1953-=-) is well known and most commonly used in language modeling for speech recognition (see, e.g., Chen and Goodman, 1996). 13 symbol probability generated by GT is The next⎧ ⎨ a ′ 1 ˆz GT (σ|x t 1) = 1 S... |

249 | Aggregating strategies - Vovk - 1990 |

229 |
T.C.: The zero-frequency problem: estimating the probabilities of novel events in adaptive text compression
- Witten, Bell
- 1991
(Show Context)
Citation Context ...ion community, we examined the algorithms over the ‘Calgary Corpus.’ This Corpus serves as a standard benchmark for testing log-loss prediction and lossless compression algorithms (Bell et al., 1990; =-=Witten and Bell, 1991-=-; Cleary and Teahan, 1995; Begleiter et al., 2004). The corpus consists of 18 files of nine different types. Most of the files are pure ASCII files and four are binary files. The ASCII files consist o... |

206 | In defense of one-vs-all classification - Rifkin, Klautau |

172 | The power of amnesia: learning probabilistic automata with variable memory length
- Ron, Singer, et al.
- 1996
(Show Context)
Citation Context ...ithm enjoys the tightest bound. Note that there exist sequential prediction algorithms that enjoy other types of performance guarantees. One example is the probabilistic suffix trees (PST) algorithm (=-=Ron et al., 1996-=-). The PST is a well-known algorithm that is mainly used in the bioinformatic community (see, e.g., Bejerano and Yona, 2001). The algorithm enjoys a PAC-like performance guarantee with respect to the ... |

159 | The context-tree weighting method: basic properties
- Willems, Shtarkov, et al.
- 1995
(Show Context)
Citation Context ...zs(yi) and for the empty sequence zs(ε) = 1. Thus, we can rewrite Equation (2) as 2.2 The Context-Tree Weighting Method PS(x n 1) = ∏ zs(x s∈S n 1). (3) Here we describe the CTW prediction algorithm (=-=Willems et al., 1995-=-), originally presented as a lossless compression algorithm. 2 The goal of the CTW algorithm is to predict a sequence (nearly) as good as the the best tree-source. This goal can be divided into two su... |

135 | Universal prediction
- Merhav, Feder
- 1998
(Show Context)
Citation Context ...nd therefore should mix the predictions associated with two 2. As mentioned above, any lossless compression algorithm can be translated into a sequence prediction algorithm and vice versa (see, e.g., =-=Merhav and Feder, 1998-=-). 382sSUPERIOR GUARANTEES FOR SEQUENTIAL PREDICTION AND LOSSLESS COMPRESSION topologies: S0 = {ε} (where ε is the empty sequence), and S1 = {0,1}. Note that S0 corresponds to the zero-order topology ... |

130 |
Trofimov, “The performance of universal encoding
- Krichevsky, K
- 1981
(Show Context)
Citation Context ...ITER AND EL-YANIV The high practical performance of the original CTW algorithm is most apparent when applied to binary prediction problems, in which case it uses the well-known (binary) KT-estimator (=-=Krichevsky and Trofimov, 1981-=-). When the algorithm is applied to non-binary prediction/compression problems (using the multi-alphabet KT-estimator), its empirical performance is mediocre compared to the best known results (Tjalke... |

126 | Universal sequential coding of single messages - Shtarkov - 1987 |

84 |
A Philosophical Essay on Probabilities
- Laplace
- 1951
(Show Context)
Citation Context ...x n 1 ∈ Σn , RKT(x n 1) = logsup z(x z∈Z n 1) − log ˆz KT (x n 1) ≤ k − 1 logn+logk. (12) 2 3. Another famous add-constant predictor is the add-one predictor, also called Laplace’s law of succession (=-=Laplace, 1995-=-). 385sBEGLEITER AND EL-YANIV Remark 8 Krichevsky and Trofimov (1981) originally defined KT to be a mixture of all zero-order distributions in Z, weighted by the Dirichlet (1/2) distribution. Thus, th... |

81 |
Statistical learning theory and stochastic optimization, ser
- Catoni
(Show Context)
Citation Context ...n the redundancy of the KT estimator is a key element in the proof of the following theorem, providing a finite-sample point-wise redundancy bound for the multi-CTW (see, e.g., Tjalkens et al., 1993; =-=Catoni, 2004-=-). Theorem 9 (Willems et al.) Let Σ be any alphabet with |Σ| = k ≥ 2. For any sequence x n 1 ∈ Σn and any D-bounded tree-source with a topology S and distribution PS , the following holds: Proof RCTW(... |

76 | Alan Turing: The Enigma - Hodges - 2012 |

74 | Sequential prediction of individual sequences under general loss functions
- Haussler, Kivinen, et al.
- 1998
(Show Context)
Citation Context ...-advice scheme (see, e.g., Merhav and Feder, 1998; Helmbold and Schapire, 1997). It can be shown that these two algorithms are identical when Vovk’s algorithm is applied with the log-loss (see, e.g., =-=Haussler et al., 1998-=-, example 3.12). In this case, the set of experts in Vovk’s algorithm consists of all D-bounded tree-sources, CD; the initial weight of each expert, S, corresponds to its complexity |TS|; and the weig... |

69 | Predicting nearly as well as the best pruning of a decision tree
- Helmbold, Schapire
- 1997
(Show Context)
Citation Context ...hm of Vovk (1990). This observation is new, to the best of our knowledge, although there are citations that connect the ctw algorithm with the expert advice scheme (see, e.g., Merhav and Feder, 1998; =-=Helmbold and Schapire, 1997-=-). It can be shown that these two algorithms are identical when Vovk’s algorithm is applied with the log-loss (see, e.g., Haussler et al., 1998, example 3.12). In this case, the set of experts in Vovk... |

58 | Variations on probabilistic suffix trees: Statistical modeling and the prediction of protein families - Bejerano, Yona - 2001 |

56 | On prediction using variable order Markov models - Begleiter, El-Yaniv, et al. - 2004 |

55 | Asymptotic minimax regret for data compression, gambling, and prediction - Xie, Barron - 1996 |

48 |
Narrative analysis
- Riesman, C
- 1993
(Show Context)
Citation Context ...e w(dz) is the Dirichlet distribution with parameter 1/2 defined by w(dz) = 1 √ k Γ( k 2 ) Γ( 1 2 )k k ∏ i=1 z(i) −1/2 λ(dz), (13) Γ(x) = R¡ + t x−1 exp(−t)dt is the gamma function (see, for example, =-=Courant and John, 1989-=-), and λ(·) is a measure on Z. Shtarkov (1987) was the first to show that this mixture can be calculated sequentially as in Definition 6. The upper bound of Theorem 7 on the redundancy of the KT estim... |

42 | Always Good Turing: Asymptotically optimal probability estimation - Orlitsky, Santhanam, et al. - 2003 |

39 | Predicting a binary sequence almost as well as the optimal biased coin - Freund - 1996 |

38 | Universal lossless source coding with the burrows wheeler transform - Effros, Visweswariah, et al. - 2002 |

36 | On the convergence rate of good-turing estimators
- McAllester, Schapire
- 2000
(Show Context)
Citation Context ...ring the per-symbol ratio, KT is asymptotically optimal. Along with the above worst-case guarantees, the Good-Turing estimator also has a convergence guarantee to the “true” missing mass probability (=-=McAllester and Schapire, 2000-=-), assuming the existence of a true underlying distribution that generated the sequence. In Tables 3 and 4 we provide the respective per symbol log-loss obtained with these estimators for all the text... |

36 |
PPM: One step to practicality
- Shkarin
- 2002
(Show Context)
Citation Context ...Tjalkens et al. (1994) and further developed by Volf (2002), does achieve state-of-the-art compression and prediction performance on standard benchmarks (see, e.g., Volf, 2002; Sadakane et al., 2000; =-=Shkarin, 2002-=-; Begleiter et al., 2004). In this approach the multi-alphabet problem is hierarchically decomposed into a number of binary prediction problems. We term the resulting procedure “the DECO algorithm.” V... |

36 | The context-tree weighting method: extensions - Willems - 1998 |

26 | Redundancy of the Lempel-Ziv incremental parsing rule - Savari - 1997 |

21 | Efficient universal lossless data compression algorithms based on a greedy sequential grammar transform— Part one: Without context models - Yang, Kieffer - 2000 |

18 | Markov types and minimax redundancy for Markov sources - Jacquet, Szpankowski - 2004 |

14 | Implementing the context tree weighting method for text compression
- Sadakane, Okazaki, et al.
- 2000
(Show Context)
Citation Context ...euristic, suggested by Tjalkens et al. (1994) and further developed by Volf (2002), does achieve state-of-the-art compression and prediction performance on standard benchmarks (see, e.g., Volf, 2002; =-=Sadakane et al., 2000-=-; Shkarin, 2002; Begleiter et al., 2004). In this approach the multi-alphabet problem is hierarchically decomposed into a number of binary prediction problems. We term the resulting procedure “the DEC... |

13 |
Weighting Techniques in Data Compression Theory and Algorithms
- Volf
- 2002
(Show Context)
Citation Context ...omposition heuristic, suggested by Tjalkens et al. (1994) and further developed by Volf (2002), does achieve state-of-the-art compression and prediction performance on standard benchmarks (see, e.g., =-=Volf, 2002-=-; Sadakane et al., 2000; Shkarin, 2002; Begleiter et al., 2004). In this approach the multi-alphabet problem is hierarchically decomposed into a number of binary prediction problems. We term the resul... |

10 | Experiments on the zero frequency problem
- Cleary, Teahan
- 1995
(Show Context)
Citation Context ...ned the algorithms over the ‘Calgary Corpus.’ This Corpus serves as a standard benchmark for testing log-loss prediction and lossless compression algorithms (Bell et al., 1990; Witten and Bell, 1991; =-=Cleary and Teahan, 1995-=-; Begleiter et al., 2004). The corpus consists of 18 files of nine different types. Most of the files are pure ASCII files and four are binary files. The ASCII files consist of English texts (books 1-... |

9 | On the Optimality of Huffman Trees - Glassey, Karp - 1976 |

7 |
Support vector machines with binary tree architecture for multi-class classification
- Cheong, Oh, et al.
- 2004
(Show Context)
Citation Context ...nd ’all-pairs’) commonly used in supervised learning. This result may further motivate the consideration of hierarchical decompositions in supervised learning (e.g., as suggested by Huo et al., 2002; =-=Cheong et al., 2004-=-; El-Yaniv and Etzion-Rosenberg, 2004). The fact that the other zero-order estimators can improve the multi-CTW performance (with larger alphabets) motivates further research along these lines. First,... |

7 |
A context-tree weighting method for text generating sources
- Tjalkens, Volf, et al.
- 1997
(Show Context)
Citation Context ..., 1981). When the algorithm is applied to non-binary prediction/compression problems (using the multi-alphabet KT-estimator), its empirical performance is mediocre compared to the best known results (=-=Tjalkens et al., 1997-=-). Nevertheless, a clever alphabet decomposition heuristic, suggested by Tjalkens et al. (1994) and further developed by Volf (2002), does achieve state-of-the-art compression and prediction performan... |

6 |
Context tree weighting: Multi-alphabet sources
- Tjalkens, Shtarkov, et al.
- 1993
(Show Context)
Citation Context ...er bound of Theorem 7 on the redundancy of the kt estimator is a key element in the proof of the following theorem, providing a finite-sample point-wise redundancy bound for the multi-ctw (see, e.g., =-=Tjalkens et al., 1993-=-; Catoni, 2004). Theorem 9 (Willems et al.) Let Σ be any alphabet with |Σ| = k ≥ 2. For any sequence x n 1 ∈ Σn and any D-bounded tree-source with a topology S and distribution PS, the following holds... |

2 |
Hierarchical multiclass decompositions with application to authorship determination
- El-Yaniv, Etzion-Rosenberg
- 2004
(Show Context)
Citation Context ...nly used in supervised learning. This result may further motivate the consideration of hierarchical decompositions in supervised learning (e.g., as suggested by Huo et al., 2002; Cheong et al., 2004; =-=El-Yaniv and Etzion-Rosenberg, 2004-=-). The fact that the other zero-order estimators can improve the multi-CTW performance (with larger alphabets) motivates further research along these lines. First, it would be interesting to try combi... |

2 | Support vector trees: Simultaneously realizing the principles of maximal margin and maximal purity - BEGLEITER, Huo, et al. - 2002 |

1 | Orlitsky et al. (2003) the ˆz +1 estimator is denoted by q+1 ′ and ˆzGT* by q 1/3 - In |

1 | Turing used this estimator to break the Enigma Cipher (Hodges, 2000) during World War II. 14. Orlitsky et al. mention that Turing had an intuitive motivation for this estimator. Unfortunately, this explanation was never published - Good, M |

1 | a ′ Nσ+1 a ′ Nσ (50 - Bejerano, Yona |

1 | A simple technique for bounding the pointwise redundancy of the 1978 Lempel-Ziv algorithm - Kieffer, Yang - 1999 |

1 | Redundancy estimates for the Lempel-Ziv algorithm of data compression - Potapov |