Results 1 - 10
of
19
MDL-based Decision Tree Pruning
, 1995
"... This paper explores the application of the Minimum Description Length principle for pruning decision trees. We present a new algorithm that intuitively captures the primary goal of reducing the misclassification error. An experimental comparison is presented with three other pruning algorithms. The ..."
Abstract
-
Cited by 54 (1 self)
- Add to MetaCart
This paper explores the application of the Minimum Description Length principle for pruning decision trees. We present a new algorithm that intuitively captures the primary goal of reducing the misclassification error. An experimental comparison is presented with three other pruning algorithms. The results show that the MDL pruning algorithm achieves good accuracy, small trees, and fast execution times. Introduction Construction or "induction" of decision trees from examples has been the subject of extensive research in the past [Breiman et. al. 84, Quinlan 86]. It is typically performed in two steps. First, training data is used to grow a decision tree. Then in the second step, called pruning, the tree is reduced to prevent "overfitting". There are two broad classes of pruning algorithms. The first class includes algorithms like cost-complexity pruning [Breiman et. al., 84], that use a separate set of samples for pruning, distinct from the set used to grow the tree. In many cases, ...
Efficient Bayesian Parameter Estimation in Large Discrete Domains
- Advances in Neural Information Processing Systems
, 1999
"... In this paper we examine the problem of estimating the parameters of a multinomial distribution over a large number of discrete outcomes, most of which do not appear in the training data. We analyze this problem from a Bayesian perspective and develop a hierarchical prior that incorporates the assum ..."
Abstract
-
Cited by 27 (1 self)
- Add to MetaCart
In this paper we examine the problem of estimating the parameters of a multinomial distribution over a large number of discrete outcomes, most of which do not appear in the training data. We analyze this problem from a Bayesian perspective and develop a hierarchical prior that incorporates the assumption that the observed outcomes constitute only a small subset of the possible outcomes. We show how to efficiently perform exact inference with this form of hierarchical prior and compare our method to standard approaches and demonstrate its merits. Category: Algorithms and Architectures Presentation preference: none This paper was not submitted elsewhere nor will be submitted during NIPS review period. 1 Introduction One of the most important problems in statistical inference is multinomialestimation: Given a past history of observations independent trials with a discrete set of outcomes, predict the probability of the next trial. Such estimators are the basic building blocks in mor...
Switching Portfolios
- International Journal of Neural Systems
, 1998
"... Recently, there has been work on on-line portfolio selection algorithms which are competitive with the best constant rebalanced portfolio determined in hindsight [2, 6, 3]. By their nature, these algorithms employ the assumption that high yield returns can be achieved using a fixed asset allocation ..."
Abstract
-
Cited by 26 (1 self)
- Add to MetaCart
Recently, there has been work on on-line portfolio selection algorithms which are competitive with the best constant rebalanced portfolio determined in hindsight [2, 6, 3]. By their nature, these algorithms employ the assumption that high yield returns can be achieved using a fixed asset allocation strategy. However, stock markets are far from being stationary and in many cases the return of a constant rebalanced portfolio is much smaller than the return of an ad-hoc investment strategy that adapts to changes in the market. In this paper we present an efficient portfolio selection algorithm that is able to track a changing market. We also describe a simple extension of the algorithm for the case of a general transaction cost, including a fixed percentage transaction cost which was recently investigated [1]. We provide a simple analysis of the competitiveness of the algorithm and check its performance on real stock data from the New York Stock Exchange accumulated during a 22-year perio...
An Efficient Extension to Mixture Techniques for Prediction and Decision Trees
- Machine Learning
, 1999
"... We present an e#cient method for maintaining mixtures of prunings of a prediction or decision tree that extends the previous methods for "node-based" prunings (Buntine, 1990; Willems, Shtarkov, & Tjalkens, 1995; Helmbold & Schapire, 1997; Singer, 1997) to the larger class of edge-based prunings. The ..."
Abstract
-
Cited by 23 (4 self)
- Add to MetaCart
We present an e#cient method for maintaining mixtures of prunings of a prediction or decision tree that extends the previous methods for "node-based" prunings (Buntine, 1990; Willems, Shtarkov, & Tjalkens, 1995; Helmbold & Schapire, 1997; Singer, 1997) to the larger class of edge-based prunings. The method includes an online weight-allocation algorithm that can be used for prediction, compression and classification. Although the set of edge-based prunings of a given tree is much larger than that of node-based prunings, our algorithm has similar space and time complexity to that of previous mixture algorithms for trees. Using the general online framework of Freund & Schapire (1997), we prove that our algorithm maintains correctly the mixture weights for edge-based prunings with any bounded loss function. We also give a similar algorithm for the logarithmic loss function with a corresponding weight-allocation algorithm. Finally, we describe experiments comparing node-based and edge-based mixture models for estimating the probability of the next word in English text, which show the advantages of edge-based models. Keywords: mixture models, decision and prediction trees, on-line learning, statistical language modeling 1.
Precise Minimax Redundancy and Regret
- IEEE TRANS. INFORMATION THEORY
, 2004
"... Recent years have seen a resurgence of interest in redundancy of lossless coding. The redundancy (regret) of universal xed{to{variable length coding for a class of sources determines by how much the actual code length exceeds the optimal (ideal over the class) code length. In a minimax scenario ..."
Abstract
-
Cited by 19 (8 self)
- Add to MetaCart
Recent years have seen a resurgence of interest in redundancy of lossless coding. The redundancy (regret) of universal xed{to{variable length coding for a class of sources determines by how much the actual code length exceeds the optimal (ideal over the class) code length. In a minimax scenario one nds the best code for the worst source either in the worst case (called also maximal minimax) or on average. We rst study the worst case minimax redundancy over a class of stationary ergodic sources and replace Shtarkov's bound by an exact formula. Among others, we prove that a generalized Shannon code minimizes the worst case redundancy, derive asymptotically its redundancy, and establish some general properties. This allows us to obtain precise redundancy rates for memoryless, Markov and renewal sources. For example, we derive the exact constant of the redundancy rate for memoryless and Markov sources by showing that an integer nature of coding contributes log(log m=(m 1))= log m+ o(1) where m is the size of the alphabet. Then we deal with the average minimax redundancy and regret. Our approach
Universal compression of memoryless sources over unknown alphabets
- IEEE TRANSACTIONS ON INFORMATION THEORY
, 2004
"... It has long been known that the compression redundancy of independent and identically distributed (i.i.d.) strings increases to infinity as the alphabet size grows. It is also apparent that any string can be described by separately conveying its symbols, and its pattern—the order in which the symbol ..."
Abstract
-
Cited by 16 (5 self)
- Add to MetaCart
It has long been known that the compression redundancy of independent and identically distributed (i.i.d.) strings increases to infinity as the alphabet size grows. It is also apparent that any string can be described by separately conveying its symbols, and its pattern—the order in which the symbols appear. Concentrating on the latter, we show that the patterns of i.i.d. strings over all, including infinite and even unknown, alphabets, can be compressed with diminishing redundancy, both in block and sequentially, and that the compression can be performed in linear time. To establish these results, we show that the number of patterns is the Bell number, that the number of patterns with a given number of symbols is the Stirling number of the second kind, and that the redundancy of patterns can be bounded using results of Hardy and Ramanujan on the number of integer partitions. The results also imply an asymptotically optimal solution for the Good-Turing probability-estimation problem.
Pointwise Redundancy in Lossy Data Compression and Universal Lossy Data Compression
- IEEE Trans. Inform. Theory
, 1999
"... We characterize the achievable pointwise redundancy rates for lossy data compression at a fixed distortion level. "Pointwise redundancy" refers to the difference between the description length achieved by an nth-order block code and the optimal nR(D) bits. For memoryless sources, we show that the be ..."
Abstract
-
Cited by 15 (10 self)
- Add to MetaCart
We characterize the achievable pointwise redundancy rates for lossy data compression at a fixed distortion level. "Pointwise redundancy" refers to the difference between the description length achieved by an nth-order block code and the optimal nR(D) bits. For memoryless sources, we show that the best achievable redundancy rate is of order O( p n) in probability. This follows from a second-order refinement to the classical source coding theorem, in the form of a "one-sided central limit theorem." Moreover, we show that, along (almost) any source realization, the description lengths of any sequence of block codes operating at distortion level D exceed nR(D) by at least as much as C p n log log n, infinitely often. Corresponding direct coding theorems are also given, showing that these rates are essentially achievable. The above rates are in sharp contrast with the expected redundancy rates of order O(log n) recently reported by various authors. Our approach is based on showing that...
Second-order noiseless source coding theorems
- IEEE Trans. Inform. Theory
, 1997
"... Abstract—Shannon’s celebrated source coding theorem can be viewed as a “one-sided law of large numbers. ” We formulate second-order noiseless source coding theorems for the deviation of the codeword lengths from the entropy. For a class of sources that includes Markov chains we prove a “one-sided ce ..."
Abstract
-
Cited by 11 (7 self)
- Add to MetaCart
Abstract—Shannon’s celebrated source coding theorem can be viewed as a “one-sided law of large numbers. ” We formulate second-order noiseless source coding theorems for the deviation of the codeword lengths from the entropy. For a class of sources that includes Markov chains we prove a “one-sided central limit theorem ” and a law of the iterated logarithm. Index Terms—Coding variance, convergence rates, source coding theorems. I.
Markov Types and Minimax Redundancy for Markov Sources
- IEEE Trans. Information Theory
, 2003
"... Redundancy of universal codes for a class of sources determines by how much the actual code length exceeds the optimal code length. In the minimax scenario one designs the best code for the worst source within the class. Such minimax redundancy comes in two flavors: either on average or for individu ..."
Abstract
-
Cited by 11 (6 self)
- Add to MetaCart
Redundancy of universal codes for a class of sources determines by how much the actual code length exceeds the optimal code length. In the minimax scenario one designs the best code for the worst source within the class. Such minimax redundancy comes in two flavors: either on average or for individual sequences. The latter is also known as the maximal or the worst case minimax redundancy. We study the maximal minimax redundancy of universal block codes for Markovian sources of any order. We prove that the maximal minimax redundancy for Markov sources of order r is asymptotically equal to 1) log 2 n + log 2 A (ln ln m 1/(m-1) )/ ln m + o(1), where n is the length of a source sequence, m is the size of the alphabet and A m is an explicit constant (e.g., we find that for a binary alphabet m = 2 and Markov of order r = 1 the constant 14.655449504 where G is the Catalan number). Unlike previous attempts, we view the redundancy problem as an asymptotic evaluation of certain sums over a set of matrices representing Markov types. The enumeration of Markov types is accomplished by reducing it to counting Eulerian paths in a multigraph. In particular, we propose an asymptotic formula for the number of strings of a given Markov type. All of these findings are obtained by analytic and combinatorial tools of analysis of algorithms. Index terms: Minimax redundancy, Markov sources, Markov types, Eulerian paths, multidimensional generating functions, analytic information theory. # A preliminary version of this paper was presented at Colloquium on Mathematics and Computer Science: Algorithms, Trees, Combinatorics and Probabilities, Versailles, 2002.
Precise Average Redundancy of an Idealized Arithmetic Coding
- Coding, Proc. Data Compression Conference, 222-231, Snowbird
, 2002
"... Redundancy is defined as the excess of the code length over the optimal (ideal) code length. We study the average redundancy of an idealized arithmetic coding (for memoryless sources with unknown distributions) in which the Krichevsky and Tro mov estimator is followed by the Shannon-Fano code. We sh ..."
Abstract
-
Cited by 10 (6 self)
- Add to MetaCart
Redundancy is defined as the excess of the code length over the optimal (ideal) code length. We study the average redundancy of an idealized arithmetic coding (for memoryless sources with unknown distributions) in which the Krichevsky and Tro mov estimator is followed by the Shannon-Fano code. We shall ignore here important practical implementation issues such as finite precisions and finite buffer sizes. In fact, our idealized arithmetic code can be viewed as an adaptive infinite precision implementation of arithmetic encoder that resembles Elias coding. However, we provide very precise results for the average redundancy that takes into account integer-length constraints. These findings are obtained by analytic methods of analysis of algorithms such as theory of distribution of sequences modulo 1 and Fourier series. These estimates can be used to study the average redundancy of codes for tree sources, and ultimately the context-tree weighting algorithms.

