Results 11 - 20
of
25
Optimality of Universal Bayesian Sequence Prediction for General Loss and Alphabet
- In
, 2002
"... The Bayesian framework is ideally suited for induction problems. The probability of observing $x_t$ at time $t$, given past observations $x_1...x_{t-1}$ can be computed with Bayes' rule if the true generating distribution $\mu$ of the sequences $x_1x_2x_3...$ is known. The problem, however, is that ..."
Abstract
-
Cited by 5 (1 self)
- Add to MetaCart
The Bayesian framework is ideally suited for induction problems. The probability of observing $x_t$ at time $t$, given past observations $x_1...x_{t-1}$ can be computed with Bayes' rule if the true generating distribution $\mu$ of the sequences $x_1x_2x_3...$ is known. The problem, however, is that in many cases one does not even have a reasonable guess of the true distribution. In order to overcome this problem a universal (or mixture) distribution $\xi$ is defined as a weighted sum or integral of distributions $ u\!\in\!\M$, where $\M$ is any countable or continuous set of distributions including $\mu$. This is a generalization of Solomonoff induction, in which $\M$ is the set of all enumerable semi-measures. It is shown for several performance measures that using the universal $\xi$ as a prior is nearly as good as using the unknown true distribution $\mu$. In a sense, this solves the problem of the unknown prior in a universal way. All results are obtained for general finite alphabet. Convergence of $\xi$ to $\mu$ in a conditional mean squared sense and of $\xi/\mu\to 1$ with $\mu$ probability $1$ is proven. The number of additional errors $E_\xi$ made by the optimal universal prediction scheme based on $\xi$ minus the number of errors $E_\mu$ of the optimal informed prediction scheme based on $\mu$ is proven to be bounded by $O(\sqrt{E_\mu})$. The prediction framework is generalized to arbitrary loss functions. A system is allowed to take an action $y_t$, given $x_1...x_{t-1}$ and receives loss $\ell_{x_t y_t}$ if $x_t$ is the next symbol of the sequence. No assumptions on $\ell$ are necessary, besides boundedness. Optimal universal $\Lambda_\xi$ and optimal informed $\Lambda_\mu$ prediction schemes are defined and the total loss of $\Lambda_\xi$ is bounded in terms of the total loss of $\Lambda_\mu$, similar to the error bounds. We show that the bounds are tight and that no other predictor can lead to smaller bounds. Furthermore, for various performance measures we show Pareto-optimality of $\xi$ in the sense that there is no other predictor which performs better or equal in all environments $ u\in\M$ and strictly better in at least one. So, optimal predictors can (w.r.t.\ to most performance measures in expectation) be based on the mixture $\xi$. Finally we give an Occam's razor argument that Solomonoff's choice $w_ u\sim 2^{-K( u)}$ for the weights is optimal, where $K( u)$ is the length of the shortest program describing $ u$. Furthermore, games of chance, defined as a sequence of bets, observations, and rewards are studied. The average profit achieved by the $\Lambda_\xi$ scheme rapidly converges to the best possible profit. The time needed to reach the winning zone is proportional to the relative entropy of $\mu$ and $\xi$. The prediction schemes presented here are compared to the weighted majority algorithm(s). Although the algorithms, the settings, and the proofs are quite different the bounds of both schemes have a very similar structure. Extensions to infinite alphabets, partial, delayed and probabilistic prediction, classification, and more active systems are briefly discussed.
Linear Classification and Selective Sampling Under Low Noise Conditions
"... We provide a new analysis of an efficient margin-based algorithm for selective sampling in classification problems. Using the so-called Tsybakov low noise condition to parametrize the instance distribution, we show bounds on the convergence rate to the Bayes risk of both the fully supervised and the ..."
Abstract
-
Cited by 4 (2 self)
- Add to MetaCart
We provide a new analysis of an efficient margin-based algorithm for selective sampling in classification problems. Using the so-called Tsybakov low noise condition to parametrize the instance distribution, we show bounds on the convergence rate to the Bayes risk of both the fully supervised and the selective sampling versions of the basic algorithm. Our analysis reveals that, excluding logarithmic factors, the average risk of the selective sampler converges to the Bayes risk at rate N −(1+α)(2+α)/2(3+α) where N denotes the number of queried labels, and α> 0 is the exponent in the low noise condition. For all α> √ 3 − 1 ≈ 0.73 this convergence rate is asymptotically faster than the rate N −(1+α)/(2+α) achieved by the fully supervised version of the same classifier, which queries all labels, and for α → ∞ the two rates exhibit an exponential gap. Experiments on textual data reveal that simple variants of the proposed selective sampler perform much better than popular and similarly efficient competitors. 1
Robust selective sampling from single and multiple teachers (Technical Report). Microsoft Research, Università dell’Insubria, TTI
, 2010
"... We present a new online learning algorithm in the selective sampling framework, where labels must be actively queried before they are revealed. We prove bounds on the regret of our algorithm and on the number of labels it queries when faced with an adaptive adversarial strategy of generating the ins ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
We present a new online learning algorithm in the selective sampling framework, where labels must be actively queried before they are revealed. We prove bounds on the regret of our algorithm and on the number of labels it queries when faced with an adaptive adversarial strategy of generating the instances. Our bounds both generalize and strictly improve over previous bounds in similar settings. Using a simple online-to-batch conversion technique, our selective sampling algorithm can be converted into a statistical (pool-based) active learning algorithm. We extend our algorithm and analysis to the multiple-teacher setting, where the algorithm can choose which subset of teachers to query for each label. 1
Competing with stationary prediction strategies
, 2006
"... In this paper we introduce the class of stationary prediction strategies and construct a prediction algorithm that asymptotically performs as well as the best continuous stationary strategy. We make mild compactness assumptions but no stochastic assumptions about the environment. In particular, no a ..."
Abstract
-
Cited by 3 (3 self)
- Add to MetaCart
In this paper we introduce the class of stationary prediction strategies and construct a prediction algorithm that asymptotically performs as well as the best continuous stationary strategy. We make mild compactness assumptions but no stochastic assumptions about the environment. In particular, no assumption of stationarity is made about the environment, and the stationarity of the considered strategies only means that they do not depend explicitly on time; we argue that it is natural to consider only stationary strategies even for highly non-stationary environments. 1
Relative loss bounds and polynomial-time predictions for the K-LMS-NET algorithm
- Proc. of the 15-th Int. Conference on Algorithmic Learning Theory
, 2004
"... Abstract. We consider a two-layer network algorithm. The first layer consists of an uncountable number of linear units. Each linear unit is an LMS algorithm whose inputs are first “kernelized. ” Each unit is indexed by the value of a parameter corresponding to a parameterized reproducing kernel. The ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
Abstract. We consider a two-layer network algorithm. The first layer consists of an uncountable number of linear units. Each linear unit is an LMS algorithm whose inputs are first “kernelized. ” Each unit is indexed by the value of a parameter corresponding to a parameterized reproducing kernel. The first-layer outputs are then connected to an exponential weights algorithm which combines them to produce the final output. We give loss bounds for this algorithm; and for specific applications to prediction relative to the best convex combination of kernels, and the best width of a Gaussian kernel. The algorithm’s predictions require the computation of an expectation which is a quotient of integrals as seen in a variety of Bayesian inference problems. Typically this computational problem is tackled by mcmc, importance sampling, and other sampling techniques for which there are few polynomial time guarantees of the quality of the approximation in general and none for our problem specifically. We develop a novel deterministic polynomial time approximation scheme for the computations of expectations considered in this paper. 1
Sequential optimization through adaptive design of experiments. Engineering Systems Division
- MIT, PhD: 118, Cambridge,MA
, 2007
"... This thesis considers the problem of achieving better system performance through adaptive experiments. For the case of discrete design space, I propose an adaptive One-Factor-at-A-Time (OFAT) experimental design, study its properties and compare its performance to saturated fractional factorial desi ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
This thesis considers the problem of achieving better system performance through adaptive experiments. For the case of discrete design space, I propose an adaptive One-Factor-at-A-Time (OFAT) experimental design, study its properties and compare its performance to saturated fractional factorial designs. The rationale for adopting the adaptive OFAT design scheme become clear if it is imbedded in a Bayesian framework: it becomes clear that OFAT is an efficient response to step by step accrual of sample information. The Bayesian predictive distribution for the outcome by implementing OFAT and the corresponding principal moments when a natural conjugate prior is assigned to parameters that are not known with certainty are also derived. For the case of compact design space, I expand the treatment of OFAT by the
Shannon Information and Kolmogorov Complexity
, 2010
"... The elementary theories of Shannon information and Kolmogorov complexity are cmpared, the extent to which they have a common purpose, and where they are fundamentally different. The focus is on: Shannon entropy versus Kolmogorov complexity, the relation of both to universal coding, Shannon mutual in ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
The elementary theories of Shannon information and Kolmogorov complexity are cmpared, the extent to which they have a common purpose, and where they are fundamentally different. The focus is on: Shannon entropy versus Kolmogorov complexity, the relation of both to universal coding, Shannon mutual information versus Kolmogorov (‘algorithmic’) mutual information, probabilistic sufficient statistic versus algorithmic sufficient statistic (related to lossy compression in the Shannon theory versus meaningful information in the Kolmogorov theory), and rate distortion theory versus Kolmogorov’s structure function. Part of the material has appeared in print before, scattered through various publications, but
Prediction with Expert Advice for the Brier Game
"... We show that the Brier game of prediction is mixable and find the optimal learning rate and substitution function for it. The resulting prediction algorithm is applied to predict results of football and tennis matches. The theoretical performance guarantee turns out to be rather tight on these data ..."
Abstract
- Add to MetaCart
We show that the Brier game of prediction is mixable and find the optimal learning rate and substitution function for it. The resulting prediction algorithm is applied to predict results of football and tennis matches. The theoretical performance guarantee turns out to be rather tight on these data sets, especially in the case of the more extensive tennis data. 1.
A Lower bound on the Performance of Sequential Prediction
"... Abstract- We consider the problem of sequential linear prediction of real-valued sequences under the square-error loss function. For this problem, a prediction algorithm has been demonstrated [l][2] whose accumulated squared prediction error, for every bounded sequence, is asymptotically as small as ..."
Abstract
- Add to MetaCart
Abstract- We consider the problem of sequential linear prediction of real-valued sequences under the square-error loss function. For this problem, a prediction algorithm has been demonstrated [l][2] whose accumulated squared prediction error, for every bounded sequence, is asymptotically as small as the best fixed linear predictor for that sequence, taken from the class of all linear predictors of a given order p. The redundancy, or excess prediction error above that of the best predictor for that sequence, is upper bounded by A2pln(n)/n, where n is the data length and the sequence is assumed to be bounded by some A. In this paper, we show that this predictor is optimal in a min-max sense, by deriving a corresponding lower bound, such that no sequential predictor can ever do better than a redundancy of A2p In(n)/n.

