Results 11  20
of
68
General Loss Bounds for Universal Sequence Prediction
, 2001
"... The Bayesian framework is ideally suited for induction problems. The probability of observing $x_k$ at time $k$, given past observations $x_1...x_{k1}$ can be computed with Bayes' rule if the true distribution $\mu$ of the sequences $x_1x_2x_3...$ is known. The problem, however, is that in man ..."
Abstract

Cited by 14 (9 self)
 Add to MetaCart
The Bayesian framework is ideally suited for induction problems. The probability of observing $x_k$ at time $k$, given past observations $x_1...x_{k1}$ can be computed with Bayes' rule if the true distribution $\mu$ of the sequences $x_1x_2x_3...$ is known. The problem, however, is that in many cases one does not even have a reasonable estimate of the true distribution. In order to overcome this problem a universal distribution $\xi$ is defined as a weighted sum of distributions $\mu_i\in M$, where $M$ is any countable set of distributions including $\mu$. This is a generalization of Solomonoff induction, in which $M$ is the set of all enumerable semimeasures. Systems which predict $y_k$, given $x_1...x_{k1}$ and which receive loss $l_{x_k y_k}$ if $x_k$ is the true next symbol of the sequence are considered. It is proven that using the universal $\xi$ as a prior is nearly as good as using the unknown true distribution $\mu$. Furthermore, games of chance, defined as a sequence of bets, observations, and rewards are studied. The time needed to reach the winning zone is estimated. Extensions to arbitrary alphabets, partial and delayed prediction, and more active systems are discussed.
Defensive Forecasting
"... We consider how to make probability forecasts of binary labels. Our main mathematical result is that for any continuous gambling strategy used for detecting disagreement between the forecasts and the actual labels, there exists a forecasting strategy whose forecasts are ideal as far as this ga ..."
Abstract

Cited by 13 (12 self)
 Add to MetaCart
We consider how to make probability forecasts of binary labels. Our main mathematical result is that for any continuous gambling strategy used for detecting disagreement between the forecasts and the actual labels, there exists a forecasting strategy whose forecasts are ideal as far as this gambling strategy is concerned. A forecasting strategy obtained in this way from a gambling strategy demonstrating a strong law of large numbers is simplified and studied empirically.
Hedging predictions in machine learning
 Comput. J
, 2007
"... Recent advances in machine learning make it possible to design efficient prediction algorithms for data sets with huge numbers of parameters. This article describes a new technique for ‘hedging ’ the predictions output by many such algorithms, including support vector machines, kernel ridge regressi ..."
Abstract

Cited by 12 (3 self)
 Add to MetaCart
Recent advances in machine learning make it possible to design efficient prediction algorithms for data sets with huge numbers of parameters. This article describes a new technique for ‘hedging ’ the predictions output by many such algorithms, including support vector machines, kernel ridge regression, kernel nearest neighbours, and by many other stateoftheart methods. The hedged predictions for the labels of new objects include quantitative measures of their own accuracy and reliability. These measures are provably valid under the assumption of randomness, traditional in machine learning: the objects and their labels are assumed to be generated independently from the same probability distribution. In particular, it becomes possible to control (up to statistical fluctuations) the number of erroneous predictions by selecting a suitable confidence level. Validity being achieved automatically, the remaining goal of hedged prediction is efficiency: taking full account of the new objects ’ features and other available information to produce as accurate predictions as possible. This can be done successfully using the powerful machinery of modern machine learning. 1
Regret bounds for hierarchical classification with linearthreshold functions
 Proceedings of the 17th Annual Conference on Learning Theory
, 2004
"... Abstract. We study the problem of classifying data in a given taxonomy when classifications associated with multiple and/or partial paths are allowed. We introduce an incremental algorithm using a linearthreshold classifier at each node of the taxonomy. These classifiers are trained and evaluated i ..."
Abstract

Cited by 9 (3 self)
 Add to MetaCart
Abstract. We study the problem of classifying data in a given taxonomy when classifications associated with multiple and/or partial paths are allowed. We introduce an incremental algorithm using a linearthreshold classifier at each node of the taxonomy. These classifiers are trained and evaluated in a hierarchical topdown fashion. We then define a hierachical and parametric data model and prove a bound on the probability that our algorithm guesses the wrong multilabel for a random instance compared to the same probability when the true model parameters are known. Our bound decreases exponentially with the number of training examples and depends in a detailed way on the interaction between the process parameters and the taxonomy structure. Preliminary experiments on realworld data provide support to our theoretical results. 1
Linear Classification and Selective Sampling Under Low Noise Conditions
"... We provide a new analysis of an efficient marginbased algorithm for selective sampling in classification problems. Using the socalled Tsybakov low noise condition to parametrize the instance distribution, we show bounds on the convergence rate to the Bayes risk of both the fully supervised and the ..."
Abstract

Cited by 8 (2 self)
 Add to MetaCart
We provide a new analysis of an efficient marginbased algorithm for selective sampling in classification problems. Using the socalled Tsybakov low noise condition to parametrize the instance distribution, we show bounds on the convergence rate to the Bayes risk of both the fully supervised and the selective sampling versions of the basic algorithm. Our analysis reveals that, excluding logarithmic factors, the average risk of the selective sampler converges to the Bayes risk at rate N −(1+α)(2+α)/2(3+α) where N denotes the number of queried labels, and α> 0 is the exponent in the low noise condition. For all α> √ 3 − 1 ≈ 0.73 this convergence rate is asymptotically faster than the rate N −(1+α)/(2+α) achieved by the fully supervised version of the same classifier, which queries all labels, and for α → ∞ the two rates exhibit an exponential gap. Experiments on textual data reveal that simple variants of the proposed selective sampler perform much better than popular and similarly efficient competitors. 1
Robust selective sampling from single and multiple teachers
, 2010
"... We present a new online learning algorithm in the selective sampling framework, where labels must be actively queried before they are revealed. We prove bounds on the regret of our algorithm and on the number of labels it queries when faced with an adaptive adversarial strategy of generating the ins ..."
Abstract

Cited by 7 (0 self)
 Add to MetaCart
We present a new online learning algorithm in the selective sampling framework, where labels must be actively queried before they are revealed. We prove bounds on the regret of our algorithm and on the number of labels it queries when faced with an adaptive adversarial strategy of generating the instances. Our bounds both generalize and strictly improve over previous bounds in similar settings. Using a simple onlinetobatch conversion technique, our selective sampling algorithm can be converted into a statistical (poolbased) active learning algorithm. We extend our algorithm and analysis to the multipleteacher setting, where the algorithm can choose which subset of teachers to query for each label.
Sparsity regret bounds for individual sequences in online linear regression
 JMLR Workshop and Conference Proceedings, 19 (COLT 2011 Proceedings):377–396
, 2011
"... We consider the problem of online linear regression on arbitrary deterministic sequences when the ambient dimension d can be much larger than the number of time rounds T. We introduce the notion of sparsity regret bound, which is a deterministic online counterpart of recent risk bounds derived in th ..."
Abstract

Cited by 6 (0 self)
 Add to MetaCart
We consider the problem of online linear regression on arbitrary deterministic sequences when the ambient dimension d can be much larger than the number of time rounds T. We introduce the notion of sparsity regret bound, which is a deterministic online counterpart of recent risk bounds derived in the stochastic setting under a sparsity scenario. We prove such regret bounds for an onlinelearning algorithm called SeqSEW and based on exponential weighting and datadriven truncation. In a second part we apply a parameterfree version of this algorithm to the stochastic setting (regression model with random design). This yields risk bounds of the same flavor as in Dalalyan and Tsybakov (2012a) but which solve two questions left open therein. In particular our risk bounds are adaptive (up to a logarithmic factor) to the unknown variance of the noise if the latter is Gaussian. We also address the regression model with fixed design.
Learning Noisy Linear Classifiers via Adaptive and Selective Sampling
"... We introduce efficient marginbased algorithms for selective sampling and filtering in binary classification tasks. Experiments on realworld textual data reveal that our algorithms perform significantly better than popular and similarly efficient competitors. Using the socalled MammenTsybakov lo ..."
Abstract

Cited by 6 (2 self)
 Add to MetaCart
We introduce efficient marginbased algorithms for selective sampling and filtering in binary classification tasks. Experiments on realworld textual data reveal that our algorithms perform significantly better than popular and similarly efficient competitors. Using the socalled MammenTsybakov low noise condition to parametrize the instance distribution, and assuming linear label noise, we show bounds on the convergence rate to the Bayes risk of a weaker adaptive variant of our selective sampler. Our analysis reveals that, excluding logarithmic factors, the average risk of this adaptive sampler converges to the Bayes risk at rate N −(1+α)(2+α)/2(3+α) where N denotes the number of queried labels, and α> 0 is the exponent in the low noise condition. For all α> √ 3 − 1 ≈ 0.73 this convergence rate is asymptotically faster than the rate N −(1+α)/(2+α) achieved by the fully supervised version of the base selective sampler, which queries all labels. Moreover, for α → ∞ (hard margin condition) the gap between the semi and fullysupervised rates becomes exponential.
Optimality of Universal Bayesian Sequence Prediction for General Loss and Alphabet
 In
, 2002
"... The Bayesian framework is ideally suited for induction problems. The probability of observing $x_t$ at time $t$, given past observations $x_1...x_{t1}$ can be computed with Bayes' rule if the true generating distribution $\mu$ of the sequences $x_1x_2x_3...$ is known. The problem, however, is ..."
Abstract

Cited by 6 (1 self)
 Add to MetaCart
The Bayesian framework is ideally suited for induction problems. The probability of observing $x_t$ at time $t$, given past observations $x_1...x_{t1}$ can be computed with Bayes' rule if the true generating distribution $\mu$ of the sequences $x_1x_2x_3...$ is known. The problem, however, is that in many cases one does not even have a reasonable guess of the true distribution. In order to overcome this problem a universal (or mixture) distribution $\xi$ is defined as a weighted sum or integral of distributions $ u\!\in\!\M$, where $\M$ is any countable or continuous set of distributions including $\mu$. This is a generalization of Solomonoff induction, in which $\M$ is the set of all enumerable semimeasures. It is shown for several performance measures that using the universal $\xi$ as a prior is nearly as good as using the unknown true distribution $\mu$. In a sense, this solves the problem of the unknown prior in a universal way. All results are obtained for general finite alphabet. Convergence of $\xi$ to $\mu$ in a conditional mean squared sense and of $\xi/\mu\to 1$ with $\mu$ probability $1$ is proven. The number of additional errors $E_\xi$ made by the optimal universal prediction scheme based on $\xi$ minus the number of errors $E_\mu$ of the optimal informed prediction scheme based on $\mu$ is proven to be bounded by $O(\sqrt{E_\mu})$. The prediction framework is generalized to arbitrary loss functions. A system is allowed to take an action $y_t$, given $x_1...x_{t1}$ and receives loss $\ell_{x_t y_t}$ if $x_t$ is the next symbol of the sequence. No assumptions on $\ell$ are necessary, besides boundedness. Optimal universal $\Lambda_\xi$ and optimal informed $\Lambda_\mu$ prediction schemes are defined and the total loss of $\Lambda_\xi$ is bounded in terms of the total loss of $\Lambda_\mu$, similar to the error bounds. We show that the bounds are tight and that no other predictor can lead to smaller bounds. Furthermore, for various performance measures we show Paretooptimality of $\xi$ in the sense that there is no other predictor which performs better or equal in all environments $ u\in\M$ and strictly better in at least one. So, optimal predictors can (w.r.t.\ to most performance measures in expectation) be based on the mixture $\xi$. Finally we give an Occam's razor argument that Solomonoff's choice $w_ u\sim 2^{K( u)}$ for the weights is optimal, where $K( u)$ is the length of the shortest program describing $ u$. Furthermore, games of chance, defined as a sequence of bets, observations, and rewards are studied. The average profit achieved by the $\Lambda_\xi$ scheme rapidly converges to the best possible profit. The time needed to reach the winning zone is proportional to the relative entropy of $\mu$ and $\xi$. The prediction schemes presented here are compared to the weighted majority algorithm(s). Although the algorithms, the settings, and the proofs are quite different the bounds of both schemes have a very similar structure. Extensions to infinite alphabets, partial, delayed and probabilistic prediction, classification, and more active systems are briefly discussed.
Online regression competitive with reproducing kernel Hilbert spaces
, 2005
"... We consider the problem of online prediction of realvalued labels of new objects. The prediction algorithm’s performance is measured by the squared deviation of the predictions from the actual labels. No probabilistic assumptions are made about the way the labels and objects are generated. Instead ..."
Abstract

Cited by 6 (3 self)
 Add to MetaCart
We consider the problem of online prediction of realvalued labels of new objects. The prediction algorithm’s performance is measured by the squared deviation of the predictions from the actual labels. No probabilistic assumptions are made about the way the labels and objects are generated. Instead, we are given a benchmark class of prediction rules some of which are hoped to produce good predictions. We show that for a wide range of infinitedimensional benchmark classes one can construct a prediction algorithm whose cumulative loss over the first N examples does not exceed the cumulative loss of any prediction rule in the class plus O ( √ N). Our proof technique is based on the recently developed method of defensive forecasting. 1