Results 1  10
of
19
Philosophy and the practice of Bayesian statistics
, 2010
"... A substantial school in the philosophy of science identifies Bayesian inference with inductive inference and even rationality as such, and seems to be strengthened by the rise and practical success of Bayesian statistics. We argue that the most successful forms of Bayesian statistics do not actually ..."
Abstract

Cited by 37 (8 self)
 Add to MetaCart
A substantial school in the philosophy of science identifies Bayesian inference with inductive inference and even rationality as such, and seems to be strengthened by the rise and practical success of Bayesian statistics. We argue that the most successful forms of Bayesian statistics do not actually support that particular philosophy but rather accord much better with sophisticated forms of hypotheticodeductivism. We examine the actual role played by prior distributions in Bayesian models, and the crucial aspects of model checking and model revision, which fall outside the scope of Bayesian confirmation theory. We draw on the literature on the consistency of Bayesian updating and also on our experience of applied work in social science. Clarity about these matters should benefit not just philosophy of science, but also statistical practice. At best, the inductivist view has encouraged researchers to fit and compare models without checking them; at worst, theorists have actively discouraged practitioners from performing model checking because it does not fit into their framework.
Asymptotics of Discrete MDL for Online Prediction
, 2005
"... Minimum Description Length (MDL) is an important principle for induction and prediction, with strong relations to optimal Bayesian learning. This paper deals with learning noni.i.d. processes by means of twopart MDL, where the underlying model class is countable. We consider the online learning fr ..."
Abstract

Cited by 10 (5 self)
 Add to MetaCart
Minimum Description Length (MDL) is an important principle for induction and prediction, with strong relations to optimal Bayesian learning. This paper deals with learning noni.i.d. processes by means of twopart MDL, where the underlying model class is countable. We consider the online learning framework, i.e. observations come in one by one, and the predictor is allowed to update his state of mind after each time step. We identify two ways of predicting by MDL for this setup, namely a static and a dynamic one. (A third variant, hybrid MDL, will turn out inferior.) We will prove that under the only assumption that the data is generated by a distribution contained in the model class, the MDL predictions converge to the true values almost surely. This is accomplished by proving finite bounds on the quadratic, the Hellinger, and the KullbackLeibler loss of the MDL learner, which are however exponentially worse than for Bayesian prediction. We demonstrate that these bounds are sharp, even for model classes containing only Bernoulli distributions. We show how these bounds imply regret bounds for arbitrary loss functions. Our results apply to a wide range of setups, namely sequence prediction, pattern classification, regression, and universal induction in the sense of Algorithmic Information Theory among others.
Follow the leader if you can, Hedge if you must
 Journal of Machine Learning Research
"... FollowtheLeader (FTL) is an intuitive sequential prediction strategy that guarantees constant regret in the stochastic setting, but has poor performance for worstcase data. Other hedging strategies have better worstcase guarantees but may perform much worse than FTL if the data are not maximall ..."
Abstract

Cited by 9 (5 self)
 Add to MetaCart
(Show Context)
FollowtheLeader (FTL) is an intuitive sequential prediction strategy that guarantees constant regret in the stochastic setting, but has poor performance for worstcase data. Other hedging strategies have better worstcase guarantees but may perform much worse than FTL if the data are not maximally adversarial. We introduce the FlipFlop algorithm, which is the first method that provably combines the best of both worlds. As a stepping stone for our analysis, we develop AdaHedge, which is a new way of dynamically tuning the learning rate in Hedge without using the doubling trick. AdaHedge refines a method by CesaBianchi, Mansour, and Stoltz (2007), yielding improved worstcase guarantees. By interleaving AdaHedge and FTL, FlipFlop achieves regret within a constant factor of the FTL regret, without sacrificing AdaHedge’s worstcase guarantees. AdaHedge and FlipFlop do not need to know the range of the losses in advance; moreover, unlike earlier methods, both have the intuitive property that the issued weights are invariant under rescaling and translation of the losses. The losses are also allowed to be negative, in which case they may be interpreted as gains.
Safe Learning: bridging the gap between Bayes, MDL and statistical learning theory via empirical convexity
"... We extend Bayesian MAP and Minimum Description Length (MDL) learning by testing whether the data can be substantially more compressed by a mixture of the MDL/MAP distribution with another element of the model, and adjusting the learning rate if this is the case. While standard Bayes and MDL can fail ..."
Abstract

Cited by 7 (4 self)
 Add to MetaCart
(Show Context)
We extend Bayesian MAP and Minimum Description Length (MDL) learning by testing whether the data can be substantially more compressed by a mixture of the MDL/MAP distribution with another element of the model, and adjusting the learning rate if this is the case. While standard Bayes and MDL can fail to converge if the model is wrong, the resulting “safe ” estimator continues to achieve good rates with wrong models. Moreover, when applied to classification and regression models as considered in statistical learning theory, the approach achieves optimal rates under, e.g., Tsybakov’s conditions, and reveals new situations in which we can penalize by ( − log prior)/n rather than √ ( − log prior)/n. 1
When Efficient Model Averaging OutPerforms Boosting and Bagging , To Appear ECML/PKDD
 Bagging, 17th European Conference on Machine Learning and 10th European Conference on Principles and Practice of Knowledge Discovery in Databases (ECML/PKDD06
, 2006
"... Abstract. Bayesian model averaging also known as the Bayes optimal classifier (BOC) is an ensemble technique used extensively in the statistics literature. However, compared to other ensemble techniques such as bagging and boosting, BOC is less known and rarely used in data mining. This is partly du ..."
Abstract

Cited by 5 (2 self)
 Add to MetaCart
Abstract. Bayesian model averaging also known as the Bayes optimal classifier (BOC) is an ensemble technique used extensively in the statistics literature. However, compared to other ensemble techniques such as bagging and boosting, BOC is less known and rarely used in data mining. This is partly due to model averaging being perceived as being inefficient and because bagging and boosting consistently outperforms a single model, which raises the question: “Do we even need BOC in datamining?”. We show that the answer to this question is “yes ” by illustrating that several recent efficient model averaging approaches can significantly outperform bagging and boosting in realistic difficult situations such as extensive class label noise, sample selection bias and manyclass problems. To our knowledge the insights that model averaging can outperform bagging and boosting in these situations has not been published in the machine learning, mining or statistical communities. 1
The Safe Bayesian: learning the learning rate via the mixability gap
"... Abstract. Standard Bayesian inference can behave suboptimally if the model is wrong. We present a modification of Bayesian inference which continues to achieve good rates with wrong models. Our method adapts the Bayesian learning rate to the data, picking the rate minimizing the cumulative loss of s ..."
Abstract

Cited by 2 (1 self)
 Add to MetaCart
(Show Context)
Abstract. Standard Bayesian inference can behave suboptimally if the model is wrong. We present a modification of Bayesian inference which continues to achieve good rates with wrong models. Our method adapts the Bayesian learning rate to the data, picking the rate minimizing the cumulative loss of sequential prediction by posterior randomization. Our results can also be used to adapt the learning rate in a PACBayesian context. The results are based on an extension of an inequality due to T. Zhang and others to dependent random variables. 1
Bayesian Brittleness: Why no Bayesian model is “good enough”
, 2014
"... Although it is known that Bayesian estimators may fail to converge or may converge towards the wrong answer (i.e. be inconsistent) if the probability space is not finite or if the model is misspecified (i.e. the datagenerating distribution does not belong to the family parametrized by the model), ..."
Abstract

Cited by 2 (1 self)
 Add to MetaCart
(Show Context)
Although it is known that Bayesian estimators may fail to converge or may converge towards the wrong answer (i.e. be inconsistent) if the probability space is not finite or if the model is misspecified (i.e. the datagenerating distribution does not belong to the family parametrized by the model), it is also a popular belief that a “good ” or “close ” enough model should have good convergence properties. This paper incorporates Bayesian priors into the Optimal Uncertainty Quantification (OUQ) framework [86] and in doing so reveals extreme brittleness in Bayesian inference. These brittleness results demonstrate that, contrary to popular belief, there is no such thing as a “close enough ” model in Bayesian inference in the following sense: we derive optimal lower and upper bounds on posterior values obtained from models that exactly capture an arbitrarily large (but finite) number of finitedimensional marginals of the datagenerating distribution and/or that are arbitrarily close to the datagenerating distribution in the Prokhorov or total variation metrics; these bounds show that such models may still make the largest possible prediction
A Fast Algorithm for Robust Mixtures in the Presence of Measurement Errors
"... Abstract—In experimental and observational sciences, detecting atypical, peculiar data from large sets of measurements has the potential of highlighting candidates of interesting new types of objects that deserve more detailed domainspecific followup study. However, measurement data is nearly never ..."
Abstract

Cited by 2 (1 self)
 Add to MetaCart
(Show Context)
Abstract—In experimental and observational sciences, detecting atypical, peculiar data from large sets of measurements has the potential of highlighting candidates of interesting new types of objects that deserve more detailed domainspecific followup study. However, measurement data is nearly never free of measurement errors. These errors can generate false outliers that are not truly interesting. Although many approaches exist for finding outliers, they have no means to tell to what extent the peculiarity is not simply due to measurement errors. To address this issue, we have developed a modelbased approach to infer genuine outliers from multivariate data sets when measurement error information is available. This is based on a probabilistic mixture of hierarchical density models, in which parameter estimation is made feasible by a treestructured variational expectationmaximization algorithm. Here, we further develop an algorithmic enhancement to address the scalability of this approach, in order to make it applicable to large data sets, via a Kdimensionaltree based partitioning of the variational posterior assignments. This creates a nontrivial tradeoff between a more detailed noise model to enhance the detection accuracy, and the coarsened posterior representation to obtain computational speedup. Hence, we conduct extensive experimental validation to study the accuracy/speed tradeoffs achievable in a variety of data conditions. We find that, at lowtomoderate error levels, a speedup factor that is at least linear in the number of data points can be achieved without significantly sacrificing the detection accuracy. The benefits of including measurement error information into the modeling is evident in all situations, and the gain roughly recovers the loss incurred by the speedup procedure in large error conditions. We analyze and discuss in detail the characteristics of our algorithm based on results obtained on appropriately designed synthetic data experiments, and we also demonstrate its working in a real application example. Index Terms—Kdimensional (KD)tree, measurement errors, outlier detection, robust mixture modeling, variational expectationmaximization (EM) algorithm. I.