An Empirical Study of Smoothing Techniques for Language Modeling
, 1998
"... We present an extensive empirical comparison of several smoothing techniques in the domain of language modeling, including those described by Jelinek and Mercer (1980), Katz (1987), and Church and Gale (1991). We investigate for the first time how factors such as training data size, corpus (e.g., Br ..."
We present an extensive empirical comparison of several smoothing techniques in the domain of language modeling, including those described by Jelinek and Mercer (1980), Katz (1987), and Church and Gale (1991). We investigate for the first time how factors such as training data size, corpus (e.g., Brown versus Wall Street Journal), and ngram order (bigram versus trigram) affect the relative performance of these methods, which we measure through the crossentropy of test data. In addition, we introduce two novel smoothing techniques, one a variation of JelinekMercer smoothing and one a very simple linear interpolation technique, both of which outperform existing methods. 1
Forecasting Time Series Subject to Multiple Structural Breaks
, 2004
"... This paper provides a novel approach to forecasting time series subject to discrete structural breaks. We propose a Bayesian estimation and prediction procedure that allows for the possibility of new breaks over the forecast horizon, taking account of the size and duration of past breaks (if any) by ..."
This paper provides a novel approach to forecasting time series subject to discrete structural breaks. We propose a Bayesian estimation and prediction procedure that allows for the possibility of new breaks over the forecast horizon, taking account of the size and duration of past breaks (if any) by means of a hierarchical hidden Markov chain model. Predictions are formed by integrating over the hyper parameters from the meta distributions that characterize the stochastic break point process. In an application to US Treasury bill rates, we find that the method leads to better outofsample forecasts than alternative methods that ignore breaks, particularly at long horizons.
MCLUST: Software for ModelBased Cluster and Discriminant Analysis
, 1998
"...  k ) , (1) where x represents the data, and k is an integer subscript specifying a particular cluster. Clusters are ellipsoidal, centered at the means k . The covariances # k determine their other geometric features. # Funded by the O#ce of Naval Research under contracts N000149610192 an ..."
 k ) , (1) where x represents the data, and k is an integer subscript specifying a particular cluster. Clusters are ellipsoidal, centered at the means k . The covariances # k determine their other geometric features. # Funded by the O#ce of Naval Research under contracts N000149610192 and N000149610330. 1 MathSoft, Inc., Seattle, WA USA  http://www.mathsoft.com/splus 2 see http://lib.stat.cmu.edu/R/CRAN 1 Each covariance matrix is parameterized by eigenvalue decomposition in the form # k = # k D k A k D T k , where D k is the orthogonal matrix of eigenvectors, A k is a diagonal matrix whose elements are proportional to the eigenvalues of # k , and # k is a scalar. The orie
Could Fisher, Jeffreys, and Neyman Have Agreed on Testing?
, 2002
"... Ronald Fisher advocated testing using pvalues; Harold Jeffreys proposed use of objective posterior probabilities of hypotheses; and Jerzy Neyman recommended testing with fixed error probabilities. Each was quite critical of the other approaches. ..."
Ronald Fisher advocated testing using pvalues; Harold Jeffreys proposed use of objective posterior probabilities of hypotheses; and Jerzy Neyman recommended testing with fixed error probabilities. Each was quite critical of the other approaches.
A Bayesian Framework for Concept Learning
 DEPARTMENT OF ARTIFICIAL INTELLIGENCE, EDINBURGH UNIVERSITY
, 1999
"... Human concept learning presents a version of the classic problem of induction, which is made particularly difficult by the combination of two requirements: the need to learn from a rich (i.e. nested and overlapping) vocabulary of possible concepts and the need to be able to generalize concepts reaso ..."
Human concept learning presents a version of the classic problem of induction, which is made particularly difficult by the combination of two requirements: the need to learn from a rich (i.e. nested and overlapping) vocabulary of possible concepts and the need to be able to generalize concepts reasonably from only a few positive examples. I begin this thesis by considering a simple number concept game as a concrete illustration of this ability. On this task, human learners can with reasonable confidence lock in on one out of a billion billion billion logically possible concepts, after seeing only four positive examples of the concept, and can generalize informatively after seeing just a single example. Neither of the two classic approaches to inductive inference  hypothesis testing in a constrained space of possible rules and computing similarity to the observed examples  can provide a complete picture of how people generalize concepts in even this simple setting. This thesis prop...
Concerning Bayesian Motion Segmentation, Model Averaging, Matching and the Trifocal Tensor
 In European Conference on Computer Vision
, 1998
"... . Motion segmentation involves identifying regions of the image that correspond to independently moving objects. The number of independently moving objects, and type of motion model for each of the objects is unknown a priori. In order to perform motion segmentation, the problems of model select ..."
. Motion segmentation involves identifying regions of the image that correspond to independently moving objects. The number of independently moving objects, and type of motion model for each of the objects is unknown a priori. In order to perform motion segmentation, the problems of model selection, robust estimation and clustering must all be addressed simultaneously. Here we place the three problems into a common Bayesian framework; investigating the use of model averagingrepresenting a motion by a combination of modelsas a principled way for motion segmentation of images. The final result is a fully automatic algorithm for clustering that works in the presence of noise and outliers. 1 Introduction Detection of independently moving objects is an essential but often neglected precursor to problems in computer vision e.g. e#cient video compression [3], video editing, surveillance, smart tracking of objects etc. The work in this paper stems from the desire to develop a g...
Bayesian Model Comparison and Backprop Nets
 ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 4
, 1992
"... The Bayesian model comparison framework is reviewed, and the Bayesian Occam's razor is explained. This framework can be applied to feedforward networks, making possible (1) objective comparisons between solutions using alternative network architectures; (2) objective choice of magnitude and ..."
The Bayesian model comparison framework is reviewed, and the Bayesian Occam's razor is explained. This framework can be applied to feedforward networks, making possible (1) objective comparisons between solutions using alternative network architectures; (2) objective choice of magnitude and type of weight decay terms; (3) quantified estimates of the error bars on network parameters and on network output. The framework also generates a measure of the effective number of parameters determined by the data. The relationship
A Bayesian Approach to Testing for Markov Switching in Univariate and Dynamic Factor Models
, 2000
"... Though Hamilton's (1989) Markov switching model has been widely estimated in various contexts, formal testing for Markov switching is not straightforward. Univariate tests in the classical framework by Hansen (1992) and Garcia (1998) do not reject the linear model for GDP. We present Bayesian t ..."
Though Hamilton's (1989) Markov switching model has been widely estimated in various contexts, formal testing for Markov switching is not straightforward. Univariate tests in the classical framework by Hansen (1992) and Garcia (1998) do not reject the linear model for GDP. We present Bayesian tests for Markov switching in both univariate and multivariate settings based on sensitivity of the posterior probability to the prior. We #nd that evidence for Markov switching, and thus the business cycle asymmetry, is stronger in a switching version of the dynamic factor model of Stock and Watson (1991) than it is for GDP by itself. Key Words: Bayesian Model Selection, Business Cycle Asymmetry, Dynamic Factor Model, Pseudo Prior, Model Indicator Parameter, Test of Markov Switching. JEL Classi#cations: C11, C12, E32. \The Bayesian moral is simple: Never make anything more than relative probability statements about the models explicitly entertained. Be suspicious of those who promise more!" [Po...
LongRun Performance of Bayesian Model Averaging
 Journal of the American Statistical Association
, 2003
"... Hjort and Claeskens (HC) argue that statistical inference conditional on a single selected model underestimates uncertainty, and that model averaging is the way to remedy this; we strongly agree. They point out that Bayesian model averaging (BMA) has been the dominant approach to this, but argue tha ..."
Hjort and Claeskens (HC) argue that statistical inference conditional on a single selected model underestimates uncertainty, and that model averaging is the way to remedy this; we strongly agree. They point out that Bayesian model averaging (BMA) has been the dominant approach to this, but argue that its performance has been inadequately studied, and propose an alternative, Frequentist Model Averaging (FMA). We point out, however, that there is a substantial literature on the performance of BMA, consisting of three main threads: general theoretical results, simulation studies, and evaluation of outofsample performance. The theoretical results are scattered, and we summarize them. The results have been quite consistent: BMA has tended to outperform competing methods for model selection and taking account of model uncertainty. The theoretical results depend on the assumption that the \practical distribution" over which the performance of methods is assessed is the same as the prior distribution used, and we investigate sensitivity of results to this assumption in a simple normal example; they turn out not to be unduly sensitive.
On the Bayesianity of PereiraStern Tests
"... C. Pereira and J. Stern have recently introduced a measure of evidence of a precise hypothesis consisting of the posterior probability of the set of points having smaller density than the supremum over the hypothesis. The related procedure is seen to be a Bayes test for specific loss functions. The ..."
C. Pereira and J. Stern have recently introduced a measure of evidence of a precise hypothesis consisting of the posterior probability of the set of points having smaller density than the supremum over the hypothesis. The related procedure is seen to be a Bayes test for specific loss functions. The nature of such loss functions and their relation to stylised inference problems are investigated. The dependence of the loss function on the sample is also discussed as well as the consequence of the introduction of Jeffreys prior mass for the precise hypothesis on the separability of probability and utility.