Results 1  10
of
428
An Empirical Study of Smoothing Techniques for Language Modeling
, 1998
"... We present an extensive empirical comparison of several smoothing techniques in the domain of language modeling, including those described by Jelinek and Mercer (1980), Katz (1987), and Church and Gale (1991). We investigate for the first time how factors such as training data size, corpus (e.g., Br ..."
Abstract

Cited by 850 (20 self)
 Add to MetaCart
We present an extensive empirical comparison of several smoothing techniques in the domain of language modeling, including those described by Jelinek and Mercer (1980), Katz (1987), and Church and Gale (1991). We investigate for the first time how factors such as training data size, corpus (e.g., Brown versus Wall Street Journal), and ngram order (bigram versus trigram) affect the relative performance of these methods, which we measure through the crossentropy of test data. In addition, we introduce two novel smoothing techniques, one a variation of JelinekMercer smoothing and one a very simple linear interpolation technique, both of which outperform existing methods. 1
Bayesian Interpolation
 Neural Computation
, 1991
"... Although Bayesian analysis has been in use since Laplace, the Bayesian method of modelcomparison has only recently been developed in depth. In this paper, the Bayesian approach to regularisation and modelcomparison is demonstrated by studying the inference problem of interpolating noisy data. T ..."
Abstract

Cited by 520 (18 self)
 Add to MetaCart
Although Bayesian analysis has been in use since Laplace, the Bayesian method of modelcomparison has only recently been developed in depth. In this paper, the Bayesian approach to regularisation and modelcomparison is demonstrated by studying the inference problem of interpolating noisy data. The concepts and methods described are quite general and can be applied to many other problems. Regularising constants are set by examining their posterior probability distribution. Alternative regularisers (priors) and alternative basis sets are objectively compared by evaluating the evidence for them. `Occam's razor' is automatically embodied by this framework. The way in which Bayes infers the values of regularising constants and noise levels has an elegant interpretation in terms of the effective number of parameters determined by the data set. This framework is due to Gull and Skilling. 1 Data modelling and Occam's razor In science, a central task is to develop and compare models to a...
Model selection and accounting for model uncertainty in graphical models using Occam's window
, 1993
"... We consider the problem of model selection and accounting for model uncertainty in highdimensional contingency tables, motivated by expert system applications. The approach most used currently is a stepwise strategy guided by tests based on approximate asymptotic Pvalues leading to the selection o ..."
Abstract

Cited by 266 (46 self)
 Add to MetaCart
We consider the problem of model selection and accounting for model uncertainty in highdimensional contingency tables, motivated by expert system applications. The approach most used currently is a stepwise strategy guided by tests based on approximate asymptotic Pvalues leading to the selection of a single model; inference is then conditional on the selected model. The sampling properties of such a strategy are complex, and the failure to take account of model uncertainty leads to underestimation of uncertainty about quantities of interest. In principle, a panacea is provided by the standard Bayesian formalism which averages the posterior distributions of the quantity of interest under each of the models, weighted by their posterior model probabilities. Furthermore, this approach is optimal in the sense of maximising predictive ability. However, this has not been used in practice because computing the posterior model probabilities is hard and the number of models is very large (often greater than 1011). We argue that the standard Bayesian formalism is unsatisfactory and we propose an alternative Bayesian approach that, we contend, takes full account of the true model uncertainty byaveraging overamuch smaller set of models. An efficient search algorithm is developed for nding these models. We consider two classes of graphical models that arise in expert systems: the recursive causal models and the decomposable
Operations for Learning with Graphical Models
 Journal of Artificial Intelligence Research
, 1994
"... This paper is a multidisciplinary review of empirical, statistical learning from a graphical model perspective. Wellknown examples of graphical models include Bayesian networks, directed graphs representing a Markov chain, and undirected networks representing a Markov field. These graphical models ..."
Abstract

Cited by 249 (12 self)
 Add to MetaCart
This paper is a multidisciplinary review of empirical, statistical learning from a graphical model perspective. Wellknown examples of graphical models include Bayesian networks, directed graphs representing a Markov chain, and undirected networks representing a Markov field. These graphical models are extended to model data analysis and empirical learning using the notation of plates. Graphical operations for simplifying and manipulating a problem are provided including decomposition, differentiation, and the manipulation of probability models from the exponential family. Two standard algorithm schemas for learning are reviewed in a graphical framework: Gibbs sampling and the expectation maximization algorithm. Using these operations and schemas, some popular algorithms can be synthesized from their graphical specification. This includes versions of linear regression, techniques for feedforward networks, and learning Gaussian and discrete Bayesian networks from data. The paper conclu...
Bayesian Model Averaging for Linear Regression Models
 Journal of the American Statistical Association
, 1997
"... We consider the problem of accounting for model uncertainty in linear regression models. Conditioning on a single selected model ignores model uncertainty, and thus leads to the underestimation of uncertainty when making inferences about quantities of interest. A Bayesian solution to this problem in ..."
Abstract

Cited by 184 (13 self)
 Add to MetaCart
We consider the problem of accounting for model uncertainty in linear regression models. Conditioning on a single selected model ignores model uncertainty, and thus leads to the underestimation of uncertainty when making inferences about quantities of interest. A Bayesian solution to this problem involves averaging over all possible models (i.e., combinations of predictors) when making inferences about quantities of
Efficient approximations for the marginal likelihood of Bayesian networks with hidden variables
 Machine Learning
, 1997
"... We discuss Bayesian methods for learning Bayesian networks when data sets are incomplete. In particular, we examine asymptotic approximations for the marginal likelihood of incomplete data given a Bayesian network. We consider the Laplace approximation and the less accurate but more efficient BIC/MD ..."
Abstract

Cited by 178 (10 self)
 Add to MetaCart
We discuss Bayesian methods for learning Bayesian networks when data sets are incomplete. In particular, we examine asymptotic approximations for the marginal likelihood of incomplete data given a Bayesian network. We consider the Laplace approximation and the less accurate but more efficient BIC/MDL approximation. We also consider approximations proposed by Draper (1993) and Cheeseman and Stutz (1995). These approximations are as efficient as BIC/MDL, but their accuracy has not been studied in any depth. We compare the accuracy of these approximations under the assumption that the Laplace approximation is the most accurate. In experiments using synthetic data generated from discrete naiveBayes models having a hidden root node, we find that (1) the BIC/MDL measure is the least accurate, having a bias in favor of simple models, and (2) the Draper and CS measures are the most accurate. 1
Prior Probabilities
 IEEE Transactions on Systems Science and Cybernetics
, 1968
"... e case of location and scale parameters, rate constants, and in Bernoulli trials with unknown probability of success. In realistic problems, both the transformation group analysis and the principle of maximum entropy are needed to determine the prior. The distributions thus found are uniquely determ ..."
Abstract

Cited by 166 (3 self)
 Add to MetaCart
e case of location and scale parameters, rate constants, and in Bernoulli trials with unknown probability of success. In realistic problems, both the transformation group analysis and the principle of maximum entropy are needed to determine the prior. The distributions thus found are uniquely determined by the prior information, independently of the choice of parameters. In a certain class of problems, therefore, the prior distributions may now be claimed to be fully as "objective" as the sampling distributions. I. Background of the problem Since the time of Laplace, applications of probability theory have been hampered by difficulties in the treatment of prior information. In realistic problems of decision or inference, we often have prior information which is highly relevant to the question being asked; to fail to take it into account is to commit the most obvious inconsistency of reasoning and may lead to absurd or dangerously misleading results. As an extreme examp
Optimal Structure Identification with Greedy Search
, 2002
"... In this paper we prove the socalled "Meek Conjecture". In particular, we show that if a is an independence map of another DAG then there exists a finite sequence of edge additions and covered edge reversals in such that (1) after each edge modification and (2) after all modifications ..."
Abstract

Cited by 161 (1 self)
 Add to MetaCart
In this paper we prove the socalled "Meek Conjecture". In particular, we show that if a is an independence map of another DAG then there exists a finite sequence of edge additions and covered edge reversals in such that (1) after each edge modification and (2) after all modifications H.
Strictly Proper Scoring Rules, Prediction, and Estimation
, 2007
"... Scoring rules assess the quality of probabilistic forecasts, by assigning a numerical score based on the predictive distribution and on the event or value that materializes. A scoring rule is proper if the forecaster maximizes the expected score for an observation drawn from the distribution F if he ..."
Abstract

Cited by 143 (17 self)
 Add to MetaCart
Scoring rules assess the quality of probabilistic forecasts, by assigning a numerical score based on the predictive distribution and on the event or value that materializes. A scoring rule is proper if the forecaster maximizes the expected score for an observation drawn from the distribution F if he or she issues the probabilistic forecast F, rather than G ̸ = F. It is strictly proper if the maximum is unique. In prediction problems, proper scoring rules encourage the forecaster to make careful assessments and to be honest. In estimation problems, strictly proper scoring rules provide attractive loss and utility functions that can be tailored to the problem at hand. This article reviews and develops the theory of proper scoring rules on general probability spaces, and proposes and discusses examples thereof. Proper scoring rules derive from convex functions and relate to information measures, entropy functions, and Bregman divergences. In the case of categorical variables, we prove a rigorous version of the Savage representation. Examples of scoring rules for probabilistic forecasts in the form of predictive densities include the logarithmic, spherical, pseudospherical, and quadratic scores. The continuous ranked probability score applies to probabilistic forecasts that take the form of predictive cumulative distribution functions. It generalizes the absolute error and forms a special case of a new and very general type of score, the energy score. Like many other scoring rules, the energy score admits a kernel representation in terms of negative definite functions, with links to inequalities of Hoeffding type, in both univariate and multivariate settings. Proper scoring rules for quantile and interval forecasts are also discussed. We relate proper scoring rules to Bayes factors and to crossvalidation, and propose a novel form of crossvalidation known as randomfold crossvalidation. A case study on probabilistic weather forecasts in the North American Pacific Northwest illustrates the importance of propriety. We note optimum score approaches to point and quantile
Bayesian color constancy
 Journal of the Optical Society of America A
, 1997
"... The problem of color constancy may be solved if we can recover the physical properties of illuminants and surfaces from photosensor responses. We consider this problem within the framework of Bayesian decision theory. First, we model the relation among illuminants, surfaces, and photosensor response ..."
Abstract

Cited by 135 (18 self)
 Add to MetaCart
The problem of color constancy may be solved if we can recover the physical properties of illuminants and surfaces from photosensor responses. We consider this problem within the framework of Bayesian decision theory. First, we model the relation among illuminants, surfaces, and photosensor responses. Second, we construct prior distributions that describe the probability that particular illuminants and surfaces exist in the world. Given a set of photosensor responses, we can then use Bayes’s rule to compute the posterior distribution for the illuminants and the surfaces in the scene. There are two widely used methods for obtaining a single best estimate from a posterior distribution. These are maximum a posteriori (MAP) and minimum meansquarederror (MMSE) estimation. We argue that neither is appropriate for perception problems. We describe a new estimator, which we call the maximum local mass (MLM) estimate, that integrates local probability density. The new method uses an optimality criterion that is appropriate for perception tasks: It finds the most probable approximately correct answer. For the case of low observation noise, we provide an efficient approximation. We develop the MLM estimator for the colorconstancy problem in which flat matte surfaces are uniformly illuminated. In simulations we show that the MLM method performs better than the MAP estimator and better than a number of standard colorconstancy algorithms. We note conditions under which even the optimal estimator produces poor estimates: when the spectral properties of the surfaces in the scene are biased. © 1997 Optical Society of America [S07403232(97)016074] 1.