Results 1  10
of
12
Model selection and accounting for model uncertainty in graphical models using Occam's window
, 1993
"... We consider the problem of model selection and accounting for model uncertainty in highdimensional contingency tables, motivated by expert system applications. The approach most used currently is a stepwise strategy guided by tests based on approximate asymptotic Pvalues leading to the selection o ..."
Abstract

Cited by 365 (48 self)
 Add to MetaCart
We consider the problem of model selection and accounting for model uncertainty in highdimensional contingency tables, motivated by expert system applications. The approach most used currently is a stepwise strategy guided by tests based on approximate asymptotic Pvalues leading to the selection of a single model; inference is then conditional on the selected model. The sampling properties of such a strategy are complex, and the failure to take account of model uncertainty leads to underestimation of uncertainty about quantities of interest. In principle, a panacea is provided by the standard Bayesian formalism which averages the posterior distributions of the quantity of interest under each of the models, weighted by their posterior model probabilities. Furthermore, this approach is optimal in the sense of maximising predictive ability. However, this has not been used in practice because computing the posterior model probabilities is hard and the number of models is very large (often greater than 1011). We argue that the standard Bayesian formalism is unsatisfactory and we propose an alternative Bayesian approach that, we contend, takes full account of the true model uncertainty byaveraging overamuch smaller set of models. An efficient search algorithm is developed for nding these models. We consider two classes of graphical models that arise in expert systems: the recursive causal models and the decomposable
WordSense Disambiguation Using Decomposable Models
 In Proceedings of the 32nd Annual Meeting of the Association for Computational Linguistics
, 1994
"... Most probabilistic classifiers used for wordsense disambiguation have either been based on only one contextual feature or have used a model that is simply assumed to characterize the interdependencies among multiple contextual features. In this paper, a different approach to formulating a probabili ..."
Abstract

Cited by 155 (19 self)
 Add to MetaCart
(Show Context)
Most probabilistic classifiers used for wordsense disambiguation have either been based on only one contextual feature or have used a model that is simply assumed to characterize the interdependencies among multiple contextual features. In this paper, a different approach to formulating a probabilistic model is presented along with a case study of the performance of models produced in this manner for the disambiguafion of the noun interest. We describe a method for formulating probabilistic models that use multiple contextual features for wordsense disambiguafion, without requiring untested assumptions regarding the form of the model. Using this approach, the joint distribution of all variables is described by only the most systematic variable interactions, thereby limiting the number of parameters to be estimated, supporting computational efficiency, and providing an understanding of the data.
Improved learning of Bayesian networks
 Proc. of the Conf. on Uncertainty in Artificial Intelligence
, 2001
"... Two or more Bayesian network structures are Markov equivalent when the corresponding acyclic digraphs encode the same set of conditional independencies. Therefore, the search space of Bayesian network structures may be organized in equivalence classes, where each of them represents a different set o ..."
Abstract

Cited by 40 (6 self)
 Add to MetaCart
(Show Context)
Two or more Bayesian network structures are Markov equivalent when the corresponding acyclic digraphs encode the same set of conditional independencies. Therefore, the search space of Bayesian network structures may be organized in equivalence classes, where each of them represents a different set of conditional independencies. The collection of sets of conditional independencies obeys a partial order, the socalled “inclusion order.” This paper discusses in depth the role that the inclusion order plays in learning the structure of Bayesian networks. In particular, this role involves the way a learning algorithm traverses the search space. We introduce a condition for traversal operators, the inclusion boundary condition, which, when it is satisfied, guarantees that the search strategy can avoid local maxima. This is proved under the assumptions that the data is sampled from a probability distribution which is faithful to an acyclic digraph, and the length of the sample is unbounded. The previous discussion leads to the design of a new traversal operator and two new learning algorithms in the context of heuristic search and the Markov Chain Monte Carlo method. We carry out a set of experiments with synthetic and realworld data that show empirically the benefit of striving for the inclusion order when learning Bayesian networks from data.
Advances in Markov chain Monte Carlo methods
, 2007
"... Probability distributions over many variables occur frequently in Bayesian inference, statistical physics and simulation studies. Samples from distributions give insight into their typical behavior and can allow approximation of any quantity of interest, such as expectations or normalizing constants ..."
Abstract

Cited by 24 (3 self)
 Add to MetaCart
Probability distributions over many variables occur frequently in Bayesian inference, statistical physics and simulation studies. Samples from distributions give insight into their typical behavior and can allow approximation of any quantity of interest, such as expectations or normalizing constants. Markov Chain Monte Carlo (MCMC), introduced by Metropolis et al. (1953), allows sampling from distributions with intractable normalization, and remains one of most important tools for approximate computation with probability distributions.
While not needed by MCMC, normalizers are key quantities: in Bayesian statistics marginal likelihoods are needed for model comparison; in statistical physics many physical quantities relate to the partition function. In this thesis we propose and investigate several new Monte Carlo algorithms, both for evaluating normalizing constants and for improved sampling of distributions.
Many MCMC correctness proofs rely on using reversible transition operators; this can lead to chains exploring by slow random walks. After reviewing existing MCMC algorithms, we develop a new framework for constructing nonreversible transition operators from existing reversible ones.
Next we explore and extend MCMCbased algorithms for computing normalizing constants. In particular we develop a new MCMC operator and Nested Sampling approach for the Potts model. Our results demonstrate that these approaches can be superior to finding normalizing constants by annealing methods and can obtain better posterior samples.
Finally we consider "doublyintractable" distributions with extra unknown normalizer terms that do not cancel in standard MCMC algorithms. We propose using several deterministic approximations for the unknown terms, and investigate their interaction with sampling algorithms. We then develop novel exactsamplingbased MCMC methods, the Exchange Algorithm and Latent Histories. For the first time these algorithms do not require separate approximation before sampling begins. Moreover, the Exchange Algorithm outperforms the only alternative sampling algorithm for doubly intractable distributions.
A New Approach to Word Sense Disambiguation
 In Proceedings of the ARPA Workshop on Human Language Technology
, 1994
"... This paper presents and evaluates models created according to a schema that provides a description of the joint distribution of the values of sense tags and contextual features that is potentially applicable to a wide range of content words. The models are evaluated through a series of experiments, ..."
Abstract

Cited by 15 (5 self)
 Add to MetaCart
(Show Context)
This paper presents and evaluates models created according to a schema that provides a description of the joint distribution of the values of sense tags and contextual features that is potentially applicable to a wide range of content words. The models are evaluated through a series of experiments, the results of which suggest that the schema is particularly well suited to nouns but that it is also applicable to words in other syntactic categories. 1. INTRODUCTION Assigning sense tags to the words in a text can be viewed as a classification problem. A probabilistic classifier assigns to each word the tag that has the highest estimated probability of having occurred in the given context. Designing a probabilistic classifier for wordsense disambiguation includes two main subtasks: specifying an appropriate model and estimating the parameters of that model. The former involves selecting informative contextual features (such as collocations) and describing the joint distribution of the...
BIFROST  Block recursive models Induced From Relevant knowledge, Observations, and Statistical Techniques
 Computational Statistics and Data Analysis
, 1993
"... The theoretical background for a program for establishing expert systems on the basis of observations and expert knowledge is presented. Block recursive models form the basis of the statistical modelling. These models, together with various model selection methods for automatic model selection, a ..."
Abstract

Cited by 4 (0 self)
 Add to MetaCart
The theoretical background for a program for establishing expert systems on the basis of observations and expert knowledge is presented. Block recursive models form the basis of the statistical modelling. These models, together with various model selection methods for automatic model selection, are presented. Additionally, the connection between a block recursive model and expert systems based on causal probabilistic networks is treated. A medical example concerning diagnosis of coronary artery disease forms the basis for an evaluation of the expert systems established. Keywords: causal probabilistic networks, graphical association models, machine learning, model selection, selection criteria, selection strategies. 1 Introduction BIFROST is a program for semiautomatic knowledge acquisition and is a continuation developments made in (Greve, Hjsgaard, Skjth and Thiesson 1990). The objective is to obtain preliminary causal models for use in the HUGIN expert system shell (Ander...
MODEL SELECTION AND SIMPLIFICATION USING LATTICES
"... This paper shows how to cope with a problem of model selection and simplication using the principle of coherence (Gabriel (1969): Aprocedure involving testing a set of models ought not accept a model while rejecting a more general model). The mathematical lattice theory is used to de ne a partial or ..."
Abstract
 Add to MetaCart
This paper shows how to cope with a problem of model selection and simplication using the principle of coherence (Gabriel (1969): Aprocedure involving testing a set of models ought not accept a model while rejecting a more general model). The mathematical lattice theory is used to de ne a partial ordering over the space of considered models. Several examples of partial ordering in large families of models are given along with a searching algorithm to determine the best model with respect to chosen criteria.
Advances in Markov chain
, 2007
"... I, Iain Murray, confirm that the work presented in this thesis is my own. Where information has been derived from other sources, I confirm that this has been indicated in the thesis. 3 Probability distributions over many variables occur frequently in Bayesian inference, statistical physics and simul ..."
Abstract
 Add to MetaCart
I, Iain Murray, confirm that the work presented in this thesis is my own. Where information has been derived from other sources, I confirm that this has been indicated in the thesis. 3 Probability distributions over many variables occur frequently in Bayesian inference, statistical physics and simulation studies. Samples from distributions give insight into their typical behavior and can allow approximation of any quantity of interest, such as expectations or normalizing constants. Markov chain Monte Carlo (MCMC), introduced by Metropolis et al. (1953), allows sampling from distributions with intractable normalization, and remains one of most important tools for approximate computation with probability distributions. While not needed by MCMC, normalizers are key quantities: in Bayesian statistics marginal likelihoods are needed for model comparison; in statistical physics many physical quantities relate to the partition function. In this thesis we propose and investigate several new Monte Carlo algorithms, both for evaluating normalizing constants and for
User's guide to BIFROST version 1.3
, 1998
"... Contents 2 Contents 1 Introduction 3 2 Block Recursive Models and BIFROST 4 3 Starting BIFROST 9 4 Specifications 10 5 The Model Selection Screen 19 6 Export to HUGIN 21 7 Example (Survival of Breast Cancer Patients) 22 8 Acknowledgments 26 9 Addendum to version 1.3: Case Selection 27 A The Datafi ..."
Abstract
 Add to MetaCart
Contents 2 Contents 1 Introduction 3 2 Block Recursive Models and BIFROST 4 3 Starting BIFROST 9 4 Specifications 10 5 The Model Selection Screen 19 6 Export to HUGIN 21 7 Example (Survival of Breast Cancer Patients) 22 8 Acknowledgments 26 9 Addendum to version 1.3: Case Selection 27 A The Datafile 31 B Installing BIFROST 1 Introduction 3 1 Introduction BIFROST is a program for semiautomatic knowledge acquisition. The objective is to obtain preliminary causal models for use in the HUGIN shell 1 . Based on a database of observations and minimal expert guidance the program will search for a model giving a description of the structure of association among the variables. The model obtained can be saved as, and afterwards loaded as a domain in the HUGIN shell. This domain forms the starting point for establishing a causal network. The program originates from the work done by the authors together with Jørgen Greve
LINEAR INFROMATION MODELS AND LOGLINEAR MODELS: THE CONCEPTUAL DIFFERENCE
, 2008
"... Abstract: The nature of linear information models (LIM) Analysis of categorical data by linear information models (LIM) is revisited (Cheng, et al. (2007)). LIM describes the association of variables in a contingency table by factoring the joint likelihood as a product of likelihood ratio (LR) stati ..."
Abstract
 Add to MetaCart
(Show Context)
Abstract: The nature of linear information models (LIM) Analysis of categorical data by linear information models (LIM) is revisited (Cheng, et al. (2007)). LIM describes the association of variables in a contingency table by factoring the joint likelihood as a product of likelihood ratio (LR) statistics in the form of mutual information (MI). The sample versions of these logarithmic LR statistics (the deviances), defined as the mutual information (MI) and the conditional mutual information (CMI), form the components of nonunique orthogonal decompositions of the joint MI of data likelihood. Distinct MI/CMI decompositions are used to express various interactive associations to yield varied choices of LIM for data interpretation. In principle, LIM are validated by testing sets of MI and CMI statistics directly observed from the data, without assuming constrained likelihood and/or moment equations to be tested and formulated as the loglinear models (LLM), for example, Bishop, et al. (1975), Goodman (1970), Gokhale and Kullback (1978), and Haberman (1973). In practice, LIM and LLM often present distinct statistical inferences for multiway contingency tables with four or more variables. Examples with data analyses in the literature are illustrated.