## The Bayes Net Toolbox for MATLAB (2001)

Venue: | Computing Science and Statistics |

Citations: | 184 - 1 self |

### BibTeX

@ARTICLE{Murphy01thebayes,

author = {Kevin P. Murphy},

title = {The Bayes Net Toolbox for MATLAB},

journal = {Computing Science and Statistics},

year = {2001},

volume = {33},

pages = {2001}

}

### Years of Citing Articles

### OpenURL

### Abstract

The Bayes Net Toolbox (BNT) is an open-source Matlab package for directed graphical models. BNT supports many kinds of nodes (probability distributions), exact and approximate inference, parameter and structure learning, and static and dynamic models. BNT is widely used in teaching and research: the web page has received over 28,000 hits since May 2000. In this paper, we discuss a broad spectrum of issues related to graphical models (directed and undirected), and describe, at a high-level, how BNT was designed to cope with them all. We also compare BNT to other software packages for graphical models, and to the nascent OpenBayes effort.

### Citations

7441 |
Probabilistic Reasoning in Intelligent Systems
- Pearl
- 1988
(Show Context)
Citation Context ...ntage of requiring a number of parameters that is exponential in the number of parents. Other representations, which only require a linear number of parameters, have been proposed, including noisy-OR =-=[Pea88]-=- and its generalizations [Hen89, Sri93, Die93, MH97], and the logistic (sigmoid) function [Nea92]. Decision trees [BFGK96] can be used to represent CPDs with a variable (datadependent) number of param... |

1403 | Near Shannon limit error-correcting coding and decoding: Turbo-codes
- Berrou, Glavieux, et al.
- 1993
(Show Context)
Citation Context ...applying the message passing algorithm to the original graph, even if it has loops (undirected cycles). Originally this was believed to be unsound, but the outstanding empirical success of turbocodes =-=[BGT93]-=-, which have been shown to be using the BP algorithm [MMC98], led to a lot of theoretical analysis, which has shown how BP is closely related to variational methods [YFW01, SO01]. Recently this techni... |

1278 | Factor graphs and the sum-product algorithm
- Kschischang, Frey, et al.
- 2001
(Show Context)
Citation Context ...observed pixel, and is caused by its hidden parent (clear); the hidden causes are correlated with each other, as modelled by a Markov Random Field with pairwise potentials. The factor graph formalism =-=[KFL01]-=- is a very general way of using graph structure to represent global models (not necessarily probabilistic) in terms of local terms, or factors. It is possible to inter-convert all of these representat... |

1247 |
Causality: models, reasoning, and inference
- Pearl
(Show Context)
Citation Context ...rected graphical models Directed acyclic graph (DAG) models, also known as Bayesian or belief networks, are popular in the AI community, partly because they lend themselves to a causal interpretation =-=[Pea00], whi-=-ch makes their structure easy to design by hand (as in an expert system). DAG models are also useful for modelling temporal data and dynamical systems, because they can encode \the arrow of time"... |

1132 | A Bayesian method for the induction of probabilistic networks from data
- Cooper, Herskovits
- 1992
(Show Context)
Citation Context ...es MCMC (MCMC + ~ S) By ~ S, we mean an approximation to the Bayesian scoring function, such as those mentioned in [CH97]. The structural EM algorithm is described in [Fri97, Fri98]. The K2 algorithm =-=[CH92]-=- assumes a total ordering of the nodes is given. In principle one can search over this ordering [FK00]; this is more ecient than searching in the (much larger) space of graphs. Unlike all the others, ... |

949 | Learning Bayesian networks: The combination of knowledge and statistical data - Heckerman, Geiger, et al. - 1995 |

866 | An introduction to variational methods for graphical models. Learning in Graphical Models
- Jordan
- 1999
(Show Context)
Citation Context ...meters becomes a proxy for inference. The mean-eld approximation produces a lower bound on the like6 lihood. More sophisticated methods are possible, which give tighter lower (and upper) bounds. See [=-=JGJS98-=-] for a tutorial. Recently this technique has been extended to do approximate Bayesian inference, using a technique called Variational Bayes [GB00]. Belief propagation (BP). This entails applying the... |

666 |
Probabilistic Networks and expert systems
- Cowell, Dawid, et al.
- 1999
(Show Context)
Citation Context ...3 ; then the graph has 3 nodes, and looks like x 1 x 2 x 3 . 2.3 Mixed directed/ undirected graphical models It is possible to combine directed and undirected graphs into what is called a chain graph =-=[CDLS99]-=-. A common example of this is in image processing, where the hidden nodes are connected in an undirected 2D grid, but each hidden node has a child which contains the observed value of that pixel (see ... |

650 | Learning in Graphical Models
- Jordan
- 1998
(Show Context)
Citation Context ...hine learning and engineering, ranging from mixture models to hidden Markov models (HMMs), from factor analysis (PCA) to Kalmanslters. The reason for this is well-described in the following quotation =-=[Jor99]-=-: Graphical models are a marriage between probability theory and graph theory. They provide a natural tool for dealing with two problems that occur throughout applied mathematics and engineering { unc... |

641 | Bayesian Networks and Decision Graphs - Jensen - 2001 |

626 | Markov chain Monte Carlo in practice - Gilks, Richardson, et al. - 1996 |

590 | Probabilistic inference using Markov chain Monte Carlo methods - Neal - 1993 |

445 |
Generalized iterative scaling for log-linear models
- Darroch, Ratcliff
- 1972
(Show Context)
Citation Context ...erative Proportional Fitting (IPF). Finally, if the potentials are represented in terms of features (no matter what the graph structure), we must use the generalized iterative scaling (GIS) algorithm =-=[DR72]-=-, which requires that the features sum to a constant, or the improved iterative scaling (IIS) algorithm [PPL97], which makes no assumptions about the features. Usually one uses Monte Carlo sampling to... |

339 | Turbo decoding as an instance of Pearl’s ’belief propagation’ algorithm
- McEliece, MacKay, et al.
- 1998
(Show Context)
Citation Context ..., even if it has loops (undirected cycles). Originally this was believed to be unsound, but the outstanding empirical success of turbocodes [BGT93], which have been shown to be using the BP algorithm =-=[MMC98]-=-, led to a lot of theoretical analysis, which has shown how BP is closely related to variational methods [YFW01, SO01]. Recently this technique has been extended to do approximate Bayesian inference, ... |

329 |
Expectation propagation for approximate bayesian inference
- Minka
- 2001
(Show Context)
Citation Context ...n" [Lau92], which reduces a mixture of Gaussians to a single Gaussian using moment matching. (The implementation of this in [Lau92] is numerically unstable, and has been improved in [LJ99].) See =-=also [Min01]-=-. 3 If messages are passed sequentially, the scheduling usually uses two passes, often called collect /distribute or forwards/backwards [PS91]. 5 Undirected models are already parameterized in terms o... |

319 |
Understanding belief propagation and its generalizations
- Yedidia, Freeman, et al.
(Show Context)
Citation Context ... very general way of using graph structure to represent global models (not necessarily probabilistic) in terms of local terms, or factors. It is possible to inter-convert all of these representations =-=[YFW01], al-=-though sometimes information is \lost" from the graph structure in the process (this information will be implicitely represented in the parameters). This can aect the computational complexity of ... |

308 | The generalized distributive law - McEliece, Aji |

293 | Bucket elimination: a unifying framework for probabilistic inference - Dechter - 1998 |

293 |
Bayesian updating in causal probabilistic networks by local computations
- Jensen, Lauritzen, et al.
- 1990
(Show Context)
Citation Context ... , and the computation of the messages may or may not involve a division operation. For instance, Pearl's algorithm [Pea88] was formulated for directed trees without division; the Hugin/JLO algorithm =-=[JLO90]-=- was formulated for undirected trees with division; and belief propagation [YFW01] was formulated for undirected networks without division. All of these algorithms are essentially equivalent. The adva... |

275 | A Unifying Review of Linear Gaussian Models - Roweis, Ghahramani - 1999 |

254 | Operations for learning with graphical models - Buntine - 1994 |

227 |
The EM algorithm for graphical association models with missing data
- Lauritzen
- 1995
(Show Context)
Citation Context ...imation schemes such as belief propagation [TW01]. 7 4.1.2 Partially observed case If there are missing values or latent variables, we can use the EM algorithm tosnd a locally optimal ML/MAP estimate =-=[Lau95]-=-. The E step requires calling an inference routine (exact or approximate) to compute the expected sucient statistics, and the M step is similar to the fully observed case. (If the M step only takes a ... |

224 | The Bayesian structural EM algorithm - Friedman - 1998 |

217 | Being bayesian about network structure. a bayesian approach to structure discovery in bayesian networks
- Friedman, Koller
- 2003
(Show Context)
Citation Context ...mentioned in [CH97]. The structural EM algorithm is described in [Fri97, Fri98]. The K2 algorithm [CH92] assumes a total ordering of the nodes is given. In principle one can search over this ordering =-=[FK00]-=-; this is more ecient than searching in the (much larger) space of graphs. Unlike all the others, the IC/PC algorithm [SGS01] is a constraint-based algorithm that does not need to do search: it starts... |

190 |
Connectionist learning of belief networks
- Neal
- 1992
(Show Context)
Citation Context ...resentations, which only require a linear number of parameters, have been proposed, including noisy-OR [Pea88] and its generalizations [Hen89, Sri93, Die93, MH97], and the logistic (sigmoid) function =-=[Nea92]-=-. Decision trees [BFGK96] can be used to represent CPDs with a variable (datadependent) number of parameters; they are also useful for variable (parent) selection inside a structure learning algorithm... |

179 | A guide to the literature on learning probabilistic networks from data - Buntine - 1996 |

171 |
Introduction to Graphical Modelling
- Edwards
- 1995
(Show Context)
Citation Context ...ust grid-structured Markov networks. In the statistics community, undirected models are often used to model multiway contingency tables, in which case they are called (hierarchical) log-linear models =-=[Edw00-=-]. 2.2.1 Parameterization of undirected models The parameters of a Markov network are the clique potentials. For instance, in a 2D grid, these correspond to the edge potentials (x i ; x j ) for neighb... |

162 | Adaptive Probabilistic Networks with Hidden Variables
- Binder, Koller, et al.
- 1997
(Show Context)
Citation Context ...ed the generalized EM algorithm. One can also imagine doing a \partial E step" [NH98].) Other methods for handling partial observability, such as \bound and collapse" [RS97] or gradient-base=-=d methods [BKRK97]-=-, are of course possible. The advantages of EM compared to gradient methods are its simplicity, its lack of step size parameters, and the fact that it deals with constraints automatically. It can be c... |

155 | Inference in belief networks: A procedural guide - Huang, Darwiche - 1996 |

151 | Introduction to Monte Carlo methods - MacKay - 1998 |

143 | Propagation of probabilities, means and variances in mixed graphical association models
- Lauritzen
- 1992
(Show Context)
Citation Context ...re of k Gaussians: it hasn't got any smaller. Hence repeated applications of sums and products will cause the size of the representation to blow up. One approximation is to use \weak marginalisation&q=-=uot; [Lau92]-=-, which reduces a mixture of Gaussians to a single Gaussian using moment matching. (The implementation of this in [Lau92] is numerically unstable, and has been improved in [LJ99].) See also [Min01]. 3... |

134 | An Algebra of Bayesian Belief Universes for Knowledge-Based Systems - Jensen, Olesen, et al. - 1990 |

112 | Some practical issues in constructing belief networks - Henrion - 1987 |

111 | Propagation algorithms for variational Bayesian learning
- Ghahramani, Beal
- 2000
(Show Context)
Citation Context ...e, which give tighter lower (and upper) bounds. See [JGJS98] for a tutorial. Recently this technique has been extended to do approximate Bayesian inference, using a technique called Variational Bayes =-=[GB00-=-]. Belief propagation (BP). This entails applying the message passing algorithm to the original graph, even if it has loops (undirected cycles). Originally this was believed to be unsound, but the ou... |

72 | AIS-BN: An adaptive importance sampling algorithm for evidential reasoning in large Bayesian networks
- Cheng, Druzdzel
(Show Context)
Citation Context ...tial case being a notable exception.) Below we list some techniques that can be used to tackle both kinds of intractability. Sampling (Monte Carlo) methods. The simplest kind is importance sampling [=-=CD00]-=-, where we draw random samples x from the prior P (X), the (unconditional) distribution on the hidden variables, and then weight the samples by their likelihood, P (yjx), where y is the evidence. A (s... |

71 | A generalization of the noisy-or model - Srinivas - 1993 |

70 | Parameter adjustment in Bayes networks. The generalized noisy-or gate - Diez - 1993 |

67 | Learning Belief Networks from Data: An Information Theory Based Approach - Cheng, Bell, et al. - 1997 |

67 | Stable local computation with conditional Gaussian distributions
- Lauritzen, Jensen
(Show Context)
Citation Context ...eak marginalisation" [Lau92], which reduces a mixture of Gaussians to a single Gaussian using moment matching. (The implementation of this in [Lau92] is numerically unstable, and has been improve=-=d in [LJ99]-=-.) See also [Min01]. 3 If messages are passed sequentially, the scheduling usually uses two passes, often called collect /distribute or forwards/backwards [PS91]. 5 Undirected models are already param... |

67 | Bayesian network induction via local neighborhoods - Margaritis, Thrun - 1999 |

62 | Causal discovery from a mixture of experimental and observational data
- Cooper, Yoo
- 1999
(Show Context)
Citation Context ...bers of the same equivalence class. However, it is simple to modify the scoring function to exploit interventional data: simply don't update the parameters of nodes that have been set by intervention =-=[CY99]-=-. This enables one to learn causal models from data, which is useful in such domains as bioinformatics. An alternative to searching in graph/edge space is to search in feature space, and then use the ... |

62 | Learning Bayesian Networks from Incomplete Databases
- Ramoni, Sebastiani
- 1997
(Show Context)
Citation Context ... ofsnding the optimum, it is called the generalized EM algorithm. One can also imagine doing a \partial E step" [NH98].) Other methods for handling partial observability, such as \bound and colla=-=pse" [RS97]-=- or gradient-based methods [BKRK97], are of course possible. The advantages of EM compared to gradient methods are its simplicity, its lack of step size parameters, and the fact that it deals with con... |

57 |
A view of the EM algorithm that justi incremental, sparse, and other variants. Learning in Graphical Models
- Neal, Hinton
- 1998
(Show Context)
Citation Context ...ase. (If the M step only takes a step in the right direction of parameter space, instead ofsnding the optimum, it is called the generalized EM algorithm. One can also imagine doing a \partial E step&q=-=uot; [NH98].) Ot-=-her methods for handling partial observability, such as \bound and collapse" [RS97] or gradient-based methods [BKRK97], are of course possible. The advantages of EM compared to gradient methods a... |

52 |
Conjugate gradient acceleration of the EM algorithm
- Jamshidian, Jennrich
- 1993
(Show Context)
Citation Context ...pared to gradient methods are its simplicity, its lack of step size parameters, and the fact that it deals with constraints automatically. It can be combined with gradient methods for increased speed =-=[JJ93]-=-. 4.2 Structure learning If we adopt a Bayesian approach, structure learning means returning a posterior distribution over all possible graphs, or at least computing the expected value of functions (e... |

45 | Thin junction trees
- Bach, Jordan
- 2001
(Show Context)
Citation Context ...es Monte Carlo sampling to compute the expectations needed by these scaling algorithms, but if the graph implied by the features has small treewidth, one can use exact inference, which is much faster =-=[BJ01]-=-. Alternatively, one could use deterministic approximation schemes such as belief propagation [TW01]. 7 4.1.2 Partially observed case If there are missing values or latent variables, we can use the EM... |

44 | A variational approximation for Bayesian networks with discrete and continuous latent variables
- Murphy
- 1999
(Show Context)
Citation Context ...nd represent the result as a table. (This is what the BNT function CPD_to_table does.) This suggests that we should convert CPDs to potentials after we have seen the evidence, an ideasrst proposed in =-=[Mur99]-=-. This is the approach adopted by BNT, which lets it apply exact algorithms to a much greater range of models (e.g., mixtures of experts, input-output HMMs) than most other software packages. (Unfortu... |

43 | The unified propagation and scaling algorithm
- Teh, Welling
(Show Context)
Citation Context ...raph implied by the features has small treewidth, one can use exact inference, which is much faster [BJ01]. Alternatively, one could use deterministic approximation schemes such as belief propagation =-=[TW01]-=-. 7 4.1.2 Partially observed case If there are missing values or latent variables, we can use the EM algorithm tosnd a locally optimal ML/MAP estimate [Lau95]. The E step requires calling an inference... |

41 | Efficient inference in Bayes networks as a combinatorial optimization problem - Li, D’Ambrosio - 1994 |

30 |
E cient approximations for the marginal likelihood of Bayesian networks with hidden variables
- Chickering, Heckerman
- 1997
(Show Context)
Citation Context ...l Partial Point K2, IC/PC, (MI), (hill climb) (structural EM), (hill climb + ~ S) Bayes MCMC (MCMC + ~ S) By ~ S, we mean an approximation to the Bayesian scoring function, such as those mentioned in =-=[CH97]-=-. The structural EM algorithm is described in [Fri97, Fri98]. The K2 algorithm [CH92] assumes a total ordering of the nodes is given. In principle one can search over this ordering [FK00]; this is mor... |

25 | Learning Bayesian networks in the presence of missing values and hidden variables - Friedman - 1997 |