## Mix-nets: Factored Mixtures of Gaussians in Bayesian Networks with Mixed Continuous And Discrete Variables (2000)

### Cached

### Download Links

- [www.cs.cmu.edu]
- [www-2.cs.cmu.edu]
- [www-cgi.cs.cmu.edu]
- [www-cgi.cs.cmu.edu.]
- [www.cs.cmu.edu]
- [www-cgi.cs.cmu.edu]
- [www.aladdin.cs.cmu.edu]
- [www.cs.cmu.edu]
- [www-cgi.cs.cmu.edu]
- [reports-archive.adm.cs.cmu.edu]
- [reports-archive.adm.cs.cmu.edu]
- [reports-archive.adm.cs.cmu.edu]
- [robotweb.ri.cmu.edu]

Citations: | 7 - 2 self |

### BibTeX

@MISC{Davies00mix-nets:factored,

author = {Scott Davies and Andrew Moore},

title = {Mix-nets: Factored Mixtures of Gaussians in Bayesian Networks with Mixed Continuous And Discrete Variables},

year = {2000}

}

### OpenURL

### Abstract

Recently developed techniques have made it possible to quickly learn accurate probability density functions from data in low-dimensional continuous spaces. In particular, mixtures of Gaussians can be fitted to data very quickly using an accelerated EM algorithm that employs multiresolution kd-trees (Moore, 1999). In this paper, we propose a kind of Bayesian network in which low-dimensional mixtures of Gaussians over different subsets of the domain’s variables are combined into a coherent joint probability model over the entire domain. The network is also capable of modeling complex dependencies between discrete variables and continuous variables without requiring discretization of the continuous variables. We present efficient heuristic algorithms for automatically learning these networks from data, and perform comparative experiments illustrating how well these networks model real scientific data and synthetic data. We also briefly discuss some possible improvements to the networks, as well as possible applications.

### Citations

9028 | Miximum Likelihood from Incomplete Data via the EM Algorithm
- Dempster, Laird, et al.
- 1977
(Show Context)
Citation Context ...s in many situations, but we have yet to verify this experimentally. 2.1.1 Learning Gaussian Mixtures From Data The EM algorithm is a popular method for learning mixture models from data (see, e.g., (=-=Dempster et al., 1977-=-)). The algorithm is an iterative algorithm with two steps per iteration. The Expectation or “E” step calculates the distribution over the unobserved mixture component variables, using the current est... |

7487 |
Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference
- Pearl
- 1988
(Show Context)
Citation Context ...ference, and compression. 1 Introduction Bayesian networks (otherwise known as belief networks) are a popular method for representing joint probability distributions over many variables. (See, e.g., (=-=Pearl, 1988-=-).) A Bayesian network contains a directed acyclic graph G with one vertex V i in the graph for each variable X i in the domain. The directed edges in the graph specify a set of independence relations... |

4171 |
Pattern Classification and Scene Analysis
- Duda, Hart, et al.
- 2001
(Show Context)
Citation Context ...ous values. What sorts of models might we want to return? One powerful type of model for representing probability density functions over small sets of variables is a Gaussian mixture model (see e.g. (=-=Duda and Hart, 1973-=-)). Let represent the values that the datapoint in the dataset assigns to a variable set of interest . In a Gaussian mixture model over , we assume that the data are generated independently through th... |

2755 |
Estimating the dimension of a model
- Schwarz
- 1978
(Show Context)
Citation Context ...ling with this tradeo is to choose the model maximizing a scoring function that includes penalty terms related to the number of parameters in the model. We employ the Bayesian Information Criterion (=-=Schwarz, 1978-=-), or BIC, to choose between mixtures with dierent numbers of Gaussians. The BIC score for a given probability model P 0 ( ~ S) is as follows: BIC(P 0 ) = logP 0 (D S ) logR 2 jP 0 j where D S is th... |

1141 | A Bayesian method for the induction of probabilistic networks from data, Machine Learning 9
- Cooper, Herskovits
- 1992
(Show Context)
Citation Context ...developed and experimented with variations of one particular network structure-learning algorithm. There is a wide variety of structure-learning algorithms for discrete Bayesian networks (see, e.g., (=-=Cooper and Herskovits, 1992-=-; Lam and Bacchus, 1994; Heckerman et al., 1995; Friedman et al., 1999)), many of which could be employed when learning mix-nets. The quicker and dirtier of these algorithms might be applicable direct... |

955 | Learning Bayesian networks: The combination of knowledge and statistical data, Machine Learning 20
- Heckerman, Geiger, et al.
- 1995
(Show Context)
Citation Context ...pular heuristic approach to finding networks that model discrete data well is to hillclimb over network structures, using a scoring function such as the BIC as the criterion to maximize. (See, e.g., (=-=Heckerman et al., 1995-=-).) Unfortunately, hillclimbing usually requires scoring a very large number of networks. While our algorithm for learning Gaussian mixtures from data is comparatively fast for the complex task it per... |

686 | Approximating discrete probability distributions with dependence trees
- Chow, Liu
- 1968
(Show Context)
Citation Context ...frequently used class of algorithms involves measuring all pairwise interactions between the variables, and then constructing a network that models the strongest of these pairwise interactions (e.g. (=-=Chow and Liu, 1968-=-; Sahami, 1996; Friedman et al., 1999)). We use such an algorithm in this paper to automatically learn the structures of our Bayesian networks. In order to measure the pairwise interactions between th... |

637 | Bayesian network classifiers
- Friedman, Geiger, et al.
- 1997
(Show Context)
Citation Context ...s also easy to modify our greedy network-learning algorithm to learn networks for classification tasks; the resulting networks would be similar in structure to those generated by previous algorithms (=-=Friedman et al., 1997-=-, 1998; Sahami, 1996), but with more flexible parameterizations. While it is possible to perform exact inference in some kinds of networks modeling continuous values (e.g. (Driver and Morrell, 1995; A... |

343 | Estimating continuous distributions in bayesian classifiers
- John, Langley
- 1995
(Show Context)
Citation Context ...y assumed to be discrete; however, continuous variables have been handled in the past by using Gaussians or kernel density estimators for the conditional distributions of continuous variables (e.g., (=-=John & Langley, 1995-=-)). A recently developed type of classier, Tree Augmented Naive Bayes (TAN) (Friedman et al., 1997), augments the network structure of Naive Bayes with additional arcs between the non-target variable... |

306 |
Introduction to Data Compression
- Sayood
- 1996
(Show Context)
Citation Context ...n the data be compressed if we are willing to accept some given average loss of accuracy in the reconstruction? Lossily compressing values using a Gaussian model is a well-studied problem (see, e.g. (=-=Sayood, 1996-=-)). How do we lossily compress values coming from a mixture of Gaussians? One obvious approach would be to encode each point as follows. First, we calculate the likelihood with which it came from each... |

276 |
Graphical models for machine learning and digital communication
- Frey
- 1998
(Show Context)
Citation Context ...opular and powerful methods for data compression such as arithmetic coding (see, e.g., (Witten et al., 1987)) rely on explicit probabilistic models of the data they are compressing. Recent research ((=-=Frey, 1998-=-), (Davies & Moore, 1999)) has shown that using automatically learned Bayesian networks for these models can result in compression ratios dramatically better than those achievable by gzip or bzip2, wh... |

247 | Learning Bayesian Networks with Local Structure. Learning and Inference in Graphical Models
- Friedman, Goldszmidt
- 1998
(Show Context)
Citation Context ...s, however. Another possibility would be to use decision trees over the discrete variables rather than full lookup tables, a technique previously explored for Bayesian networks over discrete domains (=-=Friedman & Goldszmidt, 1996-=-). The Gaussian mixture learning algorithm we currently employ attempts tosnd a mixture maximizing the joint likelihood of all the variables in the mixture rather than a conditional likelihood. Since ... |

201 | Learning Bayesian belief networks: An approach based on mdl principle
- Lam, Bacchus
- 1994
(Show Context)
Citation Context ...th variations of one particular network structure-learning algorithm. There is a wide variety of structure-learning algorithms for discrete Bayesian networks (see, e.g., (Cooper and Herskovits, 1992; =-=Lam and Bacchus, 1994-=-; Heckerman et al., 1995; Friedman et al., 1999)), many of which could be employed when learning mix-nets. The quicker and dirtier of these algorithms might be applicable directly to learning mix-net ... |

188 | Learning Bayesian network structure from massive datasets: The ”sparse candidate” algorithm
- Friedman, Nachman, et al.
- 1999
(Show Context)
Citation Context ...s involves measuring all pairwise interactions between the variables, and then constructing a network that models the strongest of these pairwise interactions (e.g. (Chow and Liu, 1968; Sahami, 1996; =-=Friedman et al., 1999-=-)). We use such an algorithm in this paper to automatically learn the structures of our Bayesian networks. In order to measure the pairwise interactions between the variables, we start with an empty B... |

177 |
Pattern classi and scene analysis
- Duda, Hart
- 1973
(Show Context)
Citation Context ...s values. What sorts of models might we want A to return? One powerful type of model for representing probability density functions over small sets of variables is a Gaussian mixture model (see e.g. (=-=Duda & Hart, 1973-=-)). Let ~s j represent the values that the j th datapoint in the dataset D assigns to a variable set of interest ~ S. In a Gaussian mixture model over ~ S, we assume that the data are generated indepe... |

162 | Learning Bayesian network is NP-complete
- Chickering
- 1996
(Show Context)
Citation Context ... best Bayesian network structure on our own, or at least find a “good” network structure?In general, finding the optimal Bayesian network structure with which to model a given dataset is NPcomplete (=-=Chickering, 1996-=-), even when all the data is discrete and there are no missing values or hidden variables. A popular heuristic approach to finding networks that model discrete data well is to hillclimb over network s... |

112 | Learning limited dependence bayesian classi ers
- Sahami
- 1996
(Show Context)
Citation Context ...s of algorithms involves measuring all pairwise interactions between the variables, and then constructing a network that models the strongest of these pairwise interactions (e.g. (Chow and Liu, 1968; =-=Sahami, 1996-=-; Friedman et al., 1999)). We use such an algorithm in this paper to automatically learn the structures of our Bayesian networks. In order to measure the pairwise interactions between the variables, w... |

96 | Very fast EM-based mixture model clustering using multiresolution kd-trees
- Moore
- 1999
(Show Context)
Citation Context ...nctions from data in low-dimensional continuous spaces. In particular, mixtures of Gaussians can be fitted to data very quickly using an accelerated EM algorithm that employs multiresolution d-trees (=-=Moore, 1999-=-). In this paper, we propose a kind of Bayesian network in which low-dimensional mixtures of Gaussians over different subsets of the domain’s variables are combined into a coherent joint probability m... |

81 | Efficient Locally Weighted Polynomial Regression Predictions
- Moore, Schneider, et al.
- 1997
(Show Context)
Citation Context ...tely, each iteration of the basic algorithm described above is slow, since it requires a entire pass through the dataset. Instead, we use an accelerated EM algorithm inwhich multiresolution d-trees (=-=Moore et al., 1997-=-) are used to dramatically reduce the computational cost of each iteration (Moore, 1999). We refer the interested reader to this previous paper (Moore, 1999) for details. An important remaining issue ... |

79 | A general algorithm for approximate inference and its application to hybrid Bayes nets
- Koller, Lerner, et al.
- 1999
(Show Context)
Citation Context ...cision trees over the discrete variables rather than full lookup tables, a technique previously explored for Bayesian networks over discrete domains (Friedman and Goldszmidt, 1996). In previous work (=-=Koller et al., 1999-=-) employing representations closely related to the those employed in this paper, a combination of these two approaches has been explored briefly in order to represent potentials in junction trees; fur... |

78 | The anchors hierarchy: Using the triangle inequality to survive high dimensional data
- Moore
- 2000
(Show Context)
Citation Context ...starts with a small number of Gaussians and stochastically tries adding or deleting Gaussians as the EM algorithm progresses. The details of this algorithm are described in a separate paper (Sand and =-=Moore, 2000-=-). 2.2 HANDLING DISCRETE VARIABLES Suppose now that a set of variables we wish to model includes discrete variables as well as continuous variables. Let be the discrete variables in , and the continuo... |

71 | Nonuniform dynamic discretization in hybrid networks
- Kozlov, Koller
- 1997
(Show Context)
Citation Context ... only a model of its variable’s conditional distribution given its parents rather than a joint distribution. Note that there are better ways of discretizing real variables in Bayesian networks (e.g. (=-=Kozlov and Koller, 1997-=-; Monti and Cooper, 1998a)); the simple discretization algorithm discussed here and currently implemented for our experiments is certainly not state-of-the-art. 4.2 DATASETS AND RESULTS We tested the ... |

57 |
A view of the EM algorithm that justi incremental, sparse, and other variants. Learning in Graphical Models
- Neal, Hinton
- 1998
(Show Context)
Citation Context ... by adjusting the datapoints' estimated class distributions, and the M step increases it by adjusting the model parameters. This view justies many variants of EM that may provide faster convergence (=-=Neal & Hinton, 1998-=-). Another approach to accelerating the EM algorithm for Gaussian mix24 ture models is to take a single pass through the dataset while heuristically maintaining in memory a limited-size buer of datap... |

45 | Scaling EM (ExpectationMaximization) Clustering to Large Databases
- Bradley, Reina
- 1998
(Show Context)
Citation Context ... within mix-nets as we do for the “competing” Single-Gaussian algorithm described in Section 4.1. Other methods for accelerating EM have also been developed in the past (e.g., (Neal and Hinton, 1998; =-=Bradley et al., 1998-=-)), some of which might be used in our Bayesian network-learning algorithm instead of or in addition to the accelerated EM algorithm employed in this paper. Our current method of handling discrete var... |

30 | Bayesian network classification with continuous attributes: getting the best of both discretization and parametric fitting - Friedman, Goldszmidt, et al. - 1998 |

28 |
Bayesian Network Classi
- Friedman, Geiger, et al.
- 1997
(Show Context)
Citation Context ...ssians or kernel density estimators for the conditional distributions of continuous variables (e.g., (John & Langley, 1995)). A recently developed type of classier, Tree Augmented Naive Bayes (TAN) (=-=Friedman et al., 1997-=-), augments the network structure of Naive Bayes with additional arcs between the non-target variables, where each nontarget variable is conditioned on at most one other non-target variable. This clas... |

27 |
Baysian classi (autoclass): Theory and results
- Cheeseman, Stutz
- 1996
(Show Context)
Citation Context ...r the datapoint's discrete values, where each discrete value is assumed to be conditionally independent of the others given the class variable. Such an approach has been used previously in AutoClass (=-=Cheeseman & Stutz, 1996-=-). The EM acceleration algorithm exploited in this paper would have to be generalized to handle this class of models, however. Another possibility would be to use decision trees over the discrete vari... |

26 |
Learning Bayesian networks is NP-complete,” Lect. Notes Stat
- Chickering
- 1996
(Show Context)
Citation Context ...the best Bayesian network structure on our own, or at leastsnd a \good" network structure? In general,snding the optimal Bayesian network structure with which to model a given dataset is NP-complete (=-=Chickering, 1996-=-), even when all the data is discrete and there are no missing values or hidden variables. A popular heuristic approach tosnding networks that model discrete data well is to hillclimb over network str... |

24 | Models and selection criteria for regression and classi
- Heckerman, Meek
- 1997
(Show Context)
Citation Context ...nal distribution of each variable in the network given its parents can be modeled by conditionalizing another \embedded" Bayesian network that species the joint between the variable and its parents (=-=Heckerman & Meek, 1997-=-a). (Some theoretical issues concerning the interdependence of parameters in such models appear in (Heckerman & Meek, 1997a) and (Heckerman & Meek, 1997b).) Joint distributions formed by convolving a ... |

22 | Discovering structure in continuous variables using bayesian networks
- Hofmann, Tresp
- 1996
(Show Context)
Citation Context ...d Meek, 1997b).) Joint distributions formed by convolving a Gaussian kernel function with each of the datapoints have also been conditionalized for use in Bayesian networks over continuous variables (=-=Hofmann and Tresp, 1995-=-). 2.1 HANDLING CONTINUOUS VARIABLES Suppose for the moment that contains only continuous values. What sorts of models might we want to return? One powerful type of model for representing probability ... |

19 | Bayesian network classiers - Friedman, Geiger, et al. - 1997 |

19 |
A multivariate discretization method for learning Bayesian networks from mixed data
- Monti, Cooper
- 1998
(Show Context)
Citation Context ...able’s conditional distribution given its parents rather than a joint distribution. Note that there are better ways of discretizing real variables in Bayesian networks (e.g. (Kozlov and Koller, 1997; =-=Monti and Cooper, 1998-=-a)); the simple discretization algorithm discussed here and currently implemented for our experiments is certainly not state-of-the-art. 4.2 DATASETS AND RESULTS We tested the previously described alg... |

11 | Learning Hybrid Bayesian Networks from Data. Learning in Graphical Models
- Monti, Cooper
- 1998
(Show Context)
Citation Context ...able’s conditional distribution given its parents rather than a joint distribution. Note that there are better ways of discretizing real variables in Bayesian networks (e.g. (Kozlov and Koller, 1997; =-=Monti and Cooper, 1998-=-a)); the simple discretization algorithm discussed here and currently implemented for our experiments is certainly not state-of-the-art. 4.2 DATASETS AND RESULTS We tested the previously described alg... |

8 | Embedded Bayesian network classifiers
- Heckeman, Meek
- 1997
(Show Context)
Citation Context ...al distribution of each variable in the network given its parents can be modeled by conditionalizing another “embedded” Bayesian network that specifies the joint between the variable and its parents (=-=Heckerman and Meek, 1997-=-a). (Some theoretical issues concerning the interdependence of parameters in such models appear in (Heckerman and Meek, 1997a) and (Heckerman and Meek, 1997b).) Joint distributions formed by convolvin... |

7 | Bayesian Networks for Lossless Dataset Compression
- Davies, Moore
- 1999
(Show Context)
Citation Context ... network structures restricted to those candidate parents is performed greedily. (We have also previously used a very similar algorithm for learning networks with which to compress discrete datasets (=-=Davies and Moore, 1999-=-).) 4 EXPERIMENTS 4.1 ALGORITHMS We compare the performance of the network-learning algorithm described above to the performance of four other algorithms, each of which is designed to be similar to ou... |

5 |
Inference Using Message Propogation and Topology Transformation in Vector Gaussian Continuous Networks
- Alag
- 1996
(Show Context)
Citation Context ...7, 1998; Sahami, 1996), but with more flexible parameterizations. While it is possible to perform exact inference in some kinds of networks modeling continuous values (e.g. (Driver and Morrell, 1995; =-=Alag, 1996-=-)), general-purpose exact inference in arbitrarily-structured mix-nets with continuous variables may not be possible. However, inference in these networks can be performed via stochastic sampling meth... |

4 |
Implementation of Continous Bayesian Networks Using Sums of Weighted Gaussians
- Driver, Morrell
- 1995
(Show Context)
Citation Context ...tly investigated the use of complex continuous distributions within Bayesian networks; for example, weighted sums of Gaussians have been used to approximate conditional probability density functions (=-=Driver and Morrell, 1995-=-). Such complex distributions over continuous variables are usually quite computationally expensive to learn. This expense may not be too problematic if an appropriate Bayesian network structure is kn... |

4 |
Embedded Bayesian network classi
- Heckerman, Meek
- 1997
(Show Context)
Citation Context ...nal distribution of each variable in the network given its parents can be modeled by conditionalizing another \embedded" Bayesian network that species the joint between the variable and its parents (=-=Heckerman & Meek, 1997-=-a). (Some theoretical issues concerning the interdependence of parameters in such models appear in (Heckerman & Meek, 1997a) and (Heckerman & Meek, 1997b).) Joint distributions formed by convolving a ... |

4 | The Generalized CEM Algorithm
- Jebara, Pentland
- 1999
(Show Context)
Citation Context ...ities, some of their representational power may be used inefficiently. The EM algorithm has recently been generalized to learn joint distributions specifically optimized for being used conditionally (=-=Jebara and Pentland, 1999-=-). If this modified EM algorithm can be accelerated in a manner similar to our current accelerated EM algorithm, it may result in significantly more accurate networks. Finally, further comparisons wit... |

2 |
Fast Structure Search for Gaussian Mixture Models. Submitted to Knowledge Discovery and Data Mining 2000
- Sand, Moore
- 2000
(Show Context)
Citation Context ...ious number of Gaussians, runs the EM algorithm for a few more iterations, and then continues stochastically from 6 there. The details of this algorithm are described in a separate forthcoming paper (=-=Sand & Moore, 2000-=-). 2.3 Handling discrete variables Suppose now that a set of variables ~ S i we wish to model includes discrete variables as well as continuous variables. Let ~ Q i be the discrete variables in ~ S i ... |

1 |
Bayesian Network Classication with Continuous Attributes: Getting the Best
- Friedman, Godszmidt, et al.
- 1998
(Show Context)
Citation Context ... classier has been extended to handle continuous variables by representing each continuous variable in the network twice: once in a discretized form, and once in a simple conditional parametric form(=-=Friedman et al., 1998-=-). Our greedy network-learning algorithm can easily be modied to learn mix-net classiers similar in structure to TAN classiers. By raising our algorithm's MAXPARS parameter, it can also be used to ... |