## Learning with mixtures of trees (2000)

### Cached

### Download Links

- [www.ics.uci.edu]
- [www.ai.mit.edu]
- [www.cs.berkeley.edu]
- [jmlr.csail.mit.edu]
- [www.ai.mit.edu]
- DBLP

### Other Repositories/Bibliography

Venue: | Journal of Machine Learning Research |

Citations: | 117 - 2 self |

### BibTeX

@ARTICLE{Meilă00learningwith,

author = {Marina Meilă and Michael I. Jordan},

title = {Learning with mixtures of trees},

journal = {Journal of Machine Learning Research},

year = {2000},

volume = {1},

pages = {1--48}

}

### Years of Citing Articles

### OpenURL

### Abstract

This paper describes the mixtures-of-trees model, a probabilistic model for discrete multidimensional domains. Mixtures-of-trees generalize the probabilistic trees of Chow and Liu [6] in a different and complementary direction to that of Bayesian networks. We present efficient algorithms for learning mixtures-of-trees models in maximum likelihood and Bayesian frameworks. We also discuss additional efficiencies that can be obtained when data are “sparse, ” and we present data structures and algorithms that exploit such sparseness. Experimental results demonstrate the performance of the model for both density estimation and classification. We also discuss the sense in which tree-based classifiers perform an implicit form of feature selection, and demonstrate a resulting insensitivity to irrelevant attributes.

### Citations

8843 |
Introduction to Algorithms
- Cormen, Leiserson, et al.
(Show Context)
Citation Context ...v ) log P uv (x u , x v ) P u (x u )P v (x v ) , u, v # V, u =/ v, (7) an operation that requires O(n 2 r 2 MAX ) operations. Third, we run a maximum-weight spanning tree (MWST) algorithm (see, e.g., =-=[8]-=-), using I uv as the weight for edge (u, v), #u, v # V . Such algorithms, which run in time O(n 2 ), return a spanning tree that maximizes the total mutual information for edges included in the tree. ... |

8613 | Maximum Likelihood From Incomplete Data via the EMAlgorithm
- Dempster, Laird, et al.
- 1977
(Show Context)
Citation Context ...tions scales linearly with the number of trees in the mixture. 11 3.2 Learning of MT models The expectation-maximization (EM) algorithm provides an e#ective approach to solving many learning problems =-=[11, 30]-=-, and has been employed with particular success in the setting of mixture models and more general latent variable models [25, 24, 46]. In this section we show that EM also provides a natural approach ... |

7347 |
Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference
- Pearl
- 1988
(Show Context)
Citation Context ...duction Probabilistic inference has become a core technology in AI, largely due to developments in graph-theoretic methods for the representation and manipulation of complex probability distributions =-=[42]-=-. Whether in their guise as directed graphs (Bayesian networks) or as undirected graphs (Markov random fields), probabilistic graphical models have a number of virtues as representations of uncertaint... |

1141 | Graphical models
- Lauritzen
- 1996
(Show Context)
Citation Context ...A #Q B | C # {z} (15) In addition, a MTSS can be represented as a Bayesian network (Figure 4,a), as a Markov random field (Figure 4,b) and as a chain graph (Figure 5). Chain graphs were introduced by =-=[28]-=-; they represent a superclass of both Bayesian networks and Markov random fields. A chain graph contains both directed and undirected edges. While we generally consider problems in which the choice va... |

1119 | A bayesian method for the induction of probabilistic networks from data - Cooper, Herskovits - 1992 |

942 | Learning Bayesian networks: the combination of knowledge and statistical data
- Heckerman, Geiger, et al.
- 1995
(Show Context)
Citation Context ...nnoticed by researchers interested in machine learning, and graphical models are being widely explored as the underlying architectures in systems for classification, prediction and density estimation =-=[1, 15, 22, 23, 17, 36, 47]-=-. Indeed, it is possible to view a wide variety of classical machine learning architectures as instances of graphical models and the graphical model framework provides a natural design procedure for e... |

851 |
Introduction to Graph Theory
- West
- 1996
(Show Context)
Citation Context ...enerally subquadratic. Moreover, random graph theory implies that if the distribution of the weight values is the same for all edges, then Kruskal's algorithm should take time proportional to n log n =-=[52]-=-. To verify this latter result, we conducted a set of Monte Carlo experiments, in which we ran the Kruskal algorithm on sets of random weights over domains of dimension up to n = 3000. For each n, 100... |

797 | A view of the EM algorithm that justifies incremental, sparse and other variants
- Neal, Hinton
- 1998
(Show Context)
Citation Context ...d dataset D the logarithm of the posterior P r[Q|D] equals: log P r[Q] + X x#D log Q(x) (34) plus an additive constant. The EM algorithm can be adapted to maximize the log posterior for every fixed m =-=[39]-=-. Indeed, by comparing with equation (22) we see that the quantity to be maximized is now: E[log P r[Q|x 1,...,N , z 1,...,N ]] = log P r[Q] + E[l c (x 1,...,N , z 1,...,N |Q)]. (35) The prior term do... |

771 |
Statistical Methods for Speech Recognition
- Jelinek
- 1998
(Show Context)
Citation Context ...orithm provides an e#ective approach to solving many learning problems [11, 30], and has been employed with particular success in the setting of mixture models and more general latent variable models =-=[25, 24, 46]-=-. In this section we show that EM also provides a natural approach to the learning problem for the MT model. An important feature of the solution that we present is that it provides estimates for both... |

748 | Hierarchical mixtures of experts and the em algorithm
- Jordan, Jacobs
- 1994
(Show Context)
Citation Context ...orithm provides an e#ective approach to solving many learning problems [11, 30], and has been employed with particular success in the setting of mixture models and more general latent variable models =-=[25, 24, 46]-=-. In this section we show that EM also provides a natural approach to the learning problem for the MT model. An important feature of the solution that we present is that it provides estimates for both... |

673 | and C.N.Liu. Approximating Discrete Probability Distributions with Dependence Trees
- Chow
- 1968
(Show Context)
Citation Context ...r 11, 2000 Abstract This paper describes the mixtures-of-trees model, a probabilistic model for discrete multidimensional domains. Mixtures-of-trees generalize the probabilistic trees of Chow and Liu =-=[6]-=- in a di#erent and complementary direction to that of Bayesian networks. We present e#cient algorithms for learning mixtures-of-trees models in maximum likelihood and Bayesian frameworks. We also disc... |

654 |
Probabilistic Networks and Expert Systems
- Cowell, Dawid, et al.
- 1999
(Show Context)
Citation Context ...Even more importantly, the graph-theoretic framework has allowed for the development of general inference algorithms, which in many cases provide orders of magnitude speedups over brute-force methods =-=[9, 48]-=-. These virtues have not gone unnoticed by researchers interested in machine learning, and graphical models are being widely explored as the underlying architectures in systems for classification, pre... |

623 | Bayesian network classifiers
- Friedman, Geiger
- 1997
(Show Context)
Citation Context ...nnoticed by researchers interested in machine learning, and graphical models are being widely explored as the underlying architectures in systems for classification, prediction and density estimation =-=[1, 15, 22, 23, 17, 36, 47]-=-. Indeed, it is possible to view a wide variety of classical machine learning architectures as instances of graphical models and the graphical model framework provides a natural design procedure for e... |

592 |
Fibonacci Heaps and Their Uses in Improved Network Optimization Algorithms
- Fredman, Tarjan
- 1987
(Show Context)
Citation Context ...al informations. What we aim to create is a mechanism that will output the edges (u, v) in decreasing order of their mutual information. We shall set up this mechanism in the form of a Fibonacci heap =-=[12]-=- called vheap that contains an element for each u # V , represented by the edge with the highest mutual 38 Algorithm acCL(D) Input: Variable set V of size n Dataset D = {xlist i , i = 1, . . . N} Proc... |

528 | Learning Probabilistic Relational Models
- Friedman, Getoor, et al.
- 1999
(Show Context)
Citation Context ...nnoticed by researchers interested in machine learning, and graphical models are being widely explored as the underlying architectures in systems for classification, prediction and density estimation =-=[1, 15, 22, 23, 17, 36, 47]-=-. Indeed, it is possible to view a wide variety of classical machine learning architectures as instances of graphical models and the graphical model framework provides a natural design procedure for e... |

499 |
Bayesian classification (AutoClass): Theory and results. chapter 6
- Cheeseman, Stutz
- 1996
(Show Context)
Citation Context ...stering and compression problems, the MT model makes contact with the large and active literature on mixture modeling. Let us briefly review some of the most salient connections. The Auto-Class model =-=[4]-=- is a mixture of factorial distributions (MF), and its excellent cost/performance ratio motivates the MT 4 model in much the same way as the Naive Bayes model motivates the TANB model in the classific... |

311 |
Stochastic Complexity in Statistical Inquiry
- Rissanen
- 1989
(Show Context)
Citation Context ...enalties to be proportional to the increase in the number of parameters caused by the addition of edge uv to the tree, # uv = 1 2 (r u - 1)(r v - 1) log N (41) then a Minimum Description Length (MDL) =-=[45]-=- type of prior is implemented. In [22], in the context of learning Bayesian networks, the following prior is suggested: P r[E] # # #(E,E # ) (42) 17 where #() is a distance metric between Bayes net st... |

299 | Context-specific independence in Bayesian networks
- Boutilier, Friedman, et al.
- 1996
(Show Context)
Citation Context ...ility to represent context-specific independencies---situations in which subsets of variables exhibit certain conditional independencies for some, but not all, values of a conditioning variable. (See =-=[2]-=- for further work on context-specific independence). By making context-specific independencies explicit as multiple collections of edges, one can obtain (a) more parsimonious representations of joint ... |

229 | The wake-sleep algorithm for unsupervised neural networks - Hinton, Dayan, et al. - 1995 |

225 | The EM algorithm for graphical association models with missing data - Lauritzen - 1995 |

224 | The Bayesian structural EM algorithm
- Friedman
- 1998
(Show Context)
Citation Context ...work on learning statistical models over large domains that focus on the e#cient computation and usage of the relevant su#cient statistics. Work in this direction includes the structural EM algorithm =-=[14, 16] as well a-=-s the A-D trees of [37]. The latter are closely related to our representation of the pairwise marginals P uv by counts. In fact, our representation can be viewed as a "reduced" A-D tree that... |

179 |
On structuring probabilistic dependencies in stochastic language modelling. Computer Speech and Language
- Ney, Essen, et al.
- 1994
(Show Context)
Citation Context ...ges with # uv = # > 0. To regularize model parameters we use a Dirichlet prior derived from the pairwise marginal distributions for the data set. This approach is known as smoothing with the marginal =-=[15, 40]-=-. In particular, we set the parameter N # k characterizing the Dirichlet prior for tree k by apportioning a fixed smoothing coe#cient # equally between the n variables and in an amount that is inverse... |

173 | Probabilistic independence networks for hidden markov probability models
- Smyth, Heckerman, et al.
- 1997
(Show Context)
Citation Context ...ssical machine learning architectures as instances of graphical models and the graphical model framework provides a natural design procedure for exploring architectural variations on classical themes =-=[3, 49]-=-. As in many machine learning problems, the problem of learning a graphical model from data can be divided into the problem of parameter learning and the problem of structure learning. Much progress h... |

146 |
Independence properties of directed Markov fields
- Lauritzen, Dawid, et al.
- 1990
(Show Context)
Citation Context ...d side of (4) shows that each edge (u, v) increases the number of parameters by (r u - 1)(r v - 1). The set of conditional independencies associated with a tree distribution are readily characterized =-=[29]-=-. In particular, two subsets A, B # V are independent given C # V if C intersects every path (ignoring the direction of edges in the directed case) between u and v for all u # A and v # B. 6 2.1 Margi... |

127 | Cached Sufficient Statistics for Efficient Machine Learning with Large Datasets
- Moore, Lee
- 1998
(Show Context)
Citation Context ... the efficient computation and usage of the relevant sufficient statistics. Work in this direction includes the structural EM algorithm (Friedman, 1998; Friedman & Getoor, 1999) as well as A-D trees (=-=Moore & Lee, 1998-=-). The latter are closely related to our representation of the pairwise marginals Puv by counts. In fact, our representation can be viewed as a “reduced” A-D tree that stores only pairwise statistics.... |

109 |
Probability propagation
- Shafer, Shenoy
- 1990
(Show Context)
Citation Context ...Even more importantly, the graph-theoretic framework has allowed for the development of general inference algorithms, which in many cases provide orders of magnitude speedups over brute-force methods =-=[9, 48]-=-. These virtues have not gone unnoticed by researchers interested in machine learning, and graphical models are being widely explored as the underlying architectures in systems for classification, pre... |

102 |
algorithms for ML factor analysis
- Rubin, Thayer, et al.
- 1982
(Show Context)
Citation Context ...ubin, 1977; MacLachlan & Bashford, 1988), and has been employed with particular success in the setting of mixture models and more general latent variable models (Jordan & Jacobs, 1994; Jelinek, 1997; =-=Rubin & Thayer, 1983-=-). In this section we show that EM also provides a natural approach to the learning problem for the MT model. An important feature of the solution that we present is that it provides estimates for bot... |

93 |
Knowledge representation and inference in similarity networks and bayesian multinets
- Geiger, Heckerman
- 1996
(Show Context)
Citation Context ...eters and structure of such models. One can also consider probabilistic mixtures of more general graphical models; indeed, the general case is the Bayesian multinet introduced by Geiger and Heckerman =-=[20]-=-. The Bayesian multinet is a mixture model in which each mixture component is an arbitrary graphical model. The advantage of Bayesian multinets over more traditional graphical models is the ability to... |

70 |
Training knowledge-based neural networks to recognize genes in DNA sequences
- Noordewier, Towell, et al.
- 1991
(Show Context)
Citation Context ...T). The dataset consists of 3,175 labeled examples 2 . We ran two series of experiments comparing the MT model with competing models. In the first series of experiments, we compared to the results of =-=[41]-=-, who used multilayer neural networks and knowledge-based neural networks for the same task. We replicated these authors' choice of training set size (2000) and test set size (1175) and sampled new tr... |

65 | Learning belief networks from data: An information theory based approach
- Cheng, Bell, et al.
- 2002
(Show Context)
Citation Context ...ut is unable to discover the independence in the double digit set. 5.2.2 The ALARM network Our second set of density estimation experiments features the ALARM network as the data generating mechanism =-=[22, 5]-=-. This Bayesian network was constructed from expert 24 Table 3: Density estimation results for the mixtures of trees and other models on the ALARM data set. Training set size N train = 10, 000. Averag... |

53 | Competition and multiple cause models
- Dayan, Zemel
- 1995
(Show Context)
Citation Context ...g model was 0.41 bits per example. 5.1.2 Random bars, small data set The "bars" problem is a benchmark structure learning problem for unsupervised learning algorithms in the neural network l=-=iterature [10]-=-. The domain V is the l l square of binary variables depicted in Figure 8. The data are generated in the following manner: first, one flips a fair coin to decide whether to generate horizontal or vert... |

51 |
The FERET evaluation methodology for face-recognition algorithms
- Philips, Moon, et al.
(Show Context)
Citation Context ... trees over the mixture of factorials for this data set. 5.2.3 The FACES dataset For the third density estimation experiment, we used a subset of 576 images from the normalized face images dataset of =-=[43]-=-. These images were downsampled to 48 variables (pixels) and 5 gray levels. We divided the data randomly into N train = 500 and N test = 76 examples; of the 500 training examples, 50 were left out as ... |

33 | Latent variable models
- Bishop
- 1999
(Show Context)
Citation Context |

30 |
An entropy-based learning algorithm of bayesian conditional trees
- Geiger
- 1992
(Show Context)
Citation Context ...herwise disconnected set of nodes representing the input variables (i.e., attributes). Introducing additional edges between the input variables yields the Tree Augmented Naive Bayes (TANB) classifier =-=[15, 19]-=-. These authors also considered a less constrained model in which di#erent patterns of edges were allowed for each value of the class node---this is formally identical to the Chow and Liu proposal. If... |

28 | Bayesian Network Classification with Continuous Attributes: Getting the Best
- Friedman, Goldszmidt, et al.
- 1998
(Show Context)
Citation Context ...hese two guises. In the classification setting, the MT model builds on the seminal work on tree-based classifiers by Chow and Liu [6], and on recent extensions due to Friedman, Geiger, and Goldszmidt =-=[15, 18]-=-. Chow and Liu proposed to solve M-way classification problems by fitting a separate tree to the observed variables in each of the M classes, and classifying a new data point by choosing the class hav... |

26 | Estimating dependency structure as a hidden variable
- Meila, Jordan, et al.
- 1997
(Show Context)
Citation Context ...ss label as simply another input variable. This yields a more discriminative approach to classification in which all of the training data are pooled for the purposes of training the model (Section 5, =-=Meilă & Jordan, 1998-=-). The choice variable remains hidden, yielding a mixture model for each class. This is similar in spirit to the “mixture discriminant analysis” model of Hastie and Tibshirani (1996), where a mixture ... |

24 | Does the wake-sleep algorithm produce good density estimators
- Frey, Hinton, et al.
- 1995
(Show Context)
Citation Context ...mensions. The other dataset ("pairs") contains 128-dimensional vectors representing randomly paired digit images. These datasets, as well as the training conditions that we employed, are des=-=cribed in [13]-=- (see Figure 12 for an example of a digit pair). The training, validation and test sets contained 6000, 2000, and 5000 exemplars respectively. Each model was trained on the training set until the like... |

23 | Constructing Bayesian finite mixture models by the EM algorithm
- Kontkanen, Myllymäki, et al.
- 1996
(Show Context)
Citation Context ...sification setting. (A factorial distribution is a product of factors each of which depends on exactly one variable). Kontkanen et al. study a MF in which a hidden variable is used for classification =-=[26]-=-; this approach was extended by Monti and Cooper [36]. The idea of learning tractable but simple belief networks and superimposing a mixture to account for the remaining dependencies was developed ind... |

21 |
A guide to the literature on learning graphical models
- Buntine
- 1995
(Show Context)
Citation Context ...ssical machine learning architectures as instances of graphical models and the graphical model framework provides a natural design procedure for exploring architectural variations on classical themes =-=[3, 49]-=-. As in many machine learning problems, the problem of learning a graphical model from data can be divided into the problem of parameter learning and the problem of structure learning. Much progress h... |

17 | Learning with mixtures of trees
- Meila-Predoviciu
- 1999
(Show Context)
Citation Context ...all maximum of Iuv can then be extracted as the maximum of the Fibonacci heap. the lists V0(u) (that contain many elements) we use their “complements” V0(u) = {v ∈ V, v ≺ u, Nuv > 0}. It can be shown =-=[34]-=- that the computation of the cooccurrence counts and the construction of the lists C(u), V 0(u), u ∈ V takes an amount of time proportional to the number of cooccurrences NC, up to a logarithmic facto... |

12 | A mean field learning algorithm for unsupervised neural networks
- Saul, Jordan
- 1999
(Show Context)
Citation Context |

9 |
Mixture Models: Inference and Application to Clustering
- MacLachlan, Basford
- 1988
(Show Context)
Citation Context ...paper we consider an alternative upgrade path. Inspired by the success of mixture models in providing simple, e#ective generalizations of classical methods in many simpler density estimation settings =-=[30]-=-, we consider a generalization of tree distributions known as the mixtures-of-trees (MT) model. As suggested in Figure 1, the MT model involves the probabilistic mixture of a set of graphical componen... |

9 |
Cached sucient statistics for ecient machine learning with large datasets
- Moore, Lee
- 1998
(Show Context)
Citation Context ...ver large domains that focus on the e#cient computation and usage of the relevant su#cient statistics. Work in this direction includes the structural EM algorithm [14, 16] as well as the A-D trees of =-=[37]. The latt-=-er are closely related to our representation of the pairwise marginals P uv by counts. In fact, our representation can be viewed as a "reduced" A-D tree that stores only pairwise statistics.... |

7 |
algorithms for ML factor analysis, Psychometrika 47
- Rubin, Thayer, et al.
- 1982
(Show Context)
Citation Context ...orithm provides an e#ective approach to solving many learning problems [11, 30], and has been employed with particular success in the setting of mixture models and more general latent variable models =-=[25, 24, 46]-=-. In this section we show that EM also provides a natural approach to the learning problem for the MT model. An important feature of the solution that we present is that it provides estimates for both... |

6 |
The DELVE manual. http://www.cs.utoronto.ca/ delve
- Rasmussen, Neal, et al.
- 1996
(Show Context)
Citation Context ...udied the classification performance of the MT model in the domain of DNA SPLICE-junctions. The domain consists of 60 variables, representing a sequence of DNA bases, and an additional class variable =-=[44]-=-. The task is to determine if the middle of the sequence is a splice junction and what is its type. Splice junctions are of two types: exonintron (EI) represents the end of an exon and the beginning o... |

5 | Efficient learning using constrained sufficient statistics
- Friedman, Getoor
- 1999
(Show Context)
Citation Context ...istical models over large domains that focus on the efficient computation and usage of the relevant sufficient statistics. Work in this direction includes the structural EM algorithm (Friedman, 1998; =-=Friedman & Getoor, 1999-=-) as well as A-D trees (Moore & Lee, 1998). The latter are closely related to our representation of the pairwise marginals Puv by counts. In fact, our representation can be viewed as a “reduced” A-D t... |

3 |
Discriminant analysis by mixture modeling
- Hastie, Tibshirani
- 1996
(Show Context)
Citation Context ... (see Section 5 and [33]). The choice variable remains hidden, yielding a mixture model for each class. This is similar in spirit to the "mixture discriminant analysis" model of Hastie and T=-=ibshirani [21]-=-, where a mixture of Gaussians is used for each class in a multiway classification problem. In the setting of density estimation, clustering and compression problems, the MT model makes contact with t... |

3 |
A Bayesian network classfier that combines a finite mixture model and a naive Bayes model
- Monti, Cooper
- 1998
(Show Context)
Citation Context |

3 |
Learning mixtures of Bayes networks
- Thiesson, Meek, et al.
- 1997
(Show Context)
Citation Context ...Cooper [36]. The idea of learning tractable but simple belief networks and superimposing a mixture to account for the remaining dependencies was developed independently of our work by Thiesson et al. =-=[50]-=-, who studied mixtures of Gaussian belief networks. Their work interleaves EM parameter search with Bayesian model search in a heuristic but general algorithm. 2 Tree distributions In this section we ... |

2 |
E#cient learning using constrained su#cient statistics
- Friedman, Getoor
- 1999
(Show Context)
Citation Context ...work on learning statistical models over large domains that focus on the e#cient computation and usage of the relevant su#cient statistics. Work in this direction includes the structural EM algorithm =-=[14, 16] as well a-=-s the A-D trees of [37]. The latter are closely related to our representation of the pairwise marginals P uv by counts. In fact, our representation can be viewed as a "reduced" A-D tree that... |