## Estimating dependency structure as a hidden variable (1998)

### Cached

### Download Links

- [historical.ncstrl.org]
- [publications.ai.mit.edu]
- [publications.ai.mit.edu]
- [cbcl.mit.edu]
- [cbcl.mit.edu]
- [ftp.ai.mit.edu]
- [publications.ai.mit.edu]
- DBLP

### Other Repositories/Bibliography

Venue: | In NIPS |

Citations: | 26 - 6 self |

### BibTeX

@INPROCEEDINGS{Meilă98estimatingdependency,

author = {Marina Meilă and Michael I. Jordan and Quaid Morris},

title = {Estimating dependency structure as a hidden variable},

booktitle = {In NIPS},

year = {1998},

pages = {584--590},

publisher = {MIT Press}

}

### OpenURL

### Abstract

This publication can be retrieved by anonymous ftp to publications.ai.mit.edu. This paper introduces a probability model, the mixture of trees that can account for sparse, dynamically changing dependence relationships. We present a family of efficient algorithms based on the EM and the Minimum Spanning Tree algorithms that learn mixtures of trees in the ML framework. The method can be extended to take into account priors and, for a wide class of priors that includes the Dirichlet and the MDL priors, it preserves its computational efficiency. Experimental results demonstrate the excellent performance of the new model both in density estimation and in classification. Finally, we show that a single tree classifier acts like an implicit feature selector, thus making the classification performance insensitive to irrelevant attributes.

### Citations

9054 | Maximum likelihood from incomplete data via the EM algorithm
- Dempster, Laird, et al.
- 1977
(Show Context)
Citation Context ...ctive. 2 3 THE BASIC ALGORITHM: ML FITTING OF MIXTURES OF TREES This section will show how a mixture of trees can be fit to an observed dataset in the Maximum Likelihood paradigm via the EM algorithm =-=[3]-=-. The observations are denoted by {x 1 ,x 2 , ..., x N }; the corresponding values of the structure variable are {z i ,i=1,... N}. Following a usual EM procedure for mixtures, the Expectation (E) step... |

7493 |
Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference
- Pearl
- 1988
(Show Context)
Citation Context ..., have been investigated recently by [9]. Work on fitting a tree to a distribution in a Maximum-Likelihood (ML) framework has been pioneered by Chow and Liu [2] and was extended to polytrees by Pearl =-=[13]-=- and to mixtures of trees with observed structure variable by Geiger [6] and Friedman [5]. This work presents efficient algorithms for learning mixture of trees models with unknown or hidden structure... |

953 | Learning Bayesian networks: The combination of knowledge and statistical data
- Heckerman, Geiger, et al.
- 1995
(Show Context)
Citation Context ...hlet prior over a tree can be represented as a table of ctitious marginal probabilities P 0k uv for each pair u� v of variables plus an equivalent sample size N 0 that gives the strength of the prio=-=r [7]-=-. It is now straightforward to maximize the a-posteriori probability of a tree: one has to replace the marginals P k uv in step M2 by ~P k uv = (NPk uv + N 0 P 0k uv )=(N + N 0 ): (14) The Dirichlet p... |

685 | Approximating discrete probability distribution with dependence trees
- Chow, Liu
- 1968
(Show Context)
Citation Context ...ributions, a subclass of tree distributions, have been investigated recently by [9]. Work on fitting a tree to a distribution in a Maximum-Likelihood (ML) framework has been pioneered by Chow and Liu =-=[2]-=- and was extended to polytrees by Pearl [13] and to mixtures of trees with observed structure variable by Geiger [6] and Friedman [5]. This work presents efficient algorithms for learning mixture of t... |

184 |
On structuring probabilistic dependency in stochastic language modeling", Computer Speech & Language 8
- Ney
- 1994
(Show Context)
Citation Context ...the and repairwise marginals for the whole dataset P total uv places the marginals P k uv by ˜P k uv =(1−α)Pk total uv + αPuv , 0 <α<1 (16) This method and several variations thereof are discussed in =-=[11]-=-. Its effect is to give a small probability weight to unseen instances and to draw the components closer to each other, thereby reducing the effective value of m. For the method to be effective in pra... |

79 | Building Classifiers Using Bayesian Networks
- Friedman, Goldszmidt
- 1996
(Show Context)
Citation Context ...ximum-Likelihood (ML) framework has been pioneered by Chow and Liu [2] and was extended to polytrees by Pearl [13] and to mixtures of trees with observed structure variable by Geiger [6] and Friedman =-=[5]-=-. This work presents efficient algorithms for learning mixture of trees models with unknown or hidden structure variable. The following section introduces the model; then, section 3 develops the basic... |

70 |
Training knowledge-based neural networks to recognize genes in DNA sequences
- Noordewier, Towell, et al.
- 1991
(Show Context)
Citation Context ...two series of experiments. A third experiment involving the SPLICE data set will be described later. For the first series, we compared our model’s performance against the reported results of [16] and =-=[12]-=- who used multilayer neural networks and knowledge-based neural networks for the same task. The sizes of the training set and of the test set are reproduced from the above cited papers; they are 2000 ... |

57 | Interpretation of Artificial Neural Networks
- Towell, Shavlik
- 1992
(Show Context)
Citation Context ...erformed two series of experiments. A third experiment involving the SPLICE data set will be described later. For the first series, we compared our model’s performance against the reported results of =-=[16]-=- and [12] who used multilayer neural networks and knowledge-based neural networks for the same task. The sizes of the training set and of the test set are reproduced from the above cited papers; they ... |

32 |
An entropy-based learning algorithm of Bayesian conditional trees
- Geiger
- 1992
(Show Context)
Citation Context ...tribution in a Maximum-Likelihood (ML) framework has been pioneered by Chow and Liu [2] and was extended to polytrees by Pearl [13] and to mixtures of trees with observed structure variable by Geiger =-=[6]-=- and Friedman [5]. This work presents efficient algorithms for learning mixture of trees models with unknown or hidden structure variable. The following section introduces the model; then, section 3 d... |

24 | Does the wake-sleep algorithm produce good density estimators
- Frey, Hinton, et al.
- 1995
(Show Context)
Citation Context ... digit images. The training, validation and test set contained 6000, 2000, and 5000 exemplars respectively. The data sets, the training conditions and the algorithms we compared with are described in =-=[4]-=-. We tried mixtures of 16, 32, 64 and 128 trees, fitted by the basic algorithm. The training set was used to fit the model parameters and the validation set to determine when EM has converged. The EM ... |

23 | Constructing Bayesian finite mixture models by the EM algorithm
- Kontkanen, Myllymäki, et al.
- 1996
(Show Context)
Citation Context ...tion space, in the case of continuous variables overlaps with the space of mixtures of trees. Mixtures of factorial distributions, a subclass of tree distributions, have been investigated recently by =-=[9]-=-. Work on fitting a tree to a distribution in a Maximum-Likelihood (ML) framework has been pioneered by Chow and Liu [2] and was extended to polytrees by Pearl [13] and to mixtures of trees with obser... |

10 |
Building classi ers using Bayesian networks
- Friedman, Goldszmidt
- 1996
(Show Context)
Citation Context ...mum-Likelihood (ML) 1 framework has been pioneered by Chow and Liu [1] and was extended to polytrees by Pearl [10] and to mixtures of trees with observed structure variable by Geiger [5] and Friedman =-=[4]. -=-This work presents e cient algorithms for learning mixture of trees models with unknown or hidden structure variable. The following section introduces the model� then, section 3 develops the basic a... |

6 |
The DELVE manual. http://www.cs.utoronto.ca/ delve
- Rasmussen, Neal, et al.
- 1996
(Show Context)
Citation Context ... contains no prior knowledge about the domain, the KBNN does. The second set of experiments pursued a comparison with benchmark experiments on the SPLICE data set that are part of the DELVE data base =-=[14]-=-. The DELVE benchmark uses subsets of the SPLICE database with 100, 200 and 400 examples for training. Testing is done on 1500 examples in all cases. The algorithms tested by DELVE and their performan... |

4 |
Machine Learning Repository. ftp://ftp.ics.uci.edu/pub/machine-learningdatabases
- Irvine
(Show Context)
Citation Context ...d by picking the most likely value of the class variable given the other variables’ settings. We investigated the performance of mixtures of trees on four classification tasks from the UCI repository =-=[1]-=-. In the first experiment, the data set was the Australian database [1]. It has 690 examples each consisting of 14 attributes and a binary class variable. Six of the attributes (numbers 2, 3, 7, 10, 1... |

2 | Glass identification database - German - 1987 |

2 |
Constructing bayesian nite mixture models by theEM algorithm
- Kontkanen, Myllymaki, et al.
- 1994
(Show Context)
Citation Context ...tion space, in the case of continuous variables overlaps with the space of mixtures of trees. Mixtures of factorial distributions, a subclass of tree distributions, have been investigated recently by =-=[8]-=-. Work on tting a tree to a distribution in a Maximum-Likelihood (ML) 1 framework has been pioneered by Chow and Liu [1] and was extended to polytrees by Pearl [10] and to mixtures of trees with obser... |

1 |
Learning Bayesian networks: the com of knowledge and statistical data
- Heckerman, Geiger, et al.
- 1995
(Show Context)
Citation Context ...et prior over a tree can be represented as a table of fictitious marginal probabilities P ′k uv for each pair u, v of variables plus an equivalent sample size N ′ that gives the strength of the prior =-=[8]-=-. It is straightforward to maximize the a-posteriori probability of a tree: one has to replace the marginals P k uv in step M2 by ˜ P k uv =(NPk uv + N ′ P ′k uv )/(N + N ′ ) (14)sThe above formula is... |

1 |
Glass identi cation database
- German
(Show Context)
Citation Context ...e anyohter variable. In the testing phase, a new instance was classi ed by picking the most likely value of the class variable given the other variables settings. The rst task used the Glass database =-=[6]-=-. The data set has 214 instances of 9-dimensional continuous valued vectors. The class variable has 6 values. The continuous variables were discretized in 4 uniform bins each. We tested mixtures with ... |

1 |
Learning Bayesian networks: the com11 bination of knowledge and statistical data
- Heckerman, Geiger, et al.
- 1995
(Show Context)
Citation Context ...et prior over a tree can be represented as a table of fictitious marginal probabilities P #k uv for each pair u, v of variables plus an equivalent sample size N # that gives the strength of the prior =-=[8]-=-. It is straightforward to maximize the a-posteriori probability of a tree: one has to replace the marginals P k uv in step M2 by P k uv = (NP k uv + N # P #k uv )/(N + N # ) (14) 3 The above formula ... |

1 |
Does the wakw-sleep algorithm produce good density estimators
- Frey, Hinton, et al.
- 1996
(Show Context)
Citation Context ... digit images. The training, validation and test set contained 6000, 2000, and 5000 exemplars respectively. The data sets, the training conditions and the algorithms we compared with are described in =-=[3]-=-. We tried mixtures of 16, 32, 64 and 128 trees, fitted by the basic algorithm. We showed only the best performance in the results table 2. The results are very encouraging: the mixture of trees is th... |