## Optimal reinsertion: A new search operator for accelerated and more accurate Bayesian network structure learning (2003)

### Cached

### Download Links

- [web.engr.orst.edu]
- [www.aaai.org]
- [www.aaai.org]
- [www.hpl.hp.com]
- [www.autonlab.org]
- DBLP

### Other Repositories/Bibliography

Venue: | In Proceedings of the 20th International Conference on Machine Learning (ICML ’03 |

Citations: | 40 - 6 self |

### BibTeX

@INPROCEEDINGS{Moore03optimalreinsertion:,

author = {Andrew Moore and Weng-keen Wong},

title = {Optimal reinsertion: A new search operator for accelerated and more accurate Bayesian network structure learning},

booktitle = {In Proceedings of the 20th International Conference on Machine Learning (ICML ’03},

year = {2003},

pages = {552--559},

publisher = {AAAI Press}

}

### Years of Citing Articles

### OpenURL

### Abstract

We show how a conceptually simple search operator called Optimal Reinsertion can be applied to learning Bayesian Network structure from data. On each step we pick a node called the target. We delete all arcs entering or exiting the target. We then find, subject to some constraints, the globally optimal combination of in-arcs and out-arcs with which to reinsert it. The heart of the paper is a new algorithm called ORSearch which allows each optimal reinsertion step to be computed efficiently on large datasets. Our empirical results compare Optimal Reinsertion against a highly tuned implementation of multi-restart hill climbing. The results typically show one to two orders of magnitude speed-up on a variety of datasets. They usually show better final results, both in terms of BDEU score and in modeling of future data drawn from the same distribution. 1. Bayesian Network Structure Search Given a dataset of R records and m categorical attributes, how can we find a Bayesian network structure that provides a good model of the data? Happily, the formulation of this question into a well-defined optimization problem is now fairly well understood (Heckerman et al., 1995; Cooper & Herskovits, 1992). However, finding the optimal solution is an NP-complete problem (Chickering, 1996a). The computational issues in performing heuristic search in this space are also severe, even taking into account the numerous ingenious and effective innovations in recent years (e.g.

### Citations

2865 |
UCI repository of machine learning databases
- Blake, Merz
- 1998
(Show Context)
Citation Context ... Datasets used. R = Number of records, m = number of attributes and AA = average arity of attributes. 3.2. Processing of the real datasets The empirical datasets were chosen from UCI Irvine datasets (=-=Blake & Merz, 1998-=-) that contained at least 10,000 records. Real valued attributes were automatically converted to binary-valued categorical attributes by thresholding them at their median value (treating real-valued v... |

1075 | Herskovitz: A Bayesian Method for the Induction
- Cooper, E
- 1992
(Show Context)
Citation Context ...ian network structure that provides a good model of the data? Happily, the formulation of this question into a well-defined optimization problem is now fairly well understood (Heckerman et al., 1995; =-=Cooper & Herskovits, 1992-=-). However, finding the optimal solution is an NP-complete problem (Chickering, 1996a). The computational issues in performing heuristic search in this space are also severe, even taking into account ... |

903 | Learning Bayesian networks: The combination of knowledge and statistical data
- Heckerman, Geiger, et al.
- 1995
(Show Context)
Citation Context ... how can we find a Bayesian network structure that provides a good model of the data? Happily, the formulation of this question into a well-defined optimization problem is now fairly well understood (=-=Heckerman et al., 1995-=-; Cooper & Herskovits, 1992). However, finding the optimal solution is an NP-complete problem (Chickering, 1996a). The computational issues in performing heuristic search in this space are also severe... |

323 |
Estimating the dimension of a model
- Schwartz
- 1978
(Show Context)
Citation Context ... degree i¡¦¥ i¡ that these parents predict the conditional distribution of i given the parents (while penalizing for model complexity). Examples of NodeScore function that have been proposed are BIC (=-=Schwartz, 1979-=-), BD (Cooper & Herskovits, 1992), BDE (Heckerman et al., 1995) and BDEU (Buntine, 1991). The algorithms of this paper can be applied irrespective of the choice of NodeScore function. 1.1. Performing ... |

303 |
Learning in Embedded Systems
- Kaelbling
- 1993
(Show Context)
Citation Context ...everal algorithms have been introduced that do this adaptively, with the algorithm dynamically determining from the data what sample size will be sufficient to very probably find a good answer, e.g. (=-=Kaelbling, 1990-=-; Maron & Moore, 1993; Hulten & Domingos, 2002; Pelleg & Moore, 2002). The most relevant recent example is (Hulten & Domingos, 2002) which learns Bayesian network structure from impressively massive d... |

239 |
The ALARM monitoring system: A case study with two probabilistic inference techniques for belief networks
- Beinlich, Suermondt, et al.
- 1989
(Show Context)
Citation Context ...nth2 Synth3 Synth4 Figure 3. Synthetic datasets described in Section 3.1. R m AA adult 49K 15 7.7 Contributed to UCI by Ron Kohavi alarm 20K 37 2.8 Data generated from a standard Bayes Net benchmark (=-=Beinlich et al., 1989-=-). biosurv 150K 24 3.5 Anonymized, deidentified aggregate information about hospital admission rates covtype 150K 39 2.8 Contributed to UCI by Jock Blackard. con4 67K 43 3.0 Contributed to UCI by John... |

183 | Theory refinement on Bayesian networks
- Buntine
- 1991
(Show Context)
Citation Context ...arents (while penalizing for model complexity). Examples of NodeScore function that have been proposed are BIC (Schwartz, 1979), BD (Cooper & Herskovits, 1992), BDE (Heckerman et al., 1995) and BDEU (=-=Buntine, 1991-=-). The algorithms of this paper can be applied irrespective of the choice of NodeScore function. 1.1. Performing the optimization The most common algorithm for optimizing DagScoresD¡ is a hill climbin... |

180 | Learning of Bayesian network structure from massive datasets: The “sparse candidate” algorithm
- Friedman, Nachman, et al.
- 1999
(Show Context)
Citation Context ...rch in this space are also severe, even taking into account the numerous ingenious and effective innovations in recent years (e.g. (Chickering, 1996b; Friedman & Goldszmidt, 1997; Xiang et al., 1997; =-=Friedman et al., 1999-=-; Elidan et al., 2002; Hulten & Domingos, 2002)), discussed in Section 4. Problem: From fully observed categorical data find an acyclic structure and tabular conditional probability tables (CPTs) that... |

155 | Learning Bayesian networks is np-complete
- Chickering
- 1996
(Show Context)
Citation Context ...s question into a well-defined optimization problem is now fairly well understood (Heckerman et al., 1995; Cooper & Herskovits, 1992). However, finding the optimal solution is an NP-complete problem (=-=Chickering, 1996-=-a). The computational issues in performing heuristic search in this space are also severe, even taking into account the numerous ingenious and effective innovations in recent years (e.g. (Chickering, ... |

129 | Learning equivalence classes of Bayesian-network structure
- Chickering
- 2002
(Show Context)
Citation Context ...s question into a well-defined optimization problem is now fairly well understood (Heckerman et al., 1995; Cooper & Herskovits, 1992). However, finding the optimal solution is an NP-complete problem (=-=Chickering, 1996-=-a). The computational issues in performing heuristic search in this space are also severe, even taking into account the numerous ingenious and effective innovations in recent years (e.g. (Chickering, ... |

120 | Cached sufficient statistics for efficient machine learning with large datasets
- Moore, Lee
- 1998
(Show Context)
Citation Context ...m¦ § k tables in total, each needing Oswork to construct naively. R¡ Constructing all these tables is a job suited for AD-search, introduced in (Moore & Schneider, 2002) and an extension of AD-trees (=-=Moore & Lee, 1998-=-). There is no space to review AD-search here, except to mention costs. Searching all contingency tables of dimension k would normally require R ¥ m operations. In contrast, AD-search requires k§ R k ... |

119 | Learning belief networks in the presence of missing values and hidden variables
- Friedman
- 1997
(Show Context)
Citation Context ...e equivalent of an Optimal Reinsertion operation exists.sStructural EM. A very important problem is to learn Bayesian network structure in datasets where some attributes of some records are missing. (=-=Friedman, 1997-=-) and subsequent publications have pioneered an EM approach to this problem. The EM approach requires repeated Bayesian Network structure optimizations and we plan to apply Optimal Reinsertion to this... |

101 | Hoeffding races: Accelerating model selection search for classification and function approximation
- Maron, Moore
- 1994
(Show Context)
Citation Context ... have been introduced that do this adaptively, with the algorithm dynamically determining from the data what sample size will be sufficient to very probably find a good answer, e.g. (Kaelbling, 1990; =-=Maron & Moore, 1993-=-; Hulten & Domingos, 2002; Pelleg & Moore, 2002). The most relevant recent example is (Hulten & Domingos, 2002) which learns Bayesian network structure from impressively massive datasets using adaptiv... |

45 | Sequential update of Bayesian networks structure
- Friedman, Goldszmidt
- 1997
(Show Context)
Citation Context ... computational issues in performing heuristic search in this space are also severe, even taking into account the numerous ingenious and effective innovations in recent years (e.g. (Chickering, 1996b; =-=Friedman & Goldszmidt, 1997-=-; Xiang et al., 1997; Friedman et al., 1999; Elidan et al., 2002; Hulten & Domingos, 2002)), discussed in Section 4. Problem: From fully observed categorical data find an acyclic structure and tabular... |

36 | Data perturbation for escaping local maxima in learning
- Elidan, Ninio, et al.
- 2002
(Show Context)
Citation Context ...lso severe, even taking into account the numerous ingenious and effective innovations in recent years (e.g. (Chickering, 1996b; Friedman & Goldszmidt, 1997; Xiang et al., 1997; Friedman et al., 1999; =-=Elidan et al., 2002-=-; Hulten & Domingos, 2002)), discussed in Section 4. Problem: From fully observed categorical data find an acyclic structure and tabular conditional probability tables (CPTs) that optimize a Bayesian ... |

29 | Mining complex models from arbitrarily large databases in constant time
- Hulten, Domingos
- 2002
(Show Context)
Citation Context ...ng into account the numerous ingenious and effective innovations in recent years (e.g. (Chickering, 1996b; Friedman & Goldszmidt, 1997; Xiang et al., 1997; Friedman et al., 1999; Elidan et al., 2002; =-=Hulten & Domingos, 2002-=-)), discussed in Section 4. Problem: From fully observed categorical data find an acyclic structure and tabular conditional probability tables (CPTs) that optimize a Bayesian Network scoring criterion... |

14 | Using Tarjan’s Red Rule for Fast Dependency Tree Construction. NIPS 15
- Pelleg, Moore
- 2002
(Show Context)
Citation Context ... with the algorithm dynamically determining from the data what sample size will be sufficient to very probably find a good answer, e.g. (Kaelbling, 1990; Maron & Moore, 1993; Hulten & Domingos, 2002; =-=Pelleg & Moore, 2002-=-). The most relevant recent example is (Hulten & Domingos, 2002) which learns Bayesian network structure from impressively massive datasets using adaptive sampling. For massive data the sampling algor... |

6 |
The Edinburgh/Durham Southern Galaxy Catalogue - IX. The Galaxy Catalogue
- Nichol, Collins, et al.
- 2000
(Show Context)
Citation Context ...ission rates covtype 150K 39 2.8 Contributed to UCI by Jock Blackard. con4 67K 43 3.0 Contributed to UCI by John Tromp edsgc 300K 24 2.0 Data on 300,000 galaxies from the Edinburgh-Durham Sky Survey (=-=Nichol et al., 2000-=-) synth2 25K 36 2.0 Generated from Figure 3 synth3 25K 36 2.0 Generated from Figure 3 synth4 25K 36 2.0 Generated from Figure 3 nursery 13K 9 3.6 Contributed to UCI by Marko Bohanec and Blaz Zupan let... |

5 | Real-valued AllDimensions search: Low-overhead rapid searching over subsets of attributes
- Moore, Schneider
- 2002
(Show Context)
Citation Context ...§ k CPTs for each of the m target nodes, meaning m ¥ 1 m¦ § k tables in total, each needing Oswork to construct naively. R¡ Constructing all these tables is a job suited for AD-search, introduced in (=-=Moore & Schneider, 2002-=-) and an extension of AD-trees (Moore & Lee, 1998). There is no space to review AD-search here, except to mention costs. Searching all contingency tables of dimension k would normally require R ¥ m op... |