## Locally Minimax Optimal Predictive Modeling with Bayesian Networks

### Cached

### Download Links

### BibTeX

@MISC{Silander_locallyminimax,

author = {Tomi Silander and Teemu Roos and Petri Myllymäki},

title = {Locally Minimax Optimal Predictive Modeling with Bayesian Networks},

year = {}

}

### OpenURL

### Abstract

We propose an information-theoretic approach for predictive modeling with Bayesian networks. Our approach is based on the minimax optimal Normalized Maximum Likelihood (NML) distribution, motivated by the MDL principle. In particular, we present a parameter learning method which, together with a previously introduced NML-based model selection criterion, provides a way to construct highly predictive Bayesian network models from data. The method is parameterfree and robust, unlike the currently popular Bayesian marginal likelihood approach which has been shown to be sensitive to the choice of prior hyperparameters. Empirical tests show that the proposed method compares favorably with the Bayesian approach in predictive tasks. 1

### Citations

7072 |
Probabilistic reasoning in intelligent systems: Networks of plausible inference
- Pearl
- 1988
(Show Context)
Citation Context ...be sensitive to the choice of prior hyperparameters. Empirical tests show that the proposed method compares favorably with the Bayesian approach in predictive tasks. 1 INTRODUCTION Bayesian networks (=-=Pearl, 1988-=-) are one of the most popular model classes for discrete vector-valued i.i.d. data. The popular Bayesian BDeu criterion (Heckerman, Geiger, & Chickering, 1995) for learning Bayesian network structures... |

2321 |
Estimating the dimension of a model
- Schwarz
- 1978
(Show Context)
Citation Context ...en reported to be very sensitive to the choice of prior hyper-parameters (Silander, Kontkanen, & Myllymäki, 2007). On the other hand, the general model selection criteria, AIC (Akaike, 1973) and BIC (=-=Schwarz, 1978-=-), are derived through asymptotics and their behavior is suboptimal for small sample sizes. Furthermore, it is not clear how to set the parameters Appearing in Proceedings of the 12 th International C... |

1242 |
Information theory and an extension of the maximum likelihood principle
- Akaike
- 1973
(Show Context)
Citation Context ...uctures has recently been reported to be very sensitive to the choice of prior hyper-parameters (Silander, Kontkanen, & Myllymäki, 2007). On the other hand, the general model selection criteria, AIC (=-=Akaike, 1973-=-) and BIC (Schwarz, 1978), are derived through asymptotics and their behavior is suboptimal for small sample sizes. Furthermore, it is not clear how to set the parameters Appearing in Proceedings of t... |

905 | Learning Bayesian networks: the combination of knowledge and statistical
- Heckerman, Geiger, et al.
- 1995
(Show Context)
Citation Context ...ian approach in predictive tasks. 1 INTRODUCTION Bayesian networks (Pearl, 1988) are one of the most popular model classes for discrete vector-valued i.i.d. data. The popular Bayesian BDeu criterion (=-=Heckerman, Geiger, & Chickering, 1995-=-) for learning Bayesian network structures has recently been reported to be very sensitive to the choice of prior hyper-parameters (Silander, Kontkanen, & Myllymäki, 2007). On the other hand, the gene... |

275 |
Fisher information and stochastic complexity
- Rissanen
- 1996
(Show Context)
Citation Context ...al range of values (Silander et al., 2007). 3.2 INFORMATION THEORY SCORES Our preferred model selection criterion would be to use the normalized maximum likelihood (NML) distribution (Shtarkov, 1987; =-=Rissanen, 1996-=-): ̂P(D | M) PNML(D | M) = ∑ D ′ ̂ P(D ′ , (7) | M) where the normalization is over all data sets D ′ of a fixed size N. The log of the normalizing factor is called the parametric complexity. NML is t... |

171 |
Statistical theory: the prequential approach
- Dawid
- 1984
(Show Context)
Citation Context ...=1 Nijk + αijk (10) [Nijk′ + αijk ′].Locally Minimax Optimal Predictive Modeling with Bayesian Networks This choice of parameters can be further backed up by a prequential model selection principle (=-=Dawid, 1984-=-). Since the BDeu score is just a marginal likelihood P(D | G,α), it can be expressed as a product of predictive distributions P(D | G,α) = = N∏ P(D (n) | D (<n) ,α) n=1 N∏ P(D (n) | ˜ θ(D (<n) ,α)), ... |

155 | Learning Bayesian networks is np-complete
- Chickering
- 1996
(Show Context)
Citation Context ...r exponential with respect to the number of variables, and the model selection task has been shown to be NP-hard for practically all model selection criteria such as AIC, BIC and marginal likelihood (=-=Chickering, 1996-=-). However, all popular Bayesian network selection criteria S(G,D) feature a convenient decomposability property, S(G,D) = m∑ i=1 S(Di,DGi ), (3) which makes implementing a heuristic search for models... |

126 |
Universal sequential coding of single messages
- Shtarkov
- 1987
(Show Context)
Citation Context ... completely normal range of values (Silander et al., 2007). 3.2 INFORMATION THEORY SCORES Our preferred model selection criterion would be to use the normalized maximum likelihood (NML) distribution (=-=Shtarkov, 1987-=-; Rissanen, 1996): ̂P(D | M) PNML(D | M) = ∑ D ′ ̂ P(D ′ , (7) | M) where the normalization is over all data sets D ′ of a fixed size N. The log of the normalizing factor is called the parametric comp... |

44 | A simple approach for finding the globally optimal Bayesian network structure
- Silander, Myllymäki
- 2006
(Show Context)
Citation Context ...the counts Nij. We omit the details. 5 EXPERIMENTS To empirically test our method, we selected 20 UCI data sets 2 with fewer than 20 variables, so that we can use exact structure learning algorithms (=-=Silander & Myllymäki, 2006-=-) that eliminate the uncertainty due to the heuristic search for the best structure. We then compared our method, the fNML-based structure learning + fsNML parametrization, with the stateof-the-art Ba... |

24 |
A lineartime algorithm for computing the multinomial stochastic complexity
- Kontkanen, Myllymäki
- 2007
(Show Context)
Citation Context ... an exponential number of terms, it can be evaluated efficiently using the recently discovered linear time algorithm for calculating the parametric complexity for a single r-ary multinomial variable (=-=Kontkanen & Myllymäki, 2007-=-). It is immediate from the construction that fNML is decomposable. Thus it can be used efficiently in heuristic local search. Empirical tests show that selecting the network structure with fNML compa... |

22 | On the dirichlet prior and bayesian regularization
- Steck, Jaakkola
- 2002
(Show Context)
Citation Context ...twork structures encoding same independence assumptions. The BDeu score depends only on a single parameter α, but the outcome of model selection is very sensitive to it: it has been previously shown (=-=Steck & Jaakkola, 2002-=-) that the extreme values of α strongly affect the model selected by the BDeu score, and moreover, recent empirical studies have demonstrated great sensitivity to this parameter even within a complete... |

7 |
On sensitivity of the MAP Bayesian network structure to the equivalent sample size parameter
- Silander, Kontkanen, et al.
- 2007
(Show Context)
Citation Context ...pular Bayesian BDeu criterion (Heckerman, Geiger, & Chickering, 1995) for learning Bayesian network structures has recently been reported to be very sensitive to the choice of prior hyper-parameters (=-=Silander, Kontkanen, & Myllymäki, 2007-=-). On the other hand, the general model selection criteria, AIC (Akaike, 1973) and BIC (Schwarz, 1978), are derived through asymptotics and their behavior is suboptimal for small sample sizes. Further... |

7 | On sensitivity of the MAP bayesian network structure to the equivalent sample size parameter
- Silander, Kontkanen, et al.
- 2007
(Show Context)
Citation Context ...pular Bayesian BDeu criterion (Heckerman, Geiger, & Chickering, 1995) for learning Bayesian network structures has recently been reported to be very sensitive to the choice of prior hyper-parameters (=-=Silander, Kontkanen, & Myllymäki, 2007-=-). On the other hand, the general model selection criteria, AIC (Akaike, 1973) and BIC (Schwarz, 1978), are derived through asymptotics and their behavior is suboptimal for small sample sizes. Further... |

6 | On sequentially normalized maximum likelihood models
- Roos, Rissanen
- 2008
(Show Context)
Citation Context ... Hence, in accordance with the information-theoretic approach we introduce a solution to the parameter learning task based on minimax rules. The so called sequential NML model (Rissanen & Roos, 2007; =-=Roos & Rissanen, 2008-=-) is similar in spirit to the factorized NML model in the sense that the idea is to obtain a joint likelihood as a product of locally minimax (regret) optimal models. In sNML, the normalization is don... |

6 | Factorized normalized maximum likelihood criterion for learning Bayesian network structures
- Silander, Roos, et al.
- 2008
(Show Context)
Citation Context ...twork structure with fNML compares favourably to the state-of-the art model selection using BDeu scores even when the prior hyperparameter is optimized (with “hindsight”) to maximize the performance (=-=Silander, Roos, Kontkanen, & Myllymäki, 2008-=-). 4 PREDICTION The scoring methods described in the previous section can be used for selecting the best Bayesian network structure. However, much of the appeal of the Bayesian networks rests on the f... |

5 |
Conditional NML models
- Rissanen, Roos
- 2007
(Show Context)
Citation Context ...on the hyperparameters. Hence, in accordance with the information-theoretic approach we introduce a solution to the parameter learning task based on minimax rules. The so called sequential NML model (=-=Rissanen & Roos, 2007-=-; Roos & Rissanen, 2008) is similar in spirit to the factorized NML model in the sense that the idea is to obtain a joint likelihood as a product of locally minimax (regret) optimal models. In sNML, t... |

3 | Factorized normalized maximum likelihood criterion for learning bayesian network structures,” Submitted for PGM08
- Silander, Roos, et al.
- 2008
(Show Context)
Citation Context ...twork structure with fNML compares favourably to the state-of-the art model selection using BDeu scores even when the prior hyperparameter is optimized (with “hindsight”) to maximize the performance (=-=Silander, Roos, Kontkanen, & Myllymäki, 2008-=-). 4 PREDICTION The scoring methods described in the previous section can be used for selecting the best Bayesian network structure. However, much of the appeal of the Bayesian networks rests on the f... |