## Asymptotic Model Selection for Naive Bayesian Networks (2002)

### Cached

### Download Links

- [www.cs.technion.ac.il]
- [jmlr.csail.mit.edu]
- [www.cs.technion.ac.il]
- [jmlr.org]
- DBLP

### Other Repositories/Bibliography

Venue: | In Proc. of the 18th Conference on Uncertainty in Artificial Intelligence (UAI-02 |

Citations: | 30 - 3 self |

### BibTeX

@INPROCEEDINGS{Rusakov02asymptoticmodel,

author = {Dmitry Rusakov and Dan Geiger},

title = {Asymptotic Model Selection for Naive Bayesian Networks},

booktitle = {In Proc. of the 18th Conference on Uncertainty in Artificial Intelligence (UAI-02},

year = {2002},

pages = {438--445}

}

### Years of Citing Articles

### OpenURL

### Abstract

We develop a closed form asymptotic formula to compute the marginal likelihood of data given a naive Bayesian network model with two hidden states and binary features.

### Citations

7074 |
Probabilistic Reasoning in Intelligent Systems
- Pearl
- 1988
(Show Context)
Citation Context ...denotes a particular state (class). Intuitively, this model describes the generation of data x that comes from r sources h 1 ; : : : ; h r . Naive Bayesian models are a subclass of Bayesian networks (=-=Pearl, 198-=-8) . In this work we focus on naive Bayesian networks that have two hidden states (r = 2) and n binary feature variables X 1 ; : : : ; Xn . We denote the parameters deningsp(x i jc 1 ) by a i , the pa... |

2321 |
Estimating the dimension of a model
- Schwarz
- 1978
(Show Context)
Citation Context ...r choosing model M . For many types of models the asymptotic evaluation of integral 1 (as N !1) is a classical Laplace procedure. This evaluation wassrst performed for Linear Exponential (LE) models (=-=Schwarz, 197-=-8) and then for Curved Exponential (CE) models under some additional technical assumptions (Haughton, 1988) . It was shown that S(YD ; N;M) = N ln P (YD j! ML ) d 2 ln N +R; (2) where ln P (YD j! ML ... |

1079 | Bayesian method for the induction of probabilistic networks from data - Cooper, Herskovits - 1992 |

905 | Learning Bayesian networks: the combination of knowledge and statistical - Heckerman, Geiger, et al. - 1995 |

693 |
Optimal Statistical Decisions
- DeGroot
- 1970
(Show Context)
Citation Context ...lnx+Y1 ln[1−x]) dxdadb. Z b I1[N,Y] = e a N(Y0 lnx+Y1 ln(1−x)) dx for some 0 ≤ a < b ≤ 1 (the case b > a is symmetric). This is the integral of the beta distribution with α = NY0 + 1 and β = NY1 + 1 (=-=DeGroot, 1970-=-, page 40). Let f(x) = Y0 lnx+Y1 ln(1 − x). The maximum of the integrand function f(x) on [0,1] is achieved at x0 = Y0 and it is eN f(Y0) . There are three cases to consider according to the location ... |

594 | Bayesian network classifiers
- Friedman, Geiger, et al.
- 1997
(Show Context)
Citation Context ... the asymptotic evaluation of I[N,YD] for a binary naive Bayesian model with binary features. This model, described fully in Section 3, is useful in classification of binary vectors into two classes (=-=Friedman et al., 1997-=-). Our results are derived under similar assumptions 2sASYMPTOTIC MODEL SELECTION FOR NAIVE BAYESIAN NETWORKS to the ones made by Schwarz (1978) and Haughton (1988). In this sense, our paper generaliz... |

481 |
Bayesian Classification (AutoClass): Theory and Results. Knowledge Discovery
- Cheeseman, Stutz
- 1995
(Show Context)
Citation Context ...ely, this model describes the generation of data x that comes from r sources c1,...,cr. Naive Bayesian models are a subclass of Bayesian networks (Pearl, 1988) and they are widely used in clustering (=-=Cheeseman and Stutz, 1995-=-). In this work we focus on naive Bayesian networks that have two hidden states (r = 2) and n binary feature variables X1,...,Xn. We denote the parameters defining p(xi = 1|c1) by ai, the parameters d... |

468 |
Resolution of singularities of an algebraic variety over a field of characteritic zero.I,II
- Hironaka
(Show Context)
Citation Context ...l theory comes to rescue. The resolution of singularities in algebraic geometry transforms the integralsJ() into a direct product of integrals of a single variable (Atiyah, 1970, Resolution Theorem; H=-=ironaka, -=-1964). We demonstrate this technique in the next section. 3 APPLICATION OF WATANABE'S METHOD We now apply the method of Watanabe (2001) to approximate the integral ~ J[N ] = Z ( ;+) n e N P 1l6=kn u 2... |

94 | Real analysis - Lang - 1983 |

63 |
On The Choice of a Model to fit Data From an Exponential Family, The Annals of Statistics 16(1
- Haughton
- 1988
(Show Context)
Citation Context ...ical Laplace procedure. This evaluation was first performed for Linear Exponential (LE) models (Schwarz, 1978) and then for Curved Exponential (CE) models under some additional technical assumptions (=-=Haughton, 1988-=-). It was shown that S(N,YD,M) = N · lnP(YD|ωML) − d lnN + R, (2) 2 where lnP(YD|ωML) is the log-likelihood of YD given the maximum likelihood parameters of the model and d is the model dimension, i.e... |

54 | Stratified exponential families: graphical models and model selection, Ann
- Geiger, Heckerman, et al.
(Show Context)
Citation Context ...e class of distributions represented by Bayesian networks with hidden variables is signicantly richer than curved exponential models and it falls into the class of Stratied Exponential (SE) models (Ge=-=iger et al., 200-=-1) . For such models the eective dimensionality d (Eq. 2) of the model is no longer the number of network parameters (Geiger, Heckerman & Meek, 1996; Settimi & Smith, 1998). Moreover, the central prob... |

48 |
Algebraic geometry for scientists and engineers
- Abhyankar
- 1990
(Show Context)
Citation Context ...ded transformations for the integral under study, we apply a technique called blowing-up which consists of a series of quadratic transformations. For an accessible introduction to these concepts see (=-=Abhyankar, 19-=-90) . We start with n = 3 and then generalize. Rescaling the integration range to ( 1; 1) and then taking only the positive quadrant, which does not change the poles of J(), yields J() = R (0;1) 3 (u ... |

46 | Asymptotic model selection for directed networks with hidden variables - Geiger, Heckerman, et al. - 1996 |

45 | Algebraic Analysis for Nonidentifiable Learning - Watanabe - 2001 |

41 |
Resolution of singularities and division of distributions
- Atiyah
- 1970
(Show Context)
Citation Context ...e, another fundamental mathematical theory comes to rescue. The resolution of singularities in algebraic geometry transforms the integralsJ() into a direct product of integrals of a single variable (A=-=tiyah, 1970-=-, Resolution Theorem; Hironaka, 1964). We demonstrate this technique in the next section. 3 APPLICATION OF WATANABE'S METHOD We now apply the method of Watanabe (2001) to approximate the integral ~ J[... |

33 | Consistent estimation of the order of mixture models - Keribin - 2000 |

32 |
A New Look at Statistical Model Identi
- Akaike
- 1974
(Show Context)
Citation Context ... select the model of highest possible dimension, contrary to the intuitive notion of choosing the right model. Penalized likelihood approaches such as AIC have been proposed to remedy this deciency (A=-=kaike, 1974-=-) . We focus on the Bayesian approach to model selection, by which a model M is chosen according to the maximum posteriori probability given the observed data D: P (M jD) / P (M; D) = P (M)P (DjM) = P... |

21 |
On the geometry of Bayesian graphical models with hidden variables
- Settimi, Smith
- 1998
(Show Context)
Citation Context ...nential (SE) models (Geiger et al., 2001) . For such models the eective dimensionality d (Eq. 2) of the model is no longer the number of network parameters (Geiger, Heckerman & Meek, 1996; Settimi & S=-=mith, 1998-=-). Moreover, the central problem in the evaluation of the marginal likelihood for this class is that the set of maximum likelihood points is sometimes a complex selfcrossing surface. Recently, major p... |

9 |
Graphical Models. Number 17
- Lauritzen
- 1996
(Show Context)
Citation Context ...ferred as a (standard) BIC score. The use of BIC score for Bayesian model selection for Graphical Models is valid for Undirected Graphical Models without hidden variables because these are LE models (=-=Lauritzen, 199-=-6) . The justication of BIC for Directed Graphical Models (called Bayesian Networks) is somewhat more complicated. On one hand discrete and Gaussian DAG models are CE models (Geiger, Heckerman, King &... |

8 | The dimensionality of mixed ancestral graphs - Spirtes, Richardson, et al. - 1997 |

7 |
On the choice of a model to data from an exponential family. The Annals of Statistics 16
- Haughton
- 1988
(Show Context)
Citation Context ...ssical Laplace procedure. This evaluation wassrst performed for Linear Exponential (LE) models (Schwarz, 1978) and then for Curved Exponential (CE) models under some additional technical assumptions (=-=Haughton, 198-=-8) . It was shown that S(YD ; N;M) = N ln P (YD j! ML ) d 2 ln N +R; (2) where ln P (YD j! ML ) is the log-likelihood of YD given the maximum likelihood parameters of the model and d is the model dim... |

3 |
Algebraic Geometry for Scientists and Engineers. Number 35
- Abhyankar
- 1990
(Show Context)
Citation Context ... Figure 1c). transformations for the integral under study, we apply a technique called blowing-up which consists of a series of quadratic transformations. For an introduction to these techniques see (=-=Abhyankar, 1990-=-). Rescaling the integration range to (−1,1) and then taking only the positive octant yields J(λ) = 8ε R 4λ+3 (0,1) 3(u21 u22 + u21 u23 + u22 u23 )λdu �R 0<u2,u3<u1<1 +R0<u1,u3<u2<1 +R0<u1,u2<u3<1 = 8... |

2 | Asymptotic Analysis. Number 48 - Murray - 1984 |

2 |
Algebraic analysis for nonidenti able learning machines
- Watanabe
- 2001
(Show Context)
Citation Context ...for this class is that the set of maximum likelihood points is sometimes a complex selfcrossing surface. Recently, major progress has been achieved in analyzing and evaluating this type of integrals (=-=Watanabe, 2001-=-) . Herein, we apply these techniques to model selection among Bayesian networks with hidden variables. The focus of this paper is the asymptotic evaluation of I[N; Y; M ] for a binary naive Bayesian ... |

2 |
Asymptotic Approximations of Integrals. Academic Press. APPENDIX: PROOF OUTLINE The integral I[N; Y ] converges for all N 1 and for all Y because the likelihood function is bounded. The rst claim of Theorem 3 follows from the fact that for Y 2 0 n S the
- Wong
- 1989
(Show Context)
Citation Context ...or of and this maximum is non-degenerate, i.e., the Hessian matrix Hf(!ML ) of f at !ML is of full rank. In this case the approximation of I[N; Y ] for N ! 1 is the classical Laplace procedure (e.g., =-=Wong, 1-=-989, page 495), summarized as follows Lemma 1 (Laplace Approximation) Let I(N) = Z U e Nf(u) (u)du where U R d . Suppose that f is twice dierentiable and convex (Hf(u) > 0), the minimum of f on U is ... |

2 | We first introduce a series of three transformations from the model parameters w = (a, b, t) to the joint space parameters Bx that facilitates the approximation ofiT[N, YJ. The transformations T1, T2 and T3 are such that their composition T = T3 o T o T1 - TRANSFORMATIONS - 2001 |

2 | moments and conditional independence trees with hidden variables - Geometry |

1 | Hirotugu Akaike. A new look at the statistical model identification - RUSAKOV, GEIGER - 1974 |