## Robust inference of trees (2003)

### Cached

### Download Links

- [www.hutter1.net]
- [www.idsia.ch]
- [www.idsia.ch]
- [www.idsia.ch]
- DBLP

### Other Repositories/Bibliography

Venue: | IDSIA, Manno (Lugano), CH, 2003. Marcus Hutter is with the AI research institute IDSIA, Galleria 2, CH-6928 MannoLugano, Switzerland. E-mail: marcus@idsia.ch, HP: http://www.idsia.ch/∼marcus/idsia |

Citations: | 7 - 7 self |

### BibTeX

@TECHREPORT{Zaffalon03robustinference,

author = {Marco Zaffalon and Marcus Hutter},

title = {Robust inference of trees},

institution = {IDSIA, Manno (Lugano), CH, 2003. Marcus Hutter is with the AI research institute IDSIA, Galleria 2, CH-6928 MannoLugano, Switzerland. E-mail: marcus@idsia.ch, HP: http://www.idsia.ch/∼marcus/idsia},

year = {2003}

}

### OpenURL

### Abstract

Abstract. This paper is concerned with the reliable inference of optimal treeapproximations to the dependency structure of an unknown distribution generating data. The traditional approach to the problem measures the dependency strength between random variables by the index called mutual information. In this paper reliability is achieved by Walley’s imprecise Dirichlet model, which generalizes Bayesian learning with Dirichlet priors. Adopting the imprecise Dirichlet model results in posterior interval expectation for mutual information, and in a set of plausible trees consistent with the data. Reliable inference about the actual tree is achieved by focusing on the substructure common to all the plausible trees. We develop an exact algorithm that infers the substructure in time O(m 4), m being the number of random variables. The new algorithm is applied to a set of data sampled from a known distribution. The method is shown to reliably infer edges of the actual tree even when the data are very scarce, unlike the traditional approach. Finally, we provide lower and upper credibility limits for mutual information under the imprecise Dirichlet model. These enable the previous developments to be extended to a full inferential method for trees.

### Citations

7067 |
Probabilistic Reasoning in Intelligence Systems: Networks of Plausible Inference
- Pearl
- 1988
(Show Context)
Citation Context ...riable itself conditional on the state of the parent node. The distributions are given in Table 2. A model made by the graph and the probability tables, as the one above, is called a Bayesian network =-=[Pea88]-=-. We used the Bayesian network to sample units from the joint distribution of the variables in the graph. Each unit is a vector thats16 Marco Zaffalon and Marcus Hutter, IDSIA-11-03 Low pollution Low ... |

1276 |
K.: Combinatorial Optimization: Algorithms and Complexity
- Papadimitriou, Steiglitz
- 1998
(Show Context)
Citation Context ...in the given total order. This is the case, for example, when mutual information is separately specified via intervals on the edges. 5 RobustTree.tex; 24/08/2005; 16:46; p.5s6 A B [5,15] [5,15] 2 3 D =-=[10,20]-=- [5,7] Figure 1. An example set-based weighted graph. The sets for the edges are specified separately by intervals that in two cases degenerate to real numbers. (A,D) (A,B) (D,C) (A,C) (B,D) (B,C) Fig... |

1250 | Bayesian Data Analysis
- Gelman, Carlin, et al.
- 1995
(Show Context)
Citation Context ...nt to write n ′ i=s·ti with s:=n ′ +, hence t∈∆. Examples for s are 0 for Haldane’s prior [Hal48], 1 for Perks’ prior [Per47], d for Jeffreys’ prior [Jef46], and d for Bayes-Laplace’s 2 uniform prior =-=[GCSR95]-=- (all with ti = 1). These are also called noninformative d priors. From the prior and the data likelihood one can determine the posterior p(π|D) = p(π|n) ∝ � iπni+sti−1 i . The expected value or mean ... |

1248 |
On information and sufficiency
- Kullback, Leibler
- 1951
(Show Context)
Citation Context ...lly be represented as an undirected tree T, that is the optimal approximating tree-dependency distribution (Section 2). This result is due to Chow and Liu [CL68], who use Kullback-Leiber’s divergence =-=[KL51]-=- to measure the similarity of two distributions. Since only a sample is available, the joint distribution π is unknown and an inferential approach is necessary. Prior uncertainty about the vector π is... |

1161 | Information Theory and Statistics
- Kullback
- 1997
(Show Context)
Citation Context ...ntly categorized according to a set of m nominal random variables ı, j, κ, etc. The dependency between two variables is measured by the information-theoretic symmetric index called mutual information =-=[Kul68]-=-. If the chances 1 π of all instances defined by the co-occurrence of ı = i, j = j, κ = ˙κ, etc., were known, it would be possible to approximate the distribution by another, for which all the depende... |

742 |
Statistical Reasoning with Imprecise Probabilities
- Walley
- 1991
(Show Context)
Citation Context ...sume exact prior knowledge p(π). The solution to the second problem is to model our ignorance by considering sets of priors p(π), a model that is part of the wider theory of imprecise 4 probabilities =-=[Wal91]-=-. The specific imprecise Dirichlet model [Wal96] considers the set of all 5 t ∈ ∆, i.e., {p(π) : t ∈ ∆}, which solves also the first problem. Walley suggests to fix the hyperparameter s somewhere in t... |

637 | Approximating discrete probability distributions with dependence trees
- Chow, Liu
- 1968
(Show Context)
Citation Context ...e dependencies are bivariate and can graphically be represented as an undirected tree T, that is the optimal approximating tree-dependency distribution (Section 2). This result is due to Chow and Liu =-=[CL68]-=-, who use Kullback-Leiber’s divergence [KL51] to measure the similarity of two distributions. Since only a sample is available, the joint distribution π is unknown and an inferential approach is neces... |

594 | Bayesian network classifiers
- Friedman, Geiger, et al.
- 1997
(Show Context)
Citation Context ... other hand, one can think of applications for which the algorithm is probably not so well suited. For instance, in (precise-probability) problems of pattern classification based on Bayesian networks =-=[FGG97]-=-, it is important to recover any tree (or forest) structure for which the sum of the edge weights is maximized. In this case, suppressing edges with large weights only because they are not strong migh... |

440 | The Advanced Theory of Statistics - Kendall, Stuart - 1961 |

216 |
Equivalence and synthesis of causal models
- Verma, Pearl
- 1990
(Show Context)
Citation Context ...he tree itself represents the dependencies between the variables. It is a well-known result that all the directed trees that share the same undirected structure represent the same set of dependencies =-=[VP90]-=-. This is the reason why the inference of directed trees from data focuses on recovering the undirected structure; and it is also the reason why this paper is almost entirely concerned with undirected... |

194 |
An invariant form for the prior probability in estimation problems
- Jeffreys
- 1946
(Show Context)
Citation Context ...n i can be modelled by large n ′ i . It is convenient to write n ′ i=s·ti with s:=n ′ +, hence t∈∆. Examples for s are 0 for Haldane’s prior [Hal48], 1 for Perks’ prior [Per47], d for Jeffreys’ prior =-=[Jef46]-=-, and d for Bayes-Laplace’s 2 uniform prior [GCSR95] (all with ti = 1). These are also called noninformative d priors. From the prior and the data likelihood one can determine the posterior p(π|D) = p... |

124 |
Inferences from Multinomial Data: Learning About a Bag of Marbles
- Walley
- 1996
(Show Context)
Citation Context ... approach is necessary. Prior uncertainty about the vector π is described 1 We denote vectors by x:=(x1,...,xd) for x∈{n,t,u,π,...}.sRobust Inference of Trees 3 by the imprecise Dirichlet model (IDM) =-=[Wal96]-=-. This is an inferential model that generalizes Bayesian learning with Dirichlet priors, by using a set of prior densities to model prior (near-)ignorance. Using the IDM results in posterior uncertain... |

96 |
Stegun (Eds
- Abramowitz, A
- 1968
(Show Context)
Citation Context ... specific imprecise Dirichlet model [26] considers the set of all 5 t ∈ ∆, i.e., {p(π):t∈∆}, which solves also the first problem. Walley suggests to fix the hyperparameter s somewhere in the interval =-=[1,2]-=-. A set of priors results in a set of posteriors, set of expected values, etc. For realvalued quantities like the expected entropy Et[H] the sets are typically intervals: Et[H] ∈ [min t∈∆ Et[H] , max ... |

67 |
2003): Partial Identification of Probability Distributions
- Manski
(Show Context)
Citation Context ... dependency structures from incomplete samples. Recent research has developed robust approaches to incomplete samples that make very weak assumptions on the mechanism responsible for the missing data =-=[18, 23, 29]-=-. This would be an important step towards realism and reliability in structure inference. Appendix A. Properties of the digamma ψ function The digamma function ψ is defined as the logarithmic derivati... |

48 | Robust learning with missing data - Ramoni, Sebastiani - 2000 |

38 | Distribution of Mutual Information
- Hutter
- 2002
(Show Context)
Citation Context ...oximations to the actual mutual information. Kleiter focuses on general graphical structures and is not 2 Note that accurate expressions for credible mutual information intervals have been derived in =-=[9, 11]-=-. RobustTree.tex; 24/08/2005; 16:46; p.2sconcerned with questions of optimality. In the second case, Bernard [3] describes a method to build a directed graph from a multivariate binary database. The m... |

33 | An introduction to the imprecise Dirichlet model for multinomial data
- Bernard
- 2005
(Show Context)
Citation Context ...outputs depending on the random fluctuations involved in the generation of the sample. Reliability can be achieved by robust inferential tools. In this paper we consider the imprecise Dirichlet model =-=[26, 4]-=-. The IDM is a model of inference for multivariate categorical data. It models prior uncertainty with a set of Dirichlet prior densities and does posterior inference by combining them with the likelih... |

27 | The robust spanning tree problem with interval data
- Yaman, Kara¸san, et al.
- 2001
(Show Context)
Citation Context ...erential approach should probably follow other lines than those described here. One possibility could be to exploit existing results in the literature of robust optimization; the work of Yaman et al. =-=[YKP01]-=- seems to be particularly worthy of consideration. Yaman et al. consider a problem of maximum spanning tree for a graph with weights specified by intervals (the weights are given no particular interpr... |

20 |
On the Shortest Spanning Subtree of a Graph and the Traveling Salesman Problem
- Jr
- 1956
(Show Context)
Citation Context ...ed weighted graphs, as the set T of maximum spanning trees originated by the graphs in G. Recall that Kruskal’s algorithm only needs a total order on the edges to build a unique maximum spanning tree =-=[15]-=-. Therefore, in order to focus on T , we can equivalently focus on the set OT of total orders that are consistent with the graphs in G. In the following we find it more convenient not to directly deal... |

18 |
Exact credal treatment of missing data
- Zaffalon
(Show Context)
Citation Context ... dependency structures from incomplete samples. Recent research has developed robust approaches to incomplete samples that make very weak assumptions on the mechanism responsible for the missing data =-=[18, 23, 29]-=-. This would be an important step towards realism and reliability in structure inference. Appendix A. Properties of the digamma ψ function The digamma function ψ is defined as the logarithmic derivati... |

14 | Distribution of mutual information from complete and incomplete data
- Hutter, Zaffalon
(Show Context)
Citation Context ...oximations to the actual mutual information. Kleiter focuses on general graphical structures and is not 2 Note that accurate expressions for credible mutual information intervals have been derived in =-=[9, 11]-=-. RobustTree.tex; 24/08/2005; 16:46; p.2sconcerned with questions of optimality. In the second case, Bernard [3] describes a method to build a directed graph from a multivariate binary database. The m... |

14 | The posterior probability of Bayes nets with strong dependences
- Kleiter
- 1999
(Show Context)
Citation Context ...posed in this paper, aiming to exploit the results presented here in wider contexts. To our knowledge, the literature only reports two other attempts to infer robust structures of dependence. Kleiter =-=[Kle99]-=- uses approximate confidence intervals on mutual information 2 to measure the dependence between random variables. Kleiter’s work is different in spirit from ours. We look for tree structures that are... |

11 | On the complexity of the robust spanning tree problem with interval data
- Aron, Hentenryck
- 2004
(Show Context)
Citation Context ...s sense the approach adopted by Yaman et al. is in the long tradition of the popular maximin (or minimax) decision criterion. From the computational point of view, although the problem is NP-complete =-=[AVH04]-=-, recent results show that relatively large instances of the problem can be solved efficiently [Mon0X]. The trees defined by Yaman et al. could probably be combined with the IDM-based inferential appr... |

10 |
Non-parametric inference about an unknown mean using the imprecise dirichlet model
- Bernard
- 2001
(Show Context)
Citation Context ..., by using systematic and reliable interval approximations to the actual mutual information. Kleiter focuses on general graphical structures and is not concerned with questions of optimality. Bernard =-=[Ber01]-=- describes a method to build a directed graph from a multivariate binary database. The method is based on the IDM and Bayesian implicative analysis. The connection with our work is looser here since t... |

8 |
The precision of observed values of small frequencies
- Haldane
- 1948
(Show Context)
Citation Context ...y fractional) “virtual” sample numbers. High prior belief in i can be modelled by large n ′ i . It is convenient to write n ′ i=s·ti with s:=n ′ +, hence t∈∆. Examples for s are 0 for Haldane’s prior =-=[Hal48]-=-, 1 for Perks’ prior [Per47], d for Jeffreys’ prior [Jef46], and d for Bayes-Laplace’s 2 uniform prior [GCSR95] (all with ti = 1). These are also called noninformative d priors. From the prior and the... |

8 |
Kruskal Jr., On the shortest spanning subtree of a graph and the traveling salesman problem
- B
- 1956
(Show Context)
Citation Context ...ed weighted graphs, as the set T of maximum spanning trees originated by the graphs in G. Recall that Kruskal’s algorithm only needs a total order on the edges to build a unique maximum spanning tree =-=[KJ56]-=-. Therefore, in order to focus on T , we can equivalently focus on the set OT of total orders that are consistent with the graphs in G. In the following we find it more convenient not to directly deal... |

7 | Robust estimators under the imprecise Dirichlet model
- Hutter
- 2003
(Show Context)
Citation Context ...O(σ2 ) + H lb R ���� O(σ 2 ) ≤ H ≤ H(u) � �� � H+O(σ2 . ) For robust estimates, the lower bound is more interesting. General approximation techniques for other quantities of interest are developed in =-=[Hut03]-=-. Exact expressions for [H,H] are also derived there. 4.3 Robust estimates for mutual information Mutual information. Here we generalize the bounds for the entropy found in Section 4.2 to the mutual i... |

6 |
Estimating functions of distributions from a finite set of samples
- Wolpert, Wolf
- 1995
(Show Context)
Citation Context ... is Et[H] = � ∆H(π)p(π|n)dπ. An approximate solution can be obtained by exchanging E with H (exact only for linear functions): Et[H(π)] ≈ H(Et[π]) = H(u). The approximation error is typically of . In =-=[27, 9, 11]-=- exact expressions have been obtained: the order 1 n Et[H] = H(u) := � h(ui) with (1) i h(u) = u·[ψ(n + s + 1) − ψ((n + s)u + 1)], where ψ(x)=d logΓ(x)/dx is the logarithmic derivative of the Gamma fu... |

3 | A Benders decomposition approach for the robust spanning tree with interval data
- Montemanni
(Show Context)
Citation Context ...ax) decision criterion. From the computational point of view, although the problem is NP-complete [AVH04], recent results show that relatively large instances of the problem can be solved efficiently =-=[Mon0X]-=-. The trees defined by Yaman et al. could probably be combined with the IDM-based inferential approach presented here, suitably modified for classification problems, in order to yield relative robust ... |

3 |
Some observations on inverse probability
- Perks
- 1947
(Show Context)
Citation Context ...le numbers. High prior belief in i can be modelled by large n ′ i . It is convenient to write n ′ i=s·ti with s:=n ′ +, hence t∈∆. Examples for s are 0 for Haldane’s prior [Hal48], 1 for Perks’ prior =-=[Per47]-=-, d for Jeffreys’ prior [Jef46], and d for Bayes-Laplace’s 2 uniform prior [GCSR95] (all with ti = 1). These are also called noninformative d priors. From the prior and the data likelihood one can det... |

1 | Exact credal treatment of missing data. Journal of Statistical Planning and Inference, 105(1):105–122, 2002. Marco Zaffalon and Marcus Hutter, IDSIA-11-03 Low pollution Low consumptions Sustainable growth Care of environment Organic farming Care of anima - Zaffalon |

1 |
Hentenryck: 2004, ‘On the complexity of the robust spanning tree problem with interval data
- Aron, Van
(Show Context)
Citation Context ... specific imprecise Dirichlet model [26] considers the set of all 5 t ∈ ∆, i.e., {p(π):t∈∆}, which solves also the first problem. Walley suggests to fix the hyperparameter s somewhere in the interval =-=[1,2]-=-. A set of priors results in a set of posteriors, set of expected values, etc. For realvalued quantities like the expected entropy Et[H] the sets are typically intervals: Et[H] ∈ [min t∈∆ Et[H] , max ... |

1 |
Sebastiani: 2001, ‘Robust learning with missing data
- Ramoni, P
(Show Context)
Citation Context ... dependency structures from incomplete samples. Recent research has developed robust approaches to incomplete samples that make very weak assumptions on the mechanism responsible for the missing data =-=[18, 23, 29]-=-. This would be an important step towards realism and reliability in structure inference. Appendix A. Properties of the digamma ψ function The digamma function ψ is defined as the logarithmic derivati... |

1 |
Pinar: 2001, ‘The robust spanning tree problem with interval data
- Yaman, Kara¸san, et al.
(Show Context)
Citation Context ...erential approach should probably follow other lines than those described here. One possibility could be to exploit existing results in the literature of robust optimization; the work of Yaman et al. =-=[28]-=- seems to be particularly worthy of consideration. Yaman et al. consider a problem of maximum spanning tree for a graph with weights specified by intervals (the weights are given no particular interpr... |