## Bayesian treatment of incomplete discrete data applied to mutual information and feature selection (2003)

### Cached

### Download Links

- [www.idsia.ch]
- [www.idsia.ch]
- [www.idsia.ch]
- [www.hutter1.de]
- [www.hutter1.net]
- DBLP

### Other Repositories/Bibliography

Venue: | Proceedings of the Twenty-sixth German Conference on Artificial Intelligence (KI-2003), volume 2821 of Lecture Notes in Computer Science |

Citations: | 4 - 4 self |

### BibTeX

@INPROCEEDINGS{Hutter03bayesiantreatment,

author = {Marcus Hutter and Marco Zaffalon},

title = {Bayesian treatment of incomplete discrete data applied to mutual information and feature selection},

booktitle = {Proceedings of the Twenty-sixth German Conference on Artificial Intelligence (KI-2003), volume 2821 of Lecture Notes in Computer Science},

year = {2003},

pages = {396--406},

publisher = {Springer}

}

### OpenURL

### Abstract

Given the joint chances of a pair of random variables one can compute quantities of interest, like the mutual information. The Bayesian treatment of unknown chances involves computing, from a second order prior distribution and the data likelihood, a posterior distribution of the chances. A common treatment of incomplete data is to assume ignorability and determine the chances by the expectation maximization (EM) algorithm. The two different methods above are well established but typically separated. This paper joins the two approaches in the case of Dirichlet priors, and derives efficient approximations for the mean, mode and the (co)variance of the chances and the mutual information. Furthermore, we prove the unimodality of the posterior distribution, whence the important property of convergence of EM to the global maximum in the chosen framework. These results are applied to the problem of selecting features for incremental learning and naive Bayes classification. A fast filter based on the distribution of mutual information is shown to outperform the traditional filter based on empirical mutual information on a number of incomplete real data sets.

### Citations

4983 |
C4.5: Programs for Machine Learning
- Quinlan
- 1993
(Show Context)
Citation Context ... Bayes classifier [DH73], which is often a good classification model. Despite its simplifying assumptions (see [DP97]), it often competes successfully with much more complex classifiers, such as C4.5 =-=[Qui93]-=-. Our experiments focus on the incremental use of the naive Bayes classifier, a natural learning process when the data are available sequentially: the data set is read instance by instance; each time,... |

3940 |
Pattern Classification and Scene Analysis
- Duda, Hart
- 1973
(Show Context)
Citation Context ...o use fewer features, i.e. only those for which there is substantial evidence about them being useful in predicting the class. For the subsequent classification task we use the naive Bayes classifier =-=[DH73]-=-, which is often a good classification model. Despite its simplifying assumptions (see [DP97]), it often competes successfully with much more complex classifiers, such as C4.5 [Qui93]. Our experiments... |

1936 |
Pattern classification
- Duda, Hart, et al.
- 2001
(Show Context)
Citation Context ...lly in this case) that use credible intervals based on p(I|D) to robustly estimate mutual information. The filters are empirically tested in Section 5 by coupling them with the naive Bayes classifier =-=[DHS01]-=- to incrementally learn from and classify incomplete data. On five real data sets that we used, one of the two proposed filters consistently outperforms the traditional filter. 2s2 Posterior Distribut... |

1882 |
Numerical Recipes: The Art of Scientific Computing
- Press, Flannery, et al.
- 1986
(Show Context)
Citation Context ... leads to a full rs-dimensional Gaussian with kernel Ã = A+uvT , u = v = 1e. The coε variance of a Gaussian with kernel Ã is Ã−1 . Using the Sherman-Morrison formula Ã −1 =A −1 −1 uvT −A 1+vT A−1uA−1 =-=[PFTV92, p73]-=- and ε→0 we get Cov(ij)(kl)[π] := E[∆ij∆kl] ≃ [ Ã−1 � ](ij)(kl) = A −1 − A−1eeTA −1 eTA −1 � , (5) e (ij)(kl) 3 Note that ni?>0 ∀i is not sufficient, since vi+ ≡0 for v �=0 is possible. Actually v ++=... |

1261 | Bayesian data analysis
- Gelman, Robert, et al.
- 1995
(Show Context)
Citation Context ...es (i,j), and n′′ ij 1 for Jeffreys’ prior, 0 for Haldane’s prior, rs comprises prior information (1 for the uniform prior, 1 2 for Perks’ prior, and other numbers in case of specific prior knowledge =-=[GCSR95]-=-). Furthermore, in leading order in 1/N, any Dirichlet prior with n ′′ ij = O(1) leads to the same results, hence we can simply assume a uniform prior. The reason for the δ(π ++−1) is that π must be c... |

1146 |
Statistical analysis with missing data
- Little, Rubin
- 1987
(Show Context)
Citation Context ... of uncertainty about π. From the prior p(π) and the likelihood p(D|π) of data D one can compute the posterior p(π|D). The traditional solution to (b) is to assume that the data are missing at random =-=[LR87]-=-. A (local) maximum likelihood estimate for ˆπ can then be obtained by the expectation-maximization (EM) algorithm [CF74]. In this work we present a full Bayesian treatment of incomplete discrete data... |

743 |
UCI repository of machine learning databases
- Murphy, Aha
- 1994
(Show Context)
Citation Context ... together with their number of features, instances, missing values, and the relative frequency of the majority class. The data sets are available from the UCI repository of machine learning data sets =-=[MA95]-=-. Average number of features selected by the filters on the entire data set are reported in the last three columns. FF always selected fewer features than F; F almost always selected fewer features th... |

605 | On the optimality of the simple bayesian classifier under zero-one loss
- Domingos, Pazzani
- 1997
(Show Context)
Citation Context ...g useful in predicting the class. For the subsequent classification task we use the naive Bayes classifier [DH73], which is often a good classification model. Despite its simplifying assumptions (see =-=[DP97]-=-), it often competes successfully with much more complex classifiers, such as C4.5 [Qui93]. Our experiments focus on the incremental use of the naive Bayes classifier, a natural learning process when ... |

429 | Selection of relevant features and examples in machine learning
- Blum, Langley
- 1997
(Show Context)
Citation Context ...or incomplete data. Since we only used labeled data we could use (11) with (6), and (14) with (7) and (3). 4 Feature Selection Feature selection is a basic step in the process of building classifiers =-=[BL97]-=-. We consider the well-known filter (F) that computes the empirical mutual information I(ˆπ) between features and the class, and discards features with I(ˆπ)<ε for some threshold ε [Lew92]. This is an... |

99 | Feature selection and feature extraction for text categorization
- Lewis
- 1992
(Show Context)
Citation Context ... classifiers [BL97]. We consider the well-known filter (F) that computes the empirical mutual information I(ˆπ) between features and the class, and discards features with I(ˆπ)<ε for some threshold ε =-=[Lew92]-=-. This is an easy and effective approach that has gained popularity with time. We compare F to the two filters introduced in [ZH02] for the case of complete data, and extended here to the more general... |

98 | MLC++: A machine learning library in C - Kohavi, John, et al. - 1994 |

39 | Distribution of Mutual Information
- Hutter
(Show Context)
Citation Context ...form leading order expressions of the distribution of mutual information p(I|D). In case of complete data, the mean and variance of I have been approximated numerically in [Kle99] and analytically in =-=[Hut02]-=-. The results are then applied to feature selection in Section 4. A popular filter approach discards features of low empirical mutual information I(ˆπ) [Lew92, BL97, CHH + 02]. We compare this filter ... |

26 | Robust feature selection by mutual information distributions
- Zaffalon, Hutter
- 2002
(Show Context)
Citation Context ...ithm [CF74]. In this work we present a full Bayesian treatment of incomplete discrete data with Dirichlet prior p(π) and apply the results to feature selection. This work is a natural continuation of =-=[ZH02]-=-, which focused on the case of complete data and, by working out a special case, provided encouraging evidence for the extension of the proposed approach to incomplete data. Here we develop that frame... |

22 | KDD cup 2001 report - Cheng, Hatzis, et al. |

14 | The posterior probability of Bayes nets with strong dependences - Kleiter - 1999 |

9 |
Twodimensional contingency tables with both completely and partially cross-classified data
- Chen, Fienberg
- 1974
(Show Context)
Citation Context ...e traditional solution to (b) is to assume that the data are missing at random [LR87]. A (local) maximum likelihood estimate for ˆπ can then be obtained by the expectation-maximization (EM) algorithm =-=[CF74]-=-. In this work we present a full Bayesian treatment of incomplete discrete data with Dirichlet prior p(π) and apply the results to feature selection. This work is a natural continuation of [ZH02], whi... |