## c ○ 2001 Kluwer Academic Publishers. Manufactured in The Netherlands. Robust Learning with Missing Data

### BibTeX

@MISC{Ramoni_c○,

author = {Marco Ramoni and Paola Sebastiani and Pat Langley},

title = {c ○ 2001 Kluwer Academic Publishers. Manufactured in The Netherlands. Robust Learning with Missing Data},

year = {}

}

### OpenURL

### Abstract

Abstract. This paper introduces a new method, called the robust Bayesian estimator (RBE), to learn conditional probability distributions from incomplete data sets. The intuition behind the RBE is that, when no information about the pattern of missing data is available, an incomplete database constrains the set of all possible estimates and this paper provides a characterization of these constraints. An experimental comparison with two popular methods to estimate conditional probability distributions from incomplete data—Gibbs sampling and the EM algorithm—shows a gain in robustness. An application of the RBE to quantify a naive Bayesian classifier from an incomplete data set illustrates its practical relevance. Keywords:

### Citations

9033 | Maximum likelihood from incomplete data via the EM algorithm
- Dempster, Laird, et al.
- 1977
(Show Context)
Citation Context ...uring the past few years, several methods have been proposed for learning conditional probabilities from incomplete data sets. The two most popular methods are the expectation maximization algorithm (=-=Dempster, Laird, & Rubin, 1977-=-) and Gibbs sampling (Geman & Geman, 1984). Both methods make the simplifying assumption that data are missing at random (Rubin, 1976). Under this assumption, the probability that an entry is not repo... |

7489 |
Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference
- Pearl
- 1988
(Show Context)
Citation Context ...set of variable values observed is called evidence, which we denote by e. The solution is to compute the probability distribution of Xi given the evidence e—using some standard propagation algorithm (=-=Pearl, 1988-=-; Castillo, Gutierrez, & Hadi, 1997)—and then to select the value of Xi with the largest probability, given e. We can similarly propagate the probability intervals computed by the RBE with one of the ... |

5430 |
C4.5 – Programs for Machine Learning
- Quinlan
- 1993
(Show Context)
Citation Context ... procedure, in this case, would be to not update the counts when a value is missing (Domingos & Pazzani, 1997; Friedman, Geiger, & Goldszmidt, 1997) or to assign the unknown entries to a dummy value (=-=Quinlan, 1993-=-). We describe, first, the data set and the statistical model used in this application, and we then show the advantages of using the RBE in a supervised learning task. 6.1. The data set We used data o... |

3081 |
UCI Repository of Machine Learning Database. Http://www.ics.uci.edu/~mlearn/ mlrepository.html
- Blake, Merz
- 1998
(Show Context)
Citation Context ...of using the RBE in a supervised learning task. 6.1. The data set We used data on Congressional Voting Records, available from the Machine Learning Repository at the University of California, Irvine (=-=Blake, Keogh, & Merz, 1998-=-). The dataROBUST LEARNING WITH MISSING DATA 165 set describes votes for each of the 435 member of the US House of Representative on the 16 key issues during the 1984. Hence, the data set consists of... |

1141 | A Bayesian method for the induction of probabilistic networks from data, Machine Learning 9
- Cooper, Herskovits
- 1992
(Show Context)
Citation Context ... Boolean risk factors X1,...,X6 observed in a sample of 1841 employees of a Czech car factory. Table 2 describes the variables and their values. The data set is complete and we used the K2 algorithm (=-=Cooper & Herskovitz, 1992-=-) to extract the most probable structure, reported in figure 3. The K2 algorithm extracts the most probable network consistent with a partial order among the variables in the data set. We chose X1 ≤ X... |

955 | Learning Bayesian networks: The combination of knowledge and statistical data, Machine Learning 20
- Heckerman, Geiger, et al.
- 1995
(Show Context)
Citation Context ...n successive estimates is smaller than a fixed threshold. The EM algorithm produces an approximation of the estimate (αijk + n(xik | πij) − 1)/(αij + n(πij) − si), the so called maximum a posteriori (=-=Heckerman, Geiger, & Chickering, 1995-=-). However, by setting α ′ ijk = αijk + 1, the estimate (α ′ ijk + n(xik | πij) − 1)/(α ′ ij + n(πij) − si) becomes exactly that given in Eq. (2). The convergence rate of this process can be slow and ... |

643 | On the optimality of the simple Bayesian classifier under zero-one loss
- Domingos, Pazzani
- 1997
(Show Context)
Citation Context ...nd tested using classification rules that do not rely on assumptions about the missing data mechanism. The standard procedure, in this case, would be to not update the counts when a value is missing (=-=Domingos & Pazzani, 1997-=-; Friedman, Geiger, & Goldszmidt, 1997) or to assign the unknown entries to a dummy value (Quinlan, 1993). We describe, first, the data set and the statistical model used in this application, and we t... |

637 | Bayesian network classifiers
- Friedman, Geiger, et al.
- 1997
(Show Context)
Citation Context ...tion rules that do not rely on assumptions about the missing data mechanism. The standard procedure, in this case, would be to not update the counts when a value is missing (Domingos & Pazzani, 1997; =-=Friedman, Geiger, & Goldszmidt, 1997-=-) or to assign the unknown entries to a dummy value (Quinlan, 1993). We describe, first, the data set and the statistical model used in this application, and we then show the advantages of using the R... |

514 |
Baysian classification (autoclass): Theory and results
- Cheeseman, Stutz
- 1996
(Show Context)
Citation Context ... Bayesian estimation of conditional probabilities from a data is a task relevant to a variety of machine learning applications, such as classification (Langley, Iba, & Thompson, 1992) and clustering (=-=Cheeseman & Stutz, 1996-=-). When no entry is missing in the database, these conditional probabilities can be efficiently estimated using standard Bayesian analysis (Good, 1968). Unfortunately, when the database is incomplete,... |

468 | Graphical Models in Applied Multivariate Statistics - Whittaker - 1990 |

372 |
Inference and missing data
- Rubin
- 1976
(Show Context)
Citation Context ...ods are the expectation maximization algorithm (Dempster, Laird, & Rubin, 1977) and Gibbs sampling (Geman & Geman, 1984). Both methods make the simplifying assumption that data are missing at random (=-=Rubin, 1976-=-). Under this assumption, the probability that an entry is not reported is independent of the missing entries in the data set and, in this situation, the missing values can be inferred from the availa... |

362 | An Analysis of Bayesian Classifiers
- Langley, Wayne, et al.
- 1992
(Show Context)
Citation Context ...lity intervals, missing data 1. Introduction The Bayesian estimation of conditional probabilities from a data is a task relevant to a variety of machine learning applications, such as classification (=-=Langley, Iba, & Thompson, 1992-=-) and clustering (Cheeseman & Stutz, 1996). When no entry is missing in the database, these conditional probabilities can be efficiently estimated using standard Bayesian analysis (Good, 1968). Unfort... |

229 |
The EM algorithm for graphical association models with missing data
- Lauritzen
- 1995
(Show Context)
Citation Context ...α ′ ijk + n(xik | πij) − 1)/(α ′ ij + n(πij) − si) becomes exactly that given in Eq. (2). The convergence rate of this process can be slow and several modifications have been proposed to increase it (=-=Lauritzen, 1995-=-; Zhang, 1996; Russell et al., 1995; Friedman, 1977). In contrast to the EM algorithm, which is iterative but deterministic, Gibbs sampling is a stochastic method that produces a sample of values for ... |

190 |
Expert systems and probabilistic network models
- Castillo, Gutiérrez, et al.
- 1997
(Show Context)
Citation Context ...le values observed is called evidence, which we denote by e. The solution is to compute the probability distribution of Xi given the evidence e—using some standard propagation algorithm (Pearl, 1988; =-=Castillo, Gutierrez, & Hadi, 1997-=-)—and then to select the value of Xi with the largest probability, given e. We can similarly propagate the probability intervals computed by the RBE with one of the existing propagation algorithms for... |

164 |
The Estimation of Probability: An Essay on Modern Bayesian Methods
- Good
- 1968
(Show Context)
Citation Context ...a, & Thompson, 1992) and clustering (Cheeseman & Stutz, 1996). When no entry is missing in the database, these conditional probabilities can be efficiently estimated using standard Bayesian analysis (=-=Good, 1968-=-). Unfortunately, when the database is incomplete, i.e., some entries are reported as unknown, the simplicity and efficiency of this analysis are lost. Exact Bayesian analysis requires that one estima... |

131 | Learning Belief Networks in the presence of Missing Values and Hidden Variables - Friedman - 1997 |

81 | Local learning in probabilistic networks with hidden variables
- Russell, Binder, et al.
- 1995
(Show Context)
Citation Context ... ′ ij + n(πij) − si) becomes exactly that given in Eq. (2). The convergence rate of this process can be slow and several modifications have been proposed to increase it (Lauritzen, 1995; Zhang, 1996; =-=Russell et al., 1995-=-; Friedman, 1977). In contrast to the EM algorithm, which is iterative but deterministic, Gibbs sampling is a stochastic method that produces a sample of values for the probabilities {p(xik | πij)} fr... |

72 |
BUGS: a program to perform Bayesian inference using Gibbs sampling
- Thomas, Spiegelhalter, et al.
- 1992
(Show Context)
Citation Context ...e posterior probabilities {ˆp(xik | πij)}. In practice, the algorithm iterates a number of times—called the burn in—to reach stability and then takes a final sample from the equilibrium distribution (=-=Thomas, Spiegelhalter, & Gilks, 1992-=-). The advantage of Gibbs sampling over EM is that the simulated sample provides empirical estimates of the variance, as well as credible intervals, that is, intervals that contain the p(xik | πij) va... |

30 | Accelerated Quantification of Bayesian Networks with Incomplete Data
- Thiesson
- 1995
(Show Context)
Citation Context ...ases as the accuracy of the approximation decreases. We compared the results obtained from the RBE estimates with the results obtained from an implementation of the accelerated EM algorithm in GAMES (=-=Thiesson, 1995-=-) and the implementation of Gibbs sampling called BUGS (Thomas, Spiegelhalter, & Gilks, 1992). 5.3. Results and discussion We begin by describing the use of the probability intervals computed by the R... |

27 |
Learning in probabilistic expert systems
- Spiegelhalter, Cowell
- 1992
(Show Context)
Citation Context ...erred from the available data. However, there is no way to verify this assumption on a database and, when this assumption is violated, all these methods can suffer of a dramatic decrease in accuracy (=-=Spiegelhalter & Cowell, 1992-=-). This situation148 M. RAMONI AND P. SEBASTIANI motivated the recent development of a deterministic method, called Bound and Collapse (Ramoni & Sebastiani, 1998), that does not rely, per se, on a pa... |

23 | Parameter estimation in Bayesian networks from incomplete databases. Intelligent Data Analysis
- Ramoni, Sebastiani
- 1998
(Show Context)
Citation Context ...f a dramatic decrease in accuracy (Spiegelhalter & Cowell, 1992). This situation148 M. RAMONI AND P. SEBASTIANI motivated the recent development of a deterministic method, called Bound and Collapse (=-=Ramoni & Sebastiani, 1998-=-), that does not rely, per se, on a particular assumption about the missing data mechanism but allows the user to specify one, including the missing at random assumption. However, it still requires th... |

19 |
Full belief
- Kyburg
- 1988
(Show Context)
Citation Context ...n any particular assumption about the model for the missing data. We accomplish this task by choosing a criterion upon which to base the selection of the Xi value. The stochastic dominance criterion (=-=Kyburg, 1983-=-) selects the value xik of Xi if the minimum probability p(xik | e) is larger than the maximum probability ¯p(xih | e), for any h ̸= k. Stochastic dominance is the safest and most conservative criteri... |

9 | Ignorant influence diagrams, in
- Ramoni
- 1995
(Show Context)
Citation Context ...tional probabilities. During the past few years, there has been an increasing interest in algorithms that propagate probability intervals during inference in Bayesian networks (Fertig & Breese, 1993; =-=Ramoni, 1995-=-) but, to our knowledge, no effort has been made to apply the same interval-based approach to the task of learning such networks. The method presented in this paper can be therefore regarded as the le... |

7 |
Probability intervals over influence diagrams
- Fertig, Breese
- 1993
(Show Context)
Citation Context ... task of learning conditional probabilities. During the past few years, there has been an increasing interest in algorithms that propagate probability intervals during inference in Bayesian networks (=-=Fertig & Breese, 1993-=-; Ramoni, 1995) but, to our knowledge, no effort has been made to apply the same interval-based approach to the task of learning such networks. The method presented in this paper can be therefore rega... |

7 | Bayesian Methods
- Ramoni, Sebastini
- 2003
(Show Context)
Citation Context ...ary sample upon which we base the formulation of the prior probability. As such, α represents a confidence measure of our prior probabilities and, therefore, it is called prior precision (Good, 1968; =-=Ramoni & Sebastiani, 1999-=-). Unfortunately, the simplicity and efficiency of this closed form solution are lost when the database is incomplete, that is, some entries are reported as unknown. The issues involved in the estimat... |

6 |
Improved posterior probability estimates from prior and conditional linear constraint systems
- Snow
- 1991
(Show Context)
Citation Context ...requirements. When attributes are not binary or more than two classes are involved, however, more general methods must be used to apply Bayes’ theorem to probability intervals (Fertig & Breese, 1993; =-=Snow, 1991-=-). A further difference between a standard Bayesian classifier and one trained with the RBE lies in the class assignment criterion. A standard scheme assigns a case to the class with the highest poste... |

5 | An ignorant belief network to forecast glucose concentration from clinical databases - Ramoni, Riva, et al. - 1995 |