#### DMCA

## On the optimality of the simple Bayesian classifier under zero-one loss (1997)

### Cached

### Download Links

- [engr.case.edu]
- [www.ics.uci.edu]
- [www.cse.unr.edu]
- [www.cs.unr.edu]
- [www.ics.uci.edu]
- [sci2s.ugr.es]
- [www.cc.gatech.edu]
- [home.eng.iastate.edu]
- [www.cc.gatech.edu]
- [www.cc.gatech.edu]
- [disi.unitn.it]
- [www.elilabs.com]
- [www.ics.uci.edu]
- [www.cs.rutgers.edu]
- [www.cc.gatech.edu]
- [home.eng.iastate.edu]
- [www.ics.uci.edu]

Venue: | MACHINE LEARNING |

Citations: | 817 - 27 self |

### Citations

6600 |
C4.5: Programs For Machine Learning
- Quinlan
- 1993
(Show Context)
Citation Context ...degree of attribute dependence in the data sets. The learners used were state-of-the art representatives of three major approaches to classification learning: decision tree induction (C4.5 release 8, =-=Quinlan, 1993-=-), instance-based learning (PEBLS 2.1, Cost & Salzberg, 1993) and rule induction (CN2 version 6.1, Clark & Boswell, 1991). A simple Bayesian classifier was implemented for these experiments. Three mai... |

4842 |
Pattern Classification and Scene Analysis
- Duda, Hart
(Show Context)
Citation Context ...ssigns it to a class. Many classifiers can be viewed as computing a set of discriminant functions of the example, one for each class, and assigning the example to the class whose function is maximum (=-=Duda & Hart, 1973-=-). If E is the example, and fi(E) is the discriminant function corresponding to the ith class, the chosen class Ck is the one for which 1 fk(E) >fi(E) ∀i=k. (1) Suppose an example is a vector of a at... |

3469 | UCI repository of machine learning databases - Blake, Merz - 1998 |

890 | The CN2 induction algorithm - Clark, Niblett - 1989 |

788 |
Probability and Statistics
- Degroot, Schervish
- 2002
(Show Context)
Citation Context ...ier is more accurate than C4.5 and CN2, if this sample of data sets is assumed to be representative. The fourth line shows the confidence levels obtained by applying the more sensitive Wilcoxon test (=-=DeGroot, 1986-=-) to the 28 average accuracy differences obtained, and results in high confidence that the Bayesian classifier is more accurate than each of the other learners. The fifth line shows the average accura... |

722 | Approximate statistical tests for comparing supervised classification learning algorithms - Dietterich - 1998 |

540 | Supervised and unsupervised discretization of continuous features - Dougherty, Kohavi, et al. - 1995 |

499 | Estimating continuous distributions in bayesian classifiers - John, Langley - 1995 |

439 | An analysis of Bayesian classifier - Langley, Iba, et al. - 1992 |

385 | Rule induction with CN2: Some recent improvements - Clark, Boswell - 1991 |

353 | Syskill and webert: Identifying interesting web sites, - Pazzani, Muramatsu, et al. - 1996 |

309 | A weighted nearest neighbor algorithm for learning with symbolic features.
- Cost, Salzberg
- 1993
(Show Context)
Citation Context ... learners used were state-of-the art representatives of three major approaches to classification learning: decision tree induction (C4.5 release 8, Quinlan, 1993), instance-based learning (PEBLS 2.1, =-=Cost & Salzberg, 1993-=-) and rule induction (CN2 version 6.1, Clark & Boswell, 1991). A simple Bayesian classifier was implemented for these experiments. Three main issues arise here: how to handle numeric attributes, zero ... |

265 | Induction of selective bayesian classifiers, - Langley, Sage - 1994 |

253 |
Quantifying Inductive Bias: AI Learning Algorithms and Valiant's Model
- Haussler
- 1988
(Show Context)
Citation Context ... is a subset of the perceptron’s, or of a linear machine’s (Duda & Hart, 1973). This leads to the following result. Let the Vapnik-Chervonenkis dimension, or VC dimension for short, be defined as in (=-=Haussler, 1988-=-). Corollary 2 In domains composed of a nominal attributes, the VC dimension of the simple Bayesian classifier is O(a). Proof: This result follows immediately from Theorem 4 and the fact that, given a... |

248 | On bias, variance, 0/1-loss, and the curse-of-dimensionality. Data Mining and Knowledge Discovery,
- Friedman
- 1997
(Show Context)
Citation Context ...o this question, but some elements can be gleaned from the results in this article, and from the literature. It is well known that squared error loss can be decomposed into three additive components (=-=Friedman, 1996-=-): the intrinsic error due to noise in the sample, the statistical bias (systematic component of the approximation error, or error for an infinite sample) and the variance (component of the error due ... |

231 | Scaling up the accuracy of Naıve-Bayes classifier: a decision-tree hybrid,” - Kohavi - 1996 |

212 | Bias plus variance decomposition for zero-one loss functions. In
- Kohavi, Wolpert
- 1996
(Show Context)
Citation Context ...s: those with greater representational power, and thus greater ability to respond to the sample, tend to have lower bias, but also higher variance. Recently, several authors (Kong & Dietterich, 1995; =-=Kohavi & Wolpert, 1996-=-; Tibshirani, 1996; Breiman, 1996; Friedman, 1996) have proposed similar bias-variance decompositions for zero-one loss functions. In particular, Friedman (1996) has shown, using normal approximations... |

194 | Estimating probabilities: A crucial task in machine learning - CESTNIK - 1990 |

170 | Error-correcting output coding corrects bias and variance, in: ICML,
- Kong, Dietterich
- 1995
(Show Context)
Citation Context ...r of estimation algorithms: those with greater representational power, and thus greater ability to respond to the sample, tend to have lower bias, but also higher variance. Recently, several authors (=-=Kong & Dietterich, 1995-=-; Kohavi & Wolpert, 1996; Tibshirani, 1996; Breiman, 1996; Friedman, 1996) have proposed similar bias-variance decompositions for zero-one loss functions. In particular, Friedman (1996) has shown, usi... |

131 | Learning Limited Dependence Bayesian Classifiers, - Sahami - 1996 |

125 | Wrappers for Performance Enhancement and Oblivious Decision Graphs.
- Kohavi
- 1995
(Show Context)
Citation Context ... m-of-n concepts for which the Bayesian classifier makes errors, even when the examples are noise-free (i.e., an example always has the same class) and the Bayes rate is therefore zero (e.g., 3-of-7, =-=Kohavi, 1995-=-). Let P (A|C) represent the probability that an arbitrary attribute A is true given that the concept C is true, let a bar represent negation, and let all examples be equally probable. In general, if ... |

123 |
Semi-naive bayesian classifier,
- Kononenko
- 1991
(Show Context)
Citation Context ...between pairs of attributes given the class). Given attributes Am and An and the class variable C, a possible measure of the degree of pairwise dependence between Am and An given C (Wan & Wong, 1989; =-=Kononenko, 1991-=-) is D(Am,An|C)=H(Am|C)+H(An|C)−H(AmAn|C), (4) where AmAn represents the Cartesian product of attributes Am and An (i.e., a derived attribute with one possible value corresponding to each combination ... |

105 | Bias, variance and arcing classifiers.
- Breiman
- 1996
(Show Context)
Citation Context ...er, and thus greater ability to respond to the sample, tend to have lower bias, but also higher variance. Recently, several authors (Kong & Dietterich, 1995; Kohavi & Wolpert, 1996; Tibshirani, 1996; =-=Breiman, 1996-=-; Friedman, 1996) have proposed similar bias-variance decompositions for zero-one loss functions. In particular, Friedman (1996) has shown, using normal approximations to the class probabilities, that... |

85 | Bayesian network classifiers, Machine Learning 29 (2–3 - Friedman, Geiger, et al. - 1997 |

74 | Searching for dependencies in Bayesian classifiers - Pazzani - 1995 |

71 | Constructing decision trees in noisy domains. In: - Niblett - 1987 |

54 | Comparison of inductive and naïve Bayesian learning approaches to automatic knowledge acquistion - Kononenko - 1990 |

54 | Efficient learning of selective Bayesian network classifiers - Singh, Provan - 1995 |

50 | Induction of recursive bayesian classifiers - Langley - 1993 |

40 | Bias, variance and prediction error for classification rules.
- Tibshirani
- 1996
(Show Context)
Citation Context ...presentational power, and thus greater ability to respond to the sample, tend to have lower bias, but also higher variance. Recently, several authors (Kong & Dietterich, 1995; Kohavi & Wolpert, 1996; =-=Tibshirani, 1996-=-; Breiman, 1996; Friedman, 1996) have proposed similar bias-variance decompositions for zero-one loss functions. In particular, Friedman (1996) has shown, using normal approximations to the class prob... |

27 | A comparison of induction algorithms for selective and non-selective Bayesian classifiers - Singh, GM - 1995 |

20 | A framework for the average case analysis of conjunctive learning algorithms. - Pazzani, Sarrett - 1992 |

12 | Discovering patterns in EEGsignals: comparative study of a few methods - Kubat, Flotzinger, et al. - 1993 |

6 | Efficient learning of selective Bayesian network classifiers - Provan - 1996 |

5 | Error rates in quadratic discrimination with constraints on the covariance matrices - Flury, Schmid, et al. - 1994 |

5 | A comparison of induction algorithms for selective and non-selective Bayesian classifiers - Provan - 1995 |

4 | LD: The effect of assuming independence in applying Bayes; theorem to risk estimation and classification in diagnosis. Comput Biomed Res 16: 357 - Russek, RA, et al. - 1983 |

2 | Sensitivity Analysis in Bayesian Classification Models: Multiplicative Deviations - Ben-Bassat, Klove - 1980 |

1 | Improving simple Bayes (technical report - Kohavi, Becker, et al. - 1997 |

1 |
A measure for concept dissimilarity and its applications in machine learning
- Wan, Wong
- 1989
(Show Context)
Citation Context ....e., dependencies between pairs of attributes given the class). Given attributes Am and An and the class variable C, a possible measure of the degree of pairwise dependence between Am and An given C (=-=Wan & Wong, 1989-=-; Kononenko, 1991) is D(Am,An|C)=H(Am|C)+H(An|C)−H(AmAn|C), (4) where AmAn represents the Cartesian product of attributes Am and An (i.e., a derived attribute with one possible value corresponding to ... |

1 |
Constructing decision trees in noisy domains. Proceedings of the Second European Working Session on Learning (pp. 67–78
- Niblett
- 1987
(Show Context)
Citation Context ...j = vjk|Ci) when they are multiplied according to Equation 2. A principled solution to this problem is to incorporate a small-sample correction into all probabilities, such as the Laplace correction (=-=Niblett, 1987-=-). If nijk is the number of times class Ci and value vjk of attribute Aj occur together, and ni is the total number of times class Ci occurs in the training set, the uncorrected estimate of P (Aj = vj... |