Results 1 
5 of
5
On the optimality of the simple Bayesian classifier under zeroone loss
 MACHINE LEARNING
, 1997
"... The simple Bayesian classifier is known to be optimal when attributes are independent given the class, but the question of whether other sufficient conditions for its optimality exist has so far not been explored. Empirical results showing that it performs surprisingly well in many domains containin ..."
Abstract

Cited by 601 (25 self)
 Add to MetaCart
The simple Bayesian classifier is known to be optimal when attributes are independent given the class, but the question of whether other sufficient conditions for its optimality exist has so far not been explored. Empirical results showing that it performs surprisingly well in many domains containing clear attribute dependences suggest that the answer to this question may be positive. This article shows that, although the Bayesian classifier’s probability estimates are only optimal under quadratic loss if the independence assumption holds, the classifier itself can be optimal under zeroone loss (misclassification rate) even when this assumption is violated by a wide margin. The region of quadraticloss optimality of the Bayesian classifier is in fact a secondorder infinitesimal fraction of the region of zeroone optimality. This implies that the Bayesian classifier has a much greater range of applicability than previously thought. For example, in this article it is shown to be optimal for learning conjunctions and disjunctions, even though they violate the independence assumption. Further, studies in artificial domains show that it will often outperform more powerful classifiers for common training set sizes and numbers of attributes, even if its bias is a priori much less appropriate to the domain. This article’s results also imply that detecting attribute dependence is not necessarily the best way to extend the Bayesian classifier, and this is also verified empirically.
Tree induction vs. logistic regression: A learningcurve analysis
 CEDER WORKING PAPER #IS0102, STERN SCHOOL OF BUSINESS
, 2001
"... Tree induction and logistic regression are two standard, offtheshelf methods for building models for classi cation. We present a largescale experimental comparison of logistic regression and tree induction, assessing classification accuracy and the quality of rankings based on classmembership pr ..."
Abstract

Cited by 62 (16 self)
 Add to MetaCart
Tree induction and logistic regression are two standard, offtheshelf methods for building models for classi cation. We present a largescale experimental comparison of logistic regression and tree induction, assessing classification accuracy and the quality of rankings based on classmembership probabilities. We use a learningcurve analysis to examine the relationship of these measures to the size of the training set. The results of the study show several remarkable things. (1) Contrary to prior observations, logistic regression does not generally outperform tree induction. (2) More specifically, and not surprisingly, logistic regression is better for smaller training sets and tree induction for larger data sets. Importantly, this often holds for training sets drawn from the same domain (i.e., the learning curves cross), so conclusions about inductionalgorithm superiority on a given domain must be based on an analysis of the learning curves. (3) Contrary to conventional wisdom, tree induction is effective atproducing probabilitybased rankings, although apparently comparatively less so foragiven training{set size than at making classifications. Finally, (4) the domains on which tree induction and logistic regression are ultimately preferable canbecharacterized surprisingly well by a simple measure of signaltonoise ratio.
Regularized Gaussian Discriminant Analysis Through Eigenvalue Decomposition
 Journal of the American Statistical Association
, 1996
"... Friedman (1989) has proposed a regularization technique (RDA) of discriminant analysis in the Gaussian framework. RDA makes use of two regularization parameters to design an intermediate classi cation rule between linear and quadratic discriminant analysis. In this paper, we propose an alternative a ..."
Abstract

Cited by 40 (6 self)
 Add to MetaCart
Friedman (1989) has proposed a regularization technique (RDA) of discriminant analysis in the Gaussian framework. RDA makes use of two regularization parameters to design an intermediate classi cation rule between linear and quadratic discriminant analysis. In this paper, we propose an alternative approach to design classi cation rules which have also a median position between linear and quadratic discriminant analysis. Our approach is based on the reparametrization of the covariance matrix k of a group Gk in terms of its eigenvalue decomposition, k = kDkAkD 0 k where k speci es the volume of Gk, Ak its shape, and Dk its orientation. Variations on constraints concerning k�Ak and Dk lead to 14 discrimination models of interest. For each model, we derived the maximum likelihood parameter estimates and our approach consists in selecting the model among the 14 possible models by minimizing the samplebased estimate of future misclassi cation risk by crossvalidation. Numerical experiments show favorable behavior of this approach as compared to RDA.
Clustering Via Normal Mixture Models
"... We consider a modelbased approach to clustering, whereby each observation is assumed to have arisen from an underlying mixture of a finite number of distributions. The number of components in this mixture model corresponds to the number of clusters to be imposed on the data. A common assumption is ..."
Abstract

Cited by 3 (0 self)
 Add to MetaCart
We consider a modelbased approach to clustering, whereby each observation is assumed to have arisen from an underlying mixture of a finite number of distributions. The number of components in this mixture model corresponds to the number of clusters to be imposed on the data. A common assumption is to take the component distributions to be multivariate normal with perhaps some restrictions on the component covariance matrices. The model can be fitted to the data using maximum likelihood implemented via the EM algorithm. There is a number of computational issues associated with the fitting, including the specification of initial starting points for the EM algorithm and the carrying out of tests for the number of components in the final version of the model. We shall discuss some of these problems and describe an algorithm that attempts to handle them automatically. 1. INTRODUCTION In some applications of mixture models, questions related to clustering may arise only after the mixture mo...