Results 1  10
of
353
Induction of Decision Trees
 MACH. LEARN
, 1986
"... The technology for building knowledgebased systems by inductive inference from examples has been demonstrated successfully in several practical applications. This paper summarizes an approach to synthesizing decision trees that has been used in a variety of systems, and it describes one such syste ..."
Abstract

Cited by 4303 (4 self)
 Add to MetaCart
The technology for building knowledgebased systems by inductive inference from examples has been demonstrated successfully in several practical applications. This paper summarizes an approach to synthesizing decision trees that has been used in a variety of systems, and it describes one such system, ID3, in detail. Results from recent studies show ways in which the methodology can be modified to deal with information that is noisy and/or incomplete. A reported shortcoming of the basic algorithm is discussed and two means of overcoming it are compared. The paper concludes with illustrations of current research directions.
Wrappers for Feature Subset Selection
 AIJ SPECIAL ISSUE ON RELEVANCE
, 1997
"... In the feature subset selection problem, a learning algorithm is faced with the problem of selecting a relevant subset of features upon which to focus its attention, while ignoring the rest. To achieve the best possible performance with a particular learning algorithm on a particular training set, a ..."
Abstract

Cited by 1522 (3 self)
 Add to MetaCart
In the feature subset selection problem, a learning algorithm is faced with the problem of selecting a relevant subset of features upon which to focus its attention, while ignoring the rest. To achieve the best possible performance with a particular learning algorithm on a particular training set, a feature subset selection method should consider how the algorithm and the training set interact. We explore the relation between optimal feature subset selection and relevance. Our wrapper method searches for an optimal feature subset tailored to a particular algorithm and a domain. We study the strengths and weaknesses of the wrapper approach andshow a series of improved designs. We compare the wrapper approach to induction without feature subset selection and to Relief, a filter approach to feature subset selection. Significant improvement in accuracy is achieved for some datasets for the two families of induction algorithms used: decision trees and NaiveBayes.
Bayesian Network Classifiers
, 1997
"... Recent work in supervised learning has shown that a surprisingly simple Bayesian classifier with strong assumptions of independence among features, called naive Bayes, is competitive with stateoftheart classifiers such as C4.5. This fact raises the question of whether a classifier with less restr ..."
Abstract

Cited by 788 (23 self)
 Add to MetaCart
(Show Context)
Recent work in supervised learning has shown that a surprisingly simple Bayesian classifier with strong assumptions of independence among features, called naive Bayes, is competitive with stateoftheart classifiers such as C4.5. This fact raises the question of whether a classifier with less restrictive assumptions can perform even better. In this paper we evaluate approaches for inducing classifiers from data, based on the theory of learning Bayesian networks. These networks are factored representations of probability distributions that generalize the naive Bayesian classifier and explicitly represent statements about independence. Among these approaches we single out a method we call Tree Augmented Naive Bayes (TAN), which outperforms naive Bayes, yet at the same time maintains the computational simplicity (no search involved) and robustness that characterize naive Bayes. We experimentally tested these approaches, using problems from the University of California at Irvine repository, and compared them to C4.5, naive Bayes, and wrapper methods for feature selection.
An Empirical Comparison of Voting Classification Algorithms: Bagging, Boosting, and Variants
 MACHINE LEARNING
, 1999
"... Methods for voting classification algorithms, such as Bagging and AdaBoost, have been shown to be very successful in improving the accuracy of certain classifiers for artificial and realworld datasets. We review these algorithms and describe a large empirical study comparing several variants in co ..."
Abstract

Cited by 695 (2 self)
 Add to MetaCart
Methods for voting classification algorithms, such as Bagging and AdaBoost, have been shown to be very successful in improving the accuracy of certain classifiers for artificial and realworld datasets. We review these algorithms and describe a large empirical study comparing several variants in conjunction with a decision tree inducer (three variants) and a NaiveBayes inducer.
The purpose of the study is to improve our understanding of why and
when these algorithms, which use perturbation, reweighting, and
combination techniques, affect classification error. We provide a
bias and variance decomposition of the error to show how different
methods and variants influence these two terms. This allowed us to
determine that Bagging reduced variance of unstable methods, while
boosting methods (AdaBoost and Arcx4) reduced both the bias and
variance of unstable methods but increased the variance for NaiveBayes,
which was very stable. We observed that Arcx4 behaves differently
than AdaBoost if reweighting is used instead of resampling,
indicating a fundamental difference. Voting variants, some of which
are introduced in this paper, include: pruning versus no pruning,
use of probabilistic estimates, weight perturbations (Wagging), and
backfitting of data. We found that Bagging improves when
probabilistic estimates in conjunction with nopruning are used, as
well as when the data was backfit. We measure tree sizes and show
an interesting positive correlation between the increase in the
average tree size in AdaBoost trials and its success in reducing the
error. We compare the meansquared error of voting methods to
nonvoting methods and show that the voting methods lead to large
and significant reductions in the meansquared errors. Practical
problems that arise in implementing boosting algorithms are
explored, including numerical instabilities and underflows. We use
scatterplots that graphically show how AdaBoost reweights instances,
emphasizing not only "hard" areas but also outliers and noise.
The Case Against Accuracy Estimation for Comparing Induction Algorithms
 In Proceedings of the Fifteenth International Conference on Machine Learning
, 1997
"... We analyze critically the use of classification accuracy to compare classifiers on natural data sets, providing a thorough investigation using ROC analysis, standard machine learning algorithms, and standard benchmark data sets. The results raise serious concerns about the use of accuracy for compar ..."
Abstract

Cited by 410 (23 self)
 Add to MetaCart
We analyze critically the use of classification accuracy to compare classifiers on natural data sets, providing a thorough investigation using ROC analysis, standard machine learning algorithms, and standard benchmark data sets. The results raise serious concerns about the use of accuracy for comparing classifiers and drawinto question the conclusions that can be drawn from such studies. In the course of the presentation, we describe and demonstrate what we believe to be the proper use of ROC analysis for comparative studies in machine learning research. We argue that this methodology is preferable both for making practical choices and for drawing scientific conclusions.
The Foundations of CostSensitive Learning
 In Proceedings of the Seventeenth International Joint Conference on Artificial Intelligence
, 2001
"... This paper revisits the problem of optimal learning and decisionmaking when different misclassification errors incur different penalties. We characterize precisely but intuitively when a cost matrix is reasonable, and we show how to avoid the mistake of defining a cost matrix that is economically i ..."
Abstract

Cited by 398 (6 self)
 Add to MetaCart
This paper revisits the problem of optimal learning and decisionmaking when different misclassification errors incur different penalties. We characterize precisely but intuitively when a cost matrix is reasonable, and we show how to avoid the mistake of defining a cost matrix that is economically incoherent. For the twoclass case, we prove a theorem that shows how to change the proportion of negative examples in a training set in order to make optimal costsensitive classification decisions using a classifier learned by a standard noncostsensitive learning method. However, we then argue that changing the balance of negative and positive training examples has little effect on the classifiers produced by standard Bayesian and decision tree learning methods. Accordingly, the recommended way of applying one of these methods in a domain with differing misclassification costs is to learn a classifier from the training set as given, and then to compute optimal decisions ...
Learning and Revising User Profiles: The Identification of Interesting Web Sites
 Machine Learning
, 1997
"... . We discuss algorithms for learning and revising user profiles that can determine which World Wide Web sites on a given topic would be interesting to a user. We describe the use of a naive Bayesian classifier for this task, and demonstrate that it can incrementally learn profiles from user feedback ..."
Abstract

Cited by 375 (15 self)
 Add to MetaCart
. We discuss algorithms for learning and revising user profiles that can determine which World Wide Web sites on a given topic would be interesting to a user. We describe the use of a naive Bayesian classifier for this task, and demonstrate that it can incrementally learn profiles from user feedback on the interestingness of Web sites. Furthermore, the Bayesian classifier may easily be extended to revise user provided profiles. In an experimental evaluation we compare the Bayesian classifier to computationally more intensive alternatives, and show that it performs at least as well as these approaches throughout a range of different domains. In addition, we empirically analyze the effects of providing the classifier with background knowledge in form of user defined profiles and examine the use of lexical knowledge for feature selection. We find that both approaches can substantially increase the prediction accuracy. Keywords: Information filtering, intelligent agents, multistrategy lea...
Robust Classification for Imprecise Environments
, 1989
"... In realworld environments it is usually difficult to specify target operating conditions precisely. This uncertainty makes building robust classification systems problematic. We present a method for the comparison of classifier performance that is robust to imprecise class distributions and misclas ..."
Abstract

Cited by 332 (15 self)
 Add to MetaCart
(Show Context)
In realworld environments it is usually difficult to specify target operating conditions precisely. This uncertainty makes building robust classification systems problematic. We present a method for the comparison of classifier performance that is robust to imprecise class distributions and misclassification costs. The ROC convex hull method combines techniques from ROC analysis, decision analysis and computational geometry, and adapts them to the particulars of analyzing learned classifiers. The method is efficient and incremental, minimizes the management of classifier performance data, and allows for clear visual comparisons and sensitivity analyses. We then show that it is possible to build a hybrid classifier that will perform at least as well as the best available classifier for any target conditions. This robust performance extends across a wide variety of comparison frameworks, including the optimization of metrics such as accuracy, expected cost, lift, precision, recall, and ...
Employing EM in PoolBased Active Learning for Text Classification
, 1998
"... This paper shows how a text classifier's need for labeled training data can be reduced by a combination of active learning and Expectation Maximization (EM) on a pool of unlabeled data. QuerybyCommittee is used to actively select documents for labeling, then EM with a naive Bayes model furthe ..."
Abstract

Cited by 316 (10 self)
 Add to MetaCart
This paper shows how a text classifier's need for labeled training data can be reduced by a combination of active learning and Expectation Maximization (EM) on a pool of unlabeled data. QuerybyCommittee is used to actively select documents for labeling, then EM with a naive Bayes model further improves classification accuracy by concurrently estimating probabilistic labels for the remaining unlabeled documents and using them to improve the model. We also present a metric for better measuring disagreement among committee members; it accounts for the strength of their disagreement and for the distribution of the documents. Experimental results show that our method of combining EM and active learning requires only half as many labeled training examples to achieve the same accuracy as either EM or active learning alone. Keywords: text classification active learning unsupervised learning information retrieval 1 Introduction In many settings for learning text classifiers, obtaining lab...
Distributional Clustering of Words for Text Classification
, 1998
"... This paper describes the application of Distributional Clustering [20] to document classification. This approach clusters words into groups based on the distribution of class labels associated with each word. Thus, unlike some other unsupervised dimensionalityreduction techniques, such as Latent Sem ..."
Abstract

Cited by 299 (1 self)
 Add to MetaCart
This paper describes the application of Distributional Clustering [20] to document classification. This approach clusters words into groups based on the distribution of class labels associated with each word. Thus, unlike some other unsupervised dimensionalityreduction techniques, such as Latent Semantic Indexing, we are able to compress the feature space much more aggressively, while still maintaining high document classification accuracy. Experimental results obtained on three realworld data sets show that we can reduce the feature dimensionality by three orders of magnitude and lose only 2% accuracysignificantly better than Latent Semantic Indexing [6], classbased clustering [1], feature selection by mutual information [23], or Markovblanketbased feature selection [13]. We also show that less aggressive clustering sometimes results in improved classification accuracy over classification without clustering. 1 Introduction The popularity of the Internet has caused an exponent...