Results 1  10
of
218
Multicategory Support Vector Machines, theory, and application to the classification of microarray data and satellite radiance data
 Journal of the American Statistical Association
, 2004
"... Twocategory support vector machines (SVM) have been very popular in the machine learning community for classi � cation problems. Solving multicategory problems by a series of binary classi � ers is quite common in the SVM paradigm; however, this approach may fail under various circumstances. We pro ..."
Abstract

Cited by 189 (22 self)
 Add to MetaCart
Twocategory support vector machines (SVM) have been very popular in the machine learning community for classi � cation problems. Solving multicategory problems by a series of binary classi � ers is quite common in the SVM paradigm; however, this approach may fail under various circumstances. We propose the multicategory support vector machine (MSVM), which extends the binary SVM to the multicategory case and has good theoretical properties. The proposed method provides a unifying framework when there are either equal or unequal misclassi � cation costs. As a tuning criterion for the MSVM, an approximate leaveoneout crossvalidation function, called Generalized Approximate Cross Validation, is derived, analogous to the binary case. The effectiveness of the MSVM is demonstrated through the applications to cancer classi � cation using microarray data and cloud classi � cation with satellite radiance pro � les.
A Comparison of Prediction Accuracy, Complexity, and Training Time of Thirtythree Old and New Classification Algorithms
, 2000
"... . Twentytwo decision tree, nine statistical, and two neural network algorithms are compared on thirtytwo datasets in terms of classication accuracy, training time, and (in the case of trees) number of leaves. Classication accuracy is measured by mean error rate and mean rank of error rate. Both cr ..."
Abstract

Cited by 186 (7 self)
 Add to MetaCart
(Show Context)
. Twentytwo decision tree, nine statistical, and two neural network algorithms are compared on thirtytwo datasets in terms of classication accuracy, training time, and (in the case of trees) number of leaves. Classication accuracy is measured by mean error rate and mean rank of error rate. Both criteria place a statistical, splinebased, algorithm called Polyclass at the top, although it is not statistically signicantly dierent from twenty other algorithms. Another statistical algorithm, logistic regression, is second with respect to the two accuracy criteria. The most accurate decision tree algorithm is Quest with linear splits, which ranks fourth and fth, respectively. Although splinebased statistical algorithms tend to have good accuracy, they also require relatively long training times. Polyclass, for example, is third last in terms of median training time. It often requires hours of training compared to seconds for other algorithms. The Quest and logistic regression algor...
Tree Induction for Probabilitybased Ranking
, 2002
"... Tree induction is one of the most effective and widely used methods for building classification models. However, many applications require cases to be ranked by the probability of class membership. Probability estimation trees (PETs) have the same attractive features as classification trees (e.g., c ..."
Abstract

Cited by 142 (4 self)
 Add to MetaCart
Tree induction is one of the most effective and widely used methods for building classification models. However, many applications require cases to be ranked by the probability of class membership. Probability estimation trees (PETs) have the same attractive features as classification trees (e.g., comprehensibility, accuracy and efficiency in high dimensions and on large data sets). Unfortunately, decision trees have been found to provide poor probability estimates. Several techniques have been proposed to build more accurate PETs, but, to our knowledge, there has not been a systematic experimental analysis of which techniques actually improve the probabilitybased rankings, and by how much. In this paper we first discuss why the decisiontree representation is not intrinsically inadequate for probability estimation. Inaccurate probabilities are partially the result of decisiontree induction algorithms that focus on maximizing classification accuracy and minimizing tree size (for example via reducederror pruning). Larger trees can be better for probability estimation, even if the extra size is superfluous for accuracy maximization. We then present the results of a comprehensive set of experiments, testing some straghtforward methods for improving probabilitybased rankings. We show that using a simple, common smoothing methodthe Laplace correctionuniformly improves probabilitybased rankings. In addition, bagging substantioJly improves the rankings, and is even more effective for this purpose than for improving accuracy. We conclude that PETs, with these simple modifications, should be considered when rankings based on classmembership probability are required.
Learning when Training Data are Costly: The Effect of Class Distribution on Tree Induction
, 2002
"... For large, realworld inductive learning problems, the number of training examples often must be limited due to the costs associated with procuring, preparing, and storing the data and/or the computational costs associated with learning from the data. One question of practical importance is: if n ..."
Abstract

Cited by 129 (9 self)
 Add to MetaCart
(Show Context)
For large, realworld inductive learning problems, the number of training examples often must be limited due to the costs associated with procuring, preparing, and storing the data and/or the computational costs associated with learning from the data. One question of practical importance is: if n training examples are going to be selected, in what proportion should the classes be represented? In this article we analyze the relationship between the marginal class distribution of training data and the performance of classification trees induced from these data, when the size of the training set is fixed. We study twentysix data sets and, for each, determine the best class distribution for learning. Our results show that, for a fixed number of training examples, it is often possible to obtain improved classifier performance by training with a class distribution other than the naturally occurring class distribution. For example, we show that to build a classifier robust to different misclassification costs, a balanced class distribution generally performs quite well. We also describe and evaluate a budgetsensitive progressivesampling algorithm that selects training examples such that the resulting training set has a good (nearoptimal) class distribution for learning.
A Study of the Behavior of Several Methods for Balancing Machine Learning Training Data
, 2004
"... There are several aspects that might influence the performance achieved by existing learning systems. It has been reported that one of these aspects is related to class imbalance in which examples in training data belonging to one class heavily outnumber the examples in the other class. In this situ ..."
Abstract

Cited by 108 (0 self)
 Add to MetaCart
There are several aspects that might influence the performance achieved by existing learning systems. It has been reported that one of these aspects is related to class imbalance in which examples in training data belonging to one class heavily outnumber the examples in the other class. In this situation, which is found in real world data describing an infrequent but important event, the learning system may have di#culties to learn the concept related to the minority class. In this work we perform a broad experimental evaluation involving ten methods, three of them proposed by the authors, to deal with the class imbalance problem in thirteen UCI data sets. Our experiments provide evidence that class imbalance does not systematically hinder the performance of learning systems. In fact, the problem seems to be related to learning with too few minority class examples in the presence of other complicating factors, such as class overlapping. Two of our proposed methods, Smote + Tomek and Smote + ENN, deal with these conditions directly, allying a known oversampling method with data cleaning methods in order to produce betterdefined class clusters. Our comparative experiments show that, in general, oversampling methods provide more accurate results than undersampling methods considering the area under the ROC curve (AUC). This result seems to contradict results previously published in the literature. Smote + Tomek and Smote + ENN presented very good results for data sets with a small number of positive examples. Moreover, Random oversampling, a very simple oversampling method, is very competitive to more complex oversampling methods. Since the oversampling methods provided very good performance results, we also measured the syntactic complexity of decision trees induc...
RainForest  a Framework for Fast Decision Tree Construction of Large Datasets
 In VLDB
, 1998
"... Classification of large datasets is an important data mining problem. Many classification algorithms have been proposed in the literature, but studies have shown that so far no algorithm uniformly outperforms all other algorithms in terms of quality. In this paper, we present a unifying framework fo ..."
Abstract

Cited by 107 (9 self)
 Add to MetaCart
Classification of large datasets is an important data mining problem. Many classification algorithms have been proposed in the literature, but studies have shown that so far no algorithm uniformly outperforms all other algorithms in terms of quality. In this paper, we present a unifying framework for decision tree classifiers that separates the scalability aspects of algorithms for constructing a decision tree from the central features that determine the quality of the tree. This generic algorithm is easy to instantiate with specific algorithms from the literature (including C4.5, CART,
BOAT  Optimistic Decision Tree Construction
, 1999
"... Classification is an important data mining problem. Given a training database of records, each tagged with a class label, the goal of classification is to build a concise model that can be used to predict the class label of future, unlabeled records. A very popular class of classifiers are decision ..."
Abstract

Cited by 105 (2 self)
 Add to MetaCart
Classification is an important data mining problem. Given a training database of records, each tagged with a class label, the goal of classification is to build a concise model that can be used to predict the class label of future, unlabeled records. A very popular class of classifiers are decision trees. All current algorithms to construct decision trees, including all mainmemory algorithms, make one scan over the training database per level of the tree. We introduce a new algorithm (BOAT) for decision tree construction that improves upon earlier algorithms in both performance and functionality. BOAT constructs several levels of the tree in only two scans over the training database, resulting in an average performance gain of 300% over previous work. The key to this performance improvement is a novel optimistic approach to tree construction in which we construct an initial tree using a small subset of the data and refine it to arrive at the final tree. We guarantee that any differen...
The Effect of Class Distribution on Classifier Learning: An Empirical Study
, 2001
"... In this article we analyze the effect of class distribution on classifier learning. We begin by describing the different ways in which class distribution affects learning and how it affects the evaluation of learned classifiers. We then present the results of two comprehensive experimental studie ..."
Abstract

Cited by 86 (2 self)
 Add to MetaCart
(Show Context)
In this article we analyze the effect of class distribution on classifier learning. We begin by describing the different ways in which class distribution affects learning and how it affects the evaluation of learned classifiers. We then present the results of two comprehensive experimental studies. The first study compares the performance of classifiers generated from unbalanced data sets with the performance of classifiers generated from balanced versions of the same data sets. This comparison allows us to isolate and quantify the effect that the training set's class distribution has on learning and contrast the performance of the classifiers on the minority and majority classes. The second study assesses what distribution is "best" for training, with respect to two performance measures: classification accuracy and the area under the ROC curve (AUC). A tacit assumption behind much research on classifier induction is that the class distribution of the training data should match the "natural" distribution of the data. This study shows that the naturally occurring class distribution often is not best for learning, and often substantially better performance can be obtained by using a different class distribution. Understanding how classifier performance is affected by class distribution can help practitioners to choose training datain realworld situations the number of training examples often must be limited due to computational costs or the costs associated with procuring and preparing the data. 1.
Support Vector Machines for Classification in Nonstandard Situations
 MACHINE LEARNING
, 2000
"... The majority of classification algorithms are developed for the standard situation in which it is assumed that the examples in the training set come from the same distribution as that of the target population, and that the cost of misclassification into di#erent classes are the same. However, these ..."
Abstract

Cited by 79 (15 self)
 Add to MetaCart
The majority of classification algorithms are developed for the standard situation in which it is assumed that the examples in the training set come from the same distribution as that of the target population, and that the cost of misclassification into di#erent classes are the same. However, these assumptions are often violated in real world settings. For some classification methods, this can often be taken care of simply with a change of threshold; for others, additional e#ort is required. In this paper, we explain why the standard support vector machine is not suitable for the nonstandard situation, and introduce a simple procedure for adapting the support vector machine methodology to the nonstandard situation. Theoretical justification for the procedure is provided. Simulation study illustrates that the modified support vector machine significantly improves upon the standard support vector machine in the nonstandard situation. The computational load of the proposed procedure is th...
Statistical Fraud Detection: A Review
, 2002
"... Fraud is increasing dramatically with the expansion of modern technology and the global superhighways of communication, resulting in the loss of billions of dollars worldwide each year. Although prevention technologies are the best way of reducing fraud, fraudsters are adaptive and, given time, will ..."
Abstract

Cited by 72 (0 self)
 Add to MetaCart
Fraud is increasing dramatically with the expansion of modern technology and the global superhighways of communication, resulting in the loss of billions of dollars worldwide each year. Although prevention technologies are the best way of reducing fraud, fraudsters are adaptive and, given time, will usually find ways to circumvent such measures. Methodologies for the detection of fraud are essential if we are to catch fraudsters once fraud prevention has failed. Statistics and machine learning provide effective technologies for fraud detection and have been applied successfully to detect activities such as money laundering, ecommerce credit card fraud, telecommunication fraud, and computer intrusion, to name but a few. We describe the tools available for statistical fraud detection and the areas in which fraud detection technologies are most used. Keywords: Fraud detection, fraud prevention, statistics, machine learning, money laundering, computer intrusion, ecommerce, credit cards, telecommunications. Author's note: Richard J. Bolton is Research Associate and David J. Hand Professor of Statistics, Department of Mathematics, Imperial College, 180 Queen's Gate, London SW7 2BZ, UK. Contact email: {r.bolton, d.j.hand @ic.ac.uk} 1.