Results 1  10
of
89
Additive Logistic Regression: a Statistical View of Boosting
 Annals of Statistics
, 1998
"... Boosting (Freund & Schapire 1996, Schapire & Singer 1998) is one of the most important recent developments in classification methodology. The performance of many classification algorithms can often be dramatically improved by sequentially applying them to reweighted versions of the input data, and t ..."
Abstract

Cited by 1217 (21 self)
 Add to MetaCart
Boosting (Freund & Schapire 1996, Schapire & Singer 1998) is one of the most important recent developments in classification methodology. The performance of many classification algorithms can often be dramatically improved by sequentially applying them to reweighted versions of the input data, and taking a weighted majority vote of the sequence of classifiers thereby produced. We show that this seemingly mysterious phenomenon can be understood in terms of well known statistical principles, namely additive modeling and maximum likelihood. For the twoclass problem, boosting can be viewed as an approximation to additive modeling on the logistic scale using maximum Bernoulli likelihood as a criterion. We develop more direct approximations and show that they exhibit nearly identical results to boosting. Direct multiclass generalizations based on multinomial likelihood are derived that exhibit performance comparable to other recently proposed multiclass generalizations of boosting in most...
Fisher Discriminant Analysis With Kernels
, 1999
"... A nonlinear classification technique based on Fisher's discriminant is proposed. The main ingredient is the kernel trick which allows the efficient computation of Fisher discriminant in feature space. The linear classification in feature space corresponds to a (powerful) nonlinear decision functi ..."
Abstract

Cited by 312 (15 self)
 Add to MetaCart
A nonlinear classification technique based on Fisher's discriminant is proposed. The main ingredient is the kernel trick which allows the efficient computation of Fisher discriminant in feature space. The linear classification in feature space corresponds to a (powerful) nonlinear decision function in input space. Large scale simulations demonstrate the competitiveness of our approach.
Classification by pairwise coupling
, 1998
"... We discuss a strategy for polychotomous classification that involves estimating class probabilities for each pair of classes, and then coupling the estimates together. The coupling model is similar to the BradleyTerry method for paired comparisons. We study the nature of the class probability estim ..."
Abstract

Cited by 273 (0 self)
 Add to MetaCart
We discuss a strategy for polychotomous classification that involves estimating class probabilities for each pair of classes, and then coupling the estimates together. The coupling model is similar to the BradleyTerry method for paired comparisons. We study the nature of the class probability estimates that arise, and examine the performance of the procedure in real and simulated datasets. Classifiers used include linear discriminants, nearest neighbors, and the support vector machine.
A Comparison of Prediction Accuracy, Complexity, and Training Time of Thirtythree Old and New Classification Algorithms
, 2000
"... . Twentytwo decision tree, nine statistical, and two neural network algorithms are compared on thirtytwo datasets in terms of classication accuracy, training time, and (in the case of trees) number of leaves. Classication accuracy is measured by mean error rate and mean rank of error rate. Both cr ..."
Abstract

Cited by 167 (7 self)
 Add to MetaCart
. Twentytwo decision tree, nine statistical, and two neural network algorithms are compared on thirtytwo datasets in terms of classication accuracy, training time, and (in the case of trees) number of leaves. Classication accuracy is measured by mean error rate and mean rank of error rate. Both criteria place a statistical, splinebased, algorithm called Polyclass at the top, although it is not statistically signicantly dierent from twenty other algorithms. Another statistical algorithm, logistic regression, is second with respect to the two accuracy criteria. The most accurate decision tree algorithm is Quest with linear splits, which ranks fourth and fth, respectively. Although splinebased statistical algorithms tend to have good accuracy, they also require relatively long training times. Polyclass, for example, is third last in terms of median training time. It often requires hours of training compared to seconds for other algorithms. The Quest and logistic regression algor...
Discriminant Analysis by Gaussian Mixtures
 Journal of the Royal Statistical Society, Series B
, 1996
"... FisherRao linear discriminant analysis (LDA) is a valuable tool for multigroup classification. LDA is equivalent to maximum likelihood classification assuming Gaussian distributions for each class. In this paper, we fit Gaussian mixtures to each class to facilitate effective classification in nonn ..."
Abstract

Cited by 149 (10 self)
 Add to MetaCart
FisherRao linear discriminant analysis (LDA) is a valuable tool for multigroup classification. LDA is equivalent to maximum likelihood classification assuming Gaussian distributions for each class. In this paper, we fit Gaussian mixtures to each class to facilitate effective classification in nonnormal settings, especially when the classes are clustered. Low dimensional views are an important byproduct of LDAour new techniques inherit this feature. We are able to control the withinclass spread of the subclass centers relative to the betweenclass spread. Our technique for fitting these models permits a natural blend with nonparametric versions of LDA. Keywords: Classification, Pattern Recognition, Clustering, Nonparametric, Penalized. 1 Introduction In the generic classification or discrimination problem, the outcome of interest G falls into J unordered classes, which for convenience we denote by the set J = f1; 2; 3; \Delta \Delta \Delta Jg. We wish to build a rule for pred...
Model Selection and the Principle of Minimum Description Length
 Journal of the American Statistical Association
, 1998
"... This paper reviews the principle of Minimum Description Length (MDL) for problems of model selection. By viewing statistical modeling as a means of generating descriptions of observed data, the MDL framework discriminates between competing models based on the complexity of each description. This ..."
Abstract

Cited by 145 (5 self)
 Add to MetaCart
This paper reviews the principle of Minimum Description Length (MDL) for problems of model selection. By viewing statistical modeling as a means of generating descriptions of observed data, the MDL framework discriminates between competing models based on the complexity of each description. This approach began with Kolmogorov's theory of algorithmic complexity, matured in the literature on information theory, and has recently received renewed interest within the statistics community. In the pages that follow, we review both the practical as well as the theoretical aspects of MDL as a tool for model selection, emphasizing the rich connections between information theory and statistics. At the boundary between these two disciplines, we find many interesting interpretations of popular frequentist and Bayesian procedures. As we will see, MDL provides an objective umbrella under which rather disparate approaches to statistical modeling can coexist and be compared. We illustrate th...
Polynomial Splines and Their Tensor Products in Extended Linear Modeling
 Ann. Statist
, 1997
"... ANOVA type models are considered for a regression function or for the logarithm of a probability function, conditional probability function, density function, conditional density function, hazard function, conditional hazard function, or spectral density function. Polynomial splines are used to m ..."
Abstract

Cited by 142 (14 self)
 Add to MetaCart
ANOVA type models are considered for a regression function or for the logarithm of a probability function, conditional probability function, density function, conditional density function, hazard function, conditional hazard function, or spectral density function. Polynomial splines are used to model the main effects, and their tensor products are used to model any interaction components that are included. In the special context of survival analysis, the baseline hazard function is modeled and nonproportionality is allowed. In general, the theory involves the L 2 rate of convergence for the fitted model and its components. The methodology involves least squares and maximum likelihood estimation, stepwise addition of basis functions using Rao statistics, stepwise deletion using Wald statistics, and model selection using BIC, crossvalidation or an independent test set. Publically available software, written in C and interfaced to S/SPLUS, is used to apply this methodology to...
Penalized Discriminant Analysis
 Annals of Statistics
, 1995
"... Fisher's linear discriminant analysis (LDA) is a popular dataanalytic tool for studying the relationship between a set of predictors and a categorical response. In this paper we describe a penalized version of LDA. It is designed for situations in which there are many highly correlated predictors, ..."
Abstract

Cited by 131 (9 self)
 Add to MetaCart
Fisher's linear discriminant analysis (LDA) is a popular dataanalytic tool for studying the relationship between a set of predictors and a categorical response. In this paper we describe a penalized version of LDA. It is designed for situations in which there are many highly correlated predictors, such as those obtained by discretizing a function, or the greyscale values of the pixels in a series of images. In cases such as these it is natural, efficient, and sometimes essential to impose a spatial smoothness constraint on the coefficients, both for improved prediction performance and interpretability. We cast the classification problem into a regression framework via optimal scoring. Using this, our proposal facilitates the use of any penalized regression technique in the classification setting. The technique is illustrated with examples in speech recognition and handwritten character recognition. AMS 1991 Classifications: Primary 62H30, Secondary 62G07 1 Introduction Linear discrim...
Hazard Regression
 Journal of the American Statistical Association
, 1995
"... An automatic procedure that uses linear splines and their tensor products is proposed for tting a regression model to data involving a polychotomous response variable and one or more predictors. The tted model can be used for multiple classi cation. The automatic tting procedure involves maximum lik ..."
Abstract

Cited by 80 (19 self)
 Add to MetaCart
An automatic procedure that uses linear splines and their tensor products is proposed for tting a regression model to data involving a polychotomous response variable and one or more predictors. The tted model can be used for multiple classi cation. The automatic tting procedure involves maximum likelihood estimation, stepwise addition, stepwise deletion, and model selection by AIC, crossvalidation or an independent test set. A modi ed version of the algorithm has been constructed that is applicable to large data sets, and it is illustrated using a phoneme recognition data set with 250,000 cases, 45 classes and 63 predictors.
Efficient and Robust Feature Extraction by Maximum Margin Criterion
 In Advances in Neural Information Processing Systems 16
, 2003
"... In pattern recognition, feature extraction techniques are widely employed to reduce the dimensionality of data and to enhance the discriminatory information. Principal Component Analysis (PCA) and Linear Discriminant Analysis (LDA) are two most popular linear dimensionality reduction methods. Howev ..."
Abstract

Cited by 43 (4 self)
 Add to MetaCart
In pattern recognition, feature extraction techniques are widely employed to reduce the dimensionality of data and to enhance the discriminatory information. Principal Component Analysis (PCA) and Linear Discriminant Analysis (LDA) are two most popular linear dimensionality reduction methods. However, PCA is not very effective for the extraction of the most discriminant features and LDA is not stable due to the small sample size problem. In this paper, we propose some new (linear and nonlinear) feature extractors based on maximum margin criterion (MMC). Geometrically, feature extractors based on MMC maximize the (average) margin between classes after dimensionality reduction. It is shown that MMC can represent class separability better than PCA. As a connection to LDA, we may also derive LDA from MMC by incorporating some constraints. By using some other constraints, we establish a new linear feature extractor that does not suffer from the small sample size problem, which is known to cause serious stability problems for LDA. The kernelized (nonlinear) counterpart of this linear feature extractor is also established in the paper. Our extensive experiments demonstrate that the new feature extractors are effective, stable, and efficient.