Results 1  10
of
14
Statistical analysis of some multicategory large margin classification methods
 Journal of Machine Learning Research
, 2004
"... The purpose of this paper is to investigate statistical properties of risk minimization based multicategory classification methods. These methods can be considered as natural extensions of binary large margin classification. We establish conditions that guarantee the consistency of classifiers obtai ..."
Abstract

Cited by 45 (2 self)
 Add to MetaCart
The purpose of this paper is to investigate statistical properties of risk minimization based multicategory classification methods. These methods can be considered as natural extensions of binary large margin classification. We establish conditions that guarantee the consistency of classifiers obtained in the risk minimization framework with respect to the classification error. Examples are provided for four specific forms of the general formulation, which extend a number of known methods. Using these examples, we show that some risk minimization formulations can also be used to obtain conditional probability estimates for the underlying problem. Such conditional probability information can be useful for statistical inferencing tasks beyond classification. 1.
Boosting with early stopping: convergence and consistency
 Annals of Statistics
, 2003
"... Abstract Boosting is one of the most significant advances in machine learning for classification and regression. In its original and computationally flexible version, boosting seeks to minimize empirically a loss function in a greedy fashion. The resulted estimator takes an additive function form an ..."
Abstract

Cited by 42 (6 self)
 Add to MetaCart
Abstract Boosting is one of the most significant advances in machine learning for classification and regression. In its original and computationally flexible version, boosting seeks to minimize empirically a loss function in a greedy fashion. The resulted estimator takes an additive function form and is built iteratively by applying a base estimator (or learner) to updated samples depending on the previous iterations. An unusual regularization technique, early stopping, is employed based on CV or a test set. This paper studies numerical convergence, consistency, and statistical rates of convergence of boosting with early stopping, when it is carried out over the linear span of a family of basis functions. For general loss functions, we prove the convergence of boosting's greedy optimization to the infinimum of the loss function over the linear span. Using the numerical convergence result, we find early stopping strategies under which boosting is shown to be consistent based on iid samples, and we obtain bounds on the rates of convergence for boosting estimators. Simulation studies are also presented to illustrate the relevance of our theoretical results for providing insights to practical aspects of boosting. As a side product, these results also reveal the importance of restricting the greedy search step sizes, as known in practice through the works of Friedman and others. Moreover, our results lead to a rigorous proof that for a linearly separable problem, AdaBoost with ffl! 0 stepsize becomes an L1margin maximizer when left to run to convergence. 1 Introduction In this paper we consider boosting algorithms for classification and regression. These algorithms present one of the major progresses in machine learning. In their original version, the computational aspect is explicitly specified as part of the estimator/algorithm. That is, the empirical minimization of an appropriate loss function is carried out in a greedy fashion, which means that at each step, a basis function that leads to the largest reduction of empirical risk is added into the estimator. This specification distinguishes boosting from other statistical procedures which are defined by an empirical minimization of a loss function without the numerical optimization details.
Boosting algorithms: Regularization, prediction and model fitting
 Statistical Science
, 2007
"... Abstract. We present a statistical perspective on boosting. Special emphasis is given to estimating potentially complex parametric or nonparametric models, including generalized linear and additive models as well as regression models for survival analysis. Concepts of degrees of freedom and correspo ..."
Abstract

Cited by 38 (5 self)
 Add to MetaCart
Abstract. We present a statistical perspective on boosting. Special emphasis is given to estimating potentially complex parametric or nonparametric models, including generalized linear and additive models as well as regression models for survival analysis. Concepts of degrees of freedom and corresponding Akaike or Bayesian information criteria, particularly useful for regularization and variable selection in highdimensional covariate spaces, are discussed as well. The practical aspects of boosting procedures for fitting statistical models are illustrated by means of the dedicated opensource software package mboost. This package implements functions which can be used for model fitting, prediction and variable selection. It is flexible, allowing for the implementation of new boosting algorithms optimizing userspecified loss functions. Key words and phrases: Generalized linear models, generalized additive models, gradient boosting, survival analysis, variable selection, software. 1.
Statistical analysis of Bayes optimal subset ranking
 IEEE Transactions on Information Theory
, 2008
"... Abstract—The ranking problem has become increasingly important in modern applications of statistical methods in automated decision making systems. In particular, we consider a formulation of the statistical ranking problem which we call subset ranking, and focus on the DCG (discounted cumulated gain ..."
Abstract

Cited by 23 (0 self)
 Add to MetaCart
Abstract—The ranking problem has become increasingly important in modern applications of statistical methods in automated decision making systems. In particular, we consider a formulation of the statistical ranking problem which we call subset ranking, and focus on the DCG (discounted cumulated gain) criterion that measures the quality of items near the top of the ranklist. Similar to error minimization for binary classification, direct optimization of natural ranking criteria such as DCG leads to a nonconvex optimization problems that can be NPhard. Therefore a computationally more tractable approach is needed. We present bounds that relate the approximate optimization of DCG to the approximate minimization of certain regression errors. These bounds justify the use of convex learning formulations for solving the subset ranking problem. The resulting estimation methods are not conventional, in that we focus on the estimation quality in the topportion of the ranklist. We further investigate the asymptotic statistical behavior of these formulations. Under appropriate conditions, the consistency of the estimation schemes with respect to the DCG metric can be derived. I.
AdaBoost is consistent
 In Advances in Neural Information Processing Systems
, 2006
"... The risk, or probability of error, of the classifier produced by the AdaBoost algorithm is investigated. In particular, we consider the stopping strategy to be used in AdaBoost to achieve universal consistency. We show that provided AdaBoost is stopped after n 1−ε iterations—for sample size n and ε ..."
Abstract

Cited by 22 (0 self)
 Add to MetaCart
The risk, or probability of error, of the classifier produced by the AdaBoost algorithm is investigated. In particular, we consider the stopping strategy to be used in AdaBoost to achieve universal consistency. We show that provided AdaBoost is stopped after n 1−ε iterations—for sample size n and ε ∈ (0,1)—the sequence of risks of the classifiers it produces approaches the Bayes risk.
On surrogate loss functions and fdivergences
"... The goal in the binary classification problem is to estimate a discriminant function γ from observations of covariate vectors and corresponding binary labels. We consider an elaboration of this problem in which the covariates are not available directly, but are transformed by a dimensionalityreduci ..."
Abstract

Cited by 10 (2 self)
 Add to MetaCart
The goal in the binary classification problem is to estimate a discriminant function γ from observations of covariate vectors and corresponding binary labels. We consider an elaboration of this problem in which the covariates are not available directly, but are transformed by a dimensionalityreducing quantizer Q. We present conditions on loss functions such that empirical risk minimization yields Bayes consistency when both the discriminant function and the quantizer are estimated. These conditions are stated in terms of a general correspondence between loss functions and a class of functionals known as AliSilvey or fdivergence functionals. For the 01 loss, this correspondence was established by Blackwell [3]; we extend the correspondence to the broader class of surrogate loss functions that play a key role in the general theory of Bayes consistency for binary classification. Our result makes it possible to pick out the (strict) subset of surrogate loss functions that yield Bayes consistency for joint estimation of the discriminant function and the quantizer.
An Infinitysample Theory For Multicategory Large Margin Classification
 Advances in Neural Information Processing
, 2004
"... The purpose of this paper is to investigate infinitysample properties of risk minimization based multicategory classification methods. These methods can be considered as natural extensions to binary large margin classification. We establish conditions that guarantee the infinitysample consistency ..."
Abstract

Cited by 4 (0 self)
 Add to MetaCart
The purpose of this paper is to investigate infinitysample properties of risk minimization based multicategory classification methods. These methods can be considered as natural extensions to binary large margin classification. We establish conditions that guarantee the infinitysample consistency of classifiers obtained in the risk minimization framework. Examples are provided for two specific forms of the general formulation, which extend a number of known methods. Using these examples, we show that some risk minimization formulations can also be used to obtain conditional probability estimates for the underlying problem. Such conditional probability information will be useful for statistical inferencing tasks beyond classification. 1
Discriminative Methods for Label Sequence Learning
, 2005
"... Discriminative learning framework is one of the very successful fields of machine learning. The methods of this paradigm, such as Boosting and Support Vector Machines, have significantly advanced the stateoftheart for classification by improving the accuracy and by increasing the applicability ..."
Abstract

Cited by 3 (1 self)
 Add to MetaCart
Discriminative learning framework is one of the very successful fields of machine learning. The methods of this paradigm, such as Boosting and Support Vector Machines, have significantly advanced the stateoftheart for classification by improving the accuracy and by increasing the applicability of machine learning methods. One of the key benefits of these methods is their ability to learn efficiently in high dimensional feature spaces, either by the use of implicit data representations via kernels or by explicit feature induction. However, traditionally these methods do not exploit dependencies between class labels where more than one label is predicted. Many realworld classification problems involve sequential, temporal or structural dependencies between multiple labels. The goal of this research is to generalize discriminative learning methods for such scenarios. In particular, we focus on label sequence learning. Label sequence learning is the problem of inferring a state sequence from an observation sequence, where the state sequence may encode a labeling, an annotation or a segmentation of the sequence. Prominent examples include partofspeech tagging, named entity classification, information extraction, continuous speech recognition, and secondary protein structure prediction. In this thesis, we present three novel discriminative methods that are generalizations of AdaBoost and multiclass Support Vector Machines (SVM) and a Gaussian Process formulation for label sequence learning. These techniques combine the efficiency of dynamic programming methods with the advantages of the stateoftheart learning methods. We present theoretical analysis and experimental evaluations on pitch accent prediction, named entity recognition and partofspeech tagging which demonstrate the advantages over classical approaches like Hidden Markov Models as well as the stateoftheart methods like Conditional Random Fields. ii
Data Dependent Risk Bounds for Hierarchical Mixture of Experts Classifiers
, 2004
"... The hierarchical mixture of experts architecture provides a flexible procedure for implementing classification algorithms. The classification is obtained by a recursive soft partition of the feature space in a datadriven fashion. Such a procedure enables local classification where several exper ..."
Abstract

Cited by 3 (0 self)
 Add to MetaCart
The hierarchical mixture of experts architecture provides a flexible procedure for implementing classification algorithms. The classification is obtained by a recursive soft partition of the feature space in a datadriven fashion. Such a procedure enables local classification where several experts are used, each of which is assigned with the task of classification over some subspace of the feature space. In this work, we provide datadependent generalization error bounds for this class of models, which lead to e#ective procedures for performing model selection.