Results 1  10
of
28
Boosting with early stopping: convergence and consistency
 Annals of Statistics
, 2003
"... Abstract Boosting is one of the most significant advances in machine learning for classification and regression. In its original and computationally flexible version, boosting seeks to minimize empirically a loss function in a greedy fashion. The resulted estimator takes an additive function form an ..."
Abstract

Cited by 64 (8 self)
 Add to MetaCart
Abstract Boosting is one of the most significant advances in machine learning for classification and regression. In its original and computationally flexible version, boosting seeks to minimize empirically a loss function in a greedy fashion. The resulted estimator takes an additive function form and is built iteratively by applying a base estimator (or learner) to updated samples depending on the previous iterations. An unusual regularization technique, early stopping, is employed based on CV or a test set. This paper studies numerical convergence, consistency, and statistical rates of convergence of boosting with early stopping, when it is carried out over the linear span of a family of basis functions. For general loss functions, we prove the convergence of boosting's greedy optimization to the infinimum of the loss function over the linear span. Using the numerical convergence result, we find early stopping strategies under which boosting is shown to be consistent based on iid samples, and we obtain bounds on the rates of convergence for boosting estimators. Simulation studies are also presented to illustrate the relevance of our theoretical results for providing insights to practical aspects of boosting. As a side product, these results also reveal the importance of restricting the greedy search step sizes, as known in practice through the works of Friedman and others. Moreover, our results lead to a rigorous proof that for a linearly separable problem, AdaBoost with ffl! 0 stepsize becomes an L1margin maximizer when left to run to convergence. 1 Introduction In this paper we consider boosting algorithms for classification and regression. These algorithms present one of the major progresses in machine learning. In their original version, the computational aspect is explicitly specified as part of the estimator/algorithm. That is, the empirical minimization of an appropriate loss function is carried out in a greedy fashion, which means that at each step, a basis function that leads to the largest reduction of empirical risk is added into the estimator. This specification distinguishes boosting from other statistical procedures which are defined by an empirical minimization of a loss function without the numerical optimization details.
A scalability analysis of classifiers in text categorization
 In Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
, 2003
"... Realworld applications of text categorization often require a system to deal with tens of thousands of categories defined over a large taxonomy. This paper addresses the problem with respect to a set of popular algorithms in text categorization, including Support Vector Machines, knearest neighbor ..."
Abstract

Cited by 61 (4 self)
 Add to MetaCart
(Show Context)
Realworld applications of text categorization often require a system to deal with tens of thousands of categories defined over a large taxonomy. This paper addresses the problem with respect to a set of popular algorithms in text categorization, including Support Vector Machines, knearest neighbor, ridge regression, linear least square fit and logistic regression. By providing a formal analysis of the computational complexity of each classification method, followed by an investigation on the usage of different classifiers in a hierarchical setting of categorization, we show how the scalability of a method depends on the topology of the hierarchy and the category distributions. In addition, we are able to obtain tight bounds for the complexities by using the power law to approximate category distributions over a hierarchy. Experiments with kNN and SVM classifiers on the OHSUMED corpus are reported on, as concrete examples.
Bayesian adaptive user profiling with explicit & implicit feedback
 In Conference on Information and Knowledge Mangement
, 2006
"... Research in information retrieval is now moving into a personalized scenario where a retrieval or filtering system maintains a separate user profile for each user. In this framework, information delivered to the user can be automatically personalized and catered to individual user’s information need ..."
Abstract

Cited by 28 (4 self)
 Add to MetaCart
(Show Context)
Research in information retrieval is now moving into a personalized scenario where a retrieval or filtering system maintains a separate user profile for each user. In this framework, information delivered to the user can be automatically personalized and catered to individual user’s information needs. However, a practical concern for such a personalized system is the “cold start problem”: any user new to the system must endure poor initial performance until sufficient feedback from that user is provided. To solve this problem, we use both explicit and implicit feedback to build a user’s profile and use Bayesian hierarchical methods to borrow information from existing users. We analyze the usefulness of implicit feedback and the adaptive performance of the model on two data sets gathered from user studies where users ’ interaction with a document, or implicit feedback, were recorded along with explicit feedback. Our results are twofold: first, we demonstrate that the Bayesian modeling approach effectively trades off between shared and userspecific information, alleviating poor initial performance for each user. Second, we find that implicit feedback has very limited unstable predictive value by itself and only marginal value when combined with explicit feedback.
Author Identification on the Large Scale
 In Proc. of the Meeting of the Classification Society of North America
, 2005
"... this paper is on techniques for identifying authors in large collections of textual artifacts (emails, communiques, transcribed speech, etc.). Our approach focuses on very highdimensional, topicfree document representations and particular attribution problems, such as: (1) Which one of these K au ..."
Abstract

Cited by 21 (0 self)
 Add to MetaCart
this paper is on techniques for identifying authors in large collections of textual artifacts (emails, communiques, transcribed speech, etc.). Our approach focuses on very highdimensional, topicfree document representations and particular attribution problems, such as: (1) Which one of these K authors wrote this particular document? (2) Did any of these K authors write this particular document? Scientific investigation into measuring style and authorship of texts goes back to the late nineteenth century, with the pioneering studies of Mendenhall [36] and Mascol [34, 35] on distributions of sentence and word lengths in works of literature and the gospels of the New Testament. The underlying notion was that works by di#erent authors are strongly distinguished by quantifiable features of the text. By the midtwentieth century, this line of research had grown into what became known as "stylometrics", and a variety of textual statistics had been proposed to quantify textual style. The style of early work was characterized by a search for invariant properties of textual statistics, such as Zipf's distribution and Yule's K statistic
Bayesian Multinomial Logistic Regression for Author Identification
 In Maxent Conference
, 2005
"... Motivated by highdimensional applications in authorship atttribution, we describe a Bayesian multinomial logistic regression model together with an associated learning algorithm. ..."
Abstract

Cited by 17 (0 self)
 Add to MetaCart
(Show Context)
Motivated by highdimensional applications in authorship atttribution, we describe a Bayesian multinomial logistic regression model together with an associated learning algorithm.
Multilabel Classification with Metalevel Features
"... Effective learning in multilabel classification (MLC) requires an appropriate level of abstraction for representing the relationship between each instance and multiple categories. Current MLC methods have been focused on learningtomap from instances to ranked lists of categories in a relatively h ..."
Abstract

Cited by 14 (2 self)
 Add to MetaCart
(Show Context)
Effective learning in multilabel classification (MLC) requires an appropriate level of abstraction for representing the relationship between each instance and multiple categories. Current MLC methods have been focused on learningtomap from instances to ranked lists of categories in a relatively highdimensional space. The finegrained features in such a space may not be sufficiently expressive for characterizing discriminative patterns, and worse, make the model complexity unnecessarily high. This paper proposes an alternative approach by transforming conventional representations of instances and categories into a relatively small set of linkbased metalevel features, and leveraging successful learningtorank retrieval algorithms (e.g., SVMMAP) over this reduced feature space. Controlled experiments on multiple benchmark datasets show strong empirical evidence for the strength of the proposed approach, as it significantly outperformed several stateoftheart methods, including RankSVM, MLkNN and
A Stagewise Least Square Loss Function for Classification
"... This paper presents a stagewise least square (SLS) loss function for classification. It uses a least square form within each stage to approximate a bounded monotonic nonconvex loss function in a stagewise manner. Several benefits are obtained from using the SLS loss function, such as: (i) higher gen ..."
Abstract

Cited by 2 (2 self)
 Add to MetaCart
This paper presents a stagewise least square (SLS) loss function for classification. It uses a least square form within each stage to approximate a bounded monotonic nonconvex loss function in a stagewise manner. Several benefits are obtained from using the SLS loss function, such as: (i) higher generalization accuracy and better scalability than classical least square loss; (ii) improved performance and robustness than convex loss (e.g., hinge loss of SVM); (iii) computational advantages compared with nonconvex loss (e.g. ramp loss in ψlearning); (iv) ability to resist myopia of Empirical Risk Minimization and to boost the margin without boosting the complexity of the classifier. In addition, it naturally results in a kernel machine which is as sparse as SVM, yet much faster and simpler to train. A fast online learning algorithm with an integrated sparsification procedure is also provided. Experimental results on several benchmarks confirm the advantages of the proposed approach.
In Language and Information Technologies
, 2007
"... learning with large sparse undirected ..."
(Show Context)
SemiSupervised Regression using Spectral Techniques ∗
, 2006
"... Graphbased approaches for semisupervised learning have received increasing amount of interest in recent years. Despite their good performance, many pure graph based algorithms do not have explicit functions and can not predict the label of unseen data. Graph regularization is a recently proposed f ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
(Show Context)
Graphbased approaches for semisupervised learning have received increasing amount of interest in recent years. Despite their good performance, many pure graph based algorithms do not have explicit functions and can not predict the label of unseen data. Graph regularization is a recently proposed framework which incorporates the intrinsic geometrical structure as a regularization term. It can be performed as semisupervised learning when unlabeled samples are available. However, our theoretical analysis shows that such approach may not be optimal for multiclass problems. In this paper, we propose a novel method, called Spectral Regression (SR). By using spectral techniques, we first compute a set of responses for each sample which respects both the label information and geometrical structure. Once the responses are obtained, the ordinary ridge regression can be apply to find the regression functions. Our proposed algorithm is particularly designed for multiclass problem. Experimental results on two real world classification problems arising in visual and speech recognition demonstrate the effectiveness of our algorithm. 1