Results 1  10
of
34
Multicategory Support Vector Machines, theory, and application to the classification of microarray data and satellite radiance data
 Journal of the American Statistical Association
, 2004
"... Twocategory support vector machines (SVM) have been very popular in the machine learning community for classi � cation problems. Solving multicategory problems by a series of binary classi � ers is quite common in the SVM paradigm; however, this approach may fail under various circumstances. We pro ..."
Abstract

Cited by 189 (22 self)
 Add to MetaCart
Twocategory support vector machines (SVM) have been very popular in the machine learning community for classi � cation problems. Solving multicategory problems by a series of binary classi � ers is quite common in the SVM paradigm; however, this approach may fail under various circumstances. We propose the multicategory support vector machine (MSVM), which extends the binary SVM to the multicategory case and has good theoretical properties. The proposed method provides a unifying framework when there are either equal or unequal misclassi � cation costs. As a tuning criterion for the MSVM, an approximate leaveoneout crossvalidation function, called Generalized Approximate Cross Validation, is derived, analogous to the binary case. The effectiveness of the MSVM is demonstrated through the applications to cancer classi � cation using microarray data and cloud classi � cation with satellite radiance pro � les.
Kernel partial least squares regression in reproducing kernel hilbert space
 Journal of Machine Learning Research
, 2001
"... A family of regularized least squares regression models in a Reproducing Kernel Hilbert Space is extended by the kernel partial least squares (PLS) regression model. Similar to principal components regression (PCR), PLS is a method based on the projection of input (explanatory) variables to the late ..."
Abstract

Cited by 112 (5 self)
 Add to MetaCart
A family of regularized least squares regression models in a Reproducing Kernel Hilbert Space is extended by the kernel partial least squares (PLS) regression model. Similar to principal components regression (PCR), PLS is a method based on the projection of input (explanatory) variables to the latent variables (components). However, in contrast to PCR, PLS creates the components by modeling the relationship between input and output variables while maintaining most of the information in the input variables. PLS is useful in situations where the number of explanatory variables exceeds the number of observations and/or a high level of multicollinearity among those variables is assumed. Motivated by this fact we will provide a kernel PLS algorithm for construction of nonlinear regression models in possibly highdimensional feature spaces. We give the theoretical description of the kernel PLS algorithm and we experimentally compare the algorithm with the existing kernel PCR and kernel ridge regression techniques. We will demonstrate that on the data sets employed kernel PLS achieves the same results as kernel PCR but uses significantly fewer, qualitatively different components. 1.
Diffusion Kernels on Statistical Manifolds
, 2004
"... A family of kernels for statistical learning is introduced that exploits the geometric structure of statistical models. The kernels are based on the heat equation on the Riemannian manifold defined by the Fisher information metric associated with a statistical family, and generalize the Gaussian ker ..."
Abstract

Cited by 92 (6 self)
 Add to MetaCart
(Show Context)
A family of kernels for statistical learning is introduced that exploits the geometric structure of statistical models. The kernels are based on the heat equation on the Riemannian manifold defined by the Fisher information metric associated with a statistical family, and generalize the Gaussian kernel of Euclidean space. As an important special case, kernels based on the geometry of multinomial families are derived, leading to kernelbased learning algorithms that apply naturally to discrete data. Bounds on covering numbers and Rademacher averages for the kernels are proved using bounds on the eigenvalues of the Laplacian on Riemannian manifolds. Experimental results are presented for document classification, for which the use of multinomial geometry is natural and well motivated, and improvements are obtained over the standard use of Gaussian or linear kernels, which have been the standard for text classification.
Kernel Conditional Random Fields: Representation and Clique Selection
 in ICML
, 2004
"... Kernel conditional random fields (KCRFs) are introduced as a framework for discriminative modeling of graphstructured data. A representer theorem for conditional graphical models is given which shows how kernel conditional random fields arise from risk minimization procedures defined using Me ..."
Abstract

Cited by 80 (5 self)
 Add to MetaCart
Kernel conditional random fields (KCRFs) are introduced as a framework for discriminative modeling of graphstructured data. A representer theorem for conditional graphical models is given which shows how kernel conditional random fields arise from risk minimization procedures defined using Mercer kernels on labeled graphs. A procedure for greedily selecting cliques in the dual representation is then proposed, which allows sparse representations. By incorporating kernels and implicit feature spaces into conditional graphical models, the framework enables semisupervised learning algorithms for structured data through the use of graph kernels.
Efficient LeaveOneOut CrossValidation of Kernel Fisher Discriminant Classifiers
 PATTERN RECOGNITION
, 2003
"... Mika et al. [1] apply the "kernel trick" to obtain a nonlinear variant of Fisher's linear discriminant analysis method, demonstrating stateoftheart performance on a range of benchmark datasets. We show that leaveoneout crossvalidation of kernel Fisher discriminant classifiers c ..."
Abstract

Cited by 34 (5 self)
 Add to MetaCart
Mika et al. [1] apply the "kernel trick" to obtain a nonlinear variant of Fisher's linear discriminant analysis method, demonstrating stateoftheart performance on a range of benchmark datasets. We show that leaveoneout crossvalidation of kernel Fisher discriminant classifiers can be implemented with a computational complexity of only O(l&sup3;) operations rather than the O(l^4) of a nave implementation, where l is the number of training patterns. Leaveoneout crossvalidation then becomes an attractive means of model selection in largescale applications of kernel Fisher discriminant analysis, being significantly faster than conventional kfold crossvalidation procedures commonly used.
Measure based regularization
 Advances in Neural Information Processing Systems 16
, 2004
"... We address in this paper the question of how the knowledge of the marginal distribution P (x) can be incorporated in a learning algorithm. We suggest three theoretical methods for taking into account this distribution for regularization and provide links to existing graphbased semisupervised learn ..."
Abstract

Cited by 32 (6 self)
 Add to MetaCart
(Show Context)
We address in this paper the question of how the knowledge of the marginal distribution P (x) can be incorporated in a learning algorithm. We suggest three theoretical methods for taking into account this distribution for regularization and provide links to existing graphbased semisupervised learning algorithms. We also propose practical implementations. 1
A Note on Marginbased Loss Functions in Classification
 STATISTICS AND PROBABILITY LETTERS
, 2002
"... In many classification procedures, the classification function is obtained (or trained) by minimizing a certain empirical risk on the training sample. The classification is then based on the sign of the classification function. In recent years, there have been a host of classification methods pro ..."
Abstract

Cited by 30 (1 self)
 Add to MetaCart
In many classification procedures, the classification function is obtained (or trained) by minimizing a certain empirical risk on the training sample. The classification is then based on the sign of the classification function. In recent years, there have been a host of classification methods proposed in machine learning that use di#erent marginbased loss functions in the training. Examples include the AdaBoost procedure, the support vector machine, and many variants of them. The marginbased loss functions used in these procedures are usually motivated as upper bounds of the misclassification loss, but this can not explain the statistical properties of the classification procedures. We consider the marginbased loss functions from a statistical point of view. We first show that under general conditions, marginbased loss functions are Fisher consistent for classification. That is, the population minimizer of the loss function leads to the Bayes optimal rule of classification. In particular, almost all marginbased loss functions that have appeared in the literature are Fisher consistent. We then study marginbased loss functions in the method of sieves and the method of regularization. We show that the Fisher consistency of marginbased loss functions often leads to consistency and rate of convergence (to the Bayes optimal risk) results under general conditions. The common notion of marginbased loss functions as upper bounds of the misclassification loss is formalized and investigated. It is shown that the hinge loss is the tightest convex upper bound of the misclassification loss. Simulations are carried out to compare some commonly used marginbased loss functions.
Variable Selection and Model Building via Likelihood Basis Pursuit
 JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION
, 2002
"... This paper presents a nonparametric penalized likelihood approach for variable selection and model building, called likelihood basis pursuit (LBP). In the setting of a tensor product reproducing kernel Hilbert space, we decompose the log likelihood into the sum of different functional components suc ..."
Abstract

Cited by 23 (10 self)
 Add to MetaCart
This paper presents a nonparametric penalized likelihood approach for variable selection and model building, called likelihood basis pursuit (LBP). In the setting of a tensor product reproducing kernel Hilbert space, we decompose the log likelihood into the sum of different functional components such as main effects and interactions, with each component represented by appropriate basis functions. The basis functions are chosen to be compatible with variable selection and model building in the context of a smoothing spline ANOVA model. Basis pursuit is applied to obtain the optimal decomposition in terms of having the smallest l 1 norm on the coefficients. We use the functional L 1 norm to measure the importance of each component and determine the "threshold" value by a sequential Monte Carlo bootstrap test algorithm. As a generalized LASSOtype method, LBP produces shrinkage estimates for the coefficients, which greatly facilitates the variable selection process, and provides highly interpretable multivariate functional estimates at the same time. To choose the regularization parameters appearing in the LBP models, generalized approximate cross validation (GACV) is derived as a tuning criterion. To make GACV widely applicable to large data sets, its randomized version is proposed as well. A technique "slice modeling" is used to solve the optimization problem and makes the computation more efficient. LBP has great potential for a wide range of research and application areas such as medical studies, and in this paper we apply it to two large ongoing epidemiological studies: the Wisconsin Epidemiological Study of Diabetic Retinopathy (WESDR) and the Beaver Dam Eye Study (BDES).
Online Manifold Regularization: A New Learning Setting and Empirical Study
"... Abstract. We consider a novel “online semisupervised learning ” setting where (mostly unlabeled) data arrives sequentially in large volume, and it is impractical to store it all before learning. We propose an online manifold regularization algorithm. It differs from standard online learning in that ..."
Abstract

Cited by 19 (3 self)
 Add to MetaCart
(Show Context)
Abstract. We consider a novel “online semisupervised learning ” setting where (mostly unlabeled) data arrives sequentially in large volume, and it is impractical to store it all before learning. We propose an online manifold regularization algorithm. It differs from standard online learning in that it learns even when the input point is unlabeled. Our algorithm is based on convex programming in kernel space with stochastic gradient descent, and inherits the theoretical guarantees of standard online algorithms. However, naïve implementation of our algorithm does not scale well. This paper focuses on efficient, practical approximations; we discuss two sparse approximations using buffering and online random projection trees. Experiments show our algorithm achieves risk and generalization accuracy comparable to standard batch manifold regularization, while each step runs quickly. Our online semisupervised learning setting is an interesting direction for further theoretical development, paving the way for semisupervised learning to work on realworld lifelong learning tasks. 1
Parsing with a Single Neuron: Convolution Kernels for Natural Language Problems
, 2001
"... This paper introduces new training ..."