Results 1  10
of
29
Probabilistic Outputs for Support Vector Machines and Comparisons to Regularized Likelihood Methods
 ADVANCES IN LARGE MARGIN CLASSIFIERS
, 1999
"... The output of a classifier should be a calibrated posterior probability to enable postprocessing. Standard SVMs do not provide such probabilities. One method to create probabilities is to directly train a kernel classifier with a logit link function and a regularized maximum likelihood score. Howev ..."
Abstract

Cited by 1041 (0 self)
 Add to MetaCart
The output of a classifier should be a calibrated posterior probability to enable postprocessing. Standard SVMs do not provide such probabilities. One method to create probabilities is to directly train a kernel classifier with a logit link function and a regularized maximum likelihood score. However, training with a maximum likelihood score will produce nonsparse kernel machines. Instead, we train an SVM, then train the parameters of an additional sigmoid function to map the SVM outputs into probabilities. This chapter compares classification error rate and likelihood scores for an SVM plus sigmoid versus a kernel method trained with a regularized likelihood error function. These methods are tested on three dataminingstyle data sets. The SVM+sigmoid yields probabilities of comparable quality to the regularized maximum likelihood kernel method, while still retaining the sparseness of the SVM.
Support Vector Machines, Reproducing Kernel Hilbert Spaces and the Randomized GACV
, 1998
"... this paper we very briefly review some of these results. RKHS can be chosen tailored to the problem at hand in many ways, and we review a few of them, including radial basis function and smoothing spline ANOVA spaces. Girosi (1997), Smola and Scholkopf (1997), Scholkopf et al (1997) and others have ..."
Abstract

Cited by 187 (12 self)
 Add to MetaCart
this paper we very briefly review some of these results. RKHS can be chosen tailored to the problem at hand in many ways, and we review a few of them, including radial basis function and smoothing spline ANOVA spaces. Girosi (1997), Smola and Scholkopf (1997), Scholkopf et al (1997) and others have noted the relationship between SVM's and penalty methods as used in the statistical theory of nonparametric regression. In Section 1.2 we elaborate on this, and show how replacing the likelihood functional of the logit (log odds ratio) in penalized likelihood methods for Bernoulli [yesno] data, with certain other functionals of the logit (to be called SVM functionals) results in several of the SVM's that are of modern research interest. The SVM functionals we consider more closely resemble a "goodnessoffit" measured by classification error than a "goodnessoffit" measured by the comparative KullbackLiebler distance, which is frequently associated with likelihood functionals. This observation is not new or profound, but it is hoped that the discussion here will help to bridge the conceptual gap between classical nonparametric regression via penalized likelihood methods, and SVM's in RKHS. Furthermore, since SVM's can be expected to provide more compact representations of the desired classification boundaries than boundaries based on estimating the logit by penalized likelihood methods, they have potential as a prescreening or model selection tool in sifting through many variables or regions of attribute space to find influential quantities, even when the ultimate goal is not classification, but to understand how the logit varies as the important variables change throughout their range. This is potentially applicable to the variable/model selection problem in demographic m...
Smoothing Spline ANOVA for Exponential Families, with Application to the Wisconsin Epidemiological Study of Diabetic Retinopathy
 ANN. STATIST
, 1995
"... Let y i ; i = 1; \Delta \Delta \Delta ; n be independent observations with the density of y i of the form h(y i ; f i ) = exp[y i f i \Gammab(f i )+c(y i )], where b and c are given functions and b is twice continuously differentiable and bounded away from 0. Let f i = f(t(i)), where t = (t 1 ; \De ..."
Abstract

Cited by 101 (46 self)
 Add to MetaCart
Let y i ; i = 1; \Delta \Delta \Delta ; n be independent observations with the density of y i of the form h(y i ; f i ) = exp[y i f i \Gammab(f i )+c(y i )], where b and c are given functions and b is twice continuously differentiable and bounded away from 0. Let f i = f(t(i)), where t = (t 1 ; \Delta \Delta \Delta ; t d ) 2 T (1)\Omega \Delta \Delta \Delta\Omega T (d) = T , the T (ff) are measureable spaces of rather general form, and f is an unknown function on T with some assumed `smoothness' properties. Given fy i ; t(i); i = 1; \Delta \Delta \Delta ; ng, it is desired to estimate f(t) for t in some region of interest contained in T . We develop the fitting of smoothing spline ANOVA models to this data of the form f(t) = C + P ff f ff (t ff ) + P ff!fi f fffi (t ff ; t fi ) + \Delta \Delta \Delta. The components of the decomposition satisfy side conditions which generalize the usual side conditions for parametric ANOVA. The estimate of f is obtained as the minimizer...
When is there a representer theorem? Vector vs matrix regularizers
 J. of Machine Learning Res
"... We consider a general class of regularization methods which learn a vector of parameters on the basis of linear measurements. It is well known that if the regularizer is a nondecreasing function of the inner product then the learned vector is a linear combination of the input data. This result, know ..."
Abstract

Cited by 26 (4 self)
 Add to MetaCart
(Show Context)
We consider a general class of regularization methods which learn a vector of parameters on the basis of linear measurements. It is well known that if the regularizer is a nondecreasing function of the inner product then the learned vector is a linear combination of the input data. This result, known as the representer theorem, is at the basis of kernelbased methods in machine learning. In this paper, we prove the necessity of the above condition, thereby completing the characterization of kernel methods based on regularization. We further extend our analysis to regularization methods which learn a matrix, a problem which is motivated by the application to multitask learning. In this context, we study a more general representer theorem, which holds for a larger class of regularizers. We provide a necessary and sufficient condition for these class of matrix regularizers and highlight them with some concrete examples of practical importance. Our analysis uses basic principles from matrix theory, especially the useful notion Regularization in Hilbert spaces is an important methodology for learning from examples and has a long history in a variety of fields. It has been studied, from different perspectives, in statistics
Approximating ThinPlate Splines for Elastic Registration: Integration of Landmark Errors and Orientation Attributes
 In Proc. of IPMI'99, volume 1613 of LNCS
, 1999
"... . We introduce an approach to elastic registration of tomographic images based on thinplate splines. Central to this scheme is a welldened minimizing functional for which the solution can be stated analytically. In this work, we consider the integration of anisotropic landmark errors as well a ..."
Abstract

Cited by 20 (1 self)
 Add to MetaCart
(Show Context)
. We introduce an approach to elastic registration of tomographic images based on thinplate splines. Central to this scheme is a welldened minimizing functional for which the solution can be stated analytically. In this work, we consider the integration of anisotropic landmark errors as well as additional attributes at landmarks. As attributes we use orientations at landmarks and we incorporate the corresponding constraints through scalar products. With our approximation scheme it is thus possible to integrate statistical as well as geometric information as additional knowledge in elastic image registration. On the basis of synthetic as well as real tomographic images we show that this additional knowledge can signicantly improve the registration result. In particular, we demonstrate that our scheme incorporating orientation attributes can preserve the shape of rigid structures (such as bone) embedded in an otherwise elastic material. This is achieved without selecting...
Approximate methods for propagation of uncertainty with gaussian process models. Doctoral dissertation
, 2004
"... This thesis presents extensions of the Gaussian Process (GP) model, based on approximate methods allowing the model to deal with input uncertainty. Zeromean GPs with Gaussian covariance function are of particular interest, as they allow to carry out many derivations exactly, as well as having been ..."
Abstract

Cited by 17 (0 self)
 Add to MetaCart
This thesis presents extensions of the Gaussian Process (GP) model, based on approximate methods allowing the model to deal with input uncertainty. Zeromean GPs with Gaussian covariance function are of particular interest, as they allow to carry out many derivations exactly, as well as having been shown to have modelling abilities and predictive performance comparable to that of neural networks (Rasmussen, 1996a). With this model, given observed data and a new input, making a prediction corresponds to computing the (Gaussian) predictive distribution of the associated output, whose mean can be used as an estimate. This way, the predictive variance provides errorbars or confidence intervals on this estimate: It quantifies the model’s degree of belief in its ‘best guess’. Using the knowledge of the predictive variance in an informative manner is at the centre of this thesis, as the problems of how to propagate it in the model, how to account for it when derivative observations are available, and how to derive a control law with a cautious behaviour are addressed. The task of making a prediction when the new input presented to the model is noisy is introduced. Assuming a normally distributed input, only the mean and variance of the corresponding nonGaussian predictive distribution are computed (Gaussian approximation). Depending on the parametric form of
Smoothing Spline ANOVA Fits for Very Large, Nearly Regular Data Sets, with Application to Historical Global Climate Data
, 1995
"... ... validation (GCV), provided that matrix decompositions of size n \Theta n can be carried out, where n is the sample size. We review the randomized trace technique and the backfitting algorithm, and remark that they can be combined to solve the variational problem while choosing the smoothing para ..."
Abstract

Cited by 13 (5 self)
 Add to MetaCart
... validation (GCV), provided that matrix decompositions of size n \Theta n can be carried out, where n is the sample size. We review the randomized trace technique and the backfitting algorithm, and remark that they can be combined to solve the variational problem while choosing the smoothing parameters by GCV for data sets that are much too large to use matrix decomposition methods directly. Some intermediate calculations to speed up the backfitting algorithm are given which are useful when the data has a tensor product structure. We describe an imputation procedure which can take advantage of data with a (nearly) tensor product structure. As an illustration of an application we discuss the algorithm in the context of fitting and smoothing historical global winter mean surface temperature data and examining the main effects and interactions for time and space.
Generalization And Regularization in Nonlinear Learning Systems
 The Handbook of Brain Theory and Neural Networks
, 1994
"... this article we will describe generalization and regularization from the point of view of multivariate function estimation in a statistical context. Multivariate function estimation is not, in principle, distinguishable from supervised machine learning. However, until fairly recently supervised mach ..."
Abstract

Cited by 12 (3 self)
 Add to MetaCart
this article we will describe generalization and regularization from the point of view of multivariate function estimation in a statistical context. Multivariate function estimation is not, in principle, distinguishable from supervised machine learning. However, until fairly recently supervised machine learning and multivariate function estimation had fairly distinct groups of practitioners, and small overlap in language, literature, and in the kinds of practical problems under study. In any case, we are given a training set, consisting of pairs of input (feature) vectors and associated outputs ft(i); y i g, for n training or example subjects, i = 1; :::n. From this data, it is desired to construct a map which generalizes well, that is, given a new value of t, the map will provide a reasonable prediction for the unobserved output associated with this t.
Smoothing Spline Analysis Of Variance For Polychotomous Response Data
, 1998
"... We consider the penalized likelihood method with smoothing spline ANOVA for estimating nonparametric functions to data involving a polychotomous response. The fitting procedure involves minimizing the penalized likelihood in a Reproducing Kernel Hilbert Space. One Step Block SORNewtonRaphson Algor ..."
Abstract

Cited by 10 (1 self)
 Add to MetaCart
We consider the penalized likelihood method with smoothing spline ANOVA for estimating nonparametric functions to data involving a polychotomous response. The fitting procedure involves minimizing the penalized likelihood in a Reproducing Kernel Hilbert Space. One Step Block SORNewtonRaphson Algorithm is used to solve the minimization problem. Generalized CrossValidation or unbiased risk estimation is used to empirically assess the amount of smoothing (which controls the bias and variance tradeoff) at each onestep Block SORNewtonRaphson iteration. Under some regular smoothness conditions, the onestep Block SORNewtonRaphson will produce a sequence which converges to the minimizer of the penalized likelihood for the fixed smoothing parameters. Monte Carlo simulations are conducted to examine the performance of the algorithm. The method is applied to polychotomous data from the Wisconsin Epidemiological Study of Diabetic Retinopathy to estimate the risks of causespecific mortality given several potential risk factors at the start of the study. Strategies to obtain smoothing spline estimates for large data sets with polychotomous response are also proposed in this thesis. Simulation studies are conducted to check the performance of the proposed method. ii Acknowledgements I would like to express my sincerest gratitude to my advisor, Professor Grace Wahba, for her invaluable advice during the course of this dissertation. Appreciation is extended to Professors Michael Kosorok, Mary Lindstrom, Olvi Mangasarian, and KamWah Tsui for their service on my final examination committee, their careful reading of this thesis and their valuable comments. I would like to thank Ronald Klein, MD and Barbara Klein, MD for providing the WESDR data. Fellow graduate students Fangy...
Tree Structured Nonlinear Signal Modeling and Prediction
 Proc. of the IEEE 1995 International Conference on Acoustics, Speech and Signal Processing
"... Abstract—In this paper, we develop a regression tree approach to identification and prediction of signals that evolve according to an unknown nonlinear state space model. In this approach, a tree is recursively constructed that partitions the �dimensional state space into a collection of piecewise ..."
Abstract

Cited by 8 (0 self)
 Add to MetaCart
Abstract—In this paper, we develop a regression tree approach to identification and prediction of signals that evolve according to an unknown nonlinear state space model. In this approach, a tree is recursively constructed that partitions the �dimensional state space into a collection of piecewise homogeneous regions utilizing a P �ary splitting rule with an entropybased node impurity criterion. On this partition, the joint density of the state is approximately piecewise constant, leading to a nonlinear predictor that nearly attains minimum mean square error. This process decomposition is closely related to a generalized version of the thresholded AR signal model (ART), which we call piecewise constant AR (PCAR). We illustrate the method for two cases where classical linear prediction is ineffective: a chaotic “doublescroll” signal measured at the output of a Chuatype electronic circuit and a secondorder ART model. We show that the prediction errors are comparable with the nearest neighbor approach to nonlinear prediction but with greatly reduced complexity. Index Terms—Chaotic signal analysis, nonlinear and nonparametric modeling and prediction, piecewise constant AR models, recursive partitioning, regression trees. I.