Results 1  10
of
11
Input Space Versus Feature Space in KernelBased Methods
 IEEE TRANSACTIONS ON NEURAL NETWORKS
, 1999
"... This paper collects some ideas targeted at advancing our understanding of the feature spaces associated with support vector (SV) kernel functions. We first discuss the geometry of feature space. In particular, we review what is known about the shape of the image of input space under the feature spac ..."
Abstract

Cited by 88 (3 self)
 Add to MetaCart
This paper collects some ideas targeted at advancing our understanding of the feature spaces associated with support vector (SV) kernel functions. We first discuss the geometry of feature space. In particular, we review what is known about the shape of the image of input space under the feature space map, and how this influences the capacity of SV methods. Following this, we describe how the metric governing the intrinsic geometry of the mapped surface can be computed in terms of the kernel, using the example of the class of inhomogeneous polynomial kernels, which are often used in SV pattern recognition. We then discuss the connection between feature space and input space by dealing with the question of how one can, given some vector in feature space, find a preimage (exact or approximate) in input space. We describe algorithms to tackle this issue, and show their utility in two applications of kernel methods. First, we use it to reduce the computational complexity of SV decision functions; second, we combine it with the Kernel PCA algorithm, thereby constructing a nonlinear statistical denoising technique which is shown to perform well on realworld data.
Covering Number Bounds of Certain Regularized Linear Function Classes
 Journal of Machine Learning Research
, 2002
"... Recently, sample complexity bounds have been derived for problems involving linear functions such as neural networks and support vector machines. In many of these theoretical studies, the concept of covering numbers played an important role. It is thus useful to study covering numbers for linear ..."
Abstract

Cited by 42 (3 self)
 Add to MetaCart
Recently, sample complexity bounds have been derived for problems involving linear functions such as neural networks and support vector machines. In many of these theoretical studies, the concept of covering numbers played an important role. It is thus useful to study covering numbers for linear function classes. In this paper, we investigate two closely related methods to derive upper bounds on these covering numbers. The first method, already employed in some earlier studies, relies on the socalled Maurey's lemma; the second method uses techniques from the mistake bound framework in online learning. We compare results from these two methods, as well as their consequences in some learning formulations.
Further Results on the Margin Distribution
 In Proc. 12th Annu. Conf. on Comput. Learning Theory
, 1999
"... A number of results have bounded generalization of a classifier in terms of its margin on the training points. There has been some debate about whether the minimum margin is the best measure of the distribution of training set margin values with which to estimate the generalization. Freund and Schap ..."
Abstract

Cited by 31 (9 self)
 Add to MetaCart
A number of results have bounded generalization of a classifier in terms of its margin on the training points. There has been some debate about whether the minimum margin is the best measure of the distribution of training set margin values with which to estimate the generalization. Freund and Schapire [7] have shown how a different function of the margin distribution can be used to bound the number of mistakes of an online learning algorithm for a perceptron, as well as an expected error bound. ShaweTaylor and Cristianini [13] showed that a slight generalization of their construction can be used to give a pac style bound on the tail of the distribution of the generalization errors that arise from a given sample size. We show that in the linear case the approach can be viewed as a change of kernel and that the algorithms arising from the approach are exactly those originally proposed by Cortes and Vapnik [4]. We generalise the basic result to function classes with bounded fatshatteri...
Robust Bounds on Generalization from the Margin Distribution
, 1998
"... A number of results have bounded generalization of a classifier in terms of its margin on the training points. There has been some debate about whether the minimum margin is the best measure of the distribution of training set margin values with which to estimate the generalization. Freund and Schap ..."
Abstract

Cited by 25 (1 self)
 Add to MetaCart
A number of results have bounded generalization of a classifier in terms of its margin on the training points. There has been some debate about whether the minimum margin is the best measure of the distribution of training set margin values with which to estimate the generalization. Freund and Schapire [8] have shown how a different function of the margin distribution can be used to bound the number of mistakes of an online learning algorithm for a perceptron, as well as an expected error bound. We show that a slight generalization of their construction can be used to give a pac style bound on the tail of the distribution of the generalization errors that arise from a given sample size. Algorithms arising from the approach are related to those of Cortes and Vapnik [5]. We generalise the basic result to function classes with bounded fatshattering dimension and the 1norm of the slack variables which gives rise to Vapnik's box constraint algorithm. We also extend the results to the reg...
Covering numbers for support vector machines
 IEEE Trans. Inform. Theory
, 2002
"... Abstract—Support vector (SV) machines are linear classifiers that use the maximum margin hyperplane in a feature space defined by a kernel function. Until recently, the only bounds on the generalization performance of SV machines (within Valiant’s probably approximately correct framework) took no ac ..."
Abstract

Cited by 19 (6 self)
 Add to MetaCart
Abstract—Support vector (SV) machines are linear classifiers that use the maximum margin hyperplane in a feature space defined by a kernel function. Until recently, the only bounds on the generalization performance of SV machines (within Valiant’s probably approximately correct framework) took no account of the kernel used except in its effect on the margin and radius. More recently, it has been shown that one can bound the relevant covering numbers using tools from functional analysis. In this paper, we show that the resulting bound can be greatly simplified. The new bound involves the eigenvalues of the integral operator induced by the kernel. It shows that the effective dimension depends on the rate of decay of these eigenvalues. We present an explicit calculation of covering numbers for an SV machine using a Gaussian kernel, which is significantly better than that implied by previous results. Index Terms—Covering numbers, entropy numbers, kernel machines, statistical learning theory, support vector (SV) machines. I.
KMOD  A Twoparameter SVM Kernel for Pattern Recognition
 In ICPR
, 2002
"... It has been shown that Support Vector Machine theory optimizes a smoothness functional hypothesis through kernel application. We present KMOD, a twoparameter SVM kernel with distinctive properties of good discrimination between patterns while preserving the data neighborhood information. In classif ..."
Abstract

Cited by 4 (3 self)
 Add to MetaCart
It has been shown that Support Vector Machine theory optimizes a smoothness functional hypothesis through kernel application. We present KMOD, a twoparameter SVM kernel with distinctive properties of good discrimination between patterns while preserving the data neighborhood information. In classification problems, the experiments we carried out on the Breast Cancer benchmark produced better performance than RBF kernel and some state of the art classifiers. As well, it also generated favorable results when subjected to a 10class problem of recognizing handwritten digits in the NIST database.
Automatic Model Selection for the optimization of SVM Kernels
"... This approach aims to optimize the kernel parameters and to efficiently reduce the number of support vectors, so that the generalization error can be reduced drastically. The proposed methodology suggests the use of a new model selection criterion based on the estimation of the probability of error ..."
Abstract

Cited by 3 (0 self)
 Add to MetaCart
This approach aims to optimize the kernel parameters and to efficiently reduce the number of support vectors, so that the generalization error can be reduced drastically. The proposed methodology suggests the use of a new model selection criterion based on the estimation of the probability of error of the SVM classifier. For comparison, we considered two more model selection criteria: GACV (‘Generalized Approximate CrossValidation’) and VC (‘VapnikChernovenkis’) dimension. These criteria are algebraic estimates of upper bounds of the expected error. For the former, we also propose a new minimization scheme. The experiments conducted on a biclass problem show that we can adequately choose the SVM hyperparameters using the empirical error criterion. Moreover, it turns out that the criterion produces a less complex model with fewer support vectors. For multiclass data, the optimization strategy is adapted to the oneagainstone data partitioning. The approach is then evaluated on images of handwritten digits from the USPS database.
Sample Based Generalization Bounds
, 1999
"... It is known that the covering numbers of a function class on a double sample (length 2m, where m is the number of points in the sample) can be used to bound the generalization performance of a classifier by using a margin based analysis. Traditionally this has been done using a "Sauerlike" relation ..."
Abstract

Cited by 2 (1 self)
 Add to MetaCart
It is known that the covering numbers of a function class on a double sample (length 2m, where m is the number of points in the sample) can be used to bound the generalization performance of a classifier by using a margin based analysis. Traditionally this has been done using a "Sauerlike" relationship involving a combinatorial dimension such as the fatshattering dimension. In this paper we show that one can utilize an analogous argument in terms of the observed covering numbers on a single msample (being the actual observed data points). The significance of this is that for certain interesting classes of functions, such as support vector machines, one can readily estimate the empirical covering numbers quite well. We show how to do so in terms of the eigenvalues of the Gram matrix created from the data. These covering numbers can be much less than a priori bounds indicate in situations where the particular data received is "easy". The work can be considered an extension of previous results which provided generalization performance bounds in terms of the VCdimension of the class of hypotheses restricted to the sample, with the considerable advantage that the covering numbers can be readily computed, and they often are small.
Optimal Hyperplane Classifier with Adaptive Norm
, 1999
"... The conventional optimal hyperplane classifier is defined on Euclidean space where the norm is given a priori. In this paper, we propose a new optimal hyperplane classifier in which the norm also adapts for learning. For practical implementation, the norm is restricted to a weighted Euclidean norm a ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
The conventional optimal hyperplane classifier is defined on Euclidean space where the norm is given a priori. In this paper, we propose a new optimal hyperplane classifier in which the norm also adapts for learning. For practical implementation, the norm is restricted to a weighted Euclidean norm and the weights are controlled in learning. The statistical properties of this classifier is analyzed via the generalization bound characterized by entropy numbers. An iterative training algorithm is designed to minimize the generalization bound while keeping the complete separation of training samples. As a result of online character recognition experiments, the optimal hyperplane classifier with adaptive norm outperformed the conventional one. 1 Introduction A learning machine is defined as a mapping from a set of training samples (i.e. training set) to a function. This function performs various roles according to the problem setting. For example, the function performs as a discriminant fu...