Results 1  10
of
38
A tutorial on support vector machines for pattern recognition
 Data Mining and Knowledge Discovery
, 1998
"... The tutorial starts with an overview of the concepts of VC dimension and structural risk minimization. We then describe linear Support Vector Machines (SVMs) for separable and nonseparable data, working through a nontrivial example in detail. We describe a mechanical analogy, and discuss when SV ..."
Abstract

Cited by 3096 (12 self)
 Add to MetaCart
(Show Context)
The tutorial starts with an overview of the concepts of VC dimension and structural risk minimization. We then describe linear Support Vector Machines (SVMs) for separable and nonseparable data, working through a nontrivial example in detail. We describe a mechanical analogy, and discuss when SVM solutions are unique and when they are global. We describe how support vector training can be practically implemented, and discuss in detail the kernel mapping technique which is used to construct SVM solutions which are nonlinear in the data. We show how Support Vector machines can have very large (even infinite) VC dimension by computing the VC dimension for homogeneous polynomial and Gaussian radial basis function kernels. While very high VC dimension would normally bode ill for generalization performance, and while at present there exists no theory which shows that good generalization performance is guaranteed for SVMs, there are several arguments which support the observed high accuracy of SVMs, which we review. Results of some experiments which were inspired by these arguments are also presented. We give numerous examples and proofs of most of the key theorems. There is new material, and I hope that the reader will find that even old material is cast in a fresh light.
Gene selection for cancer classification using support vector machines
 Machine Learning
"... Abstract. DNA microarrays now permit scientists to screen thousands of genes simultaneously and determine whether those genes are active, hyperactive or silent in normal or cancerous tissue. Because these new microarray devices generate bewildering amounts of raw data, new analytical methods must ..."
Abstract

Cited by 999 (25 self)
 Add to MetaCart
Abstract. DNA microarrays now permit scientists to screen thousands of genes simultaneously and determine whether those genes are active, hyperactive or silent in normal or cancerous tissue. Because these new microarray devices generate bewildering amounts of raw data, new analytical methods must be developed to sort out whether cancer tissues have distinctive signatures of gene expression over normal tissues or other types of cancer tissues. In this paper, we address the problem of selection of a small subset of genes from broad patterns of gene expression data, recorded on DNA microarrays. Using available training examples from cancer and normal patients, we build a classifier suitable for genetic diagnosis, as well as drug discovery. Previous attempts to address this problem select genes with correlation techniques. We propose a new method of gene selection utilizing Support Vector Machine methods based on Recursive Feature Elimination (RFE). We demonstrate experimentally that the genes selected by our techniques yield better classification performance and are biologically relevant to cancer. In contrast with the baseline method, our method eliminates gene redundancy automatically and yields better and more compact gene subsets. In patients with leukemia our method discovered 2 genes that yield zero leaveoneout error, while 64 genes are necessary for the baseline method to get the best result (one leaveoneout error). In the colon cancer database, using only 4 genes our method is 98 % accurate, while the baseline method is only 86 % accurate.
Scalesensitive Dimensions, Uniform Convergence, and Learnability
, 1997
"... Learnability in Valiant's PAC learning model has been shown to be strongly related to the existence of uniform laws of large numbers. These laws define a distributionfree convergence property of means to expectations uniformly over classes of random variables. Classes of realvalued functions ..."
Abstract

Cited by 237 (2 self)
 Add to MetaCart
Learnability in Valiant's PAC learning model has been shown to be strongly related to the existence of uniform laws of large numbers. These laws define a distributionfree convergence property of means to expectations uniformly over classes of random variables. Classes of realvalued functions enjoying such a property are also known as uniform GlivenkoCantelli classes. In this paper we prove, through a generalization of Sauer's lemma that may be interesting in its own right, a new characterization of uniform GlivenkoCantelli classes. Our characterization yields Dudley, Gin'e, and Zinn's previous characterization as a corollary. Furthermore, it is the first based on a simple combinatorial quantity generalizing the VapnikChervonenkis dimension. We apply this result to obtain the weakest combinatorial condition known to imply PAC learnability in the statistical regression (or "agnostic") framework. Furthermore, we show a characterization of learnability in the probabilistic concept model, solving an open problem posed by Kearns and Schapire. These results show that the accuracy parameter plays a crucial role in determining the effective complexity of the learner's hypothesis class.
Local Learning Algorithms
 Neural Computation
, 1992
"... Very rarely are training data evenly distributed in the input space. Local learning algorithms attempt to locally adjust the capacity of the training system to the properties of the training set in each area of the input space. The family of local learning algorithms contains known methods, like the ..."
Abstract

Cited by 147 (1 self)
 Add to MetaCart
Very rarely are training data evenly distributed in the input space. Local learning algorithms attempt to locally adjust the capacity of the training system to the properties of the training set in each area of the input space. The family of local learning algorithms contains known methods, like the kNearest Neighbors method (kNN) or the Radial Basis Function networks (RBF), as well as new algorithms. A single analysis models some aspects of these algorithms. In particular, it suggests that neither kNN or RBF, nor non local classifiers, achieve the best compromise between locality and capacity. A careful control of these parameters in a simple local learning algorithm has provided a performance breakthrough for an optical character recognition problem. Both the error rate and the rejection performance have been significantly improved. 1 Introduction. Here is a simple local algorithm: For each testing pattern, (1) select the few training examples located in the vicinity of the testing...
Improving Regression Estimation: Averaging Methods for Variance Reduction with Extensions to General Convex Measure Optimization
, 1993
"... ..."
Measuring the VCdimension of a Learning Machine
 Neural Computation
, 1994
"... A method for measuring the capacity of learning machines is described. The method is based on fitting a theoretically derived function to empirical measurements of the maximal difference between the error rates on two separate data sets of varying sizes. Experimental measurements of the capacity of ..."
Abstract

Cited by 54 (2 self)
 Add to MetaCart
(Show Context)
A method for measuring the capacity of learning machines is described. The method is based on fitting a theoretically derived function to empirical measurements of the maximal difference between the error rates on two separate data sets of varying sizes. Experimental measurements of the capacity of various types of linear classifiers are presented. 1 Introduction. Many theoretical and experimental studies have shown the influence of the capacity of a learning machine on its generalization ability (Vapnik, 1982; Baum and Haussler, 1989; Le Cun et al., 1990; Weigend, Rumelhart and Huberman, 1991; Guyon et al., 1992; AbuMostafa, 1993). Learning machines with a small capacity may not require large training sets to approach the best possible solution (lowest error rate on test sets). Highcapacity learning machines, on the other hand, may provide better asymptotical solutions (i.e. lower test error rate for very large training sets), but may require large amounts of training data to reach...
Discovering Neural Nets With Low Kolmogorov Complexity And High Generalization Capability
 Neural Networks
, 1997
"... Many neural net learning algorithms aim at finding "simple" nets to explain training data. The expectation is: the "simpler" the networks, the better the generalization on test data (! Occam's razor). Previous implementations, however, use measures for "simplicity&quo ..."
Abstract

Cited by 54 (30 self)
 Add to MetaCart
Many neural net learning algorithms aim at finding "simple" nets to explain training data. The expectation is: the "simpler" the networks, the better the generalization on test data (! Occam's razor). Previous implementations, however, use measures for "simplicity" that lack the power, universality and elegance of those based on Kolmogorov complexity and Solomonoff's algorithmic probability. Likewise, most previous approaches (especially those of the "Bayesian" kind) suffer from the problem of choosing appropriate priors. This paper addresses both issues. It first reviews some basic concepts of algorithmic complexity theory relevant to machine learning, and how the SolomonoffLevin distribution (or universal prior) deals with the prior problem. The universal prior leads to a probabilistic method for finding "algorithmically simple" problem solutions with high generalization capability. The method is based on Levin complexity (a timebounded generalization of Kolmogorov comple...
Design of a Linguistic Postprocessor using Variable Memory Length Markov Models
 In International Conference on Document Analysis and Recognition
, 1995
"... We present the design of a linguistic postprocessor for character recognizers. The central module of our system is a trainable variable memory length Markov model (VLMM) which predicts the next character given a variable length window of past characters. The overall system is composed of several fin ..."
Abstract

Cited by 51 (1 self)
 Add to MetaCart
(Show Context)
We present the design of a linguistic postprocessor for character recognizers. The central module of our system is a trainable variable memory length Markov model (VLMM) which predicts the next character given a variable length window of past characters. The overall system is composed of several finite state automata, including the main VLMM and a proper noun VLMM. The best model reported in the literature (Brown et al 1992) achieves 1.75 bits per character on the Brown corpus. On that same corpus, our model, trained on 10 times less data, reaches 2.19 bits per character and is 200 times smaller (_ 160,000 parameters). The model was designed for handwriting recognition applications but can be used for other OCR problems and speech recognition.
Automatic Capacity Tuning of Very Large VCdimension Classifiers
 Advances in Neural Information Processing Systems
, 1993
"... Large VCdimension classifiers can learn difficult tasks, but are usually impractical because they generalize well only if they are trained with huge quantities of data. In this paper we show that even very highorder polynomial classifiers can be trained with a small amount of training data and ..."
Abstract

Cited by 45 (1 self)
 Add to MetaCart
(Show Context)
Large VCdimension classifiers can learn difficult tasks, but are usually impractical because they generalize well only if they are trained with huge quantities of data. In this paper we show that even very highorder polynomial classifiers can be trained with a small amount of training data and yet generalize better than classifiers with a smaller VCdimension. This is achieved with a maximum margin algorithm (the Generalized Portrait).