• Documents
  • Authors
  • Tables
  • Other Seers ▼
    RefSeer AckSeer CollabSeer SeerSeer
  • Log in
  • Sign up
  • MetaCart

CiteSeerX logo

Advanced Search Include Citations
Advanced Search Include Citations | Disambiguate

A bound on the error of cross validation using the approximation and estimation rates, with consequences for the trainingtest split (1997)

by Michael Kearns
Venue:Neural Computation
Add To MetaCart

Tools

Sorted by:
Results 1 - 10 of 20
Next 10 →

Probabilistic Outputs for Support Vector Machines and Comparisons to Regularized Likelihood Methods

by John C. Platt - ADVANCES IN LARGE MARGIN CLASSIFIERS , 1999
"... The output of a classifier should be a calibrated posterior probability to enable post-processing. Standard SVMs do not provide such probabilities. One method to create probabilities is to directly train a kernel classifier with a logit link function and a regularized maximum likelihood score. Howev ..."
Abstract - Cited by 503 (0 self) - Add to MetaCart
The output of a classifier should be a calibrated posterior probability to enable post-processing. Standard SVMs do not provide such probabilities. One method to create probabilities is to directly train a kernel classifier with a logit link function and a regularized maximum likelihood score. However, training with a maximum likelihood score will produce non-sparse kernel machines. Instead, we train an SVM, then train the parameters of an additional sigmoid function to map the SVM outputs into probabilities. This chapter compares classification error rate and likelihood scores for an SVM plus sigmoid versus a kernel method trained with a regularized likelihood error function. These methods are tested on three data-mining-style data sets. The SVM+sigmoid yields probabilities of comparable quality to the regularized maximum likelihood kernel method, while still retaining the sparseness of the SVM.

The Connection between Regularization Operators and Support Vector Kernels

by Alex J. Smola, Bernhard Schölkopf, Klaus-Robert Müller , 1998
"... In this paper a correspondence is derived between regularization operators used in Regularization Networks and Support Vector Kernels. We prove that the Green's Functions associated with regularization operators are suitable Support Vector Kernels with equivalent regularization properties. Moreover ..."
Abstract - Cited by 119 (35 self) - Add to MetaCart
In this paper a correspondence is derived between regularization operators used in Regularization Networks and Support Vector Kernels. We prove that the Green's Functions associated with regularization operators are suitable Support Vector Kernels with equivalent regularization properties. Moreover the paper provides an analysis of currently used Support Vector Kernels in the view of regularization theory and corresponding operators associated with the classes of both polynomial kernels and translation invariant kernels. The latter are also analyzed on periodical domains. As a by-product we show that a large number of Radial Basis Functions, namely conditionally positive definite functions, may be used as Support Vector kernels.

Estimating the Generalization Performance of an SVM Efficiently

by Thorsten Joachims , 2000
"... This paper proposes and analyzes an approach to estimating the generalization performance of a support vector machine (SVM) for text classification. Without any computation intensive resampling, the new estimators are computationally much more ecient than cross-validation or bootstrap, since they ca ..."
Abstract - Cited by 79 (1 self) - Add to MetaCart
This paper proposes and analyzes an approach to estimating the generalization performance of a support vector machine (SVM) for text classification. Without any computation intensive resampling, the new estimators are computationally much more ecient than cross-validation or bootstrap, since they can be computed immediately from the form of the hypothesis returned by the SVM. Moreover, the estimators delevoped here address the special performance measures needed for text classification. While they can be used to estimate error rate, one can also estimate the recall, the precision, and the F 1 . A theoretical analysis and experiments on three text classification collections show that the new method can effectively estimate the performance of SVM text classifiers in a very efficient way.

Model Selection for Probabilistic Clustering Using Cross-Validated Likelihood

by Padhraic Smyth - Statistics and Computing , 1998
"... Cross-validated likelihood is investigated as a tool for automatically determining the appropriate number of components (given the data) in finite mixture modelling, particularly in the context of model-based probabilistic clustering. The conceptual framework for the cross-validation approach to mod ..."
Abstract - Cited by 46 (3 self) - Add to MetaCart
Cross-validated likelihood is investigated as a tool for automatically determining the appropriate number of components (given the data) in finite mixture modelling, particularly in the context of model-based probabilistic clustering. The conceptual framework for the cross-validation approach to model selection is direct in the sense that models are judged directly on their out-of-sample predictive performance. The method is applied to a well-known clustering problem in the atmospheric science literature using historical records of upper atmosphere geopotential height in the Northern hemisphere. Cross-validated likelihood provides strong evidence for three clusters in the data set, providing an objective confirmation of earlier results derived using non-probabilistic clustering techniques. 1 Introduction Cross-validation is a well-known technique in supervised learning to select a model from a family of candidate models. Examples include selecting the best classification tree using cr...

On Feature Selection: Learning with Exponentially many Irrelevant Features as Training Examples

by Andrew Y. Ng - Proceedings of the Fifteenth International Conference on Machine Learning , 1998
"... We consider feature selection in the "wrapper " model of feature selection. This typically involves an NP-hard optimization problem that is approximated by heuristic search for a "good" feature subset. First considering the idealization where this optimization is performed exactly, we give a rigorou ..."
Abstract - Cited by 32 (4 self) - Add to MetaCart
We consider feature selection in the "wrapper " model of feature selection. This typically involves an NP-hard optimization problem that is approximated by heuristic search for a "good" feature subset. First considering the idealization where this optimization is performed exactly, we give a rigorous bound for generalization error under feature selection. The search heuristics typically used are then immediately seen as trying to achieve the error given in our bounds, and succeeding to the extent that they succeed in solving the optimization. The bound suggests that, in the presence of many "irrelevant" features, the main source of error in wrapper model feature selection is from "overfitting " hold-out or cross-validation data. This motivates a new algorithm that, again under the idealization of performing search exactly, has sample complexity (and error) that grows logarithmically in the number of "irrelevant" features -- which means it can tolerate having a number of "irrelevant" f...

Preventing "Overfitting" of Cross-Validation Data

by Andrew Y. Ng - In Proceedings of the Fourteenth International Conference on Machine Learning , 1997
"... Suppose that, for a learning task, we have to select one hypothesis out of a set of hypotheses (that may, for example, have been generated by multiple applications of a randomized learning algorithm). A common approach is to evaluate each hypothesis in the set on some previously unseen cross-validat ..."
Abstract - Cited by 29 (1 self) - Add to MetaCart
Suppose that, for a learning task, we have to select one hypothesis out of a set of hypotheses (that may, for example, have been generated by multiple applications of a randomized learning algorithm). A common approach is to evaluate each hypothesis in the set on some previously unseen cross-validation data, and then to select the hypothesis that had the lowest cross-validation error. But when the cross-validation data is partially corrupted such as by noise, and if the set of hypotheses we are selecting from is large, then "folklore" also warns about "overfitting" the crossvalidation data [Klockars and Sax, 1986, Tukey, 1949, Tukey, 1953]. In this paper, we explain how this "overfitting" really occurs, and show the surprising result that it can be overcome by selecting a hypothesis with a higher cross-validation error, over others with lower cross-validation errors. We give reasons for not selecting the hypothesis with the lowest cross-validation error, and propose a new algorithm, L...

Adaptive Regularization in Neural Network Modeling

by Jan Larsen, Claus Svarer, Lars Nonboe Andersen, Lars Kai Hansen , 1997
"... . In this paper we address the important problem of optimizing regularization parameters in neural network modeling. The suggested optimization scheme is an extended version of the recently presented algorithm [24]. The idea is to minimize an empirical estimate -- like the cross-validation estimate ..."
Abstract - Cited by 12 (2 self) - Add to MetaCart
. In this paper we address the important problem of optimizing regularization parameters in neural network modeling. The suggested optimization scheme is an extended version of the recently presented algorithm [24]. The idea is to minimize an empirical estimate -- like the cross-validation estimate -- of the generalization error with respect to regularization parameters. This is done by employing a simple iterative gradient descent scheme using virtually no additional programming overhead compared to standard training. Experiments with feed-forward neural network models for time series prediction and classification tasks showed the viability and robustness of the algorithm. Moreover, we provided some simple theoretical examples in order to illustrate the potential and limitations of the proposed regularization framework. 1 Introduction Neural networks are flexible tools for time series processing and pattern recognition. By increasing the number of hidden neurons in a 2-layer architec...

Pruning decision trees and lists

by Eibe Frank , 2000
"... at the ..."
Abstract - Cited by 10 (0 self) - Add to MetaCart
Abstract not found

Additive Regularization: Fusion of Training and Validation Levels in Kernel Methods

by K. Pelckmans, J. A. K. Suykens, B. De Moor - INTERNAL REPORT 03-184, ESAT-SCD-SISTA, K.U.LEUVEN , 2003
"... In this paper the training of Least Squares Support Vector Machines (LS-SVMs) for classification and regression and the determination of its regularization constants is reformulated in terms of additive regularization. In contrast with the classical Tikhonov scheme, a major advantage of this additiv ..."
Abstract - Cited by 8 (7 self) - Add to MetaCart
In this paper the training of Least Squares Support Vector Machines (LS-SVMs) for classification and regression and the determination of its regularization constants is reformulated in terms of additive regularization. In contrast with the classical Tikhonov scheme, a major advantage of this additive regularization mechanism is that it enables to achieve computational fusion of the training and validation levels leading to the solution of one single set of linear equations that characterizes the training and validation at once. The problem of avoiding overfitting on validation data is approached by restricting explicitly the degrees of freedom of the regularization constants. Di#erent restriction schemes are investigated, including an ensemble model approach. The link between the Tikhonov scheme and additive regularization is explained and an efficient cross-validation method with additive regularization is proposed. The new methods are illustrated with several examples on synthetic and real-life data sets.

Best-first decision tree learning

by Haijian Shi - University of Waikato , 2007
"... Decision trees are potentially powerful predictors and explicitly represent the structure of a dataset. Standard decision tree learners such as C4.5 expand nodes in depth-first order (Quinlan, 1993), while in best-first decision tree learners the ”best ” node is expanded first. The ”best ” node is t ..."
Abstract - Cited by 8 (0 self) - Add to MetaCart
Decision trees are potentially powerful predictors and explicitly represent the structure of a dataset. Standard decision tree learners such as C4.5 expand nodes in depth-first order (Quinlan, 1993), while in best-first decision tree learners the ”best ” node is expanded first. The ”best ” node is the node whose split leads to maximum reduction of impurity (e.g. Gini index or information in this thesis) among all nodes available for splitting. The resulting tree will be the same when fully grown, just the order in which it is built is different. In practice, some branches of a fully-expanded tree do not truly reflect the underlying information in the domain. This problem is known as overfitting and is mainly caused by noisy data. Pruning is necessary to avoid overfitting the training data, and discards those parts that are not predictive of future data. Best-first node expansion enables us to investigate new pruning techniques by determining the number of expansions performed based on cross-validation. This thesis first introduces the algorithm for building binary best-first decision trees for classification problems. Then, it investigates two new pruning methods that
The National Science Foundation
  • About CiteSeerX
  • Submit Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2010 The Pennsylvania State University