Results 1  10
of
26
On the Generalization Ability of Online Learning Algorithms
 IEEE Transactions on Information Theory
, 2001
"... In this paper we show that online algorithms for classification and regression can be naturally used to obtain hypotheses with good datadependent tail bounds on their risk. Our results are proven without requiring complicated concentrationofmeasure arguments and they hold for arbitrary onlin ..."
Abstract

Cited by 135 (8 self)
 Add to MetaCart
In this paper we show that online algorithms for classification and regression can be naturally used to obtain hypotheses with good datadependent tail bounds on their risk. Our results are proven without requiring complicated concentrationofmeasure arguments and they hold for arbitrary online learning algorithms. Furthermore, when applied to concrete online algorithms, our results yield tail bounds that in many cases are comparable or better than the best known bounds.
An introduction to boosting and leveraging
 Advanced Lectures on Machine Learning, LNCS
, 2003
"... ..."
Learning the Kernel with Hyperkernels
, 2003
"... This paper addresses the problem of choosing a kernel suitable for estimation with a Support Vector Machine, hence further automating machine learning. This goal is achieved by defining a Reproducing Kernel Hilbert Space on the space of kernels itself. Such a formulation leads to a statistical es ..."
Abstract

Cited by 81 (2 self)
 Add to MetaCart
This paper addresses the problem of choosing a kernel suitable for estimation with a Support Vector Machine, hence further automating machine learning. This goal is achieved by defining a Reproducing Kernel Hilbert Space on the space of kernels itself. Such a formulation leads to a statistical estimation problem very much akin to the problem of minimizing a regularized risk functional.
The Set Covering Machine
, 2002
"... We extend the classical algorithms of Valiant and Haussler for learning compact conjunctions and disjunctions of Boolean attributes to allow features that are constructed from the data and to allow a tradeoff between accuracy and complexity. The result is a generalpurpose learning machine, suitabl ..."
Abstract

Cited by 24 (7 self)
 Add to MetaCart
We extend the classical algorithms of Valiant and Haussler for learning compact conjunctions and disjunctions of Boolean attributes to allow features that are constructed from the data and to allow a tradeoff between accuracy and complexity. The result is a generalpurpose learning machine, suitable for practical learning tasks, that we call the set covering machine. We present a version of the set covering machine that uses datadependent balls for its set of features and compare its performance with the support vector machine. By extending a technique pioneered by Littlestone and Warmuth, we bound its generalization error as a function of the amount of data compression it achieves during training. In experiments with realworld learning tasks, the bound is shown to be extremely tight and to provide an effective guide for model selection.
Controlling sparseness in nonnegative tensor factorization
 IN: ECCV. (2006
, 2006
"... Nonnegative tensor factorization (NTF) has recently been proposed as sparse and efficient image representation (Welling and Weber, Patt. Rec. Let., 2001). Until now, sparsity of the tensor factorization has been empirically observed in many cases, but there was no systematic way to control it. In ..."
Abstract

Cited by 17 (0 self)
 Add to MetaCart
Nonnegative tensor factorization (NTF) has recently been proposed as sparse and efficient image representation (Welling and Weber, Patt. Rec. Let., 2001). Until now, sparsity of the tensor factorization has been empirically observed in many cases, but there was no systematic way to control it. In this work, we show that a sparsity measure recently proposed for nonnegative matrix factorization (Hoyer, J. Mach. Learn. Res., 2004) applies to NTF and allows precise control over sparseness of the resulting factorization. We devise an algorithm based on sequential conic programming and show improved performance over classical NTF codes on artificial and on realworld data sets.
Simpler knowledgebased support vector machines
 In ICML
, 2006
"... If appropriately used, prior knowledge can significantly improve the predictive accuracy of learning algorithms or reduce the amount of training data needed. In this paper we introduce a simple method to incorporate prior knowledge in support vector machines by modifying the hypothesis space rather ..."
Abstract

Cited by 8 (0 self)
 Add to MetaCart
If appropriately used, prior knowledge can significantly improve the predictive accuracy of learning algorithms or reduce the amount of training data needed. In this paper we introduce a simple method to incorporate prior knowledge in support vector machines by modifying the hypothesis space rather than the optimization problem. The optimization problem is amenable to solution by the constrained concave convex procedure, which finds a local optimum. The paper discusses different kinds of prior knowledge and demonstrates the applicability of the approach in some characteristic experiments. 1.
PACBayesian generalisation error bounds for gaussian process classification
 Journal of Machine Learning Research
, 2002
"... Approximate Bayesian Gaussian process (GP) classification techniques are powerful nonparametric learning methods, similar in appearance and performance to support vector machines. Based on simple probabilistic models, they render interpretable results and can be embedded in Bayesian frameworks for m ..."
Abstract

Cited by 8 (0 self)
 Add to MetaCart
Approximate Bayesian Gaussian process (GP) classification techniques are powerful nonparametric learning methods, similar in appearance and performance to support vector machines. Based on simple probabilistic models, they render interpretable results and can be embedded in Bayesian frameworks for model selection, feature selection, etc. In this paper, by applying the PACBayesian theorem of McAllester (1999a), we prove distributionfree generalisation error bounds for a wide range of approximate Bayesian GP classification techniques. We also provide a new and much simplified proof for this powerful theorem, making use of the concept of convex duality which is a backbone of many machine learning techniques. We instantiate and test our bounds for two particular GPC techniques, including a recent sparse method which circumvents the unfavourable scaling of standard GP algorithms. As is shown in experiments on a realworld task, the bounds can be very tight for moderate training sample sizes. To the best of our knowledge, these results provide the tightest known distributionfree error bounds for approximate Bayesian GPC methods, giving a strong learningtheoretical justification for the use of these techniques.
Mathematical Aspects of Neural Networks
 European Symposium of Artificial Neural Networks 2003
, 2003
"... In this tutorial paper about mathematical aspects of neural networks, we will focus on two directions: on the one hand, we will motivate standard mathematical questions and well studied theory of classical neural models used in machine learning. On the other hand, we collect some recent theoretic ..."
Abstract

Cited by 6 (4 self)
 Add to MetaCart
In this tutorial paper about mathematical aspects of neural networks, we will focus on two directions: on the one hand, we will motivate standard mathematical questions and well studied theory of classical neural models used in machine learning. On the other hand, we collect some recent theoretical results (as of beginning of 2003) in the respective areas. Thereby, we follow the dichotomy offered by the overall network structure and restrict ourselves to feedforward networks, recurrent networks, and selforganizing neural systems, respectively.
The sample complexity of dictionary learning
 In Proc. Conference on Learning Theory
, 2011
"... A large set of signals can sometimes be described sparsely using a dictionary, that is, every element can be represented as a linear combination of few elements from the dictionary. Algorithms for various signal processing applications, including classification, denoising and signal separation, lear ..."
Abstract

Cited by 6 (1 self)
 Add to MetaCart
A large set of signals can sometimes be described sparsely using a dictionary, that is, every element can be represented as a linear combination of few elements from the dictionary. Algorithms for various signal processing applications, including classification, denoising and signal separation, learn a dictionary from a given set of signals to be represented. Can we expect that the error in representing by such a dictionary a previously unseen signal from the same source will be of similar magnitude as those for the given examples? We assume signals are generated from a fixed distribution, and study these questions from a statistical learning theory perspective. We develop generalization bounds on the quality of the learned dictionary for two types of constraints on the coefficient selection, as measured by the expected L2 error in representation when the dictionary is used. For the case ( of l1 regularized coefficient selection we provide a general√npln(mλ)/m) ization bound of the order of O, where n is the dimension, p is the number of elements in the dictionary, λ is a bound on the l1 norm of the coefficient vector and m is the number of samples, which complements existing results. For the case of representing a new signal as a combination of at most k dictionary elements, we provide a bound of the order O ( √ npln(mk)/m) under an assumption on the closeness to orthogonality of the dictionary (low Babel function). We further show that this assumption holds for most dictionaries in high dimensions in a strong probabilistic sense. Our results also include bounds that converge as 1/m, not previously known for this problem. We provide similar results in a general setting using kernels with weak smoothness requirements.
PACBayesian compression bounds on the prediction error of learning algorithms for classification
 Machine Learning
, 2005
"... We consider bounds on the prediction error of classification algorithms based on sample compression. We refine the notion of a compression scheme to distinguish permutation and repetition invariant and nonpermutation and repetition invariant compression schemes leading to different prediction error ..."
Abstract

Cited by 5 (1 self)
 Add to MetaCart
We consider bounds on the prediction error of classification algorithms based on sample compression. We refine the notion of a compression scheme to distinguish permutation and repetition invariant and nonpermutation and repetition invariant compression schemes leading to different prediction error bounds. Also, we extend known results on compression to the case of nonzero empirical risk. We provide bounds on the prediction error of classifiers returned by mistakedriven online learning algorithms by interpreting mistake bounds as bounds on the size of the respective compression scheme of the algorithm. This leads to a bound on the prediction error of perceptron solutions that depends on the margin a support vector machine would achieve on the same training sample. Furthermore, using the property of compression we derive bounds on the average prediction error of kernel classifiers in the PACBayesian framework. These bounds assume a prior measure over the expansion coefficients in the datadependent kernel expansion and bound the average prediction error uniformly over subsets of the space of expansion coefficients. 1.