## Adaptive Sparseness for Supervised Learning (2003)

Venue: | IEEE Transactions on Pattern Analysis and Machine Intelligence |

Citations: | 81 - 4 self |

### BibTeX

@ARTICLE{Figueiredo03adaptivesparseness,

author = {Mario A.T. Figueiredo and Senior Member},

title = {Adaptive Sparseness for Supervised Learning},

journal = {IEEE Transactions on Pattern Analysis and Machine Intelligence},

year = {2003},

volume = {25},

pages = {1150--1159}

}

### Years of Citing Articles

### OpenURL

### Abstract

The goal of supervised learning is to infer a functional mapping based on a set of training examples. To achieve good generalization, it is necessary to control the "complexity" of the learned function. In Bayesian approaches, this is done by adopting a prior for the parameters of the function being learned. We propose a Bayesian approach to supervised learning, which leads to sparse solutions; that is, in which irrelevant parameters are automatically set exactly to zero. Other ways to obtain sparse classifiers (such as Laplacian priors, support vector machines) involve (hyper)parameters which control the degree of sparseness of the resulting classifiers; these parameters have to be somehow adjusted/estimated from the training data. In contrast, our approach does not involve any (hyper)parameters to be adjusted or estimated. This is achieved by a hierarchical-Bayes interpretation of the Laplacian prior, which is then modified by the adoption of a Jeffreys' noninformative hyperprior. Implementation is carried out by an expectationmaximization (EM) algorithm. Experiments with several benchmark data sets show that the proposed approach yields state-of-the-art performance. In particular, our method outperforms SVMs and performs competitively with the best alternative techniques, although it involves no tuning or adjustment of sparseness-controlling hyperparameters.

### Citations

9002 | The Nature of Statistical Learning Theory - Vapnik - 1995 |

3927 |
Classification and Regression Trees
- Breiman
- 1984
(Show Context)
Citation Context ... is on learning the function fðx; Þ directly from the training set. Well known discriminative approaches include linear and generalized linear models, k-nearest neighbor classifiers, tree classifier=-=s [9]-=-, feed-forward neural networks [6], support vector machines (SVM), and other kernel-based methods [4], [7], [10]. Although, usually computationally more demanding, discriminative approaches tend to pe... |

1858 | Regression shrinkage and selection via the Lasso
- Tibshirani
- 1996
(Show Context)
Citation Context ...is density. The sparseness-inducing nature of the Laplacian prior (or, equivalently, of the l1 penalty from a regularization point of view) is well-known and has been exploited in several areas [15], =-=[16]-=-, [17]. SVMs are another approach to supervised learning leading to sparse structures. Both in approaches based on Laplacian priors and in SVMs, there are hyperparameters (e.g., in (1)) controlling th... |

1673 | Atomic decomposition by basis pursuit
- Chen, Donoho, et al.
- 1998
(Show Context)
Citation Context ... of this density. The sparseness-inducing nature of the Laplacian prior (or, equivalently, of the l1 penalty from a regularization point of view) is well-known and has been exploited in several areas =-=[15]-=-, [16], [17]. SVMs are another approach to supervised learning leading to sparse structures. Both in approaches based on Laplacian priors and in SVMs, there are hyperparameters (e.g., in (1)) controll... |

1551 |
An Introduction to Support Vector Machines and Other Kernel-Based Learning Methods
- Cristianini, Shawe-Taylor
- 2000
(Show Context)
Citation Context ...e function may not be rich enough to capture the the true underlying relationship (underfitting). This is a well-known problem which has been addressed with a variety of formal tools (see, e.g., [3], =-=[4]-=-, [5], [6], [7], and references therein). 2 OVERVIEW OF METHODS AND THE PROPOSED APPROACH 2.1 Discriminative versus Generative Learning Supervised learning can be formulated using either a generative ... |

1375 |
Generalized Linear Models
- McCullagh, Nelder
- 1989
(Show Context)
Citation Context ...dels In classification problems, the formulation is somewhat more complicated due to the categorical nature of the output variable. A standard approach is to consider a generalized linear model (GLM) =-=[28]-=-. For a two-class problem (y 2f 1; 1g), a GLM models the probability that an observation x belongs to, say, class 1, as given by a nonlinearity applied to the output of a linear regression function. F... |

1281 |
Spline Models for Observational Data
- Wahba
- 1990
(Show Context)
Citation Context ... like ridge regression [11], or weight decay [5], [6]. Gaussian priors are also at the heart of the nonparametric Gaussian processes approach [5], [10], [12], which has roots in earlier spline models =-=[13]-=- and regularized radial basis functions [14]. The main disadvantage of Gaussian priors is that they do not control the structural complexity of the learned function. That is, if one of the components ... |

1245 |
Statistical Decision Theory and Bayesian Analysis
- Berger
- 1985
(Show Context)
Citation Context ...as proposed for robust regression in [18]); 2. a Jeffreys’ noninformative second-level hyperprior (in the same spirit as [19]), which expresses scaleinvariance and, more importantly, is parameterfre=-=e [20]-=-; 3. an expectation-maximization (EM) algorithm which yields a maximum a posteriori (MAP) estimate of (and of the observation noise variance, in the case of regression). Experimental evaluation of the... |

1116 |
Pattern Recognition and Neural Networks
- Ripley
- 1996
(Show Context)
Citation Context ... may not be rich enough to capture the the true underlying relationship (underfitting). This is a well-known problem which has been addressed with a variety of formal tools (see, e.g., [3], [4], [5], =-=[6]-=-, [7], and references therein). 2 OVERVIEW OF METHODS AND THE PROPOSED APPROACH 2.1 Discriminative versus Generative Learning Supervised learning can be formulated using either a generative approach o... |

607 | Bayesian Learning for Neural Networks - Neal - 1996 |

493 |
Ridge regression: Biased estimation for nonorthogonal problems
- Hoerl, Kennard
- 1970
(Show Context)
Citation Context ...is a vector of hyperparameters) [5]. The usual choice, namely for analytical and computational tractability, is a zero-mean Gaussian prior, which appears under different guises, like ridge regression =-=[11]-=-, or weight decay [5], [6]. Gaussian priors are also at the heart of the nonparametric Gaussian processes approach [5], [10], [12], which has roots in earlier spline models [13] and regularized radial... |

455 |
Bayesian analysis of binary and polychotomous response data
- Albert, Chib
- 1993
(Show Context)
Citation Context ...t this scale is implicitly absorbed by . 4.3 Learning a Probit Classifier via EM The fundamental reason behind our choice of the probit model is its simple interpretation in terms of hidden variables =-=[30]. C-=-onsider a hidden variable z T hðxÞþw, where w is a zero-mean unit-variance Gaussian noise sample pðwÞ Nðwj0; 1Þ. Then, if the classification rule is y 1 if z 0, and y 1 if z<0, we obtain the... |

280 |
Multivariate Statistical Modeling Based on Generalized Linear Models
- Fahrmeir, Tutz
- 1994
(Show Context)
Citation Context ...cumulative distribution function (cdf) ðzÞ Z z 1 Nðxj0; 1Þdx: ð18Þ The rescaled probit ð3z 2 Þ is plotted in Fig. 2, together with the logistic function; notice that they are almost indistingu=-=ishable [29]-=-. Of course, both the logistic and probit functions can be rescaled (horizontally), but this scale is implicitly absorbed by . 4.3 Learning a Probit Classifier via EM The fundamental reason behind our... |

215 | M.E.: Variational Relevance Vector Machines - Bishop, Tipping - 2000 |

195 | Prediction with Gaussian processes: From linear regression to linear prediction and beyond
- Williams
- 1999
(Show Context)
Citation Context ...e linear and generalized linear models, k-nearest neighbor classifiers, tree classifiers [9], feed-forward neural networks [6], support vector machines (SVM), and other kernel-based methods [4], [7], =-=[10]-=-. Although, usually computationally more demanding, discriminative approaches tend to perform better, especially with small training data sets (see [7]). The approach described in this paper falls in ... |

175 | Analysis of multiresolution image denoising schemes using generalized Gaussian and complexity priors
- Moulin, Liu
- 1999
(Show Context)
Citation Context ...cting a constant (equal to the threshold) from the absolute value of ðH T yÞ i . This rule is called the soft threshold (see Fig. 1), and is widely used in wavelet-based signal/image estimation [24]=-=, [25]. 3-=-.3 A Hierarchical-Bayes View of the Laplacian Prior Let us now consider that each i has a zero-mean Gaussian priorpð ij iÞ Nð ij0; iÞ, with its own variance i, and that each i has an exponential (... |

163 | A new approach to variable selection in least squares problems
- Osborne, Presnell, et al.
- 2000
(Show Context)
Citation Context ...arizing, the E-step is implemented by (10), while (12) and (13) constitute the M-step. This EM algorithm is not the most computationally efficient way to solve (3); see, e.g., the methods proposed in =-=[27]-=-, [16]. However, it is very simple to implement and serves our main goal which is to open the way to the adoption of different hyperpriors. 3.5 Comparison with the RVM We are now in position to explai... |

84 | Bayesian regularization and pruning using a Laplace prior - Williams - 1995 |

58 |
F.Girosi, “A Theory of Networks for Approximation and
- Poggio
- 1989
(Show Context)
Citation Context ... [5], [6]. Gaussian priors are also at the heart of the nonparametric Gaussian processes approach [5], [10], [12], which has roots in earlier spline models [13] and regularized radial basis functions =-=[14]-=-. The main disadvantage of Gaussian priors is that they do not control the structural complexity of the learned function. That is, if one of the components of (say, a given coefficient of a linear cla... |

50 | Wavelet-based image estimation: an empirical Bayes approach using Jeffreys’ noninformative prior
- Figueiredo, Nowak
- 2001
(Show Context)
Citation Context ...es interpretation of the Laplacian prior as a normal/independent distribution (as proposed for robust regression in [18]); 2. a Jeffreys’ noninformative second-level hyperprior (in the same spirit a=-=s [19]-=-), which expresses scaleinvariance and, more importantly, is parameterfree [20]; 3. an expectation-maximization (EM) algorithm which yields a maximum a posteriori (MAP) estimate of (and of the observa... |

46 | Bayesian model selection for support vector machines, Gaussian processes and other kernel classifiers
- Seeger
- 1999
(Show Context)
Citation Context ...120 test samples. For the WBC problem, there is a total of 569 samples; the results reported were obtained by averaging over 30 random partitions with 300 training samples and 269 test samples (as in =-=[31-=-]). Prior to applying our algorithm, all the inputs are normalized to zero mean and unit variance, as is customary in kernel-based methods. The kernel width was set to 4, for the Pima and crabs probl... |

38 | Adaptive sparseness using Jeffreys prior
- Figueiredo
(Show Context)
Citation Context ...hod strongly depend on an adequate choice of these parameters, and our formulation does not contribute to the solution of this problem. ACKNOWLEDGMENTS Earlier versions of this work were presented in =-=[1]-=- and [2]. This work was supported by the Foundation for Science and Technology, Portuguese Ministry of Science and Technology, under project POSI/33143/SRI/2000. 5. Available at www.stats.ox.ac.uk/pub... |

29 | Least absolute shrinkage is equivalent to quadratic penalization
- Grandvalet
- 1998
(Show Context)
Citation Context ... Dashed line: Hardthreshold rule. Dotted line: Soft-threshold rule (obtained with a Laplacian prior). robust regression under Laplacian noise models [18]. A related equivalence was also considered in =-=[26]-=-. 3.4 Sparse Regression via EM The hierarchical decomposition of the Laplacian prior allows using the expectation-maximization (EM) algorithm to implement the LASSO criterion in (3). This is done simp... |

24 | Bayesian learning of sparse classifiers
- Figueiredo, Jain
- 2001
(Show Context)
Citation Context ...ngly depend on an adequate choice of these parameters, and our formulation does not contribute to the solution of this problem. ACKNOWLEDGMENTS Earlier versions of this work were presented in [1] and =-=[2]-=-. This work was supported by the Foundation for Science and Technology, Portuguese Ministry of Science and Technology, under project POSI/33143/SRI/2000. 5. Available at www.stats.ox.ac.uk/pub/PRNN/. ... |

21 |
Normal/Independent Distributions and Their Applications in Robust Regression
- Lange, Sinsheimer
- 1993
(Show Context)
Citation Context ...arseness. This is achieved with the following building blocks: 1. a hierarchical-Bayes interpretation of the Laplacian prior as a normal/independent distribution (as proposed for robust regression in =-=[18]);-=- 2. a Jeffreys’ noninformative second-level hyperprior (in the same spirit as [19]), which expresses scaleinvariance and, more importantly, is parameterfree [20]; 3. an expectation-maximization (EM)... |

19 |
Bayesian non-linear modelling for the 1993 energy prediction competition
- MacKay
- 1996
(Show Context)
Citation Context ...vely with (often better than) the state-of-the-art methods (such as SVM). 2.5 Related Approaches Our method is formally and conceptually related to the automatic relevance determination (ARD) concept =-=[21]-=-, [5], which underlies the recently proposed relevance vector machine (RVM) [22], [23]. The RVM exhibits state-of-the-art performance: It beats SVMs, both in terms of accuracy and sparseness [22], [23... |

16 |
Using the Nyström Method to Speed up Kernel
- Williams, Seeger
- 2001
(Show Context)
Citation Context ...the number of training points), whose computational requirements scale with the third power of n. This computational issue is a topic of current interest to researchers in kernel-based methods (e.g., =-=[32]-=-), and we also intend to focus on it. Another well-known limitation of kernel methods is the need to adjust the kernel parameter(s) (e.g., the Gaussian kernel width in (22)). Of course, the results of... |

15 |
On different facets of regularization theory
- Chen, Haykin
(Show Context)
Citation Context ...simple function may not be rich enough to capture the the true underlying relationship (underfitting). This is a well-known problem which has been addressed with a variety of formal tools (see, e.g., =-=[3]-=-, [4], [5], [6], [7], and references therein). 2 OVERVIEW OF METHODS AND THE PROPOSED APPROACH 2.1 Discriminative versus Generative Learning Supervised learning can be formulated using either a genera... |

14 |
The Relevance Vector
- Tipping
- 1999
(Show Context)
Citation Context ...lated Approaches Our method is formally and conceptually related to the automatic relevance determination (ARD) concept [21], [5], which underlies the recently proposed relevance vector machine (RVM) =-=[22]-=-, [23]. The RVM exhibits state-of-the-art performance: It beats SVMs, both in terms of accuracy and sparseness [22], [23]. However, our approach does not rely on a type-II maximum likelihood approxima... |

14 |
Ideal Spatial Adaptation via Wavelet
- Donoho, Johnston
- 1994
(Show Context)
Citation Context ...subtracting a constant (equal to the threshold) from the absolute value of ðH T yÞ i . This rule is called the soft threshold (see Fig. 1), and is widely used in wavelet-based signal/image estimatio=-=n [24], [-=-25]. 3.3 A Hierarchical-Bayes View of the Laplacian Prior Let us now consider that each i has a zero-mean Gaussian priorpð ij iÞ Nð ij0; iÞ, with its own variance i, and that each i has an exponen... |

6 |
Bayesian Classification with Gaussian Priors
- Williams, Barber
- 1998
(Show Context)
Citation Context ...an prior, which appears under different guises, like ridge regression [11], or weight decay [5], [6]. Gaussian priors are also at the heart of the nonparametric Gaussian processes approach [5], [10], =-=[12]-=-, which has roots in earlier spline models [13] and regularized radial basis functions [14]. The main disadvantage of Gaussian priors is that they do not control the structural complexity of the learn... |

3 | On gaussian radial basis function approximations: Interpretation, extensions, and learning strategies
- Figueiredo
(Show Context)
Citation Context ...tional densities, pðxjyÞ, and the probability of each class,pðyÞ, [6]. In regression, this can be done, for example, by representing the joint density using a kernel method or a Gaussian mixture (=-=see [8]-=- and references therein). From this joint probability function estimate, optimal Bayesian decision rules can be derived by the standard Bayesian decision theory machinery [6]. In the discriminative ap... |