## Bayesian learning of sparse classifiers (2001)

### Cached

### Download Links

- [www.lx.it.pt]
- [www.lx.it.pt]
- [red.lx.it.pt]
- [www.lx.it.pt]
- [www.tsi.enst.fr]
- [perso.telecom-paristech.fr]
- DBLP

### Other Repositories/Bibliography

Venue: | in IEEE Computer Society Conference on Computer Vision and Pattern Recognition - CVPR’2001, (Hawaii |

Citations: | 24 - 2 self |

### BibTeX

@INPROCEEDINGS{Figueiredo01bayesianlearning,

author = {Mário A. T. Figueiredo and Anil K. Jain},

title = {Bayesian learning of sparse classifiers},

booktitle = {in IEEE Computer Society Conference on Computer Vision and Pattern Recognition - CVPR’2001, (Hawaii},

year = {2001},

pages = {35--41}

}

### Years of Citing Articles

### OpenURL

### Abstract

Bayesian approaches to supervised learning use priors on the classifier parameters. However, few priors aim at achieving “sparse ” classifiers, where irrelevant/redundant parameters are automatically set to zero. Two well-known ways of obtaining sparse classifiers are: use a zero-mean Laplacian prior on the parameters, and the “support vector machine ” (SVM). Whether one uses a Laplacian prior or an SVM, one still needs to specify/estimate the parameters that control the degree of sparseness of the resulting classifiers. We propose a Bayesian approach to learning sparse classifiers which does not involve any parameters controlling the degree of sparseness. This is achieved by a hierarchical-Bayes interpretation of the Laplacian prior, followed by the adoption of a Jeffreys ’ non-informative hyper-prior. Implementation is carried out by an EM algorithm. Experimental evaluation of the proposed method shows that it performs competitively with (often better than) the best classification techniques available.

### Citations

8980 | Statistical Learning Theory
- Vapnik
- 1998
(Show Context)
Citation Context ...include linear and logistic discrimination, k-nearest neighbor classifiers, tree classifiers [5], feedforward neural networks [3, 19, 21], support vector machines (SVM) and other kernel-based methods =-=[8, 25, 27, 28]-=-. This paper focuses on discriminative learning. 1.3. Over-fitting and under-fitting A main concern in supervised learning is to avoid overfittingsthe training data. In other words, to achieve good ge... |

4828 |
Neural Networks for Pattern Recognition
- Bishop
- 1995
(Show Context)
Citation Context ...r is directly learned from the data. Well known discriminative techniques include linear and logistic discrimination, k-nearest neighbor classifiers, tree classifiers [5], feedforward neural networks =-=[3, 19, 21]-=-, support vector machines (SVM) and other kernel-based methods [8, 25, 27, 28]. This paper focuses on discriminative learning. 1.3. Over-fitting and under-fitting A main concern in supervised learning... |

3909 |
Classification and Regression Trees
- Breiman, Friedman, et al.
- 1984
(Show Context)
Citation Context ... explicit modelled; the classifier is directly learned from the data. Well known discriminative techniques include linear and logistic discrimination, k-nearest neighbor classifiers, tree classifiers =-=[5]-=-, feedforward neural networks [3, 19, 21], support vector machines (SVM) and other kernel-based methods [8, 25, 27, 28]. This paper focuses on discriminative learning. 1.3. Over-fitting and under-fitt... |

1832 | Regression shrinkage and selection via the lasso
- Tibshirani
- 1996
(Show Context)
Citation Context ...k 1 g ; where k k 1 denotes the l 1 norm. The sparseness-inducing nature of the Laplacian prior (or equivalently, of the l 1 penalty) is well known and has been exploited in several research areas [6=-=, 16, 23, 30-=-]. When using a Laplacian prior ons, the question remains of how to adjust or estimate the parameter which ultimately controls the degree of sparseness of the obtained estimates. Concerning the SVM, ... |

1652 | Atomic decomposition by basis pursuit
- Chen, Donoho, et al.
- 2001
(Show Context)
Citation Context ...k 1 g ; where k k 1 denotes the l 1 norm. The sparseness-inducing nature of the Laplacian prior (or equivalently, of the l 1 penalty) is well known and has been exploited in several research areas [6=-=, 16, 23, 30-=-]. When using a Laplacian prior ons, the question remains of how to adjust or estimate the parameter which ultimately controls the degree of sparseness of the obtained estimates. Concerning the SVM, ... |

1552 |
An Introduction to Support Vector Machines and Other Kernel-based Learning Methods. Cambrige
- Cristianini, Shawe-Taylor
- 2000
(Show Context)
Citation Context ...include linear and logistic discrimination, k-nearest neighbor classifiers, tree classifiers [5], feedforward neural networks [3, 19, 21], support vector machines (SVM) and other kernel-based methods =-=[8, 25, 27, 28]-=-. This paper focuses on discriminative learning. 1.3. Over-fitting and under-fitting A main concern in supervised learning is to avoid overfittingsthe training data. In other words, to achieve good ge... |

1363 |
Generalized linear models
- McCullagh, Nelder
- 1990
(Show Context)
Citation Context ...interested in learning a function g(x;s) taking values in [0; 1] (rather than just f0; 1g) which can be interpreted as the probability that x belongs to, say, class 1. In logistic (linear) regression =-=[18-=-], P (y = 1jx) = g(x;s) =s(s0 + P isi x i ), wheres(z) = (1 + exp( z)) 1 (1) is called the logistic function (see Fig. 1). The functions() yielding the class probabilities is known as the link. An adv... |

1273 |
Spline models for observational data
- Wahba
- 1990
(Show Context)
Citation Context ...nown as weight decay [3, 19, 21]. Gaussian priors are also used in non-parametric contexts, like the Gaussian processes (GP) approach [8, 19, 27, 28], which has roots in earlier work on spline models =-=[14, 26]-=- and regularized radial basis function (RBF) approximations [20]. The main disadvantage of Gaussian priors is that they do not explicitly control the structural complexity of the classifiers. That is,... |

1240 |
Statistical decision theory and Bayesian analysis. Springer series in Statistics
- Berger
- 1985
(Show Context)
Citation Context ...priors with an exponential hyper-prior for the variance. 3. Replacement of the exponential hyper-prior by the Jeffreys' prior which expresses scale-invariance and, more importantly, is parameter-free =-=[2]-=-. 4. An expectation-maximization (EM) algorithm which yields a maximum a posteriori estimate ofs. Experimental evaluation of the proposed method shows that it performs competitively with (often better... |

1113 |
Pattern recognition and neural networks
- Ripley
- 1996
(Show Context)
Citation Context ...r is directly learned from the data. Well known discriminative techniques include linear and logistic discrimination, k-nearest neighbor classifiers, tree classifiers [5], feedforward neural networks =-=[3, 19, 21]-=-, support vector machines (SVM) and other kernel-based methods [8, 25, 27, 28]. This paper focuses on discriminative learning. 1.3. Over-fitting and under-fitting A main concern in supervised learning... |

667 | Statistical pattern recognition: a review
- Jain, Duin, et al.
- 2009
(Show Context)
Citation Context ... from the training data; then a Bayes classifier is obtained by inserting (plugging in) these class-conditional probability functions and the a priori class probabilities into the Bayes decision rule =-=[13]-=-. In discriminative learning, the classconditional densities are not explicit modelled; the classifier is directly learned from the data. Well known discriminative techniques include linear and logist... |

639 |
F Girosi, “Networks for approximation and learning
- Poggio
- 1990
(Show Context)
Citation Context ...on-parametric contexts, like the Gaussian processes (GP) approach [8, 19, 27, 28], which has roots in earlier work on spline models [14, 26] and regularized radial basis function (RBF) approximations =-=[20]-=-. The main disadvantage of Gaussian priors is that they do not explicitly control the structural complexity of the classifiers. That is, if one of the components ofs(say, the weight of a given feature... |

608 | Bayesian Learning for Neural Networks
- Neal
- 1996
(Show Context)
Citation Context ...r is directly learned from the data. Well known discriminative techniques include linear and logistic discrimination, k-nearest neighbor classifiers, tree classifiers [5], feedforward neural networks =-=[3, 19, 21]-=-, support vector machines (SVM) and other kernel-based methods [8, 25, 27, 28]. This paper focuses on discriminative learning. 1.3. Over-fitting and under-fitting A main concern in supervised learning... |

452 |
Bayesian analysis of binary and polychotomous response data
- Albert, Chib
- 1993
(Show Context)
Citation Context ...desirable, not simply to classify x into one of the classes, but to know the degree of confidence of that classification. In that case we are interested in learning a function g(x;s) taking values in =-=[0; 1]-=- (rather than just f0; 1g) which can be interpreted as the probability that x belongs to, say, class 1. In logistic (linear) regression [18], P (y = 1jx) = g(x;s) =s(s0 + P isi x i ), wheres(z) = (1 +... |

384 |
Maximum Likelihood Estimation from Incomplete Data via the EM Algorithm
- Dempster, Laird, et al.
- 1977
(Show Context)
Citation Context ...lem, z = Hs+w; (3) where H is the design matrix H = h h T (x (1) ); :::; h T (x (n) ) i T ; (4) and w a vector of i.i.d. zero-mean unit-variance Gaussian samples. This suggests using the EM algorithm =-=[10]-=- to find a maximum a posteriori (MAP) estimate ofs. The EM algorithm produces a sequence of estimates bs(t) , for t = 0; 1; 2; :::, by alternating between two steps: E-step: Compute the expected value... |

283 | Using the Nyström method to speed up kernel machines - Williams, Seeger - 2001 |

277 |
Multivariate Statistical Modelling Based on Generalized Linear Models
- Fahrmeir, Tutz
- 2001
(Show Context)
Citation Context ... mean m and (co)variance C. The re-scaled probit (3 z 2 j0; 1) is plotted in Fig. 1, together with the logistic function, showing that (apart from a scale factor) they are almost indistinguishable [1=-=1]-=-. Of course, both the logistic and probit functions can be re-scaled (horizontally), but this scale is implicitly absorbed bys. - 5 0 5 0 0.2 0.4 0.6 0.8 1 z logistic(z) probit 3 z 2 ( ) Figure 1. The... |

257 | Learning overcomplete representations
- Lewicki, Sejnowski
(Show Context)
Citation Context ...k 1 g ; where k k 1 denotes the l 1 norm. The sparseness-inducing nature of the Laplacian prior (or equivalently, of the l 1 penalty) is well known and has been exploited in several research areas [6=-=, 16, 23, 30-=-]. When using a Laplacian prior ons, the question remains of how to adjust or estimate the parameter which ultimately controls the degree of sparseness of the obtained estimates. Concerning the SVM, ... |

233 |
Learning from Data: Concepts Theory and Methods
- Cherkassky, Mulier
- 1998
(Show Context)
Citation Context ...on the other hand, may not be able to capture the main behavior of the underlying relationship (under-fitting). This well-known trade-off has been addressed with a variety of formal tools (see, e.g., =-=[3, 7, 8, 19, 21, 25]-=-). 1.4. Bayesian discriminative learning The Bayesian approach to controlling the complexity in discriminative supervised learning is to place a prior on the function to be learned (i.e., ons) favorin... |

215 | The Relevance Vector Machine
- Tipping
- 2000
(Show Context)
Citation Context ... the best classification techniques available. Our method is related to the automatic relevance determinations(ARD) idea [19, 17], which underlies the recently proposed relevance vector machine (RVM) =-=[4, 24]-=-. The RVM exhibits state-of-the-art performance, beating SVM both in terms of accuracy and sparseness [4, 24]. However, rather than using a type-II maximum likelihood approximation [2] (as in ARD and ... |

195 | Prediction with Gaussian processes: From linear regression to linear prediction and beyond
- Williams
- 1999
(Show Context)
Citation Context ...include linear and logistic discrimination, k-nearest neighbor classifiers, tree classifiers [5], feedforward neural networks [3, 19, 21], support vector machines (SVM) and other kernel-based methods =-=[8, 25, 27, 28]-=-. This paper focuses on discriminative learning. 1.3. Over-fitting and under-fitting A main concern in supervised learning is to avoid overfittingsthe training data. In other words, to achieve good ge... |

132 |
A correspondence between bayesian estimation on stochastic processes and smoothing by splines,” The
- Kimeldorf, Wahba
- 1970
(Show Context)
Citation Context ...nown as weight decay [3, 19, 21]. Gaussian priors are also used in non-parametric contexts, like the Gaussian processes (GP) approach [8, 19, 27, 28], which has roots in earlier work on spline models =-=[14, 26]-=- and regularized radial basis function (RBF) approximations [20]. The main disadvantage of Gaussian priors is that they do not explicitly control the structural complexity of the classifiers. That is,... |

84 |
Bayesian regularization and pruning using a Laplace prior
- Williams
- 1995
(Show Context)
Citation Context |

51 | Wavelet-based image estimation: an empirical bayes approach using Jeffreys’ noninformative prior
- Figueiredo, Nowak
- 2001
(Show Context)
Citation Context ...n Eq. (10) by a non-informative Jeffreys hyper-prior: p( i ) = 1= i . The Jeffreys prior expresses the notion of ignorance/invariance, in this case with respect to changes in measurement scale (see [2=-=, 12]-=-). Of course, we no longer have the Laplacian prior ons, but some other prior resulting from the adoption of the Jeffreys hyper-prior. It turns out that this new hyper-prior leads to a minor modificat... |

46 | Bayesian model selection for support vector machines, Gaussian processes and other kernel classifiers
- Seeger
- 1999
(Show Context)
Citation Context ...ere obtained by averaging over 2 Available at the Machine Learning Repository: http://www.ics.uci.edu/mlearn/MLSummary.html 30 random partitions with 300 training samples and 269 test samples (as in [=-=22]-=-). Prior to applying our algorithm, all the inputs are normalized to zero mean and unit variance, as is customary in kernel-based methods. The kernel width was set to h = 4, for the Pima and crabs pro... |

21 |
Normal/Independent Distributions and Their Applications in Robust Regression
- Lange, Sinsheimer
- 1993
(Show Context)
Citation Context ... is the Gaussian cumulative distribution function (cdf) [18]. 2. A hierarchical-Bayes interpretation of the Laplacian prior as a normal/independent distribution (as has been used in robust regression =-=[15]-=-). More specifically, a Laplacian prior can be decomposed into a continuous mixture of zero mean Gaussian priors with an exponential hyper-prior for the variance. 3. Replacement of the exponential hyp... |

20 |
Bayesian non-linear modelling for the 1993 energy prediction competition
- MacKay
- 1995
(Show Context)
Citation Context ...e proposed method shows that it performs competitively with (often better than) the best classification techniques available. Our method is related to the automatic relevance determinations(ARD) idea =-=[19, 17]-=-, which underlies the recently proposed relevance vector machine (RVM) [4, 24]. The RVM exhibits state-of-the-art performance, beating SVM both in terms of accuracy and sparseness [4, 24]. However, ra... |

6 |
Bayesian Classification with Gaussian Priors
- Williams, Barber
- 1998
(Show Context)
Citation Context |

5 | Discriminative versus informative learning - Rubinstein, Hastie - 1997 |