Results 1  10
of
71
Sparse multinomial logistic regression: fast algorithms and generalization bounds
 IEEE Trans. on Pattern Analysis and Machine Intelligence
"... Abstract—Recently developed methods for learning sparse classifiers are among the stateoftheart in supervised learning. These methods learn classifiers that incorporate weighted sums of basis functions with sparsitypromoting priors encouraging the weight estimates to be either significantly larg ..."
Abstract

Cited by 113 (1 self)
 Add to MetaCart
Abstract—Recently developed methods for learning sparse classifiers are among the stateoftheart in supervised learning. These methods learn classifiers that incorporate weighted sums of basis functions with sparsitypromoting priors encouraging the weight estimates to be either significantly large or exactly zero. From a learningtheoretic perspective, these methods control the capacity of the learned classifier by minimizing the number of basis functions used, resulting in better generalization. This paper presents three contributions related to learning sparse classifiers. First, we introduce a true multiclass formulation based on multinomial logistic regression. Second, by combining a bound optimization approach with a componentwise update procedure, we derive fast exact algorithms for learning sparse multiclass classifiers that scale favorably in both the number of training samples and the feature dimensionality, making them applicable even to large data sets in highdimensional feature spaces. To the best of our knowledge, these are the first algorithms to perform exact multinomial logistic regression with a sparsitypromoting prior. Third, we show how nontrivial generalization bounds can be derived for our classifier in the binary case. Experimental results on standard benchmark data sets attest to the accuracy, sparsity, and efficiency of the proposed methods.
Adaptive Sparseness for Supervised Learning
 IEEE Transactions on Pattern Analysis and Machine Intelligence
, 2003
"... The goal of supervised learning is to infer a functional mapping based on a set of training examples. To achieve good generalization, it is necessary to control the "complexity" of the learned function. In Bayesian approaches, this is done by adopting a prior for the parameters of the function bei ..."
Abstract

Cited by 80 (4 self)
 Add to MetaCart
The goal of supervised learning is to infer a functional mapping based on a set of training examples. To achieve good generalization, it is necessary to control the "complexity" of the learned function. In Bayesian approaches, this is done by adopting a prior for the parameters of the function being learned. We propose a Bayesian approach to supervised learning, which leads to sparse solutions; that is, in which irrelevant parameters are automatically set exactly to zero. Other ways to obtain sparse classifiers (such as Laplacian priors, support vector machines) involve (hyper)parameters which control the degree of sparseness of the resulting classifiers; these parameters have to be somehow adjusted/estimated from the training data. In contrast, our approach does not involve any (hyper)parameters to be adjusted or estimated. This is achieved by a hierarchicalBayes interpretation of the Laplacian prior, which is then modified by the adoption of a Jeffreys' noninformative hyperprior. Implementation is carried out by an expectationmaximization (EM) algorithm. Experiments with several benchmark data sets show that the proposed approach yields stateoftheart performance. In particular, our method outperforms SVMs and performs competitively with the best alternative techniques, although it involves no tuning or adjustment of sparsenesscontrolling hyperparameters.
Constructive Algorithms for Structure Learning in Feedforward Neural Networks for Regression Problems
 IEEE Transactions on Neural Networks
, 1997
"... In this survey paper, we review the constructive algorithms for structure learning in feedforward neural networks for regression problems. The basic idea is to start with a small network, then add hidden units and weights incrementally until a satisfactory solution is found. By formulating the whole ..."
Abstract

Cited by 66 (2 self)
 Add to MetaCart
In this survey paper, we review the constructive algorithms for structure learning in feedforward neural networks for regression problems. The basic idea is to start with a small network, then add hidden units and weights incrementally until a satisfactory solution is found. By formulating the whole problem as a state space search, we first describe the general issues in constructive algorithms, with special emphasis on the search strategy. A taxonomy, based on the differences in the state transition mapping, the training algorithm and the network architecture, is then presented. Keywords Constructive algorithm, structure learning, state space search, dynamic node creation, projection pursuit regression, cascadecorrelation, resourceallocating network, group method of data handling. I. Introduction A. Problems with Fixed Size Networks I N recent years, many neural network models have been proposed for pattern classification, function approximation and regression problems. Among...
Exponential Priors for Maximum Entropy Models
 In Proceedings of the Annual Meeting of the Association for Computational Linguistics
, 2003
"... this paper. Finally, thanks to Stan Chen and Roni Rosenfeld: our derivation for Exponential priors closely follows the text of their derivation for Gaussian priors. ..."
Abstract

Cited by 57 (0 self)
 Add to MetaCart
this paper. Finally, thanks to Stan Chen and Roni Rosenfeld: our derivation for Exponential priors closely follows the text of their derivation for Gaussian priors.
Performance guarantees for regularized maximum entropy density estimation
 Proceedings of the 17th Annual Conference on Computational Learning Theory
, 2004
"... Abstract. We consider the problem of estimating an unknown probability distribution from samples using the principle of maximum entropy (maxent). To alleviate overfitting with a very large number of features, we propose applying the maxent principle with relaxed constraints on the expectations of th ..."
Abstract

Cited by 53 (8 self)
 Add to MetaCart
Abstract. We consider the problem of estimating an unknown probability distribution from samples using the principle of maximum entropy (maxent). To alleviate overfitting with a very large number of features, we propose applying the maxent principle with relaxed constraints on the expectations of the features. By convex duality, this turns out to be equivalent to finding the Gibbs distribution minimizing a regularized version of the empirical log loss. We prove nonasymptotic bounds showing that, with respect to the true underlying distribution, this relaxed version of maxent produces density estimates that are almost as good as the best possible. These bounds are in terms of the deviation of the feature empirical averages relative to their true expectations, a number that can be bounded using standard uniformconvergence techniques. In particular, this leads to bounds that drop quickly with the number of samples, and that depend very moderately on the number or complexity of the features. We also derive and prove convergence for both sequentialupdate and parallelupdate algorithms. Finally, we briefly describe experiments on data relevant to the modeling of species geographical distributions. 1
Adaptive Sparseness Using Jeffreys Prior
, 2001
"... In this paper we introduce a new sparseness inducing prior which does not involve any (hyper) parameters that need to be adjusted or estimated. Although other applications are possible, we focus here on supervised learning problems: regression and classification. Experiments with several publicly av ..."
Abstract

Cited by 38 (2 self)
 Add to MetaCart
In this paper we introduce a new sparseness inducing prior which does not involve any (hyper) parameters that need to be adjusted or estimated. Although other applications are possible, we focus here on supervised learning problems: regression and classification. Experiments with several publicly available benchmark data sets show that the proposed approach yields stateoftheart performance. In particular, our method outperforms support vector machines and performs competitively with the best alternative techniques, both in terms of error rates and sparseness, although it involves no tuning or adjusting of sparsenesscontrolling hyperparameters.
A leastsquares approach to direct importance estimation
 Journal of Machine Learning Research
, 2009
"... We address the problem of estimating the ratio of two probability density functions, which is often referred to as the importance. The importance values can be used for various succeeding tasks such as covariate shift adaptation or outlier detection. In this paper, we propose a new importance estima ..."
Abstract

Cited by 36 (24 self)
 Add to MetaCart
We address the problem of estimating the ratio of two probability density functions, which is often referred to as the importance. The importance values can be used for various succeeding tasks such as covariate shift adaptation or outlier detection. In this paper, we propose a new importance estimation method that has a closedform solution; the leaveoneout crossvalidation score can also be computed analytically. Therefore, the proposed method is computationally highly efficient and simple to implement. We also elucidate theoretical properties of the proposed method such as the convergence rate and approximation error bounds. Numerical experiments show that the proposed method is comparable to the best existing method in accuracy, while it is computationally more efficient than competing approaches.
Bayesian learning of sparse classifiers
 in IEEE Computer Society Conference on Computer Vision and Pattern Recognition  CVPR’2001, (Hawaii
, 2001
"... Bayesian approaches to supervised learning use priors on the classifier parameters. However, few priors aim at achieving “sparse ” classifiers, where irrelevant/redundant parameters are automatically set to zero. Two wellknown ways of obtaining sparse classifiers are: use a zeromean Laplacian prio ..."
Abstract

Cited by 24 (2 self)
 Add to MetaCart
Bayesian approaches to supervised learning use priors on the classifier parameters. However, few priors aim at achieving “sparse ” classifiers, where irrelevant/redundant parameters are automatically set to zero. Two wellknown ways of obtaining sparse classifiers are: use a zeromean Laplacian prior on the parameters, and the “support vector machine ” (SVM). Whether one uses a Laplacian prior or an SVM, one still needs to specify/estimate the parameters that control the degree of sparseness of the resulting classifiers. We propose a Bayesian approach to learning sparse classifiers which does not involve any parameters controlling the degree of sparseness. This is achieved by a hierarchicalBayes interpretation of the Laplacian prior, followed by the adoption of a Jeffreys ’ noninformative hyperprior. Implementation is carried out by an EM algorithm. Experimental evaluation of the proposed method shows that it performs competitively with (often better than) the best classification techniques available.
Maximum Entropy Density Estimation with Generalized Regularization and an Application to Species Distribution Modeling
"... We present a unified and complete account of maximum entropy density estimation subject to constraints represented by convex potential functions or, alternatively, by convex regularization. We provide fully general performance guarantees and an algorithm with a complete convergence proof. As special ..."
Abstract

Cited by 20 (1 self)
 Add to MetaCart
We present a unified and complete account of maximum entropy density estimation subject to constraints represented by convex potential functions or, alternatively, by convex regularization. We provide fully general performance guarantees and an algorithm with a complete convergence proof. As special cases, we easily derive performance guarantees for many known regularization types, including ℓ1, ℓ2, ℓ 2 2, and ℓ1+ ℓ 2 2 style regularization. We propose an algorithm solving a large and general subclass of generalized maximum entropy problems, including all discussed in the paper, and prove its convergence. Our approach generalizes and unifies techniques based on information geometry and Bregman divergences as well as those based more directly on compactness. Our work is motivated by a novel application of maximum entropy to species distribution modeling, an important problem in conservation biology and ecology. In a set of experiments on realworld data, we demonstrate the utility of maximum entropy in this setting. We explore effects of different feature types, sample sizes, and regularization levels on the performance of maxent, and discuss interpretability of the resulting models.
Modeling of species distributions with Maxent: new extensions and a comprehensive evaluation
"... Accurate modeling of geographic distributions of species is crucial to various applications in ecology and conservation. The best performing techniques often require some parameter tuning, which may be prohibitively timeconsuming to do separately for each species, or unreliable for small or biased ..."
Abstract

Cited by 17 (2 self)
 Add to MetaCart
Accurate modeling of geographic distributions of species is crucial to various applications in ecology and conservation. The best performing techniques often require some parameter tuning, which may be prohibitively timeconsuming to do separately for each species, or unreliable for small or biased datasets. Additionally, even with the abundance of good quality data, users interested in the application of species models need not have the statistical knowledge required for detailed tuning. In such cases, it is desirable to use ‘‘default settings’’, tuned and validated on diverse datasets. Maxent is a recently introduced modeling technique, achieving high predictive accuracy and enjoying several additional attractive properties. The performance of Maxent is influenced by a moderate number of parameters. The first contribution of this paper is the empirical tuning of these parameters. Since many datasets lack information about species absence, we present a tuning method that uses presenceonly data. We evaluate our method on independently collected highquality presenceabsence data. In addition to tuning, we introduce several concepts that improve the predictive accuracy and running time of Maxent. We introduce ‘‘hinge features’ ’ that model more complex relationships in the training data; we describe a new logistic output format that gives an estimate of probability of presence; finally we explore ‘‘background sampling’’ strategies that cope with sample selection bias and decrease modelbuilding time. Our evaluation, based on a diverse dataset of 226 species from 6 regions, shows: 1) default settings tuned on presenceonly data achieve performance which is almost as good as if they had been tuned on the evaluation data itself; 2) hinge features substantially improve model