## Sparse multinomial logistic regression: Fast algorithms and generalization bounds (2005)

### Cached

### Download Links

- [www.cs.duke.edu]
- [www.cs.duke.edu]
- [www.lx.it.pt]
- [www.stat.columbia.edu]
- [www.cs.duke.edu]
- [www.stat.columbia.edu]
- [www.stat.columbia.edu]
- DBLP

### Other Repositories/Bibliography

Venue: | IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE |

Citations: | 113 - 1 self |

### BibTeX

@ARTICLE{Krishnapuram05sparsemultinomial,

author = {Balaji Krishnapuram and Lawrence Carin and Mário A. T. Figueiredo and Alexander J. Hartemink},

title = {Sparse multinomial logistic regression: Fast algorithms and generalization bounds},

journal = {IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE},

year = {2005},

volume = {27},

pages = {2005}

}

### Years of Citing Articles

### OpenURL

### Abstract

Recently developed methods for learning sparse classifiers are among the state-of-the-art in supervised learning. These methods learn classifiers that incorporate weighted sums of basis functions with sparsity-promoting priors encouraging the weight estimates to be either significantly large or exactly zero. From a learning-theoretic perspective, these methods control the capacity of the learned classifier by minimizing the number of basis functions used, resulting in better generalization. This paper presents three contributions related to learning sparse classifiers. First, we introduce a true multiclass formulation based on multinomial logistic regression. Second, by combining a bound optimization approach with a component-wise update procedure, we derive fast exact algorithms for learning sparse multiclass classifiers that scale favorably in both the number of training samples and the feature dimensionality, making them applicable even to large data sets in high-dimensional feature spaces. To the best of our knowledge, these are the first algorithms to perform exact multinomial logistic regression with a sparsity-promoting prior. Third, we show how nontrivial generalization bounds can be derived for our classifier in the binary case. Experimental results on standard benchmark data sets attest to the accuracy, sparsity, and efficiency of the proposed methods.

### Citations

8980 | Statistical Learning Theory
- Vapnik
- 1998
(Show Context)
Citation Context ...linear transformation of those features, or even kernels centered on the training samples. In this latter case, the learned classifier will be similar in flavor to a support vector machine (SVM) [5], =-=[38]-=-, although, in contrast to an SVM, the kernel is not required to satisfy the Mercer condition. . B. Krishnapuram is with the Computer Aided Diagnosis and Therapy Group, Siemens Medical Solutions USA, ... |

2026 |
A Wavelet Tour of Signal Processing
- Mallat
- 1999
(Show Context)
Citation Context ...ÞsignðaÞ maxf0; jaj gsKRISHNAPURAM ET AL.: SPARSE MULTINOMIAL LOGISTIC REGRESSION: FAST ALGORITHMS AND GENERALIZATION BOUNDS 961 is the soft threshold function, well-known in the wavelets literature =-=[24]-=-. The weight update equation (15) provides an explicit criterion for whether or not to include each basis function in the classifier, in a similar vein to the criterion derived in [36]. In the case of... |

1832 | Regression shrinkage and selection via the lasso - Tibshirani - 1994 |

1694 | A Theory of the Learnable
- Valiant
- 1984
(Show Context)
Citation Context ... 1X n n lðyi;fðxiÞÞ: ð18Þ i1 One of the key goals of learning theory is to obtain upper bounds on RtruefŠ that hold uniformly for any P ðx; yÞ. In the probably approximately correct (PAC) framework =-=[37]-=-, such bounds are of the form P ðD : RtruefŠ boundðf;D; ;nÞÞ ; ð19Þ where the probability is over the random draw of a training set D consisting of n i.i.d. samples from P ðx; yÞ. In other words, for... |

1651 | Atomic decomposition by basis pursuit
- Chen, Donoho, et al.
- 1999
(Show Context)
Citation Context ...ture of the Laplacian prior is theoretically well-justified (see [9], [12], [29], as well as references therein) and has been found to be practically and conceptually useful in several research areas =-=[4]-=-, [23], [30], [41]. Another interesting property of the Laplacian is that it is the most heavy-tailed density that is still log-concave (though not strictly so); thus, when combined with a concave log... |

943 |
An Introduction to Support Vector Machines
- Cristianini, Shawe-Taylor
- 2000
(Show Context)
Citation Context ...e nonlinear transformation of those features, or even kernels centered on the training samples. In this latter case, the learned classifier will be similar in flavor to a support vector machine (SVM) =-=[5]-=-, [38], although, in contrast to an SVM, the kernel is not required to satisfy the Mercer condition. . B. Krishnapuram is with the Computer Aided Diagnosis and Therapy Group, Siemens Medical Solutions... |

927 |
Emergence of Simple-Cell Receptive Field Properties by Learning a Sparse Code for Natural Images
- Olshausen, Field
- 1996
(Show Context)
Citation Context ... Laplacian prior is theoretically well-justified (see [9], [12], [29], as well as references therein) and has been found to be practically and conceptually useful in several research areas [4], [23], =-=[30]-=-, [41]. Another interesting property of the Laplacian is that it is the most heavy-tailed density that is still log-concave (though not strictly so); thus, when combined with a concave log-likelihood,... |

608 | Bayesian Learning for Neural Networks
- Neal
- 1996
(Show Context)
Citation Context ...is typically regularized by some prior belief about the weights that promotes their sparsity. The prior is sometimes implicit—as is the case with the automatic relevance determination (ARD) framework =-=[28]-=- exploited by the RVM—but is often explicit (as in, e.g., [11], [17], [34]). In the latter case, a common choice of prior in this family of algorithms is the Laplacian, which results in an l1-penalty,... |

384 |
Maximum Likelihood Estimation from Incomplete Data via the EM Algorithm
- Dempster, Laird, et al.
- 1977
(Show Context)
Citation Context ...ee (4)). The standard expectation-maximization (EM) algorithm for ML estimation with missing data is a special case of this approach, with the key condition being a consequence of Jensen’s inequality =-=[8]-=-, [18], [19]. As suggested in [7], [19], the bound optimization perspective allows us to derive alternative EM-style algorithms without invoking the concept of missing data. In fact, a monotonic algor... |

364 | Optimal sparse representation in general (nonorthogonal) dictionaries via l 1 minimization
- DONOHO, ELAD
- 2003
(Show Context)
Citation Context ...orithms is the Laplacian, which results in an l1-penalty, analogous to the LASSO penalty for regression [34]. The sparsity-promoting nature of the Laplacian prior is theoretically well-justified (see =-=[9]-=-, [12], [29], as well as references therein) and has been found to be practically and conceptually useful in several research areas [4], [23], [30], [41]. Another interesting property of the Laplacian... |

257 | Rademacher and Gaussian complexities: Risk bounds and structural results
- Bartlett, Mendelson
(Show Context)
Citation Context ...ne has extremely large training sets. Other more recent approaches include compression bounds [13] and minimum description length (MDL) bounds [38], as well as bounds based on Rademacher complexities =-=[1]-=-, [26] and PAC-Bayesian bounds [25], [33]. It has often been noted that compression and MDL bounds are not directly applicable to sparse classification algorithms like the RVM (e.g., [15, p. 178]), so... |

257 | Learning overcomplete representations
- Lewicki, Sejnowski
(Show Context)
Citation Context ...of the Laplacian prior is theoretically well-justified (see [9], [12], [29], as well as references therein) and has been found to be practically and conceptually useful in several research areas [4], =-=[23]-=-, [30], [41]. Another interesting property of the Laplacian is that it is the most heavy-tailed density that is still log-concave (though not strictly so); thus, when combined with a concave log-likel... |

123 | Fast sparse Gaussian process methods: The informative vector machine
- Lawrence, Seeger, et al.
(Show Context)
Citation Context ...ily of algorithms includes the relevance vector machine (RVM) [35], the sparse probit regression (SPR) algorithm [10], [11], sparse online Gaussian processes [6], the informative vector machine (IVM) =-=[22]-=-, and the joint classifier and feature optimization (JCFO) algorithm [16], [17]. These algorithms learn classifiers that are constructed as weighted linear combinations of basis functions; the weights... |

120 | Sparse on-line Gaussian processes
- Csató, Opper
- 2002
(Show Context)
Citation Context ...of-theart in supervised learning. This family of algorithms includes the relevance vector machine (RVM) [35], the sparse probit regression (SPR) algorithm [10], [11], sparse online Gaussian processes =-=[6]-=-, the informative vector machine (IVM) [22], and the joint classifier and feature optimization (JCFO) algorithm [16], [17]. These algorithms learn classifiers that are constructed as weighted linear c... |

120 |
Optimization transfer using surrogate objective functions (with discussion
- Lange, Hunter, et al.
- 2000
(Show Context)
Citation Context ...of our approach in order to contrast it with “1-versus-all” and other similar heuristics that are frequently adopted in current practice. Second, by combining a bound optimization approach [7], [18], =-=[19]-=- with a component-wise update procedure, we derive in Section 3 a series of new fast algorithms for learning a sparse multiclass classifier that scale favorably in both the number of training samples ... |

114 | Use of the zero norm with linear models and kernel methods
- Weston, Elisseeff, et al.
- 2003
(Show Context)
Citation Context ... our attention on Rademacher and PAC-Bayesian bounds. It has been shown that sparsity alone does not guarantee good generalization performance, especially in algorithms that take sparsity to extremes =-=[39]-=-, so to rigorously analyze the generalization performance of our SMLR algorithm, we derive two closely related upper bounds on the error rate of binary logistic classifiers with a Laplacian prior. The... |

103 | Some PAC-Bayesian theorems
- McAllester
- 1999
(Show Context)
Citation Context ...ts. Other more recent approaches include compression bounds [13] and minimum description length (MDL) bounds [38], as well as bounds based on Rademacher complexities [1], [26] and PAC-Bayesian bounds =-=[25]-=-, [33]. It has often been noted that compression and MDL bounds are not directly applicable to sparse classification algorithms like the RVM (e.g., [15, p. 178]), so we focus our attention on Rademach... |

85 | A comparison of numerical optimizers for logistic regression (Technical Report). Microsoft Research
- Minka
- 2007
(Show Context)
Citation Context ...ished using Newton’s method, also known, in this case, as iteratively reweighted least squares (IRLS). Although there are other methods for performing this maximization, none clearly outperforms IRLS =-=[27]-=-. However, when the training data is separable, the function ‘ðwÞ can be made arbitrarily large, so a prior on w is crucial. This motivates the adoption of a maximum a posteriori (MAP) estimate (or pe... |

84 |
Bayesian regularization and pruning using a Laplace prior
- Williams
- 1995
(Show Context)
Citation Context ...cian prior is theoretically well-justified (see [9], [12], [29], as well as references therein) and has been found to be practically and conceptually useful in several research areas [4], [23], [30], =-=[41]-=-. Another interesting property of the Laplacian is that it is the most heavy-tailed density that is still log-concave (though not strictly so); thus, when combined with a concave log-likelihood, it le... |

81 | Tutorial on practical prediction theory for classification - Langford |

80 | Adaptive sparseness for supervised learning
- Figueiredo
- 2003
(Show Context)
Citation Context ...established themselves among the state-of-theart in supervised learning. This family of algorithms includes the relevance vector machine (RVM) [35], the sparse probit regression (SPR) algorithm [10], =-=[11]-=-, sparse online Gaussian processes [6], the informative vector machine (IVM) [22], and the joint classifier and feature optimization (JCFO) algorithm [16], [17]. These algorithms learn classifiers tha... |

78 |
Learning kernel classifiers: theory and algorithms
- Herbrich
- 2002
(Show Context)
Citation Context ...s. For example, in a kernel classifier, the basis functions will be kernels centered at the training samples, meaning that d will equal n regardless of the number of originally observed features [5], =-=[15]-=-. To simplify notation and exposition, we will denote any of these choices simply as x and remind the reader that what follows is equally applicable in the context of linear, nonlinear, or kernel clas... |

65 | Fast marginal likelihood maximisation for sparse Bayesian models
- Tipping, Faul
- 2003
(Show Context)
Citation Context ...lets literature [24]. The weight update equation (15) provides an explicit criterion for whether or not to include each basis function in the classifier, in a similar vein to the criterion derived in =-=[36]-=-. In the case of a multinomial logistic likelihood, B can be precomputed according to (8) so the computational cost of (15) is essentially that of computing one element of the gradient vector, gðbw ðt... |

58 | PAC-Bayes & margins
- Langford, Shawe-Taylor
(Show Context)
Citation Context ...yesian Error Bounds The PAC-Bayesian formalism has been utilized to provide upper bounds on the generalization error of other classification algorithms. Margin bounds due to Langford and Shawe-Taylor =-=[20]-=- have been used to justify the SVM classifier; the application of the PAC-Bayesian theorem in our paper is similar to their derivation. In another closely related paper, Seeger [33] provides PAC-Bayes... |

41 | Predictive automatic relevance determination by expectation propagation
- Qi, Minka, et al.
(Show Context)
Citation Context ...ually achieved somewhat superior generalization. In recent work, this observation about the RVM has been studied systematically and addressed rather elegantly via approximate cross-validation methods =-=[31]-=-. 6.3 MAP versus Fully Bayesian Classification Our algorithm learns the maximum a posteriori classifier, which is only a point estimate from the posterior. Thus, in comparison to fully Bayesian classi... |

38 |
Multinomial logistic regression algorithm
- Bohning
- 1992
(Show Context)
Citation Context ...ntroducing the notation for this formulation in the next section; strictly speaking, a multinomial logistic regression formulation for multiclass classification is certainly not new (for example, see =-=[2]-=-), but it is rarely employed in the pattern recognition and machine learning literature. Although not common in the current literature, such an approach can be fruitfully extended to many other sparse... |

34 | PAC-Bayesian generalization error bounds for gaussian process classification. Informatics report series EDI-INF-RR-0094
- Seeger
- 2002
(Show Context)
Citation Context ...penalty) and called RMLR (for ridge multinomial logistic regression). Third, in Section 4, we derive generalization bounds for our methods based on recently published learning-theoretic results [26], =-=[33]-=-. Similar in nature to the margin bounds that are frequently used to justify the SVM [38], these bounds can be used to provide theoretical insight into and justification for our algorithm. Section 5 c... |

30 | Adaptive overrelaxed bound optimization methods
- Salakhutdinov, Roweis
- 2003
(Show Context)
Citation Context ... the component-wise update algorithm; this was true for Crabs and Mines, so we used the noncomponent-wise bound optimization algorithm of Section 3.4 and adopted an adaptive over-relaxation technique =-=[32]-=-. The over-relaxation approach effectively speeds up the block update algorithm and makes it comparable to quasiNewton methods. In terms of feature selection, because our approach can be formulated di... |

27 | Generalization error bounds for Bayesian mixture algorithms
- Meir, Zhang
(Show Context)
Citation Context ...r (l2-penalty) and called RMLR (for ridge multinomial logistic regression). Third, in Section 4, we derive generalization bounds for our methods based on recently published learning-theoretic results =-=[26]-=-, [33]. Similar in nature to the margin bounds that are frequently used to justify the SVM [38], these bounds can be used to provide theoretical insight into and justification for our algorithm. Secti... |

24 | Bayesian learning of sparse classifiers
- Figueiredo, Jain
- 2001
(Show Context)
Citation Context ...ickly established themselves among the state-of-theart in supervised learning. This family of algorithms includes the relevance vector machine (RVM) [35], the sparse probit regression (SPR) algorithm =-=[10]-=-, [11], sparse online Gaussian processes [6], the informative vector machine (IVM) [22], and the joint classifier and feature optimization (JCFO) algorithm [16], [17]. These algorithms learn classifie... |

24 |
Sparse Bayesian Learning and the Relevance Vector
- Tipping
- 2001
(Show Context)
Citation Context ... developed sparse classification algorithms have quickly established themselves among the state-of-theart in supervised learning. This family of algorithms includes the relevance vector machine (RVM) =-=[35]-=-, the sparse probit regression (SPR) algorithm [10], [11], sparse online Gaussian processes [6], the informative vector machine (IVM) [22], and the joint classifier and feature optimization (JCFO) alg... |

22 | From margin to sparsity
- Graepel, Herbrich, et al.
- 2000
(Show Context)
Citation Context ...kbwk 1 is large (or, in the case of the SVM, if the margin separating the classes is small), good generalization may be possible. In the past, margin bounds have been criticized on these grounds (see =-=[14]-=- for an elegant example); similar criticisms remain valid about our bounds also. While they are theoretically interesting, our bounds provide only partial justification to our algorithm; we have chose... |

20 | A bayesian approach to joint feature selection and classifier design
- Balaji, Hartemink, et al.
- 2004
(Show Context)
Citation Context ...probit regression (SPR) algorithm [10], [11], sparse online Gaussian processes [6], the informative vector machine (IVM) [22], and the joint classifier and feature optimization (JCFO) algorithm [16], =-=[17]-=-. These algorithms learn classifiers that are constructed as weighted linear combinations of basis functions; the weights are estimated in the presence of training data. In many of these algorithms, t... |

19 | Generalisation error bounds for sparse linear classifiers
- Graepel, Herbrich, et al.
- 2000
(Show Context)
Citation Context ...est known approach to deriving bounds [38]; however, VC theory usually leads to very loose bounds unless one has extremely large training sets. Other more recent approaches include compression bounds =-=[13]-=- and minimum description length (MDL) bounds [38], as well as bounds based on Rademacher complexities [1], [26] and PAC-Bayesian bounds [25], [33]. It has often been noted that compression and MDL bou... |

18 |
Monotonicity of quadratic-approximation algorithms
- Böhning, Lindsay
- 1988
(Show Context)
Citation Context ...e probabilistic modeling aspects of the problem [7]. One way to obtain a surrogate function Qðwjw0Þ when LðwÞ is concave is by using a bound on its Hessian (which, if it exists, is negative definite) =-=[3]-=-, [18], [19]. If the Hessian H is lower bounded, i.e., if there exists a negative definite matrix B such that 1 HðwÞ B for any w, then it is easy to prove that, for any w0 , 1. For two square matrices... |

9 | Joint classifier and feature optimization for cancer diagnosis using gene expression data. The
- Krishnapuram
- 2003
(Show Context)
Citation Context ...parse probit regression (SPR) algorithm [10], [11], sparse online Gaussian processes [6], the informative vector machine (IVM) [22], and the joint classifier and feature optimization (JCFO) algorithm =-=[16]-=-, [17]. These algorithms learn classifiers that are constructed as weighted linear combinations of basis functions; the weights are estimated in the presence of training data. In many of these algorit... |

7 | Discussion of boosting papers
- Friedman, Hastie, et al.
(Show Context)
Citation Context ...ms is the Laplacian, which results in an l1-penalty, analogous to the LASSO penalty for regression [34]. The sparsity-promoting nature of the Laplacian prior is theoretically well-justified (see [9], =-=[12]-=-, [29], as well as references therein) and has been found to be practically and conceptually useful in several research areas [4], [23], [30], [41]. Another interesting property of the Laplacian is th... |

6 |
Feature selection, L1 vs
- Ng
- 2004
(Show Context)
Citation Context ...the Laplacian, which results in an l1-penalty, analogous to the LASSO penalty for regression [34]. The sparsity-promoting nature of the Laplacian prior is theoretically well-justified (see [9], [12], =-=[29]-=-, as well as references therein) and has been found to be practically and conceptually useful in several research areas [4], [23], [30], [41]. Another interesting property of the Laplacian is that it ... |

6 | Bayesian Classification with Gaussian Priors - Williams, Barber - 1998 |

1 |
Block Relaxation Methods in Statistics,” technical report
- Leeuw, Michailides
- 1993
(Show Context)
Citation Context ...his aspect of our approach in order to contrast it with “1-versus-all” and other similar heuristics that are frequently adopted in current practice. Second, by combining a bound optimization approach =-=[7]-=-, [18], [19] with a component-wise update procedure, we derive in Section 3 a series of new fast algorithms for learning a sparse multiclass classifier that scale favorably in both the number of train... |

1 |
Regularized Linear Classification Methods
- Zhang, Oles
- 2001
(Show Context)
Citation Context ...usion/exclusion criterion for basis functions remains, which is consistent with the fact that the Gaussian is not a sparsity-promoting prior. A related component-wise update algorithm was proposed in =-=[42]-=-, which is only applicable in the case of a Gaussian prior (not a Laplacian one). One issue that has to be addressed is the choice of component to update at each step. Because of the concavity of our ... |

1 |
Balaji Krishnapuram received the BTech degree from the Indian Institute of Technology (IIT) Kharagpur in 1999 and the PhD degree from Duke University in 2004, both in electrical engineering. He works as a scientist with Siemens Medical Solutions, USA. His
- Zhu, Hastie
- 2002
(Show Context)
Citation Context ... arg max w LðwÞ arg max‘ðwÞþlog pðwÞŠ; with pðwÞ being some prior on the parameters w. Although the IRLS algorithm can be trivially modified to accommodate a Gaussian prior on w (see, for example, =-=[43]-=-), w ð2ÞsKRISHNAPURAM ET AL.: SPARSE MULTINOMIAL LOGISTIC REGRESSION: FAST ALGORITHMS AND GENERALIZATION BOUNDS 959 other priors are not so easily handled. In particular, a sparsity-promoting Laplacia... |