## Algorithms for sparse linear classifiers in the massive data setting, 2006. Manuscript. Available fromwww.stat.rutgers.edu/˜madigan/papers (2005)

### Cached

### Download Links

Citations: | 16 - 0 self |

### BibTeX

@MISC{Balakrishnan05algorithmsfor,

author = {Suhrid Balakrishnan and David Madigan},

title = {Algorithms for sparse linear classifiers in the massive data setting, 2006. Manuscript. Available fromwww.stat.rutgers.edu/˜madigan/papers},

year = {2005}

}

### OpenURL

### Abstract

Classifiers favoring sparse solutions, such as support vector machines, relevance vector machines, LASSO-regression based classifiers, etc., provide competitive methods for classification problems in high dimensions. However, current algorithms for training sparse classifiers typically scale quite unfavorably with respect to the number of training examples. This paper proposes online and multi-pass algorithms for training sparse linear classifiers for high dimensional data. These algorithms have computational complexity and memory requirements that make learning on massive datasets feasible. The central idea that makes this possible is a straightforward quadratic approximation to the likelihood function.

### Citations

3268 | Convex Analysis
- Rockafellar
- 1970
(Show Context)
Citation Context ...eeded. Appendix B. In this appendix we derive the modified Shooting algorithm, Algorithm 1 and discuss its efficient implementation. We derive Shooting by analyzing the subdifferential of the system (=-=Rockafellar, 1970-=-). We need convex non-smooth analysis results because the regularization term is non-differentiable at zero. Reviewing concepts very briefly, the subgradient ξ ∈ R |x| , of a convex function f at x0 i... |

1835 | Regression shrinkage and selection via the Lasso
- Tibshirani
- 1996
(Show Context)
Citation Context ... 2003) and attack overfitting and variable selection in a unified manner. L1-regularization and a maximum a posteriori (MAP) Bayesian analysis with so-called Laplacian priors yield identical results (=-=Tibshirani, 1996-=-) and in order to streamline our presentation, we adopt the Bayesian approach. Many training algorithms now exist for L1-logistic regression that can handle high-dimensional input vectors (Hastie et a... |

1654 | Atomic decomposition by basis pursuit
- Chen, Donoho, et al.
- 1998
(Show Context)
Citation Context ...our predictive model, or shrunk towards zero. With this prior distribution, (2) presents a convex optimization problem and yields the same solutions as the LASSO (Tibshirani, 1996) and Basis Pursuit (=-=Chen et al., 1999-=-): ≡ max(log p(β|Dt)) β max β ( t ∑ log i=1 ( yiΦ(β T xi) + (1 − yi)(1 − Φ(β T ) ) xi) − γ‖β‖1 . (3) The parameter γ in the above problem controls the amount of regularization. Figure 2 shows a 2-dime... |

1039 |
Bayesian theory
- Bernardo, Smith
- 1994
(Show Context)
Citation Context ...riate Normal (Gaussian) distribution seems a natural first step backed by asymptotic Bayesian central limit results that imply this approximation will get better and better with the addition of data (=-=Bernardo and Smith, 1994-=-). The version of Assumed Density Filtering closest to our approach is described in Minka (2001b). The posterior distribution is assumed to be multivariate Normal and observations are processed sequen... |

984 | Bayes factors
- Kass, Raftery
- 1995
(Show Context)
Citation Context ...eds of thousands. 4. Related work Approximating the log-likelihood function by a quadratic polynomial is a standard technique in Bayesian learning applications; see for example Laplace approximation (=-=Kass and Raftery, 1995-=-; MacKay, 1995), Assumed Density Filtering (ADF)/Expectation Propagation (EP) (Minka, 2001b), some variational approximation methods such as Jaakkola and Jordan (2000) and in Bayesian online learning ... |

753 | Least angle regression
- Efron, Hastie, et al.
- 2004
(Show Context)
Citation Context ...dimensional visualization of how the objective function of the optimization problem changes as γ is varied. The choice of the regularization parameter is an important but separate question in itself (=-=Efron et al., 2004-=-; Hastie et al., 2004). While methods such as cross validation can be used to pick its value, we do not address such issues in this manuscript, and we simply assume γ is some fixed, user-specified con... |

436 | Rcv1: A new benchmark collection for text categorization research
- Lewis, Yang, et al.
(Show Context)
Citation Context ...Lewis, 2004). We examine one particular category, money-fx, to which we fit a logistic regression model. • BIG-RCV dataset: d = 288, 062, t = 421, 816, a dataset constructed from the RCV1-v2 dataset (=-=Lewis et al., 2004-=-). It consists of the training portion of the LYRL2004 split plus 2 parts of the test data (the test data is made publicly available in 4 ≈ 350 MB parts)—see Figure 6. We also use just the training po... |

309 | Expectation propagation for approximate Bayesian inference,” tech - Minka - 2005 |

265 | A family of algorithms for approximate Bayesian inference - Minka - 2001 |

162 | A new approach to variable selection in least squares problems
- Osborne, Presnell, et al.
- 2000
(Show Context)
Citation Context ...ant. To the best of our knowledge, all existing algorithms solve the above convex optimization problem in the batch setting, i.e., by storing the dataset Dt in memory and iterating over it (Fu, 1998; =-=Osborne et al., 2000-=-; Zhang and Oles, 2001; Zhang, 2002; Shevade and Keerthi, 2003; Genkin et al., 2003). Consequently, these algorithms cannot be used in the massive data/online scenario, where memory costs dependent on... |

156 |
Reuters-21578 text categorization test collection. http://www.daviddlewis.com/resources/testcollections/reuters21578
- Lewis
- 2004
(Show Context)
Citation Context ...model parameters as above, but with t = 100, 000 and only a logistic regression model. • ModApte training dataset: d = 21, 989, t = 9, 603. This is a text dataset, the ModApte split of Reuters-21578 (=-=Lewis, 2004-=-). We examine one particular category, money-fx, to which we fit a logistic regression model. • BIG-RCV dataset: d = 288, 062, t = 421, 816, a dataset constructed from the RCV1-v2 dataset (Lewis et al... |

153 | An Interior-Point Method for Large-Scale l1 Regularized Logistic Regression
- Koh, Kim, et al.
- 2007
(Show Context)
Citation Context ...entation, we adopt the Bayesian approach. Many training algorithms now exist for L1-logistic regression that can handle high-dimensional input vectors (Hastie et al., 2004; Shevade and Keerthi, 2003; =-=Koh et al., 2007-=-). However, these algorithms generally begin with a “load data into memory” step that precludes applications with large numbers of training examples. More precisely, consider a training dataset that c... |

147 | The entire regularization path for the support vector machine
- HASTIE, ROSSET, et al.
(Show Context)
Citation Context ...irani, 1996) and in order to streamline our presentation, we adopt the Bayesian approach. Many training algorithms now exist for L1-logistic regression that can handle high-dimensional input vectors (=-=Hastie et al., 2004-=-; Shevade and Keerthi, 2003). However, these algorithms generally begin with a “load data into memory” step that precludes applications with large numbers of training examples. More precisely, conside... |

139 | Probable networks and plausible predictions -- a review of practical Bayesian methods for supervised neural networks
- MacKay
- 1995
(Show Context)
Citation Context ...ated work Approximating the log-likelihood function by a quadratic polynomial is a standard technique in Bayesian learning applications; see for example Laplace approximation (Kass and Raftery, 1995; =-=MacKay, 1995-=-), Assumed Density Filtering (ADF)/Expectation Propagation (EP) (Minka, 2001b), some variational approximation methods such as Jaakkola and Jordan (2000) and in Bayesian online learning (Opper, 1996).... |

113 | Sparse multinomial logistic regression: fast algorithms and generalization bounds. IEEE Transaction on pattern analysis and machine learning - Krishnapuram, Carin, et al. - 2005 |

105 | Penalized regressions: The bridge versus the Lasso
- Fu
- 1998
(Show Context)
Citation Context ...fied constant. To the best of our knowledge, all existing algorithms solve the above convex optimization problem in the batch setting, i.e., by storing the dataset Dt in memory and iterating over it (=-=Fu, 1998-=-; Osborne et al., 2000; Zhang and Oles, 2001; Zhang, 2002; Shevade and Keerthi, 2003; Genkin et al., 2003). Consequently, these algorithms cannot be used in the massive data/online scenario, where mem... |

105 | Bayesian parameter estimation via variational methods, Statistics and Computing 10 - Jaakkola, Jordan - 1999 |

81 | Text categorization based on regularized linear classifiers - Zhang, Oles - 2001 |

59 |
A Simple and Efficient Algorithm for Gene Selection using Sparse Logistic Regression
- Shevade, Keerthi
- 2003
(Show Context)
Citation Context ...r L1-regularized logistic and probit regression models. Such models have provided excellent predictive accuracy in many applications (see, for example, Genkin et al., 2003; Figueiredo and Jain, 2001; =-=Shevade and Keerthi, 2003-=-) and attack overfitting and variable selection in a unified manner. L1-regularization and a maximum a posteriori (MAP) Bayesian analysis with so-called Laplacian priors yield identical results (Tibsh... |

41 | Predictive automatic relevance determination by expectation propagation - Qi, Minka, et al. |

33 | Making logistic regression a core data mining tool: A practical investigation of accuracy, speed, and simplicity
- Komarek, Moore
- 2005
(Show Context)
Citation Context ...timization problem for logistic regression (essentially the terms in Equation 3, but with L2 regularization of β) with techniques such as fixed memory BFGS (Minka, 2000), modified conjugate gradient (=-=Komarek and Moore, 2005-=-) and cyclic coordinate descent (Zhang and Oles, 2001; Genkin et al., 2007). In this paper, we employ instead a slight modification of the Shooting algorithm (Fu, 1998), see Algorithm 1. Shooting is e... |

28 |
A Bayesian approach to on-line learning
- Opper
- 1998
(Show Context)
Citation Context ... MacKay, 1995), Assumed Density Filtering (ADF)/Expectation Propagation (EP) (Minka, 2001b), some variational approximation methods such as Jaakkola and Jordan (2000) and in Bayesian online learning (=-=Opper, 1996-=-). Our approach is closest in spirit to the online Bayesian method presented in Opper (1996) but is closer in the details of the approximation to ADF/EP as described in Minka (2001b). We briefly outli... |

24 | Bayesian learning of sparse classifiers
- Figueiredo, Jain
- 2001
(Show Context)
Citation Context ... classification and consider L1-regularized logistic and probit regression models. Such models have provided excellent predictive accuracy in many applications (see, for example, Genkin et al., 2007; =-=Figueiredo and Jain, 2001-=-; Shevade and Keerthi, 2003) and attack overfitting and variable selection in a unified manner. L1-regularization and a maximum a posteriori (MAP) Bayesian analysis with so-called Laplacian priors yie... |

11 | Sparse bayesian classifiers for text categorization - Eyheramendy, Genkin, et al. - 2003 |

10 | Logistic Regression for Data Mining and HighDimensional Classification - Komarek - 2004 |

6 | E.: Laplace propagation
- Smola, Vishwanathan, et al.
- 2004
(Show Context)
Citation Context ...erior distribution. Thus, if the approximation converges to a fixed point, it is the correct optima location. The above is a modification of the fixed point Lemma in the paper on Laplace Propagation (=-=Eskin et al., 2003-=-). One can also prove unbiasedness which follows from our update rules and a minor modification of a theorem in Opper, (1999). Even though Opper derives his results based on a Gaussian prior on the pa... |

3 |
Large-scale Bayesian logisitic regression for text categorization
- Genkin, Lewis, et al.
(Show Context)
Citation Context ...cifically with binary classification and consider L1-regularized logistic and probit regression models. Such models have provided excellent predictive accuracy in many applications (see, for example, =-=Genkin et al., 2007-=-; Figueiredo and Jain, 2001; Shevade and Keerthi, 2003) and attack overfitting and variable selection in a unified manner. L1-regularization and a maximum a posteriori (MAP) Bayesian analysis with so-... |

2 |
On the dual formulation of regularized linear systems
- Zhang
(Show Context)
Citation Context ...ing algorithms solve the above convex optimization problem in the batch setting, i.e., by storing the dataset Dt in memory and iterating over it (Fu, 1998; Osborne et al., 2000; Zhang and Oles, 2001; =-=Zhang, 2002-=-; Shevade and Keerthi, 2003; Genkin et al., 2003). Consequently, these algorithms cannot be used in the massive data/online scenario, where memory costs dependent on t must be avoided. The approach we... |