## Efficient Projections onto the ℓ1-Ball for Learning in High Dimensions

### Cached

### Download Links

Citations: | 70 - 9 self |

### BibTeX

@MISC{Duchi_efficientprojections,

author = {John Duchi and Yoram Singer and Tushar Chandra},

title = {Efficient Projections onto the ℓ1-Ball for Learning in High Dimensions},

year = {}

}

### OpenURL

### Abstract

We describe efficient algorithms for projecting a vector onto the ℓ1-ball. We present two methods for projection. The first performs exact projection in O(n) expected time, where n is the dimension of the space. The second works on vectors k of whose elements are perturbed outside the ℓ1-ball, projecting in O(k log(n)) time. This setting is especially useful for online learning in sparse feature spaces such as text categorization applications. We demonstrate the merits and effectiveness of our algorithms in numerous batch and online learning tasks. We show that variants of stochastic gradient projection methods augmented with our efficient projection procedures outperform interior point methods, which are considered state-of-the-art optimization techniques. We also show that in online settings gradient updates with ℓ1 projections outperform the exponentiated gradient algorithm while obtaining models with high degrees of sparsity. 1.

### Citations

9158 | Introduction to Algorithms
- Cormen, Leiserson, et al.
- 1998
(Show Context)
Citation Context ... stemming from the need to compute summations (while searching) of the form given by Eq. (9). Our efficient projection algorithm is based on a modification of the randomized median finding algorithm (=-=Cormen et al., 2001-=-). The algorithm computes partial sums just-in-time and has expected linear time complexity. The algorithm identifies ρ and the pivot value v (ρ) without sorting the vector v by using a divide and con... |

2105 | Regression Shrinkage and Selection via the Lasso
- Tibshirani
- 1996
(Show Context)
Citation Context ...nsion of w is very high, a sparse solution enables easier interpretation of the problem in a lower dimension space. For the usage of ℓ1-based approach in statistical machine learning see for example (=-=Tibshirani, 1996-=-) and the references therein. Donoho (2006b) provided sufficient conditions for obtaining an optimal ℓ1-norm solution which is sparse. Recent work on compressed sensing (Candes, 2006; Donoho, 2006a) f... |

1922 | Compressed sensing
- Donoho
(Show Context)
Citation Context ...Tibshirani, 1996) and the references therein. Donoho (2006b) provided sufficient conditions for obtaining an optimal ℓ1-norm solution which is sparse. Recent work on compressed sensing (Candes, 2006; =-=Donoho, 2006-=-a) further explores how ℓ1 constraints can be used for recovering a sparse signal sampled below the Nyquist rate. The second motivation for using ℓ1 constraints in machine learning problems is that in... |

811 |
Nonlinear Programming. Athena Scientific
- Bertsekas
- 1999
(Show Context)
Citation Context ...ect to an ℓ1-norm constraint on w. Formally, the problem we need to solve is minimize L(w) s.t. �w� w 1 ≤ z . (1) Our focus is on variants of the projected subgradient method for convex optimization (=-=Bertsekas, 1999-=-). Projected subgradient methods minimize a function L(w) subject to the constraint that w ∈ X, for X convex, by generating the sequence {w (t) } via w (t+1) = ΠX � w (t) − ηt∇ (t)� where ∇ (t) is (an... |

709 |
An introduction to compressive sampling
- Candes, Wakin
- 2008
(Show Context)
Citation Context ... for example (Tibshirani, 1996) and the references therein. Donoho (2006b) provided sufficient conditions for obtaining an optimal ℓ1-norm solution which is sparse. Recent work on compressed sensing (=-=Candes, 2006-=-; Donoho, 2006a) further explores how ℓ1 constraints can be used for recovering a sparse signal sampled below the Nyquist rate. The second motivation for using ℓ1 constraints in machine learning probl... |

490 | Rcv1: A new benchmark collection for text categorization research
- Lewis, Yang, et al.
- 2004
(Show Context)
Citation Context ... of IP methods is counteracted by their quadratic dependence on the dimension of the space. We also ran a series of experiments on two real datasets with high dimensionality: the Reuters RCV1 Corpus (=-=Lewis et al., 2004-=-) and the MNIST handwritten digits database. The Reuters Corpus has 804,414 examples; with simple stemming and stop-wording, there are 112,919 unigram features and 1,946,684 bigram features. With our ... |

372 | For Most Large Underdetermined Systems of Linear Equations the Minimal `1-norm Solution is also the Sparsest Solution - Donoho - 2006 |

313 | Pegasos: Primal Estimated sub-GrAdient SOlver for SVM - Shalev-Shwartz, Singer, et al. - 2007 |

260 | Exponentiated gradient versus gradient descent for linear predictors
- Kivinen, Warmuth
- 1997
(Show Context)
Citation Context ...jected subgradient methods on real data, we used a method known in the literature as either entropic descent, a special case of mirror descent (Beck & Teboulle, 2003), or exponentiated gradient (EG) (=-=Kivinen & Warmuth, 1997-=-). EG maintains a weight vector w subject to the constraint that � i wi = z and w � 0; it can easily be extended to work with negative weights under a 1-norm constraint by maintaining two vectors w + ... |

181 | Pathwise coordinate optimization - Friedman, Hastie, et al. |

178 | On the learnability and design of output codes for multiclass problems - Crammer, Singer |

144 | Feature selection, L1 vs. L2 regularization, and rotational invariance - Ng - 2004 |

88 |
Mirror descent and nonlinear projected subgradient methods for convex optimization
- Beck, Teboulle
(Show Context)
Citation Context ...� s.t. �wj�1 ≤ z,wj � 0. (12) As a comparison to our projected subgradient methods on real data, we used a method known in the literature as either entropic descent, a special case of mirror descent (=-=Beck & Teboulle, 2003-=-), or exponentiated gradient (EG) (Kivinen & Warmuth, 1997). EG maintains a weight vector w subject to the constraint that � i wi = z and w � 0; it can easily be extended to work with negative weights... |

72 |
Data Structures and Network Algorithms, Society for Industrial and Applied Mathematics (SIAM
- Tarjan
- 1983
(Show Context)
Citation Context ...he non-zero elements in our weight vector by updating Θt. We then remove all the components that are less than v (ρ) (i.e. less than Θt); this removal is efficient and requires only logarithmic time (=-=Tarjan, 1983-=-). The course of the algorithm is as follows. After t projected gradient iterations we have a vector v (t) whose non-zero elements are stored in a red-black tree T and a global deduction value Θt whic... |

56 | An interior-point method for large-scale ℓ1-regularized logistic regression
- Koh, Kim, et al.
(Show Context)
Citation Context ...ate Flops Figure 4. Comparison of methods on ℓ1-regularized least squares. The left has dimension n = 800, the right n = 4000 methods for both least squares and logistic regression (Koh et al., 2007; =-=Kim et al., 2007-=-). The algorithms we use are batch projected gradient, stochastic projected subgradient, and batch projected gradient augmented with a backtracking line search (Koh et al., 2007). The IP and coordinat... |

48 | Two-metric projection methods for constrained optimization - Gafni, Bertsekas - 1984 |

24 | Efficient learning of label ranking by soft projections onto polyhedra
- Shalev-Shwartz, Singer
- 2006
(Show Context)
Citation Context ... vector w are tied via a single variable, so knowing the indices of these elements gives a much simpler problem. Upon first inspection, finding these indices seems difficult, but the following lemma (=-=Shalev-Shwartz & Singer, 2006-=-) provides a key tool in deriving our procedure for identifying non-zero elements. Lemma 1. Let w be the optimal solution to the minimization problem in Eq. (3). Let s and j be two indices such that v... |

4 | Approximate convex optimization by online game playing - Hazan |

3 | Efficient learning of label ranking by soft projections onto polyhedra
- Shalev-Shwartz, Singer
- 2006
(Show Context)
Citation Context ... vector w are tied via a single variable, so knowing the indices of these elements gives a much simpler problem. Upon first inspection, finding these indices seems difficult, but the following lemma (=-=Shalev-Shwartz & Singer, 2006-=-) provides a key tool in deriving our procedure for identifying non-zero elements. Lemma 1. Let w be the optimal solution to the minimization problem in Eq. (3). Let s and j be two indices such that v... |