## Efficient Online and Batch Learning using Forward Backward Splitting

Citations: | 56 - 1 self |

### BibTeX

@MISC{Duchi_efficientonline,

author = {John Duchi and Yoram Singer and Yoav Freund},

title = {Efficient Online and Batch Learning using Forward Backward Splitting},

year = {}

}

### OpenURL

### Abstract

We describe, analyze, and experiment with a framework for empirical loss minimization with regularization. Our algorithmic framework alternates between two phases. On each iteration we first perform an unconstrained gradient descent step. We then cast and solve an instantaneous optimization problem that trades off minimization of a regularization term while keeping close proximity to the result of the first phase. This view yields a simple yet effective algorithm that can be used for batch penalized risk minimization and online learning. Furthermore, the two phase approach enables sparse solutions when used in conjunction with regularization functions that promote sparsity, such as ℓ1. We derive concrete and very simple algorithms for minimization of loss functions with ℓ1, ℓ2, ℓ 2 2, and ℓ ∞ regularization. We also show how to construct efficient algorithms for mixed-norm ℓ1/ℓq regularization. We further extend the algorithms and give efficient implementations for very high-dimensional data with sparsity. We demonstrate the potential of the proposed framework in a series of experiments with synthetic and natural datasets.

### Citations

3668 |
Convex optimization
- Boyd, Vandenberghe
- 2004
(Show Context)
Citation Context ...o w and setting the result to zero, we get that the optimal solution is w⋆ = v − ˜ λ+β. If w⋆ > 0, then from the complimentary slackness condition that the optimal pair of w⋆ and β must have w⋆β = 0 (=-=Boyd and Vandenberghe, 2004-=-) we must have β = 0, and therefore w⋆ = v − ˜ λ. If v < ˜ λ, then v − ˜ λ < 0, so we must have β > 0 and again by complimentary slackness, w⋆ = 0. The case when v ≤ 0 is analogous and amounts to simp... |

3265 | Variational Analysis - Rockafellar, Wets - 1997 |

419 | An iterative thresholding algorithm for linear inverse problems with a sparsity constraint - Daubechies, Defrise, et al. - 2004 |

382 | High-dimensional graphs and variable selection with - Meinshausen, Buhlmann - 2006 |

282 | Signal recovery by proximal forward backward splitting - Combettes, Wajs - 2005 |

280 | Pegasos: Primal estimated sub-gradient solver for svm
- Shalev-Shwartz, Singer, et al.
- 2011
(Show Context)
Citation Context ...) loss. Both algorithms quickly approach the optimal value. In this experiment we let both Pegasos and Fobos employ a projection after each gradient step onto a ℓ2 norm ball in which w⋆ must lie (see =-=Shalev-Shwartz et al., 2007-=- and the discussion following the proof of Theorem 2). However, in the experiment corresponding to the right plot of Fig. 4, we eliminated the additional projection step and ran the algorithms with th... |

222 | On Model Selection Consistency of Lasso - Zhao, Yu - 2006 |

191 |
Gradient methods for minimizing composite objective function. Core discussion paper 2007/96
- Nesterov
- 2007
(Show Context)
Citation Context ...ly, 1/T in terms of number of iterations needed to be ǫ close to the optimum). A more complicated algorithm related to Nesterov’s “estimate functions” (Nesterov, 2004) leads to O(1/ √ ε) convergence (=-=Nesterov, 2007-=-). For completeness, we give a simple proof of 1/T convergence in Appendix C. Finally, the above proof can be modified slightly to give convergence of the stochastic gradient method. In particular, 7... |

168 | Sparse reconstruction by separable approximation - Wright, Nowak, et al. |

140 | The group lasso for logistic regression - Meier, Geer, et al. |

122 | Logarithmic regret algorithms for online convex optimization
- HAZAN, AGARWAL, et al.
- 2007
(Show Context)
Citation Context ...mixed-norm regularization). There is also a body of literature on regret analysis for online learning and online convex programming with convex constraints, which we build upon here (Zinkevich, 2003; =-=Hazan et al., 2006-=-; Shalev-Shwartz and Singer, 2007). Learning sparse models generally is of great interest in the statistics literature, specifically in the context of consistency and recovery of sparsity patterns thr... |

109 | Splitting algorithms for the sum of two nonlinear operators - Lions, Mercier - 1979 |

106 | bollywood, boom-boxes and blenders: Domain adaptation for sentiment classification - Biographies |

88 | Grouped and hierarchical model selection through composite absolute penalties
- Zhao, Rocha, et al.
- 2006
(Show Context)
Citation Context ...re, specifically in the context of consistency and recovery of sparsity patterns through ℓ1 or mixed-norm regularization across multiple tasks (Meinshausen and Bühlmann, 2006; Obozinski et al., 2008; =-=Zhao et al., 2006-=-). In this paper, we describe a general gradient-based framework for online and batch convex programming. To make our presentation a little simpler, we call our approach Fobos, for FOrwardBackward Spl... |

82 | Mirror descent and nonlinear projected subgradient methods for convex optimization - Beck, Teboulle |

66 | Efficient projections onto the ℓ1-ball for learning in high dimensions
- Duchi, Shalev-Shwartz, et al.
- 2008
(Show Context)
Citation Context ...lies that θ2 ≤ θ1. We now examine the solution vectors to the dual problems of P.1, α1 and α2. We know that ‖α1‖ 1 = λ1 so that ‖w0 − α1‖ 1 > λ2 and hence α2 is at the boundary ‖α2‖ 1 = λ2 (see again =-=Duchi et al., 2008-=-). Furthermore, the sum of the these vectors is α1 + α2 = [w0 − θ1] + + [ ] w0 − [w0 − θ1] + − θ2 + . (37) Let v denote a component of w0 greater than θ1. For any such component the right hand side of... |

62 |
Introductory Lectures on Convex Optimization
- Nesterov
- 2003
(Show Context)
Citation Context ...mization. The main reason that this weaker result occurs is due to the fact that we cannot guarantee a strict descent direction when using arbitrary subgradients (see, for example, Theorem 3.2.2 from =-=Nesterov, 2004-=-). Another consequence of using non-differentiable functions means that analyses such as those carried out by Tseng (2000) and Chen and Rockafellar (1997) are difficult to apply, as the stronger rates... |

60 | Accelerated projected gradient methods for linear inverse problems with sparsity constraints - Daubechies, Fornasier, et al. - 2008 |

60 |
Denoising via soft-thresholding
- Donoho
- 1995
(Show Context)
Citation Context ... gives a simple online and offline method for minimizing a convex f with ℓ1 regularization. Such soft-thresholding operations are common in the statistics literature and have been used for some time (=-=Donoho, 1995-=-; Daubechies et al., 2004). Langford et al. (2008) recently proposed and analyzed the same update, terming it the “truncated gradient.” The analysis presented here is different from the analysis in th... |

59 | Sparse online learning via truncated gradient
- Langford, Li, et al.
(Show Context)
Citation Context ... the section we denote by w ⋆ the minimizer of f(w) + r(w). In what follows, define ‖∂f(w)‖ � sup g∈∂f(w) ‖g‖. We begin by deriving convergence results under the fairly general assumption (see, e.g., =-=Langford et al., 2008-=- or Shalev-Shwartz and Tewari, 2009) that the subgradients are bounded as follows: ‖∂f(w)‖ 2 ≤ Af(w) + G 2 , ‖∂r(w)‖ 2 ≤ Ar(w) + G 2 . (8) For example, any Lipschitz loss (such as the logistic loss or... |

57 | An interior point method for large-scale ℓ1-regularized least squares - Kim, Koh, et al. - 2007 |

51 | A Coordinate Gradient Descent Method for Nonsmooth Separable - Tseng, Yun |

48 | A modified forward-backward splitting method for maximal monotone mappings - Tseng |

45 | A discriminative kernel-based model to rank images from text queries - Grangier, Bengio |

39 | A fixed-point continuation method for ℓ1-regularized minimization with applications to compressed sensing - Hale, Yin, et al. - 2007 |

28 | High-dimensional union support recovery in multivariate regression
- Obozinski, Wainwright, et al.
- 2008
(Show Context)
Citation Context ... the statistics literature, specifically in the context of consistency and recovery of sparsity patterns through ℓ1 or mixed-norm regularization across multiple tasks (Meinshausen and Bühlmann, 2006; =-=Obozinski et al., 2008-=-; Zhao et al., 2006). In this paper, we describe a general gradient-based framework for online and batch convex programming. To make our presentation a little simpler, we call our approach Fobos, for ... |

24 | Optimizing costly functions with simple constraints: A limited-memory projected quasi-newton algorithm - Schmidt, Berg, et al. - 2009 |

21 | Convergence rates in forward-backward splitting - Chen, Rockafellar - 1997 |

17 | Stochastic methods for ℓ 1 -regularized loss minimization
- Shalev-Shwartz, Tewari
(Show Context)
Citation Context ...w ⋆ the minimizer of f(w) + r(w). In what follows, define ‖∂ f(w)‖ � sup g∈∂ f(w) ‖g‖. We begin by deriving convergence results under the fairly general assumption (see, e.g., Langford et al. 2008 or =-=Shalev-Shwartz and Tewari 2009-=-) that the subgradients are bounded as follows: ‖∂ f(w)‖ 2 ≤ A f(w)+G 2 , ‖∂r(w)‖ 2 ≤ Ar(w)+G 2 . (6) For example, any Lipschitz loss (such as the logistic loss or hinge loss used in support vector ma... |

16 | Logarithmic regret algorithms for strongly convex repeated games
- Shalev-Shwartz, Singer
- 2007
(Show Context)
Citation Context ...ation). There is also a body of literature on regret analysis for online learning and online convex programming with convex constraints, which we build upon here (Zinkevich, 2003; Hazan et al., 2006; =-=Shalev-Shwartz and Singer, 2007-=-). Learning sparse models generally is of great interest in the statistics literature, specifically in 2900EFFICIENT LEARNING USING FORWARD BACKWARD SPLITTING the context of consistency and recovery ... |

12 | Joint covariate selection for grouped classification
- Obozinski, Taskar, et al.
- 2007
(Show Context)
Citation Context ...ere the j th column of the matrix is the weight vector w j associated with class j. Thus, the i th row corresponds to the weight of the i th feature with respect to all classes. The mixed ℓr/ℓs-norm (=-=Obozinski et al., 2007-=-) of W, denoted ‖W ‖ ℓr/ℓs , is obtained by computing the ℓs-norm of each row of W and then applying the ℓr-norm to the resulting n dimensional vector, for instance, ‖W ‖ = ℓ1/ℓ∞ ∑n j=1 maxj |Wi,j|. T... |

10 | Phase transitions for high-dimensional joint support recovery. NIPS - Negahban, Wainwright - 2009 |

10 | A robust hybrid of lasso and ridge regression
- Owen
- 2006
(Show Context)
Citation Context ... inverse Huber) regularization results in sparse solutions, but its hybridization 13Duchi and Singer with ℓ2 2 regularization prevents the weights from being excessively large. Berhu regularization (=-=Owen, 2006-=-) is defined as n∑ n∑ [ r(w) = λ b(wj) = λ |wj|[[|wj| ≤ γ]] + w2 ] j + γ2 [[|wj| > γ]] . 2γ j=1 j=1 In the above, [[·]] is 1 if its argument is true and is 0 otherwise. The positive scalar γ controls ... |

9 |
Joint covariate selection for grouped classification
- Obozinski, Taskar, et al.
- 2007
(Show Context)
Citation Context ... where the jth column of the matrix is the weight vector w j associated with class j. Thus, the ith row corresponds to the weight of the ith feature with respect to all classes. The mixed ℓr/ℓs-norm (=-=Obozinski et al., 2007-=-) of W, denoted ‖W‖ℓr/ℓs , is obtained by computing the ℓs-norm of each row of W and then applying the ℓr-norm to the resulting n dimensional vector, for instance, ‖W‖ℓ1/ℓ∞ = ∑nj=1 max j |Wi, j|. Thus... |

4 |
An efficient projection for l1,infinity regularization
- Quattoni, Carreras, et al.
- 2009
(Show Context)
Citation Context ...hermore, the sparse FOBOS updates apply equally as well to mixed norm regularization, and while there are efficient algorithms for both projection to both ℓ1/ℓ2 and ℓ1/ℓ∞ balls (Schmidt et al., 2009; =-=Quattoni et al., 2009-=-), they are more complicated than the FOBOS steps. Lastly, though it may be possible to extend the efficient data structures of Duchi et al. (2008) to the ℓ1/ℓ2 case, there is no known algorithm for e... |

1 |
36 Learning using Forward Backward Splitting
- Donoho
- 1995
(Show Context)
Citation Context ... gives a simple online and offline method for minimizing a convex f with ℓ1 regularization. Such soft-thresholding operations are common in the statistics literature and have been used for some time (=-=Donoho, 1995-=-; Daubechies et al., 2004). Langford et al. (2008) recently proposed and analyzed the same update, terming it the “truncated gradient.” The analysis presented here is different from the analysis in th... |