## Composite Objective Mirror Descent

### Cached

### Download Links

Citations: | 27 - 5 self |

### BibTeX

@MISC{Duchi_compositeobjective,

author = {John Duchi and Shai Shalev-shwartz and Yoram Singer and Ambuj Tewari},

title = {Composite Objective Mirror Descent},

year = {}

}

### OpenURL

### Abstract

We present a new method for regularized convex optimization and analyze it under both online and stochastic optimization settings. In addition to unifying previously known firstorder algorithms, such as the projected gradient method, mirror descent, and forwardbackward splitting, our method yields new analysis and algorithms. We also derive specific instantiations of our method for commonly used regularization functions, such as ℓ1, mixed norm, and trace-norm. 1

### Citations

4676 |
Matrix Analysis
- Horn, Johnson
- 1986
(Show Context)
Citation Context ...te Mirror Descent We now consider a setting that generalizes the previous discussions in which our variables Wt are matrices Wt ∈ Ω = Rd1×d2 . We use Bregman functions based on Schatten p-norms (e.g. =-=Horn and Johnson, 1985-=-, Section 7.4). Schatten p-norms are the family of unitarily invariant matrix norms arising out of applying p-norms to the singular values of the matrix W. That is, letting σ(W) denote thevectorofsing... |

3667 |
L.: Convex Optimization
- BOYD, VANDENBERGHE
- 2004
(Show Context)
Citation Context ...bra of Bψ (see Lemma 14 at the end and set c = w∗ , a = wt+1, and b = wt). The second to last inequality follows from the Fenchel-Young inequality applied to the conjugate pair 1 2 ‖·‖2 , 1 2 ‖·‖2 ∗ (=-=Boyd and Vandenberghe, 2004-=-, Example 3.27). The last inequality follows from the strong convexity of Bψ with respect to the norm ‖·‖. Lemma 1 allows us to easily give a proof that Comid achieves low regret. The following theore... |

3267 | Convex analysis
- Rockafellar
- 1970
(Show Context)
Citation Context ... η Bψ(w ∗ ,w1)+r(w1)+ Tη 2α G2 ∗. If we take η ∝ 1/ √ T, then we have a regret which is O( √ T) when the functions ft are Lipschitz. If Ω is compact, the ft are guaranteed to be Lipschitz continuous (=-=Rockafellar, 1970-=-). Corollary 4 Suppose that either Ω is compact or the functions ft are Lipschitz so ‖f ′ t‖∗ ≤ G∗. Also assume r(w1) = 0. Then setting η = √ 2αBψ(w ∗ √ ,w1)/(G∗ T), √ Rφ(T) ≤ 2TBψ(w∗ ,w1)G∗/ √ α. It ... |

669 | The weighted majority algorithm
- LITTLESTONE, WARMUTH
- 1994
(Show Context)
Citation Context ...φt(wt)− inf w∈Ω T∑ φt(w) , where {wt} is the sequence generated by mirror descent and the φt are convex functions. In fact, one can view popular online learning algorithms, such as weighted majority (=-=Littlestone and Warmuth, 1994-=-) and online gradient descent (Zinkevich, 2003) as special cases of mirror descent. A guarantee ontheonlineregretcanbetranslateddirectlytoaguaranteeontheconvergencerateofthealgorithm to the optimum of... |

419 | An iterative thresholding algorithm for linear inverse problems with a sparsity constraint - Daubechies, Defrise, et al. - 2004 |

235 | Weighted sums of certain dependent random variables - Azuma - 1967 |

215 | Guaranteed minimum-rank solutions of linear matrix equations via nuclear norm minimization
- Recht, Fazel, et al.
- 2010
(Show Context)
Citation Context ...ation of functions on matrices, which include efficient and simple algorithms for trace-norm minimization. Trace norm minimization has recently found strong applicability in matrix rank minimization (=-=Recht et al., 2007-=-), which has been shown to be very useful, for example, in collaborative filtering (Srebro et al., 2004). A special case of Comid has recently been developed for this task, which is very similar in sp... |

191 | Gradient methods for minimizing composite objective function. Core discussion paper 2007/96 - Nesterov - 2007 |

170 |
Problem complexity and method efficiency in optimization. Nauka (published in English by
- Nemirovskii, Yudin
- 1983
(Show Context)
Citation Context ...ng a single example (or subset of the examples) at each iteration or accessing the entire training set at each iteration. The method we describe is an adaptation of the Mirror Descent (MD) algorithm (=-=Nemirovski and Yudin, 1983-=-; Beck and Teboulle, 2003), an iterative method for minimizing a convex function φ : Ω → R. If the dimension d is large enough, MD is optimal among first-order methods, and it has a close connection t... |

168 | Sparse reconstruction by separable approximation - Wright, Nowak, et al. |

146 | Maximum margin matrix factorization
- Srebro, Rennie, et al.
- 2005
(Show Context)
Citation Context ...ion. Trace norm minimization has recently found strong applicability in matrix rank minimization (Recht et al., 2007), which has been shown to be very useful, for example, in collaborative filtering (=-=Srebro et al., 2004-=-). A special case of Comid has recently been developed for this task, which is very similar in spirit to fixed point and shrinkage methods from signal processing for ℓ1minimization (Ma et al., 2009). ... |

134 | C.Gentile, “On the generalization ability of on-line learning algorithms
- Cesa-Bianchi, Conconi
(Show Context)
Citation Context ...ions between the regret of the algorithm and generalization performance using martingale concentration results (Littlestone, 1989). We build on known techniques for data-driven generalization bounds (=-=Cesa-Bianchi et al., 2004-=-) to give concentration results for Comid in the stochastic optimization setting. Further work on this subject for the strongly convex case can be found in Kakade and Tewari (2008), though we focus on... |

133 | The tradeoffs of large scale learning
- Bottou, Bousquet
- 2007
(Show Context)
Citation Context ...e perspective of achieving good statistical performance on unseen data, first order methods are preferable to higher order approaches, especially when the number of training examples n is very large (=-=Bottou and Bousquet, 2008-=-; Shalev-Shwartz and Srebro, 2008). Furthermore, in large scale problems it is often prohibitively expensive to compute the gradient of the entire objective function (thus accessing all the examples i... |

122 | Logarithmic regret algorithms for online convex optimization - HAZAN, AGARWAL, et al. - 2007 |

109 | Splitting algorithms for the sum of two nonlinear operators - Lions, Mercier - 1979 |

91 | Fixed point and bregman iterative methods for matrix rank minimization - Ma, Goldfarb, et al. |

85 |
From on-line to batch learning
- Littlestone
(Show Context)
Citation Context ...roblems. The techniques we use have a long history in online algorithms and make connections between the regret of the algorithm and generalization performance using martingale concentration results (=-=Littlestone, 1989-=-). We build on known techniques for data-driven generalization bounds (Cesa-Bianchi et al., 2004) to give concentration results for Comid in the stochastic optimization setting. Further work on this s... |

82 | Mirror descent and nonlinear projected subgradient methods for convex optimization - Beck, Teboulle |

73 | Primal-Dual subgradient methods for convex problems - Nesterov - 2009 |

66 | Efficient projections onto the ℓ1-ball for learning in high dimensions
- Duchi, Shalev-Shwartz, et al.
- 2008
(Show Context)
Citation Context ...ν ≥ 0 so that d∑ ̂θi(ν) = i=1 d∑ i=1 vi −min{ν,v q−1 i } 1/(q−1) = λ. (16) Interestingly, this reduces to exactly the same root-finding problem as that for solving Euclidean projection to an ℓ1-ball (=-=Duchi et al., 2008-=-). As shown by Duchi et al., it is straightforward to find the optimal ν in time linear in the dimension d. An open problem is to find an efficient algorithm for solving the generalized projections ab... |

62 | The robustness of the p-norm algorithms - Gentile - 2003 |

59 | Sparse online learning via truncated gradient - Langford, Li, et al. |

59 | Dual averaging methods for regularized stochastic learning and online optimization - Xiao |

52 | SVM optimization: inverse dependence on training set size
- Shalev-Shwartz, Srebro
- 2008
(Show Context)
Citation Context ...good statistical performance on unseen data, first order methods are preferable to higher order approaches, especially when the number of training examples n is very large (Bottou and Bousquet, 2008; =-=Shalev-Shwartz and Srebro, 2008-=-). Furthermore, in large scale problems it is often prohibitively expensive to compute the gradient of the entire objective function (thus accessing all the examples in the training set), and randomly... |

46 |
uniform convexity and smoothness inequalities for trace norms
- Ball, Carlen, et al.
(Show Context)
Citation Context ...te of Eq. (3). 7.2 p-norm divergences Now we consider divergence functions ψ which are the ℓp-norms squared. 1 2 ‖w‖2 p is (p−1)-strongly convex over Rd with respect to the ℓp-norm for any p ∈ (1,2] (=-=Ball et al., 1994-=-). We see that if we choose ψ(w) = 1 2 ‖w‖2 p to be the divergence function, we have a corollary to Theorem 2. Corollary 9 Suppose that r(0) = 0 and that w1 = 0. Let p = 1 + 1/logd and use the Bregman... |

29 | Matrix analysis. Cambridge university press - Horn, Johnson - 1990 |

28 | The convex analysis of unitarily invariant matrix norms
- Lewis
- 1995
(Show Context)
Citation Context ...Wt) = ∇ψ(Wt)−ηf ′ t(Wt) and Wt+1 = argmin W { Bψ(W, ˜ Wt)+ηr( ˜ Wt) Since Wt has singular value decomposition Utdiag(σ(W))V t and ψ(W) = Ψ(σ(W)) is unitarily invariant, ∇ψ(Wt) = Utdiag(∇Ψ(σ(Wt)))V t (=-=Lewis, 1995-=-, Corollary 2.5). This means that the Θt computed in step 2 above is simply ∇ψ( ˜ Wt). The proof essentially amounts to a reduction to the vector case, since the norms are unitarily invariant, and wil... |

22 | On the generalization ability of online strongly convex programming - Kakade, Tewari - 2008 |

17 | Stochastic methods for ℓ 1 -regularized loss minimization
- Shalev-Shwartz, Tewari
(Show Context)
Citation Context ...= ‖w‖ 1 include iterative shrinkage and thresholding from the signal processing literature (Daubechies et al., 2004), and from machine learning, Truncated Gradient (Langford et al., 2009) and SMIDAS (=-=Shalev-Shwartz and Tewari, 2009-=-) are both special cases of Comid. In the optimization community there has there has been significant recent interest—both applied and theoretical—on minimization of composite objective functions such... |

13 | Aprroximation Accuracy, Gradient Methods, and Error Bound for Structured Convex Optimization - Tseng |

12 | Online and Batch Learning using Forward-Backward Splitting - Duchi, Singer - 2009 |

12 | Joint covariate selection for grouped classification
- Obozinski, Taskar, et al.
- 2007
(Show Context)
Citation Context ...1 , ∇ψ(w)+v−θ = 0 at optimum or w = (∇ψ)−1(θ−v). When p2 = 1, we easily recover Eq. (14) as our update. However, the case p2 = ∞ is more interesting, as it can be a building block for group-sparsity (=-=Obozinski et al., 2007-=-). In this case our problem is min θ 1 2 ‖v −θ‖ q s.t. ‖θ‖ 1 ≤ λ. It is clear by symmetry in the above that we can assume v ≽ 0 with no loss of generality. We can raise the ℓq-norm to a power greater ... |