## Efficient Online Learning via Randomized Rounding

Citations: | 3 - 0 self |

### BibTeX

@MISC{Cesa-bianchi_efficientonline,

author = {Nicolò Cesa-bianchi and Ohad Shamir},

title = {Efficient Online Learning via Randomized Rounding},

year = {}

}

### OpenURL

### Abstract

Most online algorithms used in machine learning today are based on variants of mirror descent or follow-the-leader. In this paper, we present an online algorithm based on a completely different approach, which combines “random playout ” and randomized rounding of loss subgradients. As an application of our approach, we provide the first computationally efficient online algorithm for collaborative filtering with trace-norm constrained matrices. As a second application, we solve an open question linking batch learning and transductive online learning. 1

### Citations

321 | How to use expert advice
- Cesa-Bianchi, Freund, et al.
- 1997
(Show Context)
Citation Context ...(see for instance [14]). In this work we revisit, and significantly extend, an algorithm which uses a completely different approach. This algorithm, known as the Minimax Forecaster, was introduced in =-=[9, 11]-=- for the setting of prediction with static experts. It computes minimax predictions in the case of known horizon, binary outcomes, and absolute loss. Although the original version is computationally e... |

273 | Rademacher and gaussian complexities: Risk bounds and structural results
- Bartlett, Mendelson
- 2002
(Show Context)
Citation Context ...kably, the value of infA Vabs T (A, F) is exactly the Rademacher complexity RT (F) of the class F, which is known to play a crucial role in understanding the sample complexity in statistical learning =-=[5]-=-. In this paper, we t=1 T 2define it as 1 : RT (F) = E [ sup f∈F where σ1, . . . , σT are i.i.d. Rademacher random variables, taking values −1, +1 with equal probability. When RT (F) = o(T ), we get ... |

159 | Maximum-margin matrix factorization
- Srebro, Rennie, et al.
(Show Context)
Citation Context ...of its observed entries. A common approach is norm regularization, where we seek a low-norm matrix which matches the observed entries as best as possible. The norm is often taken to be the trace-norm =-=[22, 19, 4]-=-, although other norms have also been considered, such as the max-norm [18] and the weighted trace-norm [20, 13]. Previous theoretical treatments of this problem assumed a stochastic setting, where th... |

149 | Probabilistic matrix factorization
- Salakhutdinov, Mnih
- 2008
(Show Context)
Citation Context ...of its observed entries. A common approach is norm regularization, where we seek a low-norm matrix which matches the observed entries as best as possible. The norm is often taken to be the trace-norm =-=[22, 19, 4]-=-, although other norms have also been considered, such as the max-norm [18] and the weighted trace-norm [20, 13]. Previous theoretical treatments of this problem assumed a stochastic setting, where th... |

130 | Collaborative filtering with temporal dynamics
- Koren
- 2009
(Show Context)
Citation Context ..., [23, 21]). However, even when the guarantees are distribution-free, assuming a fixed distribution fails to capture important aspects of collaborative filtering in practice, such as non-stationarity =-=[17]-=-. Thus, an online adversarial setting, where no distributional assumptions whatsoever are required, seems to be particularly well-suited to this problem domain. In an online setting, at each round t t... |

43 | Consistency of trace norm minimization
- Bach
(Show Context)
Citation Context ...of its observed entries. A common approach is norm regularization, where we seek a low-norm matrix which matches the observed entries as best as possible. The norm is often taken to be the trace-norm =-=[22, 19, 4]-=-, although other norms have also been considered, such as the max-norm [18] and the weighted trace-norm [20, 13]. Previous theoretical treatments of this problem assumed a stochastic setting, where th... |

33 |
trace-norm and max-norm
- Rank
- 2005
(Show Context)
Citation Context ... the weighted trace-norm [20, 13]. Previous theoretical treatments of this problem assumed a stochastic setting, where the observed entries are picked according to some underlying distribution (e.g., =-=[23, 21]-=-). However, even when the guarantees are distribution-free, assuming a fixed distribution fails to capture important aspects of collaborative filtering in practice, such as non-stationarity [17]. Thus... |

27 | A stochastic view of optimal regret through minimax duality
- Abernethy, Agarwal, et al.
- 2009
(Show Context)
Citation Context ...mn/T ). This is a trivial bound, since it becomes “meaningful” (smaller than a constant) only after all T = mn entries have been predicted. On the other hand, based on general techniques developed in =-=[15]-=- and greatly extended in [1], it can be shown that online learnability is information-theoretically possible for such W. However, these techniques do not provide a computationally efficient algorithm.... |

23 | Practical large-scale optimization for max-norm regularization
- Lee, Recht, et al.
- 2010
(Show Context)
Citation Context ... low-norm matrix which matches the observed entries as best as possible. The norm is often taken to be the trace-norm [22, 19, 4], although other norms have also been considered, such as the max-norm =-=[18]-=- and the weighted trace-norm [20, 13]. Previous theoretical treatments of this problem assumed a stochastic setting, where the observed entries are picked according to some underlying distribution (e.... |

22 | Optimal strategies and minimax lower bounds for online convex games
- Abernethy, Bartlett, et al.
(Show Context)
Citation Context ...s). The regret of the R 2 Forecaster is determined by the Rademacher complexity of the comparison class. The connection between online learnability and Rademacher complexity has also been explored in =-=[2, 1]-=-. However, these works focus on the information-theoretically achievable regret, as opposed to computationally efficient algorithms. The idea of “random playout”, in the context of online learning, ha... |

20 | Separating distribution-free and mistake-bound learning models over the boolean domain
- Blum
- 1994
(Show Context)
Citation Context ...pirical risk minimization) implies efficient learning in the transductive online setting. This is an important result, as online learning can be computationally harder than batch learning —see, e.g., =-=[8]-=- for an example in the context of Boolean learning. A major open question posed by [16] was whether one can achieve the optimal rate O( √ dT ), matching the rate of a batch learning algorithm in the s... |

20 | Collaborative filtering in a non-uniform world: Learning with the weighted trace norm
- Salakhutdinov, Srebro
(Show Context)
Citation Context ...e observed entries as best as possible. The norm is often taken to be the trace-norm [22, 19, 4], although other norms have also been considered, such as the max-norm [18] and the weighted trace-norm =-=[20, 13]-=-. Previous theoretical treatments of this problem assumed a stochastic setting, where the observed entries are picked according to some underlying distribution (e.g., [23, 21]). However, even when the... |

19 | The convex optimization approach to regret minimization
- Hazan
- 2012
(Show Context)
Citation Context ...ly fair to say that at their core, most of these algorithms are based on the same small set of fundamental techniques, in particular mirror descent and regularized follow-the-leader (see for instance =-=[14]-=-). In this work we revisit, and significantly extend, an algorithm which uses a completely different approach. This algorithm, known as the Minimax Forecaster, was introduced in [9, 11] for the settin... |

14 | Agnostic online learning
- Ben-David, Pál, et al.
- 2009
(Show Context)
Citation Context ...cludes the proof of the first part of the theorem. The second part is an immediate corollary of Thm. 3. We close this section by contrasting our results for online transductive learning with those of =-=[7]-=- about standard online learning. If F contains {0, 1}-valued functions, then the optimal regret bound for online learning is order of √ d ′ T , where d ′ is the Littlestone dimension of F. Since the L... |

13 | Approximate methods for sequential decision making using expert advice
- Chung
- 1994
(Show Context)
Citation Context ...(see for instance [14]). In this work we revisit, and significantly extend, an algorithm which uses a completely different approach. This algorithm, known as the Minimax Forecaster, was introduced in =-=[9, 11]-=- for the setting of prediction with static experts. It computes minimax predictions in the case of known horizon, binary outcomes, and absolute loss. Although the original version is computationally e... |

10 | Online learning: Random averages, combinatorial parameters, and learnability
- Rakhlin, Sridharan, et al.
- 2010
(Show Context)
Citation Context ...s). The regret of the R 2 Forecaster is determined by the Rademacher complexity of the comparison class. The connection between online learnability and Rademacher complexity has also been explored in =-=[2, 1]-=-. However, these works focus on the information-theoretically achievable regret, as opposed to computationally efficient algorithms. The idea of “random playout”, in the context of online learning, ha... |

9 | Learning with the weighted trace-norm under arbitrary sampling distributions
- Foygel, Salakhuidinov, et al.
- 2011
(Show Context)
Citation Context ...e observed entries as best as possible. The norm is often taken to be the trace-norm [22, 19, 4], although other norms have also been considered, such as the max-norm [18] and the weighted trace-norm =-=[20, 13]-=-. Previous theoretical treatments of this problem assumed a stochastic setting, where the observed entries are picked according to some underlying distribution (e.g., [23, 21]). However, even when the... |

6 | Repeated games against budgeted adversaries
- Abernethy, Warmuth
- 2010
(Show Context)
Citation Context ...ks focus on the information-theoretically achievable regret, as opposed to computationally efficient algorithms. The idea of “random playout”, in the context of online learning, has also been used in =-=[16, 3]-=-, but we apply this idea in a different way. We show that the R 2 Forecaster can be used to design the first efficient online learning algorithm for collaborative filtering with trace-norm constrained... |

6 |
From batch to transductive online learning
- Kakade, Kalai
- 2005
(Show Context)
Citation Context ...ks focus on the information-theoretically achievable regret, as opposed to computationally efficient algorithms. The idea of “random playout”, in the context of online learning, has also been used in =-=[16, 3]-=-, but we apply this idea in a different way. We show that the R 2 Forecaster can be used to design the first efficient online learning algorithm for collaborative filtering with trace-norm constrained... |

5 | Online learning versus offline learning
- Ben-David, Kushilevitz, et al.
- 1997
(Show Context)
Citation Context ...he permutation chosen by the adversary. 4 Application 1: Transductive Online Learning The first application we consider is a rather straightforward one, in the context of transductive online learning =-=[6]-=-. In this model, we have an arbitrary sequence of labeled examples (x1, y1), . . . , (xT , yT ), where only the set {x1, . . . , xT } of unlabeled instances is known to the learner in advance. At each... |

5 |
Collaborative filtering with the trace norm
- Shamir, Shalev-Shwartz
- 2011
(Show Context)
Citation Context ...uch as mirror descent, appear to give only trivial performance guarantees. Moreover, our 1regret bound matches the best currently known sample complexity bound in the batch distribution-free setting =-=[21]-=-. As a different application, we consider the relationship between batch learning and transductive online learning. This relationship was analyzed in [16], in the context of binary prediction with res... |

3 |
Empirical processes. In Ecole de Probabilité de St. Flour
- Dudley
- 1982
(Show Context)
Citation Context ... predictions and apply Thm. 2 to bound the expected transductive online regret with RT (F). For a class with VC dimension d, RT (F) ≤ O( √ dT ) for some constant c > 0, using Dudley’s chaining method =-=[12]-=-, and this concludes the proof of the first part of the theorem. The second part is an immediate corollary of Thm. 3. We close this section by contrasting our results for online transductive learning ... |

1 |
9 Derivation of the Minimax Forecaster In this appendix, we outline how the Minimax Forecaster is derived, as well as its associated guarantees. This outline closely follows the exposition in [10, Chapter 8], to which we refer the reader for some of the t
- Rank, COLT
- 2005
(Show Context)
Citation Context ... the weighted trace-norm [20, 13]. Previous theoretical treatments of this problem assumed a stochastic setting, where the observed entries are picked according to some underlying distribution (e.g., =-=[23, 21]-=-). However, even when the guarantees are distribution-free, assuming a fixed distribution fails to capture important aspects of collaborative filtering in practice, such as non-stationarity [17]. Thus... |