#### DMCA

## SGD Algorithms based on Incomplete U-statistics: Large-Scale Minimization of Empirical Risk

### Citations

408 |
Stochastic approximation and recursive algorithms and applications
- Kushner, Yin
- 2003
(Show Context)
Citation Context ...n‖2] and following [1], observe that the sequence (at) satisfies the recursion at+1 6 at (1− 2αγt(1− γtL))+2γ2t σ2n(θ∗n). A standard stochastic approximation argument yields an upper bound for at (cf =-=[17, 1]-=-), which, combined with L̂n(θ)− L̂n(θ∗n) 6 L2 ‖θ− θ∗n‖2 (see [23] for instance), give the desired result. Sketch of Proof of Theorem 1 The proof relies on stochastic approximation arguments (see [10, ... |

269 | The tradeoffs of large scale learning
- Bottou, Bousquet
- 2008
(Show Context)
Citation Context ...sed on a subsample). While [6] has investigated maximal deviations between U -processes and their incomplete approximations, the performance analysis carried out in the present paper is inspired from =-=[4]-=- and involves both the optimization error of the SGD algorithm and the estimation error induced by the statistical finite-sample setting. We first provide non-asymptotic rate bounds and asymptotic con... |

267 | 2009] Robust stochastic approximation approach to stochastic programming
- Nemirovski, Juditsky, et al.
(Show Context)
Citation Context ...minimizer. We point out that the present analysis can be extended to the smooth but non-strongly convex case, see [1]. A classical argument based on convex analysis and 4 stochastic optimization (see =-=[1, 22]-=- for instance) shows precisely how the conditional variance of the gradient estimator impacts the empirical performance of the solution produced by the corresponding SGD method and thus strongly advoc... |

124 | U-Statistics: Theory and Practice
- Lee
- 1990
(Show Context)
Citation Context ...ariance among all unbiased estimates) is obtained by averaging over all tuples of observations and thus takes the form of a U -statistic (an average of dependent variables generalizing the means, see =-=[19]-=-). The Empirical Risk Minimization (ERM) principle, one of the main paradigms of statistical learning theory, has been extended to the case where the empirical risk of a prediction rule is a U -statis... |

73 | A Stochastic Gradient Method with an Exponential Convergence Rate for Strongly-Convex Optimization with Finite Training Sets. arXiv preprint arXiv:1202.6258,
- Roux, Schmidt, et al.
- 2012
(Show Context)
Citation Context ...ment of a wide variety of SGD variants implementing a variance reduction method in order to improve convergence. Variance reduction is 1 achieved by occasionally computing the exact gradient (see SAG =-=[18]-=-, SVRG [15], MISO [20] and SAGA [9] among others) or by means of nonuniform sampling schemes (see [21, 28] for instance). However, such ideas can hardly be applied to the case under study here: due to... |

70 | Accelerating stochastic gradient descent using predictive variance reduction.
- Johnson, Zhang
- 2013
(Show Context)
Citation Context ...in the case where empirical risk functionals are computed by summing over independent observations (sample mean statistics), is its slow convergence due to the variance of the gradient estimates, see =-=[15]-=-. This has recently motivated the development of a wide variety of SGD variants implementing a variance reduction method in order to improve convergence. Variance reduction is 1 achieved by occasional... |

48 |
Optimising area under the ROC curve using gradient descent, in ICML,
- Herschtal, Raskutti
- 2004
(Show Context)
Citation Context ...the VUS criterion (Eq. 2) whenK = 2. Given a sequence of i.i.d observations Zi = (Xi, Yi) where Xi ∈ Rp and Yi ∈ {−1, 1}, we denote by X+ = {Xi;Yi = 1}, X− = {Xi;Yi = −1} and N = |X+||X−|. As done in =-=[27, 13]-=-, we take a linear scoring rule sθ(x) = θ Tx where θ ∈ Rp is the parameter to learn, and use the logistic loss as a smooth convex function upper bounding the Heaviside function, leading to the followi... |

47 | Non-asymptotic analysis of stochastic approximation algorithms for machine learning.
- Bach, Moulines
- 2011
(Show Context)
Citation Context ...(θ1)− L̂n(θ2) 6 ∇θL̂n(θ1)T (x− y)− α 2 ‖θ1 − θ2‖2 (7) and we denote by θ∗n its unique minimizer. We point out that the present analysis can be extended to the smooth but non-strongly convex case, see =-=[1]-=-. A classical argument based on convex analysis and 4 stochastic optimization (see [1, 22] for instance) shows precisely how the conditional variance of the gradient estimator impacts the empirical pe... |

36 | Hamming distance metric learning.
- Norouzi, Fleet, et al.
- 2012
(Show Context)
Citation Context ...ochastic optimization techniques such as Stochastic Gradient Descent (SGD), where at each iteration only a small number of randomly selected terms are used to compute an estimate of the gradient (see =-=[27, 24, 16, 26]-=- for instance). A drawback of the original SGD learning method, introduced in the case where empirical risk functionals are computed by summing over independent observations (sample mean statistics), ... |

34 | A survey on metric learning for feature vectors and structured data,” arXiv preprint arXiv:1306.6709
- Bellet, Habrard, et al.
- 2013
(Show Context)
Citation Context ...etric Learning We now turn to a metric learning formulation, where we are given a sample of N i.i.d observations Zi = (Xi, Yi) where Xi ∈ Rp and Yi ∈ {1, . . . , c}. Following the existing literature =-=[2]-=-, we focus on (pseudo) distances of the form DM (x, x′) = (x− x′)TM(x− x′) where M is a p × p symmetric positive semi-definite matrix. We again use the logistic loss to obtain a convex and smooth surr... |

30 | SAGA:A fast incremental gradient method with support for non-strongly convex composite objectives. arXiv preprint arXiv:1407.0202
- Defazio, Bach, et al.
(Show Context)
Citation Context ...ts implementing a variance reduction method in order to improve convergence. Variance reduction is 1 achieved by occasionally computing the exact gradient (see SAG [18], SVRG [15], MISO [20] and SAGA =-=[9]-=- among others) or by means of nonuniform sampling schemes (see [21, 28] for instance). However, such ideas can hardly be applied to the case under study here: due to the overwhelming number of possibl... |

25 |
Introductory lectures on convex optimization, volume 87.
- Nesterov
- 2004
(Show Context)
Citation Context ... recursion at+1 6 at (1− 2αγt(1− γtL))+2γ2t σ2n(θ∗n). A standard stochastic approximation argument yields an upper bound for at (cf [17, 1]), which, combined with L̂n(θ)− L̂n(θ∗n) 6 L2 ‖θ− θ∗n‖2 (see =-=[23]-=- for instance), give the desired result. Sketch of Proof of Theorem 1 The proof relies on stochastic approximation arguments (see [10, 25, 11]). We first show that√ 1/γt (θt − θ∗n)⇒ N (0.Σ∗n). Then, w... |

23 | Incremental majorization-minimization optimization with application to large-scale machine learning.
- Mairal
- 2015
(Show Context)
Citation Context ... of SGD variants implementing a variance reduction method in order to improve convergence. Variance reduction is 1 achieved by occasionally computing the exact gradient (see SAG [18], SVRG [15], MISO =-=[20]-=- and SAGA [9] among others) or by means of nonuniform sampling schemes (see [21, 28] for instance). However, such ideas can hardly be applied to the case under study here: due to the overwhelming numb... |

23 |
Weak convergence rates for stochastic approximation with application to multiple targets and simulated annealing
- Pelletier
- 1998
(Show Context)
Citation Context ...7, 1]), which, combined with L̂n(θ)− L̂n(θ∗n) 6 L2 ‖θ− θ∗n‖2 (see [23] for instance), give the desired result. Sketch of Proof of Theorem 1 The proof relies on stochastic approximation arguments (see =-=[10, 25, 11]-=-). We first show that√ 1/γt (θt − θ∗n)⇒ N (0.Σ∗n). Then, we apply the second order delta-method to derive the asymptotic behavior of the objective function. Eq. (9) is obtained by standard algebra. Sk... |

21 |
Ranking and empirical risk minimization of U-statistics.
- Clemencon, Lugosi, et al.
- 2008
(Show Context)
Citation Context ...he Empirical Risk Minimization (ERM) principle, one of the main paradigms of statistical learning theory, has been extended to the case where the empirical risk of a prediction rule is a U -statistic =-=[5]-=-, using concentration properties of U -processes (i.e. collections of U -statistics). The computation of the empirical risk is however numerically unfeasible in large and even moderate scale situation... |

15 |
Binary decomposition methods for multipartite ranking
- Fürnkranz, Hüllermeier, et al.
(Show Context)
Citation Context ... popular examples include bipartite ranking (see [27] for instance), where the goal is to maximize the number of concordant pairs (i.e. AUC maximization), and more generally multi-partite ranking (cf =-=[12]-=-), as well as pairwise clustering (see [7]). Given a data sample, the most natural empirical risk estimate (which is known to have minimal variance among all unbiased estimates) is obtained by averagi... |

12 | Stochastic Approximation with Decreasing Gain: Convergence and Asymptotic Theory,” Unpublished Lecture Notes, http://perso.univ-rennes1.fr/bernard.delyon/as cours.ps
- Delyon
- 2000
(Show Context)
Citation Context ...7, 1]), which, combined with L̂n(θ)− L̂n(θ∗n) 6 L2 ‖θ− θ∗n‖2 (see [23] for instance), give the desired result. Sketch of Proof of Theorem 1 The proof relies on stochastic approximation arguments (see =-=[10, 25, 11]-=-). We first show that√ 1/γt (θt − θ∗n)⇒ N (0.Σ∗n). Then, we apply the second order delta-method to derive the asymptotic behavior of the objective function. Eq. (9) is obtained by standard algebra. Sk... |

8 | On the generalization ability of online learning algorithms for pairwise loss functions
- Kar, Sriperumbudur, et al.
- 2013
(Show Context)
Citation Context ...ochastic optimization techniques such as Stochastic Gradient Descent (SGD), where at each iteration only a small number of randomly selected terms are used to compute an estimate of the gradient (see =-=[27, 24, 16, 26]-=- for instance). A drawback of the original SGD learning method, introduced in the case where empirical risk functionals are computed by summing over independent observations (sample mean statistics), ... |

7 | Central Limit Theorems for Stochastic Approximation with controlled Markov chain dynamics. ArXiv e-prints
- Fort
- 2013
(Show Context)
Citation Context ...7, 1]), which, combined with L̂n(θ)− L̂n(θ∗n) 6 L2 ‖θ− θ∗n‖2 (see [23] for instance), give the desired result. Sketch of Proof of Theorem 1 The proof relies on stochastic approximation arguments (see =-=[10, 25, 11]-=-). We first show that√ 1/γt (θt − θ∗n)⇒ N (0.Σ∗n). Then, we apply the second order delta-method to derive the asymptotic behavior of the objective function. Eq. (9) is obtained by standard algebra. Sk... |

7 | Stochastic gradient descent, weighted sampling, and the randomized Kaczmarz algorithm.
- Needell, Ward, et al.
- 2014
(Show Context)
Citation Context ...nvergence. Variance reduction is 1 achieved by occasionally computing the exact gradient (see SAG [18], SVRG [15], MISO [20] and SAGA [9] among others) or by means of nonuniform sampling schemes (see =-=[21, 28]-=- for instance). However, such ideas can hardly be applied to the case under study here: due to the overwhelming number of possible tuples, computing even a single exact gradient or maintaining a proba... |

3 | On U-processes and clustering performance
- Clémençon
- 2011
(Show Context)
Citation Context ... (see [27] for instance), where the goal is to maximize the number of concordant pairs (i.e. AUC maximization), and more generally multi-partite ranking (cf [12]), as well as pairwise clustering (see =-=[7]-=-). Given a data sample, the most natural empirical risk estimate (which is known to have minimal variance among all unbiased estimates) is obtained by averaging over all tuples of observations and thu... |

3 |
The asymptotic distributions of Incomplete U -statistics
- Janson
- 1984
(Show Context)
Citation Context ...n in [19]. Remark 1. The results of this paper can be extended to other sampling schemes to approximate (4), such as Bernoulli sampling or sampling without replacement in Λ, following the proposal of =-=[14]-=-. For clarity, we focus on sampling with replacement, which is computationally more efficient. 3.2 A Conditional Performance Analysis As a first go, we investigate and compare the performance of the S... |

3 | Efficient distance metric learning by adaptative sampling and mini-batch stochastic gradient descent (SDD),” arXiv: 1304.1192v1
- Qian, Jin, et al.
- 2013
(Show Context)
Citation Context ...ochastic optimization techniques such as Stochastic Gradient Descent (SGD), where at each iteration only a small number of randomly selected terms are used to compute an estimate of the gradient (see =-=[27, 24, 16, 26]-=- for instance). A drawback of the original SGD learning method, introduced in the case where empirical risk functionals are computed by summing over independent observations (sample mean statistics), ... |

3 |
Stochastic optimization with importance sampling for regularized loss minimization.
- Zhao, Zhang
- 2015
(Show Context)
Citation Context ...nvergence. Variance reduction is 1 achieved by occasionally computing the exact gradient (see SAG [18], SVRG [15], MISO [20] and SAGA [9] among others) or by means of nonuniform sampling schemes (see =-=[21, 28]-=- for instance). However, such ideas can hardly be applied to the case under study here: due to the overwhelming number of possible tuples, computing even a single exact gradient or maintaining a proba... |

2 |
Scaling up M-estimation via sampling designs: The HorvitzThompson stochastic gradient descent
- Clémençon, Bertail, et al.
- 2014
(Show Context)
Citation Context ...ese performance gains are very significant in practice when dealing with large-scale datasets. In future work, we plan to investigate how one may extend the nonuniform sampling strategies proposed in =-=[8, 21, 28]-=- to our setting in order to further improve convergence. This is a challenging goal since we cannot hope to maintain a distribution over the set of all possible tuples of data points. A tractable solu... |

1 | Metric Learning
- Bellet, Habrard, et al.
- 2015
(Show Context)
Citation Context ... machine learning problems, the statistical risk functional is an expectation over d-tuples (d ≥ 2) of observations, rather than over individual points. This is the case in supervised metric learning =-=[3]-=-, where one seeks to optimize a distance function such that it assigns smaller values to pairs of points with the same label than to those with different labels. Other popular examples include biparti... |

1 |
Maximal deviations of incomplete U -processes with applications to Empirical Risk Sampling
- Clémençon, Robbiano, et al.
- 2013
(Show Context)
Citation Context ...n drawing a subset of observations without replacement and forming all possible tuples based on these (the corresponding gradient estimate is then a complete U -statistic based on a subsample). While =-=[6]-=- has investigated maximal deviations between U -processes and their incomplete approximations, the performance analysis carried out in the present paper is inspired from [4] and involves both the opti... |