Results 1  10
of
11
Optimal distributed online prediction using minibatches
, 2010
"... Online prediction methods are typically presented as serial algorithms running on a single processor. However, in the age of webscale prediction problems, it is increasingly common to encounter situations where a single processor cannot keep up with the high rate at which inputs arrive. In this wor ..."
Abstract

Cited by 17 (2 self)
 Add to MetaCart
Online prediction methods are typically presented as serial algorithms running on a single processor. However, in the age of webscale prediction problems, it is increasingly common to encounter situations where a single processor cannot keep up with the high rate at which inputs arrive. In this work, we present the distributed minibatch algorithm, a method of converting many serial gradientbased online prediction algorithms into distributed algorithms. We prove a regret bound for this method that is asymptotically optimal for smooth convex loss functions and stochastic inputs. Moreover, our analysis explicitly takes into account communication latencies between nodes in the distributed environment. We show how our method can be used to solve the closelyrelated distributed stochastic optimization problem, achieving an asymptotically linear speedup over multiple processors. Finally, we demonstrate the merits of our approach on a webscale online prediction problem.
Distributed delayed stochastic optimization
, 2011
"... We analyze the convergence of gradientbased optimization algorithms whose updates depend on delayed stochastic gradient information. The main application of our results is to the development of distributed minimizationalgorithmswhereamasternodeperformsparameterupdateswhile worker nodes compute stoc ..."
Abstract

Cited by 16 (3 self)
 Add to MetaCart
We analyze the convergence of gradientbased optimization algorithms whose updates depend on delayed stochastic gradient information. The main application of our results is to the development of distributed minimizationalgorithmswhereamasternodeperformsparameterupdateswhile worker nodes compute stochastic gradients based on local information in parallel, which may give rise to delays due to asynchrony. Our main contributionistoshowthatforsmoothstochasticproblems,thedelaysareasymptotically negligible. In application to distributed optimization, we show nnode architectures whose optimization error in stochastic problems—in spite of asynchronous delays—scales asymptotically as O(1 / √ nT), which is known to be optimal even in the absence of delays. 1
Better MiniBatch Algorithms via Accelerated Gradient Methods
"... Minibatch algorithms have been proposed as a way to speedup stochastic convex optimization problems. We study how such algorithms can be improved using accelerated gradient methods. We provide a novel analysis, which shows how standard gradient methods may sometimes be insufficient to obtain a sig ..."
Abstract

Cited by 6 (3 self)
 Add to MetaCart
Minibatch algorithms have been proposed as a way to speedup stochastic convex optimization problems. We study how such algorithms can be improved using accelerated gradient methods. We provide a novel analysis, which shows how standard gradient methods may sometimes be insufficient to obtain a significant speedup and propose a novel accelerated gradient algorithm, which deals with this deficiency, enjoys a uniformly superior guarantee and works well in practice. 1
Distributed learning, communication complexity and privacy
, 2012
"... We consider the problem of PAClearning from distributed data and analyze fundamental communication complexity questions involved. We provide general upper and lower bounds on the amount of communication needed to learn well, showing that in addition to VCdimension and covering number, quantities su ..."
Abstract

Cited by 4 (0 self)
 Add to MetaCart
We consider the problem of PAClearning from distributed data and analyze fundamental communication complexity questions involved. We provide general upper and lower bounds on the amount of communication needed to learn well, showing that in addition to VCdimension and covering number, quantities such as the teachingdimension and mistakebound of a class play an important role. We also present tight results for a number of common concept classes including conjunctions, parity functions, and decision lists. For linear separators, we show that for nonconcentrated distributions, we can use a version of the Perceptron algorithm to learn with much less communication than the number of updates given by the usual margin bound. We also show how boosting can be performed in a generic manner in the distributed setting to achieve communication with only logarithmic dependence on 1/ɛ for any concept class, and demonstrate how recent work on agnostic learning from classconditional queries can be used to achieve low communication in agnostic settings as well. We additionally present an analysis of privacy, considering both differential privacy and a notion of distributional privacy that is especially appealing in this context.
Solving Large Scale Linear SVM with Distributed Block Minimization
"... Over recent years we have seen the appearance of huge datasets that do not fit into memory and do not even fit on the hard disk of a single computer. Moreover, even when processed on a cluster of machines, data are usually stored in a distributed way. The transfer of significant subsets of such data ..."
Abstract
 Add to MetaCart
Over recent years we have seen the appearance of huge datasets that do not fit into memory and do not even fit on the hard disk of a single computer. Moreover, even when processed on a cluster of machines, data are usually stored in a distributed way. The transfer of significant subsets of such datasets from one node to another is very slow. We present a new algorithm for training linear Support Vector Machines over such large datasets. Our algorithm assumes that the dataset is partitioned over several nodes on a cluster and performs a distributed block minimization along with the subsequent line search. The communication complexity of our algorithm is independent of the number of training examples. With our MapReduce/Hadoop implementation of this algorithm the accurate training of SVM over the datasets of tens of millions of examples takes less than 11 minutes. 1
Distributed NonStochastic Experts
"... We consider the online distributed nonstochastic experts problem, where the distributed system consists of one coordinator node that is connected to k sites, and the sites are required to communicate with each other via the coordinator. At each timestep t, one of the k site nodes has to pick an ex ..."
Abstract
 Add to MetaCart
We consider the online distributed nonstochastic experts problem, where the distributed system consists of one coordinator node that is connected to k sites, and the sites are required to communicate with each other via the coordinator. At each timestep t, one of the k site nodes has to pick an expert from the set {1,..., n}, and the same site receives information about payoffs of all experts for that round. The goal of the distributed system is to minimize regret at time horizon T, while simultaneously keeping communication to a minimum. The two extreme solutions to this problem are: (i) Full communication: This essentially simulates the nondistributed setting to obtain the optimal O ( √ log(n)T) regret bound at the cost of T communication. (ii) No communication: Each site runs an independent copy – the regret is O ( √ log(n)kT) and the communication is 0. This paper shows the difficulty of simultaneously achieving regret asymptotically better than √ kT and communication better than T. We give a novel algorithm that for an oblivious adversary achieves a nontrivial tradeoff: regret O ( √ k 5(1+ɛ)/6 T) and communication O(T/k ɛ), for any value of ɛ ∈ (0, 1/5). We also consider a variant of the model, where the coordinator picks the expert. In this model, we show that the labelefficient forecaster of CesaBianchi et al. (2005) already gives us strategy that is near optimal in regret vs communication tradeoff. 1
Online Learning and Adaptation over Networks: More Information is Not Necessarily Better
"... Abstract—We examine the performance of stochasticgradient learners over connected networks for global optimization problems involving risk functions that are not necessarily quadratic. We consider two wellstudied classes of distributed schemes including consensus strategies and diffusion strategie ..."
Abstract
 Add to MetaCart
Abstract—We examine the performance of stochasticgradient learners over connected networks for global optimization problems involving risk functions that are not necessarily quadratic. We consider two wellstudied classes of distributed schemes including consensus strategies and diffusion strategies. We quantify how the meansquareerror and the convergence rate of the network vary with the combination policy and with the fraction of informed agents. Several combination policies are considered including doublystochastic rules, the averaging rule, Metropolis rule, and the Hastings rule. It will be seen that the performance of the network does not necessarily improve with a larger proportion of informed agents. A strategy to counter the degradation in performance is presented. I.
A Theoretical Analysis of a Warm Start Technique
"... Batch gradient descent looks at every data point for every step, which is wasteful for early steps where the current position is nowhere near optimal. There has been a lot of interest in warmstart approaches to gradient descent techniques, but little analysis. In this paper, we formally analyze a m ..."
Abstract
 Add to MetaCart
Batch gradient descent looks at every data point for every step, which is wasteful for early steps where the current position is nowhere near optimal. There has been a lot of interest in warmstart approaches to gradient descent techniques, but little analysis. In this paper, we formally analyze a method of warmstarting batch gradient descent using small batch sizes. We argue that this approach is fundamentally different than minibatch, in that after an initial shuffle, it requires only sequential passes over the data, improving performance on datasets stored on a disk drive. 1 Support Vector Machine Problems Suppose you have a set of examples (x1, y1)... (xm, ym), drawn from a distribution D, and a regularization parameter λ. For each example i, there is some convex loss Li: R d → R over the parameter space associated with example i, such that for any S ⊆ {1... m}, we have a cost function fS: R d → R: fS(w) = λ 2 w2 + 1
25th Annual Conference on Learning Theory Distributed Learning, Communication Complexity and Privacy
"... We consider the problem of PAClearning from distributed data and analyze fundamental communication complexity questions involved. We provide general upper and lower bounds on the amount of communication needed to learn well, showing that in addition to VCdimension and covering number, quantities s ..."
Abstract
 Add to MetaCart
We consider the problem of PAClearning from distributed data and analyze fundamental communication complexity questions involved. We provide general upper and lower bounds on the amount of communication needed to learn well, showing that in addition to VCdimension and covering number, quantities such as the teachingdimension and mistakebound of a class play an important role. We also present tight results for a number of common concept classes including conjunctions, parity functions, and decision lists. For linear separators, we show that for nonconcentrated distributions, we can use a version of the Perceptron algorithm to learn with much less communication than the number of updates given by the usual margin bound. We also show how boosting can be performed in a generic manner in the distributed setting to achieve communication with only logarithmic dependence on 1/ɛ for any concept class, and demonstrate how recent work on agnostic learning from classconditional queries can be used to achieve low communication in agnostic settings as well. We additionally present an analysis of privacy, considering both differential privacy and a notion of distributional privacy that is especially appealing in this context.