Results 1  10
of
17
Optimal distributed online prediction using minibatches
, 2010
"... Online prediction methods are typically presented as serial algorithms running on a single processor. However, in the age of webscale prediction problems, it is increasingly common to encounter situations where a single processor cannot keep up with the high rate at which inputs arrive. In this wor ..."
Abstract

Cited by 69 (7 self)
 Add to MetaCart
Online prediction methods are typically presented as serial algorithms running on a single processor. However, in the age of webscale prediction problems, it is increasingly common to encounter situations where a single processor cannot keep up with the high rate at which inputs arrive. In this work, we present the distributed minibatch algorithm, a method of converting many serial gradientbased online prediction algorithms into distributed algorithms. We prove a regret bound for this method that is asymptotically optimal for smooth convex loss functions and stochastic inputs. Moreover, our analysis explicitly takes into account communication latencies between nodes in the distributed environment. We show how our method can be used to solve the closelyrelated distributed stochastic optimization problem, achieving an asymptotically linear speedup over multiple processors. Finally, we demonstrate the merits of our approach on a webscale online prediction problem.
Distributed delayed stochastic optimization
, 2011
"... We analyze the convergence of gradientbased optimization algorithms whose updates depend on delayed stochastic gradient information. The main application of our results is to the development of distributed minimizationalgorithmswhereamasternodeperformsparameterupdateswhile worker nodes compute stoc ..."
Abstract

Cited by 52 (5 self)
 Add to MetaCart
(Show Context)
We analyze the convergence of gradientbased optimization algorithms whose updates depend on delayed stochastic gradient information. The main application of our results is to the development of distributed minimizationalgorithmswhereamasternodeperformsparameterupdateswhile worker nodes compute stochastic gradients based on local information in parallel, which may give rise to delays due to asynchrony. Our main contributionistoshowthatforsmoothstochasticproblems,thedelaysareasymptotically negligible. In application to distributed optimization, we show nnode architectures whose optimization error in stochastic problems—in spite of asynchronous delays—scales asymptotically as O(1 / √ nT), which is known to be optimal even in the absence of delays. 1
Better MiniBatch Algorithms via Accelerated Gradient Methods
"... Minibatch algorithms have been proposed as a way to speedup stochastic convex optimization problems. We study how such algorithms can be improved using accelerated gradient methods. We provide a novel analysis, which shows how standard gradient methods may sometimes be insufficient to obtain a sig ..."
Abstract

Cited by 28 (6 self)
 Add to MetaCart
(Show Context)
Minibatch algorithms have been proposed as a way to speedup stochastic convex optimization problems. We study how such algorithms can be improved using accelerated gradient methods. We provide a novel analysis, which shows how standard gradient methods may sometimes be insufficient to obtain a significant speedup and propose a novel accelerated gradient algorithm, which deals with this deficiency, enjoys a uniformly superior guarantee and works well in practice. 1
Distributed learning, communication complexity and privacy
, 2012
"... We consider the problem of PAClearning from distributed data and analyze fundamental communication complexity questions involved. We provide general upper and lower bounds on the amount of communication needed to learn well, showing that in addition to VCdimension and covering number, quantities su ..."
Abstract

Cited by 19 (1 self)
 Add to MetaCart
We consider the problem of PAClearning from distributed data and analyze fundamental communication complexity questions involved. We provide general upper and lower bounds on the amount of communication needed to learn well, showing that in addition to VCdimension and covering number, quantities such as the teachingdimension and mistakebound of a class play an important role. We also present tight results for a number of common concept classes including conjunctions, parity functions, and decision lists. For linear separators, we show that for nonconcentrated distributions, we can use a version of the Perceptron algorithm to learn with much less communication than the number of updates given by the usual margin bound. We also show how boosting can be performed in a generic manner in the distributed setting to achieve communication with only logarithmic dependence on 1/ɛ for any concept class, and demonstrate how recent work on agnostic learning from classconditional queries can be used to achieve low communication in agnostic settings as well. We additionally present an analysis of privacy, considering both differential privacy and a notion of distributional privacy that is especially appealing in this context.
Online bandit learning against an adaptive adversary: from regret to policy regret
 In Proceedings of the 29th International Conference on Machine Learning
, 2012
"... Online learning algorithms are designed to learn even when their input is generated by an adversary. The widelyaccepted formal definition of an online algorithm’s ability to learn is the gametheoretic notion of regret. We argue that the standard definition of regret becomes inadequate if the adve ..."
Abstract

Cited by 15 (5 self)
 Add to MetaCart
Online learning algorithms are designed to learn even when their input is generated by an adversary. The widelyaccepted formal definition of an online algorithm’s ability to learn is the gametheoretic notion of regret. We argue that the standard definition of regret becomes inadequate if the adversary is allowed to adapt to the online algorithm’s actions. We define the alternative notion of policy regret, which attempts to provide a more meaningful way to measure an online algorithm’s performance against adaptive adversaries. Focusing on the online bandit setting, we show that no bandit algorithm can guarantee a sublinear policy regret against an adaptive adversary with unbounded memory. On the other hand, if the adversary’s memory is bounded, we present a general technique that converts any bandit algorithm with a sublinear regret bound into an algorithm with a sublinear policy regret bound. We extend this result to other variants of regret, such as switching regret, internal regret, and swap regret. 1.
Algorithms and hardness results for parallel large margin learning
"... We study the fundamental problem of learning an unknown largemargin halfspace in the context of parallel computation. Our main positive result is a parallel algorithm for learning a largemargin halfspace that is based on interior point methods from convex optimization and fast parallel algorithms ..."
Abstract

Cited by 5 (0 self)
 Add to MetaCart
(Show Context)
We study the fundamental problem of learning an unknown largemargin halfspace in the context of parallel computation. Our main positive result is a parallel algorithm for learning a largemargin halfspace that is based on interior point methods from convex optimization and fast parallel algorithms for matrix computations. We show that this algorithm learns an unknown γmargin halfspace over n dimensions using poly(n, 1/γ) processors and runs in time Õ(1/γ) + O(log n). In contrast, naive parallel algorithms that learn a γmargin halfspace in time that depends polylogarithmically on n have Ω(1/γ2) runtime dependence on γ. Our main negative result deals with boosting, which is a standard approach to learning largemargin halfspaces. We give an informationtheoretic proof that in the original PAC framework, in which a weak learning algorithm is provided as an oracle that is called by the booster, boosting cannot be parallelized: the ability to call the weak learner multiple times in parallel within a single boosting stage does not reduce the overall number of successive stages of boosting that are required. 1
O(logT) Projections for Stochastic Optimization of Smooth and Strongly Convex Functions
"... Traditional algorithms for stochastic optimization require projecting the solution at each iteration into a given domain to ensure itsfeasibility. Whenfacingcomplexdomains, such as the positive semidefinite cone, the projection operation can be expensive, leading to a high computational cost per ite ..."
Abstract

Cited by 5 (4 self)
 Add to MetaCart
Traditional algorithms for stochastic optimization require projecting the solution at each iteration into a given domain to ensure itsfeasibility. Whenfacingcomplexdomains, such as the positive semidefinite cone, the projection operation can be expensive, leading to a high computational cost per iteration. In this paper, we present a novel algorithm that aims to reduce the number of projections for stochastic optimization. The proposed algorithm combines the strength of several recent developments in stochastic optimization, including minibatches, extragradient, and epoch gradient descent, in order to effectively explore the smoothness and strong convexity. We show, both in expectation and with a high probability, that when the objective function is both smooth and strongly convex, the proposed algorithm achieves the optimal O(1/T) rate of convergence with only O(logT) projections. Our empirical study verifies the theoretical result. 1.
Solving Large Scale Linear SVM with Distributed Block Minimization
"... Over recent years we have seen the appearance of huge datasets that do not fit into memory and do not even fit on the hard disk of a single computer. Moreover, even when processed on a cluster of machines, data are usually stored in a distributed way. The transfer of significant subsets of such data ..."
Abstract

Cited by 4 (0 self)
 Add to MetaCart
(Show Context)
Over recent years we have seen the appearance of huge datasets that do not fit into memory and do not even fit on the hard disk of a single computer. Moreover, even when processed on a cluster of machines, data are usually stored in a distributed way. The transfer of significant subsets of such datasets from one node to another is very slow. We present a new algorithm for training linear Support Vector Machines over such large datasets. Our algorithm assumes that the dataset is partitioned over several nodes on a cluster and performs a distributed block minimization along with the subsequent line search. The communication complexity of our algorithm is independent of the number of training examples. With our MapReduce/Hadoop implementation of this algorithm the accurate training of SVM over the datasets of tens of millions of examples takes less than 11 minutes. 1
Distributed NonStochastic Experts
"... We consider the online distributed nonstochastic experts problem, where the distributed system consists of one coordinator node that is connected to k sites, and the sites are required to communicate with each other via the coordinator. At each timestep t, one of the k site nodes has to pick an ex ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
We consider the online distributed nonstochastic experts problem, where the distributed system consists of one coordinator node that is connected to k sites, and the sites are required to communicate with each other via the coordinator. At each timestep t, one of the k site nodes has to pick an expert from the set {1,..., n}, and the same site receives information about payoffs of all experts for that round. The goal of the distributed system is to minimize regret at time horizon T, while simultaneously keeping communication to a minimum. The two extreme solutions to this problem are: (i) Full communication: This essentially simulates the nondistributed setting to obtain the optimal O ( √ log(n)T) regret bound at the cost of T communication. (ii) No communication: Each site runs an independent copy – the regret is O ( √ log(n)kT) and the communication is 0. This paper shows the difficulty of simultaneously achieving regret asymptotically better than √ kT and communication better than T. We give a novel algorithm that for an oblivious adversary achieves a nontrivial tradeoff: regret O ( √ k 5(1+ɛ)/6 T) and communication O(T/k ɛ), for any value of ɛ ∈ (0, 1/5). We also consider a variant of the model, where the coordinator picks the expert. In this model, we show that the labelefficient forecaster of CesaBianchi et al. (2005) already gives us strategy that is near optimal in regret vs communication tradeoff. 1