Results 1  10
of
43
Mergeable Summaries
"... We study the mergeability of data summaries. Informally speaking, mergeability requires that, given two summaries on two data sets, there is a way to merge the two summaries into a single summary on the union of the two data sets, while preserving the error and size guarantees. This property means t ..."
Abstract

Cited by 22 (7 self)
 Add to MetaCart
(Show Context)
We study the mergeability of data summaries. Informally speaking, mergeability requires that, given two summaries on two data sets, there is a way to merge the two summaries into a single summary on the union of the two data sets, while preserving the error and size guarantees. This property means that the summaries can be merged in a way like other algebraic operators such as sum and max, which is especially useful for computing summaries on massive distributed data. Several data summaries are trivially mergeable by construction, most notably all the sketches that are linear functions of the data sets. But some other fundamental ones like those for heavy hitters and quantiles, are not (known to be) mergeable. In this paper, we demonstrate that these summaries are indeed mergeable or can be made mergeable after appropriate modifications. Specifically, we show that for εapproximate heavy hitters, there is a deterministic mergeable summary of size O(1/ε); for εapproximate quantiles, there is a deterministic summary of size O ( 1 log(εn)) that has a restricted form of mergeability, ε and a randomized one of size O ( 1 1 log3/2) with full mergeε ε ability. We also extend our results to geometric summaries such as εapproximations and εkernels. We also achieve two results of independent interest: (1) we provide the best known randomized streaming bound for εapproximate quantiles that depends only on ε, of size O ( 1 1 log3/2), and (2) we demonstrate that the MG and the ε ε SpaceSaving summaries for heavy hitters are isomorphic. Supported by NSF under grants CNS0540347, IIS07
Nearest Neighbor based Greedy Coordinate Descent
"... Increasingly, optimization problems in machine learning, especially those arising from highdimensional statistical estimation, have a large number of variables. Modern statistical estimators developed over the past decade have statistical or sample complexity that depends only weakly on the number ..."
Abstract

Cited by 10 (4 self)
 Add to MetaCart
Increasingly, optimization problems in machine learning, especially those arising from highdimensional statistical estimation, have a large number of variables. Modern statistical estimators developed over the past decade have statistical or sample complexity that depends only weakly on the number of parameters when there is some structure to the problem, such as sparsity. A central question is whether similar advances can be made in their computational complexity as well. In this paper, we propose strategies that indicate that such advances can indeed be made. In particular, we investigate the greedy coordinate descent algorithm, and note that performing the greedy step efficiently weakens the costly dependence on the problem size provided the solution is sparse. We then propose a suite of methods that perform these greedy steps efficiently by a reduction to nearest neighbor search. We also devise a more amenable form of greedy descent for composite nonsmooth objectives; as well as several approximate variants of such greedy descent. We develop a practical implementation of our algorithm that combines greedy coordinate descent with locality sensitive hashing. Without tuning the latter data structure, we are not only able to significantly speed up the vanilla greedy method, but also outperform cyclic descent when the problem size becomes large. Our results indicate the effectiveness of our nearest neighbor strategies, and also point to many open questions regarding the development of computational geometric techniques tailored towards firstorder optimization methods. 1
Protocols for learning classifiers on distributed data
 In AISTATS
, 2012
"... We consider the problem of learning classifiers for labeled data that has been distributed across several nodes. Our goal is to find a single classifier, with small approximation error, across all datasets while minimizing the communication between nodes. This setting models realworld communication ..."
Abstract

Cited by 9 (1 self)
 Add to MetaCart
We consider the problem of learning classifiers for labeled data that has been distributed across several nodes. Our goal is to find a single classifier, with small approximation error, across all datasets while minimizing the communication between nodes. This setting models realworld communication bottlenecks in the processing of massive distributed datasets. We present several very general samplingbased solutions as well as some twoway protocols which have a provable exponential speedup over any oneway protocol. We focus on core problems for noiseless data distributed across two or more nodes. The techniques we introduce are reminiscent of active learning, but rather than actively probing labels, nodes actively communicate with each other, each node simultaneously learning the important data from another node. istributed Learning, Communication complexity, Oneway/Twoway communication, Twoparty/kparty protocol 1
A Dynamic Data Structure for Approximate Range Searching
, 2010
"... In this paper, we introduce a simple, randomized dynamic datastructureforstoringmultidimensionalpointsets, called a quadtreap. This data structure is a randomized, balanced variant of a quadtree data structure. In particular, it defines a hierarchical decomposition of space into cells, which are bas ..."
Abstract

Cited by 5 (3 self)
 Add to MetaCart
In this paper, we introduce a simple, randomized dynamic datastructureforstoringmultidimensionalpointsets, called a quadtreap. This data structure is a randomized, balanced variant of a quadtree data structure. In particular, it defines a hierarchical decomposition of space into cells, which are based on hyperrectangles of bounded aspect ratio, each of constant combinatorial complexity. It can be viewed as a multidimensional generalization of the treap data structure of Seidel and Aragon. When inserted, points are assigned random priorities, and the tree is restructured through rotations as if the points had been inserted in priority order. In any fixed dimension d, we show it is possible to store a set of n points in a quadtreap of space O(n). The height h of the tree is O(logn) with high probability. It supports point insertion in time O(h). It supports point deletion in worstcase time O(h 2) and expectedcase time O(h), averaged over the points of the tree. It can answer εapproximate spherical range counting queries over groups and approximate nearest neighbor queries in time O ( h+ ( 1 ε)d−1).
Weighted Geometric Set Cover Problems Revisited
, 2008
"... We study several set cover problems in low dimensional geometric settings. Specifically, we describe a PTAS for the problem of computing a minimum cover of given points by a set of weighted fat objects. Here, we allow the objects to expand by some prespecified δfraction of their diameter. Next, we ..."
Abstract

Cited by 4 (0 self)
 Add to MetaCart
We study several set cover problems in low dimensional geometric settings. Specifically, we describe a PTAS for the problem of computing a minimum cover of given points by a set of weighted fat objects. Here, we allow the objects to expand by some prespecified δfraction of their diameter. Next, we show that the problem of computing minimum weight cover of points by weighted halfplanes (without expansion) can be solved exactly in the plane. We also study the problem of covering IR d by weighted halfspaces, and provide approximation algorithms and hardness results. We also investigate the “dual” settings of computing minimum weight simplex that covers a given target point. Finally, we provide a near linear time algorithm for the problem of solving a LP minimizing the total weight of violated constraints needed to be removed to make it feasible.
SelfApproaching Graphs
"... In this paper we introduce selfapproaching graph drawings. A straightline drawing of a graph is selfapproaching if, for any origin vertex s and any destination vertex t, there is an stpath in the graph such that, for any point q on the path, as a point p moves continuously along the path from ..."
Abstract

Cited by 4 (0 self)
 Add to MetaCart
(Show Context)
In this paper we introduce selfapproaching graph drawings. A straightline drawing of a graph is selfapproaching if, for any origin vertex s and any destination vertex t, there is an stpath in the graph such that, for any point q on the path, as a point p moves continuously along the path from the origin to q, the Euclidean distance from p to q is always decreasing. This is a more stringent condition than a greedy drawing (where only the distance between vertices on the path and the destination vertex must decrease), and guarantees that the drawing is a 5.33spanner. We study three topics: (1) recognizing selfapproaching drawings; (2) constructing selfapproaching drawings of a given graph; (3) constructing a selfapproaching Steiner network connecting a given set of points. We show that: (1) there are efficient algorithms to test if a polygonal path is selfapproaching inR 2 and R 3, but it is NPhard to test if a given graph drawing in R 3 has a selfapproaching uvpath; (2) we can characterize the trees that have selfapproaching drawings; (3) for any given set of terminal points in the plane, we can find a linear sized network that has a selfapproaching path between any ordered pair of terminals.
Down the Rabbit Hole: Robust Proximity Search and Density Estimation in Sublinear Space
, 2013
"... For a set of n points in IR d, and parameters k and ε, we present a data structure that answers (1 + ε, k)ANN queries in logarithmic time. Surprisingly, the space used by the datastructure is Õ(n/k); that is, the space used is sublinear in the input size if k is sufficiently large. Our approach pr ..."
Abstract

Cited by 2 (1 self)
 Add to MetaCart
(Show Context)
For a set of n points in IR d, and parameters k and ε, we present a data structure that answers (1 + ε, k)ANN queries in logarithmic time. Surprisingly, the space used by the datastructure is Õ(n/k); that is, the space used is sublinear in the input size if k is sufficiently large. Our approach provides a novel way to summarize geometric data, such that meaningful proximity queries on the data can be carried out using this sketch. Using this, we provide a sublinear space datastructure that can estimate the density of a point set under various measures, including: (i) sum of distances of k closest points to the query point, and (ii) sum of squared distances of k closest points to the query point. Our approach generalizes to other distance based estimation of densities of similar flavor. We also study the problem of approximating some of these quantities when using sampling. In particular, we show that a sample of size Õ(n/k) is sufficient, in some restricted cases, to estimate the above quantities. Remarkably, the sample size has only linear dependency on the dimension.
Range Counting Coresets for Uncertain Data
, 2013
"... We study coresets for various types of range counting queries on uncertain data. In our model each uncertain point has a probability density describing its location, sometimes defined as k distinct locations. Our goal is to construct a subset of the uncertain points, including their locational uncer ..."
Abstract

Cited by 2 (1 self)
 Add to MetaCart
(Show Context)
We study coresets for various types of range counting queries on uncertain data. In our model each uncertain point has a probability density describing its location, sometimes defined as k distinct locations. Our goal is to construct a subset of the uncertain points, including their locational uncertainty, so that range counting queries can be answered by just examining this subset. We study three distinct types of queries. RE queries return the expected number of points in a query range. RC queries return the number of points in the range with probability at least a threshold. RQ queries returns the probability that fewer than some threshold fraction of the points are in the range. In both RC and RQ coresets the threshold is provided as part of the query. And for each type of query we provide coreset constructions with approximationsize tradeoffs. We show that random sampling can be used to construct each type of coreset, and we also provide significantly improved bounds using discrepancybased approaches on axisaligned range queries.
The traveling salesman problem for lines, balls and planes
, 2013
"... We revisit the traveling salesman problem with neighborhoods (TSPN) and obtain several approximation algorithms. These constitute either improvements over previously best approximations achievable in comparable times (for unit disks in the plane), or first approximations ever (for hyperplanes and li ..."
Abstract

Cited by 2 (1 self)
 Add to MetaCart
We revisit the traveling salesman problem with neighborhoods (TSPN) and obtain several approximation algorithms. These constitute either improvements over previously best approximations achievable in comparable times (for unit disks in the plane), or first approximations ever (for hyperplanes and lines in R d, and unit balls in R 3). (I) Given a set of n hyperplanes in R d, a TSP tour that is at most O(1) times longer than the optimal can be computed in O(n) time, when d is constant. (II) Given a set of n lines in R d, a TSP tour that is at most O(log 3 n) times longer than the optimal can be computed in polynomial time, when d is constant. (III) Given a set of n unit disks in the plane or n unit balls in R 3, we improve the approximation ratios relying on a black box that computes a good approximate tour for a set of points in the ambient space (in our case, these are the centers of a subset of the disks or the balls).