Results 1  10
of
19
Cover trees for nearest neighbor
 In Proceedings of the 23rd international conference on Machine learning
, 2006
"... ABSTRACT. We present a tree data structure for fast nearest neighbor operations in generalpoint metric spaces. The data structure requires space regardless of the metric’s structure. If the point set has an expansion constant � in the sense of Karger and Ruhl [KR02], the data structure can be const ..."
Abstract

Cited by 139 (0 self)
 Add to MetaCart
ABSTRACT. We present a tree data structure for fast nearest neighbor operations in generalpoint metric spaces. The data structure requires space regardless of the metric’s structure. If the point set has an expansion constant � in the sense of Karger and Ruhl [KR02], the data structure can be constructed in � time. Nearest neighbor queries obeying the expansion bound require � time. In addition, the nearest neighbor of points can be queried in time. We experimentally test the algorithm showing speedups over the brute force search varying between 1 and 2000 on natural machine learning datasets. 1.
Meridian: A Lightweight Network Location Service without Virtual Coordinates
 In SIGCOMM
, 2005
"... This paper introduces a lightweight, scalable and accurate framework, called Meridian, for performing node selection based on network location. The framework consists of an overlay network structured around multiresolution rings, query routing with direct measurements, and gossip protocols for diss ..."
Abstract

Cited by 139 (7 self)
 Add to MetaCart
This paper introduces a lightweight, scalable and accurate framework, called Meridian, for performing node selection based on network location. The framework consists of an overlay network structured around multiresolution rings, query routing with direct measurements, and gossip protocols for dissemination. We show how this framework can be used to address three commonly encountered problems, namely, closest node discovery, central leader election, and locating nodes that satisfy target latency constraints in largescale distributed systems without having to compute absolute coordinates. We show analytically that the framework is scalable with logarithmic convergence when Internet latencies are modeled as a growthconstrained metric, a lowdimensional Euclidean metric, or a metric of low doubling dimension. Large scale simulations, based on latency measurements from 6.25 million nodepairs as well as an implementation deployed on PlanetLab show that the framework is accurate and effective.
Fast construction of nets in lowdimensional metrics and their applications
 SIAM Journal on Computing
, 2006
"... We present a near linear time algorithm for constructing hierarchical nets in finite metric spaces with constant doubling dimension. This datastructure is then applied to obtain improved algorithms for the following problems: approximate nearest neighbor search, wellseparated pair decomposition, s ..."
Abstract

Cited by 98 (10 self)
 Add to MetaCart
We present a near linear time algorithm for constructing hierarchical nets in finite metric spaces with constant doubling dimension. This datastructure is then applied to obtain improved algorithms for the following problems: approximate nearest neighbor search, wellseparated pair decomposition, spanner construction, compact representation scheme, doubling measure, and computation of the (approximate) Lipschitz constant of a function. In all cases, the running (preprocessing) time is near linear and the space being used is linear. 1
Distributed Approaches to Triangulation and Embedding
 In Proceedings 16th ACMSIAM Symposium on Discrete Algorithms (SODA
, 2005
"... A number of recent papers in the networking community study the distance matrix defined by the nodetonode latencies in the Internet and, in particular, provide a number of quite successful distributed approaches that embed this distance into a lowdimensional Euclidean space. In such algorithms it ..."
Abstract

Cited by 30 (6 self)
 Add to MetaCart
A number of recent papers in the networking community study the distance matrix defined by the nodetonode latencies in the Internet and, in particular, provide a number of quite successful distributed approaches that embed this distance into a lowdimensional Euclidean space. In such algorithms it is feasible to measure distances among only a linear or nearlinear number of node pairs; the rest of the distances are simply not available. Moreover, for applications it is desirable to spread the load evenly among the participating nodes. Indeed, several recent studies use this ’fully distributed ’ approach and achieve, empirically, a low distortion for all but a small fraction of node pairs. This is concurrent with the large body of theoretical work on metric embeddings, but there is a fundamental distinction: in the theoretical approaches to metric embeddings, full and centralized access to the distance matrix is assumed and heavily used. In this paper we present the first fully distributed embedding algorithm with provable distortion guarantees for doubling metrics (which have been proposed as a reasonable abstraction of Internet latencies), thus providing some insight into the empirical success of the recent Vivaldi algorithm [7]. The main ingredient of our embedding algorithm is an improved fully distributed algorithm for a more basic problem of triangulation, where the triangle inequality is used to infer the distances that have not been measured; this problem received a considerable attention in the networking community, and has also been studied theoretically in [19]. We use our techniques to extend ɛrelaxed embeddings and triangulations to infinite metrics and arbitrary measures, and to improve on the approximate distance labeling scheme of Talwar [36]. 1
Searching dynamic point sets in spaces with bounded doubling dimension
 In The thirtyeighth annual ACM symposium on Theory of computing (STOC
, 2006
"... We present a new data structure that facilitates approximate nearest neighbor searches on a dynamic set of points in a metric space that has a bounded doubling dimension. Our data structure has linear size and supports insertions and deletions in O(log n) time, and finds a (1 + ɛ)approximate neares ..."
Abstract

Cited by 29 (8 self)
 Add to MetaCart
We present a new data structure that facilitates approximate nearest neighbor searches on a dynamic set of points in a metric space that has a bounded doubling dimension. Our data structure has linear size and supports insertions and deletions in O(log n) time, and finds a (1 + ɛ)approximate nearest neighbor in time O(log n) +(1/ɛ) O(1). The search and update times hide multiplicative factors that depend on the doubling dimension; the space does not. These performance times are independent of the aspect ratio (or spread) of the points. Categories and Subject Descriptors: F.2.2 [Nonnumerical Algorithms and Problems]:Sorting and searching, computations on discrete structures; E.1 [Data Structures]:Graphs and networks, trees.
Small hopdiameter sparse spanners for doubling metrics
 In SODA ’06: Proceedings of the seventeenth annual ACMSIAM symposium on Discrete algorithm
, 2006
"... Given a metric M = (V, d), a graph G = (V, E) is a tspanner for M if every pair of nodes in V has a “short ” path (i.e., of length at most t times their actual distance) between them in the spanner. Furthermore, this spanner has a hop diameter bounded by D if every such short path also uses at most ..."
Abstract

Cited by 18 (3 self)
 Add to MetaCart
Given a metric M = (V, d), a graph G = (V, E) is a tspanner for M if every pair of nodes in V has a “short ” path (i.e., of length at most t times their actual distance) between them in the spanner. Furthermore, this spanner has a hop diameter bounded by D if every such short path also uses at most D edges. We consider the problem of constructing sparse (1 + ε)spanners with small hop diameter for metrics of low doubling dimension. In this paper, we show that given any metric with constant doubling dimension k, and any 0 < ε < 1, one can find a (1 + ε)spanner for the metric with nearly linear number of edges (i.e., only O(n log ∗ n + nε −O(k)) edges) and a constant hop diameter, and also a (1 + ε)spanner with linear number of edges (i.e., only nε −O(k) edges) which achieves a hop diameter that grows like the functional inverse of the Ackermann’s function. Moreover, we prove that such tradeoffs between the number of edges and the hop diameter are asymptotically optimal. 1
Disorder inequality: A combinatorial approach to nearest neighbor search
 In WSDM’08
"... We say that an algorithm for nearest neighbor search is combinatorial if only direct comparisons between two pairwise similarity values are allowed. Combinatorial algorithms for nearest neighbor search have two important advantages: (1) they do not map similarity values to artificial distance values ..."
Abstract

Cited by 16 (4 self)
 Add to MetaCart
We say that an algorithm for nearest neighbor search is combinatorial if only direct comparisons between two pairwise similarity values are allowed. Combinatorial algorithms for nearest neighbor search have two important advantages: (1) they do not map similarity values to artificial distance values and do not use the triangle inequality for the latter, and (2) they work for arbitrarily complicated data representations and similarity functions. In this paper we introduce a special property of the similarity function on a set S that leads to efficient combinatorial algorithms for S. The disorder constant D(S) of a set S is defined to ensure the following inequality: if x is the a’th most similar object to z and y is the b’th most similar object to z, then x is among the D(S) · (a + b) most similar objects to y. Assuming that disorder is small we present the first two known combinatorial algorithms for nearest neighbors whose query time has logarithmic dependence on the size of S. The first one, called Ranwalk, is a randomized zeroerror algorithm that always returns the exact nearest neighbor. It uses space quadratic in the input size in preprocessing, but is very efficient in query processing. The second algorithm, called Arwalk, uses nearlinear space. It uses random choices in preprocessing, but the query processing is essentially deterministic. For an arbitrary query q, there is only a small probability that the chosen data structure does not support q. Finally, we show that for the Reuters corpus average disorder is indeed quite small and that Ranwalk efficiently computes the nearest neighbor in most cases.
Combinatorial algorithms for nearest neighbors, nearduplicates and smallworld design
 In Proceedings of the 20th Annual ACMSIAM Symposium on Discrete Algorithms, SODA’09
, 2009
"... We study the so called combinatorial framework for algorithmic problems in similarity spaces. Namely, the input dataset is represented by a comparison oracle that given three points x, y, y ′ answers whether y or y ′ is closer to x. We assume that the similarity order of the dataset satisfies the fo ..."
Abstract

Cited by 11 (1 self)
 Add to MetaCart
We study the so called combinatorial framework for algorithmic problems in similarity spaces. Namely, the input dataset is represented by a comparison oracle that given three points x, y, y ′ answers whether y or y ′ is closer to x. We assume that the similarity order of the dataset satisfies the four variations of the following disorder inequality: if x is the a’th most similar object to y and y is the b’th most similar object to z, then x is among the D(a + b) most similar objects to z, where D is a relatively small disorder constant. Though the oracle gives much less information compared to the standard general metric space model where distance values are given, one can still design very efficient algorithms for various fundamental computational tasks. For nearest neighbor search we present deterministic and exact algorithm with almost linear time and space complexity of preprocessing, and nearlogarithmic time complexity of search. Then, for nearduplicate detection we present the first known deterministic algorithm that requires just nearlinear time + time proportional to the size of output. Finally, we show that for any dataset satisfying the disorder inequality a visibility graph can be constructed: all outdegrees are nearlogarithmic and greedy routing deterministically converges to the nearest neighbor of a target in logarithmic number of steps. The later result is the first known workaround for Navarro’s impossibility of generalizing Delaunay graphs. The technical contribution of the paper consists of handling “false positives ” in data structures and an algorithmic technique upasidedownfilter.
A QPTAS for TSP with Fat Weakly Disjoint Neighborhoods in Doubling Metrics
"... We consider the Traveling Salesman Problem with Neighborhoods (TSPN) in doubling metrics. The goal is to find a shortest tour that visits each of a collection of ..."
Abstract

Cited by 4 (0 self)
 Add to MetaCart
We consider the Traveling Salesman Problem with Neighborhoods (TSPN) in doubling metrics. The goal is to find a shortest tour that visits each of a collection of
Similarity search via combinatorial nets
"... We consider Nearest Neighbor Search problem in the so called combinatorial framework: Only direct comparisons between two pairwise similarity values are allowed. We assume that the similarity order for the input dataset has the following consistency property: if x is the a’th most similar object to ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
We consider Nearest Neighbor Search problem in the so called combinatorial framework: Only direct comparisons between two pairwise similarity values are allowed. We assume that the similarity order for the input dataset has the following consistency property: if x is the a’th most similar object to y and y is the b’th most similar object to z, then x is among the D(a + b) most similar objects to z. Though the oracle gives much less information compared to the standard general metric space model where distance values are given, it turns out that one can still design a deterministic preprocessing algorithm with almost linear time and space complexity, and answer queries deterministically in nearlogarithmic time. A key procedure of our main algorithm is efficient constructions of combinatorial nets. We show that this data structure is useful for solving other important problems. For example, motivated by navigability questions we show that for any dataset a visibility graph can be constructed: all outdegrees are nearlogarithmic and greedy routing deterministically converges to nearest neighbor in logarithmic number of steps. Also, for nearduplicate detection problem we present the first known deterministic algorithm that requires just nearlinear time + time proportional to the size of output. 1