## Nearest-neighbor searching and metric space dimensions (2006)

Venue: | In Nearest-Neighbor Methods for Learning and Vision: Theory and Practice |

Citations: | 87 - 0 self |

### BibTeX

@INPROCEEDINGS{Clarkson06nearest-neighborsearching,

author = {Kenneth L. Clarkson},

title = {Nearest-neighbor searching and metric space dimensions},

booktitle = {In Nearest-Neighbor Methods for Learning and Vision: Theory and Practice},

year = {2006},

publisher = {MIT Press}

}

### Years of Citing Articles

### OpenURL

### Abstract

Given a set S of n sites (points), and a distance measure d, the nearest neighbor searching problem is to build a data structure so that given a query point q, the site nearest to q can be found quickly. This paper gives a data structure for this problem; the data structure is built using the distance function as a “black box”. The structure is able to speed up nearest neighbor searching in a variety of settings, for example: points in low-dimensional or structured Euclidean space, strings under Hamming and edit distance, and bit vector data from an OCR application. The data structures are observed to need linear space, with a modest constant factor. The preprocessing time needed per site is observed to match the query time. The data structure can be viewed as an application of a “kd-tree ” approach in the metric space setting, using Voronoi regions of a subset in place of axis-aligned boxes. 1

### Citations

1213 | An algorithm for vector quantizer design - Linde, Buzo, et al. - 1980 |

994 |
A probabilistic theory of pattern recognition
- Devroye, Györfi, et al.
- 1996
(Show Context)
Citation Context ... practice.) Vector quantization, and the quantization dimension, are discussed further in §4.3. As noted above, the problem of classification has long been approached using nearest-neighbor searching =-=[DGL96]-=-. Points in the space (typically ℜ d ) correspond to sets of objects, and the point coordinates encode various properties of the objects. Each object also has a “color,” say red or blue, corresponding... |

764 |
Geometric Measure Theory
- Federer
- 1969
(Show Context)
Citation Context ...is a number doubM (Z) such that µ(B(x, 2r)) ≤ µ(B(x, r))2 doubM (Z) for all x and r. Such a space is also called a growth-restricted metric[KR02] or Federer measure or a diametrically regular measure =-=[Fed69]-=-. The definition is sometimes relaxed, so that only balls B(x, r) with µ(B(x, r)) sufficiently large need satisfy the doubling condition. For such a space there is a smallest number dimD(Z) such that ... |

696 |
The art of computer programming., volume 3: Sorting and searching
- Knuth
- 1974
(Show Context)
Citation Context ...h that D(q, s) = D(q, S). This problem has been studied for a long time, and has many names in a large and diverse literature. In an early proposal for a solution, due to McNutt (as discussed by Knuth=-=[Knu98]-=-), it was called the post office problem. In another ∗ Bell Labs; 600 Mountain Avenue; Murray Hill, New Jersey 07974; clarkson@research.bell-labs.com † Previous version: April 2005; Original: February... |

536 | Fractal Geometry: Mathematical Foundations and Applications - Falconer - 1990 |

404 |
Geometry of Sets and Measures in Euclidean spaces. Cambridge Univ
- Mattila
- 1995
(Show Context)
Citation Context ...The Riesz t-energy of a measure is � � It(µ) := 1 dµ(x)dµ(y), D(x, y) t and the energy dimension is sup{t | It(µ) < ∞}. This energy is related to the pointwise dimension: for given x, it can be shown =-=[Mat95]-=- that � 1 dµ(y) = t D(x, y) t � ∞ 0 r −t−1 µ(B(x, r))dr. If µ(U) is bounded, and the upper pointwise dimension is bounded everywhere by some v > t, that is, for all x ∈ U, µ(B(x, r)) = O(r v ), then I... |

341 | On the resemblance and containment of documents
- Broder
- 1997
(Show Context)
Citation Context ... = µ(A) + µ(B) + µ(A∆B) 2µ(A ∪ B) = µ(A∆B) µ(A ∪ B) , which is called the Steinhaus distance [DL97]. The special case for finite sets |A∆B|/|A ∪ B| is called the Tanimoto distance [RT60], resemblance =-=[Bro97]-=-, set similarity distance [Cha02], Jaccard distance [Jac01, Spa80], or Marczewski-Steinhaus distance [MS58]. It has been proven a metric in several ways [DL97, Spa80, XA03, Cha02]. The above follows i... |

322 | Searching in Metric Spaces - Cha´vez, Navarro, et al. - 2001 |

322 | lists: A probabilistic alternative to balanced trees
- Pugh
- 1989
(Show Context)
Citation Context ...l number of sites to determine the closest site at level i − 1 or higher, repeating until the closest site at level 0 is found. This description shows that the data structure is similar to a skip list=-=[Pug90]-=-, which is a way to accelerate searching in a linear list of ordered values; such searching is the one-dimensional version of nearest-neighbor searching. The skip list approach can be applied to a bro... |

280 |
Clustering to minimize the maximum intercluster distance, Theoret
- Gonzalez
- 1985
(Show Context)
Citation Context .../ɛ. There is a greedy algorithm for finding ɛ-nets that has been applied to building data structures for nearest-neighbor searching,[Bri95, Woj03, Cla03, HPS05] as well as other optimization problems =-=[Gon85]-=-. These relations are discussed in Section 4 and Subsection 5.2.4. 1.1 Scope, and Survey of Surveys There are many important aspects of nearest-neighbor searching that are not covered here, but have b... |

271 | Data structures and algorithms for nearest neighbor search in general metric spaces
- Yianilos
- 1993
(Show Context)
Citation Context ...ren. Burkhard and Keller [BK73] proposed a multibranch version for discretevalued metrics. Metric trees, in many variations, were also invented by Omohundro [Omo89], by Uhlmann[Uhl91], and by Yianilos=-=[Yia93]-=-, and they have a large literature. For further discussion of them, prior surveys can be consulted [HS03, CNBYM01]. 4 Dimensions While it is easy to construct or encounter metric spaces for which brut... |

254 | Think globally, fit locally: unsupervised learning of low dimensional manifolds
- Saul, Roweis
(Show Context)
Citation Context ...N) problem is to find, for each site s, the k sites closest to s. Solving this problem is a common preprocessing step for “manifold reconstruction” in the computational geometry[FR02], learning theory=-=[SR03]-=-, and computer graphics[Hor03] literatures. Note that the answer to the closestpair problem can easily be found using the answer to the all-k-NN problem. Similarly, the max-min distance max i min j D(... |

252 |
Measuring the strangeness of strange attractors
- Grassberger, Procaccia
- 1983
(Show Context)
Citation Context ...ral can be estimated using a fixed-radius all-sites query. Historically, the quadtree-based view was proposed first, and the distance-based version was proposed as a more accurate empirical estimator =-=[GP83]-=-. For a given set of sample points, the quadtree estimate is easier to compute than the correlation integral, and so Belussi and Faloutsos[BF98] use the 24 (8)squadtree estimator in the context of dat... |

244 |
Searching in High-Dimensional Spaces – Index Structures for Improving the Performance of Multimedia Databases
- Böhm, Berchtold, et al.
(Show Context)
Citation Context ...st-pair problems, including insertions and deletions of sites [Smi00]; and another [AGE + 02] on data structures to allow moving sites to be handled efficiently [Ata85, Kah91, BGH97]. A recent survey =-=[BBK01]-=- and book [PM05] describe nearest-neighbor searching from a database perspective. There are at least two prior surveys of searching in general metric spaces [CNBYM01, HS03]. These surveys discuss in d... |

240 |
Laurent: Geometry of cuts and metrics
- Deza
- 1997
(Show Context)
Citation Context ...ar, scaling a single metric by a positive constant also gives a metric.) In other words, the set of metrics on U is closed under nonnegative combination, and forms a cone; such cones are well studied =-=[DL97]-=-. • Metric Transforms. If f is a real-valued function of the nonnegative reals, and f(0) = 0, and f(z) is monotone increasing and concave for z ≥ 0, then ˆ D(x, y) := f(D(x, y)) is a metric [DL97]. Fo... |

231 | Data Structures for Mobile Data - Basch, Guibas, et al. - 1997 |

230 | Similarity estimation techniques from rounding algorithms
- Charikar
(Show Context)
Citation Context ... = µ(A∆B) µ(A ∪ B) , which is called the Steinhaus distance [DL97]. The special case for finite sets |A∆B|/|A ∪ B| is called the Tanimoto distance [RT60], resemblance [Bro97], set similarity distance =-=[Cha02]-=-, Jaccard distance [Jac01, Spa80], or Marczewski-Steinhaus distance [MS58]. It has been proven a metric in several ways [DL97, Spa80, XA03, Cha02]. The above follows in part Deza and Laurent [DL97], a... |

217 | Probability Theory - Rényi - 1970 |

192 | Data networks as cascades: Investigating the multifractal nature of Internet WAN trac
- Feldmann, Gilbert, et al.
- 1998
(Show Context)
Citation Context ...corresponds to v = 1, and the correlation dimension corresponds to v = 2. The Rènyi spectrum is much-studied in the area of chaotic, multifractal systems, such as turbulence, the web, network traffic =-=[FGW98]-=-, and Bayesian belief networks [GH04]. Another dimension value on the Rènyi spectrum can be computed by way of minimum spanning trees, or other extremal geometric graphs, as discussed in Section 5. Th... |

189 |
A best possible heuristic for the k-center problem
- Hochbaum, Shmoys
- 1985
(Show Context)
Citation Context ... it is an approximation algorithm for 17sthe k-center problem, of finding the k points whose maximum distance to any point in U is minimized. Gonzalez [Gon85], and, independently, Hochbaum and Shmoys =-=[HS85]-=-, showed that this is the best possible approximation factor for a polynomial-time algorithm on a general metric space, unless P = NP . As mentioned, this algorithm has been used in building nearest-n... |

176 | Scaling and related techniques for geometry problems - Gabow, Bentley, et al. - 1984 |

165 |
Satisfying General Proximity/Similarity Queries with Metric Trees
- Uhlmann
- 1991
(Show Context)
Citation Context ...ng to explore both children. Burkhard and Keller [BK73] proposed a multibranch version for discretevalued metrics. Metric trees, in many variations, were also invented by Omohundro [Omo89], by Uhlmann=-=[Uhl91]-=-, and by Yianilos[Yia93], and they have a large literature. For further discussion of them, prior surveys can be consulted [HS03, CNBYM01]. 4 Dimensions While it is easy to construct or encounter metr... |

154 |
The shortest path through many points
- Beardwood, Halton, et al.
- 1959
(Show Context)
Citation Context ...ed on a d-manifold. For v with 0 < v < d, let L(G, v) := � ℓ(e) v , e an edge of G an edge length power sum of G. Costa and Hero use the fact, going back to the celebrated results of Beardwood et al. =-=[BHH59]-=-, that L(G, v)/n = n −v/d+o(1) as n → ∞, for the extremal graphs just mentioned, and others. (cf. (7), (8), (10)) Yukich’s monograph [Yuk98] surveys results in this setting.) This allows the topologic... |

149 | M.: Finding nearest neighbors in growth-restricted metrics
- Karger, Ruhl
- 2002
(Show Context)
Citation Context ... D, µ) with a doubling measure[Hei03] is one for which there is a number doubM (Z) such that µ(B(x, 2r)) ≤ µ(B(x, r))2 doubM (Z) for all x and r. Such a space is also called a growth-restricted metric=-=[KR02]-=- or Federer measure or a diametrically regular measure [Fed69]. The definition is sometimes relaxed, so that only balls B(x, r) with µ(B(x, r)) sufficiently large need satisfy the doubling condition. ... |

134 | Index-Driven Similarity Search in Metric Spaces
- Hjaltason, Samet
- 2003
(Show Context)
Citation Context ...rks 38 2searly proposal, it was called best-match file searching [BK73]. In the database or information-retrieval literature, it might be called the problem of building an index for similarity search =-=[HS03]-=-. In the information theory literature, it arises as the problem of building a vector quantization encoder [LBG80, GN93]. In the pattern recognition (or statistics or learning theory) literature, it m... |

129 |
Some approaches to best-match file searching
- Burkhard, Keller
- 1973
(Show Context)
Citation Context ... Structures and Skip Lists . . . . . . . . . 34 5.2.4 Voronoi Grouping . . . . . . . . . . . . . . . . . . . . . . 35 6 Concluding Remarks 38 2searly proposal, it was called best-match file searching =-=[BK73]-=-. In the database or information-retrieval literature, it might be called the problem of building an index for similarity search [HS03]. In the information theory literature, it arises as the problem ... |

123 | J.R.: Navigating nets: simple algorithms for proximity search
- Krauthgamer, Lee
- 2004
(Show Context)
Citation Context ...condition. For such a space there is a smallest number dimD(Z) such that sup µ(B(x, r))/µ(B(x, ɛr)) = 1/ɛ x∈U,r>0 dimD(Z)+o(1) , as ɛ → 0. (cf. (3).) It is not hard to show that doubA(Z) ≤ 4 doubM (Z)=-=[KL04]-=- and that dimA(Z) ≤ dimD(Z). 20sIf the inputs to a nearest-neighbor searching problem are such that (S ∪ {q}, D, µc) is a doubling measure, then several provably good data structures exist for searchi... |

120 | Étude Comparative de la Distribution Florale dans une Portion des Alpes et des Jura. Bulletin del la Société Vaudoise des Sciences Naturelles - Jaccard - 1901 |

118 |
A branch and bound algorithm for computing k-nearest neighbors
- Fukunaga, Narendra
- 1975
(Show Context)
Citation Context ..., the closest site currently known is maintained, and a subtree need not be searched if its sites can be ruled out using the covering radius, as above. (A very early proposal by Fukunaga and Narendra =-=[FN75]-=- for nearest-neighbor searching uses Voronoi grouping with a large branching factor, that is, |P | is large. Their method does not apply to general metric spaces.) Another data structure that uses Vor... |

114 |
A Randomized Algorithm for Closest-Point Queries
- Clarkson
- 1988
(Show Context)
Citation Context ...mines the number of levels in the data structure, and the latter is needed to determine the size of the data structure. One example of such a scheme is an algorithm by Clarkson for the Euclidean case =-=[Cla88]-=-. In the examples below, the divide-and-conquer scheme is based on finding the nearest neighbor to q in a subset P ⊂ S. To motivate such approaches, we return to some basic considerations regarding ne... |

112 | Nearest neighbor queries in metric spaces
- Clarkson
- 1999
(Show Context)
Citation Context ...], while the empirical doubling measure (growth-restricted) scheme follows ideas of an earlier paper by Karger and Ruehl[KR02]. Finally, the “exchangeable queries” model follows a still earlier paper =-=[Cla99]-=-. The problem with applying the direct approach, as described in Subsection 5.2.2, is that the sizes of the subproblems are too big: ideally, the sum of the subproblem sizes |Bp ∩S|, over p ∈ P , woul... |

109 |
Luschgy H. Foundations of quantization for probability distributions. Volume 1730, Lecture notes in mathematics
- Graf
- 2000
(Show Context)
Citation Context ... n → ∞. The quantization dimension can also be defined for v = ∞, and with “upper” and “lower” versions, and these are equal to the upper and lower box dimension of the support of µ. Graf and Luschgy =-=[GL00]-=- discuss the quantization dimension in detail. The energy dimension is defined as follows. The Riesz t-energy of a measure is � � It(µ) := 1 dµ(x)dµ(y), D(x, y) t and the energy dimension is sup{t | I... |

108 | Influence sets based on reverse nearest neighbor queries - Korn, Muthukrishnan - 2000 |

107 |
Dimension theory in dynamical systems: Contemporary views and applications
- Pesin
- 1997
(Show Context)
Citation Context ...e answer to a fixed-radius neighbor query is likely to be. (Such a query asks for all sites inside a given ball.) It is a basic concept of multifractal analysis, as used in studying dynamical systems =-=[Pes97]-=-. As another example: ɛ-nets are a kind of well-distributed subset of a metric space, such that every point in the space is within distance ɛ of the net. The box dimension of the space determines the ... |

104 | Approximate Nearest Neighbor Queries in Fixed Dimensions
- Arya, Mount
- 1993
(Show Context)
Citation Context ...Delaunay method can be very expensive, because the total number of Delaunay neighbors can be large, in the Euclidean case there are some traversal approximation algorithms. Given ɛ > 0, Arya and Mount=-=[AM93]-=- found an easily computed list Ls of size independent of the number of sites, such that for any q, if s is closer to q than any member of Ls, then s is (1 + ɛ)-near to q in s. This yields a traversal ... |

97 | Fast construction of nets in low dimensional metrics, and their applications
- Har-Peled, Mendel
(Show Context)
Citation Context ...The ɛ-net divide-and-conquer approach can also use a permutation: the one that arises from the greedy ɛ-net construction procedure described in Subsection 4.1. This permutation is used in [Cla03] and =-=[HPM05]-=-. 5.2.3 Traversal Data Structures and Skip Lists Four ways of generating a nested sequence of subsets of the sites were just described, two for the random approaches, and two for the ɛ-net approaches.... |

97 |
Maximum likelihood estimation of intrinsic dimension
- Levina, Bickel
(Show Context)
Citation Context ...ervations were made by Pettis et al. [KTAD79], Verveer and Duin [VD95], and van de Water and Schram [vdWS88]. A derivation of a similar estimator via maximum likelihood was given by Levina and Bickel =-=[LB05]-=-. Heuristically, (9) can be understood by considering ɛk such that the ball B(x, ɛk) has probability mass µ(B(x, ɛk)) = k/n. The expected number of points in the sample falling in B(x, ɛk) is k, and s... |

84 | Entropy, Hausdorff measures old and new, and limit sets of geometrically finite Kleinian groups - Sullivan - 1984 |

77 |
The earth mover’s distance is the mallows distance: some insights from statistics
- Levina, Bickel
- 2001
(Show Context)
Citation Context ...DL97, Spa80, XA03, Cha02]. The above follows in part Deza and Laurent [DL97], and also Indyk and Matouˇsek [IM04]; the latter describe other metric constructions, including the earth-mover (or Mallows=-=[LB01]-=-), Fréchet, and block-edit distances. The above hardly exhausts the distances and metrics that have been considered, even by applying the constructions repeatedly. For example, for two probability dis... |

76 | Nearest Neighbors in High-Dimensional Spaces. Handbook of Discrete and Computational Geometry (2nd Edition
- Indyk
- 2004
(Show Context)
Citation Context ...s of nearest-neighbor searching that are not covered here, but have been surveyed elsewhere. Several surveys of nearest-neighbor searching in ℜ d have been done: one focuses on high-dimensional spaces=-=[Ind04]-=-; another on closest-pair problems, including insertions and deletions of sites [Smi00]; and another [AGE + 02] on data structures to allow moving sites to be handled efficiently [Ata85, Kah91, BGH97]... |

68 |
An algorithm for approximate closest-point queries
- Clarkson
- 1994
(Show Context)
Citation Context ...ithm. In this setting, it is possible to find a list with the same properties as described for Arya and Mount, and whose size is within a provably small factor of the smallest possible for such a list=-=[Cla94]-=-. In the metric-space setting, Navarro [Nav02] proposed a heuristic data structure with a similar but somewhat more complicated searching method. His construction is very similar to one of those of Ar... |

66 | Geodesic entropic graphs for dimension and entropy estimation in manifold learning
- Costa, Hero
(Show Context)
Citation Context ...of the Hausdorff and pointwise dimensions, perhaps their bound is a kind of graph Hausdorff dimension. Extremal Graphs as Dimensional Estimators. In the setting of Euclidean manifolds, Costa and Hero =-=[CH04]-=- propose the use as dimension estimators of the minimum spanning tree, matching, k-NN graph, or other extremal graphs. 25sSuppose G is such a graph for a set of n sites independently, identically dist... |

65 | Searching in metric spaces by spatial approximation
- Navarro
- 2002
(Show Context)
Citation Context ...a list with the same properties as described for Arya and Mount, and whose size is within a provably small factor of the smallest possible for such a list[Cla94]. In the metric-space setting, Navarro =-=[Nav02]-=- proposed a heuristic data structure with a similar but somewhat more complicated searching method. His construction is very similar to one of those of Arya and Mount. Even when the sizes of the lists... |

65 | Closest point problems in computational geometry
- Smid
- 2000
(Show Context)
Citation Context ...here. Several surveys of nearest-neighbor searching in ℜ d have been done: one focuses on high-dimensional spaces[Ind04]; another on closest-pair problems, including insertions and deletions of sites =-=[Smi00]-=-; and another [AGE + 02] on data structures to allow moving sites to be handled efficiently [Ata85, Kah91, BGH97]. A recent survey [BBK01] and book [PM05] describe nearest-neighbor searching from a da... |

64 |
Probability Theory of Classical Euclidean Optimization
- Yukich
- 1998
(Show Context)
Citation Context ...t, going back to the celebrated results of Beardwood et al. [BHH59], that L(G, v)/n = n −v/d+o(1) as n → ∞, for the extremal graphs just mentioned, and others. (cf. (7), (8), (10)) Yukich’s monograph =-=[Yuk98]-=- surveys results in this setting.) This allows the topological dimension d of a manifold to be estimated as a function of L(G, v) and n, so for example, log(1/n) d = lim n→∞ log(L(G, 1)/n) with probab... |

60 | Five balltree construction algorithms
- Omohundro
- 1989
(Show Context)
Citation Context ...t is, without needing to explore both children. Burkhard and Keller [BK73] proposed a multibranch version for discretevalued metrics. Metric trees, in many variations, were also invented by Omohundro =-=[Omo89]-=-, by Uhlmann[Uhl91], and by Yianilos[Yia93], and they have a large literature. For further discussion of them, prior surveys can be consulted [HS03, CNBYM01]. 4 Dimensions While it is easy to construc... |

60 | Two definitions of fractional dimension - Tricot - 1982 |

57 | A new version of the nearest-neighbour approximating and eliminating search algorithm (aesa) with linear preprocessing time and memory requirements
- Micó, Oncina, et al.
- 1994
(Show Context)
Citation Context ...t random. While this scheme is simple and answers queries quickly, the quadratic preprocessing and storage limit its applicability. The Linear Approximating and Eliminating Search Algorithm, or LAESA =-=[MOV94]-=-, reduces these needs by precomputing and storing the distances from all sites to only a subset V of the sites, called pivots. The algorithm proceeds as in AESA, but only applies the update step 4 whe... |

55 | Fast algorithms for the all nearest neighbors problem - Clarkson - 1983 |

53 | Intrinsic dimension estimation using packing numbers - Kegl - 2003 |