## Fast algorithms for nearest neighbour search (2007)

Citations: | 2 - 1 self |

### BibTeX

@TECHREPORT{Kibriya07fastalgorithms,

author = {Ashraf Masood Kibriya},

title = {Fast algorithms for nearest neighbour search},

institution = {},

year = {2007}

}

### OpenURL

### Abstract

The nearest neighbour problem is of practical significance in a number of fields. Often we are interested in finding an object near to a given query object. The problem is old, and a large number of solutions have been proposed for it in the literature. However, it remains the case that even the most popular of the techniques proposed for its solution have not been compared against each other. Also, many techniques, including the old and popular ones, can be implemented in a number of ways, and often the different implementations of a technique have not been thoroughly compared either. This research presents a detailed investigation of different implementations of two popular nearest neighbour search data structures, KDTrees and Metric Trees, and compares the different implementations of each of the two structures against each other. The best implementations of these structures are then compared against each other and against two other techniques, Annulus Method and Cover Trees. Annulus Method is an old technique that was rediscovered during the research for this thesis. Cover Trees are one of the most novel and promising data structures for nearest neighbour search that have been proposed in the literature. i Acknowledgments The continued support of Department of Computer Science’s Machine Learning group, and particularly my supervisor Dr. Eibe Frank, is greatly appreciated, without which this thesis would not have been possible.

### Citations

3016 | Indexing by Latent Semantic Analysis
- Deerwester, Dumais, et al.
- 1990
(Show Context)
Citation Context ...m, which tracks players in a hockey rink (Cai et al., 2006). • Document/information retrieval: Here, NN search is often used method to retrieve and rank documents given a user query (Lucarella, 1988; =-=Deerwester et al., 1990-=-; Faloutsos & Oard, 1995). 1.4 Characteristics Common to NN Applications In almost all of the above, the general representation of objects of interest (documents, images, etc.) including the queries i... |

2380 | R-trees: A Dynamic Index Structure for Spatial Searching
- Guttman
- 1984
(Show Context)
Citation Context ...with external memory, and to work well in a more dynamic setting (i.e. with efficient insertion and deletion operations since all the data points cannot be known in advance at construction). R-Trees (=-=Guttman, 1984-=-), like KDTrees, hierarchically partition the data into hyperrectangles. However, the partitioning is achieved also using hyperrectangles instead of hyperplanes as in KDTrees. The partitioned rectangu... |

2323 |
The Elements of Statistical Learning
- Hastie, Tibshirani, et al.
- 2001
(Show Context)
Citation Context ...gorithms for ɛ-NN search, and also the newer algorithms that are based on the concept of intrinsic dimensionality (defined below) of the data. A more meticulous coverage of the curse can be found in (=-=Hastie et al., 2001-=-), (Katayama & Satoh, 2001) and (Fayyad et al., 1996). Also, there is some debate regarding the usefulness of the NN problem at such high dimensions (Beyer et al., 1999; Hinneburg et al., 2000). In pr... |

1210 |
1975J, "Multidimensional binary search trees used for associative searching
- Bentley
- 1975
(Show Context)
Citation Context ...rees), BBF-Trees and Variants Multidimensional binary search trees, called in short by the author as KDTrees (where k is the dimensionality of the space), were originally proposed by Bentley in 1975 (=-=Bentley, 1975-=-) for associative retrieval of records in a file. Their potential for NN search was observed by Bentley, and hence were quickly adopted for NN searching, with an optimized version by Friedman, Bentley... |

1066 | The r*-tree: An efficient and robust access method for points and rectangles
- Beckmann, Kriegel, et al.
- 1990
(Show Context)
Citation Context ... minimum bounding rectangles of the points contained in their descendant leaves. A number of variants exist, that improve on the basic algorithm; such as the R+-Trees (Sellis et al., 1987), R*-Trees (=-=Beckmann et al., 1990-=-) and recently proposed Priority R-Trees (PR-Trees) (Arge et al., 2004). The variant R*-Tree is the most popular one used in practice. X-Trees (Berchtold et al., 1996) are similar to R-Trees. The only... |

965 |
Nearest neighbor pattern classification
- Cover, Hart
- 1967
(Show Context)
Citation Context ...oded with the index of its nearest neighbour among the codevectors. • Pattern recognition, datamining and machine learning: Here, one of the most widely used classifier/learner is the kNN classifier (=-=Cover & Hart, 1967-=-). It is based on straight forward adoption of the kNN search, and works by assigning a given test point the majority class of its k-nearest neighbours. Also, Locally weighted learning (Atkeson et al.... |

934 |
Query by image and video content: the QBIC system
- Flickner, Sawhney, et al.
- 1995
(Show Context)
Citation Context ...ntent similar to a user query. The systems usually allow content-based queries, i.e. queries in the form of object shapes, texture, dominant colours, and scene descriptions etc. for images and video (=-=Flickner et al., 1995-=-; Pentland et al., 1994; Smith & Chang, 1996; Bach et al., 1996), and in the form of dominant frequency and pitch (which can also be given as an acoustic input from the user) etc. in case of an audio/... |

818 | An optimal algorithm for approximate nearest neighbor searching in fixed dimensions
- Arya, Mount, et al.
- 1998
(Show Context)
Citation Context ...d for some particular metric (e.g. the initial version of LSH (Indyk & Motwani, 1998)). Many times even the NNsearch problem itself has been defined with insistence on points being in a metric space (=-=Arya et al., 1998-=-; Maneewongvatana & Mount, 2002; Indyk & Motwani, 1998; Indyk, 1998, 2002) 1 . This, though, is pretty restrictive, given some of the earliest studies of the problem used more general measures instead... |

762 | Approximate nearest neighbors: Towards removing the curse of dimensionality
- Indyk, Motwani
- 1998
(Show Context)
Citation Context ...hich we are interested in the k (≤ |S|) nearest points to q, contained in S. The NN search then just becomes a special case of kNN search with k=1. A slight variation of NN search, advocated by some (=-=Indyk & Motwani, 1998-=-; Datar et al., 2004) in place of NN search, is ɛ-approximate NN (ɛ-NN) search, where given a 1p q r p' Figure 1.1: ɛ-Approximated Nearest Neighbour. user defined error bound ɛ ≥ 0, the task is to fi... |

730 |
The Art of Computer Programming, Volume 2, Seminumerical Algorithms
- Knuth
- 1998
(Show Context)
Citation Context ...mean 0 and standard deviation specified by the parameter std_dev. This distribution in the ANN library is implemented using Box, Muller, and Marsaglia’s polar method, similar to the one described in (=-=Knuth, 1997-=-). For this thesis, the std_dev parameter for the generated sets was left to the default value of 1.0. • Laplacian: In this distribution each coordinate xiof a point was generated from Laplacian distr... |

636 | An Algorithm for Finding Best Matches in Logarithmic Expected Time
- Friedman, Bentley, et al.
- 1977
(Show Context)
Citation Context ...tana & Mount, 2002; Indyk & Motwani, 1998; Indyk, 1998, 2002) 1 . This, though, is pretty restrictive, given some of the earliest studies of the problem used more general measures instead of metrics (=-=Friedman et al., 1977-=-). 1.5 Objectives and Scope of the Thesis It was observed during the review of the literature that even the most popular of the large number of techniques proposed since the initial inception of the p... |

558 | Comparison of discrimination methods for the classification of tumors using gene expression data
- Dudoit, Fridlyand, et al.
- 2002
(Show Context)
Citation Context ...al data. They have been used for cancer classification 2(Niijima & Kuhara, 2005), for detecting rRNA sequences (Robinson-Cox et al., 1995), and, using gene-expression data for tumour classification (=-=Dudoit et al., 2002-=-) and tissue classification, (Li et al., 2004). In cases involving gene selection (Niijima & Kuhara, 2005; Dudoit et al., 2002; Li et al., 2004), these classifiers have been observed to perform as wel... |

556 | M-tree: An efficient access method for similarity search in metric spaces
- Ciaccia, Partella, et al.
- 1997
(Show Context)
Citation Context ...imizing split algorithm and the concept of super nodes, which, as shown by the authors, enhances their performance by orders of magnitude compared to R*-trees and TV-Trees (discussed below). M-Trees (=-=Ciaccia et al., 1997-=-), are the database variant of metric trees, with optimizations for reducing I/O costs. They hierarchically partition the point space, just like metric trees, into hyperspherical regions. The authors ... |

543 | The X-Tree: An Index Structure for High-Dimensional Data
- Berchtold, Keim, et al.
- 1996
(Show Context)
Citation Context ...s (Sellis et al., 1987), R*-Trees (Beckmann et al., 1990) and recently proposed Priority R-Trees (PR-Trees) (Arge et al., 2004). The variant R*-Tree is the most popular one used in practice. X-Trees (=-=Berchtold et al., 1996-=-) are similar to R-Trees. The only major difference is that they employ an overlap-minimizing split algorithm and the concept of super nodes, which, as shown by the authors, enhances their performance... |

495 | Locally weighted learning
- Atkeson, Schaal, et al.
- 1997
(Show Context)
Citation Context ...r & Hart, 1967). It is based on straight forward adoption of the kNN search, and works by assigning a given test point the majority class of its k-nearest neighbours. Also, Locally weighted learning (=-=Atkeson et al., 1997-=-) is another technique which utilizes kNN search. It works by training its base classifier/learner on training points that are nearest neighbours of a given test point. • Bioinformatics: Here, kNN and... |

461 | Similarity search in high dimensions via hashing
- Gionis, Indyk, et al.
- 1999
(Show Context)
Citation Context ...The technique has a worst case query time of O(dn1/(1+ɛ) ). It works well both in main and in external memory, and in a later study by the same authors has been shown to perform better than SR-Trees (=-=Gionis et al., 1999-=-). No evaluation of the technique, however, is known to have been carried out against established main memory methods such as KDTrees or BBF-Trees. The basic idea behind the technique is to use a numb... |

401 | The sr-tree: An index structure for high-dimensional nearest neighbor queries
- Katayama, Satoh
- 1997
(Show Context)
Citation Context ...would allow one or more objects to have the same value, and thus allow the idea of telescoping to work. The trees have been shown empirically by the authors to perform better than R*-Trees. SR-Trees (=-=Katayama & Satoh, 1997-=-) are a combination of R-Trees and SS-Trees. Each node in the tree stores the minimum bounding rectangle as well as the minimum bounding sphere of the points it contains. On real data, the trees have ... |

319 | When is “nearest neighbor” meaningful?, in
- Beyer, Goldstein, et al.
- 1999
(Show Context)
Citation Context ... of the curse can be found in (Hastie et al., 2001), (Katayama & Satoh, 2001) and (Fayyad et al., 1996). Also, there is some debate regarding the usefulness of the NN problem at such high dimensions (=-=Beyer et al., 1999-=-; Hinneburg et al., 2000). In practical problems it often turns out that the data points, even though being embedded in a high dimensional space, are not so widely scattered after all. Due to the depe... |

319 | Locality-sensitive hashing scheme based on p-stable distributions
- Datar, Immorlica, et al.
- 2004
(Show Context)
Citation Context ...in the k (≤ |S|) nearest points to q, contained in S. The NN search then just becomes a special case of kNN search with k=1. A slight variation of NN search, advocated by some (Indyk & Motwani, 1998; =-=Datar et al., 2004-=-) in place of NN search, is ɛ-approximate NN (ɛ-NN) search, where given a 1p q r p' Figure 1.1: ɛ-Approximated Nearest Neighbour. user defined error bound ɛ ≥ 0, the task is to find a point p ′ in S ... |

299 | The Virage image search engine: An open framework for image management - Bach, Fuller, et al. - 1996 |

205 | Shape indexing using approximate nearest-neighbor search in high dimensional spaces
- Beis, Lowe
- 1997
(Show Context)
Citation Context ...rse-of-dimensionality and all, from what is known, have only been devised with at most Minkowsky-p metrics in view. Best-Bin-First (BBF) trees, developed by (Arya & Mount, 1993) and independently by (=-=Beis & Lowe, 1997-=-), are a modification of KDTrees for ɛ-NN search. Using a priority queue, they visit those regions first during back tracking that are nearer to the query point, and terminate the search early if the ... |

199 | Query by humming: Musical information retrieval in an audio database
- Ghias, Logan, et al.
- 1995
(Show Context)
Citation Context ...l., 1994; Smith & Chang, 1996; Bach et al., 1996), and in the form of dominant frequency and pitch (which can also be given as an acoustic input from the user) etc. in case of an audio/music library (=-=Ghias et al., 1995-=-; McNab et al., 1997; Tseng, 1999; Uitdenbogerd & Zobel, 2002; Zhu et al., 2003). • Computer vision: Here, NN search is an important tool used for the task of object classification, which involves fin... |

182 |
Voronoi diagrams—a survey of a fundamental geometric data structure
- Aurenhammer
- 1991
(Show Context)
Citation Context ...sorting the array in this case). For d = 2, O(logn) query time in the worst case, with linear space and near linear preprocessing time, is possible using methods based on Voronoi diagrams (Lee, 1982; =-=Aurenhammer, 1991-=-). However, for d > 2, no known solution exists that can guarantee a sublinear query time while still keeping the space complexity linear and the preprocessing time near linear. Still, for moderate va... |

154 | Find nearest neighbors in growthrestricted metrics
- Karger, Ruhl
- 2002
(Show Context)
Citation Context ...es in which they are embedded), regardless of their actual number of dimensions, exhibit certain restricted or bounded growth. A simple notion of such bounded growth was presented by Karger and Ruhl (=-=Karger & Ruhl, 2002-=-). They defined a growth bound on a dataset such that the number of points in a ball (hypershere to be precise) centred at any point p is at most c times the number of points in a ball of half the rad... |

148 | Cover trees for nearest neighbor
- Beygelzimer, Kakade, et al.
- 2006
(Show Context)
Citation Context ...), then the search procedure takes no longer than O(logn). Beygelzimer, Kakade and Langford, presented a data structure for NN and ɛ-NN search based on Navigating Nets, which they called Cover Trees (=-=Beygelzimer et al., 2006-=-). In a Navigating Net each point at some lower level is allowed to have more than one parent point from the previous level (i.e. points are allowed to overlap among balls in any intermediate level), ... |

127 | Navigating nets: simple algorithms for proximity search
- Krauthgamer, Lee
- 2004
(Show Context)
Citation Context ...net. Karger and Ruhl also presented a data structure for NN search which works well for geometries/datasets satisfying their growth bound. A similar bound property was defined by Krauthgamer and Lee (=-=Krauthgamer & Lee, 2004-=-). Their growth bound, however, as shown in (Gupta et al., 2003), is more general than the one by Karger and Ruhl. Their growth bound definition is: every set of points in the dataset should be able t... |

120 | What is the nearest neighbor in high dimensional spaces
- Hinneburg, Aggarwal, et al.
- 2000
(Show Context)
Citation Context ... found in (Hastie et al., 2001), (Katayama & Satoh, 2001) and (Fayyad et al., 1996). Also, there is some debate regarding the usefulness of the NN problem at such high dimensions (Beyer et al., 1999; =-=Hinneburg et al., 2000-=-). In practical problems it often turns out that the data points, even though being embedded in a high dimensional space, are not so widely scattered after all. Due to the dependencies among the dimen... |

112 | Nearest neighbor queries in metric spaces
- Clarkson
- 1997
(Show Context)
Citation Context ...against any other known technique for ɛ-NN search, whereas Cover Trees have only been evaluated (in the paper in which they were presented) against the little-known sb(S) data structures by Clarkson (=-=Clarkson, 1999-=-, 2002). This is in spite of the fact that an excellent implementation of KDTrees and BBF-Trees, for NN and ɛ-NN search, is freely available (Mount & Arya, 1997) that can be easily integrated with any... |

90 | A Survey of Information Retrieval and Filtering Methods
- Faloutsos, Oard
- 1996
(Show Context)
Citation Context ...n a hockey rink (Cai et al., 2006). • Document/information retrieval: Here, NN search is often used method to retrieve and rank documents given a user query (Lucarella, 1988; Deerwester et al., 1990; =-=Faloutsos & Oard, 1995-=-). 1.4 Characteristics Common to NN Applications In almost all of the above, the general representation of objects of interest (documents, images, etc.) including the queries is as vectors or points i... |

87 | Nearest-neighbor searching and metric space dimensions - Clarkson - 2006 |

72 |
An improvement of the minimum distortion encoding algorithm for vector quantization
- BEI, GRAY
- 1985
(Show Context)
Citation Context ...(hence are also mentioned in Table 2.1). These and the techniques in Table 2.1 are described in brief in the paragraphs below. 2.1.1 Partial Distance Search (PDS) Originally proposed by Bei and Gray (=-=Bei & Gray, 1985-=-) it provides only moderate acceleration on its own. However, it is extremely simple and is general enough to be applied in conjunction with almost any other technique known for NN search. The techniq... |

66 | Algorithms for Fast Vector Quantization
- Arya, Mount
- 1993
(Show Context)
Citation Context ...however, fall far short of removing the curse-of-dimensionality and all, from what is known, have only been devised with at most Minkowsky-p metrics in view. Best-Bin-First (BBF) trees, developed by (=-=Arya & Mount, 1993-=-) and independently by (Beis & Lowe, 1997), are a modification of KDTrees for ɛ-NN search. Using a priority queue, they visit those regions first during back tracking that are nearer to the query poin... |

63 |
Multiresolution instance-based learning
- Deng, Moore
- 1995
(Show Context)
Citation Context ...ruction method of Friedman et. al (Friedman et al., 1977), which could be called the Median of Widest Dimension, or one their own proposed Midpoint of Widest Dimension. For example in (Moore, 1991), (=-=Deng & Moore, 1995-=-) and (Moore et al., 1997) they have used Midpoint of Widest Dimension, whereas in (Gray & Moore, 2004) they have suggested to use Median of Widest Dimension. Hence, it is not clear which method is be... |

59 | The Priority R-tree: A practically efficient and worst-caseoptimal R-tree - Arge, Berg, et al. |

36 |
Fast full search equivalent encoding algorithms for image compression using vector quantization
- Huang, Bi, et al.
- 1992
(Show Context)
Citation Context ...e curse-of-dimensionality. 2.1.6 Annulus Method The Annulus Method, based around the mathematical concept of an annulus (a ring shaped object), was also designed specifically for Vector Quantization (=-=Huang et al., 1992-=-). Like Orchard’s method, it also exploits a geometrical property. It works by projecting the points to their scalar distances from a fixed reference point (which is usually the origin). First, the di... |

34 | Robust visual tracking for multiple targets - Cai, Cai |

34 | On approximate nearest neighbors in non-Euclidean spaces
- Indyk
- 1999
(Show Context)
Citation Context ...twani, 1998)). Many times even the NNsearch problem itself has been defined with insistence on points being in a metric space (Arya et al., 1998; Maneewongvatana & Mount, 2002; Indyk & Motwani, 1998; =-=Indyk, 1998-=-, 2002) 1 . This, though, is pretty restrictive, given some of the earliest studies of the problem used more general measures instead of metrics (Friedman et al., 1977). 1.5 Objectives and Scope of th... |

29 | Space-efficient approximate Voronoi diagrams
- Arya, Malamatos, et al.
- 2002
(Show Context)
Citation Context ...tationally only efficient for at most d = 2, for which they require O(nlogn) preprocessing and O(n) space. For values higher than 2, their complexity grows exponentially in d (O(n⌈d/2⌉) according to (=-=Arya et al., 2002-=-)). The NN search on Voronoi diagrams takes only O(logn) time in the worst case, while kNN search takes only O(logn + k) time. The search is done using an approach for planar point location in straigh... |

28 |
Solving query-retrieval problems by compacting Voronoi diagrams
- Aggarwal, Hansen, et al.
- 1990
(Show Context)
Citation Context ...or drawback of this technique was that k for kNN search needed to be known in advance before preprocessing, and had to remain fixed for all queries. However, an approach developed by Aggarwal et al. (=-=Aggarwal et al., 1990-=-) for compacting order-k Voronoi diagrams, allows all possible order-k Voronoi diagrams (i.e. for k = 1...n − 1) to be stored in O(nlogn) space, while still guaranteeing O(logn + k) query time. An ext... |

27 | Approximate nearest neighbor algorithms for Hausdorff metrics via embeddings - Farach-Colton, Indyk - 1999 |

25 |
Fast nearest-neighbor search in dissimilarity spaces
- Faragó, Linder, et al.
- 1993
(Show Context)
Citation Context ...e Warping (DTW) distance measure used in speech recognition (Vidal et al., 1988), and the NEMr shape-distance measure employed in IBM’s QBIC system (Flickner et al., 1995), as noted by Faragó et al. (=-=Faragó et al., 1993-=-) and Fagin and Stockmeyer (Fagin & Stockmeyer, 1998) respectively, are not exact metrics. Nevertheless, almost all of the studies of (k)NN search that were reviewed as part of the research for this t... |

22 | Relaxing the triangle inequality in pattern matching - Fagin, Stockmeyer |

14 | Distinctiveness-sensitive nearest-neighbor search for efficient similarity retrieval of multimedia information
- Katayama, Satoh
- 2001
(Show Context)
Citation Context ...h, and also the newer algorithms that are based on the concept of intrinsic dimensionality (defined below) of the data. A more meticulous coverage of the curse can be found in (Hastie et al., 2001), (=-=Katayama & Satoh, 2001-=-) and (Fayyad et al., 1996). Also, there is some debate regarding the usefulness of the NN problem at such high dimensions (Beyer et al., 1999; Hinneburg et al., 2000). In practical problems it often ... |

13 | Efficient algorithms for substring near neighbor problem
- Andoni, Indyk
- 2006
(Show Context)
Citation Context ...n order to reduce the I/O overhead in databases), and often cannot be fairly compared to those designed to work in main memory. 1 Indyk and his group have corrected this in their latest publications (=-=Andoni & Indyk, 2006-=-; Shakhnarovich et al., 2006a) 4Hence, to make the project more tractable, the goal of the research was narrowed down to include only the most popular of the old and the most promising of the novel t... |

10 | On Effective Classification of Strings with Wavelets
- Aggarwal
- 2002
(Show Context)
Citation Context ...e of similarity/nearness measure are usually dependent on the application domain. In string classification, for example, the objects are generally represented as string sequences rather than vectors (=-=Aggarwal, 2002-=-; Mollineda et al., 2003), even in cases when the employed distance measure (usually edit distance) is a 3metric (Mollineda et al., 2003). Similarly, the Dynamic Time Warping (DTW) distance measure u... |

3 |
Cover trees for nearest neighbor. (Unpublished manuscript
- Beygelzimer, Kakade, et al.
- 2005
(Show Context)
Citation Context ...ed in (Krauthgamer & Lee, 2004) (see section), on which the Cover Trees are based, is more general than the one above by Karger and Ruhl. It, however, as noted by Beygelzimer, Kakade and Langford in (=-=Beygelzimer et al., 2005-=-), does not have strong provable results, and the ones that are present are only applicable to ɛ-NN search. Since the growth bound of Karger and Ruhl above is a subclass of the bound of Krauthgamer an... |

1 | Near neighbor search in large metric spaces - unknown authors - 1995 |

1 |
Vector quantization and signal compression (1
- Gersho, Gray
- 1991
(Show Context)
Citation Context ...number of fields. Some of those, along with examples of their use of kNN search, include: • Data compression: Here, it is used in a method called vector quantization for speech and image compression (=-=Gersho & Gray, 1991-=-). It involves blocking speech or image waveform signals into vectors of fixed length. A set of codevectors is first computed based on a set of training vectors, then each new vector is encoded with t... |

1 |
Tutorial on Data Structures for Fast Statistics
- Gray, Moore
- 2004
(Show Context)
Citation Context ...Dimension, or one their own proposed Midpoint of Widest Dimension. For example in (Moore, 1991), (Deng & Moore, 1995) and (Moore et al., 1997) they have used Midpoint of Widest Dimension, whereas in (=-=Gray & Moore, 2004-=-) they have suggested to use Median of Widest Dimension. Hence, it is not clear which method is best to use in general practice. Group of Mount and Arya have provided a number construction methods, an... |

1 | http://www.cs.cmu.edu/∼agray/icml.html. Available from 140 - Krauthgamer, R, et al. - 2004 |