## Similarity Indexing: Algorithms and Performance (1996)

### Cached

### Download Links

- [cui.unige.ch]
- [ls6-www.informatik.uni-dortmund.de]
- DBLP

### Other Repositories/Bibliography

Venue: | In Storage and Retrieval for Image and Video Databases (SPIE |

Citations: | 120 - 1 self |

### BibTeX

@INPROCEEDINGS{White96similarityindexing:,

author = {David A. White and Ramesh Jain},

title = {Similarity Indexing: Algorithms and Performance},

booktitle = {In Storage and Retrieval for Image and Video Databases (SPIE},

year = {1996},

pages = {62--73}

}

### Years of Citing Articles

### OpenURL

### Abstract

Efficient indexing support is essential to allow content-based image and video databases using similaritybased retrieval to scale to large databases (tens of thousands up to millions of images). In this paper, we take an in depth look at this problem. One of the major difficulties in solving this problem is the high dimension (6-100) of the feature vectors that are used to represent objects. We provide an overview of the work in computational geometry on this problem and highlight the results we found are most useful in practice, including the use of approximate nearest neighbor algorithms. We also present a variant of the optimized k-d tree we call the VAM k-d tree, and provide algorithms to create an optimized R-tree we call the VAMSplit R-tree. We found that the VAMSplit R-tree provided better overall performance than all competing structures we tested for main memory and secondary memory applications. We observed large improvements in performance relative to the R*-tree and SS-tree...

### Citations

2999 | Eigenfaces for Recognition
- Turk, Pentland
- 1991
(Show Context)
Citation Context ...provide marginally better query performance, but it requires orders of magnitude more time to optimize an index. In the future, we plan to run more tests on real datasets, such as an EigenFace dataset=-=[37]-=- and a large texture dataset[23], and provide results on approximate query performance. Such test results were omitted from this paper due to space considerations. However, we note that the results in... |

2354 | R-trees: A dynamic index structure for spatial searching
- Guttman
- 1984
(Show Context)
Citation Context ...y different method. In the database literature, research has mainly focused on indexing lower dimensional data and on other types of queries besides nearest neighbor or similarity queries. The R-tree =-=[17]-=- and its most successful variant, the R*-tree [5], have been used most often for indexing high dimensional datasets in the database literature[14]. Henrich[18] provides algorithms for nearest neighbor... |

1585 | The C++ Programming Language
- Stroustrup
- 1997
(Show Context)
Citation Context ...nto an R-tree, we provide a simpler and more efficient algorithm that creates a VAMSplit R-tree directly from a dataset, based on the k-d tree variant described above. Our algorithm is provided in C++=-=[35]-=- and uses the conventions of the Standard Template Library (STL)[26, 25], although only simple C++ syntax is used for this presentation. We implemented both the k-d tree above and the VAMSplit R-tree ... |

1047 | Seeger: The R*-Tree: An Efficient and Robust Access Method for Points and Rectangles
- Beckmann, Kriegel, et al.
- 1990
(Show Context)
Citation Context ...search has mainly focused on indexing lower dimensional data and on other types of queries besides nearest neighbor or similarity queries. The R-tree [17] and its most successful variant, the R*-tree =-=[5]-=-, have been used most often for indexing high dimensional datasets in the database literature[14]. Henrich[18] provides algorithms for nearest neighbor searching and related problems, although he does... |

810 | An optimal algorithm for approximate nearest neighbor searching fixed dimensions
- Arya, Mount, et al.
- 1998
(Show Context)
Citation Context ...e focus on what we call the similarity selection operation[38]. The similarity selection operation is generalization of the k-nearest neighbor query and spherical range query, and approximate versions=-=[4, 2]-=- of those queries. The parameters of the similarity selection query operation are the following: 1. A query vector q 2 R d . The query results are ordered in increasing distance from q, where distance... |

633 | An algorithm for finding best matches in logarithmic expected time
- Friedman, Bentley, et al.
- 1977
(Show Context)
Citation Context ...sited by a k nearest neighbor search increases linearly with k, since the search ball increases its (average) volume by a factor of k. The optimized k-d tree proposed by Friedman, Bentley, and Finkel =-=[15]-=- is probably the data structure most often used in practice for nearest neighbor searching in main memory. The optimized k-d tree requires logarithmic expected time with respect to the size of the dat... |

404 |
The K-D-B-Tree: A Search Structure for large Multidimensional Dynamic Index
- Robinson
- 1981
(Show Context)
Citation Context ... required for k-d tree internal nodes. In disk-based implementations, however, it is important to "trim the fat" because of paging issues, so we recommend a structure similar to the K-D-B-tr=-=ee instead[32]. As -=-a comparison, our implementation, designed for main memory and disk-based applications, uses about 1/3 the space of Bentley's[7] to store the k-d tree nodes, primarily because we eliminated "buck... |

398 |
Photobook: tools for content-based manipulation of image databases
- Pentland, Picard, et al.
- 1994
(Show Context)
Citation Context ...jects are based on similarity, where similarity is measured by some type of distance in feature space. Some representative samples of systems using this strategy are IBM's QBIC[27] and MIT's PhotoBook=-=[29]-=- project. Because the number of object (ie. images) stored in such systems is usually small (100-10,000), typically a simple linear or optimized linear search is used to perform queries since it provi... |

312 | Similarity indexing with the ss-tree
- White, Jain
- 1995
(Show Context)
Citation Context ...atabases will not provide adequate performance. We call the general problem of providing indexing support for similarity-based queries of medium or high dimensional feature vectors similarity indexing=-=[38]-=-. Figure 1 shows the dependencies that exist in similarity indexing applications. The domain expert must use domain knowledge to convert domain objects (ie. images) into feature vectors, and provide s... |

281 | Chabot: Retrieval from a Relational Database of Images
- Stonebraker
- 1995
(Show Context)
Citation Context ...provides adequate performance. However, in order to allow systems such as these to scale to large or very large databases (100,000-10,000,000 objects) that will soon exist as part of digital libraries=-=[28, 24]-=- and other information systems, indexing support needs to be developed, because a linear search of such databases will not provide adequate performance. We call the general problem of providing indexi... |

124 |
STL Tutorial and Reference Guide C++ Programming with the Standard Template
- Musser, Saini
- 1996
(Show Context)
Citation Context ...t creates a VAMSplit R-tree directly from a dataset, based on the k-d tree variant described above. Our algorithm is provided in C++[35] and uses the conventions of the Standard Template Library (STL)=-=[26, 25]-=-, although only simple C++ syntax is used for this presentation. We implemented both the k-d tree above and the VAMSplit R-tree (and SS-tree) code in the generic algorithms framework of STL[26]. This ... |

111 |
Refinements to nearest- neighbor searching in k-dimensional trees
- Sproull
- 1991
(Show Context)
Citation Context ... size of the dataset. Bentley [7] proposed a modified k-d tree that in some applications can allow constant time searching (with respect to the dataset size) of a k-d tree in lower dimension. Sproull =-=[34]-=- provided refinements to the k-d tree and observed that in practice the k-d tree performance degrades rapidly with dimension. Arya and Mount [3] analyzed the k-d tree (and the bucketing algorithm) tak... |

107 | The LSD tree: Spatial access to multidimensional point and non-point objects
- HENRICH, SIX, et al.
- 1989
(Show Context)
Citation Context ...applications, uses about 1/3 the space of Bentley's[7] to store the k-d tree nodes, primarily because we eliminated "bucket" nodes. Our implementation also requires much less storage than th=-=e LSD-tree[19]-=-, which must store a number of extra fields in each k-d tree node in main memory in order to support dynamic updates. 5 The VAMSplit R-tree Because our preliminary tests showed the optimized k-d tree ... |

96 |
K-d trees for semidynamic point sets
- Bentley
- 1990
(Show Context)
Citation Context ...he data structure most often used in practice for nearest neighbor searching in main memory. The optimized k-d tree requires logarithmic expected time with respect to the size of the dataset. Bentley =-=[7]-=- proposed a modified k-d tree that in some applications can allow constant time searching (with respect to the dataset size) of a k-d tree in lower dimension. Sproull [34] provided refinements to the ... |

95 |
et al., “QBIC project: querying images by content, using color, texture, and shape
- Niblack, Barber, et al.
- 1993
(Show Context)
Citation Context ... and queries on those objects are based on similarity, where similarity is measured by some type of distance in feature space. Some representative samples of systems using this strategy are IBM's QBIC=-=[27]-=- and MIT's PhotoBook[29] project. Because the number of object (ie. images) stored in such systems is usually small (100-10,000), typically a simple linear or optimized linear search is used to perfor... |

87 | Approximate range searching
- Arya, Mount
- 2000
(Show Context)
Citation Context ...e focus on what we call the similarity selection operation[38]. The similarity selection operation is generalization of the k-nearest neighbor query and spherical range query, and approximate versions=-=[4, 2]-=- of those queries. The parameters of the similarity selection query operation are the following: 1. A query vector q 2 R d . The query results are ordered in increasing distance from q, where distance... |

72 |
An improvement of the minimum distortion encoding algorithm for vector quantization
- BEI, GRAY
- 1985
(Show Context)
Citation Context ...split orientation and (2) split position, but incorporates Arya's[1] distance and priority refinements (however, see caveat above). We also use the well-known partial distance calculation optimization=-=[6, 34]-=- in the buckets. We call this refinement of the optimized k-d tree the VAM k-d tree, because its (1) split orientation is based the variance, and the (2) split position is approximately the median. 1.... |

70 |
An algorithm for approximate closest-point queries
- Clarkson
- 1994
(Show Context)
Citation Context ...ts showed that there was "relatively little difference in running time and effective performance between splitting rules..." used by their optimal algorithms and the optimized k-d tree [15].=-= Clarkson [10]-=- also has provided an optimal algorithm for this problem using a totally different method. In the database literature, research has mainly focused on indexing lower dimensional data and on other types... |

55 |
et al., “Efficient and Effective Querying by Image Content
- Faloutsos
- 1994
(Show Context)
Citation Context ...es nearest neighbor or similarity queries. The R-tree [17] and its most successful variant, the R*-tree [5], have been used most often for indexing high dimensional datasets in the database literature=-=[14]-=-. Henrich[18] provides algorithms for nearest neighbor searching and related problems, although he does not provide performance results for high-dimensional data. White and Jain proposed the SS-tree [... |

48 | Representation, similarity, and the chorus of prototypes
- Edelman
- 1995
(Show Context)
Citation Context ...ce dataset). In fact, high intrinsic dimensionality (for instance, a high fractal dimension[33]) might be an indication that an ineffective or non-intuitive feature vector representation is being used=-=[13]-=-. 2. The simple optimized k-d tree [15] and variants [7, 34, 3] provide the best main memory performance in practice of algorithms proposed in the computational geometry and algorithms literature. Fri... |

42 |
Fast k-dimensional tree algorithms for nearest-neighbor search with application to vector quantization encoding
- Ramasubramanian, V, et al.
- 1992
(Show Context)
Citation Context ...des. However, more complicated split rules may not yield a large improvement in the average search performance in practice, except for certain types of distributions. Both Ramasubramanian and Paliwal =-=[30]-=- and Arya et. al.[4] found that in the average case, the standard k-d tree split rule provides performance close to that of other split rules, in cases where the query distribution is not known. Howev... |

42 |
Laws: Minutes from an Infinite Paradise
- Fractals, Power
- 1991
(Show Context)
Citation Context ...imensional (!20D) or has a high embedded dimension but is not intrinsically high dimensional (ie. the EigenFace dataset). In fact, high intrinsic dimensionality (for instance, a high fractal dimension=-=[33]-=-) might be an indication that an ineffective or non-intuitive feature vector representation is being used[13]. 2. The simple optimized k-d tree [15] and variants [7, 34, 3] provide the best main memor... |

36 | Accounting for Boundary Effects in Nearest Neighbor Searching
- Arya, Mount, et al.
- 1995
(Show Context)
Citation Context .... For example, if ffl = 0:5, the distance to the kth approximate nearest neighbor might be as much as 50% greater than the distance to the true kth nearest neighbor. However, we and other researchers =-=[3]-=- have found that in practice, the average error is much less than the maximum allowed error, and for small values of ffl, the probability of that a non-exact results is actually returned is often very... |

36 | Algorithms for dynamic closest-pair and n-body potential fields
- Callahan, Kosaraju
- 1995
(Show Context)
Citation Context ... In addition, the structure proposed by Arya et. al.[4] can be generalized to be an optimal dynamic structure (O(log n) point insertion and deletion) using the recent results of Callahan and Kosaraju =-=[9] and -=-Bespamyatnikh[8]. However, Arya et. al.[4] also state that their empirical tests showed that there was "relatively little difference in running time and effective performance between splitting ru... |

35 | An optimal algorithm for closest pair maintenance
- Bespamyatnikh
- 1995
(Show Context)
Citation Context ...ucture proposed by Arya et. al.[4] can be generalized to be an optimal dynamic structure (O(log n) point insertion and deletion) using the recent results of Callahan and Kosaraju [9] and Bespamyatnikh=-=[8]. However,-=- Arya et. al.[4] also state that their empirical tests showed that there was "relatively little difference in running time and effective performance between splitting rules..." used by their... |

32 | A distance-scan algorithm for spatial access strucures
- HENRICH
- 1994
(Show Context)
Citation Context ...ighbor or similarity queries. The R-tree [17] and its most successful variant, the R*-tree [5], have been used most often for indexing high dimensional datasets in the database literature[14]. Henrich=-=[18]-=- provides algorithms for nearest neighbor searching and related problems, although he does not provide performance results for high-dimensional data. White and Jain proposed the SS-tree [38] as a dyna... |

29 |
Analysis of an algorithm for finding nearest neighbors in Euclidean space
- CLEARY
- 1979
(Show Context)
Citation Context ...rithm provides good performance on uniformly distributed data, but is not practical for most non-uniform distributions used in practice. This algorithm was analyzed by Rivest [31] and later by Cleary =-=[11]-=-. Cleary's analysis, assuming a uniform distribution and a sufficiently large dataset (N AE 2 d ), shows that the algorithm requires time exponential with dimension (lower bound of 0:886 \Delta 2 d bu... |

23 | Ma.: Image indexing using a texture dictionary
- Manjunath, Y
(Show Context)
Citation Context ...performance, but it requires orders of magnitude more time to optimize an index. In the future, we plan to run more tests on real datasets, such as an EigenFace dataset[37] and a large texture dataset=-=[23]-=-, and provide results on approximate query performance. Such test results were omitted from this paper due to space considerations. However, we note that the results in table 1 are much worse that we ... |

22 |
Nearest Neighbor Searching and Applications
- Arya
- 1995
(Show Context)
Citation Context ...y effects into account, showing that dependence on dimension is much better than Cleary's bound when the number of data points is not large with respect to dimension (N 6AE 2 d ). In his thesis, Arya =-=[1]-=- provides further refinements to the k-d tree and suggest a k-d tree variant called the priority k-d tree (see figure 2(b), numbering shows bucket search order). Recent work in computational geometry ... |

22 |
On the optimality of Elias’s algorithm for performing best-match searches
- Rivest
- 1974
(Show Context)
Citation Context ...e figure 2(a)). This algorithm provides good performance on uniformly distributed data, but is not practical for most non-uniform distributions used in practice. This algorithm was analyzed by Rivest =-=[31]-=- and later by Cleary [11]. Cleary's analysis, assuming a uniform distribution and a sufficiently large dataset (N AE 2 d ), shows that the algorithm requires time exponential with dimension (lower bou... |

18 |
Average Case Selection
- Cunto, Munro
- 1989
(Show Context)
Citation Context .... This can be performed in linear time using a selection algorithm. A variant of Hoare's[20] algorithm is provided as the nth element() template function in STL, although more sophisticated algorithms=-=[12]-=- are faster for large datasets. We note that for large datasets where runtime is dominated by disk accesses, the algorithm can be improved by a constant factor (perhaps 10-20%), by combining the varia... |

12 | R-Tree Index Optimization
- Gavrila
- 1994
(Show Context)
Citation Context ...time to construct an index. Because of this, we suggest that new update routines or reorganization routines for the R-tree structure should be developed based on ideas presented in this paper. Gavrila=-=[16]-=- suggested another method for creating an optimized R-tree that might provide marginally better query performance, but it requires orders of magnitude more time to optimize an index. In the future, we... |

9 |
Algorithm 63 PARTITION, algorithm 64 QUICKSORT, and algorithm 65 FIND
- Hoare
- 1961
(Show Context)
Citation Context ...offset lo size is in its sorted position on the given dimension, and the dataset is partitioned around its value. This can be performed in linear time using a selection algorithm. A variant of Hoare's=-=[20]-=- algorithm is provided as the nth element() template function in STL, although more sophisticated algorithms[12] are faster for large datasets. We note that for large datasets where runtime is dominat... |

7 |
Refinements to nearest-neighbor searching in k -dimensional trees
- Sproull
- 1991
(Show Context)
Citation Context ... size of the dataset. Bentley [7] proposed a modified k-d tree that in some applications can allow constant time searching (with respect to the dataset size) of a k-d tree in lower dimension. Sproull =-=[34]-=- provided refinements to the k-d tree and observed that in practice the k-d tree performance degrades rapidly with dimension. Arya and Mount [3] analyzed the k-d tree (and the bucketing algorithm) tak... |

4 |
Image descriptions for browsing and retrieval
- Tomasi, Guibas
- 1994
(Show Context)
Citation Context ...neighbor problem, although, to our knowledge, these algorithms have not yet been used in image and video retrieval applications, although use of these algorithms for image retrieval has been suggested=-=[36]-=-. Approximate nearest neighbor algorithms seem to be appropriate for image and video retrieval applications because often fast response time is more important than receiving exact query results. The e... |

2 |
Image Browsing
- Manjunath
- 1995
(Show Context)
Citation Context ...provides adequate performance. However, in order to allow systems such as these to scale to large or very large databases (100,000-10,000,000 objects) that will soon exist as part of digital libraries=-=[28, 24]-=- and other information systems, indexing support needs to be developed, because a linear search of such databases will not provide adequate performance. We call the general problem of providing indexi... |

1 |
Chapter 5: Image transforms
- Jain
- 1989
(Show Context)
Citation Context ...proach for highly correlated datasets. Rather than allowing splits on arbitrary dimensions, we believe the axes themselves should be changed so they are uncorrelated using the Karhunen-Loeve transform=-=[21]-=- of the dataset before indexing in a k-d tree (or VAMSplit R-tree). The distance refinement can be used in this case since the dimensions are orthogonal and distance in the transformed space is equiva... |

1 |
Algorithm-orientedgeneric libraries
- Musser, Stepanov
- 1994
(Show Context)
Citation Context ...t creates a VAMSplit R-tree directly from a dataset, based on the k-d tree variant described above. Our algorithm is provided in C++[35] and uses the conventions of the Standard Template Library (STL)=-=[26, 25]-=-, although only simple C++ syntax is used for this presentation. We implemented both the k-d tree above and the VAMSplit R-tree (and SS-tree) code in the generic algorithms framework of STL[26]. This ... |

1 |
Algorithms for dynamic closest pair and -body potential fields
- Callahan, Kosaraju
- 1995
(Show Context)
Citation Context ...ng time. In addition, the structure proposed by Arya et. al.[4] can be generalized to be an optimal dynamic structure ( point insertion and deletion) using the recent results of Callahan and Kosaraju =-=[9]-=- and Bespamyatnikh[8]. However, Arya et. al.[4] also state that their empirical tests showed that there was “relatively little difference in running time and effective performance between splitting ru... |