Indexing Large Metric Spaces For Similarity Search Queries (2002)
Cached
Download Links
- [erciyes.ces.cwru.edu]
- [erciyes.ces.cwru.edu]
- CiteULike
- DBLP
Other Repositories/Bibliography
| Venue: | ACM Transactions on Database Systems |
| Citations: | 57 - 0 self |
BibTeX
@ARTICLE{Bozkaya02indexinglarge,
author = {Tolga Bozkaya and Meral Ozsoyoglu},
title = {Indexing Large Metric Spaces For Similarity Search Queries},
journal = {ACM Transactions on Database Systems},
year = {2002},
volume = {24},
pages = {361--404}
}
Years of Citing Articles
OpenURL
Abstract
In many database applications, one of the common queries is to find approximate matches to a given query item from a collection of data items. For example, given an image database, one may want to retrieve all images that are similar to a given query image. Distance based index structures are proposed for applications where the distance computations between objects of the data domain are expensive (such as high dimensional data), and the distance function used is metric. In this paper, we consider using distance-based index structures for similarity queries on large metric spaces. We elaborate on the approach of using reference points (vantage points) to partition the data space into spherical shell-like regions in a hierarchical manner. We introduce the multi-vantage point tree structure (mvp-tree) that uses more than one vantage points to partition the space into spherical cuts at each level. In answering similarity based queries, the mvp-tree also utilizes the pre-computed (at construction time) distances between the data points and the vantage points. We summarize the experiments to compare mvp-trees with vp-trees which have a similar partitioning strategy, but use only one vantage point at each level, and do not make use of the pre-computed distances. Empirical studies show that mvp-tree outperforms the vp-tree by 20% to 80% for varying query ranges and different distance distributions. Next, we generalize the idea of using multiple vantage points, and discuss the results of experiments we have done to see how varying the number of vantage points used in a node affects search performance, and how much performance gain is obtained by making use of pre-computed distances. The results show that, after all, it may be best to use a large number of vantage points in an internal node to end up with a single directory node, and keep as many of the pre-computed distances as possible to provide more efficient filtering during search operations. Finally, we provide some experimental results comparing mvp-trees with M-trees, which is a dynamic distance based index structure for metric domains.







