## Indexing Large Metric Spaces For Similarity Search Queries (2002)

### Cached

### Download Links

- [erciyes.ces.cwru.edu]
- [erciyes.ces.cwru.edu]
- CiteULike
- DBLP

### Other Repositories/Bibliography

Venue: | ACM Transactions on Database Systems |

Citations: | 76 - 0 self |

### BibTeX

@ARTICLE{Bozkaya02indexinglarge,

author = {Tolga Bozkaya and Meral Ozsoyoglu},

title = {Indexing Large Metric Spaces For Similarity Search Queries},

journal = {ACM Transactions on Database Systems},

year = {2002},

volume = {24},

pages = {361--404}

}

### Years of Citing Articles

### OpenURL

### Abstract

In many database applications, one of the common queries is to find approximate matches to a given query item from a collection of data items. For example, given an image database, one may want to retrieve all images that are similar to a given query image. Distance based index structures are proposed for applications where the distance computations between objects of the data domain are expensive (such as high dimensional data), and the distance function used is metric. In this paper, we consider using distance-based index structures for similarity queries on large metric spaces. We elaborate on the approach of using reference points (vantage points) to partition the data space into spherical shell-like regions in a hierarchical manner. We introduce the multi-vantage point tree structure (mvp-tree) that uses more than one vantage points to partition the space into spherical cuts at each level. In answering similarity based queries, the mvp-tree also utilizes the pre-computed (at construction time) distances between the data points and the vantage points. We summarize the experiments to compare mvp-trees with vp-trees which have a similar partitioning strategy, but use only one vantage point at each level, and do not make use of the pre-computed distances. Empirical studies show that mvp-tree outperforms the vp-tree by 20% to 80% for varying query ranges and different distance distributions. Next, we generalize the idea of using multiple vantage points, and discuss the results of experiments we have done to see how varying the number of vantage points used in a node affects search performance, and how much performance gain is obtained by making use of pre-computed distances. The results show that, after all, it may be best to use a large number of vantage points in an internal node to end up with a single directory node, and keep as many of the pre-computed distances as possible to provide more efficient filtering during search operations. Finally, we provide some experimental results comparing mvp-trees with M-trees, which is a dynamic distance based index structure for metric domains.

### Citations

2381 | R-trees: A dynamic index structure for spatial searching - Guttman - 1984 |

1246 |
Design and Analysis of Spatial Data Structures
- Samet
- 1990
(Show Context)
Citation Context ...revious work. In section 3.3, vp-tree structure is discussed in more detail. 3.1 Distance Transformations to Euclidean Spaces For low-dimensional Euclidean domains, the conventional index structures (=-=[Sam89]-=-) such as R-trees (and its variations) [Gut84, SRF87, BKSS90] can be used effectively to answer similarity queries. In such cases, a near neighbor search query would ask for all the objects in (or tha... |

1066 | The r*-tree: An efficient and robust access method for points and rectangles - Beckmann, Kriegel, et al. - 1990 |

556 | M-tree: An efficient access method for similarity search in metric spaces
- Ciaccia, Partella, et al.
- 1997
(Show Context)
Citation Context ...is more expensive than the vp-tree, but its search algorithm makes less distance computations in the experiments for different data sets. More recently, Ciaccia et al. introduced the M-tree structure =-=[CPZ97]-=-, which differs from the other distance-based index structures by being able to handle dynamic operations. The M-tree is constructed bottom-up (in contrast to the other structures such as vp-tree, GNA... |

544 | The X-Tree: An index structure for high-dimensional data
- BERCHTOLD, KEIM, et al.
- 1996
(Show Context)
Citation Context ...sionality of the Euclidean data sets, or the choice of Euclidean distance L 2 metric is not of particular significance here. There are many other indexing techniques such as TV-trees [LJF94], X-trees =-=[BKP96]-=- that are particularly designed for high dimensional Euclidean data. For general metric spaces, we only use the pairwise distances between objects in the data space for both index construction and sea... |

519 | Nearest neighbor queries
- Roussopoulos, Kelley, et al.
- 1995
(Show Context)
Citation Context ...ere the center is the query object and the radius is the tolerance factor r. There are some special techniques for other forms of similarity queries, such as nearest neighbor queries. For example, in =-=[RKV95]-=-, some heuristics are introduced to efficiently search the R-tree structure to answer nearest neighbor queries. However, the conventional spatial structures stop being efficient if the dimensionality ... |

447 | Fast subsequence matching in time-series databases - Faloutsos, Ranganathan, et al. - 1994 |

290 | The R+-tree: A dynamic index for multi-dimensional objects - Sellis, Roussopoulos, et al. - 1987 |

208 | The TV-Tree: An Index Structure for HighDimensional Data
- Lin, Jagadish, et al.
- 1994
(Show Context)
Citation Context .... Note that dimensionality of the Euclidean data sets, or the choice of Euclidean distance L 2 metric is not of particular significance here. There are many other indexing techniques such as TV-trees =-=[LJF94]-=-, X-trees [BKP96] that are particularly designed for high dimensional Euclidean data. For general metric spaces, we only use the pairwise distances between objects in the data space for both index con... |

193 | A cost model for nearest neighbor search in high dimensional data spaces - Berchtold, Bohm, et al. - 1997 |

189 | Near neighbor search in large metric spaces
- Brin
- 1995
(Show Context)
Citation Context ... using a single reference point for all nodes in the same level is an interesting idea. We use a similar technique in the design of mvp-trees. The GNAT (Geometric Near-Neighbor Access Tree) structure =-=[Bri95]-=- is another mechanism for answering near neighbor queries. A k number of split points are chosen at the top level. Each one of the remaining points are associated with one of the k data sets (one for ... |

180 |
Satisfying general proximity/similarity queries with metric trees
- Uhlmann
- 1991
(Show Context)
Citation Context ...tage point tree) as a general solution to the problem of answering similarity based queries efficiently for high-dimensional metric spaces. The mvp-tree is similar to the vp-tree (vantage point tree) =-=[Uhl91]-=- in the sense that both structures use relative distances from a vantage point to partition the domain space. In vp-trees, at every node of the tree, a vantage point is chosen among the data points, a... |

131 |
Some approaches to best-match file searching
- Burkhard, Keller
- 1973
(Show Context)
Citation Context ...exing techniques in section 3.2. 3.2 Distance-Based Index Structures There are a number of research results on efficiently answering similarity search queries in different contexts. Burkhard & Keller =-=[BK73]-=- suggested the use of three different techniques for the problem of finding best matching (closest) key words in a file to a given query key. They employ a metric distance function on the key space wh... |

122 | Distance-Based Indexing for High-Dimensional Metric Spaces
- Bozkaya, Ozsoyoglu
- 1997
(Show Context)
Citation Context ...ompared on a pixel by pixel basis by calculating the distance between two images as the accumulation of the differences between the intensities of their pixels. 1 A preliminary version of this paper (=-=[BO97]-=-) appeared in ACM-SIGMOD 1997. 2 This research is partially supported by the National Science Foundation grant IRI 92-24660, and the National Science Foundation FAW award IRI90 2 In all the applicatio... |

117 | Fast Similarity Search - Agrawal, Lin, et al. - 1995 |

74 | Content-Based Image Indexing
- Chiueh
- 1994
(Show Context)
Citation Context ...it is also possible to generalize it to a multi-way tree for larger fanouts. In [Yia93], Yiannilos provided some analytical results on vp-trees, and suggested ways to pick 6 better vantage points. In =-=[Chi94]-=-, Chiueh proposed an algorithm for the vp-tree structure to answer nearest neighbor queries. We talk about vp-trees in detail in section 3.3. The gh-tree (generalized hyperplane tree) structure was al... |

55 | A Cost Model for Similarity Queries in Metric Spaces
- Ciaccia, Patella, et al.
(Show Context)
Citation Context ...ing objects in their parent nodes. Experimental results for M-trees are provided in [CPZ97, CP98, CPZ98a, CPZ98b]. An analytical cost model based on distance distribution of the objects is derived in =-=[CPZ98b]-=- for M-trees. Evaluation of complex similarity queries (with multiple similarity predicates) using M-trees are discussed in [CPZ98a]. [CP98] provides an algorithm for creating an M-tree from a given s... |

55 | New techniques for best-match retrieval
- Shasha, Wang
- 1990
(Show Context)
Citation Context ...rch. Note that keys may appear in more than one clique; so the aim is to select the representative keys to be the ones that appear in as many cliques as possible. In another approach, Shasha and Wang =-=[SW90]-=- suggested using pre-computed distances between data elements to efficiently answer similarity search queries. The aim is to minimize the number of distance computations as much as possible, as they a... |

34 | Bulk loading the m-tree
- Ciaccia, M
- 1998
(Show Context)
Citation Context ...sed on distance distribution of the objects is derived in [CPZ98b] for M-trees. Evaluation of complex similarity queries (with multiple similarity predicates) using M-trees are discussed in [CPZ98a]. =-=[CP98]-=- provides an algorithm for creating an M-tree from a given set of objects via bulkloading. We provide some experimental results with M-trees in Section 8.2. 3.3 Vantage point tree structure Let us bri... |

29 | ªProcessing Complex Similarity Queries with Distance-Based Access Methods,º
- Ciaccia, Patella, et al.
- 1998
(Show Context)
Citation Context ...t model based on distance distribution of the objects is derived in [CPZ98b] for M-trees. Evaluation of complex similarity queries (with multiple similarity predicates) using M-trees are discussed in =-=[CPZ98a]-=-. [CP98] provides an algorithm for creating an M-tree from a given set of objects via bulkloading. We provide some experimental results with M-trees in Section 8.2. 3.3 Vantage point tree structure Le... |

16 |
Approximate matching with high dimensionality r-trees. M.Sc. scholarly paper
- Otterman
- 1992
(Show Context)
Citation Context ...ed to efficiently search the R-tree structure to answer nearest neighbor queries. However, the conventional spatial structures stop being efficient if the dimensionality is high. Experimental results =-=[Ott92]-=- show that R-trees become inefficient for n-dimensional spaces where n is greater than 20. It is possible to make use of conventional spatial index structures for some high-dimensional Euclidean domai... |

14 | et al., "Efficient and effective querying by image content - Faloutsos - 1994 |

6 |
Data Structures and Algorithms for Nearest Neighbor Search in General Metric Spaces
- Yiannilos
- 1993
(Show Context)
Citation Context ...low that node, which are constructed in the same way recursively. Although the vp-tree was introduced as a binary tree, it is also possible to generalize it to a multi-way tree for larger fanouts. In =-=[Yia93]-=-, Yiannilos provided some analytical results on vp-trees, and suggested ways to pick 6 better vantage points. In [Chi94], Chiueh proposed an algorithm for the vp-tree structure to answer nearest neigh... |

1 | A Cost Nodel for Similarity Queries in Metric Spaces", to appear - Ciaccia, Patella, et al. - 1998 |