Results 1  10
of
43
Indexdriven similarity search in metric spaces
 ACM Transactions on Database Systems
, 2003
"... Similarity search is a very important operation in multimedia databases and other database applications involving complex objects, and involves finding objects in a data set S similar to a query object q, based on some similarity measure. In this article, we focus on methods for similarity search th ..."
Abstract

Cited by 134 (6 self)
 Add to MetaCart
Similarity search is a very important operation in multimedia databases and other database applications involving complex objects, and involves finding objects in a data set S similar to a query object q, based on some similarity measure. In this article, we focus on methods for similarity search that make the general assumption that similarity is represented with a distance metric d. Existing methods for handling similarity search in this setting typically fall into one of two classes. The first directly indexes the objects based on distances (distancebased indexing), while the second is based on mapping to a vector space (mappingbased approach). The main part of this article is dedicated to a survey of distancebased indexing methods, but we also briefly outline how search occurs in mappingbased methods. We also present a general framework for performing search based on distances, and present algorithms for common types of queries that operate on an arbitrary “search hierarchy. ” These algorithms can be applied on each of the methods presented, provided a suitable search hierarchy is defined.
On the Marriage of L_pnorms and Edit Distance
 IN VLDB
, 2004
"... Existing studies on time series are based on two categories of distance functions. The first category consists of the Lpnorms. They are metric distance functions but cannot support local time shifting. The second category consists of distance functions which are capable of handling local time shift ..."
Abstract

Cited by 58 (2 self)
 Add to MetaCart
Existing studies on time series are based on two categories of distance functions. The first category consists of the Lpnorms. They are metric distance functions but cannot support local time shifting. The second category consists of distance functions which are capable of handling local time shifting but are nonmetric. The first
Incremental Similarity Search in Multimedia Databases
, 2000
"... Similarity search is a very important operation in multimedia databases and other database applications involving complex objects, and involves finding objects in a data set S similar to a query object q, based on some distance measure d, usually a distance metric. Existing methods for handling simi ..."
Abstract

Cited by 23 (2 self)
 Add to MetaCart
Similarity search is a very important operation in multimedia databases and other database applications involving complex objects, and involves finding objects in a data set S similar to a query object q, based on some distance measure d, usually a distance metric. Existing methods for handling similarity search in this setting fall into one of two classes. The first is based on mapping to a lowdimensionalvector space (making use of data structures such as the Rtree), while the second directly indexes the objects based on distances (making use of data structures such as the Mtree). We introduce a general framework for performing search based on distances, and present an incremental nearest neighbor algorithm that operates on an arbitrary "search hierarchy". We show how this framework can be applied in both classes of similarity search methods, by defining a suitable search hierarchy for a number of different indexing structures. Armed with an appropriate search hierarchy, our algorithm thus performs incremental similarity search, wherein the result objects are reported one by one in order of similarity to a query object, with as little effort as possible expended to produce each new result object. This is especially important in interactive database applications, as it makes it possible to display partial query results early. The incremental aspect also provides significant benefits in situations when the number of desired neighbors is unknown in advance. Furthermore, our algorithm is at least as efficient as existing knearest neighbor algorithms, in terms of the number of distance computations and index node accesses. In fact, provided that the search hierarchy is properly defined, our algorithm can be shown to be optimal in the sense of performing as few distance ...
Querysensitive embeddings
 In ACM International Conference on Management of Data (SIGMOD). 706–717. ACM Transactions on Database Systems, Vol. ?, No. ?, ? 20?. · Vassilis Athitsos et al
"... A common problem in many types of databases is retrieving the most similar matches to a query object. Finding those matches in a large database can be too slow to be practical, especially in domains where objects are compared using computationally expensive similarity (or distance) measures. Embeddi ..."
Abstract

Cited by 21 (11 self)
 Add to MetaCart
A common problem in many types of databases is retrieving the most similar matches to a query object. Finding those matches in a large database can be too slow to be practical, especially in domains where objects are compared using computationally expensive similarity (or distance) measures. Embedding methods can significantly speed up retrieval by mapping objects into a vector space, where distances can be measured rapidly using a Minkowski metric. In this paper we present a novel way to improve embedding quality. In particular, we propose to construct embeddings that use a “querysensitive ” distance measure for the target space of the embedding. This distance measure is used to compare the vectors that the query and database objects are mapped to. The term “querysensitive ” means that the distance measure changes depending on the current query object. We demonstrate theoretically that using a querysensitive distance measure increases the modeling power of embeddings and allows them to capture more of the structure of the original space. We also demonstrate experimentally that querysensitive embeddings can significantly improve retrieval performance. In experiments with an image database of handwritten digits and a timeseries database, the proposed method outperforms existing stateoftheart nonEuclidean indexing methods, meaning that it provides significantly better tradeoffs between efficiency and retrieval accuracy.
Using MoBIoS' Scalable Genome Joins to Find Conserved Primer Pair Candidates Between Two Genomes
 Bioinformatics
, 2004
"... For the purpose of identifying evolutionary reticulation events in flowering plants, we determine a large number of paired, conserved DNA oligomers that may be used as primers to amplify orthologous DNA regions using the polymerasechain reaction (PCR). We develop an initial candidate set by compari ..."
Abstract

Cited by 13 (9 self)
 Add to MetaCart
For the purpose of identifying evolutionary reticulation events in flowering plants, we determine a large number of paired, conserved DNA oligomers that may be used as primers to amplify orthologous DNA regions using the polymerasechain reaction (PCR). We develop an initial candidate set by comparing the Arabidopsis and rice genomes using MoBIoS (Molecular Biological Information System). MoBIoS is a metricspace database management system targeting life science data. Through the use of metricspace indexing techniques, two genomes can be compared in O(mlog n), where m and n are the lengths of the genomes, versus O(mn) for BLAST based analysis. The filtering of low complexity regions may also be accomplished by directly assessing the uniqueness of the region. We describe mSQL, a SQL extension being developed for MoBIoS that encapsulates the algorithmic details in a common database programming language, shielding endusers from esoteric programming.
Efficiently answering topk typicality queries on large databases
 In VLDB
, 2007
"... 890 Finding typical instances is an effective approach to understand and analyze large data sets. In this paper, we apply the idea of typicality analysis from psychology and cognition science to database query answering, and study the novel problem of answering topk typicality queries. We model typ ..."
Abstract

Cited by 12 (1 self)
 Add to MetaCart
890 Finding typical instances is an effective approach to understand and analyze large data sets. In this paper, we apply the idea of typicality analysis from psychology and cognition science to database query answering, and study the novel problem of answering topk typicality queries. We model typicality in large data sets systematically. To answer questions like “Who are the topk most typical NBA players?”, the measure of simple typicality is developed. To answer questions like “Who are the topk most typical guards distinguishing guards from other players?”, the notion of discriminative typicality is proposed. Computing the exact answer to a topk typicality query requires quadratic time which is often too costly for online query answering on large databases. We develop a series of approximation methods for various situations. (1) The randomized tournament algorithm has linear complexity though it does not provide a theoretical guarantee on the quality of the answers. (2) The direct local typicality approximation using VPtrees provides an approximation quality guarantee. (3) A VPtree can be exploited to index a large set of objects. Then, typicality queries can be answered efficiently with quality guarantees by a tournament method based on a Local Typicality Tree data structure. An extensive performance study using two real data sets and a series of synthetic data sets clearly show that topk typicality queries are meaningful and our methods are practical.
BoostMap: An Embedding Method for Efficient Nearest Neighbor Retrieval
, 2008
"... This paper describes BoostMap, a method for efficient nearest neighbor retrieval under computationally expensive distance measures. Database and query objects are embedded into a vector space in which distances can be measured efficiently. Each embedding is treated as a classifier that predicts for ..."
Abstract

Cited by 12 (5 self)
 Add to MetaCart
This paper describes BoostMap, a method for efficient nearest neighbor retrieval under computationally expensive distance measures. Database and query objects are embedded into a vector space in which distances can be measured efficiently. Each embedding is treated as a classifier that predicts for any three objects X, A, B whether X is closer to A or to B. It is shown that a linear combination of such embeddingbased classifiers naturally corresponds to an embedding and a distance measure. Based on this property, the BoostMap method reduces the problem of embedding construction to the classical boosting problem of combining many weak classifiers into an optimized strong classifier. The classification accuracy of the resulting strong classifier is a direct measure of the amount of nearest neighbor structure preserved by the embedding. An important property of BoostMap is that the embedding optimization criterion is equally valid in both metric and nonmetric spaces. Performance is evaluated in databases of hand images, handwritten digits, and time series. In all cases, BoostMap significantly improves retrieval efficiency with small losses in accuracy compared to bruteforce search. Moreover, BoostMap significantly outperforms existing nearest neighbor retrieval methods such as Lipschitz embeddings, FastMap, and VPtrees.
Nearest Neighbor Retrieval Using DistanceBased Hashing
"... Abstract — A method is proposed for indexing spaces with arbitrary distance measures, so as to achieve efficient approximate nearest neighbor retrieval. Hashing methods, such as Locality Sensitive Hashing (LSH), have been successfully applied for similarity indexing in vector spaces and string space ..."
Abstract

Cited by 12 (1 self)
 Add to MetaCart
Abstract — A method is proposed for indexing spaces with arbitrary distance measures, so as to achieve efficient approximate nearest neighbor retrieval. Hashing methods, such as Locality Sensitive Hashing (LSH), have been successfully applied for similarity indexing in vector spaces and string spaces under the Hamming distance. The key novelty of the hashing technique proposed here is that it can be applied to spaces with arbitrary distance measures, including nonmetric distance measures. First, we describe a domainindependent method for constructing a family of binary hash functions. Then, we use these functions to construct multiple multibit hash tables. We show that the LSH formalism is not applicable for analyzing the behavior of these tables as index structures. We present a novel formulation, that uses statistical observations from sample data to analyze retrieval accuracy and efficiency for the proposed indexing method. Experiments on several realworld data sets demonstrate that our method produces good tradeoffs between accuracy and efficiency, and significantly outperforms VPtrees, which are a wellknown method for distancebased indexing. I.
A Metric Cache for Similarity Search
 In LSDSIR
, 2008
"... Similarity search in metric spaces is a general paradigm that can be used in several application fields. It can also be effectively exploited in contentbased image retrieval systems, which are shifting their target towards the Webscale dimension. In this context, an important issue becomes the des ..."
Abstract

Cited by 9 (0 self)
 Add to MetaCart
Similarity search in metric spaces is a general paradigm that can be used in several application fields. It can also be effectively exploited in contentbased image retrieval systems, which are shifting their target towards the Webscale dimension. In this context, an important issue becomes the design of scalable solutions, which combine parallel and distributed architectures with caching at several levels. To this end, we investigate the design of a similarity cache that works in metric spaces. It is able to answer with exact and approximate results: even when an exact match is not present in cache, our cache may return an approximate result set with quality guarantees. By conducting tests on a collection of one million highquality digital photos, we show that the proposed caching techniques can have a significant impact on performance, like caching on text queries has been proved effective for traditional Web search engines.
Effective Indexing and Filtering for Similarity Search in Large Biosequence Databases
 In Third IEEE Symposium on BioInformatics and BioEngineering (BIBE’03
, 2003
"... We present a multidimensional indexing approach for fast sequence similarity search in DNA and protein databases. In particular, we propose effective transformations of subsequences into numerical vector domains and build efficient index structures on the transformed vectors. We then define distanc ..."
Abstract

Cited by 7 (0 self)
 Add to MetaCart
We present a multidimensional indexing approach for fast sequence similarity search in DNA and protein databases. In particular, we propose effective transformations of subsequences into numerical vector domains and build efficient index structures on the transformed vectors. We then define distance functions in the transformed domain and examine properties of these functions. We experimentally compared their (a) approximation quality for kNearest Neighbor (kNN) queries, (b) pruning ability and (c) approximation quality for erange queries. Results for kNN queries, which we present here, show that our proposed distances FD2 and WD2 (i.e. Frequency and Wavelet Distance functions for 2grams) perform significantly better than the others'. We then develop effective index structures, based on Rtrees and scalar quantization, on top of transformed vectors' and distance functions. Promising results from the experiments on real biosequence data sets are presented.