Results 1  10
of
93
Indexdriven similarity search in metric spaces
 ACM Transactions on Database Systems
, 2003
"... Similarity search is a very important operation in multimedia databases and other database applications involving complex objects, and involves finding objects in a data set S similar to a query object q, based on some similarity measure. In this article, we focus on methods for similarity search th ..."
Abstract

Cited by 192 (8 self)
 Add to MetaCart
Similarity search is a very important operation in multimedia databases and other database applications involving complex objects, and involves finding objects in a data set S similar to a query object q, based on some similarity measure. In this article, we focus on methods for similarity search that make the general assumption that similarity is represented with a distance metric d. Existing methods for handling similarity search in this setting typically fall into one of two classes. The first directly indexes the objects based on distances (distancebased indexing), while the second is based on mapping to a vector space (mappingbased approach). The main part of this article is dedicated to a survey of distancebased indexing methods, but we also briefly outline how search occurs in mappingbased methods. We also present a general framework for performing search based on distances, and present algorithms for common types of queries that operate on an arbitrary “search hierarchy. ” These algorithms can be applied on each of the methods presented, provided a suitable search hierarchy is defined.
On the Marriage of L_pnorms and Edit Distance
 IN VLDB
, 2004
"... Existing studies on time series are based on two categories of distance functions. The first category consists of the Lpnorms. They are metric distance functions but cannot support local time shifting. The second category consists of distance functions which are capable of handling local time shift ..."
Abstract

Cited by 101 (3 self)
 Add to MetaCart
Existing studies on time series are based on two categories of distance functions. The first category consists of the Lpnorms. They are metric distance functions but cannot support local time shifting. The second category consists of distance functions which are capable of handling local time shifting but are nonmetric. The first
Melody extraction from polyphonic music signals using pitch contour characteristics
 IEEE Transactions on Audio, Speech, and Language Processing
, 2012
"... In this paper we describe our submission for the audio melody extraction task of the Music Information Retrieval Evaluation eXchange (MIREX) 2011 campaign. The system presented here is an updated version of the one submitted to last year’s campaign. Following a detailed analysis of each step of our ..."
Abstract

Cited by 69 (24 self)
 Add to MetaCart
(Show Context)
In this paper we describe our submission for the audio melody extraction task of the Music Information Retrieval Evaluation eXchange (MIREX) 2011 campaign. The system presented here is an updated version of the one submitted to last year’s campaign. Following a detailed analysis of each step of our method, system parameters have been optimised for melody extraction and the implementation is now more efficient. Two variants of the system have been submitted, each making use of a different spectral transform, allowing us to asses whether the difference between them is significant for overall performance. Following the description of the system, we describe the datasets and metrics used for evaluation. This is followed by a summary of the results and some conclusions. 1.
Nearest Neighbor Retrieval Using DistanceBased Hashing
"... Abstract — A method is proposed for indexing spaces with arbitrary distance measures, so as to achieve efficient approximate nearest neighbor retrieval. Hashing methods, such as Locality Sensitive Hashing (LSH), have been successfully applied for similarity indexing in vector spaces and string space ..."
Abstract

Cited by 27 (1 self)
 Add to MetaCart
(Show Context)
Abstract — A method is proposed for indexing spaces with arbitrary distance measures, so as to achieve efficient approximate nearest neighbor retrieval. Hashing methods, such as Locality Sensitive Hashing (LSH), have been successfully applied for similarity indexing in vector spaces and string spaces under the Hamming distance. The key novelty of the hashing technique proposed here is that it can be applied to spaces with arbitrary distance measures, including nonmetric distance measures. First, we describe a domainindependent method for constructing a family of binary hash functions. Then, we use these functions to construct multiple multibit hash tables. We show that the LSH formalism is not applicable for analyzing the behavior of these tables as index structures. We present a novel formulation, that uses statistical observations from sample data to analyze retrieval accuracy and efficiency for the proposed indexing method. Experiments on several realworld data sets demonstrate that our method produces good tradeoffs between accuracy and efficiency, and significantly outperforms VPtrees, which are a wellknown method for distancebased indexing. I.
Techniques for Similarity Searching in Multimedia Databases
, 2010
"... Techniques for similarity searching in multimedia databases are reviewed. This includes a discussion of the curse of dimensionality, as well as multidimensional indexing, distancebased indexing, and the actual search process which is realized by nearest neighbor finding. ..."
Abstract

Cited by 27 (3 self)
 Add to MetaCart
Techniques for similarity searching in multimedia databases are reviewed. This includes a discussion of the curse of dimensionality, as well as multidimensional indexing, distancebased indexing, and the actual search process which is realized by nearest neighbor finding.
Querysensitive embeddings
 In ACM International Conference on Management of Data (SIGMOD). 706–717. ACM Transactions on Database Systems, Vol. ?, No. ?, ? 20?. · Vassilis Athitsos et al
"... A common problem in many types of databases is retrieving the most similar matches to a query object. Finding those matches in a large database can be too slow to be practical, especially in domains where objects are compared using computationally expensive similarity (or distance) measures. Embeddi ..."
Abstract

Cited by 24 (11 self)
 Add to MetaCart
(Show Context)
A common problem in many types of databases is retrieving the most similar matches to a query object. Finding those matches in a large database can be too slow to be practical, especially in domains where objects are compared using computationally expensive similarity (or distance) measures. Embedding methods can significantly speed up retrieval by mapping objects into a vector space, where distances can be measured rapidly using a Minkowski metric. In this paper we present a novel way to improve embedding quality. In particular, we propose to construct embeddings that use a “querysensitive ” distance measure for the target space of the embedding. This distance measure is used to compare the vectors that the query and database objects are mapped to. The term “querysensitive ” means that the distance measure changes depending on the current query object. We demonstrate theoretically that using a querysensitive distance measure increases the modeling power of embeddings and allows them to capture more of the structure of the original space. We also demonstrate experimentally that querysensitive embeddings can significantly improve retrieval performance. In experiments with an image database of handwritten digits and a timeseries database, the proposed method outperforms existing stateoftheart nonEuclidean indexing methods, meaning that it provides significantly better tradeoffs between efficiency and retrieval accuracy.
BoostMap: An Embedding Method for Efficient Nearest Neighbor Retrieval
, 2008
"... This paper describes BoostMap, a method for efficient nearest neighbor retrieval under computationally expensive distance measures. Database and query objects are embedded into a vector space in which distances can be measured efficiently. Each embedding is treated as a classifier that predicts for ..."
Abstract

Cited by 23 (5 self)
 Add to MetaCart
(Show Context)
This paper describes BoostMap, a method for efficient nearest neighbor retrieval under computationally expensive distance measures. Database and query objects are embedded into a vector space in which distances can be measured efficiently. Each embedding is treated as a classifier that predicts for any three objects X, A, B whether X is closer to A or to B. It is shown that a linear combination of such embeddingbased classifiers naturally corresponds to an embedding and a distance measure. Based on this property, the BoostMap method reduces the problem of embedding construction to the classical boosting problem of combining many weak classifiers into an optimized strong classifier. The classification accuracy of the resulting strong classifier is a direct measure of the amount of nearest neighbor structure preserved by the embedding. An important property of BoostMap is that the embedding optimization criterion is equally valid in both metric and nonmetric spaces. Performance is evaluated in databases of hand images, handwritten digits, and time series. In all cases, BoostMap significantly improves retrieval efficiency with small losses in accuracy compared to bruteforce search. Moreover, BoostMap significantly outperforms existing nearest neighbor retrieval methods such as Lipschitz embeddings, FastMap, and VPtrees.
Using MoBIoS' Scalable Genome Joins to Find Conserved Primer Pair Candidates Between Two Genomes
 Bioinformatics
, 2004
"... For the purpose of identifying evolutionary reticulation events in flowering plants, we determine a large number of paired, conserved DNA oligomers that may be used as primers to amplify orthologous DNA regions using the polymerasechain reaction (PCR). We develop an initial candidate set by compari ..."
Abstract

Cited by 18 (10 self)
 Add to MetaCart
For the purpose of identifying evolutionary reticulation events in flowering plants, we determine a large number of paired, conserved DNA oligomers that may be used as primers to amplify orthologous DNA regions using the polymerasechain reaction (PCR). We develop an initial candidate set by comparing the Arabidopsis and rice genomes using MoBIoS (Molecular Biological Information System). MoBIoS is a metricspace database management system targeting life science data. Through the use of metricspace indexing techniques, two genomes can be compared in O(mlog n), where m and n are the lengths of the genomes, versus O(mn) for BLAST based analysis. The filtering of low complexity regions may also be accomplished by directly assessing the uniqueness of the region. We describe mSQL, a SQL extension being developed for MoBIoS that encapsulates the algorithmic details in a common database programming language, shielding endusers from esoteric programming.
Unified Framework for Fast Exact and Approximate Search in Dissimilarity Spaces
, 2007
"... In multimedia systems we usually need to retrieve DB objects based on their similarity to a query object, while the similarity assessment is provided by a measure which defines a (dis)similarity score for every pair of DB objects. In most existing applications, the similarity measure is required to ..."
Abstract

Cited by 16 (5 self)
 Add to MetaCart
(Show Context)
In multimedia systems we usually need to retrieve DB objects based on their similarity to a query object, while the similarity assessment is provided by a measure which defines a (dis)similarity score for every pair of DB objects. In most existing applications, the similarity measure is required to be a metric, where the triangle inequality is utilized to speedup the search for relevant objects by use of metric access methods (MAMs), e.g. the Mtree. A recent research has shown, however, that nonmetric measures are more appropriate for similarity modeling due to their robustness and ease to model a madetomeasure similarity. Unfortunately, due to the lack of triangle inequality, the nonmetric measures cannot be directly utilized by MAMs. From another point of view, some sophisticated similarity measures could be available in a blackbox nonanalytic form (e.g. as an algorithm or even a hardware device), where no information about their topological properties is provided, so we have to consider them as nonmetric measures as well. From yet another point of view, the concept of similarity measuring itself is inherently imprecise and we often prefer fast but approximate retrieval over an exact but slower one. To date, the mentioned aspects of similarity retrieval have been solved separately, i.e. exact vs. approximate search or metric vs. nonmetric search. In this paper we introduce a similarity retrieval framework which incorporates both of the aspects into a single unified model. Based on the framework, we show that for any dissimilarity measure (either a metric or nonmetric) we are able to change the ”amount ” of triangle inequality, and so to obtain an approximate or full metric which can be used for MAMbased retrieval. Due to the varying ”amoun ” of triangle inequality, the measure is modified in a way suitable for either an exact but slower or an approximate but faster retrieval. Additionally, we introduce the TriGen algorithm aimed to construct the desired modification of any blackbox distance automatically, using just a small fraction of the database.
Efficiently answering topk typicality queries on large databases
 In VLDB
, 2007
"... 890 Finding typical instances is an effective approach to understand and analyze large data sets. In this paper, we apply the idea of typicality analysis from psychology and cognition science to database query answering, and study the novel problem of answering topk typicality queries. We model typ ..."
Abstract

Cited by 12 (1 self)
 Add to MetaCart
(Show Context)
890 Finding typical instances is an effective approach to understand and analyze large data sets. In this paper, we apply the idea of typicality analysis from psychology and cognition science to database query answering, and study the novel problem of answering topk typicality queries. We model typicality in large data sets systematically. To answer questions like “Who are the topk most typical NBA players?”, the measure of simple typicality is developed. To answer questions like “Who are the topk most typical guards distinguishing guards from other players?”, the notion of discriminative typicality is proposed. Computing the exact answer to a topk typicality query requires quadratic time which is often too costly for online query answering on large databases. We develop a series of approximation methods for various situations. (1) The randomized tournament algorithm has linear complexity though it does not provide a theoretical guarantee on the quality of the answers. (2) The direct local typicality approximation using VPtrees provides an approximation quality guarantee. (3) A VPtree can be exploited to index a large set of objects. Then, typicality queries can be answered efficiently with quality guarantees by a tournament method based on a Local Typicality Tree data structure. An extensive performance study using two real data sets and a series of synthetic data sets clearly show that topk typicality queries are meaningful and our methods are practical.