Results 1 - 10
of
20
Index-driven similarity search in metric spaces
- ACM Transactions on Database Systems
, 2003
"... Similarity search is a very important operation in multimedia databases and other database applications involving complex objects, and involves finding objects in a data set S similar to a query object q, based on some similarity measure. In this article, we focus on methods for similarity search th ..."
Abstract
-
Cited by 118 (6 self)
- Add to MetaCart
Similarity search is a very important operation in multimedia databases and other database applications involving complex objects, and involves finding objects in a data set S similar to a query object q, based on some similarity measure. In this article, we focus on methods for similarity search that make the general assumption that similarity is represented with a distance metric d. Existing methods for handling similarity search in this setting typically fall into one of two classes. The first directly indexes the objects based on distances (distance-based indexing), while the second is based on mapping to a vector space (mapping-based approach). The main part of this article is dedicated to a survey of distance-based indexing methods, but we also briefly outline how search occurs in mapping-based methods. We also present a general framework for performing search based on distances, and present algorithms for common types of queries that operate on an arbitrary “search hierarchy. ” These algorithms can be applied on each of the methods presented, provided a suitable search hierarchy is defined.
Top-k Query Evaluation with Probabilistic Guarantees
- In VLDB
, 2004
"... Top-k queries based on ranking elements of multidimensional datasets are a fundamental building block for many kinds of information discovery. The best known general-purpose algorithm for evaluating top-k queries is Fagin’s threshold algorithm (TA). Since the user’s goal behind top-k queries is to i ..."
Abstract
-
Cited by 73 (15 self)
- Add to MetaCart
Top-k queries based on ranking elements of multidimensional datasets are a fundamental building block for many kinds of information discovery. The best known general-purpose algorithm for evaluating top-k queries is Fagin’s threshold algorithm (TA). Since the user’s goal behind top-k queries is to identify one or a few relevant and novel data items, it is intriguing to use approximate variants of TA to reduce run-time costs. This paper introduces a family of approximate top-k algorithms based on probabilistic arguments. When scanning index lists of the underlying multidimensional data space in descending order of local scores, various forms of convolution and derived bounds are employed to predict when it is safe, with high probability, to drop candidate items and to prune the index scans. The precision and the efficiency of the developed methods are experimentally evaluated based on a large Web corpus and a structured data collection.
KLEE: A Framework for Distributed Top-K Query Algorithms
- In VLDB
, 2005
"... This paper addresses the efficient processing of top-k queries in wide-area distributed data repositories where the index lists for the attribute values (or text terms) of a query are distributed across a number of data peers and the computational costs include network latency, bandwidth consumption ..."
Abstract
-
Cited by 53 (11 self)
- Add to MetaCart
This paper addresses the efficient processing of top-k queries in wide-area distributed data repositories where the index lists for the attribute values (or text terms) of a query are distributed across a number of data peers and the computational costs include network latency, bandwidth consumption, and local peer work. We present KLEE, a novel algorithmic framework for distributed top-k queries, designed for high performance and flexibility. KLEE makes a strong case for approximate top-k algorithms over widely distributed data sources. It shows how great gains in efficiency can be enjoyed at low result-quality penalties. Further, KLEE affords the query-initiating peer the flexibility to trade-off result quality and expected performance and to trade-off the number of communication phases engaged during query execution versus network bandwidth performance. We have implemented KLEE and related algorithms and conducted a comprehensive performance evaluation. Our evaluation employed real-world and synthetic large, web-data collections, and query benchmarks. Our experimental results show that KLEE can achieve major performance gains in terms of network bandwidth, query response times, and much lighter peer loads, all with small errors in result precision and other result-quality measures.
Effective Proximity Retrieval by Ordering Permutations
, 2007
"... We introduce a new probabilistic proximity search algorithm for range and K-nearest neighbor (K-NN) searching in both coordinate and metric spaces. Although there exist solutions for these problems, they boil down to a linear scan when the space is intrinsically high-dimensional, as is the case in m ..."
Abstract
-
Cited by 13 (4 self)
- Add to MetaCart
We introduce a new probabilistic proximity search algorithm for range and K-nearest neighbor (K-NN) searching in both coordinate and metric spaces. Although there exist solutions for these problems, they boil down to a linear scan when the space is intrinsically high-dimensional, as is the case in many pattern recognition tasks. This, for example, renders the K-NN approach to classification rather slow in large databases. Our novel idea is to predict closeness between elements according to how they order their distances towards a distinguished set of anchor objects. Each element in the space sorts the anchor objects from closest to farthest to it, and the similarity between orders turns out to be an excellent predictor of the closeness between the corresponding elements. We present extensive experiments comparing our method against state-of-the-art exact and approximate techniques, both in synthetic and real, metric and non-metric databases, measuring both CPU time and distance computations. The experiments demonstrate that our technique almost always improves upon the performance of alternative techniques, in some cases by a wide margin.
Reverse Nearest Neighbor Search in Metric Spaces
- TKDE
"... Abstract—Given a set D of objects, a reverse nearest neighbor (RNN) query returns the objects o in D such that o is closer to a query object q than to any other object in D, according to a certain similarity metric. The existing RNN solutions are not sufficient because they either 1) rely on precomp ..."
Abstract
-
Cited by 11 (0 self)
- Add to MetaCart
Abstract—Given a set D of objects, a reverse nearest neighbor (RNN) query returns the objects o in D such that o is closer to a query object q than to any other object in D, according to a certain similarity metric. The existing RNN solutions are not sufficient because they either 1) rely on precomputed information that is expensive to maintain in the presence of updates or 2) are applicable only when the data consists of “Euclidean objects ” and similarity is measured using the L2 norm. In this paper, we present the first algorithms for efficient RNN search in generic metric spaces. Our techniques require no detailed representations of objects, and can be applied as long as their mutual distances can be computed and the distance metric satisfies the triangle inequality. We confirm the effectiveness of the proposed methods with extensive experiments. Index Terms—Reverse nearest neighbor, metric space. 1
Dynamic Skyline Queries in Metric Spaces
"... Skyline query is of great importance in many applications, such as multi-criteria decision making and business planning. In particular, a skyline point is a data object in the database whose attribute vector is not dominated by that of any other objects. Previous methods to retrieve skyline points u ..."
Abstract
-
Cited by 5 (1 self)
- Add to MetaCart
Skyline query is of great importance in many applications, such as multi-criteria decision making and business planning. In particular, a skyline point is a data object in the database whose attribute vector is not dominated by that of any other objects. Previous methods to retrieve skyline points usually assume static data objects in the database (i.e. their attribute vectors are fixed), whereas several recent work focus on skyline queries with dynamic attributes. In this paper, we propose a novel variant of skyline queries, namely metric skyline, whose dynamic attributes are defined in the metric space (i.e. not limited to the Euclidean space). We illustrate an efficient and effective pruning mechanism to answer metric skyline queries through a metric index. Extensive experiments have demonstrated the efficiency and effectiveness of our proposed pruning techniques over the metric index in answering metric skyline queries. 1.
Indexing schemes for similarity search in datasets of short protein fragments. ArXiv e-print cs.DS/0309005, version 2
, 2006
"... We propose a family of very efficient hierarchical indexing schemes for ungapped, score matrixbased similarity search in large datasets of short (4-12 amino acid) protein fragments. This type of similarity search has importance in both providing a building block to more complex algorithms and for po ..."
Abstract
-
Cited by 4 (3 self)
- Add to MetaCart
We propose a family of very efficient hierarchical indexing schemes for ungapped, score matrixbased similarity search in large datasets of short (4-12 amino acid) protein fragments. This type of similarity search has importance in both providing a building block to more complex algorithms and for possible use in direct biological investigations where datasets are of the order of 60 million objects. Our scheme is based on the internal geometry of the amino acid alphabet and performs exceptionally well, for example outputting 100 nearest neighbours to any possible fragment of length 10 after scanning on average less than one per cent of the entire dataset.
Supporting exact indexing of arbitrarily rotated shapes and periodic time series under Euclidean and warping distance measures
, 2009
"... ..."
Unified Framework for Fast Exact and Approximate Search in Dissimilarity Spaces · 45
, 2003
"... In multimedia systems we usually need to retrieve DB objects based on their similarity to a query object, while the similarity assessment is provided by a measure which defines a (dis)similarity score for every pair of DB objects. In most existing applications, the similarity measure is required to ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
In multimedia systems we usually need to retrieve DB objects based on their similarity to a query object, while the similarity assessment is provided by a measure which defines a (dis)similarity score for every pair of DB objects. In most existing applications, the similarity measure is required to be a metric, where the triangle inequality is utilized to speedup the search for relevant objects by use of metric access methods (MAMs), e.g. the M-tree. A recent research has shown, however, that non-metric measures are more appropriate for similarity modeling due to their robustness and ease to model a made-to-measure similarity. Unfortunately, due to the lack of triangle inequality, the non-metric measures cannot be directly utilized by MAMs. From another point of view, some sophisticated similarity measures could be available in a black-box non-analytic form (e.g. as an algorithm or even a hardware device), where no information about their topological properties is provided, so we have to consider them as non-metric measures as well. From yet another point of view, the concept of similarity measuring itself is inherently imprecise and we often prefer fast but approximate retrieval over an exact but slower one. To date, the mentioned aspects of similarity retrieval have been solved separately, i.e. exact vs. approximate search or metric vs. non-metric search. In this paper we introduce a similarity retrieval framework which incorporates both of the aspects into a single unified model. Based on the framework, we show that for any dissimilarity measure (either a metric or non-metric) we are able to change the ”amount ” of triangle inequality, and so to obtain an approximate or full metric which can be used for MAM-based retrieval. Due to the varying ”amount ” of triangle inequality, the measure is modified in a way suitable for either an exact but slower or an approximate but faster retrieval. Additionally, we introduce the TriGen algorithm aimed to construct the desired modification of any black-box distance automatically, using just a small fraction of the database.
A Framework for the Comparison of Complex Patterns
- in: Proceedings of the International Workshop on Pattern Representation and Management (PaRMa 2004), Vol. 96 of CEUR Workshop Proceedings, CEUR-WS.org, Heraklion, Hellas
, 2004
"... Abstract. Data mining and knowledge discovery techniques are commonly used to extract condensed artifacts representing huge volumes of data. The comparison of such compact and rich in semantics representations (which we call patterns) can be useful to avoid the direct comparison of underlying raw da ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
Abstract. Data mining and knowledge discovery techniques are commonly used to extract condensed artifacts representing huge volumes of data. The comparison of such compact and rich in semantics representations (which we call patterns) can be useful to avoid the direct comparison of underlying raw data. In this paper, we present a general framework for the assessment of similarity between patterns, by identifying the common features that characterize approaches proposed in the literature for particular applications. We also propose an implementation of the framework using an UML formalism, and discuss efficiency issues that arise when similarity queries are considered, i.e. when a similarity predicate is used to query a collection of pattern. 1

