Results 11 - 20
of
61
A cost model and index architecture for the similarity join
- In ICDE
, 2001
"... The similarity join is an important database primitive which has been successfully applied to speed up data mining algorithms. In the similarity join, two point sets of a multidimensional vector space are combined such that the result contains all point pairs where the distance does not exceed a par ..."
Abstract
-
Cited by 31 (7 self)
- Add to MetaCart
(Show Context)
The similarity join is an important database primitive which has been successfully applied to speed up data mining algorithms. In the similarity join, two point sets of a multidimensional vector space are combined such that the result contains all point pairs where the distance does not exceed a parameter ε. Due to its high practical relevance, many similarity join algorithms have been devised. In this paper, we propose an analytical cost model for the similarity join operation based on indexes. Our problem analysis reveals a serious optimization conflict between CPU time and I/O time: Fine-grained index structures are beneficial for the CPU efficiency, but deteriorate the I/O performance. As a consequence of this observation, we propose a new index architecture and join algorithm which allows a separate optimization of CPU time and I/O time. Our solution utilizes large pages which are optimized for I/O processing. The pages accommodate a search structure which minimizes the computational effort. In the experimental evaluation, a substantial improvement over competitive techniques is shown. 1.
Analysis of Predictive Spatio-Temporal Queries
- TODS
, 2003
"... this paper we present probabilistic cost models that estimate the selectivity of spatio-temporal window queries and joins, and the expected distance between a query and its nearest neighbor(s). Our models capture any query/object mobility combination (moving queries, moving objects or both) and any ..."
Abstract
-
Cited by 29 (5 self)
- Add to MetaCart
this paper we present probabilistic cost models that estimate the selectivity of spatio-temporal window queries and joins, and the expected distance between a query and its nearest neighbor(s). Our models capture any query/object mobility combination (moving queries, moving objects or both) and any data type (points and rectangles) in arbitrary dimensionality. In addition, we develop specialized spatio-temporal histograms, which take into account both location and velocity information, and can be incrementally maintained. Extensive performance evaluation verifies that the proposed techniques produce highly accurate estimation on both uniform and non-uniform data
Techniques for Similarity Searching in Multimedia Databases
, 2010
"... Techniques for similarity searching in multimedia databases are reviewed. This includes a discussion of the curse of dimensionality, as well as multidimensional indexing, distance-based indexing, and the actual search process which is realized by nearest neighbor finding. ..."
Abstract
-
Cited by 27 (3 self)
- Add to MetaCart
Techniques for similarity searching in multimedia databases are reviewed. This includes a discussion of the curse of dimensionality, as well as multidimensional indexing, distance-based indexing, and the actual search process which is realized by nearest neighbor finding.
Region proximity in metric spaces and its use for approximate similarity search
- ACM Trans. Inf. Syst
, 2003
"... Similarity search structures for metric data typically bound object partitions by ball regions. Since regions can overlap, a relevant issue is to estimate the proximity of regions in order to predict the number of objects in the regions ’ intersection. This paper analyzes the problem using a probabi ..."
Abstract
-
Cited by 18 (3 self)
- Add to MetaCart
Similarity search structures for metric data typically bound object partitions by ball regions. Since regions can overlap, a relevant issue is to estimate the proximity of regions in order to predict the number of objects in the regions ’ intersection. This paper analyzes the problem using a probabilistic approach and provides a solution that effectively computes the proximity through realistic heuristics that only require small amounts of auxiliary data. An extensive simulation to validate the technique is provided. An application is developed to demonstrate how the proximity measure can be successfully applied to the approximate similarity search. Search speedup is achieved by ignoring data regions whose proximity to the query region is smaller than a user-defined threshold. This idea is implemented in a metric tree environment for the similarity range and “nearest neighbors ” queries. Several measures of efficiency and effectiveness are applied to evaluate proposed approximate search algorithms on real-life data sets. An analytical model is developed to relate proximity parameters and the quality of search. Improvements of two orders of magnitude are achieved for moderately approximated search results. We demonstrate that the precision of proximity measures can significantly influence the quality of approximated algorithms.
Peer-to-peer similarity search in metric spaces
- IN PROCEEDINGS OF VLDB’07
, 2007
"... This paper addresses the efficient processing of similarity queries in metric spaces, where data is horizontally distributed across a P2P network. The proposed approach does not rely on arbitrary data movement, hence each peer joining the network autonomously stores its own data. We present SIMPEER, ..."
Abstract
-
Cited by 15 (10 self)
- Add to MetaCart
(Show Context)
This paper addresses the efficient processing of similarity queries in metric spaces, where data is horizontally distributed across a P2P network. The proposed approach does not rely on arbitrary data movement, hence each peer joining the network autonomously stores its own data. We present SIMPEER, a novel framework that dynamically clusters peer data, in order to build distributed routing information at super-peer level. SIMPEER allows the evaluation of range and nearest neighbor queries in a distributed manner that reduces communication cost, network latency, bandwidth consumption and computational overhead at each individual peer. SIMPEER utilizes a set of distributed statistics and guarantees that all similar objects to the query are retrieved, without necessarily flooding the network during query processing. The statistics are employed for estimating an adequate query radius for k-nearest neighbor queries, and transform the query to a range query. Our experimental evaluation employs both real-world and synthetic data collections, and our results show that SIMPEER performs efficiently, even in the case of high degree of distribution.
String matching with metric trees using an approximate distance
- In SPIRE, LNCS 2476
, 2002
"... Abstract. Searching in a large data set those strings that are more similar, according to the edit distance, to a given one is a time-consuming process. In this paper we investigate the performance of metric trees, namely the M-tree, when they are extended using a cheap approximate distance function ..."
Abstract
-
Cited by 11 (0 self)
- Add to MetaCart
(Show Context)
Abstract. Searching in a large data set those strings that are more similar, according to the edit distance, to a given one is a time-consuming process. In this paper we investigate the performance of metric trees, namely the M-tree, when they are extended using a cheap approximate distance function as a filter to quickly discard irrelevant strings. Using the bag distance as an approximation of the edit distance, we show an improvement in performance up to 90 % with respect to the basic case. This, along with the fact that our solution is independent on both the distance used in the pre-test and on the underlying metric index, demonstrates that metric indices are a powerful solution, not only for many modern application areas, as multimedia, data mining and pattern recognition, but also for the string matching problem. 1
Nearest Neighbor Search in Multidimensional Spaces
, 1999
"... The Nearest Neighbor Search problem is defined as follows: given a set P of n points, preprocess the points so as to efficiently answer queries that require finding the closest point in P to a query point q. If we are willing to settle for a point that is almost as close as the nearest neighbor, t ..."
Abstract
-
Cited by 11 (0 self)
- Add to MetaCart
(Show Context)
The Nearest Neighbor Search problem is defined as follows: given a set P of n points, preprocess the points so as to efficiently answer queries that require finding the closest point in P to a query point q. If we are willing to settle for a point that is almost as close as the nearest neighbor, then we can relax the problem to the approximate Nearest Neighbor Search. Nearest Neighbor Search (exact or approximate) is an integral component in a wide range of applications that include multimedia databases, computational biology, data mining, and information retrieval. The common thread in all these applications is similarity search: given a database of objects, we want to return the object in the database that is most similar to a query object. The objects are mapped onto points in a high dimensional metric space , and similarity search reduces to a nearest neighbor search. The dimension of the underlying space may be in the order of a few hundreds, or thousands; therefore, we r...
A Query-sensitive Cost Model for Similarity Queries with M-tree
- IN PROC. OF THE 10TH ADC
, 1999
"... We introduce a cost model for the M-tree acRNH method [Ciac4( et al., 1997]whic h provides estimates of CPU(distanc cis putations) and I/Oc osts for the execDRS( of similarity queries as a funcSL( ofeac h single query. This model is said to bequery-sensit , sinc it takes intoacSSL t, by relying on ..."
Abstract
-
Cited by 10 (2 self)
- Add to MetaCart
(Show Context)
We introduce a cost model for the M-tree acRNH method [Ciac4( et al., 1997]whic h provides estimates of CPU(distanc cis putations) and I/Oc osts for the execDRS( of similarity queries as a funcSL( ofeac h single query. This model is said to bequery-sensit , sinc it takes intoacSSL t, by relying on the novel notion of "witness", the "position" of the query point inside the metric spac indexed by the M-tree. We desc4fi e thebasic c onccE underlying the model along with di#erent methods whic hcS be used for its implementation; finally, we experimentally validate the model over both real and synthetic datasets.
PM-tree: Pivoting Metric Tree for Similarity Search in Multimedia Databases
"... In this paper we introduce the Pivoting M-tree (PM-tree), a metric access method combining M-tree with the pivot-based approach. While in M-tree a metric region is represented by a hyper-sphere, in PM-tree the shape of a metric region is determined by intersection of the hyper-sphere and a set of ..."
Abstract
-
Cited by 9 (0 self)
- Add to MetaCart
In this paper we introduce the Pivoting M-tree (PM-tree), a metric access method combining M-tree with the pivot-based approach. While in M-tree a metric region is represented by a hyper-sphere, in PM-tree the shape of a metric region is determined by intersection of the hyper-sphere and a set of hyper-rings. The set of hyper-rings for each metric region is related to a fixed set of pivot objects. As a consequence, the shape of a metric region bounds the indexed objects more tightly which, in turn, significantly improves the overall efficiency of similarity search. We present basic algorithms on PM-tree and two cost models for range query processing. Finally, the PM-tree efficiency is experimentally evaluated on large synthetic as well as real-world datasets.
Towards Measuring the Searching Complexity of Metric Spaces
- In Proc.ofthe Mexican Computing Meeting
, 2001
"... . In this paper we introduce a new measure of the intrinsic searching complexity of a general metric space. This measure reects the expected behavior of the search algorithms on the metric space, yet it is easy to estimate and independent of the search algorithm. We prove average case lower boun ..."
Abstract
-
Cited by 7 (1 self)
- Add to MetaCart
. In this paper we introduce a new measure of the intrinsic searching complexity of a general metric space. This measure reects the expected behavior of the search algorithms on the metric space, yet it is easy to estimate and independent of the search algorithm. We prove average case lower bounds, in terms of this complexity measure, for a large class of proximity search algorithms. This gives some new insight on the intrinsic diculty of the search problem in metric spaces. 1