Results 11  20
of
61
A cost model and index architecture for the similarity join
 In ICDE
, 2001
"... The similarity join is an important database primitive which has been successfully applied to speed up data mining algorithms. In the similarity join, two point sets of a multidimensional vector space are combined such that the result contains all point pairs where the distance does not exceed a par ..."
Abstract

Cited by 31 (7 self)
 Add to MetaCart
(Show Context)
The similarity join is an important database primitive which has been successfully applied to speed up data mining algorithms. In the similarity join, two point sets of a multidimensional vector space are combined such that the result contains all point pairs where the distance does not exceed a parameter ε. Due to its high practical relevance, many similarity join algorithms have been devised. In this paper, we propose an analytical cost model for the similarity join operation based on indexes. Our problem analysis reveals a serious optimization conflict between CPU time and I/O time: Finegrained index structures are beneficial for the CPU efficiency, but deteriorate the I/O performance. As a consequence of this observation, we propose a new index architecture and join algorithm which allows a separate optimization of CPU time and I/O time. Our solution utilizes large pages which are optimized for I/O processing. The pages accommodate a search structure which minimizes the computational effort. In the experimental evaluation, a substantial improvement over competitive techniques is shown. 1.
Analysis of Predictive SpatioTemporal Queries
 TODS
, 2003
"... this paper we present probabilistic cost models that estimate the selectivity of spatiotemporal window queries and joins, and the expected distance between a query and its nearest neighbor(s). Our models capture any query/object mobility combination (moving queries, moving objects or both) and any ..."
Abstract

Cited by 29 (5 self)
 Add to MetaCart
this paper we present probabilistic cost models that estimate the selectivity of spatiotemporal window queries and joins, and the expected distance between a query and its nearest neighbor(s). Our models capture any query/object mobility combination (moving queries, moving objects or both) and any data type (points and rectangles) in arbitrary dimensionality. In addition, we develop specialized spatiotemporal histograms, which take into account both location and velocity information, and can be incrementally maintained. Extensive performance evaluation verifies that the proposed techniques produce highly accurate estimation on both uniform and nonuniform data
Techniques for Similarity Searching in Multimedia Databases
, 2010
"... Techniques for similarity searching in multimedia databases are reviewed. This includes a discussion of the curse of dimensionality, as well as multidimensional indexing, distancebased indexing, and the actual search process which is realized by nearest neighbor finding. ..."
Abstract

Cited by 27 (3 self)
 Add to MetaCart
Techniques for similarity searching in multimedia databases are reviewed. This includes a discussion of the curse of dimensionality, as well as multidimensional indexing, distancebased indexing, and the actual search process which is realized by nearest neighbor finding.
Region proximity in metric spaces and its use for approximate similarity search
 ACM Trans. Inf. Syst
, 2003
"... Similarity search structures for metric data typically bound object partitions by ball regions. Since regions can overlap, a relevant issue is to estimate the proximity of regions in order to predict the number of objects in the regions ’ intersection. This paper analyzes the problem using a probabi ..."
Abstract

Cited by 18 (3 self)
 Add to MetaCart
Similarity search structures for metric data typically bound object partitions by ball regions. Since regions can overlap, a relevant issue is to estimate the proximity of regions in order to predict the number of objects in the regions ’ intersection. This paper analyzes the problem using a probabilistic approach and provides a solution that effectively computes the proximity through realistic heuristics that only require small amounts of auxiliary data. An extensive simulation to validate the technique is provided. An application is developed to demonstrate how the proximity measure can be successfully applied to the approximate similarity search. Search speedup is achieved by ignoring data regions whose proximity to the query region is smaller than a userdefined threshold. This idea is implemented in a metric tree environment for the similarity range and “nearest neighbors ” queries. Several measures of efficiency and effectiveness are applied to evaluate proposed approximate search algorithms on reallife data sets. An analytical model is developed to relate proximity parameters and the quality of search. Improvements of two orders of magnitude are achieved for moderately approximated search results. We demonstrate that the precision of proximity measures can significantly influence the quality of approximated algorithms.
Peertopeer similarity search in metric spaces
 IN PROCEEDINGS OF VLDB’07
, 2007
"... This paper addresses the efficient processing of similarity queries in metric spaces, where data is horizontally distributed across a P2P network. The proposed approach does not rely on arbitrary data movement, hence each peer joining the network autonomously stores its own data. We present SIMPEER, ..."
Abstract

Cited by 15 (10 self)
 Add to MetaCart
(Show Context)
This paper addresses the efficient processing of similarity queries in metric spaces, where data is horizontally distributed across a P2P network. The proposed approach does not rely on arbitrary data movement, hence each peer joining the network autonomously stores its own data. We present SIMPEER, a novel framework that dynamically clusters peer data, in order to build distributed routing information at superpeer level. SIMPEER allows the evaluation of range and nearest neighbor queries in a distributed manner that reduces communication cost, network latency, bandwidth consumption and computational overhead at each individual peer. SIMPEER utilizes a set of distributed statistics and guarantees that all similar objects to the query are retrieved, without necessarily flooding the network during query processing. The statistics are employed for estimating an adequate query radius for knearest neighbor queries, and transform the query to a range query. Our experimental evaluation employs both realworld and synthetic data collections, and our results show that SIMPEER performs efficiently, even in the case of high degree of distribution.
String matching with metric trees using an approximate distance
 In SPIRE, LNCS 2476
, 2002
"... Abstract. Searching in a large data set those strings that are more similar, according to the edit distance, to a given one is a timeconsuming process. In this paper we investigate the performance of metric trees, namely the Mtree, when they are extended using a cheap approximate distance function ..."
Abstract

Cited by 11 (0 self)
 Add to MetaCart
(Show Context)
Abstract. Searching in a large data set those strings that are more similar, according to the edit distance, to a given one is a timeconsuming process. In this paper we investigate the performance of metric trees, namely the Mtree, when they are extended using a cheap approximate distance function as a filter to quickly discard irrelevant strings. Using the bag distance as an approximation of the edit distance, we show an improvement in performance up to 90 % with respect to the basic case. This, along with the fact that our solution is independent on both the distance used in the pretest and on the underlying metric index, demonstrates that metric indices are a powerful solution, not only for many modern application areas, as multimedia, data mining and pattern recognition, but also for the string matching problem. 1
Nearest Neighbor Search in Multidimensional Spaces
, 1999
"... The Nearest Neighbor Search problem is defined as follows: given a set P of n points, preprocess the points so as to efficiently answer queries that require finding the closest point in P to a query point q. If we are willing to settle for a point that is almost as close as the nearest neighbor, t ..."
Abstract

Cited by 11 (0 self)
 Add to MetaCart
(Show Context)
The Nearest Neighbor Search problem is defined as follows: given a set P of n points, preprocess the points so as to efficiently answer queries that require finding the closest point in P to a query point q. If we are willing to settle for a point that is almost as close as the nearest neighbor, then we can relax the problem to the approximate Nearest Neighbor Search. Nearest Neighbor Search (exact or approximate) is an integral component in a wide range of applications that include multimedia databases, computational biology, data mining, and information retrieval. The common thread in all these applications is similarity search: given a database of objects, we want to return the object in the database that is most similar to a query object. The objects are mapped onto points in a high dimensional metric space , and similarity search reduces to a nearest neighbor search. The dimension of the underlying space may be in the order of a few hundreds, or thousands; therefore, we r...
A Querysensitive Cost Model for Similarity Queries with Mtree
 IN PROC. OF THE 10TH ADC
, 1999
"... We introduce a cost model for the Mtree acRNH method [Ciac4( et al., 1997]whic h provides estimates of CPU(distanc cis putations) and I/Oc osts for the execDRS( of similarity queries as a funcSL( ofeac h single query. This model is said to bequerysensit , sinc it takes intoacSSL t, by relying on ..."
Abstract

Cited by 10 (2 self)
 Add to MetaCart
(Show Context)
We introduce a cost model for the Mtree acRNH method [Ciac4( et al., 1997]whic h provides estimates of CPU(distanc cis putations) and I/Oc osts for the execDRS( of similarity queries as a funcSL( ofeac h single query. This model is said to bequerysensit , sinc it takes intoacSSL t, by relying on the novel notion of "witness", the "position" of the query point inside the metric spac indexed by the Mtree. We desc4fi e thebasic c onccE underlying the model along with di#erent methods whic hcS be used for its implementation; finally, we experimentally validate the model over both real and synthetic datasets.
PMtree: Pivoting Metric Tree for Similarity Search in Multimedia Databases
"... In this paper we introduce the Pivoting Mtree (PMtree), a metric access method combining Mtree with the pivotbased approach. While in Mtree a metric region is represented by a hypersphere, in PMtree the shape of a metric region is determined by intersection of the hypersphere and a set of ..."
Abstract

Cited by 9 (0 self)
 Add to MetaCart
In this paper we introduce the Pivoting Mtree (PMtree), a metric access method combining Mtree with the pivotbased approach. While in Mtree a metric region is represented by a hypersphere, in PMtree the shape of a metric region is determined by intersection of the hypersphere and a set of hyperrings. The set of hyperrings for each metric region is related to a fixed set of pivot objects. As a consequence, the shape of a metric region bounds the indexed objects more tightly which, in turn, significantly improves the overall efficiency of similarity search. We present basic algorithms on PMtree and two cost models for range query processing. Finally, the PMtree efficiency is experimentally evaluated on large synthetic as well as realworld datasets.
Towards Measuring the Searching Complexity of Metric Spaces
 In Proc.ofthe Mexican Computing Meeting
, 2001
"... . In this paper we introduce a new measure of the intrinsic searching complexity of a general metric space. This measure reects the expected behavior of the search algorithms on the metric space, yet it is easy to estimate and independent of the search algorithm. We prove average case lower boun ..."
Abstract

Cited by 7 (1 self)
 Add to MetaCart
. In this paper we introduce a new measure of the intrinsic searching complexity of a general metric space. This measure reects the expected behavior of the search algorithms on the metric space, yet it is easy to estimate and independent of the search algorithm. We prove average case lower bounds, in terms of this complexity measure, for a large class of proximity search algorithms. This gives some new insight on the intrinsic diculty of the search problem in metric spaces. 1