Results 1  10
of
12
Domainindependent data cleaning via analysis of entityrelationship graph
 ACM TRANSACTIONS ON DATABASE SYSTEMS (TODS
, 2006
"... In this article, we address the problem of reference disambiguation. Specifically, we consider a situation where entities in the database are referred to using descriptions (e.g., a set of instantiated attributes). The objective of reference disambiguation is to identify the unique entity to which e ..."
Abstract

Cited by 64 (23 self)
 Add to MetaCart
In this article, we address the problem of reference disambiguation. Specifically, we consider a situation where entities in the database are referred to using descriptions (e.g., a set of instantiated attributes). The objective of reference disambiguation is to identify the unique entity to which each description corresponds. The key difference between the approach we propose (called RELDC) and the traditional techniques is that RELDC analyzes not only object features but also interobject relationships to improve the disambiguation quality. Our extensive experiments over two real data sets and over synthetic datasets show that analysis of relationships significantly improves quality of the result.
Main Memory Evaluation of Monitoring Queries over Moving Objects
 Distributed and Parallel Databases
, 2004
"... In this paper we evaluate several inmemory algorithms for efficient and scalable processing of continuous range queries over collections of moving objects. Constant updates to the index are avoided by query indexing. No constraints are imposed on the speed or path of moving objects or fraction of o ..."
Abstract

Cited by 59 (6 self)
 Add to MetaCart
(Show Context)
In this paper we evaluate several inmemory algorithms for efficient and scalable processing of continuous range queries over collections of moving objects. Constant updates to the index are avoided by query indexing. No constraints are imposed on the speed or path of moving objects or fraction of objects that move at any moment in time. We present a detailed analysis of a grid approach which shows the best results for both skewed and uniform data. A sorting based optimization is developed for significantly improving the cache hitrate. Experimental evaluation establishes that indexing queries using the grid index yields orders of magnitude better performance than other index structures such as R*trees. 1
Efficient evaluation of continuous range queries on moving objects
 In DEXA 2002, Proc. of the 13th International Conference and Workshop on Database and Expert Systems Applications, Aix en Provence
, 2002
"... In this paper we evaluate several inmemory algorithms for efficient and scalable processing of continuous range queries over collections of moving objects. Constant updates to the index are avoided by query indexing. No constraints are imposed on the speed or path of moving objects or fraction of o ..."
Abstract

Cited by 38 (11 self)
 Add to MetaCart
(Show Context)
In this paper we evaluate several inmemory algorithms for efficient and scalable processing of continuous range queries over collections of moving objects. Constant updates to the index are avoided by query indexing. No constraints are imposed on the speed or path of moving objects or fraction of objects that move at any moment in time. We present a detailed analysis of a grid approach which shows the best results for both skewed and uniform data. A sorting based optimization is developed for significantly improving the cache hitrate. Experimental evaluation establishes that indexing queries using the grid index yields orders of magnitude better performance than other index structures such as R*trees. 1
Metric Space Similarity Joins
"... Similarity join algorithms find pairs of objects that lie within a certain distance ɛ of each other. Algorithms that are adapted from spatial join techniques are designed primarily for data in a vector space and often employ some form of a multidimensional index. For these algorithms, when the data ..."
Abstract

Cited by 24 (2 self)
 Add to MetaCart
Similarity join algorithms find pairs of objects that lie within a certain distance ɛ of each other. Algorithms that are adapted from spatial join techniques are designed primarily for data in a vector space and often employ some form of a multidimensional index. For these algorithms, when the data lies in a metric space, the usual solution is to embed the data in vector space and then make use of a multidimensional index. Such an approach has a number of drawbacks when the data is high dimensional as we must eventually find the most discriminating dimensions, which is not trivial. In addition, although the maximum distance between objects increases with dimension, the ability to discriminate between objects in each dimension does not. These drawbacks are overcome via the introduction of a new method called Quickjoin that does not require a multidimensional index and instead adapts techniques used in distancebased indexing for use in a method that is conceptually similar to the Quicksort algorithm. A formal analysis is provided of the Quickjoin method. Experiments show that the Quickjoin method significantly outperforms two existing techniques.
An Approximate Algorithm for Topk Closest Pairs Join Query
 in Large High Dimensional Data,” Proc. Int’l Database Eng. and Applications Symp. (DKE),
, 2005
"... Abstract In this paper we present a novel approximate algorithm to calculate the topk closest pairs join query of two large and high dimensional data sets. The algorithm has worst case time complexity Oðd 2 nkÞ and space complexity OðndÞ and guarantees a solution within a Oðd 1þ 1 t Þ factor of th ..."
Abstract

Cited by 9 (0 self)
 Add to MetaCart
(Show Context)
Abstract In this paper we present a novel approximate algorithm to calculate the topk closest pairs join query of two large and high dimensional data sets. The algorithm has worst case time complexity Oðd 2 nkÞ and space complexity OðndÞ and guarantees a solution within a Oðd 1þ 1 t Þ factor of the exact one, where t 2 {1, 2, . . . , 1} denotes the Minkowski metrics L t of interest and d the dimensionality. It makes use of the concept of space filling curve to establish an order between the points of the space and performs at most d + 1 sorts and scans of the two data sets. During a sca\n, each point from one data set is compared with its closest points, according to the space filling curve order, in the other data set and points whose contribution to the solution has already been analyzed are detected and eliminated. Experimental results on real and synthetic data sets show that our algorithm behaves as an exact algorithm in low dimensional spaces; it is able to prune the entire (or a considerable fraction of the) data set even for high dimensions if certain separation conditions are satisfied; in any case it returns a solution within a small error to the exact one.
Fast similarity join for multidimensional data
, 2005
"... To appear in Information Systems Journal, Elsevier, 2005 The efficient processing of multidimensional similarity joins is important for a large class of applications. The dimensionality of the data for these applications ranges from low to high. Most existing methods have focused on the execution of ..."
Abstract

Cited by 6 (3 self)
 Add to MetaCart
(Show Context)
To appear in Information Systems Journal, Elsevier, 2005 The efficient processing of multidimensional similarity joins is important for a large class of applications. The dimensionality of the data for these applications ranges from low to high. Most existing methods have focused on the execution of highdimensional joins over large amounts of diskbased data. The increasing sizes of main memory available on current computers, and the need for efficient processing of spatial joins suggest that spatial joins for a large class of problems can be processed in main memory. In this paper, we develop two new inmemory spatial join algorithms, the Gridjoin and EGO*join, and study their performance. Through evaluation, we explore the domain of applicability of each approach and provide recommendations for the choice of a join algorithm depending upon the dimensionality of the data as well as the expected selectivity of the join. We show that the two new proposed join techniques substantially outperform the state of the art join algorithm, the EGOjoin. Key words: similarity join, gridbased joins PACS:
Solving Similarity Joins and Range Queries in Metric Spaces with the List of Twin Clusters
, 2008
"... The metric space model abstracts many proximity or similarity problems, where the most frequently considered primitives are range and knearest neighbor search, leaving out the similarity join, an extremely important primitive. In fact, despite the great attention that this primitive has received in ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
The metric space model abstracts many proximity or similarity problems, where the most frequently considered primitives are range and knearest neighbor search, leaving out the similarity join, an extremely important primitive. In fact, despite the great attention that this primitive has received in traditional and even multidimensional databases, little has been done for general metric databases. We solve two variants of the similarity join problem: (1) range joins: Given two sets of objects and a distance threshold r, find all the object pairs (one from each set) at distance at most r; and (2) kclosest pair joins: Find the k closest object pairs (one from each set). For this sake, we devise a new metric index, coined List of Twin Clusters (LTC), which indexes both sets jointly, instead of the natural approach of indexing one or both sets independently. Finally, we show how to use the LTC in order to solve classical range queries. Our results show significant speedups over the basic quadratictime naive alternative for both join variants, and that the LTC is competitive with the original list of clusters when solving range queries. Furthermore, we show that our technique has a great potential for improvements.
List of twin clusters: a data structure for similarity joins in metric spaces
 In Proc. 1st Intl. Workshop on Similarity Search and Applications (SISAP’08
, 2008
"... The metric space model abstracts many proximity or similarity problems, where the most frequently considered primitives are range and knearest neighbor search, leaving out the similarity join, an extremely important primitive. In fact, despite the great attention that this primitive has received in ..."
Abstract

Cited by 1 (1 self)
 Add to MetaCart
(Show Context)
The metric space model abstracts many proximity or similarity problems, where the most frequently considered primitives are range and knearest neighbor search, leaving out the similarity join, an extremely important primitive. In fact, despite the great attention that this primitive has received in traditional and even multidimensional databases, little has been done for general metric databases. We consider a particular type of similarity join: Given two sets of objects and a distance threshold r, find all the object pairs (one from each set) at distance at most r. For this sake, we devise a new metric index, coined List of Twin Clusters, which indexes both sets jointly (instead of the natural approach of indexing one or both sets independently). Our results show significant speedups over the basic quadratictime naive alternative. Furthermore, we show that our technique can be easily extended to other similarity join variants, e.g., finding the kclosest pairs. 1.
The VLDB Journal DOI 10.1007/s0077801203057 REGULAR PAPER SuperEGO: fast multidimensional similarity join
"... Abstract Efficient processing of highdimensional similarity joins plays an important role for a wide variety of datadriven applications. In this paper, we consider εjoin variant of the problem. Given two ddimensional datasets and parameter ε, the task is to find all pairs of points, one from eac ..."
Abstract
 Add to MetaCart
(Show Context)
Abstract Efficient processing of highdimensional similarity joins plays an important role for a wide variety of datadriven applications. In this paper, we consider εjoin variant of the problem. Given two ddimensional datasets and parameter ε, the task is to find all pairs of points, one from each dataset that are within ε distance from each other. We propose a new εjoin algorithm, called SuperEGO, which belongs the EGO family of join algorithms. The new algorithm gains its advantage by using novel datadriven dimensionality reordering technique, developing a new EGOstrategy that more aggressively avoids unnecessary computation, as well as by developing a parallel version of the algorithm. We study the newly proposed SuperEGO algorithm on large real and synthetic datasets. The empirical study demonstrates significant advantage of the proposed solution over the existing state of the art techniques.