Results 1 - 10
of
12
Domain-independent data cleaning via analysis of entity-relationship graph
- ACM TRANSACTIONS ON DATABASE SYSTEMS (TODS
, 2006
"... In this article, we address the problem of reference disambiguation. Specifically, we consider a situation where entities in the database are referred to using descriptions (e.g., a set of instantiated attributes). The objective of reference disambiguation is to identify the unique entity to which e ..."
Abstract
-
Cited by 64 (23 self)
- Add to MetaCart
In this article, we address the problem of reference disambiguation. Specifically, we consider a situation where entities in the database are referred to using descriptions (e.g., a set of instantiated attributes). The objective of reference disambiguation is to identify the unique entity to which each description corresponds. The key difference between the approach we propose (called RELDC) and the traditional techniques is that RELDC analyzes not only object features but also inter-object relationships to improve the disambiguation quality. Our extensive experiments over two real data sets and over synthetic datasets show that analysis of relationships significantly improves quality of the result.
Main Memory Evaluation of Monitoring Queries over Moving Objects
- Distributed and Parallel Databases
, 2004
"... In this paper we evaluate several in-memory algorithms for efficient and scalable processing of continuous range queries over collections of moving objects. Constant updates to the index are avoided by query indexing. No constraints are imposed on the speed or path of moving objects or fraction of o ..."
Abstract
-
Cited by 59 (6 self)
- Add to MetaCart
(Show Context)
In this paper we evaluate several in-memory algorithms for efficient and scalable processing of continuous range queries over collections of moving objects. Constant updates to the index are avoided by query indexing. No constraints are imposed on the speed or path of moving objects or fraction of objects that move at any moment in time. We present a detailed analysis of a grid approach which shows the best results for both skewed and uniform data. A sorting based optimization is developed for significantly improving the cache hit-rate. Experimental evaluation establishes that indexing queries using the grid index yields orders of magnitude better performance than other index structures such as R*-trees. 1
Efficient evaluation of continuous range queries on moving objects
- In DEXA 2002, Proc. of the 13th International Conference and Workshop on Database and Expert Systems Applications, Aix en Provence
, 2002
"... In this paper we evaluate several in-memory algorithms for efficient and scalable processing of continuous range queries over collections of moving objects. Constant updates to the index are avoided by query indexing. No constraints are imposed on the speed or path of moving objects or fraction of o ..."
Abstract
-
Cited by 38 (11 self)
- Add to MetaCart
(Show Context)
In this paper we evaluate several in-memory algorithms for efficient and scalable processing of continuous range queries over collections of moving objects. Constant updates to the index are avoided by query indexing. No constraints are imposed on the speed or path of moving objects or fraction of objects that move at any moment in time. We present a detailed analysis of a grid approach which shows the best results for both skewed and uniform data. A sorting based optimization is developed for significantly improving the cache hit-rate. Experimental evaluation establishes that indexing queries using the grid index yields orders of magnitude better performance than other index structures such as R*-trees. 1
Metric Space Similarity Joins
"... Similarity join algorithms find pairs of objects that lie within a certain distance ɛ of each other. Algorithms that are adapted from spatial join techniques are designed primarily for data in a vector space and often employ some form of a multi-dimensional index. For these algorithms, when the data ..."
Abstract
-
Cited by 24 (2 self)
- Add to MetaCart
Similarity join algorithms find pairs of objects that lie within a certain distance ɛ of each other. Algorithms that are adapted from spatial join techniques are designed primarily for data in a vector space and often employ some form of a multi-dimensional index. For these algorithms, when the data lies in a metric space, the usual solution is to embed the data in vector space and then make use of a multidimensional index. Such an approach has a number of drawbacks when the data is high dimensional as we must eventually find the most discriminating dimensions, which is not trivial. In addition, although the maximum distance between objects increases with dimension, the ability to discriminate between objects in each dimension does not. These drawbacks are overcome via the introduction of a new method called Quickjoin that does not require a multi-dimensional index and instead adapts techniques used in distance-based indexing for use in a method that is conceptually similar to the Quicksort algorithm. A formal analysis is provided of the Quickjoin method. Experiments show that the Quickjoin method significantly outperforms two existing techniques.
An Approximate Algorithm for Top-k Closest Pairs Join Query
- in Large High Dimensional Data,” Proc. Int’l Database Eng. and Applications Symp. (DKE),
, 2005
"... Abstract In this paper we present a novel approximate algorithm to calculate the top-k closest pairs join query of two large and high dimensional data sets. The algorithm has worst case time complexity Oðd 2 nkÞ and space complexity OðndÞ and guarantees a solution within a Oðd 1þ 1 t Þ factor of th ..."
Abstract
-
Cited by 9 (0 self)
- Add to MetaCart
(Show Context)
Abstract In this paper we present a novel approximate algorithm to calculate the top-k closest pairs join query of two large and high dimensional data sets. The algorithm has worst case time complexity Oðd 2 nkÞ and space complexity OðndÞ and guarantees a solution within a Oðd 1þ 1 t Þ factor of the exact one, where t 2 {1, 2, . . . , 1} denotes the Minkowski metrics L t of interest and d the dimensionality. It makes use of the concept of space filling curve to establish an order between the points of the space and performs at most d + 1 sorts and scans of the two data sets. During a sca\n, each point from one data set is compared with its closest points, according to the space filling curve order, in the other data set and points whose contribution to the solution has already been analyzed are detected and eliminated. Experimental results on real and synthetic data sets show that our algorithm behaves as an exact algorithm in low dimensional spaces; it is able to prune the entire (or a considerable fraction of the) data set even for high dimensions if certain separation conditions are satisfied; in any case it returns a solution within a small error to the exact one.
Fast similarity join for multi-dimensional data
, 2005
"... To appear in Information Systems Journal, Elsevier, 2005 The efficient processing of multidimensional similarity joins is important for a large class of applications. The dimensionality of the data for these applications ranges from low to high. Most existing methods have focused on the execution of ..."
Abstract
-
Cited by 6 (3 self)
- Add to MetaCart
(Show Context)
To appear in Information Systems Journal, Elsevier, 2005 The efficient processing of multidimensional similarity joins is important for a large class of applications. The dimensionality of the data for these applications ranges from low to high. Most existing methods have focused on the execution of high-dimensional joins over large amounts of disk-based data. The increasing sizes of main memory available on current computers, and the need for efficient processing of spatial joins suggest that spatial joins for a large class of problems can be processed in main memory. In this paper, we develop two new in-memory spatial join algorithms, the Grid-join and EGO*-join, and study their performance. Through evaluation, we explore the domain of applicability of each approach and provide recommendations for the choice of a join algorithm depending upon the dimensionality of the data as well as the expected selectivity of the join. We show that the two new proposed join techniques substantially outperform the state of the art join algorithm, the EGO-join. Key words: similarity join, grid-based joins PACS:
Solving Similarity Joins and Range Queries in Metric Spaces with the List of Twin Clusters
, 2008
"... The metric space model abstracts many proximity or similarity problems, where the most frequently considered primitives are range and k-nearest neighbor search, leaving out the similarity join, an extremely important primitive. In fact, despite the great attention that this primitive has received in ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
The metric space model abstracts many proximity or similarity problems, where the most frequently considered primitives are range and k-nearest neighbor search, leaving out the similarity join, an extremely important primitive. In fact, despite the great attention that this primitive has received in traditional and even multidimensional databases, little has been done for general metric databases. We solve two variants of the similarity join problem: (1) range joins: Given two sets of objects and a distance threshold r, find all the object pairs (one from each set) at distance at most r; and (2) k-closest pair joins: Find the k closest object pairs (one from each set). For this sake, we devise a new metric index, coined List of Twin Clusters (LTC), which indexes both sets jointly, instead of the natural approach of indexing one or both sets independently. Finally, we show how to use the LTC in order to solve classical range queries. Our results show significant speedups over the basic quadratic-time naive alternative for both join variants, and that the LTC is competitive with the original list of clusters when solving range queries. Furthermore, we show that our technique has a great potential for improvements.
List of twin clusters: a data structure for similarity joins in metric spaces
- In Proc. 1st Intl. Workshop on Similarity Search and Applications (SISAP’08
, 2008
"... The metric space model abstracts many proximity or similarity problems, where the most frequently considered primitives are range and k-nearest neighbor search, leaving out the similarity join, an extremely important primitive. In fact, despite the great attention that this primitive has received in ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
(Show Context)
The metric space model abstracts many proximity or similarity problems, where the most frequently considered primitives are range and k-nearest neighbor search, leaving out the similarity join, an extremely important primitive. In fact, despite the great attention that this primitive has received in traditional and even multidimensional databases, little has been done for general metric databases. We consider a particular type of similarity join: Given two sets of objects and a distance threshold r, find all the object pairs (one from each set) at distance at most r. For this sake, we devise a new metric index, coined List of Twin Clusters, which indexes both sets jointly (instead of the natural approach of indexing one or both sets independently). Our results show significant speedups over the basic quadratic-time naive alternative. Furthermore, we show that our technique can be easily extended to other similarity join variants, e.g., finding the k-closest pairs. 1.
The VLDB Journal DOI 10.1007/s00778-012-0305-7 REGULAR PAPER Super-EGO: fast multi-dimensional similarity join
"... Abstract Efficient processing of high-dimensional similarity joins plays an important role for a wide variety of data-driven applications. In this paper, we consider ε-join variant of the problem. Given two d-dimensional datasets and parameter ε, the task is to find all pairs of points, one from eac ..."
Abstract
- Add to MetaCart
(Show Context)
Abstract Efficient processing of high-dimensional similarity joins plays an important role for a wide variety of data-driven applications. In this paper, we consider ε-join variant of the problem. Given two d-dimensional datasets and parameter ε, the task is to find all pairs of points, one from each dataset that are within ε distance from each other. We propose a new ε-join algorithm, called Super-EGO, which belongs the EGO family of join algorithms. The new algorithm gains its advantage by using novel datadriven dimensionality re-ordering technique, developing a new EGO-strategy that more aggressively avoids unnecessary computation, as well as by developing a parallel version of the algorithm. We study the newly proposed Super-EGO algorithm on large real and synthetic datasets. The empirical study demonstrates significant advantage of the proposed solution over the existing state of the art techniques.