Results 1 - 10
of
24
Pass-join: A partition-based method for similarity joins
, 2011
"... As an essential operation in data cleaning, the similarity join has attracted considerable attention from the database com-munity. In this paper, we study string similarity joins with edit-distance constraints, which find similar string pairs from two large sets of strings whose edit distance is wit ..."
Abstract
-
Cited by 27 (13 self)
- Add to MetaCart
(Show Context)
As an essential operation in data cleaning, the similarity join has attracted considerable attention from the database com-munity. In this paper, we study string similarity joins with edit-distance constraints, which find similar string pairs from two large sets of strings whose edit distance is within a given threshold. Existing algorithms are efficient either for short strings or for long strings, and there is no algorithm that can efficiently and adaptively support both short strings and long strings. To address this problem, we propose a partition-based method called Pass-Join. Pass-Join par-titions a string into a set of segments and creates inverted indices for the segments. Then for each string, Pass-Join selects some of its substrings and uses the selected substrings to find candidate pairs using the inverted indices. We devise efficient techniques to select the substrings and prove that our method can minimize the number of selected substrings. We develop novel pruning techniques to efficiently verify the candidate pairs. Experimental results show that our algo-rithms are efficient for both short strings and long strings, and outperform state-of-the-art methods on real datasets. 1.
Fast-join: An efficient method for fuzzy token matching based string similarity join
- In ICDE
, 2011
"... Abstract—String similarity join that finds similar string pairs between two string sets is an essential operation in many applications, and has attracted significant attention recently in the database community. A significant challenge in similarity join is to implement an effective fuzzy match oper ..."
Abstract
-
Cited by 21 (11 self)
- Add to MetaCart
(Show Context)
Abstract—String similarity join that finds similar string pairs between two string sets is an essential operation in many applications, and has attracted significant attention recently in the database community. A significant challenge in similarity join is to implement an effective fuzzy match operation to find all similar string pairs which may not match exactly. In this paper, we propose a new similarity metrics, called “fuzzy token matching based similarity”, which extends token-based similarity functions (e.g., Jaccard similarity and Cosine similarity) by allowing fuzzy match between two tokens. We study the problem of similarity join using this new similarity metrics and present a signature-based method to address this problem. We propose new signature schemes and develop effective pruning techniques to improve the performance. Experimental results show that our approach achieves high efficiency and result quality, and significantly outperforms state-of-the-art methods. I.
Can We Beat the Prefix Filtering? An Adaptive Framework for Similarity Join and Search
"... As two important operations indata cleaning, similarity join andsimilaritysearchhaveattractedmuchattentionrecently. Existing methods to support similarity join usually adopt a prefix-filtering-based framework. They select a prefix of each object and prune object pairs whose prefixes have no overlap. ..."
Abstract
-
Cited by 20 (9 self)
- Add to MetaCart
As two important operations indata cleaning, similarity join andsimilaritysearchhaveattractedmuchattentionrecently. Existing methods to support similarity join usually adopt a prefix-filtering-based framework. They select a prefix of each object and prune object pairs whose prefixes have no overlap. Wehaveanobservationthatprefixlengthshavesignificant effect on the performance. Different prefix lengths lead to significantly different performance, and prefix filtering does not always achieve high performance. To address this problem, in this paper we propose an adaptive framework to support similarity join. We propose a cost model to judiciously select an appropriate prefix for each object. To efficiently select prefixes, we devise effective indexes. We extend our method to support similarity search. Experimental results show that our framework beats the prefix-filteringbased framework and achieves high efficiency.
A Survey of Large-Scale Analytical Query Processing in MapReduce
- THE VLDB JOURNAL
, 2013
"... Enterprises today acquire vast volumes of data from different sources and leverage this information by means of data analysis to support effective decision-making and provide new functionality and services. The key requirement of data analytics is scalability, simply due to the immense volume of dat ..."
Abstract
-
Cited by 10 (0 self)
- Add to MetaCart
Enterprises today acquire vast volumes of data from different sources and leverage this information by means of data analysis to support effective decision-making and provide new functionality and services. The key requirement of data analytics is scalability, simply due to the immense volume of data that need to be extracted, processed, and analyzed in a timely fashion. Arguably the most popular framework for contemporary large-scale data analytics is MapReduce, mainly due to its salient features that include scalability, fault-tolerance, ease of programming, and flexibility. However, despite its merits, MapReduce has evident performance limitations in miscellaneous analytical tasks, and this has given rise to a significant body of research that aim at improving its efficiency, while maintaining its desirable properties. This survey aims to review the state-of-the-art in improving the performance of parallel query processing using MapReduce. A set of the most significant weaknesses and limitations of MapReduce is discussed at a high level, along with solving techniques. A taxonomy is presented for categorizing existing research on Map-Reduce improvements according to the specific problem they target. Based on the proposed taxonomy, a clas-
Similarity-aware query processing and optimization
- In Proceedings of the International Conference on Very Large Data Bases PhD Workshop
, 2009
"... Many application scenarios, e.g., marketing analysis, sensor networks, and medical and biological applications, require or can significantly benefit from the identification and processing of similarities in the data. Even though some work has been done to extend the semantics of some operators, e.g. ..."
Abstract
-
Cited by 5 (2 self)
- Add to MetaCart
(Show Context)
Many application scenarios, e.g., marketing analysis, sensor networks, and medical and biological applications, require or can significantly benefit from the identification and processing of similarities in the data. Even though some work has been done to extend the semantics of some operators, e.g., join and selection, to be aware of data similarities; there has not been much study on the role, interaction, and implementation of similarity-aware operators as first-class database operators. The focus of the thesis work presented in this paper is the proposal and study of several similarity-aware database operators and a systematic analysis of their role as query operators, interactions, optimizations, and implementation techniques. This paper presents the core research questions that drive our research work and the physical database operators that were studied as part of this thesis work so far, i.e., Similarity Group-by and Similarity Join. We describe multiple optimization techniques for the introduced operators. Specifically, we present: (1) multiple non-trivial equivalence rules that enable similarity query transformations, (2) Eager and Lazy aggregation transformations for Similarity Group-by and Similarity Join to allow pre-aggregation before potentially expensive joins, and (3) techniques to use materialized views to answer similarity-based queries. This paper also presents the main guidelines to implement the presented operators as integral components of a DBMS query engine and some of the key performance evaluation results of this implementation in an open source DBMS. In addition, we present the way the proposed operators are efficiently exploited to answer more useful business questions in a decision support system. 1.
Set Similarity Join on Probabilistic Data
"... Set similarity join has played an important role in many real-world applications such as data cleaning, near duplication detection, data integration, and so on. In these applications, set data often contain noises and are thus uncertain and imprecise. In this paper, we model such probabilistic set d ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
(Show Context)
Set similarity join has played an important role in many real-world applications such as data cleaning, near duplication detection, data integration, and so on. In these applications, set data often contain noises and are thus uncertain and imprecise. In this paper, we model such probabilistic set data on two uncertainty levels, that is, set and element levels. Based on them, we investigate the problem of probabilistic set similarity join (PS 2 J) over two probabilistic set databases, under the possible worlds semantics. To efficiently process the PS 2 J operator, we first reduce our problem by condensing the possible worlds, and then propose effective pruning techniques, including Jaccard distance pruning, probability upper bound pruning, and aggregate pruning, which can filter out false alarms of probabilistic set pairs, with the help of indexes and our designed synopses. We demonstrate through extensive experiments the PS 2 J processing performance on both real and synthetic data. 1.
Similarity queries: their conceptual evaluation, transformations, and processing,”
- The VLDB Journal,
, 2013
"... Abstract Many application scenarios can significantly benefit from the identification and processing of similarities in the data. Even though some work has been done to extend the semantics of some operators, e.g., join and selection, to be aware of data similarities; there has not been much study ..."
Abstract
-
Cited by 4 (3 self)
- Add to MetaCart
(Show Context)
Abstract Many application scenarios can significantly benefit from the identification and processing of similarities in the data. Even though some work has been done to extend the semantics of some operators, e.g., join and selection, to be aware of data similarities; there has not been much study on the role and implementation of similarity-aware operations as first-class database operators. Furthermore, very little work has addressed the problem of evaluating and optimizing queries that combine several similarity operations. The focus of this paper is the study of similarity queries that contain one or multiple first-class similarity database operators, e.g., Similarity Selection, Similarity Join, and Similarity Group-by. Particularly, we analyze the implementation techniques of several similarity operators; introduce a consistent and comprehensive conceptual evaluation model for similarity queries; and present a rich set of transformation rules to extend cost-based query optimization to the case of similarity queries.
Scalable all-pairs similarity search in metric spaces
- In Proceedings of KDD
, 2013
"... Given a set of entities, the all-pairs similarity search aims at identifying all pairs of entities that have similarity greater than (or distance smaller than) some user-defined threshold. In this article, we propose a parallel framework for solving this problem in metric spaces. Novel elements of o ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
(Show Context)
Given a set of entities, the all-pairs similarity search aims at identifying all pairs of entities that have similarity greater than (or distance smaller than) some user-defined threshold. In this article, we propose a parallel framework for solving this problem in metric spaces. Novel elements of our solution include: i) flexible support for multiple metrics of interest; ii) an autonomic approach to partition the input dataset with minimal redundancy to achieve good load-balance in the presence of limited computing resources; iii) an on-the-fly lossless compression strategy to reduce both the running time and the final output size. We validate the utility, scal-ability and the effectiveness of the approach on hundreds of machines using real and synthetic datasets.
Efficient record linkage using a double embedding scheme
- in DMIN’09, Las Vegas, 2009
"... Abstract—Record linkage is the problem of identifying similar records across different data sources. The similarity between two records is defined based on domain-specific similarity functions over several attributes. In this paper, a novel approach is proposed that uses a two level matching based ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
Abstract—Record linkage is the problem of identifying similar records across different data sources. The similarity between two records is defined based on domain-specific similarity functions over several attributes. In this paper, a novel approach is proposed that uses a two level matching based on double embedding. First, records are embedded into a metric space of dimension K, then they are embedded into a smaller dimension K. The first matching phase operates on the K-vectors, performing a quick-and-dirty comparison, pruning a large number of true negatives while ensuring a high recall. Then a more accurate matching phase is performed on the matching pairs in the K-dimension. Experiments have been conducted on real data sets and results revealed a gain in time performance ranging from 30 % to 60 % while achieving the same level of recall and accuracy as in previous single embedding schemes. Keywords- data cleaning; similarity matching; record linkage; embedding schemes I.
Exploiting database similarity joins for metric spaces.
- Proc. VLDB Endow.,
, 2012
"... Abstract. Similarity Joins are recognized among the most useful data processing and analysis operations. They retrieve all data pairs whose distances are smaller than a predefined threshold ε. While several standalone implementations have been proposed, very little work has addressed the implementa ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
(Show Context)
Abstract. Similarity Joins are recognized among the most useful data processing and analysis operations. They retrieve all data pairs whose distances are smaller than a predefined threshold ε. While several standalone implementations have been proposed, very little work has addressed the implementation of Similarity Join as a physical database operator. In this paper, we focus on the study, design and implementation of a Similarity Join database operator for any dataset that lies in a metric space (DBSimJoin). We describe the changes in each query engine module to implement DBSimJoin and provide details of our implementation in PostgreSQL. The extensive performance evaluation shows that DBSimJoin significantly outperforms alternative approaches.