Results 1 -
7 of
7
Large-scale deduplication with constraints using dedupalog
- in: Proceedings of the 25th International Conference on Data Engineering (ICDE
"... Abstract — We present a declarative framework for collective deduplication of entity references in the presence of constraints. Constraints occur naturally in many data cleaning domains and can improve the quality of deduplication. An example of a constraint is “each paper has a unique publication v ..."
Abstract
-
Cited by 49 (3 self)
- Add to MetaCart
(Show Context)
Abstract — We present a declarative framework for collective deduplication of entity references in the presence of constraints. Constraints occur naturally in many data cleaning domains and can improve the quality of deduplication. An example of a constraint is “each paper has a unique publication venue”; iftwo paper references are duplicates, then their associated conference references must be duplicates as well. Our framework supports collective deduplication, meaning that we can dedupe both paper references and conference references collectively in the example above. Our framework is based on a simple declarative Datalogstyle language with precise semantics. Most previous work on deduplication either ignore constraints or use them in an ad-hoc domain-specific manner. We also present efficient algorithms to support the framework. Our algorithms have precise theoretical guarantees for a large subclass of our framework. We show, using a prototype implementation, that our algorithms scale to very large datasets. We provide thorough experimental results over real-world data demonstrating the utility of our framework for high-quality and scalable deduplication. I.
RAVEN – Active Learning of Link Specifications
"... Abstract. With the growth of the Linked Data Web, time-efficient approaches for computing links between data sources have become indispensable. Yet, in many cases, determining the right specification for a link discovery problem is a tedious task that must still be carried out manually. We present R ..."
Abstract
-
Cited by 16 (5 self)
- Add to MetaCart
(Show Context)
Abstract. With the growth of the Linked Data Web, time-efficient approaches for computing links between data sources have become indispensable. Yet, in many cases, determining the right specification for a link discovery problem is a tedious task that must still be carried out manually. We present RAVEN, an approach for the semi-automatic determination of link specifications. Our approach is based on the combination of stable solutions of matching problems and active learning with the time-efficient link discovery framework LIMES. RAVEN aims at requiring a small number of interactions with the user to generate classifiers of high accuracy. We focus on using RAVEN to compute and configure boolean and weighted classifiers, which we evaluate in three experiments against link specifications created manually. Our evaluation shows that we can compute linking configurations that achieve more than 90 % F-score by asking the user to verify at most twelve potential links.
Gdr: a system for guided data repair
- In SIGMOD,pages1223–1226,2010
"... Improving data quality is a time-consuming, labor-intensive and often domain specific operation. Existing data repair approaches are either fully automated or not efficient in interactively involving the users. We present a demo of GDR, a Guided Data Repair system that uses a novel approach to effic ..."
Abstract
-
Cited by 6 (1 self)
- Add to MetaCart
(Show Context)
Improving data quality is a time-consuming, labor-intensive and often domain specific operation. Existing data repair approaches are either fully automated or not efficient in interactively involving the users. We present a demo of GDR, a Guided Data Repair system that uses a novel approach to efficiently involve the user alongside automatic data repair techniques to reach better data quality as quickly as possible. Specifically, GDR generates data repairs and acquire feedback on them that would be most beneficial in improving the data quality. GDR quantifies the data quality benefit of generated repairs by combining mechanisms from decision theory and active learning. Based on these benefit scores, groups of repairs are ranked and displayed to the user. User feedback is used to train a machine learning component to eventually replace the user in deciding on the validity of a suggested repair. We describe how the generated repairs are ranked and displayed to the user in a “usefullooking" way and demonstrate how data quality can be effectively improved with minimal feedback from the user.
Supporting Efficient Record Linkage for Large Data Sets Using Mapping Techniques
"... This paper describes an efficient approach to record linkage. Given two lists of records, the record-linkage problem consists of determining all pairs that are similar to each other, where the overall similarity between two records is defined based on domain-specific similarities over individual att ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
(Show Context)
This paper describes an efficient approach to record linkage. Given two lists of records, the record-linkage problem consists of determining all pairs that are similar to each other, where the overall similarity between two records is defined based on domain-specific similarities over individual attributes. The record-linkage problem arises naturally in the context of data cleansing that usually precedes data analysis and mining. Since the scalability issue of record linkage was addressed in [21], the repertoire of database techniques dealing with multidimensional data sets has significantly increased. Specifically, many effective and efficient approaches for distance-preserving transforms and similarity joins have been developed. Based on these advances, we explore a novel approach to record linkage. For each attribute of records, we first map values to a multidimensional Euclidean space that preserves domain-specific similarity. Many mapping algorithms can be applied, and we use the Fastmap approach [16] as an example. Given the merging rule that defines when two records are similar based on their attribute-level similarities, a set of attributes are chosen along which the merge will proceed. A multidimensional similarity join over the chosen attributes is used to find similar pairs of records. Our extensive experiments using real data sets show that our solution has very good efficiency and recall.
Active learning for crowd-sourced databases
, 2012
"... Crowd-sourcing has become a popular means of acquiring labeled data for a wide variety of tasks where humans are more accurate than computers, e.g., labeling images, matching objects, or ana-lyzing sentiment. However, relying solely on the crowd is often impractical even for datasets with thousands ..."
Abstract
-
Cited by 4 (1 self)
- Add to MetaCart
(Show Context)
Crowd-sourcing has become a popular means of acquiring labeled data for a wide variety of tasks where humans are more accurate than computers, e.g., labeling images, matching objects, or ana-lyzing sentiment. However, relying solely on the crowd is often impractical even for datasets with thousands of items, due to time and cost constraints of acquiring human input. In this paper, we propose algorithms for integrating machine learning into crowd-sourced databases, with the goal of allowing crowd-sourcing appli-cations to scale, i.e., to handle larger datasets at lower costs. The key observation is that, in many of the above tasks, humans and machine learning algorithms can be complementary, as humans are often more accurate but slow and expensive, while algorithms are usually less accurate, but faster and cheaper. Based on this observation, we present two new active learn-ing algorithms to combine humans and algorithms together in a crowd-sourced database. Our algorithms are based on the theory of non-parametric bootstrap, which makes our results applicable to a broad class of machine learning models. Our results, on three real-life datasets collected with Amazon’s Mechanical Turk, and on 15 well-known UCI data sets, show that our methods on average ask humans to label one to two orders of magnitude fewer items to achieve the same accuracy as the baseline (that randomly chooses which items to be labeled by the crowd), and two to eight times fewer questions than previous active learning schemes. 1.
Raven: Towards zero-configuration link discovery
- In Proceedings of OM@ISWC
, 2011
"... Abstract. With the growth of the Linked Data Web, time-efficient ap-proaches for computing links between data sources have become indis-pensable. Yet, in many cases, determining the right specification for a link discovery problem is a tedious task that must still be carried out manually. In this ar ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
(Show Context)
Abstract. With the growth of the Linked Data Web, time-efficient ap-proaches for computing links between data sources have become indis-pensable. Yet, in many cases, determining the right specification for a link discovery problem is a tedious task that must still be carried out manually. In this article we present RAVEN, an approach for the semi-automatic determination of link specifications. Our approach is based on the combination of stable solutions of matching problems and active learning leveraging the time-efficient link discovery framework LIMES. RAVEN is designed to require a small number of interactions with the user in order to generate classifiers of high accuracy. We focus with RAVEN on the computation and configuration of Boolean and weighted classifiers, which we evaluate in three experiments against link specifi-cations created manually. Our evaluation shows that we can compute linking configurations that achieve more than 90 % F-score by asking the user to verify at most twelve potential links.
Scaling up the ALIAS Duplicate Elimination System: A Demonstration
"... Duplicate elimination is an important stage in integrating data from multiple sources. The challenges involved are finding a robust deduplication function that can identify when two records are duplicates and efficiently applying the function on very large lists of records. In ALIAS the task of desi ..."
Abstract
- Add to MetaCart
(Show Context)
Duplicate elimination is an important stage in integrating data from multiple sources. The challenges involved are finding a robust deduplication function that can identify when two records are duplicates and efficiently applying the function on very large lists of records. In ALIAS the task of designing a deduplication function is eased by learning the function from examples of duplicates and nonduplicates and by using active learning to spot such examples effectively [1]. Here we investigate the issues involved in efficiently applying the learnt deduplication system on large lists of records. We demonstrate the working of the ALIAS evaluation engine and highlight the optimizations it uses to significantly cut down the number of record pairs that need to be explicitly materialized. 1