Results 1 -
8 of
8
Duplicate record detection: A survey
- TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING
, 2007
"... Often, in the real world, entities have two or more representations in databases. Duplicate records do not share a common key and/or they contain errors that make duplicate matching a dif cult task. Errors are introduced as the result of transcription errors, incomplete information, lack of standard ..."
Abstract
-
Cited by 155 (4 self)
- Add to MetaCart
Often, in the real world, entities have two or more representations in databases. Duplicate records do not share a common key and/or they contain errors that make duplicate matching a dif cult task. Errors are introduced as the result of transcription errors, incomplete information, lack of standard formats or any combination of these factors. In this article, we present a thorough analysis of the literature on duplicate record detection. We cover similarity metrics that are commonly used to detect similar eld entries, and we present an extensive set of duplicate detection algorithms that can detect approximately duplicate records in a database. We also cover multiple techniques for improving the ef ciency and scalability of approximate duplicate detection algorithms. We conclude with a coverage of existing tools and with a brief discussion of the big open problems in the area.
Efficient top-k query evaluation on probabilistic data
- in ICDE
, 2007
"... Modern enterprise applications are forced to deal with unreliable, inconsistent and imprecise information. Probabilistic databases can model such data naturally, but SQL query evaluation on probabilistic databases is difficult: previous approaches have either restricted the SQL queries, or computed ..."
Abstract
-
Cited by 106 (26 self)
- Add to MetaCart
Modern enterprise applications are forced to deal with unreliable, inconsistent and imprecise information. Probabilistic databases can model such data naturally, but SQL query evaluation on probabilistic databases is difficult: previous approaches have either restricted the SQL queries, or computed approximate probabilities, or did not scale, and it was shown recently that precise query evaluation is theoretically hard. In this paper we describe a novel approach, which computes and ranks efficiently the top-k answers to a SQL query on a probabilistic database. The restriction to top-k answers is natural, since imprecisions in the data often lead to a large number of answers of low quality, and users are interested only in the answers with the highest probabilities. The idea in our algorithm is to run in parallel several Monte-Carlo simulations, one for each candidate answer, and approximate each probability only to the extent needed to compute correctly the top-k answers. The algorithms is in a certain sense provably optimal and scales to large databases: we have measured running times of 5 to 50 seconds for complex SQL queries over a large database (10M tuples of which 6M probabilistic). Additional contributions of the paper include several optimization techniques, and a simple data model for probabilistic data that achieves completeness by using SQL views. 1
Efficient Development of Data Migration Transformations
- IN ACM SIGMOD INT’L CONF. ON THE MANAGMENT OF DATA
, 2004
"... ..."
Improving Data Cleaning Quality using a Data Lineage Facility
- In: Proc. Design and Management of Data Warehouses (DMDW
, 2001
"... The problem of data cleaning, which consists of removing inconsistencies and errors from original data sets, is well known in the area of decision support systems and data warehouses. However, for some applications, existing ETL (Extraction Transformation Loading) and data cleaning tools for w ..."
Abstract
- Add to MetaCart
The problem of data cleaning, which consists of removing inconsistencies and errors from original data sets, is well known in the area of decision support systems and data warehouses. However, for some applications, existing ETL (Extraction Transformation Loading) and data cleaning tools for writing data cleaning programs are insufficient. One important challenge with them is the design of a data flow graph that effectively generates clean data. A generalized difficulty is the lack of explanation of cleaning results and user interaction facilities to tune a data cleaning program. This paper presents a solution to handle this problem by enabling users to express user interactions declaratively and tune data cleaning programs. 1
Copyright c ○ 2007 by Oktie HassanzadehAbstract Benchmarking Declarative Approximate Selection Predicates
"... Declarative data quality has been an active research topic. The fundamental principle behind a declarative approach to data quality is the use of declarative statements to realize data quality primitives on top of any relational data source. A primary advantage of such an approach is the ease of use ..."
Abstract
- Add to MetaCart
Declarative data quality has been an active research topic. The fundamental principle behind a declarative approach to data quality is the use of declarative statements to realize data quality primitives on top of any relational data source. A primary advantage of such an approach is the ease of use and integration with existing applications. Over the last couple of years several similarity predicates have been proposed for common quality primitives (approximate selections, joins, etc.) and have been fully expressed using declarative SQL statements. In this thesis, new similarity predicates are proposed along with their declarative realization, based on notions of probabilistic information retrieval. Then, full declarative specifications of previously proposed similarity predicates in the literature are presented, grouped into classes according to their primary characteristics. Finally, a thorough performance and accuracy study comparing a large number of similarity predicates for data cleaning operations is performed. ii Dedication This thesis is dedicated to my brother, Aidin, and to my parents who have always supported me. iii Acknowledgements First, I would like to thank my supervisor, Nick Koudas. This work would not have been possible without his invaluable guidance and support. Special thanks to John Mylopoulos, the second reader of my thesis, for his valuable time and comments. During my research, I had the pleasure of working in a wonderful atmosphere in the database lab. I had an unforgettable year with my colleagues there. While enjoying the taste of fresh coffee from our fancy coffee machine that helped us stay awake all long nights before the deadlines, we had many fruitful discussions that often resulted in brilliant new ideas. I would like to thank all my friends in the database lab, particularly
Management of Data with Uncertainties
- CIKM'07
, 2007
"... Since their invention in the early 70s, relational databases have been deterministic. They were designed to support applications s.a. accounting, inventory, customer care, and manufacturing, and these applications require a precise semantics. Thus, database systems are deterministic. A row is either ..."
Abstract
- Add to MetaCart
Since their invention in the early 70s, relational databases have been deterministic. They were designed to support applications s.a. accounting, inventory, customer care, and manufacturing, and these applications require a precise semantics. Thus, database systems are deterministic. A row is either in the database or is not; a tuple is either in the query answer or is not. The foundations of query processing and the tools that exists today for managing data rely fundamentally on the assumption that the data is deterministic. Increasingly, today we need to manage data that is uncertain. The uncertainty can be in the data itself, in the schema, in the mapping between different data instances, or in the user query. We find increasingly large amounts of uncertain data in a variety of domains: in data integration, in scientific data, in information extracted automatically from text, in data from the physical world. Large enterprises today can sometimes afford to cope with the uncertainty in their data by completely removing it, by using some expensive data cleaning or ETL tools. But increasingly today organizations or users need to cope directly with uncertain data, either because cleaning it is prohibitively expensive (e.g. in scientific data integration or in integration of Web data), or because it is even impossible to clean (e.g. sensor data or RFID data). It becomes clear that we need

