Results 1 - 10
of
47
Data Cleaning: Problems and Current Approaches
- IEEE Data Engineering Bulletin
, 2000
"... We classify data quality problems that are addressed by data cleaning and provide an overview of the main solution approaches. Data cleaning is especially required when integrating heterogeneous data sources and should be addressed together with schema-related data transformations. In data warehouse ..."
Abstract
-
Cited by 132 (7 self)
- Add to MetaCart
We classify data quality problems that are addressed by data cleaning and provide an overview of the main solution approaches. Data cleaning is especially required when integrating heterogeneous data sources and should be addressed together with schema-related data transformations. In data warehouses, data cleaning is a major part of the so-called ETL process. We also discuss current tool support for data cleaning. 1
Potter's Wheel: An Interactive Data Cleaning System
, 2001
"... Cleaning data of errors in structure and content is important for data warehousing and integration. Current solutions for data cleaning involve many iterations of data "auditing" to find errors, and long-running transformations to fix them. Users need to endure long waits, and often write compl ..."
Abstract
-
Cited by 128 (4 self)
- Add to MetaCart
Cleaning data of errors in structure and content is important for data warehousing and integration. Current solutions for data cleaning involve many iterations of data "auditing" to find errors, and long-running transformations to fix them. Users need to endure long waits, and often write complex transformation scripts. We present Potter's Wheel, an interactive data cleaning system that tightly integrates transformation and discrepancy detection. Users gradually build transformations to clean the data by adding or undoing transforms on a spreadsheet-like interface; the effect of a transform is shown at once on records visible on screen. These transforms are specified either through simple graphical operations, or by showing the desired effects on example data values. In the background, Potter's Wheel automatically infers structures for data values in terms of user-defined domains, and accordingly checks for constraint violations. Thus users can gradually build a transformation as discrepancies are found, and clean the data without writing complex programs or enduring long delays. 1
Learning to Match and Cluster Large High-Dimensional Data Sets For Data Integration
, 2002
"... Part of the process of data integration is determining which sets of identifiers refer to the same real-world entities. In integrating databases found on the Web or obtained by using information extraction methods, it is often possible to solve this problem by exploiting similarities in the textual ..."
Abstract
-
Cited by 96 (6 self)
- Add to MetaCart
Part of the process of data integration is determining which sets of identifiers refer to the same real-world entities. In integrating databases found on the Web or obtained by using information extraction methods, it is often possible to solve this problem by exploiting similarities in the textual names used for objects in di#erent databases. In this paper we describe techniques for clustering and matching identifier names that are both scalable and adaptive, in the sense that they can be trained to obtain better performance in a particular domain. An experimental evaluation on a number of sample datasets shows that the adaptive method sometimes performs much better than either of two non-adaptive baseline systems, and is nearly always competitive with the best baseline system.
Declarative Data Cleaning: Language, Model, and Algorithms
- In VLDB
, 2001
"... The problem of data cleaning, which consists of removing inconsistencies and errors from original data sets, is well known in the area of decision support systems and data warehouses. This holds regardless of the application - relational database joining, web-related, or scientific. In all cases, ex ..."
Abstract
-
Cited by 86 (4 self)
- Add to MetaCart
The problem of data cleaning, which consists of removing inconsistencies and errors from original data sets, is well known in the area of decision support systems and data warehouses. This holds regardless of the application - relational database joining, web-related, or scientific. In all cases, existing ETL (Extraction Transformation Loading) and data cleaning tools for writing data cleaning programs are insufficient. The main challenge is the design and implementation of a dataflow graph that effectively and efficiently generates clean data. Needed improvements to the current state of the art include (i) a clear separation between the logical specification of data transformations and their physical implementation (ii) an explanation of the reasoning behind cleaning results, (iii) and interactive facilities to tune a data cleaning program. This paper presents a language, an execution model and algorithms that enable users to express data cleaning specifications declaratively and perform the cleaning efficiently. We use as an example a set of bibliographic references used to construct the Citeseer Web site. The underlying data integration problem is to derive structured and clean textual records so that meaningful queries can be performed. Experimental results report on the assessment of the proposed framework for data cleaning.
Integrating Keyword Search into XML Query Processing
, 2000
"... Due to the popularity of the XML data format, several query languages for XML have been proposed, specially devised to handle data whose structure is unknown, loose, or absent. While these languages are rich enough to allow for querying the content and structure of an XML document, a varying or unkn ..."
Abstract
-
Cited by 65 (1 self)
- Add to MetaCart
Due to the popularity of the XML data format, several query languages for XML have been proposed, specially devised to handle data whose structure is unknown, loose, or absent. While these languages are rich enough to allow for querying the content and structure of an XML document, a varying or unknown structure can make formulating queries a very difficult task. We propose an extension to XML query languages that enables keyword search at the granularity of XML elements, that helps novice users formulate queries, and also yields new optimization opportunities for the query processor. We present an implementation of this extension on top of a commercial RDBMS; we then discuss implementation choices and performance results. Keywords XML query processing, full-text index 1 Introduction There is no doubt that XML is rapidly becoming one of the most important data formats. It is already used for scientific data (e.g., DNA sequences), in linguistics (e.g., the Treebank database at the U...
Mining database structure; or, how to build a data quality browser
- In SIGMOD
, 2002
"... ABSTRACT Data mining research typically assumes that the data to be analyzed has been identified, gathered, cleaned, and processed into a convenient form. While data mining tools greatly enhance the ability of the analyst to make datadriven discoveries, most of the time spent in performing an analys ..."
Abstract
-
Cited by 53 (6 self)
- Add to MetaCart
ABSTRACT Data mining research typically assumes that the data to be analyzed has been identified, gathered, cleaned, and processed into a convenient form. While data mining tools greatly enhance the ability of the analyst to make datadriven discoveries, most of the time spent in performing an analysis is spent in data identification, gathering, cleaning and processing the data. Similarly, schema mapping tools have been developed to help automate the task of using legacy or federated data sources for a new purpose, but assume that the structure of the data sources is well understood. However the data sets to be federated may come from dozens of databases containing thousands of tables and tens of thousands of fields, with little reliable documentation about primary keys or foreign keys. We are developing a system, Bellman, which performs data mining on the structure of the database. In this paper, we present techniques for quickly identifying which fields have similar values, identifying join paths, estimating join directions and sizes, and identifying structures in the database. The results of the database structure mining allow the analyst to make sense of the database content. This information can be used to e.g., prepare data for data mining, find foreign key joins for schema mapping, or identify steps to be taken to prevent the database from collapsing under the weight of its complexity. 1.
10^(10^6) Worlds and Beyond: Efficient Representation and Processing of Incomplete Information
, 2006
"... Current systems and formalisms for representing incomplete information generally suffer from at least one of two weaknesses. Either they are not strong enough for representing results of simple queries, or the handling and processing of the data, e.g. for query evaluation, is intractable. In this pa ..."
Abstract
-
Cited by 46 (6 self)
- Add to MetaCart
Current systems and formalisms for representing incomplete information generally suffer from at least one of two weaknesses. Either they are not strong enough for representing results of simple queries, or the handling and processing of the data, e.g. for query evaluation, is intractable. In this paper, we present a decomposition-based approach to addressing this problem. We introduce world-set decompositions (WSDs), a space-efficient formalism for representing any finite set of possible worlds over relational databases. WSDs are therefore a strong representation system for any relational query language. We study the problem of efficiently evaluating relational algebra queries on sets of worlds represented by WSDs. We also evaluate our technique experimentally in a large census data scenario and show that it is both scalable and efficient.
A Cost-Based Model and Effective Heuristic for Repairing Constraints by Value Modification
- In ACM SIGMOD International Conference on Management of Data
, 2005
"... Data integrated from multiple sources may contain inconsistencies that violate integrity constraints. The constraint repair problem attempts to find “low cost ” changes that, when applied, will cause the constraints to be satisfied. While in most previous work repair cost is stated in terms of tuple ..."
Abstract
-
Cited by 42 (4 self)
- Add to MetaCart
Data integrated from multiple sources may contain inconsistencies that violate integrity constraints. The constraint repair problem attempts to find “low cost ” changes that, when applied, will cause the constraints to be satisfied. While in most previous work repair cost is stated in terms of tuple insertions and deletions, we follow recent work to define a database repair as a set of value modifications. In this context, we introduce a novel cost framework that allows for the application of techniques from record-linkage to the search for good repairs. We prove that finding minimal-cost repairs in this model is NP-complete in the size of the database, and introduce an approach to heuristic repair-construction based on equivalence classes of attribute values. Following this approach, we define two greedy algorithms. While these simple algorithms take time cubic in the size of the database, we develop optimizations inspired by algorithms for duplicate-record detection that greatly improve scalability. We evaluate our framework and algorithms on synthetic and real data, and show that our proposed optimizations greatly improve performance at little or no cost in repair quality. 1.
Learning to Match and Cluster Entity Names
- In ACM SIGIR-2001 Workshop on Mathematical/Formal Methods in Information Retrieval
, 2001
"... Introduction Information retrieval is, in large part, the study of methods for assessing the similarity of pairs of documents. Document similarity metrics have been used for many tasks including ad hoc document retrieval, text classification [YC1994], and summarization [GC1998,SSMB1997]. Another pro ..."
Abstract
-
Cited by 28 (1 self)
- Add to MetaCart
Introduction Information retrieval is, in large part, the study of methods for assessing the similarity of pairs of documents. Document similarity metrics have been used for many tasks including ad hoc document retrieval, text classification [YC1994], and summarization [GC1998,SSMB1997]. Another problem area in which similarity metrics are central is record linkage (e.g., [KA1985]), where one wishes to determine if two database records taken from different source databases refer to the same entity. For instance, one might wish to determine if two database records from two different hospitals, each containing a patient's name, address, and insurance information, refer to the same person; as another example, one might wish to determine if two bibliography records, each containing a paper title, list of authors, and journal name, refer to the same publication. In both of these examples (and in many other practical cases) most of the record fields
Conceptual Modeling for ETL Processes
, 2002
"... software responsible for the extraction of data from several sources, their cleansing, customization and insertion into a data warehouse. In this paper, we focus on the problem of the definition of ETL activities and provide formal foundations for their conceptual representation. The proposed concep ..."
Abstract
-
Cited by 28 (9 self)
- Add to MetaCart
software responsible for the extraction of data from several sources, their cleansing, customization and insertion into a data warehouse. In this paper, we focus on the problem of the definition of ETL activities and provide formal foundations for their conceptual representation. The proposed conceptual model is (a) customized for the tracing of inter-attribute relationships and the respective ETL activities in the early stages of a data warehouse project; (b) enriched with a 'palette' of a set of frequently used ETL activities, like the assignment of surrogate keys, the check for null values, etc; and (c) constructed in a customizable and extensible manner, so that the designer can enrich it with his own re-occurring patterns for ETL activities.

