Results 1 -
7 of
7
KATARA: A Data Cleaning System Powered by Knowledge Bases and Crowdsourcing
"... Classical approaches to clean data have relied on using in-tegrity constraints, statistics, or machine learning. These approaches are known to be limited in the cleaning accu-racy, which can usually be improved by consulting master data and involving experts to resolve ambiguity. The advent of knowl ..."
Abstract
-
Cited by 4 (2 self)
- Add to MetaCart
(Show Context)
Classical approaches to clean data have relied on using in-tegrity constraints, statistics, or machine learning. These approaches are known to be limited in the cleaning accu-racy, which can usually be improved by consulting master data and involving experts to resolve ambiguity. The advent of knowledge bases (kbs), both general-purpose and within enterprises, and crowdsourcing marketplaces are providing yet more opportunities to achieve higher accuracy at a larger scale. We propose Katara, a knowledge base and crowd powered data cleaning system that, given a table, a kb, and a crowd, interprets table semantics to align it with the kb, identifies correct and incorrect data, and generates top-k possible repairs for incorrect data. Experiments show that Katara can be applied to various datasets and kbs, and can efficiently annotate data and suggest possible repairs. 1.
BigDansing: A System for Big Data Cleansing
"... Data cleansing approaches have usually focused on detect-ing and fixing errors with little attention to scaling to big datasets. This presents a serious impediment since data cleansing often involves costly computations such as enu-merating pairs of tuples, handling inequality joins, and deal-ing wi ..."
Abstract
-
Cited by 3 (3 self)
- Add to MetaCart
Data cleansing approaches have usually focused on detect-ing and fixing errors with little attention to scaling to big datasets. This presents a serious impediment since data cleansing often involves costly computations such as enu-merating pairs of tuples, handling inequality joins, and deal-ing with user-defined functions. In this paper, we present BigDansing, a Big Data Cleansing system to tackle ef-ficiency, scalability, and ease-of-use issues in data cleans-ing. The system can run on top of most common general purpose data processing platforms, ranging from DBMSs to MapReduce-like frameworks. A user-friendly program-ming interface allows users to express data quality rules both declaratively and procedurally, with no requirement of being aware of the underlying distributed platform. BigDansing takes these rules into a series of transformations that enable distributed computations and several optimizations, such as shared scans and specialized joins operators. Experimental results on both synthetic and real datasets show that Big-Dansing outperforms existing baseline systems up to more than two orders of magnitude without sacrificing the quality provided by the repair algorithms. 1.
DISCOVERING ONTOLOGY FUNCTIONAL DEPENDENCIES DISCOVERING ONTOLOGY FUNCTIONAL DEPENDENCIES TITLE: Discovering Ontology Functional Dependencies AUTHOR
"... Abstract Functional Dependencies (FDs) are commonly used in data cleaning to identify dirty and inconsistent data values. However, many errors require user input for specific domain knowledge. For example, let us consider the drugs, Advil and Crocin. FDs will consider these two drugs different beca ..."
Abstract
- Add to MetaCart
(Show Context)
Abstract Functional Dependencies (FDs) are commonly used in data cleaning to identify dirty and inconsistent data values. However, many errors require user input for specific domain knowledge. For example, let us consider the drugs, Advil and Crocin. FDs will consider these two drugs different because they are not syntactically equal. However, Advil and Crocin are synonyms as they are two different drugs with similar chemical compounds but marketed under distinct names in different countries. While FDs have traditionally been used in existing data cleaning solutions to model syntactic equivalence, they are not able to model broader relationships (e.g., synonym, Is-A (Inheritance)) defined by ontologies. In this thesis, we take a first step to discover a new dependency called Ontology Functional Dependencies (OFDs). OFDs model attribute relationships based on relationships in a given ontology. We present two effective algorithms to discover OFDs using synonyms and inheritance relationships. Our discovery algorithms search for minimal OFDs and prune the redundant ones. Both algorithms traverse the search lattice in a level-wise Breadth First Search (BFS) manner. In addition, we have developed a set of pruning rules so that we can avoid considering unnecessary candidates in the search lattice. We present an experimental study describing the performance
On Axiomatization and Inference Complexity over a Hierarchy of Functional Dependencies
"... Abstract. Functional dependencies (FDs) have recently been extended for data quality purposes with various notions of similarity instead of strict equality. We study these extensions in this paper. We begin by constructing a hierarchy of dependencies, showing which dependencies generalize others. W ..."
Abstract
- Add to MetaCart
(Show Context)
Abstract. Functional dependencies (FDs) have recently been extended for data quality purposes with various notions of similarity instead of strict equality. We study these extensions in this paper. We begin by constructing a hierarchy of dependencies, showing which dependencies generalize others. We then focus on an extension of FDs that we call Antecedent Metric Functional Dependencies (AMFDs). An AMFD asserts that if two tuples have similar but not necessarily equal values of the antecedent attributes, then their consequent values must be equal. We present a sound and complete axiomatization as well as an inference algorithm for AMFDs. We compare the axiomatization of AMFDs to those of the other dependencies, and we show that while the complexity of inference for some FD extensions is quadratic or even co-NP complete, the inference problem for AMFDs remains linear, as in traditional FDs.
IBM Toronto Centre for Advanced Studies
"... Business-intelligence queries often involve SQL functions and al-gebraic expressions. There can be clear semantic relationships be-tween a column’s values and the values of a function over that col-umn. A common property is monotonicity: as the column’s values ascend, so do the function’s values. Th ..."
Abstract
- Add to MetaCart
(Show Context)
Business-intelligence queries often involve SQL functions and al-gebraic expressions. There can be clear semantic relationships be-tween a column’s values and the values of a function over that col-umn. A common property is monotonicity: as the column’s values ascend, so do the function’s values. This we call an order depen-dency (OD). Queries can be evaluated more efficiently when the query optimizer uses order dependencies. They can be run even faster when the optimizer can also reason over known ODs to infer new ones. Order dependencies can be declared as integrity constraints, and they can be detected automatically for many types of SQL functions and algebraic expressions. We present optimization techniques us-ing ODs for queries that involve join, order by, group by, partition by, and distinct. Essentially, ODs can further exploit interesting orders to eliminate or simplify potentially expensive sorts in the query plan. We evaluate these techniques over our implementa-tion in IBM R © DB2 R © V10 using the TPC-DS R © benchmark schema and some IBM customer inspired queries. Our experimental results demonstrate a significant performance gain. We additionally devise an algorithm for testing logical implication for ODs which is poly-nomial over the size of the set of given ODs. We show that the inference algorithm which we have implemented in DB2 is sound and complete over sets of ODs over natural domains. This enables the optimizer to infer useful ODs from known ODs.
Improving Data Cleansing Accuracy: A model-based Approach
"... Abstract: Research on data quality is growing in importance in both industrial and academic communities, as it aims at deriving knowledge (and then value) from data. Information Systems generate a lot of data useful for studying the dynamics of subjects ’ behaviours or phenomena over time, making th ..."
Abstract
- Add to MetaCart
(Show Context)
Abstract: Research on data quality is growing in importance in both industrial and academic communities, as it aims at deriving knowledge (and then value) from data. Information Systems generate a lot of data useful for studying the dynamics of subjects ’ behaviours or phenomena over time, making the quality of data a crucial aspect for guaranteeing the believability of the overall knowledge discovery process. In such a scenario, data cleansing techniques, i.e., automatic methods to cleanse a dirty dataset, are paramount. However, when multiple cleans-ing alternatives are available a policy is required for choosing between them. The policy design task still relies on the experience of domain-experts, and this makes the automatic identification of accurate policies a signifi-cant issue. This paper extends the Universal Cleaning Process enabling the automatic generation of an accurate cleansing policy derived from the dataset to be analysed. The proposed approach has been implemented and tested on an on-line benchmark dataset, a real-world instance of the Labour Market Domain. Our preliminary results show that our approach would represent a contribution towards the generation of data-driven policy, reducing significantly the domain-experts intervention for policy specification. Finally, the generated results have been made publicly available for downloading. 1