Results 1 - 10
of
31
Data Cleaning and Query Answering with Matching Dependencies and Matching Functions
"... Matching dependencies were recently introduced as declarative rules for data cleaning and entity resolution. Enforcing a matching dependency on a database instance identifies the values of some attributes for two tuples, provided that the values of some other attributes are sufficiently similar. Ass ..."
Abstract
-
Cited by 10 (9 self)
- Add to MetaCart
Matching dependencies were recently introduced as declarative rules for data cleaning and entity resolution. Enforcing a matching dependency on a database instance identifies the values of some attributes for two tuples, provided that the values of some other attributes are sufficiently similar. Assuming the existence of matching functions for making two attributes values equal, we formally introduce the process of cleaning an instance using matching dependencies, as a chase-like procedure. We show that matching functions naturally introduce a lattice structure on attribute domains, and a partial order of semantic domination between instances. Using the latter, we define the semantics of clean query answering in terms of certain/possible answers as the greatest lower bound/least upper bound of all possible answers obtained from the clean instances. We show that clean query answering is intractable in some cases. Then we study queries that behave monotonically w.r.t. semantic domination order, and show that we can provide an under/over approximation for clean answers to monotone queries. Moreover, non-monotone positive queries can be relaxed into monotone queries.
A Self-Training Approach for Resolving Object Coreference on the Semantic Web
, 2011
"... An object on the Semantic Web is likely to be denoted with multiple URIs by different parties. Object coreference resolution is to identify “equivalent ” URIs that denote the same object. Driven by the Linking Open Data (LOD) initiative, millions of URIs have been explicitly linked with owl:sameAs s ..."
Abstract
-
Cited by 6 (0 self)
- Add to MetaCart
An object on the Semantic Web is likely to be denoted with multiple URIs by different parties. Object coreference resolution is to identify “equivalent ” URIs that denote the same object. Driven by the Linking Open Data (LOD) initiative, millions of URIs have been explicitly linked with owl:sameAs statements, but potentially coreferent ones are still considerable. Existing approaches address the problem mainly from two directions: one is based upon equivalence inference mandated by OWL semantics, which finds semantically coreferent URIs but probably omits many potential ones; the other is via similarity computation between property-value pairs, which is not always accurate enough. In this paper, we propose a self-training approach for object coreference resolution on the Semantic Web, which leverages the two classes
Matching Dependencies with Arbitrary Attribute Values: Semantics, Query Answering and Integrity Constraints ∗
"... Matching dependencies (MDs) were introduced to specify the identification or matching of certain attribute values in pairs of database tuples when some similarity conditions are satisfied. Their enforcement can be seen as a natural generalization of entity resolution. In what we call the pure case o ..."
Abstract
-
Cited by 5 (4 self)
- Add to MetaCart
Matching dependencies (MDs) were introduced to specify the identification or matching of certain attribute values in pairs of database tuples when some similarity conditions are satisfied. Their enforcement can be seen as a natural generalization of entity resolution. In what we call the pure case of MDs, any value from the underlying data domain can be used for the value in common that does the matching. We investigate the semantics and properties of data cleaning through the enforcement of matching dependencies for the pure case. We characterize the intended clean instances and also the clean answers to queries as those that are invariant under the cleaning process. The complexity of computing clean instances and clean answers to queries is investigated. Tractable and intractable cases depending on the MDs are identified. 1.
Data fusion–resolving data conflicts for integration. PVLDB
, 2009
"... The amount of information produced in the world increases by 30 % every year and this rate will only go up. With advanced network technology, more and more sources are available either over the Internet or in enterprise intranets. Modern data management applications, such as setting up Web portals, ..."
Abstract
-
Cited by 4 (2 self)
- Add to MetaCart
The amount of information produced in the world increases by 30 % every year and this rate will only go up. With advanced network technology, more and more sources are available either over the Internet or in enterprise intranets. Modern data management applications, such as setting up Web portals, managing enterprise data, managing community data, and sharing scientific data, often require integrating available data sources and providing a uniform interface for users to access data from different sources; such requirements have been driving fruitful research on data integration over the last two decades [11, 13]. Data integration systems face two folds of challenges. First, data from disparate sources are often heterogeneous. Heterogeneity can exist at the schema level, where different data sources often describe the same domain using different schemas; it can also exist at the instance level, where different sources can represent the same real-world entity in different ways. There has been rich body of work on resolving heterogeneity in data, including, at the schema level, schema mapping and matching [14], model management [1], answering queries using views [12], data exchange [8], and at the instance level, record linkage (entity resolution, object matching, reference linkage, etc.) [7, 15], string similarity comparison [4], etc. Second, different sources can provide conflicting data. Conflicts can arise because of incomplete data, erroneous data, and out-of-date data. Returning incorrect data in a query result can be misleading and even harmful: one may contact a person by an out-of-date phone number, visit a clinic at a wrong address, and even make poor business decisions. It is thus critical for data integration systems to resolve conflicts from various sources and identify true values. This problem becomes especially prominent with the ease of publishing and spreading false information on the Web. This tutorial focuses on data fusion, which addresses the second challenge by fusing records on the same real-world entity into a single record and resolving possible conflicts from different data sources. Data fusion plays an important
Query Answering under Matching Dependencies for Data Cleaning: Complexity and Algorithms. CorrArXiv paper cs.DB/1112.5908
, 2012
"... Matching dependencies (MDs) have been recently introduced as declarative rules for entity resolution (ER), i.e. for identifying and resolving duplicates in relational instance D. A set of MDs can be used as the basis for a possibly nondeterministic mechanism that computes a duplicate-free instance f ..."
Abstract
-
Cited by 2 (2 self)
- Add to MetaCart
Matching dependencies (MDs) have been recently introduced as declarative rules for entity resolution (ER), i.e. for identifying and resolving duplicates in relational instance D. A set of MDs can be used as the basis for a possibly nondeterministic mechanism that computes a duplicate-free instance from D. The possible results of this process are the clean, minimally resolved instances (MRIs). There might be several MRIs for D, and the resolved answers to a query are those that are shared by all the MRIs. We investigate the problem of computing resolved answers. We look at various sets of MDs, developing syntactic criteria for determining (in)tractability of the resolved answer problem, including a dichotomy result. For some tractable classes of MDs and conjunctive queries, we present a query rewriting methodology that can be used to retrieve the resolved answers. We also investigate connections with consistent query answering, deriving further tractability results for MD-based ER. 1.
Scalable Data Exchange with Functional Dependencies
"... Therecentliteraturehasprovidedasolidtheoreticalfoundation for the use of schema mappings in data-exchange applications. Following this formalization, new algorithms have been developed to generate optimal solutions for mapping scenarios in a highly scalable way, by relying on SQL. However, these alg ..."
Abstract
-
Cited by 2 (2 self)
- Add to MetaCart
Therecentliteraturehasprovidedasolidtheoreticalfoundation for the use of schema mappings in data-exchange applications. Following this formalization, new algorithms have been developed to generate optimal solutions for mapping scenarios in a highly scalable way, by relying on SQL. However, these algorithms suffer from a serious drawback: they are not able to handle key constraints and functional dependencies on the target, i.e., equality generating dependencies (egds). While egds play a crucial role in the generation of optimal solutions, handling them with first-order languages is a difficult problem. In fact, we start from a negative result: it is not always possible to compute solutions for scenarios with egds using an SQL script. Then, we identify many practical cases in which this is possible, and develop a best-effort algorithm to do this. Experimental results show that our algorithm produces solutions of better quality with respect to those produced by previous algorithms, and scales nicely to large databases. 1.
Query Rewriting using Datalog for Duplicate Resolution ⋆
"... Abstract. Matching Dependencies (MDs) are a recent proposal for declarative entity resolution. They are rules that specify, given the similarities satisfied by values in a database, what values should be considered duplicates, and have to be matched. On the basis of a chase-like procedure for MD enf ..."
Abstract
-
Cited by 2 (2 self)
- Add to MetaCart
Abstract. Matching Dependencies (MDs) are a recent proposal for declarative entity resolution. They are rules that specify, given the similarities satisfied by values in a database, what values should be considered duplicates, and have to be matched. On the basis of a chase-like procedure for MD enforcement, we can obtain clean (duplicate-free) instances; actually possibly several of them. The clean answers to queries (which we call the resolved answers) are invariant under the resulting class of instances. In this paper, we investigate a query rewriting approach to obtaining the resolved answers (for certain classes of queries and MDs). The rewritten queries are specified in stratified Datalog not,s with aggregation. In addition to the rewriting algorithm, we discuss the semantics of the rewritten queries, and how they could be implemented by means of a DBMS. 1
Subsumption and Complementation as Data Fusion Operators
"... The goal of data fusion is to combine several representations of one real world object into a single, consistent representation, e.g., in data integration. A very popular operator to perform data fusion is the minimum union operator. It is defined as the outer union and the subsequent removal of sub ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
The goal of data fusion is to combine several representations of one real world object into a single, consistent representation, e.g., in data integration. A very popular operator to perform data fusion is the minimum union operator. It is defined as the outer union and the subsequent removal of subsumed tuples. Minimum union is used in other applications as well, for instance in database query optimization to rewrite outer join queries, in the semantic web community in implementing SPARQL’s OPTIONAL operator, etc. Despite its wide applicability, there are only few efficient implementations, and until now, minimum union is not a relational database primitive. This paper fills this gap as we present implementations of subsumption that serve as a building block for minimum union. Furthermore, we consider this operator as database primitive and show how to perform optimization of query plans in presence of subsumption and minimum union through rule-based plan transformations. Experiments on both artificial and real world data show that our algorithms outperform existing algorithms used for subsumption in terms of runtime and they scale to large volumes of data. In the context of data integration, we observe that performing data fusion calls for more than subsumption and minimum union. Therefore, another contribution of this paper is the definition of the complementation and complement union operators. Intuitively, these allow to merge tuples that have complementing values and thus eliminate unnecessary null-values. Research was partially performed while at Hasso-Plattner-Institut. Research was partially performed while at Hasso-Plattner-Institut
Complement union for data integration
- In Proc. of NTII
, 2010
"... A data integration process consists of mapping source data into a target representation (schema mapping [1]), identifying multiple representations of the same real-word object (duplicate detection [2]), and finally combining these representations ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
A data integration process consists of mapping source data into a target representation (schema mapping [1]), identifying multiple representations of the same real-word object (duplicate detection [2]), and finally combining these representations
Experiments with Wikipedia Cross-Language Data Fusion
"... Abstract. There are currently Wikipedia editions in 264 different languages. Each of these editions contains infoboxes that provide structured data about the topic of the article in which an infobox is contained. The content of infoboxes about the same topic in different Wikipedia editions varies in ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Abstract. There are currently Wikipedia editions in 264 different languages. Each of these editions contains infoboxes that provide structured data about the topic of the article in which an infobox is contained. The content of infoboxes about the same topic in different Wikipedia editions varies in completeness, coverage and quality. This paper examines the hypothesis that by extracting infobox data from multiple Wikipedia editions and by fusing the extracted data among editions it should be possible to complement data from one edition with previously missing values from other editions and to increase the overall quality of the extracted dataset by choosing property values that are most likely correct in case of inconsistencies among editions. We will present a software framework for fusing RDF datasets based on different conflict resolution strategies. We will apply the framework to fuse infobox data that has been extracted from the English, German, Italian and French editions of Wikipedia and will discuss the accuracy of the conflict resolution strategies that were used in this experiment.

