Results 1 - 10
of
11
Data integration with uncertainties
- In Proc. of VLDB
, 2007
"... This paper reports our first set of results on managing uncertainty in data integration. We posit that data-integration systems need to handle uncertainty at three levels, and do so in a principled fashion. First, the semantic mappings between the data sources and the mediated schema may be approxim ..."
Abstract
-
Cited by 41 (2 self)
- Add to MetaCart
This paper reports our first set of results on managing uncertainty in data integration. We posit that data-integration systems need to handle uncertainty at three levels, and do so in a principled fashion. First, the semantic mappings between the data sources and the mediated schema may be approximate because there may be too many of them to be created and maintained or because in some domains (e.g., bioinformatics) it is not clear what the mappings should be. Second, queries to the system may be posed with keywords rather than in a structured form. Third, the data from the sources may be extracted using information extraction techniques and so may yield imprecise data. As a first step to building such a system, we introduce the concept of probabilistic schema mappings and analyze their formal foundations. We show that there are two possible semantics for such mappings: by-table semantics assumes that there exists a correct mapping but we don’t know what it is; by-tuple semantics assumes that the correct mapping may depend on the particular tuple in the source data. We present the query complexity and algorithms for answering queries in the presence of approximate schema mappings, and we describe an algorithm for efficiently computing the top-k answers to queries in such a setting. 1.
Why is schema matching tough and what can we do about it
- SIGMOD Record
, 2007
"... In this paper we analyze the problem of schema matching, explain why it is such a “tough ” problem and suggest directions for handling it effectively. In particular, we present the monotonicity principle and see how it leads to the use of top-K mappings rather than a single mapping. 1. ..."
Abstract
-
Cited by 10 (1 self)
- Add to MetaCart
In this paper we analyze the problem of schema matching, explain why it is such a “tough ” problem and suggest directions for handling it effectively. In particular, we present the monotonicity principle and see how it leads to the use of top-K mappings rather than a single mapping. 1.
Rank Aggregation for Automatic Schema Matching
, 2006
"... Schema matching is a basic operation of data integration and several tools for automating it have been proposed and evaluated in the database community. Research in this area reveals that there is no single schema matcher that is guaranteed to succeed in finding a good mapping for all possible domai ..."
Abstract
-
Cited by 6 (2 self)
- Add to MetaCart
Schema matching is a basic operation of data integration and several tools for automating it have been proposed and evaluated in the database community. Research in this area reveals that there is no single schema matcher that is guaranteed to succeed in finding a good mapping for all possible domains, and thus an ensemble of schema matchers should be considered. In this paper we introduce schema meta-matching, a general framework for composing an arbitrary ensemble of schema matchers, and generating a list of best-ranked schema mappings. Informally, schema meta-matching stands for computing a “consensus ” ranking of alternative mappings between two schemata, given the “individual” graded rankings provided by several schema matchers. We introduce several algorithms for this problem, varying from adaptations of some standard techniques for general quantitative rank aggregation to novel techniques specific to the problem of schema matching, and to combinations of both. We provide a formal analysis of the applicability and relative performance of these algorithms, and evaluate them empirically on a set of real-world schemata.
Uncertainty in data integration: current approaches and open problems ⋆
"... Abstract. Uncertainty is an intrinsic feature of automatic and semiautomatic data integration processes. Although many solutions have been proposed to reduce uncertainty, if we do not explicitly represent and keep it up to the end of the integration process we risk to lose relevant information, and ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
Abstract. Uncertainty is an intrinsic feature of automatic and semiautomatic data integration processes. Although many solutions have been proposed to reduce uncertainty, if we do not explicitly represent and keep it up to the end of the integration process we risk to lose relevant information, and to produce misleading results. Models for uncertain data can then be used to represent integrated data sources resulting from uncertain data integration processes. In this paper we present a survey of existing approaches directly dealing with uncertainty in data integration, define a generic data integration process that explicitly represents uncertainty during all its steps, and present some preliminary results and open issues in the field.
Aggregate query answering under uncertain schema mappings
"... Abstract — Recent interest in managing uncertainty in data integration has led to the introduction of probabilistic schema mappings and the use of probabilistic methods to answer queries across multiple databases using two semantics: by-table and bytuple. In this paper, we develop three possible sem ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
Abstract — Recent interest in managing uncertainty in data integration has led to the introduction of probabilistic schema mappings and the use of probabilistic methods to answer queries across multiple databases using two semantics: by-table and bytuple. In this paper, we develop three possible semantics for aggregate queries: the range, distribution, and expected value semantics, and show that these three semantics combine with the by-table and by-tuple semantics in six ways. We present algorithms to process COUNT, AVG, SUM, MIN, and MAX queries under all six semantics and develop results on the complexity of processing such queries under all six semantics. We show that computing COUNT is in PTIME for all six semantics and computing SUM is in PTIME for all but the by-tuple/distribution semantics. Finally, we show that AVG, MIN, and MAX are PTIME computable for all by-table semantics and for the by-tuple/range semantics. We developed a prototype implementation and experimented with both real-world traces and simulated data. We show that, as expected, naive processing of aggregates does not scale beyond small databases with a small number of mappings. The results also show that the polynomial time algorithms are scalable up to several million tuples as well as with a large number of mappings. I.
Analyzing and Revising Mediated Schemas to Improve Their Matchability
"... Data integration systems often provide a uniform interface, called a mediated schema, to a multitude of disparate data sources. To answer user queries posed over the mediated schema, such systems employ a set of semantic matches between this schema and the local schemas of the data sources. Finding ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
Data integration systems often provide a uniform interface, called a mediated schema, to a multitude of disparate data sources. To answer user queries posed over the mediated schema, such systems employ a set of semantic matches between this schema and the local schemas of the data sources. Finding such matches is well known to be difficult. Hence much work has focused on developing semi-automatic techniques to efficiently find the matches. In this paper, however, we consider the complementary problem of improving the mediated schema, to make finding such matches easier. Specifically, a mediated schema S will typically be matched with many source schemas. Thus, can the developer of S analyze and revise S in a way that preserves S’s semantics, and yet makes it easier to match with in the future? We describe mSeer, a solution to this problem. Given a mediated schema S, mSeer first computes a matchability score that quantifies how well S can be matched against. Next, mSeer generates a matchability report that shows where the problems in matching S come from. Finally, mSeer automatically suggests changes to S (e.g., renaming an attribute, reformatting data values, etc.) that it believes will preserve the semantics of S and yet make it more amenable to matching. The creator of S is free to accept or revise the changes suggested by mSeer. We present extensive experiments over several real-world domains that demonstrate the effectiveness of our approach. 1.
Managing Uncertainty in Schema Matcher Ensembles
"... Abstract. Schema matching is the task of matching between concepts describing the meaning of data in various heterogeneous, distributed data sources. With many heuristics to choose from, several tools have enabled the use of schema matcher ensembles, combining principles by which different schema ma ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
Abstract. Schema matching is the task of matching between concepts describing the meaning of data in various heterogeneous, distributed data sources. With many heuristics to choose from, several tools have enabled the use of schema matcher ensembles, combining principles by which different schema matchers judge the similarity between concepts. In this work, we investigate means of estimating the uncertainty involved in schema matching and harnessing it to improve an ensemble outcome. We propose a model for schema matching, based on simple probabilistic principles. We then propose the use of machine learning in determining the best mapping and discuss its pros and cons. Finally, we provide a thorough empirical analysis, using both real-world and synthetic data, to test the proposed technique. We conclude that the proposed heuristic performs well, given an accurate modeling of uncertainty in matcher decision making. 1
OpenKnowledge ⋆ Deliverable 3.1.: Dynamic Ontology Matching: a Survey
, 2006
"... Abstract. Matching has been recognized as a plausible solution for the semantic heterogeneity problem in many traditional applications, such as schema integration, ontology integration, data warehouses, data integration, and so on. Recently, there have emerged a line of new applications characterize ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Abstract. Matching has been recognized as a plausible solution for the semantic heterogeneity problem in many traditional applications, such as schema integration, ontology integration, data warehouses, data integration, and so on. Recently, there have emerged a line of new applications characterized by their dynamics, such as peer-to-peer systems, agents, web-services. In this deliverable we extend the notion of ontology matching, as it has been understood in traditional applications, to dynamic ontology matching. In particular, we examine real-world scenarios and collect the requirements they pose towards a plausible solution. We consider five general matching directions which we believe can appropriately address those requirements. These are: (i) approximate and partial ontology matching, (ii) interactive ontology matching, (iii) continuous ”design-time ” ontology matching, (iv) community-driven ontology matching and
Managing Uncertainty of XML Schema Matching
"... Abstract — Despite of advances in machine learning technologies, a schema matching result between two database schemas (e.g., those derived from COMA++) is likely to be imprecise. In particular, numerous instances of “possible mappings ” between the schemas may be derived from the matching result. I ..."
Abstract
- Add to MetaCart
Abstract — Despite of advances in machine learning technologies, a schema matching result between two database schemas (e.g., those derived from COMA++) is likely to be imprecise. In particular, numerous instances of “possible mappings ” between the schemas may be derived from the matching result. In this paper, we study the problem of managing possible mappings between two heterogeneous XML schemas. We observe that for XML schemas, their possible mappings have a high degree of overlap. We hence propose a novel data structure, called the block tree, to capture the commonalities among possible mappings. The block tree is useful for representing the possible mappings in a compact manner, and can be generated efficiently. Moreover, it supports the evaluation of probabilistic twig query (PTQ), which returns the probability of portions of an XML document that match the query pattern. For users who are interested only in answers with k-highest probabilities, we also propose the top-k PTQ, and present an efficient solution for it. The second challenge we have tackled is to efficiently generate possible mappings for a given schema matching. While this problem can be solved by existing algorithms, we show how to improve the performance of the solution by using a divide-andconquer approach. An extensive evaluation on realistic datasets show that our approaches significantly improve the efficiency of generating, storing, and querying possible mappings. I.
PruSM: A Prudent Schema Matching Approach for Web Forms
"... There has been a substantial increase in the number of Web data sources whose contents are hidden and can only be accessed through form interfaces. To leverage this data, several applications have emerged that aim to automate and simplify the access to these data sources, from hidden-Web crawlers an ..."
Abstract
- Add to MetaCart
There has been a substantial increase in the number of Web data sources whose contents are hidden and can only be accessed through form interfaces. To leverage this data, several applications have emerged that aim to automate and simplify the access to these data sources, from hidden-Web crawlers and meta-searchers to Web information integration systems. A requirement shared by these applications is the ability to understand these forms, so that they can automatically fill them out. In this paper, we address a key problem in form understanding: how to match elements across distinct forms. Although this problem has been studied in the literature, existing approaches have important limitations. Notably, they only handle small form collections and assume that form elements are clean and normalized, often through manual pre-processing. When a large number of forms is automatically gathered, matching form schemata presents new challenges: data heterogeneity is compounded with the Web-scale and noise introduced by automated processes. We propose PruSM, a prudent schema matching strategy the determines matches for form elements in a prudent fashion, with the goal of minimizing error propagation. A experimental evaluation of PruSM using widely available data sets shows that the approach effective and able to accurately match a large number of form schemata and without requiring any manual pre-processing.

