Results 1 - 10
of
475
Data Integration: A Theoretical Perspective
- Symposium on Principles of Database Systems
, 2002
"... Data integration is the problem of combining data residing at different sources, and providing the user with a unified view of these data. The problem of designing data integration systems is important in current real world applications, and is characterized by a number of issues that are interestin ..."
Abstract
-
Cited by 585 (35 self)
- Add to MetaCart
Data integration is the problem of combining data residing at different sources, and providing the user with a unified view of these data. The problem of designing data integration systems is important in current real world applications, and is characterized by a number of issues that are interesting from a theoretical point of view. This document presents on overview of the material to be presented in a tutorial on data integration. The tutorial is focused on some of the theoretical issues that are relevant for data integration. Special attention will be devoted to the following aspects: modeling a data integration application, processing queries in data integration, dealing with inconsistent data sources, and reasoning on queries.
Information integration using logical views
, 1997
"... Abstract. A number of ideas concerning information-integration tools can be thought of as constructing answers to queries using views that represent the capabilities of information sources. We review the formal basis of these techniques, which are closely related to containment algo-rithms for conju ..."
Abstract
-
Cited by 395 (4 self)
- Add to MetaCart
Abstract. A number of ideas concerning information-integration tools can be thought of as constructing answers to queries using views that represent the capabilities of information sources. We review the formal basis of these techniques, which are closely related to containment algo-rithms for conjunctive queries and/or Datalog programs. Then we com-pare the approaches taken by AT&T Labs ' "Information Manifold " and the Stanford "Tsimmis " project in these terms. 1 Theoretical Background Before addressing information-integration issues, let us review some of the basic ideas concerning conjunctive queries, Datalog programs, and their containment. To begin, we use the logical rule notation from [Ull88]. Example 1. The following: p(X,Z):- a(X,Y) & a(Y,Z). is a rule that talks about a, an EDB predicate ("Extensional DataBase, " or stored relation), and p, an IDB predicate ("Intensional DataBase, " or predicate whose relation is constructed by rules). In this and several other examples, it is useful to think of a as an "arc " predicate defining a graph, while other predicates define certain structures that might exist in the graph. That is, a(X, Y) means there is an arc from node X to node Y. In this case, the rule says "p(X, Z) is true if there is an arc from node X to node Y and also an arc from Y to Z." That is, p represents paths of length 2. In general, there is one atom, the head, on the left of the "if " sign,:- and zero of more atoms, called subgoals, on the right side (the body). The head always has an IDB predicate; the subgoals can have IDB or EDB predicates. Thus, here p(X, Z) is the head, while a(X, Y) and a(Y, Z) are subgoals. We assume that each variable appearing in the head also appears somewhere in the body. This "safety " requirement assures that when we use a rule, we are not left with undefined variables in the head when we try to infer a fact about the head's predicate. We also assume that atoms consist of a predicate and zero or more arguments. An argument can be either a variable or a constant. However, we exclude function symbols from arguments.
Answering Queries Using Views: A Survey
, 2000
"... The problem of answering queries using views is to find efficient methods of answering a query using a set of previously defined materialized views over the database, rather than accessing the database relations. The problem has recently received significant attention because of its relevance to a w ..."
Abstract
-
Cited by 395 (27 self)
- Add to MetaCart
The problem of answering queries using views is to find efficient methods of answering a query using a set of previously defined materialized views over the database, rather than accessing the database relations. The problem has recently received significant attention because of its relevance to a wide variety of data management problems. In query optimization, finding a rewriting of a query using a set of materialized views can yield a more efficient query execution plan. To support the separation of the logical and physical views of data, a storage schema can be described using views over the logical schema. As a result, finding a query execution plan that accesses the storage amounts to solving the problem of answering queries using views. Finally, the problem arises in data integration systems, where data sources can be described as precomputed views over a mediated schema. This article surveys the state of the art on the problem of answering queries using views, and synthesizes the disparate works into a coherent framework. We describe the different applications of the problem, the algorithms proposed to solve it and the relevant theoretical results.
Reconciling Schemas of Disparate Data Sources: A Machine-Learning Approach
- In SIGMOD Conference
, 2001
"... A data-integration system provides access to a multitude of data sources through a single mediated schema. A key bottleneck in building such systems has been the laborious manual construction of semantic mappings between the source schemas and the mediated schema. We describe LSD, a system that empl ..."
Abstract
-
Cited by 300 (47 self)
- Add to MetaCart
A data-integration system provides access to a multitude of data sources through a single mediated schema. A key bottleneck in building such systems has been the laborious manual construction of semantic mappings between the source schemas and the mediated schema. We describe LSD, a system that employs and extends current machine-learning techniques to semi-automatically find such mappings. LSD first asks the user to provide the semantic mappings for a small set of data sources, then uses these mappings together with the sources to train a set of learners. Each learner exploits a different type of information either in the source schemas or in their data. Once the learners have been trained, LSD nds semantic mappings for a new data source by applying the learners, then combining their predictions using a meta-learner. To further improve matching accuracy, we extend machine learning techniques so that LSD can incorporate domain constraints as an additional source of knowledge, and develop a novel learner that utilizes the structural information in XML documents. Our approach thus is distinguished in that it incorporates multiple types of knowledge. Importantly, its architecture is extensible to additional learners that may exploit new kinds of information. We describe a set of experiments on several real-world domains, and show that LSD proposes semantic mappings with a high degree of accuracy.
Complexity of Answering Queries Using Materialized Views
- In PODS
, 1998
"... We study the complexity of the problem of answering queries using materialized views. This problem has attracted a lot of attention recently because of its relevance in data integration. Previous work considered only conjunctive view definitions. We examine the consequences of allowing more expressi ..."
Abstract
-
Cited by 248 (5 self)
- Add to MetaCart
We study the complexity of the problem of answering queries using materialized views. This problem has attracted a lot of attention recently because of its relevance in data integration. Previous work considered only conjunctive view definitions. We examine the consequences of allowing more expressive view definition languages. The languageswe consider for view definitions and user queries are: conjunctive queries with inequality, positive queries, datalog, and first-order logic. We show that the complexity of the problem depends on whether views are assumed to store all the tuples that satisfy the view definition, or only a subset of it. Finally, we apply the results to the view consistency and view self-maintainability problems which arise in data warehousing. 1 Introduction The notion of materialized view is essential in databases [34] and is attracting more and more attention with the popularity of data warehouses [28]. The problem of answering queries using materialized views [24...
Optimizing Queries across Diverse Data Sources
- In Proc. of VLDB
, 1997
"... Businesses today need to interrelate data stored in diverse systems with differing capabilities, ideally via a single high-level query interface. We present the design of a query optimizer for Gar- lic [C+95], a middleware system designed to integrate data from a broad range of data sources with ver ..."
Abstract
-
Cited by 241 (15 self)
- Add to MetaCart
Businesses today need to interrelate data stored in diverse systems with differing capabilities, ideally via a single high-level query interface. We present the design of a query optimizer for Gar- lic [C+95], a middleware system designed to integrate data from a broad range of data sources with very different query capabilities. Garlic's optimizer extends the rule-based approach of [Loh88 ] to work in a heterogeneous environment, by defining generic rules for the middleware and using wrapper-provided rules to encapsulate the capabilities of each data source. This approach offers great advantages in terms of plan quality, extensibility to new sources, incremental implementation of rules for new sources, and the ability to express the capabilities of a diverse set of sources. We describe the design and implementation of this optimizer, and illustrate its actions through an example.
Query Answering in Inconsistent Databases
, 2003
"... In this chapter, we summarize the research on querying inconsistent databases we have been conducting over the last five years. The formal framework we have used is based on two concepts: repair and consistent query answer. We describe different approaches to the issue of computing consistent query ..."
Abstract
-
Cited by 227 (57 self)
- Add to MetaCart
In this chapter, we summarize the research on querying inconsistent databases we have been conducting over the last five years. The formal framework we have used is based on two concepts: repair and consistent query answer. We describe different approaches to the issue of computing consistent query answers: query transformation, logic programming, inference in annotated logics, and specialized algorithms. We also characterize the computational complexity of this problem. Finally, we discuss related research in artificial intelligence, databases, and logic programming.
Don't Scrap It, Wrap It! A Wrapper Architecture for Legacy Data Sources
, 1997
"... Garlic is a middleware system that provides an in-tegrated view of a variety of legacy data sources, without changing how or where data is stored. In this paper, we describe our architecture for wrap-pers, key components of Garlic that encapsulate data sources and mediate between them and the middle ..."
Abstract
-
Cited by 200 (2 self)
- Add to MetaCart
Garlic is a middleware system that provides an in-tegrated view of a variety of legacy data sources, without changing how or where data is stored. In this paper, we describe our architecture for wrap-pers, key components of Garlic that encapsulate data sources and mediate between them and the middleware. Garlic wrappers model legacy data as objects, participate in query planning, and provide standard interfaces for method invocation and query execution. To date, we have built wrappers for 10 data sources. Our experience shows that Garlic wrappers can be written quickly and that our architecture is flexible enough to accommo-date data sources with a variety of data models and a broad range of traditional and non-tradition-al query processing capabilities. 1
Integration of Heterogeneous Databases Without Common Domains Using Queries Based on Textual Similarity
, 1998
"... Most databases contain "name constants" like course numbers, personal names, and place names that correspond to entities in the real world. Previous work in integration of heterogeneous databases has assumed that local name constants can be mapped into an appropriate global domain by normalization. ..."
Abstract
-
Cited by 193 (13 self)
- Add to MetaCart
Most databases contain "name constants" like course numbers, personal names, and place names that correspond to entities in the real world. Previous work in integration of heterogeneous databases has assumed that local name constants can be mapped into an appropriate global domain by normalization. However, in many cases, this assumption does not hold; determining if two name constants should be considered identical can require detailed knowledge of the world, the purpose of the user's query, or both. In this paper, we reject the assumption that global domains can be easily constructed, and assume instead that the names are given in natural language text. We then propose a logic called WHIRL which reasons explicitly about the similarity of local names, as measured using the vector-space model commonly adopted in statistical information retrieval. We describe an efficient implementation of WHIRL and evaluate it experimentally on data extracted from the World Wide Web. We show that WHIR...

