Results 1 - 10
of
286
Data Integration: A Theoretical Perspective
- Symposium on Principles of Database Systems
, 2002
"... Data integration is the problem of combining data residing at different sources, and providing the user with a unified view of these data. The problem of designing data integration systems is important in current real world applications, and is characterized by a number of issues that are interestin ..."
Abstract
-
Cited by 585 (35 self)
- Add to MetaCart
Data integration is the problem of combining data residing at different sources, and providing the user with a unified view of these data. The problem of designing data integration systems is important in current real world applications, and is characterized by a number of issues that are interesting from a theoretical point of view. This document presents on overview of the material to be presented in a tutorial on data integration. The tutorial is focused on some of the theoretical issues that are relevant for data integration. Special attention will be devoted to the following aspects: modeling a data integration application, processing queries in data integration, dealing with inconsistent data sources, and reasoning on queries.
Querying Semi-Structured Data
, 1997
"... The amount of data of all kinds available electronically has increased dramatically in recent years. The data resides in different forms, ranging from unstructured data in file systems to highly structured in relational database systems. Data is accessible through a variety of interfaces including W ..."
Abstract
-
Cited by 467 (19 self)
- Add to MetaCart
The amount of data of all kinds available electronically has increased dramatically in recent years. The data resides in different forms, ranging from unstructured data in file systems to highly structured in relational database systems. Data is accessible through a variety of interfaces including Web browsers, database query languages, application-specific interfaces, or data exchange formats. Some of this data is raw data, e.g. images or sound. Some of it has structure even if the structure is often implicit, and not as rigid or regular as that found in standard database systems. Sometimes the structure exists but has to be extracted from the data. Sometimes also it exists but we prefer to ignore it for certain purposes such as browsing. We call here semi-structured data this data that is (from a particular viewpoint) neither raw data nor strictly typed, i.e., not table-oriented as in a relational model or sorted-graph as in object databases...
DataGuides: Enabling Query Formulation and Optimization in Semistructured Databases
, 1997
"... In semistructured databases there is no schema fixed in advance. To provide the benefits of a schema in such environments, we introduce DataGuides: concise and accurate structural summaries of semistructured databases. DataGuides serve as dynamic schemas, generated from the database; they are ..."
Abstract
-
Cited by 459 (14 self)
- Add to MetaCart
In semistructured databases there is no schema fixed in advance. To provide the benefits of a schema in such environments, we introduce DataGuides: concise and accurate structural summaries of semistructured databases. DataGuides serve as dynamic schemas, generated from the database; they are useful for browsing database structure, formulating queries, storing information such as statistics and sample values, and enabling query optimization. This paper presents the theoretical foundations of DataGuides along with an algorithm for their creation and an overview of incremental maintenance. We provide performance results based on our implementation of DataGuides in the Lore DBMS for semistructured data. We also describe the use of DataGuides in Lore, both in the user interface to enable structure browsing and query formulation, and as a means of guiding the query processor and optimizing query execution.
Answering Queries Using Views: A Survey
, 2000
"... The problem of answering queries using views is to find efficient methods of answering a query using a set of previously defined materialized views over the database, rather than accessing the database relations. The problem has recently received significant attention because of its relevance to a w ..."
Abstract
-
Cited by 395 (27 self)
- Add to MetaCart
The problem of answering queries using views is to find efficient methods of answering a query using a set of previously defined materialized views over the database, rather than accessing the database relations. The problem has recently received significant attention because of its relevance to a wide variety of data management problems. In query optimization, finding a rewriting of a query using a set of materialized views can yield a more efficient query execution plan. To support the separation of the logical and physical views of data, a storage schema can be described using views over the logical schema. As a result, finding a query execution plan that accesses the storage amounts to solving the problem of answering queries using views. Finally, the problem arises in data integration systems, where data sources can be described as precomputed views over a mediated schema. This article surveys the state of the art on the problem of answering queries using views, and synthesizes the disparate works into a coherent framework. We describe the different applications of the problem, the algorithms proposed to solve it and the relevant theoretical results.
Relational Databases for Querying XML Documents: Limitations and Opportunities
, 1999
"... XML is fast emerging as the dominant standard for representing data in the World Wide Web. Sophisticated query engines that allow users to effectively tap the data stored in XML documents will be crucial to exploiting the full power of XML. While there has been a great deal of activity recently prop ..."
Abstract
-
Cited by 359 (10 self)
- Add to MetaCart
XML is fast emerging as the dominant standard for representing data in the World Wide Web. Sophisticated query engines that allow users to effectively tap the data stored in XML documents will be crucial to exploiting the full power of XML. While there has been a great deal of activity recently proposing new semistructured data models and query languages for this purpose, this paper explores the more conservative approach of using traditional relational database engines for processing XML documents conforming to Document Type Descriptors (DTDs). To this end, we have developed algorithms and implemented a prototype system that converts XML documents to relational tuples, translates semi-structured queries over XML documents to SQL queries over tables, and converts the results to XML. We have qualitatively evaluated this approach using several real DTDs drawn from diverse domains. It turns out that the relational approach can handle most (but not all) of the semantics of semi-structured queries over XML data, but is likely to be effective only in some cases. We identify the causes for these limitations and propose certain extensions to the relational
A Query Language for XML
, 1998
"... An important application of XML is the interchange of electronic data (EDI) between multiple data sources on the Web. As XML data proliferates on the Web, applications will need to integrate and aggregate data from multiple source and clean and transform data to facilitate exchange. Data extraction, ..."
Abstract
-
Cited by 301 (19 self)
- Add to MetaCart
An important application of XML is the interchange of electronic data (EDI) between multiple data sources on the Web. As XML data proliferates on the Web, applications will need to integrate and aggregate data from multiple source and clean and transform data to facilitate exchange. Data extraction, conversion, transformation, and integration are all well-understood database problems, and their solutions rely on a query language. We present a query language for XML, called XML-QL, which we argue is suitable for performing the above tasks. XML-QL is a declarative, "relational complete" query language and is simple enough that it can be optimized. XML-QL can extract data from existing XML documents and construct new XML documents. Keywords: XML, query languages, electronic-data interchange (EDI) 1. Introduction The goal of XML is to provide many of SGML's benefits not available in HTML and to provide them in a language that is easier to learn and use than complete SGML. These benefits...
Efficient Filtering of XML Documents for Selective Dissemination of Information
, 2000
"... Information Dissemination applications are gaining increasing popularity due to dramatic improvements in communications bandwidth and ubiquity. The sheer volume of data available necessitates the use of selective approaches to dissemination in order to avoid overwhelming users with unnecessaryi ..."
Abstract
-
Cited by 272 (13 self)
- Add to MetaCart
Information Dissemination applications are gaining increasing popularity due to dramatic improvements in communications bandwidth and ubiquity. The sheer volume of data available necessitates the use of selective approaches to dissemination in order to avoid overwhelming users with unnecessaryinformation. Existing mechanisms for selective dissemination typically rely on simple keyword matching or "bag of words" information retrieval techniques. The advent of XML as a standard for information exchangeand the development of query languages for XML data enables the development of more sophisticated filtering mechanisms that take structure information into account. We have developed several index organizations and search algorithms for performing efficient filtering of XML documents for large-scale information dissemination systems. In this paper we describe these techniques and examine their performance across a range of document, workload, and scale scenarios. 1
Index Structures for Path Expressions
, 1997
"... In recent years there has been an increased interest in managing data which does not conform to traditional data models, like the relational or object oriented model. The reasons for this non-conformance are diverse. One one hand, data may not conform to such models at the physical level: it may be ..."
Abstract
-
Cited by 255 (7 self)
- Add to MetaCart
In recent years there has been an increased interest in managing data which does not conform to traditional data models, like the relational or object oriented model. The reasons for this non-conformance are diverse. One one hand, data may not conform to such models at the physical level: it may be stored in data exchange formats, fetched from the Internet, or stored as structured les. One the other hand, it may not conform at the logical level: data may have missing attributes, some attributes may be of di erent types in di erent data items, there may be heterogeneous collections, or the data may be simply specified by a schema which is too complex or changes too often to be described easily as a traditional schema. The term semistructured data has been used to refer to such data. The data model proposed for this kind of data consists of an edge-labeled graph, in which nodes correspond to objects and edges to attributes or values. Figure 1 illustrates a semistructured database providing information about a city. Relational databases are traditionally queried with associative queries, retrieving tuples based on the value of some attributes. To answer such queries efciently, database management systems support indexes for translating attribute values into tuple ids (e.g. B-trees or hash tables). In object-oriented databases, path queries replace the simpler associative queries. Several data structures have been proposed for answering path queries e ciently: e.g., access support relations 14] and path indexes 4]. In the case of semistructured data, queries are even more complex, because they may contain generalized path expressions 1, 7, 8, 16]. The additional exibility is needed in order to traverse data whose structure is irregular, or partially unknown to the user.
Why and Where: A Characterization of Data Provenance
- In ICDT
, 2001
"... With the proliferation of database views and curated databases, the issue of data provenance # where a piece of data came from and the process by which it arrived in the database # is becoming increasingly important, especially in scienti#c databases where understanding provenance is crucial to ..."
Abstract
-
Cited by 254 (18 self)
- Add to MetaCart
With the proliferation of database views and curated databases, the issue of data provenance # where a piece of data came from and the process by which it arrived in the database # is becoming increasingly important, especially in scienti#c databases where understanding provenance is crucial to the accuracy and currency of data. In this paper we describe an approach to computing provenance when the data of interest has been created by a database query.We adopt a syntactic approach and present results for a general data model that applies to relational databases as well as to hierarchical data such as XML. A novel aspect of our work is a distinction between #why" provenance #refers to the source data that had some in#uence on the existence of the data# and #where" provenance #refers to the location#s# in the source databases from which the data was extracted#.
Semistructured data
, 1997
"... In semistructured data, the information that is normally as-sociated with a schema is contained within the data, which is sometimes called “self-describing”. In some forms of semi-structured data there is no separate schema, in others it exists but only places loose constraints on the data. Semi-str ..."
Abstract
-
Cited by 230 (0 self)
- Add to MetaCart
In semistructured data, the information that is normally as-sociated with a schema is contained within the data, which is sometimes called “self-describing”. In some forms of semi-structured data there is no separate schema, in others it exists but only places loose constraints on the data. Semi-structured data has recently emerged as an important topic of study for a variety of reasons. First, there are data sources such as the Web, which we would like to treat as databases but which cannot be constrained by a schema. Second, it may be desirable to have an extremely flexible format for data exchange between disparate databases. Third, even when dealing with structured data, it may be helpful to view it. as semistructured for the purposes of browsing. This tu-torial will cover a number of issues surrounding such data: finding a concise formulation, building a sufficiently expres-sive language for querying and transformation, and opti-mizat,ion problems. 1 The motivation The topic of semistructured data (also called unstructured data) is relatively recent, and a tutorial on the topic may well be premature. It represents, if anything, the conver-gence of a number of lines of thinking about new ways to represent and query data that do not completely fit with conventional data models. The purpose of this tutorial is to to describe this motivation and to suggest areas in which further research may be fruitful. For a similar exposition, the reader is referred to Serge Abiteboul’s recent survey pa-per PI. The slides for this tutorial will be made available from a section of the Penn database home page

