Results 1 - 10
of
17
Curated databases
- PODS'08
, 2008
"... Curated databases are databases that are populated and updated with a great deal of human effort. Most reference works that one traditionally found on the reference shelves of libraries – dictionaries, encyclopedias, gazetteers etc. – are now curated databases. Since it is now easy to publish databa ..."
Abstract
-
Cited by 43 (6 self)
- Add to MetaCart
Curated databases are databases that are populated and updated with a great deal of human effort. Most reference works that one traditionally found on the reference shelves of libraries – dictionaries, encyclopedias, gazetteers etc. – are now curated databases. Since it is now easy to publish databases on the web, there has been an explosion in the number of new curated databases used in scientific research. The value of curated databases lies in the organization and the quality of the data they contain. Like the paper reference works they have replaced, they usually represent the efforts of a dedicated group of people to produce a definitive description of some subject area. Curated databases present a number of challenges for database research. The topics of annotation, provenance, and citation are central, because curated databases are heavily cross-referenced with, and include data from, other databases, and much of the work of a curator is annotating existing data. Evolution of structure is important because these databases often evolve from semistructured representations, and because they have to accommodate new scientific discoveries. Much of the work in these areas is in its infancy, but it is beginning to provide suggest new research for both theory and practice. We discuss some of this research and emphasize the need to find appropriate models of the processes associated with curated databases.
Provenance as dependency analysis
- Proceedings of the 11th International Symposium on Database Programming Languages (DBPL 2007), number 4797 in LNCS
, 2007
"... Abstract. Provenance is information recording the source, derivation, or history of some information. Provenance tracking has been studied in a variety of settings; however, although many design points have been explored, the mathematical or semantic foundations of data provenance have received comp ..."
Abstract
-
Cited by 25 (9 self)
- Add to MetaCart
Abstract. Provenance is information recording the source, derivation, or history of some information. Provenance tracking has been studied in a variety of settings; however, although many design points have been explored, the mathematical or semantic foundations of data provenance have received comparatively little attention. In this paper, we argue that dependency analysis techniques familiar from program analysis and program slicing provide a formal foundation for forms of provenance that are intended to show how (part of) the output of a query depends on (parts of) its input. We introduce a semantic characterization of such dependency provenance, show that this form of provenance is not computable, and provide dynamic and static approximation techniques. 1
Containment of Conjunctive Queries on Annotated Relations
"... We study containment and equivalence of (unions of) conjunctive queries on relations annotated with elements of a commutative semiring. Such relations and the semantics of positive relational queries on them were introduced in a recent paper as a generalization of set semantics, bag semantics, incom ..."
Abstract
-
Cited by 14 (7 self)
- Add to MetaCart
We study containment and equivalence of (unions of) conjunctive queries on relations annotated with elements of a commutative semiring. Such relations and the semantics of positive relational queries on them were introduced in a recent paper as a generalization of set semantics, bag semantics, incomplete databases, and databases annotated with various kinds of provenance information. We obtain positive decidability results and complexity characterizations for databases with lineage, why-provenance, and provenance polynomial annotations, for both conjunctive queries and unions of conjunctive queries. At least one of these results is surprising given that provenance polynomial annotations seem “more expressive ” than bag semantics and under the latter, containment of unions of conjunctive queries is known to be undecidable. The decision procedures rely on interesting variations on the notion of containment mappings. We also show that for any positive semiring (a very large class) and conjunctive queries without self-joins, equivalence is the same as isomorphism. 1.
Querying data provenance
- In SIGMOD
, 2010
"... Many advanced data management operations (e.g., incremental maintenance, ..."
Abstract
-
Cited by 11 (5 self)
- Add to MetaCart
Many advanced data management operations (e.g., incremental maintenance,
Updatable Security Views
, 2009
"... Security views are a flexible and effective means of controlling access to confidential information. Rather than allowing untrusted users to access the source data directly, they can instead be provided with a restricted view, from which all confidential information has been removed. The program tha ..."
Abstract
-
Cited by 10 (2 self)
- Add to MetaCart
Security views are a flexible and effective means of controlling access to confidential information. Rather than allowing untrusted users to access the source data directly, they can instead be provided with a restricted view, from which all confidential information has been removed. The program that generates the view effectively embodies a confidentiality policy for the underlying source data. However, this approach has a significant drawback: it prevents users from updating the data in the view. To address the “view update problem ” in general, a number of bidirectional languages have been proposed. Programs in these languages—often called lenses—can be run in two directions: read from left to right, they map sources to views; read from right to left, they map updated views back to updated sources. However, existing bidirectional languages do not deal adequately with security issues. In particular, they do not provide a way to ensure the integrity of data in the source as it is manipulated by untrusted users of the view. We propose a novel framework of secure lenses that addresses these shortcomings. We first enrich the types of basic lenses with equivalence relations capturing notions of confidentiality and integrity and formulate the essential security conditions on source data as non-interference properties. We then offer a concrete instantiation of our framework in the domain of string transformations, developing concrete syntax for security-annotated regular expressions as well as a collection of bidirectional string combinators with annotated expressions as their types.
Provenance for aggregate queries
- In PODS, 2011. Available at http://arxiv.org/abs/1101.1110
"... doi:10.1145/1989284.1989302 © ACM, 2011. This is the author's version of the work. It is posted here by permission of ACM for your personal use. Not for redistribution. The ..."
Abstract
-
Cited by 7 (3 self)
- Add to MetaCart
doi:10.1145/1989284.1989302 © ACM, 2011. This is the author's version of the work. It is posted here by permission of ACM for your personal use. Not for redistribution. The
Provenance in ORCHESTRA
"... Sharing structured data today requires agreeing on a standard schema, then mapping and cleaning all of the data to achieve a single queriable mediated instance. However, for settings in which structured data is collaboratively authored by a large community, such as in the sciences, there is seldom c ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Sharing structured data today requires agreeing on a standard schema, then mapping and cleaning all of the data to achieve a single queriable mediated instance. However, for settings in which structured data is collaboratively authored by a large community, such as in the sciences, there is seldom consensus about how the data should be represented, what is correct, and which sources are authoritative. Moreover, such data is dynamic: it is frequently updated, cleaned, and annotated. The ORCHESTRA collaborative data sharing system develops a new architecture and consistency model for such settings, based on the needs of data sharing in the life sciences. A key aspect of ORCHESTRA’s design is that the provenance of data is recorded at every step. In this paper we describe ORCHESTRA’s provenance model and architecture, emphasizing its integral use of provenance in enforcing trust policies and translating updates efficiently. 1
COLLABORATIVE DATA SHARING WITH MAPPINGS AND PROVENANCE
, 2009
"... A key challenge in science today involves integrating data from databases managed by different collaborating scientists. In this dissertation, we develop the foundations and applications of collaborative data sharing systems (CDSSs), which address this challenge. A CDSS allows collaborators to defin ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
A key challenge in science today involves integrating data from databases managed by different collaborating scientists. In this dissertation, we develop the foundations and applications of collaborative data sharing systems (CDSSs), which address this challenge. A CDSS allows collaborators to define loose confederations of heterogeneous databases, relating them through schema mappings that establish how data should flow from one site to the next. In addition to simply propagating data along the mappings, it is critical to record data provenance (annotations describing where and how data originated) and to support policies allowing scientists to specify whose data they trust, and when. Since a large data sharing confederation is certain to evolve over time, the CDSS must also efficiently handle incremental changes to data, schemas, and mappings. We focus in this dissertation on the formal foundations of CDSSs, as well as practical issues of its implementation in a prototype CDSS called Orchestra. We propose a novel model of data provenance appropriate for CDSSs, based on a framework of semiring-annotated relations. This framework elegantly generalizes a number of other important database semantics involving annotated relations, including ranked results, prior provenance models, and probabilistic databases. We describe the design and implementation of the Orchestra prototype, which supports update
Annotated XML: Queries and Provenance
"... We present a formal framework for capturing the provenance of data appearing in XQuery views of XML. Building on previous work on relations and their (positive) query languages, we decorate unordered XML with annotations from commutative semirings and show that these annotations suffice for a large ..."
Abstract
- Add to MetaCart
We present a formal framework for capturing the provenance of data appearing in XQuery views of XML. Building on previous work on relations and their (positive) query languages, we decorate unordered XML with annotations from commutative semirings and show that these annotations suffice for a large positive fragment of XQuery applied to this data. In addition to tracking provenance metadata, the framework can be used to represent and process XML with repetitions, incomplete XML, and probabilistic XML, and provides a basis for enforcing access control policies in security applications. Each of these applications builds on our semantics for XQuery, which we present in several steps: we generalize the semantics of the Nested Relational Calculus (NRC) to handle semiring-annotated complex values, we extend it with a recursive type and structural recursion operator for trees, and we define a semantics for XQuery on annotated XML by translation into this calculus.
Research Statement
, 2008
"... My research strives to develop flexible, efficient, and easy-to-use platforms for sharing and integrating large, heterogeneous collections of data, based on sound formal foundations. My dissertation on collaborative data sharing takes an important step toward this goal, and develops new data models, ..."
Abstract
- Add to MetaCart
My research strives to develop flexible, efficient, and easy-to-use platforms for sharing and integrating large, heterogeneous collections of data, based on sound formal foundations. My dissertation on collaborative data sharing takes an important step toward this goal, and develops new data models, query semantics, optimization strategies, and implementation techniques needed for the next generation of data integration platforms. As a member of the Orchestra project team at Penn, I built and evaluated a prototype collaborative data sharing system (CDSS), and developed solutions to several key challenges, such as recording and maintaining data provenance and propagating changes efficiently, that arise in building such a system. CDSS targets application domains such as bioinformatics, where there exists a plethora of different databases, each providing a different perspective on a collection of organisms, genes, proteins, diseases, and so on. The data in these databases is highly interrelated, but the databases have proved very difficult to integrate using existing technologies. A CDSS allows collections of autonomous peer data sources related by declarative schema mappings to exchange data and updates, while retaining local control and enforcing provenance-based trust policies. In the rest of this research statement, I will describe my research approach and thesis work, and

