Results 1 -
8 of
8
Provenance management in curated databases
- In SIGMOD ’06: Proceedings of the 2006 ACM SIGMOD international conference on Management of data
, 2006
"... Curated databases in bioinformatics and other disciplines are the result of a great deal of manual annotation, correction and transfer of data from other sources. Provenance information concerning the creation, attribution, or version history of such data is crucial for assessing its integrity and s ..."
Abstract
-
Cited by 66 (16 self)
- Add to MetaCart
Curated databases in bioinformatics and other disciplines are the result of a great deal of manual annotation, correction and transfer of data from other sources. Provenance information concerning the creation, attribution, or version history of such data is crucial for assessing its integrity and scientific value. General purpose database systems provide little support for tracking provenance, especially when data moves among databases. This paper investigates general-purpose techniques for recording provenance for data that is copied among databases. We describe an approach in which we track the user’s actions while browsing source databases and copying data into a curated database, in order to record the user’s actions in a convenient, queryable form. We present an implementation of this technique and use it to evaluate the feasibility of database support for provenance management. Our experiments show that although the overhead of a naïve approach is fairly high, it can be decreased to an acceptable level using simple optimizations. 1.
Distance makes the types grow stronger: A calculus for differential privacy
- In ICFP
, 2010
"... We want assurances that sensitive information will not be disclosed when aggregate data derived from a database is published. Differential privacy offers a strong statistical guarantee that the effect of the presence of any individual in a database will be negligible, even when an adversary has auxi ..."
Abstract
-
Cited by 7 (2 self)
- Add to MetaCart
We want assurances that sensitive information will not be disclosed when aggregate data derived from a database is published. Differential privacy offers a strong statistical guarantee that the effect of the presence of any individual in a database will be negligible, even when an adversary has auxiliary knowledge. Much of the prior work in this area consists of proving algorithms to be differentially private one at a time; we propose to streamline this process with a functional language whose type system automatically guarantees differential privacy, allowing the programmer to write complex privacy-safe query programs in a flexible and compositional way. The key novelty is the way our type system captures function sensitivity, a measure of how much a function can magnify the distance between similar inputs: well-typed programs not only can’t go wrong, they can’t go too far on nearby inputs. Moreover, by introducing a monad for random computations, we can show that the established definition of differential privacy falls out naturally as a special case of this soundness principle. We develop examples including known differentially private algorithms, privacy-aware variants of standard functional programming idioms, and compositionality principles for differential privacy.
Data Provenance: A Categorization of Existing Approaches
, 2007
"... In many application areas like e-science and data-warehousing detailed information about the origin of data is required. This kind of information is often referred to as data provenance or data lineage. The provenance of a data item includes information about the processes and source data items that ..."
Abstract
-
Cited by 4 (2 self)
- Add to MetaCart
In many application areas like e-science and data-warehousing detailed information about the origin of data is required. This kind of information is often referred to as data provenance or data lineage. The provenance of a data item includes information about the processes and source data items that lead to its creation and current representation. The diversity of data representation models and application domains has lead to a number of more or less formal definitions of provenance. Most of them are limited to a special application domain, data representation model or data processing facility. Not surprisingly, the associated implementations are also restricted to some application domain and depend on a special data model. In this paper we give a survey of data provenance models and prototypes, present a general categorization scheme for provenance models and use this categorization scheme to study the properties of the existing approaches. This categorization enables us to distinguish between different kinds of provenance information and could lead to a better understanding of provenance in general. Besides the categorization of provenance types, it is important to include the storage, transformation and query requirements for the different kinds of provenance information and application domains in our considerations. The analysis of existing approaches will assist us in revealing open research problems in the area of data provenance.
Towards a Secure and Efficient System for End-to-End Provenance
- APPEARS IN THE PROCEEDINGS OF THE SECOND USENIX WORKSHOP ON THEORY AND PRACTICE OF PROVENANCE (TAPP 2010)
, 2010
"... Work on the End-to-End Provenance System (EEPS) began in the late summer of 2009. The EEPS effort seeks to explore the three central questions in provenance systems: (1) “Where and how do I design secure hostlevel provenance collecting instruments (called provenance monitors)?”; (2) “How do I extend ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Work on the End-to-End Provenance System (EEPS) began in the late summer of 2009. The EEPS effort seeks to explore the three central questions in provenance systems: (1) “Where and how do I design secure hostlevel provenance collecting instruments (called provenance monitors)?”; (2) “How do I extend completeness and accuracy guarantees to distributed systems and computations?”; and (3) “What are the costs associated with provenance collection? ” This position paper discusses our initial exploration into these issues and posits several challenges to the realization of the EEPS vision.
Graduating committee:
, 2008
"... Database systems are widely used in today’s world. Almost every information system contains one or more databases. From a traditional perspective, databases are used to store precise values about objects in the ’real world’. However, many information is uncertain or imprecise. Consider, for example, ..."
Abstract
- Add to MetaCart
Database systems are widely used in today’s world. Almost every information system contains one or more databases. From a traditional perspective, databases are used to store precise values about objects in the ’real world’. However, many information is uncertain or imprecise. Consider, for example, sensor applications. Sensors produce uncertain and imprecise data since readings of sensors are inherently imprecise and uncertain. Current database management systems are not able to store, manipulate or query continuous uncertain data unless through user-defined attributes. However, this approach delegates the responsibility of managing the uncertainty associated with the data to the end-user. In many situations, the uncertainty associated with the data is distributed continuously, the data can be represented in terms of a continuous probability distribution. In this thesis, we present an extension to an existing probabilistic data model, resulting in a data model which is capable of storing continuous uncertain data in XML documents. We give a sound semantical foundation to this data model. The probabilistic XML data model is based on the probabilistic tree. In the probabilistic tree, elements and subtrees can be associated with
Refining Information Extraction Rules using Data Provenance.........................................
"... ..."
This paper is posted at ScholarlyCommons.
, 2007
"... A lens is a bidirectional program. When read from left to right, it denotes an ordinary function that maps inputs to outputs. When read from right to left, it denotes an “update translator ” that takes an input together with an updated output and produces a new input that reflects the update. Many v ..."
Abstract
- Add to MetaCart
A lens is a bidirectional program. When read from left to right, it denotes an ordinary function that maps inputs to outputs. When read from right to left, it denotes an “update translator ” that takes an input together with an updated output and produces a new input that reflects the update. Many variants of this idea have been explored in the literature, but none deal fully with ordered data. If, for example, an update changes the order of a list in the output, the items in the output list and the chunks of the input that generated them can be misaligned, leading to lost or corrupted data. We attack this problem in the context of bidirectional transformations over strings, the primordial ordered data type. We first propose a collection of bidirectional string lens combinators, based on familiar operations on regular transducers (union, concatenation, Kleene-star) and with a type system based on regular expressions. We then design a new semantic space of dictionary lenses, enriching the lenses of Foster et al. (2007b) with support for two additional combinators for marking “reorderable chunks ” and their keys. To demonstrate the effectiveness of these primitives, we describe the design and implementation of Boomerang, a full-blown bidirectional programming language with dictionary lenses at its core. We have used Boomerang to build transformers for complex real-world data formats including the SwissProt genomic database. We formalize the essential property of resourcefulness—the correct use of keys to associate chunks in the input

