Results 1 - 10
of
54
Curated databases
- PODS'08
, 2008
"... Curated databases are databases that are populated and updated with a great deal of human effort. Most reference works that one traditionally found on the reference shelves of libraries – dictionaries, encyclopedias, gazetteers etc. – are now curated databases. Since it is now easy to publish databa ..."
Abstract
-
Cited by 43 (6 self)
- Add to MetaCart
Curated databases are databases that are populated and updated with a great deal of human effort. Most reference works that one traditionally found on the reference shelves of libraries – dictionaries, encyclopedias, gazetteers etc. – are now curated databases. Since it is now easy to publish databases on the web, there has been an explosion in the number of new curated databases used in scientific research. The value of curated databases lies in the organization and the quality of the data they contain. Like the paper reference works they have replaced, they usually represent the efforts of a dedicated group of people to produce a definitive description of some subject area. Curated databases present a number of challenges for database research. The topics of annotation, provenance, and citation are central, because curated databases are heavily cross-referenced with, and include data from, other databases, and much of the work of a curator is annotating existing data. Evolution of structure is important because these databases often evolve from semistructured representations, and because they have to accommodate new scientific discoveries. Much of the work in these areas is in its infancy, but it is beginning to provide suggest new research for both theory and practice. We discuss some of this research and emphasize the need to find appropriate models of the processes associated with curated databases.
On the expressiveness of implicit provenance in query and update languages
- In ICDT
, 2007
"... Abstract. Information concerning the origin of data (that is, its provenance) is important in many areas, especially scientific recordkeeping. Currently, provenance information must be maintained explicitly, by added effort of the database maintainer. Since such maintenance is tedious and error-pron ..."
Abstract
-
Cited by 33 (12 self)
- Add to MetaCart
Abstract. Information concerning the origin of data (that is, its provenance) is important in many areas, especially scientific recordkeeping. Currently, provenance information must be maintained explicitly, by added effort of the database maintainer. Since such maintenance is tedious and error-prone, it is desirable to provide support for provenance in the database system itself. In order to provide such support, however, it is important to provide a clear explanation of the behavior and meaning of existing database operations, both queries and updates, with respect to provenance. In this paper we take the view that a query or update implicitly defines a provenance mapping linking components of the output to the originating components in the input. Our key result is that the proposed semantics are expressively complete relative to natural classes of queries that explicitly manipulate provenance. 1
Provenance in Databases: Past, Current, and Future
, 2007
"... The need to understand and manage provenance arises in almost every scientific application. In many cases, information about provenance constitutes the proof of correctness of results that are generated by scientific applications. It also determines the quality and amount of trust one places on the ..."
Abstract
-
Cited by 27 (0 self)
- Add to MetaCart
The need to understand and manage provenance arises in almost every scientific application. In many cases, information about provenance constitutes the proof of correctness of results that are generated by scientific applications. It also determines the quality and amount of trust one places on the results. For these reasons, the knowledge of provenance of a scientific result is typically regarded to be as important as the result itself. In this paper, we provide an overview of research in provenance in databases and discuss some future research directions. The content of this paper is largely based on the tutorial presented at SIGMOD 2007 [11].
Provenance as dependency analysis
- Proceedings of the 11th International Symposium on Database Programming Languages (DBPL 2007), number 4797 in LNCS
, 2007
"... Abstract. Provenance is information recording the source, derivation, or history of some information. Provenance tracking has been studied in a variety of settings; however, although many design points have been explored, the mathematical or semantic foundations of data provenance have received comp ..."
Abstract
-
Cited by 25 (9 self)
- Add to MetaCart
Abstract. Provenance is information recording the source, derivation, or history of some information. Provenance tracking has been studied in a variety of settings; however, although many design points have been explored, the mathematical or semantic foundations of data provenance have received comparatively little attention. In this paper, we argue that dependency analysis techniques familiar from program analysis and program slicing provide a formal foundation for forms of provenance that are intended to show how (part of) the output of a query depends on (parts of) its input. We introduce a semantic characterization of such dependency provenance, show that this form of provenance is not computable, and provide dynamic and static approximation techniques. 1
Materialized views in probabilistic databases for information exchange and query optimization
- IN PROCEEDINGS OF VLDB
, 2007
"... Views over probabilistic data contain correlations between tuples, and the current approach is to capture these correlations using explicit lineage. In this paper we propose an alternative approach to materializing probabilistic views, by giving conditions under which a view can be represented by a ..."
Abstract
-
Cited by 24 (9 self)
- Add to MetaCart
Views over probabilistic data contain correlations between tuples, and the current approach is to capture these correlations using explicit lineage. In this paper we propose an alternative approach to materializing probabilistic views, by giving conditions under which a view can be represented by a block-independent disjoint (BID) table. Not all views can be represented as BID tables and so we propose a novel partial representation that can represent all views but may not define a unique probability distribution. We then give conditions on when a query’s value on a partial representation will be uniquely defined. We apply our theory to two applications: query processing using views and information exchange using views. In query processing on probabilistic data, we can ignore the lineage and use materialized views to more efficiently answer queries. By contrast, if the view has explicit lineage, the query evaluation must reprocess the lineage to compute the query resulting in dramatically slower execution. The second application is information exchange when we do not wish to disclose the entire lineage, which otherwise may result in shipping the entire database. The paper contains several theoretical results that completely solve the problem of deciding whether a conjunctive view can be represented as a BID and whether a query on a partial representation is uniquely determined. We validate our approach experimentally showing that representable views exist in real and synthetic workloads and show over three magnitudes of improvement in query processing versus a lineage based approach.
Efficient provenance storage
- In Proceedings of the ACM International Conference on Management of Data (SIGMOD
, 2008
"... As the world is increasingly networked and digitized, the data we store has more and more frequently been chopped, baked, diced and stewed. In consequence, there is an increasing need to store and manage provenance for each data item stored in a database, describing exactly where it came from, and w ..."
Abstract
-
Cited by 23 (1 self)
- Add to MetaCart
As the world is increasingly networked and digitized, the data we store has more and more frequently been chopped, baked, diced and stewed. In consequence, there is an increasing need to store and manage provenance for each data item stored in a database, describing exactly where it came from, and what manipulations have been applied to it. Storage of the complete provenance of each data item can become prohibitively expensive. In this paper, we identify important properties of provenance that can be used to considerably reduce the amount of storage required. We identify three different techniques: a family of factorization processes and two methods based on inheritance, to decrease the amount of storage required for provenance. We have used the techniques described in this work to significantly reduce the provenance storage costs associated with constructing MiMI [22], a warehouse of data regarding protein interactions, as well as two provenance stores, Karma [31] and PReServ [20], produced through workflow execution. In these real provenance sets, we were able to reduce the size of the provenance by up to a factor of 20. Additionally, we show that this reduced store can be queried efficiently and further that incremental changes can be made inexpensively.
bdbms: A database management system for biological data
- In CIDR
, 2007
"... Biologists are increasingly using databases for storing and managing their data. Biological databases typically consist of a mixture of raw data, metadata, sequences, annotations, and related data obtained from various sources. Current database technology lacks several functionalities that are neede ..."
Abstract
-
Cited by 17 (3 self)
- Add to MetaCart
Biologists are increasingly using databases for storing and managing their data. Biological databases typically consist of a mixture of raw data, metadata, sequences, annotations, and related data obtained from various sources. Current database technology lacks several functionalities that are needed by biological databases. In this paper, we introduce bdbms, an extensible prototype database management system for supporting biological data. bdbms extends the functionalities of current DBMSs with: (1) Annotation and provenance management including storage, indexing, manipulation, and querying of annotation and provenance as first class objects in bdbms, (2) Local dependency tracking to track the dependencies and derivations among data items, (3) Update authorization to support data curation via content-based authorization, in contrast to identity-based authorization, and (4) New access methods and their supporting operators that support pattern matching on various types of compressed biological data types. This paper presents the design of bdbms along with the techniques proposed to support these functionalities including an extension to SQL. We also outline some open issues in building bdbms. 1.
The Case of the Fake Picasso: Preventing History Forgery with Secure Provenance
"... As increasing amounts of valuable information are produced and persist digitally, the ability to determine the origin of data becomes important. In science, medicine, commerce, and government, data provenance tracking is essential for rights protection, regulatory compliance, management of intellige ..."
Abstract
-
Cited by 17 (3 self)
- Add to MetaCart
As increasing amounts of valuable information are produced and persist digitally, the ability to determine the origin of data becomes important. In science, medicine, commerce, and government, data provenance tracking is essential for rights protection, regulatory compliance, management of intelligence and medical data, and authentication of information as it flows through workplace tasks. In this paper, we show how to provide strong integrity and confidentiality assurances for data provenance information. We describe our provenance-aware system prototype that implements provenance tracking of data writes at the application layer, which makes it extremely easy to deploy. We present empirical results that show that, for typical real-life workloads, the runtime overhead of our approach to recording provenance with confidentiality and integrity guarantees ranges from 1 % – 13%. 1
Approximate Lineage for Probabilistic Databases
"... In probabilistic databases, lineage is fundamental to both query processing and understanding the data. Current systems s.a. Trio or Mystiq use a complete approach in which the lineage for a tuple t is a Boolean formula which represents all derivations of t. In large databases lineage formulas can b ..."
Abstract
-
Cited by 17 (7 self)
- Add to MetaCart
In probabilistic databases, lineage is fundamental to both query processing and understanding the data. Current systems s.a. Trio or Mystiq use a complete approach in which the lineage for a tuple t is a Boolean formula which represents all derivations of t. In large databases lineage formulas can become huge: in one public database (the Gene Ontology) we often observed 10MB of lineage (provenance) data for a single tuple. In this paper we propose to use approximate lineage, which is a much smaller formula keeping track of only the most important derivations, which the system can use to process queries and provide explanations. We discuss in detail two specific kinds of approximate lineage: (1) a conservative approximation called sufficient lineage that records the most important derivations for each tuple, and (2) polynomial lineage, which is more aggressive and can provide higher compression ratios, and which is based on Fourier approximations of Boolean expressions. In this paper we define approximate lineage formally, describe algorithms to compute approximate lineage and prove formally their error bounds, and validate our approach experimentally on a real data set. 1.
Towards a model of provenance and user views in scientific workflows
- In Data Integration in the Life Sciences
, 2006
"... Abstract. Scientific experiments are becoming increasingly large and complex, with a commensurate increase in the amount and complexity of data generated. Data, both intermediate and final results, is derived by chaining and nesting together multiple database searches and analytical tools. In many c ..."
Abstract
-
Cited by 16 (4 self)
- Add to MetaCart
Abstract. Scientific experiments are becoming increasingly large and complex, with a commensurate increase in the amount and complexity of data generated. Data, both intermediate and final results, is derived by chaining and nesting together multiple database searches and analytical tools. In many cases, the means by which the data are produced is not known, making the data difficult to interpret and the experiment impossible to reproduce. Provenance in scientific workflows is thus of paramount importance. In this paper, we provide a formal model of provenance for scientific workflows which is general (i.e. can be used with existing workflow systems, such as Kepler, myGrid and Chimera) and sufficiently expressive to answer the provenance queries we encountered in a number of case studies. Interestingly, our model not only takes into account the chained and nested structure of scientific workflows, but allows asks for provenance at different levels of abstraction (user views). 1

