Results 1 - 10
of
70
Chimera: A Virtual Data System For Representing, Querying, and Automating Data Derivation
- In Proceedings of the 14th Conference on Scientific and Statistical Database Management
, 2002
"... Much scientific data is not obtained from measurements' but rather derived from other data by the application of computational procedures. We hypothesize that explicit representation of these procedures can enable documentation of data provenance, discovery of available methods', and on-demand data ..."
Abstract
-
Cited by 187 (19 self)
- Add to MetaCart
Much scientific data is not obtained from measurements' but rather derived from other data by the application of computational procedures. We hypothesize that explicit representation of these procedures can enable documentation of data provenance, discovery of available methods', and on-demand data generation (socalled "virtual data"). To explore this' idea, we have developed the Chimera virtual data system, which combines a virtual data catalog, for representing data derivation procedures and derived data, with a virtual data language interpreter that translates user requests' into data definition and query operations on the database. We couple the Chimera system with distributed "Data Grid" services to enable on-demand execution of computation schedules constructed from database queries. We have applied this system to two challenge problems, the reconstruction of simulated collision event data from a high-energy physics experiment, and the search of digital sky survey data for galactic clusters', with promising results'.
Trio: a system for integrated management of data, accuracy, and lineage
- PRESENTED AT CIDR 2005
, 2005
"... Trio is a new database system that manages not only data, butalsotheaccuracy and lineage of the data. Inexact (uncertain, probabilistic, fuzzy, approximate, incomplete, and imprecise!) databases have been proposed in the past, and the lineage problem also has been studied. The goals of the Trio proj ..."
Abstract
-
Cited by 174 (11 self)
- Add to MetaCart
Trio is a new database system that manages not only data, butalsotheaccuracy and lineage of the data. Inexact (uncertain, probabilistic, fuzzy, approximate, incomplete, and imprecise!) databases have been proposed in the past, and the lineage problem also has been studied. The goals of the Trio project are to combine and distill previous work into a simple and usable model, design a query language as an understandable extension to SQL, and most importantly build a working system—a system that augments conventional data management with both accuracy and lineage as an integral part of the data. This paper provides numerous motivating applications for Trio and lays out preliminary plans for the data model, query language, and prototype system.
The Virtual Data Grid: A New Model and Architecture for Data-Intensive Collaboration
- in CIDR
, 2003
"... It is now common to encounter communities engaged in the collaborative analysis and transformation of large quantities of data over extended time periods. We argue that these communities require a scalable system for managing, tracing, communicating, and exploring the derivation and analysis o ..."
Abstract
-
Cited by 75 (8 self)
- Add to MetaCart
It is now common to encounter communities engaged in the collaborative analysis and transformation of large quantities of data over extended time periods. We argue that these communities require a scalable system for managing, tracing, communicating, and exploring the derivation and analysis of diverse data objects. Such a system could bring significant productivity increases, facilitating discovery, understanding, assessment, and sharing of both data and transformation resources, as well as the productive use of distributed resources for computation, storage, and collaboration. We define a model and architecture for a virtual data grid to address this requirement. Using a broadly applicable "typed dataset" as the unit of derivation tracking, we introduce simple constructs for describing how datasets are derived from transformations and from other datasets. We also define mechanisms for integrating with, and adapting to, existing data management systems and transformation and analysis tools, as well as Grid mechanisms for distributed resource management and computation planning. We report on successful application results obtained with a prototype system called Chimera that implements some of these concepts, involving challenging analyses of high-energy physics and astronomy data.
Provenance management in curated databases
- In SIGMOD ’06: Proceedings of the 2006 ACM SIGMOD international conference on Management of data
, 2006
"... Curated databases in bioinformatics and other disciplines are the result of a great deal of manual annotation, correction and transfer of data from other sources. Provenance information concerning the creation, attribution, or version history of such data is crucial for assessing its integrity and s ..."
Abstract
-
Cited by 66 (16 self)
- Add to MetaCart
Curated databases in bioinformatics and other disciplines are the result of a great deal of manual annotation, correction and transfer of data from other sources. Provenance information concerning the creation, attribution, or version history of such data is crucial for assessing its integrity and scientific value. General purpose database systems provide little support for tracking provenance, especially when data moves among databases. This paper investigates general-purpose techniques for recording provenance for data that is copied among databases. We describe an approach in which we track the user’s actions while browsing source databases and copying data into a curated database, in order to record the user’s actions in a convenient, queryable form. We present an implementation of this technique and use it to evaluate the feasibility of database support for provenance management. Our experiments show that although the overhead of a naïve approach is fairly high, it can be decreased to an acceptable level using simple optimizations. 1.
VisTrails: Enabling interactive multiple-view visualizations
- In IEEE Visualization 2005
, 2005
"... Figure 1: VisTrails Visualization Spreadsheet. This ensemble shows the surface salinity variation at the mouth of the Columbia River over the period of a day. The green regions represent the fresh-water discharge of the river into the ocean. A single vistrail specification is used to construct this ..."
Abstract
-
Cited by 50 (23 self)
- Add to MetaCart
Figure 1: VisTrails Visualization Spreadsheet. This ensemble shows the surface salinity variation at the mouth of the Columbia River over the period of a day. The green regions represent the fresh-water discharge of the river into the ocean. A single vistrail specification is used to construct this ensemble. Each cell corresponds to an instance of this specification executed using a different timestamp value. VisTrails is a new system that enables interactive multiple-view visualizations by simplifying the creation and maintenance of visualization pipelines, and by optimizing their execution. It provides a general infrastructure that can be combined with existing visualization systems and libraries. A key component of VisTrails is the visualization trail (vistrail), a formal specification of a pipeline. Unlike existing dataflow-based systems, in VisTrails there is a clear separation between the specification of a pipeline and its execution instances. This separation enables powerful scripting capabilities and provides a scalable mechanism for generating a large number of visualizations. VisTrails also leverages the vistrail specification to identify and avoid redundant operations. This optimization is especially useful while exploring multiple visualizations. When variations of the same pipeline need to be executed, substantial speedups can be obtained by caching the results of overlapping subsequences of the pipelines. In this paper, we describe the design and implementation of VisTrails, and show its effectiveness in different application scenarios.
Curated databases
- PODS'08
, 2008
"... Curated databases are databases that are populated and updated with a great deal of human effort. Most reference works that one traditionally found on the reference shelves of libraries – dictionaries, encyclopedias, gazetteers etc. – are now curated databases. Since it is now easy to publish databa ..."
Abstract
-
Cited by 43 (6 self)
- Add to MetaCart
Curated databases are databases that are populated and updated with a great deal of human effort. Most reference works that one traditionally found on the reference shelves of libraries – dictionaries, encyclopedias, gazetteers etc. – are now curated databases. Since it is now easy to publish databases on the web, there has been an explosion in the number of new curated databases used in scientific research. The value of curated databases lies in the organization and the quality of the data they contain. Like the paper reference works they have replaced, they usually represent the efforts of a dedicated group of people to produce a definitive description of some subject area. Curated databases present a number of challenges for database research. The topics of annotation, provenance, and citation are central, because curated databases are heavily cross-referenced with, and include data from, other databases, and much of the work of a curator is annotating existing data. Evolution of structure is important because these databases often evolve from semistructured representations, and because they have to accommodate new scientific discoveries. Much of the work in these areas is in its infancy, but it is beginning to provide suggest new research for both theory and practice. We discuss some of this research and emphasize the need to find appropriate models of the processes associated with curated databases.
Recording and using provenance in a protein compressibility experiment
- In HPDC’05
, 2005
"... Very large scale computations are now becoming routinely used as a metliodology to undertake scientific research. In this context,)provenance systems ’ are regarded as the equivazent of the scientist’s logbook far in silico experimenlation: provenance captures the documentation of the process that l ..."
Abstract
-
Cited by 41 (14 self)
- Add to MetaCart
Very large scale computations are now becoming routinely used as a metliodology to undertake scientific research. In this context,)provenance systems ’ are regarded as the equivazent of the scientist’s logbook far in silico experimenlation: provenance captures the documentation of the process that led to some result. Using a protein compressibility analysis application, we derive a set of generic use cascs for Q provenance system. In order to support these, we address the following fundamental questions: what is provenance? how to record il? what is the performance impacr for grid execution? what is the performance of reasoning? In doing so, we dejne a technologyindependent notion of provenance that captures intemctions between components, internal componenr information and grouping of interactions, so as to allow us to analyse and reason about rhe execution of scientific processes. In order to siipport persistennl provenance in heterogeneous upplications, we introduce a separate provenance store, in which provenance documentation can be stored, archived and queried indepsndently of the technology used to mn the application. Through a series of practical tests, we evaluale the pe~onnance impact of such a provenance system. In summary, we demonstrate that provenance recording overhead of our protolype system remains under 10 % of execution rime, and we show that the recorded information successfully supports our use cases in U perjbmant mannel: 1
Community information management
, 2006
"... We introduce Cimple, a joint project between the University of Illinois and the University of Wisconsin. Cimple aims to develop a software platform that can be rapidly deployed and customized to manage data-rich online communities. We first describe the envisioned working of such a software platform ..."
Abstract
-
Cited by 39 (9 self)
- Add to MetaCart
We introduce Cimple, a joint project between the University of Illinois and the University of Wisconsin. Cimple aims to develop a software platform that can be rapidly deployed and customized to manage data-rich online communities. We first describe the envisioned working of such a software platform and our prototype, DBLife, which is a community portal being developed for the database research community. We then describe the technical challenges in Cimple and our solution approach. Finally, we discuss managing uncertainty and provenance, a crucial task in making our software platform practical. 1
Provenance of e-Science Experiments - Experience from Bioinformatics
- Proc. UK e-Science All Hands Meeting 2003
, 2003
"... Like experiments performed at a laboratory bench, the data associated with an e-Science experiment are of reduced value if other scientists are not able to identify the origin, or provenance, of those data. Provenance information is essential if experiments are to be validated and verified by othe ..."
Abstract
-
Cited by 36 (12 self)
- Add to MetaCart
Like experiments performed at a laboratory bench, the data associated with an e-Science experiment are of reduced value if other scientists are not able to identify the origin, or provenance, of those data. Provenance information is essential if experiments are to be validated and verified by others, or even by those who originally performed them. In this article, we give an overview of our initial work on the provenance of bioinformatics e-Science experiments within Grid. We use two kinds of provenance: the derivation path of information and annotation. We show how this kind of provenance can be delivered within the Grid demonstrator WorkBench and we explore how the resulting Webs of experimental data holdings can be mined for useful information and presentations for the e-Scientist.
The requirements of recording and using provenance in e-science experiments
- Journal of Grid Computing
, 2005
"... Abstract. In e-Science experiments, it is vital to record the experimental process for later use such as in interpreting results, verifying that the correct process took place or tracing where data came from. The process that led to some data is called the provenance of that data, and a provenance a ..."
Abstract
-
Cited by 35 (10 self)
- Add to MetaCart
Abstract. In e-Science experiments, it is vital to record the experimental process for later use such as in interpreting results, verifying that the correct process took place or tracing where data came from. The process that led to some data is called the provenance of that data, and a provenance architecture is the software architecture for a system that will provide the necessary functionality to record, store and use process documentation to determine the provenance of data items. However, there has been little principled analysis of what is actually required of a provenance architecture, so it is impossible to determine the functionality they would ideally support. In this paper, we present use cases for a provenance architecture from current experiments in biology, chemistry, physics and computer science, and analyse the use cases to determine the technical requirements of a generic, application-independent architecture. We propose an architecture that meets these requirements and evaluate a preliminary implementation by attempting to realise two of the use cases. 1.

