Results 1 - 10
of
17
Provenance in scientific workflow systems
"... The automated tracking and storage of provenance information promises to be a major advantage of scientific workflow systems. We discuss issues related to data and workflow provenance, and present techniques for focusing user attention on meaningful provenance through “user views,” for managing the ..."
Abstract
-
Cited by 20 (7 self)
- Add to MetaCart
The automated tracking and storage of provenance information promises to be a major advantage of scientific workflow systems. We discuss issues related to data and workflow provenance, and present techniques for focusing user attention on meaningful provenance through “user views,” for managing the provenance of nested scientific data, and for using information about the evolution of a workflow specification to understand the difference in the provenance of similar data products.
Scientific workflow design for mere mortals
, 2008
"... Recent years have seen a dramatic increase in research and development of scientific workflow systems. These systems promise to make scientists more productive by automating data-driven and compute-intensive analyses. Despite many early achievements, the long-term success of scientific workflow tech ..."
Abstract
-
Cited by 14 (4 self)
- Add to MetaCart
Recent years have seen a dramatic increase in research and development of scientific workflow systems. These systems promise to make scientists more productive by automating data-driven and compute-intensive analyses. Despite many early achievements, the long-term success of scientific workflow technology critically depends on making these systems useable by “mere mortals”, i.e., scientists who have a very good idea of the analysis methods they wish to assemble, but who are neither software developers nor scripting-language experts. With these users in mind, we identify a set of desiderata for scientific workflow systems crucial for enabling scientists to model and design the workflows they wish to automate themselves. As a first step towards meeting these requirements, we also show how the collection-oriented modeling and design (COMAD) approach for scientific workflows, implemented within the Kepler system, can help provide these critical, design-oriented capabilities to scientists.
Provenance in Collection-Oriented Scientific Workflows
, 2002
"... We describe a provenance model tailored to scientific workflows based on the Collection-Oriented Modeling and Design paradigm. Our implementation within the Kepler scientific workflow system captures the dependencies of data and collection creation events on preexisting data and collections, and emb ..."
Abstract
-
Cited by 11 (4 self)
- Add to MetaCart
We describe a provenance model tailored to scientific workflows based on the Collection-Oriented Modeling and Design paradigm. Our implementation within the Kepler scientific workflow system captures the dependencies of data and collection creation events on preexisting data and collections, and embeds these provenance records within the data stream. A provenance query engine operates on self-contained workflow traces representing serializations of the output data stream for particular workflow runs. We demonstrate this approach in our response to the First Provenance Challenge.
Project Histories: Managing Data Provenance Across Collection-Oriented Scientific Workflow Runs
- of Lecture Notes in Bioinformatics, pp 122–138
, 2007
"... Abstract. While a number of scientific workflow systems support data provenance, they primarily focus on collecting and querying provenance for single workflow runs. Scientific research projects, however, typically involve (1) many interrelated workflows (where data from one or more workflow runs ar ..."
Abstract
-
Cited by 8 (3 self)
- Add to MetaCart
Abstract. While a number of scientific workflow systems support data provenance, they primarily focus on collecting and querying provenance for single workflow runs. Scientific research projects, however, typically involve (1) many interrelated workflows (where data from one or more workflow runs are selected and used as input to subsequent runs) and (2) tasks between workflow runs that cannot be fully automated. This paper addresses the need for recording data dependencies across multiple workflow runs and accommodating data management activities performed between runs. We define a new conceptual model for representing project-level provenance based on the notion of project histories and folders, and describe mechanisms to support this model in the collection-oriented modeling and design framework of KEPLER. Our approach allows users to conveniently organize their projects and data using the familiar folder-hierarchy metaphor, while at the same time integrating this information with detailed provenance of data products generated via automated scientific workflows. 1
e-Science and biological pathway semantics
- BMC Bioinformatics
, 2007
"... Background: The development of e-Science presents a major set of opportunities and challenges for the future progress of biological and life scientific research. Major new tools are required and corresponding demands are placed on the high-throughput data generated and used in these processes. Nowhe ..."
Abstract
-
Cited by 8 (2 self)
- Add to MetaCart
Background: The development of e-Science presents a major set of opportunities and challenges for the future progress of biological and life scientific research. Major new tools are required and corresponding demands are placed on the high-throughput data generated and used in these processes. Nowhere is the demand greater than in the semantic integration of these data. Semantic Web tools and technologies afford the chance to achieve this semantic integration. Since pathway knowledge is central to much of the scientific research today it is a good test-bed for semantic integration. Within the context of biological pathways, the BioPAX initiative, part of a broader movement towards the standardization and integration of life science databases, forms a necessary prerequisite for its successful application of e-Science in health care and life science research. This paper examines whether BioPAX, an effort to overcome the barrier of disparate and heterogeneous pathway data sources, addresses the needs of e-Science. Results: We demonstrate how BioPAX pathway data can be used to ask and answer some useful biological questions. We find that BioPAX comes close to meeting a broad range of e-Science needs, but certain semantic weaknesses mean that these goals are missed. We make a series of
PrIMe: A methodology for developing provenance-aware applications
, 2006
"... Provenance refers to the past processes that brought about a given (version of an) object, item or entity. By knowing the provenance of data, users can often better understand, trust, reproduce, and validate it. A provenance-aware application has the functionality to answer questions regarding the p ..."
Abstract
-
Cited by 7 (4 self)
- Add to MetaCart
Provenance refers to the past processes that brought about a given (version of an) object, item or entity. By knowing the provenance of data, users can often better understand, trust, reproduce, and validate it. A provenance-aware application has the functionality to answer questions regarding the provenance of the data it produces, by using documentation of past processes. PrIMe is a software engineering technique for adapting application designs to enable them to interact with a provenance middleware layer, thereby making them provenance-aware. In this article, we specify the steps involved in applying PrIMe, analyse its effectiveness, and illustrate its use with two case studies, in bioinformatics and medicine.
Taverna Workflows: Syntax and Semantics
"... This paper presents the formal syntax and the operational semantics of Taverna, a workflow management system with a large user base among the e-Science community. Such formal foundation, which has so far been lacking, opens the way to the translation between Taverna workflows and other process model ..."
Abstract
-
Cited by 6 (2 self)
- Add to MetaCart
This paper presents the formal syntax and the operational semantics of Taverna, a workflow management system with a large user base among the e-Science community. Such formal foundation, which has so far been lacking, opens the way to the translation between Taverna workflows and other process models. In particular, the ability to automatically compile a simple domain-specific process description into Taverna facilitates its adoption by e-scientists who are not expert workflow developers. We demonstrate this potential through a practical use case. 1
Expressive Reusable Workflow Templates
, 2009
"... Workflow systems can manage complex scientific applications with distributed data processing. Although some workflow systems can represent collections of data with very compact abstractions and manage their execution efficiently, there are no approaches to date to manage collections of application c ..."
Abstract
-
Cited by 5 (4 self)
- Add to MetaCart
Workflow systems can manage complex scientific applications with distributed data processing. Although some workflow systems can represent collections of data with very compact abstractions and manage their execution efficiently, there are no approaches to date to manage collections of application components required to express some scientific applications. We present an approach to handle collections of components and data alike in expressive workflow templates whose basic structure is reusable. We also present an algorithm that can elaborate abstract compact workflow templates into execution-ready workflows that enumerate all computations to be carried out. We implemented the proposed approach in the Wings workflow system. Our work is motivated by real-world complex scientific applications that require handling of nested collections of both components and data.
From COmputation Models to Models of Provenance: The RWS Approach
, 2002
"... Scientific workflows often benefit from or even require advanced modeling constructs, e.g., nesting of subworkflows, cycles for executing loops, data-dependent routing, and pipelined execution. In such settings, an often overlooked aspect of provenance takes center stage: A suitable model of provena ..."
Abstract
-
Cited by 4 (3 self)
- Add to MetaCart
Scientific workflows often benefit from or even require advanced modeling constructs, e.g., nesting of subworkflows, cycles for executing loops, data-dependent routing, and pipelined execution. In such settings, an often overlooked aspect of provenance takes center stage: A suitable model of provenance (MoP) for scientific workflows should be based upon the underlying model of computation (MoC) used for executing the workflows. We can derive an adequate MoP from a MoC (such as Kahn’s process networks) by taking into account the assumptions that a MoC entails, and by recording the observables which it affords. In this way, a MoP captures or at least better approximates “real ” data dependencies for workflows with advanced modeling constructs. As a specific instance, we elaborate on the Read-Write-ReSet (RWS) model, a simple and flexible MoP suitable for a number of different MoCs.
X-CSR: Dataflow Optimization for Distributed XML Process Pipelines
- In ICDE
, 2009
"... XML process networks are a simple, yet powerful programming paradigm for loosly coupled, coarse-grained dataflow applications such as data-centric scientific workflows. We describe a framework called ∆-XML that is wellsuited for applications in which pipelines of data processors modify parts (“delta ..."
Abstract
-
Cited by 3 (3 self)
- Add to MetaCart
XML process networks are a simple, yet powerful programming paradigm for loosly coupled, coarse-grained dataflow applications such as data-centric scientific workflows. We describe a framework called ∆-XML that is wellsuited for applications in which pipelines of data processors modify parts (“deltas”) of XML data collections while keeping the overall collection structure intact. We show how to optimize the execution of ∆-XML process networks by minimizing the data shipping cost in distributed settings. This X-CSR 1 optimization employs static type inference based on XML Schema to determine the XML stream fragments that are relevant to a processor, allowing irrelevant fragments to be bypassed (“shipped”) to downstream pipeline steps. Finally, we present evaluation results for synthetic as well as real-world scientific workflows, showing the range of improvements and practical feasibility of X-CSR. 1.

