Results 1 - 10
of
12
A Reference Architecture for Scientific Workflow Management Systems and the VIEW SOA Solution
"... Abstract—Scientific workflows have recently emerged as a new paradigm for scientists to formalize and structure complex and distributed scientific processes to enable and accelerate many scientific discoveries. In contrast to business workflows, which are typically control flow oriented, scientific ..."
Abstract
-
Cited by 6 (4 self)
- Add to MetaCart
Abstract—Scientific workflows have recently emerged as a new paradigm for scientists to formalize and structure complex and distributed scientific processes to enable and accelerate many scientific discoveries. In contrast to business workflows, which are typically control flow oriented, scientific workflows tend to be dataflow oriented, introducing a new set of requirements for system development. These requirements demand a new architectural design for scientific workflow management systems (SWFMSs). Although several SWFMSs have been developed that provide much experience for future research and development, a study from an architectural perspective is still missing. The main contributions of this paper are: 1) based on a comprehensive survey of the literature and identification of key requirements for SWFMSs, we propose the first reference architecture for SWFMSs; 2) according to the reference architecture, we further propose a service-oriented architecture for VIEW (a VIsual sciEntific Workflow management system); 3) we implemented VIEW to validate the feasibility of the proposed architectures; and 4) we present a VIEW-based scientific workflow application system (SWFAS), called FiberFlow, to showcase the application of our VIEW system. Index Terms—Reference architecture, scientific workflows, scientific workflow management system, SOA, VIEW. Ç
PreDatA- Preparatory Data Analytics on Peta-Scale Machines
"... Abstract—Peta-scale scientific applications running on High End Computing (HEC) platforms can generate large volumes of data. For high performance storage and in order to be useful to science end users, such data must be organized in its layout, indexed, sorted, and otherwise manipulated for subsequ ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
Abstract—Peta-scale scientific applications running on High End Computing (HEC) platforms can generate large volumes of data. For high performance storage and in order to be useful to science end users, such data must be organized in its layout, indexed, sorted, and otherwise manipulated for subsequent data presentation, visualization, and detailed analysis. In addition, scientists desire to gain insights into selected data characteristics ‘hidden ’ or ‘latent ’ in the massive datasets while data is being produced by simulations. PreDatA, short for Preparatory Data Analytics, is an approach for preparing and characterizing data while it is being produced by the large scale simulations running on peta-scale machines. By dedicating additional compute nodes on the peta-scale machine as staging nodes and staging simulation’s output data through these nodes, PreDatA can exploit their computational power to perform selected data manipulations with lower latency than attainable by first moving data into file systems and storage. Such in-transit manipulations are supported by the PreDatA middleware through RDMAbased data movement to reduce write latency, application-specific operations on streaming data that are able to discover latent data characteristics, and appropriate data reorganization and metadata annotation to speed up subsequent data access. As a result, PreDatA enhances the scalability and flexibility of current I/O stack on HEC platforms and is useful for data pre-processing, runtime data analysis and inspection, as well as for data exchange between concurrently running simulation models. Performance evaluations with several production peta-scale applications on Oak Ridge National Laboratory’s Leadership Computing Facility demonstrate the feasibility and advantages of the PreDatA approach. I.
A Task Abstraction and Mapping Approach to the Shimming Problem in Scientific Workflows
"... Recently, there has been an increasing need in scientific workflows to solve the shimming problem, the use of a special kind of adaptors, called shims, to link related but incompatible workflow tasks. However, existing techniques produce scientific workflows that are cluttered with many visible shim ..."
Abstract
-
Cited by 2 (2 self)
- Add to MetaCart
Recently, there has been an increasing need in scientific workflows to solve the shimming problem, the use of a special kind of adaptors, called shims, to link related but incompatible workflow tasks. However, existing techniques produce scientific workflows that are cluttered with many visible shims, which distract a scientist’s focus on functional components. Moreover, these techniques do not address a new type of shimming problem that occurs due to the incompatibility between the ports of a task and the inputs/outputs of its internal task component. To address these issues, 1) we propose a task template model which encapsulates the composition and mapping of shims and functional task component within a task interface; 2) we design an XMLbased task specification language, called TSL, to realize the proposed task template model; 3) we propose a serviceoriented architecture for task management to enable the distributed execution of shims and functional components; and 4) we implement the proposed model, language and architecture and present a case study to validate them. Our technique uniquely addresses both types of shimming problems. To our best knowledge, this is the first shimming technique that makes shims invisible at the workflow level, resulting in scientific workflows that are more elegant and readable. 1
Task Decomposition for Adaptive Data Staging in Workflows for Distributed Environments
"... Abstract—Scientific workflows are often composed by scientists that are not particularly familiar with performance and fault-tolerance issues of the underlying layer. The inherent nature of the infrastructure and environment for scientific workflow applications means that the movement of data comes ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
Abstract—Scientific workflows are often composed by scientists that are not particularly familiar with performance and fault-tolerance issues of the underlying layer. The inherent nature of the infrastructure and environment for scientific workflow applications means that the movement of data comes with reliability challenges. Improving the reliablility scientific workflows in distributed environments, calls for the decoupling of data staging and computation activities, and each aspect needs to be addressed separately In this paper, we present an approach to managing scientific workflows that specifically provides constructs for reliable data staging. In our framework, data staging tasks are automatically separated from computation tasks in the definition of the workflow. High-level policies can be provided that allow for dynamic adaptation of the workflow to occur. Our approach permits the separate specification of the functional and non-functional requirements of the application and is dynamic enough to allow for the alteration of the workflow at runtime for optimization.
A Dataflow-Based Scientific Workflow Composition Framework
- IEEE TRANSACTIONS ON SERVICES COMPUTING
"... Scientific workflow has recently become an enabling technology to automate and speed up the scientific discovery process. Although several scientific workflow management systems (SWFMSs) have been developed, a formal scientific workflow composition model in which workflow constructs are fully compos ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
Scientific workflow has recently become an enabling technology to automate and speed up the scientific discovery process. Although several scientific workflow management systems (SWFMSs) have been developed, a formal scientific workflow composition model in which workflow constructs are fully compositional one with another is still missing. In this paper, we propose a dataflow-based scientific workflow composition framework consisting of: i) a dataflow-based scientific workflow model that separates the declaration of the workflow interface from the definition of its functional body; ii) a set of workflow constructs, including Map, Reduce, Tree, Loop, Conditional, and Curry, which are fully compositional one with another; iii) a dataflow based exception handling approach to support hierarchical exception propagation and user-defined exception handling. Our workflow composition framework is unique in that workflows are the only operands for composition; in this way, our approach elegantly solves the two-world problem in existing composition frameworks, in which composition needs to deal with both the world of tasks and the world of workflows. The proposed framework is implemented and several case studies are conducted to validate our techniques.
DYNAMICALLY RECONFIGURABLE DATA-INTENSIVE SERVICE COMPOSITION
"... The distributed nature of services poses significant challenges to building robust service-based applications. A major aspect of this challenge is finding a model of service integration that promotes ease of dynamic reconfiguration, in response to internal and external stimuli. Centralized models of ..."
Abstract
- Add to MetaCart
The distributed nature of services poses significant challenges to building robust service-based applications. A major aspect of this challenge is finding a model of service integration that promotes ease of dynamic reconfiguration, in response to internal and external stimuli. Centralized models of composition are not conducive for data-intensive applications such as those in the scientific domain. Decentralized compositions are more complicated to manage especially since no service has a global view of the interaction. In this paper we identify the requirements for dynamic reconfiguration of data-intensive composite services. A hybrid composition model that combines the attributes of centralization and decentralization is proposed. We argue that this model promotes dynamic reconfiguration of data-intensive service compositions. 1
Scientific Process Automation and . . .
, 2009
"... We introduce and describe scientific workflows, i.e., executable descriptions of automatable scientific processes such as computational science simulations and data analyses. Scientific workflows are often expressed in terms of tasks and their (dataflow) dependencies. This chapter first provides an ..."
Abstract
- Add to MetaCart
We introduce and describe scientific workflows, i.e., executable descriptions of automatable scientific processes such as computational science simulations and data analyses. Scientific workflows are often expressed in terms of tasks and their (dataflow) dependencies. This chapter first provides an overview of the characteristic features of scientific workflows and outlines their life cycle. A detailed case study highlights workflow challenges and solutions in simulation management. We then provide a brief overview of how some concrete systems support the various phases of the workflow life cycle, i.e., design, resource management, execution, and provenance management. We conclude with a discussion on communitybased workflow sharing.
Scheduling and Management Techniques for Data-Intensive Application Workflows
"... This chapter presents a comprehensive survey of algorithms, techniques and frameworks used for scheduling and management of data-intensive application workflows. Many complex scientific experiments are expressed in the form of workflows for structured, repeatable, controlled, scalable and automated ..."
Abstract
- Add to MetaCart
This chapter presents a comprehensive survey of algorithms, techniques and frameworks used for scheduling and management of data-intensive application workflows. Many complex scientific experiments are expressed in the form of workflows for structured, repeatable, controlled, scalable and automated executions. This chapter focuses on the type of workflows that have tasks processing huge amount of data, usually in the range from hundreds of mega-bytes to petabytes. Scientists are already using Grid systems that schedule these workflows onto globally distributed resources for optimizing various objectives: minimize total makespan of the workflow, minimize cost and usage of network bandwidth, minimize cost of computation and storage, meet the deadline of the application, and so forth. This chapter lists and describes techniques used in each of these systems for processing huge amount of data. A survey of workflow management techniques is useful for understanding the working of the Grid systems providing insights on performance optimization of scientific applications dealing with dataintensive workloads.
Passive Network Performance Estimation for Large-Scale, Data-Intensive Computing
"... Abstract—Distributed computing applications are increasingly utilizing distributed data sources. However, the unpredictable cost of data access in large-scale computing infrastructures can lead to severe performance bottlenecks. Providing predictability in data access is, thus, essential to accommod ..."
Abstract
- Add to MetaCart
Abstract—Distributed computing applications are increasingly utilizing distributed data sources. However, the unpredictable cost of data access in large-scale computing infrastructures can lead to severe performance bottlenecks. Providing predictability in data access is, thus, essential to accommodate the large set of newly emerging large-scale, data-intensive computing applications. In this regard, accurate estimation of network performance is crucial to meeting the performance goals of such applications. Passive estimation based on past measurements is attractive for its relatively small overhead compared to relying on explicit probing. In this paper, we take a passive approach for network performance estimation. Our approach is different from existing passive techniques that rely either on past direct measurements of pairs of nodes or on topological similarities. Instead, we exploit secondhand measurements collected by other nodes without any topological restrictions. In this paper, we present Overlay Passive Estimation of Network performance (OPEN), a scalable framework providing end-to-end network performance estimation based on secondhand measurements, and discuss how OPEN achieves cost-effective estimation in a large-scale infrastructure. Our extensive experimental results show that OPEN estimation can be applicable for replica and resource selections commonly used in distributed computing. Index Terms—Network performance estimation, secondhand estimation, data-intensive computing, replica selection, resource selection. Ç 1

