Results 1 - 10
of
31
Gathering at the well: Creating communities for grid I/O
- In Proceedings of Supercomputing 2001
, 2001
"... Grid applications have demanding I/O needs. Schedulers must bring jobs and data in close proximity in order to satisfy throughput, scalability, and policy requirements. Most systems accomplish this by making either jobs or data mobile. We propose a system that allows jobs and data to meet by binding ..."
Abstract
-
Cited by 32 (7 self)
- Add to MetaCart
Grid applications have demanding I/O needs. Schedulers must bring jobs and data in close proximity in order to satisfy throughput, scalability, and policy requirements. Most systems accomplish this by making either jobs or data mobile. We propose a system that allows jobs and data to meet by binding execution and storage sites together into I/O communities which then participate in the wide-area system. The relationships between participants in a community may be expressed by the ClassAd framework. Extensions to the framework allow community members to express indirect relations. We demonstrate our implementation of I/O communities by improving the performance of a key high-energy physics simulation on an international distributed system. 1.
The Internet Backplane Protocol: A Study in Resource Sharing
- IN FUTURE GENERATION COMPUTING SYSTEMS
, 2002
"... In this work we present the Internet Backplane Protocol (IBP), a middleware created to allow the sharing of storage resources, implemented as part of the network fabric. IBP allows an application to control intermediate data staging operations explicitly. As IBP follows a very simple philosophy, ver ..."
Abstract
-
Cited by 32 (3 self)
- Add to MetaCart
In this work we present the Internet Backplane Protocol (IBP), a middleware created to allow the sharing of storage resources, implemented as part of the network fabric. IBP allows an application to control intermediate data staging operations explicitly. As IBP follows a very simple philosophy, very similar to the Internet Protocol, and the resulting semantic might be too weak for some applications, we introduce the exNode, a data structure that aggregates storage allocations on the Internet.
A Taxonomy of Data Grids for Distributed Data Sharing, Management and Processing
- ACM Comput. Surv
, 2006
"... Data Grids have been adopted as the platform for scientific communities that need to share, access, transport, process and manage large data collections distributed worldwide. They combine high-end computing technologies with high-performance networking and wide-area storage management techniques. ..."
Abstract
-
Cited by 27 (7 self)
- Add to MetaCart
Data Grids have been adopted as the platform for scientific communities that need to share, access, transport, process and manage large data collections distributed worldwide. They combine high-end computing technologies with high-performance networking and wide-area storage management techniques. In this paper, we discuss the key concepts behind Data Grids and compare them with other data sharing and distribution paradigms such as content delivery networks, peer-to-peer networks and distributed databases.
The PUNCH Virtual File System: Seamless Access to Decentralized Storage Services in a Computational Grid
- Proceedings of the IEEE International Symposium on High Performance Distributed Computing (HPDC
, 2001
"... This paper describes a virtual le system that allows data to be transferred on demand between storage and compute servers for the duration of a computing session. The solution works with unmodi ed applications (even commercial ones) running on standard operating systems and hardware. The virtual le ..."
Abstract
-
Cited by 20 (12 self)
- Add to MetaCart
This paper describes a virtual le system that allows data to be transferred on demand between storage and compute servers for the duration of a computing session. The solution works with unmodi ed applications (even commercial ones) running on standard operating systems and hardware. The virtual le system employs software proxies to broker transactions between standard NFS clients and servers; the proxies are dynamically con gured and controlled bycomputational grid middleware. The approach has been implemented and extensively exercised in the context of the Purdue University Network Computing Hubs, an operational computing portal that has more than 1,500 users across 24 countries. Results show that the virtual le system performs well in comparison to native NFS: performance analyses show that the proxy incurs mean overheads of 1 % and 18 % with respect to native NFS for a singleclient execution of the Andrew benchmark in two representative computing environments, and that the average overhead for eight clients can be reduced to within 1 % of native NFS with concurrent proxies. 1.
The livny and plank-beck problems: Studies in data movement on the computational grid
- In SC2003
, 2003
"... Over the last few years the Grid Computing research community has become interested in developing data intensive applications for the Grid. These applications face significant challenges because their widely distributed nature makes it difficult to access data with reasonable speed. In order to addr ..."
Abstract
-
Cited by 15 (2 self)
- Add to MetaCart
Over the last few years the Grid Computing research community has become interested in developing data intensive applications for the Grid. These applications face significant challenges because their widely distributed nature makes it difficult to access data with reasonable speed. In order to address this problem, we feel that the Grid community needs to develop and explore data movement challenges that represent problems encountered in these applications. In this paper, we will identify two such problems that we have dubbed the Livny Problem and the Plank-Beck Problem. We will also present data movement scheduling techniques that we have developed to address these problems. 1
Design, Implementation, and Performance of Checkpointing in NetSolve
- In International Conference on Dependable Systems and Networks (FTCS-30 & DCCA-8
, 2000
"... While a variety of checkpointing techniques and systems have been documented for long-running programs, they are typically not available for programmers that are non systems experts. This paper details a project that integrates three technologies, NetSolve, Starfish, and IBP, for the seamless integr ..."
Abstract
-
Cited by 14 (9 self)
- Add to MetaCart
While a variety of checkpointing techniques and systems have been documented for long-running programs, they are typically not available for programmers that are non systems experts. This paper details a project that integrates three technologies, NetSolve, Starfish, and IBP, for the seamless integration of fault-tolerance into long-running applications. We discuss the design and implementation of this project, and present performance results executing on both local and wide-area networks. 1 Introduction Checkpointing and rollback recovery is a well-studied research area for enabling long-running applications to be fault-tolerant. Many basic checkpointing algorithms [6, 11] and optimization techniques [12] have been developed for uniprocessor and parallel computing systems, and several checkpointing libraries and systems have been implemented [1, 5, 8, 10, 14, 17, 18, 20, 22]. However, for the typical scientific user, actually using a checkpointing system is a difficult task. All sys...
Optimizing Center Performance through Coordinated Data Staging, Scheduling and Recovery
- SC07
, 2007
"... Procurement and the optimized utilization of Petascale supercomputers and centers is a renewed national priority. Sustained performance and availability of such large centers is a key technical challenge significantly impacting their usability. Storage systems are known to be the primary fault sourc ..."
Abstract
-
Cited by 13 (13 self)
- Add to MetaCart
Procurement and the optimized utilization of Petascale supercomputers and centers is a renewed national priority. Sustained performance and availability of such large centers is a key technical challenge significantly impacting their usability. Storage systems are known to be the primary fault source leading to data unavailability and job resubmissions. This results in reduced center performance, partially due to the lack of coordination between I/O activities and job scheduling. In this work, we propose the coordination of job scheduling with data staging/offloading and on-demand staged data reconstruction to address the availability of job input data and to improve centerwide performance. Fundamental to both mechanisms is the efficient management of transient data: in the way it is scheduled and recovered. Collectively, from a center’s standpoint, these techniques optimize resource usage and increase its data/service availability. From a user’s standpoint, they reduce the job turnaround time and optimize the allocated time usage.
Data Staging Effects in Wide Area Task Farming Applications
- IEEE International Symposium on Cluster Computing and the grid
, 2001
"... Recent advances in computing and communication have given rise to the computational grid notion. The core of this computing paradigm is the design of a system for drawing compute power from a confederation of geographically dispersed heterogeneous resources, seamlessly and ubiquitously. If high-perf ..."
Abstract
-
Cited by 11 (1 self)
- Add to MetaCart
Recent advances in computing and communication have given rise to the computational grid notion. The core of this computing paradigm is the design of a system for drawing compute power from a confederation of geographically dispersed heterogeneous resources, seamlessly and ubiquitously. If high-performance levels are to be achieved, data locality must be identified and managed. In this paper, we consider the affect of server side staging on the behavior of a class of wide area “task farming ” applications. We show that staging improves task throughput mainly through the increased parallelism rather than the reduction in overall turnaround time per task. We derive a model for farming applications with and without server side staging and verify the model through live experiments as well as simulations. 1.
Timely offloading of result-data in hpc centers
- in The 2008 International Conference on Supercomputing
, 2008
"... High performance computing is facing an exponential growth in job output dataset sizes. This implies a significant commitment of supercomputing center resources—most notably, precious scratch space—in handling data staging and offloading. However, the scratch area is typically managed using simple “ ..."
Abstract
-
Cited by 11 (9 self)
- Add to MetaCart
High performance computing is facing an exponential growth in job output dataset sizes. This implies a significant commitment of supercomputing center resources—most notably, precious scratch space—in handling data staging and offloading. However, the scratch area is typically managed using simple “purge policies”, without sophisticated “end-user data services ” that are required to balance center’s resource consumption and user serviceability. End-user data services such as offloading are performed using point-to-point transfers that are unable to reconcile center’s purge and users delivery deadlines, unable to adapt to changing dynamics in the end-to-end data path and are not fault-tolerant. We propose a robust framework for the timely, decentralized offload of result data, addressing the aforementioned significant gaps in extant direct-transfer-based offloading. The decentralized offload is achieved using an overlay of user-specified intermediate nodes and well known landmark nodes. These nodes serve as a means both to provide multiple data-flow paths, thereby maximizing bandwidth as well as provide fail-over capabilities for the offload. We have implemented our techniques within a production job scheduler (PBS) and data transfer tool (BitTorrent), and our evaluation shows that the offloading times can be significantly reduced (90.2 % for a 2.1 GB file), while also meeting centeruser
Coupling Prefix Caching and Collective Downloads for Remote Dataset Access
- In Proceedings of the 16th ACM International Conference on Supercomputing
, 2006
"... Scientific datasets are typically archived at mass storage systems or data centers close to supercomputers/instruments. Endusers of these datasets, however, usually perform parts of their workflows at their local computers. In such cases, client-side caching can offer significant gains by reducing t ..."
Abstract
-
Cited by 8 (7 self)
- Add to MetaCart
Scientific datasets are typically archived at mass storage systems or data centers close to supercomputers/instruments. Endusers of these datasets, however, usually perform parts of their workflows at their local computers. In such cases, client-side caching can offer significant gains by reducing the cost of widearea data movement. Scientific data caches, however, traditionally cache entire datasets, which may not be necessary. In this paper, we propose a novel combination of prefix caching and collective download. Prefix caching allows the bootstrapping of dataset downloads by caching only a prefix of the dataset, while collective download facilitates efficient parallel patching of the missing suffix from an external data source. To estimate the optimal prefix size, we further present an analytical model that considers both the initial download overhead and the downloading speed. We implemented our proposed approach in the FreeLoader distributed cache prototype. Experimental results (using multiple scientific data repositories and data transfer tools, as well as a real-world scientific dataset access trace) demonstrate that prefix caching and collective download can be implemented efficiently, our model can select an appropriate prefix size, and the cache hit rate can be improved significantly without hurting the local access rate of cached datasets. 1.

