Results 11 - 20
of
178
RFS: Efficient and Flexible Remote File Access for MPI-IO
- In Proceedings of the IEEE International Conference on Cluster Computing
, 2004
"... Scientific applications often need to access remote file systems. Because of slow networks and large data size, however, remote I/O can become an even more serious performance bottleneck than local I/O performance. In this work, we present RFS, a high-performance remote I/O facility for ROMIO, which ..."
Abstract
-
Cited by 15 (9 self)
- Add to MetaCart
Scientific applications often need to access remote file systems. Because of slow networks and large data size, however, remote I/O can become an even more serious performance bottleneck than local I/O performance. In this work, we present RFS, a high-performance remote I/O facility for ROMIO, which is a well-known MPI-IO implementation. Our simple, portable, and flexible design eliminates the shortcomings of previous remote I/O efforts. In particular, RFS improves the remote I/O performance by adopting active buffering with threads (ABT), which hides I/O cost by aggressively buffering the output data using available memory and performing background I/O using threads while computation is taking place. Our experimental results show that RFS with ABT can significantly reduce the remote I/O visible cost, achieving up to 92 % of the theoretical peak throughput. The computation slowdown caused by concurrent I/O activities was 0.2–6.2%, which is dwarfed by the overall performance improvement in application turnaround time. 1
Lightweight I/O for scientific applications
, 2006
"... Today’s high-end massively parallel processing (MPP) machines have thousands to tens of thousands of processors, with next-generation systems planned to have in excess of one hundred thousand processors. For systems of such scale, efficient I/O is a significant challenge that cannot be solved using ..."
Abstract
-
Cited by 14 (2 self)
- Add to MetaCart
Today’s high-end massively parallel processing (MPP) machines have thousands to tens of thousands of processors, with next-generation systems planned to have in excess of one hundred thousand processors. For systems of such scale, efficient I/O is a significant challenge that cannot be solved using traditional approaches. In particular, general purpose parallel file systems that limit applications to standard interfaces and access policies do not scale and will likely be a performance bottleneck for many scientific applications. In this paper, we investigate the use of a “lightweight” approach to I/O that requires the application or I/O-library developer to extend a core set of critical I/O functionality with the minimum set of features and services required by its target applications. We argue that this approach allows the development of I/O libraries that are both scalable and secure. We support our claims with preliminary results for a lightweight checkpoint operation on a development cluster at Sandia. 1
Optimizing Center Performance through Coordinated Data Staging, Scheduling and Recovery
- SC07
, 2007
"... Procurement and the optimized utilization of Petascale supercomputers and centers is a renewed national priority. Sustained performance and availability of such large centers is a key technical challenge significantly impacting their usability. Storage systems are known to be the primary fault sourc ..."
Abstract
-
Cited by 13 (13 self)
- Add to MetaCart
Procurement and the optimized utilization of Petascale supercomputers and centers is a renewed national priority. Sustained performance and availability of such large centers is a key technical challenge significantly impacting their usability. Storage systems are known to be the primary fault source leading to data unavailability and job resubmissions. This results in reduced center performance, partially due to the lack of coordination between I/O activities and job scheduling. In this work, we propose the coordination of job scheduling with data staging/offloading and on-demand staged data reconstruction to address the availability of job input data and to improve centerwide performance. Fundamental to both mechanisms is the efficient management of transient data: in the way it is scheduled and recovered. Collectively, from a center’s standpoint, these techniques optimize resource usage and increase its data/service availability. From a user’s standpoint, they reduce the job turnaround time and optimize the allocated time usage.
A Case Study of Parallel I/O for Biological Sequence Analysis on Linux Clusters
- Proceedings of the 5th IEEE International Conference on Cluster Computing (Cluster 2003), Hong Kong
, 2003
"... In this paper we analyze the I/O access patterns of a widely-used biological sequence search tool and implement two variations that employ parallel-I/O for data access based on PVFS (Parallel Virtual File System) and CEFT-PVFS (Cost-Effective Fault-Tolerant PVFS). Experiments show that the two varia ..."
Abstract
-
Cited by 13 (7 self)
- Add to MetaCart
In this paper we analyze the I/O access patterns of a widely-used biological sequence search tool and implement two variations that employ parallel-I/O for data access based on PVFS (Parallel Virtual File System) and CEFT-PVFS (Cost-Effective Fault-Tolerant PVFS). Experiments show that the two variations outperform the original tool when equal or even fewer storage devices are used in the former. It is also found that although the performance of the two variations improves consistently when initially increasing the number of servers, this performance gain from parallel I/O becomes insignificant with further increase in server number. We examine the effectiveness of two read performance optimization techniques in CEFT-PVFS by using this tool as a benchmark. Performance results indicate: (1) Doubling the degree of parallelism boosts the read performance to approach that of PVFS; (2) Skipping hotspots can substantially improve the I/O performance when the load on data servers is highly imbalanced. The I/O resource contention due to the sharing of server nodes by multiple applications in a cluster has been shown to degrade the performance of the original tool and the variation based on PVFS by up to 10 and 21 folds, respectively; whereas, the variation based on CEFT-PVFS only suffered a two-fold performance degradation. Keywords: parallel I/O, CEFT-PVFS, PVFS, BLAST 1.
Database support for data-driven scientific applications
- in the grid. Parallel Processing Letters
, 2003
"... krishnan,kurc,umit,jsaltz¢ In this paper we describe a services oriented software system to provide basic database support for efficient execution of applications that make use of scientific datasets in the Grid. This system supports two core operations: efficient selection of the data of interest f ..."
Abstract
-
Cited by 13 (3 self)
- Add to MetaCart
krishnan,kurc,umit,jsaltz¢ In this paper we describe a services oriented software system to provide basic database support for efficient execution of applications that make use of scientific datasets in the Grid. This system supports two core operations: efficient selection of the data of interest from distributed databases and efficient transfer of data from storage nodes to compute nodes for processing. We present its overall architecture and main components and describe preliminary experimental results. 1
stdchk: A Checkpoint Storage System for Desktop Grid Computing
"... Abstract — Checkpointing is an indispensable technique to provide fault tolerance for long-running high-throughput applications like those running on desktop grids. This article argues that a checkpoint storage system, optimized to operate in these environments, can offer multiple benefits: reduce t ..."
Abstract
-
Cited by 13 (6 self)
- Add to MetaCart
Abstract — Checkpointing is an indispensable technique to provide fault tolerance for long-running high-throughput applications like those running on desktop grids. This article argues that a checkpoint storage system, optimized to operate in these environments, can offer multiple benefits: reduce the load on a traditional file system, offer high-performance through specialization, and, finally, optimize data management by taking into account checkpoint application semantics. Such a storage system can present a unifying abstraction to checkpoint operations, while hiding the fact that there are no dedicated resources to store the checkpoint data. We prototype stdchk, a checkpoint storage system that uses scavenged disk space from participating desktops to build a low-cost storage system, offering a traditional file system interface for easy integration with applications. This article presents the stdchk architecture, key performance optimizations, and its support for incremental checkpointing and increased data availability. Our evaluation confirms that the stdchk approach is viable in a desktop grid setting and offers a low-cost storage system with desirable performance characteristics: high write throughput as well as reduced storage space and network effort to save checkpoint images. I.
Supporting Efficient Noncontiguous Access in PVFS over InfiniBand
- In Proceedings of Cluster Computing ’03, Hong Kong
, 2003
"... Noncontiguous I/O access is the main access pattern in many scientific applications. Noncontiguity exists both in access to files and in access to target memory regions on the client. This characteristic imposes a requirement of native noncontiguous I/O access support in cluster file systems for hig ..."
Abstract
-
Cited by 12 (6 self)
- Add to MetaCart
Noncontiguous I/O access is the main access pattern in many scientific applications. Noncontiguity exists both in access to files and in access to target memory regions on the client. This characteristic imposes a requirement of native noncontiguous I/O access support in cluster file systems for high performance. In this paper, we address two main issues on supporting efficient noncontiguous I/O access in cluster file systems over a high performance network. One is noncontiguous data transmission between the client and the I/O server. The second is noncontiguous disk access on the I/O server itself.
Parallel genomic sequence-searching on an ad-hoc grid: Experiences, lessons learned, and implications. InACM/IEEESC2006:TheInternationalConferenceon High-PerformanceComputing,Networking,andStorage
- Supercomputing, 2006. SC ’06. Proceedings of the ACM/IEEE SC 2006 Conference
, 2006
"... bioinformaticists to characterize an unknown sequence by comparing it against a database of known sequences. The similarity between sequences enables biologists to detect evolutionary relationships and infer biological properties of the unknown sequence. mpiBLAST, our parallel BLAST, decreases the s ..."
Abstract
-
Cited by 11 (9 self)
- Add to MetaCart
bioinformaticists to characterize an unknown sequence by comparing it against a database of known sequences. The similarity between sequences enables biologists to detect evolutionary relationships and infer biological properties of the unknown sequence. mpiBLAST, our parallel BLAST, decreases the search time of a 300 KB query on the current NT database from over two full days to under 10 minutes on a 128processor cluster and allows larger query files to be compared. Consequently, we propose to compare the largest query available, the entire NT database, against the largest database available, the entire NT database. The result of this comparison will provide critical information to the biology community, including insightful evolutionary, structural, and functional relationships between every sequence and family in the NT database. Preliminary projections indicated that to complete the above task in a reasonable length of time required more processors than were available to us at a single site. Hence, we assembled GreenGene, an ad-hoc grid that was constructed “on the fly ” from donated computational, network, and storage resources during last year’s SC|05. GreenGene consisted of 3048 processors from machines that were distributed across the United States. This paper presents a case study of mpiBLAST on GreenGene — specifically, a pre-run characterization of the computation, the hardware and software architectural design, experimental results, and future directions.
Trace: Parallel trace replay with approximate causal events
- In Proceedings of the 5th USENIX Symposium on File and Storage Technologies (FAST’07). MCDOUGALL
, 2007
"... //TRACE 1 is a new approach for extracting and replaying traces of parallel applications to recreate their I/O behavior. Its tracing engine automatically discovers inter-node data dependencies and inter-I/O compute times for each node (process) in an application. This information is reflected in per ..."
Abstract
-
Cited by 10 (0 self)
- Add to MetaCart
//TRACE 1 is a new approach for extracting and replaying traces of parallel applications to recreate their I/O behavior. Its tracing engine automatically discovers inter-node data dependencies and inter-I/O compute times for each node (process) in an application. This information is reflected in per-node annotated I/O traces. Such annotation allows a parallel replayer to closely mimic the behavior of a traced application across a variety of storage systems. When compared to other replay mechanisms, //TRACE offers significant gains in replay accuracy. Overall, the average replay error for the parallel applications evaluated in this paper is below 6%. 1
Learning from the Success of MPI
, 2001
"... The Message Passing Interface (MPI) has been extremely successful as a portable way to program high-performance parallel computers. ..."
Abstract
-
Cited by 10 (1 self)
- Add to MetaCart
The Message Passing Interface (MPI) has been extremely successful as a portable way to program high-performance parallel computers.

