Results 1 - 10
of
11
Supporting Efficient Noncontiguous Access in PVFS over InfiniBand
- In Proceedings of Cluster Computing ’03, Hong Kong
, 2003
"... Noncontiguous I/O access is the main access pattern in many scientific applications. Noncontiguity exists both in access to files and in access to target memory regions on the client. This characteristic imposes a requirement of native noncontiguous I/O access support in cluster file systems for hig ..."
Abstract
-
Cited by 12 (6 self)
- Add to MetaCart
Noncontiguous I/O access is the main access pattern in many scientific applications. Noncontiguity exists both in access to files and in access to target memory regions on the client. This characteristic imposes a requirement of native noncontiguous I/O access support in cluster file systems for high performance. In this paper, we address two main issues on supporting efficient noncontiguous I/O access in cluster file systems over a high performance network. One is noncontiguous data transmission between the client and the I/O server. The second is noncontiguous disk access on the I/O server itself.
Scalable high-level caching for parallel I/O
- In Proceedings of the International Parallel and Distributed Processing Symposium
, 2004
"... In order for I/O systems to achieve high performance in a parallel environment, they must either sacrifice client-side file caching, or keep caching and deal with complex coherency issues. The most common technique for dealing with cache coherency in multi-client file caching environments uses file ..."
Abstract
-
Cited by 7 (6 self)
- Add to MetaCart
In order for I/O systems to achieve high performance in a parallel environment, they must either sacrifice client-side file caching, or keep caching and deal with complex coherency issues. The most common technique for dealing with cache coherency in multi-client file caching environments uses file locks to bypass the client-side cache. Aside from effectively disabling cache usage, file locking is sometimes unavailable on larger systems. The high-level abstraction layer of MPI allows us to tackle cache coherency with additional information and coordination without using file locks. By approaching the cache coherency issue further up, the underlying I/O accesses can be modified in such a way as to ensure access to coherent data while satisfying the user’s I/O request. We can effectively exploit the benefits of a file system’s clientside cache while minimizing its management costs. 1
Unifier: Unifying Cache Management and Communication Buffer Management for PVFS over InfiniBand
- In In Proceedings of IEEE/ACM International Symposium on Cluster Computing and the Grid (CCGrid 04
, 2004
"... The advent of networking technologies and high performance transport protocols facilitates the service of storage over networks. However, they pose challenges in integration and interaction among storage server application components and system components. In this paper, we put forward a component, ..."
Abstract
-
Cited by 6 (1 self)
- Add to MetaCart
The advent of networking technologies and high performance transport protocols facilitates the service of storage over networks. However, they pose challenges in integration and interaction among storage server application components and system components. In this paper, we put forward a component, called Unifier, to provide more efficient integration and better interaction among these components. Unifier has three notable features. (1) Unifier integrates cache management and communication buffer management. It offers a single copy data sharing among all components in a server application safely and concurrently. (2) It reduces memory registration and deregistration costs to enable applications to take full advantage of RDMA operations. (3) It provides means to achieve adaptation, application-specific optimization, and better cooperation among different components. This paper presents the design and implementation of Unifier. This component has been deployed and evaluated in a version of PVFS1 implementation over InfiniBand. Experimental results show performance improvements between 30 % and 70 % over other approaches. Better scalability is also achieved by the PVFS I/O servers.
ABSTRACT Investigation Of Leading HPC I/O Performance Using A Scientific-Application Derived Benchmark
"... With the exponential growth of high-fidelity sensor and simulated data, the scientific community is increasingly reliant on ultrascale HPC resources to handle their data analysis requirements. However, to utilize such extreme computing power effectively, the I/O components must be designed in a bala ..."
Abstract
-
Cited by 5 (1 self)
- Add to MetaCart
With the exponential growth of high-fidelity sensor and simulated data, the scientific community is increasingly reliant on ultrascale HPC resources to handle their data analysis requirements. However, to utilize such extreme computing power effectively, the I/O components must be designed in a balanced fashion, as any architectural bottleneck will quickly render the platform intolerably inefficient. To understand I/O performance of data-intensive applications in realistic computational settings, we develop a lightweight, portable benchmark called MADbench2, which is derived directly from a large-scale Cosmic Microwave Background (CMB) data analysis package. Our study represents one of the most comprehensive I/O analyses of modern parallel filesystems, examining a broad range of system architectures and configurations, including Lustre on the Cray XT3 and Intel Itanium2 cluster; GPFS on IBM Power5 and AMD Opteron platforms; two BlueGene/L installations utilizing GPFS and PVFS2 filesystems; and CXFS on the SGI Altix3700. We present extensive synchronous I/O performance data comparing a number of key parameters including concurrency, POSIX- versus MPI-IO, and unique- versus shared-file accesses, using both the default environment as well as highly-tuned I/O parameters. Finally, we explore the potential of asynchronous I/O and quantify the volume of computation required to hide a given volume of I/O. Overall our study quantifies the vast differences in performance and functionality of parallel filesystems across state-of-theart platforms, while providing system designers and computational scientists a lightweight tool for conducting further analyses. 1.
Noncontiguous Locking Techniques for Parallel File Systems ABSTRACT
"... Many parallel scientific applications use high-level I/O APIs that offer atomic I/O capabilities. Atomic I/O in current parallel file systems is often slow when multiple processes simultaneously access interleaved, shared files. Current atomic I/O solutions are not optimized for handling noncontiguo ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Many parallel scientific applications use high-level I/O APIs that offer atomic I/O capabilities. Atomic I/O in current parallel file systems is often slow when multiple processes simultaneously access interleaved, shared files. Current atomic I/O solutions are not optimized for handling noncontiguous access patterns because current locking systems have a fixed file system block-based granularity and do not leverage highlevel access pattern information. In this paper we present a hybrid lock protocol that takes advantage of new list and datatype byte-range lock description techniques to enable high performance atomic I/O operations for these challenging access patterns. We implement our scalable distributed lock manager (DLM) in the PVFS parallel file system and show that these techniques improve locking throughput over a naive noncontiguous locking approach by several orders of magnitude in an array of lockonly tests. Additionally, in two scientific I/O benchmarks, we show the benefits of avoiding false sharing with our byterange granular DLM when compared against a block-based lock system implementation. 1.
HPC Global File System Performance Analysis Using A Scientific-Application Derived Benchmark
"... With the exponential growth of high-fidelity sensor and simulated data, the scientific community is increasingly reliant on ultrascale HPC resources to handle its data analysis requirements. However, to use such extreme computing power effectively, the I/O components must be designed in a balanced f ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
With the exponential growth of high-fidelity sensor and simulated data, the scientific community is increasingly reliant on ultrascale HPC resources to handle its data analysis requirements. However, to use such extreme computing power effectively, the I/O components must be designed in a balanced fashion, as any architectural bottleneck will quickly render the platform intolerably inefficient. To understand I/O performance of data-intensive applications in realistic computational settings, we develop a lightweight, portable benchmark called MADbench2, which is derived directly from a large-scale Cosmic Microwave Background (CMB) data analysis package. Our study represents one of the most comprehensive I/O analyses of modern parallel file systems, examining a broad range of system architectures and configurations, including Lustre on the Cray XT3, XT4, and Intel Itanium2 clusters; GPFS on IBM Power5 and AMD Opteron platforms; a BlueGene/P installation using GPFS and PVFS2 file systems; and CXFS on the SGI Altix3700. We present extensive synchronous I/O performance data comparing a number of key parameters including concurrency, POSIX- versus MPI-IO, and unique- versus shared-file accesses, using both the default environment as well as highly-tuned I/O parameters. Finally, we explore the potential of asynchronous I/O and show that only the two of the nine evaluated systems benefited from MPI-2’s asynchronous MPI-IO. On those systems, experimental results indicate that the computational intensity required to hide I/O effectively is already close to the practical limit of BLAS3 calculations. Overall, our study quantifies vast differences in performance and functionality of parallel file systems across state-of-the-art platforms — showing I/O rates that vary up to 75x on the examined architectures — while providing system designers and computational scientists a lightweight tool for conducting further analysis. 1
A New Flexible MPI Collective I/O Implementation
"... The MPI-IO standard creates a huge opportunity to break out of the traditional file system I/O methods. As a software layer between the user and the file system, an MPI-IO library can potentially optimize I/O on behalf of the user with little to no user intervention. This is all possible because of ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
The MPI-IO standard creates a huge opportunity to break out of the traditional file system I/O methods. As a software layer between the user and the file system, an MPI-IO library can potentially optimize I/O on behalf of the user with little to no user intervention. This is all possible because of the rich data description and communication infrastructure MPI-2 offers. Powerful data descriptions and some of the other desirable features of MPI-2, however, make MPI-IO challenging to implement. By creating a new collective I/O implementation that allows developers to easily tinker and play with new optimizations or combinations of different techniques, research can proceed faster and be quickly and reliably deployed. 1
2010 International Workshop on Storage Network Architecture and Parallel I/Os Enhancing Checkpoint Performance with Staging IO and SSD ∗
"... With the ever-growing size of computer clusters and applications, system failures are becoming inevitable. Checkpointing, a strategy to ensure fault tolerance, has become imperative in such an environment. However existing mechanism of checkpoint writing to parallel file systems doesn’t perform well ..."
Abstract
- Add to MetaCart
With the ever-growing size of computer clusters and applications, system failures are becoming inevitable. Checkpointing, a strategy to ensure fault tolerance, has become imperative in such an environment. However existing mechanism of checkpoint writing to parallel file systems doesn’t perform well with increasing job size. Solid State Disk(SSD) is attracting more and more attention due to its technical merits such as good random access performance, low power consumption and shock resistance. However, how to apply SSDs into a parallel storage system to improve checkpoint writing still remains an open question. In this paper we propose a new strategy to enhance checkpoint writing performance by aggregating checkpoint writing at client side, and utilizing staging IO on data servers. We also explore the potentials to substitute traditional hard disks with SSDs on data server to achieve better write bandwidth. Our strategy achieves up to 6.3 times higher write bandwidth than a popular parallel file system PVFS2 [6] with 8 client nodes and 4 data servers. In experiments with real applications using 64 application processes and 4 data servers, our strategy can accelerate checkpoint writing by up to 9.9 times compared to PVFS2. 1
MPI-I/O on Franklin XT4 System at NERSC
"... Prior to a software upgrade and hardware maintenance on March 17th 2009 on the Frankin Cray XT4 machine at the National Energy Research Scientific Computing (NERSC) Center, MPI-IO shared file performance saw only a small percentage of file-per-processor performance POSIX performance. The March 17th ..."
Abstract
- Add to MetaCart
Prior to a software upgrade and hardware maintenance on March 17th 2009 on the Frankin Cray XT4 machine at the National Energy Research Scientific Computing (NERSC) Center, MPI-IO shared file performance saw only a small percentage of file-per-processor performance POSIX performance. The March 17th upgrade unintentionally increased I/O performance significantly for a number of applications. This paper shows the performance differences after the maintenance and explores some of the possible explanations for the dramatic improvements.

