Results 1 - 10
of
78
Disk-directed I/O for MIMD Multiprocessors
, 1994
"... Many scientific applications that run on today’s multiprocessors, such as weather forecasting and seismic analysis, are bottlenecked by their file-I/O needs. Even if the multiprocessor is configured with sufficient I/O hardware, the file-system software often fails to provide the available bandwidth ..."
Abstract
-
Cited by 217 (18 self)
- Add to MetaCart
Many scientific applications that run on today’s multiprocessors, such as weather forecasting and seismic analysis, are bottlenecked by their file-I/O needs. Even if the multiprocessor is configured with sufficient I/O hardware, the file-system software often fails to provide the available bandwidth to the application. Although libraries and enhanced file-system interfaces can make a significant improvement, we believe that fundamental changes are needed in the file-server software. We propose a new technique, disk-directed I/O, to allow the disk servers to determine the flow of data for maximum performance. Our simulations show that tremendous performance gains are possible. Indeed, disk-directed I/O provided consistent high performance that was largely independent of data distribution, obtained up to 93 % of peak disk bandwidth, and was as much as 16 times faster than traditional parallel file systems.
Automatic Compiler-Inserted I/O Prefetching for Out-of-Core Applications
, 1996
"... Current operating systems offer poor performance when a numeric application's working set does not fit in main memory. As a result, programmers who wish to solve "out-of-core" problems efficiently are typically faced with the onerous task of rewriting an application to use explicit I/O operations (e ..."
Abstract
-
Cited by 138 (6 self)
- Add to MetaCart
Current operating systems offer poor performance when a numeric application's working set does not fit in main memory. As a result, programmers who wish to solve "out-of-core" problems efficiently are typically faced with the onerous task of rewriting an application to use explicit I/O operations (e.g., read/write). In this paper, we propose and evaluate a fully-automatic technique which liberates the programmer from this task, provides high performance, and requires only minimal changes to current operating systems. In our scheme, the compiler provides the crucial information on future access patterns without burdening the programmer, the operating system supports non-binding prefetch and re- lease hints for managing I/O, and the operating sys- tem cooperates with a run-time layer to accelerate performance by adapting to dynamic behavior and minimizing prefetch overhead. This approach maintains the abstraction of unlimited virtual memory for the programmer, gives the compiler the flexibility to aggressively move prefetches back ahead of references, and gives the operating system the flexibility to arbitrate between the competing resource demands of multiple applications. We have implemented our scheme using the SUIF compiler and the Hurricane operating system. Our experimental results demonstrate that our fully-automatic scheme effectively hides the I/O latency in out-of- core versions of the entire NAS Parallel benchmark suite, thus resulting in speedups of roughly twofold for five of the eight applications, with one application speeding up by over threefold.
On Implementing MPI-IO Portably and with High Performance
- In Proceedings of the 6th Workshop on I/O in Parallel and Distributed Systems
, 1999
"... We discuss the issues involved in implementing MPI-IO portably on multiple machines and file systems and also achieving high performance. One way to implement MPI-IO portably is to implement it on top of the basic Unix I/O functions (open, lseek, read, write, and close), which are themselves portabl ..."
Abstract
-
Cited by 137 (21 self)
- Add to MetaCart
We discuss the issues involved in implementing MPI-IO portably on multiple machines and file systems and also achieving high performance. One way to implement MPI-IO portably is to implement it on top of the basic Unix I/O functions (open, lseek, read, write, and close), which are themselves portable. We argue that this approach has limitations in both functionality and performance. We instead advocatean implementation approach that combines a large portion of portable code and a small portion of code that is optimized separately for different machines and file systems. We have used such an approach to develop a high-performance, portable MPI-IO implementation, called ROMIO. In addition to basic I/O functionality, we consider the issues of supporting other MPI-IO features, such as 64-bit file sizes, noncontiguous accesses, collective I/O, asynchronous I/O, consistency and atomicity semantics, user-supplied hints, shared file pointers, portable data representation, and file preallocati...
File-Access Characteristics of Parallel Scientific Workloads
- IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS
, 1996
"... Phenomenal improvements in the computational performance of multiprocessors have not been matched by comparable gains in I/O system performance. This imbalance has resulted in I/O becoming a significant bottleneck for many scientific applications. One key to overcoming this bottleneck is improving t ..."
Abstract
-
Cited by 92 (10 self)
- Add to MetaCart
Phenomenal improvements in the computational performance of multiprocessors have not been matched by comparable gains in I/O system performance. This imbalance has resulted in I/O becoming a significant bottleneck for many scientific applications. One key to overcoming this bottleneck is improving the performance of parallel file systems. The design of a high-performance parallel file system requires a comprehensive understanding of the expected workload. Unfortunately, until recently, no general workload studies of parallel file systems have been conducted. The goal of the CHARISMA project was to remedy this problem by characterizing the behavior of several production workloads, on different machines, at the level of individual reads and writes. The first set of results from the CHARISMA project describe the workloads observed on an Intel iPSC/860 and a Thinking Machines CM-5. This paper is intended to compare and contrast these two workloads for an understanding of their essential similarities and differences, isolating common trends and platform-dependent variances. Using this comparison, we are able to gain more insight into the general principles that should guide parallel file-system design.
Data Sieving and Collective I/o In Romio
, 1999
"... The I/O access patterns of parallel programs often consist of accesses to a large number of small, noncontiguous pieces of data. If an application’s I/O needs are met by making many small, distinct I/O requests, however, the I/O performance degrades drastically. To avoid this problem, MPI-IO allows ..."
Abstract
-
Cited by 79 (14 self)
- Add to MetaCart
The I/O access patterns of parallel programs often consist of accesses to a large number of small, noncontiguous pieces of data. If an application’s I/O needs are met by making many small, distinct I/O requests, however, the I/O performance degrades drastically. To avoid this problem, MPI-IO allows users to access a noncontiguous data set with a single I/O function call. This feature provides MPI-IO implementations an opportunity to optimize data access. We describe how our MPI-IO implementation, ROMIO, delivers high performance in the presence of noncontiguous requests. We explain in detail the two key optimizations ROMIO performs: data sieving for noncontiguous requests from one process and collective I/O for noncontiguous requests from multiple processes. We describe how one can implement these optimizations portably on multiple machines and file systems, control their memory requirements, and also achieve high performance. We demonstrate the performance and portability with performance results for three applications—an astrophysics-application template (DIST3D), the NAS BTIO benchmark, and an unstructured code (UNSTRUC)—on five different parallel machines:
Tuning the Performance of I/O-Intensive Parallel Applications
, 1996
"... Getting good I/O performance from parallel programs is a critical problem for many application domains. In this paper, we report our experience tuning the I/O performance of four application programs from the areas of satellite-data processing and linear algebra. After tuning, three of the four appl ..."
Abstract
-
Cited by 70 (24 self)
- Add to MetaCart
Getting good I/O performance from parallel programs is a critical problem for many application domains. In this paper, we report our experience tuning the I/O performance of four application programs from the areas of satellite-data processing and linear algebra. After tuning, three of the four applications achieve application-level I/O rates of over 100 MB/s on 16 processors. The total volume of I/O required by the programs ranged from about 75 MB to over 200 GB. We report the lessons learned in achieving high I/O performance from these applications, including the need for code restructuring, local disks on every node and knowledge of future I/O requests. We also report our experience on achieving high performance on peer-to-peer configurations. Finally, we comment on the necessity of complex I/O interfaces like collective I/O and strided requests to achieve high performance. 1 Introduction I/O has been identified as one of the major obstacles to achieving high performance from para...
CLIP: A Checkpointing Tool for Message-Passing Parallel Programs
, 1997
"... Checkpointing is a useful technique for rollback recovery of parallel applications. While extensive research has been performed on checkpointing in parallel environments, there are few checkpointers available to application users on commercial parallel computers. This paper presents one such checkpo ..."
Abstract
-
Cited by 60 (9 self)
- Add to MetaCart
Checkpointing is a useful technique for rollback recovery of parallel applications. While extensive research has been performed on checkpointing in parallel environments, there are few checkpointers available to application users on commercial parallel computers. This paper presents one such checkpointer: CLIP. CLIP is a user-level library that provides semitransparent checkpointing for parallel programs on the Intel Paragon multicomputer. It is publicly available to Paragon users at no cost. Conceptually, checkpointing a multicomputer is quite straightforward. However, when creating an actual tool for checkpointing a complex machine like the Paragon, many more issues arise that require careful design decisions to be made. Sometimes ease-of-use must be sacrificed for efficiency and/or correctness. This paper details what these decisions are, and how they were made in CLIP. We also present performance data when checkpointing several long-running Paragon applications with CLIP. The bottom line is that a convenient, general-purpose checkpointing tool like CLIP can provide fault-tolerance on a massively parallel multicomputer like the Paragon with very good performance.
Hfs: A performance-oriented flexible file system based on building-block compositions
- ACM Transactions on Computer Systems
, 1997
"... The Hurricane File System (HFS) is designed for (potentially large-scale) shared-memory multiprocessors. Its architecture is based on the principle that, in order to maximize performance for applications with diverse requirements, a file system must support a wide variety of file structures, file sy ..."
Abstract
-
Cited by 49 (8 self)
- Add to MetaCart
The Hurricane File System (HFS) is designed for (potentially large-scale) shared-memory multiprocessors. Its architecture is based on the principle that, in order to maximize performance for applications with diverse requirements, a file system must support a wide variety of file structures, file system policies, and I/O interfaces. Files in HFS are implemented using simple building blocks composed in potentially complex ways. This approach yields great flexibility, allowing an application to customize the structure and policies of a file to exactly meet its requirements. As an extreme example, HFS allows a file’s structure to be optimized for concurrent random-access write-only operations by 10 threads, something no other file system can do. Similarly, the prefetching, locking, and file cache management policies can all be chosen to match an application’s access pattern. In contrast, most parallel file systems support a single file structure and a small set of policies. We have implemented HFS as part of the Hurricane operating system running on the Hector shared-memory multiprocessor. We demonstrate that the flexibility of HFS comes with little processing or I/O overhead. We also show that for a number of file access patterns, HFS is able to deliver to the applications the full I/O bandwidth of the disks on our system.
High Performance Visualization of Time-Varying Volume Data over a Wide-Area Network
, 2000
"... This paper presents an end-to-end, low-cost solution for visualizing time-varying volume data rendered on a parallel computer located at a remote site. Pipelining and careful grouping of processors are used to hide I/O time and to maximize processors utilization. Compression is used to significantly ..."
Abstract
-
Cited by 38 (6 self)
- Add to MetaCart
This paper presents an end-to-end, low-cost solution for visualizing time-varying volume data rendered on a parallel computer located at a remote site. Pipelining and careful grouping of processors are used to hide I/O time and to maximize processors utilization. Compression is used to significantly cut down the cost of transferring output images from the parallel computer to a display device through a wide-area network. This complete rendering pipeline makes possible highly efficient rendering and remote viewing of high resolution time-varying data sets in the absence of high-speed network and parallel I/O support. To study the performance of this rendering pipeline and to demonstrate high-performance remote visualization, tests were conducted on a PC cluster in Japan as well as an SGI Origin 2000 operated at the NASA Ames Research Center with the display located at UC Davis. Keywords: High Performance Computing, Image Compression, Parallel Volume Rendering, Pipelining, Remote Visualization, Scientific Visualization, Time-Varying Data, Wide-Area Network 1
Performance Modeling for Realistic Storage Devices
, 1997
"... Managing large amounts of storage is difficult and becoming more so as both the complexity and number of storage devices are increasing. One approach to this problem is a self-managing storage system. Since a self-managing storage system is a real-time system, it requires a model that quickly approx ..."
Abstract
-
Cited by 36 (8 self)
- Add to MetaCart
Managing large amounts of storage is difficult and becoming more so as both the complexity and number of storage devices are increasing. One approach to this problem is a self-managing storage system. Since a self-managing storage system is a real-time system, it requires a model that quickly approximates the behavior of the storage device in a workload-dependent fashion. We develop such a model.
Our approach to modeling storage devices is to model the individual physical components of the device, such as queues, caches, and disk mechanisms, and then compose the component models. Each component model determines its behavior from the specification of the entering workload and the lower-level device behavior. To support the lower level component model in determining its behavior, each component model creates a modified workload specification to support the manner that the physical component would modify the entering workload. Modifying the workload specification allows us, for example, to capture the altered spatial locality that occurs when queues reorder their requests.
Our model predicts the device behavior in terms of response time within a relative error ranging from 2% to 30% for interesting subsets of the domain of devices and workloads. To demonstrate this, the model has been validated with synthetic traces of parallel scientific file system workloads and video-on-demand applications and traces of transaction processing applications.
Our contributions to the area of performance modeling for storage devices include the following:
- An infrastructure for developing a composite model. The infrastructure
supports the development of more complicated devices and workloads
than we have validated.
- Methods to approximate the mean seek time and rotational latency of
a disk mechanism using measures of workload spatial locality.
- Methods to approximate the miss probability and the full- and partial- hit
probabilities in an I/O system's data caches using measures of workload
spatial locality.
- Methods to approximate the queue delay for non-FCFS scheduling algorithms
using a description of the workload arrival process.
These methods can be composed to provide analytic estimation procedures for the behavior of a subset of current storage devices.

