Results 11 - 20
of
32
Automatic Parallel I/O Performance Optimization in Panda
- In Proceedings of the 10th Annual ACM Symposium on Parallel Algorithms and Architectures
, 1998
"... Parallel I/O systems typically consist of individual processors, communication networks, and a large number of disks. Managing and utilizing these resources to meet performance, portability and usability goals of applications has become a significant challenge. We believe that a parallel I/O system ..."
Abstract
-
Cited by 9 (2 self)
- Add to MetaCart
Parallel I/O systems typically consist of individual processors, communication networks, and a large number of disks. Managing and utilizing these resources to meet performance, portability and usability goals of applications has become a significant challenge. We believe that a parallel I/O system that automatically selects efficient I/O plans for user applications is a solution to this problem. In this paper, we present such an automatic performance optimization approach for scientific applications performing collective I/O requests on multidimensional arrays. Under our approach, an optimization engine in a parallel I/O system selects optimal I/O plans automatically without human intervention based on a description of the application I/O requests and the system configuration. To validate our hypothesis, we have built an optimizer that uses a rule-based and randomized search-based algorithms to select optimal parameter settings in Panda, a parallel I/O library for multidimensional arr...
Reactive Scheduling For Parallel I/O Systems
, 2000
"... Parallel computing is integral to high performance computing, but it is not uniquely sufficient. With the adoption of parallel computing, some additional supporting technologies are required. Parallel I/O is one such supporting technology, providing high speed data storage in parallel computing envi ..."
Abstract
-
Cited by 9 (1 self)
- Add to MetaCart
Parallel computing is integral to high performance computing, but it is not uniquely sufficient. With the adoption of parallel computing, some additional supporting technologies are required. Parallel I/O is one such supporting technology, providing high speed data storage in parallel computing environments. Parallel I/O systems have emerged and are beginning to see use in the main stream; however, research into optimizing these systems is still an open area. In particular, techniques for optimizing parallel I/O have focused on disk performance optimization when other resources might have equal or greater impact on overall performance. Other work has looked at adaptive techniques for optimizing in these systems, but has focused on caching and prefetching only.
Galley: A New Parallel File System For Scientific Workloads
, 1996
"... Most current multiprocessor file systems are designed to use multiple disks in parallel, using the high aggregate bandwidth to meet the growing I/O requirements of parallel scientific applications. Most multiprocessor file systems provide applications with a conventional Unix-like interface, allowin ..."
Abstract
-
Cited by 9 (2 self)
- Add to MetaCart
Most current multiprocessor file systems are designed to use multiple disks in parallel, using the high aggregate bandwidth to meet the growing I/O requirements of parallel scientific applications. Most multiprocessor file systems provide applications with a conventional Unix-like interface, allowing the application to access those multiple disks transparently. This interface conceals the parallelism within the file system, increasing the ease of programmability, but making it difficult or impossible for sophisticated application and library programmers to use knowledge about their I/O to exploit that parallelism. In addition to providing an insufficient interface, most current multiprocessor le systems are optimized for a different workload than they are being asked to support. In this work we examine current multiprocessor file systems, as well as how those file systems are used by scientific applications. Contrary to the expectations of the designers of current parallel file systems, the workloads on those systems are dominated by requests to read and write small pieces of data. Furthermore, rather than being accessed sequentially and contiguously, as in uniprocessor and supercomputer workloads, files in multiprocessor file systems are accessed in regular, structured, but non-contiguous patterns. Based on our observations of multiprocessor workloads, we have designed Galley, a new parallel
DPFS: A Distributed Parallel File System
- In Proceedings of the International Conference on Parallel Processing
, 2001
"... One of challenges brought by large-scale scientific applications is how to avoid remote storage access by collectively using enough local storage resources to hold huge amount of data generated by the simulation while providing high performance I/O. DPFS, a Distributed Parallel File System, is desig ..."
Abstract
-
Cited by 7 (0 self)
- Add to MetaCart
One of challenges brought by large-scale scientific applications is how to avoid remote storage access by collectively using enough local storage resources to hold huge amount of data generated by the simulation while providing high performance I/O. DPFS, a Distributed Parallel File System, is designed and implemented to address this problem. DPFS collects locally distributed unused storage resources as a supplement to the internal storage of parallel computing systems to satisfy the storage capacity requirement of large-scale applications. In addition, like parallel file systems, DPFS provides striping mechanisms that divides a file into small pieces and distributes them across multiple storage devices for parallel data access. The unique feature of DPFS is that it provides three file levels with each file level corresponding to a file striping method. In addition to the traditional linear striping method, DPFS also provides a novel Multidimensional striping method that can solve performance problems of linear striping for many popular access patterns. Other issues such as load-balanceing and user interface are also addressed in DPFS.
Extending I/O through high performance data services
- IN CLUSTER COMPUTING
, 2007
"... The complexity of HPC systems has increased the burden on the developer as applications scale to hundreds of thousands of processing cores. Moreover, additional efforts are required to achieve acceptable I/O performance, where it is important how I/O is performed, which resources are used, and where ..."
Abstract
-
Cited by 5 (4 self)
- Add to MetaCart
The complexity of HPC systems has increased the burden on the developer as applications scale to hundreds of thousands of processing cores. Moreover, additional efforts are required to achieve acceptable I/O performance, where it is important how I/O is performed, which resources are used, and where I/O functionality is deployed. Specifically, by scheduling I/O data movement and by effectively placing operators affecting data volumes or information about the data, tremendous gains can be achieved both in the performance of simulation output and in the usability of output data. Previous studies have shown the value of using asynchronous I/O, of employing a staging area, and of performing select operations on data before it is written to disk. Leveraging such insights, this paper develops and experiments with higher level I/O abstractions, termed “data services”, that manage output data from ‘source to sink’: where/when it is captured, transported towards storage, and filtered or manipulated by service functions to improve its information content. Useful services include data reduction, data indexing, and those that manage how I/O is performed, i.e., the control aspects of data movement. Our data services implementation distinguishes control aspects – the control plane – from data movement – the data plane, so that both may be changed separably. This results in runtime flexibility not only in which services to employ, but also in where to deploy them and how they use I/O resources. The outcome is consistently high levels of I/O performance at large scale, without requiring application change.
I/O Scheduling Service for Multi-Application Clusters
- in "Proceedings of IEEE Cluster 2006
, 2006
"... Distributed applications, especially the ones being I/O intensive, often access the storage subsystem in a non-sequential way (stride requests). Since such behaviors lower the overall system performance, many applications use parallel I/O libraries such as ROMIO to gather and reorder requests. In th ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
Distributed applications, especially the ones being I/O intensive, often access the storage subsystem in a non-sequential way (stride requests). Since such behaviors lower the overall system performance, many applications use parallel I/O libraries such as ROMIO to gather and reorder requests. In the meantime, as cluster usage grows, several applications are of-ten executed concurrently, competing for access to storage subsystems and, thus, potentially canceling optimizations brought by Parallel I/O libraries. The aIOLi project aims at optimizing the I/O accesses within the cluster and providing a simple POSIX API. This article presents an extension of aIOLi to address the issue of disjoint
ViMPIOS, a "Truly" Portable MPI-IO Implementation
, 2000
"... We present ViMPIOS, a novel MPI-IO implementation based on ViPIOS, the Vienna Parallel Input Output System. ViMPIOS inherits the defining characteristics of ViPIOS, which makes it a client-server based system focusing on cluster architectures. ViMPIOS stands out from all other MPI-IO implementations ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
We present ViMPIOS, a novel MPI-IO implementation based on ViPIOS, the Vienna Parallel Input Output System. ViMPIOS inherits the defining characteristics of ViPIOS, which makes it a client-server based system focusing on cluster architectures. ViMPIOS stands out from all other MPI-IO implementations by its "truly" portable design, which allows not only applications to be transferred between parallel architectures easily but also to keep their original performance characteristics on the new platform as far as possible. This is kept by the "smart" AI-Blackboard module of ViPIOS, which is responsible for an appropriate data layout. Specifically in this paper we concentrate on the algorithm, which maps MPI-IO data structures on respective ViPIOS structures, and thus allows to exploit the ViPIOS properties. 1 Introduction Over the last few years the so-called I/O bottleneck turned out as the limiting factor in high-performance computing. Thus, the performance of parallel systems in many a...
The Impact of Spatial Layout of Jobs on Parallel I/O Performance
- In IOPADS ’99: Proceedings of the sixth workshop on I/O in parallel and distributed systems
, 1999
"... Input/Output is a big obstacle to effective use of teraflopsscale computing systems. Motivated by earlier parallel I/O measurements on an Intel TFLOPS machine, we conduct studies to determine the sensitivity of parallel I/O performance on multi-programmed mesh-connected machines with respect to numb ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
Input/Output is a big obstacle to effective use of teraflopsscale computing systems. Motivated by earlier parallel I/O measurements on an Intel TFLOPS machine, we conduct studies to determine the sensitivity of parallel I/O performance on multi-programmed mesh-connected machines with respect to number of I/O nodes, number of compute nodes, network link bandwidth, I/O node bandwidth, spatial layout of jobs, and read or write demands of applications. Our extensive simulations and analytical modeling yield important insights into the limitations on parallel I/O performance due to network contention, and into the possible gains in parallel I/O performance that can be achieved by tuning the spatial layout of jobs. Applying these results, we devise a new processor allocation strategy that is sensitive to parallel I/O traffic and the resulting network contention. In performance evaluations driven by synthetic workloads and by a real workload trace captured at the San Diego Supercomputing Cen...
Design of a next generation sampling service for large scale data analysis applications
- In Proceedings of the 19th Annual International Conference on Supercomputing (ICS05
, 2005
"... Advances in data collection and storage technologies have resulted in large and dynamically growing data sets at many organizations. Database and data mining researchers often use sampling with great effect to scale up performance on these data sets with small cost to accuracy. However, existing tec ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
Advances in data collection and storage technologies have resulted in large and dynamically growing data sets at many organizations. Database and data mining researchers often use sampling with great effect to scale up performance on these data sets with small cost to accuracy. However, existing techniques often ignore the cost of computing a sample. This cost is often linear in the size of the data set, not the sample, which is expensive. Furthermore, for data mining applications that leverage progressive sampling or bootstrapping-based techniques, this cost can be prohibitive, since they require the generation of multiple samples. To address this problem, we present a solution in the context of a state-of-the-art data analysis center. Specifically, we propose a scalable service that supports sample generation
PreDatA- Preparatory Data Analytics on Peta-Scale Machines
"... Abstract—Peta-scale scientific applications running on High End Computing (HEC) platforms can generate large volumes of data. For high performance storage and in order to be useful to science end users, such data must be organized in its layout, indexed, sorted, and otherwise manipulated for subsequ ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
Abstract—Peta-scale scientific applications running on High End Computing (HEC) platforms can generate large volumes of data. For high performance storage and in order to be useful to science end users, such data must be organized in its layout, indexed, sorted, and otherwise manipulated for subsequent data presentation, visualization, and detailed analysis. In addition, scientists desire to gain insights into selected data characteristics ‘hidden ’ or ‘latent ’ in the massive datasets while data is being produced by simulations. PreDatA, short for Preparatory Data Analytics, is an approach for preparing and characterizing data while it is being produced by the large scale simulations running on peta-scale machines. By dedicating additional compute nodes on the peta-scale machine as staging nodes and staging simulation’s output data through these nodes, PreDatA can exploit their computational power to perform selected data manipulations with lower latency than attainable by first moving data into file systems and storage. Such in-transit manipulations are supported by the PreDatA middleware through RDMAbased data movement to reduce write latency, application-specific operations on streaming data that are able to discover latent data characteristics, and appropriate data reorganization and metadata annotation to speed up subsequent data access. As a result, PreDatA enhances the scalability and flexibility of current I/O stack on HEC platforms and is useful for data pre-processing, runtime data analysis and inspection, as well as for data exchange between concurrently running simulation models. Performance evaluations with several production peta-scale applications on Oak Ridge National Laboratory’s Leadership Computing Facility demonstrate the feasibility and advantages of the PreDatA approach. I.

