Results 1 - 10
of
20
MPI-SIM: Using Parallel Simulation To Evaluate MPI Programs
, 1998
"... This paper describes the design and implementation of MPI-SIM, a library for the execution driven parallel simulation of MPI programs. MPI-LITE, a portable library that supports multithreaded MPI is also described. MPI-SIM, which is built on top of MPI-LITE, can be used to predict the performance of ..."
Abstract
-
Cited by 29 (11 self)
- Add to MetaCart
This paper describes the design and implementation of MPI-SIM, a library for the execution driven parallel simulation of MPI programs. MPI-LITE, a portable library that supports multithreaded MPI is also described. MPI-SIM, which is built on top of MPI-LITE, can be used to predict the performance of existing MPI programs as a function of architectural characteristics including number of processors and message communication latencies. The simulation models can be executed sequentially or in parallel. Parallel executions of MPI-SIM models are synchronized using a set of asynchronous conservative protocols. MPI-SIM reduces synchronization overheads by exploiting the communication characteristics of the program that it simulates. The paper presents validation and performance results from the use of MPI-SIM to simulate applications from the NAS Parallel Benchmark suite. Using the techniques described in this paper, we were able to reduce the number of synchronizations in the parallel simula...
Performance Prediction of Large Parallel Applications Using Parallel Simulations
, 1999
"... Accurate simulation of large parallel applications can be facilitated with the use of direct execution and parallel discrete event simulation. This paper describes the use of COMPASS, a direct execution-driven, parallel simulator for performance prediction of programs that include both communica ..."
Abstract
-
Cited by 26 (11 self)
- Add to MetaCart
Accurate simulation of large parallel applications can be facilitated with the use of direct execution and parallel discrete event simulation. This paper describes the use of COMPASS, a direct execution-driven, parallel simulator for performance prediction of programs that include both communication and I/O intensive applications. The simulator has been used to predict the performance of such applications on both distributed memory machines like the IBM SP and shared-memory machines like the SGI Origin 2000. The paper illustrates the usefulness of COMPASS as a versatile performance prediction tool. We use both real-world applications and synthetic benchmarks to study application scalability, sensitivity to communication latency, and the interplay between factors like communication pattern and parallel file system caching on application performance. We also show that the simulator is accurate in its predictions and that it is also efficient in its ability to use parallel si...
Advances, Applications and Performance of the Global Arrays Shared Memory Programming Toolkit
- INTERN. J. HIGH PERF. COMP. APPLICATIONS
, 2005
"... This paper describes capabilities, evolution, performance, and applications of the Global Arrays (GA) toolkit. GA was created to provide application programmers with an interface that allows them to distribute data while maintaining the type of global index space and programming syntax similar to th ..."
Abstract
-
Cited by 13 (8 self)
- Add to MetaCart
This paper describes capabilities, evolution, performance, and applications of the Global Arrays (GA) toolkit. GA was created to provide application programmers with an interface that allows them to distribute data while maintaining the type of global index space and programming syntax similar to that available when programming on a single processor. The goal of GA is to free the programmer from the low level management of communication and allow them to deal with their problems at the level at which they were originally formulated. At the same time, compatibility of GA with MPI enables the programmer to take advantage of the existing MPI software/libraries when available and appropriate. The variety of applications that have been implemented using Global Arrays attests to the
Parallel Simulation of Parallel File Systems and I/O Programs
- In Proc. of SuperComputing'97
"... Efficient I/O implementations can have a significant impact on the performance of parallel applications. This paper describes the design and implementation of PIOSIM, a parallel simulation library for MPI-IO programs. The simulator can be used to predict the performance of existing MPI-IO programs a ..."
Abstract
-
Cited by 12 (4 self)
- Add to MetaCart
Efficient I/O implementations can have a significant impact on the performance of parallel applications. This paper describes the design and implementation of PIOSIM, a parallel simulation library for MPI-IO programs. The simulator can be used to predict the performance of existing MPI-IO programs as a function of architectural characteristics, caching algorithms, and alternative implementations of collective I/O operations. This paper describes the simulator and presents the results of a number of performance studies to evaluate the impact of the preceding factors on a set of MPI-IO benchmarks, including programs from the NAS benchmark suite. This work was supported by the Advanced Research Projects Agency, ARPA/CSTO, under Contract F-3060294 -C-0273, "Scalable Systems Software Measurement and Evaluation" 1 Introduction Simulation models are commonly used to evaluate the performance of parallel architectures and predict the impact of algorithmic and architectural innovations on t...
Improving Lookahead in Parallel Discrete Event Simulations of Large-Scale Applications Using Compiler Analysis
- In Proc. 15th Workshop on Parallel and Distributed Simulation (PADS 01), Lake Arrowhead
, 2001
"... This paper addresses the issue of efficient and accurate petformance prediction of large-scale message-passing applications on high petforniance architectures using sinidation. Such simulators are often based on parallel discrete event simulation, Qpically using the conservative protocol to synchron ..."
Abstract
-
Cited by 10 (3 self)
- Add to MetaCart
This paper addresses the issue of efficient and accurate petformance prediction of large-scale message-passing applications on high petforniance architectures using sinidation. Such simulators are often based on parallel discrete event simulation, Qpically using the conservative protocol to synchronize the simulation threads. The paper considers how a compiler cat1 be used to autoniatically extract information about the lookahead present in the application, and how this can be used to improve the performance of the null protocol used for synchronization. These techniques are implemented in the MPI-Sin1 siniulator and dHPF compiler, which had previously been extended to work together for optimizing the sinidation of local coniputational components of an application. The results show that the availability of lookahead iiforniation iniproves the runtinie of the siniulator by factors rarigitig from 9 % up to two orders of niagnitude, with 30.60% iniprovenietits being typical for the real-world codes. The experiments also show that these iniprovements are directly correlated with reductions in the number of ruill niessages required by the simulations. 1
Asynchronous Parallel Simulation of Parallel Programs
, 2000
"... Parallel simulation of parallel programs for large datasets has been shown to oer signicant reduction in the execution time of many discrete event models. This paper describes the design and implementation of MPI-SIM, a library for the execution driven parallel simulation of task and data paralle ..."
Abstract
-
Cited by 9 (5 self)
- Add to MetaCart
Parallel simulation of parallel programs for large datasets has been shown to oer signicant reduction in the execution time of many discrete event models. This paper describes the design and implementation of MPI-SIM, a library for the execution driven parallel simulation of task and data parallel programs. MPI-SIM can be used to predict the performance of existing programs written using MPI for message-passing, or written in UC, a data parallel language, compiled to use message-passing. The simulation models can be executed sequentially or in parallel. Parallel execution of the models are synchronized using a set of asynchronous conservative protocols. This paper demonstrates how protocol performance is improved by the use of application-level, runtime analysis. The analysis targets the communication patterns of the application. We show the application-level analysis for message passing and data parallel languages. We present the validation and performance results for the ...
A Complexity-Effective Architecture for Accelerating Full-System Multiprocessor Simulations Using FPGAs
"... Functional full-system simulators are powerful and versatile research tools for accelerating architectural exploration and advanced software development. Their main shortcoming is limited throughput when simulating systems with hundreds of processors or more. To overcome this bottleneck, we propose ..."
Abstract
-
Cited by 8 (2 self)
- Add to MetaCart
Functional full-system simulators are powerful and versatile research tools for accelerating architectural exploration and advanced software development. Their main shortcoming is limited throughput when simulating systems with hundreds of processors or more. To overcome this bottleneck, we propose the PROTOFLEX simulation architecture, which uses FPGAs to accelerate simulation. Prior FPGA approaches that prototype a complete system in hardware are either too complex when scaling to large-scale configurations or require significant effort to provide full-system support. In contrast, PROTOFLEX reduces complexity by virtualizing the execution of many logical processors onto a consolidated set of multiple-context execution engines on the FPGA. Through virtualization, the number of engines can be judiciously scaled, as needed, to deliver on necessary simulation performance. To achieve low-complexity full-system support, a hybrid simulation technique called transplanting allows implementing in the FPGA only the frequently encountered behaviors, while a software simulator preserves the abstraction of a complete system. We have created a first instance of the PROTOFLEX simulation architecture, which is an FPGA-based, full-system functional simulator for a 16-way UltraSPARC III symmetric multiprocessor server hosted on a single Xilinx Virtex-II XCV2P70 FPGA. On average, the simulator achieves a 39x speedup (and as high as 49x) over comparable software simulation across a suite of applications, including OLTP on a commercial database server.
Perils And Pitfalls Of Parallel Discrete-Event Simulation
, 1996
"... The design of efficient parallel discrete-event simulation (PDES) models often appears to be a mysterious art practiced primarily by academic researchers who have been rigorously ordained in this task. This tutorial attempts to unravel some of the mysteries. It describes the process of generating an ..."
Abstract
-
Cited by 6 (0 self)
- Add to MetaCart
The design of efficient parallel discrete-event simulation (PDES) models often appears to be a mysterious art practiced primarily by academic researchers who have been rigorously ordained in this task. This tutorial attempts to unravel some of the mysteries. It describes the process of generating an efficient parallel implementation of a discrete -event simulation (DES) model. Common pitfalls in the parallel execution of the models are described together with suggestions on their avoidance. 1 INTRODUCTION Parallel (or distributed) discrete-event simulation refers to the execution of a discrete-event simulation program on a parallel (or distributed) architecture (Fujimoto 1990). In recent years, interest in exploiting parallelism in the execution of discrete-event simulations in a number of domains including network design and configuration, personal communication systems, parallel programs, digital battlefields, and digital circuits has been growing. This demand has been fueled both ...
Implications of Application Usage Characteristics for Collective Communication Offload
- International Journal of High Performance Computing and Networking
, 2005
"... Abstract — The performance of collective communication operations is known to have a significant impact on the scalability of some applications. Indeed, the global, synchronous nature of some collective operations directly implies that they will become the bottleneck when scaling to hundreds of thou ..."
Abstract
-
Cited by 4 (1 self)
- Add to MetaCart
Abstract — The performance of collective communication operations is known to have a significant impact on the scalability of some applications. Indeed, the global, synchronous nature of some collective operations directly implies that they will become the bottleneck when scaling to hundreds of thousands of nodes. This fact has led many researchers to try to improve the efficiency of collective operations. One popular approach improves the implementation of MPI collective operations by using intelligent or programmable network interfaces to offload the burden of communication activities from the host processor(s). Such implementations have shown significant improvement for microbenchmarks that isolate collective communication performance, but these results have not been shown to translate to significant increases in performance for real applications. In order for collective offload implementations to benefit real applications, a greater understanding of application behavior is needed. In this paper, we describe several characteristics of applications and application benchmarks that impact collective communication performance. We analyze network resource usage data in order to guide the design of collective offload engines and their associated programming interfaces. In particular, we provide an analysis of the potential benefit of non-blocking collective communication operations for MPI. Index Terms — MPI, non-blocking, collective, resource usage, resource management, network interface

