Results 1 - 10
of
28
MPI-SIM: Using Parallel Simulation To Evaluate MPI Programs
, 1998
"... This paper describes the design and implementation of MPI-SIM, a library for the execution driven parallel simulation of MPI programs. MPI-LITE, a portable library that supports multithreaded MPI is also described. MPI-SIM, which is built on top of MPI-LITE, can be used to predict the performance of ..."
Abstract
-
Cited by 29 (11 self)
- Add to MetaCart
This paper describes the design and implementation of MPI-SIM, a library for the execution driven parallel simulation of MPI programs. MPI-LITE, a portable library that supports multithreaded MPI is also described. MPI-SIM, which is built on top of MPI-LITE, can be used to predict the performance of existing MPI programs as a function of architectural characteristics including number of processors and message communication latencies. The simulation models can be executed sequentially or in parallel. Parallel executions of MPI-SIM models are synchronized using a set of asynchronous conservative protocols. MPI-SIM reduces synchronization overheads by exploiting the communication characteristics of the program that it simulates. The paper presents validation and performance results from the use of MPI-SIM to simulate applications from the NAS Parallel Benchmark suite. Using the techniques described in this paper, we were able to reduce the number of synchronizations in the parallel simula...
Memory Disambiguation To Facilitate Instruction-Level Parallelism Compilation
, 1995
"... ... to support low-level optimization and scheduling. A dynamic approach, the memory conflict buffer, originally proposed by Chen [1], is analyzed across a large suite of integer and floating-point benchmarks. A new static approach, termed sync arcs, involving the passing of explicit dependence arcs ..."
Abstract
-
Cited by 28 (1 self)
- Add to MetaCart
... to support low-level optimization and scheduling. A dynamic approach, the memory conflict buffer, originally proposed by Chen [1], is analyzed across a large suite of integer and floating-point benchmarks. A new static approach, termed sync arcs, involving the passing of explicit dependence arcs from the source-level code down to the low-level code, is proposed and evaluated. This investigation of both dynamic and static memory disambiguation allows a quantitative analysis of the tradeoffs between the two approaches.
An All-Software Thread-Level Data Dependence Speculation System for Multiprocessors
- Journal of Instruction-Level Parallelism
, 2001
"... We present a software approach to design a thread-level data dependence speculation system targeting multiprocessors. Highly-tuned checking codes are associated with loads and stores whose addresses cannot be disambiguated by parallel compilers and that can potentially cause dependence violations at ..."
Abstract
-
Cited by 17 (2 self)
- Add to MetaCart
We present a software approach to design a thread-level data dependence speculation system targeting multiprocessors. Highly-tuned checking codes are associated with loads and stores whose addresses cannot be disambiguated by parallel compilers and that can potentially cause dependence violations at run-time. Besides resolving many name and true data dependencies through dynamic renaming and forwarding, respectively, our method supports parallel commit operations. Performance results collected on an architectural simulator and validated on a commercial multiprocessor show that the overhead can be reduced to less than ten instructions per speculative memory operation. Moreover, we demonstrate that a ten-fold speedup is possible on some of the difficult-toparallelize loops in the Perfect Club benchmark suite on a 16-way multiprocessor.
On The Implementation And Effectiveness Of Autoscheduling For Shared-Memory Multiprocessors
, 1995
"... processors Physical processors Alignment Distribution dependent mapping Implementation Figure 3.4 HPF approach to data partition and distribution. states that iteration i is to be executed by the processor to which A(i) is assigned. Therefore processor p 1 executes iterations f1; 2; 3; 4g. T ..."
Abstract
-
Cited by 16 (2 self)
- Add to MetaCart
processors Physical processors Alignment Distribution dependent mapping Implementation Figure 3.4 HPF approach to data partition and distribution. states that iteration i is to be executed by the processor to which A(i) is assigned. Therefore processor p 1 executes iterations f1; 2; 3; 4g. The ON clause is a feature borrowed from the language Kali [25]. 3.1.3 HPF The High Performance Fortran (HPF) [6, 26, 27] language was designed as a set of extensions and modifications to Fortran 90 to support data parallel programming. The ability to achieve top performance on MIMD and SIMD computers with nonuniform memory access was one of the main goals of the project. The design of HPF was influenced by Fortran D and Vienna Fortran [28, 29]. Just as Fortran D approaches the problem of data partitioning and distribution in two stages, HPF uses three. First, arrays are aligned to each other. Second, arrays are distributed across a user-defined rectilinear arrangement of abstract processo...
Low-Cost Thread-Level Data Dependence Speculation on Multiprocessors
- In Fourth Workshop on Multithreaded Execution, Architecture and Compilation
, 2000
"... We present a software approach to design a thread-level data dependence speculation system targeting multiprocessors. Highly-tuned checking codes are associated with loads and stores whose addresses cannot be disambiguated by parallel compilers and that can potentially cause dependence violations at ..."
Abstract
-
Cited by 15 (2 self)
- Add to MetaCart
We present a software approach to design a thread-level data dependence speculation system targeting multiprocessors. Highly-tuned checking codes are associated with loads and stores whose addresses cannot be disambiguated by parallel compilers and that can potentially cause dependence violations at run-time. Besides resolving many name and true data dependencies through dynamic renaming and forwarding, respectively, our method supports parallel commit operations and allows mis-speculated threads to restart earlier. Preliminary performance results collected on an architectural simulator show that the overhead can be reduced to less than ten instructions per speculative memory operation. Moreover, we demonstrate that a ten-fold speedup is possible on some of the difficult-to-parallelize loops in the Perfect Club benchmark suite on a 16-way multiprocessor. 1
Simple Register Spilling in a Retargetable Compiler
, 1995
"... This paper describes the management of register spills in a retargetable C compiler. Spills are rare, which means that testing is a bigger problem than performance. The trade-offs have been arranged so that the common case (no spills) generates respectable code quickly and the uncommon case (spills) ..."
Abstract
-
Cited by 15 (3 self)
- Add to MetaCart
This paper describes the management of register spills in a retargetable C compiler. Spills are rare, which means that testing is a bigger problem than performance. The trade-offs have been arranged so that the common case (no spills) generates respectable code quickly and the uncommon case (spills) is less efficient but as simple as possible. The technique has proven practical and is in production use on VAX, Motorola 68020, SPARC and MIPS machines. KEY WORDS ANSI C code generation compilers register allocation register spilling INTRODUCTION When register allocators run out of registers, they generate code to spill one or more busy registers into temporaries and code to reload those values when they are needed again. The trend in compiling research is increasing the sophistication --- and the implementation and execution costs --- of the techniques that avoid spills.
TOMLAB - A General Purpose, Open MATLAB Environment for Research and Teaching in Optimization
, 1998
"... TOMLAB is a general purpose, open and integrated MATLAB environment for research and teaching in optimization on UNIX and PC systems. The motivation for TOMLAB is to simplify research on practical optimization problems, giving easy access to all types of solvers; at the same time having full acce ..."
Abstract
-
Cited by 12 (11 self)
- Add to MetaCart
TOMLAB is a general purpose, open and integrated MATLAB environment for research and teaching in optimization on UNIX and PC systems. The motivation for TOMLAB is to simplify research on practical optimization problems, giving easy access to all types of solvers; at the same time having full access to the power of MATLAB. By using a simple, but general input format, combined with the ability in MATLAB to evaluate string expressions, it is possible to run internal TOMLAB solvers, MATLAB Optimization Toolbox and commercial solvers written in FORTRAN or C/C++ using MEX-file interfaces. Currently MEX-file interfaces have been developed for MINOS, NPSOL, NPOPT, NLSSOL, LPOPT, QPOPT and LSSOL. TOMLAB may either be used totally parameter driven or menu driven. The basic principles will be discussed. The menu system makes it suitable for teaching. Many standard test problems are included. More test problems are easily added. There are many example and demonstration files. Iterati...
An Invariant Subspace Approach in M/G/1 and G/M/1 Type Markov Chains
, 1995
"... Let A k ; k 0, be a sequence of m \Theta m nonnegative matrices and let A(z) = k=0 A k z be such that A(1) is an irreducible stochastic matrix. The unique power-bounded solution of the nonlinear matrix equation G = k=0 A k G has been shown to play a key role in the analysis of Markov cha ..."
Abstract
-
Cited by 11 (6 self)
- Add to MetaCart
Let A k ; k 0, be a sequence of m \Theta m nonnegative matrices and let A(z) = k=0 A k z be such that A(1) is an irreducible stochastic matrix. The unique power-bounded solution of the nonlinear matrix equation G = k=0 A k G has been shown to play a key role in the analysis of Markov chains of M/G/1 type. Assuming that the matrix A(z) is rational, we show that the solution of this matrix equation reduces to finding an invariant subspace of a certain matrix. We present an iterative method for computing this subspace which is globally convergent. Moreover, the method can be implemented with quadratic or higher convergence rate matrix sign function iterations, which brings in a new dimension to the analysis of M/G/1 type Markov chains for which the existing algorithms may suffer from low linear convergence rates. The method can be viewed as a "bridge" between the matrix analytic methods and transform techniques whereas it circumvents the requirement for a large number of iterations which may be encountered in the methods of the former type and the root finding problem of the techniques of the latter type. Similar results are obtained for computing the unique power-summable solution of the matrix equation R = k=0 R A k , which appears in the analysis of G/M/1 type Markov chains.
Asynchronous Parallel Simulation of Parallel Programs
, 2000
"... Parallel simulation of parallel programs for large datasets has been shown to oer signicant reduction in the execution time of many discrete event models. This paper describes the design and implementation of MPI-SIM, a library for the execution driven parallel simulation of task and data paralle ..."
Abstract
-
Cited by 9 (5 self)
- Add to MetaCart
Parallel simulation of parallel programs for large datasets has been shown to oer signicant reduction in the execution time of many discrete event models. This paper describes the design and implementation of MPI-SIM, a library for the execution driven parallel simulation of task and data parallel programs. MPI-SIM can be used to predict the performance of existing programs written using MPI for message-passing, or written in UC, a data parallel language, compiled to use message-passing. The simulation models can be executed sequentially or in parallel. Parallel execution of the models are synchronized using a set of asynchronous conservative protocols. This paper demonstrates how protocol performance is improved by the use of application-level, runtime analysis. The analysis targets the communication patterns of the application. We show the application-level analysis for message passing and data parallel languages. We present the validation and performance results for the ...
Modulo Scheduling for Control-Intensive General-Purpose Programs
, 1997
"... It is increasingly necessary for the compiler to overlap successive loop iterations in order to nd su cient instruction-level parallelism to e ectively utilize the resources of high-performance processors. Two competing methods have been developed for moving instructions across itera-tion boundaries ..."
Abstract
-
Cited by 7 (0 self)
- Add to MetaCart
It is increasingly necessary for the compiler to overlap successive loop iterations in order to nd su cient instruction-level parallelism to e ectively utilize the resources of high-performance processors. Two competing methods have been developed for moving instructions across itera-tion boundaries: unrolling followed by global acyclic scheduling and software pipelining. This dissertation investigates modulo scheduling, a software pipelining technique. Much of the pre-vious work on modulo scheduling has targeted the relatively well-behaved loops in numeric programs. This dissertation develops new techniques that allow modulo scheduling to be ef-fectively applied to control-intensive non-numeric programs. These techniques overcome the restrictions imposed by problematic control ow and loop exits. This dissertation also demonstrates that unrolling-based optimization prior to scheduling improves the performance of modulo scheduled loops and is, in fact, necessary to allow modulo scheduling to surpass the performance of acyclic scheduling for control-intensive general-purpose programs. Modulo scheduling has the following advantages over the acyclic scheduling approach for control-intensive general-purpose programs. First, modulo scheduling increases performance by maintaining the overlap of loop iterations throughout the execution of the loop. Second,

