Results 1  10
of
29
MPISIM: Using Parallel Simulation To Evaluate MPI Programs
, 1998
"... This paper describes the design and implementation of MPISIM, a library for the execution driven parallel simulation of MPI programs. MPILITE, a portable library that supports multithreaded MPI is also described. MPISIM, which is built on top of MPILITE, can be used to predict the performance of ..."
Abstract

Cited by 32 (11 self)
 Add to MetaCart
This paper describes the design and implementation of MPISIM, a library for the execution driven parallel simulation of MPI programs. MPILITE, a portable library that supports multithreaded MPI is also described. MPISIM, which is built on top of MPILITE, can be used to predict the performance of existing MPI programs as a function of architectural characteristics including number of processors and message communication latencies. The simulation models can be executed sequentially or in parallel. Parallel executions of MPISIM models are synchronized using a set of asynchronous conservative protocols. MPISIM reduces synchronization overheads by exploiting the communication characteristics of the program that it simulates. The paper presents validation and performance results from the use of MPISIM to simulate applications from the NAS Parallel Benchmark suite. Using the techniques described in this paper, we were able to reduce the number of synchronizations in the parallel simula...
Memory Disambiguation To Facilitate InstructionLevel Parallelism Compilation
, 1995
"... ... to support lowlevel optimization and scheduling. A dynamic approach, the memory conflict buffer, originally proposed by Chen [1], is analyzed across a large suite of integer and floatingpoint benchmarks. A new static approach, termed sync arcs, involving the passing of explicit dependence arcs ..."
Abstract

Cited by 29 (1 self)
 Add to MetaCart
... to support lowlevel optimization and scheduling. A dynamic approach, the memory conflict buffer, originally proposed by Chen [1], is analyzed across a large suite of integer and floatingpoint benchmarks. A new static approach, termed sync arcs, involving the passing of explicit dependence arcs from the sourcelevel code down to the lowlevel code, is proposed and evaluated. This investigation of both dynamic and static memory disambiguation allows a quantitative analysis of the tradeoffs between the two approaches.
An AllSoftware ThreadLevel Data Dependence Speculation System for Multiprocessors
 Journal of InstructionLevel Parallelism
, 2001
"... We present a software approach to design a threadlevel data dependence speculation system targeting multiprocessors. Highlytuned checking codes are associated with loads and stores whose addresses cannot be disambiguated by parallel compilers and that can potentially cause dependence violations at ..."
Abstract

Cited by 19 (2 self)
 Add to MetaCart
We present a software approach to design a threadlevel data dependence speculation system targeting multiprocessors. Highlytuned checking codes are associated with loads and stores whose addresses cannot be disambiguated by parallel compilers and that can potentially cause dependence violations at runtime. Besides resolving many name and true data dependencies through dynamic renaming and forwarding, respectively, our method supports parallel commit operations. Performance results collected on an architectural simulator and validated on a commercial multiprocessor show that the overhead can be reduced to less than ten instructions per speculative memory operation. Moreover, we demonstrate that a tenfold speedup is possible on some of the difficulttoparallelize loops in the Perfect Club benchmark suite on a 16way multiprocessor.
LowCost ThreadLevel Data Dependence Speculation on Multiprocessors
 In Fourth Workshop on Multithreaded Execution, Architecture and Compilation
, 2000
"... We present a software approach to design a threadlevel data dependence speculation system targeting multiprocessors. Highlytuned checking codes are associated with loads and stores whose addresses cannot be disambiguated by parallel compilers and that can potentially cause dependence violations at ..."
Abstract

Cited by 18 (2 self)
 Add to MetaCart
We present a software approach to design a threadlevel data dependence speculation system targeting multiprocessors. Highlytuned checking codes are associated with loads and stores whose addresses cannot be disambiguated by parallel compilers and that can potentially cause dependence violations at runtime. Besides resolving many name and true data dependencies through dynamic renaming and forwarding, respectively, our method supports parallel commit operations and allows misspeculated threads to restart earlier. Preliminary performance results collected on an architectural simulator show that the overhead can be reduced to less than ten instructions per speculative memory operation. Moreover, we demonstrate that a tenfold speedup is possible on some of the difficulttoparallelize loops in the Perfect Club benchmark suite on a 16way multiprocessor. 1
On The Implementation And Effectiveness Of Autoscheduling For SharedMemory Multiprocessors
, 1995
"... processors Physical processors Alignment Distribution dependent mapping Implementation Figure 3.4 HPF approach to data partition and distribution. states that iteration i is to be executed by the processor to which A(i) is assigned. Therefore processor p 1 executes iterations f1; 2; 3; 4g. T ..."
Abstract

Cited by 16 (2 self)
 Add to MetaCart
processors Physical processors Alignment Distribution dependent mapping Implementation Figure 3.4 HPF approach to data partition and distribution. states that iteration i is to be executed by the processor to which A(i) is assigned. Therefore processor p 1 executes iterations f1; 2; 3; 4g. The ON clause is a feature borrowed from the language Kali [25]. 3.1.3 HPF The High Performance Fortran (HPF) [6, 26, 27] language was designed as a set of extensions and modifications to Fortran 90 to support data parallel programming. The ability to achieve top performance on MIMD and SIMD computers with nonuniform memory access was one of the main goals of the project. The design of HPF was influenced by Fortran D and Vienna Fortran [28, 29]. Just as Fortran D approaches the problem of data partitioning and distribution in two stages, HPF uses three. First, arrays are aligned to each other. Second, arrays are distributed across a userdefined rectilinear arrangement of abstract processo...
Simple Register Spilling in a Retargetable Compiler
 SOFIWAREPRACTICE AND EXPERIENCE, VOL. 22(1), 8999(JANUARY 1992)
, 1992
"... This paper describes the management of register spills in a retargetable C compiler. Spills are rare, which means that testing is a bigger problem than performance. The tradeoffs have been arranged so that the common case (no spills) generates respectable code quickly and the uncommon case (spills) ..."
Abstract

Cited by 16 (4 self)
 Add to MetaCart
This paper describes the management of register spills in a retargetable C compiler. Spills are rare, which means that testing is a bigger problem than performance. The tradeoffs have been arranged so that the common case (no spills) generates respectable code quickly and the uncommon case (spills) is less efficient but as simple as possible. The technique has proven practical and is in production use on VAX, Motorola 68020, SPARC and MIPS machines
TOMLAB  A General Purpose, Open MATLAB Environment for Research and Teaching in Optimization
, 1998
"... TOMLAB is a general purpose, open and integrated MATLAB environment for research and teaching in optimization on UNIX and PC systems. The motivation for TOMLAB is to simplify research on practical optimization problems, giving easy access to all types of solvers; at the same time having full acce ..."
Abstract

Cited by 14 (13 self)
 Add to MetaCart
TOMLAB is a general purpose, open and integrated MATLAB environment for research and teaching in optimization on UNIX and PC systems. The motivation for TOMLAB is to simplify research on practical optimization problems, giving easy access to all types of solvers; at the same time having full access to the power of MATLAB. By using a simple, but general input format, combined with the ability in MATLAB to evaluate string expressions, it is possible to run internal TOMLAB solvers, MATLAB Optimization Toolbox and commercial solvers written in FORTRAN or C/C++ using MEXfile interfaces. Currently MEXfile interfaces have been developed for MINOS, NPSOL, NPOPT, NLSSOL, LPOPT, QPOPT and LSSOL. TOMLAB may either be used totally parameter driven or menu driven. The basic principles will be discussed. The menu system makes it suitable for teaching. Many standard test problems are included. More test problems are easily added. There are many example and demonstration files. Iterati...
Software distribution using XNETLIB
 ACM Transactions on Mathematical Software
, 1995
"... Xnetlib is a new tool for software distribution. Whereas its predecessor nethb uses email as the user interface to its large collection of publicdomain mathematical software, xnetlib uses an X Window interface and socketbased communication. Xnetlib makes it easy to search through a large distribut ..."
Abstract

Cited by 12 (6 self)
 Add to MetaCart
Xnetlib is a new tool for software distribution. Whereas its predecessor nethb uses email as the user interface to its large collection of publicdomain mathematical software, xnetlib uses an X Window interface and socketbased communication. Xnetlib makes it easy to search through a large distributed collection of software and to retrieve requested software in seconds.
An Invariant Subspace Approach in M/G/1 and G/M/1 Type Markov Chains
, 1995
"... Let A k ; k 0, be a sequence of m \Theta m nonnegative matrices and let A(z) = k=0 A k z be such that A(1) is an irreducible stochastic matrix. The unique powerbounded solution of the nonlinear matrix equation G = k=0 A k G has been shown to play a key role in the analysis of Markov cha ..."
Abstract

Cited by 11 (6 self)
 Add to MetaCart
Let A k ; k 0, be a sequence of m \Theta m nonnegative matrices and let A(z) = k=0 A k z be such that A(1) is an irreducible stochastic matrix. The unique powerbounded solution of the nonlinear matrix equation G = k=0 A k G has been shown to play a key role in the analysis of Markov chains of M/G/1 type. Assuming that the matrix A(z) is rational, we show that the solution of this matrix equation reduces to finding an invariant subspace of a certain matrix. We present an iterative method for computing this subspace which is globally convergent. Moreover, the method can be implemented with quadratic or higher convergence rate matrix sign function iterations, which brings in a new dimension to the analysis of M/G/1 type Markov chains for which the existing algorithms may suffer from low linear convergence rates. The method can be viewed as a "bridge" between the matrix analytic methods and transform techniques whereas it circumvents the requirement for a large number of iterations which may be encountered in the methods of the former type and the root finding problem of the techniques of the latter type. Similar results are obtained for computing the unique powersummable solution of the matrix equation R = k=0 R A k , which appears in the analysis of G/M/1 type Markov chains.
Asynchronous Parallel Simulation of Parallel Programs
, 2000
"... Parallel simulation of parallel programs for large datasets has been shown to oer signicant reduction in the execution time of many discrete event models. This paper describes the design and implementation of MPISIM, a library for the execution driven parallel simulation of task and data paralle ..."
Abstract

Cited by 11 (5 self)
 Add to MetaCart
Parallel simulation of parallel programs for large datasets has been shown to oer signicant reduction in the execution time of many discrete event models. This paper describes the design and implementation of MPISIM, a library for the execution driven parallel simulation of task and data parallel programs. MPISIM can be used to predict the performance of existing programs written using MPI for messagepassing, or written in UC, a data parallel language, compiled to use messagepassing. The simulation models can be executed sequentially or in parallel. Parallel execution of the models are synchronized using a set of asynchronous conservative protocols. This paper demonstrates how protocol performance is improved by the use of applicationlevel, runtime analysis. The analysis targets the communication patterns of the application. We show the applicationlevel analysis for message passing and data parallel languages. We present the validation and performance results for the ...