Results 1 - 10
of
10
Performance Prediction of Parallel Processing Systems: The PAMELA Methodology
- in Proc. 7th ACM Int. Conf. on Supercomputing
, 1993
"... In this paper we present a new methodology for the performance prediction of parallel programs on parallel platforms ranging from shared-memory to distributed-memory (vector) machines. The methodology comprises a procedural program and machine specification paradigm based on Pamela (PerformAnce ModE ..."
Abstract
-
Cited by 37 (17 self)
- Add to MetaCart
In this paper we present a new methodology for the performance prediction of parallel programs on parallel platforms ranging from shared-memory to distributed-memory (vector) machines. The methodology comprises a procedural program and machine specification paradigm based on Pamela (PerformAnce ModEling LAnguage), along with a performance calculus, called "serialization analysis". This calculus extends conventional parallel program analysis technology by explicitly accounting for resource contention, yet at the low evaluation cost typical for static techniques. It is shown that, where conventional techniques introduce fundamental errors, predictions from serialization analysis remain realistic. Apart from the merits of the methodology itself, this high reliability/cost ratio makes Pamela an attractive candidate for compile-time application within the performance prediction hierarchy often found in parallel programming environments. 1 Introduction The performance of a concurrent syste...
The Search for Lost Cycles: A New Approach to Parallel Program Performance Evaluation
- In Proceedings of Supercomputing '94
, 1993
"... Traditional performance debugging and tuning of parallel programs is based on the "measuremodify " approach, in which detailed measurements of program executions are used to guide incremental changes to the program that result in better performance. Unfortunately, the performance of a parallel algor ..."
Abstract
-
Cited by 21 (2 self)
- Add to MetaCart
Traditional performance debugging and tuning of parallel programs is based on the "measuremodify " approach, in which detailed measurements of program executions are used to guide incremental changes to the program that result in better performance. Unfortunately, the performance of a parallel algorithm is often related to its implementation, input data, and machine characteristics in surprising ways, and the "measure-modify" approach is unsuited to exploring these relationships fully: it is too heavily dependent on experimentation and measurement, which is impractical for studying the large number of variables that can affect parallel program performance. In this paper we argue that the problem of selecting the best implementation of a parallel algorithm requires a new approach to parallel program performance evaluation, one with a greater balance between measurement and modeling. We first present examples that demonstrate that different parallelizations of a program may be necessary ...
Compiling Performance Models from Parallel Programs
- In Proceedings of the 8th ACM International Conference on Supercomputing
, 1994
"... A technique is described to automatically compile performance models in the course of program translation. The performance models are fully symbolic in order to preserve as much diagnostic information as possible. Although compiled statically, the models account for the effects of resource contentio ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
A technique is described to automatically compile performance models in the course of program translation. The performance models are fully symbolic in order to preserve as much diagnostic information as possible. Although compiled statically, the models account for the effects of resource contention, due to the introduction of a novel algorithm within the symbolic compilation scheme. It is shown that the compilation approach fundamentally outperforms traditional static estimation procedures in terms of precision at a negligible increase in cost. This claim is illustrated by a case study of an LU factorization algorithm on a multiprocessor. 1 Introduction Low-cost, compile-time performance prediction provides essential, early feedback to enable program and machine parameter optimization by both the user and the compiler. In this paper we present a technique to automatically compile a symbolic performance model which accurately predicts the execution time of a parallel program given a...
On the Analysis of PAMELA Models
, 1993
"... While last year's report [16] loosely introduced the general concepts behind the Pamela approach toward modeling and analysis of parallel systems, this report exclusively focuses on the calculus of the methodology. In particular, it defines an algorithmic approach toward serialization analysis, whi ..."
Abstract
-
Cited by 4 (3 self)
- Add to MetaCart
While last year's report [16] loosely introduced the general concepts behind the Pamela approach toward modeling and analysis of parallel systems, this report exclusively focuses on the calculus of the methodology. In particular, it defines an algorithmic approach toward serialization analysis, which enables (future) mechanization of the analysis. Thus, a technique is developed to automatically compile symbolic performance models in the course of program translation. It is shown that the resulting performance models fundamentally outperform traditional static estimation approaches at a negligible increase in cost. This claim is illustrated by two case studies, i.e., an LU factorization algorithm on a multiprocessor, and a matrix-vector update on a multicomputer. Contents 1 Introduction 2 2 Analysis 5 2.1 Mathematical Preliminaries : : : : : : : : : : : : : : : : : : : : : 5 2.2 Formalism : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 7 2.3 Homomorphic Mapping : : : :...
The PAMELA Approach To The Performance Simulation Of Parallel And Distributed Systems
- in Proc. European Simulation Symposium
, 1993
"... We present a new methodology for the efficient performance simulation of imperative, explicit parallel programs running on shared-memory or distributedmemory (vector) machines. The methodology is based on a program and machine specification formalism called Pamela (PerformAnce ModEling LAnguage), an ..."
Abstract
-
Cited by 3 (3 self)
- Add to MetaCart
We present a new methodology for the efficient performance simulation of imperative, explicit parallel programs running on shared-memory or distributedmemory (vector) machines. The methodology is based on a program and machine specification formalism called Pamela (PerformAnce ModEling LAnguage), and an associated calculus which allows performance models to be optimized prior to subsequent simulation. The (compile-time) application of this calculus yields models which retain high prediction accuracy whereas simulation costs may be reduced by orders of magnitude compared to traditional approaches. 1 INTRODUCTION As the peak performance of parallel computer systems steadily increases, application design support tends to fall behind in view of the relatively low performance achieved in practice. Given the need to integrate performance analysis at various intermediate stages of the design process, a performance modeling methodology which offers low-cost predictions rapidly becomes a criti...
The PAMELA Approach to Performance Modeling of Parallel and Distributed Systems
, 1993
"... We present a new methodology for the performance prediction of imperative, explicit parallel programs running on shared-memory or distributed-memory (vector) machines. The methodology is based on an imperative, procedure-oriented program and machine specification formalism called Pamela with an asso ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
We present a new methodology for the performance prediction of imperative, explicit parallel programs running on shared-memory or distributed-memory (vector) machines. The methodology is based on an imperative, procedure-oriented program and machine specification formalism called Pamela with an associated calculus which allows performance models to be reduced prior to subsequent simulation. The introduction of this novel compile-time reduction technique enables a combination of the reliability typical for dynamic approaches with the low cost typical for static prediction methods. 1. INTRODUCTION While the peak performance of parallel computer systems steadily increases, application design support tends to fall behind in view of the relatively low performance achieved in practice. Given the need to integrate performance analysis at various intermediate stages of the design process, a performance modeling methodology which offers low-cost predictions rapidly becomes a critical success ...
Performance Estimation for Embedded Systems
, 2000
"... In this document we propose a symbolic performance modeling technique to be used as the basis of the JOSES cost estimator. The approach is inspired by the need for highly parametric cost models in the initial stages in parallel program design, where absolute prediction accuracy is of less priority t ..."
Abstract
-
Cited by 2 (2 self)
- Add to MetaCart
In this document we propose a symbolic performance modeling technique to be used as the basis of the JOSES cost estimator. The approach is inspired by the need for highly parametric cost models in the initial stages in parallel program design, where absolute prediction accuracy is of less priority than solution cost, and where symbolic feedback on the effects of user mapping decisions and machine parameters is of primary concern. As illustrated by the case study, the symbolic approach provides good feedback on the effects of partitioning choices as well as the influence of computation and communication parameters on application performance.
TLB Performance in Multiprocessors
, 1991
"... This paper compares the performance, in highly-parallel shared-memory multiprocessors, of locating translation-lookaside buffers (TLBs) at processors with that of locating TLBs at memory. Our performance comparison is based on results of trace-driven simulations of multiprocessors with logN-stage ne ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
This paper compares the performance, in highly-parallel shared-memory multiprocessors, of locating translation-lookaside buffers (TLBs) at processors with that of locating TLBs at memory. Our performance comparison is based on results of trace-driven simulations of multiprocessors with logN-stage networks interconnecting N processors and N memory modules. For the systems and workloads studied, memory-based TLBs perform better than processor-based TLBs, provided that memory is organized as multiple paging arenas, where the mapping of pages to arenas is fixed. The cost of a processor-based TLB reload is at least logN because of network transit. The cost of a memory-based TLB reload can be made smaller than that of a processor-based TLB reload, since network transits are not required. Furthermore, with multiple paging arenas, the number of reloads is smaller with memory-based TLBs. For memory-based TLBs to continue to outperform processor-based TLBs for large N, it is likely that the numb...
unknown title
"... Current analytic solutions to the execution time distribu-tion of an N-ary parallel composition of tasks having in-dependent and identically distributed execution times are computationally complex, except for a limited number of distributions. In this paper we introduce an analytical so-lution based ..."
Abstract
- Add to MetaCart
Current analytic solutions to the execution time distribu-tion of an N-ary parallel composition of tasks having in-dependent and identically distributed execution times are computationally complex, except for a limited number of distributions. In this paper we introduce an analytical so-lution based on approximating the execution time distribu-tions in terms of a limited number of statistical moments. This approach allows the parallel execution time to be ap-proximated with O ( 1) solution complexity for a wide range of execution time distributions, while the approximation ac-curacy outpegorms comparable techniques known to date. Experiments show that the error of the predicted mean value of the parallel execution time is even less than 4 % for par-allel loops comprising up to 10,000 tasks whose execution times are normally distributed. Measurements on real pro-grams (NAS-EP benchmark, PSRS sorteer; and WATOR sim-ulator) confirm these results provided the task execution dis-tributions are independent and unimodal. 1.

