## Analyzing the Behavior and Performance of Parallel Programs (1993)

Venue: | Univ. of Wisconsin-Madison, UW CS Tech. Rep |

Citations: | 40 - 5 self |

### BibTeX

@TECHREPORT{Adve93analyzingthe,

author = {Vikram S. Adve},

title = {Analyzing the Behavior and Performance of Parallel Programs},

institution = {Univ. of Wisconsin-Madison, UW CS Tech. Rep},

year = {1993}

}

### Years of Citing Articles

### OpenURL

### Abstract

An analytical performance model for parallel programs can provide qualitative insight as well as efficient quantitative evaluation and prediction of parallel program performance. While stochastic models for parallel programs can represent execution time variance due to communication and resource contention delays, a qualitative assessment of previous models shows that the stochastic assumption makes it extremely difficult to compute synchronization costs and overall execution times. This thesis first re-evaluates the need for the stochastic assumption by examining the influence of non-deterministic communication and resource contention delays on execution times in parallel programs. An analytical model of program behavior, combined with detailed program measurements, provides compelling evidence that in shared-memory programs on current systems as well as programs with similar granularity on foreseeable future systems, such delays introduce extremely low variance into the execution tim...

### Citations

712 |
SPLASH: Stanford Parallel Applications for Shared-Memory
- Singh, Weber, et al.
- 1992
(Show Context)
Citation Context ...he applications (MP3D, Locus Route, Water and Barnes) are from the Splash suite, which was developed to provide a realistic set of parallel applications for performance evaluation of parallel systems =-=[SWG92]-=-. The other three are also real applications in the sense that they were written to solve computationally intensive problems of interest to their authors. Hydro is a parallel simulation of particle mo... |

497 |
Validity of the Single Processor Approach to Achieving Large Scale Computing Capabilities
- Amdahl
- 1967
(Show Context)
Citation Context ...e exploited to derive -- -- 2 fundamental principles of program behavior and their impact on performance. (A simple and wellknown example of an analytical model fulfilling such a role is Amdahl's Law =-=[Amd67]-=-.) Furthermore, because analytical models have the ability to view a program and the underlying system at a higher level of abstraction than measurement or simulation techniques, they can play an impo... |

497 | Eicken. Logp: Towards a realistic model of parallel computation - Culler, Karp, et al. - 1993 |

347 |
Stochastic Modeling and the Theory of Queues
- Wolff
- 1989
(Show Context)
Citation Context ...or Increasing Failure Rate, if F (0) = 0 and if, for any t 0 > 0, 1-F (t) 1-F (t +t 0 ) ########## is monotone increasing in t. IFR distributions are continuous and have decreasing mean residual life =-=[Wol89]-=-. For example, the Erlang and exponential are both IFR but the hyperexponential distribution is not. -- -- 15 communication demand in each iteration can be equally divided among (an arbitrary number o... |

227 | An Evaluation of Directory Schemes for Cache Coherence
- Agarwala, Simoni, et al.
- 1988
(Show Context)
Citation Context ...our conclusions under high communication overhead. We ran the simulator with a 64 kilobyte, 4-way set associative cache per node, and a fulldirectory non-broadcast invalidate cache coherence protocol =-=[ASH88]-=-. One might expect CV C to be high on this system because some but not all remote requests have to be forwarded from the directory to a third node that will supply the updated copy of the block, and a... |

197 | Process Control and Scheduling Issues for Multiprogrammed Shared-Memory Multiprocessors - Tucker, Gupta - 1989 |

164 |
Probability and Statistics with Reliability, Queueing, and Computer Science Applications
- Trivedi
- 1982
(Show Context)
Citation Context ...compared to a Normal for each phase. In addition, the Kolmogorov-Smirnov statistic can be constructed from the measured samples and used to derive a confidence band for the actual parent distribution =-=[Tri82]-=-. This gives an error bound between the estimated and actual parent distribution at a certain level of confidence. The empirical distribution calculated using 300 samples, the upper and lower ends of ... |

138 |
Speedup versus efficiency in parallel systems
- Eager, Zahorian, et al.
- 1989
(Show Context)
Citation Context ...dels is also extremely high, precluding analysis even of programs with fairly small task graphs. # Comparing the deterministic model with the parametric speedup bounds of Eager, Zahorjan and Lazowska =-=[EZL89]-=-, we show that for essentially the same effort as that required to calculate parameters for the bounds, the deterministic model can be used to obtain estimates of the actual speedup. Besides being qua... |

120 |
A.: Allocating independent subtasks on parallel processors
- Kruskal, Weiss
- 1985
(Show Context)
Citation Context ...sk graphs [AIA91, Cve87, DuB82, HeT83, KrW85, TRS90, TsV90, VSS88], and are all much simpler than models described so far. Of these, perhaps the most general is the seminal model of Kruskal and Weiss =-=[KrW85]-=-. They consider a parallel program consisting of N independent parallel tasks executing on P processors, and make two simplifying assumptions about task behavior. They assume task execution times to b... |

110 | IPS-2: The Second Generation of a Parallel Program Measurement System
- Miller, Clark, et al.
- 1990
(Show Context)
Citation Context ...Performance evaluation tools in use today for evaluating parallel program performance are based on measurement or simulation. Measurement-based performance analysis tools such as Pablo [RAM92], IPS-2 =-=[MCH90]-=- and numerous others provide the ability to evaluate the performance of a given program on an existing system in detail. Simulation-based tools such as the Rice Parallel Processing Testbed [CMM88], th... |

102 |
Characterizations of parallelism in applications and their use in scheduling
- Sevcik
- 1989
(Show Context)
Citation Context ...tter performance is possible if scheduling decisions are based not only on A but also on one or more additional parameters such as maximum parallelism, variance of parallelism and offered system load =-=[Sev89]-=-. We observe that the deterministic model provides an efficient technique for obtaining these various parameters for a particular program. In fact, all these parameters can be derived from the same so... |

95 | An efficient parallel biconnectivity algorithm - Tarjan, Vishkin - 1985 |

87 | Determining average program execution times and their variance
- Sarkar
- 1989
(Show Context)
Citation Context ...vious paper focuses on estimating the mean and variance of the processing requirements of tasks in the presence of data-dependent effects such as conditional branch probabilities and loop frequencies =-=[Sar89]-=-. In that work, Sarkar describes a framework for determining the mean and variance of task execution times using frequency information from a counter-based execution profile of the program. For exampl... |

85 | An Overview of the Pablo Performance Analysis Environment
- Reed, Aydt, et al.
- 1992
(Show Context)
Citation Context ...configuration. Performance evaluation tools in use today for evaluating parallel program performance are based on measurement or simulation. Measurement-based performance analysis tools such as Pablo =-=[RAM92]-=-, IPS-2 [MCH90] and numerous others provide the ability to evaluate the performance of a given program on an existing system in detail. Simulation-based tools such as the Rice Parallel Processing Test... |

82 |
Predicting the Performance of Parallel Computations
- Mak, Lundstrom
- 1990
(Show Context)
Citation Context ...the detailed state space). Mak and Lundstrom develop a heuristic and fairly complex graph reduction technique as their higher-level model component, to avoid considering the individual program states =-=[MaL90]-=-. This heuristic, however, restricts their model to programs that have series-parallel task graphs. It also requires the assumption that task residence times are exponentially distributed. 2 Given the... |

74 |
The Rice Parallel Processing Testbed
- Covington, Madala, et al.
- 1988
(Show Context)
Citation Context ...S-2 [MCH90] and numerous others provide the ability to evaluate the performance of a given program on an existing system in detail. Simulation-based tools such as the Rice Parallel Processing Testbed =-=[CMM88]-=-, the Wisconsin Wind Tunnel [RHL93], and others provide the additional flexibility of evaluating an existing program on varying system sizes and configurations. Thus, these techniques collectively ach... |

69 | Performance analysis of mesh interconnection networks with deterministic routing - Adve, Vernon - 1994 |

48 | An Accurate and Efficient Performance Analysis Technique for Multiprocessor Snooping Cache-Consistency Protocols - Vernon, Lazowska, et al. - 1988 |

45 |
Analytic Queueing Network Models for Parallel Processing of Task Systems
- Thomasian, Bay
- 1986
(Show Context)
Citation Context ...vious work with three models that apply to arbitrary task graphs, and which illustrate the above hierarchical framework as well as the principal difficulties of the general problem. Thomasian and Bay =-=[ThB86]-=- developed a 2-level hierarchical model in which task residence times are assumed to be exponentially distributed. This assumption allows the task-level behavior of the program to be modeled as a Mark... |

42 |
Performance of synchronous parallel algorithms with regular structures
- Madala, Sinclair
- 1991
(Show Context)
Citation Context ...ls to which we refer in this paper only consider systems with a single executing job. Finally, two previous models [LCB92, MaS91] are restricted to specific task graph structures. Madala and Sinclair =-=[MaS91] propose a-=- model that applies to divide-and-conquer task-graphs where tasks at each "level" in the graph have i.i.d. execution times with arbitrary variance, but with the assumption that the number of... |

40 |
Analytic queueing models for programs with internal concurrency
- Heidelberger, Trivedi
- 1983
(Show Context)
Citation Context ...ys at barriers does not introduce significant error). We will not describe the other, less general, models for fork-join programs here, except to note that both the models of Heidelberger and Trivedi =-=[HeT83]-=- and Towsley et al [TRS90] apply to multiprogrammed parallel systems with multiple parallel jobs (each job is assumed to have the same number of tasks in each parallel phase and each task is exponenti... |

35 |
Performance analysis of parallel processing systems
- Nelson, Towsley, et al.
- 1988
(Show Context)
Citation Context ...of the conclusion is affected by the choice of distribution used to represent the parallel workload. A result that is not strongly dependent on the distribution assumption is as follows. Nelson et al =-=[NTT88]-=- showed that in an environment containing mixed sequential (interactive) and large parallel (batch) jobs, an unpartitioned parallel system yields better performance than one in which processors are st... |

30 | An analytic model for multistage interconnection networks - Willick, Eager - 1990 |

27 |
Bounds for the mean runtime of parallel programs
- Hartleb, Mertsiotakis
- 1992
(Show Context)
Citation Context ...nt of V 2 and V 2 is the only child of V 1 . Parallel reduction combines 2 verticessV 1 and V 2 into a single vertex if V 1 and V 2 have exactly the same parents, as well as exactly the same children =-=[HaM92]-=-. This class includes fork-join graphs but excludes, for example, the task graphs in Figures 5.1 (d,e). _______________________________________________________________ ________________________________... |

26 |
Chores: Enhanced Run-Time Support for Shared-Memory Parallel Computing
- Eager, Zahorjan
- 1993
(Show Context)
Citation Context ...askgraph of a given program. Some requisite infrastructure for program analysis is already available in parallelizing compilers such as Jade [RSL93] and others, and in run-time systems such as Chores =-=[EaZ93]-=-. In parallelizing compilers, the compiler automatically detects and enforces (a superset of) the data dependencies in a program, and implements the partitioning and scheduling of work. The task graph... |

26 |
Performance Prediction and Calibration for a Class of Multiprocessors
- Vrsalovic, Siewiorek, et al.
- 1988
(Show Context)
Citation Context ...e is asymptotically exact as P and N/P, but has been shown to be fairly accurate compared to simulations for small values of P, for a number of task time distributions. The models of Vrsalovic et al. =-=[VSS88]-=-, Cvetanovic [Cve87] and Tsuei and Vernon [TsV90] are the three deterministic models mentioned in Chapter 1. The models of Vrsalovic et al and Cvetanovic apply to iterative parallel programs in which ... |

23 |
A Modeling Methodology for the Analysis of Concurrent Systems and Computations
- Kapelnikov, Muntz, et al.
- 1989
(Show Context)
Citation Context ...erage program completion time by sampling different execution paths, instead of analytically solving for the steady-state probability distribution of the Markov chain. Kapelnikov, Muntz and Ercegovac =-=[KME89]-=- also propose a very similar hierarchical model for evaluating programs on distributed systems. Their model assumes a computation control graph, a program representation that is more general than a ta... |

22 |
Performance of Synchronized Iterative Processes in Multiprocessor Systems
- Dubois, Briggs
- 1982
(Show Context)
Citation Context ..., stochastic models that allow general distributions of task-time have been applied using different specific distributions, including the normal distribution. [DuB82, Gre89, KrW85]. Dubois and Briggs =-=[DuB82]-=- as well as Greenberg [Gre89] argued that a task could be asymptotically normally distributed because it is the sum of a large number of (non-deterministic) instruction execution times. Our proof in A... |

22 |
Performance of Parallel Programs: Model and Analyses, doctoral dissertation
- Mohan
- 1984
(Show Context)
Citation Context ...sm in the program (i.e. O(2 N ) in the worst case). The authors show the model to be accurate for a task graph with N = 6 tasks, compared to simulations that also assume exponential task times. Mohan =-=[Moh84]-=- earlier described a model equivalent to that of Thomasian and Bay, but used a stochastic simulation to find the average program completion time by sampling different execution paths, instead of analy... |

22 | Stochastic bounds on execution times of parallel programs - Yazici-Pekergin, Vincent - 1991 |

20 |
The Effects of Problem Partitioning, Allocation, and Granularity on the Performance of Multiple-Processor Systems
- Cvetanovic
- 1987
(Show Context)
Citation Context ...exact as P and N/P, but has been shown to be fairly accurate compared to simulations for small values of P, for a number of task time distributions. The models of Vrsalovic et al. [VSS88], Cvetanovic =-=[Cve87]-=- and Tsuei and Vernon [TsV90] are the three deterministic models mentioned in Chapter 1. The models of Vrsalovic et al and Cvetanovic apply to iterative parallel programs in which the computational wo... |

20 |
Loop-level parallelism in numeric and symbolic programs
- Larus
- 1993
(Show Context)
Citation Context ...lution of the basic deterministic model ignoring communication. ############### 20. An interesting, related measurement-based technique has been developed by Larus and implemented in a tool called pp =-=[Lar93]-=-. His technique uses a trace of an execution of a sequential program to detect data dependencies and CPU times for the iterations of each parallel loop in the program, and computes the potential speed... |

19 |
Analysis of fork-join program response times on multiprocessors
- Towsley, Rommel, et al.
- 1990
(Show Context)
Citation Context ...ork-join Task Graph: A task graph consisting of alternating sequential and parallel phases, where each parallel phase consists of a set of independent tasks and ends in a full barrier synchronization =-=[TRS90]-=-. Series-Parallel Task Graph: A task graph that can be reduced to a single vertex by repeated applications of series reduction or parallel reduction: Series reduction combines two vertices V 1 and V 2... |

16 | Analysis of Spatial and Temporal Scheduling Policies for Semi-Static and Dynamic Multiprocessor Environments - Leutenegger, Nelson - 1991 |

16 |
A Performance Evaluation of a General Parallel Processing Model
- NELSON
- 1990
(Show Context)
Citation Context ... of processors was assumed to be exponentially distributed. In contrast, one result that is significant for exponentially distributed tasks but weaker for tasks with lower variance is Nelson's result =-=[Nel90]-=- that higher variance of parallelism can yield -- -- 112 ___________________________________________________________ ___________________________________________________________ o i t a R 1.2 1.0 0.8 0... |

14 |
Guided self-scheduling: a practical scheduling scheme for parallel supercomputers
- CD, Kuck
- 1987
(Show Context)
Citation Context ...ists of a single parallel phase with N tasks. Various scheduling functions are possible for such a loop (e.g., static scheduling in blocked or cyclic order, dynamic scheduling, guided self-scheduling =-=[PoK87]-=-, etc.), but the task graph is the same in all cases. In some programs, however, the task graph (or some portion thereof) may necessarily depend on the number of processes used during program executio... |

12 | Optimally Profiling and Tracing
- Ball, Larus
- 1992
(Show Context)
Citation Context ...s of this program for this input. The most convenient method is to measure these values directly in software using explicit system timers or using software instrumentation tools such as pixie and qpt =-=[BaL92]-=-. Note that some software timing techniques will implicitly include overheads due to communication or shared-resource contention, rather than purely the CPU requirements. Depending on the desired accu... |

11 |
An Analysis of Several Processor Partitioning Policies for Parallel Computers
- Setia, Tripathi
(Show Context)
Citation Context ...ch processors are statically partitioned among the two classes. They assumed that the parallel jobs consisted of tasks with exponentially distributed execution times. But, in fact, Setia and Tripathi =-=[SeT91]-=- showed that the same conclusion holds with a completely different assumption about task times. Specifically, they assumed that each job consisted of tasks of equal size, while the total execution tim... |

10 |
Synchronization costs on multiprocessors
- Greenbaum
- 1989
(Show Context)
Citation Context ...w general distributions of task-time have been applied using different specific distributions, including the normal distribution. [DuB82, Gre89, KrW85]. Dubois and Briggs [DuB82] as well as Greenberg =-=[Gre89]-=- argued that a task could be asymptotically normally distributed because it is the sum of a large number of (non-deterministic) instruction execution times. Our proof in Appendix A is essentially a fo... |

10 | Deterministic timing schema for parallel programs - Shaw - 1991 |

9 | Analyzing multiprocessor cache behaviour through data reference modeling
- Tsai, Agarwal
- 1993
(Show Context)
Citation Context ...s, Tsai and Agarwal have shown that it is possible to derive precise analytical estimates of multiprocessor cache miss rates as functions of input size, cache block size, and the number of processors =-=[TsA93]-=-. Finally, when using a model in early stages of the design and development of a program, important parameter values, in particular CPU requirements and perhaps rough communication rates, may have to ... |

8 | Asynchronous analysis of parallel dynamic programming algorithms
- Lewandowski, Condon, et al.
- 1996
(Show Context)
Citation Context ...the maximum parallelism, i.e., task scheduling can be ignored. (They also derive models for fork-join programs that are very similar to the results of Kruskal and Weiss.) Lewandowski, Condon and Bach =-=[LCB92]-=- propose a model applicable to -- -- 16 programs with pipelined task graphs and i.i.d. exponential task times. 2.5. Summary of the State of the Art We can summarize what is known about the state of th... |

7 |
Diagnosing Parallel Program Speedup Limitations Using Resource Contention Models
- Tsuei, Vernon
- 1990
(Show Context)
Citation Context ...een shown to be fairly accurate compared to simulations for small values of P, for a number of task time distributions. The models of Vrsalovic et al. [VSS88], Cvetanovic [Cve87] and Tsuei and Vernon =-=[TsV90]-=- are the three deterministic models mentioned in Chapter 1. The models of Vrsalovic et al and Cvetanovic apply to iterative parallel programs in which the computational work as well as the ###########... |

6 |
Jade: a highâ€“level, machine independent language for parallel programming
- Rinard, Scales, et al.
- 1993
(Show Context)
Citation Context ... programmer annotation) will be required for creating the taskgraph of a given program. Some requisite infrastructure for program analysis is already available in parallelizing compilers such as Jade =-=[RSL93]-=- and others, and in run-time systems such as Chores [EaZ93]. In parallelizing compilers, the compiler automatically detects and enforces (a superset of) the data dependencies in a program, and impleme... |

5 | Performance Modeling of Parallel Algorithms - Ammar, Islam, et al. - 1990 |

5 |
The Wisconsin Wind Tunnel
- Reinhardt, Hill, et al.
- 1993
(Show Context)
Citation Context ...vide the ability to evaluate the performance of a given program on an existing system in detail. Simulation-based tools such as the Rice Parallel Processing Testbed [CMM88], the Wisconsin Wind Tunnel =-=[RHL93]-=-, and others provide the additional flexibility of evaluating an existing program on varying system sizes and configurations. Thus, these techniques collectively achieve at least partial success in ad... |

4 |
Foundations of Parallel Computational Microhydrodynamics : Communication Scheduling Strategies, A.I.Ch.E
- FUENTES, KIM
- 1992
(Show Context)
Citation Context ...nse that they were written to solve computationally intensive problems of interest to their authors. Hydro is a parallel simulation of particle motion in viscous fluids, with efficient communication. =-=[FuK92]-=-. PSIM was developed at Lawrence Livermore Laboratories to simulate the indirect binary n-cube memory server network in a large parallel vector-processing environment [Bro88b]. Bicon is an implementat... |

3 |
APRIL: A
- Agarwal, Lim, et al.
- 1990
(Show Context)
Citation Context ...ed resources (note that this would only be useful if there were more processes than processors), an even simpler modification suffices. Possible examples of such systems are multi-threaded processors =-=[ALK90]-=- or parallel applications with significant I/O requirements. Representing such a program with the model only requires modifying the system-level queueing network to represent the individual processes ... |

3 | An Experiment on Predicting and Measuring the Deterministic Execution Times of Parallel Programs on a Multiprocessor - Kim, Shaw - 1990 |

3 |
A Multiprocessor Bus Design Model Validated by System Measurement
- Tsuei, Vernon
- 1992
(Show Context)
Citation Context ...hat each can be separately and accurately included [TsV90]. The Sequent Symmetry bus supports an invalidation-based snooping cache protocol. In a previous analytical modeling study of the Sequent bus =-=[TsV92]-=-, Tsuei and Vernon showed that two aspects of the bus protocol have a significant impact on performance: (1) at most three read requests can be outstanding at any time from all processors, with at mos... |