## Performance Measurement and Trace Driven Simulation of Parallel CAD and Numeric Applications on a Hypercube Multicomputer * (1990)

Venue: | IEEE Transactions on Parallel and Distributed Systems |

Citations: | 28 - 0 self |

### BibTeX

@ARTICLE{Hsu90performancemeasurement,

author = {Jiun-ming Hsu and Prithviraj Banerjee},

title = {Performance Measurement and Trace Driven Simulation of Parallel CAD and Numeric Applications on a Hypercube Multicomputer *},

journal = {IEEE Transactions on Parallel and Distributed Systems},

year = {1990},

volume = {3},

pages = {260--269}

}

### OpenURL

### Abstract

This paper presents the performance evaluation, workload characterization and trace driven simulation of a hypercube multicomputer running realistic workloads. Six representative parallel applications were selected as benchmarks. Software monitoring techniques were then used to collect execution traces. Based on the measurement results, we investigated both the computation and communication behavior of these parallel programs, including CPU utilization, computation task granularity, message interarrival distribution, the distribution of waiting times in receiving messages, and message length and destination distributions. The localities in communication were also studied. A trace driven simulation environment was developed to study the behavior of the communication hardware under real workload. Simulation results on DMA and link utilizations are reported. 1. INTRODUCTION Hypercube multicomputers have recently offered a cost-effective and feasible approach to supercomputing by connect...

### Citations

1960 |
Matrix Computations
- Golub, Loan
- 1989
(Show Context)
Citation Context ...me to distribute the faults to be processed at run time. The idle processors broadcast messages to request for more works from other busy processors. 2.6. Matrix QR Factorization The QR factorization =-=[19]-=- of a matrix A involves the determination of an orthonormal matrix Q and an upper triangular matrix R such that QA =R , also written as A=Q T R . The standard sequence for computing Q and R involves s... |

339 | Virtual cut-through: a new computer communication switching technique
- Kermani, Kleinrock
- 1979
(Show Context)
Citation Context ...rrival interval, length, and destination distributions. The most common assumptions are Poisson message arrival, exponentially distributed message lengths, and evenly distributed message destinations =-=[4]-=-. Researchers in the load balancing area make various assumptions about task granularity and intertask dependencies in computational task graphs that are mapped on to processors [5]. We wanted to veri... |

323 |
Evaluation techniques and storage hierarchies
- Mattson, Gecsei, et al.
- 1970
(Show Context)
Citation Context ... for both the spatial and temporal localities of message destinations have been proposed before in [10]. In this study we propose a model for the locality of message size based on the LRU stack model =-=[11]-=-. The accuracy of this model was verified by measurement results. In distributed-memory multicomputers like hypercubes, synchronization and data sharing are achieved by explicit message passing. Hence... |

204 |
The Cosmic Cube
- Seitz
- 1985
(Show Context)
Citation Context ...reported. 1. INTRODUCTION Hypercube multicomputers have recently offered a cost-effective and feasible approach to supercomputing by connecting a large number of low-cost processors with direct links =-=[1]-=-. Each processor has its own local memory. Processes running on these processors communicate via message passing. This type of architecture is more readily scaled up to large numbers of processors tha... |

93 |
Partitioning problems in parallel, pipeline, and distributed computing
- Bokhari
- 1988
(Show Context)
Citation Context ...ssage destinations [4]. Researchers in the load balancing area make various assumptions about task granularity and intertask dependencies in computational task graphs that are mapped on to processors =-=[5]-=-. We wanted to verify if indeed the above assumptions are valid, and if not, what distributions more accurately model the real world applications on hypercubes. In experimental studies of hypercube pe... |

72 |
The NX/2 Operating System
- Pierce
- 1988
(Show Context)
Citation Context ...torage, hence the computing nodes access the file system through the host. The host runs the UNIX System V operating system, and the computing node runs a proprietary operating system called the NX/2 =-=[26]-=- which supports, among other things, file access to the host and message passing between the nodes. The hypercube program usually consists of a host program running on the host and a node program runn... |

49 |
The iPSC/2 Direct-Connect Communication Technology
- Nugent
- 1988
(Show Context)
Citation Context ...ample, Figures 3, 4, and 5 report the various workload distributions, while Table 10 shows the model parameters for the workload. HSIM models the architecture and communication protocol of the iPSC/2 =-=[3, 32]-=-, which uses circuit-switched message routing scheme. Each communication link consists of two bi-directional, bit-serial channels. Two DMA channels are used to transmit data, one for memory to routing... |

48 |
Analysis of the Effects of Delays on Load Sharing
- Mirchandaney, Towsley, et al.
- 1989
(Show Context)
Citation Context ...tically tractable. Researchers in the load balancing area also make various assumptions about task granularity and intertask dependencies in computational task graphs that are mapped on to processors =-=[6, 7]-=-. We wanted to verify if indeed the above assumptions are valid, and if not, what distributions more accurately model the real world applications on hypercubes. 2 In experimental studies of hypercube ... |

25 |
Matrix Factorization on a Hypercube Multiprocessor
- Geist, Heath
- 1985
(Show Context)
Citation Context ...s Ax = b can be solved by performing Gaussian elimination to obtain an upper triangular matrix. On a hypercube, this algorithm is implemented by distributing complete rows of Ax = b to each processor =-=[22]-=-. In each iteration of the computation, a pivoting row is selected by finding the global maximum. This row is then broadcasted to every node to eliminate one element in each of the remaining active ro... |

24 |
Performability modeling based on real data: a case study
- Hsueh, Iyer, et al.
- 1988
(Show Context)
Citation Context ...butions. Again, based on the results of the measurement, we can verify whether these synthetic benchmarks are close to real applications by modeling the empirical data using statistical distributions =-=[8, 9]-=-. The resulting model will also help us in designing more realistic synthetic benchmarks. The communication patterns of the parallel programs can critically affect the performance of the message passi... |

16 |
Hardware Support for Message Routing in a Distributed Memory Multicomputer
- Hsu, Banerjee
- 1990
(Show Context)
Citation Context ... by intelligent routing controllers and efficient buffer management strategies. In a circuit-switched message routing paradigm, the circuit can be kept connected even after the message is transmitted =-=[28]-=-. If the next message also goes to the same destination, this circuit can again be used, thereby reducing circuit set-up time. Software overhead can also be reduced by predicting the sizes of the inco... |

14 |
Fault partitioning issues in an integrated parallel test generation/fault simulation environment
- Patil, Banerjee
- 1989
(Show Context)
Citation Context ...parallel. 2.5. Fault Simulation In fault simulation the objective is to find out how many of a given set of faults are detected by a given set of input patterns. The parallel fault simulation program =-=[18]-=- partitions the faults among the processors initially and uses a dynamic load balancing scheme to distribute the faults to be processed at run time. The idle processors broadcast messages to request f... |

12 |
Networks for Parallel Processors: Measurements and Prognostications
- Grunwald, Reed
- 1988
(Show Context)
Citation Context ...d to verify if indeed the above assumptions are valid, and if not, what distributions more accurately model the real world applications on hypercubes. In experimental studies of hypercube performance =-=[6, 7]-=-, simulations have been performed using various synthetic communication benchmarks, which assume different kinds of message interval, length and destination distributions. Again, based on the results ... |

12 |
A Parallel Simulated Annealing Algorithm for Channel Routing on a Hypercube Multiprocessor
- Brouwer, Banerjee
- 1988
(Show Context)
Citation Context ...imated wirelength in a standard cell layout. Processors pair up to perform parallel moves, and synchronize with each other using a ring-based broadcast mechanism. The parallel channel routing program =-=[13]-=- is also based on simulated annealing and uses similar communication patterns to minimize the number of horizontal tracks in a channel. The parallel test pattern generator [14] is based on a parallel ... |

9 |
A parallel branch and bound approach to test generation
- Patil, Banerjee
- 1990
(Show Context)
Citation Context ...channel routing program [13] is also based on simulated annealing and uses similar communication patterns to minimize the number of horizontal tracks in a channel. The parallel test pattern generator =-=[14]-=- is based on a parallel branch and bound algorithm. The scheduling and control of the parallel search is done by one processor, while other nodes do the searches in parallel. The parallel circuit extr... |

9 |
A message passing coprocessor for distributed memory multicomputers
- Hsu, Banerjee
- 1990
(Show Context)
Citation Context ...nation, this circuit can again be used, thereby reducing circuit set-up time. Software overhead can also be reduced by predicting the sizes of the incoming messages and preallocating buffers for them =-=[29]-=-. Broadcast and reduce (finding global minimum/maximum) operations are used substantially. If special hardware can be used to perform these operations, the communication speed will be greatly improved... |

8 |
CSIM Reference Manual (Revision 13
- Schwetman
- 1989
(Show Context)
Citation Context ...mulation Methodology A trace driven simulator for hypercube --- HSIM has been developed to study the behavior of the communication hardware under real workload. HSIM, written in C++, is based on CSIM =-=[20]-=-, which is a process oriented simulation language. The emphasis of HSIM is in communication activities, therefore the message transmission is modeled in detail. The computation time of CPU between mes... |

5 |
The iPSC/2 Node Architecture
- Close
- 1988
(Show Context)
Citation Context ... than multiprocessor designs based on globally shared memory. Implementations of the hypercube architecture range from experimental prototype systems [2], to commercially available systems from Intel =-=[3]-=-, Ametek, and NCUBE. The evaluation of performance of parallel machines such as hypercubes is extremely important for exploring parallel program characteristics and parallel architecture behavior. One... |

5 | PACE2: An improved parallel VLSI extractor with parametric extraction
- Belkhale, Banerjee
- 1989
(Show Context)
Citation Context ...parallel branch and bound algorithm. The scheduling and control of the parallel search is done by one processor, while other nodes do the searches in parallel. The parallel circuit extraction program =-=[15]-=- consists of two phases. The first phase, dominated by communication, is the distribution of data from the host to the node processors in a tree fashion. The second phase is the circuit extraction pha... |

4 | A Parallel Row-Based Algorithm for Standard Cell
- Sargent, Banerjee
- 1989
(Show Context)
Citation Context ...I circuit extractor, and a test pattern generator. In addition, two numeric applications, QR factorization and Fast Fourier Transformation, are also used. The parallel standard cell placement program =-=[12]-=- is based on simulated annealing technique whose objective is to minimize the total estimated wirelength in a standard cell layout. Processors pair up to perform parallel moves, and synchronize with e... |

3 |
Performance Instrumentation for the Intel IPSC/2
- Rudolph
- 1989
(Show Context)
Citation Context ...r performance, to investigate resource utilization, and to determine the characteristics of computation and communication workload. The measurement methodology basically followed the same approach as =-=[18]-=-, which is software monitoring in the operating system level. The user program need not be modified except that two commands, monitor_init() and monitor_end(), are added to the node program to initiat... |

3 | Implementation and Analysis of a Navier-Stokes Algorithm on Parallel Computers
- Fatoohi, Grosch
- 1988
(Show Context)
Citation Context ...s finding 6 the global minimum and tree broadcasting. 2.10. Navier-Stokes Algorithm The Navier-Stokes equations for fluid dynamics modeling can be solved numerically by an iterative relaxation scheme =-=[24]-=-. The parallel version of this algorithm partitions the grid points among the processors. After each iteration, the processors communicate to obtain the values of the points abut its region. 2.11. EIS... |

3 |
EISPACK − A package for solving matrix eigenvalue problems, Argonne National Laboratory
- Dongarra, Moler
- 1977
(Show Context)
Citation Context ... points among the processors. After each iteration, the processors communicate to obtain the values of the points abut its region. 2.11. EISPACK The last benchmark is the tred2 routine in the EISPACK =-=[25]-=- matrix computation library. Tred2 reduces a symmetric matrix to a symmetric tridiagonal matrix and is used in computing the eigenvalues of a matrix. The parallel version of this routine is generated ... |

2 | File Usage Analysis and Resource Usage Prediction: a Measurement-Based Study
- Devarakonda
- 1987
(Show Context)
Citation Context ...butions. Again, based on the results of the measurement, we can verify whether these synthetic benchmarks are close to real applications by modeling the empirical data using statistical distributions =-=[8, 9]-=-. The resulting model will also help us in designing more realistic synthetic benchmarks. The communication patterns of the parallel programs can critically affect the performance of the message passi... |

2 |
Orthogonal factorization on a distributed memory multiprocessor
- Pothen, Jha, et al.
- 1987
(Show Context)
Citation Context ...stribution of data from the host to the node processors in a tree fashion. The second phase is the circuit extraction phase which is computationally intensive. The parallel QR factorization algorithm =-=[16]-=- maps the matrix onto a ring of processors and eliminates matrix elements in parallel. The parallel Fast Fourier Transform (FFT) program [17] involves an even distribution of the input points among th... |

2 |
Linear Optimization via Message-based Parallel Processing
- Stunkel
- 1988
(Show Context)
Citation Context ...st transforms these inequalities into equalities while maintaining the x �� 0 constraints. It then iterates systematically to minimize the objective (cost) function. In the parallel simplex algori=-=thm [23]-=-, complete rows of Ax = b are divided equally among the processors. It also involves finding 6 the global minimum and tree broadcasting. 2.10. Navier-Stokes Algorithm The Navier-Stokes equations for f... |

1 |
Mark IIIfp Hypercube Concurrent Processor Architecture
- Tuazon, Peterson, et al.
- 1988
(Show Context)
Citation Context ...e readily scaled up to large numbers of processors than multiprocessor designs based on globally shared memory. Implementations of the hypercube architecture range from experimental prototype systems =-=[2]-=-, to commercially available systems from Intel [3], Ametek, and NCUBE. The evaluation of performance of parallel machines such as hypercubes is extremely important for exploring parallel program chara... |

1 |
High Performance Hypercube Communication
- Buzzard, Mudge
- 1988
(Show Context)
Citation Context ...d to verify if indeed the above assumptions are valid, and if not, what distributions more accurately model the real world applications on hypercubes. In experimental studies of hypercube performance =-=[6, 7]-=-, simulations have been performed using various synthetic communication benchmarks, which assume different kinds of message interval, length and destination distributions. Again, based on the results ... |

1 |
SAS User's Guide: Statistics," Version 5
- Inc
- 1985
(Show Context)
Citation Context ...poexponential, and multi-stage shifted gamma pdf's. To determine the best model for the empirical distributions shown in Figures 5 through 7, we applied curve fitting technique (nonlinear regression) =-=[19]-=- to fit the distributions into the above models. Similar studies have been reported by [9] for file usage analysis and [8] for resource utilization. The criterion of the nonlinear regression procedure... |