### Table 3: Execution times for loosely coupled simulation. Both estimated and execution times are

in A Performance Prediction Framework for Data Intensive Applications on Large Scale Parallel Machines

### Table 2 Operations count.

"... In PAGE 6: ...773 E ciency Analysis In order to study the parallel computational work, the number of elementary operations per grid point per time-step is needed. The number of elementary operations involved in the numerical schemes are listed in Table2 . In this analysis, it is assumed that the workstations are loosely coupled in a local network such as Ethernet.... ..."

Cited by 2

### Table 2.1: A summary of various distributed-memory parallel machine simulators, including PUPPET. The \overall organization quot; is consistently discrete-event, and hence omitted. For loosely coupled simulations, \non- parallel quot; means that the second phase (simulation) is serial.

### Table 1. A spectrum of heterogeneity

1995

"... In PAGE 2: ... Choosing the best set of available resources is a difficult problem and is the subject of this paper. Consider the set of machines in Table1 and observe that they have different computation and communication capacities. Loosely-coupled parallel computations with infrequent communication would likely benefit by applying the fastest set of computa- tional resources (perhaps the DEC-Alpha cluster), and may benefit from distribution across many machines.... ..."

Cited by 18

### Table 1. A spectrum of heterogeneity

"... In PAGE 2: ... Choosing the best set of available resources is a difficult problem and is the subject of this paper. Consider the set of machines in Table1 and observe that they have different computation and communication capacities. Loosely-coupled parallel computations with infrequent communication would likely benefit by applying the fastest set of computa- tional resources (perhaps the DEC-Alpha cluster), and may benefit from distribution across many machines.... ..."

### Table 3: E ect of the circulation scheme on the performance of the light contribution com- putation phase (Phase 4).

1996

"... In PAGE 22: ... As is seen in Table 2, the loosely-coupled circulation scheme on simple ring topology achieves almost the same high e ciency values as the demand-driven scheme in spite of the fact that the demand-driven scheme exploits the rich hypercube topology and the direct-routing facility of iPSC/2. Table3 illustrates the execution times of the distributed light contribution computations (Phase 4) during a single iteration of the parallel algorithm. The last column of Table 3 illustrates the percent decrease in the parallel execution time obtained by using the contri- bution vector circulation scheme instead of the form-factor vector circulation scheme.... ..."

Cited by 1

### Table 3: Execution times for loosely coupled simulation. Both estimated and execution times are in seconds. models for less accurate but fast simulation, or use parallel machines to run the simulators [3, 13]. Our work di ers from previous work in several ways. In our work we speci cally target large scale data-intensive applications on large scale machines. The application emulators presented in this paper lie in between pure analytical models and full applications. They provide a simpler, but parameterized, model of the application by abstracting away the details not related to a performance prediction study. Since an application emulator is a program, it preserves the dynamic nature of the application, and can be simulated using any simulator that can run the full application. The loosely-coupled simulation model reduces the number of interactions between the simulator and the application emulator by embedding the application processing structure into the simulator. As our experimental results show, our optimizations enable simulation of large scale machines on workstations.

in A Performance Prediction Framework for Data Intensive Applications on Large Scale Parallel Machines

1998

Cited by 22

### Table 1 gives the sizes of their reachable state space and the set of recurrent states. The back-annotated version achieves over three times the state compression ratio of the estimated version which is much higher than that of the FIFOs. This is largely because the modules in the DIFFEQs are loosely coupled and rather sequential. Due to the state compression, the number of power iterations to convergence is dramatically reduced. The curves in Figure 11 illustrate convergence of the distance of the probability vectors from two consecutive iterations. They clearly suggest that the Markov chains in both estimated and back-annotated versions possess 22

1998

"... In PAGE 23: ...9 gt; 4,287 35 gt;122 8,915 10,040 0.89 Table1 : State compression and iteration number reduction in the DIFFEQ and PCI analyses. Examples CPU time without compression (sec) CPU time with state compression (sec) Speedup Power iteration Compression Power iteration Expansion Total DiffeqEst.... In PAGE 24: ...eceiver side is also geometrically distributed with a parameter of 0.9. The mutual exclusion element mutex has a unit delay and is assumed to be fair with simultaneously arriving requests. Table1 lists the sizes of the reachable state space, the recurrent state set and the state space after compression. The model achieves a state compression ratio close to 6.... ..."

Cited by 6

### Table 1: Scheduling schemes in Grid environments

2006

"... In PAGE 3: ... We obtain the VG by querying our vgES prototype, which has stored resource information corresponding to our synthetic computing environment. Therefore, we conduct 6 different types of experiments, as summarized in Table1 . We provide details on all the above in the following sections.... ..."

Cited by 5

### Table 1 GA and PGA complexities. The PGA has been implemented on the Supernode. The Supernode is a loosely coupled, highly parallel machine based on transputers. One of its most important characteristics is its ability to dynamically reconfigure the network topology by using a programmable VLSI switch device. This architecture offers a range of 16 to 1024 processors, delivering from 24 to 1500 Mflops performance. To achieve these performance, a hierarchical structure has been adopted. The basic component is a T800 transputer. It is a 32-bit microprocessor, with on-chip memory and F.P.U. (Floating Point Unit), delivering 10Mips and 1.5Mflops peak performance. Communication between transputers is supported by 4 bidirectional, serial, asynchronous, point-to-point connection

1993

"... In PAGE 8: ...Table1... ..."