DMCA
The SPLASH-2 programs: Characterization and methodological considerations (1995)
Cached
Download Links
- [www.csd.uoc.gr]
- [www-flash.stanford.edu]
- DBLP
Other Repositories/Bibliography
Venue: | INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE |
Citations: | 1420 - 12 self |
Citations
759 |
SPLASH: Stanford parallel applications for shared-memory”,
- Singh, Weber, et al.
- 1992
(Show Context)
Citation Context ...ten, different programs and different problem sizes were used, making comparisons across studies difficult. Many recent studies have used the Stanford ParalleL Applications for SHared memory (SPLASH) =-=[SWG92]-=-, a suite of parallel programs written for cache-coherent shared address space machines. While SPLASH has provided a degree of consistency and comparability across studies, like any other suite of app... |
526 |
Multi-level adaptive solution to boundary-value problems.
- Brandt
- 1977
(Show Context)
Citation Context ...grids are conceptually represented as 4-D arrays, with all subgrids allocated contiguously and locally in the nodes that own them, and (iii) it uses a red-black Gauss-Seidel multigrid equation solver =-=[Bra77]-=-, rather than an SOR solver. See [WSH93] for more details. Radiosity: This application computes the equilibrium distribution of light in a scene using the iterative hierarchical diffuse radiosity meth... |
432 |
The Rapid Evaluation of Potential Fields in Particle Systems
- Greengard
- 1988
(Show Context)
Citation Context ... simulates a system 3 of bodies over a number of timesteps. However, it simulates interactions in two dimensions using a different hierarchical N-body method called the adaptive Fast Multipole Method =-=[Gre87]-=-. As in Barnes, the major data structures are body and tree cells, with multiple particles per leaf cell. FMM differs from Barnes in two respects: (i) the tree is not traversed once per body, but only... |
409 | A rapid hierarchical radiosity algorithm.
- HANRAHAN, SALZMAN, et al.
- 1991
(Show Context)
Citation Context ...ather than an SOR solver. See [WSH93] for more details. Radiosity: This application computes the equilibrium distribution of light in a scene using the iterative hierarchical diffuse radiosity method =-=[HSA91]-=-. A scene is initially modeled as a number of large input polygons. Light transport interactions are computed among these polygons, and polygons are hierarchically subdivided into patches as necessary... |
318 |
Parallelism in random access machines,
- Fortune, Wyllie
- 1978
(Show Context)
Citation Context ...nvalidations when an invalidating action occurs. All instructions in our simulated multiprocessor complete in a single cycle. The performance of the memory system is assumed to be perfect (PRAM model =-=[FoW78]-=-), so that all memory references complete in a single cycle as well regardless of whether they are cache hits, or whether they are local or remote misses. There are two reasons for this. First, for no... |
288 |
The working set model of program behavior.
- Denning
- 1968
(Show Context)
Citation Context ...of cache size. Often, the relationship between miss rate and cache size is not linear, but contains points of inflection (or knees) at cache sizes where a working set of the program fits in the cache =-=[Den68]-=-. As shown in [RSG93], many parallel applications have a hierarchy of working sets, each corresponding to a different knee in the miss rate versus cache size curve. Some of these working sets are more... |
192 | A comparison of sorting algorithms for the connection machine CM-2. - Blelloch - 1991 |
167 | A low-overhead coherence solution for multiprocessors with private cache memories”,
- Papamarcos, Patel
- 1984
(Show Context)
Citation Context ...ibuted memory and one processor per node. Every processor has a single-level cache that is kept coherent using a directory-based Illinois protocol (dirty, shared, valid-exclusive, and invalid states) =-=[PaP84]-=-. Processors are assumed to send replacement hints to the home nodes when shared copies of data are replaced from their caches, so that the list of sharing nodes maintained at the home contains only t... |
153 | Ffts in external or hierarchical memory,”
- Bailey
- 1990
(Show Context)
Citation Context ...computation ratio for comparable problem sizes, and (ii) it is not globally synchronized between steps. FFT: The FFT kernel is a complex 1-D version of the radix- n sixstep FFT algorithm described in =-=[Bai90]-=-, which is optimized to minimize interprocessor communication. The data set consists of the n complex data points to be transformed, and another n complex data points referred to as the roots of unity... |
134 | ªFalse Sharing and Spatial Locality in Multiprocessor Caches,º - Torrellas, Lam, et al. - 1994 |
99 |
Cache Invalidation Patterns in Shared-Memory Multiprocessors
- Gupta, Weber
- 1992
(Show Context)
Citation Context ... sizes. Hence, although actual data traffic increases as the line size is increased, the total traffic is usually a minimum at between 32 and 128 bytes. Our results reconfirm previous studies such as =-=[GuW92]-=-, which shows that the overall network traffic in a distributed shared address space multiprocessor is usually a minimum for cache line sizes of 32 bytes. To summarize, in addition to showing which pr... |
94 |
The effect of sharing on the cache and bus performance of parallel programs
- Eggers, Katz
- 1989
(Show Context)
Citation Context ...isses due to fragmentation. On parallel machines, long cache lines can also be detrimental if they are used as the units of coherence (which we assume), since a program may then exhibit false sharing =-=[EgK89]-=-. While perfect spatial locality implies no false sharing, a program with quite good spatial locality in each processor’s reference stream (e.g. a processor writes every other element of a contiguous ... |
78 | Working Sets, Cache Sizes, and Node Granularity Issues for Large-Scale Multiprocessors
- Rothberg, Singh, et al.
- 1993
(Show Context)
Citation Context ... the relationship between miss rate and cache size is not linear, but contains points of inflection (or knees) at cache sizes where a working set of the program fits in the cache [Den68]. As shown in =-=[RSG93]-=-, many parallel applications have a hierarchy of working sets, each corresponding to a different knee in the miss rate versus cache size curve. Some of these working sets are more important to perform... |
75 | Volume Rendering on Scalable Shared-Memory MIMD Architectures
- Nieh, Levoy
- 1992
(Show Context)
Citation Context ...similar to those in Raytrace. The main data structures are the voxels, octree and pixels. Data accesses are input-dependent and irregular, and no attempt is made at intelligent data distribution. See =-=[NiL92]-=- for details. Water-Nsquared: This application is an improved version of the Water program in SPLASH [SWG92]. This application evaluates forces and potentials that occur over time in a system of water... |
73 | The Detection and Elimination of Useless Misses in Multiprocessors - Dubois, Skeppstedt, et al. - 1993 |
70 |
Simulation of Multiprocessors: Accuracy and Performance.
- Goldschmidt
- 1993
(Show Context)
Citation Context ...emory system parameters. 2.2 Approach to Characterization Experimental Environment: We perform our characterization study through execution-driven simulation, using the Tango-Lite reference generator =-=[Gol93]-=- to drive a multiprocessor cache and memory system simulator. The simulator tracks cache misses of various types according to an extension of the classification presented in [DSR+93] developed to hand... |
53 | The Performance Advantages of Integrating Block Data Transfer in Cache-Coherent Multiprocessors.
- Woo, Singh, et al.
- 1994
(Show Context)
Citation Context ... line reuse. To avoid memory hotspotting, submatrices are communicated in a staggered fashion, with processor i first transposing a submatrix from processor i+1, then one from processor i+2, etc. See =-=[WSH94]-=- for more details. FMM: Like Barnes, the FMM application also simulates a system 3 of bodies over a number of timesteps. However, it simulates interactions in two dimensions using a different hierarch... |
53 | Limitations of cache prefetching on a busbased multiprocessor - Tullsen, Eggers - 1993 |
42 | Effective cache prefetching on busbased multiprocessors - Tullsen, Eggers - 1995 |
15 |
The Performance Advantages of Integrating Message Passing in Cache-Coherent Multiprocessors
- Woo, Singh, et al.
- 1993
(Show Context)
Citation Context ...D arrays, with all subgrids allocated contiguously and locally in the nodes that own them, and (iii) it uses a red-black Gauss-Seidel multigrid equation solver [Bra77], rather than an SOR solver. See =-=[WSH93]-=- for more details. Radiosity: This application computes the equilibrium distribution of light in a scene using the iterative hierarchical diffuse radiosity method [HSA91]. A scene is initially modeled... |
14 |
Hierarchical N-Body Methods on Shared Address Space Multiprocessors.
- Holt, Singh
- 1995
(Show Context)
Citation Context ...in three dimensions over a number of time-steps, using the Barnes-Hut hierarchical N-body method. It differs from the version in SPLASH in two respects: (i) it allows multiple particles per leaf cell =-=[HoS95]-=-, and (ii) it implements the cell data structures differently for better data locality. Like the SPLASH application, it represents the computational domain as an octree with leaves containing informat... |
8 | Parallel Visualization Algorithms - JP, Gupta, et al. - 1994 |
5 | The Effects of Latency - Holt, Heinrich, et al. - 1995 |