Results 1 -
6 of
6
The SPLASH-2 programs: Characterization and methodological considerations
- INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE
, 1995
"... The SPLASH-2 suite of parallel applications has recently been released to facilitate the study of centralized and distributed shared-address-space multiprocessors. In this context, this paper has two goals. One is to quantitatively characterize the SPLASH-2 programs in terms of fundamental propertie ..."
Abstract
-
Cited by 962 (13 self)
- Add to MetaCart
The SPLASH-2 suite of parallel applications has recently been released to facilitate the study of centralized and distributed shared-address-space multiprocessors. In this context, this paper has two goals. One is to quantitatively characterize the SPLASH-2 programs in terms of fundamental properties and architectural interactions that are important to understand them well. The properties we study include the computational load balance, communication to computation ratio and traffic needs, important working set sizes, and issues related to spatial locality, as well as how these properties scale with problem size and the number of processors. The other, related goal is methodological: to assist people who will use the programs in architectural evaluations to prune the space of application and machine parameters in an informed and meaningful way. For example, by characterizing the working sets of the applications, we describe which operating points in terms of cache size and problem size are representative of realistic situations, which are not, and which re redundant. Using SPLASH-2 as an example, we hope to convey the importance of understanding the interplay of problem size, number of processors, and working sets in designing experiments and interpreting their results.
Integrating Non-blocking Synchronisation in Parallel Applications: Performance Advantages and Methodologies
- In Proceedings of the 3rd ACM Workshop on Software and Performance (WOSP'02
, 2002
"... In this paper we investigate how performance and speedup of applications would be aoeected by using non-blocking rather than blocking synchronisation. The results obtained show that for many applications, non-blocking synchronisation lead to significant speedups for a fairly large number of processo ..."
Abstract
-
Cited by 18 (2 self)
- Add to MetaCart
In this paper we investigate how performance and speedup of applications would be aoeected by using non-blocking rather than blocking synchronisation. The results obtained show that for many applications, non-blocking synchronisation lead to significant speedups for a fairly large number of processors, while it never slows the applications down. As part of this investigation this paper also provides a set of efficient and simple translations that show how typical blocking operations found in parallel applications, such as simple locks, queues and lock trees can be translated into non-blocking equivalents that use hardware primitives common in modern multiprocessor systems. With these translations this paper clearly demonstrates that it is easy for the application designer/programmer to replace the blocking operations commonly found on with nonblocking equivalents ones. For the empirical results a set of representative applications running on a large-scale ccNUMA machine were used.
Efficient parallel refinement for hierarchical radiosity on a DSM computer
- In Proceedings of the Third Eurographics Workshop on Parallel Graphics and Visualisation
, 2000
"... We introduce a simple, yet efficient extension to the hierarchical radiosity algorithm for the simulation of global illumination, taking advantage of a distributed shared memory (DSM) parallel architecture. Our task definition is based on a very fine grain decomposition of the refinement process at ..."
Abstract
-
Cited by 2 (2 self)
- Add to MetaCart
We introduce a simple, yet efficient extension to the hierarchical radiosity algorithm for the simulation of global illumination, taking advantage of a distributed shared memory (DSM) parallel architecture. Our task definition is based on a very fine grain decomposition of the refinement process at the level of individual pairs of hierarchical elements, therefore allowing a very simple implementation from an existing code with minimal modifications. We describe a generic refinement scheme based on a scheduler, allowing both easy parallelization and reordering of refinement tasks, which is useful for interactive and user-driven applications. We show that a very simple task grouping mechanism suffices to avoid excessive time waste in synchronization. Results obtained on an SGI Origin computer with 64 processors validate the approach, with excellent speedups using the full capacity of the machine. 1
Hybrid Scheduling for Efficient Ray Tracing of Complex Images
- High Performance Computing for Computer Graphics and Visualisation
, 1995
"... Ray tracing is a powerful technique to generate realistic images of 3D scenes. A ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
Ray tracing is a powerful technique to generate realistic images of 3D scenes. A
Order of Pixel Traversal and Parallel Volume Ray-tracing on the Distributed Shared Volume Buffer
, 1995
"... . The distributed shared volume buffer (DSVB) is a software package we developed to facilitate general, parallel volume ray-tracing on networked workstations. It is internally implemented with messagepassing and adopts the cache-coherent shared memory model. Thus the cache efficiency of volume data ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
. The distributed shared volume buffer (DSVB) is a software package we developed to facilitate general, parallel volume ray-tracing on networked workstations. It is internally implemented with messagepassing and adopts the cache-coherent shared memory model. Thus the cache efficiency of volume data access is of utter importance to the performance of a DSVB-based ray-tracer. For a given data set, the data access behavior of a volume ray-tracer depends mostly on the way in which pixels of the image are traversed. This paper addresses the cache coherence problem and compares three kinds of pixel traversal order: one-way, two-way and along a space filling curve. Experiments show that traversing pixels along a space filling curve (e.g. a Hilbert curve) greatly enhances cache efficiency especially when size of the cache is small compared to that of the volume data, and in the meantime greatly simplifies task distribution and management. Keywords: Pixel Traversal, Parallel Volume Ray-tracing...
Methodological Considerations and Characterization of the SPLASH-2 Parallel Application Suite
, 1995
"... We have recently released the SPLASH-2 suite of parallel applications for the study of centralized and/or distributed shared-address-space multiprocessors. In this regard, this paper has two main goals. One is to quantitatively characterize the SPLASH-2 programs in terms of fundamental properties ..."
Abstract
- Add to MetaCart
We have recently released the SPLASH-2 suite of parallel applications for the study of centralized and/or distributed shared-address-space multiprocessors. In this regard, this paper has two main goals. One is to quantitatively characterize the SPLASH-2 programs in terms of fundamental properties that are important to understanding these parallel programs, and to describe how these properties vary with problem and machine parameters. The properties we study include the concurrency and load balance behavior, the communication-to-computation ratio, the sizes and scaling of the important working sets, and issues related to spatial locality. The other, perhaps more important, goal is methodological: to assist people who will use the programs for architectural evaluations to prune the design space of machine parameters in an informed and meaningful way. For example, by characterizing the working sets of the applications, we describe which operating regions in terms of cache size and...

