Results 11 - 20
of
28
Performance characteristics of a network of commodity multiprocessors for the NAS benchmarks using a hybrid memory model
, 1998
"... The availability of multiprocessors and high performance netvorks offer an opportunity to construct CLUster of MultiProcessors (CLUMPs) and use them as parallel computing platforms. The distinctive feature of the CLUMPs over traditional parallel computers is their hybrid memory model (message pas ..."
Abstract
-
Cited by 10 (3 self)
- Add to MetaCart
The availability of multiprocessors and high performance netvorks offer an opportunity to construct CLUster of MultiProcessors (CLUMPs) and use them as parallel computing platforms. The distinctive feature of the CLUMPs over traditional parallel computers is their hybrid memory model (message passing betveen the nodes and shared memory inside the nodes). In this paper, ve investigate the performance characteristics of a CLUMP using a programming model close to the hardvare memory model. The programming model is based on MPI for message passing part and OpenMP for shared memory part. The paper provides three contributions. These include: a) Performance potential of biprocessor PC as a single node in the context of shared memory parallel programs and also as being the processing node of a parallel platform in the context of MPI programs, b) Performance measurements of a cluster of biprocessor PCs for NAS 2.3 parallel benchmarks using the hybrid memory model and c) Some explanations for the performance results by examining a breakdovn of the benchmarks execution time and also by showing the existence of a theoretical limit for the intra-mukiprocessor speedup.
Irregular Parallel Algorithms in Java
- In Irregular'99: Sixth International Workshop on Solving Irregularly Structured Problems in Parallel
, 1999
"... The nested data-parallel programming model supports the design and implementation of irregular parallel algorithms. This paper describes work in progress to incorporate nested data parallelism into the object model of Java by developing a library of collection classes and adding a forall statement t ..."
Abstract
-
Cited by 9 (0 self)
- Add to MetaCart
The nested data-parallel programming model supports the design and implementation of irregular parallel algorithms. This paper describes work in progress to incorporate nested data parallelism into the object model of Java by developing a library of collection classes and adding a forall statement to the language. The collection classes provide parallel implementations of operations on the collections. The forall statement allows operations over the elements of a collection to be expressed in parallel. We distinguish between shape and data components in the collection classes, and use this distinction to simplify algorithm expression and to improve performance. We present initial performance data on two benchmarks with irregular algorithms, EM3d and Convex Hull, and on several microbenchmark programs.
Parallel 3D Adaptive Mesh Refinement in Titanium
, 1999
"... We describe a 3-dimensional adaptive mesh refinement Poisson solver. The complete program consists of about 3,500 lines of Titanium code and runs on both shared-memory and distributed-memory architectures. This paper focuses on the algorithm and on our experiences in writing AMR and tuning its perfo ..."
Abstract
-
Cited by 6 (3 self)
- Add to MetaCart
We describe a 3-dimensional adaptive mesh refinement Poisson solver. The complete program consists of about 3,500 lines of Titanium code and runs on both shared-memory and distributed-memory architectures. This paper focuses on the algorithm and on our experiences in writing AMR and tuning its performance. 1 Introduction This paper is a case study in the use of an experimental programming language in implementing a useful numerical method---adaptive mesh refinement (AMR) for solving Poisson's equation, \Delta' = ae, over the cube\Omega = [0; 1] 3 . Poisson's equation and its close relatives arise in many applications such as fluid mechanics, gravitation, heat flow, and electromagnetics. Poisson solvers are also used as components of some other PDE solvers. The authors have been involved in the design and implementation of the Titanium language, a dialect of Java intended for use in parallel computation [9]. Here, we attempt to show that Titanium is well-suited to the implementation ...
An Efficient Shared Memory Layer for Distributed Memory Machines
, 1994
"... This report describes a system called SAM that simplifies the task of programming machines with distributed address spaces by providing a shared name space and dynamic caching of remotely accessed data. SAM makes it possible to utilize the computational power available in networks of workstations an ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
This report describes a system called SAM that simplifies the task of programming machines with distributed address spaces by providing a shared name space and dynamic caching of remotely accessed data. SAM makes it possible to utilize the computational power available in networks of workstations and distributed memory machines, while getting the ease of programming associated with a single address space model. The global name space and caching are especially important for complex scientific applications with irregular communication and parallelism. SAM is based on the principle of tying synchronization with data accesses. Precedence constraints are expressed by accesses to single-assignment values, and mutual exclusion constraints are represented by access to data items called accumulators. Programmers easily express the communication and synchronization between processes using these operations; they can also use alternate paradigms that are built with the SAM primitives. Operations f...
OPTNET: A Cost-Effective Optical Network for Multiprocessors
- In Proceedings of the International Conference on Supercomputing '98
, 1998
"... In this paper we propose the OPTNET, a novel optical network and associated coherence protocol for scalable multiprocessors. The network divides its channels into broadcast and point-to-point groups. The broadcast channels are used for memory block request, coherence, and synchronization transaction ..."
Abstract
-
Cited by 5 (4 self)
- Add to MetaCart
In this paper we propose the OPTNET, a novel optical network and associated coherence protocol for scalable multiprocessors. The network divides its channels into broadcast and point-to-point groups. The broadcast channels are used for memory block request, coherence, and synchronization transactions, while the point-to-point channels are utilized for memory block transfer operations. The three main distinguishing features of the OPTNET are: a) its broadcast channels behave well under high contention; b) its point-to-point channels do not require any access control mechanism; and c) it can achieve good communication performance at a low hardware cost. We use detailed execution-driven simulations of ten applications to evaluate a 16-node OPTNET-based multiprocessor. We compare our multiprocessor against highly-efficient systems based on the DMON and LambdaNet optical interconnects. Our results demonstrate that our system outperforms the DMON multiprocessors consistently for our applications, even though the OPTNET requires no more hardware than DMON. The comparison between our multiprocessor and the LambdaNet system shows performance differences in the range of 0 to 12 % in favor of the LambdaNet. However, the LambdaNet requires a factor ofpmore hardware than the OPTNET, wherepis the number of computational nodes in the multiprocessor. Based on these results and on our parameter space study, our main conclusion is that the combination of our network and coherence protocol strikes an excellent cost/performance ratio under most architectural assumptions. 1
Supporting Software Distributed Shared Memory with an Optimizing Compiler
- In Proc. of the 1998 International Conference on Parallel Processing
, 1998
"... To execute a shared memory program efficiently, we have to manage memory consistency with low overheads, and have to utilize communication bandwidth of the platform as much as possible. A software distributed shared memory (DSM) can solve these problems via proper support by an optimizing compiler. ..."
Abstract
-
Cited by 4 (1 self)
- Add to MetaCart
To execute a shared memory program efficiently, we have to manage memory consistency with low overheads, and have to utilize communication bandwidth of the platform as much as possible. A software distributed shared memory (DSM) can solve these problems via proper support by an optimizing compiler. The optimizing compiler can detect shared write operations, using interprocedural points-to analysis. It also coalesces shared write commitments onto contiguous regions, and removes redundant write commitments, using interprocedural redundancy elimination. A page-based target software DSM system can utilize communication bandwidth, owing to coalescing optimization. We have implemented the above optimizing compiler and a runtime software DSM on AP1000+. We have obtained a high speed-up ratio with the SPLASH-2 benchmark suite. The result shows that using an optimizing compiler to assist a software DSM is a promising approach to obtain a good performance. It also shows that the appropriate proto...
On the Use and Performance of Explicit Communication Primitives in Cache-coherent Multiprocessor Systems
- In Proceedings of the Third International Symposium on High Performance Computer Architecture
, 1996
"... Recent developments in shared-memory multiprocessor systems advocate using off-the-shelf hardware to provide basic communication mechanisms and using software to implement cache coherence policies. The exposure of communication mechanisms to software opens many opportunities for enhancing applicatio ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
Recent developments in shared-memory multiprocessor systems advocate using off-the-shelf hardware to provide basic communication mechanisms and using software to implement cache coherence policies. The exposure of communication mechanisms to software opens many opportunities for enhancing application performance. In this paper we propose a set of communication primitives that are absent from pure cache coherent schemes. The communication primitives, implemented on a communication co-processor, introduce a flavor of message passing and permit protocol optimization, without sacrificing the simplicity of the shared memory systems. To assess the overhead of the software implementation of the primitives and protocols, we compare, via simulation, the execution of three programs from the SPLASH-2 suite on four environments: a PRAM model, a hardware cache coherence scheme, a software scheme implementing only the basic cache coherence protocol, and an optimized software solution supporting the ...
NetCache: A Network/Cache Hybrid for Multiprocessors
- In Proceedingsof the III Workshop on Optics and Computer Science
, 1999
"... In this paper we propose the use of an optical network not only as the communication medium, but also as a system-wide cache for the shared data in a multiprocessor. More specifically, the basic idea of our novel network/cache hybrid (and associated coherence protocol), called NetCache, is to use an ..."
Abstract
-
Cited by 3 (3 self)
- Add to MetaCart
In this paper we propose the use of an optical network not only as the communication medium, but also as a system-wide cache for the shared data in a multiprocessor. More specifically, the basic idea of our novel network/cache hybrid (and associated coherence protocol), called NetCache, is to use an optical ring network on which some amount of recently-accessed shared data is continually sent around. These data are organized as a cache shared by all processors. We use detailed execution-driven simulations of a dozen applications to evaluate a multiprocessor based on our NetCache architecture. We compare a 16-node multiprocessor with a third-level NetCache against three highly-efficient systems based on the DMON and LambdaNet optical interconnects. Our results demonstrate that the NetCache multiprocessor outperforms the DMON systems consistently for our applications; running time differences can be as significant as 105%. The NetCache system also compares favorably against the LambdaNet multiprocessor. For nine of our applications, the running time advantage of the NetCache machine ranges from 7 % for applications with little data reuse in the shared cache to 79 % for applications with significant data reuse. For the other applications, the two systems perform similarly. Based on these results and on our parameter space study, our main conclusion is that the NetCache is highly efficient under most architectural assumptions and for most applications. 1
The C// Data Parallel Language on a Shared Memory Multiprocessor
, 1997
"... The image processing applications require both computing and input/output power. The GFLOPS project's aim is to develop a parallel architecture as well as its software environment to implement those applications efficiently. This goal can be achieved only with a real collaboration among the architec ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
The image processing applications require both computing and input/output power. The GFLOPS project's aim is to develop a parallel architecture as well as its software environment to implement those applications efficiently. This goal can be achieved only with a real collaboration among the architecture, the compiler and the programming language. This paper investigates the C// on global address space architectures. The main advantage of our paradigm is that it allows a unique framework to express both data and control parallelism We will first present the structure of the GFLOPS machine used to implement this language. The C// parallel language will be presented in the next section, and finally we will evaluate the effectiveness of the mechanisms incorporated in the architecture to implement the high level C// structures. 1. Introduction Most recent MPP systems employ fast sequential microprocessors surrounded by a shell of communication and synchronization logic. The GFLOPS computer...
Gardens: High Performance Objects, Tasking and Migration for Cluster Computing
, 1997
"... Gardens is an integrated programming language and system which supports efficient parallel computation across workstation clusters. In particular it addresses the three goals of: high performance, adaptive parallelism and abstraction. High performance is the goal of parallel computing, and abstracti ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
Gardens is an integrated programming language and system which supports efficient parallel computation across workstation clusters. In particular it addresses the three goals of: high performance, adaptive parallelism and abstraction. High performance is the goal of parallel computing, and abstraction simplifies programming. Adaptive parallelism entails a program adapting during its execution to utilise a changing set of otherwise idle workstations. Tasks are used as units of work, and task migration to realise adaptive parallelism. Tasking is non-preemptive; compared to preemptive tasking this leads to simpler programming and greater efficiency. Global objects are used for inter-task communication. These support abstraction and importantly directly map to active messages: a very efficient messaging system. These features of Gardens are tightly integrated yet orthogonal. 1 Introduction Gardens is an integrated programming language and system which supports efficient parallel computing...

