Results 1 - 10
of
18
The J-Machine Multicomputer: An Architectural Evaluation
- In Proceedings of the 20th Annual International Symposium on Computer Architecture
, 1993
"... The MIT J-Machine multicomputer has been constructed to study the role of a set of primitive mechanisms in providing efficient support for parallel computing. Each J-Machine node consists of an integrated multicomputer component, the Message-Driven Processor (MDP), and 1 MByte of DRAM. The MDP provi ..."
Abstract
-
Cited by 132 (4 self)
- Add to MetaCart
The MIT J-Machine multicomputer has been constructed to study the role of a set of primitive mechanisms in providing efficient support for parallel computing. Each J-Machine node consists of an integrated multicomputer component, the Message-Driven Processor (MDP), and 1 MByte of DRAM. The MDP provides mechanisms to support efficient communication, synchronization, and naming. A 512 node J-Machine is operational and is due to be expanded to 1024 nodes in March 1993. In this paper we discuss the design of the J-Machine and evaluate the effectiveness of the mechanisms incorporated into the MDP. We measure the performance of the communication and synchronization mechanisms directly and investigate the behavior of four complete applications. 1 Introduction Over the past 40 years, sequential von Neumann processors have evolved a set of mechanisms appropriate for supporting most sequential programming models. It is clear, however, from efforts to build concurrent machines by connecting man...
Micro Benchmark Analysis of the KSR1
- In Supercomputing '93
, 1993
"... A new approach, micro benchmarks, has recently been developed. Using this technique, we have analyzed the KSR1, and in particular the "ALLCACHE" memory architecture and ring interconnection. We have been able to elucidate many facets of memory performance. The technique has enabled us to identify an ..."
Abstract
-
Cited by 34 (2 self)
- Add to MetaCart
A new approach, micro benchmarks, has recently been developed. Using this technique, we have analyzed the KSR1, and in particular the "ALLCACHE" memory architecture and ring interconnection. We have been able to elucidate many facets of memory performance. The technique has enabled us to identify and characterize parts of the memory design not described by Kendall Square Research. Our results show that a miss in the local cache can incur a penalty ranging from 7.5 microseconds to 500 microseconds (when a dirty "page" in the local cache must be evicted). The programmer must be very careful in placement and accessing of data to obtain maximum performance from the KSR1; the data presented here will help in understanding the performance actually obtained. 1. Introduction The KSR1 from Kendall Square Research is a novel new parallel computer. It is the first commercial machine embodying a scalable all cache form of shared memory architecture. In addition, there are a number of other inter...
KSR1 Multiprocessor: Analysis of Latency Hiding Techniques in a Sparse Solver
- Proceedings of the 7th International Parallel Processing Symposium
, 1993
"... This paper analyzes and evaluates some novel latency hiding features of the KSR1 multiprocessor: prefetch and poststore instructions and automatic updates. As a case study, we analyze the performance of an iterative sparse solver which generates irregular communications. We show that automatic updat ..."
Abstract
-
Cited by 17 (5 self)
- Add to MetaCart
This paper analyzes and evaluates some novel latency hiding features of the KSR1 multiprocessor: prefetch and poststore instructions and automatic updates. As a case study, we analyze the performance of an iterative sparse solver which generates irregular communications. We show that automatic updates significantly reduce the amount of communication. Although prefetch and poststore instructions reduce the coherence miss ratios, they do not significantly improve the sparse solver performance due to the overhead in executing these instructions. 1 Introduction Message-passing distributed-memory systems are scalable to large numbers of processors. However, to use such systems, the programmer must manage the complex details of work distribution, data placement, message generation, and scheduling. These steps are manageable for applications with regular parallelism but are more difficult for problems with irregular parallelism. At the high end of the spectrum, true sharedmemory multiproces...
Evaluating the Communication Performance of MPPs Using Synthetic Sparse Matrix Multiplication Workloads
- In Proceedings of the International Conference on Supercomputing
, 1993
"... Communication has a dominant impact on the performance of massively parallel processors (MPPs). We propose a methodology to evaluate the internode communication performance of MPPs using a controlled set of synthetic workloads. By generating a range of sparse matrices and measuring the performance o ..."
Abstract
-
Cited by 14 (4 self)
- Add to MetaCart
Communication has a dominant impact on the performance of massively parallel processors (MPPs). We propose a methodology to evaluate the internode communication performance of MPPs using a controlled set of synthetic workloads. By generating a range of sparse matrices and measuring the performance of a simple parallel algorithm that repeatedly multiplies a sparse matrix by a dense vector, we can determine the relative performance of different communication workloads. Specifiable communication parameters include the number of nodes, the average amount of communication per node, the degree of sharing among the nodes, and the computation-communication ratio. We describe a general procedure for constructing sparse matrices that have these desired communication and computation parameters, and apply a range of these synthetic workloads to evaluate the hierarchical ring interconnection and cacheonly memory architecture (COMA) of the Kendall Square Research KSR1 MPP. This analysis discusses th...
On the importance of parallel application placement in NUMA multiprocessors
- In SEDMS IV. Symposium on Experiences with Distributed and Multiprocessor Systems
, 1993
"... The thesis of this paper is that scheduling decisions in large-scale, sharedmemory, NUMA (Non-Uniform Memory Access) multiprocessors must consider not only how many processors, but also which processors to allocate to each application. We call the problem of assigning parallel processes of an applic ..."
Abstract
-
Cited by 14 (2 self)
- Add to MetaCart
The thesis of this paper is that scheduling decisions in large-scale, sharedmemory, NUMA (Non-Uniform Memory Access) multiprocessors must consider not only how many processors, but also which processors to allocate to each application. We call the problem of assigning parallel processes of an application to processors application placement. We explore the importance of placement decisions by measuring the execution time of several parallel applications using different placements on a sharedmemory NUMA multiprocessor. The results of these experiments lead us to conclude that, as expected, in small-scale mildly NUMA multiprocessors, placement decisions have only a minor affect on the execution time of parallel applications. However, the results also show that placement decisions in largescale multiprocessors are critical and localization that considers the architectural clusters inherent in these systems is essential. Our experiments also show that the importance of placement decisions increases substantially with the size and NUMAness of the system and that the placement of individual processes of an application within the subset of chosen processors also significantly impacts performance.
Linear Algebra Calculations on a virtual shared memory computer
- Int Journal of High Speed Computing
, 1992
"... We evaluate the impact of the memory hierarchy of virtual shared memory computers on the design of algorithms for linear algebra. On classical shared memory multiprocessor computers, block algorithms are used for efficiency. We study here the potential and the limitations of such approaches on globa ..."
Abstract
-
Cited by 13 (6 self)
- Add to MetaCart
We evaluate the impact of the memory hierarchy of virtual shared memory computers on the design of algorithms for linear algebra. On classical shared memory multiprocessor computers, block algorithms are used for efficiency. We study here the potential and the limitations of such approaches on globally addressable distributed memory computers. The BBN TC2000 belongs to this class of computers and will be used to illustrate our discussion. The BBN TC2000 is a virtual shared memory multiprocessor with up to 512 nodes. Each node contains one RISC processor (a Motorola 88100) and 16 MBytes of memory. The originality of the BBN TC2000 comes from its interconnection network (Butterfly switch) and from its globally addressable memory. Memory references can be either remote or local to one node. The memory hierarchy consists of the disks, the remote memory, the local memory of each node, the local cache of the 88100, and the internal registers of the processor. We describe the implementation o...
Kernel–Kernel communication in a shared-memory multiprocessor. Concurrency: Practice and Experience
, 1993
"... In the standard kernel organization on a bus-based multiprocessor, all processors share the code and data of the operating system; explicit synchronization is used to control access to kernel data structures. Distributed-memory multicomputers use an alternative approach, in which each instance of th ..."
Abstract
-
Cited by 13 (0 self)
- Add to MetaCart
In the standard kernel organization on a bus-based multiprocessor, all processors share the code and data of the operating system; explicit synchronization is used to control access to kernel data structures. Distributed-memory multicomputers use an alternative approach, in which each instance of the kernel performs local operations directly and uses remote invocation to perform remote operations. Either approach to inter-kernel communication can be used in a large-scale shared-memory multiprocessor. In this paper we discuss the issues and architectural features that must be considered when choosing between remote memory access and remote invocation. We focus in particular on experience with the Psyche multiprocessor operating system on the BBN Butterfly Plus. We find that the Butterfly architecture is biased towards the use of remote invocation for kernel operations that perform a significant number of memory references, and that current architectural trends are likely to increase this bias in future machines. This conclusion suggests that straightforward parallelization of existing kernels (e.g. by using semaphores to protect shared data) is unlikely in the future to yield acceptable performance. We note, however, that remote memory access is useful for small, frequently-executed operations, and is likely to remain so.
Trace-Driven Simulation of Data-Alignment and other Factors affecting Update and Invalidate Based Coherent Memory
, 1994
"... The exploitation of locality of reference in shared memory multiprocessors is one of the most important problems in parallel processing today. Locality can be managed in several levels: hardware, operating system, runtime environment of the compiler, user level. In this paper we investigate the prob ..."
Abstract
-
Cited by 6 (5 self)
- Add to MetaCart
The exploitation of locality of reference in shared memory multiprocessors is one of the most important problems in parallel processing today. Locality can be managed in several levels: hardware, operating system, runtime environment of the compiler, user level. In this paper we investigate the problem of exploiting locality at the operating system level and its interactions with the compiler and the architecture. Our main conclusion, based on trace-driven simulations of real applications, is that exploitation of locality is effective only if all three levels cooperate.
MAD Kernels: An Experimental Testbed to Study Multiprocessor Memory System Behavior
, 1992
"... On large-scale multiprocessors, access to common memory is one of the key performance limiting factors. The shared-memory performance depends not only on the characteristics of the memory hierarchy itself, but also upon the characteristics of the memory address streams and the interaction between th ..."
Abstract
-
Cited by 4 (1 self)
- Add to MetaCart
On large-scale multiprocessors, access to common memory is one of the key performance limiting factors. The shared-memory performance depends not only on the characteristics of the memory hierarchy itself, but also upon the characteristics of the memory address streams and the interaction between the two. We present a technique for multiprocessor workload construction and a family of artificial kernels, called MAD-kernels, to systematically investigate the behavior of the memory hierarchy. The measured performance is independent of any particular application or algorithm. The proposed methodology is demonstrated on two commercial shared-memory systems. Keywords: Performance evaluation, shared-memory multiprocessors, memory hierarchy, interconnection networks, resource contention, synchronization overhead, memory access patterns, unit grain characterization. This work was supported in part by NSF grants ECS-88-14027, MIP-88-11815, CDA-9121641 and MIP-9204066, and DOE grant DE-FG02-93...
Analysis of Memory Latency Factors and their Impact on KSR1 MPP Performance
, 1993
"... The Kendall Square Research KSR1 MPP system has a shared address space, which spreads over physically distributed memory modules. Thus, memory access time can vary over a wide range even when accessing the same variable, depending on how this variable is being referenced and updated by the various p ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
The Kendall Square Research KSR1 MPP system has a shared address space, which spreads over physically distributed memory modules. Thus, memory access time can vary over a wide range even when accessing the same variable, depending on how this variable is being referenced and updated by the various processors. Since the processor stalls during this access time, the KSR1 performance depends considerably on the program's locality of reference. The KSR1 provides two novel features to reduce such long memory latencies: prefetch and post-store instructions. This paper analyzes the various memory latency factors which stalls the processor during program execution. A suitable model for evaluating these factors is developed for the execution of FORTRAN DO-loops parallelized with the Tile construct using the Slice strategy. The DO-loops used in the benchmark program perform sparse matrix-vector multiply, vector-vector dot product, and vector-vector addition, which are typically executed in an it...

