Results 1 - 10
of
100
The MIT Alewife Machine: Architecture and Performance
- In Proceedings of the 22nd Annual International Symposium on Computer Architecture
, 1995
"... Alewife is a multiprocessor architecture that supports up to 512 processing nodes connected over a scalable and cost-effective mesh network at a constant cost per node. The MIT Alewife machine, a prototype implementation of the architecture, demonstrates that a parallel system can be both scalable a ..."
Abstract
-
Cited by 163 (22 self)
- Add to MetaCart
Alewife is a multiprocessor architecture that supports up to 512 processing nodes connected over a scalable and cost-effective mesh network at a constant cost per node. The MIT Alewife machine, a prototype implementation of the architecture, demonstrates that a parallel system can be both scalable and programmable. Four mechanisms combine to achieve these goals: software-extended coherent shared memory provides a global, linear address space; integrated message passing allows compiler and operating system designers to provide efficient communication and synchronization; support for fine-grain computation allows many processorsto cooperate on small problem sizes; and latency tolerance mechanisms -- including block multithreading and prefetching -- mask unavoidable delays due to communication; Microbenchmarks, together with over a dozen complete applications running on the 32-node prototype, help analyze the behavior of the system. Analysis shows that integrating message passing with sha...
An Integrated Compile-Time/Run-Time Software Distributed Shared Memory System
- In Proceedings of the 7th Symposium on Architectural Support for Programming Languages and Operating Systems
, 1996
"... On a distributed memory machine, hand-coded message passing leads to the most efficient execution, but it is difficult to use. Parallelizing compilers can approach the performance of hand-coded message passing by translating data-parallel programs into message passing programs, but efficient executi ..."
Abstract
-
Cited by 76 (10 self)
- Add to MetaCart
On a distributed memory machine, hand-coded message passing leads to the most efficient execution, but it is difficult to use. Parallelizing compilers can approach the performance of hand-coded message passing by translating data-parallel programs into message passing programs, but efficient execution is limited to those programs for which precise analysis can be carried out. Shared memory is easier to program than message passing and its domain is not constrained by the limitations of parallelizing compilers, but it lags in performance. Our goal is to close that performance gap while retaining the benefits of shared memory. In other words, our goal is (1) to make shared memory as efficient as message passing, whether hand-coded or compiler-generated, (2) to retain its ease of programming, and (3) to retain the broader class of applications it supports. To this end we have designed and implemented an integrated compile-time and run-time software DSM system. The programming model remain...
Dynamic IPC/Clock Rate Optimization
- PROCEEDINGS OF THE 25TH INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE
, 1998
"... Current microprocessor designs set the functionality and clock rate of the chip at design time based on the configuration that achieves the best overall performance over a range of target applications. The result may be poor performance when running applications whose requirements are not well-match ..."
Abstract
-
Cited by 74 (17 self)
- Add to MetaCart
Current microprocessor designs set the functionality and clock rate of the chip at design time based on the configuration that achieves the best overall performance over a range of target applications. The result may be poor performance when running applications whose requirements are not well-matched to the particular hardware organization chosen. We present a new approach called Complexity-Adaptive Processors (CAPs) in which the IPC/clock rate tradeoff can be altered at runtime to dynamically match the changing requirements of the instruction stream. By exploiting repeater methodologies used increasingly in deep sub-micron designs, CAPs achieve this flexibility with potentially no cycle time impact compared to a fixed architecture. Our preliminary results in applying this approach to on-chip caches and instruction queues indicate that CAPs have the potential to significantly outperform conventional approaches on workloads containing both general-purpose and scientific applications.
Impulse: Building a Smarter Memory Controller
- IN PROCEEDINGS OF THE FIFTH ANNUAL SYMPOSIUM ON HIGH PERFORMANCE COMPUTER ARCHITECTURE
, 1999
"... Impulse is a new memory system architecture that adds two important features to a traditional memory controller. First, Impulse supports application-specific optimizations through configurable physical address remapping. By remapping physical addresses, applications control how their data is accesse ..."
Abstract
-
Cited by 74 (17 self)
- Add to MetaCart
Impulse is a new memory system architecture that adds two important features to a traditional memory controller. First, Impulse supports application-specific optimizations through configurable physical address remapping. By remapping physical addresses, applications control how their data is accessed and cached, improving their cache and bus utilization. Second, Impulse supports prefetching at the memory controller, which can hide much of the latency of DRAM accesses. In this paper
High-Performance Parallel Programming in Java: Exploiting Native Libraries
, 1998
"... With most of today's fast scientific software written in Fortran and C, Java has a lot of catching up to do. In this paper we discuss how new Java programs can capitalize on high-performance libraries for other languages. With the help of a tool we have automatically created Java bindings for severa ..."
Abstract
-
Cited by 67 (3 self)
- Add to MetaCart
With most of today's fast scientific software written in Fortran and C, Java has a lot of catching up to do. In this paper we discuss how new Java programs can capitalize on high-performance libraries for other languages. With the help of a tool we have automatically created Java bindings for several standard libraries: MPI, BLAS, BLACS, PBLAS, ScaLAPACK. Performance results are presented for Java versions of two benchmarks from the NPB and PARKBENCH suites on an IBM SP2 distributed memory machine using JDK and IBM's high-performance Java compiler. The results confirm that fast parallel computing in Java is indeed possible.
The Performance Advantages of Integrating Block Data Transfer in Cache-Coherent Multiprocessors
- in Proceedings of the Sixth International Conference on Architectural Support for Programming Languages and Operating Systems
, 1994
"... Integrating 1 support for block data transfer has become an im- portant emphasis in recent cache-coherent shared address space multiprocessors. This paper examines the potential performance benefits of adding this support. A set of ambitious hardware mechanisms is used to study performance gains in ..."
Abstract
-
Cited by 43 (8 self)
- Add to MetaCart
Integrating 1 support for block data transfer has become an im- portant emphasis in recent cache-coherent shared address space multiprocessors. This paper examines the potential performance benefits of adding this support. A set of ambitious hardware mechanisms is used to study performance gains in five important scientific computations that appear to be good candidates for using block transfer. Our conclusion is that the benefits of block transfer are not substantial for hardware cache- coherent multiprocessors. The main reasons for this are (i) the relatively modest fraction of time applications spend in communication amenable to block transfer, (ii) the difficulty of finding enough independent computation to overlap with the communication latency that remains after block transfer, and (iii) long cache lines often capture many of the benefits of block transfer in efficient cache-coherent machines. In the cases where block transfer improves performance, prefetching can often provide comparable, if not superior, performance benefits. We also examine the impact of varying important communication parameters and processor speed on the effectiveness of block transfer, and comment on useful features that a block transfer facility should support for real applications.
Efficiently Adapting to Sharing Patterns in Software DSMs
- In Proc. of the 4th Intl. Symp. on High Performance Computer Architecture, Las Vegas, NV
, 1998
"... In this paper we introduce a page-based Lazy Release Consistency protocol called ADSM that constantly and efficiently adapts to the applications' sharing patterns. Adaptation in ADSM is based on our dynamic categorization of the type of sharing experienced by each page. Pages can be categorized as f ..."
Abstract
-
Cited by 37 (8 self)
- Add to MetaCart
In this paper we introduce a page-based Lazy Release Consistency protocol called ADSM that constantly and efficiently adapts to the applications' sharing patterns. Adaptation in ADSM is based on our dynamic categorization of the type of sharing experienced by each page. Pages can be categorized as falsely-shared, migratory, or producer /consumer(s). Migratory and producer/consumer(s) pages are managed in single-writer mode, while falselyshared data are managed in multiple-writer mode. Coherence is kept with invalidations for most types of the shared data, but updates are used for lock-protected data in migratory state and barrier-protected data in producer /consumer(s) state. We performed experiments with 6 parallel applications on an 8-node SP2 system, comparing our protocol against standard TreadMarks and a version of TreadMarks that also adapts to sharing patterns. Our results show that ADSM consistently outperforms its competitors; our protocol can improve the TreadMarks speedups b...
Towards Portable Message Passing in Java: Binding MPI
- In Recent Advances in PVM and MPI, number 1332 in Lecture Notes in Computer Science
, 1997
"... . In this paper we present a way of successfully tackling the difficulties of binding MPI to Java with a view to ensuring portability. We have created a tool for automatically binding existing native C libraries to Java, and have applied the Java--to--C Interface generating tool (JCI) to bind MPI to ..."
Abstract
-
Cited by 34 (9 self)
- Add to MetaCart
. In this paper we present a way of successfully tackling the difficulties of binding MPI to Java with a view to ensuring portability. We have created a tool for automatically binding existing native C libraries to Java, and have applied the Java--to--C Interface generating tool (JCI) to bind MPI to Java. The approach of automatic binding by JCI ensures both portability across different platforms and full compatibility with the MPI specification. To evaluate the resulting combination we have run a Java version of the NAS parallel IS benchmark on a distributed--memory IBM SP2 machine. 1 Introduction It is generally accepted that computers based on the emerging hybrid shared/distributed-memory parallel architectures will become the fastest and most cost-effective supercomputers over the next decade. This, however, makes the search for the most appropriate programming model even more important than it has been so far. Users need a flexible yet comprehensive interface which covers both th...
The Impulse Memory Controller
- IEEE TRANSACTIONS ON COMPUTERS
, 2001
"... Impulse is a memory system architecture that adds an optional level of address indirection at the memory controller. Applications can use this level of indirection to remap their data structures in memory. As a result, they can control how their data is accessed and cached, which can improve cach ..."
Abstract
-
Cited by 33 (6 self)
- Add to MetaCart
Impulse is a memory system architecture that adds an optional level of address indirection at the memory controller. Applications can use this level of indirection to remap their data structures in memory. As a result, they can control how their data is accessed and cached, which can improve cache and bus utilization. The Impulse design does not require any modification to processor, cache, or bus designs, since all the functionality resides at the memory controller. As a result, Impulse can be adopted in conventional systems without major system changes. We describe the design of the Impulse architecture and show how an Impulse memory system can be used in a variety of ways to improve the performance of memory-bound applications. Impulse can be used to dynamically create superpages cheaply, to dynamically recolor physical pages, to perform strided fetches, and to perform gathers and scatters through indirection vectors. Our performance results demonstrate the effectiveness of these optimizations in a variety of scenarios. Using Impulse can speed up a range of applications from 20% to over a factor of 5. Alternatively, Impulse can be used by the OS for dynamic superpage creation; the best policy for creating superpages using Impulse outperforms previously known superpage creation policies.
An Approach to Scalability Study of Shared Memory Parallel Systems
, 1994
"... The overheads in a parallel system that limit its scalability need to be identified and separated in order to enable parallel algorithm design and the development of parallel machines. Such overheads may be broadly classified into two components. The first one is intrinsic to the algorithm and arise ..."
Abstract
-
Cited by 30 (16 self)
- Add to MetaCart
The overheads in a parallel system that limit its scalability need to be identified and separated in order to enable parallel algorithm design and the development of parallel machines. Such overheads may be broadly classified into two components. The first one is intrinsic to the algorithm and arises due to factors such as the work-imbalance and the serial fraction. The second one is due to the interaction between the algorithm and the architecture and arises due to latency and contention in the network. A top-down approach to scalability study of shared memory parallel systems is proposed in this research. We define the notion of overhead functions associated with the different algorithmic and architectural characteristics to quantify the scalability of parallel systems; we isolate the algorithmic overhead and the overheads due to network latency and contention from the overall execution time of an application; we design and implement an execution-driven simulation platform that incorporates these methods for quantifying the overhead functions; and we use this simulator to study the scalability characteristics of five applications on shared memory platforms with different communication topologies.

