Results 1 - 10
of
19
Limits on Interconnection Network Performance
- IEEE Transactions on Parallel and Distributed Systems
, 1991
"... As the performance of interconnection networks becomes increasingly limited by physical constraints in high-speed multiprocessor systems, the parameters of high-performance network design must be reevaluated, starting with a close examination of assumptions and requirements. This paper models networ ..."
Abstract
-
Cited by 166 (4 self)
- Add to MetaCart
As the performance of interconnection networks becomes increasingly limited by physical constraints in high-speed multiprocessor systems, the parameters of high-performance network design must be reevaluated, starting with a close examination of assumptions and requirements. This paper models network latency, taking both switch and wire delays into account. A simple closed form expression for contention in buffered, direct networks is derived and is found to agree closely with simulations. The model includes the effects of packet size and communication locality. Network analysis under various constraints (such as fixed bisection width, fixed channel width, and fixed node size) and under different workload parameters (such as packet size, degree of communication locality, and network request rate) reveals that performance is highly sensitive to these constraints and workloads. A twodimensional network has the lowest latency only when switch delays and network contention are ignored, but...
The MIT Alewife Machine: A Large-Scale Distributed-Memory Multiprocessor
- In Proceedings of Workshop on Scalable Shared Memory Multiprocessors
, 1991
"... The Alewife multiprocessor project focuses on the architecture and design of a large-scale parallel machine. The machine uses a low-dimensional direct interconnection network to provide scalable communication bandwidth, while allowing the exploitation of locality. Despite its distributed-memory arch ..."
Abstract
-
Cited by 138 (22 self)
- Add to MetaCart
The Alewife multiprocessor project focuses on the architecture and design of a large-scale parallel machine. The machine uses a low-dimensional direct interconnection network to provide scalable communication bandwidth, while allowing the exploitation of locality. Despite its distributed-memory architecture, Alewife allows efficient shared-memory programming through a multilayered approach to locality management. A new scalable cache-coherence scheme called LimitLESS directories allows the use of caches for reducing communication latency and network bandwidth requirements. Alewife also employs run-time and compile-time methods for partitioning and placement of data and processes to enhance communication locality. While the above methods attempt to minimize communication latency, communication with distant processors cannot be completely avoided. Alewife's processor, Sparcle, is designed to tolerate these latencies by rapidly switching between threads of computation. This paper describe...
Exploring interconnections in multi-core architectures
, 2005
"... This paper examines the area, power, performance, and design issues for the on-chip interconnects on a chip multiprocessor, attempting to present a comprehensive view of a class of interconnect architectures. It shows that the design choices for the interconnect have significant effect on the rest o ..."
Abstract
-
Cited by 73 (4 self)
- Add to MetaCart
This paper examines the area, power, performance, and design issues for the on-chip interconnects on a chip multiprocessor, attempting to present a comprehensive view of a class of interconnect architectures. It shows that the design choices for the interconnect have significant effect on the rest of the chip, potentially consuming a significant fraction of the real estate and power budget. This research shows that designs that treat interconnect as an entity that can be independently architected and optimized would not arrive at the best multicore design. Several examples are presented showing the need for careful co-design. For instance, increasing interconnect bandwidth requires area that then constrains the number of cores or cache sizes, and does not necessarily increase performance. Also, shared level-2 caches become significantly less attractive when the overhead of the resulting crossbar is accounted for. A hierarchical bus structure is examined which negates some of the performance costs of the assumed baseline architecture. 1
A Timestamp-based Cache Coherence Scheme
, 1989
"... this paper, we propose a software-assisted cache coherence scheme which overcomes some of the inefficiencies of previous approaches by using a combination of a compile-time marking of references and a hardware-based local incoherence detection scheme. In section 2, we give the notation used througho ..."
Abstract
-
Cited by 29 (1 self)
- Add to MetaCart
this paper, we propose a software-assisted cache coherence scheme which overcomes some of the inefficiencies of previous approaches by using a combination of a compile-time marking of references and a hardware-based local incoherence detection scheme. In section 2, we give the notation used throughout the paper. Section 3 reviews previous software-assisted methods to enforcing cache coherence. In section 4, a complete description of our approach is given along with a correctness proof. Section 5 gives a qualitative comparison of our scheme and the directory-based approaches. Section 6 provides some concluding remarks. Definitions In conventional programs, there are four kinds of data dependences : flow-dependence, antidependence, output-dependence and input-dependence [14]. Let r and r
Exploiting Operating System Support for Dynamic Page Placement on a NUMA Shared Memory Multiprocessor
- ACM SIGPLAN SYMPOSIUM ON PRINCIPLES AND PRACTICE OF PARALLEL PROGRAMMING
, 1991
"... Shared memory multiprocessors are attractive because they are programmed in a manner similar to uniprocessors. The UMA class of shared memory multiprocessors is the most attractive, from the programmer's point of view, since the programmer need not be concerned with the placement of code and data in ..."
Abstract
-
Cited by 19 (5 self)
- Add to MetaCart
Shared memory multiprocessors are attractive because they are programmed in a manner similar to uniprocessors. The UMA class of shared memory multiprocessors is the most attractive, from the programmer's point of view, since the programmer need not be concerned with the placement of code and data in the physical memory hierarchy. Scalable shared memory multiprocessors, on the other hand, tend to present at least some degree of non-uniformity of memory access to the programmer, making the NUMA class an important one to consider. In this paper, we investigate the role that DUnX, an operating system supporting dynamic page placement on a BBN GP1000, might play in simplifying the memory model presented to the applications programmer. We consider a case study of psolu, a real scientific application originally targeted for a NUMA architecture. We find that dynamic page placement can dramatically improve the performance of a simpler implementation of psolu targeted for an UMA memory architec...
Design and Analysis of a Scalable Cache Coherence Scheme based on Clocks and Timestamps
, 1992
"... this paper, we restrict ourselves to a study of caching of shared variables. The presence of multiple private caches introduces the well-known cache coherence problem [7]. Hardware based protocols to solve the cache coherence problem are well understood in a shared-bus environment (e.g., [17, 22, 32 ..."
Abstract
-
Cited by 15 (0 self)
- Add to MetaCart
this paper, we restrict ourselves to a study of caching of shared variables. The presence of multiple private caches introduces the well-known cache coherence problem [7]. Hardware based protocols to solve the cache coherence problem are well understood in a shared-bus environment (e.g., [17, 22, 32, 37]). However these solutions cannot be extended to the dance-hall multiprocessors since they make use of the instantaneous broadcast and "snoopy" mechanisms provided by the shared-bus. Software-assisted [10, 25, 27, 33, 38, 40] and directory-based [1, 4, 7, 36, 41] schemes are usually advocated in such an environment. In this paper, we propose a software-assisted cache coherence scheme which overcomes some of the inefficiencies of previous approaches by using a combination of a compile-time marking of references and a hardware-based local incoherence detection scheme. We also give a performance evaluation of our proposed scheme. In Section 2, we give the notation used throughout the paper. Section 3 reviews previous software-assisted approaches to enforcing cache coherence. In Section 4, a complete description of our approach is given. A correctness proof of our proposed scheme is given elsewhere [29] and is omitted here. Section 5 gives a quantitative comparison of our scheme with previous approaches. Section 6 provides some concluding remarks. 2 Definitions Programs written for shared-memory multiprocessors may use explicit parallel constructs or may be conventional sequential programs transformed into equivalent parallel ones by a restructuring compiler or a preprocessor like Parafrase [24, 39], PFC [3] or PTRAN [2]. The parallelism is constrained by data dependences : flow-dependence, anti-dependence, and
Rigel: An architecture and scalable programming interface for a 1000-core accelerator
- In ISCA ’09
"... This paper considers Rigel, a programmable accelerator architecture for a broad class of data- and task-parallel computation. Rigel comprises 1000+ hierarchically-organized cores that use a fine-grained, dynamically scheduled singleprogram, multiple-data (SPMD) execution model. Rigel’s low-level pro ..."
Abstract
-
Cited by 12 (4 self)
- Add to MetaCart
This paper considers Rigel, a programmable accelerator architecture for a broad class of data- and task-parallel computation. Rigel comprises 1000+ hierarchically-organized cores that use a fine-grained, dynamically scheduled singleprogram, multiple-data (SPMD) execution model. Rigel’s low-level programming interface adopts a single global address space model where parallel work is expressed in a taskcentric, bulk-synchronized manner using minimal hardware support. Compared to existing accelerators, which contain domain-specific hardware, specialized memories, and restrictive programming models, Rigel is more flexible and provides a straightforward target for a broader set of applications. We perform a design analysis of Rigel to quantify the compute density and power efficiency of our initial design. We find that Rigel can achieve a density of over 8 single-precision GF LOP S mm2 in 45nm, which is comparable to high-end GPUs scaled to 45nm. We perform experimental analysis on several applications ported to the Rigel low-level programming interface. We examine scalability issues related to work distribution, synchronization, and load-balancing for 1000-core accelerators using software techniques and minimal specialized hardware support. We find that while it is important to support fast task distribution and barrier operations, these operations can be implemented without specialized hardware using flexible hardware primitives.
Special Issue on Group Communication Systems
- In Communications of the ACM
, 1996
"... Abstract: The high latency of memory operations is a problem in both sequential and parallel computing. Multithreading is a technique, which can be used to eliminate the delays caused by the high latency. This happens by letting a processor to execute other processes (threads) while one process is w ..."
Abstract
-
Cited by 7 (4 self)
- Add to MetaCart
Abstract: The high latency of memory operations is a problem in both sequential and parallel computing. Multithreading is a technique, which can be used to eliminate the delays caused by the high latency. This happens by letting a processor to execute other processes (threads) while one process is waiting for the completion of a memory operation. In this paper we investigate the implementation of multithreading in the processor-level. As a result we outline and evaluate a MultiThreaded VLIW processor Architecture with functional unit Chaining (MTAC), which is specially designed for PRAM-style parallelism. According to our experiments MTAC offers remarkably better performance than a basic pipelined RISC architecture and chaining improves the exploitation of instruction level parallelism to a level where the achieved speedup corresponds to the number of functional units in a processor.
An Effective Synchronization Network for Hot-spot Accesses
, 1992
"... this paper was presented at the 1991 International Parallel Processing Symposium, Anaheim CA, under the title "An Effective Synchronization Network for Large Multiprocessor Systems". -2 Ultracomputer project [GGKM83], for combining fetch-and-op instructions. Some tree-structured hardware has also b ..."
Abstract
-
Cited by 7 (0 self)
- Add to MetaCart
this paper was presented at the 1991 International Parallel Processing Symposium, Anaheim CA, under the title "An Effective Synchronization Network for Large Multiprocessor Systems". -2 Ultracomputer project [GGKM83], for combining fetch-and-op instructions. Some tree-structured hardware has also been proposed for combining hot-spot accesses. Special hardware of this type tends to be too rigid and difficult to use in an environment with process migration or multiprogramming, or when only some of the processors are involved in the synchronization. Lipovski and Vaughan's fetch-and-op tree [LiVa88], for example, is only capable of combining simultaneous accesses to a single hot-spot. It is more suitable for SIMD machines. Other hardware schemes have been proposed for barrier synchronization, such as [BePo90] and [HwSh91], but they are not flexible enough to handle traffic to memory hot-spots. Feedback has been proposed by Scott and Sohi [ScSo89] for avoiding tree saturation, but it does not improve the latency of hot-spot accesses and is not a substitute for combining. Similarly, the intelligent allocation of hardware switch buffers (for example, [Tzen91]) relieves congestion, but does not address the problem of high-latency hot-spot accesses. Some software techniques have been proposed for removing hot-spots (for example, [YeTL87] and [Broo86]). However, software schemes incur a fair amount of overhead. A basic OS operation often requires several hot-spot accesses, and software techniques may not provide the necessary speed. Also, certain compiler optimization techniques (see, for example, cycle shrinking in [Poly88]) are only effective if fast synchronization primitives are available. [MeSc91] introduced efficient algorithms for synchronization, but these are not appli...
On Shortest Path Routing in Single Stage Shuffle-Exchange Networks
- In Seventh Annual ACM Symposium on Parallel Algorithms and Architectures 95
, 1995
"... In this paper, we study routing in shuffle-exchange networks. Shuffle-exchange networks can have two different structures: multistage and single stage. Routing in multistage networks with K \Theta K crossbar switches needs dlog K Ne stage traversals for the connectivity between N inputs and N output ..."
Abstract
-
Cited by 3 (2 self)
- Add to MetaCart
In this paper, we study routing in shuffle-exchange networks. Shuffle-exchange networks can have two different structures: multistage and single stage. Routing in multistage networks with K \Theta K crossbar switches needs dlog K Ne stage traversals for the connectivity between N inputs and N outputs. In single stage networks, less than dlog K Ne traversals may be required depending on source and destination. We establish a theorem for routing from an input terminal to an output terminal at any stage in multistage networks. In the theorem, system size is limited only as a multiple of a crossbar switch size. This condition allows more flexible increments in system size. Based on the theorem, we derive an algorithm that generates routing tags for shortest path routing in single stage networks. We study the impact of shortest path routing on average internode distance, and by using trace-driven simulation, we evaluate its effect on shared memory systems. Our results show that the shortest...

