Results 1 - 10
of
16
A Cost and Speed Model for k-ary n-cube Wormhole Routers
- HIT INTERCONNECTS '93
, 1993
"... A great deal of research has been published on the performance of wormhole routers with advanced features such as adaptivity and virtual lanes. In most cases, the effectiveness of such novel routers is evaluated on the basis of the achieved network throughput (channel utilization), ignoring the impo ..."
Abstract
-
Cited by 125 (1 self)
- Add to MetaCart
A great deal of research has been published on the performance of wormhole routers with advanced features such as adaptivity and virtual lanes. In most cases, the effectiveness of such novel routers is evaluated on the basis of the achieved network throughput (channel utilization), ignoring the important effects of implementation complexity. In this paper we describe a parameterized cost model for router performance, characterized by two numbers: router delay and flow control time. Grounding the cost model in a 0.8 micron gate array technology, we use it to compare a number of proposed routing algorithms. Based on these design studies, several insights regarding the implementation complexity of adaptive routing are clear. First, header update and selection is expensive in adaptive routers, suggesting the absolute addressing should be reconsidered. Second, virtual channels are expensive in terms of latency and cycle time, so decisions to include them to support adaptivity or even virtual lanes should not be taken lightly. Third, requirements of larger crossbars and more complex arbitration cause some increase in the complexity of adaptive routers, but the rate of increase is small. Finally, the complexity of adaptive routers significantly increases their setup delay and flow control cycle times, implying that claims of performance advantages in channel utilization and low load latency must be carefully balanced against losses in achievable implementation speed.
The DASH Prototype: Logic Overhead and Performance
- IEEE Transactions on Parallel and Distributed Systems
, 1993
"... Abstract-The fundamental premise behind the DASH project is that it is feasible to build large-scale shared-memory multi-processors with hardware cache coherence. While paper studies and software simulators are useful for understanding many high-level design tradeoffs, prototypes are essential to en ..."
Abstract
-
Cited by 100 (2 self)
- Add to MetaCart
Abstract-The fundamental premise behind the DASH project is that it is feasible to build large-scale shared-memory multi-processors with hardware cache coherence. While paper studies and software simulators are useful for understanding many high-level design tradeoffs, prototypes are essential to ensure that no critical details are overlooked. A prototype provides convincing evidence of the feasibility of the design, allows one to accurately estimate both the hardware and the complexity cost of various features, and provides a platform for studying real workloads. A 48-processor prototype of the DASH multiprocessor is now operational. In this paper, we first examine the hardware overhead of directory-based cache coherence in the prototype. The data show that the overhead is only about M-15%, which appears to be a small cost for the ease of programming offered by coherent caches and the potential for higher performance. We then discuss the performance of the system and show the speedups obtained by a variety of parallel applications running on the prototype. Using a sophisticated hardware performance monitor, we also characterize the effectiveness of coherent caches and the relationship between an application’s reference behavior and its speedup. Finally, we present an evaluation of the optimizations incorporated in the DASH protocol in terms of their effectiveness on parallel applications and on atomic tests that stress the memory system.’ Index Terms- Directory-based cache coherence, implementa-tion cost, multiprocessor, parallel architecture, performance anal-
Software-Extended Coherent Shared Memory: Performance and Cost
"... This paper evaluates the tradeoffs involved in the design of the software-extended memory system of Alewife, a multiprocessor architecture that implements coherentsharedmemory through a combination of hardware and software mechanisms. For each block of memory, Alewife implements between zero and fiv ..."
Abstract
-
Cited by 54 (5 self)
- Add to MetaCart
This paper evaluates the tradeoffs involved in the design of the software-extended memory system of Alewife, a multiprocessor architecture that implements coherentsharedmemory through a combination of hardware and software mechanisms. For each block of memory, Alewife implements between zero and five coherence directory pointers in hardwareand allows software to handle requests when the pointers are exhausted. The software includes a flexible coherence interface that facilitates protocol software implementation. This interface is indispensable for conducting experiments and has proven important for implementing enhancements to the basic system. Simulations of a
An Integration of Network Communication with Workstation Architecture
- ACM COMPUTER COMMUNICATION REVIEW
, 1991
"... A workstation may be thought of as a group of cooperatively connected subsystems. Point--to--point channels may be used to create a small--scale Gigabit LAN to which these subsystems are attached as nodes. The architectural focus of such a workstation shifts towards its internal LAN. An attractive a ..."
Abstract
-
Cited by 32 (11 self)
- Add to MetaCart
A workstation may be thought of as a group of cooperatively connected subsystems. Point--to--point channels may be used to create a small--scale Gigabit LAN to which these subsystems are attached as nodes. The architectural focus of such a workstation shifts towards its internal LAN. An attractive attribute of this LAN is that its aggregate capacity scales linearly with the number of nodes attached to it. If the link--layer of the internal LAN is made equivalent to the link--layer of the external LAN, interior nodes become directly accessible externally. Except for latency the distinction between whether a node is inside a workstation versus outside it need not be significant. This property is particularly attractive for distributed communication--intensive applications.
The Cost of Adaptivity and Virtual Lanes in a Wormhole Router
- Journal of VLSI Design
, 1995
"... Many studies of advanced router features consider only their benefits, not their costs. In this paper, we examine the cost in router complexity of adaptivity and virtual lanes in a class of wormhole routers by examining a set of router designs. Increased router complexity affects on achievable ro ..."
Abstract
-
Cited by 27 (4 self)
- Add to MetaCart
Many studies of advanced router features consider only their benefits, not their costs. In this paper, we examine the cost in router complexity of adaptivity and virtual lanes in a class of wormhole routers by examining a set of router designs. Increased router complexity affects on achievable router latency and bandwidth. Our studies establish the cost of adaptivity and virtual lanes allowing cost to be compared to performance benefit.
The Case for Chaotic Adaptive Routing
- IEEE Transactions on Computers
, 1994
"... Chaotic routers are randomizing, non-minimal adaptive packet routers designed for use in the communication networks of parallel computers. Chaotic routing is reviewed along with other contemporary network routing approaches, including the state-of-the-art oblivious routers. Each routing approach is ..."
Abstract
-
Cited by 27 (0 self)
- Add to MetaCart
Chaotic routers are randomizing, non-minimal adaptive packet routers designed for use in the communication networks of parallel computers. Chaotic routing is reviewed along with other contemporary network routing approaches, including the state-of-the-art oblivious routers. Each routing approach is evaluated for its effectiveness as a multicomputer message router. The results indicate that the Chaos router is the most effective of known routing methods. 1 Introduction In spite of the fact that network routing has been an active research area in recent years, leading to many diverse proposals, practical experience with routers is extremely limited. The routers used in most implemented parallel computers are all from a single class, known as oblivious routers. Most of the non-oblivious routers have appeared only in single instance machines such as the HEP, CM-2, and CM-5 computers, making it difficult to separate fundamental properties of the routers from artifacts of the specific insta...
Chaotic Routing - Design and Implementation of an Adaptive Multicomputer Network Router
, 1993
"... Chaotic Routing -- Design and Implementation of an Adaptive Multicomputer Network Router by Kevin Bolding Chairperson of Supervisory Committee: Professor Lawrence Snyder Department of Computer Science and Engineering A crucial component of a massively parallel multicomputer is the interconnection n ..."
Abstract
-
Cited by 26 (3 self)
- Add to MetaCart
Chaotic Routing -- Design and Implementation of an Adaptive Multicomputer Network Router by Kevin Bolding Chairperson of Supervisory Committee: Professor Lawrence Snyder Department of Computer Science and Engineering A crucial component of a massively parallel multicomputer is the interconnection network which links all of the nodes of the computer together. This network provides the primary method of communication between the hundreds or thousands of processing nodes and is, thus, critical to the successful operation of the multicomputer. Current state-of-the-art interconnection networks use simple, oblivious routing techniques which achieve very good performance when loading is light, but do not perform well in the presence of non-uniform congestion or faults. Chaotic routing, a non-minimal adaptive routing technique, provides a mechanism which takes into account the presence of congestion and faults when choosing a path for a message and can, thus, achieve better performance. Chaot...
Design of a Router for Fault-Tolerant Networks
- in Proc. Parallel Computer Routing and Communication Workshop
, 1994
"... . As interconnection networks grow larger and larger, the need for reliable message delivery in the presence of faults grows as well. Unfortunately, most network routing schemes currently in use do not provide graceful tolerance of even the most common faults. Because routing messages around fai ..."
Abstract
-
Cited by 9 (3 self)
- Add to MetaCart
. As interconnection networks grow larger and larger, the need for reliable message delivery in the presence of faults grows as well. Unfortunately, most network routing schemes currently in use do not provide graceful tolerance of even the most common faults. Because routing messages around failed components requires non-minimal routing, it makes sense to examine routers which, by design, allow packets to take nonminimal routes. Such routers provide a basic level of fault-tolerance by allowing messages to be routed around faults, without requiring a priori knowledge of their locations. However, the mechanisms can be slow and clumsy at times. We augment Chaotic routing, a non-minimal adaptive routing scheme, with a limited amount of hardware to support fault detection, identification, and reconfiguration so that the network can automatically reconfigure itself when faults occur. We present a high-level design of these mechanisms, driven by the goal of achieving reasonable ...
Performance implications of multiple pointer sizes
- IN: USENIX WINTER
, 1995
"... ... This paper analyzes several programs and pro-gramming techniques to understand the performance implications of different pointer sizes. Many (but not all) programs show small but definite performance consequences, primarily due to cache and paging effects. ..."
Abstract
-
Cited by 9 (0 self)
- Add to MetaCart
... This paper analyzes several programs and pro-gramming techniques to understand the performance implications of different pointer sizes. Many (but not all) programs show small but definite performance consequences, primarily due to cache and paging effects.
High Performance Inter-Chip Signalling
, 1998
"... The achievable off-chip bandwidth of digital IC's is a crucial and often limiting factor in the performance of digital systems. In intra-system interfaces where both latency and bandwidth are important, source-synchronous parallel channels have been adopted as the most effective solution. This work ..."
Abstract
-
Cited by 8 (0 self)
- Add to MetaCart
The achievable off-chip bandwidth of digital IC's is a crucial and often limiting factor in the performance of digital systems. In intra-system interfaces where both latency and bandwidth are important, source-synchronous parallel channels have been adopted as the most effective solution. This work investigates receiver and clocking circuit design techniques for increasing the signalling rate and robustness of such channels.

