Results 1 - 10
of
122
Planar-Adaptive Routing: Low-cost Adaptive Networks for Multiprocessors
- In Proceedings of the International Symposium on Computer Architecture
, 1992
"... Network throughput can be increased by allowing multipath, adaptive routing. Adaptive routing allows more freedom in the paths taken by messages, spreading load over physical channels more evenly. The flexibility of adaptive routing introduces new possibilities of deadlock. Previous deadlock avoidan ..."
Abstract
-
Cited by 179 (13 self)
- Add to MetaCart
Network throughput can be increased by allowing multipath, adaptive routing. Adaptive routing allows more freedom in the paths taken by messages, spreading load over physical channels more evenly. The flexibility of adaptive routing introduces new possibilities of deadlock. Previous deadlock avoidance schemes in k-ary ncubes require an exponential number of virtual channels [17]. We describe a family of deadlock-free routing algorithms, called planar-adaptive routing algorithms which require only a constant number of virtual channels, independent of network size and dimension. Planar-adaptive routing algorithms reduce the complexity of deadlock prevention by reducing the number of choices at each routing step. In the fault-free case, planar-adaptive networks are guaranteed to be deadlock-free. In the presence of network faults, the planar-adaptive router can be extended with misrouting to produce a working network which remains provably deadlock free and is provably livelock free. In a...
The MIT Alewife Machine: A Large-Scale Distributed-Memory Multiprocessor
- In Proceedings of Workshop on Scalable Shared Memory Multiprocessors
, 1991
"... The Alewife multiprocessor project focuses on the architecture and design of a large-scale parallel machine. The machine uses a low-dimensional direct interconnection network to provide scalable communication bandwidth, while allowing the exploitation of locality. Despite its distributed-memory arch ..."
Abstract
-
Cited by 138 (22 self)
- Add to MetaCart
The Alewife multiprocessor project focuses on the architecture and design of a large-scale parallel machine. The machine uses a low-dimensional direct interconnection network to provide scalable communication bandwidth, while allowing the exploitation of locality. Despite its distributed-memory architecture, Alewife allows efficient shared-memory programming through a multilayered approach to locality management. A new scalable cache-coherence scheme called LimitLESS directories allows the use of caches for reducing communication latency and network bandwidth requirements. Alewife also employs run-time and compile-time methods for partitioning and placement of data and processes to enhance communication locality. While the above methods attempt to minimize communication latency, communication with distant processors cannot be completely avoided. Alewife's processor, Sparcle, is designed to tolerate these latencies by rapidly switching between threads of computation. This paper describe...
The J-Machine Multicomputer: An Architectural Evaluation
- In Proceedings of the 20th Annual International Symposium on Computer Architecture
, 1993
"... The MIT J-Machine multicomputer has been constructed to study the role of a set of primitive mechanisms in providing efficient support for parallel computing. Each J-Machine node consists of an integrated multicomputer component, the Message-Driven Processor (MDP), and 1 MByte of DRAM. The MDP provi ..."
Abstract
-
Cited by 132 (4 self)
- Add to MetaCart
The MIT J-Machine multicomputer has been constructed to study the role of a set of primitive mechanisms in providing efficient support for parallel computing. Each J-Machine node consists of an integrated multicomputer component, the Message-Driven Processor (MDP), and 1 MByte of DRAM. The MDP provides mechanisms to support efficient communication, synchronization, and naming. A 512 node J-Machine is operational and is due to be expanded to 1024 nodes in March 1993. In this paper we discuss the design of the J-Machine and evaluate the effectiveness of the mechanisms incorporated into the MDP. We measure the performance of the communication and synchronization mechanisms directly and investigate the behavior of four complete applications. 1 Introduction Over the past 40 years, sequential von Neumann processors have evolved a set of mechanisms appropriate for supporting most sequential programming models. It is clear, however, from efforts to build concurrent machines by connecting man...
A Cost and Speed Model for k-ary n-cube Wormhole Routers
- HIT INTERCONNECTS '93
, 1993
"... A great deal of research has been published on the performance of wormhole routers with advanced features such as adaptivity and virtual lanes. In most cases, the effectiveness of such novel routers is evaluated on the basis of the achieved network throughput (channel utilization), ignoring the impo ..."
Abstract
-
Cited by 125 (1 self)
- Add to MetaCart
A great deal of research has been published on the performance of wormhole routers with advanced features such as adaptivity and virtual lanes. In most cases, the effectiveness of such novel routers is evaluated on the basis of the achieved network throughput (channel utilization), ignoring the important effects of implementation complexity. In this paper we describe a parameterized cost model for router performance, characterized by two numbers: router delay and flow control time. Grounding the cost model in a 0.8 micron gate array technology, we use it to compare a number of proposed routing algorithms. Based on these design studies, several insights regarding the implementation complexity of adaptive routing are clear. First, header update and selection is expensive in adaptive routers, suggesting the absolute addressing should be reconsidered. Second, virtual channels are expensive in terms of latency and cycle time, so decisions to include them to support adaptivity or even virtual lanes should not be taken lightly. Third, requirements of larger crossbars and more complex arbitration cause some increase in the complexity of adaptive routers, but the rate of increase is small. Finally, the complexity of adaptive routers significantly increases their setup delay and flow control cycle times, implying that claims of performance advantages in channel utilization and low load latency must be carefully balanced against losses in achievable implementation speed.
A Necessary and Sufficient Condition for Deadlock-Free Routing in Cut-Through and Store-and-Forward Networks
, 1995
"... This paper develops the theoretical background for the design of deadlockfree adaptive routing algorithms for virtual cut-through and store-and-forward switching. This theory is valid for networks using either central buffers or edge buffers. Some basic definitions and three theorems are proposed, d ..."
Abstract
-
Cited by 111 (15 self)
- Add to MetaCart
This paper develops the theoretical background for the design of deadlockfree adaptive routing algorithms for virtual cut-through and store-and-forward switching. This theory is valid for networks using either central buffers or edge buffers. Some basic definitions and three theorems are proposed, developing conditions to verify that an adaptive algorithm is deadlock-free, even when there are cyclic dependencies between routing resources. Moreover, we propose a necessary and sufficient condition for deadlock-free routing. Also, a design methodology is proposed. It supplies fully adaptive, minimal and non-minimal routing algorithms, guaranteeing that they are deadlock-free. The theory proposed in this paper extends the necessary and sufficient condition for wormhole switching previously proposed by us. The resulting routing algorithms are more flexible than the ones for wormhole switching. Also, the design methodology is much easier to apply because it automatically supplies deadlock-fr...
The Cray T3E Network: Adaptive Routing in a High Performance 3D Torus
, 1996
"... This paper describes the interconnection network used in the Cray T3E multiprocessor. The network is a bidirectional 3D torus with fully adaptive routing, optimized virtual channel assignments, integrated barrier synchronization support and considerable fault tolerance. The routers are built with LS ..."
Abstract
-
Cited by 111 (4 self)
- Add to MetaCart
This paper describes the interconnection network used in the Cray T3E multiprocessor. The network is a bidirectional 3D torus with fully adaptive routing, optimized virtual channel assignments, integrated barrier synchronization support and considerable fault tolerance. The routers are built with LSI’s 500K ASIC technology with custom transmitters/ receivers driving low-voltage differential signals at 375 MHz, for a link data payload capacity of approximately 500 MB/s.
Access Normalization: Loop Restructuring for NUMA Computers
- ACM Transactions on Computer Systems
, 1993
"... : In scalable parallel machines, processors can make local memory accesses much faster than they can make remote memory accesses. In addition, when a number of remote accesses must be made, it is usually more efficient to use block transfers of data rather than to use many small messages. To run wel ..."
Abstract
-
Cited by 68 (19 self)
- Add to MetaCart
: In scalable parallel machines, processors can make local memory accesses much faster than they can make remote memory accesses. In addition, when a number of remote accesses must be made, it is usually more efficient to use block transfers of data rather than to use many small messages. To run well on such machines, software must exploit these features. We believe it is too onerous for a programmer to do this by hand, so we have been exploring the use of restructuring compiler technology for this purpose. In this paper, we start with a language like HPF-FORTRAN with user-specified data distributionand develop a systematic loop transformation strategy called access normalization that restructures loop nests to exploit locality and block transfers. We demonstrate the power of our techniques using routines from the BLAS (Basic Linear Algebra Subprograms) library. An important feature of our approach is that we model loop transformations using invertible matrices and integer lattice theo...
A Comparison of Adaptive Wormhole Routing Algorithms
, 1993
"... . Improvement of message latency and network utilization in torus interconnection networks by increasing adaptivity in wormhole routing algorithms is studied. A recently proposed partially adaptive algorithm and four new fully-adaptive routing algorithms are compared with the well-known e-cube algor ..."
Abstract
-
Cited by 57 (2 self)
- Add to MetaCart
. Improvement of message latency and network utilization in torus interconnection networks by increasing adaptivity in wormhole routing algorithms is studied. A recently proposed partially adaptive algorithm and four new fully-adaptive routing algorithms are compared with the well-known e-cube algorithm for uniform, hotspot, and local traffic patterns. Our simulations indicate that the partially adaptive northlast algorithm, which causes unbalanced traffic in the network, performs worse than the nonadaptive e-cube routing algorithm for all three traffic patterns. Another result of our study is that the performance does not necessarily improve with full-adaptivity. In particular, a commonly discussed fully-adaptive routing algorithm, which uses 2 n virtual channels per physical channel of a k-ary n-cube, performs worse than e-cube for uniform and hotspot traffic patterns. The other three fully-adaptive algorithms, which give priority to messages based on distances traveled, perform ...
A Comprehensive Analytical Model for Wormhole Routing in Multicomputer Systems
, 1994
"... An analytical model for obtaining performance measures in multicomputer networks which use wormhole routing is presented. Unlike previous wormhole routing models, the model introduced in this paper is accurate and quite simple. The model is validated through flit-level simulation experiments and is ..."
Abstract
-
Cited by 41 (1 self)
- Add to MetaCart
An analytical model for obtaining performance measures in multicomputer networks which use wormhole routing is presented. Unlike previous wormhole routing models, the model introduced in this paper is accurate and quite simple. The model is validated through flit-level simulation experiments and is sufficiently general to be extended for several networks, including k-ary n-cubes, and related routing paradigms, such as virtual cut-through. The value of this model is exhibited through its application to current networks to indicate cost-effective augmentations which result in significant performance improvements. NOTE: A List of Symbols is provided in Appendix A. 1 Introduction First generation distributed-memory multicomputers such as the Cosmic Cube [20], used "store-and-forward" packet-switching methods for routing messages among the multicomputer nodes. The present generation of multicomputers are able to reduce hardware communication overheads of first-generation machines by over...
The Effectiveness of Multiple Hardware Contexts
- In Sixth International Conference on Architectural Support for Programming Languages and Operating Systems
, 1994
"... Multithreaded processors are used to tolerate long memory latencies. By executing threads loaded in multiple hardware contexts, an otherwise idle processor can keep busy, thus increasing its utilization. However, the larger size of a multi-thread working set can have a negative effect on cache confl ..."
Abstract
-
Cited by 40 (1 self)
- Add to MetaCart
Multithreaded processors are used to tolerate long memory latencies. By executing threads loaded in multiple hardware contexts, an otherwise idle processor can keep busy, thus increasing its utilization. However, the larger size of a multi-thread working set can have a negative effect on cache conflict misses. In this paper we evaluate the two phenomena together, examining their combined effect on execution time. The usefulness of multiple hardware contexts depends on: program data locality, cache organization and degree of multiprocessing. Multiple hardware contexts are most effective on programs that have been optimized for data locality. For these programs, execution time dropped with increasing contexts, over widely varying architectures. With unoptimized applications, multiple contexts had limited value.The best performance was seen with only two contexts, and only on uniprocessors and small multiprocessors. The behavior of the unoptimized applications changed more noticeably with...

