Results 1 - 10
of
122
Myrinet: A Gigabit-per-Second Local Area Network
- IEEE Micro
, 1995
"... Abstract. Myrinet is a new type of local-area network (LAN) based on the technology used for packet communication and switching within "massivelyparallel processors " (MPPs). Think of Myrinet as an MPP message-passing network that can span campus dimensions, rather than as a wide-area tele ..."
Abstract
-
Cited by 852 (0 self)
- Add to MetaCart
Abstract. Myrinet is a new type of local-area network (LAN) based on the technology used for packet communication and switching within "massivelyparallel processors " (MPPs). Think of Myrinet as an MPP message-passing network that can span campus dimensions, rather than as a wide-area telecommunications network that is operating in close quarters. The technical steps toward making Myrinet a reality included the development of (1) robust, 25m communication channels with flow control, packet framing, and error control; (2) self-initializing, low-latency, cut-through switches; (3) host interfaces that can map the network, select routes, and translate from network addresses to routes, as well as handle packet traffic; and (4) streamlined host software that allows direct communication between user processes and the network. Background. In order to understand how Myrinet differs from conventional LANs such as Ethernet and FDDI, it is helpful to start with Myrinet's genealogy. Myrinet is rooted in the results of two ARPA-sponsored research projects, the Caltech Mosaic, an experimental, fine-grain multicomputer [1], and the USC Information Sciences Institute (USC/ISI) ATOMIC LAN [2, 3], which was built using Mosaic components. Myricom, Inc., is a startup company founded by members of these two research projects. Multicomputer Message-Passing Networks. A multicomputer [4, 5] is an MPP architecture consisting of a collection of computing nodes, each with its own memory, connected by a message-passing network. The Caltech Mosaic was an experiment to "push the envelope " of multicomputer design and programming toward a system with up to tens of thousands of small, single-chip nodes rather than hundreds of circuit-board-size nodes. The fine-grain multicomputer places more extreme demands on the messagepassing network due to the larger number of nodes and a greater interdependence between the computing processes on different nodes. The message-passing-network technology developed for the Mosaic [6] achieved its goals so well that it was used in several other MPP systems, including the
APRIL: A Processor Architecture for Multiprocessing
- IN PROCEEDINGS OF THE 17TH ANNUAL INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE
, 1990
"... Processors in large-scale multiprocessors must be able to tolerate large communication latencies and synchronization delays. This paper describes the architecture of a rapid-context-switching processor called APRIL with support for fine-grain threads and synchronization. APRIL achieves high single-t ..."
Abstract
-
Cited by 254 (23 self)
- Add to MetaCart
Processors in large-scale multiprocessors must be able to tolerate large communication latencies and synchronization delays. This paper describes the architecture of a rapid-context-switching processor called APRIL with support for fine-grain threads and synchronization. APRIL achieves high single-thread performance and supports virtual dynamic threads. A commercial RISC-based implementation of APRIL and a run-time software system that can switch contexts in about 10 cycles is described. Measurements taken for several parallel applications on an APRIL simulator show that the overhead for supporting parallel tasks based on futures is reduced by a factor of twoover a corresponding implementation on the Encore Multimax. The scalability of a multiprocessor based on APRIL is explored using a performance model. We show that the SPARC-based implementation of APRIL can achieve close to 80# processor utilization with as few as three resident threads per processor in a large-scale cache-based machine with an average base network latency of 55 cycles.
A New Theory of Deadlock-Free Adaptive Routing in Wormhole Networks
, 1993
"... Second generation multicomputers use wormhole routing, allowing a very low channel set-up time and drastically reducing the dependency between network latency and internode distance. Deadlock-free routing strategies have been developed, allowing the implementation of fast hardware routers that reduc ..."
Abstract
-
Cited by 176 (23 self)
- Add to MetaCart
Second generation multicomputers use wormhole routing, allowing a very low channel set-up time and drastically reducing the dependency between network latency and internode distance. Deadlock-free routing strategies have been developed, allowing the implementation of fast hardware routers that reduce the communication bottleneck. Also, adaptive routing algorithms with deadlock-avoidance or deadlockrecovery techniques have been proposed for some topologies, being very effective and outperforming static strategies. This paper develops the theoretical background for the design of deadlock-free adaptive routing algorithms for wormhole networks. Some basic definitions and two theorems are proposed, developing conditions to verify that an adaptive algorithm is deadlock-free, even when there are cycles in the channel dependency graph. Also, two design methodologies are proposed. The first one supplies algorithms with a high degree of freedom, without increasing the number of physical channels...
Limits on Interconnection Network Performance
- IEEE Transactions on Parallel and Distributed Systems
, 1991
"... As the performance of interconnection networks becomes increasingly limited by physical constraints in high-speed multiprocessor systems, the parameters of high-performance network design must be reevaluated, starting with a close examination of assumptions and requirements. This paper models networ ..."
Abstract
-
Cited by 166 (4 self)
- Add to MetaCart
As the performance of interconnection networks becomes increasingly limited by physical constraints in high-speed multiprocessor systems, the parameters of high-performance network design must be reevaluated, starting with a close examination of assumptions and requirements. This paper models network latency, taking both switch and wire delays into account. A simple closed form expression for contention in buffered, direct networks is derived and is found to agree closely with simulations. The model includes the effects of packet size and communication locality. Network analysis under various constraints (such as fixed bisection width, fixed channel width, and fixed node size) and under different workload parameters (such as packet size, degree of communication locality, and network request rate) reveals that performance is highly sensitive to these constraints and workloads. A twodimensional network has the lowest latency only when switch delays and network contention are ignored, but...
Job Scheduling in Multiprogrammed Parallel Systems
, 1997
"... Scheduling in the context of parallel systems is often thought of in terms of assigning tasks in a program to processors, so as to minimize the makespan. This formulation assumes that the processors are dedicated to the program in question. But when the parallel system is shared by a number of us ..."
Abstract
-
Cited by 145 (15 self)
- Add to MetaCart
Scheduling in the context of parallel systems is often thought of in terms of assigning tasks in a program to processors, so as to minimize the makespan. This formulation assumes that the processors are dedicated to the program in question. But when the parallel system is shared by a number of users, this is not necessarily the case. In the context of multiprogrammed parallel machines, scheduling refers to the execution of threads from competing programs. This is an operating system issue, involved with resource allocation, not a program development issue. Scheduling schemes for multiprogrammed parallel systems can be classified as one or two leveled. Single-level scheduling combines the allocation of processing power with the decision of which thread will use it. Two level scheduling decouples the two issues: first, processors are allocated to the job, and then the job's threads are scheduled using this pool of processors. The processors of a parallel system can be shared i...
Deadlock-Free Multicast Wormhole Routing in 2D Mesh Multicomputers
, 1992
"... Multicast communication services, in which the same message is delivered from a source node to an arbitrary number of destination nodes, are being provided in new generation multicomputers. Broadcast is a special case of multicast in which a message is delivered to all nodes in the network. The n ..."
Abstract
-
Cited by 121 (22 self)
- Add to MetaCart
Multicast communication services, in which the same message is delivered from a source node to an arbitrary number of destination nodes, are being provided in new generation multicomputers. Broadcast is a special case of multicast in which a message is delivered to all nodes in the network. The nCUBE-2, a wormhole-routed hypercube multicomputer, provides hardware support for broadcast and a restricted form of multicast in which the destinations form a subcube. However, the broadcast routing algorithm adopted in the nCUBE-2 is not deadlock-free. In this paper, four multicast wormhole routing strategies for two-dimensional (2D) mesh multicomputers are proposed and studied. All of the algorithms are shown to be deadlock-free. These are the first deadlock-free multicast wormhole routing algorithms ever proposed. A simulation study has been conducted that compares the performance of these multicast algorithms under dynamic network traffic conditions in a 2D mesh. The results ind...
Performance Tradeoffs In Multithreaded Processors
- IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS
, 1991
"... ... utilization. By maintaining multiple process contexts in hardware and switching among them in a few cycles, multithreaded processors can overlap computation with memory accesses and reduce processor idle time. This paper presents an analytical performance model for multithreaded processors th ..."
Abstract
-
Cited by 111 (5 self)
- Add to MetaCart
... utilization. By maintaining multiple process contexts in hardware and switching among them in a few cycles, multithreaded processors can overlap computation with memory accesses and reduce processor idle time. This paper presents an analytical performance model for multithreaded processors that includes cache interference, network contention, context-switching overhead, and data-sharing effects. The model is validated through our own simulations and by comparison with previously published simulation results. Our results indicate that processors can substantially benefit from multithreading, even in systems with small caches. Large caches yield close to full processor utilization with as few as two to four contexts, while small caches may require up to four times as many contexts. Increased network contention due to multithreading has a major effect on performance. The available network bandwidth and the context-switching overhead limits the best possible utilization.
A Necessary and Sufficient Condition for Deadlock-Free Routing in Cut-Through and Store-and-Forward Networks
, 1995
"... This paper develops the theoretical background for the design of deadlockfree adaptive routing algorithms for virtual cut-through and store-and-forward switching. This theory is valid for networks using either central buffers or edge buffers. Some basic definitions and three theorems are proposed, d ..."
Abstract
-
Cited by 111 (15 self)
- Add to MetaCart
This paper develops the theoretical background for the design of deadlockfree adaptive routing algorithms for virtual cut-through and store-and-forward switching. This theory is valid for networks using either central buffers or edge buffers. Some basic definitions and three theorems are proposed, developing conditions to verify that an adaptive algorithm is deadlock-free, even when there are cyclic dependencies between routing resources. Moreover, we propose a necessary and sufficient condition for deadlock-free routing. Also, a design methodology is proposed. It supplies fully adaptive, minimal and non-minimal routing algorithms, guaranteeing that they are deadlock-free. The theory proposed in this paper extends the necessary and sufficient condition for wormhole switching previously proposed by us. The resulting routing algorithms are more flexible than the ones for wormhole switching. Also, the design methodology is much easier to apply because it automatically supplies deadlock-fr...
Graphical Development Tools for Network-Based Concurrent Supercomputing
- in Proceedings of Supercomputing 91
, 1991
"... This paper describes an X-window based software environment called HeNCE (Heterogeneous Network Computing Environment) designed to assist scientists in developing parallel programs that run on a network of computers. HeNCE is built on top of a software package called P M which supports process manag ..."
Abstract
-
Cited by 83 (8 self)
- Add to MetaCart
This paper describes an X-window based software environment called HeNCE (Heterogeneous Network Computing Environment) designed to assist scientists in developing parallel programs that run on a network of computers. HeNCE is built on top of a software package called P M which supports process management and communication between a network of heterogeneous computers. HeNCE is based on a parallel programming paradigm where an application program can be described by a graph. Nodes of the graph represent subroutines and the arcs represent data dependencies. HeNCE is composed of integrated graphical tools for creating, compiling, executing, and analyzing HeNCE programs.
The Design of Nectar: A Network Backplane for Heterogeneous Multicomputers
, 1989
"... Nectar is a "network backplane" for use in heterogeneous multicomputers. The initial system consists of a starshaped fiber-optic network with an aggregate bandwidth of 1.6 gigabits/second and a switching latency of 700 nanoseconds. The system can be scaled up by connecting hundreds of these networks ..."
Abstract
-
Cited by 80 (9 self)
- Add to MetaCart
Nectar is a "network backplane" for use in heterogeneous multicomputers. The initial system consists of a starshaped fiber-optic network with an aggregate bandwidth of 1.6 gigabits/second and a switching latency of 700 nanoseconds. The system can be scaled up by connecting hundreds of these networks together. The Nectar architecture provides a flexible way to handle heterogeneity and task-level parallelism. A wide variety of machines can be connected as Nectar nodes and the Nectar system software allows applications to communicate at a high level. Protocol processing is off-loaded to powerful communication processors so that nodes do not have to support a suite of network protocols. We have designed and built a prototype Nectar system that has been operational since November 1988. This paper presents the motivation and goals for Nectar and describes its hardware and software. The presentation emphasizes how the goals influenced the design decisions and led to the novel aspects of Necta...

