Results 11  20
of
38
ComPaSS: Efficient Communication Services for Scalable Architectures
 in Proceedings of Supercomputing'92
, 1992
"... In massively parallel computers (MPCs), efficient communication among processors is critical to performance. This paper describes the initial implementation of the ComPaSS communication library to support scalable software development in MPCs. ComPaSS provides highlevel global communication operati ..."
Abstract

Cited by 19 (16 self)
 Add to MetaCart
(Show Context)
In massively parallel computers (MPCs), efficient communication among processors is critical to performance. This paper describes the initial implementation of the ComPaSS communication library to support scalable software development in MPCs. ComPaSS provides highlevel global communication operations for both data manipulation and process control, many of which are based upon a small set of lowlevel communication primitives. The ComPaSS library is unique in that these lowlevel operations are provably optimal for a class of architectures representative of many commercial scalable systems, in particular, those using wormhole routing and ndimensional mesh network topologies. This paper concentrates on the multicast component of the ComPaSS library, which is useful in several data parallel operations. The design of the multicast primitive is described, and an example of its use in a data parallel application is given. Improvements in performance resulting from use of the library on a ...
The Effects of Network Contention on Processor Allocation Strategies
 In Proceedings of the 10th International Parallel Processing Symposium
, 1996
"... Various processor allocation strategies have been proposed for scalable parallel computers (SPCs). These strategies try to maximize the overall system utilization and, in the mean time, try to avoid network contention among different processor partitions. This paper provides an intensive simulation ..."
Abstract

Cited by 17 (0 self)
 Add to MetaCart
(Show Context)
Various processor allocation strategies have been proposed for scalable parallel computers (SPCs). These strategies try to maximize the overall system utilization and, in the mean time, try to avoid network contention among different processor partitions. This paper provides an intensive simulation study investigating whether contentionfree processor allocation strategies are indeed important. Our simulation considers both mesh and MINbased wormhole parallel computers, the communication characteristics of individual applications, and the impact due to communication software latency. We show that for systems with high software latency, there is no need of contentionfree processor allocation policies. However, if the software latency is very small or the message size is very long, contentionfree allocation policies should be developed. Some suggestions are also made regarding how processors should be allocated in an environment with different application characteristics. This work...
Tailoring Router Architectures to Performance Requirements in CutThrough Networks
, 1999
"... Messagepassing parallel machines have emerged as a costeffective platform for exploiting concurrency in a variety of applications. These multicomputer systems employ a wide range of policies for routing, switching, arbitration, queueing, and flow control, implemented in the router hardware that co ..."
Abstract

Cited by 12 (0 self)
 Add to MetaCart
Messagepassing parallel machines have emerged as a costeffective platform for exploiting concurrency in a variety of applications. These multicomputer systems employ a wide range of policies for routing, switching, arbitration, queueing, and flow control, implemented in the router hardware that connects an individual processing node to the interconnection fabric and manages traffic flowing through the node en route to other destinations. To address the requirements of emerging applications, we develop new techniques for designing and evaluating new router architectures that tailor network policies to application characteristics. These results facilitate the development of effective support for communication in realtime systems and local area networks, as well as more traditional multicomputer domains like highspeed scientific computing. Most modern routers employ cutthrough switching schemes, such as virtual cutthrough and wormhole switching, that permit an arriving packet to proceed directly to an idle outgoing link. We develop analytical models for evaluating cutthrough routing algorithms with different degrees of adaptivity. The analytical results permit an efficient evaluation of large networks, while detailed comparisons with simulation results characterize the subtle effects of the simplifying assumptions in the analysis; in particular, cutthrough networks introduce unique dependencies between adjacent nodes. Additional simulation experiments show that the network topologies, routing algorithms, and traffic patterns in modern multicomputers exacerbate these effects. Based on these results, we present a routing algorithm that capitalizes on internode dependencies to improve network performance.
The Impact of Packetization in WormholeRouted Networks
, 1993
"... Packetization is used in a variety of commercial multicomputers because of its potential performance advantages: higher throughput and a better distribution of message latencies. However, packetization has two significant drawbacks, 1) fragmentation and reassembly overhead and 2) increased traffic v ..."
Abstract

Cited by 12 (4 self)
 Add to MetaCart
Packetization is used in a variety of commercial multicomputers because of its potential performance advantages: higher throughput and a better distribution of message latencies. However, packetization has two significant drawbacks, 1) fragmentation and reassembly overhead and 2) increased traffic volume for routing and sequencing information. In this paper, we examine the performance benefits of packetization in existing dimensionorder routed networks and in likely future router designs including adaptive routing and virtual lanes. Our studies show that packetization has a mixed effect on performance in dimensionorder routers. Packetizing uniformsized traffic reduces network throughput dramatically. However, if the traffic is a bimodal distribution of sizes, packetization reduces the variance of latencies for short messages, and increases the network's overall throughput. On the other hand, packetization has no significant impact on the performance of advanced networks with adaptive routing and virtual lanes. Advanced routers without packetization give nearly identical performance to the corresponding packetizing networks under uniformsized or bimodal traffic. Packetization may be unnecessary in such networks. 1
An Evaluation of PlanarAdaptive Routing (PAR)
, 1992
"... Network performance can be improved by allowing adaptive routing, but doing so introduces new possibilities of deadlock which can overwhelm the flexibility advantages. Planaradaptive routing (PAR) resolves this tension by limiting adaptive routing to a series of twodimensional planes, reducing har ..."
Abstract

Cited by 12 (5 self)
 Add to MetaCart
Network performance can be improved by allowing adaptive routing, but doing so introduces new possibilities of deadlock which can overwhelm the flexibility advantages. Planaradaptive routing (PAR) resolves this tension by limiting adaptive routing to a series of twodimensional planes, reducing hardware requirements for deadlock prevention. We explore the performance of planaradaptive routers for two, three, and fourdimensional networks. Under nonuniform traffic loads, the planaradaptive router significantly outperforms the dimensionorder router, while giving comparable performance under uniform loads. With equal resources, the planaradaptive router provides superior performance to fullyadaptive routers because it requires less resources for deadlock prevention, freeing resources to increase the number of virtual lanes.
On the Benefit of Supporting Virtual Channels in Wormhole Routers
 In Proceedings of the 8th Annual ACM Symposium on Parallel Algorithms and Architectures
, 1996
"... This paper analyzes the impact of virtual channels on the performance of wormhole routing algorithms. We show that in any network in which each physical channel, i.e., communication link, can support up to B virtual channels, it is possible to route any set of messages with L flits each, whose paths ..."
Abstract

Cited by 10 (4 self)
 Add to MetaCart
(Show Context)
This paper analyzes the impact of virtual channels on the performance of wormhole routing algorithms. We show that in any network in which each physical channel, i.e., communication link, can support up to B virtual channels, it is possible to route any set of messages with L flits each, whose paths have congestion C and dilation D in (L + D)C(D log D) 1=B 2 O(log (C=D)) =B flit steps, where a flit step is the time taken to transmit a single flit across a link. We also prove a nearly matching lower bound, i.e., for any values of C, D, B, and L, where C; D B + 1 and L = (1 +\Omega\Gamma302 D, we show how to construct a network and a set of Lflit messages whose paths have congestion C and dilation D that require\Omega\Gamma LCD 1=B =B) flit steps to route. These upper and lower bounds imply that increasing the buffering capacity and the bandwidth of each physical channel by a factor of B can speed up a wormhole routing algorithm by a superlinear factor, i.e., a factor signi...
CyclicCubes: A New Family of Interconnection Networks of Even FixedDegrees
 IEEE Transaction on Parallel and Distributed Systems
, 1998
"... We introduce a new family of interconnection networks that are Cayley graphs with fixed degrees of any even number greater than or equal to 4. We call the proposed graphs cycliccubes because contracting some cycles in such a graph results in a generalized hypercube. These Cayley graphs have optimal ..."
Abstract

Cited by 9 (0 self)
 Add to MetaCart
(Show Context)
We introduce a new family of interconnection networks that are Cayley graphs with fixed degrees of any even number greater than or equal to 4. We call the proposed graphs cycliccubes because contracting some cycles in such a graph results in a generalized hypercube. These Cayley graphs have optimal fault tolerance and logarithmic diameters. For comparable number of nodes, a cycliccube can have a diameter smaller than previously known fixeddegree networks. The proposed graphs can adopt an optimum routing algorithm known for one of its subfamilies of Cayley graphs. We also show that a graph in the new family has a Hamiltonian cycle, and hence there is an embedding of a ring. Embedding of meshes and hypercubes are also discussed. keywords: Cayley Graphs, generalized hypercube, fixed degree, interconnection network, fault tolerance, diameter, Hamiltonian cycle, graph embedding. 1 Introduction Desirable properties of an interconnection network include low degree, low diameter, symmet...
Adaptive Multicast Wormhole Routing in 2D Mesh Multicomputers
 in Proc. Parallel Architectures Languages Europe 93
, 1993
"... . We study the issues of adaptive multicast wormhole routing in 2D mesh multicomputers. Three adaptive multicast wormhole routing strategies are proposed and studied, which include minimal partial adaptive, minimal fully adaptive and nonminimal multicast routing methods. All the algorithms are shown ..."
Abstract

Cited by 7 (0 self)
 Add to MetaCart
. We study the issues of adaptive multicast wormhole routing in 2D mesh multicomputers. Three adaptive multicast wormhole routing strategies are proposed and studied, which include minimal partial adaptive, minimal fully adaptive and nonminimal multicast routing methods. All the algorithms are shown to be deadlockfree. These are the first deadlockfree adaptive multicast wormhole routing algorithms ever proposed. A simulation study has been conducted that compares the performance of these multicast algorithms. The results show that the minimal fully adaptive routing method creates the least traffic, however, double vertical channels are required in order to avoid deadlock. The nonminimal routing algorithm exhibits the best adaptivity, although it creates more network traffic than the other methods. 1 Introduction Efficient communication among nodes is critical to the performance of massively parallel computers. A multicast communication service is one in which the same message is del...
Evaluation of a Hybrid Deterministic/Adaptive Router and Its Implementations
 IN PARALLEL COMPUTER ROUTING AND COMMUNICATION
, 1997
"... A novel routing scheme is proposed for virtual cutthrough routing on kary ncube networks. This scheme attempts to combine the low routing delay of deterministic routing with the flexibility and low queuing delays of adaptive routing. In this hybrid routing scheme a message is routed as soon as ..."
Abstract

Cited by 3 (1 self)
 Add to MetaCart
A novel routing scheme is proposed for virtual cutthrough routing on kary ncube networks. This scheme attempts to combine the low routing delay of deterministic routing with the flexibility and low queuing delays of adaptive routing. In this hybrid routing scheme a message is routed as soon as possible along a minimal path to its destination even though the routing choice may not be optimal. Results show that the disadvantages of making a nonoptimal routing decision are offset by its speed. Two pipelined implementations of this hybrid routing mechanism are evaluated and compared to pure deterministic and adaptive implementations. The experimental evaluations show that both hybrid implementations do indeed achieve their objectives under various types of traffic patterns.
TimeStep Optimal Broadcasting in 3D Meshes with Minimum Total Communication Distance
 Dept. Computer Science, Univ. Texas at Austin
, 2000
"... In this paper we propose a new minimum total communication distance #TCD# algorithm and an optimal TCD algorithm for broadcast in a 3dimensional mesh #3D mesh#. The former generates a minimum TCD from a given source node, and the latter guarantees a minimum TCD among all the possible source nodes. ..."
Abstract

Cited by 3 (0 self)
 Add to MetaCart
In this paper we propose a new minimum total communication distance #TCD# algorithm and an optimal TCD algorithm for broadcast in a 3dimensional mesh #3D mesh#. The former generates a minimum TCD from a given source node, and the latter guarantees a minimum TCD among all the possible source nodes. These algorithms are based on a divideandconquer approach where a 3D mesh is partitioned into eight submeshes of equal size. The source node sends the broadcast message to a special node called an eye in each submesh. The above procedure is then recursively applied in each submesh. These algorithms can be generalized to a ddimensional mesh or torus. In addition, the proposed approach can potentially be used to solve optimization problems in other collective communication operations. Keywords: Broadcast, communication distance, divideandconquer, meshes, optimization problems, wormhole routing. 1 1 Introduction In a multicomputer system, a collection of processors #also called nodes#...