Results 1 - 10
of
51
The Turn Model for Adaptive Routing
- In Proceedings of the International Symposium on Computer Architecture
, 1992
"... We present a model for designing wormhole routing algorithms that are deadlock free, livelock free, minimal or nonminimal, and maximally adaptive. A unique feature of this model is that it is not based on adding physical or virtual channels to network topologies (though it can be applied to networks ..."
Abstract
-
Cited by 247 (6 self)
- Add to MetaCart
We present a model for designing wormhole routing algorithms that are deadlock free, livelock free, minimal or nonminimal, and maximally adaptive. A unique feature of this model is that it is not based on adding physical or virtual channels to network topologies (though it can be applied to networks with extra channels). Instead, the model is based on analyzing the directions in which packets can turn in a network and the cycles that the turns can form. Prohibiting just enough turns to break all of the cycles produces routing algorithms that are deadlock free, livelock free, minimal or nonminimal, and maximally adaptive for the network. In this paper, we focus on the two most common network topologies for wormhole routing, n-dimensional meshes and k-ary n-cubes, without extra channels. In an n-dimensional mesh, just a quarter of the turns must be prohibited to prevent deadlock. The remaining three quarters of the turns permit partial adaptiveness in routing. Partially adaptive routing ...
Limits on Interconnection Network Performance
- IEEE Transactions on Parallel and Distributed Systems
, 1991
"... As the performance of interconnection networks becomes increasingly limited by physical constraints in high-speed multiprocessor systems, the parameters of high-performance network design must be reevaluated, starting with a close examination of assumptions and requirements. This paper models networ ..."
Abstract
-
Cited by 166 (4 self)
- Add to MetaCart
As the performance of interconnection networks becomes increasingly limited by physical constraints in high-speed multiprocessor systems, the parameters of high-performance network design must be reevaluated, starting with a close examination of assumptions and requirements. This paper models network latency, taking both switch and wire delays into account. A simple closed form expression for contention in buffered, direct networks is derived and is found to agree closely with simulations. The model includes the effects of packet size and communication locality. Network analysis under various constraints (such as fixed bisection width, fixed channel width, and fixed node size) and under different workload parameters (such as packet size, degree of communication locality, and network request rate) reveals that performance is highly sensitive to these constraints and workloads. A twodimensional network has the lowest latency only when switch delays and network contention are ignored, but...
Deadlock-Free Multicast Wormhole Routing in 2D Mesh Multicomputers
, 1992
"... Multicast communication services, in which the same message is delivered from a source node to an arbitrary number of destination nodes, are being provided in new generation multicomputers. Broadcast is a special case of multicast in which a message is delivered to all nodes in the network. The n ..."
Abstract
-
Cited by 121 (22 self)
- Add to MetaCart
Multicast communication services, in which the same message is delivered from a source node to an arbitrary number of destination nodes, are being provided in new generation multicomputers. Broadcast is a special case of multicast in which a message is delivered to all nodes in the network. The nCUBE-2, a wormhole-routed hypercube multicomputer, provides hardware support for broadcast and a restricted form of multicast in which the destinations form a subcube. However, the broadcast routing algorithm adopted in the nCUBE-2 is not deadlock-free. In this paper, four multicast wormhole routing strategies for two-dimensional (2D) mesh multicomputers are proposed and studied. All of the algorithms are shown to be deadlock-free. These are the first deadlock-free multicast wormhole routing algorithms ever proposed. A simulation study has been conducted that compares the performance of these multicast algorithms under dynamic network traffic conditions in a 2D mesh. The results ind...
Performance Tradeoffs In Multithreaded Processors
- IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS
, 1991
"... ... utilization. By maintaining multiple process contexts in hardware and switching among them in a few cycles, multithreaded processors can overlap computation with memory accesses and reduce processor idle time. This paper presents an analytical performance model for multithreaded processors th ..."
Abstract
-
Cited by 111 (5 self)
- Add to MetaCart
... utilization. By maintaining multiple process contexts in hardware and switching among them in a few cycles, multithreaded processors can overlap computation with memory accesses and reduce processor idle time. This paper presents an analytical performance model for multithreaded processors that includes cache interference, network contention, context-switching overhead, and data-sharing effects. The model is validated through our own simulations and by comparison with previously published simulation results. Our results indicate that processors can substantially benefit from multithreading, even in systems with small caches. Large caches yield close to full processor utilization with as few as two to four contexts, while small caches may require up to four times as many contexts. Increased network contention due to multithreading has a major effect on performance. The available network bandwidth and the context-switching overhead limits the best possible utilization.
A Survey of Collective Communication in Wormhole-Routed Massively Parallel Computers
- IEEE COMPUTER
, 1994
"... Massively parallel computers (MPC) are characterized by the distribution of memory among an ensemble of nodes. Since memory is physically distributed, MPC nodes communicate by sending data through a network. In order to program an MPC, the user may directly invoke low-level message passing primitive ..."
Abstract
-
Cited by 93 (6 self)
- Add to MetaCart
Massively parallel computers (MPC) are characterized by the distribution of memory among an ensemble of nodes. Since memory is physically distributed, MPC nodes communicate by sending data through a network. In order to program an MPC, the user may directly invoke low-level message passing primitives, may use a higher-level communications library, or may write the program in a data parallel language and rely on the compiler to translate language constructs into communication operations. Whichever method is used, the performance of communication operations directly affects the total computation time of the parallel application. Communication operations may be either point-to-point, which involves a single source and a single destination, or collective, in which more than two processes participate. This paper discusses the design of collective communication operations for current systems that use the wormhole routing switching strategy, in which messages are divided into small pieces and...
Basic Techniques for the Efficient Coordination of Very Large Numbers of Cooperating Sequential Processors
, 1981
"... In this paper we implement several basic operating system primitives by using a "replace-add" operation, which can supersede the standard "test and set", and which appears to be a universal primitive for efficiently coordinating large numbers of independently acting sequential processors. We also pr ..."
Abstract
-
Cited by 84 (2 self)
- Add to MetaCart
In this paper we implement several basic operating system primitives by using a "replace-add" operation, which can supersede the standard "test and set", and which appears to be a universal primitive for efficiently coordinating large numbers of independently acting sequential processors. We also present a hardware implementation of replace-add that permits multiple replace-adds to be processed nearly as efficiently as loads and stores. Moreover, the crucial special case of concurrent replace-adds updating the same variable is handled particularly well: If every PE simultaneously addresses a replace-add at the same variable, all these requests are satisfied in the time required to process just one request.
Construction of Optimal Multicast Trees Based on the Parameterized Communication Model
- In Proceedings of the International Conference on Parallel Processing
, 1996
"... Many tree-based multicast algorithms have been proposed to provide an efficient software implementation on parallel platforms without hardware multicast support. These algorithms are either architecture-dependent (not portable) or architectureindependent (portable) but do not provide good performanc ..."
Abstract
-
Cited by 29 (2 self)
- Add to MetaCart
Many tree-based multicast algorithms have been proposed to provide an efficient software implementation on parallel platforms without hardware multicast support. These algorithms are either architecture-dependent (not portable) or architectureindependent (portable) but do not provide good performance when porting to different parallel platforms. Based on the LogP model, the proposed parameterized communication model can more accurately characterize the communication network of parallel platforms. The model encompasses a number of critical system parameters which can be easily measured on a given parallel platform. Based on the model, efficient methods to construct optimal multicast trees are proposed for both 1-port and ff-port communication architectures. Experimental results conducted on the IBM/SP at Argonne National Laboratory are presented to compare the performance of the optimal multicast tree with two other known tree-based multicast algorithms. We claim that our proposed multi...
Coscheduling Based on Run-Time Identification of Activity Working Sets
- International Journal of Parallel Programming
"... This paper introduces a method for runtime identification of sets of interacting activities ("working sets") with the purpose of coscheduling them, i.e. scheduling them so that all the activities in the set execute simultaneously on distinct processors. The identification is done by monitoring acces ..."
Abstract
-
Cited by 26 (3 self)
- Add to MetaCart
This paper introduces a method for runtime identification of sets of interacting activities ("working sets") with the purpose of coscheduling them, i.e. scheduling them so that all the activities in the set execute simultaneously on distinct processors. The identification is done by monitoring access rates to shared communication objects: activities that access the same objects at a high rate thereby interact frequently, and therefore would benefit from coscheduling. Simulation results show that coscheduling with our runtime identification scheme can give better performance than uncoordinated scheduling based on a single global activity queue. The finer-grained the interactions among the activities in a working set, the better the performance differential. Moreover, coscheduling based on automatic runtime identification achieves about the same performance as coscheduling based on manual identification of working sets by the programmer. Keywords: coscheduling, gang scheduling, on-line ...
Near-optimal worst-case throughput routing for two-dimensional mesh networks
- In International Symposium on Computer Architecture
, 2005
"... Minimizing latency and maximizing throughput are important goals in the design of routing algorithms for interconnection networks. Ideally, we would like a routing algorithm to (a) route packets using the minimal number of hops to reduce latency and preserve communication locality, (b) deliver good ..."
Abstract
-
Cited by 25 (0 self)
- Add to MetaCart
Minimizing latency and maximizing throughput are important goals in the design of routing algorithms for interconnection networks. Ideally, we would like a routing algorithm to (a) route packets using the minimal number of hops to reduce latency and preserve communication locality, (b) deliver good worst-case and average-case throughput and (c) enable low-complexity (and hence, low latency) router implementation. In this paper, we focus on routing algorithms for an important class of interconnection networks: two dimensional (2D) mesh networks. Existing routing algorithms for mesh networks fail to satisfy one or more of design goals mentioned above. Variously, the routing algorithms suffer from poor worst case throughput (ROMM [13], DOR [23]), poor latency due to increased packet hops (VALIANT [31]) or increased latency due to hardware complexity (minimaladaptive [7, 30]). The major contribution of this paper is the design of an oblivious routing algorithm—O1TURN—with provable nearoptimal worst-case throughput, good average-case throughput, low design complexity and minimal number of network hops for 2D-mesh networks, thus satisfying all the stated design goals. O1TURN offers optimal worst-case throughput when the network radix (k in a kxk network) is even. When the network radix is odd, O1TURN is within a 1/k 2 factor of optimal worst-case throughput. O1TURN achieves superior or comparable average-case throughput with global traffic as well as local traffic. For example, O1TURN achieves 18.8%, 0.7 % and 13.6 % higher average-case throughput than DOR, ROMM and VALIANT routing, respectively when averaged over one million random traffic patterns on an 8x8 network. Finally, we demonstrate that O1TURN is well suited for a partitioned router implementation that is of similar delay complexity as a simple dimension-ordered router. Our implementation incurs a marginal increase in switch arbi-tration delay that is completely hidden in pipelined routers as it is not on the clock-critical path. 1.
Maximally Fully Adaptive Routing in 2D Meshes
- In International Conference on Parallel Processing, volume I
, 1992
"... Previous authors have proposed that wormhole routing in 2D meshes be made fully adaptive by doubling the number of channels in one of the two dimensions. We examine this proposition both using a new model for designing adaptive routing algorithms and using simulations. The model involves analyzing t ..."
Abstract
-
Cited by 24 (2 self)
- Add to MetaCart
Previous authors have proposed that wormhole routing in 2D meshes be made fully adaptive by doubling the number of channels in one of the two dimensions. We examine this proposition both using a new model for designing adaptive routing algorithms and using simulations. The model involves analyzing the directions in which packets can turn in a network and the cycles that the turns can form. Breaking all of the cycles by prohibiting a minimum number of turns produces routing algorithms that are deadlock free and maximally adaptive. Applying the model to 2D meshes shows that doubling the channels in one dimension is required for fully adaptive routing; doubling the channels in only one direction or two orthogonal directions is not sufficient. Applying the model to 2D meshes with double channels in one dimension also produces a new, fully adaptive routing algorithm. This algorithm is more adaptive than the previous fully adaptive routing algorithm. Simulations of the fully adaptive routing...

