Results 1 -
9 of
9
LogP: Towards a Realistic Model of Parallel Computation
, 1993
"... A vast body of theoretical research has focused either on overly simplistic models of parallel computation, notably the PRAM, or overly specific models that have few representatives in the real world. Both kinds of models encourage exploitation of formal loopholes, rather than rewarding developme ..."
Abstract
-
Cited by 471 (14 self)
- Add to MetaCart
A vast body of theoretical research has focused either on overly simplistic models of parallel computation, notably the PRAM, or overly specific models that have few representatives in the real world. Both kinds of models encourage exploitation of formal loopholes, rather than rewarding development of techniques that yield performance across a range of current and future parallel machines. This paper offers a new parallel machine model, called LogP, that reflects the critical technology trends underlying parallel computers. It is intended to serve as a basis for developing fast, portable parallel algorithms and to offer guidelines to machine designers. Such a model must strike a balance between detail and simplicity in order to reveal important bottlenecks without making analysis of interesting problems intractable. The model is based on four parameters that specify abstractly the computing bandwidth, the communication bandwidth, the communication delay, and the efficiency of coupling communication and computation. Portable parallel algorithms typically adapt to the machine configuration, in terms of these parameters. The utility of the model is demonstrated through examples that are implemented on the CM-5.
A Theory of Wormhole Routing in Parallel Computers
, 1993
"... Virtually all theoretical work on message routing in parallel computers has dwelt on packet routing: messages are conveyed as packets, an entire packet can reside at a node of the network, and a packet is sent from the queue of one node to the queue of another node until its reaches its destination. ..."
Abstract
-
Cited by 35 (2 self)
- Add to MetaCart
Virtually all theoretical work on message routing in parallel computers has dwelt on packet routing: messages are conveyed as packets, an entire packet can reside at a node of the network, and a packet is sent from the queue of one node to the queue of another node until its reaches its destination. A trend in multicomputer architecture, however, is to use wormhole routing. In wormhole routing a message is transmitted as a contiguous stream of bits, physically occupying a sequence of nodes/edges in the network. Thus, a message resembles a worm burrowing through the network. In this paper we give theoretical analyses of simple wormhole routing algorithms, showing them to be nearly optimal for butterfly and mesh connected networks. Our analysis requires initial random delays in injecting messages to the network. We report simulation results suggesting that the idea of random initial delays may have an impact beyond theoretical analysis. IBM Almaden Research Center, San Jose, CA., IBM A...
An Efficient Delay-Optimal Distributed Termination Detection Algorithm
- IN JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING (JPDC
, 2001
"... One of the important issues to be addressed when solving problems on parallel machines or distributed systems is that of efficient termination detection. Numerous schemes with different performance characteristics have been proposed in the past for this purpose. These schemes, while being efficie ..."
Abstract
-
Cited by 12 (2 self)
- Add to MetaCart
One of the important issues to be addressed when solving problems on parallel machines or distributed systems is that of efficient termination detection. Numerous schemes with different performance characteristics have been proposed in the past for this purpose. These schemes, while being efficient with regard to one performance metric, prove to be inefficient in terms of other metrics. A signicant drawback shared by all previous methods is that they may take as long as (P ) time to detect and signal termination after its actual occurrence, where P is the total number of processing elements. Detection delay is arguably the most important metric to optimize, since it is directly related to the amount of idling of computing resources and to the delay in the utilization of results of the underlying computation. In this paper, we present a novel termination detection algorithm that is simultaneously optimal or near-optimal with respect to all relevant performance measures on any topology. In particular, our algorithm has a best-case detection delay of (1) and a nite optimal worst-case detection delay on any topology equal in order terms to the time for an optimal one-to-all broadcast on that topology|we derive a general expression for an optimal one-to-all broadcast on an arbitrary topology, which is an interesting new result in itself. On k-ary n-cube tori and meshes, the worst-case delay is (D), where D is the diameter of the architecture. Further, our algorithm has message and computational complexities of O(max(MD;P )) ((max(M;P )) on the average for most applications|the same as other message-ecient algorithms) and an optimal space complexity of (P ), where M is the total number of messages used by the underlying computation. We also give a scheme using...
3-D Topologies for Networks-on-Chip
- in Proc. IEEE Int. SOC Conf., 2006
, 2006
"... Abstract—Several interesting topologies emerge by incorporating the third dimension in networks-on-chip (NoC). The speed and power consumption of 3-D NoC are compared to that of 2-D NoC. Physical constraints, such as the maximum number of planes that can be vertically stacked and the asymmetry betwe ..."
Abstract
-
Cited by 12 (0 self)
- Add to MetaCart
Abstract—Several interesting topologies emerge by incorporating the third dimension in networks-on-chip (NoC). The speed and power consumption of 3-D NoC are compared to that of 2-D NoC. Physical constraints, such as the maximum number of planes that can be vertically stacked and the asymmetry between the horizontal and vertical communication channels of the network, are included in speed and power consumption models of these novel 3-D structures. An analytic model for the zero-load latency of each network that considers the effects of the topology on the performance of a 3-D NoC is developed. Tradeoffs between the number of nodes utilized in the third dimension, which reduces the average number of hops traversed by a packet, and the number of physical planes used to integrate the functional blocks of the network, which decreases the length of the communication channel, is evaluated for both the latency and power consumption of a network. A performance improvement of 40 % and 36 % and a decrease of 62 % and 58 % in power consumption is demonstrated for 3-D NoC as compared to a traditional 2-D NoC topology for a network size of aIPVand aPSTnodes, respectively. Index Terms—3-D circuits, 3-D integrated circuits (ICs), 3-D integration, networks-on-chip (NoC), topologies.
Communication Throughput of Interconnection Networks
- Proc. 19th Int. Symp. on Mathematical Foundations of Computer Science (MFCS '94), Lecture Notes in Computer Science No. 841
, 1994
"... . Modern flow control techniques used for massively parallel computers have made network capacity a more important parameter for the application performance than network latency. Network latency is usually rather low as long as the injection rate is below a specific value. Nowadays the maximal injec ..."
Abstract
-
Cited by 10 (6 self)
- Add to MetaCart
. Modern flow control techniques used for massively parallel computers have made network capacity a more important parameter for the application performance than network latency. Network latency is usually rather low as long as the injection rate is below a specific value. Nowadays the maximal injection rate is usually approximated by the bisection bandwith of the network. We will describe the state of the art in determining the bisection bandwith of interconnection systems. Unfortunately the bisection bandwith leads only to very vague approximations of the communication capacity of a network. We will describe some methods aiming at modeling the maximal network capacity by using probabilistic models. Especially we will present results for the multistage interconnection network which is often used in parallel computing and more general communication applications. The presented results show a rather close relation to results gained by simulations and therefore have the potential to repla...
A Realizable Efficient Parallel Architecture
- In Proceedings of the first International Heinz Nixdorf Symposium: Parallel Architectures and their Efficient Use
, 1992
"... . The near future will present large scale parallel computers, able to provide computing power of more than one TFlop per second. It is commonly agreed that these systems will be based on the model of asynchronous processors connected by a point to point network. There are a number of different netw ..."
Abstract
-
Cited by 10 (5 self)
- Add to MetaCart
. The near future will present large scale parallel computers, able to provide computing power of more than one TFlop per second. It is commonly agreed that these systems will be based on the model of asynchronous processors connected by a point to point network. There are a number of different network architectures presented in the past. In this paper we present an architectural principle that combines efficiency, realizability for very large systems, and inherent reliability needed for such large parallel processing systems. The here presented Fat Mesh of Clos network principle can be scaled in many ways to fulfill the special requirements of a system design. Two realizations of this principle are presented: One is based on static switches combined to form a fully reconfigurable system. This architecture has been realized for systems containing up to 320 processors. The other realization uses dynamic routing switches. By combining wormhole routing with randomized and local adaptive ...
On the Communication Throughput of Buffered Multistage Interconnection Networks
- PROC. OF THE 8TH ANNUAL ACM SYMPOSIUM ON PARALLEL ALGORITHMS AND ARCHITECTURES
, 1996
"... Multistage networks (MIN) are used as interconnection structure in a large number of applications. Their performance is mainly determined by their communication throughput which, in most cases, has to be investigated by time-consuming simulations or approximated by simple models. In this paper, we i ..."
Abstract
-
Cited by 6 (0 self)
- Add to MetaCart
Multistage networks (MIN) are used as interconnection structure in a large number of applications. Their performance is mainly determined by their communication throughput which, in most cases, has to be investigated by time-consuming simulations or approximated by simple models. In this paper, we investigate the steady state throughput of single buffered multistage interconnection networks using the so called relaxed blocking model, where a message is deleted, if the receiving buffer is occupied. We derive upper and lower bounds on the throughput of MINs of arbitrary height and show that the throughput of singlebuffered networks is an order of magnitude higher than the throughput of non-buffered MINs. In detail we show, that the throughput is \Theta(n= p log n) if n is the size of the network. Because the time-dynamic of finite buffered MINs defies each marcov- or semi-marcov approach, we analyze the the equilibrium-situation of the network and give tight upper and lower bounds on t...
Improvement in Bit Error Rate for Optoelectronic Multicomputer Interconnection Networks Using Cyclic Redundancy Code Error Detection
- IEEE Photonics Technology Letters
, 1997
"... Abstract—This letter presents testing results of an integrated optoelectronic (OE) channel employing hop-by-hop error control circuitry based on cyclic redundancy codes (CRC) to improve the effective bit-error rate (BER). The use of OE interconnect in place of wires in multicomputer networks becomes ..."
Abstract
-
Cited by 2 (2 self)
- Add to MetaCart
Abstract—This letter presents testing results of an integrated optoelectronic (OE) channel employing hop-by-hop error control circuitry based on cyclic redundancy codes (CRC) to improve the effective bit-error rate (BER). The use of OE interconnect in place of wires in multicomputer networks becomes more attractive as channel bandwidth and power efficiency are increased. But these improvements must be accomplished while maintaining an acceptable channel BER. Test results of an integrated OE channel incorporating CRC-based error control circuitry demonstrate a BER reduction of two orders of magnitude while incurring a 20 % bandwidth overhead. This may lead to higher bandwidth and higher efficiency OE interconnects. Index Terms—Error detection codes, multicomputer interconnection networks, optoelectronic channels, wormhole routing protocols. I.
On constructing the minimum orthogonal convex polygon in 2-D faulty meshes
- Proc. of International Parallel and Distributed Processing Symposium (IPDPS). 2004, (CD-ROM
"... The rectangular faulty block model is the most commonly used fault model for designing fault-tolerant and deadlockfree routing algorithms in mesh-connected multicomputers. The convexity of a rectangle facilitates simple and efficient ways to route messages around fault regions using relatively few o ..."
Abstract
-
Cited by 2 (2 self)
- Add to MetaCart
The rectangular faulty block model is the most commonly used fault model for designing fault-tolerant and deadlockfree routing algorithms in mesh-connected multicomputers. The convexity of a rectangle facilitates simple and efficient ways to route messages around fault regions using relatively few or no virtual channels to avoid deadlock. However, such a faulty block may include many non-faulty nodes which are disabled, i.e., they are not involved in the routing process. Therefore, it is important to define a fault region that is convex and, at the same time, to include a minimum number of non-faulty nodes. In this paper, we propose an optimal solution that can quickly construct a set of minimum faulty polygons, called orthogonal convex polygons, from a given set of faulty blocks in a 2-D mesh (or 2-D torus). The formation

