Results 1  10
of
11
LogP: Towards a Realistic Model of Parallel Computation
, 1993
"... A vast body of theoretical research has focused either on overly simplistic models of parallel computation, notably the PRAM, or overly specific models that have few representatives in the real world. Both kinds of models encourage exploitation of formal loopholes, rather than rewarding developme ..."
Abstract

Cited by 562 (15 self)
 Add to MetaCart
(Show Context)
A vast body of theoretical research has focused either on overly simplistic models of parallel computation, notably the PRAM, or overly specific models that have few representatives in the real world. Both kinds of models encourage exploitation of formal loopholes, rather than rewarding development of techniques that yield performance across a range of current and future parallel machines. This paper offers a new parallel machine model, called LogP, that reflects the critical technology trends underlying parallel computers. It is intended to serve as a basis for developing fast, portable parallel algorithms and to offer guidelines to machine designers. Such a model must strike a balance between detail and simplicity in order to reveal important bottlenecks without making analysis of interesting problems intractable. The model is based on four parameters that specify abstractly the computing bandwidth, the communication bandwidth, the communication delay, and the efficiency of coupling communication and computation. Portable parallel algorithms typically adapt to the machine configuration, in terms of these parameters. The utility of the model is demonstrated through examples that are implemented on the CM5.
3D Topologies for NetworksonChip
 in Proc. IEEE Int. SOC Conf., 2006
, 2006
"... Abstract—Several interesting topologies emerge by incorporating the third dimension in networksonchip (NoC). The speed and power consumption of 3D NoC are compared to that of 2D NoC. Physical constraints, such as the maximum number of planes that can be vertically stacked and the asymmetry betwe ..."
Abstract

Cited by 64 (2 self)
 Add to MetaCart
(Show Context)
Abstract—Several interesting topologies emerge by incorporating the third dimension in networksonchip (NoC). The speed and power consumption of 3D NoC are compared to that of 2D NoC. Physical constraints, such as the maximum number of planes that can be vertically stacked and the asymmetry between the horizontal and vertical communication channels of the network, are included in speed and power consumption models of these novel 3D structures. An analytic model for the zeroload latency of each network that considers the effects of the topology on the performance of a 3D NoC is developed. Tradeoffs between the number of nodes utilized in the third dimension, which reduces the average number of hops traversed by a packet, and the number of physical planes used to integrate the functional blocks of the network, which decreases the length of the communication channel, is evaluated for both the latency and power consumption of a network. A performance improvement of 40 % and 36 % and a decrease of 62 % and 58 % in power consumption is demonstrated for 3D NoC as compared to a traditional 2D NoC topology for a network size of aIPVand aPSTnodes, respectively. Index Terms—3D circuits, 3D integrated circuits (ICs), 3D integration, networksonchip (NoC), topologies.
A theory of wormhole routing in parallel computers
 IEEE Transactions on Computers
, 1996
"... ..."
An Efficient DelayOptimal Distributed Termination Detection Algorithm
 IN JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING (JPDC
, 2001
"... One of the important issues to be addressed when solving problems on parallel machines or distributed systems is that of efficient termination detection. Numerous schemes with different performance characteristics have been proposed in the past for this purpose. These schemes, while being efficie ..."
Abstract

Cited by 13 (2 self)
 Add to MetaCart
(Show Context)
One of the important issues to be addressed when solving problems on parallel machines or distributed systems is that of efficient termination detection. Numerous schemes with different performance characteristics have been proposed in the past for this purpose. These schemes, while being efficient with regard to one performance metric, prove to be inefficient in terms of other metrics. A signicant drawback shared by all previous methods is that they may take as long as (P ) time to detect and signal termination after its actual occurrence, where P is the total number of processing elements. Detection delay is arguably the most important metric to optimize, since it is directly related to the amount of idling of computing resources and to the delay in the utilization of results of the underlying computation. In this paper, we present a novel termination detection algorithm that is simultaneously optimal or nearoptimal with respect to all relevant performance measures on any topology. In particular, our algorithm has a bestcase detection delay of (1) and a nite optimal worstcase detection delay on any topology equal in order terms to the time for an optimal onetoall broadcast on that topologywe derive a general expression for an optimal onetoall broadcast on an arbitrary topology, which is an interesting new result in itself. On kary ncube tori and meshes, the worstcase delay is (D), where D is the diameter of the architecture. Further, our algorithm has message and computational complexities of O(max(MD;P )) ((max(M;P )) on the average for most applicationsthe same as other messageecient algorithms) and an optimal space complexity of (P ), where M is the total number of messages used by the underlying computation. We also give a scheme using...
A Realizable Efficient Parallel Architecture
 IN PROCEEDINGS OF THE FIRST INTERNATIONAL HEINZ NIXDORF SYMPOSIUM: PARALLEL ARCHITECTURES AND THEIR EFFICIENT USE
, 1992
"... The near future will present large scale parallel computers, able to provide computing power of more than one TFlop per second. It is commonly agreed that these systems will be based on the model of asynchronous processors connected by a point to point network. There are a number of different netw ..."
Abstract

Cited by 10 (5 self)
 Add to MetaCart
(Show Context)
The near future will present large scale parallel computers, able to provide computing power of more than one TFlop per second. It is commonly agreed that these systems will be based on the model of asynchronous processors connected by a point to point network. There are a number of different network architectures presented in the past. In this paper we present an architectural principle that combines efficiency, realizability for very large systems, and inherent reliability needed for such large parallel processing systems. The here presented Fat Mesh of Clos network principle can be scaled in many ways to fulfill the special requirements of a system design. Two realizations of this principle are presented: One is based on static switches combined to form a fully reconfigurable system. This architecture has been realized for systems containing up to 320 processors. The other realization uses dynamic routing switches. By combining wormhole routing with randomized and local adaptive ...
Communication Throughput of Interconnection Networks
 Proc. 19th Int. Symp. on Mathematical Foundations of Computer Science (MFCS '94), Lecture Notes in Computer Science No. 841
, 1994
"... . Modern flow control techniques used for massively parallel computers have made network capacity a more important parameter for the application performance than network latency. Network latency is usually rather low as long as the injection rate is below a specific value. Nowadays the maximal injec ..."
Abstract

Cited by 10 (6 self)
 Add to MetaCart
(Show Context)
. Modern flow control techniques used for massively parallel computers have made network capacity a more important parameter for the application performance than network latency. Network latency is usually rather low as long as the injection rate is below a specific value. Nowadays the maximal injection rate is usually approximated by the bisection bandwith of the network. We will describe the state of the art in determining the bisection bandwith of interconnection systems. Unfortunately the bisection bandwith leads only to very vague approximations of the communication capacity of a network. We will describe some methods aiming at modeling the maximal network capacity by using probabilistic models. Especially we will present results for the multistage interconnection network which is often used in parallel computing and more general communication applications. The presented results show a rather close relation to results gained by simulations and therefore have the potential to repla...
CircuitSwitched Gossiping in the 3Dimensional Torus Networks
, 1997
"... In this paper we describe, in the case of short messages, an efficient gossiping algorithm for 3dimensional torus networks (wraparound or toroidal meshes) that uses synchronous circuitswitched routing. The algorithm is based on a recursive decomposition of a torus. The protocol requires an optima ..."
Abstract

Cited by 10 (1 self)
 Add to MetaCart
(Show Context)
In this paper we describe, in the case of short messages, an efficient gossiping algorithm for 3dimensional torus networks (wraparound or toroidal meshes) that uses synchronous circuitswitched routing. The algorithm is based on a recursive decomposition of a torus. The protocol requires an optimal number of rounds and a quasioptimal number of intermediate switch settings to gossip in an 7^i &times; 7^i &times; 7^i torus.
On the Communication Throughput of Buffered Multistage Interconnection Networks
 PROC. OF THE 8TH ANNUAL ACM SYMPOSIUM ON PARALLEL ALGORITHMS AND ARCHITECTURES
, 1996
"... Multistage networks (MIN) are used as interconnection structure in a large number of applications. Their performance is mainly determined by their communication throughput which, in most cases, has to be investigated by timeconsuming simulations or approximated by simple models. In this paper, we i ..."
Abstract

Cited by 7 (0 self)
 Add to MetaCart
(Show Context)
Multistage networks (MIN) are used as interconnection structure in a large number of applications. Their performance is mainly determined by their communication throughput which, in most cases, has to be investigated by timeconsuming simulations or approximated by simple models. In this paper, we investigate the steady state throughput of single buffered multistage interconnection networks using the so called relaxed blocking model, where a message is deleted, if the receiving buffer is occupied. We derive upper and lower bounds on the throughput of MINs of arbitrary height and show that the throughput of singlebuffered networks is an order of magnitude higher than the throughput of nonbuffered MINs. In detail we show, that the throughput is \Theta(n= p log n) if n is the size of the network. Because the timedynamic of finite buffered MINs defies each marcov or semimarcov approach, we analyze the the equilibriumsituation of the network and give tight upper and lower bounds on t...
On constructing the minimum orthogonal convex polygon in 2D faulty meshes
 Proc. of International Parallel and Distributed Processing Symposium (IPDPS). 2004, (CDROM
"... The rectangular faulty block model is the most commonly used fault model for designing faulttolerant and deadlockfree routing algorithms in meshconnected multicomputers. The convexity of a rectangle facilitates simple and efficient ways to route messages around fault regions using relatively few o ..."
Abstract

Cited by 4 (2 self)
 Add to MetaCart
(Show Context)
The rectangular faulty block model is the most commonly used fault model for designing faulttolerant and deadlockfree routing algorithms in meshconnected multicomputers. The convexity of a rectangle facilitates simple and efficient ways to route messages around fault regions using relatively few or no virtual channels to avoid deadlock. However, such a faulty block may include many nonfaulty nodes which are disabled, i.e., they are not involved in the routing process. Therefore, it is important to define a fault region that is convex and, at the same time, to include a minimum number of nonfaulty nodes. In this paper, we propose an optimal solution that can quickly construct a set of minimum faulty polygons, called orthogonal convex polygons, from a given set of faulty blocks in a 2D mesh (or 2D torus). The formation
Improvement in Bit Error Rate for Optoelectronic Multicomputer Interconnection Networks Using Cyclic Redundancy Code Error Detection
 IEEE Photonics Technology Letters
, 1997
"... Abstract—This letter presents testing results of an integrated optoelectronic (OE) channel employing hopbyhop error control circuitry based on cyclic redundancy codes (CRC) to improve the effective biterror rate (BER). The use of OE interconnect in place of wires in multicomputer networks becomes ..."
Abstract

Cited by 2 (2 self)
 Add to MetaCart
(Show Context)
Abstract—This letter presents testing results of an integrated optoelectronic (OE) channel employing hopbyhop error control circuitry based on cyclic redundancy codes (CRC) to improve the effective biterror rate (BER). The use of OE interconnect in place of wires in multicomputer networks becomes more attractive as channel bandwidth and power efficiency are increased. But these improvements must be accomplished while maintaining an acceptable channel BER. Test results of an integrated OE channel incorporating CRCbased error control circuitry demonstrate a BER reduction of two orders of magnitude while incurring a 20 % bandwidth overhead. This may lead to higher bandwidth and higher efficiency OE interconnects. Index Terms—Error detection codes, multicomputer interconnection networks, optoelectronic channels, wormhole routing protocols. I.