Results 1  10
of
31
Will Physical Scalability Sabotage Performance Gains?
, 1997
"... Many designers expect processor performance to keep improving at the current rate indefinitely as feature sizes shrink. However, as wire delays become a larger percentage of overall signal delay and as clock speeds grow faster than transistor speed, I believe performance increases will ultimately fa ..."
Abstract

Cited by 70 (1 self)
 Add to MetaCart
Many designers expect processor performance to keep improving at the current rate indefinitely as feature sizes shrink. However, as wire delays become a larger percentage of overall signal delay and as clock speeds grow faster than transistor speed, I believe performance increases will ultimately fall off. These delays are inevitable simply because wires are not keeping pace with the scaling of other features. In fact, for CMOS processes below 0.25 micron, the physical limits of wire scaling 1 may begin to change highspeed processor design. That is, an unacceptably small percentage of the die will be reachable during a single clock cycle. To support my prediction, I have mapped trends in a metric that relates time and distance and projections in clock speed across eight processor generations, from 0.6 to 0.06 micron. During this span (probably 0.1 micton) we'll see a billion transistor processor. To illustrate how physical scalability could affect the design of processors on this scale, I also compared signal drive distance and clock speed for the span endpoints, 0.6 and 0.06 micon.
BandwidthOptimal Complete Exchange on WormholeRouted 2D/3D Torus Networks: A DiagonalPropagation Approach
, 1997
"... Alltoall personalized communication, or complete exchange, is at the heart of numerous applications in parallel computing. Several complete exchange algorithms have been proposed in the literature for wormhole meshes. However, these algorithms, when applied to tori, can not take advantage of wrap ..."
Abstract

Cited by 21 (5 self)
 Add to MetaCart
Alltoall personalized communication, or complete exchange, is at the heart of numerous applications in parallel computing. Several complete exchange algorithms have been proposed in the literature for wormhole meshes. However, these algorithms, when applied to tori, can not take advantage of wraparound interconnections to implement complete exchange with reduced latency. In this paper, a new diagonalpropagation approach is proposed to develop a set of complete exchange algorithms for 2D and 3D tori. This approach exploits the symmetric interconnections of tori and allows to develop a communication schedule consisting of several contentionfree phases. These algorithms are indirect in nature and they use message combining to reduce the number of phases (message startups). It is shown that these algorithms effectively use the bisection bandwidth of a torus which is twice that for an equal sized mesh, to achieve complete exchange in time which is almost half of the best known complet...
BSPLike ExternalMemory Computation
 IN PROC. 3RD ITALIAN CONFERENCE ON ALGORITHMS AND COMPLEXITY
"... In this paper we present a paradigm for solving externalmemory problems, and illustrate it by algorithms for matrix multiplication, sorting, list ranking, transitive closure and FFT. Our paradigm is based on the use of BSP algorithms. The correspondence is almost perfect, and especially the noti ..."
Abstract

Cited by 21 (0 self)
 Add to MetaCart
In this paper we present a paradigm for solving externalmemory problems, and illustrate it by algorithms for matrix multiplication, sorting, list ranking, transitive closure and FFT. Our paradigm is based on the use of BSP algorithms. The correspondence is almost perfect, and especially the notion of xoptimality carries over to algorithms designed according to our paradigm. The advantages of the approach are similar to the advantages of BSP algorithms for parallel computing: scalability, portability, predictability. The performance measure here is the total work, not only the number of I/O operations as in previous approaches. The predicted performances are therefore more useful for practical applications.
Implementing the Hierarchical PRAM on the 2D Mesh: Analyses and Experiments
, 1995
"... We investigate aspects of the performance of the EREW instance of the Hierarchical PRAM (HPRAM) model, a recursively partitionable PRAM, on the 2D mesh architecture via analysis and simulation experiments. Since one of the ideas behind the HPRAM is to systematically exploit locality in order to ne ..."
Abstract

Cited by 12 (2 self)
 Add to MetaCart
We investigate aspects of the performance of the EREW instance of the Hierarchical PRAM (HPRAM) model, a recursively partitionable PRAM, on the 2D mesh architecture via analysis and simulation experiments. Since one of the ideas behind the HPRAM is to systematically exploit locality in order to negate the need for expensive communication hardware and thus promote costeffective scalability, our design decisions are based on minimizing implementation costs. The Peano indexing scheme is used as a simple and natural means of allowing the dynamic, recursive partitioning of the mesh into arbitrarilysized submeshes, as required by the HPRAM. We show that for any submesh the ratio of the largest manhattan distance between two nodes of the submesh to that of the square mesh with an identical number of processors is at most 3/2, thereby demonstrating the locality preserving properties of the Peano scheme for arbitrary partitions of the mesh. We provide matching analytical and experimenta...
Incomplete kary ncube and Its Derivatives
 J. Parallel and Distributed Computing
, 2004
"... Incomplete or pruned kary ncube, nX3; is derived as follows. All links of dimension n 1 are left in place and links of the remaining n 1 dimensions are removed, except for one, which is chosen periodically from the remaining dimensions along the intact dimension n 1: This leads to a node degree of ..."
Abstract

Cited by 11 (7 self)
 Add to MetaCart
Incomplete or pruned kary ncube, nX3; is derived as follows. All links of dimension n 1 are left in place and links of the remaining n 1 dimensions are removed, except for one, which is chosen periodically from the remaining dimensions along the intact dimension n 1: This leads to a node degree of 4 instead of the original 2n and results in regular networks that are Cayley graphs, provided that n 1 divides k: For n 3 ðn 5Þ; the preceding restriction is not problematic, as it only requires that k be even (a multiple of 4). In other cases, changes to the basis network to be pruned, or to the pruning algorithm, can mitigate the problem. Incomplete kary ncube maintains a number of desirable topological properties of its unpruned counterpart despite having fewer links. It is maximally connected, has diameter and fault diameter very close to those of kary ncube, and an average internode distance that is only slightly greater. Hence, the cost/performance tradeoffs offered by our pruning scheme can in fact lead to useful, and practically realizable, parallel architectures. We study pruned kary ncubes in general and offer some additional results for the special case n 3:
ProcessorTime Tradeoffs under BoundedSpeed Message Propagation: Part I, Upper Bounds
 Theory of Computing Systems
, 1995
"... Upper bounds are derived for the processortime tradeoffs of machines such as linear arrays and twodimensional meshes, which are compatible with the physical limitation expressed by boundedspeed propagation of messages (due to the finiteness of the speed of light). It is shown that parallelism and ..."
Abstract

Cited by 10 (3 self)
 Add to MetaCart
Upper bounds are derived for the processortime tradeoffs of machines such as linear arrays and twodimensional meshes, which are compatible with the physical limitation expressed by boundedspeed propagation of messages (due to the finiteness of the speed of light). It is shown that parallelism and locality combined may yield speedups superlinear in the number of processors. The speedups are inherent, due to the optimality of the obtained tradeoffs as established in a companion paper. Simulations are developed of multiprocessor machines by analogous machines with fewer processors. A crucial role is played by the hierarchical nature of the memory system. A divideandconquer technique for hierarchical memories is developed, based on the graphtheoretic notion of topological separator. For multiprocessors, this technique also requires a careful balance of memory access and interprocessor communication costs, which leads to nonintuitive orchestrations of the simulation process. Dipart...
Integer Sorting and Routing in Arrays with Reconfigurable Optical Buses
, 1996
"... In this paper we present deterministic algorithms for integer sorting and online packet routing on arrays with reconfigurable optical buses. The main objective is to identify the mechanisms specific to this type of architectures that allow us to build efficient integer sorting, partial permutation ..."
Abstract

Cited by 9 (0 self)
 Add to MetaCart
In this paper we present deterministic algorithms for integer sorting and online packet routing on arrays with reconfigurable optical buses. The main objective is to identify the mechanisms specific to this type of architectures that allow us to build efficient integer sorting, partial permutation routing and hrelations algorithms. The consequences of these results on the PRAM simulation complexity are also investigated. Keywords: Optical pipelined buses, reconfigurable array, sorting, routing. 1. Introduction In largescale general purpose parallel machines based on connection networks, efficient communication capabilities are essential in order to solve most of the problems of interest in a timely manner. Interprocessor communication networks are often the main bottlenecks in parallel machines. One important limitation of these networks concerns the exclusive access to the bus resources, which limits throughput to a function of the endtoend propagation time. Optical communicati...
Lower Bounds on ProcessorTime Tradeoffs under BoundedSpeed Message Propagation
, 1995
"... Upper bounds are derived for the processortime tradeoffs of machines such as linear arrays and twodimensional meshes, which are compatible with the physical limitation expressed by boundedspeed propagation of messages (due to the finiteness of the speed of light). It is shown that parallelism and ..."
Abstract

Cited by 7 (0 self)
 Add to MetaCart
Upper bounds are derived for the processortime tradeoffs of machines such as linear arrays and twodimensional meshes, which are compatible with the physical limitation expressed by boundedspeed propagation of messages (due to the finiteness of the speed of light). It is shown that parallelism and locality combined may yield speedups superlinear in the number of processors. The speedups are inherent, due to the optimality of the obtained tradeoffs as established in a companion paper. Simulations are developed of multiprocessor machines by analogous machines with fewer processors. A crucial role is played by the hierarchical nature of the memory system. A divideandconquer technique for hierarchical memories is developed, based on the graphtheoretic notion of topological separator. For multiprocessors, this technique also requires a careful balance of memory access and interprocessor communication costs, which leads to nonintuitive orchestrations of the simulation process. 1
Augmented Ring Networks
, 1999
"... We study four augmentations of ring networks which are intended to enhance a ring's efficiency as a communication medium significantly, while increasing its structural complexity ..."
Abstract

Cited by 6 (1 self)
 Add to MetaCart
We study four augmentations of ring networks which are intended to enhance a ring's efficiency as a communication medium significantly, while increasing its structural complexity
Performance, algorithmic, and robustness attributes of perfect difference networks
 IEEE Trans. Parallel Distributed Systems
, 2005
"... Abstract—Perfect difference networks (PDNs) that are based on the mathematical notion of perfect difference sets have been shown to comprise an asymptotically optimal method for connecting a number of nodes into a network with diameter 2. Justifications for, and mathematical underpinning of, PDNs ap ..."
Abstract

Cited by 4 (2 self)
 Add to MetaCart
Abstract—Perfect difference networks (PDNs) that are based on the mathematical notion of perfect difference sets have been shown to comprise an asymptotically optimal method for connecting a number of nodes into a network with diameter 2. Justifications for, and mathematical underpinning of, PDNs appear in a companion paper. In this paper, we compare PDNs and some of their derivatives to interconnection networks with similar cost/performance, including certain generalized hypercubes and their hierarchical variants. Additionally, we discuss pointtopoint and collective communication algorithms and derive a general emulation result that relates the performance of PDNs to that of complete networks as ideal benchmarks. We show that PDNs are quite robust, both with regard to node and link failures that can be tolerated and in terms of blandness (not having weak spots). In particular, we prove that the fault diameter of PDNs is no greater than 4. Finally, we study the complexity and scalability aspects of these networks, concluding that PDNs and their derivatives allow the construction of very low diameter networks close to any arbitrary desired size and that, in many respects, PDNs offer optimal performance and fault tolerance relative to their complexity or implementation cost.