Results 1 - 10
of
27
Will Physical Scalability Sabotage Performance Gains?
, 1997
"... Many designers expect processor performance to keep improving at the current rate indefinitely as feature sizes shrink. However, as wire delays become a larger percentage of overall signal delay and as clock speeds grow faster than transistor speed, I believe performance increases will ultimately fa ..."
Abstract
-
Cited by 65 (1 self)
- Add to MetaCart
Many designers expect processor performance to keep improving at the current rate indefinitely as feature sizes shrink. However, as wire delays become a larger percentage of overall signal delay and as clock speeds grow faster than transistor speed, I believe performance increases will ultimately fall off. These delays are inevitable simply because wires are not keeping pace with the scaling of other features. In fact, for CMOS processes below 0.25 micron, the physical limits of wire scaling 1 may begin to change high-speed processor design. That is, an unacceptably small percentage of the die will be reachable during a single clock cycle. To support my prediction, I have mapped trends in a metric that relates time and distance and projections in clock speed across eight processor generations, from 0.6 to 0.06 micron. During this span (probably 0.1 micton) we'll see a billion transistor processor. To illustrate how physical scalability could affect the design of processors on this scale, I also compared signal drive distance and clock speed for the span endpoints, 0.6 and 0.06 micon.
BSP-Like External-Memory Computation
- IN PROC. 3RD ITALIAN CONFERENCE ON ALGORITHMS AND COMPLEXITY
"... In this paper we present a paradigm for solving external-memory problems, and illustrate it by algorithms for matrix multiplication, sorting, list ranking, transitive closure and FFT. Our paradigm is based on the use of BSP algorithms. The correspondence is almost perfect, and especially the noti ..."
Abstract
-
Cited by 21 (0 self)
- Add to MetaCart
In this paper we present a paradigm for solving external-memory problems, and illustrate it by algorithms for matrix multiplication, sorting, list ranking, transitive closure and FFT. Our paradigm is based on the use of BSP algorithms. The correspondence is almost perfect, and especially the notion of x-optimality carries over to algorithms designed according to our paradigm. The advantages of the approach are similar to the advantages of BSP algorithms for parallel computing: scalability, portability, predictability. The performance measure here is the total work, not only the number of I/O operations as in previous approaches. The predicted performances are therefore more useful for practical applications.
Bandwidth-Optimal Complete Exchange on Wormhole-Routed 2D/3D Torus Networks: A Diagonal-Propagation Approach
, 1997
"... All-to-all personalized communication, or complete exchange, is at the heart of numerous applications in parallel computing. Several complete exchange algorithms have been proposed in the literature for wormhole meshes. However, these algorithms, when applied to tori, can not take advantage of wrap- ..."
Abstract
-
Cited by 20 (5 self)
- Add to MetaCart
All-to-all personalized communication, or complete exchange, is at the heart of numerous applications in parallel computing. Several complete exchange algorithms have been proposed in the literature for wormhole meshes. However, these algorithms, when applied to tori, can not take advantage of wrap-around interconnections to implement complete exchange with reduced latency. In this paper, a new diagonal-propagation approach is proposed to develop a set of complete exchange algorithms for 2D and 3D tori. This approach exploits the symmetric interconnections of tori and allows to develop a communication schedule consisting of several contention-free phases. These algorithms are indirect in nature and they use message combining to reduce the number of phases (message start-ups). It is shown that these algorithms effectively use the bisection bandwidth of a torus which is twice that for an equal sized mesh, to achieve complete exchange in time which is almost half of the best known complet...
Implementing the Hierarchical PRAM on the 2D Mesh: Analyses and Experiments
, 1995
"... We investigate aspects of the performance of the EREW instance of the Hierarchical PRAM (H-PRAM) model, a recursively partitionable PRAM, on the 2D mesh architecture via analysis and simulation experiments. Since one of the ideas behind the H-PRAM is to systematically exploit locality in order to ne ..."
Abstract
-
Cited by 11 (2 self)
- Add to MetaCart
We investigate aspects of the performance of the EREW instance of the Hierarchical PRAM (H-PRAM) model, a recursively partitionable PRAM, on the 2D mesh architecture via analysis and simulation experiments. Since one of the ideas behind the H-PRAM is to systematically exploit locality in order to negate the need for expensive communication hardware and thus promote cost-effective scalability, our design decisions are based on minimizing implementation costs. The Peano indexing scheme is used as a simple and natural means of allowing the dynamic, recursive partitioning of the mesh into arbitrarily-sized sub-meshes, as required by the H-PRAM. We show that for any sub-mesh the ratio of the largest manhattan distance between two nodes of the sub-mesh to that of the square mesh with an identical number of processors is at most 3/2, thereby demonstrating the locality preserving properties of the Peano scheme for arbitrary partitions of the mesh. We provide matching analytical and experimenta...
Processor-Time Tradeoffs under Bounded-Speed Message Propagation: Part I, Upper Bounds
- Theory of Computing Systems
, 1995
"... Upper bounds are derived for the processor-time tradeoffs of machines such as linear arrays and two-dimensional meshes, which are compatible with the physical limitation expressed by bounded-speed propagation of messages (due to the finiteness of the speed of light). It is shown that parallelism and ..."
Abstract
-
Cited by 11 (3 self)
- Add to MetaCart
Upper bounds are derived for the processor-time tradeoffs of machines such as linear arrays and two-dimensional meshes, which are compatible with the physical limitation expressed by bounded-speed propagation of messages (due to the finiteness of the speed of light). It is shown that parallelism and locality combined may yield speedups superlinear in the number of processors. The speedups are inherent, due to the optimality of the obtained tradeoffs as established in a companion paper. Simulations are developed of multiprocessor machines by analogous machines with fewer processors. A crucial role is played by the hierarchical nature of the memory system. A divide-and-conquer technique for hierarchical memories is developed, based on the graph-theoretic notion of topological separator. For multiprocessors, this technique also requires a careful balance of memory access and interprocessor communication costs, which leads to non-intuitive orchestrations of the simulation process. Dipart...
Incomplete k-ary n-cube and Its Derivatives
- J. Parallel and Distributed Computing
, 2004
"... Incomplete or pruned k-ary n-cube, nX3; is derived as follows. All links of dimension n 1 are left in place and links of the remaining n 1 dimensions are removed, except for one, which is chosen periodically from the remaining dimensions along the intact dimension n 1: This leads to a node degree of ..."
Abstract
-
Cited by 10 (7 self)
- Add to MetaCart
Incomplete or pruned k-ary n-cube, nX3; is derived as follows. All links of dimension n 1 are left in place and links of the remaining n 1 dimensions are removed, except for one, which is chosen periodically from the remaining dimensions along the intact dimension n 1: This leads to a node degree of 4 instead of the original 2n and results in regular networks that are Cayley graphs, provided that n 1 divides k: For n 3 ðn 5Þ; the preceding restriction is not problematic, as it only requires that k be even (a multiple of 4). In other cases, changes to the basis network to be pruned, or to the pruning algorithm, can mitigate the problem. Incomplete k-ary n-cube maintains a number of desirable topological properties of its unpruned counterpart despite having fewer links. It is maximally connected, has diameter and fault diameter very close to those of k-ary n-cube, and an average internode distance that is only slightly greater. Hence, the cost/performance tradeoffs offered by our pruning scheme can in fact lead to useful, and practically realizable, parallel architectures. We study pruned k-ary n-cubes in general and offer some additional results for the special case n 3:
Integer Sorting and Routing in Arrays with Reconfigurable Optical Buses
, 1996
"... In this paper we present deterministic algorithms for integer sorting and on-line packet routing on arrays with reconfigurable optical buses. The main objective is to identify the mechanisms specific to this type of architectures that allow us to build efficient integer sorting, partial permutation ..."
Abstract
-
Cited by 7 (0 self)
- Add to MetaCart
In this paper we present deterministic algorithms for integer sorting and on-line packet routing on arrays with reconfigurable optical buses. The main objective is to identify the mechanisms specific to this type of architectures that allow us to build efficient integer sorting, partial permutation routing and h-relations algorithms. The consequences of these results on the PRAM simulation complexity are also investigated. Keywords: Optical pipelined buses, reconfigurable array, sorting, routing. 1. Introduction In large-scale general purpose parallel machines based on connection networks, efficient communication capabilities are essential in order to solve most of the problems of interest in a timely manner. Interprocessor communication networks are often the main bottlenecks in parallel machines. One important limitation of these networks concerns the exclusive access to the bus resources, which limits throughput to a function of the end-to-end propagation time. Optical communicati...
Lower Bounds on Processor-Time Tradeoffs under Bounded-Speed Message Propagation
, 1995
"... Upper bounds are derived for the processor-time tradeoffs of machines such as linear arrays and two-dimensional meshes, which are compatible with the physical limitation expressed by bounded-speed propagation of messages (due to the finiteness of the speed of light). It is shown that parallelism and ..."
Abstract
-
Cited by 6 (0 self)
- Add to MetaCart
Upper bounds are derived for the processor-time tradeoffs of machines such as linear arrays and two-dimensional meshes, which are compatible with the physical limitation expressed by bounded-speed propagation of messages (due to the finiteness of the speed of light). It is shown that parallelism and locality combined may yield speedups superlinear in the number of processors. The speedups are inherent, due to the optimality of the obtained tradeoffs as established in a companion paper. Simulations are developed of multiprocessor machines by analogous machines with fewer processors. A crucial role is played by the hierarchical nature of the memory system. A divide-and-conquer technique for hierarchical memories is developed, based on the graph-theoretic notion of topological separator. For multiprocessors, this technique also requires a careful balance of memory access and interprocessor communication costs, which leads to non-intuitive orchestrations of the simulation process. 1
Augmented Ring Networks
- J. MATH. MODELLING AND SCIENTIC COMPUTING
, 1996
"... We study three augmentations of ring networks that are intended to decrease a ring's diameter significantly while increasing its structural complexity only modestly. Chordal rings enhance a ring network by adding noncrossing "shortcut" edges, which can be viewed as chords of the ring. Express rin ..."
Abstract
-
Cited by 5 (1 self)
- Add to MetaCart
We study three augmentations of ring networks that are intended to decrease a ring's diameter significantly while increasing its structural complexity only modestly. Chordal rings enhance a ring network by adding noncrossing "shortcut" edges, which can be viewed as chords of the ring. Express rings are chordal rings whose chords are oriented either clockwise or counterclockwise, allowing them to be viewed as (noncrossing) arcs of the ring. Multi-rings append subsidiary rings to edges of a ring and, recursively, to edges of appended subrings. Important measures of structural complexity are: the cutwidth of an express ring, viz., the maximum number of arcs that cross "above" any ring edge (counting the edge itself); the depth of a multi-ring, viz., the level of recursive appending of subsidiary subrings. Our first result demonstrates the topological equivalence of these three modes of augmentation: for each augmented ring of one type, there are (graph-theoretically) isomorphic ...
Performance, algorithmic, and robustness attributes of perfect difference networks
- IEEE Trans. Parallel Distributed Systems
, 2005
"... Abstract—Perfect difference networks (PDNs) that are based on the mathematical notion of perfect difference sets have been shown to comprise an asymptotically optimal method for connecting a number of nodes into a network with diameter 2. Justifications for, and mathematical underpinning of, PDNs ap ..."
Abstract
-
Cited by 4 (2 self)
- Add to MetaCart
Abstract—Perfect difference networks (PDNs) that are based on the mathematical notion of perfect difference sets have been shown to comprise an asymptotically optimal method for connecting a number of nodes into a network with diameter 2. Justifications for, and mathematical underpinning of, PDNs appear in a companion paper. In this paper, we compare PDNs and some of their derivatives to interconnection networks with similar cost/performance, including certain generalized hypercubes and their hierarchical variants. Additionally, we discuss point-to-point and collective communication algorithms and derive a general emulation result that relates the performance of PDNs to that of complete networks as ideal benchmarks. We show that PDNs are quite robust, both with regard to node and link failures that can be tolerated and in terms of blandness (not having weak spots). In particular, we prove that the fault diameter of PDNs is no greater than 4. Finally, we study the complexity and scalability aspects of these networks, concluding that PDNs and their derivatives allow the construction of very low diameter networks close to any arbitrary desired size and that, in many respects, PDNs offer optimal performance and fault tolerance relative to their complexity or implementation cost.

