Results 1  10
of
26
The Network Architecture of the Connection Machine CM5
 Journal of Parallel and Distributed Computing
, 1992
"... The Connection Machine Model CM5 Supercomputer is a massively parallel computer system designed to offer performance in the range of 1 teraflops (10 12 floatingpoint operations per second). The CM5 obtains its high performance while offering ease of programming, flexibility, and reliability. Th ..."
Abstract

Cited by 77 (2 self)
 Add to MetaCart
The Connection Machine Model CM5 Supercomputer is a massively parallel computer system designed to offer performance in the range of 1 teraflops (10 12 floatingpoint operations per second). The CM5 obtains its high performance while offering ease of programming, flexibility, and reliability. The machine contains three communication networks: a data network, a control network, and a diagnostic network. This paper describes the organization of these three networks and how they contribute to the design goals of the CM5. 1 Introduction In the design of a parallel computer, the engineering principle of economy of mechanism suggests that the machine should employ only a single communication network to convey information among the processors in the system. Indeed, many parallel computers contain only a single network: typically, a hypercube or a mesh. The Connection Machine Model CM5 Supercomputer has three networks, however, and none is a hypercube or a mesh. This paper describes the...
Can a SharedMemory Model Serve as a Bridging Model for Parallel Computation?
, 1999
"... There has been a great deal of interest recently in the development of generalpurpose bridging models for parallel computation. Models such as the BSP and LogP have been proposed as more realistic alternatives to the widely used PRAM model. The BSP and LogP models imply a rather different style fo ..."
Abstract

Cited by 42 (11 self)
 Add to MetaCart
There has been a great deal of interest recently in the development of generalpurpose bridging models for parallel computation. Models such as the BSP and LogP have been proposed as more realistic alternatives to the widely used PRAM model. The BSP and LogP models imply a rather different style for designing algorithms when compared with the PRAM model. Indeed, while many consider data parallelism as a convenient style, and the sharedmemory abstraction as an easytouse platform, the bandwidth limitations of current machines have diverted much attention to messagepassing and distributedmemory models (such as the BSP and LogP) that account more properly for these limitations. In this paper we consider the question of whether a sharedmemory model can serve as an effective bridging model for parallel computation. In particular, can a sharedmemory model be as effective as, say, the BSP? As a candidate for a bridging model, we introduce the Queuing SharedMemory (QSM) model, which accounts for limited communication bandwidth while still providing a simple sharedmemory abstraction. We substantiate the ability of the QSM to serve as a bridging model by providing a simple workpreserving emulation of the QSM on both the BSP, and on a related model, the (d, x)BSP. We present evidence that the features of the QSM are essential to its effectiveness as a bridging model. In addition, we describe scenarios
Horizons of Parallel Computation
 JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING
, 1993
"... This paper considers the ultimate impact of fundamental physical limitationsnotably, speed of light and device sizeon parallel computing machines. Although we fully expect an innovative and very gradual evolution to the limiting situation, we take here the provocative view of exploring the ..."
Abstract

Cited by 39 (3 self)
 Add to MetaCart
This paper considers the ultimate impact of fundamental physical limitationsnotably, speed of light and device sizeon parallel computing machines. Although we fully expect an innovative and very gradual evolution to the limiting situation, we take here the provocative view of exploring the consequences of the accomplished attainment of the physical bounds. The main result is that scalability holds only for neighborly interconnections, such as the square mesh, of boundedsize synchronous modules, presumably of the areauniversal type. We also discuss the ultimate infeasibility of latencyhiding, the violation of intuitive maximal speedups, and the emerging novel processortime tradeoffs.
Towards Efficient and Portability: Programming with the BSP Model
 In Proc. 8th ACM Symp. on Parallel Algorithms and Architectures
, 1996
"... The BulkSynchronous Parallel (BSP) model was proposed by Valiant as a model for generalpurpose parallel computation. The objective of the model is to allow the design of parallel programs that can be executed efficiently on a variety of architectures. While many theoretical arguments in support of ..."
Abstract

Cited by 34 (3 self)
 Add to MetaCart
The BulkSynchronous Parallel (BSP) model was proposed by Valiant as a model for generalpurpose parallel computation. The objective of the model is to allow the design of parallel programs that can be executed efficiently on a variety of architectures. While many theoretical arguments in support of the BSP model have been presented, the degree to which the model can be efficiently utilized on existing parallel machines remains unclear. To explore this question, we implemented a small library of BSP functions, called the Green BSP library, on several parallel platforms. We also created a number of parallel applications based on this library. Here, we report on the performance of six of these applications on three different parallel platforms. Our preliminary results suggest that the BSP model can be used to develop efficient and portable programs for a range of machines and applications. 1
kary ntrees: High Performance Networks for Massively Parallel Architectures
 In Proceedings of the 11th International Parallel Processing Symposium, IPPS'97
, 1997
"... The past few years have seen a rise in popularity of massively parallel architectures that use fattrees as their interconnection networks. In this paper we study the communication performance of a parametric family of fattrees, the kary ntrees, built with constant arity switches interconnected i ..."
Abstract

Cited by 27 (8 self)
 Add to MetaCart
The past few years have seen a rise in popularity of massively parallel architectures that use fattrees as their interconnection networks. In this paper we study the communication performance of a parametric family of fattrees, the kary ntrees, built with constant arity switches interconnected in a regular topology. Through simulation on a 4ary 4tree with 256 nodes, we analyze some variants of an adaptive algorithm that utilize wormhole routing with one, two and four virtual channels. The experimental results show that the uniform, bit reversal and transpose traffic patterns are very sensitive to the flow control strategy. In all these cases, the saturation points are between 35 \Gamma 40% of the network capacity with one virtual channel, 55\Gamma60% with two virtual channels and around 75% with four virtual channels. The complement traffic, a representative of the class of the congestionfree communication patterns, reaches an optimal performance, with a saturation point at 97% of the capacity for all flow control strategies.
The FatPyramid and Universal Parallel Computation Independent of Wire Delay
 IEEE Transactions on Computers
, 1994
"... This paper shows that a fatpyramid of area \Theta(A) requires only O(log A) slowdown to simulate any competing network of area A under very general conditions. The result holds regardless of the processor size (amount of attached memory) and number of processors in the competing network as long as ..."
Abstract

Cited by 20 (4 self)
 Add to MetaCart
This paper shows that a fatpyramid of area \Theta(A) requires only O(log A) slowdown to simulate any competing network of area A under very general conditions. The result holds regardless of the processor size (amount of attached memory) and number of processors in the competing network as long as the limitation on total area is met. Furthermore, the result is valid regardless of the relationship between wire length and wire delay. We especially focus on elimination of the common simplifying assumption that unit time suffices to traverse a wire regardless of its length, since the assumption becomes more and more untenable as the size of parallel systems increases. This paper concentrates on simulation using transmission lines (wires along which bits can be pipelined) with the message routing schedule set up off line, but it also discusses the extension to online simulation. This paper also examines the capabilities of a fatpyramid when matched against a substantially larger network ...
Modeling parallel bandwidth: Local vs. global restrictions
"... Recently there has been an increasing interest in models of parallel computation that account for the bandwidth limitations in communication networks. Some models (e.g., bsp and logp) account for bandwidth limitations using a perprocessor parameter g> 1, such that eachpro cessor can send/receive at ..."
Abstract

Cited by 15 (4 self)
 Add to MetaCart
Recently there has been an increasing interest in models of parallel computation that account for the bandwidth limitations in communication networks. Some models (e.g., bsp and logp) account for bandwidth limitations using a perprocessor parameter g> 1, such that eachpro cessor can send/receive at most h messages in g h time. Other models (e.g., pram(m)) account for bandwidth limitations as an aggregate parameter m<p, such thatthe p processors can send at most m messages in total at each step. This paper provides the rst detailed study of the algorithmic implications of modeling parallel bandwidth as a perprocessor (local) limitation versus an aggregate (global) limitation. We consider a number of basic problems
The Design and Analysis of BulkSynchronous Parallel Algorithms
, 1998
"... The model of bulksynchronous parallel (BSP) computation is an emerging paradigm of generalpurpose parallel computing. This thesis presents a systematic approach to the design and analysis of BSP algorithms. We introduce an extension of the BSP model, called BSPRAM, which reconciles sharedmemory s ..."
Abstract

Cited by 10 (1 self)
 Add to MetaCart
The model of bulksynchronous parallel (BSP) computation is an emerging paradigm of generalpurpose parallel computing. This thesis presents a systematic approach to the design and analysis of BSP algorithms. We introduce an extension of the BSP model, called BSPRAM, which reconciles sharedmemory style programming with efficient exploitation of data locality. The BSPRAM model can be optimally simulated by a BSP computer for a broad range of algorithms possessing certain characteristic properties: obliviousness, slackness, granularity. We use BSPRAM to design BSP algorithms for problems from three large, partially overlapping domains: combinatorial computation, dense matrix computation, graph computation. Some of the presented algorithms are adapted from known BSP algorithms (butterfly dag computation, cube dag computation, matrix multiplication). Other algorithms are obtained by application of established nonBSP techniques (sorting, randomised list contraction, Gaussian elimination without pivoting and with column pivoting, algebraic path computation), or use original techniques specific to the BSP model (deterministic list contraction, Gaussian elimination with nested block pivoting, communicationefficient multiplication of Boolean matrices, synchronisationefficient shortest paths computation). The asymptotic BSP cost of each algorithm is established, along with its BSPRAM characteristics. We conclude by outlining some directions for future research.
Performance Analysis of Wormhole Routed kary ntrees
, 1998
"... The past few years have seen a rise in popularity of massively parallel architectures that use fattrees as their interconnection networks. In this paper we formalize a parametric family of fattrees, the kary ntrees, built with constant arity switches interconnected in a regular topology. A si ..."
Abstract

Cited by 10 (5 self)
 Add to MetaCart
The past few years have seen a rise in popularity of massively parallel architectures that use fattrees as their interconnection networks. In this paper we formalize a parametric family of fattrees, the kary ntrees, built with constant arity switches interconnected in a regular topology. A simple adaptive routing algorithm for kary ntrees sends each message to one of the nearest common ancestors of both source and destination, choosing the less loaded physical channels, and then reaches the destination following the unique available path. Through simulation on a 4ary 4tree with 256 nodes, we analyze some variants of the adaptive algorithm that utilize wormhole routing with 1, 2 and 4 virtual channels. The experimental results show that the uniform, bit reversal and transpose traffic patterns are very sensitive to the flow control strategy.