Results 1  10
of
149
Efficient Algorithms for AlltoAll Communications in MultiPort MessagePassing Systems
 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS
, 1997
"... We present efficient algorithms for two alltoall communication operations in messagepassing systems: index (or alltoall personalized communication) and concatenation (or alltoall broadcast). We assume a model of a fully connected messagepassing system, in which the performance of any pointto ..."
Abstract

Cited by 83 (0 self)
 Add to MetaCart
We present efficient algorithms for two alltoall communication operations in messagepassing systems: index (or alltoall personalized communication) and concatenation (or alltoall broadcast). We assume a model of a fully connected messagepassing system, in which the performance of any pointtopoint communication is independent of the senderreceiver pair. We also assume that each processor has k ≥ 1 ports, through which it can send and receive k messages in every communication round. The complexity measures we use are independent of the particular system topology and are based on the communication startup time, and on the communication bandwidth. In the index operation among n processors, initially, each processor has n blocks of data, and the goal is to exchange the i th block of processor j with the j th block of processor i. We present a class of index algorithms that is designed for all values of n and that features a tradeoff between the communication startup time and the data transfer time. This class of algorithms includes two special cases: an algorithm that is optimal with respect to the measure of the startup time, and an algorithm that is optimal with respect to the measure of the data transfer time. We also present experimental results featuring the performance tuneability of our index algorithms on the IBM SP1 parallel system. In the concatenation operation, among n processors, initially, each processor has one block of data, and the goal is to concatenate the n blocks of data from the n processors, and to make the concatenation result known to all the processors. We present a concatenation algorithm that is optimal, for most values of n, in the number of communication rounds and in the amount of data transferred.
Special Purpose Parallel Computing
 Lectures on Parallel Computation
, 1993
"... A vast amount of work has been done in recent years on the design, analysis, implementation and verification of special purpose parallel computing systems. This paper presents a survey of various aspects of this work. A long, but by no means complete, bibliography is given. 1. Introduction Turing ..."
Abstract

Cited by 77 (5 self)
 Add to MetaCart
A vast amount of work has been done in recent years on the design, analysis, implementation and verification of special purpose parallel computing systems. This paper presents a survey of various aspects of this work. A long, but by no means complete, bibliography is given. 1. Introduction Turing [365] demonstrated that, in principle, a single general purpose sequential machine could be designed which would be capable of efficiently performing any computation which could be performed by a special purpose sequential machine. The importance of this universality result for subsequent practical developments in computing cannot be overstated. It showed that, for a given computational problem, the additional efficiency advantages which could be gained by designing a special purpose sequential machine for that problem would not be great. Around 1944, von Neumann produced a proposal [66, 389] for a general purpose storedprogram sequential computer which captured the fundamental principles of...
Deterministic Sorting in Nearly Logarithmic Time on the Hypercube and Related Computers
 Journal of Computer and System Sciences
, 1996
"... This paper presents a deterministic sorting algorithm, called Sharesort, that sorts n records on an nprocessor hypercube, shuffleexchange, or cubeconnected cycles in O(log n (log log n) 2 ) time in the worst case. The algorithm requires only a constant amount of storage at each processor. Th ..."
Abstract

Cited by 67 (10 self)
 Add to MetaCart
This paper presents a deterministic sorting algorithm, called Sharesort, that sorts n records on an nprocessor hypercube, shuffleexchange, or cubeconnected cycles in O(log n (log log n) 2 ) time in the worst case. The algorithm requires only a constant amount of storage at each processor. The fastest previous deterministic algorithm for this problem was Batcher's bitonic sort, which runs in O(log 2 n) time. Supported by an NSERC postdoctoral fellowship, and DARPA contracts N0001487K825 and N00014 89J1988. 1 Introduction Given n records distributed uniformly over the n processors of some fixed interconnection network, the sorting problem is to route the record with the ith largest associated key to processor i, 0 i ! n. One of the earliest parallel sorting algorithms is Batcher's bitonic sort [3], which runs in O(log 2 n) time on the hypercube [10], shuffleexchange [17], and cubeconnected cycles [14]. More recently, Leighton [9] exhibited a boundeddegree,...
Powerlist: a structure for parallel recursion
 ACM Transactions on Programming Languages and Systems
, 1994
"... Many data parallel algorithms – Fast Fourier Transform, Batcher’s sorting schemes and prefixsum – exhibit recursive structure. We propose a data structure, powerlist, that permits succinct descriptions of such algorithms, highlighting the roles of both parallelism and recursion. Simple algebraic pro ..."
Abstract

Cited by 59 (2 self)
 Add to MetaCart
Many data parallel algorithms – Fast Fourier Transform, Batcher’s sorting schemes and prefixsum – exhibit recursive structure. We propose a data structure, powerlist, that permits succinct descriptions of such algorithms, highlighting the roles of both parallelism and recursion. Simple algebraic properties of this data structure can be exploited to derive properties of these algorithms and establish equivalence of different algorithms that solve the same problem.
Wavelength Division Multiple Access Channel Hypercube Processor Interconnection
 IEEE Transactions on Computers
, 1992
"... A multiprocessor system with a large number of nodes can be built at low cost by combining the recent advances in high capacity channels available through optical fiber communication. A highly fault tolerant system is created with good performance characteristics at a reduction in system complexity. ..."
Abstract

Cited by 43 (18 self)
 Add to MetaCart
A multiprocessor system with a large number of nodes can be built at low cost by combining the recent advances in high capacity channels available through optical fiber communication. A highly fault tolerant system is created with good performance characteristics at a reduction in system complexity. The system capitalizes of the selfrouting characteristic of wavelength division multiple access to improve performance and reduce complexity. A hypercube based structure is introduced, where optical multiple access channels span the dimensional axes. This severely reduces the required degree, since only one I/O port is required per dimension. However, good performance is maintained through the high capacity characteristics of optical communication. The reduction in degree is shown to have significant system complexity implications. Four starcoupled configurations are studied as the basis for the optical multiple access channels, three of which exhibit the optical selfrouting characterist...
Horizons of Parallel Computation
 JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING
, 1993
"... This paper considers the ultimate impact of fundamental physical limitationsnotably, speed of light and device sizeon parallel computing machines. Although we fully expect an innovative and very gradual evolution to the limiting situation, we take here the provocative view of exploring the ..."
Abstract

Cited by 39 (3 self)
 Add to MetaCart
This paper considers the ultimate impact of fundamental physical limitationsnotably, speed of light and device sizeon parallel computing machines. Although we fully expect an innovative and very gradual evolution to the limiting situation, we take here the provocative view of exploring the consequences of the accomplished attainment of the physical bounds. The main result is that scalability holds only for neighborly interconnections, such as the square mesh, of boundedsize synchronous modules, presumably of the areauniversal type. We also discuss the ultimate infeasibility of latencyhiding, the violation of intuitive maximal speedups, and the emerging novel processortime tradeoffs.
Interval Routing Schemes
, 1998
"... Interval routing was introduced to reduce the size of routing tables: a router finds the direction where to forward a message by determining which interval contains the destination address of the message, each interval being associated to one particular direction. This way of implementing a routin ..."
Abstract

Cited by 30 (7 self)
 Add to MetaCart
Interval routing was introduced to reduce the size of routing tables: a router finds the direction where to forward a message by determining which interval contains the destination address of the message, each interval being associated to one particular direction. This way of implementing a routing function is quite attractive but very little is known about the topological properties that must satisfy a network to support an interval routing function with particular constraints (shortest paths, limited number of intervals associated to each direction, etc.). In this paper we investigate the study of the interval routing functions. In particular, we characterize the set of networks which support a linear or a linear strict interval routing function with only one interval per direction. We also derive practical tools to measure the efficiency of an interval routing function (number of intervals, length of the paths, etc.), and we describe large classes of networks which support optimal (linear) interval routing functions. Finally, we derive the main properties satisfied by the popular networks used to interconnect processors in a distributed memory parallel computer.
Packet Routing In FixedConnection Networks: A Survey
, 1998
"... We survey routing problems on fixedconnection networks. We consider many aspects of the routing problem and provide known theoretical results for various communication models. We focus on (partial) permutation, krelation routing, routing to random destinations, dynamic routing, isotonic routing ..."
Abstract

Cited by 29 (3 self)
 Add to MetaCart
We survey routing problems on fixedconnection networks. We consider many aspects of the routing problem and provide known theoretical results for various communication models. We focus on (partial) permutation, krelation routing, routing to random destinations, dynamic routing, isotonic routing, fault tolerant routing, and related sorting results. We also provide a list of unsolved problems and numerous references.
On the Diameter of Finite Groups
 SYMPOSIUM ON FOUNDATIONS OF COMPUTER SCIENCE
, 1990
"... The diameter of a group G with respect to a set S of generators is the maximum over g 2 G of the length of the shortest word in S [ S 1 representing g. This concept arises in the contexts of efficient communication networks and Rubik's cube type puzzles. "Best" generators (giving minimum diameter wh ..."
Abstract

Cited by 29 (4 self)
 Add to MetaCart
The diameter of a group G with respect to a set S of generators is the maximum over g 2 G of the length of the shortest word in S [ S 1 representing g. This concept arises in the contexts of efficient communication networks and Rubik's cube type puzzles. "Best" generators (giving minimum diameter while keeping the number of generators limited) are pertinent to networks, "worst" and "average" generators seem a more adequate model for puzzles. We survey a substantial body of recent work by the authors on these subjects. Regarding the "best" case, we show that while the structure of the group is essentially irrelevant if S is allowed to exceed (log G) 1+c (c > 0), it plays a heavy role when jSj = O(1). In particular, every nonabelian nite simple group has a set of 7 generators giving logarithmic diameter. This cannot happen for groups with an abelian subgroup of bounded index. { Regarding the worst case, we are concerned primarily with permutation groups of degree n and obtain a tight exp((n ln n) 1=2 (1 + o(1))) upper bound. In the average case, the upper bound improves to exp((ln n) 2 (1 + o(1))). As a rst step toward extending this result to simple groups other than An , we establish that almost every pair of elements of a classical simple group G generates G, a result previously proved by J. Dixon for An . In the limited space of this article, we try to illuminate some of the basic underlying techniques.
Systematic Efficient Parallelization of Scan and Other List Homomorphisms
 In Annual European Conference on Parallel Processing, LNCS 1124
, 1996
"... Homomorphisms are functions which can be parallelized by the divideandconquer paradigm. A class of distributable homomorphisms (DH) is introduced and an efficient parallel implementation schema for all functions of the class is derived by transformations in the BirdMeertens formalism. The schema ..."
Abstract

Cited by 26 (7 self)
 Add to MetaCart
Homomorphisms are functions which can be parallelized by the divideandconquer paradigm. A class of distributable homomorphisms (DH) is introduced and an efficient parallel implementation schema for all functions of the class is derived by transformations in the BirdMeertens formalism. The schema can be directly mapped on the hypercube with an unlimited or an arbitrary fixed number of processors, providing provable correctness and predictable performance. The popular scanfunction (parallel prefix) illustrates the presentation: the systematically derived implementation for scan coincides with the practically used "folklore" algorithm for distributedmemory machines.