Results 1 - 10
of
129
Special Purpose Parallel Computing
- Lectures on Parallel Computation
, 1993
"... A vast amount of work has been done in recent years on the design, analysis, implementation and verification of special purpose parallel computing systems. This paper presents a survey of various aspects of this work. A long, but by no means complete, bibliography is given. 1. Introduction Turing ..."
Abstract
-
Cited by 77 (5 self)
- Add to MetaCart
A vast amount of work has been done in recent years on the design, analysis, implementation and verification of special purpose parallel computing systems. This paper presents a survey of various aspects of this work. A long, but by no means complete, bibliography is given. 1. Introduction Turing [365] demonstrated that, in principle, a single general purpose sequential machine could be designed which would be capable of efficiently performing any computation which could be performed by a special purpose sequential machine. The importance of this universality result for subsequent practical developments in computing cannot be overstated. It showed that, for a given computational problem, the additional efficiency advantages which could be gained by designing a special purpose sequential machine for that problem would not be great. Around 1944, von Neumann produced a proposal [66, 389] for a general purpose storedprogram sequential computer which captured the fundamental principles of...
Deterministic Sorting in Nearly Logarithmic Time on the Hypercube and Related Computers
- Journal of Computer and System Sciences
, 1996
"... This paper presents a deterministic sorting algorithm, called Sharesort, that sorts n records on an n-processor hypercube, shuffle-exchange, or cube-connected cycles in O(log n (log log n) 2 ) time in the worst case. The algorithm requires only a constant amount of storage at each processor. Th ..."
Abstract
-
Cited by 67 (10 self)
- Add to MetaCart
This paper presents a deterministic sorting algorithm, called Sharesort, that sorts n records on an n-processor hypercube, shuffle-exchange, or cube-connected cycles in O(log n (log log n) 2 ) time in the worst case. The algorithm requires only a constant amount of storage at each processor. The fastest previous deterministic algorithm for this problem was Batcher's bitonic sort, which runs in O(log 2 n) time. Supported by an NSERC postdoctoral fellowship, and DARPA contracts N00014--87--K--825 and N00014-- 89--J--1988. 1 Introduction Given n records distributed uniformly over the n processors of some fixed interconnection network, the sorting problem is to route the record with the ith largest associated key to processor i, 0 i ! n. One of the earliest parallel sorting algorithms is Batcher's bitonic sort [3], which runs in O(log 2 n) time on the hypercube [10], shuffle-exchange [17], and cube-connected cycles [14]. More recently, Leighton [9] exhibited a bounded-degree,...
Efficient Algorithms for All-to-All Communications in Multi-Port Message-Passing Systems
- IEEE Transactions on Parallel and Distributed Systems
, 1997
"... Abstract—We present efficient algorithms for two all-to-all communication operations in message-passing systems: index (or all-toall personalized communication) and concatenation (or all-to-all broadcast). We assume a model of a fully connected messagepassing system, in which the performance of any ..."
Abstract
-
Cited by 60 (0 self)
- Add to MetaCart
Abstract—We present efficient algorithms for two all-to-all communication operations in message-passing systems: index (or all-toall personalized communication) and concatenation (or all-to-all broadcast). We assume a model of a fully connected messagepassing system, in which the performance of any point-to-point communication is independent of the sender-receiver pair. We also assume that each processor has k ≥ 1 ports, through which it can send and receive k messages in every communication round. The complexity measures we use are independent of the particular system topology and are based on the communication start-up time, and on the communication bandwidth. In the index operation among n processors, initially, each processor has n blocks of data, and the goal is to exchange the i th block of processor j with the j th block of processor i. We present a class of index algorithms that is designed for all values of n and that features a trade-off between the communication start-up time and the data transfer time. This class of algorithms includes two special cases: an algorithm that is optimal with respect to the measure of the start-up time, and an algorithm that is optimal with respect to the measure of the data transfer time. We also present experimental results featuring the performance tuneability of our index algorithms on the IBM SP-1 parallel system. In the concatenation operation, among n processors, initially, each processor has one block of data, and the goal is to concatenate the n blocks of data from the n processors, and to make the concatenation result known to all the processors. We present a concatenation algorithm that is optimal, for most values of n, in the number of communication rounds and in the amount of data transferred. Index Terms—All-to-all broadcast, all-to-all personalized communication, complete exchange, concatenation operation, distributedmemory system, index operation, message-passing system, multiscatter/gather, parallel system.
Powerlist: a structure for parallel recursion
- ACM Transactions on Programming Languages and Systems
, 1994
"... Many data parallel algorithms – Fast Fourier Transform, Batcher’s sorting schemes and prefixsum – exhibit recursive structure. We propose a data structure, powerlist, that permits succinct descriptions of such algorithms, highlighting the roles of both parallelism and recursion. Simple algebraic pro ..."
Abstract
-
Cited by 55 (2 self)
- Add to MetaCart
Many data parallel algorithms – Fast Fourier Transform, Batcher’s sorting schemes and prefixsum – exhibit recursive structure. We propose a data structure, powerlist, that permits succinct descriptions of such algorithms, highlighting the roles of both parallelism and recursion. Simple algebraic properties of this data structure can be exploited to derive properties of these algorithms and establish equivalence of different algorithms that solve the same problem.
Wavelength Division Multiple Access Channel Hypercube Processor Interconnection
- IEEE Transactions on Computers
, 1992
"... A multiprocessor system with a large number of nodes can be built at low cost by combining the recent advances in high capacity channels available through optical fiber communication. A highly fault tolerant system is created with good performance characteristics at a reduction in system complexity. ..."
Abstract
-
Cited by 43 (18 self)
- Add to MetaCart
A multiprocessor system with a large number of nodes can be built at low cost by combining the recent advances in high capacity channels available through optical fiber communication. A highly fault tolerant system is created with good performance characteristics at a reduction in system complexity. The system capitalizes of the self-routing characteristic of wavelength division multiple access to improve performance and reduce complexity. A hypercube based structure is introduced, where optical multiple access channels span the dimensional axes. This severely reduces the required degree, since only one I/O port is required per dimension. However, good performance is maintained through the high capacity characteristics of optical communication. The reduction in degree is shown to have significant system complexity implications. Four star-coupled configurations are studied as the basis for the optical multiple access channels, three of which exhibit the optical self-routing characterist...
Horizons of Parallel Computation
- JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING
, 1993
"... This paper considers the ultimate impact of fundamental physical limitations---notably, speed of light and device size---on parallel computing machines. Although we fully expect an innovative and very gradual evolution to the limiting situation, we take here the provocative view of exploring the ..."
Abstract
-
Cited by 36 (3 self)
- Add to MetaCart
This paper considers the ultimate impact of fundamental physical limitations---notably, speed of light and device size---on parallel computing machines. Although we fully expect an innovative and very gradual evolution to the limiting situation, we take here the provocative view of exploring the consequences of the accomplished attainment of the physical bounds. The main result is that scalability holds only for neighborly interconnections, such as the square mesh, of bounded-size synchronous modules, presumably of the area-universal type. We also discuss the ultimate infeasibility of latencyhiding, the violation of intuitive maximal speedups, and the emerging novel processor-time tradeoffs.
Universal Routing Schemes
- Journal of Distributed Computing
, 1997
"... In this paper, we deal with the compact routing problem, that is implementing routing schemes that use a minimum memory size on each router. A universal routing scheme is a scheme that applies to all n-node networks. In [31], Peleg and Upfal showed that one can not implement a universal routing sche ..."
Abstract
-
Cited by 30 (6 self)
- Add to MetaCart
In this paper, we deal with the compact routing problem, that is implementing routing schemes that use a minimum memory size on each router. A universal routing scheme is a scheme that applies to all n-node networks. In [31], Peleg and Upfal showed that one can not implement a universal routing scheme with less than a total of \Omega\Gamma n 1+1=(2s+4) ) memory bits for any given stretch factor s 1. We improve this bound for stretch factors s, 1 s ! 2, by proving that any near-shortest path universal routing scheme uses a total of \Omega\Gamma n 2 ) memory bits in the worst-case. This result is obtained by counting the minimum number of routing functions necessary to route on all n-node networks. Moreover, and more fundamentally, we give a tight bound of \Theta(n log n) bits for the local minimum memory requirement of universal routing scheme of stretch factors s, 1 s ! 2. More precisely, for any fixed constant ", 0 ! " ! 1, there exists an n-node network G on which at least \O...
Packet Routing In Fixed-Connection Networks: A Survey
, 1998
"... We survey routing problems on fixed-connection networks. We consider many aspects of the routing problem and provide known theoretical results for various communication models. We focus on (partial) permutation, k-relation routing, routing to random destinations, dynamic routing, isotonic routing ..."
Abstract
-
Cited by 26 (3 self)
- Add to MetaCart
We survey routing problems on fixed-connection networks. We consider many aspects of the routing problem and provide known theoretical results for various communication models. We focus on (partial) permutation, k-relation routing, routing to random destinations, dynamic routing, isotonic routing, fault tolerant routing, and related sorting results. We also provide a list of unsolved problems and numerous references.
Systematic Efficient Parallelization of Scan and Other List Homomorphisms
- In Annual European Conference on Parallel Processing, LNCS 1124
, 1996
"... Homomorphisms are functions which can be parallelized by the divide-and-conquer paradigm. A class of distributable homomorphisms (DH) is introduced and an efficient parallel implementation schema for all functions of the class is derived by transformations in the Bird-Meertens formalism. The schema ..."
Abstract
-
Cited by 25 (7 self)
- Add to MetaCart
Homomorphisms are functions which can be parallelized by the divide-and-conquer paradigm. A class of distributable homomorphisms (DH) is introduced and an efficient parallel implementation schema for all functions of the class is derived by transformations in the Bird-Meertens formalism. The schema can be directly mapped on the hypercube with an unlimited or an arbitrary fixed number of processors, providing provable correctness and predictable performance. The popular scan-function (parallel prefix) illustrates the presentation: the systematically derived implementation for scan coincides with the practically used "folklore" algorithm for distributed-memory machines.
Optimal Distributed Algorithms in Unlabelled Tori and Chordal Rings
, 1996
"... We study the message complexity of distributed algorithms in Tori and Chordal Rings when the communication links are unlabelled, which implies that the processors do not have "Sense of Direction". We introduce the paradigm of handrail which allows messages to travel with a consistent direction. We g ..."
Abstract
-
Cited by 25 (12 self)
- Add to MetaCart
We study the message complexity of distributed algorithms in Tori and Chordal Rings when the communication links are unlabelled, which implies that the processors do not have "Sense of Direction". We introduce the paradigm of handrail which allows messages to travel with a consistent direction. We give a distributed algorithm which confirms the conjecture that the Leader Election problem for unlabelled Tori of N processors can be solved using #(N) messages instead of O(N log N ). Using the same handrail paradigm, we solve the Election problem using #(N) messages in unlabelled chordal rings with one chord (of length approximately # N ). This solves a long-standing open problem of the minimal number of unlabelled chords required to decrease to decrease the O(N log N) message complexity. For each topology, we give an algorithm to compute the Sense of Direction in #(N) messages (improving the O(N log N) previous results). This proves the more fundamental result that any global...

