## Programming Parallel Algorithms (1996)

### Cached

### Download Links

- [www.cs.cmu.edu]
- [www.cs.cmu.edu]
- [www.cs.cmu.edu]
- [www.cs.cmu.edu]
- [www.cs.cmu.edu]
- [www.cs.cmu.edu]
- [www.classes.cs.uchicago.edu]
- [www.math.tau.ac.il]
- [www.math.tau.ac.il]
- [www.cs.cmu.edu]
- [www.cs.cmu.edu]
- [www.uni-paderborn.de]
- [www.cs.cmu.edu]
- [www.cs.cmu.edu]
- DBLP

### Other Repositories/Bibliography

Venue: | Communications of the ACM |

Citations: | 191 - 9 self |

### BibTeX

@ARTICLE{Blelloch96programmingparallel,

author = {Guy E. Blelloch},

title = {Programming Parallel Algorithms},

journal = {Communications of the ACM},

year = {1996},

volume = {39},

pages = {85--97}

}

### Years of Citing Articles

### OpenURL

### Abstract

In the past 20 years there have been a huge number of algorithms designed for parallel computers, most which have been designed for one of the variants of the Parallel Random Access Machine (PRAM) model. Unfortunately there has been limited progress in getting practical implementations of the algorithms on any real parallel machine. Although discrepancies between the PRAM model and actual implementations of parallel machines (particularly as regards communication costs) has played a part in this lack of progress, another significant problem is the lack of good programming languages. With the languages that come with existing parallel machines it can be a major project to implement a simple algorithm, and once implemented the code is unlikely to port to any other parallel machine. This paper describes a data-parallel language, Nesl, designed for programming parallel algorithms. Nesl currently runs on the Connection Machine CM-2 and the Cray Y-MP, and generates reasonably efficient cod...

### Citations

3966 | Computer Architecture: A Quantitative Approach, 3rd Edition - Hennessy, Patterson - 2003 |

2432 |
The Design and Analysis of Computer Algorithms
- Aho, Holpcroft, et al.
- 1974
(Show Context)
Citation Context ...y of edges and the adjacency-list is represented as an array of arrays. Using arrays instead of lists makes it 300 2 [(0,1), (0,2), (2,3), (3,4), (1,3), (1,0), (2,0), (3,2), (4,3), (3,1)] (b) 1 3 4 [=-=[1, 2]-=-, [0, 3], [0, 3], [1, 2, 4], [3]] (a) (c) Figure 7: Representations of an undirected graph. (a) A graph G with 5 vertices and 5 edges. (b) An edge-list representation of G. (c) The adjacency-list repr... |

1760 |
Computational Geometry: An Introduction
- Preparata
- 1985
(Show Context)
Citation Context ...ivide-and-conquer, but one does most of the work before the divide step, and the other does most of the work after. 6.2.1 QuickHull The parallel QuickHull algorithm is based on the sequential version =-=[71]-=-, so named because of its similarity to the QuickSort algorithm. As with QuickSort, the strategy is to pick a “pivot” element, split the data based on the pivot, and recurse on each of the split sets.... |

1562 | The Definition of Standard ML - Miller, Tofte, et al. - 1990 |

1531 | Distributed Algorithms - Lynch - 1996 |

1304 |
Introduction to Parallel Algorithms and Architectures
- Leighton
- 1991
(Show Context)
Citation Context ...sors, and then down the tree. Many algorithms have been designed to run efficiently on particular network topologies such as the mesh or the hypercube. For an extensive treatment such algorithms, see =-=[55, 67, 73, 80]-=-. Although this approach can lead to very fine-tuned algorithms, it has some disadvantages. First, algorithms designed for one network may not perform well on other networks. Hence, in order to solve ... |

1127 | A bridging model for parallel computation - Valiant - 1990 |

636 |
An Introduction to Parallel Algorithms
- J'aJ'a
- 1992
(Show Context)
Citation Context ...ickSort and radix sort. Both of these algorithms are easy to program, and both work well in practice. Many more sorting algorithms can be found in the literature. The interested reader is referred to =-=[3, 45, 55]-=- for more complete coverage. 5.1 QuickSort We begin our discussion of sorting with a parallel version of QuickSort. This algorithm is one of the simplest to code. 42ALGORITHM: quicksort(A) 1 if |A| =... |

615 |
Parallel and Distributed computation: Numerical Methods
- BERTSEKAS, TSKSIKLIS
- 1989
(Show Context)
Citation Context ...ication [75, Chapter 14]. Hashing is useful for load balancing and mapping addresses to memory [47, 87]. Iterative techniques are useful as a replacement for direct methods for solving linear systems =-=[18]-=-. 3 Basic operations on sequences, lists, and trees We begin our presentation of parallel algorithms with a collection of algorithms for performing basic operations on sequences, lists, and trees. The... |

497 | Eicken. Logp: Towards a realistic model of parallel computation - Culler, Karp, et al. - 1993 |

483 | Introduction to Parallel Computing; Design and Analysis of Algorithms. Benjamin/Cummings - Kumar, Grama, et al. - 1994 |

434 | The SGI Origin: A ccNUMA Highly Scalable Server
- Laudon, Lenoski
- 1997
(Show Context)
Citation Context ... to a shared-memory system. As these machines reach the limit in size imposed by the bus architecture, manufacturers have reintroduced parallel machines based on the hypercube network topology (e.g., =-=[54]-=-). 549 Defining terms CRCW. A shared memory model that allows for concurrent reads (CR) and concurrent writes (CW) to the memory. CREW. This refers to a shared memory model that allows for Concurrent... |

372 |
Gaussian elimination is not optimal
- Strassen
- 1969
(Show Context)
Citation Context ... subset of the parallelism to use so that communication costs can be minimized. Sequentially it is known that matrix multiplication can be performed using less than O(n 3 ) work. Strassen’s algorithm =-=[82]-=-, for example, requires only O(n 2.81 ) work. Most of these more efficient algorithms are also easy to parallelize because they are recursive in nature (Strassen’s algorithm has O(log n) depth using a... |

366 | A simple parallel algorithm for the maximal independent set problem
- Luby
- 1986
(Show Context)
Citation Context ...re at each vertex is the same, for example if each vertex has the same number of neighbors. As it turns out, the impasse can be resolved by using randomness to break the symmetry between the vertices =-=[58]-=-. Load balancing: A third use of randomness is load balancing. One way to quickly partition a large number of data items into a collection of approximately evenly sized subsets is to randomly assign e... |

287 |
Parallel merge sort
- Cole
- 1988
(Show Context)
Citation Context ...becomes D(n) = D(n/2) + O(log log n), (5) which has solution D(n) = O(log n log log n). Using a technique called pipelined divide-and-conquer the depth of mergesort can be further reduced to O(log n) =-=[26]-=-. The idea is to start the merge at the top level before the recursive calls complete. Divide-and-conquer has proven to be one of the most powerful techniques for solving problems in parallel. In this... |

286 |
Parallel Algorithms for Shared-Memory Machines
- Karp, Ramachandran
- 1990
(Show Context)
Citation Context ...and D(n) = log2 n. Circuit models have been used for many years to study various theoretical aspects of parallelism, for example to prove that certain problems are difficult to solve in parallel. See =-=[48]-=- for an overview. In a vector model an algorithm is expressed as a sequence of steps, each of which performs an operation on a vector (i.e., sequence) of input values, and produces a vector result [19... |

280 | Parallelism in random access machines - Fortune, Wylie - 1978 |

260 |
Vector Models for DataParallel Computing
- Blelloch
- 1990
(Show Context)
Citation Context ...48] for an overview. In a vector model an algorithm is expressed as a sequence of steps, each of which performs an operation on a vector (i.e., sequence) of input values, and produces a vector result =-=[19, 69]-=-. The work of each step is equal to the length of its input (or output) vector. The work of an algorithm is the sum of the work of its steps. The depth of an algorithm is the number of vector steps. I... |

253 |
Sorting and Searching, volume 3 of The Art of Computer Programming
- Knuth
- 1998
(Show Context)
Citation Context ...n this case, the work and depth are given by the recurrences W(n) = 2W(n/2) + O(n) (9) D(n) = D(n/2) + 1 (10) which have solutions W(n) = O(n log n) and D(n) = O(log n). A more sophisticated analysis =-=[50]-=- shows that the expected work and depth are indeed W(n) = O(n log n) and D(n) = O(log n), independent of the values in the input sequence A. In practice, the performance of parallel QuickSort can be i... |

238 | The parallel evaluation of general arithmetic expressions
- BRENT
- 1974
(Show Context)
Citation Context .... These translations are work-preserving in the sense that the work performed by both algorithms is the same, to within a constant factor. For example, the following theorem, known as Brent’s Theorem =-=[24]-=-, shows that an algorithm designed for the circuit model can be translated in a work-preserving fashion to a PRAM model algorithm. Theorem 1.1 (Brent’s Theorem) Any algorithm that can be expressed as ... |

231 | Data parallel algorithms - Hillis, Steele - 1986 |

225 | 1-stmctures: Data structures for parallel computing - Arvind, Pingali - 1989 |

223 |
Fat-trees: Universal networks for hardwareefficient supercomputing
- Leiserson
- 1985
(Show Context)
Citation Context ...eflect the fact that, in practice, a processor is capable of generating memory access requests faster than a memory module is capable of servicing them. A fat-tree is a network structured like a tree =-=[56]-=-. Each edge of the tree, however, may represent many communication channels, and each node may represent many network switches (hence the name “fat”). Figure 1.1.1(b) shows a fat-tree with the overall... |

202 |
General Purpose Parallel Architectures, Chapter 18 of Handbook of Theoretical
- Valiant
- 1990
(Show Context)
Citation Context ...comparably sized guest, however, it is usually much more difficult for the host to perform a work-preserving emulation of the guest. For more information on PRAM emulations, the reader is referred to =-=[43, 86]-=- 1.5 Model used in this chapter Because there are so many work-preserving translations between different parallel models of computation, we have the luxury of choosing the model that we feel most clea... |

195 |
How to Emulate Shared Memory
- Ranade
- 1987
(Show Context)
Citation Context ... address to a physical memory bank and an address within that bank using a sufficiently powerful hash function. This scheme was first proposed by Karlin and Upfal [47] for the EREW PRAM model. Ranade =-=[72]-=- later presented a more general approach that allowed the butterfly to efficiently emulate CRCW algorithms. Theorem 1.2 Any algorithm that takes time T on a P-processor PRAM model can be translated in... |

180 | Efficient Parallel Algorithms - Gibbons, Rytter - 1988 |

175 | Implementation of a portable nested data-parallel language - Blelloch, Hardwick, et al. - 1994 |

173 | A comparison of sorting algorithms for the connection machine cm-2 - Blelloch, Leiserson, et al. - 1991 |

171 |
The Design and Analysis of Parallel Algorithms
- Akl
- 1989
(Show Context)
Citation Context ... defeated by an item with the same value, and those that were defeated by an item with a different value. In our example, V [5] and V [6] (23 and 18) were defeated by items with the same value, and V =-=[4]-=- (42) was defeated by an item with a different value. Items of the first type are set aside because they are duplicates. Items of the second type are retained, and the algorithm repeats the entire pro... |

166 |
A Parallel Algorithm for the Efficient Solution of a General Class of Recurrence Equations
- Kogge, Stone
- 1973
(Show Context)
Citation Context ...gorithm described can be used more generally to solve various recurrences, such as the first-order linear recurrences xi = (xi−1⊗ai)⊕bi, 0 ≤ i ≤ n, where ⊗ and ⊕ are both binary associative operators =-=[51]-=-. Scans have proven so useful in the design of parallel algorithms that some parallel machines provide support for scan operations in hardware. 3.3 Multiprefix and fetch-and-add The multiprefix operat... |

159 |
Leeuwen, "Maintenance of Configurations in the Plane
- Overmars, van
- 1981
(Show Context)
Citation Context ...heir algorithm admits even more parallelism and leads to an O(log 2 n)-depth algorithm with O(n log h) work where h is the number of points on the convex hull. 6.2.2 MergeHull The MergeHull algorithm =-=[68]-=- is another divide-and-conquer algorithm for solving the planar convex hull problem. Unlike QuickHull, however, it does most of its work after returning from the recursive calls. The function is expre... |

152 | Programming with sets: an introduction to SETL - Schwartz, Dewar, et al. - 1986 |

147 | Synthesis of Parallel Algorithms - Reif - 1993 |

133 | Mathematical Theory of Connecting Networks and Telephone Traffic - Beneš - 1965 |

123 |
Multidimensional divide-and-conquer
- Bentley
- 1975
(Show Context)
Citation Context ...n over a sequence of values in parallel. It uses a set like notation. For example, the expression {a ∗ a : a ∈ [3, −4, −9,5]} squares each element of the sequence [3, −4, −9,5] returning the sequence =-=[9,16,81,25]-=-. This can be read: “in parallel, for each a in the sequence [3, −4, −9,5], square a”. The apply-to-each construct also provides the ability to subselect elements of a sequence based on a filter. For ... |

120 |
A logarithmic time sort for linear size networks
- Reif, Valiant
- 1987
(Show Context)
Citation Context ...candidates. Hence there is an optimum value of s, typically larger than one, that minimizes the total time. The sorting algorithm that selects partition elements in this fashion is called sample sort =-=[23, 89, 76]-=-. 5.2 Radix sort Our next sorting algorithm is radix sort, an algorithm that performs well in practice. Unlike QuickSort, radix sort is not a comparison sort, meaning that it does not compare keys dir... |

116 |
An O(log n) parallel connectivity algorithm
- Shiloach, Vishkin
- 1982
(Show Context)
Citation Context ...improvements to the two basic connected component algorithms we described. Here we mention some of them. The deterministic algorithm can be improved to run in O(log n) depth with the same work bounds =-=[13, 79]-=-. The basic idea is to interleave the hooking steps with the pointer-jumping steps. By a pointer-jumping step we mean a single step of point to root. This means that each tree is only partially contra... |

115 | A randomized linear-time algorithm to find minimum spanning trees
- Karger, Tarjan, et al.
- 1995
(Show Context)
Citation Context .... This technique, first applied to parallel algorithms, has since been used to improve some sequential algorithms, such as deriving the first linear-work algorithm for finding a minimum spanning tree =-=[46]-=-. Another improvement is to use the EREW model instead of requiring concurrent reads and writes [42]. However this comes at the cost of greatly complicating the algorithm. The basic idea is to keep ci... |

113 | Computational Aspects of VLSI - Ullman - 1984 |

110 | Designing broadcasting algorithms in the postal model for message-passing systems
- Bary-Noy, Kipnis
- 1992
(Show Context)
Citation Context ...h can be expressed as the minimum gap g between successive injections of messages into the network. Three models that characterize a network in terms of its latency and bandwidth are the Postal model =-=[14]-=-, the Bulk-Synchronous Parallel (BSP) model [85], and the LogP model [29]. In the Postal model, a network is described by a single parameter L, its latency. The Bulk-Synchronous Parallel model adds a ... |

103 | Parallel algorithms for shared memory machines - Karp, Ramachandran - 1990 |

100 |
The ultimate planar convex hull algorithm
- Kirkpatrick, Seidel
- 1986
(Show Context)
Citation Context ...The remaining points are defined recursively. That is, the points become arbitrarily close to xmin (see Figure 15). Figure 15: Contrived set of points for worst case QuickHull. Kirkpatrick and Seidel =-=[49]-=- have shown that it is possible to modify QuickHull so that it makes provably good partitions. Although the technique is shown for a sequential algorithm, it is easy to parallelize. A simplification o... |

97 | 4 report on the Sisal language project - Feo, Cann, et al. - 1990 |

95 |
An efficient parallel biconnectivity algorithm
- Tarjan, Vishkin
- 1985
(Show Context)
Citation Context ...e on the way down and once on the way up. By keeping a linked structure that represents the Euler tour of a tree it is possible to compute many functions on the tree, such as the size of each subtree =-=[83]-=-. This technique uses linear work, and parallel depth that is independent of the depth of the tree. The Euler tour can often be used to replace a standard traversal of a tree, such as a depth-first tr... |

95 | NESL: A Nested Data-Parallel Language (version 2.6 - Blelloch - 1993 |

93 |
Parallel Computation: Models and methods
- Akl
- 1997
(Show Context)
Citation Context ...he function distribute creates a sequence of identical elements. For example, the expression distribute(3,5) creates the sequence [3,3,3,3,3]. The++ function appends two sequences. For example [2,1]++=-=[5,0,3]-=- create the sequence [2,1,5,0,3]. The flatten function converts a nested sequence (a sequence for which each element is itself a sequence) into a flat sequence. For example, flatten([[3,5],[3,2],[1,5]... |

88 |
Fast parallel matrix inversion algorithms
- Csanky
- 1976
(Show Context)
Citation Context ...pth, although the work can be reduced by using a more efficient matrix multiplication algorithm. There are also more sophisticated, but less practical, work-efficient algorithms with depth O(log 2 n) =-=[28, 70]-=-. Parallel algorithms for many other matrix operations have been studied, and there has also been significant work on algorithms for various special forms of matrices, such as tridiagonal, 53triangul... |

84 |
Parallel Sorting Algorithms
- Akl
- 1989
(Show Context)
Citation Context ...allel. For example, suppose that, in parallel, each element of A with an even index is paired and summed with the next element of A, which has an odd index, i.e., A[0] is paired with A[1], A[2] with A=-=[3]-=-, and so on. The result is a new sequence of ⌈n/2⌉ numbers that sum to the same value as the sum that we wish to compute. This pairing and summing step can be repeated until, after ⌈log2 n⌉ steps, a s... |

82 | LogP: A practical model of parallel computation
- Culler, Karp, et al.
- 1996
(Show Context)
Citation Context ...messages into the network. Three models that characterize a network in terms of its latency and bandwidth are the Postal model [14], the Bulk-Synchronous Parallel (BSP) model [85], and the LogP model =-=[29]-=-. In the Postal model, a network is described by a single parameter L, its latency. The Bulk-Synchronous Parallel model adds a second parameter g, the minimum ratio of computation steps to communicati... |

77 |
Interconnection Networks for Large-Scale Parallel Processing: Theory and Case
- Siegel
- 1990
(Show Context)
Citation Context ...sors, and then down the tree. Many algorithms have been designed to run efficiently on particular network topologies such as the mesh or the hypercube. For an extensive treatment such algorithms, see =-=[55, 67, 73, 80]-=-. Although this approach can lead to very fine-tuned algorithms, it has some disadvantages. First, algorithms designed for one network may not perform well on other networks. Hence, in order to solve ... |