## The Design and Analysis of Bulk-Synchronous Parallel Algorithms (1998)

Citations: | 10 - 1 self |

### BibTeX

@TECHREPORT{Tiskin98thedesign,

author = {Alexandre Tiskin and Christ Church and Trinity Term},

title = {The Design and Analysis of Bulk-Synchronous Parallel Algorithms},

institution = {},

year = {1998}

}

### OpenURL

### Abstract

The model of bulk-synchronous parallel (BSP) computation is an emerging paradigm of general-purpose parallel computing. This thesis presents a systematic approach to the design and analysis of BSP algorithms. We introduce an extension of the BSP model, called BSPRAM, which reconciles shared-memory style programming with efficient exploitation of data locality. The BSPRAM model can be optimally simulated by a BSP computer for a broad range of algorithms possessing certain characteristic properties: obliviousness, slackness, granularity. We use BSPRAM to design BSP algorithms for problems from three large, partially overlapping domains: combinatorial computation, dense matrix computation, graph computation. Some of the presented algorithms are adapted from known BSP algorithms (butterfly dag computation, cube dag computation, matrix multiplication). Other algorithms are obtained by application of established non-BSP techniques (sorting, randomised list contraction, Gaussian elimination without pivoting and with column pivoting, algebraic path computation), or use original techniques specific to the BSP model (deterministic list contraction, Gaussian elimination with nested block pivoting, communication-efficient multiplication of Boolean matrices, synchronisation-efficient shortest paths computation). The asymptotic BSP cost of each algorithm is established, along with its BSPRAM characteristics. We conclude by outlining some directions for future research.

### Citations

2438 |
The Design and Analysis of Computer Algorithms
- Aho, Hopcroft, et al.
- 1974
(Show Context)
Citation Context ...rward method is standard matrix multiplication, of sequential complexity \Theta(n 3 ). There are also subcubic methods, including Kronrod's algorithm (also known as Four Russians' algorithm, see e.g. =-=[AHU76]-=-), and a recent algorithm from [BKM95]. The lowest known exponent is achieved by fast Strassen-type multiplication. In this method, the Boolean matrices are viewed as (0; 1)-matrices over the ring of ... |

1136 |
Extremal graph theory
- Bollobas
- 1978
(Show Context)
Citation Context ...sed as a maximal Boolean factor. It also gives a characterisation of such matrices by maximal triangle-free graphs. One of the few references to maximal equitripartite triangle-free graphs appears in =-=[Bol78]-=-. In particular, [Bol78, pages 324--325] states the problem of finding the minimum possible density of such a graph; it is easy to see from the discussion above that this problem is closely related to... |

822 |
A Note on Two Problems in Connection with Graphs
- Dijkstra
- 1959
(Show Context)
Citation Context ...s the problem with BSP cost W = O(n 3 =p), H = O(n 2 =p ff ), S = O(p ff ), for an arbitrary ff, 1=2sffs2=3. Alternatively, the problem with nonnegative lengths can be solved by Dijkstra's algorithm (=-=[Dij59]-=-, see also [CLR90]). This greedy algorithm finds all shortest paths from a fixed source in order of increasing length. The sequential time complexity of Dijkstra's algorithm is \Theta(n 2 ). To comput... |

800 |
Matrix multiplication via arithmetic progressions
- Coppersmith, Winograd
- 1987
(Show Context)
Citation Context ... a commutative ring with unit. However, no lower bound asymptotically better than the trivial\Omega\Gamma n 2 ) has been found, nor is there any indication that the current O(n 2:376 ) algorithm from =-=[CW90]-=- is close to optimal. The natural computational model for matrix multiplication over a commutative ring with unit is the model of arithmetic circuits. It is not difficult to see (see e.g. [HK71]) that... |

706 | Data Structures and Algorithms - Aho, Hopcroft, et al. - 1983 |

635 | An Introduction to Parallel Algorithms - JaJa - 1992 |

373 |
Gaussian elimination is not optimal
- Strassen
- 1969
(Show Context)
Citation Context ... for commutative 4.4. FAST MATRIX MULTIPLICATION 49 rings with unit, which allow "fast" matrix multiplication algorithms. The first such algorithm was proposed by Strassen in his groundbreak=-=ing paper [Str69]-=-. Since then, much work has been done on the complexity of matrix multiplication over a commutative ring with unit. However, no lower bound asymptotically better than the trivial\Omega\Gamma n 2 ) has... |

286 |
Parallel Merge Sort
- Cole
- 1988
(Show Context)
Citation Context ... log n= log(n=p) \Delta . For nsp 3 , the algorithm is identical to PSRS. For smaller values of n, it uses a pipelined tree merging technique similar to the one employed by Cole's algorithm (see e.g. =-=[Col93]-=-). Despite its asymptotic optimality, the algorithm from [Goo96] is unlikely to be practical in the case of nsp. A more practical BSP sorting algorithm for small values of n is described in [GS96]. 3.... |

284 | Designing and Building Parallel Programs - Foster - 1996 |

201 | Szemerédi’s regularity lemma and its applications in graph theory
- Komlós, Simonovits
- 1993
(Show Context)
Citation Context ...density problem was "completely unresolved". Since then, however, a general approach to problems of this kind has been developed. The basis of this approach is Szemer'edi's Regularity Lemma =-=(see e.g. [KS96]-=-). Here we apply this lemma directly to the Boolean matrix multiplication problem; it might also be applicable to similar extremal graph problems, including the minimum density problem. In the definit... |

180 | A regular layout for parallel adders
- P, Kung
- 1982
(Show Context)
Citation Context ... ; : : : ; x 0 ffl x 1 ffl \Delta \Delta \Delta ffl x n\Gamma1 ), where ffl is an associative operator computable in time O(1). A standard method of computing all-prefix sums in parallel, proposed in =-=[BK82]-=- (see also [LD94]), can be represented by a dag allpref (n), shown in Figure 3.3 for n = 8. Here, the action of a node with inputs x, y is x y ffl x x ffl y CHAPTER 3. COMBINATORIAL COMPUTATION IN BSP... |

180 | Efficient Parallel Algorithms - Gibbons, Rytter - 1988 |

172 | Graphs and Algorithms - Gondran, Minoux - 1984 |

156 | Efficient algorithms for shortest paths in sparse networks - Johnson - 1977 |

140 | Introduction to Parallel and Vector Solutions of Linear Systems - Ortega - 1988 |

136 | Towards an architectureindependent analysis of parallel algorithms - Papadimitriou, Yannakakis - 1990 |

134 | Models and languages for parallel computation - Skillicorn, Talia - 1998 |

128 | Vorlesungen über Inhalt, Oberfläche und Isoperimetrie. Die Grundlehren der Mathematischen Wissenschaften - Hadwiger - 1957 |

120 |
Parallel tree contraction and its applications
- Miller, Reif
- 1985
(Show Context)
Citation Context ...lem is rather more complicated on parallel models. The easiest way to obtain CHAPTER 3. COMBINATORIAL COMPUTATION IN BSP 30 an efficient parallel list contraction algorithm is by randomisation. Paper =-=[MR85]-=- introduced a technique of random mating. The random mating algorithm proceeds in a sequence of rounds. In each round every item is marked either forward-looking of backward-looking by flipping an ind... |

115 | A randomized linear-time algorithm to find minimum spanning trees
- Karger, Tarjan, et al.
- 1995
(Show Context)
Citation Context ...ymptotic complexity have been proposed, but it is not known if an O(m) deterministic algorithm exists. However, if the input edges are sorted by weight, the greedy algorithms work in time O(m). Paper =-=[KKT95]-=- describes a randomised O(m) MST algorithm. A standard PRAM solution to the problem is provided by another greedy algorithm, attributed to Borsuvka and Sollin (see e.g. [J'aJ92]). The algorithm works ... |

109 |
Introduction to Algorithms. The MIT Electrical Engineering and Computer Science Series
- Cormen, Leiserson, et al.
- 2001
(Show Context)
Citation Context ...ta \Phi xm ; y), where \Phi is a commutative and associative operator, and both f and \Phi are computable in time O(1). This corresponds to resolving concurrent writing in PRAM by combining (see e.g. =-=[CLR90]-=-). In a similar way to the BSP model, the cost of a BSPRAM superstep is defined as w + h \Delta g + l. Here w is the maximum number of local operations performed by each processor, and h = h 0 +h 00 .... |

108 |
Deterministic coin tossing with applications to optimal list ranking
- Cole, Vishkin
- 1986
(Show Context)
Citation Context ...rithm for list contraction. Known efficient deterministic algorithms for PRAM (see e.g. [J'aJ92, RMMM93]) typically involve the method of symmetry breaking by deterministic coin tossing introduced in =-=[CV86]-=-. Such algorithms are complicated and often assume non-standard arithmetic capabilities of the computational model, e.g. bitwise operations on integers. As in the case of randomised algorithms, it is ... |

103 |
Parallel algorithms for shared memory machines
- Karp, Ramachandran
- 1990
(Show Context)
Citation Context ... on the inputs). An oblivious algorithm can be represented as a computation of a uniform family of circuits (for the definition of a uniform family of circuits, CHAPTER 2. BSP COMPUTATION 14 see e.g. =-=[KR90]-=-). We say that a BSPRAM algorithm is communicationoblivious, if the sequence of communication and synchronisation operations executed by a processor is the same for any input of a given size (no such ... |

101 | Parallel sorting by regular sampling
- Shi, Schaeffer
- 1992
(Show Context)
Citation Context ...ment). Let ha; bi denote an open interval, i.e. the set of all x in x such that a ! x ! b. Probably the simplest parallel sorting algorithm is parallel sorting by regular sampling (PSRS), proposed in =-=[SS92]-=- (see also its discussion in [LLS + 93]). Paper [HJB] describes an optimised version of the algorithm, and its efficient implementation on a variety of platforms. The PSRS algorithm proceeds as follow... |

95 | Prefix Sums and Their Applications - Blelloch - 1990 |

84 |
A Complexity Theory of Efficient Parallel Algorithms
- Kruskal, Rudolph, et al.
- 1990
(Show Context)
Citation Context ...stic. The first step in making them more realistic was to introduce a new complexity measure, efficiency, depending on the number of processors used by the CHAPTER 2. BSP COMPUTATION 7 algorithm (see =-=[KRS90]-=-). New parallel models were gradually introduced to account for resources other than the number of processors. Currently, dozens of such models exists; see [LMR96, MMT95, ST98] for a survey. Among the... |

83 | Universal computing
- McColl
(Show Context)
Citation Context .... 3.3 Cube dag computation The cube dag defines the dependence pattern that characterises a large number of scientific algorithms. Here we describe a BSPRAM version of the BSP cube dag algorithm from =-=[McC95]-=-. For simplicity, we consider the computation of a three-dimensional cube dag. The algorithm for other dimensions is similar. The three-dimensional cube dag cube 3 (n) with inputs x (1) jk , x (2) ik ... |

77 | General purpose parallel computing - McColl - 1993 |

76 | Parallel algorithms for dense linear algebra computations - Gallivan, Plemmons, et al. - 1990 |

71 | On the exponent of the all pairs shortest path problem - Alon, Galil, et al. - 1997 |

70 | Scientific computing on bulk synchronous parallel architectures - Bisseling, McColl - 1994 |

64 | Communication-efficient parallel sorting
- Goodrich
- 1999
(Show Context)
Citation Context ...ples). \Xi Lower bounds on communication complexity of sorting for various parallel models can be found e.g. in [SS92, ABK95]. The asymptotic BSP costs of Algorithm 4 are independently optimal. Paper =-=[Goo96]-=- presents a more complex BSP sorting algorithm, asymptotically optimal for any nsp. Its BSP costs are W = O(n log n=p), H = O \Gamma n=p \Delta log n= log(n=p) \Delta , S = O \Gamma log n= log(n=p) \D... |

52 |
Triangular factorization and inversion by fast matrix multiplication
- Bunch, Hopcroft
- 1974
(Show Context)
Citation Context ... rank. A similar approach applies to computation of the QR decomposition of a real matrix by Givens rotations. Another pivoting method suitable for block triangular decomposition has been proposed in =-=[BH74]-=-. Since the approach of [BH74] requires a search for the pivot along a matrix row, its BSP cost is higher than the cost of nested block pivoting. To describe nested block pivoting, we consider Gaussia... |

52 | Models of parallel computation: a survey and synthesis - Maggs, Matheson, et al. - 1995 |

48 | Deterministic sorting and randomized median finding on the BSP model
- Gerbessiotis, Siniolakis
- 1996
(Show Context)
Citation Context ...g. [Col93]). Despite its asymptotic optimality, the algorithm from [Goo96] is unlikely to be practical in the case of nsp. A more practical BSP sorting algorithm for small values of n is described in =-=[GS96]-=-. 3.5 List contraction This and the following sections consider BSPRAM computation on pointer structures, such as linked lists and trees. A linked list is a sequence of items. The order of items is de... |

42 | The QRQW PRAM: Accounting for contention in parallel algorithms - Gibbons, Matias, et al. - 1994 |

42 |
An inequality related to the isoperimetric inequality
- Loomis, Whitney
- 1949
(Show Context)
Citation Context ...t is the Loomis--Whitney inequality, relating the volume of a compact set in R m , ms2, to the areas of its orthogonal projections onto r-dimensional coordinate subspaces, 1srsm. It was introduced in =-=[LW49] (see also-=- [Had57, BZ88]) to simplify the proof of the classical isoperimetric "volume-to-surface" inequality. The discrete analog of the Loomis--Whitney inequality relates the size of a finite set of... |

39 | A three-dimensional approach to parallel matrix multiplication - Agarwal, Balle, et al. - 1995 |

38 | Communication-efficient parallel algorithms for distributed random-access machines
- Leiserson, Maggs
- 1988
(Show Context)
Citation Context ... distance from the head (or the tail) of the list (see e.g. [CLR90, J'aJ92, RMMM93]). List ranking can be applied to more general list problems, such as computing all-prefix sums on a list. Following =-=[LM88]-=-, we view these problems as instances of an abstract problem of list contraction: given an abstract operation of merging two adjacent items as a primitive, contract the list to a single item. Implemen... |

30 | A parallel algorithm for computing minimum spanning trees
- Johnson, Metaxas
- 1995
(Show Context)
Citation Context ...g n) rounds are sufficient for a dense graph. The contraction of tree components in each round can take up to O(log n) steps, therefore the total PRAM complexity of the algorithm is O(log 2 n). Paper =-=[JM95]-=- presents a more efficient PRAM algorithm, with complexity O(log 3=2 n). The BSPRAM model suggests an alternative, coarse-grain approach to the problem. We assume that initially the edges of the graph... |

28 | Analysis and Design of Parallel Algorithms - Lakshmivarahan, Dhall - 1990 |

27 | Linear algebra in dioids - a survey of recent results - Gondran, Minoux - 1984 |

27 | A tensor product formulation of Strassen’s matrix multiplication algorithm with memory reduction - Kumar, Huang, et al. - 1993 |

27 | Path problems in graphs - Rote - 1990 |

26 | Parallel sorting with limited bandwidth - Adler, Byers, et al. - 1995 |

25 |
A Simple Randomized Parallel Algorithm for List-Ranking
- Anderson, Miller
- 1990
(Show Context)
Citation Context ... randomised list contraction. An algorithm from [RM96] is time-processor optimal. Although it is slightly suboptimal in time, it performs better in practice than the more sophisticated algorithm from =-=[AM90]-=-, optimal both in time and in the time-processor product. Optimal efficiency for randomised list contraction is much easier to achieve in the BSPRAM model, given sufficient slackness. The following st... |

25 |
Lower bounds and efficient algorithms for multiprocessor scheduling of dags with communication delays
- Jung, Kirousis, et al.
- 1989
(Show Context)
Citation Context ...nication cost model, where a nonlocal edge incurs a fixed communication delay. The number of processors is unbounded. A node may be computed, in general, more than once by different processors. Paper =-=[JKS93]-=- shows that such recomputation of nodes is necessary for an asymptotically optimal computation of certain dags in the given model. In a BSP dag computation, we also allow recomputation of nodes. Howev... |

22 | A new deterministic parallel sorting algorithm with an experimental evaluation
- Helman, JáJá, et al.
- 1998
(Show Context)
Citation Context ...t of all x in x such that a ! x ! b. Probably the simplest parallel sorting algorithm is parallel sorting by regular sampling (PSRS), proposed in [SS92] (see also its discussion in [LLS + 93]). Paper =-=[HJB]-=- describes an optimised version of the algorithm, and its efficient implementation on a variety of platforms. The PSRS algorithm proceeds as follows. First, the array x is partitioned into p subarrays... |

21 |
On minimizing the number of multiplications necessary for matrix multiplication
- HOPCROFT, KERR
- 1971
(Show Context)
Citation Context ...where A, B, C are n \Theta n matrices over a semiring. We aim to parallelise the standard \Theta(n 3 ) method, asymptotically optimal for sequential matrix multiplication over a general semiring (see =-=[HK71]-=-). The method consists in straightforward computation of the family of bilinear forms C[i; k] = n X j=1 A[i; j] \Delta B[j; k] 1si; ksn (4.11) Following (4.11), we need to set C[i; k] / 0 for i; k = 1... |

21 |
A communication-time tradeoff
- Papadimitriou, Ullman
- 1987
(Show Context)
Citation Context ... = O(n 3 =p) requires (i) W = \Theta(n 3 =p), (ii) H =\Omega \Gamma n 2 =p 1=2 \Delta , (iii ) S =\Omega \Gamma p 1=2 \Delta . Proof. (i) Trivial. (ii) The proof is an extension of the proof given in =-=[PU87]-=- for the diamond dag. Let W = O(n 3 =p). Partition the cube dag into p 3=2 cubic blocks of size n=p 1=2 . Consider 3p chains of blocks parallel to the main diagonal. In every chain, the computation of... |