## Accounting for memory bank contention and delay in high-bandwidth multiprocessors (1997)

### Cached

### Download Links

Venue: | In Proc. 7th ACM Symp. on Parallel Algorithms and Architectures |

Citations: | 30 - 4 self |

### BibTeX

@INPROCEEDINGS{Blelloch97accountingfor,

author = {Guy E. Blelloch and Phillip B. Gibbons and Yossi Matias and Marco Zagha},

title = {Accounting for memory bank contention and delay in high-bandwidth multiprocessors},

booktitle = {In Proc. 7th ACM Symp. on Parallel Algorithms and Architectures},

year = {1997},

pages = {84--94}

}

### Years of Citing Articles

### OpenURL

### Abstract

Abstract—For years, the computation rate of processors has been much faster than the access rate of memory banks, and this divergence in speeds has been constantly increasing in recent years. As a result, several shared-memory multiprocessors consist of more memory banks than processors. The object of this paper is to provide a simple model (with only a few parameters) for the design and analysis of irregular parallel algorithms that will give a reasonable characterization of performance on such machines. For this purpose, we extend Valiant’s bulk-synchronous parallel (BSP) model with two parameters: a parameter for memory bank delay, the minimum time for servicing requests at a bank, and a parameter for memory bank expansion, the ratio of the number of banks to the number of processors. We call this model the (d, x)-BSP. We show experimentally that the (d, x)-BSP captures the impact of bank contention and delay on the CRAY C90 and J90 for irregular access patterns, without modeling machine-specific details of these machines. The model has clarified the performance characteristics of several unstructured algorithms on the CRAY C90 and J90, and allowed us to explore tradeoffs and optimizations for these algorithms. In addition to modeling individual algorithms directly, we also consider the use of the (d, x)-BSP as a bridging model for emulating a very high-level abstract model, the Parallel Random Access Machine (PRAM). We provide matching upper and lower bounds for emulating the EREW and QRQW PRAMs on the (d, x)-BSP.

### Citations

2423 | The art of computer programming - Knuth - 2005 |

1612 | Probability inequalities for sums of bounded random variables - Hoeffding - 1963 |

1216 |
A bridging model for parallel computation
- Valiant
- 1990
(Show Context)
Citation Context ...signed with the goal of hiding many details of parallel machines while still providing guidance in developing efficient algorithms. Examples of such models include the Bulk Synchronous Parallel (BSP) =-=[53]-=- and LOGP [16] models, which both aim to serve as high-level performance models of message-passing machines. The important feature of these two models is that they provide a simple abstraction of the ... |

724 |
Universal classes of hash functions
- Carter, Wegman
- 1979
(Show Context)
Citation Context ...niversal”). The function ha , which is called the multiplicative hashing scheme in [35, p. 509], was recently shown by Dietzfelbinger et al. [20] to be two-universal in the sense of Carter and Wegman =-=[12]-=-: For any two distinct numbers x, y Œ [0..2 u 1 1 m-1 - 1], Probeha( x) = ha( y) j £ 12 ; i.e., the collision probability is approximately the same as for a random I F H G I KJ TABLE 3 THE EVALUATION ... |

680 |
An introduction to parallel algorithms
- JáJá
- 1992
(Show Context)
Citation Context ...y, by much less. The effect of expansion for the CRAY C90 and CRAY J90 is shown in Fig. 2. Third, we explore scenarios under which two very highlevel models for algorithm design, the EREW PRAM (e.g., =-=[32]-=-) and the stronger QRQW PRAM [25], can be effectively mapped onto high-bandwidth machines (small g) when properly accounting for memory bank delay. For the case x < d/g, we observe that (d/x) is an in... |

525 | Logp: Towards a realistic model of parallel computation
- Culler, Karp, et al.
- 1993
(Show Context)
Citation Context ...e goal of hiding many details of parallel machines while still providing guidance in developing efficient algorithms. Examples of such models include the Bulk Synchronous Parallel (BSP) [53] and LOGP =-=[16]-=- models, which both aim to serve as high-level performance models of message-passing machines. The important feature of these two models is that they provide a simple abstraction of the machine’s inte... |

394 | The Tera computer system - Alverson, Callahan, et al. - 1990 |

295 |
Parallel Algorithms for Shared-Memory Machines, volume A
- Karp, Ramachandran
- 1990
(Show Context)
Citation Context ...algorithm design, the QRQW PRAM, can be effectively mapped onto a (d, x)-BSP and, hence, onto high-bandwidth machines. The QRQW PRAM [25] is a variant of the well-studied PRAM model (see, e.g., [32], =-=[34]-=-) that allows for concurrent reading and writing to shared memory locations, but assumes that multiple reads/writes to a location queue up and are serviced one at a time (named the “queue-read queue-w... |

287 |
Probabilistic construction of deterministic algorithms: approximating packing integer programs
- Raghavan
- 1988
(Show Context)
Citation Context ...ntion of bank b greatly exceeds this expected value, we will use the following theorem by Raghavan and Spencer, which provides a tail inequality for the weighted sum of Bernoulli trials: THEOREM 5.2 (=-=[40]-=-). Let a1, º, am be reals in (0, 1]. Let x1, º, xm be independent Bernoulli trials with E(xj) = rj. Let Yb = Â ax m j j. If E( Yb ) > 0, then, for any n > 0, Prob j= 1 a f e j e b bj n E Yb a1+ nfJ Pr... |

276 |
Nonuniversal critical dynamics in monte carlo simulations
- Swendsen, Wang
- 1987
(Show Context)
Citation Context ...s is the fastest reported code for the NAS CG benchmark [45]. As another example, the graph connectivity problem is the dominant cost in simulating Ising Spin models using the Swendsen Wang algorithm =-=[50]-=-. The four problems arise from diverse domains, with the intention that the memory access patterns of the algorithms studied will reflect patterns exhibited by a large class of unstructured algorithms... |

262 | Sorting and Searching, volume 3 of The Art of Computer Programming - Knuth - 1998 |

200 |
How to emulate shared memory
- Ranade
- 1987
(Show Context)
Citation Context ...k can be ignored when using random mappings of memory locations to memory banks. Many researchers have studied the effect of randomly mapping memory to banks (e.g., [2], [29], [33], [36], [37], [41], =-=[42]-=-, [43], [53]). If there is sufficient parallel “slackness” (extra parallelism) so that each bank is receiving multiple requests, it has been shown [33], [37], [42], [53] that, with high probability, t... |

196 | Ecient Parallel algorithms - Gibson, Rytter - 1988 |

180 | A comparison of sorting algorithms for the Connection Machine CM-2 - Blelloch, Leiserson, et al. - 1991 |

130 |
A Logarithmic Time Sort on Linear Size Networks
- Reif, Valiant
- 1987
(Show Context)
Citation Context ...s a simple parallel binary search to look up n keys in a balanced binary search tree of size m [23]. Such binary searching is an important substep in several algorithms for sorting and merging (e.g., =-=[44]-=-). The algorithm replicates nodes of the search tree to avoid contention, and, at each level, selects one of the replicated nodes at random. This is an interesting problem from the point of view of th... |

111 |
Randomized and Deterministic Simulations of PRAMs by Parallel Machines with Restricted Granularity of Parallel Memories
- Mehlhorn, Vishkin
- 1984
(Show Context)
Citation Context ...a single bank can be ignored when using random mappings of memory locations to memory banks. Many researchers have studied the effect of randomly mapping memory to banks (e.g., [2], [29], [33], [36], =-=[37]-=-, [41], [42], [43], [53]). If there is sufficient parallel “slackness” (extra parallelism) so that each bank is receiving multiple requests, it has been shown [33], [37], [42], [53] that, with high pr... |

101 | Sparcle: An evolutionary processor design for large-scale multiprocessors
- Agarwal, Kubiatowicz, et al.
- 1993
(Show Context)
Citation Context ..., it restricts the kinds of programs that can be used. Multithreading was suggested and implemented for hiding latency on the HEP [46] and was later used in the design of the TERA MTA [2] and Sparcle =-=[1]-=-. Multithreading is more complicated to implement than vectorization, but permits the use of a wider class of programs. Prefetching and nonblocking caches are becoming common on commodity processors, ... |

88 | Pseudo-randomly interleaved memory
- Rau
- 1991
(Show Context)
Citation Context ...be ignored when using random mappings of memory locations to memory banks. Many researchers have studied the effect of randomly mapping memory to banks (e.g., [2], [29], [33], [36], [37], [41], [42], =-=[43]-=-, [53]). If there is sufficient parallel “slackness” (extra parallelism) so that each bank is receiving multiple requests, it has been shown [33], [37], [42], [53] that, with high probability, the mem... |

79 | A pipelined shared resource MIMD computer - Smith - 1978 |

73 | Scientific computing on bulk synchronous parallel architectures
- Bisseling, McColl
- 1994
(Show Context)
Citation Context ...tation to account for aspects that are not considered by the models, such as local computation times. As such, they have been quite successful, leading to practical designs of various algorithms [3], =-=[6]-=-, [14], [16], [17], [22], [26], [30], [38]. In this paper, we introduce and evaluate a model with ———————————————— • G.E. Blelloch is with the School of Computer Science, Carnegie Mellon University, P... |

59 |
A reliable randomized algorithm for the closest-pair problem
- Dietzfelbinger, Hagerup, et al.
- 1997
(Show Context)
Citation Context ...le properties for any given input (i.e., that they are 1 “universal”). The function ha , which is called the multiplicative hashing scheme in [35, p. 509], was recently shown by Dietzfelbinger et al. =-=[20]-=- to be two-universal in the sense of Carter and Wegman [12]: For any two distinct numbers x, y Œ [0..2 u 1 1 m-1 - 1], Probeha( x) = ha( y) j £ 12 ; i.e., the collision probability is approximately th... |

58 |
Parallel hashing - an efficient implementation f shared memory
- Karlin, Upfal
- 1986
(Show Context)
Citation Context ...residing in a single bank can be ignored when using random mappings of memory locations to memory banks. Many researchers have studied the effect of randomly mapping memory to banks (e.g., [2], [29], =-=[33]-=-, [36], [37], [41], [42], [43], [53]). If there is sufficient parallel “slackness” (extra parallelism) so that each bank is receiving multiple requests, it has been shown [33], [37], [42], [53] that, ... |

58 |
R.: A library for bulk synchronous parallel programming
- Miller
- 1993
(Show Context)
Citation Context ...t considered by the models, such as local computation times. As such, they have been quite successful, leading to practical designs of various algorithms [3], [6], [14], [16], [17], [22], [26], [30], =-=[38]-=-. In this paper, we introduce and evaluate a model with ———————————————— • G.E. Blelloch is with the School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213-3891. E-mail: guyb@cs.... |

52 | Methods for message routing in parallel machines - Leighton - 1994 |

48 | Deterministic sorting and randomized median finding on the BSP model
- Gerbessiotis, Siniolakis
- 1996
(Show Context)
Citation Context ...spects that are not considered by the models, such as local computation times. As such, they have been quite successful, leading to practical designs of various algorithms [3], [6], [14], [16], [17], =-=[22]-=-, [26], [30], [38]. In this paper, we introduce and evaluate a model with ———————————————— • G.E. Blelloch is with the School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213-3891... |

45 | Can a shared-memory model serve as a bridging model for parallel computation? Theory of Comput
- Gibbons, Matias, et al.
- 1999
(Show Context)
Citation Context ...standing traffic due to the particular cache-coherence protocol. Another area for future work is to study other high-level models that can be efficiently emulated on the (d, x)-BSP. In a recent paper =-=[24]-=-, the high-level Queuing Shared Memory (QSM) model was shown to have a work-preserving emulation on the (d, x)-BSP as long as x ≥ d/g (without restrictionssBLELLOCH ET AL.: ACCOUNTING FOR MEMORY BANK ... |

45 | Radix sort for vector multiprocessors
- Zagha, Blelloch
- 1991
(Show Context)
Citation Context ...s of algorithms, we limit ourselves to algorithms involving irregular memory access patterns. This work was motivated by the study of algorithms with irregular memory access patterns, such as sorting =-=[55]-=-, sparse-matrix vector product [7], and graph algorithms [27], [54], on the CRAY C90. In our analysis, we found previous models either quite detailed, or inadequate for describing the key performance ... |

43 | The QRQW PRAM: Accounting for contention in parallel algorithms - Gibbons, Matias, et al. - 1999 |

35 | Towards Efficiency and Portability: Programming with the BSP - Goudreau, Lang, et al. - 1996 |

33 | On the Effective Bandwidth of Interleaved Memories in Vector Processor Systems - Oed, Lange - 1985 |

33 | List ranking and list scan on the Cray C-90 - Reid-Miller - 1994 |

32 | Ecient low-contention parallel algorithms
- Gibbons, Matias, et al.
- 1996
(Show Context)
Citation Context ...ations to memory banks, but, often, it suffices to randomly order the data at the beginning of an algorithm. These PRAM emulations on the (d, x)-BSP generalize the PRAM emulations on the BSP given in =-=[23]-=-, [53]. Finally, we experiment with four algorithms with irregular memory access patterns: a QRQW binary search algorithm, a QRQW random permutation algorithm, a sparse matrix multiply, and a CRCW con... |

32 | Parallel algorithms for personalized communication and sorting with an experimental study
- Helman, Bader, et al.
- 1996
(Show Context)
Citation Context ...are not considered by the models, such as local computation times. As such, they have been quite successful, leading to practical designs of various algorithms [3], [6], [14], [16], [17], [22], [26], =-=[30]-=-, [38]. In this paper, we introduce and evaluate a model with ———————————————— • G.E. Blelloch is with the School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213-3891. E-mail: gu... |

30 | Performance of cached DRAM organizations in vector supercomputers - Hsu, Smith - 1993 |

29 | A comparison of data-parallel algorithms for connected components
- Greiner
- 1994
(Show Context)
Citation Context ...rregular memory access patterns. This work was motivated by the study of algorithms with irregular memory access patterns, such as sorting [55], sparse-matrix vector product [7], and graph algorithms =-=[27]-=-, [54], on the CRAY C90. In our analysis, we found previous models either quite detailed, or inadequate for describing the key performance characteristics of the algorithms. For example, we found that... |

29 |
multistride vector, and FFT accesses in parallel memory systems
- Block
- 1991
(Show Context)
Citation Context ... IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 8, NO. 9, SEPTEMBER 1997 sors. On the other hand, previous models of multibank memory systems [4], [5], [9], [10], [11], [13], [15], [18], =-=[28]-=-, [29], [47], [48], [49], [52] are highly detailed, and the studies have only considered either regular or random access patterns. In this paper, we are interested in modeling algorithms with irregula... |

29 | Bulk synchronous parallel computing – a paradigm for transportable software - Cheatham, Fahmy, et al. - 1995 |

29 | A simulation study of the Cray X-MP memory system - Cheung, Smith - 1986 |

28 | Segmented operations for sparse matrix computation on vector multiprocessors
- Blelloch, Heroux, et al.
- 1993
(Show Context)
Citation Context ... to algorithms involving irregular memory access patterns. This work was motivated by the study of algorithms with irregular memory access patterns, such as sorting [55], sparse-matrix vector product =-=[7]-=-, and graph algorithms [27], [54], on the CRAY C90. In our analysis, we found previous models either quite detailed, or inadequate for describing the key performance characteristics of the algorithms.... |

27 | Fast parallel sorting under LogP: from theory to practice
- Culler, Dusseau, et al.
- 1993
(Show Context)
Citation Context ... for aspects that are not considered by the models, such as local computation times. As such, they have been quite successful, leading to practical designs of various algorithms [3], [6], [14], [16], =-=[17]-=-, [22], [26], [30], [38]. In this paper, we introduce and evaluate a model with ———————————————— • G.E. Blelloch is with the School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 1521... |

27 | Polynomial hash functions are reliable - Dietzfelbinger, Gil, et al. - 1992 |

26 | Vector Computer Memory Bank Contention
- Bailey
- 1987
(Show Context)
Citation Context ...relative speed of memory banks and proces944 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 8, NO. 9, SEPTEMBER 1997 sors. On the other hand, previous models of multibank memory systems =-=[4]-=-, [5], [9], [10], [11], [13], [15], [18], [28], [29], [47], [48], [49], [52] are highly detailed, and the studies have only considered either regular or random access patterns. In this paper, we are i... |

26 |
High-bandwidth interleaved memories for vector processors{A simulation study
- Sohi
- 1993
(Show Context)
Citation Context ...RALLEL AND DISTRIBUTED SYSTEMS, VOL. 8, NO. 9, SEPTEMBER 1997 sors. On the other hand, previous models of multibank memory systems [4], [5], [9], [10], [11], [13], [15], [18], [28], [29], [47], [48], =-=[49]-=-, [52] are highly detailed, and the studies have only considered either regular or random access patterns. In this paper, we are interested in modeling algorithms with irregular, but not necessarily r... |

25 | Practical Parallel Algorithms for Dynamic Data Redistribution, Median Finding, and Selection - Bader, JaJa - 1995 |

24 | An experimental analysis of parallel sorting algorithms
- Blelloch, Leiserson, et al.
- 1998
(Show Context)
Citation Context ...iments are needed to get an accurate measure of this component. Typically, a small experiment will suffice to get an accurate prediction of work over a range of problem sizes and number of processors =-=[8]-=-, [16]. Another consideration regarding the local environment is in accounting for the use of caches. In cache-based symmetric multiprocessors (SMPs), understanding the cache behavior is often necessa... |

24 | The Queue-Read Queue-Write PRAM Model: Accounting for Contention in Parallel Algorithms
- GIBBONS, MATIAS, et al.
- 1999
(Show Context)
Citation Context ...pansion for the CRAY C90 and CRAY J90 is shown in Fig. 2. Third, we explore scenarios under which two very highlevel models for algorithm design, the EREW PRAM (e.g., [32]) and the stronger QRQW PRAM =-=[25]-=-, can be effectively mapped onto high-bandwidth machines (small g) when properly accounting for memory bank delay. For the case x < d/g, we observe that (d/x) is an inevitable work overhead due to the... |

24 | S.: An improved supercomputer sorting benchmark. In: Supercomputing ’92
- Thearling, Smith
- 1992
(Show Context)
Citation Context ... and one using successive ANDings of random keys. As expected, the curves are nearly identical. accesses, we constructed an experiment using the entropy distributions suggested by Thearling and Smith =-=[51]-=-. The distributions are generated by starting with a set of random keys and then bitwise ANDing together each key with another key selected at random. Iterating this process generates a family of dist... |

23 | On Randomly Interleaved Memories - Raghavan, Hayes - 1990 |

17 | Comparison of Hash Functions for Emulated Shared Memory
- Engelmann, Keller
- 1993
(Show Context)
Citation Context ... of a hash function may be influenced by several factors, including its degree of universality, its evaluation cost, and its congestion behavior, both theoretically (see [19]) and experimentally (see =-=[21]-=-). 5 HIGH-LEVEL PROGRAMMING MODEL In this section and the next, we explore scenarios under which a high-level model for algorithm design, the QRQW PRAM, can be effectively mapped onto a (d, x)-BSP and... |

13 |
Bulk synchronous parallel computing-a paradigm for transportable software
- Cheatham, Fahmy, et al.
- 2005
(Show Context)
Citation Context ...n to account for aspects that are not considered by the models, such as local computation times. As such, they have been quite successful, leading to practical designs of various algorithms [3], [6], =-=[14]-=-, [16], [17], [22], [26], [30], [38]. In this paper, we introduce and evaluate a model with ———————————————— • G.E. Blelloch is with the School of Computer Science, Carnegie Mellon University, Pittsbu... |