## Radix Sort For Vector Multiprocessors (1991)

Venue: | In Proceedings Supercomputing '91 |

Citations: | 44 - 6 self |

### BibTeX

@INPROCEEDINGS{Zagha91radixsort,

author = {Marco Zagha and Guy E. Blelloch},

title = {Radix Sort For Vector Multiprocessors},

booktitle = {In Proceedings Supercomputing '91},

year = {1991},

pages = {712--721}

}

### Years of Citing Articles

### OpenURL

### Abstract

We have designed a radix sort algorithm for vector multiprocessors and have implemented the algorithm on the CRAY Y-MP. On one processor of the Y-MP, our sort is over 5 times faster on large sorting problems than the optimized library sort provided by CRAY Research. On eight processors we achieve an additional speedup of almost 5, yielding a routine over 25 times faster than the library sort. Using this multiprocessor version, we can sort at a rate of 15 million 64-bit keys per second. Our sorting algorithm is adapted from a data-parallel algorithm previously designed for a highly parallel Single Instruction Multiple Data (SIMD) computer, the Connection Machine CM-2. To develop our version we introduce three general techniques for mapping data-parallel algorithms ontovector multiprocessors. These techniques allow us to fully vectorize and parallelize the algorithm. The paper also derives equations that model the performance of our algorithm on the Y-MP. These equations are then used t...

### Citations

307 |
Advanced Compiler Optimizations for Supercomputers
- Padua, Wolfe
- 1986
(Show Context)
Citation Context ... be fully vectorized. Loop Raking: A new technique we call loop raking is used for vectorizing the loops. Instead of loading contiguous blocks of a vector on each vector load, as done in strip mining =-=[16]-=-, loop-raking uses a constant stride to load a set of elements evenly distributed across the input vector---as if a rake was placed over the vector. Loop raking is necessary to keep each pass of the r... |

277 | Parallel Prefix Computation
- LADNER, FISCHER
- 1980
(Show Context)
Citation Context ...culated by flattening the matrix into column major order and executing a SCAN-BUCKETS on the flattened matrix. The SCAN-BUCKETS operation can be parallelized using a tree-summing or similar algorithm =-=[12]-=-. The parallel version generates the same permutation as the serial algorithm so the sort remains stable. 3 2 1 2 0 1 2 3 2 4 1 1 1 0 3 4 2 2 1 3 Key value: 0 1 2 3 Keys Keys Keys Keys proc 0 proc 1 p... |

178 | A comparison of sorting algorithms for the Connection Machine CM-2
- Blelloch, Leiserson, et al.
- 1991
(Show Context)
Citation Context ...ng our own macro assembler to generate Cray Assembly Language (CAL) [9]. Our sorting algorithm is adapted from a data-parallel radix-sort algorithm previously designed for the Connection Machine CM-2 =-=[3]-=-. To generate an efficient vectormultiprocessor algorithm from the data-parallel algorithm, we use the following techniques: Virtual Processors: Each element of a vector register is viewed as a virtua... |

162 | Scans as Primitive Parallel Operations
- BLELLOCH
- 1989
(Show Context)
Citation Context ...ns stable. 3 2 1 2 0 1 2 3 2 4 1 1 1 0 3 4 2 2 1 3 Key value: 0 1 2 3 Keys Keys Keys Keys proc 0 proc 1 proc 2 proc 3 Bucket[1] offsets histogram histogram histogram histogram Bucket[0] offsets Bucket=-=[2]-=- offsets Bucket[3] offsets 0 7 16 20 3 4 5 9 10 14 17 19 19 22 26 29 Result Vector Figure 3: The scan step of parallel radix sort. The algorithm is illustrated with 4 processors and 4 buckets for the ... |

81 | Intoduction to Algorithms - Rivest - 1992 |

59 |
Deterministic coin tossing and accelerating cascades: micro and macro techniques for designing parallel algorithms
- COLE, VISHKIN
- 1986
(Show Context)
Citation Context ... for its own N=P subset of the keys and tabulates them in its own set of buckets. A restricted version of this algorithm was described by Cole and Vishkin as part of an optimal 2-ruling set algorithm =-=[5]-=-. The full algorithm was also implemented on the Connection Machine CM-2 [3]. In this parallel version of radix sort, the buckets can be viewed as a matrix Buckets [i; j]: Buckets (j) 0 1 2 .. m \Gamm... |

39 | Scan primitives for vector computers
- Chatterjee, Blelloch, et al.
- 1990
(Show Context)
Citation Context ...process items h i, i + s, i + 2 \Delta s, . . . , i + (L \Gamma 1) \Delta si To avoid excessive bank conflicts during a vector load, L is selected so that s is odd (see discussion of shape factors in =-=[4]-=-). To extend loop raking to multiple processors, a vector is divided into P parts, and each part is raked by one of the processors. As will be discussed in Section 4, loop raking is used in all three ... |

14 | Logarithmic Time Cost Optimal Parallel Sorting is Not Yet Fast
- Natvig
- 1990
(Show Context)
Citation Context ...the sorted keys rather than the actual sorted keys. Our implementation of SORT is about 10% faster than our implementation of ORDERS. sorting, the large constant factors hidden by asymptotic analysis =-=[15]-=-, and the fact that although the theoretical sorts are efficient in some theoretical models of a parallel machine, these models do not accurately portray real machines. Many of the sorting algorithms ... |

13 | Algorithms from P to NP - Moret, Shapiro - 1991 |

6 | Timing results of some internal sorting algorithms on vector computers - Roensch, Strauss - 1987 |

2 |
Sorting networksand their applications
- Batcher
- 1968
(Show Context)
Citation Context ...gorithm on the Y-MP. These equations are then used to optimize the radix size. 1 Introduction Sorting is one of the most heavily studied problems in computer science. A sortingalgorithm, bitonic sort =-=[1]-=-, was one of the first algorithms designed for parallel machines, and since then hundreds of articles on parallel sorting have appeared in the literature. Although this work has developed a robust the... |

2 | Sorting and Searching, volume 3 of The Art of ComputerProgramming - Knuth - 1973 |

2 |
A Fully Vectorized Quicksort
- Levin
- 1990
(Show Context)
Citation Context ...than on cache based machines. This seems to be true. For large data sets, our single processor sort runs between 3 and 5 times faster than a fully vectorized version of quicksort implemented by Levin =-=[13]-=-. 1 Serial Radix Sort We first review the serial radix sort since our parallel algorithm will include the same phases (see [6, Section 9.3] for more details). Radix sort works by breaking keys into di... |

1 |
TheCrayX-MP/Model 24: A Case Study
- RobbinsandStevenRobbins
- 1989
(Show Context)
Citation Context ... same bank, the second access will be delayed. A memory location X is contained in bank X mod B , causing sequential accesses to be very efficient, since locations are striped across the memory banks =-=[17]-=-. But when performing data dependent access to the buckets in the radix sort, memory accesses may refer to the same banks. Ideally, we would like to arrange the buckets so that each access would be gu... |