## GPU-ABiSort: Optimal parallel sorting on stream architectures (2006)

### Cached

### Download Links

Venue: | IN PROCEEDINGS OF THE 20TH IEEE INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM (IPDPS ’06) (APR |

Citations: | 34 - 0 self |

### BibTeX

@INPROCEEDINGS{Greß06gpu-abisort:optimal,

author = {Alexander Greß and Gabriel Zachmann},

title = {GPU-ABiSort: Optimal parallel sorting on stream architectures},

booktitle = {IN PROCEEDINGS OF THE 20TH IEEE INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM (IPDPS ’06) (APR},

year = {2006},

pages = {45},

publisher = {}

}

### Years of Citing Articles

### OpenURL

### Abstract

In this paper, we present a novel approach for parallel sorting on stream processing architectures. It is based on adaptive bitonic sorting. For sorting n values utilizing p stream processor units, this approach achieves the optimal time complexity O((n log n)/p). While this makes our approach competitive with common sequential sorting algorithms not only from a theoretical viewpoint, it is also very fast from a practical viewpoint. This is achieved by using efficient linear stream memory accesses and by combining the optimal time approach with algorithms optimized for small input sequences. We present an implementation on modern programmable graphics hardware (GPUs). On recent GPUs, our optimal parallel sorting approach has shown to be remarkably faster than sequential sorting on the CPU, and it is also faster than previous non-optimal sorting approaches on the GPU for sufficiently large input sequences. Because of the excellent scalability of our algorithm with the number of stream processor units p (up to n / log 2 n or even n / log n units, depending on the stream architecture), our approach profits heavily from the trend of increasing number of fragment processor units on GPUs, so that we can expect further speed improvement with upcoming GPU generations.

### Citations

504 | Sorting networks and their applications
- Batcher
- 1968
(Show Context)
Citation Context ...ng can be adapted to a stream processor, which does not have the ability of random-access writes, as we will show in this paper. Adaptive bitonic sorting is based on Batcher’s bitonic sorting network =-=[4]-=-, which is a conceptually simpler approach that achieves only the non-optimal parallel running time O(log 2 n) for a sorting network of n nodes. 2.2. GPU-based sorting Several sorting approaches on st... |

286 |
Parallel Merge Sort
- Cole
- 1988
(Show Context)
Citation Context ...hms for sorting on a CREW-PRAM or EREW-PRAM model have been extensively studied. Ajtai, Komlos, and Szemeredi [1] showed how optimal asymptotic complexity can be achieved with a sorting network. Cole =-=[7]-=- presented a parallel merge sort approach for the CREW-PRAM as well as for the EREW-PRAM, which achieves optimal asymptotic complexity on that architecture. However, although asymptotically optimal, i... |

209 |
An O(n log n) sorting network
- Ajtai, Komlós, et al.
- 1983
(Show Context)
Citation Context ...r the reader to [2]. Especially parallel sorting using sorting networks as well as algorithms for sorting on a CREW-PRAM or EREW-PRAM model have been extensively studied. Ajtai, Komlos, and Szemeredi =-=[1]-=- showed how optimal asymptotic complexity can be achieved with a sorting network. Cole [7] presented a parallel merge sort approach for the CREW-PRAM as well as for the EREW-PRAM, which achieves optim... |

181 |
E cient Parallel Algorithms
- Gibbons, Rytter
- 1988
(Show Context)
Citation Context ...itecture. However, although asymptotically optimal, it has been show, that neither the AKS sorting network nor Cole’s parallel merge sort are fast in practice for reasonable numbers of values to sort =-=[8, 15]-=-. Adaptive bitonic sorting [5] is another optimal parallel sorting approach for a shared-memory EREWPRAM architecture (also called PRAC for parallel random access computer). It requires a smaller numb... |

174 | for GPUs: stream computing on graphics hardware - Brook |

146 | for gpus: stream computing on graphics hardware - Buck, Foley, et al. - 2004 |

126 | Photon mapping on programmable graphics hardware
- Purcell, Donner, et al.
- 2003
(Show Context)
Citation Context ...s achieve only the non-optimal time complexity O((n log 2 n)/p) on a stream architecture with p processor units (in worst and average case since sorting networks are data-independent). Purcell et al. =-=[18]-=- presented a bitonic sorting network implementation for the GPU which is based on an equivalent implementation for the Imagine stream processor by Kapasi et al. [12]. Kipfer et al. [13, 14] implemente... |

126 | Programmable stream processors - Kapasi, Rixner, et al. |

104 | GPUTeraSort: high performance graphics co-processor sorting for large database management
- Govindaraju, Gray, et al.
- 2006
(Show Context)
Citation Context ... on which optimal parallel sorting can be implemented. As in other bitonic sorting network based approaches, their GPU implementation is restricted to power-of-two sequence lengths. In a recent paper =-=[GGKM05]-=-, Govindaraju et al. embedded the GPU-based bitonic sorting algorithm into a hybrid CPU/GPU sorting approach which is capable of processing large out-of-core databases and wide sort keys. This is achi... |

84 |
Parallel Sorting Algorithms
- Akl
- 1989
(Show Context)
Citation Context ...ated work 2.1. Optimal parallel sorting Many innovative parallel sorting algorithms have been proposed for several different parallel architectures. For a comprehensive review, we refer the reader to =-=[2]-=-. Especially parallel sorting using sorting networks as well as algorithms for sorting on a CREW-PRAM or EREW-PRAM model have been extensively studied. Ajtai, Komlos, and Szemeredi [1] showed how opti... |

78 | UberFlow: a GPU-based particle engine
- Kipfer, Segal, et al.
(Show Context)
Citation Context ...Purcell et al. [18] presented a bitonic sorting network implementation for the GPU which is based on an equivalent implementation for the Imagine stream processor by Kapasi et al. [12]. Kipfer et al. =-=[13, 14]-=- implemented a bitonic as well as an odd-even merge sort network on the GPU. Govindaraju et al. presented an implementation based on the periodic balanced sorting network [10] and, more recently, also... |

63 |
The Design and Analysis of a Cache Architecture for Texture Mapping
- Hakura, Gupta
- 1997
(Show Context)
Citation Context ... the use case of accessing 2D texture data during rasterization, for which in general a cache architecture where each cache block holds a square or near-square region of the texture data is favorable =-=[HG97]-=-. As a consequence, for streaming reads from a rectangular memory block (substream) of a 2D stream the maximum read bandwidth is only achieved if this substream has a square or near-square shape (as i... |

46 |
Fast and approximate stream mining of quantiles and frequencies using graphics processors
- Govindaraju, Raghuvanshi, et al.
- 2005
(Show Context)
Citation Context ... [12]. Kipfer et al. [13, 14] implemented a bitonic as well as an odd-even merge sort network on the GPU. Govindaraju et al. presented an implementation based on the periodic balanced sorting network =-=[10]-=- and, more recently, also an implementation based on the bitonic sorting network [9]. The latter has been highly optimized for cache efficiency and is the fastest of the approaches above. On an NVIDIA... |

44 | Efficient Conditional Operations for Data-parallel Architectures
- Kapasi, Dally, et al.
- 2000
(Show Context)
Citation Context ... data-independent). Purcell et al. [18] presented a bitonic sorting network implementation for the GPU which is based on an equivalent implementation for the Imagine stream processor by Kapasi et al. =-=[12]-=-. Kipfer et al. [13, 14] implemented a bitonic as well as an odd-even merge sort network on the GPU. Govindaraju et al. presented an implementation based on the periodic balanced sorting network [10] ... |

34 |
Adaptive bitonic sorting: An optimal parallel algorithm for shared memory machines
- Bilardi, Nicolau
- 1989
(Show Context)
Citation Context ...se time, but to our knowledge no sorting algorithms for stream processors with optimal time complexity O(n log n/p) have been proposed so far. Our approach, which is based on Adaptive Bitonic Sorting =-=[5]-=-, achieves this optimal time complexity on stream architectures with up to p = n/ log n processor units. The approach can even be implemented on stream architectures with the restriction that a stream... |

24 |
Streaming architectures and technology trends
- Owens
- 2005
(Show Context)
Citation Context ...tonic sorting network based approaches, their implementation is restricted to sequence lengths that are a power of two. 3. The stream programming model 3.1. The basics In the stream programming model =-=[12, 17, 6, 16]-=-, the basic program structure is described by streams of data passing through computation kernels. A stream is an ordered set of data of an arbitrary (simple or complex) datatype. Kernels perform comp... |

21 |
Improved GPU sorting
- Kipfer, Westermann
- 2005
(Show Context)
Citation Context ...Purcell et al. [18] presented a bitonic sorting network implementation for the GPU which is based on an equivalent implementation for the Imagine stream processor by Kapasi et al. [12]. Kipfer et al. =-=[13, 14]-=- implemented a bitonic as well as an odd-even merge sort network on the GPU. Govindaraju et al. presented an implementation based on the periodic balanced sorting network [10] and, more recently, also... |

19 | A cache-efficient sorting algorithm for database and data mining computations using graphics processors
- Govindaraju, Raghuvanshi, et al.
- 2005
(Show Context)
Citation Context ... network on the GPU. Govindaraju et al. presented an implementation based on the periodic balanced sorting network [10] and, more recently, also an implementation based on the bitonic sorting network =-=[9]-=-. The latter has been highly optimized for cache efficiency and is the fastest of the approaches above. On an NVIDIA GeForce 7800 GTX GPU it performs about twice as fast as the best quick sort impleme... |

17 | Computer Graphics on a Stream Architecture
- Owens
- 2002
(Show Context)
Citation Context ...tonic sorting network based approaches, their implementation is restricted to sequence lengths that are a power of two. 3. The stream programming model 3.1. The basics In the stream programming model =-=[12, 17, 6, 16]-=-, the basic program structure is described by streams of data passing through computation kernels. A stream is an ordered set of data of an arbitrary (simple or complex) datatype. Kernels perform comp... |

14 | Logarithmic Time Cost Optimal Parallel Sorting is Not Yet Fast in Practice
- Natvig
- 1990
(Show Context)
Citation Context ...itecture. However, although asymptotically optimal, it has been show, that neither the AKS sorting network nor Cole’s parallel merge sort are fast in practice for reasonable numbers of values to sort =-=[8, 15]-=-. Adaptive bitonic sorting [5] is another optimal parallel sorting approach for a shared-memory EREWPRAM architecture (also called PRAC for parallel random access computer). It requires a smaller numb... |

8 |
Multicores from the compiler’s perspective: A blessing or a curse
- Amarasinghe
- 2005
(Show Context)
Citation Context ...e parallelism of an algorithm more effectively. For developing efficient applications on such architectures with maximum programmer productivity, alternative programming paradigms seem to be required =-=[3]-=-. The stream programming model has shown to be a promising approach going in this direction. Furthermore, the stream programming model provided the foundations for the architecture of modern programma... |

7 |
Endre Szemerédi. An o(n log n) sorting network
- Ajtai, Komlós
- 1983
(Show Context)
Citation Context ...e reader to [Akl90]. Especially parallel sorting using sorting networks as well as algorithms for sorting on a CREW-PRAM or EREW-PRAM model have been extensively studied. Ajtai, Komlos, and Szemeredi =-=[AKS83]-=- showed how optimal asymptotic complexity can be achieved with a sorting network. Cole [Col88] presented a parallel merge sort approach for the CREW-PRAM as well as for the EREW-PRAM, which achieves o... |

5 |
A computer-oriented geodetic data base and a new technique in file sequencing
- Morton
- 1966
(Show Context)
Citation Context ...ed above, we propose the usage of an alternative, GPU-cache-optimized mapping between 1D and 2D streams where the 2D space is mapped to 1D along a space-filling curve known as Z-order or Morton order =-=[Mor66]-=-: Assuming that a 1D integer index a is given, which has the bit representation (a31, . . . , a1, a0). Then this index is mapped to the 2D index (ax, ay) where ax has the bit representation (a30, . . ... |

2 | KHAILANY B.: Efficient conditional operations for data-parallel architectures - KAPASI, DALLY, et al. |