## Designing Efficient Sorting Algorithms for Manycore GPUs (2009)

### Cached

### Download Links

Citations: | 68 - 4 self |

### BibTeX

@MISC{Satish09designingefficient,

author = {Nadathur Satish and Mark Harris and Michael Garland},

title = { Designing Efficient Sorting Algorithms for Manycore GPUs},

year = {2009}

}

### OpenURL

### Abstract

We describe the design of high-performance parallel radix sort and merge sort routines for manycore GPUs, taking advantage of the full programmability offered by CUDA. Our radix sort is the fastest GPU sort and our merge sort is the fastest comparison-based sort reported in the literature. Our radix sort is up to 4 times faster than the graphics-based GPUSort and greater than 2 times faster than other CUDA-based radix sorts. It is also 23 % faster, on average, than even a very carefully optimized multicore CPU sorting routine. To achieve this performance, we carefully design our algorithms to expose substantial fine-grained parallelism and decompose the computation into independent tasks that perform minimal global communication. We exploit the high-speed onchip shared memory provided by NVIDIA’s GPU architecture and efficient data-parallel primitives, particularly parallel scan. While targeted at GPUs, these algorithms should also be wellsuited for other manycore processors.

### Citations

9061 | Introduction to Algorithms
- Cormen, Leiserson, et al.
- 2001
(Show Context)
Citation Context ...te favorably with a reference CPU implementation running on an 8-core system. 2. Related Work The study of sorting techniques has a long history and countless algorithmic variants have been developed =-=[18, 5]-=-. Many important classes of algorithms rely on sort or sortlike primitives. Database systems make extensive use of sorting operations [9]. The construction of spatial data structures that are essentia... |

2023 | Mapreduce: simplified data processing on large clusters
- Dean, Ghemawat
- 2008
(Show Context)
Citation Context ...fundamentally a sorting process. Efficient sort routines are also a useful building block in implementing algorithms like sparse matrix multiplication and parallel programming patterns like MapReduce =-=[7, 14]-=-. Parallel Sorting. The importance of sorting has lead to the design of efficient sorting algorithms for a variety of parallel architectures [1]. One notable vein of research in this area has been on ... |

744 |
The Art of Computer Programming, Volume 3: Sorting and Searching
- Knuth
- 1973
(Show Context)
Citation Context ...te favorably with a reference CPU implementation running on an 8-core system. 2. Related Work The study of sorting techniques has a long history and countless algorithmic variants have been developed =-=[18, 5]-=-. Many important classes of algorithms rely on sort or sortlike primitives. Database systems make extensive use of sorting operations [9]. The construction of spatial data structures that are essentia... |

545 | Sorting networks and their applications
- Batcher
- 1968
(Show Context)
Citation Context ...ms for a variety of parallel architectures [1]. One notable vein of research in this area has been on parallel sorting networks, of which the most frequently used is Batcher’s bitonic sorting network =-=[2]-=-. Sorting networks are inherently parallel as they are formalized in terms of physically parallel comparator devices. Algorithms based on sorting networks are particularly attractive on platforms wher... |

290 | Parallel merge sort - Cole - 1988 |

270 |
Vector Models for Data-Parallel Computing
- BLELLOCH
- 1990
(Show Context)
Citation Context ...for efficient algorithm design on such architectures. In radix sort, we exploit the inherent finegrained parallelism of the algorithm by building our algorithm upon efficient parallel scan operations =-=[3]-=-. We expose fine-grained parallelism in merge sort by developing an algorithm for pairwise parallel merging of sorted sequences, adapting schemes for parallel splitting [12] and binary search [4] prev... |

177 | A Comparison of Sorting Algorithms for the Connection - BleIIoch, Leiserson, et al. - 1991 |

161 |
NVIDIA Tesla: A unified graphics and computing architecture
- Lindholm, Nickolls, et al.
(Show Context)
Citation Context ...drive towards increased chip-level parallelism for some time and are already fundamentally manycore processors. Current NVIDIA GPUs, for example, contain up to 128 scalar processing elements per chip =-=[19]-=-, and in contrast to earlier generations of GPUs, they can be programmed directly in C using CUDA [20, 21]. In this paper, we describe the design of efficient sorting algorithms for such manycore GPUs... |

158 | Heuristic Ray Shooting Algorithms - HAVRAN |

133 |
Scalable parallel programming with CUDA
- Nickolls, Buck, et al.
(Show Context)
Citation Context ...ocessors. Current NVIDIA GPUs, for example, contain up to 128 scalar processing elements per chip [19], and in contrast to earlier generations of GPUs, they can be programmed directly in C using CUDA =-=[20, 21]-=-. In this paper, we describe the design of efficient sorting algorithms for such manycore GPUs using CUDA. The programming flexibility provided by CUDA and the current generation of GPUs allows us to ... |

131 | Scan primitives for gpu computing
- SENGUPTA, HARRIS, et al.
- 2007
(Show Context)
Citation Context ...the system as well. Modern GPUs, supported by the CUDA software environment, provide much greater flexibility to explore a broader range of parallel algorithms. Harris et al. [13] and Sengupta et al. =-=[22]-=- developed efficient implementations of scan and segmented scan data-parallel primitives, using these to implement both radix sort and quicksort. Le Grand [10] proposed a radix sort algorithm using a ... |

127 | A logarithmic time sort for linear size networks - Reif, Valiant - 1987 |

114 | Gputerasort: High performance graphics co-processor sorting for large database management
- GOVINDARAJU, GRAY, et al.
- 2006
(Show Context)
Citation Context ...only. Because of these restrictions, the most successful GPU sorting routines implemented via the graphics API have been based on bitonic sort [17]. The GPUSort system developed by Govindaraju et al. =-=[8]-=- is one of the best performing graphics-based sorts, although it suffers from the O(n log2 n) work complexity typical of bitonic methods. Greß and Zachmann [11] improve the complexity of their GPU-ABi... |

92 | A more practical PRAM model - Gibbons - 1989 |

89 |
Parallel Sorting Algorithms
- Akl
- 1985
(Show Context)
Citation Context ...on and parallel programming patterns like MapReduce [7, 14]. Parallel Sorting. The importance of sorting has lead to the design of efficient sorting algorithms for a variety of parallel architectures =-=[1]-=-. One notable vein of research in this area has been on parallel sorting networks, of which the most frequently used is Batcher’s bitonic sorting network [2]. Sorting networks are inherently parallel ... |

71 | Mars: a mapreduce framework on graphics processors
- He, Fang, et al.
- 2008
(Show Context)
Citation Context ...fundamentally a sorting process. Efficient sort routines are also a useful building block in implementing algorithms like sparse matrix multiplication and parallel programming patterns like MapReduce =-=[7, 14]-=-. Parallel Sorting. The importance of sorting has lead to the design of efficient sorting algorithms for a variety of parallel architectures [1]. One notable vein of research in this area has been on ... |

61 |
MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs
- Stratton, Stone, et al.
- 2008
(Show Context)
Citation Context ...n a parallel device. Typically, the host program executes on the CPU and the parallel kernels execute on the GPU, although CUDA kernels may also be compiled for efficient execution on multi-core CPUs =-=[23]-=-. A kernel executes a scalar sequential program across a set of parallel threads. The programmer organizes these threads into thread blocks; a kernel thus consists of a grid of 2 GPU TPC Texture Unit ... |

56 |
Parallel Prefix Sum (Scan) with CUDA
- HARRIS, SENGUPTA, et al.
- 2007
(Show Context)
Citation Context ... measured performance of the system as well. Modern GPUs, supported by the CUDA software environment, provide much greater flexibility to explore a broader range of parallel algorithms. Harris et al. =-=[13]-=- and Sengupta et al. [22] developed efficient implementations of scan and segmented scan data-parallel primitives, using these to implement both radix sort and quicksort. Le Grand [10] proposed a radi... |

55 | Parallel sorting and data partitioning by sampling - Huang, Chow - 1983 |

54 | Fast Parallel Sorting Under LogP: Experience with the CM-5 - Dusseau, Culler, et al. - 1996 |

52 | Samplesort: A sampling approach to minimal storage tree sorting - Frazer, McKellar - 1970 |

44 | Radix Sort for Vector Multiprocessors
- Zagha, Blelloch
- 1991
(Show Context)
Citation Context ...x sort and quicksort can be implemented using parallel scan and segmented scan primitives, respectively. Such techniques often translate well to vector machines, as demonstrated by Zagha and Blelloch =-=[24]-=-, and are a good fit for the GPU as well. Sorting on GPUs. Most previous attempts at designing sorting algorithms for the GPU have been made using the graphics API and the pixel shader programs that i... |

36 | Scalable parallel programming with - Nickolls, Buck, et al. |

36 | Gpu-abisort: Optimal parallel sorting on stream architectures
- Greß, Zachmann
- 2006
(Show Context)
Citation Context ...t system developed by Govindaraju et al. [8] is one of the best performing graphics-based sorts, although it suffers from the O(n log2 n) work complexity typical of bitonic methods. Greß and Zachmann =-=[11]-=- improve the complexity of their GPU-ABiSort system to O(n logn) by using an adaptive data structure that enables merges to be done in linear time and also demonstrate that this improves the measured ... |

35 |
CUDA Programming Guide
- Corporation
- 2007
(Show Context)
Citation Context ...ocessors. Current NVIDIA GPUs, for example, contain up to 128 scalar processing elements per chip [19], and in contrast to earlier generations of GPUs, they can be programmed directly in C using CUDA =-=[20, 21]-=-. In this paper, we describe the design of efficient sorting algorithms for such manycore GPUs using CUDA. The programming flexibility provided by CUDA and the current generation of GPUs allows us to ... |

24 | S.: An improved supercomputer sorting benchmark. In: Supercomputing ’92 - Thearling, Smith - 1992 |

22 |
Improved GPU sorting
- Kipfer, Westermann
- 2005
(Show Context)
Citation Context ...wed and all memory regions are either read-only or write-only. Because of these restrictions, the most successful GPU sorting routines implemented via the graphics API have been based on bitonic sort =-=[17]-=-. The GPUSort system developed by Govindaraju et al. [8] is one of the best performing graphics-based sorts, although it suffers from the O(n log2 n) work complexity typical of bitonic methods. Greß a... |

21 |
Optimal merging and sorting
- Hagerup, Rüb
- 1989
(Show Context)
Citation Context ...nt parallel scan operations [3]. We expose fine-grained parallelism in merge sort by developing an algorithm for pairwise parallel merging of sorted sequences, adapting schemes for parallel splitting =-=[12]-=- and binary search [4] previously described in the literature. We demonstrate how to impose a block-wise structure on the sorting algorithms, allowing us to exploit the fast on-chip memory provided by... |

19 |
Efficient gather and scatter operations on graphics processors
- He, Govindaraju, et al.
- 2007
(Show Context)
Citation Context ...l primitives, using these to implement both radix sort and quicksort. Le Grand [10] proposed a radix sort algorithm using a larger radix and per-processor histograms to improve performance. He et al. =-=[15]-=- used a similar strategy to reduce scattering in their Most Significant Digit (MSD) radix sort implementation. These radix sort algorithms have been some of the fastest GPU sorting systems. 3. Paralle... |

18 | A practical quicksort algorithm for graphics processors - CEDERMAN, TSIGAS - 2008 |

16 | Implementing sorting in database systems
- Graefe
- 2006
(Show Context)
Citation Context ...ory and countless algorithmic variants have been developed [18, 5]. Many important classes of algorithms rely on sort or sortlike primitives. Database systems make extensive use of sorting operations =-=[9]-=-. The construction of spatial data structures that are essential in computer graphics and geographic information systems is fundamentally a sorting process. Efficient sort routines are also a useful b... |

10 | Broad-Phase collision detection with CUDA - Grand - 2007 |

5 |
data parallel primitives library. http:// www.gpgpu.org/developer/cudpp
- CUDA
- 2008
(Show Context)
Citation Context ...based on 1-bit keys as a “split” operation—and has been implemented in CUDA by Harris et al. [13]. An implementation is publicly available as part of the CUDA Data-Parallel Primitives (CUDPP) library =-=[6]-=-. This approach to implementing radix sort is conceptually quite simple. Given scan and permute primitives, it is straightforward to implement. However, it is not particularly efficient when the array... |

5 |
Broad-phase collision detection with CUDA
- Grand
- 2007
(Show Context)
Citation Context .... Harris et al. [13] and Sengupta et al. [22] developed efficient implementations of scan and segmented scan data-parallel primitives, using these to implement both radix sort and quicksort. Le Grand =-=[10]-=- proposed a radix sort algorithm using a larger radix and per-processor histograms to improve performance. He et al. [15] used a similar strategy to reduce scattering in their Most Significant Digit (... |

4 | Merging with parallel processors - Gavril - 1975 |

3 | Efficient implementation of sorting on multicore - Chhugani, Nguyen, et al. - 2008 |

2 |
Efficient parallel binary search on sorted arrays, with applications
- Chen
- 1995
(Show Context)
Citation Context ...ions [3]. We expose fine-grained parallelism in merge sort by developing an algorithm for pairwise parallel merging of sorted sequences, adapting schemes for parallel splitting [12] and binary search =-=[4]-=- previously described in the literature. We demonstrate how to impose a block-wise structure on the sorting algorithms, allowing us to exploit the fast on-chip memory provided by the GPU architecture.... |