## 2.2 Sorting Numbers on GPUs................................................ 48 2.2.1 SDK Radix Sort Algorithm....................................... 50 2.2.1.1 Step 1–Sorting tiles...................................... 51 2.2.1.2 Step 2–Calculating histog

### BibTeX

@MISC{Bandyopadhyay_2.2sorting,

author = {Shibdas Bandyopadhyay and Sartaj Sahni},

title = {2.2 Sorting Numbers on GPUs................................................ 48 2.2.1 SDK Radix Sort Algorithm....................................... 50 2.2.1.1 Step 1–Sorting tiles...................................... 51 2.2.1.2 Step 2–Calculating histog},

year = {}

}

### OpenURL

### Abstract

2.2.2.2 Step 2–Prefix sum of tile histograms.................... 59 2.2.2.3 Step 3–Positioning numbers in a tile.................... 59 2.2.3 SRTS Radix Sort.................................................. 2.2.3.1 Step 1–Bottom level reduce..............................

### Citations

538 | Sorting networks and their applications
- Batcher
- 1968
(Show Context)
Citation Context ...rs are chosen. First, random samples are taken out of the elements in the buckets. A set of splitters is then chosen from these random samples. Finally, splitters are sorted using odd-even merge sort =-=[4]-=- in shared memory and a Binary Search Tree of splitters is created to facilitate the process of finding the bucket for an element. 2.2.4.2 Step 2–Finding buckets Each thread block is assigned a part o... |

267 |
Vector Models for Data-Parallel Computing
- Blelloch
- 1990
(Show Context)
Citation Context ...orks in phases where each phase sorts on a digit of the key using, typically, either a count sort or a bucket sort. The counting to be done in each phase may be carried out using a prefix sum or scan =-=[5]-=- operation that is quite efficiently done on a GPU [27]. Harris et al.’s [33] adaptation of radix sort to GPUs uses the radix 2 (i.e., each phase sorts on a bit of the key) and uses the bitsplit techn... |

159 |
NVIDIA Tesla: A unified graphics and computing architecture
- Lindholm, Nickolls, et al.
- 2008
(Show Context)
Citation Context ...1 Graphics Processing Units Contemporary graphics processing units (GPUs) are massively parallel manycore processors. NVIDIA’s Tesla GPUs, for example, have 240 scalar processing cores (SPs) per chip =-=[22]-=-. These cores are partitioned into 30 Streaming Multiprocessors (SMs) with each SM comprising of 8 SPs. Each SM shares a 16KB local memory (called shared memory) and has a total of 16,384 32-bit regis... |

111 | GPUTeraSort: high performance graphics co-processor sorting for large database management
- Govindaraju, Gray, et al.
- 2006
(Show Context)
Citation Context ...in registers.Sorting On A Graphics Processing Unit(GPU) 49 2.2 Sorting Numbers on GPUs One of the very first GPU sorting algorithms, an adaptation of bitonic sort, was developed by Govindraju et al. =-=[12]-=-. Since this algorithm was developed before the advent of CUDA, the algorithm was implemented using GPU pixel shaders. Zachmann et al. [13] improved on this sort algorithm by using BitonicTrees to red... |

101 | The Art of Computer Programming: Sorting and Searching (Volume 3 - Knuth - 1998 |

94 |
Fundamentals of Data Structures
- Horowitz, Sahni
- 1982
(Show Context)
Citation Context ...ed by copying records from the fieldsArray to a new array placing the records into their sorted positions in the new array or in-place using a cycle chasing algorithm as described for a table sort in =-=[15]-=-. The second strategy is to extend a number sort so as to move an entire record every time its key is moved by the number sort. We call the first strategy as indirect and the second strategy as direct... |

67 | Designing Efficient Sorting Algorithms for Manycore GPUs
- Satish, Harris, et al.
- 2009
(Show Context)
Citation Context ...earch directed toward expanding the applicability of GPUs from their native computer graphics applications to a wide variety of high-performance computing applications. FIGURE 2.1: NVIDIA’s Tesla GPU =-=[26]-=- GPUs operate under the master-slave computing model (see [25] for e.g.) in which there is a host or master processor to which is attached a collection of slave processors. A possible configuration wo... |

60 | Power efficient processor architecture and the Cell processor - Hofstee - 2005 |

55 | Anomalies in parallel branch-and-bound algorithms - Lai, Sahni - 1984 |

35 | GPU-ABiSort: Optimal parallel sorting on stream architectures
- Greß, Zachmann
- 2006
(Show Context)
Citation Context ...aptation of bitonic sort, was developed by Govindraju et al. [12]. Since this algorithm was developed before the advent of CUDA, the algorithm was implemented using GPU pixel shaders. Zachmann et al. =-=[13]-=- improved on this sort algorithm by using BitonicTrees to reduce the number of comparisons while merging the bitonic sequences. Cederman et al. [7] have adapted quick sort for GPUs. Their adaptation f... |

32 | Cellsort: High performance sorting on the cell processor - Gedik, Bordawekar, et al. - 2007 |

27 | Analysis of shellsort and related algorithms - Sedgewick - 1996 |

22 | Fast parallel gpusorting using a hybrid algorithm
- SINTORN, ASSARSSON
- 2008
(Show Context)
Citation Context ...ges the sorted subsequences in parallel. A hybrid sort algorithm that splits the data using bucket sort and then merges the data using a vectorized version of merge sort is proposed by Sintron et al. =-=[28]-=-. Satish et al. [26] have developed an even faster merge sort. In this merge sort, two sorted sequences A and B are merged by a thread block to produce the sequence C when A and B have less than 256 e... |

20 | A Programming Example - Chow, Fossum, et al. - 2005 |

15 | Revisiting sorting for GPGPU stream architectures
- Merrill, Grimshaw
- 2010
(Show Context)
Citation Context ...verage, when the keys are 32-bit integers. This would make sample sort competitive with Warpsort for 32-bit keys. For 64-bit keys, sample sort is twice as fast, on average, as the merge sort of [26]. =-=[27, 33, 19, 26, 23, 1]-=- have adapted radix sort to GPUs. Radix sort works in phases where each phase sorts on a digit of the key using, typically, either a count sort or a bucket sort. The counting to be done in each phase ... |

14 | GPU sample sort
- Leischner, Osipov, et al.
- 2010
(Show Context)
Citation Context ...tal results reported in [31] indicate that Warpsort is about 30% faster than the merge sort algorithm of [26]. Another comparison-based sort for GPUs–GPU sample sort–was developed by Leischner et al. =-=[20]-=-. Sample sort is reported to be about 30% faster than the merge sort of [26], on average, when the keys are 32-bit integers. This would make sample sort competitive with Warpsort for 32-bit keys. For ... |

13 | A fast, easy sort - Lacey, Box - 1991 |

10 | Scheduling master-slave multiprocessor systems
- Sahni
- 1996
(Show Context)
Citation Context ... their native computer graphics applications to a wide variety of high-performance computing applications. FIGURE 2.1: NVIDIA’s Tesla GPU [26] GPUs operate under the master-slave computing model (see =-=[25]-=- for e.g.) in which there is a host or master processor to which is attached a collection of slave processors. A possible configuration would have a GPU card attached to the bus of a PC. The PC CPU wo... |

8 | Gpu-quicksort: A practical quicksort algorithm for graphics processors
- Cederman, Tsigas
- 2009
(Show Context)
Citation Context ...mplemented using GPU pixel shaders. Zachmann et al. [13] improved on this sort algorithm by using BitonicTrees to reduce the number of comparisons while merging the bitonic sequences. Cederman et al. =-=[7]-=- have adapted quick sort for GPUs. Their adaptation first partitions the sequence to be sorted into subsequences, sorts these subsequences in parallel, and then merges the sorted subsequences in paral... |

7 | An ecient variation of Bubble Sort - Dobosiewicz - 1980 |

6 | High performance comparison-based sorting algorithm on many-core gpus
- Ye, Fan, et al.
- 2010
(Show Context)
Citation Context ...from the two sequences in such a way that the interval between successive splitters is small enough to be merged by a thread block. The fastest GPU merge sort algorithm known at this time is Warpsort =-=[31]-=-. Warpsort first creates sorted sequences using bitonic sort; each sorted sequence being created by a thread warp. The sorted sequences are merged in pairs until only a small number of sequences remai... |

5 | The performance of randomized Shellsort-like network sorting algorithms SCAMP working paper P18/94, Institute for Defense Analysis - Lemke - 1994 |

5 | Efficient sorting algorithms for the cell broadband engine - Sharma, Thapar, et al. - 2008 |

4 | Scan primitives for GPU computing. Graphics Hardware - SenGupta, Harris, et al. - 2007 |

3 | Worst case for Comb Sort - Drozdek - 2005 |

3 |
Broad-Phase Collision Detection with
- Grand
- 2007
(Show Context)
Citation Context ...verage, when the keys are 32-bit integers. This would make sample sort competitive with Warpsort for 32-bit keys. For 64-bit keys, sample sort is twice as fast, on average, as the merge sort of [26]. =-=[27, 33, 19, 26, 23, 1]-=- have adapted radix sort to GPUs. Radix sort works in phases where each phase sorts on a digit of the key using, typically, either a count sort or a bucket sort. The counting to be done in each phase ... |

3 | Hypercube-to-host sorting - Won, Sahni - 1989 |

2 | Sorting Large Records on a Cell Broadband Engine
- Bandyopadhyay, Sahni
- 2010
(Show Context)
Citation Context ...imensional array fieldsArray[][] with fieldsArray[i][0] = ki and fieldsArray[i][j] = fij, 1 ≤ j ≤ m, 1 ≤ i ≤ n. When this array is mapped to memory in column-major order, we get the ByField layout of =-=[2]-=-. This layout was used also for the AA-sort algorithm developed for the Cell Broadband Engine in [16] and is essentially the same as that used by the GPU radix sort algorithm of [26]. When the fields ... |

2 | Sorting on a Cell Broadband Engine - Bandyopadhyay, Sahni - 2009 |

2 |
Benchmarking GPUs to Tune Dense
- Volkov, Demmel
- 2008
(Show Context)
Citation Context ...tion should be organized so that, at any given time, the threads in a half warp access either words in different banks of shared memory or they access the same word of shared memory. 4. Volkov et al. =-=[30]-=- have observed greater throughput using operands in registers than operands in shared memory. So, data that is to be used often should be stored in registers rather than in shared memory. 5. Loop unro... |

1 |
AA-sort: A new parallel algorithm for multi-core
- Inoue, Moriyama, et al.
- 2007
(Show Context)
Citation Context ..., countersSum, ranks4) { // Read the numbers from keysIn4 and put them in // sorted order in keysOut // sTileCnt stores tile histogram // sGOffset stores global prefix-summed histogram shared sTileCnt=-=[16]-=-, sGOffset[16]; // storage for numbers shared int sKeys[t]; int4 k4, r4; // Read the histograms from the global memory if(tid < 16) { sTileCnt[tid] = counters[tid * nTiles + bid]; sGOffset[tid] = coun... |