## A High-Performance Sorting Algorithm for Multicore Single-Instruction Multiple-Data Processors

### BibTeX

@MISC{Inoue_ahigh-performance,

author = {Hiroshi Inoue and Takao Moriyama and Hideaki Komatsu and Toshio Nakatani},

title = {A High-Performance Sorting Algorithm for Multicore Single-Instruction Multiple-Data Processors},

year = {}

}

### OpenURL

### Abstract

Many sorting algorithms have been studied in the past, but there are only a few algorithms that can effectively exploit both SIMD instructions and thread-level parallelism. In this paper, we propose a new high-performance sorting algorithm, called Aligned-Access sort (AA-sort), for exploiting both the SIMD instructions and thread-level parallelism available on today's multicore processors. Our algorithm consists of two phases, an in-core sorting phase and an out-of-core merging phase. The in-core sorting phase uses our new sorting algorithm that extends combsort to exploit SIMD instructions. The out-of-core algorithm is based on mergesort with our novel vectorized merging algorithm. Both phases can take advantage of SIMD instructions. The key to high performance is eliminating unaligned memory accesses that would reduce the effectiveness of SIMD instructions in both phases. We implemented and evaluated the AA-sort on PowerPC 970MP and Cell Broadband Engine platforms. In summary, a sequential version of the AA-sort using SIMD instructions outperformed IBM’s optimized sequential sorting library by 1.8 times and bitonic merge sort using SIMD instructions by 3.3 times on PowerPC 970MP when sorting 32 million random 32-bit integers. Also, a parallel version of AA-sort demonstrated better scalability with increasing numbers of cores than a parallel version of bitonic merge sort on both platforms.

### Citations

125 | Photon mapping on programmable graphics hardware
- PURCELL, DONNER, et al.
- 2005
(Show Context)
Citation Context ...miss predictions. Moreover, our algorithm makes it possible to take advantage of the data parallelism of SIMD instructions. There are some sorting algorithms suitable for exploiting SIMD instructions =-=[13, 14, 15]-=-. They were originally proposed in the context of sorting on graphics processing units (GPUs), which were powerful programmable processors with SIMD instruction sets. Govindaraju et al. [15] presented... |

121 |
The design and implementation of a first-generation cell processor
- Pham
- 2005
(Show Context)
Citation Context ...n is published in Software: Practice and Experience. http://onlinelibrary.wiley.com/doi/10.1002/spe.1102/abstract processor and a system with 16 cores of the Cell Broadband Engine (Cell BE) processor =-=[7]-=-. In summary, a sequential version of the AA-sort using SIMD instructions outperformed IBM’s optimized sequential sorting library by 1.8 times and the bitonic merge sort that uses SIMD instructions, t... |

104 |
Scalable parallel programming with cuda
- Nickolls, Buck, et al.
- 2008
(Show Context)
Citation Context ...ccess, each slot of the GPU’s vector instructions can act as a separate thread. NVIDIA calls this processor architecture SIMT (single-instruction, multiple-thread) in contrast to the traditional SIMD =-=[20]-=-. Our AA-sort targets the SIMD processors, which have more limitations than the SIMT processors. 4. AA-SORT ALGORITHM In this section, we present our new sorting algorithm called AA-sort. We use 32-bi... |

102 | Dinesh Manocha, "GPUTeraSort: high performance graphics co-processor sorting for large database management
- Govindaraju, Gray, et al.
- 2006
(Show Context)
Citation Context ...miss predictions. Moreover, our algorithm makes it possible to take advantage of the data parallelism of SIMD instructions. There are some sorting algorithms suitable for exploiting SIMD instructions =-=[13, 14, 15]-=-. They were originally proposed in the context of sorting on graphics processing units (GPUs), which were powerful programmable processors with SIMD instruction sets. Govindaraju et al. [15] presented... |

46 |
Fast and approximate stream mining of quantiles and frequencies using graphics processors
- Govindaraju, Raghuvanshi, et al.
- 2005
(Show Context)
Citation Context ...miss predictions. Moreover, our algorithm makes it possible to take advantage of the data parallelism of SIMD instructions. There are some sorting algorithms suitable for exploiting SIMD instructions =-=[13, 14, 15]-=-. They were originally proposed in the context of sorting on graphics processing units (GPUs), which were powerful programmable processors with SIMD instruction sets. Govindaraju et al. [15] presented... |

28 | CellSort: High performance sorting on the Cell processor
- Gedik, Yu
- 2007
(Show Context)
Citation Context ...•log(N)), which is the optimal complexity for any comparison-based sorting algorithm, while the complexity for the GPUTeraSort (or other bitonic merge sort variants) is O(N•(log(N)) 2 ). Gedik et al. =-=[17]-=- presented a sorting algorithm for Cell BE called the CellSort. They also used the bitonic merge sort as their computing kernel to exploit the SIMD instruction set and thread-level parallelism of the ... |

17 | A practical quicksort algorithm for graphics processors
- CEDERMAN, TSIGAS
- 2008
(Show Context)
Citation Context ...nefits for the SSE instructions and the VMX instructions. The AA-sort can take advantage of SIMD instructions not only in the last part of the sorting, but also for entire stages. Cederman and Tsigas =-=[19]-=- demonstrated that their quicksort implementation on recent NVIDIA GPUs achieved much better performance than quicksort on general-purpose CPUs or the GPUTeraSort running on the same GPUs. Their quick... |

15 | Implementing sorting in database systems
- Graefe
- 2006
(Show Context)
Citation Context ...erations, by removing branch overhead. Sorting is one of the most important building blocks for operating systems and many commercial and scientific applications, such as data-base management systems =-=[2]-=-. Hence many sequential andThis is a pre peer-reviewed version. The final version is published in Software: Practice and Experience. http://onlinelibrary.wiley.com/doi/10.1002/spe.1102/abstract paral... |

13 | Efficient implementation of sorting on multi-core SIMD CPU architecture
- Chhugani, Nguyen, et al.
(Show Context)
Citation Context ...her SIMD instruction sets, such as the SPE instruction set of Cell BE and the SSE4 instruction set of the x86. We show our implementation of our algorithm on Cell BE in this paper and Chhugani et al. =-=[9]-=- described an implementation of a part of our algorithm using the SSE4 [10]. 3. RELATED WORK Many sorting algorithms have been proposed in the past. Quicksort is one of the fastest algorithms used in ... |

12 |
A fast, easy sort
- Lacey, Box
- 1991
(Show Context)
Citation Context ...-core merging phase. Both phases can take advantage of the SIMD instructions and can also run in parallel with multiple threads. The in-core sorting phase uses our new algorithm that extends combsort =-=[5]-=- for exploiting SIMD instructions. This makes it possible to eliminate all unaligned memory accesses and fully exploit the SIMD instructions. The key idea to improve combsort is to first sort the inpu... |

12 |
A Benchmark Parallel Sort for Shared Memory Multiprocessors
- Francis, Mathieson
- 1988
(Show Context)
Citation Context ... the out-of-core phase, the number of blocks becomes smaller than the number of threads, and hence multiple threads must cooperate on one merge operation to fully exploit the thread-level parallelism =-=[21]-=-. The entire AA-sort has the computational complexity of O(N•log(N)), where O(N•log(B)) for the in-core phase and O(N•log(N/B)) for the out-of-core phase, even for the worst case. Also it can be execu... |

11 |
Super scalar sample sort
- Sanders, Winkel
- 2004
(Show Context)
Citation Context ...d thus the radixsort may suffer from a poor scalability on multicore processors because the memory bandwidth tends to become a bottleneck in systems with multicore processors [11]. Sanders and Winkel =-=[12]-=- pointed out that the performance of sorting on today’s processors is often dominated by pipeline stalls caused by branch mispredictions. They proposed a new sorting algorithm, named super-scalar samp... |

7 |
The art of computer programming, vol 3: Sorting and searching
- DE
- 1973
(Show Context)
Citation Context ...ed version. The final version is published in Software: Practice and Experience. http://onlinelibrary.wiley.com/doi/10.1002/spe.1102/abstract parallel sorting algorithms have been studied in the past =-=[3, 4]-=-. However popular sorting algorithms, such as quicksort, are not able to exploit SIMD instructions efficiently. For example, a VMX instruction or a SSE instruction can load or store 128 bits of data b... |

5 | Using SIMD registers and instructions to enable instruction-level parallelism in sorting algorithms
- Furtak, Amaral, et al.
(Show Context)
Citation Context ...rience. http://onlinelibrary.wiley.com/doi/10.1002/spe.1102/abstract a[0] a[1] a[2] a[3] a[4] a[5] a[6] a[7] a[8] ··· va[0] va[1] Fig. 1. Data structure of the array to be sorted. ours. Furtak et al. =-=[18]-=- showed the benefits of exploiting SIMD instructions for sorting very small arrays. They demonstrated that replacing only the last few steps of quicksort by a sorting network implemented with SIMD ins... |

4 |
Toshio Nakatani. AA-sort: A new parallel sorting algorithm for multi-core SIMD processors
- Inoue, Moriyama, et al.
- 2007
(Show Context)
Citation Context ...ort on a system with 4 cores of the PowerPC 970MP † A preliminary version of this paper was published in proceedings of the Sixteenth IEEE Parallel Architecture and Compilation Techniques (PACT 2007) =-=[6]-=-. This paper adds more descriptions of our new algorithm. It also includes more detailed analysis of the results of our measurements, including the effects of important parameters on performance. ‡ lo... |

4 |
A study of memory management for web-based applications on multicore processors
- Inoue, Komatsu, et al.
(Show Context)
Citation Context ... main memory bandwidth and thus the radixsort may suffer from a poor scalability on multicore processors because the memory bandwidth tends to become a bottleneck in systems with multicore processors =-=[11]-=-. Sanders and Winkel [12] pointed out that the performance of sorting on today’s processors is often dominated by pipeline stalls caused by branch mispredictions. They proposed a new sorting algorithm... |

3 |
PowerPC Microprocessor Family: Vector/SIMD Multimedia Extension Technology Programming Environments Manual
- Corp
(Show Context)
Citation Context ...Section 5 discusses our experimental environment and gives a summary of our results. Finally, Section 6 draws conclusions. 2. SIMD INSTRUCTION SET In this paper we use the Vector Multimedia eXtension =-=[8]-=- (VMX, also known as AltiVec) instructions of the PowerPC instruction set to present our new sorting algorithm. It provides a set of 128-bit vector registers, each of which can be used as sixteen 8-bi... |

3 |
Sorting networks and their applications
- KE
- 1968
(Show Context)
Citation Context ...sing units (GPUs), which were powerful programmable processors with SIMD instruction sets. Govindaraju et al. [15] presented a sorting algorithm called GPUTeraSort that improved on bitonic merge sort =-=[16]-=-. The bitonic merge sort has computational complexity of O(N•(log(N)) 2 ) and it can be executed by up to N processors in parallel. The GPUTeraSort improves this algorithm by altering the order of com... |

1 |
Introspective Sorting and Selection Algorithms
- DR
(Show Context)
Citation Context ...o evaluated two library functions, IBM’s Engineering and Scientific Subroutine Library (ESSL) version 4.2 and the STL library delivered with gcc that implements the quicksort variant called introsort =-=[23]-=-, on the PowerPC 970MP. Table 1 summarizes the characteristics of each algorithm. The PowerPC 970MP system used for our evaluation was equipped with two 2.5 GHz dual-core PowerPC 970MP processors and ... |