Results 1 -
3 of
3
Rearrangeability of (2n -1)-stage shuffle-exchange networks
- SIAM Journal on Computing
, 2003
"... Abstract. Rearrangeable networks can realize each and every permutation in one pass through the network. Shuffle-exchange networks provide an efficient interconnection scheme for implementing various types of parallel processes. Whether (2n − 1)-stage shuffle-exchange networks with N =2 n inputs/out ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
Abstract. Rearrangeable networks can realize each and every permutation in one pass through the network. Shuffle-exchange networks provide an efficient interconnection scheme for implementing various types of parallel processes. Whether (2n − 1)-stage shuffle-exchange networks with N =2 n inputs/outputs are rearrangeable has remained an open question for approximately three decades. This question has been answered affirmatively in this paper. An important corollary of the main result is the proof that two passes through an Omega network are sufficient and necessary to implement any permutation. In obtaining the main results of this paper, frames that look like grids with horizontal links of different lengths are shown to be remarkable tools for identifying and characterizing the binary matrix representations of permutations.
Using SIMD Registers and Instructions to Enable Instruction-Level Parallelism in Sorting Algorithms
, 2007
"... Most contemporary processors offer some version of Single Instruction Multiple Data (SIMD) machinery — vector registers and instructions to manipulate data stored in such registers. The central idea of this paper is to use these SIMD resources to improve the performance of the tail of recursive sort ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
Most contemporary processors offer some version of Single Instruction Multiple Data (SIMD) machinery — vector registers and instructions to manipulate data stored in such registers. The central idea of this paper is to use these SIMD resources to improve the performance of the tail of recursive sorting algorithms. When the number of elements to be sorted reaches a set threshold, data is loaded into the vector registers, manipulated in-register, and the result stored back to memory. Three implementations of sorting with two different SIMD machineries — x86-64’s SSE2 and G5’s AltiVec — demonstrate that this idea delivers significant speed improvements. The improvements provided are orthogonal to the gains obtained through empirical search for a suitable sorting algorithm [11]. When integrated with the Dynamically Tuned Sorting Library (DTSL) this new code generation strategy reduces the time spent by DTSL up to 22 % for moderately-sized arrays, with greater relative reductions for small arrays. Wall-clock performance of d-heaps is improved by up to 39 % using a similar technique.

