Results 1 -
5 of
5
Efficient sorting using registers and caches
- WAE, WORKSHOP ON ALGORITHM ENGINEERING , LECTURE NOTES IN COMPUTER SCIENCE
, 2000
"... Modern computer systems have increasingly complex memory systems. Common machine models for algorithm analysis do not reflect many of the features of these systems, e.g., large register sets, lockup-free caches, cache hierarchies, associativity, cache line fetching, and streaming behavior. Inadequat ..."
Abstract
-
Cited by 7 (0 self)
- Add to MetaCart
Modern computer systems have increasingly complex memory systems. Common machine models for algorithm analysis do not reflect many of the features of these systems, e.g., large register sets, lockup-free caches, cache hierarchies, associativity, cache line fetching, and streaming behavior. Inadequate models lead to poor algorithmic choices and an incomplete understanding of algorithm behavior on real machines. A key step toward developing better models is to quantify the performance effects of features not reflected in the models. This paper explores the effect of memory system features on sorting performance. We introduce a new cache-conscious sorting algorithm, R-merge, which achieves better performance in practice over algorithms that are superior in the theoretical models. R-merge is designed to minimize memory stall cycles rather than cache misses by considering features common to many system designs.
MCSTL: The Multi-Core Standard Template Library
"... Abstract. 1 Future gain in computing performance will not stem from increased clock rates, but from even more cores in a processor. Since automatic parallelization is still limited to easily parallelizable sections of the code, most applications will soon have to support parallelism explicitly. The ..."
Abstract
-
Cited by 7 (3 self)
- Add to MetaCart
Abstract. 1 Future gain in computing performance will not stem from increased clock rates, but from even more cores in a processor. Since automatic parallelization is still limited to easily parallelizable sections of the code, most applications will soon have to support parallelism explicitly. The Multi-Core Standard Template Library (MCSTL) simplifies parallelization by providing efficient parallel implementations of the algorithms in the C++ Standard Template Library. Thus, simple recompilation will provide partial parallelization of applications that make consistent use of the STL. We present performance measurements on several architectures. For example, our sorter achieves a speedup of 21 on an 8-core 32-thread SUN T1. 1
Using SIMD Registers and Instructions to Enable Instruction-Level Parallelism in Sorting Algorithms
, 2007
"... Most contemporary processors offer some version of Single Instruction Multiple Data (SIMD) machinery — vector registers and instructions to manipulate data stored in such registers. The central idea of this paper is to use these SIMD resources to improve the performance of the tail of recursive sort ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
Most contemporary processors offer some version of Single Instruction Multiple Data (SIMD) machinery — vector registers and instructions to manipulate data stored in such registers. The central idea of this paper is to use these SIMD resources to improve the performance of the tail of recursive sorting algorithms. When the number of elements to be sorted reaches a set threshold, data is loaded into the vector registers, manipulated in-register, and the result stored back to memory. Three implementations of sorting with two different SIMD machineries — x86-64’s SSE2 and G5’s AltiVec — demonstrate that this idea delivers significant speed improvements. The improvements provided are orthogonal to the gains obtained through empirical search for a suitable sorting algorithm [11]. When integrated with the Dynamically Tuned Sorting Library (DTSL) this new code generation strategy reduces the time spent by DTSL up to 22 % for moderately-sized arrays, with greater relative reductions for small arrays. Wall-clock performance of d-heaps is improved by up to 39 % using a similar technique.
Data Intensive Computation in a Compute/Storage Hierarchy
, 2002
"... We are acquiring and storing ever-increasing volumes of data. Extracting useful information from these large datasets poses challenges throughout the memory/storage hierarchy. One solution is to reduce the amount of data movement between di#erent level of memory. The two-level external memory (EM) m ..."
Abstract
- Add to MetaCart
We are acquiring and storing ever-increasing volumes of data. Extracting useful information from these large datasets poses challenges throughout the memory/storage hierarchy. One solution is to reduce the amount of data movement between di#erent level of memory. The two-level external memory (EM) model and its variants are used to design such algorithms. We show how application of EM techniques can yield significant performance improvement for a GIS application. We also show that the derived cache model does not adequately represent the memory system at the cache/register level. The other

