Results 1 
9 of
9
Cacheefficient string sorting using copying
 In submission
, 2006
"... Abstract. Burstsort is a cacheoriented sorting technique that uses a dynamic trie to efficiently divide large sets of string keys into related subsets small enough to sort in cache. In our original burstsort, string keys sharing a common prefix were managed via a bucket of pointers represented as a ..."
Abstract

Cited by 8 (3 self)
 Add to MetaCart
Abstract. Burstsort is a cacheoriented sorting technique that uses a dynamic trie to efficiently divide large sets of string keys into related subsets small enough to sort in cache. In our original burstsort, string keys sharing a common prefix were managed via a bucket of pointers represented as a list or array; this approach was found to be up to twice as fast as the previous best string sorts, mostly because of a sharp reduction in outofcache references. In this paper we introduce Cburstsort, which copies the unexamined tail of each key to the bucket and discards the original key to improve data locality. On both Intel and PowerPC architectures, and on a wide range of string types, we show that sorting is typically twice as fast as our original burstsort, and four to five times faster than multikey quicksort and previous radixsorts. A variant that copies both suffixes and record pointers to buckets, CPburstsort, uses more memory but provides stable sorting. In current computers, where performance is limited by memory access latencies, these new algorithms can dramatically reduce the time needed for internal sorting of large numbers of strings. 1
Using SIMD registers and instructions to enable instructionlevel parallelism in sorting algorithms
 in Proceedings of the 19th annual ACM Symposium on Parallel Algorithms and Architectures, 2007
"... Most contemporary processors offer some version of Single Instruction Multiple Data (SIMD) machinery — vector registers and instructions to manipulate data stored in such registers. The central idea of this paper is to use these SIMD resources to improve the performance of the tail of recursive sort ..."
Abstract

Cited by 6 (1 self)
 Add to MetaCart
(Show Context)
Most contemporary processors offer some version of Single Instruction Multiple Data (SIMD) machinery — vector registers and instructions to manipulate data stored in such registers. The central idea of this paper is to use these SIMD resources to improve the performance of the tail of recursive sorting algorithms. When the number of elements to be sorted reaches a set threshold, data is loaded into the vector registers, manipulated inregister, and the result stored back to memory. Three implementations of sorting with two different SIMD machineries — x8664’s SSE2 and G5’s AltiVec — demonstrate that this idea delivers significant speed improvements. The improvements provided are orthogonal to the gains obtained through empirical search for a suitable sorting algorithm [11]. When integrated with the Dynamically Tuned Sorting Library (DTSL) this new code generation strategy reduces the time spent by DTSL up to 22 % for moderatelysized arrays, with greater relative reductions for small arrays. Wallclock performance of dheaps is improved by up to 39 % using a similar technique.
An Experimental Study of Sorting and Branch Prediction
"... Sorting is one of the most important and well studied problems in Computer Science. Many good algorithms are known which offer various tradeoffs in efficiency, simplicity, memory use, and other factors. However, these algorithms do not take into account features of modern computer architectures tha ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
(Show Context)
Sorting is one of the most important and well studied problems in Computer Science. Many good algorithms are known which offer various tradeoffs in efficiency, simplicity, memory use, and other factors. However, these algorithms do not take into account features of modern computer architectures that significantly influence performance. Caches and branch predictors are two such features, and while there has been a significant amount of research into the cache performance of general purpose sorting algorithms, there has been little research on their branch prediction properties. In this paper we empirically examine the behaviour of the branches in all the most common sorting algorithms. We also consider the interaction of cache optimization on the predictability of the branches in these algorithms. We find insertion sort to have the fewest branch mispredictions of any comparisonbased sorting algorithm, that bubble and shaker sort operate in a fashion which makes their branches highly unpredictable, that the unpredictability of shellsort’s branches improves its caching behaviour and that several cache optimizations have little effect on mergesort’s branch mispredictions. We find also that optimizations to quicksort – for example the choice of pivot – have a strong influence on the predictability of its branches. We point out a simple way of removing branch instructions from a classic heapsort implementation, and show also that unrolling a loop in a cache optimized heapsort implementation improves the predicitability of its branches. Finally, we note that when sorting random data twolevel adaptive branch predictors are usually no better than simpler bimodal predictors. This is despite the fact that twolevel adaptive predictors are almost always superior to bimodal predictors in general.
To all algorithmicists Preface
, 2008
"... Algorithms are at the heart of every nontrivial computer application. Therefore every computer scientist and every professional programmer should know about the basic algorithmic toolbox: structures that allow efficient organization and retrieval of data, frequently used algorithms, and generic tech ..."
Abstract
 Add to MetaCart
Algorithms are at the heart of every nontrivial computer application. Therefore every computer scientist and every professional programmer should know about the basic algorithmic toolbox: structures that allow efficient organization and retrieval of data, frequently used algorithms, and generic techniques for modeling, understanding, and solving algorithmic problems. This book is a concise introduction to this basic toolbox, intended for students and professionals familiar with programming and basic mathematical language. We have used the book in undergraduate courses on algorithmics. In our graduatelevel courses, we make most of the book a prerequisite, and concentrate on the starred sections and the more advanced material. We believe that, even for undergraduates, a concise yet clear and simple presentation makes material more accessible, as long as it includes examples, pictures, informal explanations, exercises, and some linkage to the real world. Most chapters have the same basic structure. We begin by discussing a problem as it occurs in a reallife situation. We illustrate the most important applications and
unknown title
"... Sorting is the computational process of rearranging a given sequence of items from some total order into ascending or descending order. Because sorting is a task in the very core of Computer Science, efficient algorithms were developed early. The first practical and industrial applications of comput ..."
Abstract
 Add to MetaCart
(Show Context)
Sorting is the computational process of rearranging a given sequence of items from some total order into ascending or descending order. Because sorting is a task in the very core of Computer Science, efficient algorithms were developed early. The first practical and industrial applications of computers had many uses for sorting. It is still a very frequent occurring problem, often appearing as a preliminary step to some other computational
Super Scalar Sample Sort
"... Abstract. Sample sort, a generalization of quicksort that partitions the input into many pieces, is known as the best practical comparison based sorting algorithm for distributed memory parallel computers. We show that sample sort is also useful on a single processor. The main algorithmic insight is ..."
Abstract
 Add to MetaCart
(Show Context)
Abstract. Sample sort, a generalization of quicksort that partitions the input into many pieces, is known as the best practical comparison based sorting algorithm for distributed memory parallel computers. We show that sample sort is also useful on a single processor. The main algorithmic insight is that element comparisons can be decoupled from expensive conditional branching using predicated instructions. This transformation facilitates optimizations like loop unrolling and software pipelining. The final implementation, albeit cache efficient, is limited by a linear number of memory accesses rather than the O(n log n) comparisons. On an Itanium 2 machine, we obtain a speedup of up to 2 over std::sort from the GCC STL library, which is known as one of the fastest available quicksort implementations. 1
AN AUTOMATICALLYTUNED SORTING LIBRARY ERAN BIDA AND SIVAN TOLEDO
"... We present atsl, an automaticallytuned sorting library. Atsl generates an incore sorting routine optimized to the target machine for a specific data type. Atsl finds a highperformance sorting routine by searching an algorithmic space that we have defined. The search space includes basic sorting a ..."
Abstract
 Add to MetaCart
(Show Context)
We present atsl, an automaticallytuned sorting library. Atsl generates an incore sorting routine optimized to the target machine for a specific data type. Atsl finds a highperformance sorting routine by searching an algorithmic space that we have defined. The search space includes basic sorting algorithms and automaticallygenerated compositions of sorting algorithms. Performance measurements are used both for ranking candidate algorithms and for characterizing the behavior of candidates in specific settings (ranges of array sizes). These characterizations allow atsl to generate hybrid algorithms that intelligently exploit the strengths of particular algorithms, such as high speed at specific inputsize ranges. Many sorting algorithms can be tuned using numeric parameters. Atsl searches these parameter spaces to find values that yield high performance on the target machine. The building blocks from which atsl synthesizes sorting algorithms include adaptations of many of the most effective handtuned sorting routines, including several that are tuned for cache efficiency. An extensive experimental evaluation shows that atsl generates highperformance codes that are well tuned for the target machine and data type. The experiments were conducted on six different machines, of several architectures, and with three different compilers. The algorithms that are generated are fast; in particular, they beat the handtuned building blocks and the compiler’s C++ builtin sorting routine. The algorithms that atsl generates on different machines and using different compilers are different from each other. 1.