## An Experimental Study of Sorting and Branch Prediction

Citations: | 2 - 0 self |

### BibTeX

@MISC{Biggar_anexperimental,

author = {Paul Biggar and Nicholas Nash and Kevin Williams and David Gregg},

title = {An Experimental Study of Sorting and Branch Prediction},

year = {}

}

### OpenURL

### Abstract

Sorting is one of the most important and well studied problems in Computer Science. Many good algorithms are known which offer various trade-offs in efficiency, simplicity, memory use, and other factors. However, these algorithms do not take into account features of modern computer architectures that significantly influence performance. Caches and branch predictors are two such features, and while there has been a significant amount of research into the cache performance of general purpose sorting algorithms, there has been little research on their branch prediction properties. In this paper we empirically examine the behaviour of the branches in all the most common sorting algorithms. We also consider the interaction of cache optimization on the predictability of the branches in these algorithms. We find insertion sort to have the fewest branch mispredictions of any comparison-based sorting algorithm, that bubble and shaker sort operate in a fashion which makes their branches highly unpredictable, that the unpredictability of shellsort’s branches improves its caching behaviour and that several cache optimizations have little effect on mergesort’s branch mispredictions. We find also that optimizations to quicksort – for example the choice of pivot – have a strong influence on the predictability of its branches. We point out a simple way of removing branch instructions from a classic heapsort implementation, and show also that unrolling a loop in a cache optimized heapsort implementation improves the predicitability of its branches. Finally, we note that when sorting random data two-level adaptive branch predictors are usually no better than simpler bimodal predictors. This is despite the fact that two-level adaptive predictors are almost always superior to bimodal predictors in general.

### Citations

9303 | Introduction to Algorithms
- Cormen, Leiserson, et al.
- 1990
(Show Context)
Citation Context ... multi-mergesort described by Aggarwal and Vitter [1988] is optimal in the externalmemory model, while several mergesort variations have been described which are optimal in the cache-oblivious model [=-=Frigo et al. 1999-=-; Brodal et al. 2007]. We have chosen the mergesort algorithms above because they are relatively simple and include multi-way merging, which is generally used by the more elaborate mergesort variation... |

4433 | Computer Architecture: A Quantitative Approach, 4th ed - Hennessy, Patterson - 2007 |

576 | The input/output complexity of sorting and related problems - Aggarwal, Vitter - 1988 |

478 | Algorithms - Sedgewick - 1983 |

176 | The microarchitecture of the Pentium 4 processor
- Hinton, Sager, et al.
- 2001
(Show Context)
Citation Context ... We used power of two sized tables for the branch predictors. We used tables containing between 2 11 and 2 14 predictors, because these are close to the size of the 2 12 entry table of the Pentium 4 [=-=Hinton et al. 2001-=-]. When presenting branch prediction results, the predictor configurations will be described precisely. SimpleScalar’s sim-bpred provides total results over all the branches in a program. It is often ... |

167 | Handbook of Algorithms and Data Structures - Gonnet, Baeza-Yates - 1991 |

127 | The influence of caches on the performance of sorting
- LaMarca, Ladner
- 1997
(Show Context)
Citation Context ... lower level of cache or even main memory. This has spawned a great deal of research on cache-efficient searching and sorting [Nyberg et al. 1994; Agarwal 1996; LaMarca and Ladner 1996; LaMarca 1996; =-=LaMarca and Ladner 1997-=-; Xiao et al. 2000; Rahman and Raman 2001; Wickremesinghe et al. 2002]. Another type of instruction whose cost can vary dramatically is the conditional branch. Modern pipelined processors depend on br... |

97 | Implementing quicksort programs - Sedgewick - 1978 |

72 | The influence of caches on the performance of heaps - LaMarca, Ladner - 1996 |

65 | AlphaSort: A RISC Machine Sort
- Nyberg, Barclay, et al.
- 1994
(Show Context)
Citation Context ...the data can be found in the first-level cache, or must be fetched from a lower level of cache or even main memory. This has spawned a great deal of research on cache-efficient searching and sorting [=-=Nyberg et al. 1994-=-; Agarwal 1996; LaMarca and Ladner 1996; LaMarca 1996; LaMarca and Ladner 1997; Xiao et al. 2000; Rahman and Raman 2001; Wickremesinghe et al. 2002]. Another type of instruction whose cost can vary dr... |

62 | Mcilroy "Engineering a sort function - Bentley, D - 1993 |

42 | A.E.: Using advanced compiler technology to exploit the performance of the Cell Broadband Engine (TM) architecture - Eichenberger - 2006 |

40 | A Super Scalar Sort Algorithm for RISC Processors
- Agarwal
- 1996
(Show Context)
Citation Context ...d in the first-level cache, or must be fetched from a lower level of cache or even main memory. This has spawned a great deal of research on cache-efficient searching and sorting [Nyberg et al. 1994; =-=Agarwal 1996-=-; LaMarca and Ladner 1996; LaMarca 1996; LaMarca and Ladner 1997; Xiao et al. 2000; Rahman and Raman 2001; Wickremesinghe et al. 2002]. Another type of instruction whose cost can vary dramatically is ... |

36 |
The art of computer programming, volume 3: (2nd ed.) sorting and searching
- Knuth
- 1998
(Show Context)
Citation Context ...] is sorted, allowing a[i] to be placed in the correct position as shown in the inner-loop above. On average, insertion sort performs approximately n 2 /4 comparisons and n 2 /4 assignments in total [=-=Knuth 1998-=-]. Thus insertion sort performs around half as many branches as selection sort. A very pleasing property of insertion sort is that it generally causes just a single branch misprediction per key, as Fi... |

32 |
A high-speed sorting procedure
- Shell
- 1959
(Show Context)
Citation Context ...for almost all other types of branches [Uht et al. 1997]. In fact, a bimodal predictor out-performs a two-level adaptive predictor for shaker sort, as is shown in Figure 1(b). 5. SHELLSORT Shellsort [=-=Shell 1959-=-] was the first in-place sorting algorithm with time complexity better than O(n 2 ). Algorithms like selection and bubble sort use each comparison to resolve at most one inversion (an inversion is a p... |

25 | Engineering a cache-oblivious sorting algorithm
- Brodal, Fagerberg, et al.
(Show Context)
Citation Context ...scribed by Aggarwal and Vitter [1988] is optimal in the externalmemory model, while several mergesort variations have been described which are optimal in the cache-oblivious model [Frigo et al. 1999; =-=Brodal et al. 2007-=-]. We have chosen the mergesort algorithms above because they are relatively simple and include multi-way merging, which is generally used by the more elaborate mergesort variations. 7.2 Branch Predic... |

22 |
The art of computer programming, volume 1 (3rd ed.): fundamental algorithms
- Knuth
- 1997
(Show Context)
Citation Context ...$5.00 ACM Journal Name, Vol. V, No. N, Month 20YY, Pages 1–38.2 · asymptotic bounds and Knuth’s MIX machine code both make drastically simplifying assumptions about the cost of machine instructions [=-=Knuth 1997-=-]. More recently researchers have recognized that on modern computers the cost of accessing memory can vary dramatically depending on whether the data can be found in the first-level cache, or must be... |

22 | Optimizing sorting with genetic algorithms - Li, Garzaran, et al. - 2005 |

19 |
Sorting on Electronic Computer Systems
- Friend
- 1956
(Show Context)
Citation Context ...are performance counters. 9. RADIX SORT The purpose of this paper is to show the relevance of branch mispredictions to sorting. We now show how to develop a simple least significant digit radix sort [=-=Friend 1956-=-] implementation for 32-bit integers that, in all our tests, is more efficient than all other sorting algorithms presented in this paper. This is in part due to the tiny number of branch misprediction... |

19 | Month 20YY - No |

19 |
Algorithm 232
- Williams
- 1964
(Show Context)
Citation Context ...sured using Pentium 4 hardware performance counters. (see Section 6 for further details on our heapsort implementations). 6. HEAPSORT Another well known general purpose sorting algorithm is heapsort [=-=Williams 1964-=-]. Heapsort’s running time is O(n log n). Heapsort begins by constructing a heap: a binary tree in which every level except possibly the deepest is entirely filled, with the deepest level filled from ... |

17 |
Caches and Algorithms
- LaMarca
- 1996
(Show Context)
Citation Context ...fetched from a lower level of cache or even main memory. This has spawned a great deal of research on cache-efficient searching and sorting [Nyberg et al. 1994; Agarwal 1996; LaMarca and Ladner 1996; =-=LaMarca 1996-=-; LaMarca and Ladner 1997; Xiao et al. 2000; Rahman and Raman 2001; Wickremesinghe et al. 2002]. Another type of instruction whose cost can vary dramatically is the conditional branch. Modern pipeline... |

17 | Adapting radix sort to the memory hierarchy
- RAHMAN, RAMAN
- 2001
(Show Context)
Citation Context ...This has spawned a great deal of research on cache-efficient searching and sorting [Nyberg et al. 1994; Agarwal 1996; LaMarca and Ladner 1996; LaMarca 1996; LaMarca and Ladner 1997; Xiao et al. 2000; =-=Rahman and Raman 2001-=-; Wickremesinghe et al. 2002]. Another type of instruction whose cost can vary dramatically is the conditional branch. Modern pipelined processors depend on branch prediction for much of their perform... |

16 | random.org: True random number service - Haahr - 2006 |

16 |
Branch effect reduction techniques
- Uht, Sindagi, et al.
- 1997
(Show Context)
Citation Context ... will cost around 30 cycles. Fortunately, the branches in most programs are very predictable, so branch mispredictions are usually rare. Indeed, prediction accuracies of greater than 90% are typical [=-=Uht et al. 1997-=-] with the best predictors. The cost of executing branches is particularly important for sorting because the inner-loops of most sorting algorithms consist of comparisons of items to be sorted. Thus, ... |

14 | Super Scalar Sample Sort - Sanders, Winkel - 2004 |

12 | Limits of Branch Prediction - Mudge, Chen, et al. - 1996 |

11 | On the adaptiveness of Quicksort - Brodal, Fagerberg, et al. |

9 | Efficient sorting using registers and caches
- Wickremesinghe, Arge, et al.
(Show Context)
Citation Context ...t deal of research on cache-efficient searching and sorting [Nyberg et al. 1994; Agarwal 1996; LaMarca and Ladner 1996; LaMarca 1996; LaMarca and Ladner 1997; Xiao et al. 2000; Rahman and Raman 2001; =-=Wickremesinghe et al. 2002-=-]. Another type of instruction whose cost can vary dramatically is the conditional branch. Modern pipelined processors depend on branch prediction for much of their performance. If the direction of a ... |

6 | Treesort3 (Algorithm 245 - Floyd - 1964 |

3 |
SimpleScalar tutorial (for release 4.0
- Austin, Ernst, et al.
- 2001
(Show Context)
Citation Context ...ost keys. Our results are averaged over these chunks of data. In order to experiment with a variety of cache and branch prediction results we used the SimpleScalar PISA processor simulator version 3 [=-=Austin et al. 2001-=-]. We used sim-cache and sim-bpred to generate results for caching and branch prediction characteristics respectively. We used a variety of cache configurations; generally speaking we used an 8 KB lev... |

3 | Tradeoffs between branch mispredictions and comparisons for sorting algorithms - Brodal, Moruz - 2005 |

2 |
Sorting in the Presence of Branch Prediction and Caches
- Biggar, Gregg
- 2005
(Show Context)
Citation Context ...tant, we will describe the precise configuration used. We present results here only for direct mapped caches. The results for fully associative caches are similar, as we show in our technical report [=-=Biggar and Gregg 2005-=-], which contains complete results for a large number of branch predictor and cache configurations. For the measurements taken using SimpleScalar, we used power of two sized sets of random keys rangin... |

2 | How Branch Mispredictions Affect Quicksort - Kaligosi, Sanders - 2006 |