## Using random sampling to build approximate tries for efficient string sorting

Venue: | In Proc. International Workshop on Efficient and Experimental |

Citations: | 1 - 1 self |

### BibTeX

@INPROCEEDINGS{Sinha_usingrandom,

author = {Ranjan Sinha and Justin Zobel},

title = {Using random sampling to build approximate tries for efficient string sorting},

booktitle = {In Proc. International Workshop on Efficient and Experimental},

year = {},

publisher = {Springer Verlag}

}

### OpenURL

### Abstract

Abstract. Algorithms for sorting large datasets can be made more efficient with careful use of memory hierarchies and reduction in the number of costly memory accesses. In earlier work, we introduced burstsort, a new string sorting algorithm that on large sets of strings is almost twice as fast as previous algorithms, primarily because it is more cache-efficient. The approach in burstsort is to dynamically build a small trie that is used to rapidly allocate each string to a bucket. In this paper, we introduce new variants of our algorithm: SR-burstsort, DR-burstsort, and DRL-burstsort. These algorithms use a random sample of the strings to construct an approximation to the trie prior to sorting. Our experimental results with sets of over 30 million strings show that the new variants reduce cache misses further than did the original burstsort, by up to 37%, while simultaneously reducing instruction counts by up to 24%. In pathological cases, even further savings can be obtained. 1

### Citations

1871 | Randomized Algorithms
- Motwani, Raghavan
- 1995
(Show Context)
Citation Context ...reduced numbers of cache misses compared to previous techniques. Randomised algorithms. A randomised algorithm is one that makes random choices during its execution. According to Motwani and Raghavan =-=[10]-=-, “two benefits of randomised algorithms have made them popular: simplicity and efficiency. For many applications, a randomised algorithm is the simplest available, or the fastest, or both.” One appli... |

147 | Fast algorithms for sorting and searching strings
- Bentley, Sedgewick
- 1997
(Show Context)
Citation Context ...ore efficient than previous methods for large string sets. (In this paper, for reference we compare against three of the best previous string sorting algorithms: MBM radixsort [9], multikey quicksort =-=[3]-=-, and adaptive radixsort [1, 11].) However, burstsort is not perfect. A key shortcoming is that individual strings must be re-accessed as the trie grows, to redistribute them into sub-buckets. If the ... |

130 |
Overview of the second text retrieval conference
- Harman
- 1995
(Show Context)
Citation Context ... have sufficient volumes of data. We have drawn on web data and genomic data. For the latter, we have parsed nucleotide strings into overlapping 9-grams. For the former, derived from the TREC project =-=[5, 6]-=-, we extracted both words—alphabetic strings delimited by non-alphabetic characters—and URLs. For the words, we considered sets with and without duplicates, in both cases in order of occurrence in the... |

113 | The influence of caches on the performance of sorting
- LaMarca
- 1997
(Show Context)
Citation Context ...t realistic on modern computer architectures, where the levels of memory have different latencies. While algorithms can be made more efficient by reducing the number of instructions, current research =-=[8, 15, 17]-=- shows that an algorithm can afford to increase the number of instructions if doing so improves the locality of memory accesses and thus reduces the ⋆ Citation details: R. Sinha and J. Zobel, “Using r... |

62 | Results and challenges in Web search evaluation
- Hawking, Craswell, et al.
- 1999
(Show Context)
Citation Context ... have sufficient volumes of data. We have drawn on web data and genomic data. For the latter, we have parsed nucleotide strings into overlapping 9-grams. For the former, derived from the TREC project =-=[5, 6]-=-, we extracted both words—alphabetic strings delimited by non-alphabetic characters—and URLs. For the words, we considered sets with and without duplicates, in both cases in order of occurrence in the... |

28 | H.E.: Burst tries: a fast, efficient data structure for string keys
- Heinz, Zobel, et al.
- 2002
(Show Context)
Citation Context ...on its length and how much of it needs to be read. In our previous work [15, 16], we introduced burstsort, a new cache-efficient string sorting algorithm. It is based on the burst trie data structure =-=[7]-=-, where a set of strings is organised as a collection of buckets indexed by a small access trie. In burstsort, the trie is built dynamically as the strings are processed. During the first phase, at mo... |

26 | On sorting strings in external memory
- Arge, Ferragina, et al.
- 1997
(Show Context)
Citation Context ... Brazil, May 2004, pp. 529–544.snumber of cache misses. In particular, recent work [8, 13, 17] has successfully adapted algorithms for sorting integers to memory hierarchies. According to Arge et al. =-=[2]-=- “string sorting is the most general formulation of sorting because it comprises integer sorting (i.e., strings of length one), multikey sorting (i.e., equal-length strings) and variable-length key so... |

25 | Improving memory performance of sorting algorithms
- Xiao, Zhang, et al.
- 1981
(Show Context)
Citation Context ...t realistic on modern computer architectures, where the levels of memory have different latencies. While algorithms can be made more efficient by reducing the number of instructions, current research =-=[8, 15, 17]-=- shows that an algorithm can afford to increase the number of instructions if doing so improves the locality of memory accesses and thus reduces the ⋆ Citation details: R. Sinha and J. Zobel, “Using r... |

22 |
On Randomization in Sequential and Distributed Algorithms
- Gupta, Smolka, et al.
- 1994
(Show Context)
Citation Context ...t, or both.” One application of randomisation for sorting is to rearrange the input in order to remove any existing patterns, to ensure that the expected running time matches the average running time =-=[4]-=-. The best-known example of this is in quicksort, where randomisation of the input lessens the chance of quadratic running time. Input randomisation can also be used in cases such as binary search tre... |

22 | Random Sampling from Databases: A Survey
- Olken, Rotem
- 1995
(Show Context)
Citation Context ...randomisation is to process a small sample from a larger collection. In simple random sampling, each individual key in a collection has an equal chance of being selected. According to Olkem and Roten =-=[12]-=-, Random sampling is used on those occasions when processing the entire dataset is unnecessary and too expensive . . . The savings generated by sampling may arise either from reductions in the cost of... |

18 | Nilsson . Implementing Radixsort
- Andersson, Stefan
- 1998
(Show Context)
Citation Context ...methods for large string sets. (In this paper, for reference we compare against three of the best previous string sorting algorithms: MBM radixsort [9], multikey quicksort [3], and adaptive radixsort =-=[1, 11]-=-.) However, burstsort is not perfect. A key shortcoming is that individual strings must be re-accessed as the trie grows, to redistribute them into sub-buckets. If the trie could be constructed ahead ... |

15 |
Radix Sorting & Searching
- Nilsson
- 1996
(Show Context)
Citation Context ...methods for large string sets. (In this paper, for reference we compare against three of the best previous string sorting algorithms: MBM radixsort [9], multikey quicksort [3], and adaptive radixsort =-=[1, 11]-=-.) However, burstsort is not perfect. A key shortcoming is that individual strings must be re-accessed as the trie grows, to redistribute them into sub-buckets. If the trie could be constructed ahead ... |

15 | Adapting Radix Sort to the Memory Hierarchy
- Rahman, Rahman
(Show Context)
Citation Context ...rkshop On Experimental Algorithmics, C.C. Ribeiro and S.L. Martins (eds), Springer-Verlag, LNCS 3059, Angra dos Reis, Brazil, May 2004, pp. 529–544.snumber of cache misses. In particular, recent work =-=[8, 13, 17]-=- has successfully adapted algorithms for sorting integers to memory hierarchies. According to Arge et al. [2] “string sorting is the most general formulation of sorting because it comprises integer so... |

14 | Engineering radix sort
- McIlroy, Bostic
- 1993
(Show Context)
Citation Context ...d burstsort to be much more efficient than previous methods for large string sets. (In this paper, for reference we compare against three of the best previous string sorting algorithms: MBM radixsort =-=[9]-=-, multikey quicksort [3], and adaptive radixsort [1, 11].) However, burstsort is not perfect. A key shortcoming is that individual strings must be re-accessed as the trie grows, to redistribute them i... |

12 | Cache-Conscious Sorting of Large Sets of Strings with Dynamic Tries
- Sinha, Zobel
(Show Context)
Citation Context ...t realistic on modern computer architectures, where the levels of memory have different latencies. While algorithms can be made more efficient by reducing the number of instructions, current research =-=[8, 15, 17]-=- shows that an algorithm can afford to increase the number of instructions if doing so improves the locality of memory accesses and thus reduces the ⋆ Citation details: R. Sinha and J. Zobel, “Using r... |

3 | Efficient trie-based sorting of large sets of strings
- Sinha, Zobel
- 2003
(Show Context)
Citation Context .... Buckets are represented as arrays of 16, 128, 1024, or 8192 pointers, growing from one size to the next as the number of strings to be stored increases, as we have described elsewhere for burstsort =-=[16]-=-. DRL-burstsort. For the largest sets of strings, the trie is much too large to be cache resident. That is, there is a trade-off between whether the largest bucket can fit in cache and whether the tri... |

1 |
Valgrind—memory and cache profiler
- Seward
- 2001
(Show Context)
Citation Context ...lliseconds of CPU time has been measured; the time taken for I/O or to parse the collection are not included as these are in common for all algorithms. For the cache simulations, we have usedvalgrind =-=[14]-=-. 5 Results We present results in three forms: time to sort each data set, instruction counts, and L2 cache misses. Times for sorting are shown in Tables 2 to 6. Instruction counts are shown in Figure... |