Results 1 - 10
of
11
IP-Address Lookup Using LC-Tries
, 1998
"... There has recently been a notable interest in the organization of routing information to enable fast lookup of IP addresses. The interest is primarily motivated by the goal of building multi-Gb/s routers for the Internet, without having to rely on multi-layer switching techniques. We address this ..."
Abstract
-
Cited by 85 (0 self)
- Add to MetaCart
There has recently been a notable interest in the organization of routing information to enable fast lookup of IP addresses. The interest is primarily motivated by the goal of building multi-Gb/s routers for the Internet, without having to rely on multi-layer switching techniques. We address this problem by using an LC-trie, a trie structure with combined path and level compression. This data structure enables us to build efficient, compact and easily searchable implementations of an IP routing table. The structure can store both unicast and multicast addresses with the same average search times. The search depth increases as \Theta (log log n) with the number of entries in the table for a large class of distributions and it is independent of the length of the addresses. A node in the trie can be coded with four bytes. Only the size of the base vector, which contains the search strings, grows linearly with the length of the addresses when extended from 4 to 16 bytes, as mandated by the shift from IP version 4 to version 6. We present the basic structure, as well as an adaptive version that roughly doubles the number of lookups per second. More general classifications of packets that are needed for link sharing, quality of service provisioning and for multicast and multipath routing are also discussed. Our experimental results compare favorably with those reported previously in the research literature.
Implementing Sorting in Database Systems
- ACM Comput. Surv
, 2006
"... Most commercial database systems do (or should) exploit many sorting techniques that are publicly known, but not readily available in the research literature. These techniques improve both sort performance on modern computer systems and the ability to adapt gracefully to resource fluctuations in mul ..."
Abstract
-
Cited by 12 (3 self)
- Add to MetaCart
Most commercial database systems do (or should) exploit many sorting techniques that are publicly known, but not readily available in the research literature. These techniques improve both sort performance on modern computer systems and the ability to adapt gracefully to resource fluctuations in multiuser operations. This survey collects many of these techniques for easy reference by students, researchers, and product developers. It covers in-memory sorting, disk-based external sorting, and considerations that apply specifically to sorting in database systems.
Cache-Conscious Sorting of Large Sets of Strings with Dynamic Tries
"... Ongoing changes in computer performance are affecting the efficiency of string sorting algorithms. The size of main memory in typical computers continues to grow, but memory accesses require increasing numbers of instruction cycles, which is a problem for the most efficient of the existing string-so ..."
Abstract
-
Cited by 10 (4 self)
- Add to MetaCart
Ongoing changes in computer performance are affecting the efficiency of string sorting algorithms. The size of main memory in typical computers continues to grow, but memory accesses require increasing numbers of instruction cycles, which is a problem for the most efficient of the existing string-sorting algorithms as they do not utilise cache particularly well for large data sets. We propose a new sorting algorithm for strings, burstsort, based on dynamic construction of a compact trie in which strings are kept in buckets. It is simple, fast, and efficient. We experimentally compare burstsort to existing string-sorting algorithms on large and small sets of strings with a range of characteristics. These experiments show that, for large sets of strings, burstsort is almost twice as fast as any previous algorithm, due primarily to a lower rate of cache miss.
2003b). Using masks, suffix array-based data structures, and multidimensional arrays to compute positional n-gram statistics from corpora
- In Proceedings of the Workshop on Multiword Expressions of the 41st Annual Meeting of the Association of Computational Linguistics
"... This paper describes an implementation to compute positional ngram statistics (i.e. Frequency and Mutual Expectation) based on masks, suffix array-based data structures and multidimensional arrays. Positional ngrams are ordered sequences of words that represent continuous or discontinuous substrings ..."
Abstract
-
Cited by 5 (2 self)
- Add to MetaCart
This paper describes an implementation to compute positional ngram statistics (i.e. Frequency and Mutual Expectation) based on masks, suffix array-based data structures and multidimensional arrays. Positional ngrams are ordered sequences of words that represent continuous or discontinuous substrings of a corpus. In particular, the positional ngram model has shown successful results for the extraction of discontinuous collocations from large corpora. However, its computation is heavy. For instance, 4.299.742 positional ngrams (n=1..7) can be generated from a 100.000-word size corpus in a seven-word size window context. In comparison, only 700.000 ngrams would be computed for the classical ngram model. It is clear that huge efforts need to be made to process positional ngram statistics in reasonable time and space. Our solution shows O(h(F) N log N) time complexity where N is the corpus size and h(F) a function of the window context. 1
Cache-efficient string sorting using copying
- In submission
, 2006
"... Abstract. Burstsort is a cache-oriented sorting technique that uses a dynamic trie to efficiently divide large sets of string keys into related subsets small enough to sort in cache. In our original burstsort, string keys sharing a common prefix were managed via a bucket of pointers represented as a ..."
Abstract
-
Cited by 5 (3 self)
- Add to MetaCart
Abstract. Burstsort is a cache-oriented sorting technique that uses a dynamic trie to efficiently divide large sets of string keys into related subsets small enough to sort in cache. In our original burstsort, string keys sharing a common prefix were managed via a bucket of pointers represented as a list or array; this approach was found to be up to twice as fast as the previous best string sorts, mostly because of a sharp reduction in out-of-cache references. In this paper we introduce C-burstsort, which copies the unexamined tail of each key to the bucket and discards the original key to improve data locality. On both Intel and PowerPC architectures, and on a wide range of string types, we show that sorting is typically twice as fast as our original burstsort, and four to five times faster than multikey quicksort and previous radixsorts. A variant that copies both suffixes and record pointers to buckets, CP-burstsort, uses more memory but provides stable sorting. In current computers, where performance is limited by memory access latencies, these new algorithms can dramatically reduce the time needed for internal sorting of large numbers of strings. 1
Using random sampling to build approximate tries for efficient string sorting
- In Proc. International Workshop on Efficient and Experimental
"... Abstract. Algorithms for sorting large datasets can be made more efficient with careful use of memory hierarchies and reduction in the number of costly memory accesses. In earlier work, we introduced burstsort, a new string sorting algorithm that on large sets of strings is almost twice as fast as p ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
Abstract. Algorithms for sorting large datasets can be made more efficient with careful use of memory hierarchies and reduction in the number of costly memory accesses. In earlier work, we introduced burstsort, a new string sorting algorithm that on large sets of strings is almost twice as fast as previous algorithms, primarily because it is more cache-efficient. The approach in burstsort is to dynamically build a small trie that is used to rapidly allocate each string to a bucket. In this paper, we introduce new variants of our algorithm: SR-burstsort, DR-burstsort, and DRL-burstsort. These algorithms use a random sample of the strings to construct an approximation to the trie prior to sorting. Our experimental results with sets of over 30 million strings show that the new variants reduce cache misses further than did the original burstsort, by up to 37%, while simultaneously reducing instruction counts by up to 24%. In pathological cases, even further savings can be obtained. 1
Fast String Sorting Using Order-Preserving Compression
"... We give experimental evidence for the benefits of order-preserving compression in sorting algorithms. While, in general, any algorithm might benefit from compressed data because of reduced paging requirements, we identified two natural candidates that would further benefit from orderpreserving compr ..."
Abstract
- Add to MetaCart
We give experimental evidence for the benefits of order-preserving compression in sorting algorithms. While, in general, any algorithm might benefit from compressed data because of reduced paging requirements, we identified two natural candidates that would further benefit from orderpreserving compression, namely string-oriented sorting algorithms and word-RAM algorithms for keys of bounded length. The word-RAM model has some of the fastest known sorting algorithms in practice. These algorithms are designed for keys of bounded length, usually 32 or 64 bits, which limits their direct applicability for strings. One possibility is to use an order-preserving compression scheme, so that a bounded-key-length algorithm can be applied. For the case of standard algorithms, we took what is considered to be the among the fastest nonword RAM string sorting algorithms, Fast MKQSort, and measured its performance on compressed data. The Fast MKQSort algorithm of Bentley and Sedgewick is optimized to handle text strings. Our experiments show that order-compression techniques results in savings of approximately 15 % over the same algorithm on noncompressed data. For the word-RAM, we modified Andersson’s sorting algorithm to handle variable-length keys. The resulting algorithm is faster than the standard Unix sort by a factor of 1.5X. Last, we used an order-preserving scheme that is within a constant additive term
Implementation of Sorting in Database Systems
"... It has often been said that sorting algorithms are very instructional in their own right as well as representative of a variety of computer algorithms, and that the performance of sorting is indicative of the performance of a variety of other data management tasks. Therefore, there is a fair amount ..."
Abstract
- Add to MetaCart
It has often been said that sorting algorithms are very instructional in their own right as well as representative of a variety of computer algorithms, and that the performance of sorting is indicative of the performance of a variety of other data management tasks. Therefore, there is a fair amount of literature on the theory of sorting as well as on specific benchmark results. On the other hand, most commercial implementations of sorting do (or should!) exploit many techniques that are publicly known but not readily available in the research literature. This survey collects them for easy reference by students, researchers, and product developers. Its main purpose is not to introduce new algorithmic techniques or to evaluate experimentally the effectiveness of any one individual technique; instead, it gathers and organizes such techniques in order to enable, stimulate, and focus future research and development. 1
Analysis, Acquisition and Treatment, pp. 25-32. Using Masks, Suffix Array-based Data Structures and Multidimensional Arrays to Compute Positional Ngram Statistics from Corpora
"... This paper describes an implementation to compute positional ngram statistics (i.e. Frequency and Mutual Expectation) based on masks, suffix array-based data structures and multidimensional arrays. Positional ngrams are ordered sequences of words that represent continuous or discontinuous substrings ..."
Abstract
- Add to MetaCart
This paper describes an implementation to compute positional ngram statistics (i.e. Frequency and Mutual Expectation) based on masks, suffix array-based data structures and multidimensional arrays. Positional ngrams are ordered sequences of words that represent continuous or discontinuous substrings of a corpus. In particular, the positional ngram model has shown successful results for the extraction of discontinuous collocations from large corpora. However, its computation is heavy. For instance, 4.299.742 positional ngrams (n=1..7) can be generated from a 100.000-word size corpus in a seven-word size window context. In comparison, only 700.000 ngrams would be computed for the classical ngram model. It is clear that huge efforts need to be made to process positional ngram statistics in reasonable time and space. Our solution shows O(h(F) N log N) time complexity where N is the corpus size and h(F) a function of the window context. 1
Post BWT Stages of the . . .
"... The lossless Burrows-Wheeler compression algorithm has received considerable attention over recent years for both its simplicity and effectiveness. It is based on a permutation of the input sequence − the Burrows-Wheeler transformation − which groups symbols with a similar context close together. In ..."
Abstract
- Add to MetaCart
The lossless Burrows-Wheeler compression algorithm has received considerable attention over recent years for both its simplicity and effectiveness. It is based on a permutation of the input sequence − the Burrows-Wheeler transformation − which groups symbols with a similar context close together. In the original version, this permutation was followed by a Move-To-Front transformation and a final entropy coding stage. Later versions used different algorithms, placed after the Burrows-Wheeler transformation, since the following stages have a significant influence on the compression rate. This article describes different algorithms and improvements for these post BWT stages including a new context based approach. Results for compression rates are presented together with compression and decompression times on the Calgary corpus, the Canterbury corpus, the large Canterbury corpus and the Lukas 2D 16 bit medical image corpus.

