Results 11  20
of
150
Burst Tries: A Fast, Efficient Data Structure for String Keys
 ACM Transactions on Information Systems
, 2002
"... Many applications depend on efficient management of large sets of distinct strings in memory. For example, during index construction for text databases a record is held for each distinct word in the text, containing the word itself and information such as counters. We propose a new data structure, t ..."
Abstract

Cited by 31 (10 self)
 Add to MetaCart
Many applications depend on efficient management of large sets of distinct strings in memory. For example, during index construction for text databases a record is held for each distinct word in the text, containing the word itself and information such as counters. We propose a new data structure, the burst trie, that has significant advantages over existing options for such applications: it requires no more memory than a binary tree; it is as fast as a trie; and, while not as fast as a hash table, a burst trie maintains the strings in sorted or nearsorted order. In this paper we describe burst tries and explore the parameters that govern their performance. We experimentally determine good choices of parameters, and compare burst tries to other structures used for the same task, with a variety of data sets. These experiments show that the burst trie is particularly effective for the skewed frequency distributions common in text collections, and dramatically outperforms all other data structures for the task of managing strings while maintaining sort order.
Inmemory Hash Tables for Accumulating Text Vocabularies
 Information Processing Letters
, 2001
"... this paper we experimentally evaluate the performance of several data structures for building vocabularies, using a range of data collections and machines. Given the wellknown properties of text and some initial experimentation, we chose to focus on the most promising candidates, splay trees and ch ..."
Abstract

Cited by 30 (10 self)
 Add to MetaCart
this paper we experimentally evaluate the performance of several data structures for building vocabularies, using a range of data collections and machines. Given the wellknown properties of text and some initial experimentation, we chose to focus on the most promising candidates, splay trees and chained hash tables, also reporting results with binary trees. Of these, our experiments show that hash tables are by a considerable margin the most e#cient. We propose and measure a refinement to hash tables, the use of movetofront lists. This refinement is remarkably e#ective: as we show, using a small table in which there are large numbers of strings in each chain has only limited impact on performance. Moving frequentlyaccessed words to the front of the list has the surprising property that the vast majority of accesses are to the first or second node. For example, our experiments show that in a typical case a table with an average of around 80 strings per slot is only 10%40% slower than a table with around one string per slotwhile a table without movetofront is perhaps 40% slower againand is still over three times faster than using a tree. We show, moreover, that a movetofront hash table of fixed size is more e#cient in space and time than a hash table that is dynamically doubled in size to maintain a constant load average
On Sorting Strings in External Memory
, 1997
"... ) Lars Arge Paolo Ferragina y Roberto Grossi z Jeffrey Scott Vitter x Abstract. In this paper we address for the first time the I/O complexity of the problem of sorting strings in external memory, which is a fundamental component of many largescale text applications. In the standard unitcost RAM c ..."
Abstract

Cited by 27 (12 self)
 Add to MetaCart
) Lars Arge Paolo Ferragina y Roberto Grossi z Jeffrey Scott Vitter x Abstract. In this paper we address for the first time the I/O complexity of the problem of sorting strings in external memory, which is a fundamental component of many largescale text applications. In the standard unitcost RAM comparison model, the complexity of sorting K strings of total length N is \Theta(K log 2 K+N). By analogy, in the external memory (or I/O) model, where the internal memory has size M and the block transfer size is B, it would be natural to guess that the I/O complexity of sorting strings is \Theta( K B log M=B K B + N B ), but the known algorithms do not come even close to achieving this bound. Our results show, somewhat counterintuitively, that the I/O complexity of string sorting depends upon the length of the strings relative to the block size. We first consider a simple comparison I/O model, where one is not allowed to break the strings into their characters, and we sho...
Fast lightweight suffix array construction and checking
 14th Annual Symposium on Combinatorial Pattern Matching
, 2003
"... We describe an algorithm that, for any v 2 [2; n], constructs the suffix array of a string of length n in O(vn + n log n) time using O(v + n= p v) space in addition to the input (the string) and the output (the suffix array). By setting v = log n, we obtain an O(n log n) time algorithm using O n= p ..."
Abstract

Cited by 26 (5 self)
 Add to MetaCart
We describe an algorithm that, for any v 2 [2; n], constructs the suffix array of a string of length n in O(vn + n log n) time using O(v + n= p v) space in addition to the input (the string) and the output (the suffix array). By setting v = log n, we obtain an O(n log n) time algorithm using O n= p
Modifications of the Burrows and Wheeler Data Compression Algorithm
 Proceedings of the ieee Data Compression Conference
, 1999
"... this paper we improve upon these previous results on the BWalgorithm. Based on the context tree model, we consider the specific statistical properties of the data at the output of the BWT. We describe six important properties, three of which have not been described elsewhere. These considerations l ..."
Abstract

Cited by 25 (3 self)
 Add to MetaCart
this paper we improve upon these previous results on the BWalgorithm. Based on the context tree model, we consider the specific statistical properties of the data at the output of the BWT. We describe six important properties, three of which have not been described elsewhere. These considerations lead to modifications of the coding method, which in turn improve the coding efficiency. We shortly describe how to compute the BWT with low complexity in time and space, using suffix trees in two different representations. Finally, we present experimental results about the compression rate and running time of our method, and compare these results to previous achievements. More references on the methods described in this paper can be found in [1, 5].
A Fast Algorithm for Making Suffix Arrays and for BurrowsWheeler Transformation
 IN PROCEEDINGS OF THE IEEE DATA COMPRESSION CONFERENCE, SNOWBIRD, UTAH, MARCH 30  APRIL 1
, 1998
"... We propose a fast and memory efficient algorithm for sorting suffixes of a text in lexicographic order. It is important to sort suffixes because an arrayof indexes of suffixes is called suffix array and it is a memory efficient alternative of the suffix tree. Sorting suffixes is also used for the ..."
Abstract

Cited by 25 (3 self)
 Add to MetaCart
We propose a fast and memory efficient algorithm for sorting suffixes of a text in lexicographic order. It is important to sort suffixes because an arrayof indexes of suffixes is called suffix array and it is a memory efficient alternative of the suffix tree. Sorting suffixes is also used for the BurrowsWheeler transformation in the Block Sorting text compression, therefore fast sorting algorithms are desired. We compare
The Analysis of Hybrid Trie Structures
, 1998
"... This paper provides a detailed analysis of various implementations of digital tries, including the “ternary search tries” of Bentley and Sedgewick. The methods employed combine symbolic uses of generating functions, Poisson models, and MeIlin transforms. Theoretical results are matched against real ..."
Abstract

Cited by 24 (2 self)
 Add to MetaCart
This paper provides a detailed analysis of various implementations of digital tries, including the “ternary search tries” of Bentley and Sedgewick. The methods employed combine symbolic uses of generating functions, Poisson models, and MeIlin transforms. Theoretical results are matched against reallife data and justify the claim that ternary search tries are a highly efficient dynamic dictionary structure for strings and textual data.
An asymptotic theory for CauchyEuler differential equations with applications to the analysis of algorithms
, 2002
"... CauchyEuler differential equations surfaced naturally in a number of sorting and searching problems, notably in quicksort and binary search trees and their variations. Asymptotics of coefficients of functions satisfying such equations has been studied for several special cases in the literature. We ..."
Abstract

Cited by 24 (10 self)
 Add to MetaCart
CauchyEuler differential equations surfaced naturally in a number of sorting and searching problems, notably in quicksort and binary search trees and their variations. Asymptotics of coefficients of functions satisfying such equations has been studied for several special cases in the literature. We study in this paper the most general framework for CauchyEuler equations and propose an asymptotic theory that covers almost all applications where CauchyEuler equations appear. Our approach is very general and requires almost no background on differential equations. Indeed the whole theory can be stated in terms of recurrences instead of functions. Old and new applications of the theory are given. New phase changes of limit laws of new variations of quicksort are systematically derived. We apply our theory to about a dozen of diverse examples in quicksort, binary search trees, urn models, increasing trees, etc.
Implementing Radixsort
 ACM Jour. of Experimental Algorithmics
, 1998
"... We present and evaluate several new optimization and implementation techniques for string sorting. In particular, we study a recently published radix sorting algorithm, Forward radixsort, that has a provably good worstcase behavior. Our experimental results indicate that radix sorting is considerab ..."
Abstract

Cited by 19 (1 self)
 Add to MetaCart
We present and evaluate several new optimization and implementation techniques for string sorting. In particular, we study a recently published radix sorting algorithm, Forward radixsort, that has a provably good worstcase behavior. Our experimental results indicate that radix sorting is considerably faster (often more than twice as fast) than comparisonbased sorting methods. This is true even for small input sequences. We also show that it is possible to implement a radix sort with good worstcase running time without sacrificing averagecase performance. Our implementations are competitive with the best previously published string sorting algorithms. Code, test data, and test results are available from the World Wide Web. 1. Introduction Radix sorting is a simple and very efficient sorting method that has received too little attention. A common misconception is that a radix sorting algorithm either has to inspect all the characters of the input or use an inordinate amount of extra...
Anytime classification using the nearest neighbor algorithm with applications to stream mining
 IEEE International Conference on Data Mining (ICDM
, 2006
"... For many real world problems we must perform classification under widely varying amounts of computational resources. For example, if asked to classify an instance taken from a bursty stream, we may have from milliseconds to minutes to return a class prediction. For such problems an anytime algorithm ..."
Abstract

Cited by 18 (8 self)
 Add to MetaCart
For many real world problems we must perform classification under widely varying amounts of computational resources. For example, if asked to classify an instance taken from a bursty stream, we may have from milliseconds to minutes to return a class prediction. For such problems an anytime algorithm may be especially useful. In this work we show how we can convert the ubiquitous nearest neighbor classifier into an anytime algorithm that can produce an instant classification, or if given the luxury of additional time, can utilize the extra time to increase classification accuracy. We demonstrate the utility of our approach with a comprehensive set of experiments on data from diverse domains.