Results 1 - 10
of
10
In-memory Hash Tables for Accumulating Text Vocabularies
- Information Processing Letters
, 2001
"... this paper we experimentally evaluate the performance of several data structures for building vocabularies, using a range of data collections and machines. Given the well-known properties of text and some initial experimentation, we chose to focus on the most promising candidates, splay trees and ch ..."
Abstract
-
Cited by 24 (9 self)
- Add to MetaCart
this paper we experimentally evaluate the performance of several data structures for building vocabularies, using a range of data collections and machines. Given the well-known properties of text and some initial experimentation, we chose to focus on the most promising candidates, splay trees and chained hash tables, also reporting results with binary trees. Of these, our experiments show that hash tables are by a considerable margin the most e#cient. We propose and measure a refinement to hash tables, the use of move-to-front lists. This refinement is remarkably e#ective: as we show, using a small table in which there are large numbers of strings in each chain has only limited impact on performance. Moving frequentlyaccessed words to the front of the list has the surprising property that the vast majority of accesses are to the first or second node. For example, our experiments show that in a typical case a table with an average of around 80 strings per slot is only 10%--40% slower than a table with around one string per slot---while a table without move-to-front is perhaps 40% slower again---and is still over three times faster than using a tree. We show, moreover, that a move-to-front hash table of fixed size is more e#cient in space and time than a hash table that is dynamically doubled in size to maintain a constant load average
Burst Tries: A Fast, Efficient Data Structure for String Keys
- ACM Transactions on Information Systems
, 2002
"... Many applications depend on efficient management of large sets of distinct strings in memory. For example, during index construction for text databases a record is held for each distinct word in the text, containing the word itself and information such as counters. We propose a new data structure, t ..."
Abstract
-
Cited by 21 (10 self)
- Add to MetaCart
Many applications depend on efficient management of large sets of distinct strings in memory. For example, during index construction for text databases a record is held for each distinct word in the text, containing the word itself and information such as counters. We propose a new data structure, the burst trie, that has significant advantages over existing options for such applications: it requires no more memory than a binary tree; it is as fast as a trie; and, while not as fast as a hash table, a burst trie maintains the strings in sorted or near-sorted order. In this paper we describe burst tries and explore the parameters that govern their performance. We experimentally determine good choices of parameters, and compare burst tries to other structures used for the same task, with a variety of data sets. These experiments show that the burst trie is particularly effective for the skewed frequency distributions common in text collections, and dramatically outperforms all other data structures for the task of managing strings while maintaining sort order.
Searchable Words on the Web
- International Journal of Digital Libraries
, 2001
"... In designing data structures for text databases, it is valuable to know how many different words are likely to be encountered in a particular collection. For example, vocabulary accumulation is central to index construction for text database systems; it is useful to be able to estimate the space req ..."
Abstract
-
Cited by 14 (6 self)
- Add to MetaCart
In designing data structures for text databases, it is valuable to know how many different words are likely to be encountered in a particular collection. For example, vocabulary accumulation is central to index construction for text database systems; it is useful to be able to estimate the space requirements and performance characteristics of the main-memory data structures used for this task. However, it is not clear how many distinct words will be found in a text collection or whether new words will continue to appear after inspecting large volumes of data. We propose practical definitions of a word, and investigate new word occurrences under these models in a large text collection. We inspected around two billion word occurrences in 45 gigabytes of world-wide web documents, and found just over 9.74 million different words in 5.5 million documents; overall, 1 word in 200 was new. We observe that new words continue to occur, even in very large data sets, and that choosing stricter definitions of what constitutes a word has only limited impact on the number of new words found.
Self-Adjusting Trees in Practice for Large Text Collections
- Software - Practice and Experience
, 2002
"... Splay and randomised search trees are self-balancing binary tree structures with little or no space overhead compared to a standard binary search tree. Both trees are intended for use in applications where node accesses are skewed, for example in gathering the distinct words in a large text collecti ..."
Abstract
-
Cited by 13 (4 self)
- Add to MetaCart
Splay and randomised search trees are self-balancing binary tree structures with little or no space overhead compared to a standard binary search tree. Both trees are intended for use in applications where node accesses are skewed, for example in gathering the distinct words in a large text collection for index construction. We investigate the efficiency of these trees for such vocabulary accumulation. Surprisingly, unmodified splaying and randomised search trees are on average around 25% slower than using a standard binary tree. We investigate heuristics to limit splay tree reorganisation costs and show their effectiveness in practice. In particular, a periodic rotation scheme improves the speed of splaying by 27%, while other proposed heuristics are less effective. We also report the performance of efficient bit-wise hashing and red-black trees for comparison.
Language Inference from Function Words
, 1993
"... Language surface structures demonstrate regularities that make it possible to learn a capacity for producing an infinite number of well-formed expressions. This paper outlines a system that uncovers and characterizes regularities through principled wholesale pattern analysis of copious amounts of ma ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
Language surface structures demonstrate regularities that make it possible to learn a capacity for producing an infinite number of well-formed expressions. This paper outlines a system that uncovers and characterizes regularities through principled wholesale pattern analysis of copious amounts of machine-readable text. The system uses the notion of closed-class lexemes to divide the input into phrases, and from these phrases infers lexical and syntactic information. The set of closed-class lexemes is derived from the text, and then these lexemes are clustered into functional types. Next the open-class words are categorized according to how they tend to appear in phrases and then clustered into a smaller number of open-class types. Finally these types are used to infer, and generalize, grammar rules. Statistical criteria are employed for each of these inference operations. The result is a relatively compact grammar that is guaranteed to cover every sentence in the source text that was u...
Self-adjusting of ternary search tries using conditional rotations and randomized heuristics
- Comput. J
, 2005
"... A Ternary Search Trie (TST) is a highly efficient dynamic dictionary structure applicable for strings and textual data. The strings are accessed based on a set of access probabilities and are to be arranged using a TST. We consider the scenario where the probabilities are not known a priori, and is ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
A Ternary Search Trie (TST) is a highly efficient dynamic dictionary structure applicable for strings and textual data. The strings are accessed based on a set of access probabilities and are to be arranged using a TST. We consider the scenario where the probabilities are not known a priori, and is time-invariant. Our aim is to adaptively restructure the TST so as to yield the best access or retrieval time. Unlike the case of lists and binary search trees, where numerous methods have been proposed, in the case of the TST, currently, the number of reported adaptive schemes are few. In this paper, we consider various self-organizing schemes that were applied to Binary Search Trees, and apply them to TSTs. Three new schemes, which are the splaying, the conditional rotation and the randomization heuristics, have been proposed, tested and comparatively presented. The results demonstrate that the conditional rotation heuristic is the best when compared to other heuristics that are considered in the paper.
Performance of Data Structures for Small Sets of Strings
- Proc. of the Australasian conference on Computer Science
, 2002
"... Fundamental structures such as trees and hash tables are used for managing data in a huge variety of circumstances. Making the right choice of structure is essential to efficiency. In previous work we have explored the performance of a range of data structures -- different forms of trees, tries, and ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
Fundamental structures such as trees and hash tables are used for managing data in a huge variety of circumstances. Making the right choice of structure is essential to efficiency. In previous work we have explored the performance of a range of data structures -- different forms of trees, tries, and hash tables -- for the task of managing sets of millions of strings, and have developed new variants of each that are more efficient for this task than previous alternatives. In this paper we test the performance of the same data structures on small sets of strings, in the context of document processing for index construction. Our results show that the new structures, in particular our burst trie, are the most efficient choice for this task, thus demonstrating that they are suitable for managing sets of hundreds to millions of distinct strings, and for input of hundreds to billions of occurrences.
Phoneme Frequencies Follow a Yule Distribution (The Form of the Phonemic Distribution in World Languages)
"... Frequency of occurrence of words in a language is well described by Zipf’s (1949) law. However, Zipf’s law does not well describe the distribution of the phonemes from which words are composed. Examination of frequency of occurrence in 95 languages shows that phoneme frequencies are best described b ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Frequency of occurrence of words in a language is well described by Zipf’s (1949) law. However, Zipf’s law does not well describe the distribution of the phonemes from which words are composed. Examination of frequency of occurrence in 95 languages shows that phoneme frequencies are best described by an equation first developed by Yule (1924) that also describes the distribution of DNA codons. The Yule equation fits the distribution of phoneme frequencies better than the Zipf equation or equations proposed by Sigurd (1968) and Borodovsky and Gusein-Zade (1989).
A Model for Word Clustering
- Proc. ACM-SIGMOD Int
, 1992
"... It is common to model the distribution of words in text by measures such as the Poisson approximation. However, these measures ignore e#ects such as clustering: our analysis of document collections demonstrates that the Poisson approximation can significantly overestimate the probability that a ..."
Abstract
- Add to MetaCart
It is common to model the distribution of words in text by measures such as the Poisson approximation. However, these measures ignore e#ects such as clustering: our analysis of document collections demonstrates that the Poisson approximation can significantly overestimate the probability that a document contains a word. Based on our analysis, we propose a new model for distribution of words in text, and show how this model can be used to estimate the probability that a document contains a word and the number of distinct words in a document.

