Searchable Words on the Web
 International Journal of Digital Libraries
, 2001
"... In designing data structures for text databases, it is valuable to know how many different words are likely to be encountered in a particular collection. For example, vocabulary accumulation is central to index construction for text database systems; it is useful to be able to estimate the space req ..."
Abstract

Cited by 18 (6 self)
In designing data structures for text databases, it is valuable to know how many different words are likely to be encountered in a particular collection. For example, vocabulary accumulation is central to index construction for text database systems; it is useful to be able to estimate the space requirements and performance characteristics of the mainmemory data structures used for this task. However, it is not clear how many distinct words will be found in a text collection or whether new words will continue to appear after inspecting large volumes of data. We propose practical definitions of a word, and investigate new word occurrences under these models in a large text collection. We inspected around two billion word occurrences in 45 gigabytes of worldwide web documents, and found just over 9.74 million different words in 5.5 million documents; overall, 1 word in 200 was new. We observe that new words continue to occur, even in very large data sets, and that choosing stricter definitions of what constitutes a word has only limited impact on the number of new words found.
Cacheconscious collision resolution in string hash tables
 In SPIRE
, 2005
"... ..."
Redesigning the String Hash Table, Burst Trie, and BST to Exploit Cache
, 2011
"... A key decision when developing inmemory computing applications is choice of a mechanism to store and retrieve strings. The most efficient current data structures for this task are the hash table with movetofront chains and the burst trie, both of which use linked lists as a substructure, and vari ..."
Abstract

Cited by 5 (1 self)
A key decision when developing inmemory computing applications is choice of a mechanism to store and retrieve strings. The most efficient current data structures for this task are the hash table with movetofront chains and the burst trie, both of which use linked lists as a substructure, and variants of binary search tree. These data structures are computationally efficient, but typical implementations use large numbers of nodes and pointers to manage strings, which is not efficient in use of cache. In this article, we explore two alternatives to the standard representation: the simple expedient of including the string in its node, and, for linked lists, the more drastic step of replacing each list of nodes by a contiguous array of characters. Our experiments show that, for large sets of strings, the improvement is dramatic. For hashing, in the best case the total space overhead is reduced to less than 1 bit per string. For the burst trie, over 300MB of strings can be stored in a total of under 200MB of memory with significantly improved search time. These results, on a variety of data sets, show that cachefriendly variants of fundamental data structures can yield remarkable gains in performance.
Selfadjusting of ternary search tries using conditional rotations and randomized heuristics
 Comput. J
, 2005
"... A Ternary Search Trie (TST) is a highly efficient dynamic dictionary structure applicable for strings and textual data. The strings are accessed based on a set of access probabilities and are to be arranged using a TST. We consider the scenario where the probabilities are not known a priori, and is ..."
Abstract

Cited by 5 (1 self)
A Ternary Search Trie (TST) is a highly efficient dynamic dictionary structure applicable for strings and textual data. The strings are accessed based on a set of access probabilities and are to be arranged using a TST. We consider the scenario where the probabilities are not known a priori, and is timeinvariant. Our aim is to adaptively restructure the TST so as to yield the best access or retrieval time. Unlike the case of lists and binary search trees, where numerous methods have been proposed, in the case of the TST, currently, the number of reported adaptive schemes are few. In this paper, we consider various selforganizing schemes that were applied to Binary Search Trees, and apply them to TSTs. Three new schemes, which are the splaying, the conditional rotation and the randomization heuristics, have been proposed, tested and comparatively presented. The results demonstrate that the conditional rotation heuristic is the best when compared to other heuristics that are considered in the paper.
S.: Optimal selfadjusting trees for dynamic string data in secondary storage
, 2007
"... Abstract. We present a selfadjusting layout scheme for suffix trees in secondary storage that provides optimal number of disk accesses for a sequence of string or substring queries. This has been an open problem since Sleator and Tarjan presented their splaying technique to create selfadjusting bi ..."
Abstract

Cited by 3 (0 self)
Abstract. We present a selfadjusting layout scheme for suffix trees in secondary storage that provides optimal number of disk accesses for a sequence of string or substring queries. This has been an open problem since Sleator and Tarjan presented their splaying technique to create selfadjusting binary search trees in 1985. In addition to resolving this open problem, our scheme provides two additional advantages: 1) The partitions are slowly readjusted, requiring fewer disk accesses than splaying methods, and 2) the initial state of the layout is balanced, making it useful even when the sequence of queries is not highly skewed. Our method is also applicable to PATRICIA trees, and potentially to other data structures. 1
Efficient Adaptive Data Compression Using Fano Binary Search Trees
 in The 20th International Symposium on Computer and Information Sciences
, 2005
"... In this paper, we show an effective way of using adaptive selforganizing data structures in enhancing compression schemes. We introduce a new data structure, the Partitioning Binary Search Tree (PBST), which is based on the wellknown Binary Search Tree (BST), and when used in conjunction with Fano ..."
Abstract

Cited by 1 (0 self)
In this paper, we show an effective way of using adaptive selforganizing data structures in enhancing compression schemes. We introduce a new data structure, the Partitioning Binary Search Tree (PBST), which is based on the wellknown Binary Search Tree (BST), and when used in conjunction with Fano encoding, the PBST leads to the socalled Fano Binary Search Tree (FBST). The PBST and FBST can be maintained adaptively and in a selforganizing manner by using new treebased operators, namely the ShiftToLeft (STL) and the ShiftToRight (STR) operators. The encoding and decoding procedures that also update the FBST have been implemented, and show that the adaptive Fano coding using FBSTs, the Huffman, and the greedy adaptive Fano coding achieve similar compression ratios.
Reducing Splaying by Taking Advantage of Working Sets
"... Abstract. Access requests to keys stored into a data structure often exhibit locality of reference in practice. Such a regularity can be modeled, e.g., by working sets. In this paper we study to what extent can the existence of working sets be taken advantage of in splay trees. In order to reduce th ..."
Abstract
Abstract. Access requests to keys stored into a data structure often exhibit locality of reference in practice. Such a regularity can be modeled, e.g., by working sets. In this paper we study to what extent can the existence of working sets be taken advantage of in splay trees. In order to reduce the number of costly splay operations we monitor for information on the current working set and its change. We introduce a simple algorithm which attempts to splay only when necessary. Under worstcase analysis the algorithm guarantees an amortized logarithmic bound. In empirical experiments it is 5 % more efficient than randomized splay trees and at most 10 % more efficient than the original splay tree. We also briefly analyze the usefulness of the commonlyused Zipf’s distribution as a general model of locality of reference. 1
SelfOptimizing Distributed Trees
"... We present a novel protocol for restructuring a treebased overlay network in response to the workload of the application running over it. Through lowcost restructuring operations, our protocol incrementally adapts the tree so as to bring nodes that tend to communicate with one another closer togeth ..."
Abstract
We present a novel protocol for restructuring a treebased overlay network in response to the workload of the application running over it. Through lowcost restructuring operations, our protocol incrementally adapts the tree so as to bring nodes that tend to communicate with one another closer together in the tree. It achieves this while respecting degree bounds on nodes so that, e.g., no node degenerates into a “hub ” for the overlay. Moreover, it limits restructuring to those parts of the tree over which communication takes place, avoiding restructuring other parts of the tree unnecessarily. We show via experiments on PlanetLab that our protocol can significantly reduce communication latencies in workloads dominated by clusters of communicating nodes. 1.
An Efficient Compression Scheme for Data Communication Which Uses a New Family of SelfOrganizing Binary Search Trees
"... In this paper, we demonstrate that we can effectively use results from the field of adaptive selforganizing data structures in enhancing compression schemes. Unlike adaptive lists, which have already been used in compression, to the best of our knowledge, adaptive selforganizing trees have not bee ..."
Abstract
In this paper, we demonstrate that we can effectively use results from the field of adaptive selforganizing data structures in enhancing compression schemes. Unlike adaptive lists, which have already been used in compression, to the best of our knowledge, adaptive selforganizing trees have not been used in this regard. To achieve this, we introduce a new data structure, the Partitioning Binary Search Tree (PBST) which, although based on the wellknown Binary Search Tree (BST), also appropriately partitions the data elements into mutually exclusive sets. When used in conjunction with Fano encoding, the PBST leads to the socalled Fano Binary Search Tree (FBST), which, indeed, incorporates the required Fano coding (nearlyequalprobability) property into the BST. We demonstrate how both the PBST and FBST can be maintained adaptively and in a selforganizing manner. The updating procedure that converts a PBST into an FBST, and the corresponding new treebased operators, namely the ShiftToLeft (STL) and the ShiftToRight (STR) operators, are explicitly presented. The encoding and decoding procedures that also update the FBST have been implemented and rigorously tested. Our empirical results on files of the wellknown benchmark, the Canterbury corpus, show that the adaptive Fano coding using FBSTs, the Huffman, and the greedy adaptive Fano coding achieve similar compression ratios. However, in terms of encoding/decoding speed, the new scheme is much faster than the latter two in the encoding phase, and they achieve approximately the same speed in the decoding phase. We believe that the same philosophy, namely that of using an adaptive selforganizing BST to maintain the frequencies, can also be utilized for other data encoding mechanisms, even as the Fenwick scheme has been used in arithmetic coding. 1