Results 1 -
4 of
4
Cache-Oblivious String B-trees
- IN: PROC. OF PRINCIPLES OF DATABASE SYSTEMS
, 2006
"... B-trees are the data structure of choice for maintaining searchable data on disk. However, B-trees perform suboptimally • when keys are long or of variable length, • when keys are compressed, even when using front compression, the standard B-tree compression scheme, • for range queries, and • with r ..."
Abstract
-
Cited by 23 (5 self)
- Add to MetaCart
B-trees are the data structure of choice for maintaining searchable data on disk. However, B-trees perform suboptimally • when keys are long or of variable length, • when keys are compressed, even when using front compression, the standard B-tree compression scheme, • for range queries, and • with respect to memory effects such as disk prefetching. This paper presents a cache-oblivious string B-tree (COSB-tree) data structure that is efficient in all these ways: • The COSB-tree searches asymptotically optimally and inserts and deletes nearly optimally. • It maintains an index whose size is proportional to the frontcompressed size of the dictionary. Furthermore, unlike standard front-compressed strings, keys can be decompressed in a memory-efficient manner. • It performs range queries with no extra disk seeks; in contrast, B-trees incur disk seeks when skipping from leaf block to leaf block. • It utilizes all levels of a memory hierarchy efficiently and makes good use of disk locality by using cache-oblivious layout strategies.
Perfect hashing for strings: Formalization and Algorithms
- IN PROC 7TH CPM
, 1996
"... Numbers and strings are two objects manipulated by most programs. Hashing has been well-studied for numbers and it has been effective in practice. In contrast, basic hashing issues for strings remain largely unexplored. In this paper, we identify and formulate the core hashing problem for strings th ..."
Abstract
-
Cited by 6 (2 self)
- Add to MetaCart
Numbers and strings are two objects manipulated by most programs. Hashing has been well-studied for numbers and it has been effective in practice. In contrast, basic hashing issues for strings remain largely unexplored. In this paper, we identify and formulate the core hashing problem for strings that we call substring hashing. Our main technical results are highly efficient sequential/parallel (CRCW PRAM) Las Vegas type algorithms that determine a perfect hash function for substring hashing. For example, given a binary string of length n, one of our algorithms finds a perfect hash function in O(log n) time, O(n) work, and O(n) space; the hash value for any substring can then be computed in O(log log n) time using a single processor. Our approach relies on a novel use of the suffix tree of a string. In implementing our approach, we design optimal parallel algorithms for the problem of determining weighted ancestors on a edge-weighted tree that may be of independent interest.
Optimal Parallel Dictionary Matching and Compression (Extended Abstract)
- 7th Annual ACM Symposium on Parallel Algorithms and Architectures
, 1995
"... ) Martin Farach S. Muthukrishnan y Rutgers University DIMACS April 26, 1995 Abstract Emerging applications in multi-media and the Human Genome Project require storage and searching of large databases of strings -- a task for which parallelism seems the only hope. In this paper, we consider the ..."
Abstract
-
Cited by 6 (3 self)
- Add to MetaCart
) Martin Farach S. Muthukrishnan y Rutgers University DIMACS April 26, 1995 Abstract Emerging applications in multi-media and the Human Genome Project require storage and searching of large databases of strings -- a task for which parallelism seems the only hope. In this paper, we consider the parallelism in some of the fundamental problems in compressing strings and in matching large dictionaries of patterns against texts. We present the first work-optimal algorithms for these well-studied problems including the classical dictionary matching problem, optimal compression with a static dictionary and the universal data compression with dynamic dictionary of Lempel and Ziv. All our algorithms are randomized and they are of the Las Vegas type. Furthermore, they are fast, working in time logarithmic in the input size. Additionally, our algorithms seem suitable for a distributed implementation. 1 Introduction Large data bases of strings from multi-media applications and the Human G...
Dictionary Compression on the PRAM
, 1994
"... Parallel algorithms for lossless data compression via dictionary compression using optimal, longest fragment first (LFF), and greedy parsing strategies are described. Dictionary compression removes redundancy by replacing substrings of the input by references to strings stored in a dictionary. Given ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Parallel algorithms for lossless data compression via dictionary compression using optimal, longest fragment first (LFF), and greedy parsing strategies are described. Dictionary compression removes redundancy by replacing substrings of the input by references to strings stored in a dictionary. Given a static dictionary stored as a suffix tree, we present a CREW PRAM algorithm for optimal compression which runs in O(M + log M log n) time with O(nM 2 ) processors, where it is assumed that M is the maximum length of any dictionary entry. Under the same model, we give an algorithm for LFF compression which runs in O(log 2 n) time with O(n= log n) processors where it is assumed that the maximum dictionary entry is of length O(log n). We also describe an O(M + log n) time and O(n) processor algorithm for greedy parsing given a static or sliding-window dictionary. For sliding-window compression, a different approach finds the greedy parsing in O(log n) time using O(nM log M= log n) proces...

