Results 1 
6 of
6
CacheOblivious String Btrees
 IN: PROC. OF PRINCIPLES OF DATABASE SYSTEMS
, 2006
"... Btrees are the data structure of choice for maintaining searchable data on disk. However, Btrees perform suboptimally • when keys are long or of variable length, • when keys are compressed, even when using front compression, the standard Btree compression scheme, • for range queries, and • with r ..."
Abstract

Cited by 26 (5 self)
 Add to MetaCart
Btrees are the data structure of choice for maintaining searchable data on disk. However, Btrees perform suboptimally • when keys are long or of variable length, • when keys are compressed, even when using front compression, the standard Btree compression scheme, • for range queries, and • with respect to memory effects such as disk prefetching. This paper presents a cacheoblivious string Btree (COSBtree) data structure that is efficient in all these ways: • The COSBtree searches asymptotically optimally and inserts and deletes nearly optimally. • It maintains an index whose size is proportional to the frontcompressed size of the dictionary. Furthermore, unlike standard frontcompressed strings, keys can be decompressed in a memoryefficient manner. • It performs range queries with no extra disk seeks; in contrast, Btrees incur disk seeks when skipping from leaf block to leaf block. • It utilizes all levels of a memory hierarchy efficiently and makes good use of disk locality by using cacheoblivious layout strategies.
Efficient Randomized Dictionary Matching Algorithms (Extended Abstract)
, 1992
"... The standard string matching problem involves finding all occurrences of a single pattern in a single text. While this approach works well in many application areas, there are some domains in which it is more appropriate to deal with dictionaries of patterns. A dictionary is a set of patterns; the ..."
Abstract

Cited by 18 (5 self)
 Add to MetaCart
The standard string matching problem involves finding all occurrences of a single pattern in a single text. While this approach works well in many application areas, there are some domains in which it is more appropriate to deal with dictionaries of patterns. A dictionary is a set of patterns; the goal of dictionary matching is to find all dictionary patterns in a given text, simultaneously. In string matching, randomized algorithms have primarily made use of randomized hashing functions which convert strings into "signatures" or "finger prints". We explore the use of finger prints in conjunction with other randomized and deterministic techniques and data structures. We present several new algorithms for dictionary matching, along with parallel algorithms which are simpler of more efficient than previously known algorithms.
Perfect hashing for strings: Formalization and Algorithms
 IN PROC 7TH CPM
, 1996
"... Numbers and strings are two objects manipulated by most programs. Hashing has been wellstudied for numbers and it has been effective in practice. In contrast, basic hashing issues for strings remain largely unexplored. In this paper, we identify and formulate the core hashing problem for strings th ..."
Abstract

Cited by 10 (2 self)
 Add to MetaCart
Numbers and strings are two objects manipulated by most programs. Hashing has been wellstudied for numbers and it has been effective in practice. In contrast, basic hashing issues for strings remain largely unexplored. In this paper, we identify and formulate the core hashing problem for strings that we call substring hashing. Our main technical results are highly efficient sequential/parallel (CRCW PRAM) Las Vegas type algorithms that determine a perfect hash function for substring hashing. For example, given a binary string of length n, one of our algorithms finds a perfect hash function in O(log n) time, O(n) work, and O(n) space; the hash value for any substring can then be computed in O(log log n) time using a single processor. Our approach relies on a novel use of the suffix tree of a string. In implementing our approach, we design optimal parallel algorithms for the problem of determining weighted ancestors on a edgeweighted tree that may be of independent interest.
Optimal Parallel Dictionary Matching and Compression (Extended Abstract)
 7th Annual ACM Symposium on Parallel Algorithms and Architectures
, 1995
"... ) Martin Farach S. Muthukrishnan y Rutgers University DIMACS April 26, 1995 Abstract Emerging applications in multimedia and the Human Genome Project require storage and searching of large databases of strings  a task for which parallelism seems the only hope. In this paper, we consider the ..."
Abstract

Cited by 7 (3 self)
 Add to MetaCart
) Martin Farach S. Muthukrishnan y Rutgers University DIMACS April 26, 1995 Abstract Emerging applications in multimedia and the Human Genome Project require storage and searching of large databases of strings  a task for which parallelism seems the only hope. In this paper, we consider the parallelism in some of the fundamental problems in compressing strings and in matching large dictionaries of patterns against texts. We present the first workoptimal algorithms for these wellstudied problems including the classical dictionary matching problem, optimal compression with a static dictionary and the universal data compression with dynamic dictionary of Lempel and Ziv. All our algorithms are randomized and they are of the Las Vegas type. Furthermore, they are fast, working in time logarithmic in the input size. Additionally, our algorithms seem suitable for a distributed implementation. 1 Introduction Large data bases of strings from multimedia applications and the Human G...
Dictionary Compression on the PRAM
, 1994
"... Parallel algorithms for lossless data compression via dictionary compression using optimal, longest fragment first (LFF), and greedy parsing strategies are described. Dictionary compression removes redundancy by replacing substrings of the input by references to strings stored in a dictionary. Given ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
Parallel algorithms for lossless data compression via dictionary compression using optimal, longest fragment first (LFF), and greedy parsing strategies are described. Dictionary compression removes redundancy by replacing substrings of the input by references to strings stored in a dictionary. Given a static dictionary stored as a suffix tree, we present a CREW PRAM algorithm for optimal compression which runs in O(M + log M log n) time with O(nM 2 ) processors, where it is assumed that M is the maximum length of any dictionary entry. Under the same model, we give an algorithm for LFF compression which runs in O(log 2 n) time with O(n= log n) processors where it is assumed that the maximum dictionary entry is of length O(log n). We also describe an O(M + log n) time and O(n) processor algorithm for greedy parsing given a static or slidingwindow dictionary. For slidingwindow compression, a different approach finds the greedy parsing in O(log n) time using O(nM log M= log n) proces...
Practical Parallel LempelZiv Factorization
"... In the age of big data, the need for efficient data compression algorithms has grown. A widely used data compression method is the LempelZiv77 (LZ77) method, being a subroutine in popular compression packages such as gzip and PKZIP. There has been a lot of recent effort on developing practical se ..."
Abstract
 Add to MetaCart
In the age of big data, the need for efficient data compression algorithms has grown. A widely used data compression method is the LempelZiv77 (LZ77) method, being a subroutine in popular compression packages such as gzip and PKZIP. There has been a lot of recent effort on developing practical sequential algorithms for LempelZiv factorization (equivalent to LZ77 compression), but research in practical parallel implementations has been less satisfactory. In this work, we present a simple workefficient parallel algorithm for LempelZiv factorization. We show theoretically that our algorithm requires linear work and runs in O(log 2 n) time (randomized) for constant alphabets and O(n ɛ) time (ɛ < 1) for integer alphabets. We present experimental results showing that our algorithm is efficient and achieves good speedup with respect to the best sequential implementations of LempelZiv factorization.