Results 1  10
of
17
Succinct Data Structures for Retrieval and Approximate Membership
"... Abstract. The retrieval problem is the problem of associating data with keys in a set. Formally, the data structure must store a function f: U → {0, 1} r that has specified values on the elements of a given set S ⊆ U, S  = n, but may have any value on elements outside S. All known methods (e. g. ..."
Abstract

Cited by 14 (6 self)
 Add to MetaCart
Abstract. The retrieval problem is the problem of associating data with keys in a set. Formally, the data structure must store a function f: U → {0, 1} r that has specified values on the elements of a given set S ⊆ U, S  = n, but may have any value on elements outside S. All known methods (e. g. those based on perfect hash functions), induce a space overhead of Θ(n) bits over the optimum, regardless of the evaluation time. We show that for any k, query time O(k) can be achieved using space that is within a factor 1 + e −k of optimal, asymptotically for large n. The time to construct the data structure is O(n), expected. If we allow logarithmic evaluation time, the additive overhead can be reduced to O(log log n) bits whp. A general reduction transfers the results on retrieval into analogous results on approximate membership, a problem traditionally addressed using Bloom filters. Thus we obtain space bounds arbitrarily close to the lower bound for this problem as well. The evaluation procedures of our data structures are extremely simple. For the results stated above we assume free access to fully random hash functions. This assumption can be justified using space o(n) to simulate full randomness on a RAM. 1
Streambased randomised language models for smt
 In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing
, 2009
"... Randomised techniques allow very big language models to be represented succinctly. However, being batchbased they are unsuitable for modelling an unbounded stream of language whilst maintaining a constant error rate. We present a novel randomised language model which uses an online perfect hash fun ..."
Abstract

Cited by 13 (1 self)
 Add to MetaCart
Randomised techniques allow very big language models to be represented succinctly. However, being batchbased they are unsuitable for modelling an unbounded stream of language whilst maintaining a constant error rate. We present a novel randomised language model which uses an online perfect hash function to efficiently deal with unbounded text streams. Translation experiments over a text stream show that our online randomised model matches the performance of batchbased LMs without incurring the computational overhead associated with full retraining. This opens up the possibility of randomised language models which continuously adapt to the massive volumes of texts published on the Web each day. 1
Counting Inversions, Offline Orthogonal Range Counting, and Related Problems
"... We give an O(n √ lg n)time algorithm for counting the number of inversions in a permutation on n elements. This improves a longstanding previous bound of O(n lg n / lg lg n) that followed from Dietz’s data structure [WADS’89], and answers a question of Andersson and Petersson [SODA’95]. As Dietz’s ..."
Abstract

Cited by 9 (3 self)
 Add to MetaCart
We give an O(n √ lg n)time algorithm for counting the number of inversions in a permutation on n elements. This improves a longstanding previous bound of O(n lg n / lg lg n) that followed from Dietz’s data structure [WADS’89], and answers a question of Andersson and Petersson [SODA’95]. As Dietz’s result is known to be optimal for the related dynamic rank problem, our result demonstrates a significant improvement in the offline setting. Our new technique is quite simple: we perform a “vertical partitioning ” of a trie (akin to van Emde Boas trees), and use ideas from external memory. However, the technique finds numerous applications: for example, we obtain • in d dimensions, an algorithm to answer n offline orthogonal range counting queries in time O(n lg d−2+1/d n); • an improved construction time for online data structures for orthogonal range counting; • an improved update time for the partial sums problem; • faster Word RAM algorithms for finding the maximum depth in an arrangement of axisaligned rectangles, and for the slope selection problem. As a bonus, we also give a simple (1 + ε)approximation algorithm for counting inversions that runs in linear time, improving the previous O(n lg lg n) bound by Andersson and Petersson.
The limits of buffering: A tight lower bound for dynamic membership in the external memory model
 In Proc. ACM Symposium on Theory of Computing
, 2010
"... We study the dynamic membership (or dynamic dictionary) problem, which is one of the most fundamental problems in data structures. We study the problem in the external memory model with cell size b bits and cache size m bits. We prove that if the amortized cost of updates is at most 0.999 (or any ot ..."
Abstract

Cited by 4 (2 self)
 Add to MetaCart
We study the dynamic membership (or dynamic dictionary) problem, which is one of the most fundamental problems in data structures. We study the problem in the external memory model with cell size b bits and cache size m bits. We prove that if the amortized cost of updates is at most 0.999 (or any other constant < 1), then the query cost must be Ω(logb log n (n/m)), where n is the number of elements in the dictionary. In contrast, when the update time is allowed to be 1 + o(1), then a bit vector or hash table give query time O(1). Thus, this is a threshold phenomenon for data structures. This lower bound answers a folklore conjecture of the external memory community. Since almost any data structure task can solve membership, our lower bound implies a dichotomy between two alternatives: (i) make the amortized update time at least 1 (so the data structure does not buffer, and we lose one of the main potential advantages of the cache), or (ii) make the query time at least roughly logarithmic in n. Our result holds even when the updates and queries are chosen uniformly at random and there are no deletions; it holds for randomized data structures, holds when the universe size is O(n), and does not make any restrictive assumptions such as indivisibility. All of the lower bounds we prove hold regardless of the space consumption of the data structure, while the upper bounds only need linear space. The lower bound has some striking implications for external memory data structures. It shows that the query complexities of many problems such as 1Drange counting, predecessor, rankselect, and many others, are all the same
On dynamic bitprobe complexity
, 2005
"... This work present several advances in the understanding of dynamic data structures in the bitprobe model: • We improve the lower bound record for dynamic language membership problems to Ω(( Surpassing Ω(lg n) was listed as the first open problem in a survey by Miltersen. • We prove a bound of Ω( kn ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
This work present several advances in the understanding of dynamic data structures in the bitprobe model: • We improve the lower bound record for dynamic language membership problems to Ω(( Surpassing Ω(lg n) was listed as the first open problem in a survey by Miltersen. • We prove a bound of Ω( known bounds were Ω( lg n lg lg lg n lg n lg lg n lg n lg lg n)2).) for maintaining partial sums in Z/2Z. Previously, the) and O(lg n). • We prove a surprising and tight upper bound of O ( lg lg n) for the greaterthan problem, and several predecessortype problems. We use this to obtain the same upper bound for dynamic word and prefix problems in groupfree monoids. We also obtain new lower bounds for the partialsums problem in the cellprobe and externalmemory models. Our lower bounds are based on a surprising improvement of the classic chronogram technique of Fredman and Saks [1989], which makes it possible to prove logarithmic lower bounds by this approach. Before the work of M. Pǎtrascu and Demaine [2004], this was the lg n only known technique for dynamic lower bounds, and surpassing Ω ( lg lg n) was a central open problem in cellprobe complexity.
Lower Bound Techniques for Data Structures
, 2008
"... We describe new techniques for proving lower bounds on datastructure problems, with the following broad consequences:
â¢ the first Î©(lgn) lower bound for any dynamic problem, improving on a bound that had been standing since 1989;
â¢ for static data structures, the first separation between linea ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
We describe new techniques for proving lower bounds on datastructure problems, with the following broad consequences:
â¢ the first Î©(lgn) lower bound for any dynamic problem, improving on a bound that had been standing since 1989;
â¢ for static data structures, the first separation between linear and polynomial space. Specifically, for some problems that have constant query time when polynomial space is allowed, we can show Î©(lg n/ lg lg n) bounds when the space is O(n Â· polylog n).
Using these techniques, we analyze a variety of central datastructure problems, and obtain improved lower bounds for the following:
â¢ the partialsums problem (a fundamental application of augmented binary search trees);
â¢ the predecessor problem (which is equivalent to IP lookup in Internet routers);
â¢ dynamic trees and dynamic connectivity;
â¢ orthogonal range stabbing;
â¢ orthogonal range counting, and orthogonal range reporting;
â¢ the partial match problem (searching with wildcards);
â¢ (1 + Îµ)approximate near neighbor on the hypercube;
â¢ approximate nearest neighbor in the lâ metric.
Our new techniques lead to surprisingly nontechnical proofs. For several problems, we obtain simpler proofs for bounds that were already known.
Dynamic Indexability: The QueryUpdate Tradeoff for OneDimensional Range Queries
, 811
"... The Btree is a fundamental secondary index structure that is widely used for answering onedimensional range N K reporting queries. Given a set of N keys, a range query can be answered in O(logB +) I/Os, where B is the disk M B block size, K the output size, and M the size of the main memory buffer ..."
Abstract
 Add to MetaCart
The Btree is a fundamental secondary index structure that is widely used for answering onedimensional range N K reporting queries. Given a set of N keys, a range query can be answered in O(logB +) I/Os, where B is the disk M B block size, K the output size, and M the size of the main memory buffer. When keys are inserted or deleted, the Btree is updated in O(logB N) I/Os, if we require the resulting changes to be committed to disk right away. Otherwise, the memory buffer can be used to buffer the recent updates, and changes can be written to disk in batches, which significantly lowers the amortized update cost. A systematic way of batching up updates is to use the logarithmic method, combined with fractional cascading, resulting in a dynamic Btree that supports insertions in O ( 1
Computational Geometry through the Information Lens
, 2007
"... revisits classic problems in computational geometry from the modern algorithmic ..."
Abstract
 Add to MetaCart
revisits classic problems in computational geometry from the modern algorithmic
Google Research Award Proposal: Data Structures
"... Data structures are essential components of computer systems in general and Google in particular. We believe this area of research is in an auspicious position where practical and theoretical goals are well aligned, implying that deep algorithmic ideas can also have significant practical impact. We ..."
Abstract
 Add to MetaCart
Data structures are essential components of computer systems in general and Google in particular. We believe this area of research is in an auspicious position where practical and theoretical goals are well aligned, implying that deep algorithmic ideas can also have significant practical impact. We exemplify with a few examples from our past research, which address problems of universal value, and should have important applications in real systems. Cacheoblivious Btrees: Btrees are a fundamental tool for representing large sets of data in external memory. But what is “external memory”? Modern computers have complicated memory hierarchies, including L1 cache, L2 cache, main memory, disk, and often network storage. Even if one decides to concentrate on one level of the hierarchy, choosing the optimal branching factor involves nontrivial tuning. A surprising, clean alternative is to design a Btree which works in the optimal O(log B n) time without knowing the memory block size B! Then the Btree will work optimally on all levels of the memory hierarchy simultaneously. Our initial paper [BDFC05] showing that this is possible has been very influential in the further study of cacheobliviousness. Bloomier filters: Suppose we want to represent a set S of items, and answer queries of the form
How to Approximate A Set Without Knowing Its Size In Advance
"... The dynamic approximate membership problem asks to represent a set S of size n, whose elements are provided in an online fashion, supporting membership queries without false negatives and with a false positive rate at most ϵ. That is, the membership algorithm must be correct on each x ∈ S, and may ..."
Abstract
 Add to MetaCart
The dynamic approximate membership problem asks to represent a set S of size n, whose elements are provided in an online fashion, supporting membership queries without false negatives and with a false positive rate at most ϵ. That is, the membership algorithm must be correct on each x ∈ S, and may err with probability at most ϵ on each x / ∈ S. We study a wellmotivated, yet insufficiently explored, variant of this problem where the size n of the set is not known in advance. Existing optimal approximate membership data structures require that the size is known in advance, but in many practical scenarios this is not a realistic assumption. Moreover, even if the eventual size n of the set is known in advance, it is desirable to have the smallest possible space usage also when the current number of inserted elements is smaller than n. Our contribution consists of the following results: • We show a superlinear gap between the space complexity when the size is known in advance and the space complexity when the size is not known in advance. When the size is known in advance, it is wellknown that Θ(n log(1/ϵ)) bits of space are necessary and sufficient (Bloom ’70, Carter et al. ’78). However, when the size is not known in advance, we prove