Results 1 - 10
of
10
Theory and Practise of Monotone Minimal Perfect Hashing
"... Minimal perfect hash functions have been shown to be useful to compress data in several data management tasks. In particular, order-preserving minimal perfect hash functions [12] have been used to retrieve the position of a key in a given list of keys: however, the ability to preserve any given orde ..."
Abstract
-
Cited by 9 (5 self)
- Add to MetaCart
Minimal perfect hash functions have been shown to be useful to compress data in several data management tasks. In particular, order-preserving minimal perfect hash functions [12] have been used to retrieve the position of a key in a given list of keys: however, the ability to preserve any given order leads to an unavoidable �(n log n) lower bound on the number of bits required to store the function. Recently, it was observed [1] that very frequently the keys to be hashed are sorted in their intrinsic (i.e., lexicographical) order. This is typically the case of dictionaries of search engines, list of URLs of web graphs, etc. We refer to this restricted version of the problem as monotone minimal perfect hashing. We analyse experimentally the data structures proposed in [1], and along our way we propose some new methods that, albeit asymptotically equivalent or worse, perform very well in practise, and provide a balance between access speed, ease of construction, and space usage. 1
SILT: A Memory-Efficient, High-Performance Key-Value Store
- In Proc. 23rd ACM SOSP, Cascias
, 2011
"... SILT (Small Index Large Table) is a memory-efficient, high-performance key-value store system based on flash storage that scales to serve billions of key-value items on a single node. It requires only 0.7 bytes of DRAM per entry and retrieves key/value pairs using on average 1.01 flash reads each. S ..."
Abstract
-
Cited by 8 (4 self)
- Add to MetaCart
SILT (Small Index Large Table) is a memory-efficient, high-performance key-value store system based on flash storage that scales to serve billions of key-value items on a single node. It requires only 0.7 bytes of DRAM per entry and retrieves key/value pairs using on average 1.01 flash reads each. SILT combines new algorithmic and systems techniques to balance the use of memory, storage, and computation. Our contributions include: (1) the design of three basic key-value stores each with a different emphasis on memory-efficiency and write-friendliness; (2) synthesis of the basic key-value stores to build a SILT key-value store system; and (3) an analytical model for tuning system parameters carefully to meet the needs of different workloads. SILT requires one to two orders of magnitude less memory to provide comparable throughput to current high-performance key-value systems on a commodity desktop system with flash storage.
Alphabet-independent compressed text indexing
- In ESA
, 2011
"... Abstract. Self-indexes can represent a text in asymptotically optimal space under the k-th order entropy model, give access to text substrings, and support indexed pattern searches. Their time complexities are not optimal, however: they always depend on the alphabet size. In this paper we achieve, f ..."
Abstract
-
Cited by 3 (3 self)
- Add to MetaCart
Abstract. Self-indexes can represent a text in asymptotically optimal space under the k-th order entropy model, give access to text substrings, and support indexed pattern searches. Their time complexities are not optimal, however: they always depend on the alphabet size. In this paper we achieve, for the first time, full alphabet-independence in the time complexities of self-indexes, while retaining space optimality. We obtain also some relevant byproducts on compressed suffix trees. 1
Improved compressed indexes for full-text document retrieval
- In Proc. 18th SPIRE
, 2011
"... Abstract. We give new space/time tradeoffs for compressed indexes that answer document retrieval queries on general sequences. On a collection of D documents of total length n, current approaches require at lg D lg lg D least |CSA | + O(n) or 2|CSA | + o(n) bits of space, where CSA is a full-text in ..."
Abstract
-
Cited by 3 (2 self)
- Add to MetaCart
Abstract. We give new space/time tradeoffs for compressed indexes that answer document retrieval queries on general sequences. On a collection of D documents of total length n, current approaches require at lg D lg lg D least |CSA | + O(n) or 2|CSA | + o(n) bits of space, where CSA is a full-text index. Using monotone minimum perfect hash functions, we give new algorithms for document listing with frequencies and top-k document retrieval using just |CSA | + O(n lg lg lg D) bits. We also improve current solutions that use 2|CSA | + o(n) bits, and consider other problems such as colored range listing, top-k most important documents, and computing arbitrary frequencies. 1 Introduction and Related Work Full-text document retrieval is the problem of, given a collection of D documents (i.e., general sequences over alphabet [1, σ]), concatenated into a text T [1, n],
Storing a Compressed Function with Constant Time Access
"... Abstract. We consider the problem of representing, in a space-efficient way, a function f: S → Σ such that any function value can be computed in constant time on a RAM. Specifically, our aim is to achieve space usage close to the 0th order entropy of the sequence of function values. Our technique wo ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
Abstract. We consider the problem of representing, in a space-efficient way, a function f: S → Σ such that any function value can be computed in constant time on a RAM. Specifically, our aim is to achieve space usage close to the 0th order entropy of the sequence of function values. Our technique works for any set S of machine words, without storing S, which is crucial for applications. Our contribution consists of two new techniques, of independent interest, that we use in combination with an existing result of Dietzfelbinger and Pagh (ICALP 2008). First of all, we introduce a way to support more space efficient approximate membership queries (Bloom filter functionality) with arbitrary false positive rate. Second, we present a variation of Huffman coding using approximate membership, providing an alternative that improves the classical bounds of Gallager (IEEE Trans. Information Theory, 1978) in some cases. The end result is an entropy-compressed function supporting constant time random access to values associated with a given set S. This improves both space and time compared to a recent result by Talbot and Talbot (ANALCO 2008). 1
© 20YY ACM 0000-0000/20YY/0000-0002 $5.00Theory and Practice of Monotone Minimal Perfect Hashing
"... supported by the MIUR PRIN projects “Mathematical aspects and forthcoming applications of automata and formal languages ” and “Grafi del web e ranking”, and by a Yahoo! Faculty Grant. Permission to make digital/hard copy of all or part of this material without fee for personal or classroom use provi ..."
Abstract
- Add to MetaCart
supported by the MIUR PRIN projects “Mathematical aspects and forthcoming applications of automata and formal languages ” and “Grafi del web e ranking”, and by a Yahoo! Faculty Grant. Permission to make digital/hard copy of all or part of this material without fee for personal or classroom use provided that the copies are not made or distributed for profit or commercial advantage, the ACM copyright/server notice, the title of the publication, and its date appear, and
Order-Preserving Encryption Revisited: Improved Security Analysis and Alternative Solutions
"... We further the study of order-preserving symmetric encryption (OPE), a primitive for allowing efficient range queries on encrypted data, recently initiated (from a cryptographic perspective) by Boldyreva et al. (Eurocrypt ’09). First, we address the open problem of characterizing what encryption via ..."
Abstract
- Add to MetaCart
We further the study of order-preserving symmetric encryption (OPE), a primitive for allowing efficient range queries on encrypted data, recently initiated (from a cryptographic perspective) by Boldyreva et al. (Eurocrypt ’09). First, we address the open problem of characterizing what encryption via a random order-preserving function (ROPF) leaks about underlying data (ROPF being the “ideal object ” in the security definition, POPF, satisfied by their scheme.) In particular, we show that, for a database of randomly distributed plaintexts and appropriate choice of parameters, ROPF encryption leaks neither the precise value of any plaintext nor the precise distance between any two of them. The analysis here introduces useful new techniques. On the other hand, we show that ROPF encryption leaks approximate value of any plaintext as well as approximate distance between any two plaintexts, each to an accuracy of about square root of the domain size. We then study schemes that are not order-preserving, but which nevertheless allow efficient range queries and achieve security notions stronger than POPF. In a setting where the entire database is known in advance of key-generation (considered in several prior works), we show that recent constructions of “monotone minimal perfect hash functions ” allow to efficiently achieve (an adaptation of) the notion
Theory and Practice of Monotone Minimal Perfect Hashing DJAMAL BELAZZOUGUI
"... Minimal perfect hash functions have been shown to be useful to compress data in several data management tasks. In particular, order-preserving minimal perfect hash functions [12] have been used to retrieve the position of a key in a given list of keys: however, the ability to preserve any given orde ..."
Abstract
- Add to MetaCart
Minimal perfect hash functions have been shown to be useful to compress data in several data management tasks. In particular, order-preserving minimal perfect hash functions [12] have been used to retrieve the position of a key in a given list of keys: however, the ability to preserve any given order leads to an unavoidable.n log n / lower bound on the number of bits required to store the function. Recently, it was observed [1] that very frequently the keys to be hashed are sorted in their intrinsic (i.e., lexicographical) order. This is typically the case of dictionaries of search engines, list of URLs of web graphs, etc. We refer to this restricted version of the problem as monotone minimal perfect hashing. We analyse experimentally the data structures proposed in [1], and along our way we propose some new methods that, albeit asymptotically equivalent or worse, perform very well in practice, and provide a balance between access speed, ease of construction, and space usage. 1
Theory General Terms
"... SILT (Small Index Large Table) is a memory-efficient, highperformance key-value store system based on flash storage that scales to serve billions of key-value items on a single node. It requires only 0.7 bytes of DRAM per entry and retrieves key/value pairs using on average 1.01 flash reads each. SI ..."
Abstract
- Add to MetaCart
SILT (Small Index Large Table) is a memory-efficient, highperformance key-value store system based on flash storage that scales to serve billions of key-value items on a single node. It requires only 0.7 bytes of DRAM per entry and retrieves key/value pairs using on average 1.01 flash reads each. SILT combines new algorithmic and systems techniques to balance the use of memory, storage, and computation. Our contributions include: (1) the design of three basic key-value stores each with a different emphasis on memory-efficiency and write-friendliness; (2) synthesis of the basic key-value stores to build a SILT key-value store system; and (3) an analytical model for tuning system parameters carefully to meet the needs of different workloads. SILT requires one to two orders of magnitude less memory to provide comparable throughput to current high-performance key-value systems on a commodity desktop system with flash storage.
Practical Batch-Updatable External Hashing with Sorting
"... This paper presents a practical external hashing scheme that supports fast lookup (7 microseconds) for large datasets (millions to billions of items) with a small memory footprint (2.5 bits/item) and fast index construction (151 K items/s for 1-KiB key-value pairs). Our scheme combines three key tec ..."
Abstract
- Add to MetaCart
This paper presents a practical external hashing scheme that supports fast lookup (7 microseconds) for large datasets (millions to billions of items) with a small memory footprint (2.5 bits/item) and fast index construction (151 K items/s for 1-KiB key-value pairs). Our scheme combines three key techniques: (1) a new index data structure (Entropy-Coded Tries); (2) the use of sorting as the main data manipulation method; and (3) support for incremental index construction for dynamic datasets. We evaluate our scheme by building an external dictionary on flash-based drives and demonstrate our scheme’s high performance, compactness, and practicality. 1

