Results 1  10
of
19
SILT: A MemoryEfficient, HighPerformance KeyValue Store
 In Proc. 23rd ACM SOSP, Cascias
, 2011
"... SILT (Small Index Large Table) is a memoryefficient, highperformance keyvalue store system based on flash storage that scales to serve billions of keyvalue items on a single node. It requires only 0.7 bytes of DRAM per entry and retrieves key/value pairs using on average 1.01 flash reads each. S ..."
Abstract

Cited by 20 (7 self)
 Add to MetaCart
SILT (Small Index Large Table) is a memoryefficient, highperformance keyvalue store system based on flash storage that scales to serve billions of keyvalue items on a single node. It requires only 0.7 bytes of DRAM per entry and retrieves key/value pairs using on average 1.01 flash reads each. SILT combines new algorithmic and systems techniques to balance the use of memory, storage, and computation. Our contributions include: (1) the design of three basic keyvalue stores each with a different emphasis on memoryefficiency and writefriendliness; (2) synthesis of the basic keyvalue stores to build a SILT keyvalue store system; and (3) an analytical model for tuning system parameters carefully to meet the needs of different workloads. SILT requires one to two orders of magnitude less memory to provide comparable throughput to current highperformance keyvalue systems on a commodity desktop system with flash storage.
Alphabetindependent compressed text indexing
 In ESA
, 2011
"... Abstract. Selfindexes can represent a text in asymptotically optimal space under the kth order entropy model, give access to text substrings, and support indexed pattern searches. Their time complexities are not optimal, however: they always depend on the alphabet size. In this paper we achieve, f ..."
Abstract

Cited by 14 (11 self)
 Add to MetaCart
Abstract. Selfindexes can represent a text in asymptotically optimal space under the kth order entropy model, give access to text substrings, and support indexed pattern searches. Their time complexities are not optimal, however: they always depend on the alphabet size. In this paper we achieve, for the first time, full alphabetindependence in the time complexities of selfindexes, while retaining space optimality. We obtain also some relevant byproducts on compressed suffix trees. 1
Theory and Practise of Monotone Minimal Perfect Hashing
"... Minimal perfect hash functions have been shown to be useful to compress data in several data management tasks. In particular, orderpreserving minimal perfect hash functions [12] have been used to retrieve the position of a key in a given list of keys: however, the ability to preserve any given orde ..."
Abstract

Cited by 12 (6 self)
 Add to MetaCart
Minimal perfect hash functions have been shown to be useful to compress data in several data management tasks. In particular, orderpreserving minimal perfect hash functions [12] have been used to retrieve the position of a key in a given list of keys: however, the ability to preserve any given order leads to an unavoidable �(n log n) lower bound on the number of bits required to store the function. Recently, it was observed [1] that very frequently the keys to be hashed are sorted in their intrinsic (i.e., lexicographical) order. This is typically the case of dictionaries of search engines, list of URLs of web graphs, etc. We refer to this restricted version of the problem as monotone minimal perfect hashing. We analyse experimentally the data structures proposed in [1], and along our way we propose some new methods that, albeit asymptotically equivalent or worse, perform very well in practise, and provide a balance between access speed, ease of construction, and space usage. 1
Improved compressed indexes for fulltext document retrieval
 IN PROC. 18TH SPIRE
, 2011
"... We give new space/time tradeoffs for compressed indexes that answer document retrieval queries on general sequences. On a collection of D documents of total length n, current approaches require at lg D lg lg D least CSA  + O(n) or 2CSA  + o(n) bits of space, where CSA is a fulltext index. Usin ..."
Abstract

Cited by 10 (7 self)
 Add to MetaCart
We give new space/time tradeoffs for compressed indexes that answer document retrieval queries on general sequences. On a collection of D documents of total length n, current approaches require at lg D lg lg D least CSA  + O(n) or 2CSA  + o(n) bits of space, where CSA is a fulltext index. Using monotone minimum perfect hash functions, we give new algorithms for document listing with frequencies and topk document retrieval using just CSA  + O(n lg lg lg D) bits. We also improve current solutions that use 2CSA  + o(n) bits, and consider other problems such as colored range listing, topk most important documents, and computing arbitrary frequencies.
New lower and upper bounds for representing sequences
 CoRR
"... Abstract. Sequence representations supporting queries access, select and rank are at the core of many data structures. There is a considerable gap between different upper bounds, and the few lower bounds, known for such representations, and how they interact with the space used. In this article we p ..."
Abstract

Cited by 9 (8 self)
 Add to MetaCart
Abstract. Sequence representations supporting queries access, select and rank are at the core of many data structures. There is a considerable gap between different upper bounds, and the few lower bounds, known for such representations, and how they interact with the space used. In this article we prove a strong lower bound for rank, which holds for rather permissive assumptions on the space used, and give matching upper bounds that require only a compressed representation of the sequence. Within this compressed space, operations access and select can be solved within almostconstant time. 1
Encodings for Range Selection and Topk Queries
"... Abstract. We study the problem of encoding the positions the topk elements of an array A[1..n] for a given parameter 1 ≤ k ≤ n. Specifically, for any i and j, we wish create a data structure that reports the positions of the largest k elements in A[i..j] in decreasing order, without accessing A at ..."
Abstract

Cited by 3 (3 self)
 Add to MetaCart
Abstract. We study the problem of encoding the positions the topk elements of an array A[1..n] for a given parameter 1 ≤ k ≤ n. Specifically, for any i and j, we wish create a data structure that reports the positions of the largest k elements in A[i..j] in decreasing order, without accessing A at query time. This is a natural extension of the wellknown encoding rangemaxima query problem, where only the position of the maximum in A[i..j] is sought, and finds applications in document retrieval and ranking. We give (sometimes tight) upper and lower bounds for this problem and some variants thereof. 1
Distributed indexing for semantic search
 In: SEMSEARCH ’10 Proceedings of the 3rd International Semantic Search Workshop
"... In this paper we describe the process of building indices for semantic search using MapReduce. We compare the two most straightforward representations of RDF data, the horizontal index structure using parallel indices and the vertical index structure using fields. We measure the cost of building ind ..."
Abstract

Cited by 2 (1 self)
 Add to MetaCart
In this paper we describe the process of building indices for semantic search using MapReduce. We compare the two most straightforward representations of RDF data, the horizontal index structure using parallel indices and the vertical index structure using fields. We measure the cost of building indices and also compare retrieval performance on keyword queries and queries restricted to particular properties. 1.
Storing a Compressed Function with Constant Time Access
"... Abstract. We consider the problem of representing, in a spaceefficient way, a function f: S → Σ such that any function value can be computed in constant time on a RAM. Specifically, our aim is to achieve space usage close to the 0th order entropy of the sequence of function values. Our technique wo ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
Abstract. We consider the problem of representing, in a spaceefficient way, a function f: S → Σ such that any function value can be computed in constant time on a RAM. Specifically, our aim is to achieve space usage close to the 0th order entropy of the sequence of function values. Our technique works for any set S of machine words, without storing S, which is crucial for applications. Our contribution consists of two new techniques, of independent interest, that we use in combination with an existing result of Dietzfelbinger and Pagh (ICALP 2008). First of all, we introduce a way to support more space efficient approximate membership queries (Bloom filter functionality) with arbitrary false positive rate. Second, we present a variation of Huffman coding using approximate membership, providing an alternative that improves the classical bounds of Gallager (IEEE Trans. Information Theory, 1978) in some cases. The end result is an entropycompressed function supporting constant time random access to values associated with a given set S. This improves both space and time compared to a recent result by Talbot and Talbot (ANALCO 2008). 1
Spaces, trees and colors: The algorithmic landscape of document retrieval on sequences
 CoRR
"... Document retrieval is one of the best established information retrieval activities since the sixties, pervading all search engines. Its aim is to obtain, from a collection of text documents, those most relevant to a pattern query. Current technology is mostly oriented to “natural language” text coll ..."
Abstract

Cited by 2 (2 self)
 Add to MetaCart
Document retrieval is one of the best established information retrieval activities since the sixties, pervading all search engines. Its aim is to obtain, from a collection of text documents, those most relevant to a pattern query. Current technology is mostly oriented to “natural language” text collections, where inverted indexes are the preferred solution. As successful as this paradigm has been, it fails to properly handle various East Asian languages and other scenarios where the “natural language ” assumptions do not hold. In this survey we cover the recent research in extending the document retrieval techniques to a broader class of sequence collections, which has applications in bioinformatics, data and Web mining, chemoinformatics, software engineering, multimedia information retrieval, and many other fields. We focus on the algorithmic aspects of the techniques, uncovering a rich world of relations between document retrieval challenges and fundamental problems on trees, strings, range queries, discrete geometry, and other areas.
Practical BatchUpdatable External Hashing with Sorting
"... This paper presents a practical external hashing scheme that supports fast lookup (7 microseconds) for large datasets (millions to billions of items) with a small memory footprint (2.5 bits/item) and fast index construction (151 K items/s for 1KiB keyvalue pairs). Our scheme combines three key tec ..."
Abstract

Cited by 2 (1 self)
 Add to MetaCart
This paper presents a practical external hashing scheme that supports fast lookup (7 microseconds) for large datasets (millions to billions of items) with a small memory footprint (2.5 bits/item) and fast index construction (151 K items/s for 1KiB keyvalue pairs). Our scheme combines three key techniques: (1) a new index data structure (EntropyCoded Tries); (2) the use of sorting as the main data manipulation method; and (3) support for incremental index construction for dynamic datasets. We evaluate our scheme by building an external dictionary on flashbased drives and demonstrate our scheme’s high performance, compactness, and practicality. 1