Results 1  10
of
18
Compressed Bloom Filters
, 2001
"... A Bloom filter is a simple spaceefficient randomized data structure for representing a set in order to support membership queries. Although Bloom filters allow false positives, for many applications the space savings outweigh this drawback when the probability of an error is sufficiently low. We in ..."
Abstract

Cited by 208 (10 self)
 Add to MetaCart
A Bloom filter is a simple spaceefficient randomized data structure for representing a set in order to support membership queries. Although Bloom filters allow false positives, for many applications the space savings outweigh this drawback when the probability of an error is sufficiently low. We introduce compressed Bloom filters, which improve performance when the Bloom filter is passed as a message, and its transmission size is a limiting factor. For example, Bloom filters have been suggested as a means for sharing Web cache information. In this setting, proxies do not share the exact contents of their caches, but instead periodically broadcast Bloom filters representing their cache. By using compressed Bloom filters, proxies can reduce the number of bits broadcast, the false positive rate, and/or the amount of computation per lookup. The cost is the processing time for compression and decompression, which can use simple arithmetic coding, and more memory use at the proxies, which utilize the larger uncompressed form of the Bloom filter.
Towards Compressing Web Graphs
 In Proc. of the IEEE Data Compression Conference (DCC
, 2000
"... In this paper, we consider the problem of compressing graphs of the link structure of the World Wide Web. We provide efficient algorithms for such compression that are motivated by recently proposed random graph models for describing the Web. ..."
Abstract

Cited by 83 (1 self)
 Add to MetaCart
In this paper, we consider the problem of compressing graphs of the link structure of the World Wide Web. We provide efficient algorithms for such compression that are motivated by recently proposed random graph models for describing the Web.
The Scalable Hyperlink Store
 HT'09
, 2009
"... This paper describes the Scalable Hyperlink Store, a distributed inmemory “database ” for storing large portions of the web graph. SHS is an enabler for research on structural properties of the web graph as well as new linkbased ranking algorithms. Previous work on specialized hyperlink databases ..."
Abstract

Cited by 10 (5 self)
 Add to MetaCart
This paper describes the Scalable Hyperlink Store, a distributed inmemory “database ” for storing large portions of the web graph. SHS is an enabler for research on structural properties of the web graph as well as new linkbased ranking algorithms. Previous work on specialized hyperlink databases focused on finding efficient compression algorithms for web graphs. By contrast, this work focuses on the systems issues of building such a database. Specifically, it describes how to build a hyperlink database that is fast, scalable, faulttolerant, and incrementally updateable.
On Compressing Permutations and Adaptive Sorting
, 2013
"... We prove that, given a permutation π over [1..n] formed of nRuns sorted blocks of sizes given by the vector R = 〈r1,..., rnRuns〉, there exists a compressed data structure encoding π in n(1 + H(R)) = n + ∑nRuns i=1 ri n log2 ri n(1 + log2 nRuns) bits while supporting access to the values of π() and ..."
Abstract

Cited by 3 (3 self)
 Add to MetaCart
We prove that, given a permutation π over [1..n] formed of nRuns sorted blocks of sizes given by the vector R = 〈r1,..., rnRuns〉, there exists a compressed data structure encoding π in n(1 + H(R)) = n + ∑nRuns i=1 ri n log2 ri n(1 + log2 nRuns) bits while supporting access to the values of π() and π−1 () in time O(log nRuns / log log n) in the worst case and O(H(R) / log log n) on average, when the argument is uniformly distributed over [1..n]. This data structure can be constructed in time O(n(1 + H(R))), which yields an improved adaptive sorting algorithm. Similar results on compressed data structures for permutations and adaptive sorting algorithms are proved for other preorder measures of practical and theoretical interest.
DACs: Bringing Direct Access to VariableLength Codes
, 2012
"... We present a new variablelength encoding scheme for sequences of integers, Directly Addressable Codes (DACs), which enables direct access to any element of the encoded sequence without the need of any sampling method. Our proposal is a kind of implicit data structure that introduces synchronism in ..."
Abstract

Cited by 3 (1 self)
 Add to MetaCart
We present a new variablelength encoding scheme for sequences of integers, Directly Addressable Codes (DACs), which enables direct access to any element of the encoded sequence without the need of any sampling method. Our proposal is a kind of implicit data structure that introduces synchronism in the encoded sequence without using asymptotically any extra space. We show some experiments demonstrating that the technique is not only simple, but also competitive in time and space with existing solutions in several applications, such as the representation of LCP arrays or highorder entropycompressed sequences.
V.: An efficient twoparty protocol for approximate matching in private record linkage
 In: AusDM, CRPIT
, 2011
"... The task of linking multiple databases with the aim to identify records that refer to the same entity is occurring increasingly in many application areas. If unique identifiers for the entities are not available in all the databases to be linked, techniques that calculate approximate similarities be ..."
Abstract

Cited by 1 (1 self)
 Add to MetaCart
(Show Context)
The task of linking multiple databases with the aim to identify records that refer to the same entity is occurring increasingly in many application areas. If unique identifiers for the entities are not available in all the databases to be linked, techniques that calculate approximate similarities between records must be used for the identification of matching pairs of records. Often, the records to be linked contain personal information such as names and addresses. In many applications, the exchange of attribute values that contain such personal details between organisations is not allowed due to privacy concerns. The linking of records between databases without revealing the actual attribute values in these records is the
On the Hardness of Finding Optimal Multiple Preset Dictionaries
"... Abstract—We show that the following simple compression problem is NPhard: given a collection of documents, find the pair of Huffman dictionaries that minimizes the total compressed size of the collection, where the best dictionary from the pair is used to compress each document. We also show the NP ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
(Show Context)
Abstract—We show that the following simple compression problem is NPhard: given a collection of documents, find the pair of Huffman dictionaries that minimizes the total compressed size of the collection, where the best dictionary from the pair is used to compress each document. We also show the NPhardness of finding optimal multiple preset dictionaries for LZ’77based compression schemes. Our reductions make use of the catalog segmentation problem, a natural partitioning problem. Our results justify heuristic attacks used in practice. Index Terms—Huffman coding, LZ’77, NPcompleteness, preset dictionaries, twostage compression.
39 Distributions in text
, 2005
"... The frequency of words and other linguistic units plays a central role in all branches of corpus linguistics. Indeed, the use of frequency information distinguishes corpusbased methodology from other approaches to language. Thus, not surprisingly, the distribution of frequencies of words and combin ..."
Abstract
 Add to MetaCart
The frequency of words and other linguistic units plays a central role in all branches of corpus linguistics. Indeed, the use of frequency information distinguishes corpusbased methodology from other approaches to language. Thus, not surprisingly, the distribution of frequencies of words and combinations of
New Algorithms on Wavelet Trees and Applications to Information Retrieval
, 2011
"... Wavelet trees are widely used in the representation of sequences, permutations, text collections, binary relations, discrete points, and other succinct data structures. We show, however, that this still falls short of exploiting all of the virtues of this versatile data structure. In particular we s ..."
Abstract
 Add to MetaCart
(Show Context)
Wavelet trees are widely used in the representation of sequences, permutations, text collections, binary relations, discrete points, and other succinct data structures. We show, however, that this still falls short of exploiting all of the virtues of this versatile data structure. In particular we show how to use wavelet trees to solve fundamental algorithmic problems such as range quantile queries, range next value queries, and range intersection queries. We explore several applications of these queries in Information Retrieval, in particular document retrieval in hierarchical and temporal documents, and in the representation of inverted lists.
unknown title
"... A Bloom filter is a simple spaceefficient randomized data structure for representing a set in order to support membership queries. Although Bloom filters allow false positives, for many applications the space savings outweigh this drawback when the probability of an error is sufficiently low. We in ..."
Abstract
 Add to MetaCart
(Show Context)
A Bloom filter is a simple spaceefficient randomized data structure for representing a set in order to support membership queries. Although Bloom filters allow false positives, for many applications the space savings outweigh this drawback when the probability of an error is sufficiently low. We introduce compressed Bloom filters, which improve performance when the Bloom filter is passed as a message, and its transmission size is a limiting factor. For example, Bloom filters have been suggested as a means for sharing Web cache information. In this setting, proxies do not share the exact contents of their caches, but instead periodically broadcast Bloom filters representing their cache. By using compressed Bloom filters, proxies can reduce the number of bits broadcast, the false positive rate, and/or the amount of computation per lookup. The cost is the processing time for compression and decompression, which can use simple arithmetic coding, and more memory use at the proxies, which utilize the larger uncompressed form of the Bloom filter. 1.