Results 1 -
7 of
7
Compressed Bloom Filters
, 2001
"... A Bloom filter is a simple space-efficient randomized data structure for representing a set in order to support membership queries. Although Bloom filters allow false positives, for many applications the space savings outweigh this drawback when the probability of an error is sufficiently low. We in ..."
Abstract
-
Cited by 162 (10 self)
- Add to MetaCart
A Bloom filter is a simple space-efficient randomized data structure for representing a set in order to support membership queries. Although Bloom filters allow false positives, for many applications the space savings outweigh this drawback when the probability of an error is sufficiently low. We introduce compressed Bloom filters, which improve performance when the Bloom filter is passed as a message, and its transmission size is a limiting factor. For example, Bloom filters have been suggested as a means for sharing Web cache information. In this setting, proxies do not share the exact contents of their caches, but instead periodically broadcast Bloom filters representing their cache. By using compressed Bloom filters, proxies can reduce the number of bits broadcast, the false positive rate, and/or the amount of computation per lookup. The cost is the processing time for compression and decompression, which can use simple arithmetic coding, and more memory use at the proxies, which utilize the larger uncompressed form of the Bloom filter.
Towards Compressing Web Graphs
- In Proc. of the IEEE Data Compression Conference (DCC
, 2000
"... In this paper, we consider the problem of compressing graphs of the link structure of the World Wide Web. We provide efficient algorithms for such compression that are motivated by recently proposed random graph models for describing the Web. ..."
Abstract
-
Cited by 68 (1 self)
- Add to MetaCart
In this paper, we consider the problem of compressing graphs of the link structure of the World Wide Web. We provide efficient algorithms for such compression that are motivated by recently proposed random graph models for describing the Web.
On the Hardness of Finding Optimal Multiple Preset Dictionaries
"... Abstract—We show that the following simple compression problem is NP-hard: given a collection of documents, find the pair of Huffman dictionaries that minimizes the total compressed size of the collection, where the best dictionary from the pair is used to compress each document. We also show the NP ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Abstract—We show that the following simple compression problem is NP-hard: given a collection of documents, find the pair of Huffman dictionaries that minimizes the total compressed size of the collection, where the best dictionary from the pair is used to compress each document. We also show the NP-hardness of finding optimal multiple preset dictionaries for LZ’77-based compression schemes. Our reductions make use of the catalog segmentation problem, a natural partitioning problem. Our results justify heuristic attacks used in practice. Index Terms—Huffman coding, LZ’77, NP-completeness, preset dictionaries, two-stage compression.
Www.elsevier.com/locate/jvlc
"... Content-basedinte retrit-b (CBIR)i achallengiB task. Current research works attempt to obtai and use thesemantiq ofiOzk to perform betterretriBzww Towardsthi goal, segmentatic of anicOB icO regijw has been usedi recent years,sirs localproperti; of regiOB can helpmatchiq objects betweenitwee and ther ..."
Abstract
- Add to MetaCart
Content-basedinte retrit-b (CBIR)i achallengiB task. Current research works attempt to obtai and use thesemantiq ofiOzk to perform betterretriBzww Towardsthi goal, segmentatic of anicOB icO regijw has been usedi recent years,sirs localproperti; of regiOB can helpmatchiq objects betweenitwee and therebycontriI;I towards a more effectiB CBIR. Thi paperipercOw on a CBIRtechniOzc called SNL(SriBwBj NasciBjcXB Li thatuticqIq theregiBqT propertiX of theiecBIO In SNL each ichci segmented and features iaturesc the color, shape,sip andspatiB posiBB of theobtaiT; regiT are extracted.Regiac are then comparedusia the icBkOOTcX regik matchiO (IRM)diM)cIj measure,whiu i s not ametriq whii prevents the use ofmetri access structures orfilteri; techni;cX based on thetriOkkB iOkkBqcXO We overcomethi iico byusiB MiBBj a true metri dirikO to compare segmentedigment Thi resultid approach, called SNL ; can be usedi conjunctiO wij afilteric technicX to reduce substantizcX the number oficIqB compared. Albei metri.czkBOc computatiTkqOc expensii We addressthi drawback,i approach, where we replace theexpensiq metri dirii i by the icjjqkcXOI orijqkc (non-metri; IRMdicBBzIj We found that one canstiI make use of the samefilterik technicXO at the expense ofliBk; lossi retrijTq effectiOcXOBq Thus, the mai contrikBcXO ofthi paperi ; a veryeffectiq andhicBz efficiqz regiqzqcXOO ici retriqz techniqcX r 2002Elsevij SiO e Ltd. AllricOq reserved. *Correspondio author. Tel.: +1-780-492-5678; fax: +1-780-492-1071.
39 Distributions in text
, 2005
"... The frequency of words and other linguistic units plays a central role in all branches of corpus linguistics. Indeed, the use of frequency information distinguishes corpus-based methodology from other approaches to language. Thus, not surprisingly, the distribution of frequencies of words and combin ..."
Abstract
- Add to MetaCart
The frequency of words and other linguistic units plays a central role in all branches of corpus linguistics. Indeed, the use of frequency information distinguishes corpus-based methodology from other approaches to language. Thus, not surprisingly, the distribution of frequencies of words and combinations of
New Algorithms on Wavelet Trees and Applications to Information Retrieval 1
"... Wavelet trees are widely used in the representation of sequences, permutations, text collections, binary relations, discrete points, and other succinct data structures. We show, however, that this still falls short of exploiting all of the virtues of this versatile data structure. In particular we s ..."
Abstract
- Add to MetaCart
Wavelet trees are widely used in the representation of sequences, permutations, text collections, binary relations, discrete points, and other succinct data structures. We show, however, that this still falls short of exploiting all of the virtues of this versatile data structure. In particular we show how to use wavelet trees to solve fundamental algorithmic problems such as range quantile queries, range next value queries, and range intersection queries. We explore several applications of these queries in Information Retrieval, in particular document retrieval in hierarchical and temporal documents, and in the representation of inverted lists.
DACs: Bringing Direct Access to Variable-Length Codes ✩
"... We present a new variable-length encoding scheme for sequences of integers, Directly Addressable Codes (DACs), which enables direct access to any element of the encoded sequence without the need of any sampling method. Our proposal is a kind of implicit data structure that introduces synchronism in ..."
Abstract
- Add to MetaCart
We present a new variable-length encoding scheme for sequences of integers, Directly Addressable Codes (DACs), which enables direct access to any element of the encoded sequence without the need of any sampling method. Our proposal is a kind of implicit data structure that introduces synchronism in the encoded sequence without using asymptotically any extra space. We show some experiments demonstrating that the technique is not only simple, but also competitive in time and space with existing solutions in several applications, such as the representation of LCP arrays or high-order entropy-compressed sequences.

