Results 1  10
of
53
Why simple hash functions work: Exploiting the entropy in a data stream
 In Proceedings of the 19th Annual ACMSIAM Symposium on Discrete Algorithms
, 2008
"... Hashing is fundamental to many algorithms and data structures widely used in practice. For theoretical analysis of hashing, there have been two main approaches. First, one can assume that the hash function is truly random, mapping each data item independently and uniformly to the range. This idealiz ..."
Abstract

Cited by 40 (6 self)
 Add to MetaCart
(Show Context)
Hashing is fundamental to many algorithms and data structures widely used in practice. For theoretical analysis of hashing, there have been two main approaches. First, one can assume that the hash function is truly random, mapping each data item independently and uniformly to the range. This idealized model is unrealistic because a truly random hash function requires an exponential number of bits to describe. Alternatively, one can provide rigorous bounds on performance when explicit families of hash functions are used, such as 2universal or O(1)wise independent families. For such families, performance guarantees are often noticeably weaker than for ideal hashing. In practice, however, it is commonly observed that weak hash functions, including 2universal hash functions, perform as predicted by the idealized analysis for truly random hash functions. In this paper, we try to explain this phenomenon. We demonstrate that the strong performance of universal hash functions in practice can arise naturally from a combination of the randomness of the hash function and the data. Specifically, following the large body of literature on random sources and randomness extraction, we model the data as coming from a “block source, ” whereby
ESmallTalker: A Distributed Mobile System for Social Networking in Physical Proximity
"... Abstract—Small talk is an important social lubricant that helps people, especially strangers, initiate conversations and make friends with each other in physical proximity. However, due to difficulties in quickly identifying significant topics of common interest, realworld small talk tends to be su ..."
Abstract

Cited by 23 (5 self)
 Add to MetaCart
(Show Context)
Abstract—Small talk is an important social lubricant that helps people, especially strangers, initiate conversations and make friends with each other in physical proximity. However, due to difficulties in quickly identifying significant topics of common interest, realworld small talk tends to be superficial. The mass popularity of mobile phones can help improve the effectiveness of small talk. In this paper, we present ESmallTalker, a distributed mobile communications system that facilitates social networking in physical proximity. It automatically discovers and suggests topics such as common interests for more significant conversations. We build on Bluetooth Service Discovery Protocol (SDP) to exchange potential topics by customizing service attributes to publish nonservicerelated information without establishing a connection. We propose a novel iterative Bloom filter (IBF) protocol that encodes topics to fit in SDP attributes and achieves a low false positive rate. We have implemented the system in Java ME for ease of deployment. Our experiments on realworld phones show that it is efficient enough at the system level to facilitate social interactions among strangers in physical proximity. To the best of our knowledge, ESmallTalker is the first distributed mobile system to achieve the same purpose.
The Dynamic Bloom Filters
 In Proc. IEEE Infocom
, 2006
"... Abstract—A Bloom filter is an effective, spaceefficient data structure for concisely representing a set and supporting approximate membership queries. Traditionally, the Bloom filter and its variants just focus on how to represent a static set and decrease the false positive probability to a suffic ..."
Abstract

Cited by 23 (3 self)
 Add to MetaCart
(Show Context)
Abstract—A Bloom filter is an effective, spaceefficient data structure for concisely representing a set and supporting approximate membership queries. Traditionally, the Bloom filter and its variants just focus on how to represent a static set and decrease the false positive probability to a sufficiently low level. By investigating mainstream applications based on the Bloom filter, we reveal that dynamic data sets are more common and important than static sets. However, existing variants of the Bloom filter cannot support dynamic data sets well. To address this issue, we propose dynamic Bloom filters to represent dynamic sets as well as static sets and design necessary item insertion, membership query, item deletion, and filter union algorithms. The dynamic Bloom filter can control the false positive probability at a low level by expanding its capacity as the set cardinality increases. Through comprehensive mathematical analysis, we show that the dynamic Bloom filter uses less expected memory than the Bloom filter when representing dynamic sets with an upper bound on set cardinality, and also that the dynamic Bloom filter is more stable than the Bloom filter due to infrequent reconstruction when addressing dynamic sets without an upper bound on set cardinality. Moreover, the analysis results hold in standalone applications as well as distributed applications. Index Terms—Bloom filters, dynamic Bloom filters, information representation.
Spaceefficient and exact de Bruijn graph representation based on a Bloom filter
"... Background: The de Bruijn graph data structure is widely used in nextgeneration sequencing (NGS). Many programs, e.g. de novo assemblers, rely on inmemory representation of this graph. However, current techniques for representing the de Bruijn graph of a human genome require a large amount of memo ..."
Abstract

Cited by 16 (1 self)
 Add to MetaCart
(Show Context)
Background: The de Bruijn graph data structure is widely used in nextgeneration sequencing (NGS). Many programs, e.g. de novo assemblers, rely on inmemory representation of this graph. However, current techniques for representing the de Bruijn graph of a human genome require a large amount of memory ( ≥ 30 GB). Results: We propose a new encoding of the de Bruijn graph, which occupies an order of magnitude less space than current representations. The encoding is based on a Bloom filter, with an additional structure to remove critical false positives. Conclusions: An assembly software implementing this structure, Minia, performed a complete de novo assembly of human genome short reads using 5.7 GB of memory in 23 hours.
Privacypreserving record linkage using bloom filters
 BMC Medical Informatics and Decision Making
"... ..."
(Show Context)
SplitScreen: Enabling Efficient, Distributed Malware Detection
"... We present the design and implementation of a novel antimalware system called SplitScreen. SplitScreen performs an additional screening step prior to the signature matching phase found in existing approaches. The screening step filters out most noninfected files (90%) and also identifies malware s ..."
Abstract

Cited by 14 (6 self)
 Add to MetaCart
(Show Context)
We present the design and implementation of a novel antimalware system called SplitScreen. SplitScreen performs an additional screening step prior to the signature matching phase found in existing approaches. The screening step filters out most noninfected files (90%) and also identifies malware signatures that are not of interest (99%). The screening step significantly improves endtoend performance because safe files are quickly identified and are not processed further, and malware files can subsequently be scanned using only the signatures that are necessary. Our approach naturally leads to a networkbased antimalware solution in which clients only receive signatures they needed, not every malware signature ever created as with current approaches. We have implemented SplitScreen as an extension to ClamAV [13], the most popular open source antimalware software. We evaluated our implementation and found a> 2 × increase in scanning speed and a 2 × decrease in memory consumption. 1
A Sequential Indexing Scheme for flashbased embedded systems. EDBT, 2009. A: The DMSP experimental project This annex presents an experimental project of secure and portable medicalsocial folder (DMSP in French) [6]. Its goal is to improve the coordinat
 in the Patient’s SPT and in the
"... NAND Flash has become the most popular stable storage medium for embedded systems. As onboard storage capacity increases, the need for efficient indexing techniques arises. Such techniques are very challenging to design due to a combination of NAND Flash constraints (for example the blockerasebef ..."
Abstract

Cited by 11 (3 self)
 Add to MetaCart
(Show Context)
NAND Flash has become the most popular stable storage medium for embedded systems. As onboard storage capacity increases, the need for efficient indexing techniques arises. Such techniques are very challenging to design due to a combination of NAND Flash constraints (for example the blockerasebeforepagerewrite constraint and limited number of erase cycles) and embedded system constraints (for example tiny RAM and resource consumption predictability). Previous work adapted traditional indexing methods to cope with Flash constraints by deferring index updates using a log and batching them to decrease the number of rewrite operations in Flash memory. However, these methods were not designed with embedded system constraints in mind and do not address them. In this paper, we propose a new alternative for indexing Flashresident data that specifically addresses the embedded context. This approach, called PBFilter, organizes the index structure in a purely sequential way. Key lookups are sped up thanks to two principles called Summarization and Partitioning. We instantiate these principles with data structures and algorithms based on Bloom Filters and show the effectiveness of this approach through a comprehensive performance study. 1.
Retouched bloom filters: allowing networked applications to trade off selected false positives against false negatives
 In Proc. CoNEXT
, 2006
"... Abstract — Where distributed agents must share voluminous set membership information, Bloom filters provide a compact, though lossy, way for them to do so. Numerous recent networking papers have examined the tradeoffs between the bandwidth consumed by the transmission of Bloom filters, and the erro ..."
Abstract

Cited by 10 (2 self)
 Add to MetaCart
Abstract — Where distributed agents must share voluminous set membership information, Bloom filters provide a compact, though lossy, way for them to do so. Numerous recent networking papers have examined the tradeoffs between the bandwidth consumed by the transmission of Bloom filters, and the error rate, which takes the form of false positives, and which rises the more the filters are compressed. In this paper, we introduce the retouched Bloom filter (RBF), an extension that makes the Bloom filter more flexible by permitting the removal of false positives, at the expense of introducing false negatives, and that allows a controlled tradeoff between the two. We analytically show that RBFs created through a random process maintain an overall error rate, expressed as a combination of the false positive rate and the false negative rate, that is equal to the false positive rate of the corresponding Bloom filters. We further provide computationally inexpensive heuristics that decrease the false positive rate more than than the corresponding increase in the false negative rate, when creating RBFs. Finally, we demonstrate the advantages of an RBF over a Bloom filter in a distributed network topology measurement application, where information about large stop sets must be shared among route tracing monitors. I.
Buffered Bloom filters on solid state storage
 In First Intl. Workshop on Accelerating Data Management Systems Using Modern Processor and Storage Architectures (ADMS*10
, 2010
"... Bloom Filters are widely used in many applications including database management systems. With a certain allowable error rate, this data structure provides an efficient solution for membership queries. The error rate is inversely proportional to the size of the Bloom filter. Currently, Bloom filters ..."
Abstract

Cited by 9 (1 self)
 Add to MetaCart
(Show Context)
Bloom Filters are widely used in many applications including database management systems. With a certain allowable error rate, this data structure provides an efficient solution for membership queries. The error rate is inversely proportional to the size of the Bloom filter. Currently, Bloom filters are stored in main memory because the low locality of operations makes them impractical on secondary storage. In multiuser database management systems, where there is a high contention for the shared memory heap, the limited memory available for allocating a Bloom filter may cause a high rate of false positives. In this paper we are proposing a technique to reduce the memory requirement for Bloom filters with the help of solid state storage devices (SSD). By using a limited memory space for buffering the read/write requests, we can afford a larger SSD space for the actual Bloom filter bit vector. In our experiments we show that with significantly less memory requirement and fewer hash functions the proposed technique reduces the false positive rate effectively. In addition, the proposed data structure runs faster than the traditional Bloom filters by grouping the inserted records with respect to their locality on the filter. 1.
Simple Summaries for Hashing with Choices
 IEEE/ACM TRANSACTIONS ON NETWORKING
, 2008
"... In a multiplechoice hashing scheme, each item is stored in one of P possible hash table buckets. The availability of these multiple choices allows for a substantial reduction in the maximum load of the buckets. However, a lookup may now require examining each of the locations. For applications whe ..."
Abstract

Cited by 8 (2 self)
 Add to MetaCart
In a multiplechoice hashing scheme, each item is stored in one of P possible hash table buckets. The availability of these multiple choices allows for a substantial reduction in the maximum load of the buckets. However, a lookup may now require examining each of the locations. For applications where this cost is undesirable, Song et al. propose keeping a summary that allows one to determine which of the locations is appropriate for each item, where the summary may allow false positives for items not in hash table. We propose alternative, simple constructions of such summaries that use less space for both the summary and the underlying hash table. Moreover, our constructions are easily analyzable and tunable.