Results 1  10
of
57
Why simple hash functions work: Exploiting the entropy in a data stream
 In Proceedings of the 19th Annual ACMSIAM Symposium on Discrete Algorithms
, 2008
"... Hashing is fundamental to many algorithms and data structures widely used in practice. For theoretical analysis of hashing, there have been two main approaches. First, one can assume that the hash function is truly random, mapping each data item independently and uniformly to the range. This idealiz ..."
Abstract

Cited by 49 (9 self)
 Add to MetaCart
(Show Context)
Hashing is fundamental to many algorithms and data structures widely used in practice. For theoretical analysis of hashing, there have been two main approaches. First, one can assume that the hash function is truly random, mapping each data item independently and uniformly to the range. This idealized model is unrealistic because a truly random hash function requires an exponential number of bits to describe. Alternatively, one can provide rigorous bounds on performance when explicit families of hash functions are used, such as 2universal or O(1)wise independent families. For such families, performance guarantees are often noticeably weaker than for ideal hashing. In practice, however, it is commonly observed that weak hash functions, including 2universal hash functions, perform as predicted by the idealized analysis for truly random hash functions. In this paper, we try to explain this phenomenon. We demonstrate that the strong performance of universal hash functions in practice can arise naturally from a combination of the randomness of the hash function and the data. Specifically, following the large body of literature on random sources and randomness extraction, we model the data as coming from a “block source, ” whereby
Spaceefficient and exact de Bruijn graph representation based on a Bloom filter
"... Background: The de Bruijn graph data structure is widely used in nextgeneration sequencing (NGS). Many programs, e.g. de novo assemblers, rely on inmemory representation of this graph. However, current techniques for representing the de Bruijn graph of a human genome require a large amount of memo ..."
Abstract

Cited by 32 (6 self)
 Add to MetaCart
(Show Context)
Background: The de Bruijn graph data structure is widely used in nextgeneration sequencing (NGS). Many programs, e.g. de novo assemblers, rely on inmemory representation of this graph. However, current techniques for representing the de Bruijn graph of a human genome require a large amount of memory ( ≥ 30 GB). Results: We propose a new encoding of the de Bruijn graph, which occupies an order of magnitude less space than current representations. The encoding is based on a Bloom filter, with an additional structure to remove critical false positives. Conclusions: An assembly software implementing this structure, Minia, performed a complete de novo assembly of human genome short reads using 5.7 GB of memory in 23 hours.
ESmallTalker: A Distributed Mobile System for Social Networking in Physical Proximity
"... Abstract—Small talk is an important social lubricant that helps people, especially strangers, initiate conversations and make friends with each other in physical proximity. However, due to difficulties in quickly identifying significant topics of common interest, realworld small talk tends to be su ..."
Abstract

Cited by 28 (5 self)
 Add to MetaCart
(Show Context)
Abstract—Small talk is an important social lubricant that helps people, especially strangers, initiate conversations and make friends with each other in physical proximity. However, due to difficulties in quickly identifying significant topics of common interest, realworld small talk tends to be superficial. The mass popularity of mobile phones can help improve the effectiveness of small talk. In this paper, we present ESmallTalker, a distributed mobile communications system that facilitates social networking in physical proximity. It automatically discovers and suggests topics such as common interests for more significant conversations. We build on Bluetooth Service Discovery Protocol (SDP) to exchange potential topics by customizing service attributes to publish nonservicerelated information without establishing a connection. We propose a novel iterative Bloom filter (IBF) protocol that encodes topics to fit in SDP attributes and achieves a low false positive rate. We have implemented the system in Java ME for ease of deployment. Our experiments on realworld phones show that it is efficient enough at the system level to facilitate social interactions among strangers in physical proximity. To the best of our knowledge, ESmallTalker is the first distributed mobile system to achieve the same purpose.
The Dynamic Bloom Filters
 In Proc. IEEE Infocom
, 2006
"... Abstract—A Bloom filter is an effective, spaceefficient data structure for concisely representing a set and supporting approximate membership queries. Traditionally, the Bloom filter and its variants just focus on how to represent a static set and decrease the false positive probability to a suffic ..."
Abstract

Cited by 25 (3 self)
 Add to MetaCart
(Show Context)
Abstract—A Bloom filter is an effective, spaceefficient data structure for concisely representing a set and supporting approximate membership queries. Traditionally, the Bloom filter and its variants just focus on how to represent a static set and decrease the false positive probability to a sufficiently low level. By investigating mainstream applications based on the Bloom filter, we reveal that dynamic data sets are more common and important than static sets. However, existing variants of the Bloom filter cannot support dynamic data sets well. To address this issue, we propose dynamic Bloom filters to represent dynamic sets as well as static sets and design necessary item insertion, membership query, item deletion, and filter union algorithms. The dynamic Bloom filter can control the false positive probability at a low level by expanding its capacity as the set cardinality increases. Through comprehensive mathematical analysis, we show that the dynamic Bloom filter uses less expected memory than the Bloom filter when representing dynamic sets with an upper bound on set cardinality, and also that the dynamic Bloom filter is more stable than the Bloom filter due to infrequent reconstruction when addressing dynamic sets without an upper bound on set cardinality. Moreover, the analysis results hold in standalone applications as well as distributed applications. Index Terms—Bloom filters, dynamic Bloom filters, information representation.
Privacypreserving record linkage using Bloom filters
 BMC Med Inform Decis Mak
"... ..."
(Show Context)
SplitScreen: Enabling Efficient, Distributed Malware Detection
"... We present the design and implementation of a novel antimalware system called SplitScreen. SplitScreen performs an additional screening step prior to the signature matching phase found in existing approaches. The screening step filters out most noninfected files (90%) and also identifies malware s ..."
Abstract

Cited by 14 (6 self)
 Add to MetaCart
(Show Context)
We present the design and implementation of a novel antimalware system called SplitScreen. SplitScreen performs an additional screening step prior to the signature matching phase found in existing approaches. The screening step filters out most noninfected files (90%) and also identifies malware signatures that are not of interest (99%). The screening step significantly improves endtoend performance because safe files are quickly identified and are not processed further, and malware files can subsequently be scanned using only the signatures that are necessary. Our approach naturally leads to a networkbased antimalware solution in which clients only receive signatures they needed, not every malware signature ever created as with current approaches. We have implemented SplitScreen as an extension to ClamAV [13], the most popular open source antimalware software. We evaluated our implementation and found a> 2 × increase in scanning speed and a 2 × decrease in memory consumption. 1
A Sequential Indexing Scheme for flashbased embedded systems. EDBT, 2009. A: The DMSP experimental project This annex presents an experimental project of secure and portable medicalsocial folder (DMSP in French) [6]. Its goal is to improve the coordinat
 in the Patient’s SPT and in the
"... NAND Flash has become the most popular stable storage medium for embedded systems. As onboard storage capacity increases, the need for efficient indexing techniques arises. Such techniques are very challenging to design due to a combination of NAND Flash constraints (for example the blockerasebef ..."
Abstract

Cited by 12 (3 self)
 Add to MetaCart
(Show Context)
NAND Flash has become the most popular stable storage medium for embedded systems. As onboard storage capacity increases, the need for efficient indexing techniques arises. Such techniques are very challenging to design due to a combination of NAND Flash constraints (for example the blockerasebeforepagerewrite constraint and limited number of erase cycles) and embedded system constraints (for example tiny RAM and resource consumption predictability). Previous work adapted traditional indexing methods to cope with Flash constraints by deferring index updates using a log and batching them to decrease the number of rewrite operations in Flash memory. However, these methods were not designed with embedded system constraints in mind and do not address them. In this paper, we propose a new alternative for indexing Flashresident data that specifically addresses the embedded context. This approach, called PBFilter, organizes the index structure in a purely sequential way. Key lookups are sped up thanks to two principles called Summarization and Partitioning. We instantiate these principles with data structures and algorithms based on Bloom Filters and show the effectiveness of this approach through a comprehensive performance study. 1.
Retouched bloom filters: allowing networked applications to trade off selected false positives against false negatives
 In Proc. CoNEXT
, 2006
"... Abstract — Where distributed agents must share voluminous set membership information, Bloom filters provide a compact, though lossy, way for them to do so. Numerous recent networking papers have examined the tradeoffs between the bandwidth consumed by the transmission of Bloom filters, and the erro ..."
Abstract

Cited by 12 (2 self)
 Add to MetaCart
Abstract — Where distributed agents must share voluminous set membership information, Bloom filters provide a compact, though lossy, way for them to do so. Numerous recent networking papers have examined the tradeoffs between the bandwidth consumed by the transmission of Bloom filters, and the error rate, which takes the form of false positives, and which rises the more the filters are compressed. In this paper, we introduce the retouched Bloom filter (RBF), an extension that makes the Bloom filter more flexible by permitting the removal of false positives, at the expense of introducing false negatives, and that allows a controlled tradeoff between the two. We analytically show that RBFs created through a random process maintain an overall error rate, expressed as a combination of the false positive rate and the false negative rate, that is equal to the false positive rate of the corresponding Bloom filters. We further provide computationally inexpensive heuristics that decrease the false positive rate more than than the corresponding increase in the false negative rate, when creating RBFs. Finally, we demonstrate the advantages of an RBF over a Bloom filter in a distributed network topology measurement application, where information about large stop sets must be shared among route tracing monitors. I.
Simple Summaries for Hashing with Choices
 IEEE/ACM TRANSACTIONS ON NETWORKING
, 2008
"... In a multiplechoice hashing scheme, each item is stored in one of P possible hash table buckets. The availability of these multiple choices allows for a substantial reduction in the maximum load of the buckets. However, a lookup may now require examining each of the locations. For applications whe ..."
Abstract

Cited by 11 (2 self)
 Add to MetaCart
In a multiplechoice hashing scheme, each item is stored in one of P possible hash table buckets. The availability of these multiple choices allows for a substantial reduction in the maximum load of the buckets. However, a lookup may now require examining each of the locations. For applications where this cost is undesirable, Song et al. propose keeping a summary that allows one to determine which of the locations is appropriate for each item, where the summary may allow false positives for items not in hash table. We propose alternative, simple constructions of such summaries that use less space for both the summary and the underlying hash table. Moreover, our constructions are easily analyzable and tunable.
Using cascading bloom filters to improve the memory usage for de brujin graphs
 In: WABI, volume 8126 of LNCS
, 2013
"... Abstract. De Brujin graphs are widely used in bioinformatics for processing nextgeneration sequencing (NGS) data. Due to the very large size of NGS datasets, it is essential to represent de Bruijn graphs compactly, and several approaches to this problem have been proposed recently. In this work, w ..."
Abstract

Cited by 10 (0 self)
 Add to MetaCart
(Show Context)
Abstract. De Brujin graphs are widely used in bioinformatics for processing nextgeneration sequencing (NGS) data. Due to the very large size of NGS datasets, it is essential to represent de Bruijn graphs compactly, and several approaches to this problem have been proposed recently. In this work, we show how to reduce the memory required by the algorithm of Chikhi and Rizk (WABI, 2012) that represents de Brujin graphs using Bloom filters. Our method requires 30 % to 40 % less memory with respect to their method, with insignificant impact to construction time. At the same time, our experiments showed a better query time compared to their method. This is, to our knowledge, the best practical representation for de Bruijn graphs. 1