Results 1 - 10
of
19
Why simple hash functions work: Exploiting the entropy in a data stream
- In Proceedings of the 19th Annual ACM-SIAM Symposium on Discrete Algorithms
, 2008
"... Hashing is fundamental to many algorithms and data structures widely used in practice. For theoretical analysis of hashing, there have been two main approaches. First, one can assume that the hash function is truly random, mapping each data item independently and uniformly to the range. This idealiz ..."
Abstract
-
Cited by 27 (6 self)
- Add to MetaCart
Hashing is fundamental to many algorithms and data structures widely used in practice. For theoretical analysis of hashing, there have been two main approaches. First, one can assume that the hash function is truly random, mapping each data item independently and uniformly to the range. This idealized model is unrealistic because a truly random hash function requires an exponential number of bits to describe. Alternatively, one can provide rigorous bounds on performance when explicit families of hash functions are used, such as 2-universal or O(1)-wise independent families. For such families, performance guarantees are often noticeably weaker than for ideal hashing. In practice, however, it is commonly observed that weak hash functions, including 2-universal hash functions, perform as predicted by the idealized analysis for truly random hash functions. In this paper, we try to explain this phenomenon. We demonstrate that the strong performance of universal hash functions in practice can arise naturally from a combination of the randomness of the hash function and the data. Specifically, following the large body of literature on random sources and randomness extraction, we model the data as coming from a “block source, ” whereby
SplitScreen: Enabling Efficient, Distributed Malware Detection
"... We present the design and implementation of a novel anti-malware system called SplitScreen. SplitScreen performs an additional screening step prior to the signature matching phase found in existing approaches. The screening step filters out most non-infected files (90%) and also identifies malware s ..."
Abstract
-
Cited by 7 (4 self)
- Add to MetaCart
We present the design and implementation of a novel anti-malware system called SplitScreen. SplitScreen performs an additional screening step prior to the signature matching phase found in existing approaches. The screening step filters out most non-infected files (90%) and also identifies malware signatures that are not of interest (99%). The screening step significantly improves end-to-end performance because safe files are quickly identified and are not processed further, and malware files can subsequently be scanned using only the signatures that are necessary. Our approach naturally leads to a network-based anti-malware solution in which clients only receive signatures they needed, not every malware signature ever created as with current approaches. We have implemented SplitScreen as an extension to ClamAV [13], the most popular open source anti-malware software. We evaluated our implementation and found a> 2 × increase in scanning speed and a 2 × decrease in memory consumption. 1
Simple Summaries for Hashing with Choices
- IEEE/ACM TRANSACTIONS ON NETWORKING
, 2008
"... In a multiple-choice hashing scheme, each item is stored in one of P possible hash table buckets. The availability of these multiple choices allows for a substantial reduction in the maximum load of the buckets. However, a lookup may now require examining each of the locations. For applications whe ..."
Abstract
-
Cited by 6 (2 self)
- Add to MetaCart
In a multiple-choice hashing scheme, each item is stored in one of P possible hash table buckets. The availability of these multiple choices allows for a substantial reduction in the maximum load of the buckets. However, a lookup may now require examining each of the locations. For applications where this cost is undesirable, Song et al. propose keeping a summary that allows one to determine which of the locations is appropriate for each item, where the summary may allow false positives for items not in hash table. We propose alternative, simple constructions of such summaries that use less space for both the summary and the underlying hash table. Moreover, our constructions are easily analyzable and tunable.
The Dynamic Bloom Filters
- In Proc. IEEE Infocom
, 2006
"... Abstract—A Bloom filter is an effective, space-efficient data structure for concisely representing a set and supporting approximate membership queries. Traditionally, the Bloom filter and its variants just focus on how to represent a static set and decrease the false positive probability to a suffic ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
Abstract—A Bloom filter is an effective, space-efficient data structure for concisely representing a set and supporting approximate membership queries. Traditionally, the Bloom filter and its variants just focus on how to represent a static set and decrease the false positive probability to a sufficiently low level. By investigating mainstream applications based on the Bloom filter, we reveal that dynamic data sets are more common and important than static sets. However, existing variants of the Bloom filter cannot support dynamic data sets well. To address this issue, we propose dynamic Bloom filters to represent dynamic sets as well as static sets and design necessary item insertion, membership query, item deletion, and filter union algorithms. The dynamic Bloom filter can control the false positive probability at a low level by expanding its capacity as the set cardinality increases. Through comprehensive mathematical analysis, we show that the dynamic Bloom filter uses less expected memory than the Bloom filter when representing dynamic sets with an upper bound on set cardinality, and also that the dynamic Bloom filter is more stable than the Bloom filter due to infrequent reconstruction when addressing dynamic sets without an upper bound on set cardinality. Moreover, the analysis results hold in standalone applications as well as distributed applications. Index Terms—Bloom filters, dynamic Bloom filters, information representation.
A Power Management Proxy with a New Best-ofN Bloom Filter Design to Reduce False Positives
- In IEEE International Performance Computing and Communications Conference
, 2007
"... Bloom filters are a probabilistic data structure used to evaluate set membership. A group of hash functions are used to map elements into a Bloom filter and to test elements for membership. In this paper, we propose using multiple groups of hash functions and selecting the group that generates the B ..."
Abstract
-
Cited by 4 (2 self)
- Add to MetaCart
Bloom filters are a probabilistic data structure used to evaluate set membership. A group of hash functions are used to map elements into a Bloom filter and to test elements for membership. In this paper, we propose using multiple groups of hash functions and selecting the group that generates the Bloom filter instance with the smallest number of bits set to 1. We evaluate the performance of this new Best-of-N method using order statistics and an actual implementation. Our analysis shows that significant reduction in the probability of a false positive can be achieved. We also propose and evaluate a new method that uses a Random Number Generator (RNG) to generate multiple hashes from one initial “seed ” hash. This RNG method (motivated by a method from Kirsch and Mitzenmacher) makes the computational expense of the Best-of-N method very modest. The target application is a power management proxy for P2P applications executing in a resource-constrained “SmartNIC”.
Buffered Bloom filters on solid state storage
- In First Intl. Workshop on Accelerating Data Management Systems Using Modern Processor and Storage Architectures (ADMS*10
, 2010
"... Bloom Filters are widely used in many applications including database management systems. With a certain allowable error rate, this data structure provides an efficient solution for membership queries. The error rate is inversely proportional to the size of the Bloom filter. Currently, Bloom filters ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
Bloom Filters are widely used in many applications including database management systems. With a certain allowable error rate, this data structure provides an efficient solution for membership queries. The error rate is inversely proportional to the size of the Bloom filter. Currently, Bloom filters are stored in main memory because the low locality of operations makes them impractical on secondary storage. In multi-user database management systems, where there is a high contention for the shared memory heap, the limited memory available for allocating a Bloom filter may cause a high rate of false positives. In this paper we are proposing a technique to reduce the memory requirement for Bloom filters with the help of solid state storage devices (SSD). By using a limited memory space for buffering the read/write requests, we can afford a larger SSD space for the actual Bloom filter bit vector. In our experiments we show that with significantly less memory requirement and fewer hash functions the proposed technique reduces the false positive rate effectively. In addition, the proposed data structure runs faster than the traditional Bloom filters by grouping the inserted records with respect to their locality on the filter. 1.
Fast cache for your text: Accelerating exact pattern matching with feed-forward bloom filters
, 2009
"... This paper presents an algorithm for exact pattern matching based on a new type of Bloom filter that we call a feed-forward Bloom filter. Besides filtering the input corpus, a feed-forward Bloom filter is also able to reduce the set of patterns needed for the exact matching phase. We show that this ..."
Abstract
-
Cited by 2 (2 self)
- Add to MetaCart
This paper presents an algorithm for exact pattern matching based on a new type of Bloom filter that we call a feed-forward Bloom filter. Besides filtering the input corpus, a feed-forward Bloom filter is also able to reduce the set of patterns needed for the exact matching phase. We show that this technique, along with a CPU architecture aware design of the Bloom filter, can provide speedups between 2 × and 30×, and memory consumption reductions as large as 50 × when compared with grep, while the filtering speed can be as much as 5 × higher than that of a normal Bloom filters.
Cache-, Hash- and Space-Efficient Bloom Filters
"... Abstract. A Bloom filter is a very compact data structure that supports approximate membership queries on a set, allowing false positives. We propose several new variants of Bloom filters and replacements with similar functionality. All of them have a better cache-efficiency and need less hash bits ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
Abstract. A Bloom filter is a very compact data structure that supports approximate membership queries on a set, allowing false positives. We propose several new variants of Bloom filters and replacements with similar functionality. All of them have a better cache-efficiency and need less hash bits than regular Bloom filters. Some use SIMD functionality, while the others provide an even better space efficiency. As a consequence, we get a more flexible trade-off between false positive rate, spaceefficiency, cache-efficiency, hash-efficiency, and computational effort. We analyze the efficiency of Bloom filters and the proposed replacements in detail, in terms of the false positive rate, the number of expected cachemisses, and the number of required hash bits. We also describe and experimentally evaluate the performance of highly-tuned implementations. For many settings, our alternatives perform better than the methods proposed so far. 1
E-SmallTalker: A Distributed Mobile System for Social Networking in Physical Proximity
"... Abstract—Small talk is an important social lubricant that helps people, especially strangers, initiate conversations and make friends with each other in physical proximity. However, due to difficulties in quickly identifying significant topics of common interest, real-world small talk tends to be su ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
Abstract—Small talk is an important social lubricant that helps people, especially strangers, initiate conversations and make friends with each other in physical proximity. However, due to difficulties in quickly identifying significant topics of common interest, real-world small talk tends to be superficial. The mass popularity of mobile phones can help improve the effectiveness of small talk. In this paper, we present E-SmallTalker, a distributed mobile communications system that facilitates social networking in physical proximity. It automatically discovers and suggests topics such as common interests for more significant conversations. We build on Bluetooth Service Discovery Protocol (SDP) to exchange potential topics by customizing service attributes to publish nonservice-related information without establishing a connection. We propose a novel iterative Bloom filter (IBF) protocol that encodes topics to fit in SDP attributes and achieves a low false positive rate. We have implemented the system in Java ME for ease of deployment. Our experiments on real-world phones show that it is efficient enough at the system level to facilitate social interactions among strangers in physical proximity. To the best of our knowledge, E-SmallTalker is the first distributed mobile system to achieve the same purpose.
Censorship-resistant Publishing
- In Technical Report CS2010-0956
, 2010
"... As the web evolves, it is becoming easier to form communities based on shared interests, and to create, publish, and query data on a wide variety of topics. In order to fully deliver on the promise of free data exchange, any community-supporting infrastructure needs to enforce the key requirement to ..."
Abstract
-
Cited by 2 (2 self)
- Add to MetaCart
As the web evolves, it is becoming easier to form communities based on shared interests, and to create, publish, and query data on a wide variety of topics. In order to fully deliver on the promise of free data exchange, any community-supporting infrastructure needs to enforce the key requirement to preserve privacy of the association of content providers with potential sensitive published information. This privacy preserving publishing requirement prevents censorship, harassment, or discrimination of users by third parties. We propose a novel privacy-preserving distributed infrastructure in which data resides only with the publishers owning it. The infrastructure disseminates user queries to publishers, who answer them at their own discretion. The infrastructure enforces a publisher k-anonymity guarantee, which prevents leakage of information about which publishers are capable of answering a certain query. Given the virtual nature of the global data collection, we study the challenging problem of efficiently locating publishers in the community that contain data items matching a specified query. We propose a distributed index structure, UQDT, that is organized as a union of Query Dissemination Trees(QDTs), andrealized on anoverlay (i.e., logical) networkinfrastructure. Each QDT has data publishers as its leaf nodes, and overlay network nodes as its internal nodes; each internal noderoutesqueries topublishers, based on a summary of thedata advertised by publishers in its subtrees. We experimentally evaluate design tradeoffs, and demonstrate that UQDT can maximize throughput by preventing any overlay network node from becoming a bottleneck.

