Results 1  10
of
35
Theory and Practice of Bloom Filters for Distributed Systems
"... Many network solutions and overlay networks utilize probabilistic techniques to reduce information processing and networking costs. This survey article presents a number of frequently used and useful probabilistic techniques. Bloom filters and their variants are of prime importance, and they are h ..."
Abstract

Cited by 30 (0 self)
 Add to MetaCart
Many network solutions and overlay networks utilize probabilistic techniques to reduce information processing and networking costs. This survey article presents a number of frequently used and useful probabilistic techniques. Bloom filters and their variants are of prime importance, and they are heavily used in various distributed systems. This has been reflected in recent research and many new algorithms have been proposed for distributed systems that are either directly or indirectly based on Bloom filters. In this survey, we give an overview of the basic and advanced techniques, reviewing over 20 variants and discussing their application in distributed systems, in particular for caching, peertopeer systems, routing and forwarding, and measurement data summarization.
The Dynamic Bloom Filters
 In Proc. IEEE Infocom
, 2006
"... Abstract—A Bloom filter is an effective, spaceefficient data structure for concisely representing a set and supporting approximate membership queries. Traditionally, the Bloom filter and its variants just focus on how to represent a static set and decrease the false positive probability to a suffic ..."
Abstract

Cited by 25 (3 self)
 Add to MetaCart
(Show Context)
Abstract—A Bloom filter is an effective, spaceefficient data structure for concisely representing a set and supporting approximate membership queries. Traditionally, the Bloom filter and its variants just focus on how to represent a static set and decrease the false positive probability to a sufficiently low level. By investigating mainstream applications based on the Bloom filter, we reveal that dynamic data sets are more common and important than static sets. However, existing variants of the Bloom filter cannot support dynamic data sets well. To address this issue, we propose dynamic Bloom filters to represent dynamic sets as well as static sets and design necessary item insertion, membership query, item deletion, and filter union algorithms. The dynamic Bloom filter can control the false positive probability at a low level by expanding its capacity as the set cardinality increases. Through comprehensive mathematical analysis, we show that the dynamic Bloom filter uses less expected memory than the Bloom filter when representing dynamic sets with an upper bound on set cardinality, and also that the dynamic Bloom filter is more stable than the Bloom filter due to infrequent reconstruction when addressing dynamic sets without an upper bound on set cardinality. Moreover, the analysis results hold in standalone applications as well as distributed applications. Index Terms—Bloom filters, dynamic Bloom filters, information representation.
Improved approximate detection of duplicates for data streams over sliding windows
 J. of Computer Science and Technology
"... Abstract Detecting duplicates in data streams is an important problem that has a wide range of applications. In general, precisely detecting duplicates in an unbounded data stream is not feasible in most streaming scenarios, and, on the other hand, the elements in data streams are always time sensit ..."
Abstract

Cited by 9 (0 self)
 Add to MetaCart
(Show Context)
Abstract Detecting duplicates in data streams is an important problem that has a wide range of applications. In general, precisely detecting duplicates in an unbounded data stream is not feasible in most streaming scenarios, and, on the other hand, the elements in data streams are always time sensitive. These make it particular significant approximately detecting duplicates among newly arrived elements of a data stream within a fixed time frame. In this paper, we present a novel data structure, Decaying Bloom Filter (DBF), as an extension of the Counting Bloom Filter, that effectively removes stale elements as new elements continuously arrive over sliding windows. On the DBF basis we present an efficient algorithm to approximately detect duplicates over sliding windows. Our algorithm may produce false positive errors, but not false negative errors as in many previous results. We analyze the time complexity and detection accuracy, and give a tight upper bound of false positive rate. For a given space G bits and sliding window size W, our algorithm has an amortized time complexity of O( G/W). Both analytical and experimental results on synthetic data demonstrate that our algorithm is superior in both execution time and detection accuracy to the previous results.
False Negative Problem of Counting Bloom Filter
"... Abstract—Bloom filter is effective, spaceefficient data structure for concisely representing a data set and supporting approximate membership queries. Traditionally, researchers often believe that it is possible that a Bloom filter returns a false positive, but it will never return a false negative ..."
Abstract

Cited by 9 (2 self)
 Add to MetaCart
(Show Context)
Abstract—Bloom filter is effective, spaceefficient data structure for concisely representing a data set and supporting approximate membership queries. Traditionally, researchers often believe that it is possible that a Bloom filter returns a false positive, but it will never return a false negative under wellbehaved operations. By investigating the mainstream variants, however, we observe that a Bloom filter does return false negatives in many scenarios. In this work, we show that the undetectable incorrect deletion of false positive items and detectable incorrect deletion of multiaddress items are two general causes of false negative in a Bloom filter. We then measure the potential and exposed false negatives theoretically and practically. Inspired by the fact that the potential false negatives are usually not fully exposed, we propose a novel Bloom filter scheme, which increases the ratio of bits set to a value larger than one without decreasing the ratio of bits set to zero. Mathematical analysis and comprehensive experiments show that this design can reduce the number of exposed false negatives as well as decrease the likelihood of false positives. To the best of our knowledge, this is the first work dealing with both the false positive and false negative problems of Bloom filter systematically when supporting standard usages of item insertion, query, and deletion operations. Index Terms—Bloom filter, false negative, multichoice counting Bloom filter. Ç 1
Addressing click fraud in content delivery systems
 In In Proceedings of the 26th IEEE INFOCOM
"... ..."
(Show Context)
A LocalityAware Memory Hierarchy for EnergyEfficient GPU Architectures
"... As GPU’s compute capabilities grow, their memory hierarchy increasingly becomes a bottleneck. Current GPU memory hierarchies use coarsegrained memory accesses to exploit spatial locality, maximize peak bandwidth, simplify control, and reduce cache metadata storage. These coarsegrained memory acce ..."
Abstract

Cited by 6 (1 self)
 Add to MetaCart
(Show Context)
As GPU’s compute capabilities grow, their memory hierarchy increasingly becomes a bottleneck. Current GPU memory hierarchies use coarsegrained memory accesses to exploit spatial locality, maximize peak bandwidth, simplify control, and reduce cache metadata storage. These coarsegrained memory accesses, however, are a poor match for emerging GPU applications with irregular control flow and memory access patterns. Meanwhile, the massive multithreading of GPUs and the simplicity of their cache hierarchies make CPUspecific memory system enhancements ineffective for improving the performance of irregular GPU applications. We design and evaluate a localityaware memory hierarchy for throughput processors, such as GPUs. Our proposed design retains the advantages of coarsegrained accesses for spatially and temporally local programs while permitting selective finegrained access to memory. By adaptively adjusting the access granularity, memory bandwidth and energy are reduced for data with low spatial/temporal locality without wasting control overheads or prefetching potential for data with high spatial locality. As such, our localityaware memory hierarchy improves GPU performance, energyefficiency, and memory throughput for a large range of applications.
Query by Document via a DecompositionBased TwoLevel Retrieval Approach
"... Retrieving similar documents from a largescale text corpus according to a given document is a fundamental technique for many applications. However, most of existing indexing techniques have difficulties to address this problem due to special properties of a document query, e.g. high dimensionality, ..."
Abstract

Cited by 4 (0 self)
 Add to MetaCart
Retrieving similar documents from a largescale text corpus according to a given document is a fundamental technique for many applications. However, most of existing indexing techniques have difficulties to address this problem due to special properties of a document query, e.g. high dimensionality, sparse representation and semantic issue. Towards addressing this problem, we propose a twolevel retrieval solution based on a document decomposition idea. A document is decomposed to a compact vector and a few document specific keywords by a dimension reduction approach. The compact vector embodies the major semantics of a document, and the document specific keywords complement the discriminative power lost in dimension reduction process. We adopt locality sensitive hashing (LSH) to index the compact vectors, which guarantees to quickly find a set of related documents according to the vector of a query document. Then we rerank documents in this set by their document
Capacity and Robustness Tradeoffs in Bloom Filters for Distributed Applications
"... Abstract—The Bloom filter is a spaceefficient data structure often employed in distributed applications to save bandwidth during data exchange. These savings, however, come at the cost of errors in the shared data, which are usually assumed low enough to not disrupt the application. We argue that t ..."
Abstract

Cited by 3 (0 self)
 Add to MetaCart
(Show Context)
Abstract—The Bloom filter is a spaceefficient data structure often employed in distributed applications to save bandwidth during data exchange. These savings, however, come at the cost of errors in the shared data, which are usually assumed low enough to not disrupt the application. We argue that this assumption does not hold in a more hostile environment, such as the Internet, where attackers can send a carefully crafted Bloom filter in order to break the application. In this paper, we propose the concatenated Bloom filter (CBF), a robust Bloom filter that prevents the attacker from interfering on the shared information, protecting the application data while still providing space efficiency. Instead of using a single large filter, the CBF concatenates small subfilters to improve both the filter robustness and capacity. We propose three CBF variants and provide analytical results that show the efficacy of the CBF for different scenarios. We also evaluate the performance of our filter in an IP traceback application and simulation results confirm the effectiveness of the proposed mechanism in the face of attackers. Index Terms—Bloom filters, distributed applications, security, IP traceback Ç 1
Finding duplicates in a data stream
 in Proc. 20th Annual Symposium on Discrete Algorithms (SODA), 2009
"... Given a data stream of length n over an alphabet [m] where n> m, we consider the problem of finding a duplicate in a single pass. We give a randomized algorithm for this problem that uses O((log m) 3) space. This answers a question of Muthukrishnan [Mut05] and Tarui [Tar07], who asked if this pro ..."
Abstract

Cited by 3 (2 self)
 Add to MetaCart
(Show Context)
Given a data stream of length n over an alphabet [m] where n> m, we consider the problem of finding a duplicate in a single pass. We give a randomized algorithm for this problem that uses O((log m) 3) space. This answers a question of Muthukrishnan [Mut05] and Tarui [Tar07], who asked if this problem could be solved using sublinear space and one pass over the input. Our algorithm solves the more general problem of finding a positive frequency element in a stream given by frequency updates where the sum of all frequencies is positive. Our main tool is an Isolation Lemma that reduces this problem to the task of detecting and identifying a Dictatorial variable in a Boolean halfspace. We present various relaxations of the condition n> m, under which one can find duplicates efficiently. 1
Towards “Intelligent Compression ” in Streams: A Biased Reservoir Sampling based Bloom Filter Approach
"... With the explosion of information stored worldwide, data intensive computing has emerged as a central area of research. Efficient management and processing of this massively exponential amount of data from diverse sources, such as telecommunication call data records, telescope imagery, online trans ..."
Abstract

Cited by 2 (1 self)
 Add to MetaCart
(Show Context)
With the explosion of information stored worldwide, data intensive computing has emerged as a central area of research. Efficient management and processing of this massively exponential amount of data from diverse sources, such as telecommunication call data records, telescope imagery, online transaction records, web pages, stock markets, medical records (monitoring critical health conditions of patients), climate warning systems, etc., has become a necessity. Removing redundancy from such huge (multibillion records) datasets results in resource and compute efficiency for downstream processing and constitutes an important area of study. “Intelligent compression " or deduplication in streaming scenarios, for precise identification and elimination of duplicates from the unbounded data stream is a greater challenge given the realtime nature of data arrival. Stable Bloom