Results 11  20
of
94
Sharing aggregate computation for distributed queries
 In SIGMOD
, 2007
"... An emerging challenge in modern distributed querying is to efficiently process multiple continuous aggregation queries simultaneously. Processing each query independently may be infeasible, so multiquery optimizations are critical for sharing work across queries. The challenge is to identify overla ..."
Abstract

Cited by 18 (0 self)
 Add to MetaCart
An emerging challenge in modern distributed querying is to efficiently process multiple continuous aggregation queries simultaneously. Processing each query independently may be infeasible, so multiquery optimizations are critical for sharing work across queries. The challenge is to identify overlapping computations that may not be obvious in the queries themselves. In this paper, we reveal new opportunities for sharing work in the context of distributed aggregation queries that vary in their selection predicates. We identify settings in which a large set of q such queries can be answered by executing k ≪ q different queries. The k queries are revealed by analyzing a boolean matrix capturing the connection between data and the queries that they satisfy, in a manner akin to familiar techniques like Gaussian elimination. Indeed, we identify a class of linear aggregate functions (including SUM, COUNT and AVERAGE), and show that the sharing potential for such queries can be optimally recovered using standard matrix decompositions from computational linear algebra. For some other typical aggregation functions (including MIN and MAX) we find that optimal sharing maps to the NPhard set basis problem. However, for those scenarios, we present a family of heuristic algorithms and demonstrate that they perform well for moderatesized matrices. We also present a dynamic distributed system architecture to exploit sharing opportunities, and experimentally evaluate the benefits of our techniques via a novel, flexible random workload generator we develop for this setting. Categories and Subject Descriptors: H.2.4 [Systems]: Distributed databases
The Dynamic Bloom Filters
 In Proc. IEEE Infocom
, 2006
"... Abstract—A Bloom filter is an effective, spaceefficient data structure for concisely representing a set and supporting approximate membership queries. Traditionally, the Bloom filter and its variants just focus on how to represent a static set and decrease the false positive probability to a suffic ..."
Abstract

Cited by 17 (2 self)
 Add to MetaCart
Abstract—A Bloom filter is an effective, spaceefficient data structure for concisely representing a set and supporting approximate membership queries. Traditionally, the Bloom filter and its variants just focus on how to represent a static set and decrease the false positive probability to a sufficiently low level. By investigating mainstream applications based on the Bloom filter, we reveal that dynamic data sets are more common and important than static sets. However, existing variants of the Bloom filter cannot support dynamic data sets well. To address this issue, we propose dynamic Bloom filters to represent dynamic sets as well as static sets and design necessary item insertion, membership query, item deletion, and filter union algorithms. The dynamic Bloom filter can control the false positive probability at a low level by expanding its capacity as the set cardinality increases. Through comprehensive mathematical analysis, we show that the dynamic Bloom filter uses less expected memory than the Bloom filter when representing dynamic sets with an upper bound on set cardinality, and also that the dynamic Bloom filter is more stable than the Bloom filter due to infrequent reconstruction when addressing dynamic sets without an upper bound on set cardinality. Moreover, the analysis results hold in standalone applications as well as distributed applications. Index Terms—Bloom filters, dynamic Bloom filters, information representation.
An optimal bloom filter replacement based on matrix solving
 In CSR
, 2009
"... We suggest a method for holding a dictionary data structure, which maps keys to values, in the spirit of Bloom Filters. The space requirements of the dictionary we suggest are much smaller than those of a hashtable. We allow storing n keys, each mapped to value which is a string of k bits. Our sugge ..."
Abstract

Cited by 14 (0 self)
 Add to MetaCart
We suggest a method for holding a dictionary data structure, which maps keys to values, in the spirit of Bloom Filters. The space requirements of the dictionary we suggest are much smaller than those of a hashtable. We allow storing n keys, each mapped to value which is a string of k bits. Our suggested method requires nk + o(n) bits space to store the dictionary, and O(n) time to produce the data structure, and allows answering a membership query in O(1) memory probes. The dictionary size does not depend on the size of the keys. However, reducing the space requirements of the data structure comes at a certain cost. Our dictionary has a small probability of a one sided error. When attempting to obtain the value for a key that is stored in the dictionary we always get the correct answer. However, when testing for membership of an element that is not stored in the dictionary, we may get an incorrect answer, and when requesting the value of such an element we may get a certain random value. Our method is based on solving equations in GF(2 k) and using several hash functions. Another significant advantage of our suggested method is that we do not require using sophisticated hash functions. We only require pairwise independent hash functions. We also suggest a data structure that requires only nk bits space, has O(n 2) preprocessing time, and has a O(log n) query time. However, this data structures requires a uniform hash functions. In order replace a Bloom Filter of n elements with an error proability of 2 −k, we require nk + o(n) memory bits, O(1) query time, O(n) preprocessing time, and only pairwise independent hash function. Even the most advanced previously known Bloom Filter would require nk +O(n) space, and a uniform hash functions, so our method is significantly less space consuming especially when k is small. Our suggested dictionary can replace Bloom Filters, and has many applications. A few application examples are dictionaries for storing bad passwords, differential files in databases, Internet caching and distributed storage systems. 1 1
Scalable Bloom Filters
"... Bloom Filters provide spaceefficient storage of sets at the cost of a probability of false positives on membership queries. The size of the filter must be defined a priori based on the number of elements to store and the desired false positive probability, being impossible to store extra elements ..."
Abstract

Cited by 13 (0 self)
 Add to MetaCart
Bloom Filters provide spaceefficient storage of sets at the cost of a probability of false positives on membership queries. The size of the filter must be defined a priori based on the number of elements to store and the desired false positive probability, being impossible to store extra elements without increasing the false positive probability. This leads typically to a conservative assumption regarding maximum set size, possibly by orders of magnitude, and a consequent space waste. This paper proposes Scalable Bloom Filters, a variant of Bloom Filters that can adapt dynamically to the number of elements stored, while assuring a maximum false positive probability.
Global document frequency estimation in peertopeer web search
 In WebDB
, 2006
"... Information retrieval (IR) in peertopeer (P2P) networks, where the corpus is spread across many loosely coupled peers, has recently gained importance. In contrast to IR systems on a centralized server or server farm, P2P IR faces the additional challenge of either being oblivious to global corpus ..."
Abstract

Cited by 11 (6 self)
 Add to MetaCart
Information retrieval (IR) in peertopeer (P2P) networks, where the corpus is spread across many loosely coupled peers, has recently gained importance. In contrast to IR systems on a centralized server or server farm, P2P IR faces the additional challenge of either being oblivious to global corpus statistics or having to compute the global measures from local statistics at the individual peers in an efficient, distributed manner. One specific measure of interest is the global document frequency for different terms, which would be very beneficial as termspecific weights in the scoring and ranking of merged search results that have been obtained from different peers. This paper presents an efficient solution for the problem of estimating global document frequencies in a largescale P2P network with very high dynamics where peers can join and leave the network on short notice. In particular, the developed method takes into account the fact that the local document collections of autonomous peers may arbitrarily overlap, so that global counting needs to be duplicateinsensitive. The method is based on hash sketches as a technique for compact data synopses. Experimental studies demonstrate the estimator’s accuracy, scalability, and ability to cope with high dynamics. Moreover, the benefit for ranking P2P search results is shown by experiments with realworld Web data and queries. 1.
Secure Selecticast for collaborative intrusion detection systems
 in: Workshop on Distributed EventBased System
, 2002
"... The problem domain of Collaborative Intrusion Detection Systems (CIDS) introduces distinctive data routing challenges, which we show are solvable through a sufficiently flexible publishsubscribe system. CIDS share intrusion detection data among organizations, usually to predict impending attacks ea ..."
Abstract

Cited by 11 (3 self)
 Add to MetaCart
The problem domain of Collaborative Intrusion Detection Systems (CIDS) introduces distinctive data routing challenges, which we show are solvable through a sufficiently flexible publishsubscribe system. CIDS share intrusion detection data among organizations, usually to predict impending attacks earlier and more accurately, e.g., from Internet worms that tend to attack many sites at once. CIDS participants collect lists of suspect IP addresses, and want to be notified if others are suspicious of the same addresses. The matching must be done efficiently and anonymously, as most organizations are reluctant to share potentially revealing information about their networks. Alerts regarding external probes should only be visible to other CIDS participants experiencing probes from the same source(s). We term this type of simultaneous publish/subscribe “selecticast.” We present a potential solution using the secure Bloom filter data structure propagated over the MEET publishsubscribe framework. 1.
A novel exact active statistics counter architecture
 in Proc. ANCS
"... Abstract—In this paper, we present an exact active statistics ..."
Abstract

Cited by 10 (4 self)
 Add to MetaCart
Abstract—In this paper, we present an exact active statistics
Retouched bloom filters: allowing networked applications to trade off selected false positives against false negatives
 In Proc. CoNEXT
, 2006
"... Abstract — Where distributed agents must share voluminous set membership information, Bloom filters provide a compact, though lossy, way for them to do so. Numerous recent networking papers have examined the tradeoffs between the bandwidth consumed by the transmission of Bloom filters, and the erro ..."
Abstract

Cited by 10 (2 self)
 Add to MetaCart
Abstract — Where distributed agents must share voluminous set membership information, Bloom filters provide a compact, though lossy, way for them to do so. Numerous recent networking papers have examined the tradeoffs between the bandwidth consumed by the transmission of Bloom filters, and the error rate, which takes the form of false positives, and which rises the more the filters are compressed. In this paper, we introduce the retouched Bloom filter (RBF), an extension that makes the Bloom filter more flexible by permitting the removal of false positives, at the expense of introducing false negatives, and that allows a controlled tradeoff between the two. We analytically show that RBFs created through a random process maintain an overall error rate, expressed as a combination of the false positive rate and the false negative rate, that is equal to the false positive rate of the corresponding Bloom filters. We further provide computationally inexpensive heuristics that decrease the false positive rate more than than the corresponding increase in the false negative rate, when creating RBFs. Finally, we demonstrate the advantages of an RBF over a Bloom filter in a distributed network topology measurement application, where information about large stop sets must be shared among route tracing monitors. I.
Scavenger: A New Last Level Cache Architecture with Global Block Priority
 In Proceedings of the 40th International Symposium on Microarchitecture
, 2007
"... Addresses suffering from cache misses typically exhibit repetitive patterns due to the temporal locality inherent in the access stream. However, we observe that the number of intervening misses at the lastlevel cache between the eviction of a particular block and its reuse can be very large, preven ..."
Abstract

Cited by 9 (0 self)
 Add to MetaCart
Addresses suffering from cache misses typically exhibit repetitive patterns due to the temporal locality inherent in the access stream. However, we observe that the number of intervening misses at the lastlevel cache between the eviction of a particular block and its reuse can be very large, preventing traditional victim caching mechanisms from exploiting this repeating behavior. In this paper, we present Scavenger, a new architecture for lastlevel caches. Scavenger divides the total storage budget into a conventional cache and a novel victim file architecture, which employs a skewed Bloom filter in conjunction with a pipelined priority heap to identify and retain the blocks that most frequently missed in the conventional part of the cache in the recent past. When compared against a baseline configuration with a 1MB 8way L2 cache, a Scavenger configuration with a 512kB 8way conventional cache and a 512kB victim file achieves an IPC improvement of up to 63 % and on average (geometric mean) 14.2 % for nine memorybound SPEC 2000 applications. On a larger set of sixteen SPEC 2000 applications, Scavenger achieves an average speedup of 8%. 1.
Rankindexed hashing: A compact construction of bloom filters and extra bits per counter (σ) lg(M/N
"... Abstract—Bloom filter and its variants have found widespread use in many networking applications. For these applications, minimizing storage cost is paramount as these filters often need to be implemented using scarce and costly (onchip) SRAM. Besides supporting membership queries, Bloom filters ha ..."
Abstract

Cited by 9 (1 self)
 Add to MetaCart
Abstract—Bloom filter and its variants have found widespread use in many networking applications. For these applications, minimizing storage cost is paramount as these filters often need to be implemented using scarce and costly (onchip) SRAM. Besides supporting membership queries, Bloom filters have been generalized to support deletions and the encoding of information. Although a standard Bloom filter construction has proven to be extremely spaceefficient, it is unnecessarily costly when generalized. Alternative constructions based on storing fingerprints in hash tables have been proposed that offer the same functionality as some Bloom filter variants, but using less space. In this paper, we propose a new fingerprint hash table construction called RankIndexed Hashing that can achieve very compact representations. A rankindexed hashing construction that offers the same functionality as a counting Bloom filter can be achieved with a factor of three or more in space savings even for a false positive probability of just 1%. Even for a basic Bloom filter function that only supports membership queries, a rankindexed hashing construction requires less space for a false positive probability as high as 0.1%, which is significant since a standard Bloom filter construction is widely regarded as extremely spaceefficient for approximate membership problems. I.