Results 1  10
of
13
What’s hot and what’s not: Tracking most frequent items dynamically
 In Proceedings of ACM Principles of Database Systems
, 2003
"... Most database management systems maintain statistics on the underlying relation. One of the important statistics is that of the “hot items ” in the relation: those that appear many times (most frequently, or more than some threshold). For example, endbiased histograms keep the hot items as part of ..."
Abstract

Cited by 173 (13 self)
 Add to MetaCart
Most database management systems maintain statistics on the underlying relation. One of the important statistics is that of the “hot items ” in the relation: those that appear many times (most frequently, or more than some threshold). For example, endbiased histograms keep the hot items as part of the histogram and are used in selectivity estimation. Hot items are used as simple outliers in data mining, and in anomaly detection in many applications. We present new methods for dynamically determining the hot items at any time in a relation which is undergoing deletion operations as well as inserts. Our methods maintain small space data structures that monitor the transactions on the relation, and when required, quickly output all hot items, without rescanning the relation in the database. With userspecified probability, all hot items are correctly reported. Our methods rely on ideas from “group testing”. They are simple to implement, and have provable quality, space and time guarantees. Previously known algorithms for this problem that make similar quality and performance guarantees can not handle deletions, and those that handle deletions can not make similar guarantees without rescanning the database. Our experiments with real and synthetic data show that our algorithms are accurate in dynamically tracking the hot items independent of the rate of insertions and deletions.
Increasing internet capacity using local search
 Computational Optimization and Applications
, 2004
"... but often the main goal is to avoid congestion, i.e. overloading of links, and the standard heuristic recommended by Cisco (a major router vendor) is to make the weight of a link inversely proportional to its capacity. We study the problem of optimizing OSPF weights for a given a set of projected de ..."
Abstract

Cited by 70 (8 self)
 Add to MetaCart
but often the main goal is to avoid congestion, i.e. overloading of links, and the standard heuristic recommended by Cisco (a major router vendor) is to make the weight of a link inversely proportional to its capacity. We study the problem of optimizing OSPF weights for a given a set of projected demands so as to avoid congestion. We show this problem is NPhard and propose a local search heuristic to solve it. We also provide worstcase results about the performance of OSPF routing vs. an optimal multicommodity flow routing. Our numerical experiments compare the results obtained with our local search heuristic to the optimal multicommodity flow routing, as well as simple and commonly used heuristics for setting the weights. Experiments were done with a proposed nextgeneration AT&T WorldNet backbone as well as synthetic internetworks.
Finding Frequent Items in Data Streams
 PVLDB
, 2008
"... The frequent items problem is to process a stream of items and find all items occurring more than a given fraction of the time. It is one of the most heavily studied problems in data stream mining, dating back to the 1980s. Many applications rely directly or indirectly on finding the frequent items, ..."
Abstract

Cited by 38 (5 self)
 Add to MetaCart
The frequent items problem is to process a stream of items and find all items occurring more than a given fraction of the time. It is one of the most heavily studied problems in data stream mining, dating back to the 1980s. Many applications rely directly or indirectly on finding the frequent items, and implementations are in use in large scale industrial systems. However, there has not been much comparison of the different methods under uniform experimental conditions. It is common to find papers touching on this topic in which important related work is mischaracterized, overlooked, or reinvented. In this paper, we aim to present the most important algorithms for this problem in a common framework. We have created baseline implementations of the algorithms, and used these to perform a thorough experimental study of their properties. We give empirical evidence that there is considerable variation in the performance of frequent items algorithms. The best methods can be implemented to find frequent items with high accuracy using only tens of kilobytes of memory, at rates of millions of items per second on cheap modern hardware.
External perfect hashing for very large key sets
 In Proceedings of the 16th ACM Conference on Information and Knowledge Management (CIKM’07
, 2007
"... A perfect hash function (PHF) h: S → [0, m − 1] for a key set S ⊆ U of size n, where m ≥ n and U is a key universe, is an injective function that maps the keys of S to unique values. A minimal perfect hash function (MPHF) is a PHF with m = n, the smallest possible range. Minimal perfect hash functio ..."
Abstract

Cited by 14 (2 self)
 Add to MetaCart
A perfect hash function (PHF) h: S → [0, m − 1] for a key set S ⊆ U of size n, where m ≥ n and U is a key universe, is an injective function that maps the keys of S to unique values. A minimal perfect hash function (MPHF) is a PHF with m = n, the smallest possible range. Minimal perfect hash functions are widely used for memory efficient storage and fast retrieval of items from static sets. In this paper we present a distributed and parallel version of a simple, highly scalable and nearspace optimal perfect hashing algorithm for very large key sets, recently presented in [4]. The sequential implementation of the algorithm constructs a MPHF for a set of 1.024 billion URLs of average length 64 bytes collected from the Web in approximately 50 minutes using a commodity PC. The parallel implementation proposed here presents the following performance using 14 commodity PCs: (i) it constructs a MPHF for the same set of 1.024 billion URLs in approximately 4 minutes; (ii) it constructs a MPHF for a set of 14.336 billion 16byte random integers in approximately 50 minutes with a performance degradation of 20%; (iii) one version of the parallel algorithm distributes the description of the MPHF among the participating machines and its evaluation is done in a distributed way, faster than the centralized function.
Practical Verified Computation with Streaming Interactive Proofs
"... When delegating computation to a service provider, as in the cloud computing paradigm, we seek some reassurance that the output is correct and complete. Yet recomputing the output as a check is inefficient and expensive, and it may not even be feasible to store all the data locally. We are therefore ..."
Abstract

Cited by 14 (2 self)
 Add to MetaCart
When delegating computation to a service provider, as in the cloud computing paradigm, we seek some reassurance that the output is correct and complete. Yet recomputing the output as a check is inefficient and expensive, and it may not even be feasible to store all the data locally. We are therefore interested in what can be validated by a streaming (sublinear space) user, who cannot store the full input, or perform the full computation herself. Our aim in this work is to advance a recent line of work on “proof systems ” in which the service provider proves the correctness of its output to a user. The goal is to minimize the time and space costs of both parties in generating and checking the proof. Only very recently have there been attempts to implement such proof systems, and thus far these have been quite limited in
Efficient hash probes on modern processors
 In Proceedings of the 23nd International Conference on Data Engineering
, 2007
"... Bucketized versions of Cuckoo hashing can achieve 95– 99 % occupancy, without any space overhead for pointers or other structures. However, such methods typically need to consult multiple hash buckets per probe, and have therefore been seen as having worse probe performance than conventional techniq ..."
Abstract

Cited by 11 (0 self)
 Add to MetaCart
Bucketized versions of Cuckoo hashing can achieve 95– 99 % occupancy, without any space overhead for pointers or other structures. However, such methods typically need to consult multiple hash buckets per probe, and have therefore been seen as having worse probe performance than conventional techniques for large tables. We consider workloads typical of database and stream processing, in which keys and payloads are small, and in which a large number of probes are processed in bulk. We show how to improve probe performance by (a) eliminating branch instructions from the probe code, enabling better scheduling and latencyhiding by modern processors, and (b) using SIMD instructions to process multiple keys/payloads in parallel. We show that on modern architectures, probes to a bucketized Cuckoo hash table can be processed much faster than conventional hash table probes, for both small and large memoryresident tables. On a Pentium 4, a probe is two to four times faster, while on the Cell SPE processor a probe is ten times faster. 1
Methods for Finding Frequent Items in Data Streams
 THE VLDB JOURNAL
, 2009
"... The frequent items problem is to process a stream of items and find all items occurring more than a given fraction of the time. It is one of the most heavily studied problems in data stream mining, dating back to the 1980s. Many applications rely directly or indirectly on finding the frequent item ..."
Abstract

Cited by 6 (0 self)
 Add to MetaCart
The frequent items problem is to process a stream of items and find all items occurring more than a given fraction of the time. It is one of the most heavily studied problems in data stream mining, dating back to the 1980s. Many applications rely directly or indirectly on finding the frequent items, and implementations are in use in large scale industrial systems. However, there has not been much comparison of the different methods under uniform experimental conditions. It is common to find papers touching on this topic in which important related work is mischaracterized, overlooked, or reinvented. In this paper, we aim to present the most important algorithms for this problem in a common framework. We have created baseline implementations of the algorithms, and used these to perform a thorough experimental study of their properties. We give empirical evidence that there is considerable variation in the performance of frequent items algorithms. The best methods can be implemented to find frequent items with high accuracy using only tens of kilobytes of memory, at rates of millions of items per second on cheap modern hardware.
Small Synopses for GroupBy Query Verification on Outsourced Data Streams
"... This is a preliminary release of an article accepted by ACM Transactions on Database Systems. The definitive version is currently in production at ACM and, when released, will supersede this version. ..."
Abstract

Cited by 4 (1 self)
 Add to MetaCart
This is a preliminary release of an article accepted by ACM Transactions on Database Systems. The definitive version is currently in production at ACM and, when released, will supersede this version.
Tabulation Based 5Universal Hashing and Linear Probing
"... Previously [SODA’04] we devised the fastest known algorithm for 4universal hashing. The hashing was based on small precomputed4universal tables. This led to a fivefold improvement in speed over direct methods based on degree 3 polynomials. In this paper, we show that if the precomputed tables a ..."
Abstract

Cited by 3 (3 self)
 Add to MetaCart
Previously [SODA’04] we devised the fastest known algorithm for 4universal hashing. The hashing was based on small precomputed4universal tables. This led to a fivefold improvement in speed over direct methods based on degree 3 polynomials. In this paper, we show that if the precomputed tables are made 5universal, then the hash value becomes 5universal without any other change to the computation. Relatively this leads to even bigger gains since the direct methods for 5universal hashing use degree 4 polynomials. Experimentally, we find that our method can gain up to an order of magnitude in speed over direct 5universal hashing. Some of the most popular randomized algorithms have been proved to have the desired expected running time using
Approximating Data with the CountMin Data Structure
, 2011
"... Algorithmic problems such as tracking the contents of a set arise frequently in the course of building systems. Given the variety of possible solutions, the choice of appropriate data structures for such tasks is at the heart of building efficient and effective software. Modern languages are now aug ..."
Abstract
 Add to MetaCart
Algorithmic problems such as tracking the contents of a set arise frequently in the course of building systems. Given the variety of possible solutions, the choice of appropriate data structures for such tasks is at the heart of building efficient and effective software. Modern languages are now augmented with large libraries of algorithms and data structures to help the programmer without