Results 1  10
of
69
What’s hot and what’s not: Tracking most frequent items dynamically
 In Proceedings of ACM Principles of Database Systems
, 2003
"... Most database management systems maintain statistics on the underlying relation. One of the important statistics is that of the “hot items ” in the relation: those that appear many times (most frequently, or more than some threshold). For example, endbiased histograms keep the hot items as part of ..."
Abstract

Cited by 174 (14 self)
 Add to MetaCart
Most database management systems maintain statistics on the underlying relation. One of the important statistics is that of the “hot items ” in the relation: those that appear many times (most frequently, or more than some threshold). For example, endbiased histograms keep the hot items as part of the histogram and are used in selectivity estimation. Hot items are used as simple outliers in data mining, and in anomaly detection in many applications. We present new methods for dynamically determining the hot items at any time in a relation which is undergoing deletion operations as well as inserts. Our methods maintain small space data structures that monitor the transactions on the relation, and when required, quickly output all hot items, without rescanning the relation in the database. With userspecified probability, all hot items are correctly reported. Our methods rely on ideas from “group testing”. They are simple to implement, and have provable quality, space and time guarantees. Previously known algorithms for this problem that make similar quality and performance guarantees can not handle deletions, and those that handle deletions can not make similar guarantees without rescanning the database. Our experiments with real and synthetic data show that our algorithms are accurate in dynamically tracking the hot items independent of the rate of insertions and deletions.
Approximate counts and quantiles over sliding windows
 In PODS
, 2004
"... We consider the problem of maintaining ɛapproximate counts and quantiles over a stream sliding window using limited space. We consider two types of sliding windows depending on whether the number of elements N in the window is fixed (fixedsize sliding window) or variable (variablesize sliding win ..."
Abstract

Cited by 84 (1 self)
 Add to MetaCart
We consider the problem of maintaining ɛapproximate counts and quantiles over a stream sliding window using limited space. We consider two types of sliding windows depending on whether the number of elements N in the window is fixed (fixedsize sliding window) or variable (variablesize sliding window). In a fixedsize sliding window, both the ends of the window slide synchronously over the stream. In a variablesize sliding window, an adversary slides the window ends independently, and therefore has the ability to vary the number of elements N in the window. We present various deterministic and randomized algorithms for approximate counts and quantiles. All of our algorithms require O ( 1 1 polylog ( , N)) space. For quantiles, this space ɛ ɛ requirement is an improvement over the previous best bound of O ( 1 ɛ2 polylog ( 1, N)). We believe that no previous work ɛ on spaceefficient approximate counts over sliding windows exists. 1.
A nearoptimal algorithm for computing the entropy of a stream
 In ACMSIAM Symposium on Discrete Algorithms
, 2007
"... We describe a simple algorithm for approximating the empirical entropy of a stream of m values in a single pass, using O(ε −2 log(δ −1) log m) words of space. Our algorithm is based upon a novel extension of a method introduced by Alon, Matias, and Szegedy [1]. We show a space lower bound of Ω(ε −2 ..."
Abstract

Cited by 54 (20 self)
 Add to MetaCart
We describe a simple algorithm for approximating the empirical entropy of a stream of m values in a single pass, using O(ε −2 log(δ −1) log m) words of space. Our algorithm is based upon a novel extension of a method introduced by Alon, Matias, and Szegedy [1]. We show a space lower bound of Ω(ε −2 / log(ε −1)), meaning that our algorithm is near optimal in terms of its dependency on ε. This improves over previous work on this problem [8, 13, 17, 5]. We show that generalizing to kth order entropy requires close to linear space for all k ≥ 1, and give additive approximations using our algorithm. Lastly, we show how to compute a multiplicative approximation to the entropy of a random walk on an undirected graph. 1
Efficient Computation of Frequent and Topk Elements in Data Streams
 IN ICDT
, 2005
"... We propose an approximate integrated approach for solving both problems of finding the most popular k elements, and finding frequent elements in a data stream coming from a large domain. Our solution is space efficient and reports both frequent and topk elements with tight guarantees on errors. For ..."
Abstract

Cited by 47 (6 self)
 Add to MetaCart
We propose an approximate integrated approach for solving both problems of finding the most popular k elements, and finding frequent elements in a data stream coming from a large domain. Our solution is space efficient and reports both frequent and topk elements with tight guarantees on errors. For general data distributions, our topk algorithm returns k elements that have roughly the highest frequencies; and it uses limited space for calculating frequent elements. For realistic Zipfian data, the space requirement of the proposed algorithm for solving the exact frequent elements problem decreases dramatically with the parameter of the distribution; and for topk queries, the analysis ensures that only the topk elements, in the correct order, are reported. The experiments, using real and synthetic data sets, show space reductions with no loss in accuracy. Having proved the effectiveness of the proposed approach through both analysis and experiments, we extend it to be able to answer continuous queries about frequent and topk elements. Although the problems of incremental reporting of frequent and topk elements are useful in many applications, to the best of our knowledge, no solution has been proposed.
Efficient ExternalMemory Data Structures and Applications
, 1996
"... In this thesis we study the Input/Output (I/O) complexity of largescale problems arising e.g. in the areas of database systems, geographic information systems, VLSI design systems and computer graphics, and design I/Oefficient algorithms for them. A general theme in our work is to design I/Oeffic ..."
Abstract

Cited by 38 (12 self)
 Add to MetaCart
In this thesis we study the Input/Output (I/O) complexity of largescale problems arising e.g. in the areas of database systems, geographic information systems, VLSI design systems and computer graphics, and design I/Oefficient algorithms for them. A general theme in our work is to design I/Oefficient algorithms through the design of I/Oefficient data structures. One of our philosophies is to try to isolate all the I/O specific parts of an algorithm in the data structures, that is, to try to design I/O algorithms from internal memory algorithms by exchanging the data structures used in internal memory with their external memory counterparts. The results in the thesis include a technique for transforming an internal memory tree data structure into an external data structure which can be used in a batched dynamic setting, that is, a setting where we for example do not require that the result of a search operation is returned immediately. Using this technique we develop batched dynamic external versions of the (onedimensional) rangetree and the segmenttree and we develop an external priority queue. Following our general philosophy we show how these structures can be used in standard internal memory sorting algorithms
MJRTY  A Fast Majority Vote Algorithm
, 1982
"... A new algorithm is presented for determining which, if any, of an arbitrary number of candidates has received a majority of the votes cast in an election. The number of comparisons required is at most twice the number of votes. Furthermore, the algorithm uses storage in a way that permits an efficie ..."
Abstract

Cited by 37 (7 self)
 Add to MetaCart
A new algorithm is presented for determining which, if any, of an arbitrary number of candidates has received a majority of the votes cast in an election. The number of comparisons required is at most twice the number of votes. Furthermore, the algorithm uses storage in a way that permits an efficient use of magnetic tape. A Fortran version of the algorithm is exhibited. The Fortran code has been proved correct by a mechanical verification system for Fortran. The system and the proof are discussed. 1 The work described here was conducted in the Computer Science Laboratory of SRI International and suported in part by NASA Contract NAS115528, NSF Grant MCS7904081, and ONR Contract N0001475C0816 1981. A brief history of this work is given in the concluding section. 106 Robert S. Boyer and J Strother Moore 5.1 Introduction Reliability may be obtained by redundant computation and voting in critical hardware systems. What is the best way to determine the majority, if any, of a mult...
Finding Frequent Items in Data Streams
 PVLDB
, 2008
"... The frequent items problem is to process a stream of items and find all items occurring more than a given fraction of the time. It is one of the most heavily studied problems in data stream mining, dating back to the 1980s. Many applications rely directly or indirectly on finding the frequent items, ..."
Abstract

Cited by 37 (5 self)
 Add to MetaCart
The frequent items problem is to process a stream of items and find all items occurring more than a given fraction of the time. It is one of the most heavily studied problems in data stream mining, dating back to the 1980s. Many applications rely directly or indirectly on finding the frequent items, and implementations are in use in large scale industrial systems. However, there has not been much comparison of the different methods under uniform experimental conditions. It is common to find papers touching on this topic in which important related work is mischaracterized, overlooked, or reinvented. In this paper, we aim to present the most important algorithms for this problem in a common framework. We have created baseline implementations of the algorithms, and used these to perform a thorough experimental study of their properties. We give empirical evidence that there is considerable variation in the performance of frequent items algorithms. The best methods can be implemented to find frequent items with high accuracy using only tens of kilobytes of memory, at rates of millions of items per second on cheap modern hardware.
A General Lower Bound on the I/OComplexity of Comparisonbased Algorithms
 In Proc. Workshop on Algorithms and Data Structures, LNCS 709
, 1993
"... We show a general relationship between the number of comparisons and the number of I/Ooperations needed to solve a given problem. This relationship enables one to show lower bounds on the number of I/Ooperations needed to solve a problem whenever a lower bound on the number of comparisons is known ..."
Abstract

Cited by 32 (11 self)
 Add to MetaCart
We show a general relationship between the number of comparisons and the number of I/Ooperations needed to solve a given problem. This relationship enables one to show lower bounds on the number of I/Ooperations needed to solve a problem whenever a lower bound on the number of comparisons is known. We use the result to show lower bounds on the I/Ocomplexity on a number of problems where known techniques only give trivial bounds. Among these are the problems of removing duplicates from a multiset, a problem of great importance in e.g. relational database systems, and the problem of determining the mode  the most frequently occurring element  of a multiset. We develop algorithms for these problems in order to show that the lower bounds are tight.
A Simpler and More Efficient Deterministic Scheme for Finding Frequent Items over Sliding Windows
 In Proc. ACM SIGMODSIGACTSIGART Symp. on Princ. of Database Sys. (PODS
, 2006
"... In this paper, we give a simple scheme for identifying εapproximate frequent items over a sliding window of size n. Our scheme is deterministic and does not make any assumption on the distribution of the item frequencies. It supports O(1/ε) update and query time, and uses O(1/ε) space. It is very si ..."
Abstract

Cited by 24 (3 self)
 Add to MetaCart
In this paper, we give a simple scheme for identifying εapproximate frequent items over a sliding window of size n. Our scheme is deterministic and does not make any assumption on the distribution of the item frequencies. It supports O(1/ε) update and query time, and uses O(1/ε) space. It is very simple; its main data structures are just a few short queues whose entries store the position of some items in the sliding window. We also extend our scheme for variablesize window. This extended scheme uses O(1/ε log(εn)) space.
A Fast Majority Vote Algorithm
 Automated Reasoning: Essays in Honor of Woody Bledsoe
, 1981
"... A new algorithm is presented for determining which, if any, of an arbitrary number of candidates has received a majority of the votes cast in an election. The number of comparisons required is at most twice the number of votes. Furthermore, the algorithm uses storage in a way that permits an efficie ..."
Abstract

Cited by 23 (5 self)
 Add to MetaCart
A new algorithm is presented for determining which, if any, of an arbitrary number of candidates has received a majority of the votes cast in an election. The number of comparisons required is at most twice the number of votes. Furthermore, the algorithm uses storage in a way that permits an efficient use of magnetic tape.