## What’s hot and what’s not: Tracking most frequent items dynamically (2003)

Venue: | In Proceedings of ACM Principles of Database Systems |

Citations: | 174 - 14 self |

### BibTeX

@INPROCEEDINGS{Cormode03what’shot,

author = {Graham Cormode and S. Muthukrishnan},

title = {What’s hot and what’s not: Tracking most frequent items dynamically},

booktitle = {In Proceedings of ACM Principles of Database Systems},

year = {2003},

pages = {296--306}

}

### Years of Citing Articles

### OpenURL

### Abstract

Most database management systems maintain statistics on the underlying relation. One of the important statistics is that of the “hot items ” in the relation: those that appear many times (most frequently, or more than some threshold). For example, end-biased histograms keep the hot items as part of the histogram and are used in selectivity estimation. Hot items are used as simple outliers in data mining, and in anomaly detection in many applications. We present new methods for dynamically determining the hot items at any time in a relation which is undergoing deletion operations as well as inserts. Our methods maintain small space data structures that monitor the transactions on the relation, and when required, quickly output all hot items, without rescanning the relation in the database. With user-specified probability, all hot items are correctly reported. Our methods rely on ideas from “group testing”. They are simple to implement, and have provable quality, space and time guarantees. Previously known algorithms for this problem that make similar quality and performance guarantees can not handle deletions, and those that handle deletions can not make similar guarantees without rescanning the database. Our experiments with real and synthetic data show that our algorithms are accurate in dynamically tracking the hot items independent of the rate of insertions and deletions.

### Citations

1864 | Randomized Algorithms
- Motwani, Raghavan
- 1995
(Show Context)
Citation Context ...g(1/δ)) Ω(k/ɛ 2 log n) [Charikar et al. 2002] Randomized (MC) Table I. Summary of previous results on insert-only methods. LV (Las Vegas) and MC (Monte Carlo) are types of randomized algorithms. See [=-=Motwani and Raghavan 1995-=-] for details. output all hot items with probability at least 1 − δ, for some constant δ, must also use Ω(m) space. This follows by observing that the above reduction corresponds to the Index problem ... |

703 |
Data Structures and Algorithms
- Aho, Hopcroft, et al.
- 1985
(Show Context)
Citation Context ...RELIMINARIES If one is allowed O(m) space, then a simple heap data structure will process each insert or delete operation in O(log m) time and find the hot items in O(k log m) time in the worst case [=-=Aho et al. 1987-=-]. Our focus here is on algorithms that only maintain a summary data structure, that is, one that uses sublinear space as it monitors inserts and deletes to the data. In a fundamental paper, Alon, Mat... |

700 | The space complexity of approximating the frequency moments
- Alon, Matias, et al.
- 1999
(Show Context)
Citation Context ...is area studies related, relaxed versions of the problems. For example, finding hot items, that is, items each of which has frequency above 1/(k + 1), is one such related problem. The lower bound of [=-=Alon et al. 1996-=-] does not directly apply to this problem. But a simple information theory argument suffices to show that solving this problem exactly requires the storage of a large amount of information if we give ... |

667 |
Universal classes of hash functions
- Carter, Wegman
- 1979
(Show Context)
Citation Context ...function is defined by a and b, which are integers less than P . P itself is chosen to be O(m), and so the space required to represent each hash function is O(log m) bits. Fact 3.2. Proposition 7 of [=-=Carter and Wegman 1979-=-]. Over all choices of a and b, for x �= y, Pr[fa,b(x) = fa,b(y)] ≤ 1/W We can now describe the data structures that we will keep in order to allow us to find up to k hot items. Non-Adaptive Group Tes... |

374 | Data streams: algorithms and applications - Muthukrishnan - 2005 |

330 | Approximate Frequency Counts Over Data Streams - Manku, Motwani |

293 | Improved data stream summaries: The count-min sketch and its applications - Cormode, Muthukrishnan - 2005 |

259 | Finding frequent items in data streams
- Charikar, Chen, et al.
- 2002
(Show Context)
Citation Context ...e do not focus on this problem, since there are many existing solutions which can be applied to the problem of, given x, estimate nx, in the presence of insertions and deletions [Gilbert et al. 2002; =-=Charikar et al. 2002-=-; Cormode and Muthukrishnan 2004a]. However, we observe that for the solutions we propose, no additional storage is needed, since the information needed to make estimates of the count of items is alre... |

204 |
New sampling-based summary statistics for improving approximate query answers
- Gibbons, Matias
- 1998
(Show Context)
Citation Context ...analysis of how it can be used as part of a solution to the hot items problem. Insert and Delete Algorithms. Previous work that studied hot items in presence of both of inserts and deletes is sparse [=-=Gibbons and Matias 1998-=-; 1999]. These papers propose methods to maintain a sample and count of times the sample appears in the data set, and focus on the harder prolem of monitoring the k most frequent items. These methods ... |

170 | Distributed top-k monitoring - Babcock, Olston - 2003 |

157 | Fast incremental maintenance of approximate histograms
- Gibbons, Matias, et al.
- 1997
(Show Context)
Citation Context ...vably for insert-only case, but provide no guarantees for the fully dynamic case with deletions. However, the authors study how effective these samples are for the deletion case through experiments. [=-=Gibbons et al. 1997-=-] presents methods to maintain various histograms in presence of inserts and deletes using “backing sample”, but these methods too need access to large portion of the data periodically in the presence... |

145 | Frequency estimation, of internet packet streams with limited space - Demaine, López-Ortiz, et al. - 2002 |

142 | A Simple Algorithm for Finding Frequent Elements in Streams and Bags. Available from http://www.cs.berkeley.edul christos/iceberg.ps
- Karp, Papadimitriou, et al.
- 2002
(Show Context)
Citation Context ... giving a warning sign if this pattern begins to change unexpectedly. This has been studied extensively in context of anomaly detection [Barbara et al. 2001; Demaine et al. 2002; Gilbert et al. 2001; =-=Karp et al. 2003-=-]. Our focus in this paper is on dynamically maintaining hot items in the presence of delete and insert transactions. In many of the motivating applications above, the underlying data distribution cha... |

135 | Computing Iceberg Queries Efficiently
- Fang, Shivakumar, et al.
- 1998
(Show Context)
Citation Context ...cking Most Frequent Items Dynamically · 7 are hot when they reach the filter, it cannot retrieve items from the past which have since become frequent. The earliest filter method appears to be due to [=-=Fang et al. 1998-=-] where it is used in the context of iceberg queries. The authors advocate a second pass over the data to count exactly those items which passed the filter. A paper which has stimulated interest in fi... |

134 | Balancing histogram optimality and practicality for query result size estimation
- Ioannidis, Poosala
- 1995
(Show Context)
Citation Context .... V, No. N, M 20YY, Pages 1–29.s2 · Graham Cormode and S. Muthukrishnan This gives a useful measure of the skew of the data. High-biased and end-biased histograms [Ioannidis and Christodoulakis 1993; =-=Ioannidis and Poosala 1995-=-] specifically focus on hot items to summarize data distributions for selectivity estimation. Iceberg queries generalize the notion of hot items in the relation to aggregate functions over an attribut... |

115 | Nonrandom binary superimposed codes - Kautz, Singleton - 1964 |

106 | tracking join and self-join sizes in limited storage
- Alon, Gibbons, et al.
- 1999
(Show Context)
Citation Context ...t oracle, which can be updated to reflect the arrival or departure of any item. We now list three examples of these and give their space and update time bounds: —The “tug of war sketch” technique of [=-=Alon et al. 1999-=-] uses space and time O( 1 ɛ2 log 1 δ ) to approximate any count up to ɛn with probability at least 1 − δ. —The method of Random Subset Sums described in [Gilbert et al. 2002] uses space and time O( 1... |

105 | small-space algorithms for approximate histogram maintenance - Fast - 2002 |

104 | How to summarize the universe: Dynamic maintenance of quantiles
- Gilbert, Kotidis, et al.
- 2002
(Show Context)
Citation Context ... nx of these items. We do not focus on this problem, since there are many existing solutions which can be applied to the problem of, given x, estimate nx, in the presence of insertions and deletions [=-=Gilbert et al. 2002-=-; Charikar et al. 2002; Cormode and Muthukrishnan 2004a]. However, we observe that for the solutions we propose, no additional storage is needed, since the information needed to make estimates of the ... |

98 | Querying and mining data streams: You only get one look - GAROFALAKIS, GEHRKE, et al. |

81 |
Finding repeated elements
- Misra, Gries
- 1982
(Show Context)
Citation Context ...t Items Dynamically · 5 Algorithm Type Time Per Item Space Lossy Counting Deterministic O(log(n/k)) amortized Ω(k log(n/k)) [Manku and Motwani 2002] Misra-Gries Deterministic O(log k) amortized O(k) [=-=Misra and Gries 1982-=-] Frequent Randomized (LV) O(1) expected O(k) [Demaine et al. 2002] Count Sketch Approximate, O(log(1/δ)) Ω(k/ɛ 2 log n) [Charikar et al. 2002] Randomized (MC) Table I. Summary of previous results on ... |

72 |
Detecting novel network intrusions using Bayes estimators
- Barbara, Wu, et al.
(Show Context)
Citation Context ...g the most traffic allows management of the network, as well as giving a warning sign if this pattern begins to change unexpectedly. This has been studied extensively in context of anomaly detection [=-=Barbara et al. 2001-=-; Demaine et al. 2002; Gilbert et al. 2001; Karp et al. 2003]. Our focus in this paper is on dynamically maintaining hot items in the presence of delete and insert transactions. In many of the motivat... |

67 | Muthukrishnan S.: What’s New: Finding Significant Differences in Network Data Streams - Cormode, S - 2005 |

66 | Online data mining for co-evolving time sequences - Yi, Sidiropoulos, et al. - 2000 |

55 | Optimal histograms for limiting worst-case error propagation in the size of join results - Ioannidis, Christodoulakis - 1993 |

46 | M.J.: QuickSAND: Quick summary and analysis of network data
- Gilbert, Kotidis, et al.
(Show Context)
Citation Context ...e network, as well as giving a warning sign if this pattern begins to change unexpectedly. This has been studied extensively in context of anomaly detection [Barbara et al. 2001; Demaine et al. 2002; =-=Gilbert et al. 2001-=-; Karp et al. 2003]. Our focus in this paper is on dynamically maintaining hot items in the presence of delete and insert transactions. In many of the motivating applications above, the underlying dat... |

38 | Combinatorial Group Testing and its Applications - Du, Hwang - 2000 |

37 | small-space algorithms for approximate histogram maintanance - Gilbert, Guha, et al. - 2002 |

23 | A fast majority vote algorithm
- Boyer, Moore
- 1981
(Show Context)
Citation Context ...ed are summarized in Table I. Insert-only Algorithms with Item Counts. The earliest work on finding frequent items considered the problem of finding an item which occurred more than half of the time [=-=Boyer and Moore 1982-=-; Fischer and Salzberg 1982]. This procedure can be viewed as a two pass algorithm: after one pass over the data a candidate is found, which is guaranteed to be the majority element if any such elemen... |

18 | Even strongly universal hashing is pretty fast - Thorup - 2000 |

11 |
Finding a majority among n votes: Solution to problem 81-5
- Fischer, Salzburg
(Show Context)
Citation Context ...able I. Insert-only Algorithms with Item Counts. The earliest work on finding frequent items considered the problem of finding an item which occurred more than half of the time [Boyer and Moore 1982; =-=Fischer and Salzberg 1982-=-]. This procedure can be viewed as a two pass algorithm: after one pass over the data a candidate is found, which is guaranteed to be the majority element if any such element exists. A second pass ver... |

7 | Synopsis structures for massive data sets - Gibbons, Matias - 1999 |

1 | What’s Hot and What’s Not: Tracking Most Frequent Items Dynamically · 29 - Estan, Varghese - 2002 |

1 | How to summarize the universe: Dynamic of quantiles - Gilbert, Kotidis, et al. - 2002 |