Results 1 - 10
of
160
An improved data stream summary: The Count-Min sketch and its applications
- J. Algorithms
, 2004
"... Abstract. We introduce a new sublinear space data structure—the Count-Min Sketch — for summarizing data streams. Our sketch allows fundamental queries in data stream summarization such as point, range, and inner product queries to be approximately answered very quickly; in addition, it can be applie ..."
Abstract
-
Cited by 202 (33 self)
- Add to MetaCart
Abstract. We introduce a new sublinear space data structure—the Count-Min Sketch — for summarizing data streams. Our sketch allows fundamental queries in data stream summarization such as point, range, and inner product queries to be approximately answered very quickly; in addition, it can be applied to solve several important problems in data streams such as finding quantiles, frequent items, etc. The time and space bounds we show for using the CM sketch to solve these problems significantly improve those previously known — typically from 1/ε 2 to 1/ε in factor. 1
Comparing top k lists
- In Proceedings of the ACM-SIAM Symposium on Discrete Algorithms
, 2003
"... Motivated by several applications, we introduce various distance measures between “top k lists.” Some of these distance measures are metrics, while others are not. For each of these latter distance measures, we show that they are “almost ” a metric in the following two seemingly unrelated aspects: ( ..."
Abstract
-
Cited by 150 (3 self)
- Add to MetaCart
Motivated by several applications, we introduce various distance measures between “top k lists.” Some of these distance measures are metrics, while others are not. For each of these latter distance measures, we show that they are “almost ” a metric in the following two seemingly unrelated aspects: (i) they satisfy a relaxed version of the polygonal (hence, triangle) inequality, and (ii) there is a metric with positive constant multiples that bound our measure above and below. This is not a coincidence—we show that these two notions of almost being a metric are formally identical. Based on the second notion, we define two distance measures to be equivalent if they are bounded above and below by constant multiples of each other. We thereby identify a large and robust equivalence class of distance measures. Besides the applications to the task of identifying good notions of (dis-)similarity between two top k lists, our results imply polynomial-time constant-factor approximation algorithms for the rank aggregation problem [DKNS01] with respect to a large class of distance measures. To appear in SIAM J. on Discrete Mathematics. Extended abstract to appear in 2003 ACM-SIAM Symposium on Discrete Algorithms (SODA ’03).
What’s hot and what’s not: Tracking most frequent items dynamically
- In Proceedings of ACM Principles of Database Systems
, 2003
"... Most database management systems maintain statistics on the underlying relation. One of the important statistics is that of the “hot items ” in the relation: those that appear many times (most frequently, or more than some threshold). For example, end-biased histograms keep the hot items as part of ..."
Abstract
-
Cited by 139 (12 self)
- Add to MetaCart
Most database management systems maintain statistics on the underlying relation. One of the important statistics is that of the “hot items ” in the relation: those that appear many times (most frequently, or more than some threshold). For example, end-biased histograms keep the hot items as part of the histogram and are used in selectivity estimation. Hot items are used as simple outliers in data mining, and in anomaly detection in many applications. We present new methods for dynamically determining the hot items at any time in a relation which is undergoing deletion operations as well as inserts. Our methods maintain small space data structures that monitor the transactions on the relation, and when required, quickly output all hot items, without rescanning the relation in the database. With user-specified probability, all hot items are correctly reported. Our methods rely on ideas from “group testing”. They are simple to implement, and have provable quality, space and time guarantees. Previously known algorithms for this problem that make similar quality and performance guarantees can not handle deletions, and those that handle deletions can not make similar guarantees without rescanning the database. Our experiments with real and synthetic data show that our algorithms are accurate in dynamically tracking the hot items independent of the rate of insertions and deletions.
Distributed top-k monitoring
- In SIGMOD
, 2003
"... The querying and analysis of data streams has been a topic of much recent interest, motivated by applications from the fields of networking, web usage analysis, sensor instrumentation, telecommunications, and others. Many of these applications involve monitoring answers to continuous queries over da ..."
Abstract
-
Cited by 137 (2 self)
- Add to MetaCart
The querying and analysis of data streams has been a topic of much recent interest, motivated by applications from the fields of networking, web usage analysis, sensor instrumentation, telecommunications, and others. Many of these applications involve monitoring answers to continuous queries over data streams produced at physically distributed locations, and most previous approaches require streams to be transmitted to a single location for centralized processing. Unfortunately, the continual transmission of a large number of rapid data streams to a central location can be impractical or expensive. We study a useful class of queries that continuously report the k largest values obtained from distributed data streams (“top-k monitoring queries”), which are of particular interest because they can be used to reduce the overhead incurred while running other types of monitoring queries. We show that transmitting entire data streams is unnecessary to support these queries and present an alternative approach that reduces communication significantly. In our approach, arithmetic constraints are maintained at remote stream sources to ensure that the most recently provided top-k answer remains valid to within a userspecified error tolerance. Distributed communication is only necessary on occasion, when constraints are violated, and we show empirically through extensive simulation on real-world data that our approach reduces overall communication cost by an order of magnitude compared with alternatives that offer the same error guarantees. 1
Interpreting the Data: Parallel Analysis with Sawzall
- Scientific Programming Journal, Special Issue on Grids and Worldwide Computing Programming Models and Infrastructure
"... Very large data sets often have a flat but regular structure and span multiple disks and machines. Examples include telephone call records, network logs, and web document repositories. These large data sets are not amenable to study using traditional database techniques, if only because they can be ..."
Abstract
-
Cited by 128 (0 self)
- Add to MetaCart
Very large data sets often have a flat but regular structure and span multiple disks and machines. Examples include telephone call records, network logs, and web document repositories. These large data sets are not amenable to study using traditional database techniques, if only because they can be too large to fit in a single relational database. On the other hand, many of the analyses done on them can be expressed using simple, easily distributed computations: filtering, aggregation, extraction of statistics, and so on. We present a system for automating such analyses. A filtering phase, in which a query is expressed using a new procedural programming language, emits data to an aggregation phase. Both phases are distributed over hundreds or even thousands of computers. The results are then collated and saved to a file. The design—including the separation into two phases, the form of the programming language, and the properties of the aggregators—exploits the parallelism inherent in having data and computation distributed across many machines. 1
Frequency estimation of internet packet streams with limited space
- In Proceedings of the 10th Annual European Symposium on Algorithms
, 2002
"... Abstract. We consider a router on the Internet analyzing the statistical properties of a TCP/IP packet stream. A fundamental difficulty with measuring traffic behavior on the Internet is that there is simply too much data to be recorded for later analysis, on the order of gigabytes a second. As a re ..."
Abstract
-
Cited by 117 (1 self)
- Add to MetaCart
Abstract. We consider a router on the Internet analyzing the statistical properties of a TCP/IP packet stream. A fundamental difficulty with measuring traffic behavior on the Internet is that there is simply too much data to be recorded for later analysis, on the order of gigabytes a second. As a result, network routers can collect only relatively few statistics about the data. The central problem addressed here is to use the limited memory of routers to determine essential features of the network traffic stream. A particularly difficult and representative subproblem is to determine the top k categories to which the most packets belong, for a desired value of k and for a given notion of categorization such as the destination IP address. We present an algorithm that deterministically finds (in particular) all categories having a frequency above 1/(m + 1) using m counters, which we prove is best possible in the worst case. We also present a sampling-based algorithm for the case that packet categories follow an arbitrary distribution, but their order over time is permuted uniformly at random. Under this model, our algorithm identifies flows above a frequency threshold of roughly 1 / √ nm with high probability, where m is the number of counters and n is the number of packets observed. This guarantee is not far off from the ideal of identifying all flows (probability 1/n), and we prove that it is best possible up to a logarithmic factor. We show that the algorithm ranks the identified flows according to frequency within any desired constant factor of accuracy. 1
Issues in Data Stream Management
, 2003
"... Traditional databases store sets of relatively static records with no pre-defined notion of time, unless timestamp attributes are explicitly added. While this model adequately represents commercial catalogues or repositories of personal information, many current and emerging applications require sup ..."
Abstract
-
Cited by 105 (5 self)
- Add to MetaCart
Traditional databases store sets of relatively static records with no pre-defined notion of time, unless timestamp attributes are explicitly added. While this model adequately represents commercial catalogues or repositories of personal information, many current and emerging applications require support for online analysis of rapidly changing data streams. Limitations of traditional DBMSs in supporting streaming applications have been recognized, prompting research to augment existing technologies and build new systems to manage streaming data. The purpose of this paper is to review recent work in data stream management systems, with an emphasis on application requirements, data models, continuous query languages, and query evaluation.
Sketch-based Change Detection: Methods, Evaluation, and Applications
- IN INTERNET MEASUREMENT CONFERENCE
, 2003
"... Traffic anomalies such as failures and attacks are commonplace in today's network, and identifying them rapidly and accurately is critical for large network operators. The detection typically treats the traffic as a collection of flows that need to be examined for significant changes in traffic patt ..."
Abstract
-
Cited by 95 (11 self)
- Add to MetaCart
Traffic anomalies such as failures and attacks are commonplace in today's network, and identifying them rapidly and accurately is critical for large network operators. The detection typically treats the traffic as a collection of flows that need to be examined for significant changes in traffic pattern (e.g., volume, number of connections) . However, as link speeds and the number of flows increase, keeping per-flow state is either too expensive or too slow. We propose building compact summaries of the traffic data using the notion of sketches. We have designed a variant of the sketch data structure, k-ary sketch, which uses a constant, small amount of memory, and has constant per-record update and reconstruction cost. Its linearity property enables us to summarize traffic at various levels. We then implement a variety of time series forecast models (ARIMA, Holt-Winters, etc.) on top of such summaries and detect significant changes by looking for flows with large forecast errors. We also present heuristics for automatically configuring the model parameters. Using a
Approximate counts and quantiles over sliding windows
- In PODS
, 2004
"... We consider the problem of maintaining ɛ-approximate counts and quantiles over a stream sliding window using limited space. We consider two types of sliding windows depending on whether the number of elements N in the window is fixed (fixed-size sliding window) or variable (variable-size sliding win ..."
Abstract
-
Cited by 63 (1 self)
- Add to MetaCart
We consider the problem of maintaining ɛ-approximate counts and quantiles over a stream sliding window using limited space. We consider two types of sliding windows depending on whether the number of elements N in the window is fixed (fixed-size sliding window) or variable (variable-size sliding window). In a fixed-size sliding window, both the ends of the window slide synchronously over the stream. In a variable-size sliding window, an adversary slides the window ends independently, and therefore has the ability to vary the number of elements N in the window. We present various deterministic and randomized algorithms for approximate counts and quantiles. All of our algorithms require O ( 1 1 polylog ( , N)) space. For quantiles, this space ɛ ɛ requirement is an improvement over the previous best bound of O ( 1 ɛ2 polylog ( 1, N)). We believe that no previous work ɛ on space-efficient approximate counts over sliding windows exists. 1.

