Results 1  10
of
102
Models and issues in data stream systems
 IN PODS
, 2002
"... In this overview paper we motivate the need for and research issues arising from a new model of data processing. In this model, data does not take the form of persistent relations, but rather arrives in multiple, continuous, rapid, timevarying data streams. In addition to reviewing past work releva ..."
Abstract

Cited by 620 (19 self)
 Add to MetaCart
In this overview paper we motivate the need for and research issues arising from a new model of data processing. In this model, data does not take the form of persistent relations, but rather arrives in multiple, continuous, rapid, timevarying data streams. In addition to reviewing past work relevant to data stream systems and current projects in the area, the paper explores topics in stream query languages, new requirements and challenges in query processing, and algorithmic issues.
Data Streams: Algorithms and Applications
, 2005
"... In the data stream scenario, input arrives very rapidly and there is limited memory to store the input. Algorithms have to work with one or few passes over the data, space less than linear in the input size or time significantly less than the input size. In the past few years, a new theory has emerg ..."
Abstract

Cited by 375 (21 self)
 Add to MetaCart
In the data stream scenario, input arrives very rapidly and there is limited memory to store the input. Algorithms have to work with one or few passes over the data, space less than linear in the input size or time significantly less than the input size. In the past few years, a new theory has emerged for reasoning about algorithms that work within these constraints on space, time, and number of passes. Some of the methods rely on metric embeddings, pseudorandom computations, sparse approximation theory and communication complexity. The applications for this scenario include IP network traffic analysis, mining text message streams and processing massive data sets in general. Researchers in Theoretical Computer Science, Databases, IP Networking and Computer Systems are working on the data stream challenges. This article is an overview and survey of data stream algorithmics and is an updated version of [175].1
Approximate Frequency Counts over Data Streams
 VLDB
, 2002
"... We present algorithms for computing frequency counts exceeding a userspecified threshold over data streams. Our algorithms are simple and have provably small memory footprints. Although the output is approximate, the error is guaranteed not to exceed a userspecified parameter. Our algorithms can e ..."
Abstract

Cited by 330 (1 self)
 Add to MetaCart
We present algorithms for computing frequency counts exceeding a userspecified threshold over data streams. Our algorithms are simple and have provably small memory footprints. Although the output is approximate, the error is guaranteed not to exceed a userspecified parameter. Our algorithms can easily be deployed for streams of singleton items like those found in IP network monitoring. We can also handle streams of variable sized sets of items exemplified by a sequence of market basket transactions at a retail store. For such streams, we describe an optimized implementation to compute frequent itemsets in a single pass.
Maintaining Stream Statistics over Sliding Windows (Extended Abstract)
, 2002
"... Mayur Datar Aristides Gionis y Piotr Indyk z Rajeev Motwani x Abstract We consider the problem of maintaining aggregates and statistics over data streams, with respect to the last N data elements seen so far. We refer to this model as the sliding window model. We consider the following basic ..."
Abstract

Cited by 228 (8 self)
 Add to MetaCart
Mayur Datar Aristides Gionis y Piotr Indyk z Rajeev Motwani x Abstract We consider the problem of maintaining aggregates and statistics over data streams, with respect to the last N data elements seen so far. We refer to this model as the sliding window model. We consider the following basic problem: Given a stream of bits, maintain a count of the number of 1's in the last N elements seen from the stream. We show that using O( 1 ffl log 2 N) bits of memory, we can estimate the number of 1's to within a factor of 1 + ffl. We also give a matching lower bound of \Omega\Gamma 1 ffl log 2 N) memory bits for any deterministic or randomized algorithms. We extend our scheme to maintain the sum of the last N positive integers. We provide matching upper and lower bounds for this more general problem as well. We apply our techniques to obtain efficient algorithms for the Lp norms (for p 2 [1; 2]) of vectors under the sliding window model. Using the algorithm for the basic counting problem, one can adapt many other techniques to work for the sliding window model, with a multiplicative overhead of O( 1 ffl log N) in memory and a 1 + ffl factor loss in accuracy. These include maintaining approximate histograms, hash tables, and statistics or aggregates such as sum and averages.
Reductions in Streaming Algorithms, with an Application to Counting Triangles in Graphs
"... We introduce reductions in the streaming model as a tool in the design of streaming algorithms. We develop the concept of listefficient streaming algorithms that are essential to the design of efficient streaming algorithms through reductions. Our results include a suite of listefficient streaming ..."
Abstract

Cited by 116 (5 self)
 Add to MetaCart
We introduce reductions in the streaming model as a tool in the design of streaming algorithms. We develop the concept of listefficient streaming algorithms that are essential to the design of efficient streaming algorithms through reductions. Our results include a suite of listefficient streaming algorithms for basic statistical primitives. Using the reduction paradigm along with these tools, we design streaming algorithms for approximately counting the number of triangles in a graph presented as a stream. A specific highlight of our work is the first algorithm for the number of distinct elements in a data stream that achieves arbitrary approximation factors. (Independently, Trevisan [Tre01] has solved this problem via a different approach; our algorithm has the advantage of being listefficient.)
Clustering data streams: Theory and practice
 IEEE TKDE
, 2003
"... Abstract—The data stream model has recently attracted attention for its applicability to numerous types of data, including telephone records, Web documents, and clickstreams. For analysis of such data, the ability to process the data in a single pass, or a small number of passes, while using little ..."
Abstract

Cited by 106 (2 self)
 Add to MetaCart
Abstract—The data stream model has recently attracted attention for its applicability to numerous types of data, including telephone records, Web documents, and clickstreams. For analysis of such data, the ability to process the data in a single pass, or a small number of passes, while using little memory, is crucial. We describe such a streaming algorithm that effectively clusters large data streams. We also provide empirical evidence of the algorithm’s performance on synthetic and real data streams. Index Terms—Clustering, data streams, approximation algorithms. 1
Approximate Join Processing Over Data Streams
, 2003
"... We consider the problem of approximating sliding window joins over data streams in a data stream processing system with limited resources. In our model, we deal with resource constraints by shedding load in the form of dropping tuples from the data streams. We first discuss alternate architectural m ..."
Abstract

Cited by 97 (2 self)
 Add to MetaCart
We consider the problem of approximating sliding window joins over data streams in a data stream processing system with limited resources. In our model, we deal with resource constraints by shedding load in the form of dropping tuples from the data streams. We first discuss alternate architectural models for data stream join processing, and we survey suitable measures for the quality of an approximation of a setvalued query result. We then consider the number of generated result tuples as the quality measure, and we give optimal offline and fast online algorithms for it. In a thorough experimental study with synthetic and real data we show the efficacy of our solutions. For applications with demand for exact results we introduce a new Archivemetric which captures the amount of work needed to complete the join in case the streams are archived for later processing.
Approximating extent measure of points
 Journal of ACM
"... We present a general technique for approximating various descriptors of the extent of a set of points in�when the dimension�is an arbitrary fixed constant. For a given extent measure�and a parameter��, it computes in time a subset�of size, with the property that. The specific applications of our tec ..."
Abstract

Cited by 96 (28 self)
 Add to MetaCart
We present a general technique for approximating various descriptors of the extent of a set of points in�when the dimension�is an arbitrary fixed constant. For a given extent measure�and a parameter��, it computes in time a subset�of size, with the property that. The specific applications of our technique include�approximation algorithms for (i) computing diameter, width, and smallest bounding box, ball, and cylinder of, (ii) maintaining all the previous measures for a set of moving points, and (iii) fitting spheres and cylinders through a point set. Our algorithms are considerably simpler, and faster in many cases, than previously known algorithms. 1
Processing sliding window multijoins in continuous queries over data streams
 Proceedings of the 29th international
, 2003
"... We study sliding window multijoin processing in continuous queries over data streams. Several algorithms are reported for performing continuous, incremental joins, under the assumption that all the sliding windows fit in main memory. The algorithms include multiway incremental nested loop joins (NL ..."
Abstract

Cited by 95 (9 self)
 Add to MetaCart
We study sliding window multijoin processing in continuous queries over data streams. Several algorithms are reported for performing continuous, incremental joins, under the assumption that all the sliding windows fit in main memory. The algorithms include multiway incremental nested loop joins (NLJs) and multiway incremental hash joins. We also propose join ordering heuristics to minimize the processing cost per unit time. We test a possible implementation of these algorithms and show that, as expected, hash joins are faster than NLJs for performing equijoins, and that the overall processing cost is influenced by the strategies used to remove expired tuples from the sliding windows. 1
Characterizing Memory Requirements for Queries over Continuous Data
 In PODS
, 2002
"... This paper deals with continuous conjunctive queries with arithmetic comparisons and optional aggregation over multiple data streams. An algorithm is presented for determining whether or not any given query can be evaluated using a bounded amount of memory for all possible instances of the data stre ..."
Abstract

Cited by 83 (9 self)
 Add to MetaCart
This paper deals with continuous conjunctive queries with arithmetic comparisons and optional aggregation over multiple data streams. An algorithm is presented for determining whether or not any given query can be evaluated using a bounded amount of memory for all possible instances of the data streams. For queries that can be evaluated using bounded memory, an execution strategy based on constantsized synopses of the data streams is proposed. For queries that cannot be evaluated using bounded memory, data stream scenarios are identified in which evaluating the queries requires memory linear in the size of the unbounded streams