## Models and issues in data stream systems (2002)

### Cached

### Download Links

Venue: | IN PODS |

Citations: | 636 - 19 self |

### BibTeX

@INPROCEEDINGS{Babcock02modelsand,

author = {Brian Babcock and Shivnath Babu and Mayur Datar and Rajeev Motwani and Jennifer Widom},

title = {Models and issues in data stream systems},

booktitle = {IN PODS},

year = {2002},

pages = {1--16},

publisher = {}

}

### Years of Citing Articles

### OpenURL

### Abstract

In this overview paper we motivate the need for and research issues arising from a new model of data processing. In this model, data does not take the form of persistent relations, but rather arrives in multiple, continuous, rapid, time-varying data streams. In addition to reviewing past work relevant to data stream systems and current projects in the area, the paper explores topics in stream query languages, new requirements and challenges in query processing, and algorithmic issues.

### Citations

1874 | Randomized Algorithms
- Motwani, Raghavan
- 1995
(Show Context)
Citation Context ...the data stream model. 6.1 Random Samples Random samples can be used as a summary structure in many scenarios where a small sample is expected to capture the essential characteristics of the data set =-=[65]-=-. It is perhaps the easiest form of summarization in a DSMS and other synopses can be built from a sample itself. In fact, the join synopsis in the AQUA system [2] is nothing but a uniform sample of t... |

700 | The space complexity of approximating the frequency moments
- Alon, Matias, et al.
- 1999
(Show Context)
Citation Context ...h area in the algorithms community in recent years, as discussed in detail in Section 6. This work has led to some general techniques for data reduction and synopsis construction, including: sketches =-=[5, 35]-=-, random sampling [1, 2, 22], histograms [51, 70], and wavelets [17, 92]. Based on these summarization techniques, we have seen some work on approximate query answering. For example, 7srecent work [27... |

616 | Communication Complexity
- Kushilevitz, Nisan
- 1997
(Show Context)
Citation Context ...am model. Henzinger, Raghavan, and Rajagopalan [49] provided space lower bounds for concrete problems in the data stream model. These lower bounds are derived from results in communication complexity =-=[56]-=-. To understand the connection, observe that the memory used by any one-pass algorithm for a 21 ¨� ����� � ��� ,s� function , after seeing a prefix of the data stream, is lower bounded by the one-way ... |

505 | BNiagaraCQ: A scalable continuous query system for internet databases
- Chen, DeWitt, et al.
(Show Context)
Citation Context ...capability over network packet streams. The Tangram stream query processing system [68, 69] uses stream processing techniques to analyze large quantities of stored data. The OpenCQ [57] and NiagaraCQ =-=[24]-=- systems support continuous queries for monitoring persistent data sets spread over a wide-area network, e.g., web sites over the Internet. OpenCQ uses a query processing algorithm based on incrementa... |

424 | Fast subsequence matching in time-series databases
- Faloutsos, Ranganathan, et al.
- 1994
(Show Context)
Citation Context ...eed to refer to the sequencing aspect of streams, particularly in the form of sliding windows over streams. Related work in this category also includes work on temporal [80] and time-series databases =-=[31]-=-, where the ordering of tuples implied by time can be used in querying, indexing, and query optimization. The body of work on materialized views relates to continuous queries, since materialized views... |

347 | Eddies: Continuously adaptive query processing
- Avnur, Hellerstein
- 2000
(Show Context)
Citation Context ...vidual data items, and they do not directly support the continuous queries [84] that are typical of data stream applications. Furthermore, it is recognized that both approximation [13] and adaptivity =-=[8]-=- are key ingredients in executing queries and performing other processing (e.g., data analysis and mining) over rapid data streams, while traditional DBMS’s focus largely on the opposite goal of preci... |

322 | External Memory Algorithms and Data Structures
- Vitter
- 1999
(Show Context)
Citation Context ...ts Since data streams are potentially unbounded in size, the amount of storage required to compute an exact answer to a data stream query may also grow without bound. While external memory algorithms =-=[91]-=- for handling data sets larger than main memory have been studied, such algorithms are not well suited to data stream applications since they do not support continuous queries and are typically too sl... |

312 | Online Aggregation
- Hellerstein, Haas, et al.
- 1997
(Show Context)
Citation Context ...of the data stream rather than over the entire data stream. We obtain an approximate answer, but in some cases one can give confidence bounds on the degree of error introduced by the sampling process =-=[48]-=-. Unfortunately, for many situations (including most queries involving joins [20, 22]), sampling-based approaches cannot give reliable approximation guarantees. Designing sampling-based algorithms tha... |

304 | Efficient Filtering of XML Documents for Selective Dissemination of Information - Altinel, Franklin - 2000 |

297 | Mining High-Speed Data Streams, in
- Domingos, Hulten
- 2000
(Show Context)
Citation Context ...pourri of algorithmic results for data streams. Data Mining � ¨�� � � space. Space lower bounds for maintaining simple Decision trees are another form of synopsis used for prediction. Domingos et al. =-=[28, 29]-=- have studied the problem of maintaining decision trees over data streams. Clustering is yet another way to summarize data. Consider the � -median � formulation for clustering: Given data points in a ... |

261 | Stable distributions, pseudorandom generators, embeddings and data stream computation
- Indyk
- 2006
(Show Context)
Citation Context ...o the pertinent bit positions that are � set to . Feigenbaum et al. [33] showed how to construct such a family ��� of range-summable -valued hash functions with limited (four-way) independence. Indyk =-=[50]-=- provided a uniform framework to ��� compute the norm (for � ��� � � ) using the so-called -stable distributions, � ��� improving upon the previous paper [33] for ��¨ estimating the norm, in that it a... |

261 | Random sampling with a reservoir
- Vitter
- 1985
(Show Context)
Citation Context ...s. If so, the larger the windows (stored in available memory), the better the approximation. Other examples include duplicate elimination using limited-size hash tables, and sampling using reservoirs =-=[90]-=-. The Aurora system [16] also proposes adaptivity and approximations, and uses load-shedding techniques based on application-specified measures of quality of service for graceful degradation in the fa... |

254 | Mining Time-Changing Data Streams, in
- Hulten, Spencer, et al.
- 2001
(Show Context)
Citation Context ...pourri of algorithmic results for data streams. Data Mining � ¨�� � � space. Space lower bounds for maintaining simple Decision trees are another form of synopsis used for prediction. Domingos et al. =-=[28, 29]-=- have studied the problem of maintaining decision trees over data streams. Clustering is yet another way to summarize data. Consider the � -median � formulation for clustering: Given data points in a ... |

253 | Clustering data streams
- Guha, Mishra, et al.
- 2001
(Show Context)
Citation Context ... such that the sum of � the errors over the data points is minimized. The “error” for each data point is the distance from that point to the nearest of the � chosen representative points. Guha et al. =-=[44]-=- presented a single-pass algorithm for maintaining approximate � -medians � (cluster ����������������� � � � centers) that uses ����� space � ����� for some using amortized time per data element, � to... |

251 | Fjording the stream: An architecture for queries over streaming sensor data
- Madden, Franklin
- 2002
(Show Context)
Citation Context ...distributed clickstream analyses, e.g., to track heavily accessed web pages as part of their real-time performance monitoring. There are several emerging applications in the area of sensor monitoring =-=[16, 58]-=- where a large number ¤ of sensors are distributed in the physical world and generate streams of data that need to be combined, monitored, and analyzed. 3sThe application domain that we use for more d... |

250 | BContinuous queries over data streams
- Babu, Widom
(Show Context)
Citation Context ...etwork traffic management, which involves monitoring network packet header information across a set of routers to obtain information on traffic flow patterns. Based on a description of Babu and Widom =-=[10]-=-, we delve into this example in some detail to help illustrate that continuous queries arise naturally in real applications and that conventional DBMS technology does not adequately support such queri... |

231 | BContinuously adaptive continuous queries over streams
- Madden, Shah, et al.
- 2002
(Show Context)
Citation Context ... deal with append-only input data, they may provide approximate rather than exact answers, and their processing strategy may adapt as characteristics of the data streams change. The Telegraph project =-=[8, 47, 58, 59]-=- shares some target applications and basic technical ideas with a DSMS. Telegraph uses an adaptive query engine (based on the Eddy concept [8]) to process queries efficiently in volatile and unpredict... |

228 | BMaintaining stream statistics over sliding windows
- Datar, Gionis, et al.
- 2002
(Show Context)
Citation Context ...e buffered in memory, there are also theoretical challenges in designing algorithms that can give approximate answers using only the available memory. Some recent results in this vein can be found in =-=[9, 26]-=-. While existing work on sequence and temporal databases has addressed many of the issues involved in time-sensitive queries (a class that includes sliding window queries) in a relational database con... |

210 | Wavelet-Based Histograms for Selectivity Estimation
- Matias, Vitter, et al.
- 1998
(Show Context)
Citation Context ... the difference between the original signal and the dyadic interval with constant value. 20sRecent papers have demonstrated the efficacy of wavelets for different tasks such as selectivity estimation =-=[63]-=-, data cube approximation [93] and computing multi-dimensional aggregates [92]. This body of work indicates that estimates obtained from wavelets were more accurate than those obtained from histograms... |

208 | Trajectory sampling for direct traffic observation
- Duffield, Grossglauser
- 2000
(Show Context)
Citation Context ...ventional DBMS technology does not adequately support such queries. Consider the network traffic management system of a large network, e.g., the backbone network of an Internet Service Provider (ISP) =-=[30]-=-. Such systems monitor a variety of continuous data streams that may be characterized as unpredictable and arriving at a high rate, including both packet traces and network performance measurements. T... |

204 | Multiple-query optimization
- Sellis
- 1988
(Show Context)
Citation Context ...ciently find ¤ the plan that, with the best memory allocation, minimizes approximation? Should plans be modified when conditions change? Even further, since synopses could be shared among query plans =-=[75]-=-, how do we optimally consider ¤ a set of queries, which may be weighted by importance? In addition to memory management, we are faced the problem of scheduling multiple query plans in a DSMS. The sch... |

196 | Surfing wavelets on streams: One-pass summaries for approximate aggregate queries
- Gilbert, Kotidis, et al.
(Show Context)
Citation Context ...roximate query answering. For example, 7srecent work [27, 37] develops histogram-based techniques to provide approximate answers for correlated aggregate queries over data streams, and Gilbert et al. =-=[40]-=- present a general approach for building smallspace summaries over data streams to provide approximate answers for many classes of aggregate queries. However, research problems abound in the area of a... |

185 | An adaptive query execution system for data integration
- Ives, Florescu, et al.
- 1999
(Show Context)
Citation Context ...klin [58] focus on query execution strategies over data streams generated by sensors, and Madden et al. [59] discuss adaptive processing techniques for multiple continuous queries. The Tukwila system =-=[53]-=- also supports adaptive query processing, in order to perform dynamic data integration over autonomous data sources. 6sThe Aurora project [16] is building a new data processing system targeted exclusi... |

182 |
Continuous queries over append-only databases
- Terry, Goldberg, et al.
- 1992
(Show Context)
Citation Context ...anagement system (DBMS) and operate on it there. Traditional DBMS’s are not designed for rapid and continuous loading of individual data items, and they do not directly support the continuous queries =-=[84]-=- that are typical of data stream applications. Furthermore, it is recognized that both approximation [13] and adaptivity [8] are key ingredients in executing queries and performing other processing (e... |

181 | Space Efficient Online Computation of Quantile Summaries
- Greenwald, Khanna
- 2001
(Show Context)
Citation Context ...arallel database systems employ value range data partitioning that requires generation of quantiles or splitters that partition the data into approximately equal parts. Recently, Greenwald and Khanna =-=[41]-=- presented a single-pass deterministic algorithm for efficient computation of quantiles. Their algorithm needs sample of the values seen so far (quantiles), along with a range of possible ranks that t... |

175 | Approximate Query Processing Using Wavelets
- Chakrabarti, Garofalakis, et al.
- 2000
(Show Context)
Citation Context ...imensional aggregates [92]. This body of work indicates that estimates obtained from wavelets were more accurate than those obtained from histograms with the same amount of memory. Chakrabarti et al. =-=[17]-=- propose the use of wavelets for general purpose approximate query processing and demonstrate how to compute joins, aggregations, and selections entirely in the wavelet coefficient domain. To extend t... |

170 | Approximate Computation of Multidimensional Aggregates on Sparse Data Using Wavelets
- Vitter, Wang
- 1999
(Show Context)
Citation Context ...ail in Section 6. This work has led to some general techniques for data reduction and synopsis construction, including: sketches [5, 35], random sampling [1, 2, 22], histograms [51, 70], and wavelets =-=[17, 92]-=-. Based on these summarization techniques, we have seen some work on approximate query answering. For example, 7srecent work [27, 37] develops histogram-based techniques to provide approximate answers... |

163 | Continual Queries for Internet Scale Event-Driven Information Delivery
- Liu, Pu, et al.
- 1999
(Show Context)
Citation Context ...estricted querying capability over network packet streams. The Tangram stream query processing system [68, 69] uses stream processing techniques to analyze large quantities of stored data. The OpenCQ =-=[57]-=- and NiagaraCQ [24] systems support continuous queries for monitoring persistent data sets spread over a wide-area network, e.g., web sites over the Internet. OpenCQ uses a query processing algorithm ... |

161 | Updating derived relations: Detecting irrelevant and autonomously computable updates
- Blakeley, Coburn, et al.
- 1989
(Show Context)
Citation Context ...ous queries, since materialized views are effectively queries that need to be reevaluated or incrementally updated whenever the base data changes. Of particular importance is work on self-maintenance =-=[15, 45, 71]-=-—ensuring that enough data has been saved to maintain a view even when the base data is unavailable—and the related problem of data expiration [36]— determining when certain base data can be discarded... |

161 | Processing complex aggregate queries over data streams
- Dobra, Garofalakis, et al.
- 2002
(Show Context)
Citation Context ...35], random sampling [1, 2, 22], histograms [51, 70], and wavelets [17, 92]. Based on these summarization techniques, we have seen some work on approximate query answering. For example, 7srecent work =-=[27, 37]-=- develops histogram-based techniques to provide approximate answers for correlated aggregate queries over data streams, and Gilbert et al. [40] present a general approach for building smallspace summa... |

159 | Computing on data stream
- Henzinger, Raqhavan, et al.
- 1998
(Show Context)
Citation Context ...gorithm maintains a data structure which can be used to compute the value � of the function on demand, and then the time required to process each such query also becomes of interest. Henzinger et al. =-=[49]-=- defined a similar model but also allowed the algorithm to make multiple passes over the stream data, making the number of passes itself a complexity measure. We will restrict our attention to algorit... |

143 | Join Synopses for Approximate Query Answering
- Acharya, Gibbons, et al.
- 1999
(Show Context)
Citation Context ...community in recent years, as discussed in detail in Section 6. This work has led to some general techniques for data reduction and synopsis construction, including: sketches [5, 35], random sampling =-=[1, 2, 22]-=-, histograms [51, 70], and wavelets [17, 92]. Based on these summarization techniques, we have seen some work on approximate query answering. For example, 7srecent work [27, 37] develops histogram-bas... |

143 |
On Random Sampling over Joins
- Chaudhuri, Motwani, et al.
- 1999
(Show Context)
Citation Context ...community in recent years, as discussed in detail in Section 6. This work has led to some general techniques for data reduction and synopsis construction, including: sketches [5, 35], random sampling =-=[1, 2, 22]-=-, histograms [51, 70], and wavelets [17, 92]. Based on these summarization techniques, we have seen some work on approximate query answering. For example, 7srecent work [27, 37] develops histogram-bas... |

139 | Optimal histograms with quality guarantees
- Jagadish, Koudas, et al.
- 1998
(Show Context)
Citation Context ...nt of such frequent items is related to Iceberg queries [32]. We give an overview of recent work on computing such histograms over data streams. V-Optimal Histograms over Data Streams Jagadish et al. =-=[54]-=- showed how to compute optimal V-Optimal Histograms for a given data set � using � dynamic programming. � ��� The algorithm uses � � � ¦�� space and � requires time, where is the size of the ¦ data se... |

136 | A taxonomy of time in databases
- Snodgrass, Ahn
- 1985
(Show Context)
Citation Context ...lications, continuous queries need to refer to the sequencing aspect of streams, particularly in the form of sliding windows over streams. Related work in this category also includes work on temporal =-=[80]-=- and time-series databases [31], where the ordering of tuples implied by time can be used in querying, indexing, and query optimization. The body of work on materialized views relates to continuous qu... |

135 | Computing Iceberg Queries Efficiently
- Fang, Shivakumar, et al.
- 1998
(Show Context)
Citation Context ... counts of items that occur with frequency above ¤ a threshold, and approximate the other counts by an uniform distribution. Maintaining the count of such frequent items is related to Iceberg queries =-=[32]-=-. We give an overview of recent work on computing such histograms over data streams. V-Optimal Histograms over Data Streams Jagadish et al. [54] showed how to compute optimal V-Optimal Histograms for ... |

134 |
D.: On computing correlated aggregates over continual data streams
- Gehrke, Korn, et al.
- 2001
(Show Context)
Citation Context ...35], random sampling [1, 2, 22], histograms [51, 70], and wavelets [17, 92]. Based on these summarization techniques, we have seen some work on approximate query answering. For example, 7srecent work =-=[27, 37]-=- develops histogram-based techniques to provide approximate answers for correlated aggregate queries over data streams, and Gilbert et al. [40] present a general approach for building smallspace summa... |

129 | K.: Data-streams and histograms
- Guha, Koudas, et al.
- 2001
(Show Context)
Citation Context ...rogramming. � ��� The algorithm uses � � � ¦�� space and � requires time, where is the size of the ¦ data set and is the number of buckets. This is prohibitive for data streams. Guha, Koudas and Shim =-=[43]-=- adapted this algorithm to sorted data streams. Their algorithm constructs an � arbitrarily-close V-Optimal Histogram (i.e., with error arbitrarily close to that of the � ¦ � ����� ��� optimal histogr... |

125 | Making Views Self-Maintainable for Data Warehousing
- Quass, Gupta, et al.
- 1995
(Show Context)
Citation Context ...ous queries, since materialized views are effectively queries that need to be reevaluated or incrementally updated whenever the base data changes. Of particular importance is work on self-maintenance =-=[15, 45, 71]-=-—ensuring that enough data has been saved to maintain a view even when the base data is unavailable—and the related problem of data expiration [36]— determining when certain base data can be discarded... |

120 | An Efficient Cost-Driven Index Selection Tool for Microsoft SQL
- Chaudhuri, Narasayya
- 1997
(Show Context)
Citation Context ...e good approximate answers to a broad range of possible future queries. The problem is similar in some ways to problems in physical database design such as selection of indexes and materialized views =-=[23]-=-. However, there is an important difference: in a traditional database system, when an index or view is lacking, it is possible to go to the underlying relation, albeit at an increased cost. In the da... |

118 |
Selection and sorting with limited storage
- Munro, Paterson
- 1980
(Show Context)
Citation Context ...o circumvent the negative results. Saks and Sun [73] provide space lower bounds for distance approximation between two vectors under � � the norm, for � , in the data stream model. Munro and Paterson =-=[66]-=- showed that any algorithm that ��� computes quantiles exactly in � passes requires � � � ��� for estimatstatistics like count, sum, min/max, and number of distinct values under the sliding windows mo... |

118 | Xjoin: A reactively-scheduled pipelined join operator
- Urhan, Franklin
(Show Context)
Citation Context ... be able to handle the average stream rate quite comfortably by buffering the streams when their rate is high and catching up during the slow periods. This is the approach used in the XJoin algorithm =-=[88]-=-. Sampling In the second scenario,computeAnswer may be fast, but theupdate operation is slow — it takes longer than the average inter-arrival time of the data elements. It is futile to attempt to make... |

117 | Reductions in streaming algorithms, with an application to counting triangles in graphs
- Bar-Yosseff, Kumar, et al.
(Show Context)
Citation Context ... unary representation of the vector. It has bit � positions (elements), where is the dimension of the � underlying vector. A in the ��� unary 2 As discussed in Section 6.7, recently Bar-Yossef et al. =-=[12]-=- and Gibbons and Tirthapura [38] have devised algorithms which, under certain conditions, provide arbitrarily small approximation factors without recourse to perfect hash functions. 3 Hash functions w... |

117 | Rate-based query optimization for streaming information sources
- Viglas, Naughton
- 2002
(Show Context)
Citation Context ...ies for efficient evaluation. Within the NiagaraCQ project, Shanmugasundaram et al. [79] discuss the problem of supporting blocking operators in query plans over data streams, and Viglas and Naughton =-=[89]-=- propose rate-based optimization for queries over data streams, a new optimization methodology that is based on stream-arrival and data-processing rates. The Chronicle data model [55] introduced appen... |

115 | Sampling-based estimation of the number of distinct values of an attribute
- Haas, Naughton, et al.
- 1995
(Show Context)
Citation Context ...fficiently estimating the number of distinct values (� � ) has received particular attention in the database literature, particularly in the context of using single pass or random sampling algorithms =-=[18, 46]-=-. A sketching technique to compute ��� was presented earlier by Flajolet and Martin [35]; however, this had the drawback of requiring explicit families of hash functions with very strong independence ... |

114 | Approximate medians and other quantiles in one pass and with limited memory
- Manku, Rajagopalan, et al.
- 1998
(Show Context)
Citation Context ... merge quantiles with “similar” errors so long as the error for the combined quantile does not exceed ��� . This algorithm improves upon the previous set of results by Manku, Rajagopalan, and Lindsay =-=[61, 62]-=- and Chaudhuri, Motwani, and Narasayya [21]. � ¨� ����� � ����� space and guarantees a precision of ��� . They employ a novel data structure that maintains a End-Biased Histograms and Iceberg Queries ... |

109 | Tracking join and self-join sizes in limited storage
- Alon, Gibbons, et al.
- 1999
(Show Context)
Citation Context ...uses ������� ����� only ��� space and provides arbitrarily small approximation factors. This technique has found many � ��� applications in the database literature, including join size estimation ��¨ =-=[4]-=-, estimating norm of vectors [33], and processing complex aggregate queries over multiple streams [27, 37]. It remains an open problem to come up with techniques to maintain correlated aggregates [37]... |

107 | Random sampling for histogram construction: How much is enough
- Chaudhuri, Motwani, et al.
- 1998
(Show Context)
Citation Context ...as the error for the combined quantile does not exceed ��� . This algorithm improves upon the previous set of results by Manku, Rajagopalan, and Lindsay [61, 62] and Chaudhuri, Motwani, and Narasayya =-=[21]-=-. � ¨� ����� � ����� space and guarantees a precision of ��� . They employ a novel data structure that maintains a End-Biased Histograms and Iceberg Queries Many applications maintain simple aggregate... |

106 |
small-space algorithms for approximate histogram maintenance
- Fast
- 2002
(Show Context)
Citation Context ...e V-Optimal Histogram (i.e., with error arbitrarily close to that of the � ¦ � ����� ��� optimal histogram), using � space and ����� ��� time per data element. � ¦ � In a recent paper, Gilbert et al. =-=[39]-=-, removed the restriction that the data stream be sorted, providing algorithms based on the sketching technique described ��� earlier for computing norms. The idea is to view each data element as an u... |

101 |
A First Course in Database Systems
- Ullman, Widom
- 1997
(Show Context)
Citation Context ...nal example, §�� , is a continuous query for monitoring the source-destination pairs in the top 5 percent in terms of backbone traffic. For ease of exposition, we employ the WITH construct from SQL99 =-=[87]-=-. §�� ¦ ��¨ �©� � � ��� ¨ � ��������� : WITH Load AS (SELECT FROM src, dest, sum(len) AS traffic GROUP BY src, dest) SELECT src, dest, traffic FROM Load AS WHERE (SELECT count(*) FROM Load AS WHERE .t... |