Results 1 -
7 of
7
Sliding Window Query Processing over Data Streams
, 2006
"... I hereby declare that I am the sole author of this thesis. This is a true copy of the thesis, including any required final revisions, as accepted by my examiners. I understand that my thesis may be made electronically available to the public. ii Database management systems (DBMSs) have been used suc ..."
Abstract
-
Cited by 11 (0 self)
- Add to MetaCart
(Show Context)
I hereby declare that I am the sole author of this thesis. This is a true copy of the thesis, including any required final revisions, as accepted by my examiners. I understand that my thesis may be made electronically available to the public. ii Database management systems (DBMSs) have been used successfully in traditional business applications that require persistent data storage and an efficient querying mechanism. Typically, it is assumed that the data are static, unless explicitly modified or deleted by a user or application. Database queries are executed when issued and their answers reflect the current state of the data. However, emerging applications, such as sensor networks, real-time Internet traffic analysis, and on-line financial trading, require support for processing of unbounded data streams. The fundamental assumption of a data stream management system (DSMS) is that new data are generated continually, making it infeasible to store a stream in its entirety. At best, a sliding window of recently arrived data may be maintained, meaning that old data must be removed as time goes on. Furthermore, as the contents of the sliding windows evolve over time, it makes
Sketching asynchronous streams over a sliding window
- In PODC
, 2006
"... We study the problem of maintaining sketches of recent elements of a data stream. Motivated by applications involving network data, we consider streams that are asynchronous, in which the observed order of data is not the same as the time order in which the data was generated. The notion of recent e ..."
Abstract
-
Cited by 10 (2 self)
- Add to MetaCart
We study the problem of maintaining sketches of recent elements of a data stream. Motivated by applications involving network data, we consider streams that are asynchronous, in which the observed order of data is not the same as the time order in which the data was generated. The notion of recent elements of a stream is modeled by the sliding timestamp window, which is the set of elements with timestamps that are close to the current time. We design algorithms for maintaining sketches of all elements within the sliding timestamp window that can give provably accurate estimates of two basic aggregates, the sum and the median, of a stream of numbers. The space taken by the sketches, the time needed for querying the sketch, and the time for inserting new elements into the sketch are all polylog with respect to the maximum window size and the values of the data items in the window. Our sketches can be easily combined in a lossless and compact way, making them useful for distributed computations over data streams. Previous works on sketching recent elements of a data stream have all considered the more restrictive scenario of synchronous streams, where the observed order of data is the same as the time order in which the data was generated. Our notion of recency of elements is more general than that studied in previous work, and thus our sketches are more robust to network delays and asynchrony.
Time-Decaying Sketches for Sensor Data Aggregation
, 2007
"... We present a new sketch for summarizing network data. The sketch has the following properties which make it useful in communication-efficient aggregation in distributed streaming scenarios, such as sensor networks: the sketch is duplicateinsensitive, i.e. re-insertions of the same data will not affe ..."
Abstract
-
Cited by 8 (3 self)
- Add to MetaCart
(Show Context)
We present a new sketch for summarizing network data. The sketch has the following properties which make it useful in communication-efficient aggregation in distributed streaming scenarios, such as sensor networks: the sketch is duplicateinsensitive, i.e. re-insertions of the same data will not affect the sketch, and hence the estimates of aggregates. Unlike previous duplicate-insensitive sketches for sensor data aggregation [26, 12], it is also time-decaying, so that the weight of a data item in the sketch can decrease with time according to a user-specified decay function. The sketch can give provably approximate guarantees for various aggregates of data, including the sum, median, quantiles, and frequent elements. The size of the sketch and the time taken to update it are both polylogarithmic in the size of the relevant data. Further, multiple sketches computed over distributed data can be combined without losing the accuracy guarantees. To our knowledge, this is the first sketch that combines all the above properties.
Distinct-values estimation over data streams
- In Data Stream Management: Processing High-Speed Data
"... Abstract. In this chapter, we consider the problem of estimating the number of distinct values in a data stream with repeated values. Distinctvalues estimation was one of the first data stream problems studied: In the mid-1980’s, Flajolet and Martin gave an effective algorithm that uses only logarit ..."
Abstract
-
Cited by 4 (1 self)
- Add to MetaCart
(Show Context)
Abstract. In this chapter, we consider the problem of estimating the number of distinct values in a data stream with repeated values. Distinctvalues estimation was one of the first data stream problems studied: In the mid-1980’s, Flajolet and Martin gave an effective algorithm that uses only logarithmic space. Recent work has built upon their technique, improving the accuracy guarantees on the estimation, proving lower bounds, and considering other settings such as sliding windows, distributed streams, and sensor networks. 1
1 Approximate Algorithms for Computing Spatial Distance Histograms with Accuracy Guarantees
"... Abstract — Particle simulation has become an important research tool in many scientific and engineering fields. Data generated by such simulations imposes great challenges to database storage and query processing. One of the queries against particle simulation data, the spatial distance histogram (S ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
Abstract — Particle simulation has become an important research tool in many scientific and engineering fields. Data generated by such simulations imposes great challenges to database storage and query processing. One of the queries against particle simulation data, the spatial distance histogram (SDH) query, is the building block of many high-level analytics, and requires quadratic time to compute using a straightforward algorithm. Previous work has developed efficient algorithms that compute exact SDHs with time complexity O ` N 3 ´ 2 for two-dimensional data, and O ` N 5 ´ 3 for three-dimensional data, where N is the number of particles in the simulation system. While beating the naive solution, such algorithms are still not practical in processing SDH queries against large-scale simulation data. In this paper, we take a different path to tackle this problem by focusing on approximate algorithms with provable error bounds. We first present a solution derived from the aforementioned exact SDH algorithm, and this solution has running time that is unrelated to the input size N. While an error bound can be easily identified, experimental results show that the accuracy of such an algorithm is significantly higher than what is given by such a (loose) bound. To study the difference between the experimental results and the theoretical bound, we develop a mathematical model to analyze the mechanism that leads to errors in the basic approximate algorithm. Our model provides insights on how the algorithm can be improved to achieve higher accuracy and efficiency. Such insights give rise to a new approximate algorithm with improved time/accuracy tradeoff. Experimental results confirm our analysis. Index Terms — molecular simulation, particle simulation, spatial distance histogram, radial distribution functions, quad-tree, scientific databases I.
Norm, Point, and Distance Estimation Over Multiple Signals Using Max–Stable Distributions
"... Consider a set of signals fs: {1,..., N} → [0,..., M] appearing as a stream of tuples (i, fs(i)) in arbitrary order of i and s. We would like to devise one pass approximate algorithms for estimating various functionals on the dominant signal fmax, defined as fmax = {(i, maxs fs(i)), ∀i}. For exampl ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
(Show Context)
Consider a set of signals fs: {1,..., N} → [0,..., M] appearing as a stream of tuples (i, fs(i)) in arbitrary order of i and s. We would like to devise one pass approximate algorithms for estimating various functionals on the dominant signal fmax, defined as fmax = {(i, maxs fs(i)), ∀i}. For example, the “worst case influence ” which is the F1– norm of the dominant signal [7], general Fp–norms, and special types of distances between dominant signals. The only known previous work in this setting are the algorithms of Cormode and Muthukrishnan [7] and Pavan and Tirthapura [18] which can only estimate the F1–norm over fmax. No previous work addressed more general norms or distance estimation. In this work, we use a novel sketch, based on the properties of max–stable distributions, for these more general problems. The max–stable sketch is a significant improvement over previous alternatives in terms of simplicity of implementation, space requirements, and insertion cost, while providing similar approximation guarantees. To assert our statements, we also conduct an experimental evaluation using real datasets. 1
Two Improved Range-Efficient Algorithms for F0 Estimation ⋆
"... Abstract. We present two new algorithms for range-efficient F0 estimating problem and improve the previously best known result, proposed by Pavan and Tirthapura in [15]. Furthermore, these algorithms presented in our paper also improve the previously best known result for Max-Dominance Norm Problem. ..."
Abstract
- Add to MetaCart
(Show Context)
Abstract. We present two new algorithms for range-efficient F0 estimating problem and improve the previously best known result, proposed by Pavan and Tirthapura in [15]. Furthermore, these algorithms presented in our paper also improve the previously best known result for Max-Dominance Norm Problem. 1