Results 1 
4 of
4
SpaceEfficient Estimation of Statistics over SubSampled Streams ABSTRACT
"... In many stream monitoring situations, the data arrival rate is so high that it is not even possible to observe each element of the stream. The most common solution is to sample a small fraction of the data stream and use the sample to infer properties and estimate aggregates of the original stream. ..."
Abstract

Cited by 2 (2 self)
 Add to MetaCart
(Show Context)
In many stream monitoring situations, the data arrival rate is so high that it is not even possible to observe each element of the stream. The most common solution is to sample a small fraction of the data stream and use the sample to infer properties and estimate aggregates of the original stream. However, the quantities that need to be computed on the sampled stream are often different from the original quantities of interest and their estimation requires new algorithms. We present upper and lower bounds (often matching) for estimating frequency moments, support size, entropy, and heavy hitters of the original stream from the data observed in the sampled stream.
Perfectly Balanced Allocation with Estimated Average Using Approximately Constant Retries
 National University of Defense Technology
, 2011
"... ar ..."
Load Shedding using Window Aggregation Queries on Data Streams
"... The processes of extracting knowledge structures for continuous, rapid records are known as the Data Stream Mining. The main issue in stream mining is handling streams of elements delivered rapidly which makes it infeasible to store everything in active storage. To overcome this problem of handling ..."
Abstract
 Add to MetaCart
(Show Context)
The processes of extracting knowledge structures for continuous, rapid records are known as the Data Stream Mining. The main issue in stream mining is handling streams of elements delivered rapidly which makes it infeasible to store everything in active storage. To overcome this problem of handling voluminous data we exposed a novel load shedding system using window based aggregate function of the data stream in which we accept those tuples in the stream that meet a criterion. Accepted tuples are conceded to another process as a stream, while further tuples are dropped. This proposed model conceivably segregates the data input stream into windows and probabilistically decides which tuple to drop based on the window function. The best window aggregate function used for dropping tuples is identified with the three prediction models used in data mining they are Decision Tree, Naïve Bayes and Logistic Regression. The result shows that the cumulative distance and density rank functions outperforms the remaining methods. Distinct to prior methods, our method preserves uniformity of windows all over a query plan, and constantly distributes subsets of the original query responds with insignificant denial in the excellence of the consequence.
Stochastic Streams: Sample Complexity vs. Space Complexity
"... We address the tradeoff between the computational resources needed to process a large data set and the number of samples available from the data set. Specifically, we consider the following abstraction: we receive a potentially infinite stream of IID samples from some unknown distribution D, and ar ..."
Abstract
 Add to MetaCart
(Show Context)
We address the tradeoff between the computational resources needed to process a large data set and the number of samples available from the data set. Specifically, we consider the following abstraction: we receive a potentially infinite stream of IID samples from some unknown distribution D, and are tasked with computing some function f(D). If the stream is observed for time t, how much memory, s, is required to estimate f(D)? We refer to t as the sample complexity and s as the space complexity. The main focus of this paper is investigating the tradeoffs between the space and sample complexity. We study these tradeoffs for two canonical problems: undirected graph connectivity and estimating frequency moments. Our algorithms are based on techniques for emulating random walks and simulating different sampling procedures given a sequence of IID samples. 1