Results 1  10
of
18
Slidingwindow topk queries on uncertain streams
 In VLDB 2008
"... Query processing on uncertain data streams has attracted a lot of attentions lately, due to the imprecise nature in the data generated from a variety of streaming applications, such as readings from a sensor network. However, all of the existing works on uncertain data streams study unbounded stream ..."
Abstract

Cited by 31 (5 self)
 Add to MetaCart
(Show Context)
Query processing on uncertain data streams has attracted a lot of attentions lately, due to the imprecise nature in the data generated from a variety of streaming applications, such as readings from a sensor network. However, all of the existing works on uncertain data streams study unbounded streams. This paper takes the first step towards the important and challenging problem of answering slidingwindow queries on uncertain data streams, with a focus on arguably one of the most important types of queries—topk queries. The challenge of answering slidingwindow topk queries on uncertain data streams stems from the strict space and time requirements of processing both arriving and expiring tuples in highspeed streams, combined with the difficulty of coping with the exponential blowup in the number of possible worlds induced by the uncertain data model. In this paper, we design a unified framework for processing slidingwindow topk queries on uncertain streams. We show that all the existing topk definitions in the literature can be plugged into our framework, resulting in several succinct synopses that use space much smaller than the window size, while are also highly efficient in terms of processing time. In addition to the theoretical space and time bounds that we prove for these synopses, we also present a thorough experimental report to verify their practical efficiency on both synthetic and real data. 1.
Robust lower bounds for communication and stream computation
 in Proceedings of the 40th Annual ACM Symposium on Theory of Computing (British
, 2008
"... We study the communication complexity of evaluating functions when the input data is randomly allocated (according to some known distribution) amongst two or more players, possibly with information overlap. This naturally extends previously studied variable partition models such as the bestcase and ..."
Abstract

Cited by 28 (7 self)
 Add to MetaCart
(Show Context)
We study the communication complexity of evaluating functions when the input data is randomly allocated (according to some known distribution) amongst two or more players, possibly with information overlap. This naturally extends previously studied variable partition models such as the bestcase and worstcase partition models [32, 29]. We aim to understand whether the hardness of a communication problem holds for almost every allocation of the input, as opposed to holding for perhaps just a few atypical partitions. A key application is to the heavily studied data stream model. There is a strong connection between our communication lower bounds and lower bounds in the data stream model that are “robust” to the ordering of the data. That is, we prove lower bounds for when the order of the items in the stream is chosen not adversarially but rather uniformly (or nearuniformly) from the set of all permuations. This randomorder data stream model has attracted recent interest, since lower bounds here give stronger evidence for the inherent hardness of streaming problems. Our results include the first randompartition communication lower bounds for problems including multiparty set disjointness and gapHammingdistance. Both are tight. We also extend and improve previous results [19, 7] for a form of pointer jumping that is relevant to the problem of selection (in particular, median finding). Collectively, these results yield lower bounds for a variety of problems in the randomorder data stream model, including estimating the number of distinct elements, approximating frequency moments, and quantile estimation.
Tight Bounds for Distributed Functional Monitoring
"... We resolve several fundamental questions in the area of distributed functional monitoring, initiated by Cormode, Muthukrishnan, and Yi (SODA, 2008), and receiving recent attention. In this model there are k sites each tracking their input streams and communicating with a central coordinator. The coo ..."
Abstract

Cited by 21 (10 self)
 Add to MetaCart
(Show Context)
We resolve several fundamental questions in the area of distributed functional monitoring, initiated by Cormode, Muthukrishnan, and Yi (SODA, 2008), and receiving recent attention. In this model there are k sites each tracking their input streams and communicating with a central coordinator. The coordinator’s task is to continuously maintain an approximate output to a function computed over the union of the k streams. The goal is to minimize the number of bits communicated. Let the pth frequency moment be defined as Fp f
STREAM ORDER AND ORDER STATISTICS: QUANTILE ESTIMATION IN RANDOMORDER STREAMS ∗
"... Abstract. When trying to process a datastream in small space, how important is the order in which the data arrives? Are there problems that are unsolvable when the ordering is worstcase, that can be solved (with high probability) when the order is chosen uniformly at random? If we consider the str ..."
Abstract

Cited by 18 (3 self)
 Add to MetaCart
(Show Context)
Abstract. When trying to process a datastream in small space, how important is the order in which the data arrives? Are there problems that are unsolvable when the ordering is worstcase, that can be solved (with high probability) when the order is chosen uniformly at random? If we consider the stream as if ordered by an adversary, what happens if we restrict the power of the adversary? We study these questions in the context of quantile estimation, one of the most studied problems in the datastream model. Our results include an O(polylog n)space, O(log log n)pass algorithm for exact selection in a randomly ordered stream of n elements. This resolves an open question of Munro and Paterson [Theor. Comput. Sci., 23 (1980), pp. 315–323]. We then demonstrate an exponential separation between the randomorder and adversarialorder models: using O(polylog n) space, exact selection requires Ω(log n / log log n) passes in the adversarialorder model. This lower bound, in contrast to previous results, applies to fullygeneral randomized algorithms and is established via a new bound on the communication complexity of a natural pointerchasing style problem. We also prove the first fully general lower bounds in the randomorder model: finding an element with rank n/2 ± nδ, in the singlepass randomorder model with probability at least 9/10 requires Ω ( p n1−3δ / log n) space.
A MultiRound Communication Lower Bound for Gap Hamming and Some Consequences ∗
, 2009
"... The GapHammingDistance problem arose in the context of proving space lower bounds for a number of key problems in the data stream model. In this problem, Alice and Bob have to decide whether the Hamming distance between their nbit input strings is large (i.e., at least n/2 + √ n) or small (i.e., ..."
Abstract

Cited by 14 (2 self)
 Add to MetaCart
The GapHammingDistance problem arose in the context of proving space lower bounds for a number of key problems in the data stream model. In this problem, Alice and Bob have to decide whether the Hamming distance between their nbit input strings is large (i.e., at least n/2 + √ n) or small (i.e., at most n/2 − √ n); they do not care if it is neither large nor small. This � ( √ n) gap in the problem specification is crucial for capturing the approximation allowed to a data stream algorithm. Thus far, for randomized communication, an �(n) lower bound on this problem was known only in the oneway setting. We prove an �(n) lower bound for randomized protocols that use any constant number of rounds. As a consequence we conclude, for instance, that εapproximately counting the number of distinct elements in a data stream requires �(1/ε 2) space, even with multiple (a constant number of) passes over the input stream. This extends earlier onepass lower bounds, answering a longstanding open question. We obtain similar results for approximating the frequency moments and for approximating the empirical entropy of a data stream. In the process, we also obtain tight n − � ( √ n log n) lower and upper bounds on the oneway deterministic communication complexity of the problem. Finally, we give a simple combinatorial proof of an �(n) lower bound on the oneway randomized communication complexity. 1
ComparisonBased Time–Space Lower Bounds for Selection
"... We establish the first nontrivial lower bounds on timespace tradeoffs for the selection problem. We prove that any comparisonbased randomized algorithm for finding the median requires Ω(n log logS n) expected time in the RAM model (or more generally in the comparison branching program model), if we ..."
Abstract

Cited by 10 (1 self)
 Add to MetaCart
(Show Context)
We establish the first nontrivial lower bounds on timespace tradeoffs for the selection problem. We prove that any comparisonbased randomized algorithm for finding the median requires Ω(n log logS n) expected time in the RAM model (or more generally in the comparison branching program model), if we have S bits of extra space besides the readonly input array. This bound is tight for all S ≫ log n, and remains true even if the array is given in a random order. Our result thus answers a 16yearold question of Munro and Raman, and also complements recent lower bounds that are restricted to sequential access, as in the multipass streaming model [Chakrabarti et al., SODA 2008]. We also prove that any comparisonbased, deterministic, multipass streaming algorithm for finding the median requires Ω(n log ∗ (n/s) + n log s n) worstcase time (in scanning plus comparisons), if we have s cells of space. This bound is also tight for all s ≫ log 2 n. We get deterministic lower bounds for I/Oefficient algorithms as well. All proofs in this paper involve “elementary ” techniques only. 1
The AverageCase Complexity of Counting Distinct Elements
"... We continue the study of approximating the number of distinct elements in a data stream of length n to within a (1±ɛ) factor. It is known that if the stream may consist of arbitrary data arriving in an arbitrary order, then any 1pass algorithm requires Ω(1/ɛ 2) bits of space to perform this task. T ..."
Abstract

Cited by 6 (1 self)
 Add to MetaCart
(Show Context)
We continue the study of approximating the number of distinct elements in a data stream of length n to within a (1±ɛ) factor. It is known that if the stream may consist of arbitrary data arriving in an arbitrary order, then any 1pass algorithm requires Ω(1/ɛ 2) bits of space to perform this task. To try to bypass this lower bound, the problem was recently studied in a model in which the stream may consist of arbitrary data, but it arrives to the algorithm in a random order. However, even in this model an Ω(1/ɛ 2) lower bound was established. This is because the adversary can still choose the data arbitrarily. This leaves open the possibility that the problem is only hard under a pathological choice of data, which would be of little practical relevance. We study the averagecase complexity of this problem under certain distributions. Namely, we study the case when each successive stream item is drawn independently and uniformly at random from an unknown subset of d items for an unknown value of d. This captures the notion of random uncorrelated data. For a wide range of values of d and n, we design a 1pass algorithm that bypasses the Ω(1/ε 2) lower bound that holds in the adversarial and randomorder models, thereby showing that this model admits more spaceefficient algorithms. Moreover, the update time of our algorithm is optimal. Despite these positive results, for a certain range of values of d and n we show that estimating the number of distinct elements requires Ω(1/ε 2) bits of space even in this model. Our lower bound subsumes previous bounds, showing that even for natural choices of data the problem is hard.
Revisiting the Direct Sum Theorem and Space Lower Bounds in Random Order Streams
, 2009
"... Estimating frequency moments and Lp distances are well studied problems in the adversarial data stream model and tight space bounds are known for these two problems. There has been growing interest in revisiting these problems in the framework of randomorder streams. The best space lower bound know ..."
Abstract

Cited by 6 (0 self)
 Add to MetaCart
(Show Context)
Estimating frequency moments and Lp distances are well studied problems in the adversarial data stream model and tight space bounds are known for these two problems. There has been growing interest in revisiting these problems in the framework of randomorder streams. The best space lower bound known for computing the k th frequency moment in randomorder streams is Ω(n 1−2.5/k) by Andoni et al., and it is conjectured that the real lower bound shall be Ω(n 1−2/k). In this paper, we resolve this conjecture. In our approach, we revisit the direct sum theorem developed by BarYossef et al. in a randompartition private messages model and provide a tight Ω(n 1−2/k /ℓ) space lower bound for any ℓpass algorithm that approximates the frequency moment in randomorder stream model to a constant factor. Finally, we also introduce the notion of spaceentropy tradeoffs in random order streams, as a means of studying intermediate models between adversarial and fully random order streams. We show an almost tight spaceentropy tradeoff for L∞ distance and a nontrivial tradeoff for Lp distances.
Spaceefficient estimation of robust statistics and distribution testing
 In ICS
, 2010
"... Abstract: The generic problem of estimation and inference given a sequence of i.i.d. samples has been extensively studied in the statistics, property testing, and learning communities. A natural quantity of interest is the sample complexity of the particular learning or estimation problem being cons ..."
Abstract

Cited by 5 (1 self)
 Add to MetaCart
(Show Context)
Abstract: The generic problem of estimation and inference given a sequence of i.i.d. samples has been extensively studied in the statistics, property testing, and learning communities. A natural quantity of interest is the sample complexity of the particular learning or estimation problem being considered. While sample complexity is an important component of the computational efficiency of the task, it is also natural to consider the space complexity: do we need to store all the samples as they are drawn, or is it sufficient to use memory that is significantly sublinear in the sample complexity? Surprisingly, this aspect of the complexity of estimation has received significantly less attention in all but a few specific cases. While spacebounded, sequential computation is the purview of the field of datastream computation, almost all of the literature on the algorithmic theory of datastreams considers only “empirical problems”, where the goal is to compute a function of the data present in the stream rather than to infer something about the source of the stream. Our contributions are twofold. First, we provide results connecting space efficiency to the estimation of robust statistics from a sequence of i.i.d. samples. Robust statistics are a particularly interesting class of statistics in our setting because, by definition, they are resilient to noise or errors in the sampled data. We show that this property is enough to ensure that very spaceefficient stream algorithms exist for their estimation. In contrast, the numerical value of a “nonrobust ” statistic can change dramatically with additional samples, and this limits the utility of any finite length sequence of samples. Second, we present a general result that captures a tradeoff between sample and space complexity in the context of distributional property testing.
BestOrder Streaming model
"... Abstract. We study a new model of computation called stream checking on graph problems where a spacelimited verifier has to verify a proof sequentially (i.e., it reads the proof as a stream). Moreover, the proof itself is nothing but a reordering of the input data. This model has a close relationsh ..."
Abstract

Cited by 4 (0 self)
 Add to MetaCart
(Show Context)
Abstract. We study a new model of computation called stream checking on graph problems where a spacelimited verifier has to verify a proof sequentially (i.e., it reads the proof as a stream). Moreover, the proof itself is nothing but a reordering of the input data. This model has a close relationship to many models of computation in other areas such as data streams, communication complexity, and proof checking and could be used in applications such as cloud computing. In this paper we focus on graph problems where the input is a sequence of edges. We show that checking if a graph has a perfect matching is impossible to do deterministically using small space. To contrast this, we show that randomized verifiers are powerful enough to check whether a graph has a perfect matching or is connected. 1