Results 1  10
of
26
Forward Decay: A Practical Time Decay Model for Streaming Systems
"... Abstract — Temporal data analysis in data warehouses and data streaming systems often uses time decay to reduce the importance of older tuples, without eliminating their influence, on the results of the analysis. While exponential time decay is commonly used in practice, other decay functions (e.g. ..."
Abstract

Cited by 17 (0 self)
 Add to MetaCart
(Show Context)
Abstract — Temporal data analysis in data warehouses and data streaming systems often uses time decay to reduce the importance of older tuples, without eliminating their influence, on the results of the analysis. While exponential time decay is commonly used in practice, other decay functions (e.g. polynomial decay) are not, even though they have been identified as useful. We argue that this is because the usual definitions of time decay are “backwards”: the decayed weight of a tuple is based on its age, measured backward from the current time. Since this age is constantly changing, such decay is too complex and unwieldy for scalable implementation. In this paper, we propose a new class of “forward ” decay functions based on measuring forward from a fixed point in time. We show that this model captures the more practical models already known, such as exponential decay and landmark windows, but also includes a wide class of other types of time decay. We provide efficient algorithms to compute a variety of aggregates and draw samples under forward decay, and show that these are easy to implement scalably. Further, we provide empirical evidence that these can be executed in a production data stream management system with little or no overhead compared to the undecayed computations. Our implementation required no extensions to the query language or the DSMS, demonstrating that forward decay represents a practical model of time decay for systems that deal with timebased data. I.
Stream sampling for varianceoptimal estimation of subset sums
 IN ACMSIAM SYMPOSIUM ON DISCRETE ALGORITHMS
, 2009
"... From a high volume stream of weighted items, we want to maintain a generic sample of a certain limited size k that we can later use to estimate the total weight of arbitrary subsets. This is the classic context of online reservoir sampling, thinking of the generic sample as a reservoir. We present ..."
Abstract

Cited by 11 (8 self)
 Add to MetaCart
(Show Context)
From a high volume stream of weighted items, we want to maintain a generic sample of a certain limited size k that we can later use to estimate the total weight of arbitrary subsets. This is the classic context of online reservoir sampling, thinking of the generic sample as a reservoir. We present an efficient reservoir sampling scheme, VarOptk, that dominates all previous schemes in terms of estimation quality. VarOptk provides variance optimal unbiased estimation of subset sums. More precisely, if we have seen n items of the stream, then for any subset size m, our scheme based on k samples minimizes the average variance over all subsets of size m. In fact, the optimality is against any offline scheme with k samples tailored for the concrete set of items seen. In addition to optimal average variance, our scheme provides tighter worstcase bounds on the variance of particular subsets than previously possible. It is efficient, handling each new item of the stream in O(log k) time, which is optimal even on the word RAM. Finally, it is particularly well suited for combination of samples from different streams in a distributed setting.
Leveraging discarded samples for tighter estimation of multipleset aggregates
 In Proceedings of the ACM SIGMETRICS’09 Conference
, 2009
"... Many datasets such as market basket data, text or hypertext documents, and sensor observations recorded in different locations or time periods, are modeled as a collection of sets over a ground set of keys. We are interested in basic aggregates such as the weight or selectivity of keys that satisfy ..."
Abstract

Cited by 9 (7 self)
 Add to MetaCart
Many datasets such as market basket data, text or hypertext documents, and sensor observations recorded in different locations or time periods, are modeled as a collection of sets over a ground set of keys. We are interested in basic aggregates such as the weight or selectivity of keys that satisfy some selection predicate defined over keys ’ attributes and membership in particular sets. This general formulation includes basic aggregates such as the Jaccard coefficient, Hamming distance, and association rules. On massive data sets, exact computation can be inefficient or infeasible. Sketches based on coordinated random samples are classic summaries that support approximate query processing. Queries are resolved by generating a sketch (sample) of the union of sets used in the predicate from the sketches these sets and then applying an estimator to this unionsketch. We derive novel tighter (unbiased) estimators that leverage sampled keys that are present in the union of applicable sketches but excluded from the union sketch. We establish analytically that our estimators dominate estimators applied to the unionsketch for all queries and data sets. Empirical evaluation on synthetic and real data reveals that on typical applications we can expect a 25%4 fold reduction in estimation error. 1.
Continuous Sampling from Distributed Streams
, 2011
"... A fundamental problem in data management is to draw and maintain a sample of a large data set, for approximate query answering, selectivity estimation, and query planning. With large, streaming data sets, this problem becomes particularly difficult when the data is shared across multiple distributed ..."
Abstract

Cited by 7 (3 self)
 Add to MetaCart
A fundamental problem in data management is to draw and maintain a sample of a large data set, for approximate query answering, selectivity estimation, and query planning. With large, streaming data sets, this problem becomes particularly difficult when the data is shared across multiple distributed sites. The main challenge is to ensure that a sample is drawn uniformly across the union of the data while minimizing the communication needed to run the protocol on the evolving data. At the same time, it is also necessary to make the protocol lightweight, by keeping the space and time costs low for each participant. In this paper, we present communicationefficient protocols for continuously maintaining a sample (both with and without replacement) from k distributed streams. These apply to the case when we want a sample from the full streams, and to the sliding window cases of only the W most recent elements, or arrivals within the last w time units. We show that our protocols are optimal (up to logarithmic factors), not just in terms of the communication used, but also the time and space costs for each participant.
Coordinated Weighted Sampling for Estimating Aggregates Over Multiple Weight Assignments
, 2009
"... Many data sources are naturally modeled by multiple weight assignments over a set of keys: snapshots of an evolving database at multiple points in time, measurements collected over multiple time periods, requests for resources served at multiple locations, and records with multiple numeric attribute ..."
Abstract

Cited by 7 (6 self)
 Add to MetaCart
Many data sources are naturally modeled by multiple weight assignments over a set of keys: snapshots of an evolving database at multiple points in time, measurements collected over multiple time periods, requests for resources served at multiple locations, and records with multiple numeric attributes. Over such vectorweighted data we are interested in aggregates with respect to one set of weights, such as weighted sums, and aggregates over multiple sets of weights such as the L1 difference. Samplebased summarization is highly effective for data sets that are too large to be stored or manipulated. The summary facilitates approximate processing queries that may be specified after the summary was generated. Current designs, however, are geared for data sets where a single scalar weight is associated with each key. We develop a sampling framework based on coordinated weighted samples that is suited for multiple weight assignments and obtain estimators that are orders of magnitude tighter than previously possible. We demonstrate the power of our methods through an extensive empirical evaluation on diverse data sets ranging from IP network to stock quotes data.
EFFICIENT STREAM SAMPLING FOR VARIANCEOPTIMAL ESTIMATION OF SUBSET SUMS
"... From a high volume stream of weighted items, we want to maintain a generic sample of a certain limited size k that we can later use to estimate the total weight of arbitrary subsets. This is the classic context of online reservoir sampling, thinking of the generic sample as a reservoir. We presen ..."
Abstract

Cited by 4 (4 self)
 Add to MetaCart
From a high volume stream of weighted items, we want to maintain a generic sample of a certain limited size k that we can later use to estimate the total weight of arbitrary subsets. This is the classic context of online reservoir sampling, thinking of the generic sample as a reservoir. We present an efficient reservoir sampling scheme, VarOptk, that dominates all previous schemes in terms of estimation quality. VarOptk provides variance optimal unbiased estimation of subset sums. More precisely, if we have seen n items of the stream, then for any subset size m, our scheme based on k samples minimizes the average variance over all subsets of size m. In fact, the optimality is against any offline scheme with k samples tailored for the concrete set of items seen. In addition to optimal average variance, our scheme provides tighter worstcase bounds on the variance of particular subsets than previously possible. It is efficient, handling each new item of the stream in O(log k) time. Finally, it is particularly well suited for combination of samples from different streams in a distributed setting. Key words. Subset sum estimation, weighted sampling, sampling without replacement, reservoir sampling.
K.: Solving nonzero sum multiagent network flow security games with attack costs
 In: AAMAS
, 2012
"... ABSTRACT Moving assets through a transportation network is a crucial challenge in hostile environments such as future battlefields where malicious adversaries have strong incentives to attack vulnerable patrols and supply convoys. Intelligent agents must balance network costs with the harm that can ..."
Abstract

Cited by 4 (1 self)
 Add to MetaCart
(Show Context)
ABSTRACT Moving assets through a transportation network is a crucial challenge in hostile environments such as future battlefields where malicious adversaries have strong incentives to attack vulnerable patrols and supply convoys. Intelligent agents must balance network costs with the harm that can be inflicted by adversaries who are in turn acting rationally to maximize harm while trading off against their own costs to attack. Furthermore, agents must choose their strategies even without full knowledge of their adversaries' capabilities, costs, or incentives. In this paper we model this problem as a nonzero sum game between two players, a sender who chooses flows through the network and an adversary who chooses attacks on the network. We advance the state of the art by: (1) moving beyond the zerosum games previously considered to nonzero sum games where the adversary incurs attack costs that are not incorporated into the payoff of the sender; (2) introducing a refinement of the Stackelberg equilibrium that is more appropriate to network security games than previous solution concepts; and (3) using Bayesian games where the sender is uncertain of the capabilities, payoffs, and costs of the adversary. We provide polynomial time algorithms for finding equilibria in each of these cases. We also show how our approach can be used for games where there are multiple adversaries.
What you can do with coordinated samples
 In The 17th. International Workshop on Randomization and Computation (RANDOM
, 2013
"... ar ..."
(Show Context)
Independent range sampling
 In PODS
, 2014
"... This paper studies the independent range sampling problem. The input is a set P of n points in R. Given an interval q = [x, y] and an integer t ≥ 1, a query returns t elements uniformly sampled (with/without replacement) from P ∩ q. The sampling result must be independent from those returned by the ..."
Abstract

Cited by 3 (1 self)
 Add to MetaCart
(Show Context)
This paper studies the independent range sampling problem. The input is a set P of n points in R. Given an interval q = [x, y] and an integer t ≥ 1, a query returns t elements uniformly sampled (with/without replacement) from P ∩ q. The sampling result must be independent from those returned by the previous queries. The objective is to store P in a structure for answering all queries efficiently. If P fits in memory, the problem is interesting when P is dynamic (i.e., allowing insertions and deletions). The state of the art is a structure of O(n) space that answers a query in O(t log n) time, and supports an update in O(log n) time. We describe a new structure of O(n) space that answers a query in O(log n+t) expected time, and supports an update in O(log n) time. If P does not fit in memory, the problem is challenging even when P is static. The best known structure incurs O(logB n + t) I/Os per query, where B is the block size. We develop a new structure of O(n/B) space that answers a query in O(log⋆(n/B)+logB n+(t/B) logM/B(n/B)) amortized expected I/Os, where M is the memory size, and log⋆(n/B) is the number of iterative log2(.) operations we need to perform on n/B before going below a constant. We also give a lower bound argument showing that this is nearly optimal—in particular, the multiplicative term logM/B(n/B) is necessary. Categories and Subject Descriptors F.2.2 [Analysis of algorithms and problem complex