## Randomized Synopses for Query Assurance on Data Streams

### Cached

### Download Links

- [www.cs.fsu.edu]
- [www.cs.utah.edu]
- [www.cs.ust.hk]
- [www.cs.bu.edu]
- [www.research.att.com]
- [www2.research.att.com]
- [research.att.com]
- [www.cse.ust.hk]
- [www.cs.utah.edu]
- [www2.research.att.com]
- [www.cs.utah.edu]
- [www.cs.fsu.edu]
- [www.cs.bu.edu]
- [www.cs.bu.edu]
- [www.cs.bu.edu]
- DBLP

### Other Repositories/Bibliography

Citations: | 7 - 1 self |

### BibTeX

@MISC{Yi_randomizedsynopses,

author = {Ke Yi and Feifei Li and Marios Hadjieleftheriou and Divesh Srivastava and George Kollios},

title = {Randomized Synopses for Query Assurance on Data Streams},

year = {}

}

### OpenURL

### Abstract

Due to the overwhelming flow of information in many data stream applications, many companies may not be willing to acquire the necessary resources for deploying a Data Stream Management System (DSMS), choosing, alternatively, to outsource the data stream and the desired computations to a third-party. But data outsourcing and remote computations intrinsically raise issues of trust, making outsourced query assurance on data streams a problem with important practical implications. Consider a setting where a continuous “GROUP BY, SUM ” query is processed using a remote, untrusted server. A client with limited processing capabilities observing exactly the same stream as the server, registers the query on the server’s DSMS and receives results upon request. The client wants to verify the integrity of the results using significantly fewer resources than evaluating the query locally. Towards that goal, we propose a probabilistic verification algorithm for selection and aggregate/group-by queries, that uses constant space irrespective of the result-set size, has low update cost per stream element, and can have arbitrarily small probability of failure. We generalize this algorithm to allow some tolerance on the number of erroneous groups detected, in order to support semantic load shedding on the server. We also discuss the hardness of supporting random load shedding. Finally, we implement our techniques and perform an empirical evaluation using live network traffic. 1

### Citations

2324 |
The Art of Computer Programming
- Knuth
- 1973
(Show Context)
Citation Context ...or count queries), as the field Zp is not 5 equipped with division. We need first to compute (α−i) −1 , the multiplicative inverse of (α−i) in modulo p, in O(log p) time (using Euclid’s gcd algorithm =-=[26]-=-), and then compute (α − i) −1|u| . PIRS-2. When n ≪ m we can actually do slightly better with PIRS-2. Now we choose the prime p between max{m, n/δ} and 2 max{m, n/δ}. For α chosen uniformly at random... |

1917 | Randomized Algorithms
- Motwani, Raghavan
- 1995
(Show Context)
Citation Context ...osen point. The technique of verifying polynomial identities can be traced back to the seventies [16]. It has found applications in verifying matrix multiplications and pattern matching, among others =-=[17]-=-. PIRS-1. Let p be a prime s.t. max{m/δ, n} < p ≤ 2 max{m/δ, n}. According to Bertrand’s Postulate such a p always exists [18]. For PIRS-1, we choose α from Zp uniformly at random and compute X (v) = ... |

713 | The space complexity of approximating the frequency moments
- Alon, Matias, et al.
- 1999
(Show Context)
Citation Context ...tly the same procedure as the server, in essence maintaining all the vi’s. This simple method consumes Θ(n) space, and makes outsourcing meaningless. Therefore, as with all data stream problems [10], =-=[12]-=-, we are only interested in solutions that use space significantly smaller than O(n). Next, we briefly mention some intuitive solutions and discuss why they do not solve the CQV problem. We focus on c... |

672 | Models and issues in data stream systems
- Babcock, Babu, et al.
- 2002
(Show Context)
Citation Context ...ally supports the tumbling window semantics. Furthermore, the proposed scheme can be easily extended for sliding window semantics (cf. Section VII). Readers are referred to two excellent papers [10], =-=[11]-=- for detailed discussions of data stream models. Problem Definition. The problem of Continuous Query Verification 1 on data streams (CQV) is defined as follows: Definition 1: Given a data stream S, a ... |

418 | TelegraphCQ: Continuous dataflow processing for an uncertain world
- Chandrasekaran, Cooper, et al.
- 2003
(Show Context)
Citation Context ...er of commercial Data Stream Management Systems (DSMS) have been developed recently to handle the continuous nature of data being generated by a variety of applications, like telephony and networking =-=[1]-=-, [2], [3], [4], [5], [6]. Companies deploy DSMSs for gathering invaluable statistics about day-to-day operations. Due to the overwhelming data flow observed, some companies are not willing to acquire... |

411 | Fast probabilistic algorithms for verification of polynomial identities
- Schwartz
- 1980
(Show Context)
Citation Context ...c techniques combined with randomization, like PIRS. Examples include verifying univariate polynomial multiplication, multivariate polynomial identities, and verifying equality of strings [16], [17], =-=[29]-=-. An excellent related discussion appears in [17]. X. CONCLUSION Verifying query results in an outsourced data stream setting is a problem that has not been addressed before. We proposed various space... |

404 | Data Streams: Algorithms and Applications
- Muthukrishnan
- 2005
(Show Context)
Citation Context ... have uτ = 1 for all τ. We assume that the L1 norm of vτ is always bounded by some large m, i.e., at any τ, �vτ �1 = �n i=1 vτ i ≤ m. Our streaming model is the same as the general Turnstile model of =-=[10]-=-, and our algorithms are designed to work under this model. Our solution naturally supports the tumbling window semantics. Furthermore, the proposed scheme can be easily extended for sliding window se... |

360 | Probabilistic counting algorithms for data base applications
- Flajolet, Martin
- 1985
(Show Context)
Citation Context ...l for practical r’s and γ’s. Thus, random sampling can at most reduce the space cost by a tiny fraction. Sketches. Recent years have witnessed a large number of sketching techniques (e.g. [12], [13], =-=[14]-=-) that are designed to summarize high-volume streaming data with small space. It is tempting to maintain such a sketch for the purpose of verification. However, although it is imaginable that such an ... |

348 |
New hash functions and their use in authentication and set equality
- Wegman, Carter
- 1981
(Show Context)
Citation Context ... synopses, we also need to generate the γ-wise independent random numbers. Using standard techniques we can generate them on-the-fly using O(γ log n) truly random bits. Specifically, the technique of =-=[19]-=- for constructing k-universal hash families can be used. Let p be some prime between n and 2n, and α0, . . . , αγ−1 be γ random numbers chosen uniformly and independently from Zp. Then we set bi = ((α... |

339 | R.: Approximate frequency counts over data streams
- Manku, Motwani
- 2002
(Show Context)
Citation Context ...emory usage for PIRS; similar results were observed for PIRS γ and PIRS ±γ . IX. RELATED WORK As discussed in Section III, sketching techniques cannot solve the verification problem [12], [13], [14], =-=[23]-=-. Various other authentication techniques though are related. By viewing v as a message, the client can compute an authenticated signature σ(v) and any alteration to v will be detected. However, a fun... |

313 | An improved data stream summary: The count-min sketch and its applications
- Cormode, Muthukrishnan
(Show Context)
Citation Context ...o small for practical r’s and γ’s. Thus, random sampling can at most reduce the space cost by a tiny fraction. Sketches. Recent years have witnessed a large number of sketching techniques (e.g. [12], =-=[13]-=-, [14]) that are designed to summarize high-volume streaming data with small space. It is tempting to maintain such a sketch for the purpose of verification. However, although it is imaginable that su... |

313 | Designing programs that check their work
- BLUM, KANNAN
- 1989
(Show Context)
Citation Context ...mating certain statistics of the stream (e.g. the frequency moments), they cannot solve the verification problem. Another approach for solving the CQV problem is to use program execution verification =-=[13, 40]-=- on the DSMS of the server. We briefly discuss two representatives from this field [17, 42]. The main idea behind these approaches is for the client to precompute hash values in nodes of the controlsr... |

237 | Maintaining stream statistics over sliding windows
- Datar, Gionis, et al.
- 2002
(Show Context)
Citation Context ...osable, i.e., for any v1, v2, X (v1 + v2) = X (v1)·X (v2) (and for PIRS-2 X (v1+v2) = X (v1)+X (v2)). This property allows us to extend PIRS for periodically sliding windows using standard techniques =-=[20]-=-. Take for example the following query: SELECT SUM(packet_size) FROM IP_Trace GROUP BY source_ip, destination_ip WITHIN LAST 1 hour SLIDE EVERY 5 minutes In this case, we build one PIRS-1 for every 5-... |

204 | The Design of the Borealis Stream Processing Engine. CIDR ’05
- Abadi, Ahmad, et al.
(Show Context)
Citation Context ...implement our techniques and perform an empirical evaluation using live network traffic. 1 Introduction A large number of commercial Data Stream Management Systems (DSMS) have been developed recently =-=[16, 25, 4, 19, 14, 2]-=-, mainly driven by the continuous nature of the data being generated by a variety of real-world applications, like telephony and networking. Many companies deploy such DSMSs for the purpose of gatheri... |

193 |
Load Shedding in a data stream manager
- Tatbul, Cetintemel, et al.
- 2003
(Show Context)
Citation Context ...l resources to provide fraudulent results. In other cases, incorrect answers might be a result of faulty software, or due to load shedding strategies, which are essential for dealing with bursty data =-=[8]-=-. In critical applications real time execution assurance is essential. Ideally, the client should be able to verify the integrity ofsthe computations performed in real time using significantly fewer r... |

148 | Counting distinct elements in a data stream
- Bar-Yossef, Jayram, et al.
- 2002
(Show Context)
Citation Context ...sly toossmall for practical r’s and γ’s. Thus, random sampling can at most reduce the space cost by a tiny fraction. Sketches. Recent years have witnessed a large number of sketching techniques (e.g. =-=[3, 18, 10, 21]-=-) that are designed to summarize high-volume streaming data with small space. It is tempting to maintain such a sketch K(v) for the purpose of verification. When the server returns some w, we compute ... |

126 | XOR MACs: new methods for message authentication using block ciphers
- Bellare, Gu¶erin, et al.
- 1995
(Show Context)
Citation Context ...setting the message v is constantly updated. Cryptography researchers have devoted considerable effort in designing incremental cryptography [11]. Among them incremental signature and incremental MAC =-=[11, 12]-=- are especially interesting for this work. However, these techniques only support updates for block edit operations such as insert and delete, i.e., by viewing v as blocks of bits, they are able to co... |

114 | Software reliability via run-time resultchecking
- Blum, Wasserman
- 1997
(Show Context)
Citation Context ...mating certain statistics of the stream (e.g. the frequency moments), they cannot solve the verification problem. Another approach for solving the CQV problem is to use program execution verification =-=[13, 40]-=- on the DSMS of the server. We briefly discuss two representatives from this field [17, 42]. The main idea behind these approaches is for the client to precompute hash values in nodes of the controlsr... |

102 | Load Shedding for Aggregation Queries over Data Streams. ICDE
- Babcock, Datar, et al.
- 2004
(Show Context)
Citation Context ... spurious answers). In other cases, incorrect answers might simply be a result of faulty software, or due to load shedding strategies, which are essential tools for dealing with bursty streaming data =-=[39, 5, 8, 38]-=-. Ideally, the client should be able to verify the integrity of the computations performed by the server using significantly fewer resources than evaluating the queries locally. Moreover, the client s... |

94 |
STREAM: The Stanford Stream Data Manager
- Arasu, Babcock, et al.
- 2003
(Show Context)
Citation Context ...ercial Data Stream Management Systems (DSMS) have been developed recently to handle the continuous nature of data being generated by a variety of applications, like telephony and networking [1], [2], =-=[3]-=-, [4], [5], [6]. Companies deploy DSMSs for gathering invaluable statistics about day-to-day operations. Due to the overwhelming data flow observed, some companies are not willing to acquire the neces... |

86 | Approximate counts and quantiles over sliding windows
- Arasu, Manku
- 2004
(Show Context)
Citation Context ... spurious answers). In other cases, incorrect answers might simply be a result of faulty software, or due to load shedding strategies, which are essential tools for dealing with bursty streaming data =-=[39, 5, 8, 38]-=-. Ideally, the client should be able to verify the integrity of the computations performed by the server using significantly fewer resources than evaluating the queries locally. Moreover, the client s... |

81 | Incremental cryptography: The case of hashing and signing
- Bellare, Goldreich, et al.
- 1994
(Show Context)
Citation Context ...eration to v will be detected. However, a fundamental problem is performing incremental updates at the client side, without storing v. Considerableseffort has been devoted in incremental cryptography =-=[24]-=- for that purpose. In particular, incremental signatures and incremental MAC [24] are closely related, but they support updates only for block edit operations, and not arithmetic updates. There is als... |

79 |
Introduction to number theory
- Nagell
- 1951
(Show Context)
Citation Context ...t has found applications in e.g. verifying matrix multiplications and pattern matching [30]. PIRS-1. Let p be some prime such that max{m/δ, n} < δsp ≤ 2 max{m/δ, n}. According to Bertrand’s Postulate =-=[32]-=- such a p always exists. We will work in the field Zp, i.e., all additions and multiplications are done modulo p. For the first PIRS, denoted PIRS-1, we choose α from Zp uniformly at random and comput... |

74 | Maintaining variance and k-medians over data stream windows
- Babcock, Datar, et al.
- 2003
(Show Context)
Citation Context ...sults were observed for PIRS γ and PIRS ±γ . 9 Related Work PIRS is a way of summarizing the underlying data streams. In that respect our work is related with the line of work on sketching techniques =-=[3, 29, 9, 18, 21, 23]-=-. As discussed in Section 3, since these sketches are mainly designed for estimating certain statistics of the stream (e.g. the frequency moments), they cannot solve the verification problem. Another ... |

72 |
Secure hierarchical in-network aggregation in sensor networks
- Chan, Perrig, et al.
- 2006
(Show Context)
Citation Context ...ations, and not arithmetic updates. There is also considerable work on authenticated queries in outsourced databases [7], [25], but they do not apply for online, one-pass streaming scenarios. Work in =-=[26]-=-, [27] has studied secure in-network aggregation in sensor networks using cryptographic tools. Nevertheless, the problem setting therein is fundamentally different from the one discussed here. Authent... |

61 | Dynamic authenticated index structures for outsourced databases
- Li, Hadjieleftheriou, et al.
- 2006
(Show Context)
Citation Context ...ches are inapplicable. There is also considerable work on authenticating query 14 0 2 4 6 8 10 12 14 16 18 20 number of faulty groups (c) PIRS ±γ , γ = 10. execution in an outsourced database setting =-=[37, 33, 27, 28]-=-. Here, the client queries the publisher’s data through a third party, namely the server, and the goal is to design efficient solutions to enable the client to authenticate the query results. In [33, ... |

53 |
Fast probabilistic algorithms
- FREIVALDS
- 1979
(Show Context)
Citation Context .... The synopses, as the name suggests, are based on testing the identity of polynomials at a randomly chosen point. The technique of verifying polynomial identities can be traced back to the seventies =-=[16]-=-. It has found applications in verifying matrix multiplications and pattern matching, among others [17]. PIRS-1. Let p be a prime s.t. max{m/δ, n} < p ≤ 2 max{m/δ, n}. According to Bertrand’s Postulat... |

48 | Verifying completeness of relational query results in data publishing
- Pang, Jain, et al.
- 2005
(Show Context)
Citation Context ...AC [24] are closely related, but they support updates only for block edit operations, and not arithmetic updates. There is also considerable work on authenticated queries in outsourced databases [7], =-=[25]-=-, but they do not apply for online, one-pass streaming scenarios. Work in [26], [27] has studied secure in-network aggregation in sensor networks using cryptographic tools. Nevertheless, the problem s... |

47 | Query execution assurance for outsourced databases
- Sion
- 2005
(Show Context)
Citation Context ...Group 2 ... a consequence, outsourced query assurance on data streams is a problem with important practical implications. This problem has been studied before in the context of static outsourced data =-=[7]-=-. To the best of our knowledge, this is the first work to address query assurance on data streams. Consider a setting where continuous queries are processed using a remote, untrusted server (that can ... |

47 |
On universal classes of fast high performance hash functions, their time-space tradeo, and their applications
- Siegel
- 1989
(Show Context)
Citation Context ...the time bounds. We can trade space for faster update times by using other γwise independent random number generation schemes. For instance by using an extra O(n ɛ ) words per layer, the technique of =-=[36]-=- can generate a bi in O(1) time provided that γ ≤ nɛ3 /2 , for ɛ > 0. The update and verification times become O(log 1 1 δ ) and O(n log δ ), and the space bound log n) bits. O(n ɛ log 1 δ 5.2 PIRS ±γ... |

40 | Nile: A query processing engine for data streams
- HAMMAD, MOKBEL, et al.
- 2004
(Show Context)
Citation Context ... commercial Data Stream Management Systems (DSMS) have been developed recently to handle the continuous nature of data being generated by a variety of applications, like telephony and networking [1], =-=[2]-=-, [3], [4], [5], [6]. Companies deploy DSMSs for gathering invaluable statistics about day-to-day operations. Due to the overwhelming data flow observed, some companies are not willing to acquire the ... |

39 |
Monitoring streams—A new class of data management applications
- CARNEY, CENTINTEMEL, et al.
- 2002
(Show Context)
Citation Context ...a Stream Management Systems (DSMS) have been developed recently to handle the continuous nature of data being generated by a variety of applications, like telephony and networking [1], [2], [3], [4], =-=[5]-=-, [6]. Companies deploy DSMSs for gathering invaluable statistics about day-to-day operations. Due to the overwhelming data flow observed, some companies are not willing to acquire the necessary resou... |

34 |
Window-aware load shedding for aggregation queries over data streams
- Tatbul, Zdonik
- 2006
(Show Context)
Citation Context ... spurious answers). In other cases, incorrect answers might simply be a result of faulty software, or due to load shedding strategies, which are essential tools for dealing with bursty streaming data =-=[39, 5, 8, 38]-=-. Ideally, the client should be able to verify the integrity of the computations performed by the server using significantly fewer resources than evaluating the queries locally. Moreover, the client s... |

31 |
Oblivious hashing: A stealthy software integrity verification primitive
- Chen, Venkatesan, et al.
- 2002
(Show Context)
Citation Context ... verification problem. Another approach for solving the CQV problem is to use program execution verification [13, 40] on the DSMS of the server. We briefly discuss two representatives from this field =-=[17, 42]-=-. The main idea behind these approaches is for the client to precompute hash values in nodes of the controlsratio of raising alarms 1 0.8 0.6 0.4 0.2 0 PIRS 10 PIRS +- 10 0 2 4 6 8 10 12 14 16 18 20 n... |

27 | Multiple aggregations over data streams
- ZHANG, KOUDAS, et al.
- 2005
(Show Context)
Citation Context ...de applications in monitoring and statistical analysis of data streams (e.g., in networking and telephony applications). Previous work has addressed exactly these types of queries numerous times (cf. =-=[9]-=- and related work therein). For example, a query that appears frequently in network monitoring applications is the following: SELECT SUM(packet_size) FROM IP_Trace GROUP BY source_ip, destination_ip I... |

22 | Proof sketches: Verifiable in-network aggregation
- Garofalakis, Hellerstein, et al.
- 2007
(Show Context)
Citation Context ..., and not arithmetic updates. There is also considerable work on authenticated queries in outsourced databases [7], [25], but they do not apply for online, one-pass streaming scenarios. Work in [26], =-=[27]-=- has studied secure in-network aggregation in sensor networks using cryptographic tools. Nevertheless, the problem setting therein is fundamentally different from the one discussed here. Authenticatio... |

21 |
Proof-infused streams: Enabling authentication of sliding window queries on streams
- Li, Yi, et al.
- 2007
(Show Context)
Citation Context ...g cryptographic tools. Nevertheless, the problem setting therein is fundamentally different from the one discussed here. Authentication of sliding window queries over data streams has been studied in =-=[28]-=-. However, in that model the client does not have access to the data stream and the data publisher is responsible for injecting “proofs” into the stream. Verifying the identity of polynomials is a fin... |

11 |
Tracking set-expression cardinalities over continuous update streams
- Ganguly, Garofalakis, et al.
(Show Context)
Citation Context ...sults were observed for PIRS γ and PIRS ±γ . 9 Related Work PIRS is a way of summarizing the underlying data streams. In that respect our work is related with the line of work on sketching techniques =-=[3, 29, 9, 18, 21, 23]-=-. As discussed in Section 3, since these sketches are mainly designed for estimating certain statistics of the stream (e.g. the frequency moments), they cannot solve the verification problem. Another ... |

11 | Trust but verify: monitoring remotely executing programs for progress and correctness
- Yang, Butt, et al.
- 2005
(Show Context)
Citation Context ... verification problem. Another approach for solving the CQV problem is to use program execution verification [13, 40] on the DSMS of the server. We briefly discuss two representatives from this field =-=[17, 42]-=-. The main idea behind these approaches is for the client to precompute hash values in nodes of the controlsratio of raising alarms 1 0.8 0.6 0.4 0.2 0 PIRS 10 PIRS +- 10 0 2 4 6 8 10 12 14 16 18 20 n... |

9 |
Introduction to Number Theory, 2nd ed
- Nagell
- 1964
(Show Context)
Citation Context ...ns in verifying matrix multiplications and pattern matching, among others [17]. PIRS-1. Let p be a prime s.t. max{m/δ, n} < p ≤ 2 max{m/δ, n}. According to Bertrand’s Postulate such a p always exists =-=[18]-=-. For PIRS-1, we choose α from Zp uniformly at random and compute X (v) = (α − 1) v1 · (α − 2) v2 · · · · · (α − n) vn , (1) where all subtractions and multiplications are performed in the field Zp. G... |

9 | Cads: Continuous authentication on data streams
- Papadopoulos, Yang, et al.
- 2007
(Show Context)
Citation Context ...eless, the problem setting therein is fundamentally different from the one discussed here. Authentication of sliding window queries and continuous queries over data streams have been studied in [28], =-=[29]-=-. However, in that model the client does not have access to the data stream and the data publisher is responsible for injecting “proofs” into the stream. Verifying the identity of polynomials is a fin... |

5 | Pseudo-Random number generation for sketch-based estimations
- RUSU, A
- 2007
(Show Context)
Citation Context ... we will argue that for certain w �= v, the chance that �n i=1 h(i)vi = �n i=1 h(i)wi is high, thus the sketch will miss w unless many repetitions are used. This AMS sketch uses the BCH4 scheme (c.f. =-=[34]-=-) to construct a 4wise independent random hash function f : [n] → {0, 1}, and then set h(i) = 2f(i) − 1. Since �n i=1 h(i)(vi − wi) = 2 �n i=1 f(i)(vi − wi) − �n i=1 (vi − wi), it is sufficient to con... |

4 |
http:// www.acm.org/ sigcomm/ITA
- Arlitt, Jin
- 1998
(Show Context)
Citation Context ...ynopsis in bulk, we incur a smaller, amortized update processing cost per tuple. 8 Empirical Evaluation In this section we evaluate the performance of the proposed synopses over two real data streams =-=[6, 1]-=-. The experimental study demonstrates that our synopses: 1. use very smallsWC IPs Count 0.98 µs 0.98 µs Sum 8.01 µs 6.69 µs Table 1. Average update time per tuple. space; 2. support fast updates; 3. h... |

3 |
Gigascope: A stream database for internet databases
- CRANOR, JOHNSON, et al.
(Show Context)
Citation Context ...l Data Stream Management Systems (DSMS) have been developed recently to handle the continuous nature of data being generated by a variety of applications, like telephony and networking [1], [2], [3], =-=[4]-=-, [5], [6]. Companies deploy DSMSs for gathering invaluable statistics about day-to-day operations. Due to the overwhelming data flow observed, some companies are not willing to acquire the necessary ... |

1 | Randomized synopses for query verification on data streams
- Yi, Li, et al.
(Show Context)
Citation Context ...ikely to catch most unintentional errors such as bad communication links, the fact that they are not designed for verification leaves them vulnerable under certain attacks. More precisely, we show in =-=[15]-=- that there are certain w �= v that correspond to identical sketches with high probability. This means that using the sketch either poses a security risk, or has to incur high space and update costs a... |

1 |
http:// www.acm.org/ sigcomm/ITA/, iTA
- Arlitt, Jin
- 1998
(Show Context)
Citation Context ...sis in bulk, we incur a smaller, amortized update processing cost per tuple. VIII. EMPIRICAL EVALUATION In this section we evaluate the performance of the proposed synopses over two real data streams =-=[21]-=-, [22]. The experimental study demonstrates that our synopses: 1. use very small space; 2. support fast updates; 3. have very high accuracy; 4. support multiple queries; and 5. are easy to implement. ... |