Results 1 
7 of
7
Randomized Synopses for Query Assurance on Data Streams
"... Due to the overwhelming flow of information in many data stream applications, many companies may not be willing to acquire the necessary resources for deploying a Data Stream Management System (DSMS), choosing, alternatively, to outsource the data stream and the desired computations to a thirdparty ..."
Abstract

Cited by 9 (2 self)
 Add to MetaCart
(Show Context)
Due to the overwhelming flow of information in many data stream applications, many companies may not be willing to acquire the necessary resources for deploying a Data Stream Management System (DSMS), choosing, alternatively, to outsource the data stream and the desired computations to a thirdparty. But data outsourcing and remote computations intrinsically raise issues of trust, making outsourced query assurance on data streams a problem with important practical implications. Consider a setting where a continuous “GROUP BY, SUM ” query is processed using a remote, untrusted server. A client with limited processing capabilities observing exactly the same stream as the server, registers the query on the server’s DSMS and receives results upon request. The client wants to verify the integrity of the results using significantly fewer resources than evaluating the query locally. Towards that goal, we propose a probabilistic verification algorithm for selection and aggregate/groupby queries, that uses constant space irrespective of the resultset size, has low update cost per stream element, and can have arbitrarily small probability of failure. We generalize this algorithm to allow some tolerance on the number of erroneous groups detected, in order to support semantic load shedding on the server. We also discuss the hardness of supporting random load shedding. Finally, we implement our techniques and perform an empirical evaluation using live network traffic. 1
K.: Beyond simple aggregates: indexing for summary queries
 In: Proceedings of the 2011 ACM SIGMOD/PODS Conference
, 2011
"... ABSTRACT Database queries can be broadly classified into two categories: reporting queries and aggregation queries. The former retrieves a collection of records from the database that match the query's conditions, while the latter returns an aggregate, such as count, sum, average, or max (min) ..."
Abstract

Cited by 6 (0 self)
 Add to MetaCart
(Show Context)
ABSTRACT Database queries can be broadly classified into two categories: reporting queries and aggregation queries. The former retrieves a collection of records from the database that match the query's conditions, while the latter returns an aggregate, such as count, sum, average, or max (min), of a particular attribute of these records. Aggregation queries are especially useful in business intelligence and data analysis applications where users are interested not in the actual records, but some statistics of them. They can also be executed much more efficiently than reporting queries, by embedding properly precomputed aggregates into an index. However, reporting and aggregation queries provide only two extremes for exploring the data. Data analysts often need more insight into the data distribution than what those simple aggregates provide, and yet certainly do not want the sheer volume of data returned by reporting queries. In this paper, we design indexing techniques that allow for extracting a statistical summary of all the records in the query. The summaries we support include frequent items, quantiles, various sketches, and wavelets, all of which are of central importance in massive data analysis. Our indexes require linear space and extract a summary with the optimal or nearoptimal query cost.
Sketching sampled data streams
 In ICDE
, 2009
"... Abstract—Sampling is used as a universal method to reduce the running time of computations – the computation is performed on a much smaller sample and then the result is scaled to compensate for the difference in size. Sketches are a popular approximation method for data streams and they proved to b ..."
Abstract

Cited by 4 (0 self)
 Add to MetaCart
(Show Context)
Abstract—Sampling is used as a universal method to reduce the running time of computations – the computation is performed on a much smaller sample and then the result is scaled to compensate for the difference in size. Sketches are a popular approximation method for data streams and they proved to be useful for estimating frequency moments and aggregates over joins. A possibility to further improve the time performance of sketches is to compute the sketch over a sample of the stream rather than the entire data stream. In this paper we analyze the behavior of the sketch estimator when computed over a sample of the stream, not the entire data stream, for the size of join and the selfjoin size problems. Our analysis is developed for a generic sampling process. We instantiate the results of the analysis for all three major types of sampling – Bernoulli sampling which is used for load shedding, sampling with replacement which is used to generate i.i.d. samples from a distribution, and sampling without replacement which is used by online aggregation engines – and compare these particular results with the results of the basic sketch estimator. Our experimental results show that the accuracy of the sketch computed over a small sample of the data is, in general, close to the accuracy of the sketch estimator computed over the entire data even when the sample size is only 10 % or less of the dataset size. This is equivalent to a speedup factor of at least 10 when updating the sketch. I.
Distributionfree Bounds for Relational Classification
"... Statistical Relational Learning (SRL) is a subarea in Machine Learning which addresses the problem of performing statistical inference on data that is correlated and not independently and identically distributed (i.i.d.)  as is generally assumed. For the traditional i.i.d. setting, distribution ..."
Abstract

Cited by 1 (1 self)
 Add to MetaCart
Statistical Relational Learning (SRL) is a subarea in Machine Learning which addresses the problem of performing statistical inference on data that is correlated and not independently and identically distributed (i.i.d.)  as is generally assumed. For the traditional i.i.d. setting, distribution free bounds exist, such as the Hoeffding bound, which are used to provide confidence bounds on the generalization error of a classification algorithm given its holdout error on a sample size of N. Bounds of this form are currently not present for the type of interactions that are considered in the data by relational classification algorithms. In this paper we extend the Hoeffding bounds to the relational setting. In particular, we derive distribution free bounds for certain classes of data generation models that do not produce i.i.d. data and are based on the type of interactions that are considered by relational classification algorithms that have been developed in SRL. We conduct empirical studies on synthetic and real data which show that these data generation models are indeed realistic and the derived bounds are tight enough for practical use.