@MISC{Riondato_sampling-basedrandomized, author = {Matteo Riondato}, title = {Sampling-based Randomized Algorithms for Big Data Analytics}, year = {} }

Share

OpenURL

Abstract

Analyzing huge datasets becomes prohibitively slow when the dataset does not fit in main memory. Approximations of the results of guaranteed high quality are sufficient for most applications and can be obtained very fast by analyzing a small random part of the data that fits in memory. We study the use of the Vapnik-Chervonenkis dimension theory to analyze the trade-off between the sample size and the quality of the approximation for fundamental problems in knowledge discovery (frequent itemsets), graph analysis (betweenness centrality), and database management (query selectivity). We show that the sample size to compute a high-quality approximation of the collection of frequent itemsets depends only on the VC-dimension of the problem, which is (tightly) bounded from above by an easy-to-compute characteristic quantity of the dataset. This bound leads to a fast algorithm for mining frequent itemsets that we also adapt to the MapReduce framework for parallel/distributed computation. We exploit similar ideas to avoid the inclusion of false positives in mining results. The betweenness centrality index of a vertex in a network measures the relative importance of that vertex by counting the fraction of shortest paths going through that vertex. We show that it is possible to compute a high-quality approximation of the betweenness of all the vertices by sampling shortest paths at random. The sample size depends on the VC-dimension of the problem, which is upper bounded by the logarithm of the maximum number of vertices in a shortest path. The tight bound collapses to a constant when there is a unique shortest path between any two vertices. The selectivity of a database query is the ratio between the size of its output and the product of the sizes of its input tables. Database Management Systems estimate the selectivity of queries for scheduling and optimization purposes. We show that it is possible to bound the VC-dimension of queries in terms of their SQL expressions, and to use this bound to compute a sample of the database that allow much a more accurate estimation of the selectivity than possible using histograms.