Results 1  10
of
20
at Yahoo Labs Barcelona and while affiliated to Brown University)
"... Abstract We study the problem of graph summarization. Given a large graph we aim at producing a concise lossy representation (a summary) that can be stored in main memory and used to approximately answer queries about the original graph much faster than by using the exact representation. In this wor ..."
Abstract
 Add to MetaCart
Abstract We study the problem of graph summarization. Given a large graph we aim at producing a concise lossy representation (a summary) that can be stored in main memory and used to approximately answer queries about the original graph much faster than by using the exact representation. In this work we study a very natural type of summary: the original set of vertices is partitioned into a small number of supernodes connected by superedges to form a complete weighted graph. The superedge weights are the edge densities between vertices in the corresponding supernodes. To quantify the dissimilarity between the original graph and a summary, we adopt the reconstruction error and the cutnorm error. By exposing a connection between graph summarization and geometric clustering problems (i.e., kmeans and kmedian), we develop the first polynomialtime approximation algorithms to compute the best possible summary of a certain size under both measures. We discuss how to use our summaries to store a (lossy or lossless) compressed graph representation and to approximately answer a large class of queries about the original graph, including adjacency, degree, eigenvector centrality, and triangle and subgraph counting. Using the summary to answer queries is very efficient as the running time to compute the answer depends on the number of supernodes in the summary, rather than the number of nodes in the original graph.
through
"... (will be inserted by the editor) Fast approximation of betweenness centrality ..."
Abstract
 Add to MetaCart
(will be inserted by the editor) Fast approximation of betweenness centrality
Learningbased Query Performance Modeling and Prediction
"... Abstract — Accurate query performance prediction (QPP) is central to effective resource management, query optimization and query scheduling. Analytical cost models, used in current generation of query optimizers, have been successful in comparing the costs of alternative query plans, but they are po ..."
Abstract

Cited by 18 (2 self)
 Add to MetaCart
Abstract — Accurate query performance prediction (QPP) is central to effective resource management, query optimization and query scheduling. Analytical cost models, used in current generation of query optimizers, have been successful in comparing the costs of alternative query plans, but they are poor predictors of execution latency. As a more promising approach to QPP, this paper studies the practicality and utility of sophisticated learningbased models, which have recently been applied to a variety of predictive tasks with great success, in both static (i.e., fixed) and dynamic query workloads. We propose and evaluate predictive modeling techniques that learn query execution behavior at different granularities, ranging from coarsegrained planlevel models to finegrained operatorlevel models. We demonstrate that these two extremes offer a tradeoff between high accuracy for static workload queries and generality to unforeseen queries in dynamic workloads, respectively, and introduce a hybrid approach that combines their respective strengths by selectively composing them in the process of QPP. We discuss how we can use a training workload to (i) prebuild and materialize such models offline, so that they are readily available for future predictions, and (ii) build new models online as new predictions are needed. All prediction models are built using only static features (available prior to query execution) and the performance values obtained from the offline execution of the training workload. We fully implemented all these techniques and extensions on top of PostgreSQL and evaluated them experimentally by quantifying their effectiveness over analytical workloads, represented by wellestablished TPCH data and queries. The results provide quantitative evidence that learningbased modeling for QPP is both feasible and effective for both static and dynamic workload scenarios. I.
Samplingbased Data Mining Algorithms: Modern Techniques and Case Studies
"... Abstract. Sampling a dataset for faster analysis and looking at it as a sample from an unknown distribution are two faces of the same coin. We discuss the use of modern techniques involving the VapnikChervonenkis (VC) dimension to study the tradeoff between sample size and accuracy of data mining ..."
Abstract
 Add to MetaCart
Abstract. Sampling a dataset for faster analysis and looking at it as a sample from an unknown distribution are two faces of the same coin. We discuss the use of modern techniques involving the VapnikChervonenkis (VC) dimension to study the tradeoff between sample size and accuracy of data mining results that can be obtained from a sample. We report two case studies where we and collaborators employed these techniques to develop efficient samplingbased algorithms for the problems of betweenness centrality computation in large graphs and extracting statistically significant Frequent Itemsets from transactional datasets. 1 Sampling the data and data as samples There exist two possible uses of sampling in data mining. On the one hand, sampling means selecting a small random portion of the data, which will then be given as input to an algorithm. The output will be an approximation of the results that would have been obtained if all available data was analyzed but, thanks the the small size of the selected portion, the approximation could be obtained much more quickly. On the other hand, from a more statisticallyinclined point of view, the entire dataset can be seen as a collection of samples from an unknown distribution. In this case the goal of analyzing the data is to gain a better understanding of the unknown distribution. Both scenarios share the same underlying question: how well does the sample resemble the entire dataset or the unknown distribution? There is a tradeoff between the size of the sample and the quality of the approximation that can be obtained from it. Given the randomness involved in the sampling process, this tradeoff must be studied in a probabilistic setting. In this nectar paper we discuss the use of techniques related to the VapnikChervonenkis (VC) dimension of the problem at hand to analyze the tradeoff between sample size and approximation quality and we report two case studies where we and collaborators successfully employed these techniques to develop efficient algorithms for the problems of betweenness centrality computation in large graphs [8] (“sampling the data” scenario) and extracting statistically significant frequent itemsets [10] (“data as samples ” scenario).
Journal of Machine Learning Research X (201X) XXX Submitted 10/13; Published X/XX The VCDimension of SQL Queries and Selectivity Estimation Through Sampling
"... In this work we show how VapnikChervonenkis (VC) dimension, a fundamental result in statistical learning theory, can be used to evaluate the selectivity (output cardinality) of SQL queries, a core problem in large database management. The major theoretical contribution of this work, which is of ind ..."
Abstract
 Add to MetaCart
In this work we show how VapnikChervonenkis (VC) dimension, a fundamental result in statistical learning theory, can be used to evaluate the selectivity (output cardinality) of SQL queries, a core problem in large database management. The major theoretical contribution of this work, which is of independent interest, is an explicit bound to the VCdimension of a range space defined by all possible outcomes of a collection of queries. We prove that the VCdimension can be bounded by a quantity that is a function of the maximum number of Boolean operations in the selection predicate and of the maximum number of select and join operations in any individual query in the collection, but it is neither a function of the number of queries in the collection nor of the size (number of tuples) of the database. Leveraging on this result we develop a method that, given a class of queries, builds a concise random sample of a database that is small enough to be stored in main memory and is such that with high probability the execution of any query in the class on the sample provides an accurate estimate for the selectivity of the query on the original large database. The error probability holds simultaneously for the selectivity estimates of all queries in the collection, thus the same sample can be used to evaluate the selectivity of multiple queries, and the sample needs to be refreshed only following major changes in the database. We present extensive experimental results, validating our theoretical analysis and demonstrating the advantage of our technique when compared to complex selectivity estimation techniques used in PostgreSQL and Microsoft SQL Server.
The Case for Predictive Database Systems: Opportunities and Challenges
"... This paper argues that next generation database management systems should incorporate a predictive model management component to effectively support both inwardfacing applications, such as self management, and userfacing applications such as datadriven predictive analytics. We draw an analogy bet ..."
Abstract

Cited by 4 (0 self)
 Add to MetaCart
This paper argues that next generation database management systems should incorporate a predictive model management component to effectively support both inwardfacing applications, such as self management, and userfacing applications such as datadriven predictive analytics. We draw an analogy between model management and data management functionality and discuss how model management can leverage profiling, physical design and query optimization techniques, as well as the pertinent challenges. We then describe the early design and architecture of Longview, a predictive DBMS prototype that we are building at Brown, along with a case study of how models can be used to predict query execution performance. 1.
Mining topK frequent itemsets through progressive sampling. Data Mining and Knowledge Discovery 21
, 2010
"... Abstract. We study the use of sampling for efficiently mining the topK frequent itemsets of cardinality at most w. To this purpose, we define an approximation to the topK frequent itemsets to be a family of itemsets which includes (resp., excludes) all very frequent (resp., very infrequent) itemse ..."
Abstract

Cited by 5 (3 self)
 Add to MetaCart
Abstract. We study the use of sampling for efficiently mining the topK frequent itemsets of cardinality at most w. To this purpose, we define an approximation to the topK frequent itemsets to be a family of itemsets which includes (resp., excludes) all very frequent (resp., very infrequent) itemsets, together with an estimate of these itemsets ’ frequencies with a bounded error. Our first result is an upper bound on the sample size which guarantees that the topK frequent itemsets mined from a random sample of that size approximate the actual topK frequent itemsets, with probability larger than a specified value. We show that the upper bound is asymptotically tight when w is constant. Our main algorithmic contribution is a progressive sampling approach, combined with suitable stopping conditions, which on appropriate inputs is able to extract approximate topK frequent itemsets from samples whose sizes are smaller than the general upper bound. In order to test the stopping conditions, this approach maintains the frequency of all itemsets encountered, which is practical only for small w. However, we show how this problem can be mitigated by using a variation of Bloom filters. A number of experiments conducted on both synthetic and real benchmark datasets show that using samples substantially smaller than the original dataset (i.e., of size defined by the upper bound or reached through the progressive sampling approach) enable to approximate the actual topK frequent itemsets with accuracy much higher than what analytically proved. 1
F.: Finding the True Frequent Itemsets
 CoRR abs/1301.1218
, 2013
"... Frequent Itemsets (FIs) mining is a fundamental primitive in data mining that requires to identify all itemsets appearing in a fraction at least θ of a transactional dataset D. Often though, the ultimate goal of mining D is not an analysis of the dataset per se, but the understanding of the underlyi ..."
Abstract

Cited by 3 (2 self)
 Add to MetaCart
Frequent Itemsets (FIs) mining is a fundamental primitive in data mining that requires to identify all itemsets appearing in a fraction at least θ of a transactional dataset D. Often though, the ultimate goal of mining D is not an analysis of the dataset per se, but the understanding of the underlying process that generated D. Specifically, in many applications D is a collection of samples obtained from an unknown probability distribution pi on transactions, and by extracting the FIs in the dataset D one attempts to infer itemsets that are frequently generated by pi, which we call the True Frequent Itemsets (TFIs). Due to the inherently random nature of the generative process, the set of FIs is only a rough approximation to the set of TFIs, as it often contains a huge number of spurious itemsets, i.e., itemsets that are not among the TFIs. In this work we present two methods to identify a collection of itemsets that contains only TFIs with probability at least 1 − δ (i.e., the methods have FamilyWise Error Rate bounded by δ), for some userspecified δ, without imposing any restriction on pi. Our methods are distributionfree and make use of results from statistical learning theory involving the (empirical) VCdimension of the problem at hand. This allows us to identify a larger fraction of the TFIs (i.e., to achieve higher statistical power) than what could be done using traditional multiple hypothesis testing corrections. In the experimental evaluation we compare our methods to established techniques (Bonferroni correction, holdout) and show that they return a very large subset of the TFIs, achieving a very high statistical power, while controlling the FamilyWise Error Rate. 1
AEfficient Discovery of Association Rules and Frequent Itemsets through Sampling with Tight Performance Guarantees1
"... The tasks of extracting (topK) Frequent Itemsets (FI’s) and Association Rules (AR’s) are fundamental primitives in data mining and database applications. Exact algorithms for these problems exist and are widely used, but their running time is hindered by the need of scanning the entire dataset, pos ..."
Abstract
 Add to MetaCart
The tasks of extracting (topK) Frequent Itemsets (FI’s) and Association Rules (AR’s) are fundamental primitives in data mining and database applications. Exact algorithms for these problems exist and are widely used, but their running time is hindered by the need of scanning the entire dataset, possibly multiple times. High quality approximations of FI’s and AR’s are sufficient for most practical uses. Sampling techniques can be used for fast discovery of approximate solutions, but works exploring this technique did not provide satisfactory performance guarantees on the quality of the approximation, due to the difficulty of bounding the probability of under or oversampling any one of an unknown number of frequent itemsets. We circumvent this issue by applying the statistical concept of VapnikChervonenkis (VC) dimension to develop a novel technique for providing tight bounds on the sample size that guarantees approximation of the (topK) FI’s and AR’s within userspecified parameters. The resulting sample size is linearly dependent on the VCdimension of a range space associated with the dataset. We analyze the VCdimension of this range space and show that it is upper bounded by an easytocompute characteristic quantity of the dataset, the dindex, namely the maximum integer d such that the dataset contains at least d transactions of length at least d such that no one of them is a superset of or equal to another. We show that this bound is tight for a large class of datasets. The resulting sample size is a significant improvement over previous known results.
Results 1  10
of
20