Results 1 - 10
of
22
Radius Plots for Mining Tera-byte Scale Graphs: Algorithms, Patterns, and Observations
"... Given large, multi-million node graphs (e.g., FaceBook, web-crawls, etc.), how do they evolve over time? How are they connected? What are the central nodes and the outliers of the graphs? We show that the Radius Plot (pdf of node radii) can answer these questions. However, computing the Radius Plot ..."
Abstract
-
Cited by 13 (10 self)
- Add to MetaCart
Given large, multi-million node graphs (e.g., FaceBook, web-crawls, etc.), how do they evolve over time? How are they connected? What are the central nodes and the outliers of the graphs? We show that the Radius Plot (pdf of node radii) can answer these questions. However, computing the Radius Plot is prohibitively expensive for graphs reaching the planetary scale. There are two major contributions in this paper: (a) We propose HADI (HAdoop DIameter and radii estimator), a carefully designed and fine-tuned algorithm to compute the diameter of massive graphs, that runs on the top of the HADOOP /MAPREDUCE system, with excellent scale-up on the number of available machines (b) We run HADI on several real world datasets including YahooWeb (6B edges, 1/8 of a Terabyte), one of the largest public graphs ever analyzed. Thanks to HADI, we report fascinating patterns on large networks, like the surprisingly small effective diameter, the multi-modal/bi-modal shape of the Radius Plot, and its palindrome motion over time. 1
An Optimal Algorithm for the Distinct Elements Problem
"... We give the first optimal algorithm for estimating the number of distinct elements in a data stream, closing a long line of theoretical research on this problem begun by Flajolet and Martin in their seminal paper in FOCS 1983. This problem has applications to query optimization, Internet routing, ne ..."
Abstract
-
Cited by 9 (2 self)
- Add to MetaCart
We give the first optimal algorithm for estimating the number of distinct elements in a data stream, closing a long line of theoretical research on this problem begun by Flajolet and Martin in their seminal paper in FOCS 1983. This problem has applications to query optimization, Internet routing, network topology, and data mining. For a stream of indices in {1,..., n}, our algorithm computes a (1 ± ε)approximation using an optimal O(ε −2 +log(n)) bits of space with 2/3 success probability, where 0 < ε < 1 is given. This probability can be amplified by independent repetition. Furthermore, our algorithm processes each stream update in O(1) worst-case time, and can report an estimate at any point midstream in O(1) worst-case time, thus settling both the space and time complexities simultaneously.
HADI: Mining radii of large graphs
- ACM Transactions on Knowledge Discovery from Data
, 2010
"... Given large, multi-million node graphs (e.g., Facebook, web-crawls, etc.), how do they evolve over time? How are they connected? What are the central nodes and the outliers? In this paper we define the Radius plot of a graph and show how it can answer these questions. However, computing the Radius p ..."
Abstract
-
Cited by 9 (4 self)
- Add to MetaCart
Given large, multi-million node graphs (e.g., Facebook, web-crawls, etc.), how do they evolve over time? How are they connected? What are the central nodes and the outliers? In this paper we define the Radius plot of a graph and show how it can answer these questions. However, computing the Radius plot is prohibitively expensive for graphs reaching the planetary scale. There are two major contributions in this paper: (a) We propose HADI (HAdoop DIameter and radii estimator), a carefully designed and fine-tuned algorithm to compute the radii and the diameter of massive graphs, that runs on the top of the Hadoop/MapReduce system, with excellent scale-up on the number of available machines (b) We run HADI on several real world datasets including YahooWeb (6B edges, 1/8 of a Terabyte), one of the largest public graphs ever analyzed. Thanks to HADI, we report fascinating patterns on large networks, like the surprisingly small effective diameter, the multi-modal/bi-modal shape of the Radius plot, and its palindrome motion over time.
Hashed Samples: Selectivity Estimators For Set Similarity Selection Queries
, 2008
"... We study selectivity estimation techniques for set similarity queries. A wide variety of similarity measures for sets have been proposed in the past. In this work we concentrate on the class of weighted similarity measures (e.g., TF/IDF and BM25 cosine similarity and variants) and design selectivity ..."
Abstract
-
Cited by 8 (1 self)
- Add to MetaCart
We study selectivity estimation techniques for set similarity queries. A wide variety of similarity measures for sets have been proposed in the past. In this work we concentrate on the class of weighted similarity measures (e.g., TF/IDF and BM25 cosine similarity and variants) and design selectivity estimators based on a priori constructed samples. First, we study the pitfalls associated with straightforward applications of random sampling, and argue that care needs to be taken in how the samples are constructed; uniform random sampling yields very low accuracy, while query sensitive realtime sampling is more expensive than exact solutions (both in CPU and I/O cost). We show how to build robust samples a priori, based on existing synopses for distinct value estimation. We prove the accuracy of our technique theoretically, and verify its performance experimentally. Our algorithm is orders of magnitude faster than exact solutions and has very small space overhead.
Multidimensional content exploration
- In VLDB
, 2008
"... Content Management Systems (CMS) store enterprise data such as insurance claims, insurance policies, legal documents, patent applications, or archival data like in the case of digital libraries. Search over content allows for information retrieval, but does not provide users with great insight into ..."
Abstract
-
Cited by 6 (2 self)
- Add to MetaCart
Content Management Systems (CMS) store enterprise data such as insurance claims, insurance policies, legal documents, patent applications, or archival data like in the case of digital libraries. Search over content allows for information retrieval, but does not provide users with great insight into the data. A more analytical view is needed through analysis, aggregations, groupings, trends, pivot tables or charts, and so on. Multidimensional Content eXploration (MCX) is about effectively analyzing and exploring large amounts of content by combining keyword search with OLAP-style aggregation, navigation, and reporting. We focus on unstructured data or generally speaking documents or content with limited metadata, as it is typically encountered in CMS. We formally present how CMS content and metadata should be organized in a well-defined multidimensional structure, so that sophisticated queries can be expressed and evaluated. The CMS metadata provide traditional OLAP static dimensions that are combined with dynamic dimensions discovered from the analyzed keyword search result, as well as measures for document scores based on the link structure between the documents. In addition, we provide means for multidimensional content exploration through traditional OLAP rollupdrilldown operations on the static and dynamic dimensions, solutions for multi-cube analysis and dynamic navigation of the content. We present our prototype, called DBPubs, which stores research publications as documents that can be searched and –most importantly – analyzed, and explored. Finally, we present experimental results of the efficiency and effectiveness of our approach.
Tighter Estimation using Bottom-k Sketches
"... Summaries of massive data sets support approximate query processing over the original data. A basic aggregate over a set of records is the weight of subpopulations specified as a predicate over records ’ attributes. Bottom-k sketches are a powerful summarization format of weighted items that include ..."
Abstract
-
Cited by 5 (4 self)
- Add to MetaCart
Summaries of massive data sets support approximate query processing over the original data. A basic aggregate over a set of records is the weight of subpopulations specified as a predicate over records ’ attributes. Bottom-k sketches are a powerful summarization format of weighted items that includes priority sampling [22], and the classic weighted sampling without replacement. They can be computed efficiently for many representations of the data including distributed databases and data streams and support coordinated and all-distances sketches. We derive novel unbiased estimators and confidence bounds for subpopulation weight. Our rank conditioning (RC) estimator is applicable when the total weight of the sketched set cannot be computed by the summarization algorithm without a significant use of additional resources (such as for sketches of network neighborhoods) and the tighter subset conditioning (SC) estimator that is applicable when the total weight is available (sketches of data streams). Our estimators are derived using clever applications of the Horvitz-Thompson estimator (that is not directly applicable to bottom-k sketches). We develop efficient computational methods and conduct performance evaluation using a range of synthetic and real data sets. We demonstrate considerable benefits of the SC estimator on larger subpopulations (over all other estimators); of the RC estimator (over existing estimators for weighted sampling without replacement); and of our confidence bounds (over all previous approaches). 1.
Diagnosing Estimation Errors in Page Counts Using Execution Feedback
- In Proceedings of ICDE 2008
"... Abstract—Errors in estimating page counts can lead to poor choice of access methods and in turn to poor quality plans. Although there is past work in using execution feedback for accurate cardinality estimation, the problem of inaccurate estimation of page counts has not been addressed. In this pape ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
Abstract—Errors in estimating page counts can lead to poor choice of access methods and in turn to poor quality plans. Although there is past work in using execution feedback for accurate cardinality estimation, the problem of inaccurate estimation of page counts has not been addressed. In this paper, we present novel mechanisms for diagnosing errors in page count by monitoring query execution at low overhead. Detection of inaccuracy in the optimizer estimates of page count can be leveraged by database administrators to improve plan quality. We have prototyped our techniques in the Microsoft SQL Server engine, and our experiments demonstrate the ability to estimate page counts accurately using execution feedback with low overhead. For queries on several real world databases, we observe significant improvement in plan quality when page counts obtained from execution feedback are used instead of the traditional optimizer estimations. I.
Sampling time-based sliding windows in bounded space
- In Proc. SIGMOD (2008
"... Random sampling is an appealing approach to build synopses of large data streams because random samples can be used for a broad spectrum of analytical tasks. Users are often interested in analyzing only the most recent fraction of the data stream in order to avoid outdated results. In this paper, we ..."
Abstract
-
Cited by 4 (1 self)
- Add to MetaCart
Random sampling is an appealing approach to build synopses of large data streams because random samples can be used for a broad spectrum of analytical tasks. Users are often interested in analyzing only the most recent fraction of the data stream in order to avoid outdated results. In this paper, we focus on sampling schemes that sample from a sliding window over a recent time interval; such windows are a popular and highly comprehensible method to model recency. In this setting, the main challenge is to guarantee an upper bound on the space consumption of the sample while using the allotted space efficiently at the same time. The difficulty arises from the fact that the number of items in the window is unknown in advance and may vary significantly over time, so that the sampling fraction has to be adjusted dynamically. We consider uniform sampling schemes, which produce each sample of the same size with equal probability, and stratified sampling schemes, in which the window is divided into smaller strata and a uniform sample is maintained per stratum. For uniform sampling, we prove that it is impossible to guarantee a minimum sample size in bounded space. We then introduce a novel sampling scheme called bounded priority sampling (BPS), which requires only bounded space. We derive a lower bound on the expected sample size and show that BPS quickly adapts to changing data rates. For stratified sampling, we propose a mergebased stratification scheme (MBS), which maintains strata of approximately equal size. Compared to naive stratification, MBS has the advantage that the sample is evenly distributed across the window, so that no part of the window is over- or underrepresented. We conclude the paper with a feasibility study of our algorithms on large real-world datasets.
Rewriting Queries on SPARQL Views
, 2011
"... The problem of answering SPARQL queries over virtual SPARQL views is commonly encountered in a number of settings, including while enforcing security policies to access RDF data, or when integrating RDF data from disparate sources. We approach this problem by rewriting SPARQL queries over the views ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
The problem of answering SPARQL queries over virtual SPARQL views is commonly encountered in a number of settings, including while enforcing security policies to access RDF data, or when integrating RDF data from disparate sources. We approach this problem by rewriting SPARQL queries over the views to equivalent queries over the underlying RDF data, thus avoiding the costs entailed by view materialization and maintenance. We show that SPARQL query rewriting combines the most challenging aspects of rewriting for the relational and XML cases: like the relational case, SPARQL query rewriting requires synthesizing multiple views; like the XML case, the size of the rewritten query is exponential to the size of the query and the views. In this paper, we present the first native query rewriting algorithm for SPARQL. For an input SPARQL query over a set of virtual SPARQL views, the rewritten query resembles a union of conjunctive queries and can be of exponential size. We propose optimizations over the basic rewriting algorithm to (i) minimize each conjunctive query in the union; (ii) eliminate conjunctive queries with empty results from evaluation; and (iii) efficiently prune out big portions of the search space of empty rewritings. The experiments, performed on two RDF stores, show that our algorithms are scalable and independent of the underlying RDF stores. Furthermore, our optimizations have order of magnitude improvements over the basic rewriting algorithm in both the rewriting size and evaluation time.
ABSTRACT Tighter Estimation using Bottom k Sketches
"... Summaries of massive data sets support approximate query processing over the original data. A basic aggregate over a set of records is the weight of subpopulations specified as a predicate over records ’ attributes. Bottom-k sketches are a powerful summarization format of weighted items that include ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
Summaries of massive data sets support approximate query processing over the original data. A basic aggregate over a set of records is the weight of subpopulations specified as a predicate over records ’ attributes. Bottom-k sketches are a powerful summarization format of weighted items that includes priority sampling [22], and the classic weighted sampling without replacement. They can be computed efficiently for many representations of the data including distributed databases and data streams and support coordinated and all-distances sketches. We derive novel unbiased estimators and confidence bounds for subpopulation weight. Our rank conditioning (RC) estimator is applicable when the total weight of the sketched set cannot be computed by the summarization algorithm without a significant use of additional resources (such as for sketches of network neighborhoods) and the tighter subset conditioning (SC) estimator that is applicable when the total weight is available (sketches of data streams). Our estimators are derived using clever applications of the Horvitz-Thompson estimator (that is not directly applicable to bottom-k sketches). We develop efficient computational methods and conduct performance evaluation using a range of synthetic and real data sets. We demonstrate considerable benefits of the SC estimator on larger subpopulations (over all other estimators); of the RC estimator (over existing estimators for weighted sampling without replacement); and of our confidence bounds (over all previous approaches). 1.

