## Graph summarization with bounded error (2008)

### Cached

### Download Links

- [www.cs.umd.edu]
- [avid.cs.umass.edu]
- [www.cs.cmu.edu]
- DBLP

### Other Repositories/Bibliography

Venue: | In SIGMOD 2008: Proceedings of the 2008 ACM SIGMOD International Conference on Management of data |

Citations: | 41 - 6 self |

### BibTeX

@INPROCEEDINGS{Navlakha08graphsummarization,

author = {Saket Navlakha and Rajeev Rastogi and Nisheeth Shrivastava},

title = {Graph summarization with bounded error},

booktitle = {In SIGMOD 2008: Proceedings of the 2008 ACM SIGMOD International Conference on Management of data},

year = {2008},

pages = {419--432},

publisher = {ACM}

}

### OpenURL

### Abstract

We propose a highly compact two-part representation of a given graph G consisting of a graph summary and a set of corrections. The graph summary is an aggregate graph in which each node corresponds to a set of nodes in G, and each edge represents the edges between all pair of nodes in the two sets. On the other hand, the corrections portion specifies the list of edge-corrections that should be applied to the summary to recreate G. Our representations allow for both lossless and lossy graph compression with bounds on the introduced error. Further, in combination with the MDL principle, they yield highly intuitive coarse-level summaries of the input graph G. We develop algorithms to construct highly compressed graph representations with small sizes and guaranteed accuracy, and validate our approach through an extensive set of experiments with multiple reallife graph data sets. To the best of our knowledge, this is the first work to compute graph summaries using the MDL principle, and use the summaries (along with corrections) to compress graphs with bounded error.

### Citations

11418 |
Computers and Intractability: A Guide to the Theory of NP -completeness
- Garey, Johnson
- 1979
(Show Context)
Citation Context ...y be other approaches which can take advantage of overlapping subsets (supernodes) to get better compression (similar to the minimum clique cover or minimum complete bipartite subgraph cover problems =-=[11]-=-); however, we will only consider the disjoint case because we want S to be a graph that is easy to visualize. Observe that the disjoint case is conceptually similar to graph clustering, where the aim... |

3598 | The anatomy of a large-scale hypertextual web search engine
- Brin, Page
- 1998
(Show Context)
Citation Context ...e Web has a natural graph structure with a node for each page and a directed edge for each hyperlink. This link structure of the Web has been exploited very successfully by search engines like Google =-=[4]-=- to improve search quality. Other contemporary research works mine the Web graph to find dense bipartite cliques, and through them Web communities [21] and link spam [12]. Recent estimates from search... |

2969 | Authoritative sources in a hyperlinked environment
- Kleinberg
- 1998
(Show Context)
Citation Context ...pages. Much of the work has focused on lossless compression of Web pages so that the compact Web-graph representations can then be used to calculate measures suchas PageRank [4] or authority vectors =-=[19]-=-. Several studies [1, 3, 27, 30, 26] take advantage of well-established properties of the Web graph, e.g., pages largely pointing to other pages on the same host, and new pages adding links by copying... |

2303 |
Algorithms for Clustering Data
- Jain, Dubes
- 1988
(Show Context)
Citation Context ...provide any insight into the structure of the graph. An exception here is [26] which computes graph summaries by grouping Web pages based on a combination of their URL patterns and k-means clustering =-=[9]-=-. In contrast, our summaries are computed using the MDL principle, which has sound information-theoretic underpinnings. In a very different setting, [20, 12] devise algorithms to extract large dense s... |

1239 |
Modeling by Shortest Data Description
- Rissanen
- 1978
(Show Context)
Citation Context ...cted graphs. However, for simplicity of exposition, we will only consider undirected graphs in the remainder of the paper. 1.2 MDL Representation Rissanen’s Minimum Description Length (MDL) principle =-=[28]-=- has its roots in information theory. It roughly states that the best theory to infer from a set of data is the one which minimizes the sum of (A) the size of the theory, and (B) the size of the data ... |

1176 | On spectral clustering: analysis and an algorithm
- Ng, Jordan, et al.
- 2002
(Show Context)
Citation Context ...nodes such that the graph representation is as compact as possible. Another problem with many of the widely-used clustering algorithms, such as METIS [17], Graclus [8], kmeans and spectral clustering =-=[24]-=-, is that they require the user to specify the number of partitions beforehand, which is typically hard to estimate and not required in our setting. Like us, AutoPart [5] uses the MDL principle to com... |

851 | A fast and high quality multilevel scheme for partitioning irregular graphs. SIAAA Sci Computing
- Karypis, Kumar
(Show Context)
Citation Context ...munities, customer segments); specifically, it provides insight into the highlevel structure of the graph, and the dominant relationships among the various node clusters. Unlike clustering algorithms =-=[17, 8]-=- that group nodes based on their similarity or distances, our summary is computed using information-theoretic principles. • Our representation allows for a high degree of compression to be achieved fo... |

494 |
Cytoscape: a software environment for integrated models of biomolecular interaction networks
- Shannon
- 2003
(Show Context)
Citation Context ...ges and supernodes (about 10% of the original graph) indicates the usefulness of the summary for visualization and trend analysis. Figure 6 shows a visualization (constructed using the Cytoscape tool =-=[29]-=-) of the original input and the corresponding summary S for the CNR-10k dataset. Apart from being much smaller in size and hence less cluttered, there are many interesting patters that stand out. One ... |

326 | Online Aggregation
- Hellerstein, Haas, et al.
- 1997
(Show Context)
Citation Context ...bors of an element node or recreate the original graph with bounded errors on neighbor sets like we do. Approximate Query Processing. There is a vast body of work on maintaining synopses like samples =-=[13]-=-, histograms [15], and wavelets [6] to provide approximate answers to relational queries. However, these have limited applicability to our graph scenario for the following reasons. First, many of the ... |

313 | Trawling the web for emerging cyber-communities
- Kumar, Raghavan, et al.
- 1999
(Show Context)
Citation Context ...d very successfully by search engines like Google [4] to improve search quality. Other contemporary research works mine the Web graph to find dense bipartite cliques, and through them Web communities =-=[21]-=- and link spam [12]. Recent estimates from search engines put the size of the Web graph at around 3 billion nodes and more than 50 billion arcs [3]. (Note that these are clearly lower bounds since the... |

294 | Index Structures for Path Expressions
- Milo, Suciu
(Show Context)
Citation Context ...thods exploit the inherent data hierarchy and spatial properties of database tables, and thus cannot be generalized to compress general graph structures. XML Synopsis Construction. Many recent papers =-=[25, 18, 23]-=- have proposed path-index structures for XML data graphs to estimate the selectivities of complex path expressions over XML documents. The basic idea is to group identically labeled element nodes in t... |

276 | Gigascope: A Stream Database for Network Applications
- Cranor, Johnson, et al.
- 2003
(Show Context)
Citation Context ...l communication patterns, security vulnerabilities, hosts that are infected by a virus or a worm, and malicious attacks against machines. These graphs,however, can be large – it has been reported in =-=[7]-=- that the AT&T IP backbone network alone generates 500 GB of IP flow data per day (about ten billion fifty-byte records). • Market Basket Data. Market basket data contains information about products b... |

190 | The Web as a Graph
- Kumar, Raghavan, et al.
- 2000
(Show Context)
Citation Context ... practical graphs to realize space savings. For instance, it is well known that in Web graphs, because of link copying between Web pages, there are clusters of pages with very similar adjacency lists =-=[27, 26]-=-. Similarly, communities in social networks and the Web frequently contain nodes that are densely inter-linked with one another [21]. Now, in such graphs, if two nodes have edges to the same set (or v... |

189 | Approximate query processing using wavelets
- Chakrabarti, Garofalakis, et al.
- 2000
(Show Context)
Citation Context ...the original graph with bounded errors on neighbor sets like we do. Approximate Query Processing. There is a vast body of work on maintaining synopses like samples [13], histograms [15], and wavelets =-=[6]-=- to provide approximate answers to relational queries. However, these have limited applicability to our graph scenario for the following reasons. First, many of the approximation techniques like sampl... |

172 | The WebGraph framework I: Compression techniques
- Boldi, Vigna
- 2004
(Show Context)
Citation Context ...bipartite cliques, and through them Web communities [21] and link spam [12]. Recent estimates from search engines put the size of the Web graph at around 3 billion nodes and more than 50 billion arcs =-=[3]-=-. (Note that these are clearly lower bounds since the Web graph has been growing rapidly over the years as more of the Web gets discovered and indexed.) Thus, the Web graph can easily occupy many tera... |

110 | Extracting Large-Scale Knowledge Bases from the Web," presented at
- Kumar, Raghavan, et al.
- 1999
(Show Context)
Citation Context ...on of their URL patterns and k-means clustering [9]. In contrast, our summaries are computed using the MDL principle, which has sound information-theoretic underpinnings. In a very different setting, =-=[20, 12]-=- devise algorithms to extract large dense subgraphs from the Web graph, since these typically correspond to online communities or link spam. Thus, the objective in [20, 12] is very different from ours... |

83 | Towards compressing web graphs
- Adler, Mitzenmacher
- 2001
(Show Context)
Citation Context ...ion. Thus, with our highly spaceefficient representations, approximate graphs can be stored in main memory and efficiently analyzed using graph algorithms. In contrast, most of the existing proposals =-=[1, 30, 3]-=- only support lossless compression for Web graphs. 1.1 A Generic Graph Representation Given a graph G = (VG, EG), our representation for it R = (S, C) consists of a graph summary S = (VS, ES) and a se... |

80 | Histogram-based approximations to set-valued query answers
- Ioannidis, Poosala
- 1999
(Show Context)
Citation Context ...t node or recreate the original graph with bounded errors on neighbor sets like we do. Approximate Query Processing. There is a vast body of work on maintaining synopses like samples [13], histograms =-=[15]-=-, and wavelets [6] to provide approximate answers to relational queries. However, these have limited applicability to our graph scenario for the following reasons. First, many of the approximation tec... |

72 |
Extracting large dense subgraphs in massive graphs
- Gibson, Kumar, et al.
- 2005
(Show Context)
Citation Context ... by search engines like Google [4] to improve search quality. Other contemporary research works mine the Web graph to find dense bipartite cliques, and through them Web communities [21] and link spam =-=[12]-=-. Recent estimates from search engines put the size of the Web graph at around 3 billion nodes and more than 50 billion arcs [3]. (Note that these are clearly lower bounds since the Web graph has been... |

68 | A Measure of Similarity between Graph Vertices: Applications to Synonym Extraction and Web Searching
- Blondel, Gajardo, et al.
- 2004
(Show Context)
Citation Context ... setting, the similarity among (unlabeled) nodes can be defined using standard measures like the min-hop distance, the Jaccard Coefficient [9] on their neighbor sets, or linear matrix transformations =-=[2]-=-. Although clustering algorithms may give meaningful insights into the dominant patterns in the graph, they typically employ distance- or similarity-based metrics to compute the clusters containing si... |

52 | Compressing the graph structure of the Web
- Suel, Yuan
- 2001
(Show Context)
Citation Context ...ion. Thus, with our highly spaceefficient representations, approximate graphs can be stored in main memory and efficiently analyzed using graph algorithms. In contrast, most of the existing proposals =-=[1, 30, 3]-=- only support lossless compression for Web graphs. 1.1 A Generic Graph Representation Given a graph G = (VG, EG), our representation for it R = (S, C) consists of a graph summary S = (VS, ES) and a se... |

48 | Exploiting local similarity for indexing paths in graph-structured data
- Kaushik, Shenoy, et al.
- 2002
(Show Context)
Citation Context ...thods exploit the inherent data hierarchy and spatial properties of database tables, and thus cannot be generalized to compress general graph structures. XML Synopsis Construction. Many recent papers =-=[25, 18, 23]-=- have proposed path-index structures for XML data graphs to estimate the selectivities of complex path expressions over XML documents. The basic idea is to group identically labeled element nodes in t... |

42 |
An ecient reduction technique for degree-constrained subgraph and bidirected network problems
- Gabow
- 1983
(Show Context)
Citation Context ...atching M in G. (When all bv = 1 (ǫnv = 1), then this reduces to the standard maximum matching problem.) The b-matching problem can be solved in O(m · min{m log n, n 2 }) time using Gabow’s algorithm =-=[10]-=- – here n in the number of nodes and m is the number of edges in G. In our setting, we will remove the corrections in C by converting it into an instance of the b-matching problem. We construct a new ... |

40 | Network Monitoring using Traffic Dispersion Graphs (TDGs
- Iliofotou, Pappu, et al.
- 2007
(Show Context)
Citation Context ...P Network Monitoring. IP routers export records containing source and destination IP addresses, number of bytes transmitted, duration, etc. for each IP communication flow. Recently, Iliofotou et. al. =-=[14]-=- proposed the idea of extracting Traffic Dispersion Graphs (TDGs) from network traces, where each node corresponds to an IP address and there is an edge between any two IP addresses who sent traffic t... |

37 | A fast kernel-based multilevel algorithm for graph clustering
- Dhillon, Guan, et al.
- 2005
(Show Context)
Citation Context ...munities, customer segments); specifically, it provides insight into the highlevel structure of the graph, and the dominant relationships among the various node clusters. Unlike clustering algorithms =-=[17, 8]-=- that group nodes based on their similarity or distances, our summary is computed using information-theoretic principles. • Our representation allows for a high degree of compression to be achieved fo... |

33 |
Semantic compression and pattern extraction with fascicles
- Jagadish, Madar, et al.
- 1999
(Show Context)
Citation Context ...oint out here that since graphs are traditionally represented as a two-column relation with one tuple per edge (that stores its two vertex endpoints), relational compression techniques like fascicles =-=[16]-=- will not work well for graphs. Network Visualization. In a recent paper [14], graph structures called Traffic Dispersion Graphs (TDG), extracted from network traffic on a router, were proposed as a m... |

18 |
The generalized MDL approach for summarization
- Lakshmanan, Ng, et al.
- 2002
(Show Context)
Citation Context ...node groups. Further, Autopart only does lossless compression – so its performance relative to our Greedy scheme degrades even further when the compressed graph is permitted to have bounded error. In =-=[22]-=-, the authors apply the MDL principle to summarize cells of interest in OLAP data by means of a covering with regions. However, their methods exploit the inherent data hierarchy and spatial properties... |

17 | Role Classification of Hosts Within Enterprise Networks based on Connection Patterns
- Tan, Poletto, et al.
- 2003
(Show Context)
Citation Context ... means to detect unknown applications on a network. However, the aim in [14] is solely to extract the relevant data at network speeds and display it as a graph, which is complementary to our work. In =-=[31]-=-, the authors use neighbor-similarity based clustering techniques to classify hosts into groups (having similar “roles”, e.g., mail-servers), and to visualize these groups for hosts on a network domai... |

11 |
XSKETCH Synopses for XML Data Graphs
- Polyzotis, Garofalakis
(Show Context)
Citation Context ...thods exploit the inherent data hierarchy and spatial properties of database tables, and thus cannot be generalized to compress general graph structures. XML Synopsis Construction. Many recent papers =-=[25, 18, 23]-=- have proposed path-index structures for XML data graphs to estimate the selectivities of complex path expressions over XML documents. The basic idea is to group identically labeled element nodes in t... |

1 | The link database: Fast access to graphs of the web. DCC - Randall, Stata, et al. - 2002 |