Results 1  10
of
53
PEGASUS: A PetaScale Graph Mining System Implementation and Observations
 IEEE INTERNATIONAL CONFERENCE ON DATA MINING
, 2009
"... Abstract—In this paper, we describe PEGASUS, an open source Peta Graph Mining library which performs typical graph mining tasks such as computing the diameter of the graph, computing the radius of each node and finding the connected components. As the size of graphs reaches several Giga, Tera or P ..."
Abstract

Cited by 128 (26 self)
 Add to MetaCart
(Show Context)
Abstract—In this paper, we describe PEGASUS, an open source Peta Graph Mining library which performs typical graph mining tasks such as computing the diameter of the graph, computing the radius of each node and finding the connected components. As the size of graphs reaches several Giga, Tera or Petabytes, the necessity for such a library grows too. To the best of our knowledge, PEGASUS is the first such library, implemented on the top of the HADOOP platform, the open source version of MAPREDUCE. Many graph mining operations (PageRank, spectral clustering, diameter estimation, connected components etc.) are essentially a repeated matrixvector multiplication. In this paper we describe a very important primitive for PEGASUS, called GIMV (Generalized Iterated MatrixVector multiplication). GIMV is highly optimized, achieving (a) good scaleup on the number of available machines (b) linear running time on the number of edges, and (c) more than 5 times faster performance over the nonoptimized version of GIMV. Our experiments ran on M45, one of the top 50 supercomputers in the world. We report our findings on several real graphs, including one of the largest publicly available Web Graphs, thanks to Yahoo!, with ≈ 6,7 billion edges. KeywordsPEGASUS; graph mining; hadoop I.
BitShred: Feature Hashing Malware for Scalable Triage and Semantic Analysis
"... The sheer volume of new malware found each day is growing at an exponential pace. This growth has created a need for automatic malware triage techniques that determine what malware is similar, what malware is unique, and why. In this paper, we present BitShred, a system for largescale malware simil ..."
Abstract

Cited by 46 (2 self)
 Add to MetaCart
(Show Context)
The sheer volume of new malware found each day is growing at an exponential pace. This growth has created a need for automatic malware triage techniques that determine what malware is similar, what malware is unique, and why. In this paper, we present BitShred, a system for largescale malware similarity analysis and clustering, and for automatically uncovering semantic inter and intrafamily relationships within clusters. The key idea behind BitShred is using feature hashing to dramatically reduce the highdimensional feature spaces that are common in malware analysis. Feature hashing also allows us to mine correlated features between malware families and samples using coclustering techniques. Our evaluation shows that BitShred speeds up typical malware triage tasks by up to 2,365x and uses up to 82x less memory on a single CPU, all with comparable accuracy to previous approaches. We also develop a parallelized version of BitShred, and demonstrate scalability within the Hadoop framework.
Data Mining with Big Data
"... Abstract: Big Data concerns largevolume, complex, growing data sets with multiple, autonomous sources. With the fast development of networking, data storage, and the data collection capacity, Big Data is now rapidly expanding in all science and engineering domains, including physical, biological an ..."
Abstract

Cited by 34 (0 self)
 Add to MetaCart
Abstract: Big Data concerns largevolume, complex, growing data sets with multiple, autonomous sources. With the fast development of networking, data storage, and the data collection capacity, Big Data is now rapidly expanding in all science and engineering domains, including physical, biological and biomedical sciences. This article presents a HACE theorem that characterizes the features of the Big Data revolution, and proposes a Big Data processing model, from the data mining perspective. This datadriven model involves demanddriven aggregation of information sources, mining and analysis, user interest modeling, and security and privacy considerations. We analyze the challenging issues in the datadriven model and also in the Big Data revolution.
HADI: Mining radii of large graphs
 ACM Transactions on Knowledge Discovery from Data
, 2010
"... Given large, multimillion node graphs (e.g., Facebook, webcrawls, etc.), how do they evolve over time? How are they connected? What are the central nodes and the outliers? In this paper we define the Radius plot of a graph and show how it can answer these questions. However, computing the Radius p ..."
Abstract

Cited by 33 (10 self)
 Add to MetaCart
(Show Context)
Given large, multimillion node graphs (e.g., Facebook, webcrawls, etc.), how do they evolve over time? How are they connected? What are the central nodes and the outliers? In this paper we define the Radius plot of a graph and show how it can answer these questions. However, computing the Radius plot is prohibitively expensive for graphs reaching the planetary scale. There are two major contributions in this paper: (a) We propose HADI (HAdoop DIameter and radii estimator), a carefully designed and finetuned algorithm to compute the radii and the diameter of massive graphs, that runs on the top of the Hadoop/MapReduce system, with excellent scaleup on the number of available machines (b) We run HADI on several real world datasets including YahooWeb (6B edges, 1/8 of a Terabyte), one of the largest public graphs ever analyzed. Thanks to HADI, we report fascinating patterns on large networks, like the surprisingly small effective diameter, the multimodal/bimodal shape of the Radius plot, and its palindrome motion over time.
HADI: Fast Diameter Estimation and Mining in Massive Graphs with Hadoop
, 2008
"... How can we quickly find the diameter of a petabytesized graph? Large graphs are ubiquitous: social networks (Facebook, LinkedIn, etc.), the World Wide Web, biological networks, computer networks and many more. The size of graphs of interest has been increasing rapidly in recent years and with it al ..."
Abstract

Cited by 25 (2 self)
 Add to MetaCart
How can we quickly find the diameter of a petabytesized graph? Large graphs are ubiquitous: social networks (Facebook, LinkedIn, etc.), the World Wide Web, biological networks, computer networks and many more. The size of graphs of interest has been increasing rapidly in recent years and with it also the need for algorithms that can handle tera and petabyte graphs. A promising direction for coping with such sizes is the emerging map/reduce architecture and its opensource implementation, ’HADOOP’. Estimating the diameter of a graph, as well as the radius of each node, is a valuable operation that can help us spot outliers and anomalies. We propose HADI (HAdoop based DIameter estimator), a carefully designed algorithm to compute the diameters of petabytescale graphs. We run the algorithm to analyze the largest public web graph ever analyzed, with billions of nodes and edges. Additional contributions include the following: (a) We propose several performance optimizations (b) we achieve excellent scaleup, and (c) we report interesting observations including outliers and related patterns, on this real graph (116Gb), as well as several other real, smaller graphs. One of the observations is that the Albert et al. conjecture about the diameter of Networked systems are ubiquitous. The analysis of networks such as the World Wide Web, social, computer and biological networks has attracted much attention recently. Some of the typical measures to compute are
Beyond ‘Caveman Communities’: Hubs and Spokes for Graph Compression and Mining
"... Abstract—Given a real world graph, how should we layout its edges? How can we compress it? These questions are closely related, and the typical approach so far is to find cliquelike communities, like the ‘cavemen graph’, and compress them. We show that the blockdiagonal mental image of the ‘cavemen ..."
Abstract

Cited by 24 (9 self)
 Add to MetaCart
(Show Context)
Abstract—Given a real world graph, how should we layout its edges? How can we compress it? These questions are closely related, and the typical approach so far is to find cliquelike communities, like the ‘cavemen graph’, and compress them. We show that the blockdiagonal mental image of the ‘cavemen graph ’ is the wrong paradigm, in full agreement with earlier results that real world graphs have no good cuts. Instead, we propose to envision graphs as a collection of hubs connecting spokes, with superhubs connecting the hubs, and so on, recursively. Based on the idea, we propose the SLASHBURN method (burn the hubs, and slash the remaining graph into smaller connected components). Our view point has several advantages: (a) it avoids the ‘no good cuts ’ problem, (b) it gives better compression, and (c) it leads to faster execution times for matrixvector operations, which are the backbone of most graph processing tools. Experimental results show that our SLASHBURN method consistently outperforms other methods on all datasets, giving good compression and faster running time.
Radius Plots for Mining Terabyte Scale Graphs: Algorithms, Patterns, and Observations
"... Given large, multimillion node graphs (e.g., FaceBook, webcrawls, etc.), how do they evolve over time? How are they connected? What are the central nodes and the outliers of the graphs? We show that the Radius Plot (pdf of node radii) can answer these questions. However, computing the Radius Plot ..."
Abstract

Cited by 22 (16 self)
 Add to MetaCart
(Show Context)
Given large, multimillion node graphs (e.g., FaceBook, webcrawls, etc.), how do they evolve over time? How are they connected? What are the central nodes and the outliers of the graphs? We show that the Radius Plot (pdf of node radii) can answer these questions. However, computing the Radius Plot is prohibitively expensive for graphs reaching the planetary scale. There are two major contributions in this paper: (a) We propose HADI (HAdoop DIameter and radii estimator), a carefully designed and finetuned algorithm to compute the diameter of massive graphs, that runs on the top of the HADOOP /MAPREDUCE system, with excellent scaleup on the number of available machines (b) We run HADI on several real world datasets including YahooWeb (6B edges, 1/8 of a Terabyte), one of the largest public graphs ever analyzed. Thanks to HADI, we report fascinating patterns on large networks, like the surprisingly small effective diameter, the multimodal/bimodal shape of the Radius Plot, and its palindrome motion over time. 1
CopyCatch: Stopping Group Attacks by Spotting Lockstep Behavior in Social Networks
"... How can web services that depend on user generated content discern fraudulent input by spammers from legitimate input? In this paper we focus on the social network Facebook and the problem of discerning illgotten Page Likes, made by spammers hoping to turn a profit, from legitimate Page Likes. Our ..."
Abstract

Cited by 19 (7 self)
 Add to MetaCart
(Show Context)
How can web services that depend on user generated content discern fraudulent input by spammers from legitimate input? In this paper we focus on the social network Facebook and the problem of discerning illgotten Page Likes, made by spammers hoping to turn a profit, from legitimate Page Likes. Our method, which we refer to as CopyCatch, detects lockstep Page Like patterns on Facebook by analyzing only the social graph between users and Pages and the times at which the edges in the graph (the Likes) were created. We offer the following contributions: (1) We give a novel problem formulation, with a simple concrete definition of suspicious behavior in terms of graph structure and edge constraints. (2) We offer two algorithms to find such suspicious lockstep behavior one provablyconvergent iterative algorithm and one approximate, scalable MapReduce implementation. (3) We show that our method severely limits “greedy attacks ” and analyze the bounds from the application of the Zarankiewicz problem to our setting. Finally, we demonstrate and discuss the effectiveness of CopyCatch at Facebook and on synthetic data, as well as potential extensions to anomaly detection problems in other domains. CopyCatch is actively in use at Facebook, searching for attacks on Facebook’s social graph of over a billion users, many millions of Pages, and billions of Page Likes.
Scalable clustering algorithm for Nbody simulations in a
"... sharednothing cluster ..."
(Show Context)
Centralities in Large Networks: Algorithms and Observations
"... Node centrality measures are important in a large number of graph applications, from search and ranking to social and biological network analysis. In this paper we study node centrality for very large graphs, up to billions of nodes and edges. Various definitions for centrality have been proposed, r ..."
Abstract

Cited by 16 (2 self)
 Add to MetaCart
(Show Context)
Node centrality measures are important in a large number of graph applications, from search and ranking to social and biological network analysis. In this paper we study node centrality for very large graphs, up to billions of nodes and edges. Various definitions for centrality have been proposed, ranging from very simple (e.g., node degree) to more elaborate. However, measuring centrality in billionscale graphs poses several challenges. Many of the “traditional ” definitions such as closeness and betweenness were not designed with scalability in mind. Therefore, it is very difficult, if not impossible, to compute them both accurately and efficiently. In this paper, we propose centrality measures suitable for very large graphs, as well as scalable methods to effectively compute them. More specifically, we propose effective closeness and LINERANK which are designed for billionscale graphs. We also develop algorithms to compute the proposed centrality measures in MAPREDUCE, a modern paradigm for largescale, distributed data processing. We present extensive experimental results on both synthetic and real datasets, which demonstrate the scalability of our approach to very large graphs, as well as interesting findings and anomalies. 1