Results 1  10
of
51
A Few Chirps About Twitter
"... Web 2.0 has brought about several new applications that have enabled arbitrary subsets of users to communicate with each other on a social basis. Such communication increasingly happens not just on Facebook and MySpace but on several smaller network applications such as Twitter and Dodgeball. We pre ..."
Abstract

Cited by 141 (5 self)
 Add to MetaCart
(Show Context)
Web 2.0 has brought about several new applications that have enabled arbitrary subsets of users to communicate with each other on a social basis. Such communication increasingly happens not just on Facebook and MySpace but on several smaller network applications such as Twitter and Dodgeball. We present a detailed characterization of Twitter, an application that allows users to send short messages. We gathered three datasets (covering nearly 100,000 users) including constrained crawls of the Twitter network using two different methodologies, and a sampled collection from the publicly available timeline. We identify distinct classes of Twitter users and their behaviors, geographic growth patterns and current size of the network, and compare crawl results obtained under rate limiting constraints. Categories and Subject Descriptors C.4 [Performance of Systems]: [Measurement techniques, Modeling techniques]
Walking in Facebook: A Case Study of Unbiased Sampling of OSNs
 in Proc. IEEE INFOCOM
, 2010
"... Abstract—With more than 250 million active users [1], Facebook (FB) is currently one of the most important online social networks. Our goal in this paper is to obtain a representative (unbiased) sample of Facebook users by crawling its social graph. In this quest, we consider and implement several c ..."
Abstract

Cited by 49 (11 self)
 Add to MetaCart
Abstract—With more than 250 million active users [1], Facebook (FB) is currently one of the most important online social networks. Our goal in this paper is to obtain a representative (unbiased) sample of Facebook users by crawling its social graph. In this quest, we consider and implement several candidate techniques. Two approaches that are found to perform well are the MetropolisHasting random walk (MHRW) and a reweighted random walk (RWRW). Both have pros and cons, which we demonstrate through a comparison to each other as well as to the ”groundtruth ” (UNI obtained through true uniform sampling of FB userIDs). In contrast, the traditional BreadthFirstSearch (BFS) and Random Walk (RW) perform quite poorly, producing substantially biased results. In addition to offline performance assessment, we introduce onlineformal convergence diagnostics to assess sample quality during the data collection process. We show how these can be used to effectively determine when a random walk sample is of adequate size and quality for subsequent use (i.e., when it is safe to cease sampling). Using these methods, we collect the first, to the best of our knowledge, unbiased sample of Facebook. Finally, we use one of our representative datasets, collected through MHRW, to characterize several key properties of Facebook. IndexTerms—Measurements,onlinesocial networks,Facebook, graph sampling, crawling, bias. I.
Characterizing files in the modern gnutella network: A measurement study
 In Proceedings of SPIE/ACM Multimedia Computing and Networking
, 2006
"... The Internet has witnessed an explosive increase in the popularity of PeertoPeer (P2P) filesharing applications during the past few years. As these applications become more popular, it becomes increasingly important to characterize their behavior in order to improve their performance and quantify ..."
Abstract

Cited by 43 (4 self)
 Add to MetaCart
(Show Context)
The Internet has witnessed an explosive increase in the popularity of PeertoPeer (P2P) filesharing applications during the past few years. As these applications become more popular, it becomes increasingly important to characterize their behavior in order to improve their performance and quantify their impact on the network. In this paper, we present a measurement study on characteristics of available files in the modern Gnutella system. We developed a new methodology to capture accurate “snapshots ” of available files in a large scale P2P system. This methodology was implemented in a parallel crawler that captures the entire overlay topology of the system where each peer in the overlay is annotated with its available files. We have captured tens of snapshots of the Gnutella system and conducted three types of analysis on available files: (i) Static analysis, (ii) Topological analysis and (iii) Dynamic analysis. Our results reveal several interesting properties of available files in Gnutella that can be leveraged to improve the design and evaluations of P2P filesharing applications. 1.
Estimating and sampling graphs with multidimensional random walks
, 2010
"... Estimating characteristics of large graphs via sampling is a vital part of the study of complex networks. Current sampling methods such as (independent) random vertex and random walks are useful but have drawbacks. Random vertex sampling may require too many resources (time, bandwidth, or money). Ra ..."
Abstract

Cited by 26 (4 self)
 Add to MetaCart
(Show Context)
Estimating characteristics of large graphs via sampling is a vital part of the study of complex networks. Current sampling methods such as (independent) random vertex and random walks are useful but have drawbacks. Random vertex sampling may require too many resources (time, bandwidth, or money). Random walks, which normally require fewer resources per sample, can suffer from large estimation errors in the presence of disconnected or loosely connected graphs. In this work we propose a new mdimensional random walk that uses m dependent random walkers. We show that the proposed sampling method, which we call Frontier sampling, exhibits all of the nice sampling properties of a regular random walk. At the same time, our simulations over large real world graphs show that, in the presence of disconnected or loosely connected components, Frontier sampling exhibits lower estimation errors than regular random walks. We also show that Frontier sampling is more suitable than random vertex sampling to sample the tail of the degree distribution of the graph.
A walk in facebook: Uniform sampling of users in online social networks
, 2009
"... The popularity of online social networks (OSNs) has given rise to a number of measurements studies that provide a first step towards their understanding. So far, such studies have been based either on complete data sets provided directly by the OSN itself or on BreadthFirstSearch (BFS) crawling of ..."
Abstract

Cited by 22 (1 self)
 Add to MetaCart
(Show Context)
The popularity of online social networks (OSNs) has given rise to a number of measurements studies that provide a first step towards their understanding. So far, such studies have been based either on complete data sets provided directly by the OSN itself or on BreadthFirstSearch (BFS) crawling of the social graph, which does not guarantee good statistical properties of the collected sample. In this paper, we crawl the publicly available social graph and present the first unbiased sampling of Facebook (FB) users using a MetropolisHastings random walk with multiple chains. We study the convergence properties of the walk and demonstrate the uniformity of the collected sample with respect to multiple metrics of interest. We provide a comparison of our crawling technique to baseline algorithms, namely BFS and simple random walk, as well as to the “ground truth ” obtained through truly uniform sampling of userIDs. Our contributions lie both in the measurement methodology and in the collected sample. With regards to the methodology, our measurement technique (i) applies and combines known results from random walk sampling specifically in the OSN context and (ii) addresses system implementation aspects that have made the measurement of Facebook challenging so far. With respect to the collected sample: (i) it is the first representative sample of FB users and we plan to make it publicly available; (ii) we perform a characterization of several key properties of the data set, and find that some of them are substantially different from what was previously believed based on nonrepresentative OSN samples.
The many facets of Internet topology and traffic
 Networks and Heterogeneous Media
"... ABSTRACT. The Internet’s layered architecture and organizational structure give rise to a number of different topologies, with the lower layers defining more physical and the higher layers more virtual/logical types of connectivity structures. These structures are very different, and successful Inte ..."
Abstract

Cited by 19 (10 self)
 Add to MetaCart
(Show Context)
ABSTRACT. The Internet’s layered architecture and organizational structure give rise to a number of different topologies, with the lower layers defining more physical and the higher layers more virtual/logical types of connectivity structures. These structures are very different, and successful Internet topology modeling requires annotating the nodes and edges of the corresponding graphs with information that reflects their networkintrinsic meaning. These structures also give rise to different representations of the traffic that traverses the heterogeneous Internet, and a traffic matrix is a compact and succinct description of the traffic exchanges between the nodes in a given connectivity structure. In this paper, we summarize recent advances in Internet research related to (i) inferring and modeling the routerlevel topologies of individual service providers (i.e., the physical connectivity structure of an ISP, where nodes are routers/switches and links represent physical connections), (ii) estimating the intraAS traffic matrix when the AS’s routerlevel topology and routing configuration are known, (iii) inferring and modeling the Internet’s ASlevel topology, and (iv) estimating the interAS traffic matrix. We will also discuss recent work on Internet connectivity structures that arise at the higher layers in the TCP/IP protocol stack and are more virtual and dynamic; e.g., overlay networks like the WWW graph, where nodes are web pages and edges represent existing hyperlinks, or P2P networks like Gnutella, where nodes represent peers and two peers are connected if they have an active network connection. 1. Introduction. The
Respondentdriven Sampling for Characterizing Unstructured Overlays
"... (RDS) as a promising technique to derive unbiased estimates of node properties in unstructured overlay networks such as Gnutella. Using RDS and a previously proposed technique, namely Metropolized Random Walk (MRW) sampling, we examine the efficiency of estimating node properties in unstructured ove ..."
Abstract

Cited by 14 (0 self)
 Add to MetaCart
(Show Context)
(RDS) as a promising technique to derive unbiased estimates of node properties in unstructured overlay networks such as Gnutella. Using RDS and a previously proposed technique, namely Metropolized Random Walk (MRW) sampling, we examine the efficiency of estimating node properties in unstructured overlays and identify some of the key factors that determine the accuracy of sampling techniques. We evaluate the RDS and MRW techniques using simulation over a wide range of static and dynamic graphs as well as experiments over a widely deployed Gnutella network. Our study sheds light on how the connectivity structure among nodes and its dynamics affect the accuracy and efficiency of the two sampling techniques. Both techniques exhibit a rather similar performance over a wide range of scenarios. However, RDS significantly outperforms MRW when the overlay structure exhibits a combination of highly skewed node degrees and highly skewed (local) clustering coefficients. I.
Multigraph Sampling of Online Social Networks
 IEEE J. SEL. AREAS COMMUN. ON MEASUREMENT OF INTERNET TOPOLOGIES
, 2011
"... Stateoftheart techniques for probability sampling of users of online social networks (OSNs) are based on random walks on a single social relation (typically friendship). While powerful, these methods rely on the social graph being fully connected. Furthermore, the mixing time of the sampling pro ..."
Abstract

Cited by 11 (6 self)
 Add to MetaCart
(Show Context)
Stateoftheart techniques for probability sampling of users of online social networks (OSNs) are based on random walks on a single social relation (typically friendship). While powerful, these methods rely on the social graph being fully connected. Furthermore, the mixing time of the sampling process strongly depends on the characteristics of this graph. In this paper, we observe that there often exist other relations between OSN users, such as membership in the same group or participation in the same event. We propose to exploit the graphs these relations induce, by performing a random walk on their union multigraph. We design a computationally efficient way to perform multigraph sampling by randomly selecting the graph on which to walk at each iteration. We demonstrate the benefits of our approach through (i) simulation in synthetic graphs, and (ii) measurements of Last.fm an Internet website for music with social networking features. More specifically, we show that multigraph sampling can obtain a representative sample and faster convergence, even when the individual graphs fail, i.e., are disconnected or highly clustered.
Practical recommendations on crawling online social networks
 SELECTED AREAS IN COMMUNICATIONS, IEEE JOURNAL ON
, 2011
"... Our goal in this paper is to develop a practical framework for obtaining a uniform sample of users in an online social network (OSN) by crawling its social graph. Such a sample allows to estimate any user property and some topological properties as well. To this end, first, we consider and compare ..."
Abstract

Cited by 11 (1 self)
 Add to MetaCart
(Show Context)
Our goal in this paper is to develop a practical framework for obtaining a uniform sample of users in an online social network (OSN) by crawling its social graph. Such a sample allows to estimate any user property and some topological properties as well. To this end, first, we consider and compare several candidate crawling techniques. Two approaches that can produce approximately uniform samples are the MetropolisHasting random walk (MHRW) and a reweighted random walk (RWRW). Both have pros and cons, which we demonstrate through a comparison to each other as well as to the “ground truth. ” In contrast, using BreadthFirstSearch (BFS) or an unadjusted Random Walk (RW) leads to substantially biased results. Second, and in addition to offline performance assessment, we introduce online formal convergence diagnostics to assess sample quality during the data collection process. We show how these diagnostics can be used to effectively determine when a random walk sample is of adequate size and quality. Third, as a case study, we apply the above methods to Facebook and we collect the first, to the best of our knowledge, representative sample of Facebook users. We make it publicly available and employ it to characterize several key properties of Facebook.
Walking on a Graph with a Magnifying Glass: Stratified Sampling via Weighted Random Walks
 in Proc. ACM SIGMETRICS
, 2011
"... Our objective is to sample the node set of a large unknown graph via crawling, to accurately estimate a given metric of interest. We design a random walk on an appropriately defined weighted graph that achieves high efficiency by preferentially crawling those nodes and edges that convey greater info ..."
Abstract

Cited by 10 (7 self)
 Add to MetaCart
(Show Context)
Our objective is to sample the node set of a large unknown graph via crawling, to accurately estimate a given metric of interest. We design a random walk on an appropriately defined weighted graph that achieves high efficiency by preferentially crawling those nodes and edges that convey greater information regarding the target metric. Our approach begins by employing the theory of stratification to find optimal node weights, for a given estimation problem, under an independence sampler. While optimal under independence sampling, these weights may be impractical under graph crawling due to constraints arising from the structure of the graph. Therefore, the edge weights for our random walk should be chosen so as to lead to an equilibrium distribution that strikes a balance between approximating the optimal weights under an independence sampler and achieving fast convergence. We propose a heuristic approach (stratified weighted random walk, or SWRW) that achieves this goal, while using only limited information about the graph structure and the node properties. We evaluate our technique in simulation, and experimentally, by collecting a sample of Facebook college users. We show that SWRW requires 1315 times fewer samples than the simple reweighted random walk (RW) to achieve the same estimation accuracy for a range of metrics.