Results 1  10
of
23
Practical recommendations on crawling online social networks
 SELECTED AREAS IN COMMUNICATIONS, IEEE JOURNAL ON
, 2011
"... Our goal in this paper is to develop a practical framework for obtaining a uniform sample of users in an online social network (OSN) by crawling its social graph. Such a sample allows to estimate any user property and some topological properties as well. To this end, first, we consider and compare ..."
Abstract

Cited by 37 (1 self)
 Add to MetaCart
(Show Context)
Our goal in this paper is to develop a practical framework for obtaining a uniform sample of users in an online social network (OSN) by crawling its social graph. Such a sample allows to estimate any user property and some topological properties as well. To this end, first, we consider and compare several candidate crawling techniques. Two approaches that can produce approximately uniform samples are the MetropolisHasting random walk (MHRW) and a reweighted random walk (RWRW). Both have pros and cons, which we demonstrate through a comparison to each other as well as to the “ground truth. ” In contrast, using BreadthFirstSearch (BFS) or an unadjusted Random Walk (RW) leads to substantially biased results. Second, and in addition to offline performance assessment, we introduce online formal convergence diagnostics to assess sample quality during the data collection process. We show how these diagnostics can be used to effectively determine when a random walk sample is of adequate size and quality. Third, as a case study, we apply the above methods to Facebook and we collect the first, to the best of our knowledge, representative sample of Facebook users. We make it publicly available and employ it to characterize several key properties of Facebook.
Towards unbiased BFS sampling
 SELECTED AREAS IN COMMUNICATIONS, IEEE JOURNAL ON
, 2011
"... Breadth First Search (BFS) is a widely used approach for sampling large graphs. However, it has been empirically observed that BFS sampling is biased toward highdegree nodes, which may strongly affect the measurement results. In this paper, we quantify and correct the degree bias of BFS. First, we ..."
Abstract

Cited by 26 (4 self)
 Add to MetaCart
(Show Context)
Breadth First Search (BFS) is a widely used approach for sampling large graphs. However, it has been empirically observed that BFS sampling is biased toward highdegree nodes, which may strongly affect the measurement results. In this paper, we quantify and correct the degree bias of BFS. First, we consider a random graph RG(pk) with an arbitrary degree distribution pk. For this model, we calculate the node degree distribution expected to be observed by BFS as a function of the fraction f of covered nodes. We also show that, for RG(pk), all commonly used graph traversal techniques (BFS, DFS, Forest Fire, Snowball Sampling, RDS) have exactly the same bias. Next, we propose a practical BFSbias correction procedure that takes as input a collected BFS sample together with the fraction f. Our correction technique is exact (i.e., leads to unbiased estimation) for RG(pk). Furthermore, it performs well when applied to a broad range of Internet topologies and to two large BFS samples of Facebook and Orkut networks.
Beyond random walk and metropolishastings samplers: Why you should not backtrack for unbiased graph sampling
, 2012
"... ar ..."
(Show Context)
CoarseGrained Topology Estimation via Graph Sampling
, 2012
"... In many online networks, nodes are partitioned into categories (e.g., countries or universities in OSNs), which naturally defines a weighted category graph i.e., a coarsegrained version of the underlying network. In this paper, we show how to efficiently estimate the category graph from a probabili ..."
Abstract

Cited by 10 (4 self)
 Add to MetaCart
In many online networks, nodes are partitioned into categories (e.g., countries or universities in OSNs), which naturally defines a weighted category graph i.e., a coarsegrained version of the underlying network. In this paper, we show how to efficiently estimate the category graph from a probability sample of nodes. We prove consistency of our estimators and evaluate their efficiency via simulation. We also apply our methodology to a sample of Facebook users to obtain a number of category graphs, such as the college friendship graph and the country friendship graph. We share and visualize the resulting data at www.geosocialmap.com.
2.5KGraphs: from Sampling to Generation
"... Abstract—Understanding network structure and having access to realistic graphs plays a central role in computer and social networks research. In this paper, we propose a complete, practical methodology for generating graphs that resemble a real graph of interest. The metrics of the original topology ..."
Abstract

Cited by 5 (2 self)
 Add to MetaCart
(Show Context)
Abstract—Understanding network structure and having access to realistic graphs plays a central role in computer and social networks research. In this paper, we propose a complete, practical methodology for generating graphs that resemble a real graph of interest. The metrics of the original topology we target to match are the joint degree distribution (JDD) and the degreedependent average clustering coefficient (¯c(k)). We start by developing efficient estimators for these two metrics based on a node sample collected via either independence sampling or random walks. Then, we process the output of the estimators to ensure that the target metrics are realizable. Finally, we propose an efficient algorithm for generating topologies that have the exact target JDD and a ¯c(k) close to the target. Extensive simulations using reallife graphs show that the graphs generated by our methodology are similar to the original graph with respect to, not only the two target metrics, but also a wide range of other topological metrics. Furthermore, our generator is order of magnitudes faster than stateoftheart techniques. I.
Faster random walks by rewiring online social networks onthefly
 In ICDE
, 2013
"... Abstract — Many online social networks feature restrictive web interfaces which only allow the query of a user’s local neighborhood through the interface. To enable analytics over such an online social network through its restrictive web interface, many recent efforts reuse the existing Markov Chai ..."
Abstract

Cited by 5 (2 self)
 Add to MetaCart
(Show Context)
Abstract — Many online social networks feature restrictive web interfaces which only allow the query of a user’s local neighborhood through the interface. To enable analytics over such an online social network through its restrictive web interface, many recent efforts reuse the existing Markov Chain Monte Carlo methods such as random walks to sample the social network and support analytics based on the samples. The problem with such an approach, however, is the large amount of queries often required (i.e., a long “mixing time”) for a random walk to reach a desired (stationary) sampling distribution. In this paper, we consider a novel problem of enabling a faster random walk over online social networks by “rewiring ” the social network onthefly. Specifically, we develop Modified TOpology (MTO)Sampler which, by using only information exposed by the restrictive web interface, constructs a “virtual ” overlay topology of the social network while performing a random walk, and ensures that the random walk follows the modified overlay topology rather than the original one. We show that MTOSampler not only provably enhances the efficiency of sampling, but also achieves significant savings on query cost over realworld online social networks such as Google Plus, Epinion etc. I.
Graph Size Estimation
"... Many online networks are not fully known and are often studied via sampling. Random Walk (RW) based techniques are the current stateoftheart for estimating nodal attributes and local graph properties, but estimating global properties remains a challenge. In this paper, we are interested in a fund ..."
Abstract

Cited by 3 (2 self)
 Add to MetaCart
(Show Context)
Many online networks are not fully known and are often studied via sampling. Random Walk (RW) based techniques are the current stateoftheart for estimating nodal attributes and local graph properties, but estimating global properties remains a challenge. In this paper, we are interested in a fundamental property of this type — the graph size N,i.e., the number of its nodes. Existing methods for estimatingN are (i) inefficient and (ii) cannot be easily used with RW sampling due to dependence between successive samples. In this paper, we address both problems. First, we propose IE (Induced Edges), an efficient technique for estimating N from an independence sample of graph’s nodes. IE exploits the edges induced on the sampled nodes. Second, we introduce SafetyMargin, a method that corrects estimators for dependence in RW samples. Finally, we combine these two standalone techniques to obtain a RWbased graph size estimator. We evaluate our approach in simulations on a wide range of reallife topologies, and on several samples of Facebook. IE with SafetyMargin typically requires at least 10 times fewer samples than the stateoftheart techniques (over 100 times in the case of Facebook) for the same estimation error. Keywords graph size estimation, network sampling, random walk, online social networks, measurement 1.
Sampling social networks using shortest paths
"... In recent years, online social networks (OSN) have emerged as a platform of sharing variety of information about people, and their interests, activities, events and news from real worlds. Due to the large scale and access limitations (e.g., privacy policies) of online social network services such as ..."
Abstract

Cited by 2 (2 self)
 Add to MetaCart
(Show Context)
In recent years, online social networks (OSN) have emerged as a platform of sharing variety of information about people, and their interests, activities, events and news from real worlds. Due to the large scale and access limitations (e.g., privacy policies) of online social network services such as Facebook and Twitter, it is difficult to access the whole public network in a limited amount of time. For this reason researchers try to study and characterize OSN by taking appropriate and reliable samples from the network. In this paper, we propose to use the concept of shortest path for sampling social networks. The proposed sampling method first finds the shortest paths between several pairs of nodes selected according to some criteria. Then the edges in these shortest paths are ranked according to the number of times that each edge has appeared in the set of found shortest paths. The sampled network is then computed as a subgraph of the social network which contains a percentage of highly ranked edges. In order to investigate the performance of the proposed sampling method, we provide a number of experiments on synthetic and real networks. Experimental results show that the proposed sampling method outperforms the existing method such as random edge sampling, random node sampling, random walk sampling and MetropolisHastings random walk sampling in terms of relative error (RE), normalized root mean square error (NMSE), and KolmogorovSmirnov (KS) test.
Online myopic network covering
 CoRR
"... Efficient marketing or awarenessraising campaigns seek to recruit n influential individuals – where n is the campaign budget – that are able to cover a large target audience through their social connections. So far most of the related literature on maximizing this network cover assumes that the so ..."
Abstract

Cited by 2 (1 self)
 Add to MetaCart
Efficient marketing or awarenessraising campaigns seek to recruit n influential individuals – where n is the campaign budget – that are able to cover a large target audience through their social connections. So far most of the related literature on maximizing this network cover assumes that the social network topology is known. Even in such a case the optimal solution is NPhard. In practice, however, the network topology is generally unknown and needs to be discovered onthefly. In this work we consider an unknown topology where recruited individuals disclose their social connections (a feature known as onehop lookahead). The goal of this work is to provide an efficient greedy online algorithm that recruits individuals as to maximize the size of target audience covered by the campaign. We propose a new greedy online algorithm, Maximum Expected dExcess Degree (MEED), and provide, to the best of our knowledge, the first detailed theoretical analysis of the cover size of a variety of well known network sampling algorithms on finite networks. Our proposed algorithm greedily maximizes the expected size of the cover. For a class of random power law networks we show that MEED simplifies into a straightforward procedure, which we denote MOD (Maximum Observed Degree). We substantiate our analytical results with extensive simulations and show that MOD significantly outperforms all analyzed myopic algorithms. We note that performance may be further improved if the node degree distribution is known or can be estimated online during the campaign. 1.
Leveraging History for Faster Sampling of Online Social Networks
"... ABSTRACT With a vast amount of data available on online social networks, how to enable efficient analytics over such data has been an increasingly important research problem. Given the sheer size of such social networks, many existing studies resort to sampling techniques that draw random nodes fro ..."
Abstract
 Add to MetaCart
(Show Context)
ABSTRACT With a vast amount of data available on online social networks, how to enable efficient analytics over such data has been an increasingly important research problem. Given the sheer size of such social networks, many existing studies resort to sampling techniques that draw random nodes from an online social network through its restrictive web/API interface. While these studies differ widely in analytics tasks supported and algorithmic design, almost all of them use the exact same underlying technique of random walk a Markov Chain Monte Carlo based method which iteratively transits from one node to its random neighbor. Random walk fits naturally with this problem because, for most online social networks, the only query we can issue through the interface is to retrieve the neighbors of a given node (i.e., no access to the full graph topology). A problem with random walks, however, is the "burnin" period which requires a large number of transitions/queries before the sampling distribution converges to a stationary value that enables the drawing of samples in a statistically valid manner. In this paper, we consider a novel problem of speeding up the fundamental design of random walks (i.e., reducing the number of queries it requires) without changing the stationary distribution it achieves thereby enabling a more efficient "dropin" replacement for existing samplingbased analytics techniques over online social networks. Technically, our main idea is to leverage the history of random walks to construct a higherordered Markov chain. We develop two algorithms, Circulated Neighbors and Groupby Neighbors Random Walk (CNRW and GNRW) and rigidly prove that, no matter what the social network topology is, CNRW and GNRW offer better efficiency than baseline random walks while achieving the same stationary distribution. We demonstrate through extensive experiments on realworld social networks and synthetic graphs the superiority of our techniques over the existing ones.