Results 1 
7 of
7
Sketchbased influence maximization and computation: Scaling up with guarantees
 In International Conference on Information and Knowledge Management (ICIKM
, 2014
"... Propagation of contagion through networks is a fundamental process. It is used to model the spread of information, influence, or a viral infection. Diffusion patterns can be specified by a probabilistic model, such as Independent Cascade (IC), or captured by a set of representative traces. Basic co ..."
Abstract

Cited by 4 (0 self)
 Add to MetaCart
(Show Context)
Propagation of contagion through networks is a fundamental process. It is used to model the spread of information, influence, or a viral infection. Diffusion patterns can be specified by a probabilistic model, such as Independent Cascade (IC), or captured by a set of representative traces. Basic computational problems in the study of diffusion are influence queries (determining the potency of a specified seed set of nodes) and Influence Maximization (identifying the most influential seed set of a given size). Answering each influence query involves many edge traversals, and does not scale when there are many queries on very large graphs. The gold standard for Influence Maximization is the greedy algorithm, which iteratively adds to the seed set a node maximizing the marginal gain in influence. Greedy has a guaranteed approximation ratio of at least (1 − 1/e) and actually produces a sequence of nodes, with each prefix having approximation guarantee with respect to the samesize optimum. Since Greedy does not scale well beyond a few million edges, for larger inputs one must currently use either heuristics or alternative algorithms designed for a prespecified small seed set size. We develop a novel sketchbased design for influence computation. Our greedy Sketchbased Influence Maximization (SKIM) algorithm scales to graphs with billions of edges, with one to two orders of magnitude speedup over the best greedy methods. It still has a guaranteed approximation ratio, and in practice its quality nearly matches that of exact greedy. We also present influence oracles, which use lineartime preprocessing to generate a small sketch for each node, allowing the influence of any seed set to be quickly answered from the sketches of its nodes. 1
Computing Classic Closeness Centrality, at Scale
, 2014
"... Closeness centrality, first considered by Bavelas (1948), is an importance measure of a node in a network which is based on the distances from the node to all other nodes. The classic definition, proposed by Bavelas (1950), Beauchamp (1965), and Sabidussi (1966), is (the inverse of) the average dist ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
(Show Context)
Closeness centrality, first considered by Bavelas (1948), is an importance measure of a node in a network which is based on the distances from the node to all other nodes. The classic definition, proposed by Bavelas (1950), Beauchamp (1965), and Sabidussi (1966), is (the inverse of) the average distance to all other nodes. We propose the first highly scalable (near lineartime processing and linear space overhead) algorithm for estimating, within a small relative error, the classic closeness centralities of all nodes in the graph. Our algorithm applies to undirected graphs, as well as for centrality computed with respect to roundtrip distances in directed graphs. For directed graphs, we also propose an efficient algorithm that approximates generalizations of classic closeness centrality to outbound and inbound centralities. Although it does not provide worstcase theoretical approximation guarantees, it is designed to perform well on real networks. We perform extensive experiments on large networks, demonstrating high scalability and accuracy. 1
Estimation for monotone sampling: Competitiveness and customization
 IN PODC. ACM
, 2014
"... ..."
(Show Context)
Scalable Facility Location for Massive Graphs on Pregellike Systems
, 2015
"... We propose a new scalable algorithm for the facilitylocation problem. We study the graph setting, where the cost of serving a client from a facility is represented by the shortestpath distance on a graph. This setting is applicable to various problems arising in the Web and social media, and allo ..."
Abstract
 Add to MetaCart
We propose a new scalable algorithm for the facilitylocation problem. We study the graph setting, where the cost of serving a client from a facility is represented by the shortestpath distance on a graph. This setting is applicable to various problems arising in the Web and social media, and allows to leverage the inherent sparsity of such graphs. To obtain truly scalable performance, we design a parallel algorithm that operates on clusters of sharednothing machines. In particular, we target modern Pregellike architectures, and we implement our algorithm on Apache Giraph. Our work builds upon previous results: a facility location algorithm for the PRAM model, a recent distancesketching method for massive graphs, and a parallel algorithm to finding maximal independent sets. The main challenge is to adapt those building blocks to the distributed graph setting, while maintaining the approximation guarantee and limiting the amount of distributed communication. Extensive experimental results show that our algorithm scales gracefully to graphs with billions of edges, while, in terms of quality, being competitive with stateoftheart sequential algorithms.
GRECS: Graph Encryption for Approximate Shortest Distance Queries
"... We propose graph encryption schemes that efficiently support approximate shortest distance queries on largescale encrypted graphs. Shortest distance queries are one of the most fundamental graph operations and have a wide range of applications. Using such graph encryption schemes, a client can ou ..."
Abstract
 Add to MetaCart
(Show Context)
We propose graph encryption schemes that efficiently support approximate shortest distance queries on largescale encrypted graphs. Shortest distance queries are one of the most fundamental graph operations and have a wide range of applications. Using such graph encryption schemes, a client can outsource largescale privacysensitive graphs to an untrusted server without losing the ability to query it. Other applications include encrypted graph databases and controlled disclosure systems. We propose GRECS (stands for GRaph EnCryption for approximate Shortest distance queries) which includes three oracle encryption schemes that are provably secure against any semihonest server. Our first construction makes use of only symmetrickey operations, resulting in a computationallyefficient construction. Our second scheme makes use of somewhathomomorphic encryption and is less computationallyefficient but achieves optimal communication complexity (i.e. uses a minimal amount of bandwidth). Finally, our third scheme is both computationallyefficient and achieves optimal communication complexity at the cost of a small amount of additional leakage. We implemented and evaluated the efficiency of our constructions experimentally. The experiments demonstrate that our schemes are efficient and can be applied to graphs that scale up to 1.6 million nodes and 11 million edges.
Hashing for statistics over kpartitions ˚
"... In this paper we propose a hash function for kpartitioning a set into bins so that we get good concentration bounds when combining statistics from different bins. To understand this point, suppose we have a fully random hash function applied to a set X of red and blue balls. We want to estimate th ..."
Abstract
 Add to MetaCart
(Show Context)
In this paper we propose a hash function for kpartitioning a set into bins so that we get good concentration bounds when combining statistics from different bins. To understand this point, suppose we have a fully random hash function applied to a set X of red and blue balls. We want to estimate the fraction f of red balls. The idea of MinHash is to sample the ball with the smallest hash value. This sample is uniformly random and is red with probability f. The standard method is to repeat the experiment k times with independent hash functions to reduce variance. Consider the alternative experiment using a single hash function, where we use some bits of the hash value to partition X into k bins, and then use the remaining bits as a local hash value. We pick the ball with the smallest hash value in each bin. The big difference between the two schemes is that the second one runs Ωpkq times faster. In the first experiment, each ball participated in k independent experiments, but in the second one with kpartitions, each ball picks its bin, and then only participates in the local experiment for that bin. Thus, essentially, we get k experiments for the price of one. However, no realistic hash function is known to give the desired concentration bounds because the contents of different bins may be too correlated even if the marginal distribution for a single bin is random. Here, we present and analyze a hash function showing that it does yields statistics similar to that of a fully random hash function when kpartitioning a set into bins. In this process we also give more insight into simple tabulation and show new results regarding the power of choice and moment estimation. ˚This article subsumes [11]
1MultiObjective Weighted Sampling
"... Abstract — Key value data sets of the form {(x,wx)} where wx> 0 are prevalent. Common queries over such data are segment fstatistics Q(f,H) = x∈H f(wx), specified for a segment H of the keys and a function f. Different choices of f correspond to count, sum, moments, capping, and threshold statis ..."
Abstract
 Add to MetaCart
Abstract — Key value data sets of the form {(x,wx)} where wx> 0 are prevalent. Common queries over such data are segment fstatistics Q(f,H) = x∈H f(wx), specified for a segment H of the keys and a function f. Different choices of f correspond to count, sum, moments, capping, and threshold statistics. When the data set is large, we can compute a smaller sample from which we can quickly estimate statistics. A weighted sample of keys taken with respect to f(wx) provides estimates with statistically guaranteed quality for fstatistics. Such a sample S(f) can be used to estimate gstatistics for g 6 = f, but quality degrades with the disparity between g and f. In this paper we address applications that require quality estimates for a set F of different functions. A naive solution is to compute and work with a different sample S(f) for each f ∈ F. Instead, this can be achieved more effectively and seamlessly using a single multiobjective sample S(F) of a much smaller size. We review multiobjective sampling schemes and place them in our context of estimating fstatistics. We show that a multiobjective sample for F provides quality estimates for any f that is a positive linear combination of functions from F. We then establish a surprising and powerful result when the target set M is all monotone nondecreasing functions, noting that M includes most natural statistics. We provide efficient multiobjective sampling algorithms for M and show that a sample size of k lnn (where n is the number of active keys) provides the same estimation quality, for any f ∈M, as a dedicated weighted sample of size k for f. F 1