Results 1 
4 of
4
Scalable similarity estimation in social networks: closeness, node labels, and random edge lengths
 In COSN
, 2013
"... Similarity estimation between nodes based on structural properties of graphs is a basic building block used in the analysis of massive networks for diverse purposes such as link prediction, product recommendations, advertisement, collaborative filtering, and community discovery. While local simila ..."
Abstract

Cited by 6 (4 self)
 Add to MetaCart
(Show Context)
Similarity estimation between nodes based on structural properties of graphs is a basic building block used in the analysis of massive networks for diverse purposes such as link prediction, product recommendations, advertisement, collaborative filtering, and community discovery. While local similarity measures, based on properties of immediate neighbors, are easy to compute, those relying on global properties have better recall. Unfortunately, this better quality comes with a computational price tag. Aiming for both accuracy and scalability, we make several contributions. First, we define closeness similarity, a natural measure that compares two nodes based on the similarity of their relations to all other nodes. Second, we show how the alldistances sketch (ADS) node labels, which are efficient to compute, can support the estimation of closeness similarity and shortestpath (SP) distances in logarithmic query time. Third, we propose the randomized edge lengths (REL) technique and define the corresponding REL distance, which captures both path length and path multiplicity and therefore improves over the SP distance as a similarity measure. The REL distance can also be the basis of closeness similarity and can be estimated using SP computation or the ADS labels. We demonstrate the effectiveness of our measures and the accuracy of our estimates through experiments on social networks with up to tens of millions of nodes.
Is MinWise Hashing Optimal for Summarizing Set Intersection?
"... Minwise hashing is an important method for estimating the size of the intersection of sets, based on a succinct summary (a “minhash”) independently computed for each set. One application is estimation of the number of data points that satisfy the conjunction of m ≥ 2 simple predicates, where a min ..."
Abstract

Cited by 2 (1 self)
 Add to MetaCart
(Show Context)
Minwise hashing is an important method for estimating the size of the intersection of sets, based on a succinct summary (a “minhash”) independently computed for each set. One application is estimation of the number of data points that satisfy the conjunction of m ≥ 2 simple predicates, where a minhash is available for the set of points satisfying each predicate. This has applications in query optimization and for approximate computation of COUNT aggregates. In this paper we address the question: How many bits is it necessary to allocate to each summary in order to get an estimate with 1 ± ε relative error? The stateoftheart technique for minimizing the encoding size, for any desired estimation error, is bbit minwise hashing due to Li and König (Communications of the ACM, 2011). We give new lower and upper bounds: • Using information complexity arguments, we show that bbit minwise hashing is space optimal for m = 2 predicates in the sense that the estimator’s variance is within a constant factor of the smallest possible among all summaries with the given space usage. But for conjunctions of m> 2 predicates we show that the performance of bbit minwise hashing (and more generally any method based on “kpermutation ” minhash) deteriorates as m grows. • We describe a new summary that nearly matches our lower bound for m ≥ 2. It asymptotically outperform all kpermutation schemes (by around a factor Ω(m / logm)), as well as methods based on subsampling (by a factor Ω(lognmax), where nmax is the maximum set size).
Estimation for monotone sampling: Competitiveness and customization
 IN PODC. ACM
, 2014
"... ..."
(Show Context)