Results 1  10
of
174
Deeper inside pagerank
 Internet Mathematics
, 2004
"... Abstract. This paper serves as a companion or extension to the “Inside PageRank” paper by Bianchini et al. [Bianchini et al. 03]. It is a comprehensive survey of all issues associated with PageRank, covering the basic PageRank model, available and recommended solution methods, storage issues, existe ..."
Abstract

Cited by 158 (5 self)
 Add to MetaCart
Abstract. This paper serves as a companion or extension to the “Inside PageRank” paper by Bianchini et al. [Bianchini et al. 03]. It is a comprehensive survey of all issues associated with PageRank, covering the basic PageRank model, available and recommended solution methods, storage issues, existence, uniqueness, and convergence properties, possible alterations to the basic model, suggested alternatives to the traditional solution methods, sensitivity and conditioning, and finally the updating problem. We introduce a few new results, provide an extensive reference list, and speculate about exciting areas of future research. 1.
Know your neighbors: Web spam detection using the web topology
 In Proceedings of SIGIR
, 2007
"... Web spam can significantly deteriorate the quality of search engine results. Thus there is a large incentive for commercial search engines to detect spam pages efficiently and accurately. In this paper we present a spam detection system that uses the topology of the Web graph by exploiting the link ..."
Abstract

Cited by 75 (9 self)
 Add to MetaCart
Web spam can significantly deteriorate the quality of search engine results. Thus there is a large incentive for commercial search engines to detect spam pages efficiently and accurately. In this paper we present a spam detection system that uses the topology of the Web graph by exploiting the link dependencies among the Web pages, and the content of the pages themselves. We find that linked hosts tend to belong to the same class: either both are spam or both are nonspam. We demonstrate three methods of incorporating the Web graph topology into the predictions obtained by our base classifier: (i) clustering the host graph, and assigning the label of all hosts in the cluster by majority vote, (ii) propagating the predicted labels to neighboring hosts, and (iii) using the predicted labels of neighboring hosts as new features and retraining the classifier. The result is an accurate system for detecting Web spam that can be applied in practice to largescale Web data.
Towards scaling fully personalized PageRank
 In Proceedings of the 3rd Workshop on Algorithms and Models for the WebGraph (WAW
, 2004
"... Abstract Personalized PageRank expresses backlinkbased page quality around userselected pages in a similar way as PageRank expresses quality over the entire Web. Existing personalized PageRank algorithms can however serve online queries only for a restricted choice of page selection. In this pape ..."
Abstract

Cited by 75 (2 self)
 Add to MetaCart
Abstract Personalized PageRank expresses backlinkbased page quality around userselected pages in a similar way as PageRank expresses quality over the entire Web. Existing personalized PageRank algorithms can however serve online queries only for a restricted choice of page selection. In this paper we achieve full personalization by a novel algorithm that computes a compact database of simulated random walks; this database can serve arbitrary personal choices of small subsets of web pages. We prove that for a fixed error probability, the size of our database is linear in the number of web pages. We justify our estimation approach by asymptotic worstcase lower bounds; we show that exact personalized PageRank values can only be obtained from a database of quadratic size. 1
A survey on pagerank computing
 Internet Mathematics
, 2005
"... Abstract. This survey reviews the research related to PageRank computing. Components of a PageRank vector serve as authority weights for web pages independent of their textual content, solely based on the hyperlink structure of the web. PageRank is typically used as a web search ranking component. T ..."
Abstract

Cited by 73 (0 self)
 Add to MetaCart
Abstract. This survey reviews the research related to PageRank computing. Components of a PageRank vector serve as authority weights for web pages independent of their textual content, solely based on the hyperlink structure of the web. PageRank is typically used as a web search ranking component. This defines the importance of the model and the data structures that underly PageRank processing. Computing even a single PageRank is a difficult computational task. Computing many PageRanks is a much more complex challenge. Recently, significant effort has been invested in building sets of personalized PageRank vectors. PageRank is also used in many diverse applications other than ranking. We are interested in the theoretical foundations of the PageRank formulation, in the acceleration of PageRank computing, in the effects of particular aspects of web graph structure on the optimal organization of computations, and in PageRank stability. We also review alternative models that lead to authority indices similar to PageRank and the role of such indices in applications other than web search. We also discuss linkbased search personalization and outline some aspects of PageRank infrastructure from associated measures of convergence to link preprocessing. 1.
A reference collection for Web spam
 SIGIR Forum
, 2006
"... We describe the WEBSPAMUK2006 collection, a large set of Web pages that have been manually annotated with labels indicating if the hosts are include Web spam aspects or not. This is the first publicly available Web spam collection that includes page contents and links, and that has been labelled by ..."
Abstract

Cited by 58 (13 self)
 Add to MetaCart
We describe the WEBSPAMUK2006 collection, a large set of Web pages that have been manually annotated with labels indicating if the hosts are include Web spam aspects or not. This is the first publicly available Web spam collection that includes page contents and links, and that has been labelled by a large and diverse set of judges. 1
Piccolo: Building Fast, Distributed Programs with Partitioned Tables
"... Piccolo is a new datacentric programming model for writing parallel inmemory applications in data centers. Unlike existing dataflow models, Piccolo allows computation running on different machines to share distributed, mutable state via a keyvalue table interface. Piccolo enables efficient appli ..."
Abstract

Cited by 44 (1 self)
 Add to MetaCart
Piccolo is a new datacentric programming model for writing parallel inmemory applications in data centers. Unlike existing dataflow models, Piccolo allows computation running on different machines to share distributed, mutable state via a keyvalue table interface. Piccolo enables efficient application implementations. In particular, applications can specify locality policies to exploit the locality of shared state access and Piccolo’s runtime automatically resolves writewrite conflicts using userdefined accumulation functions. Using Piccolo, we have implemented applications for several problem domains, including the PageRank algorithm, kmeans clustering and a distributed crawler. Experiments using 100 Amazon EC2 instances and a 12 machine cluster show Piccolo to be faster than existing data flow models for many problems, while providing similar faulttolerance guarantees and a convenient programming interface. 1
Efficient semistreaming algorithms for local triangle counting in massive graphs
 in KDD’08, 2008
"... In this paper we study the problem of local triangle counting in large graphs. Namely, given a large graph G = (V, E) we want to estimate as accurately as possible the number of triangles incident to every node v ∈ V in the graph. The problem of computing the global number of triangles in a graph ha ..."
Abstract

Cited by 42 (4 self)
 Add to MetaCart
In this paper we study the problem of local triangle counting in large graphs. Namely, given a large graph G = (V, E) we want to estimate as accurately as possible the number of triangles incident to every node v ∈ V in the graph. The problem of computing the global number of triangles in a graph has been considered before, but to our knowledge this is the first paper that addresses the problem of local triangle counting with a focus on the efficiency issues arising in massive graphs. The distribution of the local number of triangles and the related local clustering coefficient can be used in many interesting applications. For example, we show that the measures we compute can help to detect the presence of spamming activity in largescale Web graphs, as well as to provide useful features to assess content quality in social networks. For computing the local number of triangles we propose two approximation algorithms, which are based on the idea of minwise independent permutations (Broder et al. 1998). Our algorithms operate in a semistreaming fashion, using O(V ) space in main memory and performing O(log V ) sequential scans over the edges of the graph. The first algorithm we describe in this paper also uses O(E) space in external memory during computation, while the second algorithm uses only main memory. We present the theoretical analysis as well as experimental results in massive graphs demonstrating the practical efficiency of our approach. Luca Becchetti was partially supported by EU Integrated
PageRank as a Function of the Damping Factor
, 2005
"... PageRank is defined as the stationary state of a Markov chain. The chain is obtained by perturbing the transition matrix induced by a web graph with a damping factor # that spreads uniformly part of the rank. The choice of # is eminently empirical, and in most cases the original suggestion # = 0.85 ..."
Abstract

Cited by 41 (10 self)
 Add to MetaCart
PageRank is defined as the stationary state of a Markov chain. The chain is obtained by perturbing the transition matrix induced by a web graph with a damping factor # that spreads uniformly part of the rank. The choice of # is eminently empirical, and in most cases the original suggestion # = 0.85 by Brin and Page is still used. Recently, however, the behaviour of PageRank with respect to changes in # was discovered to be useful in linkspam detection [21]. Moreover, an analytical justification of the value chosen for # is still missing. In this paper, we give the first mathematical analysis of PageRank when # changes. In particular, we show that, contrarily to popular belief, for realworld graphs values of # close to 1 do not give a more meaningful ranking. Then, we give closedform formulae for PageRank derivatives of any order, and an extension of the Power Method that approximates them with convergence O for the kth derivative. Finally, we show a tight connection between iterated computation and analytical behaviour by proving that the kth iteration of the Power Method gives exactly the PageRank value obtained using a Maclaurin polynomial of degree k. The latter result paves the way towards the application of analytical methods to the study of PageRank.
Graph summarization with bounded error
 In SIGMOD 2008: Proceedings of the 2008 ACM SIGMOD International Conference on Management of data
, 2008
"... We propose a highly compact twopart representation of a given graph G consisting of a graph summary and a set of corrections. The graph summary is an aggregate graph in which each node corresponds to a set of nodes in G, and each edge represents the edges between all pair of nodes in the two sets. ..."
Abstract

Cited by 41 (6 self)
 Add to MetaCart
We propose a highly compact twopart representation of a given graph G consisting of a graph summary and a set of corrections. The graph summary is an aggregate graph in which each node corresponds to a set of nodes in G, and each edge represents the edges between all pair of nodes in the two sets. On the other hand, the corrections portion specifies the list of edgecorrections that should be applied to the summary to recreate G. Our representations allow for both lossless and lossy graph compression with bounds on the introduced error. Further, in combination with the MDL principle, they yield highly intuitive coarselevel summaries of the input graph G. We develop algorithms to construct highly compressed graph representations with small sizes and guaranteed accuracy, and validate our approach through an extensive set of experiments with multiple reallife graph data sets. To the best of our knowledge, this is the first work to compute graph summaries using the MDL principle, and use the summaries (along with corrections) to compress graphs with bounded error.
Efficient Aggregation for Graph Summarization
"... Graphs are widely used to model real world objects and their relationships, and large graph datasets are common in many application domains. To understand the underlying characteristics of large graphs, graph summarization techniques are critical. However, existing graph summarization methods are mo ..."
Abstract

Cited by 39 (3 self)
 Add to MetaCart
Graphs are widely used to model real world objects and their relationships, and large graph datasets are common in many application domains. To understand the underlying characteristics of large graphs, graph summarization techniques are critical. However, existing graph summarization methods are mostly statistical (studying statistics such as degree distributions, hopplots and clustering coefficients). These statistical methods are very useful, but the resolutions of the summaries are hard to control. In this paper, we introduce two databasestyle operations to summarize graphs. Like the OLAPstyle aggregation methods that allow users to drilldown or rollup to control the resolution of summarization, our methods provide an analogous functionality for large graph datasets. The first operation, called SNAP, produces a summary graph by grouping nodes based on userselected node attributes and relationships. The second operation, called kSNAP, further allows users to control the resolutions of summaries and provides the “drilldown ” and “rollup ” abilities to navigate through summaries with different resolutions. We propose an efficient algorithm to evaluate the SNAP operation. In addition, we prove that the kSNAP computation is NPcomplete. We propose two heuristic methods to approximate the kSNAP results. Through extensive experiments on a variety of real and synthetic datasets, we demonstrate the effectiveness and efficiency of the proposed methods.