## Do your worst to make the best: Paradoxical effects in pagerank incremental computations (2004)

### Cached

### Download Links

- [vigna.dsi.unimi.it]
- [vigna.di.unimi.it]
- [cs.wellesley.edu]
- DBLP

### Other Repositories/Bibliography

Venue: | In Proceedings of the third Workshop on Web Graphs (WAW), volume 3243 of Lecture Notes in Computer Science |

Citations: | 14 - 1 self |

### BibTeX

@INPROCEEDINGS{Boldi04doyour,

author = {Paolo Boldi and Massimo Santini and Sebastiano Vigna},

title = {Do your worst to make the best: Paradoxical effects in pagerank incremental computations},

booktitle = {In Proceedings of the third Workshop on Web Graphs (WAW), volume 3243 of Lecture Notes in Computer Science},

year = {2004},

pages = {168--180},

publisher = {Springer}

}

### Years of Citing Articles

### OpenURL

### Abstract

Deciding which kind of visit accumulates high-quality pages more quickly is one of the most often debated issue in the design of web crawlers. It is known that breadth-first visits work well, as they tend to discover pages with high PageRank early on in the crawl. Indeed, this visit order is much better than depth first, which is in turn even worse than a random visit; nevertheless, breadth-first can be superseded using an omniscient visit that chooses, at every step, the node of highest PageRank in the frontier. This paper discusses a related, and previously overlooked, measure of effectivity for crawl strategies: whether the graph obtained after a partial visit is in some sense representative of the underlying web graph as far as the computation of PageRank is concerned. More precisely, we are interested in determining how rapidly the computation of PageRank over the visited subgraph yields relative ranks that agree with the ones the nodes have in the complete graph; ranks are compared using Kendall’s τ. We describe a number of large-scale experiments that show the following paradoxical effect: visits that gather PageRank more quickly (e.g., highest-quality-first) are also those that tend to miscalculate PageRank. Finally, we perform the same kind of experimental analysis on some synthetic random graphs, generated using well-known web-graph models: the results are almost opposite to those obtained on real web graphs. 1

### Citations

3249 | The anatomy of a large-scale hypertextual web search engine
- Brin, Page
- 1998
(Show Context)
Citation Context ...ea of using bubble sort appears in Kendall’s original paper [19]), but you cannot expect to run bubble sort on a graph with 100 million nodes. 3 Measuring the Quality of a Page with PageRank PageRank =-=[7]-=- is one of the best-known methods to measure page quality; it is a static method (in that it does not depend on a specific query but rather it measures the absolute authoritativeness of each page), an... |

2136 | T.: “The PageRank citation ranking: Bringing order to the Web
- Page, Brin, et al.
- 1998
(Show Context)
Citation Context ...ecause, as noted in [11], even after crawling well over a billion pages, the number of uncrawled pages still far exceeds the number of crawled pages. 1 Usually, a basic measure of quality is PageRank =-=[28]-=-, in one of its many variants. Hence, as a first step we can compare two strategies by looking at how fast the cumulative PageRank (i.e., the sum of the PageRanks of all the visited pages up to a cert... |

325 | Rank Aggregation Methods for the Web
- Dwork, Kumar, et al.
- 2001
(Show Context)
Citation Context ...idely used and intuitive is Kendall’s τ; this classical nonparametric correlation index has recently received much attention within the web community for its possible applications to rank aggregation =-=[13, 12, 10]-=- and for determining the convergence speed in the computation of PageRank [17]. Kendall’s τ is usually defined as follows 3 : Definition 1 ([20], pages 34–36) Let ri, si ∈ R (i = 1, 2, . . . , n) be t... |

290 | Efficient crawling through url ordering
- Cho, Garcia-Molina, et al.
- 1998
(Show Context)
Citation Context ... obtained on real web graphs. 1 Introduction The search for the better crawling strategy (i.e., a strategy that gathers early pages of high quality) is by now an almost old research topic (see, e.g., =-=[8]-=-). Being able to collect quickly high-quality pages is one of the major design goals of a crawler; this issue is particularly important, because, as noted in [11], even after crawling well over a bill... |

277 |
Rank Correlation Methods
- Kendall
- 1975
(Show Context)
Citation Context ...r its possible applications to rank aggregation [13, 12, 10] and for determining the convergence speed in the computation of PageRank [17]. Kendall’s τ is usually defined as follows 3 : Definition 1 (=-=[20]-=-, pages 34–36) Let ri, si ∈ R (i = 1, 2, . . . , n) be two rankings. Given a pair of distinct indices 1 ≤ i, j ≤ n, we say that the pair is: • concordant iff ri − rj and si − sj are both nonzero and h... |

214 | Stochastic models for the web graph
- Kumar, Raghavan, et al.
(Show Context)
Citation Context ...; of course, this order must be restricted to the nodes of the subgraph. In this paper, we use Kendall’s τ as a measure of concordance between the two ranked lists (a similar approach was followed in =-=[22]-=- to motivate the usefulness of a hierarchical algorithm to compute PageRank). The results we obtain are paradoxical: many strategies that accumulate PageRank quickly explore subgraphs with badly corre... |

193 | Comparing top k lists
- Fagin, Kumar, et al.
- 2003
(Show Context)
Citation Context ...idely used and intuitive is Kendall’s τ; this classical nonparametric correlation index has recently received much attention within the web community for its possible applications to rank aggregation =-=[13, 12, 10]-=- and for determining the convergence speed in the computation of PageRank [17]. Kendall’s τ is usually defined as follows 3 : Definition 1 ([20], pages 34–36) Let ri, si ∈ R (i = 1, 2, . . . , n) be t... |

183 |
A New Measure of Rank Correlation
- Kendall
- 1938
(Show Context)
Citation Context ...ily calculated using a bubble sort: you just need two exchanges to move 0 to the front, and then one to move 1 to its final position (the idea of using bubble sort appears in Kendall’s original paper =-=[19]-=-), but you cannot expect to run bubble sort on a graph with 100 million nodes. 3 Measuring the Quality of a Page with PageRank PageRank [7] is one of the best-known methods to measure page quality; it... |

160 | The Web Graph Framework I: Compression Techniques
- Boldi, Vigna
- 2004
(Show Context)
Citation Context ...ultiprocessor servers (the latter were essential in handling the largest samples). Overall, we used about 1600 hours of CPU user time. Tools. All our tests were performed using the WebGraph framework =-=[6, 5]-=-. Both the Italian and the WebBase graphs are available at http://webgraph-data.dsi.unimi.it/ (recall, however, that we only used the portion of WebBase that can be reached from the giant component). ... |

144 | C.D.: “Deeper Inside PageRank
- Langville, Meyer
(Show Context)
Citation Context ...othing to do with the convergence speed of PageRank, but rather with its tolerance to graph modifications, or, if you prefer, with its stability. There is a rich stream of research (see, for example, =-=[24, 27, 1, 3, 23, 25]-=-) concerning the robustness of PageRank with respect to graph modifications (node and/or link perturbation, deletion and insertion). Many authors observed, in particular, that PageRank is quite stable... |

133 | Extrapolation methods for accelerating pagerank computations
- Kamvar, Haveliwala, et al.
- 2003
(Show Context)
Citation Context ...x has recently received much attention within the web community for its possible applications to rank aggregation [13, 12, 10] and for determining the convergence speed in the computation of PageRank =-=[17]-=-. Kendall’s τ is usually defined as follows 3 : Definition 1 ([20], pages 34–36) Let ri, si ∈ R (i = 1, 2, . . . , n) be two rankings. Given a pair of distinct indices 1 ≤ i, j ≤ n, we say that the pa... |

132 | Efficient Computation of PageRank
- Haveliwala
- 1999
(Show Context)
Citation Context ...ctor r satisfying Ar = r. The PageRank of a node x, PRG(x), is exactly rx. The computation of PageRank is a rather straightforward task, that can be accomplished with standard tools of linear algebra =-=[14, 17]-=-. The dumping factor α influences both the convergence speed and the results obtained, but it is by now customary to choose α = 0.85. 4 Using Kendall’s τ to Contrast Crawl Strategies The rank assigned... |

129 | Exploiting the block structure of the web for computing PageRank
- Kamvar, Haveliwala, et al.
- 2003
(Show Context)
Citation Context ...se growth is quite steady. Our explanation for this fact is that, at the beginning of a visit, the crawl tends to remain confined within a single website (or a set of websites). Indeed, a recent work =-=[16]-=- makes an analysis quite similar to ours with a different purpose: to prove that PageRank computation can be made more efficient by computing local PageRanks first. The authors motivate their algorith... |

106 | Stable algorithms for link analysis
- Ng, Zheng, et al.
- 2001
(Show Context)
Citation Context ...othing to do with the convergence speed of PageRank, but rather with its tolerance to graph modifications, or, if you prefer, with its stability. There is a rich stream of research (see, for example, =-=[24, 27, 1, 3, 23, 25]-=-) concerning the robustness of PageRank with respect to graph modifications (node and/or link perturbation, deletion and insertion). Many authors observed, in particular, that PageRank is quite stable... |

101 | Breadth-first search crawling yields high-quality pages
- Najork, Wiener
- 2001
(Show Context)
Citation Context ...k in a growingly quicker way. This is to be expected, as the omniscient visit will point immediately to pages of high quality. The fact that breadth-first visit yields high-quality pages was noted in =-=[26]-=-. There is, however, another and also quite relevant problem, which has been previously overlooked in the literature: if we assume that the crawler has no previous knowledge of the web region it has t... |

98 | UbiCrawler: A Scalable Fully Distributed Web Crawler
- Boldi, Codenotti, et al.
(Show Context)
Citation Context ...this strategy may also be adopted if we have already performed a crawl and so we have the (old) PageRank values of (at least some of the) pages. Both common sense and experiments (see, in particular, =-=[4]-=-) suggest that the above visits accumulate PageRank in a growingly quicker way. This is to be expected, as the omniscient visit will point immediately to pages of high quality. The fact that breadth-f... |

96 | WebBase: A repository of web pages
- HIRAI, RAGHAVAN, et al.
(Show Context)
Citation Context ...napshot of the Italian domain .it, performed with UbiCrawler [4] (later on referred to as the Italian graph); • a 118,142,155-nodes graph obtained from the 2001 crawl performed by the WebBase crawler =-=[15]-=- (later on referred to as the WebBase graph); • a 10,000,000-nodes synthetic graph generated using the Copying Model proposed in [22], with suitable parameters and rewiring (later on referred to as th... |

94 | Ranking the Web frontier
- Eiron, McCurley, et al.
- 2004
(Show Context)
Citation Context ... almost old research topic (see, e.g., [8]). Being able to collect quickly high-quality pages is one of the major design goals of a crawler; this issue is particularly important, because, as noted in =-=[11]-=-, even after crawling well over a billion pages, the number of uncrawled pages still far exceeds the number of crawled pages. 1 Usually, a basic measure of quality is PageRank [28], in one of its many... |

84 |
Adaptive on-line page importance computation
- Abiteboul, Preda, et al.
- 2003
(Show Context)
Citation Context ...othing to do with the convergence speed of PageRank, but rather with its tolerance to graph modifications, or, if you prefer, with its stability. There is a rich stream of research (see, for example, =-=[24, 27, 1, 3, 23, 25]-=-) concerning the robustness of PageRank with respect to graph modifications (node and/or link perturbation, deletion and insertion). Many authors observed, in particular, that PageRank is quite stable... |

74 |
Inside PageRank
- Bianchini, Gori, et al.
(Show Context)
Citation Context |

53 | Searching the workplace web
- Fagin, Kumar, et al.
(Show Context)
Citation Context ...idely used and intuitive is Kendall’s τ; this classical nonparametric correlation index has recently received much attention within the web community for its possible applications to rank aggregation =-=[13, 12, 10]-=- and for determining the convergence speed in the computation of PageRank [17]. Kendall’s τ is usually defined as follows 3 : Definition 1 ([20], pages 34–36) Let ri, si ∈ R (i = 1, 2, . . . , n) be t... |

25 | The WebGraph framework II: Codes for the World Wide Web
- Boldi, Vigna
- 2003
(Show Context)
Citation Context ...ultiprocessor servers (the latter were essential in handling the largest samples). Overall, we used about 1600 hours of CPU user time. Tools. All our tests were performed using the WebGraph framework =-=[6, 5]-=-. Both the Italian and the WebBase graphs are available at http://webgraph-data.dsi.unimi.it/ (recall, however, that we only used the portion of WebBase that can be reached from the giant component). ... |

10 |
A computer method for calculating Kendall’s tau with ungrouped data
- Knight
- 1966
(Show Context)
Citation Context ... so common, and there is a remarkable scarcity of literature on the subject of computing τ in an efficient way. Clearly, the brute-force O(n 2 ) approach is easy to implement, but inefficient. Knight =-=[21]-=- presented in the sixties an O(n log n) algorithm for the computation of τ, but the only implementation we are aware of belongs to the SAS system. Because of the large size of our data, we decided to ... |

9 | Perturbation of the hyper-linked environment
- Lee, Borodin
- 2003
(Show Context)
Citation Context |

7 |
Albert-László Barabasi, and Hawoong Jeong
- Albert
- 1999
(Show Context)
Citation Context ... proposed in [22], with suitable parameters and rewiring (later on referred to as the copying-model graph); • a 10,000,000-nodes synthetic graph generated using the Evolving Network Model proposed in =-=[2]-=-, with suitable parameters and rewiring (later on referred to as the evolving-model graph). The two synthetic graphs are much smaller than the others, because by their random nature there would be no ... |

6 |
Rank stability and rank similarity of link-based web ranking algorithms in authority-connected graphs
- Lempel, Moran
(Show Context)
Citation Context |

3 |
A library of software tools for performing measures on large networks
- Donato, Laura, et al.
- 2004
(Show Context)
Citation Context ...able at http://webgraph-data.dsi.unimi.it/ (recall, however, that we only used the portion of WebBase that can be reached from the giant component). The two synthetic graphs were produced using COSIN =-=[9]-=-. 5s1 0.8 0.6 0.4 0.2 0 -0.2 1 0.8 0.6 0.4 0.2 0 -0.2 BF DF RF QF IQF RAND LEX 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 (1) (2) BF DF RF QF IQF RAND LEX 0.8 0.6 0.4 0.2 0 -0.2 (3) (4) 1 BF DF RF QF IQF R... |