## Efficient and decentralized pagerank approximation in a peer-to-peer web search network (2006)

### Cached

### Download Links

- [www.vldb.org]
- [www.vldb.org]
- [lsirpeople.epfl.ch]
- [www.mpi-inf.mpg.de]
- [www.mpi-sb.mpg.de]
- [qid3.mmci.uni-saarland.de]
- DBLP

### Other Repositories/Bibliography

Venue: | In VLDB,2006 |

Citations: | 17 - 5 self |

### BibTeX

@INPROCEEDINGS{Parreira06efficientand,

author = {Josiane Xavier Parreira and Debora Donato and Sebastian Michel and Gerhard Weikum},

title = {Efficient and decentralized pagerank approximation in a peer-to-peer web search network},

booktitle = {In VLDB,2006},

year = {2006},

pages = {415--426}

}

### OpenURL

### Abstract

PageRank-style (PR) link analyses are a cornerstone of Web search engines and Web mining, but they are computationally expensive. Recently, various techniques have been proposed for speeding up these analyses by distributing the link graph among multiple sites. However, none of these advanced methods is suitable for a fully decentralized PR computation in a peer-to-peer (P2P) network with autonomous peers, where each peer can independently crawl Web fragments according to the user’s thematic interests. In such a setting the graph fragments that different peers have locally available or know about may arbitrarily overlap among peers, creating additional complexity for the PR computation. This paper presents the JXP algorithm for dynamically and collaboratively computing PR scores of Web pages that are arbitrarily distributed in a P2P network. The algorithm runs at every peer, and it works by combining locally computed PR scores with random meetings among the peers in the network. It is scalable as the number of peers on the network grows, and experiments as well as theoretical arguments show that JXP scores converge to the true PR scores that one would obtain by a centralized computation. 1.

### Citations

3460 | Chord: A scalable peer-to-peer lookup service for internet applications
- Stoica, Morris, et al.
- 2001
(Show Context)
Citation Context ... is a compelling paradigm for large-scale file sharing, publish-subscribe, and collaborative work, as it provides great scalability and robustness to failures and very high dynamics (so-called churn) =-=[1, 38, 32, 33]-=-. Another intriguing P2P application could be Web search: spreading the functionality and data of a search engine across thousands or millions of peers. Such an architecture is being pursued in a numb... |

3249 | The anatomy of a large-scale hypertextual web search engine
- Brin, Page
- 1998
(Show Context)
Citation Context ... computation. 1. INTRODUCTION One of the cornerstones of Web search engines and Web mining is link analysis for authority scoring, most notably, the two seminal methods PageRank (PR) by Brin and Page =-=[8]-=- and HITS by Kleinberg [23]. Both methods are Eigenvector-based algorithms that determine the importance of a page based on the importance of the pages that point to it. Their computation is quite exp... |

2703 | Authoritative sources in a hyperlinked environment
- Kleinberg
- 1999
(Show Context)
Citation Context ...ION One of the cornerstones of Web search engines and Web mining is link analysis for authority scoring, most notably, the two seminal methods PageRank (PR) by Brin and Page [8] and HITS by Kleinberg =-=[23]-=-. Both methods are Eigenvector-based algorithms that determine the importance of a page based on the importance of the pages that point to it. Their computation is quite expensive as it involves itera... |

2689 | A scalable content addressable network
- Ratnasamy, Francis, et al.
- 2001
(Show Context)
Citation Context ... is a compelling paradigm for large-scale file sharing, publish-subscribe, and collaborative work, as it provides great scalability and robustness to failures and very high dynamics (so-called churn) =-=[1, 38, 32, 33]-=-. Another intriguing P2P application could be Web search: spreading the functionality and data of a search engine across thousands or millions of peers. Such an architecture is being pursued in a numb... |

1460 | Space/Time Trade-Offs in Hash Coding with Allowable Errors
- Bloom
- 1970
(Show Context)
Citation Context ...) = |SA ∩ SB|/|SB|. So containment represents the fraction of elements in SB that are also in SA. Fundamentals for statistical synopses of sets have a rich literature, including work on Bloom filters =-=[6, 18]-=-, hash sketches [19], and min-wise independent permutations [10]. In this paper we focus on the min-wise independent permutations (MIPs). The MIPs technique assumes that the set elements can be ordere... |

1416 | P: Pastry: scalable, decentralized object location, and routing for large-scale peer-to-peer systems
- Rowstron, Druschel
- 2001
(Show Context)
Citation Context ... is a compelling paradigm for large-scale file sharing, publish-subscribe, and collaborative work, as it provides great scalability and robustness to failures and very high dynamics (so-called churn) =-=[1, 38, 32, 33]-=-. Another intriguing P2P application could be Web search: spreading the functionality and data of a search engine across thousands or millions of peers. Such an architecture is being pursued in a numb... |

773 |
Finite Markov chains
- Kemeny, Snell
- 1960
(Show Context)
Citation Context ...Links(GB, WB) 5: G ′ A ← (GA + WA) 6: LA ← combineLists(LA, LB) 7: P R ← P ageRank(G ′ A ) 8: update(LA) 9: Discard(GB, WB, LB) Our analysis builds on the theory of state aggregation in Markov chains =-=[16, 37, 29, 22]-=-. However, applying this theory to our setting is not straightforward at all, and we use it only for particular aspects. State-aggregation techniques assume complete knowledge of the Markov chain and ... |

687 | Summary cache: a scalable wide-area web cache sharing protocol
- Fan, Cao, et al.
(Show Context)
Citation Context ...) = |SA ∩ SB|/|SB|. So containment represents the fraction of elements in SB that are also in SA. Fundamentals for statistical synopses of sets have a rich literature, including work on Bloom filters =-=[6, 18]-=-, hash sketches [19], and min-wise independent permutations [10]. In this paper we focus on the min-wise independent permutations (MIPs). The MIPs technique assumes that the set elements can be ordere... |

580 |
Introduction to the Numerical Solution of Markov Chains
- Stewart
- 1994
(Show Context)
Citation Context ...Links(GB, WB) 5: G ′ A ← (GA + WA) 6: LA ← combineLists(LA, LB) 7: P R ← P ageRank(G ′ A ) 8: update(LA) 9: Discard(GB, WB, LB) Our analysis builds on the theory of state aggregation in Markov chains =-=[16, 37, 29, 22]-=-. However, applying this theory to our setting is not straightforward at all, and we use it only for particular aspects. State-aggregation techniques assume complete knowledge of the Markov chain and ... |

355 |
Matrix analysis and applied linear algebra
- Meyer, editor
- 2000
(Show Context)
Citation Context ...Links(GB, WB) 5: G ′ A ← (GA + WA) 6: LA ← combineLists(LA, LB) 7: P R ← P ageRank(G ′ A ) 8: update(LA) 9: Discard(GB, WB, LB) Our analysis builds on the theory of state aggregation in Markov chains =-=[16, 37, 29, 22]-=-. However, applying this theory to our setting is not straightforward at all, and we use it only for particular aspects. State-aggregation techniques assume complete knowledge of the Markov chain and ... |

341 | On the resemblance and containment of documents
- Broder
- 1997
(Show Context)
Citation Context ...overlap” and “containment”. Given two sets, SA and SB, the overlap between these two sets is defined as |SA ∩ SB|, i.e., the cardinality of the intersection. The notion of containment was proposed in =-=[9]-=- and is defined as Containment(SA, SB) = |SA ∩ SB|/|SB|. So containment represents the fraction of elements in SB that are also in SA. Fundamentals for statistical synopses of sets have a rich literat... |

340 | Probabilistic Counting Algorithms for Data Base Applications
- Flajolet, Martin
- 1985
(Show Context)
Citation Context ...containment represents the fraction of elements in SB that are also in SA. Fundamentals for statistical synopses of sets have a rich literature, including work on Bloom filters [6, 18], hash sketches =-=[19]-=-, and min-wise independent permutations [10]. In this paper we focus on the min-wise independent permutations (MIPs). The MIPs technique assumes that the set elements can be ordered (which is trivial ... |

255 |
Mining the Web: Discovering Knowledge from Hypertext Data
- Chakrabarti
(Show Context)
Citation Context ...rted with the seminal works of Brin and Page [8] and Kleinberg [23], and after these, many other models and techniques have followed. Good surveys of the many improvements and variations are given in =-=[12, 26, 7, 5]-=-. 2.1 PageRank The basic idea of PR is that if page p has a link to page q then the author of p is implicitly endorsing q, i.e., giving some importance to page q. How much p contributes to the importa... |

245 | P-Grid: A self-organizing access structure for p2p information systems
- Aberer
- 2001
(Show Context)
Citation Context |

193 | Min-wise independent permutations
- Broder, Charikar, et al.
(Show Context)
Citation Context ...nts in SB that are also in SA. Fundamentals for statistical synopses of sets have a rich literature, including work on Bloom filters [6, 18], hash sketches [19], and min-wise independent permutations =-=[10]-=-. In this paper we focus on the min-wise independent permutations (MIPs). The MIPs technique assumes that the set elements can be ordered (which is trivial for integer keys, e.g., hash keys of URLs) a... |

193 | Comparing top k lists
- Fagin, Kumar, et al.
- 2003
(Show Context)
Citation Context ...be the average over its different scores. The total top-k ranking given by the JXP algorithm and the top-k ranking given by traditional, centralized PR are compared using Spearman’s footrule distance =-=[17]-=-, defined as F (σ1, σ2) = �k i=1 |σ1(i) − σ2(i)| where σ1(i) and σ2(i) are the positions of the page i in the first and second ranking. In case a page is present in one of the top-k rankings and does ... |

191 | Analysis of the evolution of peer-to-peer systems
- Liben-Nowell, Balakrishnan, et al.
- 2002
(Show Context)
Citation Context ... churn. But this applies also to other, conceptually simpler, properties of P2P systems in general, such as DHT performance guarantees or full correctness under particularly “nasty” failure scenarios =-=[28]-=-. On the positive side, JXP has been designed to handle high dynamics, and the algorithms themselves can easily cope with changes in the Web graph, repeated crawls, or peer churn. Extending the mathem... |

150 |
Specifying Systems, The TLA+ Language and Tools for Hardware and Software Engineers
- Lamport
- 2002
(Show Context)
Citation Context ...o show liveness in the sense that JXP makes effective progress towards the true PR scores. The argument for this part is based on the notion of fairness from concurrent programming theory (see, e.g., =-=[24]-=-): a sequence of events is fair with respect to event e if everysinfinite sequence has an infinite number of e occurrences. In our setting, this requires that in an infinite number of P2P meetings, ev... |

144 | C.D.: “Deeper Inside PageRank
- Langville, Meyer
(Show Context)
Citation Context ...rted with the seminal works of Brin and Page [8] and Kleinberg [23], and after these, many other models and techniques have followed. Good surveys of the many improvements and variations are given in =-=[12, 26, 7, 5]-=-. 2.1 PageRank The basic idea of PR is that if page p has a link to page q then the author of p is implicitly endorsing q, i.e., giving some importance to page q. How much p contributes to the importa... |

132 | Efficient Computation of PageRank
- Haveliwala
- 1999
(Show Context)
Citation Context ...isher, ACM. VLDB’06, September 12–15, 2006, Seoul, Korea. Copyright 2006 VLDB Endowment, ACM 1-59593-385-9/06/09. Web. Recent work has made progress on efficiently computing PR-style authority scores =-=[20, 11, 14, 27]-=-, but the high storage demand of the – sparse but nonetheless huge – underlying matrix seems to limit this kind of link analysis to a central server with very large memory. Recently, various technique... |

129 | Exploiting the block structure of the web for computing PageRank
- Kamvar, Haveliwala, et al.
- 2003
(Show Context)
Citation Context ...is kind of link analysis to a central server with very large memory. Recently, various techniques have been proposed for speeding up these analyses by distributing the link graph among multiple sites =-=[21, 40, 2]-=-. In fact, given that Web data is originally distributed across many owner sites, it seems a much more natural (but obviously also more challenging) computational model to perform parts of the PR comp... |

112 |
Decomposability: Queueing and Computer System Applications
- Courtois
- 1977
(Show Context)
Citation Context |

88 | ODISSEA: A Peer-to-Peer Architecture for Scalable Web
- Suel, Mathur, et al.
- 2003
(Show Context)
Citation Context ... application could be Web search: spreading the functionality and data of a search engine across thousands or millions of peers. Such an architecture is being pursued in a number of research projects =-=[39, 31, 4]-=- and could offer various key advantages: lighter load and smaller data volume per peer, and thus more computational resources per query and data unit, could enable more powerful linguistic or statisti... |

84 |
Adaptive on-line page importance computation
- Abiteboul, Preda, et al.
- 2003
(Show Context)
Citation Context ...the stationary distribution inside the host and the stationary distribution inter-hosts.sA storage-efficient approach to computing authority scores is the OPIC algorithm developed by Abiteboul et al. =-=[3]-=-. This method avoids having the entire link graph in one site, which, albeit sparse, is very large and usually exceeds the available main memory size. It does so by randomly (or otherwise fairly) visi... |

66 | A survey on pagerank computing
- Berkhin
- 2005
(Show Context)
Citation Context ...rted with the seminal works of Brin and Page [8] and Kleinberg [23], and after these, many other models and techniques have followed. Good surveys of the many improvements and variations are given in =-=[12, 26, 7, 5]-=-. 2.1 PageRank The basic idea of PR is that if page p has a link to page q then the author of p is implicitly endorsing q, i.e., giving some importance to page q. How much p contributes to the importa... |

42 | Computing pagerank in a distributed internet search system
- Wang, DeWitt
- 2004
(Show Context)
Citation Context ...is kind of link analysis to a central server with very large memory. Recently, various techniques have been proposed for speeding up these analyses by distributing the link graph among multiple sites =-=[21, 40, 2]-=-. In fact, given that Web data is originally distributed across many owner sites, it seems a much more natural (but obviously also more challenging) computational model to perform parts of the PR comp... |

38 | MINERVA: Collaborative P2P search
- Bender, Michel, et al.
- 2005
(Show Context)
Citation Context ... application could be Web search: spreading the functionality and data of a search engine across thousands or millions of peers. Such an architecture is being pursued in a number of research projects =-=[39, 31, 4]-=- and could offer various key advantages: lighter load and smaller data volume per peer, and thus more computational resources per query and data unit, could enable more powerful linguistic or statisti... |

35 |
Efficient PageRank approximation via graph aggregation
- Broder, Lempel, et al.
- 2004
(Show Context)
Citation Context ...isher, ACM. VLDB’06, September 12–15, 2006, Seoul, Korea. Copyright 2006 VLDB Endowment, ACM 1-59593-385-9/06/09. Web. Recent work has made progress on efficiently computing PR-style authority scores =-=[20, 11, 14, 27]-=-, but the high storage demand of the – sparse but nonetheless huge – underlying matrix seems to limit this kind of link analysis to a central server with very large memory. Recently, various technique... |

34 | A Fast Two-stage Algorithm for Computing Page Rank
- Lee, H, et al.
- 2003
(Show Context)
Citation Context ...isher, ACM. VLDB’06, September 12–15, 2006, Seoul, Korea. Copyright 2006 VLDB Endowment, ACM 1-59593-385-9/06/09. Web. Recent work has made progress on efficiently computing PR-style authority scores =-=[20, 11, 14, 27]-=-, but the high storage demand of the – sparse but nonetheless huge – underlying matrix seems to limit this kind of link analysis to a central server with very large memory. Recently, various technique... |

28 | Markov chain sensitivity measured by mean first passage times, Linear Algebra and its Applications vol 316 number 1–3
- Cho, Meyer
- 2000
(Show Context)
Citation Context ...m 5.1. The JXP score of the world node, at every peer in the network, is monotonically non-increasing. Proof. The proof is based on the study of the sensitivity of Markov Chains made by Cho and Meyer =-=[15]-=-. From there we can state that by increasing pwi by δ and decreasing pww by the same amount, the following holds α t−1 w − α t w α t−1 w = α t w δ miw (21) where miw is the mean first passage time fro... |

27 | The BINGO! System for Information portal Generation and Expert
- Sizov, Biwer
- 2003
(Show Context)
Citation Context ... some of the original categories, so in the end we had a total of 10 categories (e.g., “computers”, “science”, etc ). The Web collection was obtained in January 2005, using the Bingo! focused crawler =-=[36]-=-. We first trained the crawler with a manually selected set of pages and after that, new pages were fetched and automatically classified into one of 10 pre-defined categories such as “sports”, “music”... |

22 |
evolution: Analysis and algorithms
- Link
(Show Context)
Citation Context |

21 |
analysis ranking: algorithms, theory, and experiments
- Link
- 2005
(Show Context)
Citation Context |

21 | Local methods for estimating pagerank values
- Chen, Gan, et al.
- 2004
(Show Context)
Citation Context ... strong constraint, given that in most P2P networks peers are completely autonomous and crawl and index Web data at their discretion, resulting in arbitrarily overlapping graph fragments. Chen et al. =-=[13]-=- proposed a way of approximating the PR value of a page locally, by expanding a small subgraph around the page of interest, placing an estimated PR at the boundary nodes of the subgraph, and running t... |

16 | Distributed page ranking in structured p2p networks
- Shi, Yu, et al.
- 2003
(Show Context)
Citation Context ...lgorithm in which the PR computation is performed at the network level, with peers constantly updating the scores of their local pages and sending these updated values through the network. Shi et al. =-=[35]-=- also compute PR at the network level, but they reduce the communication among peers by distributing the pages among the peers according to some load-sharing function. In contrast to these P2P-style a... |

14 | Updating the stationary vector of an irreducible Markov chain
- Langville, Meyer
- 2002
(Show Context)
Citation Context ...c P2P network. The JXP algorithm, on the other hand, requires much less interaction among peers, and with the new peer selection strategy, the number of interactions is even smaller. Other techniques =-=[25, 14]-=- for approximating PR-style authority scores with partial knowledge of the global graph use state-aggregation technique from the stationary analysis of large Markov chains. These techniques have been ... |

13 | Framework for Decentralized Ranking in Web Information Retrieval
- Aberer, Wu
- 2003
(Show Context)
Citation Context ...is kind of link analysis to a central server with very large memory. Recently, various techniques have been proposed for speeding up these analyses by distributing the link graph among multiple sites =-=[21, 40, 2]-=-. In fact, given that Web data is originally distributed across many owner sites, it seems a much more natural (but obviously also more challenging) computational model to perform parts of the PR comp... |

13 | Wayfinder: Navigating and sharing information in a decentralized world
- Peery, Cuenca-Acuna, et al.
- 2004
(Show Context)
Citation Context ... application could be Web search: spreading the functionality and data of a search engine across thousands or millions of peers. Such an architecture is being pursued in a number of research projects =-=[39, 31, 4]-=- and could offer various key advantages: lighter load and smaller data volume per peer, and thus more computational resources per query and data unit, could enable more powerful linguistic or statisti... |

7 | Using a layered markov model for distributed web ranking computation
- Wu, Aberer
- 2005
(Show Context)
Citation Context ...hority scores to each server in the network, based on the inter-server links, and then approximate global PR values by combining local page authority scores and server authority values. Wu and Aberer =-=[41]-=- pursue a similar approach based on a layered Markov model. Both of these approaches are in turn closely related to the work by Haveliwala et al. [21] that postulates a block structure of the link mat... |

6 |
J.C.: Pagerank computation and keyword search on distributed systems and P2P networks
- Sankaralingam, Yalamanchi, et al.
- 2003
(Show Context)
Citation Context ...upon each such visit. The bookkeeping for tracking the gradually approximated authority of all pages is carried out at a central site, the Web-warehouse server. This is not a P2P algorithm either. In =-=[34]-=-, Sankaralingam et al. presented a P2P algorithm in which the PR computation is performed at the network level, with peers constantly updating the scores of their local pages and sending these updated... |

5 | Jxp: Global authority scores in a p2p network
- Parreira, Weikum
- 2005
(Show Context)
Citation Context ...spread across many autonomous peers with arbitrary overlapping and the peers are a priori unaware of other peers’ fragments. The ideas for JXP have appeared in a preliminary short paper at a workshop =-=[30]-=-. The current paper elaborates these ideas, provides mathematical underpinnings, including a convergence proof (which were missing in the workshop paper), and develops novel extensions and run-time en... |