Results 1 - 10
of
52
Exploiting the Block Structure of the Web for Computing PageRank
, 2003
"... The web link graph has a nested block structure: the vast majority of hyperlinks link pages on a host to other pages on the same host, and many of those that do not link pages within the same domain. We show how to exploit this structure to speed up the computation of PageRank by a 3-stage alg ..."
Abstract
-
Cited by 106 (5 self)
- Add to MetaCart
The web link graph has a nested block structure: the vast majority of hyperlinks link pages on a host to other pages on the same host, and many of those that do not link pages within the same domain. We show how to exploit this structure to speed up the computation of PageRank by a 3-stage algorithm whereby (1) the local PageRanks of pages for each host are computed independently using the link structure of that host, (2) these local PageRanks are then weighted by the "importance" of the corresponding host, and (3) the standard PageRank algorithm is then run using as its starting vector the weighted concatenation of the local PageRanks. Empirically, this algorithm speeds up the computation of PageRank by a factor of 2 in realistic scenarios. Further, we develop a variant of this algorithm that efficiently computes many different "personalized" PageRanks, and a variant that efficiently recomputes PageRank after node updates.
ODISSEA: A Peer-to-Peer Architecture for Scalable Web Search and Information Retrieval
- In WebDB
, 2003
"... this paper appears in [15], and updated information is available at http://cis.poly.edu/westlab/odissea/ ..."
Abstract
-
Cited by 86 (3 self)
- Add to MetaCart
this paper appears in [15], and updated information is available at http://cis.poly.edu/westlab/odissea/
Effective Page Refresh Policies for Web Crawlers
- ACM TRANSACTIONS ON DATABASE SYSTEMS
, 2003
"... In this paper we study how we can maintain local copies of remote data sources "fresh," when the source data is updated autonomously and independently. In particular, we study the problem of Web crawlers that maintain local copies of remote Web pages for Web search engines. In this context, remote d ..."
Abstract
-
Cited by 50 (3 self)
- Add to MetaCart
In this paper we study how we can maintain local copies of remote data sources "fresh," when the source data is updated autonomously and independently. In particular, we study the problem of Web crawlers that maintain local copies of remote Web pages for Web search engines. In this context, remote data sources (Web sites) do not notify the copies (Web crawlers) of new changes, so we need to periodically poll the sources to maintain the copies up-to-date. Since polling the sources takes significant time and resources, it is very difficult to keep the copies completely up-to-date. This paper
OverCite: A Cooperative Digital Research Library
, 2005
"... CiteSeer is a well-known online resource for the computer science research community, allowing users to search and browse a large archive of research papers. Unfortunately, its current centralized incarnation is costly to run. Although members of the community would presumably be willing to donate h ..."
Abstract
-
Cited by 24 (9 self)
- Add to MetaCart
CiteSeer is a well-known online resource for the computer science research community, allowing users to search and browse a large archive of research papers. Unfortunately, its current centralized incarnation is costly to run. Although members of the community would presumably be willing to donate hardware and bandwidth at their own sites to assist CiteSeer, the current architecture does not facilitate such distribution of resources. OverCite is a design for a new architecture for a distributed and cooperative research library based on a distributed hash table (DHT). The new architecture harnesses donated resources at many sites to provide document search and retrieval service to researchers worldwide. A preliminary evaluation of an initial OverCite prototype shows that it can service more queries per second than a centralized system, and that it increases total storage capacity by a factor of n/4 in a system of n nodes. OverCite can exploit these additional resources by supporting new features such as document alerts, and by scaling to larger data sets.
Balancing Volume, Quality and Freshness in Web Crawling
- In Soft Computing Systems - Design, Management and Applications
, 2002
"... We describe a crawling software designed for high-performance, large-scale information discovery and gathering on the Web. This crawler allows the administrator to seek for a balance between the volume of a Web collection and its freshness; and also provides flexibility for defining a quality metric ..."
Abstract
-
Cited by 18 (11 self)
- Add to MetaCart
We describe a crawling software designed for high-performance, large-scale information discovery and gathering on the Web. This crawler allows the administrator to seek for a balance between the volume of a Web collection and its freshness; and also provides flexibility for defining a quality metric to priorize certain pages.
Effective Web Crawling
, 2004
"... The key factors for the success of the World Wide Web are its large size and the lack of a centralized control over its contents. Both issues are also the most important source of problems for locating information. The Web is a context in which traditional Information Retrieval methods are challenge ..."
Abstract
-
Cited by 17 (2 self)
- Add to MetaCart
The key factors for the success of the World Wide Web are its large size and the lack of a centralized control over its contents. Both issues are also the most important source of problems for locating information. The Web is a context in which traditional Information Retrieval methods are challenged, and given the volume of the Web and its speed of change, the coverage of modern search engines is relatively small. Moreover, the distribution of quality is very skewed, and interesting pages are scarce in comparison with the rest of the content. Web crawling is the process used by search engines to collect pages from the Web. This thesis studies Web crawling at several different levels, ranging from the long-term goal of crawling important pages first, to the short-term goal of using the network connectivity efficiently, including implementation issues that are essential for crawling in practice. We start by designing a new model and architecture for a Web crawler that tightly integrates the crawler with the rest of the search engine, providing access to the metadata and links of the documents that can be used to guide the crawling process effectively. We implement this design in the WIRE project as an efficient Web crawler that provides an experimental framework for this research. In fact, we have used our crawler to
Crawling the infinite Web: five levels are enough
- In Proceedings of the third Workshop on Web Graphs (WAW
, 2004
"... Abstract. A large amount of publicly available Web pages are generated dynamically upon request, and contain links to other dynamically generated pages. This usually produces Web sites which can create arbitrarily many pages. In this article, several probabilistic models for browsing “infinite ” Web ..."
Abstract
-
Cited by 16 (4 self)
- Add to MetaCart
Abstract. A large amount of publicly available Web pages are generated dynamically upon request, and contain links to other dynamically generated pages. This usually produces Web sites which can create arbitrarily many pages. In this article, several probabilistic models for browsing “infinite ” Web sites are proposed and studied. We use these models to estimate how deep a crawler must go to download a significant portion of the Web site content that is actually visited. The proposed models are validated against real data on page views in several Web sites, showing that, in both theory and practice, a crawler needs to download just a few levels, no more than 3 to 5 “clicks ” away from the start page, to reach 90 % of the pages that users actually visit. 1
Efficient URL Caching for World Wide Web Crawling
- In Proceedings of the twelfth international conference on World Wide Web (WWW2003
, 2003
"... Crawling the web is deceptively simple: the basic algorithm is (a) Fetch a page (b) Parse it to extract all linked URLs (c) For all the URLs not seen before, repeat (a)--(c). However, the size of the web (estimated at over 4 billion pages) and its rate of change (estimated at 7% per week) move this ..."
Abstract
-
Cited by 14 (1 self)
- Add to MetaCart
Crawling the web is deceptively simple: the basic algorithm is (a) Fetch a page (b) Parse it to extract all linked URLs (c) For all the URLs not seen before, repeat (a)--(c). However, the size of the web (estimated at over 4 billion pages) and its rate of change (estimated at 7% per week) move this plan from a trivial programming exercise to a serious algorithmic and system design challenge. Indeed, these two factors alone imply that for a reasonably fresh and complete crawl of the web, step (a) must be executed about a thousand times per second, and thus the membership test (c) must be done well over ten thousand times per second against a set too large to store in main memory. This requires a distributed architecture, which further complicates the membership test.
Distributed page ranking in structured p2p networks
- In ICPP
, 2003
"... This paper discusses the techniques of performing distributed page ranking on top of structured peer-to-peer networks. Distributed page ranking are needed because the size of the web grows at a remarkable speed and centralized page ranking is not scalable. Open System PageRank is presented in this p ..."
Abstract
-
Cited by 14 (0 self)
- Add to MetaCart
This paper discusses the techniques of performing distributed page ranking on top of structured peer-to-peer networks. Distributed page ranking are needed because the size of the web grows at a remarkable speed and centralized page ranking is not scalable. Open System PageRank is presented in this paper based on the traditional PageRank used by Google. We then propose some distributed page ranking algorithms, partially prove their convergence, and discuss some interesting properties of them. Indirect transmission is introduced in this paper to reduce communication overhead between page rankers and to achieve scalable communication. The relationship between convergence time and bandwidth consumed is also discussed. Finally, we verify some of the discussions by experiments based on real datasets. 1.

