Results 1 - 10
of
188
The PageRank Citation Ranking: Bringing Order to the Web
- Stanford InfoLab
, 1999
"... The importance of a Web page is an inherently subjective matter, which depends on the readers interests, knowledge and attitudes. But there is still much that can be said objectively about the relative importance of Web pages. This paper describes PageRank, a method for rating Web pages objectively ..."
Abstract
-
Cited by 1600 (1 self)
- Add to MetaCart
The importance of a Web page is an inherently subjective matter, which depends on the readers interests, knowledge and attitudes. But there is still much that can be said objectively about the relative importance of Web pages. This paper describes PageRank, a method for rating Web pages objectively and mechanically, effectively measuring the human interest and attention devoted to them. We compare PageRank to an idealized random Web surfer. We show how to efficiently compute PageRank for large numbers of pages. And, we show how to apply PageRank to search and to user navigation.
Focused crawling: a new approach to topic-specific Web resource discovery
, 1999
"... The rapid growth of the World-Wide Web poses unprecedented scaling challenges for general-purpose crawlers and search engines. In this paper we describe a new hypertext resource discovery system called a Focused Crawler. The goal of a focused crawler is to selectively seek out pages that are relevan ..."
Abstract
-
Cited by 411 (8 self)
- Add to MetaCart
The rapid growth of the World-Wide Web poses unprecedented scaling challenges for general-purpose crawlers and search engines. In this paper we describe a new hypertext resource discovery system called a Focused Crawler. The goal of a focused crawler is to selectively seek out pages that are relevant to a pre-defined set of topics. The topics are specified not using keywords, but using exemplary documents. Rather than collecting and indexing all accessible Web documents to be able to answer all possible ad-hoc queries, a focused crawler analyzes its crawl boundary to find the links that are likely to be most relevant for the crawl, and avoids irrelevant regions of the Web. This leads to significant savings in hardware and network resources, and helps keep the crawl more up-to-date. To achieve such goal-directed crawling, we designed two hypertext mining programs that guide our crawler: a classifier that evaluates the relevance of a hypertext document with respect to the focus topics, ...
The Evolution of the Web and Implications for an Incremental Crawler
, 1999
"... In this paper we study how to build an effective incremental crawler. The crawler selectively and incrementally updates its index and/or local collection of web pages, instead of periodically refreshing the collection in batch mode. The incremental crawler can improve the "freshness" of the collecti ..."
Abstract
-
Cited by 182 (16 self)
- Add to MetaCart
In this paper we study how to build an effective incremental crawler. The crawler selectively and incrementally updates its index and/or local collection of web pages, instead of periodically refreshing the collection in batch mode. The incremental crawler can improve the "freshness" of the collection significantly and bring in new pages in a more timely manner. We first present results from an experiment conducted on more than half million web pages over 4 months, to estimate how web pages evolve over time. Based on these experimental results, we compare various design choices for an incremental crawler and discuss their trade-offs. We propose an architecture for the incremental crawler, which combines the best design choices.
Crawling the Hidden Web
- In VLDB
, 2001
"... Current-day crawlers retrieve content only from the publicly indexable Web, i.e., the set of web pages reachable purely by following hypertext links, ignoring search forms and pages that require authorization or prior registration. ..."
Abstract
-
Cited by 173 (2 self)
- Add to MetaCart
Current-day crawlers retrieve content only from the publicly indexable Web, i.e., the set of web pages reachable purely by following hypertext links, ignoring search forms and pages that require authorization or prior registration.
Automating the Construction of Internet Portals with Machine Learning
- Information Retrieval
, 2000
"... Domain-specific internet portals are growing in popularity because they gather content from the Web and organize it for easy access, retrieval and search. For example, www.campsearch.com allows complex queries by age, location, cost and specialty over summer camps. This functionality is not possible ..."
Abstract
-
Cited by 141 (3 self)
- Add to MetaCart
Domain-specific internet portals are growing in popularity because they gather content from the Web and organize it for easy access, retrieval and search. For example, www.campsearch.com allows complex queries by age, location, cost and specialty over summer camps. This functionality is not possible with general, Web-wide search engines. Unfortunately these portals are difficult and time-consuming to maintain. This paper advocates the use of machine learning techniques to greatly automate the creation and maintenance of domain-specific Internet portals. We describe new research in reinforcement learning, information extraction and text classification that enables efficient spidering, the identification of informative text segments, and the population of topic hierarchies. Using these techniques, we have built a demonstration system: a portal for computer science research papers. It already contains over 50,000 papers and is publicly available at www.cora.justresearch.com. These techniques are ...
Mercator: A scalable, extensible web crawler
- Word Wide Web
, 1999
"... This paper describes Mercator, a scalable, extensible web crawler written entirely in Java. Scalable web crawlers are an important component of many web services, but their design is not well-documented in the literature. We enumerate the major components of any scalable web crawler, comment on alte ..."
Abstract
-
Cited by 121 (5 self)
- Add to MetaCart
This paper describes Mercator, a scalable, extensible web crawler written entirely in Java. Scalable web crawlers are an important component of many web services, but their design is not well-documented in the literature. We enumerate the major components of any scalable web crawler, comment on alternatives and tradeoffs in their design, and describe the particular components used in Mercator. We also describe Mercator’s support for extensibility and customizability. Finally, we comment on Mercator’s performance, which we have found to be comparable to that of other crawlers for which performance numbers have been published. 1
Topical Locality in the Web
- In Proceedings of the 23rd Annual International Conference on Research and Development in Information Retrieval (SIGIR 2000
, 2000
"... Most web pages are linked to others with related content. This idea, combined with another that says that text in, and possibly around, HTML anchors describe the pages to which they point, is the foundation for a usable WorldWide Web. In this paper, we examine to what extent these ideas hold by empi ..."
Abstract
-
Cited by 108 (8 self)
- Add to MetaCart
Most web pages are linked to others with related content. This idea, combined with another that says that text in, and possibly around, HTML anchors describe the pages to which they point, is the foundation for a usable WorldWide Web. In this paper, we examine to what extent these ideas hold by empirically testing whether topical locality mirrors spatial locality of pages on the Web. In particular, we find that the likelihood of linked pages having similar textual content to be high; the similarity of sibling pages increases when the links from the parent are close together; titles, descriptions, and anchor text represent at least part of the target page; and that anchor text may be a useful discriminator among unseen child pages. These results show the foundations necessary for the success of many web systems, including search engines, focused crawlers, linkage analyzers, and intelligent web agents.
Searching the Web
- ACM TRANSACTIONS ON INTERNET TECHNOLOGY
, 2001
"... We offer an overview of current Web search engine design. After introducing a generic search engine architecture, we examine each engine component in turn. We cover crawling, local Web page storage, indexing, and the use of link analysis for boosting search performance. The most common design and im ..."
Abstract
-
Cited by 108 (1 self)
- Add to MetaCart
We offer an overview of current Web search engine design. After introducing a generic search engine architecture, we examine each engine component in turn. We cover crawling, local Web page storage, indexing, and the use of link analysis for boosting search performance. The most common design and implementation techniques for each of these components are presented. For this presentation we draw from the literature and from our own experimental search engine testbed. Emphasis is on introducing the fundamental concepts and the results of several performance analyses we conducted to compare different designs.
Small-World Phenomena and the Dynamics of Information
- In Advances in Neural Information Processing Systems (NIPS) 14
, 2001
"... Introduction The problem of searching for information in networks like the World Wide Web can be approached in a variety of ways, ranging from centralized indexing schemes to decentralized mechanisms that navigate the underlying network without knowledge of its global structure. The decentralized ap ..."
Abstract
-
Cited by 102 (5 self)
- Add to MetaCart
Introduction The problem of searching for information in networks like the World Wide Web can be approached in a variety of ways, ranging from centralized indexing schemes to decentralized mechanisms that navigate the underlying network without knowledge of its global structure. The decentralized approach appears in a variety of settings: in the behavior of users browsing the Web by following hyperlinks; in the design of focused crawlers [4, 5, 8] and other agents that explore the Web's links to gather information; and in the search protocols underlying decentralized peer-to-peer systems such as Gnutella [10], Freenet [7], and recent research prototypes [21, 22, 23], through which users can share resources without a central server. In recent work, we have been investigating the problem of decentralized search in large information networks [14, 15]. Our initial motivation was an experiment that dealt directly with the search problem in a decidedly pre-Internet context: Stanley Milgram
Context in Web Search
- IEEE Data Engineering Bulletin
, 2000
"... Web search engines generally treat search requests in isolation. The results for a given query are identical, independent of the user, or the context in which the user made the request. Nextgeneration search engines will make increasing use of context information, either by using explicit or implici ..."
Abstract
-
Cited by 100 (0 self)
- Add to MetaCart
Web search engines generally treat search requests in isolation. The results for a given query are identical, independent of the user, or the context in which the user made the request. Nextgeneration search engines will make increasing use of context information, either by using explicit or implicit context information from users, or by implementing additional functionality within restricted contexts. Greater use of context in web search may help increase competition and diversity on the web.

