Results 1 - 10
of
83
Authoritative Sources in a Hyperlinked Environment
- JOURNAL OF THE ACM
, 1999
"... The network structure of a hyperlinked environment can be a rich source of information about the content of the environment, provided we have effective means for understanding it. We develop a set of algorithmic tools for extracting information from the link structures of such environments, and repo ..."
Abstract
-
Cited by 2222 (9 self)
- Add to MetaCart
The network structure of a hyperlinked environment can be a rich source of information about the content of the environment, provided we have effective means for understanding it. We develop a set of algorithmic tools for extracting information from the link structures of such environments, and report on experiments that demonstrate their effectiveness in a variety of contexts on the World Wide Web. The central issue we address within our framework is the distillation of broad search topics, through the discovery of “authoritative ” information sources on such topics. We propose and test an algorithmic formulation of the notion of authority, based on the relationship between a set of relevant authoritative pages and the set of “hub pages ” that join them together in the link structure. Our formulation has connections to the eigenvectors of certain matrices associated with the link graph; these connections in turn motivate additional heuristics for link-based analysis.
Harvest: A Scalable, Customizable Discovery and Access System
, 1995
"... Rapid growth in data volume, user base, and data diversity render Internet-accessible information increasingly difficult to use effectively. In this paper we introduce Harvest, a system that provides an integrated set of customizable tools for gathering information from diverse repositories, buil ..."
Abstract
-
Cited by 159 (7 self)
- Add to MetaCart
Rapid growth in data volume, user base, and data diversity render Internet-accessible information increasingly difficult to use effectively. In this paper we introduce Harvest, a system that provides an integrated set of customizable tools for gathering information from diverse repositories, building topic-specific content indexes, flexibly searching the indexes, widely replicating them, and caching objects as they are retrieved across the Internet. The system interoperates with WWW clients and with HTTP,FTP, Gopher, and NetNews information resources. We discuss the design and implementation of Harvest and its subsystems, give examples of its uses, and provide measurements indicating that Harvest can significantly reduce server load, network traffic, and space requirements when building indexes, compared with previous systems. We also discuss several popular indexes wehave built using Harvest, underscoring the customizability and scalability of the system.
Mercator: A scalable, extensible web crawler
- Word Wide Web
, 1999
"... This paper describes Mercator, a scalable, extensible web crawler written entirely in Java. Scalable web crawlers are an important component of many web services, but their design is not well-documented in the literature. We enumerate the major components of any scalable web crawler, comment on alte ..."
Abstract
-
Cited by 121 (5 self)
- Add to MetaCart
This paper describes Mercator, a scalable, extensible web crawler written entirely in Java. Scalable web crawlers are an important component of many web services, but their design is not well-documented in the literature. We enumerate the major components of any scalable web crawler, comment on alternatives and tradeoffs in their design, and describe the particular components used in Mercator. We also describe Mercator’s support for extensibility and customizability. Finally, we comment on Mercator’s performance, which we have found to be comparable to that of other crawlers for which performance numbers have been published. 1
Topical Locality in the Web
- In Proceedings of the 23rd Annual International Conference on Research and Development in Information Retrieval (SIGIR 2000
, 2000
"... Most web pages are linked to others with related content. This idea, combined with another that says that text in, and possibly around, HTML anchors describe the pages to which they point, is the foundation for a usable WorldWide Web. In this paper, we examine to what extent these ideas hold by empi ..."
Abstract
-
Cited by 108 (8 self)
- Add to MetaCart
Most web pages are linked to others with related content. This idea, combined with another that says that text in, and possibly around, HTML anchors describe the pages to which they point, is the foundation for a usable WorldWide Web. In this paper, we examine to what extent these ideas hold by empirically testing whether topical locality mirrors spatial locality of pages on the Web. In particular, we find that the likelihood of linked pages having similar textual content to be high; the similarity of sibling pages increases when the links from the parent are close together; titles, descriptions, and anchor text represent at least part of the target page; and that anchor text may be a useful discriminator among unseen child pages. These results show the foundations necessary for the success of many web systems, including search engines, focused crawlers, linkage analyzers, and intelligent web agents.
Effective Site Finding using Link Anchor Information
, 2001
"... Link-based ranking methods have been described in the literature and applied in commercial Web search engines. However, according to recent TREC experiments, they are no better than traditional content-based methods. We conduct a different type of experiment, in which the task is to find the main en ..."
Abstract
-
Cited by 108 (14 self)
- Add to MetaCart
Link-based ranking methods have been described in the literature and applied in commercial Web search engines. However, according to recent TREC experiments, they are no better than traditional content-based methods. We conduct a different type of experiment, in which the task is to find the main entry point of a specific Web site. In our experiments, ranking based on link anchor text is twice as effective as ranking based on document content, even though both methods used the same BM25 formula. We obtained these results using two sets of 100 queries on a 18.5 million docu- ment set and another set of 100 on a 0.4 million document set. This site finding effectiveness begins to explain why many search engines have adopted link methods. It also opens a rich new area for effectiveness improvement, where traditional methods fail.
Shmueli: W3QS: A query system for the World Wide Web
, 1995
"... The World-Wide Web (WWW) is an ever growing, distributed, non-administered, global information resource. It resides on the worldwide computer network and allows access to heterogeneous information: text, image, video, sound and graphic data. Currently, this wealth of information is difficult to mine ..."
Abstract
-
Cited by 73 (3 self)
- Add to MetaCart
The World-Wide Web (WWW) is an ever growing, distributed, non-administered, global information resource. It resides on the worldwide computer network and allows access to heterogeneous information: text, image, video, sound and graphic data. Currently, this wealth of information is difficult to mine. One can either manually, slowly and tediously navigate through the WWW or utilize indexes and libraries which are built by automatic search engines (called knowbots or robots). We have designed and are now implementing a high level SQL-like language to support effective and flexible query processing, which addresses the structure and content of WWW nodes and their varied sorts of data. Query results are intuitively presented and continuously maintained when desired. The language itself integrates new utilities and existing Unix tools (e.g. grep, awk). The implementation strategy is to employ existing WWW browsers and Unix tools to the extent possible. 1
Parallel crawlers
- In Proceedings of the 11th international conference on World Wide Web
, 2002
"... In this paper we study how we can design an effective parallel crawler. As the size of the Web grows, it becomes imperative to parallelize a crawling process, in order to finish downloading pages in a reasonable amount of time. We first propose multiple architectures for a parallel crawler and ident ..."
Abstract
-
Cited by 71 (3 self)
- Add to MetaCart
In this paper we study how we can design an effective parallel crawler. As the size of the Web grows, it becomes imperative to parallelize a crawling process, in order to finish downloading pages in a reasonable amount of time. We first propose multiple architectures for a parallel crawler and identify fundamental issues related to parallel crawling. Based on this understanding, we then propose metrics to evaluate a parallel crawler, and compare the proposed architectures using 40 million pages collected from the Web. Our results clarify the relative merits of each architecture and provide a good guideline on when to adopt which architecture. 1
Dissemination of Collection Wide Information in a Distributed Information Retrieval System
- In Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval
, 1995
"... We find that dissemination of collection wide information (CWI) in a distributed collection of documents is needed to achieve retrieval effectiveness comparable to a centralized collection. Complete dissemination is unnecessary. The required dissemination level depends upon how documents are allocat ..."
Abstract
-
Cited by 64 (10 self)
- Add to MetaCart
We find that dissemination of collection wide information (CWI) in a distributed collection of documents is needed to achieve retrieval effectiveness comparable to a centralized collection. Complete dissemination is unnecessary. The required dissemination level depends upon how documents are allocated among sites. Low dissemination is needed for random document allocation, but higher levels are needed when documents are allocated based on content. We define parameters to control dissemination and document allocation and present results from four test collections. We define the notion of iso-knowledge lines with respect to the number of sites and level of dissemination in the distributed archive, and show empirically that iso-knowledge lines are also isoeffectiveness lines when documents are randomly allocated. 1 Introduction In the rapidly evolving internetworks of today, a vast diversity of information is becoming electronically available. The information environment is highly dist...
What can you do with a web in your pocket
- Data Engineering Bulletin
, 1998
"... The amount of information available online has grown enormously over the past decade. Fortunately, computing power, disk capacity, and network bandwidth have also increased dramatically. It is currently possible for a university research project to store and process the entire World Wide Web. Since ..."
Abstract
-
Cited by 50 (1 self)
- Add to MetaCart
The amount of information available online has grown enormously over the past decade. Fortunately, computing power, disk capacity, and network bandwidth have also increased dramatically. It is currently possible for a university research project to store and process the entire World Wide Web. Since there is a limit on how much text humans can generate, it is plausible that within a few decades one will be able to store and process all the human-generated text on the Web in a shirt pocket. The Web is a very rich and interesting data source. In this paper, we describe the Stanford WebBase, a local repository of a significant portion of the Web. Furthermore, we describe a number of recent experiments that leverage the size and the diversity of the WebBase. First, we have largely automated the process of extracting a sizable relation of books (title, author pairs) from hundreds of data sources spread across the World Wide Web using a technique we call Dual Iterative Pattern Relation Extraction. Second, we have developed a global ranking of Web pages called PageRank based on the link structure of the Web that has properties that are useful for search and navigation. Third, we have used PageRank to develop a novel search engine called Google, which also makes heavy use of anchor text. All of these experiments rely significantly on the size and diversity of the WebBase. 1
Analysis of Anchor Text for Web Search
, 2003
"... It has been observed that anchor text in web documents is very useful in improving the quality of web text search for some classes of queries. By examining properties of anchor text in a large intranet, we hope to shed light on why this is the case. Our main premise is that anchor text behaves very ..."
Abstract
-
Cited by 50 (1 self)
- Add to MetaCart
It has been observed that anchor text in web documents is very useful in improving the quality of web text search for some classes of queries. By examining properties of anchor text in a large intranet, we hope to shed light on why this is the case. Our main premise is that anchor text behaves very much like real user queries and consensus titles. Thus an understanding of how anchor text is related to a document will likely lead to better understanding of how to translate a user's query into high quality search results. Our approach is experimental, based on a study of a large corporate intranet, including the content as well as a large stream of queries against that content. We conduct experiments to investigate several aspects of anchor text, including their relationship to titles, the frequency of queries that can be satisfied by anchortext alone, and the homogeneity of results fetched by anchor text.

