Results 1 - 10
of
11
Focused crawling: a new approach to topic-specific Web resource discovery
, 1999
"... The rapid growth of the World-Wide Web poses unprecedented scaling challenges for general-purpose crawlers and search engines. In this paper we describe a new hypertext resource discovery system called a Focused Crawler. The goal of a focused crawler is to selectively seek out pages that are relevan ..."
Abstract
-
Cited by 411 (8 self)
- Add to MetaCart
The rapid growth of the World-Wide Web poses unprecedented scaling challenges for general-purpose crawlers and search engines. In this paper we describe a new hypertext resource discovery system called a Focused Crawler. The goal of a focused crawler is to selectively seek out pages that are relevant to a pre-defined set of topics. The topics are specified not using keywords, but using exemplary documents. Rather than collecting and indexing all accessible Web documents to be able to answer all possible ad-hoc queries, a focused crawler analyzes its crawl boundary to find the links that are likely to be most relevant for the crawl, and avoids irrelevant regions of the Web. This leads to significant savings in hardware and network resources, and helps keep the crawl more up-to-date. To achieve such goal-directed crawling, we designed two hypertext mining programs that guide our crawler: a classifier that evaluates the relevance of a hypertext document with respect to the focus topics, ...
The Structure of Broad Topics on the Web
- INTERNATIONAL WORLD WIDE WEB CONFERENCE
, 2002
"... The Web graph is a giant social network whose properties have been measured and modeled extensively in recent years. Most such studies concentrate on the graph structure alone, and do not consider textual properties of the nodes. Consequently, Web communities have been characterized purely in terms ..."
Abstract
-
Cited by 43 (1 self)
- Add to MetaCart
The Web graph is a giant social network whose properties have been measured and modeled extensively in recent years. Most such studies concentrate on the graph structure alone, and do not consider textual properties of the nodes. Consequently, Web communities have been characterized purely in terms of graph structure and not on page content. We propose that a topic taxonomy such as Yahoo! or the Open Directory provides a useful framework for understanding the structure of content-based clusters and communities. In particular, using a topic taxonomy and an automatic classifier, we can measure the background distribution of broad topics on the Web, and analyze the capability of recent random walk algorithms to draw samples which follow such distributions. In addition, we can measure the probability that a page about one broad topic will link to another broad topic. Extending this experiment, we can measure how quickly topic context is lost while walking randomly on the Web graph. Estimates of this topic mixing distance may explain why a global PageRank is still meaningful in the context of broad queries. In general, our measurements may prove valuable in the design of community-specific crawlers and link-based ranking systems.
Distributed Hypertext Resource Discovery Through Examples
, 1999
"... We describe the architecture of a hypertext resource discovery system using a relational database. Such a system can answer questions that combine page contents, metadata, and hyperlink structure in powerful ways, such as "find the number of links from an environmental protection page to a page abou ..."
Abstract
-
Cited by 27 (2 self)
- Add to MetaCart
We describe the architecture of a hypertext resource discovery system using a relational database. Such a system can answer questions that combine page contents, metadata, and hyperlink structure in powerful ways, such as "find the number of links from an environmental protection page to a page about oil and natural gas over the last year." A key problem in populating the database in such a system is to discover web resources related to the topics involved in such queries. We argue that that a keywordbased "find similar" search based on a giant all-purpose crawler is neither necessary nor adequate for resource discovery. Instead we exploit the properties that pages tend to cite pages with related topics, and given that a page u cites a page about a desired topic, it is very likely that u cites additional desirable pages. We exploit these properties by using a crawler controlled by two hypertext mining programs: (1) a classifier that evaluates the relevance of a region of the web to the...
The Look of the Link - Concepts for the User Interface of Extended Hyperlinks
- University of Aarhus
, 2001
"... The design of hypertext systems has been subject to intense research. Apparently, one topic was mostly neglected: how to visualize and interact with link markers. ..."
Abstract
-
Cited by 17 (4 self)
- Add to MetaCart
The design of hypertext systems has been subject to intense research. Apparently, one topic was mostly neglected: how to visualize and interact with link markers.
A Case for Automated Large Scale Semantic Annotations
- Journal of Web Semantics
, 2003
"... This paper describes Seeker, a platform for large-scale text analytics, and SemTag, an application written on the platform to perform automated semantic tagging of large corpora. We apply SemTag to a collection of approximately 264 million web pages, and generate approximately 434 million automatica ..."
Abstract
-
Cited by 16 (0 self)
- Add to MetaCart
This paper describes Seeker, a platform for large-scale text analytics, and SemTag, an application written on the platform to perform automated semantic tagging of large corpora. We apply SemTag to a collection of approximately 264 million web pages, and generate approximately 434 million automatically disambiguated semantic tags, published to the web as a label bureau providing metadata regarding the 434 million annotations. To our knowledge, this is the largest scale semantic tagging effort to date. We describe the Seeker platform, discuss the architecture of the SemTag application, describe a new disambiguation algorithm specialized to support ontological disambiguation of large-scale data, evaluate the algorithm, and present our final results with information about acquiring and making use of the semantic tags. We argue that automated large scale semantic tagging of ambiguous content can bootstrap and accelerate the creation of the semantic web.
Combining Text-, Link-, and Classification-based Retrieval Methods to Enhance Information Discovery on the Web
, 2002
"... ..."
An Augmented Web Space for Digital Cities
- In Proceedings of the 2001 Symposium on Applications and the Internet (SAINT-2001
, 2001
"... We propose an augmented Web space and its query language to support geographical querying and sequential plan creation utilizing a digital city that is a city-based information space on the Internet. The augmented Web space involves a new approach to integrate the World Wide Web (WWW) and a geograp ..."
Abstract
-
Cited by 5 (1 self)
- Add to MetaCart
We propose an augmented Web space and its query language to support geographical querying and sequential plan creation utilizing a digital city that is a city-based information space on the Internet. The augmented Web space involves a new approach to integrate the World Wide Web (WWW) and a geographic information system (GIS). The augmented Web space consists of home pages (HP), hyperlinks, and generic links that represent geographical relations between HPs. The generic links are created dynamically using geographical evaluation functions included in a user's search query each time one is issued. A query also includes a path expression showing how to navigate the HPs, hyperlinks, and generic links. Since the path expression is an extended regular expression, we can describe an arbitrary sequence of users' search actions for navigating the augmented Web space. We have applied the proposed augmented Web space to Digital City Kyoto, a city information service system that is accessed through a 3D walk-through implementation and a map-based interface. Each time a user's query is issued through the 3D and 2D interfaces, Digital City Kyoto creates an augmented Web space, and navigates the Web information space based on the path expression in the query. 1.
Geospatial Mapping and Navigation of the Web
, 2001
"... Web pages may be organized, indexed, searched, and navigated along several di#erent feature dimensions. We investigate di#erent approaches to discovering geographic context for web pages, and describe a navigational tool for browsing web resources by geographic proximity. ..."
Abstract
- Add to MetaCart
Web pages may be organized, indexed, searched, and navigated along several di#erent feature dimensions. We investigate di#erent approaches to discovering geographic context for web pages, and describe a navigational tool for browsing web resources by geographic proximity.
Literature Review
, 2001
"... this paper, IR will imply text-based retrieval unless explicitly stated otherwise. ..."
Abstract
- Add to MetaCart
this paper, IR will imply text-based retrieval unless explicitly stated otherwise.

