Results 1 -
6 of
6
Focused crawling: a new approach to topic-specific Web resource discovery
, 1999
"... The rapid growth of the World-Wide Web poses unprecedented scaling challenges for general-purpose crawlers and search engines. In this paper we describe a new hypertext resource discovery system called a Focused Crawler. The goal of a focused crawler is to selectively seek out pages that are relevan ..."
Abstract
-
Cited by 411 (8 self)
- Add to MetaCart
The rapid growth of the World-Wide Web poses unprecedented scaling challenges for general-purpose crawlers and search engines. In this paper we describe a new hypertext resource discovery system called a Focused Crawler. The goal of a focused crawler is to selectively seek out pages that are relevant to a pre-defined set of topics. The topics are specified not using keywords, but using exemplary documents. Rather than collecting and indexing all accessible Web documents to be able to answer all possible ad-hoc queries, a focused crawler analyzes its crawl boundary to find the links that are likely to be most relevant for the crawl, and avoids irrelevant regions of the Web. This leads to significant savings in hardware and network resources, and helps keep the crawl more up-to-date. To achieve such goal-directed crawling, we designed two hypertext mining programs that guide our crawler: a classifier that evaluates the relevance of a hypertext document with respect to the focus topics, ...
Hybrid Neural Plausibility Networks for News Agents
- In Proceedings of the National Conference on Artificial Intelligence
, 1998
"... This paper describes a learning news agent HyNeT which uses hybrid neural network techniques for classifying news titles as they appear on an internet newswire. Recurrent plausibility networks with local memory are developed and examined for learning robust text routing. HyNeT is described for ..."
Abstract
-
Cited by 20 (15 self)
- Add to MetaCart
This paper describes a learning news agent HyNeT which uses hybrid neural network techniques for classifying news titles as they appear on an internet newswire. Recurrent plausibility networks with local memory are developed and examined for learning robust text routing. HyNeT is described for the first time in this paper. We show that a careful hybrid integration of techniques from neural network architectures, learning and information retrieval can reach consistent recall and precision rates of more than 92% on an 82 000 word corpus; this is demonstrated for 10 000 unknown news titles from the Reuters newswire. This new synthesis of neural networks, learning and information retrieval techniques allows us to scale up to a real-world task and demonstrates a lot of potential for hybrid plausibility networks for semantic text routing agents on the internet. Introduction In the last decade, a lot of work on neural networks in artificial intelligence has focused on fundam...
Complementing Search Engines with Online Web Mining Agents
, 2002
"... While search engines have become the major decision support tools for the Internet, there is a growing disparity between the image of the World Wide Web stored in search engine repositories and the actual dynamic, distributed nature of Web data. We propose to attack this problem using an adaptive po ..."
Abstract
-
Cited by 18 (6 self)
- Add to MetaCart
While search engines have become the major decision support tools for the Internet, there is a growing disparity between the image of the World Wide Web stored in search engine repositories and the actual dynamic, distributed nature of Web data. We propose to attack this problem using an adaptive population of intelligent agents mining the Web online at query time. We discuss the benefits and shortcomings of using dynamic search strategies versus the traditional static methods in which search and retrieval are disjoint. This paper presents a public Web intelligence tool called MySpiders, a threaded multiagent system designed for information discovery. The performance of the system is evaluated by comparing its effectiveness in locating recent, relevant documents with that of search engines. We present results suggesting that augmenting search engines with adaptive populations of intelligent search agents can lead to a significant competitive advantage. We also discuss some of the challenges of evaluating such a system on current Web data, introduce three novel metrics for this purpose, and outline some of the lessons learned in the process.
Crawling the infinite Web: five levels are enough
- In Proceedings of the third Workshop on Web Graphs (WAW
, 2004
"... Abstract. A large amount of publicly available Web pages are generated dynamically upon request, and contain links to other dynamically generated pages. This usually produces Web sites which can create arbitrarily many pages. In this article, several probabilistic models for browsing “infinite ” Web ..."
Abstract
-
Cited by 16 (4 self)
- Add to MetaCart
Abstract. A large amount of publicly available Web pages are generated dynamically upon request, and contain links to other dynamically generated pages. This usually produces Web sites which can create arbitrarily many pages. In this article, several probabilistic models for browsing “infinite ” Web sites are proposed and studied. We use these models to estimate how deep a crawler must go to download a significant portion of the Web site content that is actually visited. The proposed models are validated against real data on page views in several Web sites, showing that, in both theory and practice, a crawler needs to download just a few levels, no more than 3 to 5 “clicks ” away from the start page, to reach 90 % of the pages that users actually visit. 1
Preserving the Fabric of Our Lives: A Survey of Web Preservation Initiatives
- In Proc. 7 th ECDL
, 2003
"... Abstract. This paper argues that the growing importance of the World Wide Web means that Web sites are key candidates for digital preservation. After an brief outline of some of the main reasons why the preservation of Web sites can be problematic, a review of selected Web archiving initiatives show ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
Abstract. This paper argues that the growing importance of the World Wide Web means that Web sites are key candidates for digital preservation. After an brief outline of some of the main reasons why the preservation of Web sites can be problematic, a review of selected Web archiving initiatives shows that most current initiatives are based on combinations of three main approaches: automatic harvesting, selection and deposit. The paper ends with a discussion of issues relating to collection and access policies, software, costs and preservation. 1
Size Estimation Using Multiple Lists \Lambda
, 1998
"... inde xthe objects, and j = 1;:::;Jinde xthe lists. Our basic model has N \Theta J random variables, X ij,such that X ij=( 1,if object iappears onlist j; ..."
Abstract
- Add to MetaCart
inde xthe objects, and j = 1;:::;Jinde xthe lists. Our basic model has N \Theta J random variables, X ij,such that X ij=( 1,if object iappears onlist j;

