Results 1 - 10
of
50
The structure and function of complex networks
- SIAM REVIEW
, 2003
"... Inspired by empirical studies of networked systems such as the Internet, social networks, and biological networks, researchers have in recent years developed a variety of techniques and models to help us understand or predict the behavior of these systems. Here we review developments in this field, ..."
Abstract
-
Cited by 913 (7 self)
- Add to MetaCart
Inspired by empirical studies of networked systems such as the Internet, social networks, and biological networks, researchers have in recent years developed a variety of techniques and models to help us understand or predict the behavior of these systems. Here we review developments in this field, including such concepts as the small-world effect, degree distributions, clustering, network correlations, random graph models, models of network growth and preferential attachment, and dynamical processes taking place on networks.
Accelerated Focused Crawling through Online Relevance Feedback
, 2002
"... The organization of HTML into a tag tree structure, which is rendered by browsers as roughly rectangular regions with embedded text and HREF links, greatly helps surfers locate and click on links that best satisfy their information need. Can an automatic program emulate this human behavior and there ..."
Abstract
-
Cited by 58 (2 self)
- Add to MetaCart
The organization of HTML into a tag tree structure, which is rendered by browsers as roughly rectangular regions with embedded text and HREF links, greatly helps surfers locate and click on links that best satisfy their information need. Can an automatic program emulate this human behavior and thereby learn to predict the relevance of an unseen HREF target page w.r.t. an information need, based on information limited to the HREF source page? Such a capability would be of great interest in focused crawling and resource discovery, because it can fine-tune the priority of unvisited URLs in the crawl frontier, and reduce the number of irrelevant pages which are fetched and discarded.
Effective Page Refresh Policies for Web Crawlers
- ACM TRANSACTIONS ON DATABASE SYSTEMS
, 2003
"... In this paper we study how we can maintain local copies of remote data sources "fresh," when the source data is updated autonomously and independently. In particular, we study the problem of Web crawlers that maintain local copies of remote Web pages for Web search engines. In this context, remote d ..."
Abstract
-
Cited by 50 (3 self)
- Add to MetaCart
In this paper we study how we can maintain local copies of remote data sources "fresh," when the source data is updated autonomously and independently. In particular, we study the problem of Web crawlers that maintain local copies of remote Web pages for Web search engines. In this context, remote data sources (Web sites) do not notify the copies (Web crawlers) of new changes, so we need to periodically poll the sources to maintain the copies up-to-date. Since polling the sources takes significant time and resources, it is very difficult to keep the copies completely up-to-date. This paper
Topical web crawlers: Evaluating adaptive algorithms
- ACM Transactions on Internet Technology
, 2004
"... Topical crawlers are increasingly seen as a way to address the scalability limitations of universal search engines, by distributing the crawling process across users, queries, or even client computers. The context available to such crawlers can guide the navigation of links with the goal of efficien ..."
Abstract
-
Cited by 35 (11 self)
- Add to MetaCart
Topical crawlers are increasingly seen as a way to address the scalability limitations of universal search engines, by distributing the crawling process across users, queries, or even client computers. The context available to such crawlers can guide the navigation of links with the goal of efficiently locating highly relevant target pages. We developed a framework to fairly evaluate topical crawling algorithms under a number of performance metrics. Such a framework is employed here to evaluate different algorithms that have proven highly competitive among those proposed in the literature and in our own previous research. In particular we focus on the tradeoff between exploration and exploitation of the cues available to a crawler, and on adaptive crawlers that use machine learning techniques to guide their search. We find that the best performance is achieved by a novel combination of explorative and exploitative bias, and introduce an evolutionary crawler that surpasses the performance of the best non-adaptive crawler after sufficiently long crawls. We also analyze the computational complexity of the various crawlers and discuss how performance and complexity scale with available resources. Evolutionary crawlers achieve high efficiency and scalability by distributing the work across concurrent agents, resulting in the best performance/cost ratio.
A General Evaluation Framework for Topical Crawlers
- INFORMATION RETRIEVAL
, 2005
"... Topical crawlers are becoming important tools to support applications such as specialized Web portals, online searching, and competitive intelligence. As the Web mining field matures, the disparate crawling strategies proposed in the literature will have to be evaluated and compared on common tasks ..."
Abstract
-
Cited by 28 (10 self)
- Add to MetaCart
Topical crawlers are becoming important tools to support applications such as specialized Web portals, online searching, and competitive intelligence. As the Web mining field matures, the disparate crawling strategies proposed in the literature will have to be evaluated and compared on common tasks through welldefined performance measures. This paper presents a general framework to evaluate topical crawlers. We identify a class of tasks that model crawling applications of di#erent nature and di#culty. We then introduce a set of performance measures for fair comparative evaluations of crawlers along several dimensions including generalized notions of precision, recall, and e#ciency that are appropriate and practical for the Web. The framework relies on independent relevance judgements compiled by human editors and available from public directories. Two sources of evidence are proposed to assess crawled pages, capturing di#erent relevance criteria. Finally we introduce a set of topic characterizations to analyze the variability in crawling e#ectiveness across topics. The proposed evaluation framework synthesizes a number of methodologies in the topical crawlers literature and many lessons learned from several studies conducted by our group. The general framework is described in detail and then illustrated in practice by a case study that evaluates four public crawling algorithms. We found that the proposed framework is e#ective at evaluating, comparing, di#erentiating and interpreting the performance of the four crawlers. For example, we found the IS crawler to be most sensitive to the popularity of topics.
MySpiders : Evolve your own intelligent Web crawlers
, 2002
"... The dynamic nature of the World Wide Web makes it a challenge to find information that is both relevant and recent. Intelligent agents can complement the power of search engines to meet this challenge. We present a Web tool called MySpiders, which implements an evolutionary algorithms managing a pop ..."
Abstract
-
Cited by 24 (8 self)
- Add to MetaCart
The dynamic nature of the World Wide Web makes it a challenge to find information that is both relevant and recent. Intelligent agents can complement the power of search engines to meet this challenge. We present a Web tool called MySpiders, which implements an evolutionary algorithms managing a population of adaptive crawlers who browse the Web autonomously. Each agent acts as an intelligent client on behalf of the user, driven by a user query and by textual and linkage clues in the crawled pages. Agents autonomously decide which links to follow, which clues to internalize, when to spawn o#spring to focus the search near a relevant source, and when to starve. The tool is available to the public as a threaded Java applet. We discuss the development and deployment of such a system. 1
Complementing Search Engines with Online Web Mining Agents
, 2002
"... While search engines have become the major decision support tools for the Internet, there is a growing disparity between the image of the World Wide Web stored in search engine repositories and the actual dynamic, distributed nature of Web data. We propose to attack this problem using an adaptive po ..."
Abstract
-
Cited by 18 (6 self)
- Add to MetaCart
While search engines have become the major decision support tools for the Internet, there is a growing disparity between the image of the World Wide Web stored in search engine repositories and the actual dynamic, distributed nature of Web data. We propose to attack this problem using an adaptive population of intelligent agents mining the Web online at query time. We discuss the benefits and shortcomings of using dynamic search strategies versus the traditional static methods in which search and retrieval are disjoint. This paper presents a public Web intelligence tool called MySpiders, a threaded multiagent system designed for information discovery. The performance of the system is evaluated by comparing its effectiveness in locating recent, relevant documents with that of search engines. We present results suggesting that augmenting search engines with adaptive populations of intelligent search agents can lead to a significant competitive advantage. We also discuss some of the challenges of evaluating such a system on current Web data, introduce three novel metrics for this purpose, and outline some of the lessons learned in the process.
Focused Crawls, Tunneling, and Digital Libraries
- In Proceedings of the European Conference on Digital Libraries (ECDL
, 2002
"... Crawling the Web to build collections of documents related to pre-specified topics became an active area of research during the late 1990's after crawler technology was developed for the benefit of search engines. Now, Web crawling is being seriously considered as an important strategy for build ..."
Abstract
-
Cited by 16 (1 self)
- Add to MetaCart
Crawling the Web to build collections of documents related to pre-specified topics became an active area of research during the late 1990's after crawler technology was developed for the benefit of search engines. Now, Web crawling is being seriously considered as an important strategy for building large scale digital libraries. This paper considers some of the crawl technologies that might be exploited for collection building. For example, to make such collection-building crawls more effective, focused crawling was developed, in which the goal was to make a "best-first" crawl of the Web. We are using powerful crawler software to implement a focused crawl but use tunneling to overcome some of the limitations of a pure best-first approach. Tunneling has been described by others as not only prioritizing links from pages according to the page's relevance score, but also estimating the value of each link and prioritizing on that as well. We add to this mix by devising a tunneling focused crawling strategy which evaluates the current crawl direction on the fly to determine when to terminate a tunneling activity. Results indicate that a combination of focused crawling and tunneling could be an e#ective tool for building digital libraries.
Crawling the Web
- In Web Dynamics: Adapting to Change in Content, Size, Topology and Use. Edited by M. Levene and A. Poulovassilis
, 2004
"... this document to represent the vocabulary (feature space) that will be used for classification ..."
Abstract
-
Cited by 16 (1 self)
- Add to MetaCart
this document to represent the vocabulary (feature space) that will be used for classification
Exploration versus Exploitation in Topic Driven Crawlers
- WWW02 WORKSHOP ON WEB DYNAMICS
, 2002
"... The dynamic nature of the Web highlights the scalability limitations of universal search engines. Topic driven crawlers can address the problem by distributing the crawling process across users, queries, or even client computers. The context available to a topic driven crawler allows for informed de ..."
Abstract
-
Cited by 15 (9 self)
- Add to MetaCart
The dynamic nature of the Web highlights the scalability limitations of universal search engines. Topic driven crawlers can address the problem by distributing the crawling process across users, queries, or even client computers. The context available to a topic driven crawler allows for informed decisions about how to prioritize the links to be visited. Here we focus on the balance between a crawler's need to exploit this information to focus on the most promising links, and the need to explore links that appear suboptimal but might lead to more relevant pages. We investigate the issue for two different tasks: (i) seeking new relevant pages starting from a known relevant subset, and (ii) seeking relevant pages starting a few links away from the relevant subset. Using a framework and a number of quality metrics developed to evaluate topic driven crawling algorithms in a fair way, we find that a mix of exploitation and exploration is essential for both tasks, in spite of a penalty in the early stage of the crawl.

