Results 1 -
8 of
8
Evaluating Topic-Driven Web Crawlers
, 2001
"... Due to limited bandwidth, storage, and computational resources, and to the dynamic nature of the Web, search engines cannot index every Web page, and even the covered portion of the Web cannot be monitored continuously for changes. Therefore it is essential to develop effective crawling strategies t ..."
Abstract
-
Cited by 72 (19 self)
- Add to MetaCart
Due to limited bandwidth, storage, and computational resources, and to the dynamic nature of the Web, search engines cannot index every Web page, and even the covered portion of the Web cannot be monitored continuously for changes. Therefore it is essential to develop effective crawling strategies to prioritize the pages to be indexed. The issue is even more important for topic-specific search engines, where crawlers must make additional decisions based on the relevance of visited pages. However, it is difficult to evaluate alternative crawling strategies because relevant sets are unknown and the search space is changing. We propose three different methods to evaluate crawling strategies. We apply the proposed metrics to compare three topic-driven crawling algorithms based on similarity ranking, link analysis, and adaptive agents.
Topical web crawlers: Evaluating adaptive algorithms
- ACM Transactions on Internet Technology
, 2004
"... Topical crawlers are increasingly seen as a way to address the scalability limitations of universal search engines, by distributing the crawling process across users, queries, or even client computers. The context available to such crawlers can guide the navigation of links with the goal of efficien ..."
Abstract
-
Cited by 35 (11 self)
- Add to MetaCart
Topical crawlers are increasingly seen as a way to address the scalability limitations of universal search engines, by distributing the crawling process across users, queries, or even client computers. The context available to such crawlers can guide the navigation of links with the goal of efficiently locating highly relevant target pages. We developed a framework to fairly evaluate topical crawling algorithms under a number of performance metrics. Such a framework is employed here to evaluate different algorithms that have proven highly competitive among those proposed in the literature and in our own previous research. In particular we focus on the tradeoff between exploration and exploitation of the cues available to a crawler, and on adaptive crawlers that use machine learning techniques to guide their search. We find that the best performance is achieved by a novel combination of explorative and exploitative bias, and introduce an evolutionary crawler that surpasses the performance of the best non-adaptive crawler after sufficiently long crawls. We also analyze the computational complexity of the various crawlers and discuss how performance and complexity scale with available resources. Evolutionary crawlers achieve high efficiency and scalability by distributing the work across concurrent agents, resulting in the best performance/cost ratio.
A General Evaluation Framework for Topical Crawlers
- INFORMATION RETRIEVAL
, 2005
"... Topical crawlers are becoming important tools to support applications such as specialized Web portals, online searching, and competitive intelligence. As the Web mining field matures, the disparate crawling strategies proposed in the literature will have to be evaluated and compared on common tasks ..."
Abstract
-
Cited by 28 (10 self)
- Add to MetaCart
Topical crawlers are becoming important tools to support applications such as specialized Web portals, online searching, and competitive intelligence. As the Web mining field matures, the disparate crawling strategies proposed in the literature will have to be evaluated and compared on common tasks through welldefined performance measures. This paper presents a general framework to evaluate topical crawlers. We identify a class of tasks that model crawling applications of di#erent nature and di#culty. We then introduce a set of performance measures for fair comparative evaluations of crawlers along several dimensions including generalized notions of precision, recall, and e#ciency that are appropriate and practical for the Web. The framework relies on independent relevance judgements compiled by human editors and available from public directories. Two sources of evidence are proposed to assess crawled pages, capturing di#erent relevance criteria. Finally we introduce a set of topic characterizations to analyze the variability in crawling e#ectiveness across topics. The proposed evaluation framework synthesizes a number of methodologies in the topical crawlers literature and many lessons learned from several studies conducted by our group. The general framework is described in detail and then illustrated in practice by a case study that evaluates four public crawling algorithms. We found that the proposed framework is e#ective at evaluating, comparing, di#erentiating and interpreting the performance of the four crawlers. For example, we found the IS crawler to be most sensitive to the popularity of topics.
Link Analysis in Web Information Retrieval
- IEEE DATA ENGINEERING BULLETIN
, 2000
"... The analysis of the hyperlink structure of the web has led to significant improvements in web information retrieval. This survey describes two successful link analysis algorithms and the state-of-the art of the field. ..."
Abstract
-
Cited by 25 (0 self)
- Add to MetaCart
The analysis of the hyperlink structure of the web has led to significant improvements in web information retrieval. This survey describes two successful link analysis algorithms and the state-of-the art of the field.
Topic-Driven Crawlers: Machine Learning Issues
- ACM TOIT, Submitted
, 2002
"... Topic driven crawlers are increasingly seen as a way to address the scalability limitations of universal search engines, by distributing the crawling process across users, queries, or even client computers. ..."
Abstract
-
Cited by 9 (2 self)
- Add to MetaCart
Topic driven crawlers are increasingly seen as a way to address the scalability limitations of universal search engines, by distributing the crawling process across users, queries, or even client computers.
Web Information Retrieval - an Algorithmic Perspective
- Proceedings of the 8 th Annual European Symposium on Algorithms, (ESA
, 2000
"... In this paper we survey algorithmic aspects of Web information retrieval. As an example, we discuss ranking of search engine results using connectivity analysis. ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
In this paper we survey algorithmic aspects of Web information retrieval. As an example, we discuss ranking of search engine results using connectivity analysis.
Target Seeking Crawlers and their Topical Performance
, 2002
"... Topic driven crawlers can complement search engines by targeting relevant portions of the Web. A topic driven crawler must exploit the information available about the topic and its underlying context. In this paper we extend our previous research on the design and evaluation of topic driven crawlers ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
Topic driven crawlers can complement search engines by targeting relevant portions of the Web. A topic driven crawler must exploit the information available about the topic and its underlying context. In this paper we extend our previous research on the design and evaluation of topic driven crawlers by comparing seven different crawlers on a harder problem, namely, seeking highly relevant target pages. We find that exploration is an important aspect of a crawling strategy. We also study how the performance of crawler strategies depends on a number of topical characteristics based on notions of topic generality, cohesiveness, and authoritativeness. Our results reveal that topic generality is an obstacle for most crawlers, that three crawlers tend to perform better when the target pages are clustered together, and that two of these also display better performance when topic targets are highly authoritative.
www.elsevier.com/locate/comnet On near-uniform URL sampling
"... We consider the problem of sampling URLs uniformly at random from the Web. A tool for sampling URLs uniformly can be used to estimate various properties of Web pages, such as the fraction of pages in various Internet domains or written in various languages. Moreover, uniform URL sampling can be used ..."
Abstract
- Add to MetaCart
We consider the problem of sampling URLs uniformly at random from the Web. A tool for sampling URLs uniformly can be used to estimate various properties of Web pages, such as the fraction of pages in various Internet domains or written in various languages. Moreover, uniform URL sampling can be used to determine the sizes of various search engines relative to the entire Web. In this paper, we consider sampling approaches based on random walks of the Web graph. In particular, we suggest ways of improving sampling based on random walks to make the samples closer to uniform. We suggest a natural test bed based on random graphs for testing the effectiveness of our procedures. We then use our sampling approach to estimate the distribution of pages over various Internet domains and to estimate the coverage of

