Results 1 - 10
of
10
Topical web crawlers: Evaluating adaptive algorithms
- ACM Transactions on Internet Technology
, 2004
"... Topical crawlers are increasingly seen as a way to address the scalability limitations of universal search engines, by distributing the crawling process across users, queries, or even client computers. The context available to such crawlers can guide the navigation of links with the goal of efficien ..."
Abstract
-
Cited by 35 (11 self)
- Add to MetaCart
Topical crawlers are increasingly seen as a way to address the scalability limitations of universal search engines, by distributing the crawling process across users, queries, or even client computers. The context available to such crawlers can guide the navigation of links with the goal of efficiently locating highly relevant target pages. We developed a framework to fairly evaluate topical crawling algorithms under a number of performance metrics. Such a framework is employed here to evaluate different algorithms that have proven highly competitive among those proposed in the literature and in our own previous research. In particular we focus on the tradeoff between exploration and exploitation of the cues available to a crawler, and on adaptive crawlers that use machine learning techniques to guide their search. We find that the best performance is achieved by a novel combination of explorative and exploitative bias, and introduce an evolutionary crawler that surpasses the performance of the best non-adaptive crawler after sufficiently long crawls. We also analyze the computational complexity of the various crawlers and discuss how performance and complexity scale with available resources. Evolutionary crawlers achieve high efficiency and scalability by distributing the work across concurrent agents, resulting in the best performance/cost ratio.
Crawling the Web
- In Web Dynamics: Adapting to Change in Content, Size, Topology and Use. Edited by M. Levene and A. Poulovassilis
, 2004
"... this document to represent the vocabulary (feature space) that will be used for classification ..."
Abstract
-
Cited by 16 (1 self)
- Add to MetaCart
this document to represent the vocabulary (feature space) that will be used for classification
Topical Crawling for Business Intelligence
- In Proc. 7th European Conference on Research and Advanced Technology for Digital Libraries (ECDL 2003
"... The Web provides us with a vast resource for business intelligence. However, the large size of the Web and its dynamic nature make the task of foraging appropriate information challenging. Generalpurpose search engines and business portals may be used to gather some basic intelligence. Topical crawl ..."
Abstract
-
Cited by 11 (6 self)
- Add to MetaCart
The Web provides us with a vast resource for business intelligence. However, the large size of the Web and its dynamic nature make the task of foraging appropriate information challenging. Generalpurpose search engines and business portals may be used to gather some basic intelligence. Topical crawlers, driven by richer contexts, can then leverage on the basic intelligence to facilitate indepth and up-to-date research. In this paper we investigate the use of topical crawlers in creating a small document collection that helps locate relevant business entities. The problem of locating business entities is encountered when an organization looks for competitors, partners or acquisitions. We formalize the problem, create a test bed, introduce a metric to measure the performance of crawlers, and compare the results of four different crawlers. Our results underscore the importance of exploiting DOM based link contexts and identifying good hubs, for accelerating the crawl and improving the overall results. We further verify our findings on a real-world example.
Web Crawling Agents for Retrieving Biomedical Information
, 2002
"... Autonomous agents for topic driven retrieval of information from the Web are currently a very active area of research. The ability to conduct real time searches for information is important for many users including biomedical scientists, health care professionals and the general public. We present p ..."
Abstract
-
Cited by 9 (1 self)
- Add to MetaCart
Autonomous agents for topic driven retrieval of information from the Web are currently a very active area of research. The ability to conduct real time searches for information is important for many users including biomedical scientists, health care professionals and the general public. We present preliminary research on different retrieval agents tested on their ability to retrieve biomedical information, whose relevance is assessed using both genetic and ontological expertise. In particular, the agents are judged on their performance in fetching information about diseases when given information about genes. We discuss several key insights into the particular challenges of agent based retrieval learned from our initial experience in the biomedical domain.
Topic-Driven Crawlers: Machine Learning Issues
- ACM TOIT, Submitted
, 2002
"... Topic driven crawlers are increasingly seen as a way to address the scalability limitations of universal search engines, by distributing the crawling process across users, queries, or even client computers. ..."
Abstract
-
Cited by 9 (2 self)
- Add to MetaCart
Topic driven crawlers are increasingly seen as a way to address the scalability limitations of universal search engines, by distributing the crawling process across users, queries, or even client computers.
Combining text and link analysis for focused crawling–an application for vertical search engines Information Systems
, 2007
"... Abstract. The number of vertical search engines and portals has rapidly increased over the last years, making the importance of a topic-driven (focused) crawler evident. In this paper, we develop a latent semantic indexing classifier that combines link analysis with text content in order to retrieve ..."
Abstract
-
Cited by 8 (0 self)
- Add to MetaCart
Abstract. The number of vertical search engines and portals has rapidly increased over the last years, making the importance of a topic-driven (focused) crawler evident. In this paper, we develop a latent semantic indexing classifier that combines link analysis with text content in order to retrieve and index domain specific web documents. We compare its efficiency with other well-known web information retrieval techniques. Our implementation presents a different approach to focused crawling and aims to overcome the limitations of the neccesity to provide initial training data while maintaining a high recall/precision ratio. 1
Search Engine-Crawler Symbiosis: Adapting to Community Interests
- In Proc. 7th European Conference on Research and Advanced Technology for Digital Libraries (ECDL 2003
, 2003
"... Web crawlers have been used for nearly a decade as a search engine component to create and update large collections of documents. ..."
Abstract
-
Cited by 7 (1 self)
- Add to MetaCart
Web crawlers have been used for nearly a decade as a search engine component to create and update large collections of documents.
Hyperlink Analysis: Techniques and Applications
, 2002
"... ABSTRACT.................................................................................................................................................. 0 ..."
Abstract
-
Cited by 2 (2 self)
- Add to MetaCart
ABSTRACT.................................................................................................................................................. 0
Learnable topic-specific web crawler, Computer Applications xx
, 2004
"... Topic-specific web crawler collects relevant web pages of interested topics from the Internet. There are many previous researches focusing on algorithms of web page crawling. The main purpose of those algorithms is to gather as many relevant web pages as possible, and most of them only detail the ap ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Topic-specific web crawler collects relevant web pages of interested topics from the Internet. There are many previous researches focusing on algorithms of web page crawling. The main purpose of those algorithms is to gather as many relevant web pages as possible, and most of them only detail the approaches of the first crawling. However, no one has ever mentioned some important questions, such as how the crawler performs during the next crawling attempts, can the crawler learn from experience to crawl more relevant web pages in an incremental way, etc. In this paper, we present an algorithm that covers the discussion of both the first and the consecutive crawling. For efficient result of the next crawling, we derive the information of previous crawling attempts to build some knowledge bases: starting URLs, topic keywords and URL prediction. These knowledge bases are used to build the experience of the learnable topic-specific web crawler to produce better result for the next crawling. Preliminary evaluation illustrates that the proposed web crawler can learn from experience to better collect the web pages under interest during the early period of consecutive crawling attempts.
Chaining with Memory in Xcerpt
, 2008
"... Moving from single-rule Xcerpt programs as described in previous deliverables to full Xcerpt programs requires to address the issue of efficient rule chaining. In this deliverable, we first survey existing approaches for efficient rule chaining (using some form of memoization) in logic programming a ..."
Abstract
- Add to MetaCart
Moving from single-rule Xcerpt programs as described in previous deliverables to full Xcerpt programs requires to address the issue of efficient rule chaining. In this deliverable, we first survey existing approaches for efficient rule chaining (using some form of memoization) in logic programming and then briefly outline first results and challenges when extending these results to Xcerpt.

