Results 1 - 10
of
47
Accelerated Focused Crawling through Online Relevance Feedback
, 2002
"... The organization of HTML into a tag tree structure, which is rendered by browsers as roughly rectangular regions with embedded text and HREF links, greatly helps surfers locate and click on links that best satisfy their information need. Can an automatic program emulate this human behavior and there ..."
Abstract
-
Cited by 58 (2 self)
- Add to MetaCart
The organization of HTML into a tag tree structure, which is rendered by browsers as roughly rectangular regions with embedded text and HREF links, greatly helps surfers locate and click on links that best satisfy their information need. Can an automatic program emulate this human behavior and thereby learn to predict the relevance of an unseen HREF target page w.r.t. an information need, based on information limited to the HREF source page? Such a capability would be of great interest in focused crawling and resource discovery, because it can fine-tune the priority of unvisited URLs in the crawl frontier, and reduce the number of irrelevant pages which are fetched and discarded.
Sindice.com: A document-oriented lookup index for open linked data
- International Journal of Metadata, Semantics and Ontologies
"... Developers of Semantic Web applications face a challenge with respect to the decentralised publication model: how and where to find statements about encountered resources. The “linked data” approach mandates that resource URIs should be de-referenced to return resource metadata. But for data discove ..."
Abstract
-
Cited by 49 (9 self)
- Add to MetaCart
Developers of Semantic Web applications face a challenge with respect to the decentralised publication model: how and where to find statements about encountered resources. The “linked data” approach mandates that resource URIs should be de-referenced to return resource metadata. But for data discovery linkage itself is not enough, and crawling and indexing of data is necessary. Existing Semantic Web search engines are focused on database-like functionality, compromising on index size, query performance and live updates. We present Sindice, a lookup index over resources crawled on the Semantic Web. Our index allows applications to automatically locate documents containing information about a given resource. In addition, we allow resource retrieval through uniquely identifying inverse-functional properties, offer a full-text search and index SPARQL endpoints. Finally we introduce an extension to the sitemap protocol which allows us to efficiently index large Semantic Web datasets with minimal impact on the data providers.
Topical web crawlers: Evaluating adaptive algorithms
- ACM Transactions on Internet Technology
, 2004
"... Topical crawlers are increasingly seen as a way to address the scalability limitations of universal search engines, by distributing the crawling process across users, queries, or even client computers. The context available to such crawlers can guide the navigation of links with the goal of efficien ..."
Abstract
-
Cited by 35 (11 self)
- Add to MetaCart
Topical crawlers are increasingly seen as a way to address the scalability limitations of universal search engines, by distributing the crawling process across users, queries, or even client computers. The context available to such crawlers can guide the navigation of links with the goal of efficiently locating highly relevant target pages. We developed a framework to fairly evaluate topical crawling algorithms under a number of performance metrics. Such a framework is employed here to evaluate different algorithms that have proven highly competitive among those proposed in the literature and in our own previous research. In particular we focus on the tradeoff between exploration and exploitation of the cues available to a crawler, and on adaptive crawlers that use machine learning techniques to guide their search. We find that the best performance is achieved by a novel combination of explorative and exploitative bias, and introduce an evolutionary crawler that surpasses the performance of the best non-adaptive crawler after sufficiently long crawls. We also analyze the computational complexity of the various crawlers and discuss how performance and complexity scale with available resources. Evolutionary crawlers achieve high efficiency and scalability by distributing the work across concurrent agents, resulting in the best performance/cost ratio.
A General Evaluation Framework for Topical Crawlers
- INFORMATION RETRIEVAL
, 2005
"... Topical crawlers are becoming important tools to support applications such as specialized Web portals, online searching, and competitive intelligence. As the Web mining field matures, the disparate crawling strategies proposed in the literature will have to be evaluated and compared on common tasks ..."
Abstract
-
Cited by 28 (10 self)
- Add to MetaCart
Topical crawlers are becoming important tools to support applications such as specialized Web portals, online searching, and competitive intelligence. As the Web mining field matures, the disparate crawling strategies proposed in the literature will have to be evaluated and compared on common tasks through welldefined performance measures. This paper presents a general framework to evaluate topical crawlers. We identify a class of tasks that model crawling applications of di#erent nature and di#culty. We then introduce a set of performance measures for fair comparative evaluations of crawlers along several dimensions including generalized notions of precision, recall, and e#ciency that are appropriate and practical for the Web. The framework relies on independent relevance judgements compiled by human editors and available from public directories. Two sources of evidence are proposed to assess crawled pages, capturing di#erent relevance criteria. Finally we introduce a set of topic characterizations to analyze the variability in crawling e#ectiveness across topics. The proposed evaluation framework synthesizes a number of methodologies in the topical crawlers literature and many lessons learned from several studies conducted by our group. The general framework is described in detail and then illustrated in practice by a case study that evaluates four public crawling algorithms. We found that the proposed framework is e#ective at evaluating, comparing, di#erentiating and interpreting the performance of the four crawlers. For example, we found the IS crawler to be most sensitive to the popularity of topics.
An adaptive crawler for locating hidden-Web entry points
- In Proceedings of WWW
, 2007
"... In this paper we describe new adaptive crawling strategies to efficiently locate the entry points to hidden-Web sources. The fact that hidden-Web sources are very sparsely distributed makes the problem of locating them especially challenging. We deal with this problem by using the contents of pages ..."
Abstract
-
Cited by 21 (10 self)
- Add to MetaCart
In this paper we describe new adaptive crawling strategies to efficiently locate the entry points to hidden-Web sources. The fact that hidden-Web sources are very sparsely distributed makes the problem of locating them especially challenging. We deal with this problem by using the contents of pages to focus the crawl on a topic; by prioritizing promising links within the topic; and by also following links that may not lead to immediate benefit. We propose a new framework whereby crawlers automatically learn patterns of promising links and adapt their focus as the crawl progresses, thus greatly reducing the amount of required manual setup and tuning. Our experiments over real Web pages in a representative set of domains indicate that online learning leads to significant gains in harvest rates—the adaptive crawlers retrieve up to three times as many forms as crawlers that use a fixed focus strategy.
Ontology-Focused Crawling of Web Documents
, 2003
"... The Web, the largest unstructured database of the world, has greatly improved access to documents. However, documents on the Web are largely disorganized. Due to the distributed nature of the World Wide Web it is difficult to use it as a tool for information and knowledge management. Therefore, user ..."
Abstract
-
Cited by 17 (0 self)
- Add to MetaCart
The Web, the largest unstructured database of the world, has greatly improved access to documents. However, documents on the Web are largely disorganized. Due to the distributed nature of the World Wide Web it is difficult to use it as a tool for information and knowledge management. Therefore, users doing the difficult task of exploring the Web have to be supported by intelligent means.
Crawling the Web
- In Web Dynamics: Adapting to Change in Content, Size, Topology and Use. Edited by M. Levene and A. Poulovassilis
, 2004
"... this document to represent the vocabulary (feature space) that will be used for classification ..."
Abstract
-
Cited by 16 (1 self)
- Add to MetaCart
this document to represent the vocabulary (feature space) that will be used for classification
Exploration versus Exploitation in Topic Driven Crawlers
- WWW02 WORKSHOP ON WEB DYNAMICS
, 2002
"... The dynamic nature of the Web highlights the scalability limitations of universal search engines. Topic driven crawlers can address the problem by distributing the crawling process across users, queries, or even client computers. The context available to a topic driven crawler allows for informed de ..."
Abstract
-
Cited by 15 (9 self)
- Add to MetaCart
The dynamic nature of the Web highlights the scalability limitations of universal search engines. Topic driven crawlers can address the problem by distributing the crawling process across users, queries, or even client computers. The context available to a topic driven crawler allows for informed decisions about how to prioritize the links to be visited. Here we focus on the balance between a crawler's need to exploit this information to focus on the most promising links, and the need to explore links that appear suboptimal but might lead to more relevant pages. We investigate the issue for two different tasks: (i) seeking new relevant pages starting from a known relevant subset, and (ii) seeking relevant pages starting a few links away from the relevant subset. Using a framework and a number of quality metrics developed to evaluate topic driven crawling algorithms in a fair way, we find that a mix of exploitation and exploration is essential for both tasks, in spite of a penalty in the early stage of the crawl.
Web Crawling Agents for Retrieving Biomedical Information
, 2002
"... Autonomous agents for topic driven retrieval of information from the Web are currently a very active area of research. The ability to conduct real time searches for information is important for many users including biomedical scientists, health care professionals and the general public. We present p ..."
Abstract
-
Cited by 9 (1 self)
- Add to MetaCart
Autonomous agents for topic driven retrieval of information from the Web are currently a very active area of research. The ability to conduct real time searches for information is important for many users including biomedical scientists, health care professionals and the general public. We present preliminary research on different retrieval agents tested on their ability to retrieve biomedical information, whose relevance is assessed using both genetic and ontological expertise. In particular, the agents are judged on their performance in fetching information about diseases when given information about genes. We discuss several key insights into the particular challenges of agent based retrieval learned from our initial experience in the biomedical domain.
Topic-Driven Crawlers: Machine Learning Issues
- ACM TOIT, Submitted
, 2002
"... Topic driven crawlers are increasingly seen as a way to address the scalability limitations of universal search engines, by distributing the crawling process across users, queries, or even client computers. ..."
Abstract
-
Cited by 9 (2 self)
- Add to MetaCart
Topic driven crawlers are increasingly seen as a way to address the scalability limitations of universal search engines, by distributing the crawling process across users, queries, or even client computers.

