Results 1 - 10
of
19
Focused crawling: a new approach to topic-specific Web resource discovery
, 1999
"... The rapid growth of the World-Wide Web poses unprecedented scaling challenges for general-purpose crawlers and search engines. In this paper we describe a new hypertext resource discovery system called a Focused Crawler. The goal of a focused crawler is to selectively seek out pages that are relevan ..."
Abstract
-
Cited by 411 (8 self)
- Add to MetaCart
The rapid growth of the World-Wide Web poses unprecedented scaling challenges for general-purpose crawlers and search engines. In this paper we describe a new hypertext resource discovery system called a Focused Crawler. The goal of a focused crawler is to selectively seek out pages that are relevant to a pre-defined set of topics. The topics are specified not using keywords, but using exemplary documents. Rather than collecting and indexing all accessible Web documents to be able to answer all possible ad-hoc queries, a focused crawler analyzes its crawl boundary to find the links that are likely to be most relevant for the crawl, and avoids irrelevant regions of the Web. This leads to significant savings in hardware and network resources, and helps keep the crawl more up-to-date. To achieve such goal-directed crawling, we designed two hypertext mining programs that guide our crawler: a classifier that evaluates the relevance of a hypertext document with respect to the focus topics, ...
Intelligent Crawling on the World Wide Web with Arbitrary
- Discovery
, 2001
"... The enormous growth of the world wide web in recent years has made it important to perform resource discovery efficiently. Consequently, several new ideas have been proposed in recent years; among them a key technique is focused crawling which is able to crawl particular topical portions of the worl ..."
Abstract
-
Cited by 68 (2 self)
- Add to MetaCart
The enormous growth of the world wide web in recent years has made it important to perform resource discovery efficiently. Consequently, several new ideas have been proposed in recent years; among them a key technique is focused crawling which is able to crawl particular topical portions of the world wide web quickly without having to explore all web pages. In this paper, we propose the novel concept of intelligent crawling which actually learns characteristics of the linkage structure of the world wide web while performing the crawling. Specifically, the intelligent crawler uses the inlinking web page content, candidate URL structure, or other behaviors of the inlinking web pages or siblings in order to estimate the probability that a candidate is useful for a given crawl. This is a much more general framework than the focused crawling technique which is based on a pre-defined understanding of the topical structure of the web. The techniques discussed in this paper are applicable for crawling web pages which satisfy arbitrary user-defined predicates such as topical queries, keyword queries or any combinations of the above. Unlike focused crawling, it is not necessary to provide representative topical examples, since the crawler can learn its way into the appropriate topic. We refer to this technique as intelligent crawling because of its adaptive nature in adjusting to the web page linkage structure. The learning crawler is capable of reusing the knowledge gained in a given crawl in order to provide more efficient crawling for closely related predicates.
Design and Implementation of a High-Performance Distributed Web Crawler
- In Proc. of the Int. Conf. on Data Engineering
, 2002
"... Broad web search engines as well as many more specialized search tools rely on web crawlers to acquire large collections of pages for indexing and analysis. Such a web crawler may interact with millions of hosts over a period of weeks or months, and thus issues of robustness, flexibility, and manage ..."
Abstract
-
Cited by 61 (10 self)
- Add to MetaCart
Broad web search engines as well as many more specialized search tools rely on web crawlers to acquire large collections of pages for indexing and analysis. Such a web crawler may interact with millions of hosts over a period of weeks or months, and thus issues of robustness, flexibility, and manageability are of major importance. In addition, I/O performance, network resources, and OS limits must be taken into account in order to achieve high performance at a reasonable cost.
Compressing the graph structure of the web
- In IEEE Data Compression Conference (DCC
, 2001
"... A large amount of research has recently focused on the graph structure (or link structure) of the World Wide Web. This structure has proven to be extremely useful for improving the performance of search engines and other tools for navigating the web. However, since the graphs in these scenarios invo ..."
Abstract
-
Cited by 43 (2 self)
- Add to MetaCart
A large amount of research has recently focused on the graph structure (or link structure) of the World Wide Web. This structure has proven to be extremely useful for improving the performance of search engines and other tools for navigating the web. However, since the graphs in these scenarios involve hundreds of millions of nodes and even more edges, highly space-efficient data structures are needed to fit the data in memory. A first step in this direction was done by the DEC Connectivity Server, which stores the graph in compressed form. In this paper, we describe techniques for compressing the graph structure of the web, and give experimental results of a prototype implementation. We attempt to exploit a variety of different sources of compressibility of these graphs and of the associated set of URLs in order to obtain good compression performance on a large web graph. 1
Focused Crawls, Tunneling, and Digital Libraries
- In Proceedings of the European Conference on Digital Libraries (ECDL
, 2002
"... Crawling the Web to build collections of documents related to pre-specified topics became an active area of research during the late 1990's after crawler technology was developed for the benefit of search engines. Now, Web crawling is being seriously considered as an important strategy for build ..."
Abstract
-
Cited by 16 (1 self)
- Add to MetaCart
Crawling the Web to build collections of documents related to pre-specified topics became an active area of research during the late 1990's after crawler technology was developed for the benefit of search engines. Now, Web crawling is being seriously considered as an important strategy for building large scale digital libraries. This paper considers some of the crawl technologies that might be exploited for collection building. For example, to make such collection-building crawls more effective, focused crawling was developed, in which the goal was to make a "best-first" crawl of the Web. We are using powerful crawler software to implement a focused crawl but use tunneling to overcome some of the limitations of a pure best-first approach. Tunneling has been described by others as not only prioritizing links from pages according to the page's relevance score, but also estimating the value of each link and prioritizing on that as well. We add to this mix by devising a tunneling focused crawling strategy which evaluates the current crawl direction on the fly to determine when to terminate a tunneling activity. Results indicate that a combination of focused crawling and tunneling could be an e#ective tool for building digital libraries.
Data mining trends and developments: the key data mining technologies and applications for the 21st century
- Woratschek (eds), The Proceedings of ISECON 2002, v 19 (San Antonio): 224b. AITP Foundation for Information Technology Education
, 2002
"... This paper discusses a number of technologies, approaches, and research areas which have been identified as having critical and future promise in the field of data mining. There is currently an explosion in the amount of data which we now produce and have access to, and mining from these sources can ..."
Abstract
-
Cited by 8 (0 self)
- Add to MetaCart
This paper discusses a number of technologies, approaches, and research areas which have been identified as having critical and future promise in the field of data mining. There is currently an explosion in the amount of data which we now produce and have access to, and mining from these sources can uncover important information. The extensive use of handheld, wireless, and other ubiquitous devices is a developing area, since a lot of information being created and transmitted would be maintained and stored only on these kinds of devices. Among the other areas which are being developed, investigated, and applications identified for include hypertext and hypermedia data mining, phenomenal data mining, distributed/collective data mining, constraint-based data mining, and other related methods.
Information Retrieval on the World Wide Web and Active Logic: A Survey and Problem Definition
, 2002
"... As more information becomes available on the World Wide Web (there are currently over 4 billion pages covering most areas of human endeavor), it becomes more difficult to provide effective search tools for information access. Today, people access web information through two main kinds of search inte ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
As more information becomes available on the World Wide Web (there are currently over 4 billion pages covering most areas of human endeavor), it becomes more difficult to provide effective search tools for information access. Today, people access web information through two main kinds of search interfaces: Browsers (clicking and following hyperlinks) and Query Engines (queries in the form of a set of keywords showing the topic of interest). The first process is tentative and time consuming and the second may not satisfy the user because of many inaccurate and irrelevant results. Better support is needed for expressing one's information need and returning high quality search results by web search tools. There appears to be a need for systems that do reasoning under uncertainty and are flexible enough to recover from the contradictions, inconsistencies, and irregularities that such reasoning involves.
Design of a Priority Based Frequency Regulated Incremental Crawler
"... The World Wide Web is a huge source of hyperlinked information contained in hypertext documents. Search engines use web crawlers to collect these documents from web for the purpose of storage and indexing. However, many of these documents contain dynamic information which gets changed on daily, week ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
The World Wide Web is a huge source of hyperlinked information contained in hypertext documents. Search engines use web crawlers to collect these documents from web for the purpose of storage and indexing. However, many of these documents contain dynamic information which gets changed on daily, weekly, monthly or yearly basis and hence we need to refresh the search engine side storage so that latest information is made available to the user. An incremental crawler visits the web repeatedly after a specific interval for updating its collection. In this paper to regulate the revisiting frequency a novel mechanism and a novel architecture for incremental crawler is being proposed.
Hyperlink Analysis: Techniques and Applications
, 2002
"... ABSTRACT.................................................................................................................................................. 0 ..."
Abstract
-
Cited by 2 (2 self)
- Add to MetaCart
ABSTRACT.................................................................................................................................................. 0

