Results 1 -
8 of
8
Focused crawling: a new approach to topic-specific Web resource discovery
, 1999
"... The rapid growth of the World-Wide Web poses unprecedented scaling challenges for general-purpose crawlers and search engines. In this paper we describe a new hypertext resource discovery system called a Focused Crawler. The goal of a focused crawler is to selectively seek out pages that are relevan ..."
Abstract
-
Cited by 411 (8 self)
- Add to MetaCart
The rapid growth of the World-Wide Web poses unprecedented scaling challenges for general-purpose crawlers and search engines. In this paper we describe a new hypertext resource discovery system called a Focused Crawler. The goal of a focused crawler is to selectively seek out pages that are relevant to a pre-defined set of topics. The topics are specified not using keywords, but using exemplary documents. Rather than collecting and indexing all accessible Web documents to be able to answer all possible ad-hoc queries, a focused crawler analyzes its crawl boundary to find the links that are likely to be most relevant for the crawl, and avoids irrelevant regions of the Web. This leads to significant savings in hardware and network resources, and helps keep the crawl more up-to-date. To achieve such goal-directed crawling, we designed two hypertext mining programs that guide our crawler: a classifier that evaluates the relevance of a hypertext document with respect to the focus topics, ...
Data mining for hypertext: A tutorial survey
- ACM SIGKDD Explorations
, 2000
"... With over 800 million pages covering most areas of human endeavor, the World-wide Web is a fertile ground for data mining research to make a difference to the effectiveness of information search. Today, Web surfers access the Web through two dominant interfaces: clicking on hyperlinks and searching ..."
Abstract
-
Cited by 61 (0 self)
- Add to MetaCart
With over 800 million pages covering most areas of human endeavor, the World-wide Web is a fertile ground for data mining research to make a difference to the effectiveness of information search. Today, Web surfers access the Web through two dominant interfaces: clicking on hyperlinks and searching via keyword queries. This process is often tentative and unsatisfactory. Better support is needed for expressing one's information need and dealing with a search result in more structured ways than available now. Data mining and machine learning have significant roles to play towards this end. In this paper we will survey recent advances in learning and mining problems related to hypertext in general and the Web in particular. We will review the continuum of supervised to semi-supervised to unsupervised learning problems, highlight the specific challenges which distinguish data mining in the hypertext domain from data mining in the context of data warehouses, and summarize the key areas of ...
Distributed Hypertext Resource Discovery Through Examples
, 1999
"... We describe the architecture of a hypertext resource discovery system using a relational database. Such a system can answer questions that combine page contents, metadata, and hyperlink structure in powerful ways, such as "find the number of links from an environmental protection page to a page abou ..."
Abstract
-
Cited by 27 (2 self)
- Add to MetaCart
We describe the architecture of a hypertext resource discovery system using a relational database. Such a system can answer questions that combine page contents, metadata, and hyperlink structure in powerful ways, such as "find the number of links from an environmental protection page to a page about oil and natural gas over the last year." A key problem in populating the database in such a system is to discover web resources related to the topics involved in such queries. We argue that that a keywordbased "find similar" search based on a giant all-purpose crawler is neither necessary nor adequate for resource discovery. Instead we exploit the properties that pages tend to cite pages with related topics, and given that a page u cites a page about a desired topic, it is very likely that u cites additional desirable pages. We exploit these properties by using a crawler controlled by two hypertext mining programs: (1) a classifier that evaluates the relevance of a region of the web to the...
Link Analysis in Web Information Retrieval
- IEEE DATA ENGINEERING BULLETIN
, 2000
"... The analysis of the hyperlink structure of the web has led to significant improvements in web information retrieval. This survey describes two successful link analysis algorithms and the state-of-the art of the field. ..."
Abstract
-
Cited by 25 (0 self)
- Add to MetaCart
The analysis of the hyperlink structure of the web has led to significant improvements in web information retrieval. This survey describes two successful link analysis algorithms and the state-of-the art of the field.
Data mining trends and developments: the key data mining technologies and applications for the 21st century
- Woratschek (eds), The Proceedings of ISECON 2002, v 19 (San Antonio): 224b. AITP Foundation for Information Technology Education
, 2002
"... This paper discusses a number of technologies, approaches, and research areas which have been identified as having critical and future promise in the field of data mining. There is currently an explosion in the amount of data which we now produce and have access to, and mining from these sources can ..."
Abstract
-
Cited by 8 (0 self)
- Add to MetaCart
This paper discusses a number of technologies, approaches, and research areas which have been identified as having critical and future promise in the field of data mining. There is currently an explosion in the amount of data which we now produce and have access to, and mining from these sources can uncover important information. The extensive use of handheld, wireless, and other ubiquitous devices is a developing area, since a lot of information being created and transmitted would be maintained and stored only on these kinds of devices. Among the other areas which are being developed, investigated, and applications identified for include hypertext and hypermedia data mining, phenomenal data mining, distributed/collective data mining, constraint-based data mining, and other related methods.
Web Information Retrieval - an Algorithmic Perspective
- Proceedings of the 8 th Annual European Symposium on Algorithms, (ESA
, 2000
"... In this paper we survey algorithmic aspects of Web information retrieval. As an example, we discuss ranking of search engine results using connectivity analysis. ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
In this paper we survey algorithmic aspects of Web information retrieval. As an example, we discuss ranking of search engine results using connectivity analysis.
Reference Manual
"... contributions from the following people on the indicated sections of the code; Russell Bernard and Peter Killworth (rank order sociometric data), Miguel Guilarte (cluster analysis, debugging, eigenvectors, randomnumbers and spatial display), Xia Fang (debugging, power and t-tests), Stephen Johnso ..."
Abstract
- Add to MetaCart
contributions from the following people on the indicated sections of the code; Russell Bernard and Peter Killworth (rank order sociometric data), Miguel Guilarte (cluster analysis, debugging, eigenvectors, randomnumbers and spatial display), Xia Fang (debugging, power and t-tests), Stephen Johnson (cluster analysis), Nan Lin (path distances), Peter Marsden (eigenvectors), Mark Mizruchi (reflected prominence), Ronan van Rossem (triad pattern frequencies), Seymour Spilerman (eigenvectors), and Tetsuji Uchiyama (debugging and density tables). STRUCTURE Reference Manual, Page 1 CONTENTS Network Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Running the Program . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 COMMANDS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 ANALYZE --- sign

