The organization of HTML into a tag tree structure, which is rendered by browsers as roughly rectangular regions with embedded text and HREF links, greatly helps surfers locate and click on links that best satisfy their information need. Can an automatic program emulate this human behavior and thereby learn to predict the relevance of an unseen HREF target page w.r.t. an information need, based on information limited to the HREF source page? Such a capability would be of great interest in focused crawling and resource discovery, because it can fine-tune the priority of unvisited URLs in the crawl frontier, and reduce the number of irrelevant pages which are fetched and discarded.
|
2329
|
Introduction to modern information retrieval
– Salton
- 1983
|
|
640
|
Combining labeled and unlabeled data with co-training
– Blum, Mitchell
- 1998
|
|
514
|
A comparison of event models for naive bayes text classification
– McCallum, Nigam
- 1998
|
|
468
|
An agent that assists web browsing
– Lieberman
- 1995
|
|
339
|
Focused crawling: a new approach to topic-specific (web) resource discovery
– Chakrabarti, Berg, et al.
- 1999
|
|
254
|
Enhanced hypertext categorization using hyperlinks
– Chakrabarti, Dom, et al.
- 1998
|
|
244
|
Automatic resource compilation by analyzing hyperlink structure and associated text
– Chakrabarti, Dom, et al.
- 1998
|
|
200
|
Efficient crawling through URL ordering
– Cho, Garcia-Molina, et al.
- 1998
|
|
186
|
Automated learning of decision rules for text categorization
– Apte, Damerau
- 1994
|
|
175
|
A method for disambiguating word senses in a large corpus
– Gale, Church, et al.
- 1993
|
|
147
|
Focused crawling using context graphs
– Diligenti, Coetzee, et al.
- 2000
|
|
143
|
Stochastic models for the web graph
– Kumar, Raghavan, et al.
- 2000
|
|
113
|
A toolkit for statistical language modeling, text retrieval, classification and clustering. http://www.cs.cmu.edu/ ˜mccallum/bow
– Bow
|
|
88
|
Using reinforcement learning to spider the webefficiently
– Rennie, McCallum
- 1999
|
|
85
|
Topical locality in the web
– Davison
- 2000
|
|
70
|
Raghavan P: Scalable feature selection, classification and signature generation for organizing large text databases into hierarchical topic taxonomy
– Chakrabarti, AR
- 1998
|
|
68
|
Adaptive retrieval agents: Internalizing local context and scaling up to the web. Machine Learning 39(2/3):203–242
– Menczer, Belew
- 2000
|
|
58
|
Intelligent crawling on the World Wide Web with arbitrary predicates
– Aggarwal, Al-Garawi, et al.
- 2001
|
|
57
|
Evaluating topic-driven Web crawlers
– Menczer, Pant, et al.
- 2001
|
|
49
|
Integrating the Document Object Model with Hyperlinks for Enhanced Top Distillation and Information Extraction
– Chakrabarti
|
|
44
|
Information retrieval in the world-wide web: Making client-based searching feasible
– DeBra, Post
- 1994
|
|
41
|
Exploring the web with reconnaissance Agents
– Lieberman, Fry, et al.
|
|
17
|
WTMS: A System for Collecting and Analyzing Topic-specific Web Information
– Mukherjea
- 2000
|
|
8
|
Searching for arbitrary information in the WWW: the fish-search for Mosaic
– Bra, Post
- 1994
|
|
7
|
Regression by Classification
– Torgo, J
- 1996
|
|
4
|
The sharksearch algorithm—An application: Tailored Web site mapping
– HERSOVICI, JACOVI, et al.
- 1998
|
|
3
|
Topic distillation and spectral filtering
– Chakrabarti, Dom, et al.
- 1999
|
|
3
|
WebWatcher: A tour guide for the web
– Joachims, Freitag, et al.
- 1997
|
|
2
|
tell us about lexical and semantic Web content
– Links
- 2001
|
|
1
|
Mining the Web
– Mitchell
- 2001
|
|
1
|
Focused crawling using TFIDF centroid. Hypertext Retrieval and Mining (CS610) class project, Apr. 2001. Details available from manyam@cs.utexas.edu
– Subramanyam, Phanindra, et al.
|