Abstract:
The structure of the web is increasingly being used to improve organization, search, and analysis of information on the web. For example, Google uses the text in citing documents (documents that link to the target document) for search. We analyze the relative utility of document text, and the text in citing documents near the citation, for classification and description. Results show that the text in citing documents, when available, often has greater discriminative and descriptive power than the text in the target document itself. The combination of evidence from a document and citing documents can improve on either information source alone. Moreover, by ranking words and phrases in the citing documents according to expected entropy loss, we are able to accurately name clusters of web pages, even with very few positive examples. Our results confirm, quantify, and extend previous research using web sn'ucture in these areas, introducing new methods for classification and description of pages.
Citations
|
5044
|
Statistical Learning Theory
– Vapnik
- 1998
|
|
1839
|
The anatomy of a large-scale hypertextual Web search engine
– Brin, Page
- 1998
|
|
1669
|
Authoritative sources in a hyperlinked environment
– Kleinberg
- 1999
|
|
1053
|
Text Categorization with Support Vector Machines: Learning with Many Relevant Features
– Joachims
- 1998
|
|
640
|
Combining labeled and unlabeled data with co-training
– Blum, Mitchell
- 1998
|
|
544
|
Fast training of support vector machines using sequential minimal optimization. Advances in kernel methods – support vector learning
– Platt
- 1998
|
|
254
|
Enhanced hypertext categorization using hyperlinks
– Chakrabarti, Dom, et al.
- 1998
|
|
211
|
Digital Libraries and Autonomous Citation Indexing
– Lawrence, Giles, et al.
- 1999
|
|
200
|
Efficient crawling through URL ordering
– Cho, Garcia-Molina, et al.
- 1998
|
|
161
|
Accessibility of information on the web
– Lawrence, Giles
|
|
138
|
Efficient identification of Web communities
– Flake, Lawrence, et al.
- 2000
|
|
108
|
Information Theory and Coding
– Abramson
- 1963
|
|
65
|
A study of approaches to hypertext categorization
– Yang, Slattery, et al.
|
|
57
|
Improving category specific web search by learning query modifications
– Glover, Flake, et al.
- 2001
|
|
47
|
Exploiting structural information for text classi cation on the WWW
– Furnkranz
- 1999
|
|
39
|
Automatic Web page categorization by link and context analysis
– Attardi, Gullí, et al.
- 1999
|
|
28
|
Using sparseness and analytic QP to speed training of Support Vector Machines
– Platt
- 1999
|
|
23
|
Automated text categorization using support vector machine
– Kwok
- 1999
|
|
7
|
Feature selection in web applications using ROC inflections
– Coetzee, Glover, et al.
- 2001
|
|
7
|
Accessibility of information on the web. Nature, 400(July 8):107
– Lawrence, Giles
- 1999
|
|
3
|
Using Extra-Topical User Preferences To Improve Web-Based Metasearch
– Glover
- 2001
|
|
3
|
A study of approaches to hypertext categorization. Journal of lntelligent InJbrmation Systems
– Yang, Slattery, et al.
- 2001
|
|
1
|
Infirnation Theoo' and Coding
– Abramson
- 1963
|
|
1
|
Enhanced hypertext categorization using hypedinks
– Chakrabarti, Dom, et al.
- 1998
|
|
1
|
Exploiting structural information for text classification on the WWW
– Ftirnkranz
- 1999
|