A major challenge in indexing unstructured hypertext databases is to automatically extract meta-data that enables structured search using topic taxonomies, circumvents keyword ambiguity, and improves the quality of search and profile-based routing and filtering. Therefore, an accurate classifier is an essential component of a hypertext database. Hyperlinks pose new problems not addressed in the extensive text classification literature. Links clearly contain highquality semantic clues that are lost upon a purely termbased classifier, but exploiting link information is non-trivial because it is noisy. Naive use of terms in the link neighborhood of a document can even degrade accuracy. Our contribution is to propose robust statistical models and a relaxation labeling technique for better classification by exploiting link information in a small neighborhood around documents. Our technique also adapts gracefully to the fraction of neighboring documents having known topics. We experimented ...
|
3018
|
Pattern Classification and Scene Analysis
– Duda, Hart
- 1973
|
|
2579
|
Classification and Regression Trees
– Breiman, Friedman, et al.
- 1984
|
|
2331
|
Introduction to Modern Information Retrieval
– Salton, McGill
- 1983
|
|
866
|
Inductive Logic Programming
– Muggleton
- 1991
|
|
734
|
Learning logical definitions from relations
– Quinlan
- 1990
|
|
303
|
Hierarchically classifying documents using very few words
– Koller, Sahami
- 1997
|
|
213
|
Sprint: A scalable parallel classifier for data mining, in
– Shafer, Agrawal, et al.
|
|
194
|
Context sensitive learning methods for text categorization
– Cohen
- 1999
|
|
190
|
On the foundations of relaxation labeling processes
– Hummel, Zucker
- 1983
|
|
186
|
Automated Learning of Decision Rules for Text Categorization
– Apte, Damerau
- 1994
|
|
173
|
Finding the k shortest paths
– Eppstein
- 1994
|
|
159
|
SLIQ: A fast scalable classifier for data mining
– Mehta, Agrawal, et al.
- 1996
|
|
139
|
Visually searching the web for content
– Smith, Chang
- 1997
|
|
86
|
Hypursuit: A hierarchical network search engine that exploits content-link hypertext clustering
– Weiss, Velez, et al.
- 1996
|
|
82
|
Inductive Logic Programming: techniques and applications
– Lavrac, Dzeroski
- 1994
|
|
81
|
Markov random fields: theory and applications
– Chellappa, Jain
- 1993
|
|
77
|
Cat-a-Cone: an interactive interface for specifying searches and viewing retrieval results using a large category hierarchy
– Hearst, Karadi
- 1997
|
|
64
|
A retrieval model for incorporating hypertext links
– Croft, Turtle
- 1989
|
|
60
|
Using taxonomy, discriminants, and signatures for navigating in text databases
– Chakrabarti, Dom, et al.
- 1997
|
|
58
|
Using probabilistic information in data integration
– Florescu, Koller, et al.
- 1997
|
|
45
|
A knowledge-intensive approach to learning relational concepts
– Pazzani, Brunk, et al.
- 1991
|
|
33
|
Making Use of Hypertext Links when Retrieving Information
– Frei, Stieger
- 1992
|
|
29
|
Web search using automatic classification
– Chekuri, Goldwasser, et al.
- 1996
|
|
19
|
Retrieval strategies for hypertext
– Croft
- 1993
|
|
18
|
Associative document retrieval techniques using bibliographic information
– Salton
- 1963
|
|
18
|
An extended vector processing scheme for searching information in hypertext systems
– Savoy
- 1996
|
|
16
|
The use of semantic links in hypertext information retrieval
– Frei, Stieger
- 1995
|
|
13
|
Learning to Model Sequences Generated by Switching Distributions
– Freund, Ron
- 1995
|
|
13
|
Information Retrieval from Hypertext: An Approach using Plausible Inference
– Lucarella, Zanzi
- 1993
|
|
13
|
A learning scheme for information retrieval
– Savoy
- 1994
|
|
11
|
The use of title and cited titles as document representation for automatic classification
– Kwok
- 1975
|
|
7
|
A probabilistic theory of indexing and similarity measure based on cited and citing documents
– Kwok
- 1985
|
|
5
|
A continuous relaxation labeling algorithm for markov random fields
– Pelkowitz
- 1990
|
|
3
|
A document-document similarity measure based on cited titles and probability theory, and its application to relevance feedback retrieval
– Kwok
- 1984
|
|
2
|
A new probabilistic relaxation scheme and its application to edge detection
– Deng, Iyenger
- 1996
|
|
2
|
On the use of bibliographically related titles for the enhancement of document representations
– Kwok
- 1988
|
|
2
|
Enhancement of text representations using related document titles
– Salton, Zhang
- 1986
|
|
2
|
A new probabilistic scheme for information retrieval in hypertext. The New Review of Hypermedia and Multimedia
– Savoy
- 1995
|
|
1
|
Learning relational rules from relational data
– Ali
- 1997
|
|
1
|
Web classification using Bayesian nets
– Sahami
- 1997
|