Abstract:
We propose new features and algorithms for automating Web-page classification tasks such as content recommendation and ad blocking. We show that the automated classification of Web pages can be much improved if, instead of looking at their textual content, we consider each links's URL and the visual placement of those links on a referring page. These features are unusual: rather than being scalar measurements like word counts they are tree structured--- describing the position of the item in a tree. We develop a model and algorithm for machine learning using such tree-structured features. We apply our methods in automated tools for recognizing and blocking Web advertisements and for recommending "interesting" news stories to a reader. Experiments show that our algorithms are both faster and more accurate than those based on the text content of Web documents.
Citations
|
3011
|
Pattern Classification and Scene Analysis
– Duda, Hart
- 1973
|
|
1839
|
The Anatomy of a Large-Scale Hypertextual Web Search Engine
– Brin, Page
- 1998
|
|
1240
|
A tutorial on support vector machines for pattern recognition
– Burges
- 1998
|
|
408
|
Wrapper induction for information extraction
– Kushmerick, Doorenbos, et al.
- 1997
|
|
301
|
Hierarchically classifying documents using very few words
– Koller, Sahami
- 1997
|
|
250
|
A probabilistic analysis of the Rocchio algorithm with TFIDF for text categorization.ICML-97
– Joachims
- 1997
|
|
207
|
Quantifying inductive bias: AI learning algorithms and Valiant's learning framework
– Haussler
- 1988
|
|
198
|
Learning and revising user profiles: The identification of interesting web sites
– Pazzani, Billsus
- 1997
|
|
169
|
Improving text classification by shrinkage in a hierarchy of classes
– McCallum, Rosenfeld, et al.
- 1998
|
|
148
|
On discriminative vs. generative classifiers: a comparison of logistic regression and naïve Bayes
– Ng, Jordan
- 2002
|
|
122
|
Nonparametric Statistical Methods
– HOLLANDER, A
- 1973
|
|
88
|
Using reinforcement learning to spider the webefficiently
– Rennie, McCallum
- 1999
|
|
73
|
A hybrid user model for news story classifcation
– Billsus, Pazzani
- 1999
|
|
47
|
Accelerated focused crawling through online relevance feedback
– Chakrabarti, Punera, et al.
- 2002
|
|
45
|
On integrating catalogs
– Agrawal, Srikant
- 2001
|
|
45
|
Inferring strategies for sentence ordering in multidocument summarization
– Barzilay, Elhadad, et al.
- 2002
|
|
33
|
Learning to remove internet advertisements
– Kushmerick
- 1999
|
|
20
|
Web montage: a dynamic personalized start page
– Anderson, Horvitz
- 2002
|
|
1
|
Machine Learning of Web Documents
– Shih
- 2004
|