MetaCart Sign in to MyCiteSeerX

Include Citations | Advanced Search | Help

Disambiguated Search | Include Citations | Advanced Search | Help

Using URLs and Table Layout for Web Classification Tasks (2004) [14 citations — 0 self]

by Lawrence Kai Shih
Add To MetaCart

Abstract:

We propose new features and algorithms for automating Web-page classification tasks such as content recommendation and ad blocking. We show that the automated classification of Web pages can be much improved if, instead of looking at their textual content, we consider each links's URL and the visual placement of those links on a referring page. These features are unusual: rather than being scalar measurements like word counts they are tree structured--- describing the position of the item in a tree. We develop a model and algorithm for machine learning using such tree-structured features. We apply our methods in automated tools for recognizing and blocking Web advertisements and for recommending "interesting" news stories to a reader. Experiments show that our algorithms are both faster and more accurate than those based on the text content of Web documents.

Citations

3011 Pattern Classification and Scene Analysis – Duda, Hart - 1973
1839 The Anatomy of a Large-Scale Hypertextual Web Search Engine – Brin, Page - 1998
1240 A tutorial on support vector machines for pattern recognition – Burges - 1998
408 Wrapper induction for information extraction – Kushmerick, Doorenbos, et al. - 1997
301 Hierarchically classifying documents using very few words – Koller, Sahami - 1997
250 A probabilistic analysis of the Rocchio algorithm with TFIDF for text categorization.ICML-97 – Joachims - 1997
207 Quantifying inductive bias: AI learning algorithms and Valiant's learning framework – Haussler - 1988
198 Learning and revising user profiles: The identification of interesting web sites – Pazzani, Billsus - 1997
169 Improving text classification by shrinkage in a hierarchy of classes – McCallum, Rosenfeld, et al. - 1998
148 On discriminative vs. generative classifiers: a comparison of logistic regression and naïve Bayes – Ng, Jordan - 2002
122 Nonparametric Statistical Methods – HOLLANDER, A - 1973
88 Using reinforcement learning to spider the webefficiently – Rennie, McCallum - 1999
73 A hybrid user model for news story classifcation – Billsus, Pazzani - 1999
47 Accelerated focused crawling through online relevance feedback – Chakrabarti, Punera, et al. - 2002
45 On integrating catalogs – Agrawal, Srikant - 2001
45 Inferring strategies for sentence ordering in multidocument summarization – Barzilay, Elhadad, et al. - 2002
33 Learning to remove internet advertisements – Kushmerick - 1999
20 Web montage: a dynamic personalized start page – Anderson, Horvitz - 2002
1 Machine Learning of Web Documents – Shih - 2004