• Documents
  • Authors
  • Tables
  • Other Seers ▼
    RefSeer AckSeer CollabSeer SeerSeer
  • Log in
  • Sign up
  • MetaCart

CiteSeerX logo

Advanced Search Include Citations
Advanced Search Include Citations | Disambiguate

D.: Classifying web documents in a hierarchy of categories: a comprehensive study (2007)

by M Ceci, Malerba
Venue:J. Intell. Inf. Syst
Add To MetaCart

Tools

Sorted by:
Results 1 - 4 of 4

Regression Accuracy Measures

by Lachlan Henderson, Supervisor Scott, Sanner What Dmoz
"... DMOZ is the largest, most comprehensive human edited directory of the web. –> 718,201 hierarchical categories. – 4,815,303 validated entries in 79 languages – Available in RDF downloadThe DataWhat is classification? A process where, given an uncategorized web document, predict the DMOZ category. – G ..."
Abstract - Add to MetaCart
DMOZ is the largest, most comprehensive human edited directory of the web. –> 718,201 hierarchical categories. – 4,815,303 validated entries in 79 languages – Available in RDF downloadThe DataWhat is classification? A process where, given an uncategorized web document, predict the DMOZ category. – Given an input vector x assign it to one of K discrete classes. – Parametric or nonparametric; unsupervised, semisupervised or supervised learningPopular Methods

Automated Text Classification in the DMOZ Hierarchy- Project Plan

by Lachlan Henderson , 2009
"... The goal of this project is to build a text classifier[2, 4, 7] for the DMOZ hierarchy of classification labels. Based on previous successes[3, 5, 6], this project will focus on non-parametric methods such as nearest-neighbour algorithms exploring different feature representations. Various approache ..."
Abstract - Add to MetaCart
The goal of this project is to build a text classifier[2, 4, 7] for the DMOZ hierarchy of classification labels. Based on previous successes[3, 5, 6], this project will focus on non-parametric methods such as nearest-neighbour algorithms exploring different feature representations. Various approaches will be implemented and evaluated in the initial stage, focusing on the use of the classification hierarchy to improve performance and development of language-independent classification techniques. Of the most successful approach(es), work in the latter stage will work on making the classifier as efficient as possible. 1.2 Motivation The growth in the availability of on-line digital text documents has spurred considerable interest in Information Retrieval and Text Classification. The Internet particularly represents a considerable opportunity for many corporations and individuals to exchange ideas, access products and services. Automation of the management of this wealth of Internet hypertext is becoming an increasingly important endeavor as the rate of new material continues to grow at its substantial rate. The DMOZ open directory project[1] is an on-line service which provides a searchable and browsable hierarchically organised directory to facilitate access to the Internets resources. DMOZ is a collaborative effort of over 56,000 volunteers who contribute and renew Internet content for a growing list of over 718 thousand categories. This represents a considerable convenience for users of the Internet and also a valuable resource for Data Mining applications.

Automated Text Classification in the DMOZ Hierarchy

by Lachlan Henderson , 2009
"... The growth in the availability of on-line digital text documents has prompted considerable interest in Information Retrieval and Text Classification. Automation of the management of this wealth of textual data is becoming an increasingly important endeavor as the rate of new material continues to gr ..."
Abstract - Add to MetaCart
The growth in the availability of on-line digital text documents has prompted considerable interest in Information Retrieval and Text Classification. Automation of the management of this wealth of textual data is becoming an increasingly important endeavor as the rate of new material continues to grow at its substantial rate. The open directory project (ODP) also known as DMOZ is an on-line service which provides a searchable and browsable hierarchically organised directory to facilitate access to the Internets ’ resources. This resource is considerably useful for the construction of intelligent systems for on-line content management. In this report the utility of the publicly available Open Directory Project data for the classification of World Wide Web (WWW) text documents is investigated. The resource is sampled and a range of algorithms are applied to the task namely, Support Vector Machines (SVM), Multi-class Rocchio (Centroid), k-Nearest Neighbour, and Naïve Bayes (NB). The theoretical and implementation details of the four text classification systems are discussed. Results from the tuning and performance of these algorithms are analysed and compared with published results. Related work from the areas of both text classification and classification in general is surveyed. Some of the unique issues of large scale multi-class text classification are identified and analysed. 1 1

Active Learning for Hierarchical Text Classification

by Xiao Li, Da Kuang, Charles X. Ling
"... Abstract. Hierarchical text classification plays an important role in many real-world applications, such as webpage topic classification, product categorization and user feedback classification. Usually a large numberoftrainingexamplesare neededtobuildanaccurate hierarchical classification system. A ..."
Abstract - Add to MetaCart
Abstract. Hierarchical text classification plays an important role in many real-world applications, such as webpage topic classification, product categorization and user feedback classification. Usually a large numberoftrainingexamplesare neededtobuildanaccurate hierarchical classification system. Active learning has been shown to reduce the training examples significantly, but it has not been applied to hierarchical text classification due to several technical challenges. In this paper, we study active learning for hierarchical text classification. We propose a realistic multi-oracle setting as well as a novel active learning framework, and devise several novel leveraging strategies under this new framework. Hierarchical relation between different categories has been explored and leveraged to improve active learning further. Experiments show that our methods are quite effective in reducing the number of oracle queries (by 74 % to 90%) in building accurate hierarchical classification systems. As far as we know, this is the first work that studies active learning in hierarchical text classification with promising results. 1
The National Science Foundation
  • About CiteSeerX
  • Submit Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2010 The Pennsylvania State University