Results 1 - 10
of
44
Incremental algorithms for hierarchical classification
- Journal of Machine Learning Research
, 2004
"... We study the problem of classifying data in a given taxonomy when classifications associated with multiple and/or partial paths are allowed. We introduce a new algorithm that incrementally learns a linear-threshold classifier for each node of the taxonomy. A hierarchical classification is obtained b ..."
Abstract
-
Cited by 42 (2 self)
- Add to MetaCart
We study the problem of classifying data in a given taxonomy when classifications associated with multiple and/or partial paths are allowed. We introduce a new algorithm that incrementally learns a linear-threshold classifier for each node of the taxonomy. A hierarchical classification is obtained by evaluating the trained node classifiers in a top-down fashion. To evaluate classifiers in our multipath framework, we define a new hierarchical loss function, the H-loss, capturing the intuition that whenever a classification mistake is made on a node of the taxonomy, then no loss should be charged for any additional mistake occurring in the subtree of that node. Making no assumptions on the mechanism generating the data instances, and assuming a linear noise model for the labels, we bound the H-loss of our on-line algorithm in terms of the H-loss of a reference classifier knowing the true parameters of the label-generating process. We show that, in expectation, the excess cumulative H-loss grows at most logarithmically in the length of the data sequence. Furthermore, our analysis reveals the precise dependence of the rate of convergence on the eigenstructure of the data each node observes. Our theoretical results are complemented by a number of experiments on texual corpora. In these experiments we show that, after only one epoch of training, our algorithm performs much better than Perceptron-based hierarchical classifiers, and reasonably close to a hierarchical support vector machine.
Clustering Documents in a Web Directory
, 2003
"... growing interest due to the widespread proliferation of topic hierarchies for text documents. The worst problem of hierarchical supervised classifiers is their high demand in terms of labeled examples, whose amount is related to the number of topics in the taxonomy. Hence, bootstrapping a huge hiera ..."
Abstract
-
Cited by 18 (6 self)
- Add to MetaCart
growing interest due to the widespread proliferation of topic hierarchies for text documents. The worst problem of hierarchical supervised classifiers is their high demand in terms of labeled examples, whose amount is related to the number of topics in the taxonomy. Hence, bootstrapping a huge hierarchy with a proper set of labeled examples is a critical issue. In this paper, we propose some solutions for the bootstrapping problem, implicitly or explicitly using a taxonomy definition: a baseline approach where documents are classified according to class labels, and two clustering approaches, where training is constrained by the a-priori knowledge of the taxonomy structure, both at terminological and topological level. In particular, we propose the TaxSOM model, that clusters a set of documents in a predefined hierarchy of classes, directly exploiting the knowledge of both their topological organization and their lexical description. Experimental evaluation was performed on a set of taxonomies taken from the Google Web directory.
Hierarchical classification: Combining bayes with svm
- In Proceedings of the 23rd International Conference on Machine Learning
, 2006
"... We study hierarchical classification in the general case when an instance could belong to more than one class node in the underlying taxonomy. Experiments done in previous work showed that a simple hierarchy of Support Vectors Machines (SVM) with a top-down evaluation scheme has a surprisingly good ..."
Abstract
-
Cited by 17 (0 self)
- Add to MetaCart
We study hierarchical classification in the general case when an instance could belong to more than one class node in the underlying taxonomy. Experiments done in previous work showed that a simple hierarchy of Support Vectors Machines (SVM) with a top-down evaluation scheme has a surprisingly good performance on this kind of task. In this paper, we introduce a refined evaluation scheme which turns the hierarchical SVM classifier into an approximator of the Bayes optimal classifier with respect to a simple stochastic model for the labels. Experiments on synthetic datasets, generated according to this stochastic model, show that our refined algorithm outperforms the simple hierarchical SVM. On real-world data, however, the advantage brought by our approach is a bit less clear. We conjecture this is due to a higher noise rate for the training labels in the low levels of the taxonomy. 1.
Large-scale text categorization by batch mode active learning
- In Proceedings of the International World Wide Web Conference
, 2006
"... Large-scale text categorization is an important research topic for Web data mining. One of the challenges in large-scale text categorization is how to reduce the human efforts in labeling text documents for building reliable classification models. In the past, there have been many studies on applyin ..."
Abstract
-
Cited by 14 (5 self)
- Add to MetaCart
Large-scale text categorization is an important research topic for Web data mining. One of the challenges in large-scale text categorization is how to reduce the human efforts in labeling text documents for building reliable classification models. In the past, there have been many studies on applying active learning methods to automatic text categorization, which try to select the most informative documents for manually labeling. Most of these studies focused on selecting a single unlabeled document in each iteration. As a result, the text categorization model has to be retrained after each labeled document is solicited. In this paper, we present a novel active learning algorithm that selects a batch of text documents for manually labeling in each iteration. The key of the batch mode active learning is how to reduce the redundancy among the selected examples such that each example provides unique information for model updating. To this end, we use the Fisher information matrix as the measurement of model uncertainty and choose the set of documents that can efficiently minimize the Fisher information matrix of a classification model. Extensive experiments with three different datasets have shown that our algorithm is more effective than the state-of-the-art active learning techniques for text categorization and can be a promising tool toward large-scale text categorization on World Wide Web.
Bootstrapping for Hierarchical Document Classification
- In Proceedings of the Twelfth ACM International Conference on Informationand Knowledge Management(CIKM03
, 2003
"... Managing the hierarchical organization of data is starting to play a key role in the knowledge management community due to the great amount of human resources needed to create and maintain these organized repositories of information. Machine learning community has in part addressed this problem by d ..."
Abstract
-
Cited by 12 (5 self)
- Add to MetaCart
Managing the hierarchical organization of data is starting to play a key role in the knowledge management community due to the great amount of human resources needed to create and maintain these organized repositories of information. Machine learning community has in part addressed this problem by developing hierarchical supervised classifiers that help maintainers to categorize new resources within given hierarchies. Although such learning models succeed in exploiting relational knowledge, they are highly demanding in terms of labeled examples, because the number of categories is related to the dimension of the corresponding hierarchy. Hence, the creation of new directories or the modification of existing ones require strong investments.
Predicting Library of Congress Classifications from Library of Congress Subject Headings
, 2004
"... This paper addresses the problem of automatically assigning a Library of Congress Classi cation (LCC) to a work given its set of Library of Congress Subject Headings (LCSH). LCC are organized in a tree: the root node of this hierarchy comprises all possible topics, and leaf nodes correspond to ..."
Abstract
-
Cited by 10 (1 self)
- Add to MetaCart
This paper addresses the problem of automatically assigning a Library of Congress Classi cation (LCC) to a work given its set of Library of Congress Subject Headings (LCSH). LCC are organized in a tree: the root node of this hierarchy comprises all possible topics, and leaf nodes correspond to the most specialized topic areas de ned. We describe a procedure that, given a resource identi ed by its LCSH, automatically places that resource in the LCC hierarchy. The procedure uses machine learning techniques and training data from a large library catalog to learn a classi cation model mapping from sets of LCSH to nodes in the LCC tree. We present empirical results for our technique showing its accuracy on an independent collection of 50,000 LCSH/LCC pairs.
Clustering documents into a web directory for bootstrapping a supervised classification
- Journal of Data & knowledge Engineering – Elsevier
, 2005
"... The management of hierarchically organized data is starting to play a key role in the knowledge management community due to the proliferation of topic hierarchies for text documents. The creation and maintenance of such organized repositories of information requires a great deal of human interventio ..."
Abstract
-
Cited by 9 (2 self)
- Add to MetaCart
The management of hierarchically organized data is starting to play a key role in the knowledge management community due to the proliferation of topic hierarchies for text documents. The creation and maintenance of such organized repositories of information requires a great deal of human intervention The machine learning community has partially addressed this problem by developing hierarchical supervised classifiers that help people categorize new resources within given hierarchies. The worst problem of hierarchical supervised classifiers, however, is their high demand in terms of labeled examples. The number of examples required is related to the number of topics in the taxonomy. Bootstrapping a huge hierarchy with a proper set of labeled examples is therefore a critical issue. This paper proposes some solutions for the bootstrapping problem, that implicitly or explicitly use taxonomy definition: a baseline approach that classifies documents according to the class labels, and two clustering approaches, whose training is constrained by the a priori knowledge encoded in the taxonomy structure, which consists of both terminological and relational aspects. In particular, we propose the TaxSOM model, that clusters a set of documents in a predefined hierarchy of classes, directly exploiting the knowledge of both their topological organization and their lexical description. Experimental evaluation was performed on a set of taxonomies taken from the Google TM and LookSmart TM web directories, obtaining good results.
Improving protein function prediction using the hierarchical structure of the Gene Ontology
- In Proc. IEEE CIBCB
, 2005
"... Abstract—High performance and accurate protein function prediction is an important problem in molecular biology. Many contemporary ontologies, such as Gene Ontology (GO), have a hierarchical structure that can be exploited to improve the prediction accuracy, and lower the computational cost, of prot ..."
Abstract
-
Cited by 9 (1 self)
- Add to MetaCart
Abstract—High performance and accurate protein function prediction is an important problem in molecular biology. Many contemporary ontologies, such as Gene Ontology (GO), have a hierarchical structure that can be exploited to improve the prediction accuracy, and lower the computational cost, of protein function prediction. We leverage the hierarchical structure of the ontology in two ways. First, we present a method of creating hierarchy-aware training sets for machine-learned classifiers and we show that, in the case of GO molecular function, it is the most accurate method compared to not considering the hierarchy during training. Second, we use the hierarchy to reduce the computational cost of classification. We also introduce a sound methodology for evaluating hierarchical classifiers using global cross-validation. Biologists often use sequence similarity (e.g. BLAST) to identify a “nearest neighbor ” sequence and use the database annotations of this neighbor to predict protein function. In these cases, we use the hierarchy to improve accuracy by a small amount. When no similar sequences can be found (which is true for up to 40 % of some common proteomes), our technique can improve accuracy by a more significant amount. Although this paper focuses on a specific important application—protein function prediction for the GO hierarchy—the techniques may be applied to any classification problem over a hierarchical ontology. I.
Hierarchical Dirichlet model for document classification
- In ICML 2005
, 2005
"... The proliferation of text documents on the web as well as within institutions necessitates their convenient organization to enable efficient retrieval of information. Although text corpora are frequently organized into concept hierarchies or taxonomies, the classification of the documents into the h ..."
Abstract
-
Cited by 8 (2 self)
- Add to MetaCart
The proliferation of text documents on the web as well as within institutions necessitates their convenient organization to enable efficient retrieval of information. Although text corpora are frequently organized into concept hierarchies or taxonomies, the classification of the documents into the hierarchy is expensive in terms human effort. We present a novel and simple hierarchical Dirichlet generative model for text corpora and derive an efficient algorithm for the estimation of model parameters and the unsupervised classification of text documents into a given hierarchy. The class conditional feature means are assumed to be inter-related due to the hierarchical Bayesian structure of the model. We show that the algorithm provides robust estimates of the classification parameters by performing smoothing or regularization. We present experimental evidence on real web data that our algorithm achieves significant gains in accuracy over simpler models. 1.
Regret bounds for hierarchical classification with linear-threshold functions
- Proceedings of the 17th Annual Conference on Learning Theory
, 2004
"... Abstract. We study the problem of classifying data in a given taxonomy when classifications associated with multiple and/or partial paths are allowed. We introduce an incremental algorithm using a linear-threshold classifier at each node of the taxonomy. These classifiers are trained and evaluated i ..."
Abstract
-
Cited by 7 (3 self)
- Add to MetaCart
Abstract. We study the problem of classifying data in a given taxonomy when classifications associated with multiple and/or partial paths are allowed. We introduce an incremental algorithm using a linear-threshold classifier at each node of the taxonomy. These classifiers are trained and evaluated in a hierarchical top-down fashion. We then define a hierachical and parametric data model and prove a bound on the probability that our algorithm guesses the wrong multilabel for a random instance compared to the same probability when the true model parameters are known. Our bound decreases exponentially with the number of training examples and depends in a detailed way on the interaction between the process parameters and the taxonomy structure. Preliminary experiments on real-world data provide support to our theoretical results. 1

