Results 1 -
8 of
8
Hierarchically Classifying Documents Using Very Few Words
, 1997
"... The proliferation of topic hierarchies for text documents has resulted in a need for tools that automatically classify new documents within such hierarchies. Existing classification schemes which ignore the hierarchical structure and treat the topics as separate classes are often inadequate in text ..."
Abstract
-
Cited by 363 (8 self)
- Add to MetaCart
The proliferation of topic hierarchies for text documents has resulted in a need for tools that automatically classify new documents within such hierarchies. Existing classification schemes which ignore the hierarchical structure and treat the topics as separate classes are often inadequate in text classification where the there is a large number of classes and a huge number of relevant features needed to distinguish between them. We propose an approach that utilizes the hierarchical topic structure to decompose the classification task into a set of simpler problems, one at each node in the classification tree. As we show, each of these smaller problems can be solved accurately by focusing only on a very small set of features, those relevant to the task at hand. This set of relevant features varies widely throughout the hierarchy, so that, while the overall relevant feature set may be large, each classifier only examines a small subset. The use of reduced feature sets allows us to util...
Building Hierarchical Classifiers Using Class Proximity
, 1999
"... In this paper, we address the need to automatically classify text documents into topic hierarchies like those in ACM Digital Library and Yahoo!. The existing local approach constructs a classifier at each split of the topic hierarchy. However, the local approach does not address the closeness of cla ..."
Abstract
-
Cited by 46 (3 self)
- Add to MetaCart
In this paper, we address the need to automatically classify text documents into topic hierarchies like those in ACM Digital Library and Yahoo!. The existing local approach constructs a classifier at each split of the topic hierarchy. However, the local approach does not address the closeness of classification in hierarchical classification where the concern often is how close a classification is, rather than simply correct or wrong. Also, the local approach puts its bet on classification at higher levels where the classi cation structure often diminishes. To address these issues, we propose the notion of class proximity and cast the hierarchical classification as a at classification with the class proximity modeling the closeness of classes. Our approach is global in that it constructs a single classifier based on the global information about all classes and class proximity. We leverage generalized association rules as the rule/feature space to address several other issues in hierarchical classification.
Using Machine Learning To Improve Information Access
, 1999
"... The explosion of on-line information has given rise to many query-based search engines (such as Alta Vista) and manually constructed topic hierarchies (such as Yahoo! ). But with the current growth rate in the amount of information, query results grow incomprehensibly large and manual classification ..."
Abstract
-
Cited by 38 (0 self)
- Add to MetaCart
The explosion of on-line information has given rise to many query-based search engines (such as Alta Vista) and manually constructed topic hierarchies (such as Yahoo! ). But with the current growth rate in the amount of information, query results grow incomprehensibly large and manual classification in topic hierarchies creates an immense information bottleneck. Therefore, these tools are rapidly becoming inadequate for addressing users' information needs. In this dissertation, we address these problems with a system for topical information space navigation that combines the query-based and taxonomic approaches. Our system, named SONIA (Service for Organizing Networked Information Au- tonomously), is implemented as part of the Stanford Digital Libraries testbed. It enables the creation of dynamic hierarchical document categorizations based on the full-text of articles. Using probability theory as a formal foundation, we develop several Machine Learning methods to allow document collections to be automatically organized at a topical level. First, to generate such topical hierarchies, we employ a novel probabilistic clustering scheme that outperforms traditional methods used in both Information Retrieval and Probabilistic Reasoning. Furthermore, we develop methods for classifying new articles into such automatically generated, or existing manually generated, hierarchies. In contrast to standard classification approaches which do not make use of the taxonomic relations in a topic hierarchy, our method explicitly uses the existing hierarchical relationships between topics, leading to improvements in classification accuracy. Much of this improvement is derived from the fact that the classification decisions in such a hierarchy can be made by considering only the presence (o...
Hierarchical classification of html documents with webclassii
- In Proc. of the 25th European Conf. on Information Retrieval (ECIR’03
, 2003
"... Abstract. This paper describes a new method for the classification of a HTML document into a hierarchy of categories. The hierarchy of categories is involved in all phases of automated document classification, namely feature extraction, learning, and classification of a new document. The innovative ..."
Abstract
-
Cited by 16 (1 self)
- Add to MetaCart
Abstract. This paper describes a new method for the classification of a HTML document into a hierarchy of categories. The hierarchy of categories is involved in all phases of automated document classification, namely feature extraction, learning, and classification of a new document. The innovative aspects of this work are the feature selection process, the automated threshold determination for classification scores, and an experimental study on real-word Web documents that can be associated to any node in the hierarchy. Moreover, a new measure for the evaluation of system performances has been introduced in order to compare three different techniques (flat, hierarchical with proper training sets, hierarchical with hierarchical training sets). The method has been implemented in the context of a client-server application, named WebClassII. Results show that for hierarchical techniques it is better to use hierarchical training sets. 1
Hierarchical Classification of Documents with Error Control
, 2001
"... . Classification is a function that matches a new object with one of the predefined classes. Document classification is characterized by the large number of attributes involved in the objects (documents). The traditional method of building a single classifier to do all the classification work wo ..."
Abstract
-
Cited by 11 (0 self)
- Add to MetaCart
. Classification is a function that matches a new object with one of the predefined classes. Document classification is characterized by the large number of attributes involved in the objects (documents). The traditional method of building a single classifier to do all the classification work would incur a high overhead. Hierarchical classification is a more efficient method --- instead of a single classifier, we use a set of classifiers distributed over a class taxonomy, one for each internal node. However, once a misclassification occurs at a high level class, it may result in a class that is far apart from the correct one. An existing approach to coping with this problem requires terms also to be arranged hierarchically. In this paper, instead of overhauling the classifier itself, we propose mechanisms to detect misclassification and take appropriate actions. We then discuss an alternative that masks the misclassification based on a well known software fault tolerance technique. Our experiments show our algorithms represent a good trade-off between speed and accuracy in most applications. Keywords: Hierarchical document classification, naive Bayesian classifier, error control, class taxonomy, parallel algorithm 1
Decision-tree induction from time-series data based on standard-example split test
- In Proceedings of the 20th International Conference on Machine Learning (ICML03
, 2003
"... This paper proposes a novel decision tree for a data set with time-series attributes. Our time-series tree has a value (i.e. a time sequence) of a time-series attribute in its internal node, and splits examples based on dissimilarity between a pair of time sequences. Our method selects, for a split ..."
Abstract
-
Cited by 10 (3 self)
- Add to MetaCart
This paper proposes a novel decision tree for a data set with time-series attributes. Our time-series tree has a value (i.e. a time sequence) of a time-series attribute in its internal node, and splits examples based on dissimilarity between a pair of time sequences. Our method selects, for a split test, a time sequence which exists in data by exhaustive search based on class and shape information. Experimental results confirm that our induction method constructs comprehensive and accurate decision trees. Moreover, a medical application shows that our time-series tree is promising for knowledge discovery.
Learning Accurate and Concise Nave Bayes Classifiers from Attribute Value Taxonomies and Data
"... In many application domains, there is a need for learning algorithms that can e#ectively exploit attribute value taxonomies (AVT) - hierarchical groupings of attribute values - to learn compact, comprehensible, and accurate classifiers from data - including data that are partially specified. This pa ..."
Abstract
- Add to MetaCart
In many application domains, there is a need for learning algorithms that can e#ectively exploit attribute value taxonomies (AVT) - hierarchical groupings of attribute values - to learn compact, comprehensible, and accurate classifiers from data - including data that are partially specified. This paper describes AVT-NBL, a natural generalization of the Nave Bayes learner (NBL), for learning classifiers from AVT and data. Our experimental results show that AVT-NBL is able to generate classifiers that are substantially more compact and more accurate than those produced by NBL on a broad range of data sets with di#erent percentages of partially specified values. We also show that AVT-NBL is more e#cient in its use of training data: AVT-NBL produces classifiers that outperform those produced by NBL using substantially fewer training examples.
Under consideration for publication in Knowledge and Information
"... In many application domains, there is a need for learning algorithms that can e#ectively exploit attribute value taxonomies (AVT) - hierarchical groupings of attribute values - to learn compact, comprehensible, and accurate classifiers from data - including data that are partially specified. This pa ..."
Abstract
- Add to MetaCart
In many application domains, there is a need for learning algorithms that can e#ectively exploit attribute value taxonomies (AVT) - hierarchical groupings of attribute values - to learn compact, comprehensible, and accurate classifiers from data - including data that are partially specified. This paper describes AVT-NBL, a natural generalization of the Nave Bayes learner (NBL), for learning classifiers from AVT and data. Our experimental results show that AVT-NBL is able to generate classifiers that are substantially more compact and more accurate than those produced by NBL on a broad range of data sets with di#erent percentages of partially specified values. We also show that AVT-NBL is more e#cient in its use of training data: AVT-NBL produces classifiers that outperform those produced by NBL using substantially fewer training examples.

