Results 1  10
of
13
Impact of similarity measures on webpage clustering,”
 in Workshop on Artificial Intelligence for Web Search (AAAI
, 2000
"... Abstract Clustering of web documents enables (semi)automated categorization, and facilitates certain types of search. Any clustering method has to embed the documents in a suitable similarity space. While several clustering methods and the associated similarity measures have been proposed in the p ..."
Abstract

Cited by 205 (26 self)
 Add to MetaCart
(Show Context)
Abstract Clustering of web documents enables (semi)automated categorization, and facilitates certain types of search. Any clustering method has to embed the documents in a suitable similarity space. While several clustering methods and the associated similarity measures have been proposed in the past, there is no systematic comparative study of the impact of similarity metrics on cluster quality, possibly because the popular cost criteria do not readily translate across qualitatively different metrics. We observe that in domains such as YAHOO that provide a categorization by human experts, a useful criteria for comparisons across similarity metrics is indeed available. We then compare four popular similarity measures (Euclidean, cosine, Pearson correlation and extended Jaccard) in conjunction with several clustering techniques (random, selforganizing feature map, hypergraph partitioning, generalized kmeans, weighted graph partitioning), on high dimensionai sparse data representing web documents. Performance is measured against a humanimposed classification into news categories and industry categories. We conduct a number of experiments and use ttests to assure statistical significance of results. Cosine and extended Jaccard similarities emerge as the best measures to capture human categorization behavior, while Euclidean performs poorest. Also, weighted graph partitioning approaches are clearly superior to all others.
Investigation of the random forest framework for classification of hyperspectral data
 5.12 (0.15) 4.97 (0.04) Wetland 6.74 (0.33) 6.40 (0.47) 6.12 (0.22) 5.72 (0.22) 5.44 (0.08) Between 6.40 (0.26) 6.30 (0.19) 6.37 (0.11) 6.53 (0.07) 6.63 (0.05) Upland 6.30 (0.70) 6.02 (0.86) 6.64 (0.50) 6.42 (0.81) 5.60 (0.86) Wetland 5.42 (0.55) 5.34 (0.
"... Abstract—Statistical classification of byperspectral data is challenging because the inputs are high in dimension and represent multiple classes that are sometimes quite mixed, while the amount and quality of ground truth in the form of labeled data is typically limited. The resulting classifiers ar ..."
Abstract

Cited by 54 (12 self)
 Add to MetaCart
(Show Context)
Abstract—Statistical classification of byperspectral data is challenging because the inputs are high in dimension and represent multiple classes that are sometimes quite mixed, while the amount and quality of ground truth in the form of labeled data is typically limited. The resulting classifiers are often unstable and have poor generalization. This paper investigates two approaches based on the concept of random forests of classifiers implemented within a binary hierarchical multiclassifier system, with the goal of achieving improved generalization of the classifier in analysis of hyperspectral data, particularly when the quantity of training data is limited. A new classifier is proposed that incorporates bagging of training samples and adaptive random subspace feature selection within a binary hierarchical classifier (BHC), such that the number of features that is selected at each node of the tree is dependent on the quantity of associated training data. Results are compared to a random forest implementation based on the framework of classification and regression trees. For both methods, classification results obtained from experiments on data acquired
Integrating support vector machines in a hierarchical output decomposition framework
 In 2004 International Geosci. and Remote Sens. Symposium
, 2004
"... Abstract — This paper presents a new approach called Hierarchical Support Vector Machines (HSVM), to address multiclass problems. The method solves a series of maxcut problems to hierarchically and recursively partition the set of classes into twosubsets, till pure leaf nodes that have only one c ..."
Abstract

Cited by 18 (5 self)
 Add to MetaCart
(Show Context)
Abstract — This paper presents a new approach called Hierarchical Support Vector Machines (HSVM), to address multiclass problems. The method solves a series of maxcut problems to hierarchically and recursively partition the set of classes into twosubsets, till pure leaf nodes that have only one class label, are obtained. The SVM is applied at each internal node to construct the discriminant function for a binary metaclass classifier. Because maxcut unsupervised decomposition uses distance measures to investigate the natural class groupings, HSVM has a fast and intuitive SVM training process that requires little tuning and yields both high accuracy levels and good generalization. The HSVM method was applied to Hyperion hyperspectral data collected over the Okavango Delta of Botswana. Classification accuracies and generalization capability are compared to those achieved by the Best Basis Binary Hierarchical Classifier, a Random Forest CART binary decision tree classifier and Binary Hierarchical Support Vector Machines. I.
An approach to parallel growing and training of neural networks
 in Proc. 2000 IEEE Int. Symp. Intell. Signal Processing Commun. Syst. (ISPACS2000
"... Abstract—In order to find an appropriate architecture for a largescale realworld application automatically and efficiently, a natural method is to divide the original problem into a set of subproblems. In this paper, we propose a simple neuralnetwork task decomposition method based on output para ..."
Abstract

Cited by 14 (10 self)
 Add to MetaCart
Abstract—In order to find an appropriate architecture for a largescale realworld application automatically and efficiently, a natural method is to divide the original problem into a set of subproblems. In this paper, we propose a simple neuralnetwork task decomposition method based on output parallelism. By using this method, a problem can be divided flexibly into several subproblems as chosen, each of which is composed of the whole input vector and a fraction of the output vector. Each module (for one subproblem) is responsible for producing a fraction of the output vector of the original problem. The hidden structure for the original problem’s output units are decoupled. These modules can be grown and trained in parallel on parallel processing elements. Incorporated with a constructive learning algorithm, our method does not require excessive computation and any prior knowledge concerning decomposition. The feasibility of output parallelism
Adaptive Feature Selection for Hyperspectral Data Analysis
 Proceedings of SPIE  The International Society for Optical Engineering, v 5238, Image and Signal Processing for Remote Sensing IX
, 2004
"... Abstract High dimensional inputs coupled with scarcity of labeled data are among the greatest challenges for classification of hyperspectral data. These problems are exacerbated if the number of classes is large. High dimensional output classes may be handled effectively by decomposition into multi ..."
Abstract

Cited by 13 (0 self)
 Add to MetaCart
(Show Context)
Abstract High dimensional inputs coupled with scarcity of labeled data are among the greatest challenges for classification of hyperspectral data. These problems are exacerbated if the number of classes is large. High dimensional output classes may be handled effectively by decomposition into multiple twoclass problems, where each subproblem is solved using a suitable binary classifier, and outputs of this collection of classifiers are combined in a suitable manner to obtain the answer to the original multiclass problem. This approach is taken by the binary hierarchical classifier (BHC). The advantages of the BHC for output decomposition can be further exploited for hyperspectral data analysis by integrating a feature selection methodology with the classifier. Building upon the previously developed best bases BHC algorithm with greedy feature selection, a new method is developed that selects a subset of band groups within metaclasses using reactive tabu search. Experimental results obtained from analysis of Hyperion data acquired over the Okavango Delta in Botswana are superior to those of the greedy feature selection approach and more robust than either the original BHC or the BHC with greedy feature selection.
A Hierarchical Multiclassifier System for Hyperspectral Data Analysis
, 2000
"... Many real world classification problems involve high dimensional inputs and a large number of classes. Feature extraction and modular learning approaches can be used to simplify such problems. In this paper, we introduce a hierarchical multiclassifier paradigm in which a Cclass problem is recursive ..."
Abstract

Cited by 12 (7 self)
 Add to MetaCart
(Show Context)
Many real world classification problems involve high dimensional inputs and a large number of classes. Feature extraction and modular learning approaches can be used to simplify such problems. In this paper, we introduce a hierarchical multiclassifier paradigm in which a Cclass problem is recursively decomposed into C 1 twoclass problems. A generalized modular learning framework is used to partition a set of classes into two disjoint groups called metaclasses. The coupled problem of finding a good partition and of searching for a linear feature extractor that best discriminates the resulting two metaclasses are solved simultaneously at each stage of the recursive algorithm. This results in a binary tree whose leaf nodes represent the original C classes. The proposed hierarchical multiclassifier architecture was used to classify 12 types of landcover from 183dimensional hyperspectral data. The classification accuracy was significantly improved by 4 to 10% relative to othe...
An empirical comparison of hierarchical vs. twolevel approaches to multiclass problems
 in Lecture Notes in Computer Science
, 2004
"... Abstract. The ECOC framework provides a powerful and popular method for solving multiclass problems using a multitude of binary classifiers. We had recently introduced the Binary Hierarchical Classifier (BHC) architecture that addresses multiclass classification problems using a set of binary classi ..."
Abstract

Cited by 10 (4 self)
 Add to MetaCart
(Show Context)
Abstract. The ECOC framework provides a powerful and popular method for solving multiclass problems using a multitude of binary classifiers. We had recently introduced the Binary Hierarchical Classifier (BHC) architecture that addresses multiclass classification problems using a set of binary classifiers arranged as a tree. Unlike ECOCs, the BHC groups classes according to their natural affinities in order to make each binary problem easier. However it cannot exploit the powerful error correcting properties of an ECOC ensemble that can provide good results even when individual classifiers are weak. Using welltuned SVMs as the base classifiers, we provide a comparison of these two diverse approaches using a variety of datasets. The results show that while there is no clear advantage to either technique in terms of classification accuracy, BHCs typically achieve this performance using fewer classifiers, and have the added advantage of automatically generating a hierarchy of classes. Such hierarchies often provide a valuable tool for extracting domain knowledge, and achieve better results when coarser granularity of the output space is acceptable. 1
Impact of Similarity Measures on Webpage Clustering
 In Workshop on Artificial Intelligence for Web Search (AAAI 2000
, 2000
"... Clustering of web documents enables (semi)automated categorization, and facilitates certain types of search. Any clustering method has to embed the documents in a suitable similarity space. While several clustering methods and the associated similarity measures have been proposed in the past, ..."
Abstract

Cited by 6 (0 self)
 Add to MetaCart
Clustering of web documents enables (semi)automated categorization, and facilitates certain types of search. Any clustering method has to embed the documents in a suitable similarity space. While several clustering methods and the associated similarity measures have been proposed in the past, there is no systematic comparative study of the impact of similarity metrics on cluster quality, possibly because the popular cost criteria do not readily translate across qualitatively dierent metrics. We observe that in domains such as Yahoo that provide a categorization by human experts, a useful criteria for comparisons across similarity metrics is indeed available. We then compare four popular similarity measures (Euclidean, cosine, Pearson correlation and extended Jaccard) in conjunction with several clustering techniques (random, selforganizing feature map, hypergraph partitioning, generalized k means, weighted graph partitioning), on high dimensional sparse data representing web documents. Performance is measured against a humanimposed classi cation into news categories and industry categories. We conduct a number of experiments and use ttests to assure statistical signicance of results. Cosine and extended Jaccard similarities emerge as the best measures to capture human categorization behavior, while Euclidean performs poorest. Also, weighted graph partitioning approaches are clearly superior to all others.
Clustering Guidance And Quality Evaluation Using RelationshipBased Visualization
, 2000
"... In this paper, we introduce Clusion, a clustering visualization toolkit, that facilitates data exploration and validation of clustering results. Clusion is especially suitable for very highdimensional spaces since it is based on pairwise relationships of the data points rather than their original ..."
Abstract

Cited by 3 (0 self)
 Add to MetaCart
(Show Context)
In this paper, we introduce Clusion, a clustering visualization toolkit, that facilitates data exploration and validation of clustering results. Clusion is especially suitable for very highdimensional spaces since it is based on pairwise relationships of the data points rather than their original vector representation. Given a similarity semantic, guidance to answer the questions 'What is the right number of clusters?', 'Which cluster should be split?', and 'Which clusters should be merged?' can be drawn from the visualization. The visualization also induces a quality metric not biased to increase with the number of clusters. We present examples from market basket data characterized by highdimensional (d > 10; 000), highly sparse customerproduct matrices with positive ordinal attribute values and signi cant amount of outliers.