Results 1  10
of
22
Extensions to the kMeans Algorithm for Clustering Large Data Sets with Categorical Values
, 1998
"... The kmeans algorithm is well known for its efficiency in clustering large data sets. However, working only on numeric values prohibits it from being used to cluster real world data containing categorical values. In this paper we present two algorithms which extend the kmeans algorithm to categoric ..."
Abstract

Cited by 156 (2 self)
 Add to MetaCart
The kmeans algorithm is well known for its efficiency in clustering large data sets. However, working only on numeric values prohibits it from being used to cluster real world data containing categorical values. In this paper we present two algorithms which extend the kmeans algorithm to categorical domains and domains with mixed numeric and categorical values. The kmodes algorithm uses a simple matching dissimilarity measure to deal with categorical objects, replaces the means of clusters with modes, and uses a frequencybased method to update modes in the clustering process to minimise the clustering cost function. With these extensions the kmodes algorithm enables the clustering of categorical data in a fashion similar to kmeans. The kprototypes algorithm, through the definition of a combined dissimilarity measure, further integrates the kmeans and kmodes algorithms to allow for clustering objects described by mixed numeric and categorical attributes. We use the well known soybean disease and credit approval data sets to demonstrate the clustering performance of the two algorithms. Our experiments on two real world data sets with half a million objects each show that the two algorithms are efficient when clustering large data sets, which is critical to data mining applications.
Combining multiple clusterings using evidence accumulation
 IEEE Transaction on Pattern Analysis and Machine Intelligence
, 2005
"... We explore the idea of evidence accumulation (EAC) for combining the results of multiple clusterings. First, a clustering ensemble a set of object partitions, is produced. Given a data set (n objects or patterns in d dimensions), different ways of producing data partitions are: (1) applying differ ..."
Abstract

Cited by 51 (5 self)
 Add to MetaCart
We explore the idea of evidence accumulation (EAC) for combining the results of multiple clusterings. First, a clustering ensemble a set of object partitions, is produced. Given a data set (n objects or patterns in d dimensions), different ways of producing data partitions are: (1) applying different clustering algorithms, and (2) applying the same clustering algorithm with different values of parameters or initializations. Further, combinations of different data representations (feature spaces) and clustering algorithms can also provide a multitude of significantly different data partitionings. We propose a simple framework for extracting a consistent clustering, given the various partitions in a clustering ensemble. According to the EAC concept, each partition is viewed as an independent evidence of data organization, individual data partitions being combined, based on a voting mechanism, to generate a new n × n similarity matrix between the n patterns. The final data partition of the n patterns is obtained by applying a hierarchical agglomerative clustering algorithm on this matrix. We have developed a theoretical framework for the analysis of the proposed clustering combination strategy and its evaluation, based on the concept of mutual information between data partitions. Stability of the results is evaluated using bootstrapping techniques. A detailed discussion of an evidence accumulationbased clustering algorithm, using a split and merge strategy based on the Kmeans clustering algorithm, is presented. Experimental results of the proposed method on several synthetic and real data sets are compared with other combination strategies, and with individual clustering results produced by well known clustering algorithms.
Parametric and Nonparametric Unsupervised Cluster Analysis
 Pattern Recognition
, 1996
"... Much work has been published on methods for assessing the probable number of clusters or structures within unknown data sets. This paper aims to look in more detail at two methods, a broad parametric method, based around the assumption of Gaussian clusters and the other a nonparametric method which ..."
Abstract

Cited by 50 (6 self)
 Add to MetaCart
Much work has been published on methods for assessing the probable number of clusters or structures within unknown data sets. This paper aims to look in more detail at two methods, a broad parametric method, based around the assumption of Gaussian clusters and the other a nonparametric method which utilises methods of scalespace filtering to extract robust structures within a data set. It is shown that, whilst both methods are capable of determining cluster validity for data sets in which clusters tend towards a multivariate Gaussian distribution, the parametric method inevitably fails for clusters which have a nonGaussian structure whilst the scalespace method is more robust. Key words : Cluster analysis, maximum likelihood methods, scalespace filtering, probability density estimation. 1 Introduction Most scientific disciplines generate experimental data from an observed system about which we have may have little understanding of the data generating function. The notion that com...
Hierarchical bayesian clustering for automatic text classification
 In IJCAI
, 1995
"... Text classification, the grouping of texts into several clusters, has been used as a means of improving both the efficiency and the effectiveDess of text retrieval/categorization In this paper we propose a hierarchical clustering algorithm that constructs a Bet of clusters having the maximum Bayesi ..."
Abstract

Cited by 23 (2 self)
 Add to MetaCart
Text classification, the grouping of texts into several clusters, has been used as a means of improving both the efficiency and the effectiveDess of text retrieval/categorization In this paper we propose a hierarchical clustering algorithm that constructs a Bet of clusters having the maximum Bayesian posterior probability, the probability that the given texts are classified into clusters We call the algorithm Hierarchical Bayesian Clustering (HBC) The advantages of HBC are experimentally verified from several viewpoints (1) HBC can reconstruct the original clusters more accurately than do other non probabilistic algorithms (2) When
Scalebased Clustering using the Radial Basis Function Network
 IEEE Trans. Neural Networks
, 1996
"... This paper shows how scalebased clustering can be done using the Radial Basis Function (RBF) Network, with the RBF width as the scale parameter and a dummy target as the desired output. The technique suggests the "right" scale at which the given data set should be clustered, thereby providing a sol ..."
Abstract

Cited by 17 (3 self)
 Add to MetaCart
This paper shows how scalebased clustering can be done using the Radial Basis Function (RBF) Network, with the RBF width as the scale parameter and a dummy target as the desired output. The technique suggests the "right" scale at which the given data set should be clustered, thereby providing a solution to the problem of determining the number of RBF units and the widths required to get a good network solution. The network compares favorably with other standard techniques on benchmark clustering examples. Properties that are required of nongaussian basis functions, if they are to serve in alternative clustering networks, are identified. The work on the whole points out an important role played by the width parameter in RBFN, when observed over several scales, and provides a fundamental link to the scale space theory developed in computational vision. The work described here is supported in part by the National Science Foundation under grant ECS9307632 and in part by ONR Contract N...
Cluster Analysis and Workload Classification
 SIGMETRICS Performance Evaluation Review
, 1993
"... Clustering techniques are widely recommended tools for workload classification. The kmeans algorithm is widely accepted as the "standard" technique of detecting workload classes automatically from measurement data. This paper examines validity of the obtained workload classes, when the current syst ..."
Abstract

Cited by 7 (0 self)
 Add to MetaCart
Clustering techniques are widely recommended tools for workload classification. The kmeans algorithm is widely accepted as the "standard" technique of detecting workload classes automatically from measurement data. This paper examines validity of the obtained workload classes, when the current system and workload is analyzed by a queueing network model and mean value analysis. Our results, based on one week's accounting data of a VAX 8600, indicate that the results of queueing network analysis are not stable when the classes of workload are constructed through the kmeans algorithm. Therefore, we cannot recommended that the most widely used clustering technique should be used in any workload characterization study without careful validation. 1. Introduction Cluster analysis has been the most popular statistical technique for automatically dividing the workload into workload classes. Since the late 1970's, a number of authors have published numerous papers on clustering in workload c...
Structure in Document Browsing Spaces
, 1996
"... This study proposes and evaluates a document analysis strategy for information retrieval with visualization interfaces. The goal of document analysis is to highlight structure that helps searchers make their own relevance judgments, rather than to shift judgments from humans onto machines. Searcher ..."
Abstract

Cited by 6 (1 self)
 Add to MetaCart
This study proposes and evaluates a document analysis strategy for information retrieval with visualization interfaces. The goal of document analysis is to highlight structure that helps searchers make their own relevance judgments, rather than to shift judgments from humans onto machines. Searchers can investigate that structure with tools for visualizing multidimensional data. The structure of interest in this study is discrimination of documents into clusters. Two diagnostic measures may inform selection of document attributes for cluster discrimination: term discrimination value and the sum of pairwise termvector correlations. A series of experiments tests the reliability of these measures for predicting clustering tendency, as measured by proportion of elongated triples and skewness of the distribution of document dissimilarities.
Using Boundary Methods for Estimating Class Separability
, 1998
"... Designing and operating a classification system becomes drastically more difficult as the data dimensionality increases. A feature extraction (FE) step is often used to reduce the data dimensionality to mitigate this complexity. Thus FE may be viewed as a form of data compression whos objective is t ..."
Abstract

Cited by 3 (0 self)
 Add to MetaCart
Designing and operating a classification system becomes drastically more difficult as the data dimensionality increases. A feature extraction (FE) step is often used to reduce the data dimensionality to mitigate this complexity. Thus FE may be viewed as a form of data compression whos objective is to minimize the consequences reducing the dimensionality has on class separability. This differs from the normal objective of data compression which is to minimize distortion, typically measured in the mean squared sense. It is often unclear whether the resulting features from a FE method provide an optimum set for classification. Further, extracting discrimination features from finite data sets increases in difficulty as the dimensionality of the data increases. The need for features to reduce complexity, combined with the difficulties of extracting features, justifies the need for studying ways of ranking feature sets for classification, i.e. feature set evaluation (FSE) techniques. This ...
Segmentation and Visualization of Multispectral Medical Images With Interactive Control of Parameters for a Set of Unsupervised Classifiers
 Proc. SPIE  The International Society for Optical Engineering
, 1995
"... Multispectral classification uses registered 3D image volumes from more than one imaging modality or from different sequences within a modality to classify tissues within those volumes. The complementary information contained within the different image volumes may allow for the separation of tissue ..."
Abstract

Cited by 2 (1 self)
 Add to MetaCart
Multispectral classification uses registered 3D image volumes from more than one imaging modality or from different sequences within a modality to classify tissues within those volumes. The complementary information contained within the different image volumes may allow for the separation of tissue class types in multidimensional feature space when the same tissue classes would be indistinct using just one image volume. When segmentation is complete, attributes of these classes may be determined (e.g., volumes), or the classes may be visualized as objects in 3D. There are two main types of classification algorithms: supervised and unsupervised. Unsupervised classifiers offer the promise of totally automated classification of tissue types and calculation of tissue volumes and other tissue properties in medical images. This would have two benefits: (1) elimination of the timeconsuming process of manual segmentation by medical experts, and (2) ensuring reproducible results. While accur...