Results 1 
9 of
9
Extensions to the kMeans Algorithm for Clustering Large Data Sets with Categorical Values
, 1998
"... The kmeans algorithm is well known for its efficiency in clustering large data sets. However, working only on numeric values prohibits it from being used to cluster real world data containing categorical values. In this paper we present two algorithms which extend the kmeans algorithm to categoric ..."
Abstract

Cited by 157 (2 self)
 Add to MetaCart
The kmeans algorithm is well known for its efficiency in clustering large data sets. However, working only on numeric values prohibits it from being used to cluster real world data containing categorical values. In this paper we present two algorithms which extend the kmeans algorithm to categorical domains and domains with mixed numeric and categorical values. The kmodes algorithm uses a simple matching dissimilarity measure to deal with categorical objects, replaces the means of clusters with modes, and uses a frequencybased method to update modes in the clustering process to minimise the clustering cost function. With these extensions the kmodes algorithm enables the clustering of categorical data in a fashion similar to kmeans. The kprototypes algorithm, through the definition of a combined dissimilarity measure, further integrates the kmeans and kmodes algorithms to allow for clustering objects described by mixed numeric and categorical attributes. We use the well known soybean disease and credit approval data sets to demonstrate the clustering performance of the two algorithms. Our experiments on two real world data sets with half a million objects each show that the two algorithms are efficient when clustering large data sets, which is critical to data mining applications.
Modeling Proportional Membership in Fuzzy Clustering
, 2003
"... To provide feedback from a cluster structure to the data from which it has been determined, we propose a framework for mining typological structures based on a fuzzy clustering model of how the data are generated from a cluster structure. To relate data entities to cluster prototypes, we assume that ..."
Abstract

Cited by 7 (4 self)
 Add to MetaCart
To provide feedback from a cluster structure to the data from which it has been determined, we propose a framework for mining typological structures based on a fuzzy clustering model of how the data are generated from a cluster structure. To relate data entities to cluster prototypes, we assume that the observed entities share parts of the prototypes in such a way that the membership of an entity to a cluster expresses the proportion of the cluster's prototype reflected in the entity (proportional membership). In the generic version of the model, any entity may independently relate to any prototype, which is similar to the assumption underlying the fuzzymeans criterion. The model is referred to as fuzzy clustering with proportional membership (FCPM). Several versions of the model relaxing the generic assumptions are presented and alternating minimization techniques for them are developed. The results of experimental studies of FCPM versions and the fuzzymeans algorithm are presented and discussed, especially addressing the issues of fitting the underlying clustering model. An example is given with data in the medical field in which our approach is shown to suit better than more conventional methods.
Structure in Document Browsing Spaces
, 1996
"... This study proposes and evaluates a document analysis strategy for information retrieval with visualization interfaces. The goal of document analysis is to highlight structure that helps searchers make their own relevance judgments, rather than to shift judgments from humans onto machines. Searcher ..."
Abstract

Cited by 6 (1 self)
 Add to MetaCart
This study proposes and evaluates a document analysis strategy for information retrieval with visualization interfaces. The goal of document analysis is to highlight structure that helps searchers make their own relevance judgments, rather than to shift judgments from humans onto machines. Searchers can investigate that structure with tools for visualizing multidimensional data. The structure of interest in this study is discrimination of documents into clusters. Two diagnostic measures may inform selection of document attributes for cluster discrimination: term discrimination value and the sum of pairwise termvector correlations. A series of experiments tests the reliability of these measures for predicting clustering tendency, as measured by proportion of elongated triples and skewness of the distribution of document dissimilarities.
KSubspace Clustering
"... Abstract. The widely used Kmeans clustering deals with ballshaped (spherical Gaussian) clusters. In this paper, we extend the Kmeans clustering to accommodate extended clusters in subspaces, such as lineshaped clusters, planeshaped clusters, and ballshaped clusters. The algorithm retains much o ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
Abstract. The widely used Kmeans clustering deals with ballshaped (spherical Gaussian) clusters. In this paper, we extend the Kmeans clustering to accommodate extended clusters in subspaces, such as lineshaped clusters, planeshaped clusters, and ballshaped clusters. The algorithm retains much of the Kmeans clustering flavors: easy to implement and fast to converge. A model selection procedure is incorporated to determine the cluster shape. As a result, our algorithm can recognize a wide range of subspace clusters studied in various literatures, and also the global ballshaped clusters (living in all dimensions). We carry extensive experiments on both synthetic and realworld datasets, and the results demonstrate the effectiveness of our algorithm. 1
Clustering Multiway Data via Adaptive Subspace Iteration
"... Clustering multiway data is a very important research topic due to the intrinsic rich structures in realworld datasets. In this paper, we propose the subspace clustering algorithm on multiway data, called ASIT (Adaptive Subspace Iteration on Tensor). ASIT is a special version of High Order SVD ..."
Abstract

Cited by 1 (1 self)
 Add to MetaCart
Clustering multiway data is a very important research topic due to the intrinsic rich structures in realworld datasets. In this paper, we propose the subspace clustering algorithm on multiway data, called ASIT (Adaptive Subspace Iteration on Tensor). ASIT is a special version of High Order SVD (HOSVD), and it simultaneously performs subspace identification using 2DSVD and data clustering using KMeans. The experimental results on synthetic data and realworld data demonstrate the effectiveness of ASIT.
StabilityBased Cluster Analysis Applied To Microarray Data
 in Proceedings of the Seventh International Symposium on Signal Processing and its Applications
, 2003
"... This paper studies the estimation of the number of clusters using the socalled stabilitybased approach, where clusters obtained for two subsets of the data set are compared via a similarity index and the decision regarding the number of clusters is taken based on the statistics of the index over r ..."
Abstract
 Add to MetaCart
This paper studies the estimation of the number of clusters using the socalled stabilitybased approach, where clusters obtained for two subsets of the data set are compared via a similarity index and the decision regarding the number of clusters is taken based on the statistics of the index over randomly selected subsets. We introduce a new similarity index , and analyze the consistency of the estimator of the number of classes when means algorithm is used in conjunction with . Various similarity indices are experimentally evaluated when comparing the "true" data partition with the partition obtained at each level of an hierarchical clustering tree. Finally, experimental results with real data are reported for a glioma microarray dataset. 1.
A Separability Index for Distancebased Clustering and Classification Algorithms
"... We propose a separability index that quantifies the degree of difficulty in a hard clustering problem under assumptions of a multivariate Gaussian distribution for each cluster. A preliminary index is first defined and several of its properties are explored both theoretically and numerically. Adjust ..."
Abstract
 Add to MetaCart
We propose a separability index that quantifies the degree of difficulty in a hard clustering problem under assumptions of a multivariate Gaussian distribution for each cluster. A preliminary index is first defined and several of its properties are explored both theoretically and numerically. Adjustments are then made to this index so that the final refinement is also interpretable in terms of the Adjusted Rand Index between a true grouping and its hypothetical idealized clustering, taken as a surrogate of clustering complexity. Our derived index is used to develop a datasimulation algorithm that generates samples according to the prescribed value of the index. This algorithm is particularly useful for systematically generating datasets with varying degrees of clustering difficulty which can be used to evaluate performance of different clustering algorithms. The index is also shown to be useful in providing a summary of the distinctiveness of classes in grouped datasets.