Results 1  10
of
19
Extensions to the kMeans Algorithm for Clustering Large Data Sets with Categorical Values
, 1998
"... The kmeans algorithm is well known for its efficiency in clustering large data sets. However, working only on numeric values prohibits it from being used to cluster real world data containing categorical values. In this paper we present two algorithms which extend the kmeans algorithm to categoric ..."
Abstract

Cited by 156 (2 self)
 Add to MetaCart
The kmeans algorithm is well known for its efficiency in clustering large data sets. However, working only on numeric values prohibits it from being used to cluster real world data containing categorical values. In this paper we present two algorithms which extend the kmeans algorithm to categorical domains and domains with mixed numeric and categorical values. The kmodes algorithm uses a simple matching dissimilarity measure to deal with categorical objects, replaces the means of clusters with modes, and uses a frequencybased method to update modes in the clustering process to minimise the clustering cost function. With these extensions the kmodes algorithm enables the clustering of categorical data in a fashion similar to kmeans. The kprototypes algorithm, through the definition of a combined dissimilarity measure, further integrates the kmeans and kmodes algorithms to allow for clustering objects described by mixed numeric and categorical attributes. We use the well known soybean disease and credit approval data sets to demonstrate the clustering performance of the two algorithms. Our experiments on two real world data sets with half a million objects each show that the two algorithms are efficient when clustering large data sets, which is critical to data mining applications.
A Fast Clustering Algorithm to Cluster Very Large Categorical Data Sets in Data Mining
 In Research Issues on Data Mining and Knowledge Discovery
, 1997
"... Partitioning a large set of objects into homogeneous clusters is a fundamental operation in data mining. The kmeans algorithm is best suited for implementing this operation because of its efficiency in clustering large data sets. However, working only on numeric values limits its use in data mining ..."
Abstract

Cited by 82 (2 self)
 Add to MetaCart
Partitioning a large set of objects into homogeneous clusters is a fundamental operation in data mining. The kmeans algorithm is best suited for implementing this operation because of its efficiency in clustering large data sets. However, working only on numeric values limits its use in data mining because data sets in data mining often contain categorical values. In this paper we present an algorithm, called kmodes, to extend the kmeans paradigm to categorical domains. We introduce new dissimilarity measures to deal with categorical objects, replace means of clusters with modes, and use a frequency based method to update modes in the clustering process to minimise the clustering cost function. Tested with the well known soybean disease data set the algorithm has demonstrated a very good classification performance. Experiments on a very large health insurance data set consisting of half a million records and 34 categorical attributes show that the algorithm is scalable in terms of ...
An alternative extension of the kmeans algorithm for clustering categorical data
 Int. J. Appl. Math. Comput. Sci
, 2004
"... Most of the earlier work on clustering has mainly been focused on numerical data whose inherent geometric properties can be exploited to naturally define distance functions between data points. Recently, the problem of clustering categorical data has started drawing interest. However, the computatio ..."
Abstract

Cited by 14 (0 self)
 Add to MetaCart
Most of the earlier work on clustering has mainly been focused on numerical data whose inherent geometric properties can be exploited to naturally define distance functions between data points. Recently, the problem of clustering categorical data has started drawing interest. However, the computational cost makes most of the previous algorithms unacceptable for clustering very large databases. The kmeans algorithm is well known for its efficiency in this respect. At the same time, working only on numerical data prohibits them from being used for clustering categorical data. The main contribution of this paper is to show how to apply the notion of “cluster centers ” on a dataset of categorical objects and how to use this notion for formulating the clustering problem of categorical objects as a partitioning problem. Finally, a kmeanslike algorithm for clustering categorical data is introduced. The clustering performance of the algorithm is demonstrated with two wellknown data sets, namely, soybean disease and nursery databases.
Clustering Spatial Data in the Presence of Obstacles: a DensityBased Approach
 Sixth International Database Engineering and Applications Symposium (IDEAS 2002
, 2002
"... Clustering spatial data is a wellknown problem that has been extensively studied. Grouping similar data in large 2dimensional spaces to find hidden patterns or meaningful subgroups has many applications such as satellite imagery, geographic information systems, medical image analysis, marketing, ..."
Abstract

Cited by 5 (1 self)
 Add to MetaCart
Clustering spatial data is a wellknown problem that has been extensively studied. Grouping similar data in large 2dimensional spaces to find hidden patterns or meaningful subgroups has many applications such as satellite imagery, geographic information systems, medical image analysis, marketing, computer visions, etc. Although many methods have been proposed in the literature, very few have considered physical obstacles that may have significant consequences on the effectiveness of the clustering. Taking into account these constraints during the clustering process is costly and the modeling of the constraints is paramount for good performance. In this paper, we investigate the problem of clustering in the presence of constraints such as physical obstacles and introduce a new approach to model these constraints using polygons. We also propose a strategy to prune the search space and reduce the number of polygons to test during clustering. We devise a densitybased clustering algorithm, DBCluC, which takes advantage of our constraint modeling to efficiently cluster data objects while considering all physical constraints. The algorithm can detect clusters of arbitrary shape and is insensitive to noise, the input order, and the difficulty of constraints. Its average running complexity is O(N N) where N is the number of data points.
Computation of Initial Modes for Kmodes Clustering Algorithm using Evidence Accumulation
 IJCAI07
"... Abstract. Clustering accuracy of partitional clustering algorithm for categorical data depends primarily on the choice of initial data points to instigate the clustering process and hence the clustering results cannot be generated and repeated consistently. In this paper we present an approach to co ..."
Abstract

Cited by 5 (0 self)
 Add to MetaCart
Abstract. Clustering accuracy of partitional clustering algorithm for categorical data depends primarily on the choice of initial data points to instigate the clustering process and hence the clustering results cannot be generated and repeated consistently. In this paper we present an approach to compute initial modes for Kmode partitional clustering algorithm to cluster categorical data sets. Here we utilized the idea of evidence accumulation for combining the results of multiple clusterings. Initially, n F − dimensional data is decomposed into a large number of compact clusters; the Kmodes algorithm performs this decomposition, with several clusterings obtained by N random initializations of the Kmodes algorithm and the modes thus obtained for every random initialization are stored in a ModePool, PN. The objective is to investigate the contribution of those data objects / patterns that are less vulnerable to the choice of random selection of modes and to choose the most diverse set of modes from the available ModePool that can be utilized as initial modes for the Kmode clustering algorithm. Experimentally we found that by this method we get initial modes that are very similar to the actual / desired modes and gives consistent and better clustering results with less variance of error than the traditional method of choosing random modes.
Fuzzy Clustering of Categorical Attributes and its Use in Analyzing Cultural Data
 International Journal of Computational Intelligence
, 2004
"... � Abstract — We develop a threestep fuzzy logicbased algorithm for clustering categorical attributes, and we apply it to analyze cultural data. In the first step the algorithm employs an entropybased clustering scheme, which initializes the cluster centers. In the second step we apply the fuzzy c ..."
Abstract

Cited by 2 (1 self)
 Add to MetaCart
� Abstract — We develop a threestep fuzzy logicbased algorithm for clustering categorical attributes, and we apply it to analyze cultural data. In the first step the algorithm employs an entropybased clustering scheme, which initializes the cluster centers. In the second step we apply the fuzzy cmodes algorithm to obtain a fuzzy partition of the data set, and the third step introduces a novel cluster validity index, which decides the final number of clusters. Keywords—Categorical data, cultural data, fuzzy logic clustering, fuzzy cmodes, cluster validity index.
A Fast Clustering Algorithm to Cluster Very Large Categorical Data Sets in Data Mining
"... Partitioning a large set of objects into homogeneous clusters is a fundamental operation in data mining. The kmeans algorithm is best suited for implementing this operation because of its efficiency in clustering large data sets. However, working only on numeric values limits its use in data mining ..."
Abstract
 Add to MetaCart
Partitioning a large set of objects into homogeneous clusters is a fundamental operation in data mining. The kmeans algorithm is best suited for implementing this operation because of its efficiency in clustering large data sets. However, working only on numeric values limits its use in data mining because data sets in data mining often contain categorical values. In this paper we present an algorithm, called kmodes, to extend the kmeans paradigm to categorical domains. We introduce new dissimilarity measures to deal with categorical objects, replace means of clusters with modes, and use a frequency based method to update modes in the clustering process to minimise the clustering cost function. Tested with the well known soybean disease data set the algorithm has demonstrated a very good classification performance. Experiments on a very large health insurance data set consisting of half a million records and 34 categorical attributes show that the algorithm is scalable in terms of both the number of clusters and the number of records. 1
An Improved Kmeans Algorithm for Clustering Categorical Data *
"... Abstract: Most of the earlier work on clustering is mainly focused on numerical data the inherent geometric properties of which can be exploited to naturally define distance functions between the data points. However, the computational cost makes most of the previous algorithms unacceptable for clus ..."
Abstract
 Add to MetaCart
Abstract: Most of the earlier work on clustering is mainly focused on numerical data the inherent geometric properties of which can be exploited to naturally define distance functions between the data points. However, the computational cost makes most of the previous algorithms unacceptable for clustering very large databases. The kmeans algorithm is well known for its efficiency in this respect. At the same time, working only on numerical data prohibits them from being used for clustering categorical data. This paper shows how to apply the notion of “cluster centers ” to a dataset of categorical objects, and a kmeanslike algorithm for clustering categorical data is introduced. *
RefunionGeneralizationConceptual Clustering Algorithm
"... In this paper we introduce a new conceptual algorithm for the conceptual analysis of a mixed incomplete data set. This is a Logical Combinatorial Pattern Recognition (LCPR) based tool for the conceptual structuralization of spaces. Starting from the limitations of the elaborated conceptual algorithm ..."
Abstract
 Add to MetaCart
In this paper we introduce a new conceptual algorithm for the conceptual analysis of a mixed incomplete data set. This is a Logical Combinatorial Pattern Recognition (LCPR) based tool for the conceptual structuralization of spaces. Starting from the limitations of the elaborated conceptual algorithms, our Laboratories are working in the application of the methods, the techniques and in general, the philosophy of the Logical Combinatorial Pattern Recognition with the task to improve those limitations. An extension of the Michalski's concept of lcomplex for any similarity measure, a generalization operator for symbolic variables and an extension of the Michalski's Refunion operator are introduced. Finally, the performance of the RGC algorithm is analyzed. A comparison with several known conceptual algorithms is presented. Keywords: Conceptual algorithms, Logical Combinatory Pattern Recognition, Refunion operator, generalization rules, data analysis. 1.
HIMIC: A Hierarchical Mixed Type Data Clustering Algorithm
, 2005
"... Clustering is an important data mining technique. There are many algorithms that cluster either numeric or categorical data. However few algorithms cluster mixed type datasets with both numerical and categorical attributes. In this paper, we propose a similarity measure between two clusters that ..."
Abstract
 Add to MetaCart
Clustering is an important data mining technique. There are many algorithms that cluster either numeric or categorical data. However few algorithms cluster mixed type datasets with both numerical and categorical attributes. In this paper, we propose a similarity measure between two clusters that enables hierarchical clustering of data with numerical and categorical attributes. This similarity measure is derived from a frequency vector of attribute values in a cluster. Experimental results establish that our algorithm produces good quality clusters.