Results 1  10
of
35
Extensions to the kMeans Algorithm for Clustering Large Data Sets with Categorical Values
, 1998
"... The kmeans algorithm is well known for its efficiency in clustering large data sets. However, working only on numeric values prohibits it from being used to cluster real world data containing categorical values. In this paper we present two algorithms which extend the kmeans algorithm to categoric ..."
Abstract

Cited by 235 (3 self)
 Add to MetaCart
The kmeans algorithm is well known for its efficiency in clustering large data sets. However, working only on numeric values prohibits it from being used to cluster real world data containing categorical values. In this paper we present two algorithms which extend the kmeans algorithm to categorical domains and domains with mixed numeric and categorical values. The kmodes algorithm uses a simple matching dissimilarity measure to deal with categorical objects, replaces the means of clusters with modes, and uses a frequencybased method to update modes in the clustering process to minimise the clustering cost function. With these extensions the kmodes algorithm enables the clustering of categorical data in a fashion similar to kmeans. The kprototypes algorithm, through the definition of a combined dissimilarity measure, further integrates the kmeans and kmodes algorithms to allow for clustering objects described by mixed numeric and categorical attributes. We use the well known soybean disease and credit approval data sets to demonstrate the clustering performance of the two algorithms. Our experiments on two real world data sets with half a million objects each show that the two algorithms are efficient when clustering large data sets, which is critical to data mining applications.
A Fast Clustering Algorithm to Cluster Very Large Categorical Data Sets in Data Mining
"... Partitioning a large set of objects into homogeneous clusters is a fundamental operation in data mining. The kmeans algorithm is best suited for implementing this operation because of its efficiency in clustering large data sets. However, working only on numeric values limits its use in data mining ..."
Abstract

Cited by 110 (2 self)
 Add to MetaCart
Partitioning a large set of objects into homogeneous clusters is a fundamental operation in data mining. The kmeans algorithm is best suited for implementing this operation because of its efficiency in clustering large data sets. However, working only on numeric values limits its use in data mining because data sets in data mining often contain categorical values. In this paper we present an algorithm, called kmodes, to extend the kmeans paradigm to categorical domains. We introduce new dissimilarity measures to deal with categorical objects, replace means of clusters with modes, and use a frequency based method to update modes in the clustering process to minimise the clustering cost function. Tested with the well known soybean disease data set the algorithm has demonstrated a very good classification performance. Experiments on a very large health insurance data set consisting of half a million records and 34 categorical attributes show that the algorithm is scalable in terms of both the number of clusters and the number of records.
A fuzzy kmodes algorithm for clustering categorical data’, Fuzzy Systems
 IEEE Transactions on
, 1999
"... ©1999 IEEE. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other wo ..."
Abstract

Cited by 54 (5 self)
 Add to MetaCart
(Show Context)
©1999 IEEE. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be obtained from the IEEE.
An alternative extension of the kmeans algorithm for clustering categorical data
 Int. J. Appl. Math. Comput. Sci
, 2004
"... Most of the earlier work on clustering has mainly been focused on numerical data whose inherent geometric properties can be exploited to naturally define distance functions between data points. Recently, the problem of clustering categorical data has started drawing interest. However, the computatio ..."
Abstract

Cited by 19 (0 self)
 Add to MetaCart
Most of the earlier work on clustering has mainly been focused on numerical data whose inherent geometric properties can be exploited to naturally define distance functions between data points. Recently, the problem of clustering categorical data has started drawing interest. However, the computational cost makes most of the previous algorithms unacceptable for clustering very large databases. The kmeans algorithm is well known for its efficiency in this respect. At the same time, working only on numerical data prohibits them from being used for clustering categorical data. The main contribution of this paper is to show how to apply the notion of “cluster centers ” on a dataset of categorical objects and how to use this notion for formulating the clustering problem of categorical objects as a partitioning problem. Finally, a kmeanslike algorithm for clustering categorical data is introduced. The clustering performance of the algorithm is demonstrated with two wellknown data sets, namely, soybean disease and nursery databases.
Clustering Spatial Data in the Presence of Obstacles: a DensityBased Approach
 Sixth International Database Engineering and Applications Symposium (IDEAS 2002
, 2002
"... Clustering spatial data is a wellknown problem that has been extensively studied. Grouping similar data in large 2dimensional spaces to find hidden patterns or meaningful subgroups has many applications such as satellite imagery, geographic information systems, medical image analysis, marketing, ..."
Abstract

Cited by 9 (1 self)
 Add to MetaCart
Clustering spatial data is a wellknown problem that has been extensively studied. Grouping similar data in large 2dimensional spaces to find hidden patterns or meaningful subgroups has many applications such as satellite imagery, geographic information systems, medical image analysis, marketing, computer visions, etc. Although many methods have been proposed in the literature, very few have considered physical obstacles that may have significant consequences on the effectiveness of the clustering. Taking into account these constraints during the clustering process is costly and the modeling of the constraints is paramount for good performance. In this paper, we investigate the problem of clustering in the presence of constraints such as physical obstacles and introduce a new approach to model these constraints using polygons. We also propose a strategy to prune the search space and reduce the number of polygons to test during clustering. We devise a densitybased clustering algorithm, DBCluC, which takes advantage of our constraint modeling to efficiently cluster data objects while considering all physical constraints. The algorithm can detect clusters of arbitrary shape and is insensitive to noise, the input order, and the difficulty of constraints. Its average running complexity is O(N N) where N is the number of data points.
A Supervised Clustering and Classification Algorithm for Mining Data With Mixed Variables
 IEEE Transactions on Systems, Man, and CyberneticsPart A
"... Abstract—This paper presents a data mining algorithm based on supervised clustering to learn data patterns and use these patterns for data classification. This algorithm enables a scalable incremental learning of patterns from data with both numeric and nominal variables. Two different methods of co ..."
Abstract

Cited by 5 (1 self)
 Add to MetaCart
(Show Context)
Abstract—This paper presents a data mining algorithm based on supervised clustering to learn data patterns and use these patterns for data classification. This algorithm enables a scalable incremental learning of patterns from data with both numeric and nominal variables. Two different methods of combining numeric and nominal variables in calculating the distance between clusters are investigated. In one method, separate distance measures are calculated for numeric and nominal variables, respectively, and are then combined into an overall distance measure. In another method, nominal variables are converted into numeric variables, and then a distance measure is calculated using all variables. We analyze the computational complexity, and thus, the scalability, of the algorithm, and test its performance on a number of data sets from various application domains. The prediction accuracy and reliability of the algorithm are analyzed, tested, and compared with those of several other data mining algorithms. Index Terms—Classification, clustering, computer intrusion detection, dissimilarity measures.
Computation of Initial Modes for Kmodes Clustering Algorithm using Evidence Accumulation
 IJCAI07
"... Abstract. Clustering accuracy of partitional clustering algorithm for categorical data depends primarily on the choice of initial data points to instigate the clustering process and hence the clustering results cannot be generated and repeated consistently. In this paper we present an approach to co ..."
Abstract

Cited by 5 (0 self)
 Add to MetaCart
(Show Context)
Abstract. Clustering accuracy of partitional clustering algorithm for categorical data depends primarily on the choice of initial data points to instigate the clustering process and hence the clustering results cannot be generated and repeated consistently. In this paper we present an approach to compute initial modes for Kmode partitional clustering algorithm to cluster categorical data sets. Here we utilized the idea of evidence accumulation for combining the results of multiple clusterings. Initially, n F − dimensional data is decomposed into a large number of compact clusters; the Kmodes algorithm performs this decomposition, with several clusterings obtained by N random initializations of the Kmodes algorithm and the modes thus obtained for every random initialization are stored in a ModePool, PN. The objective is to investigate the contribution of those data objects / patterns that are less vulnerable to the choice of random selection of modes and to choose the most diverse set of modes from the available ModePool that can be utilized as initial modes for the Kmode clustering algorithm. Experimentally we found that by this method we get initial modes that are very similar to the actual / desired modes and gives consistent and better clustering results with less variance of error than the traditional method of choosing random modes.
Fuzzy Clustering of Categorical Attributes and its Use in Analyzing Cultural Data
 International Journal of Computational Intelligence
, 2004
"... � Abstract — We develop a threestep fuzzy logicbased algorithm for clustering categorical attributes, and we apply it to analyze cultural data. In the first step the algorithm employs an entropybased clustering scheme, which initializes the cluster centers. In the second step we apply the fuzzy c ..."
Abstract

Cited by 2 (1 self)
 Add to MetaCart
(Show Context)
� Abstract — We develop a threestep fuzzy logicbased algorithm for clustering categorical attributes, and we apply it to analyze cultural data. In the first step the algorithm employs an entropybased clustering scheme, which initializes the cluster centers. In the second step we apply the fuzzy cmodes algorithm to obtain a fuzzy partition of the data set, and the third step introduces a novel cluster validity index, which decides the final number of clusters. Keywords—Categorical data, cultural data, fuzzy logic clustering, fuzzy cmodes, cluster validity index.
REFUNIONGENERALIZATIONCONCEPTUAL CLUSTERING ALGORITHM 1
"... In this paper we introduce a new conceptual algorithm for the conceptual analysis of a mixed incomplete data set. This is a Logical Combinatorial Pattern Recognition (LCPR) based tool for the conceptual structuralization of spaces. Starting from the limitations of the elaborated conceptual algorithm ..."
Abstract
 Add to MetaCart
(Show Context)
In this paper we introduce a new conceptual algorithm for the conceptual analysis of a mixed incomplete data set. This is a Logical Combinatorial Pattern Recognition (LCPR) based tool for the conceptual structuralization of spaces. Starting from the limitations of the elaborated conceptual algorithms, our Laboratories are working in the application of the methods, the techniques and in general, the philosophy of the Logical Combinatorial Pattern Recognition with the task to improve those limitations. An extension of the Michalski’s concept of lcomplex for any similarity measure, a generalization operator for symbolic variables and an extension of the Michalski’s Refunion operator are introduced. Finally, the performance of the RGC algorithm is analyzed. A comparison with several known conceptual algorithms is presented.
Categorical Clustering By Converting Associated Information
"... Abstract—Lacking an inherent “natural ” dissimilarity measure between objects in categorical dataset presents special difficulties in clustering analysis. However, each categorical attributes from a given dataset provides natural probability and information in the sense of Shannon. In this paper, we ..."
Abstract
 Add to MetaCart
(Show Context)
Abstract—Lacking an inherent “natural ” dissimilarity measure between objects in categorical dataset presents special difficulties in clustering analysis. However, each categorical attributes from a given dataset provides natural probability and information in the sense of Shannon. In this paper, we proposed a novel method which heuristically converts categorical attributes to numerical values by exploiting such associated information. We conduct an experimental study with reallife categorical dataset. The experiment demonstrates the effectiveness of our approach. Keywords—Categorical, Clustering, Converting, Information I.