## An alternative extension of the k-means algorithm for clustering categorical data (2004)

Venue: | Int. J. Appl. Math. Comput. Sci |

Citations: | 16 - 0 self |

### BibTeX

@ARTICLE{San04analternative,

author = {Ohn Mar San and Van-nam Huynh and Yoshiteru Nakamori},

title = {An alternative extension of the k-means algorithm for clustering categorical data},

journal = {Int. J. Appl. Math. Comput. Sci},

year = {2004},

volume = {14},

pages = {241--247}

}

### OpenURL

### Abstract

Most of the earlier work on clustering has mainly been focused on numerical data whose inherent geometric properties can be exploited to naturally define distance functions between data points. Recently, the problem of clustering categorical data has started drawing interest. However, the computational cost makes most of the previous algorithms unacceptable for clustering very large databases. The k-means algorithm is well known for its efficiency in this respect. At the same time, working only on numerical data prohibits them from being used for clustering categorical data. The main contribution of this paper is to show how to apply the notion of “cluster centers ” on a dataset of categorical objects and how to use this notion for formulating the clustering problem of categorical objects as a partitioning problem. Finally, a k-means-like algorithm for clustering categorical data is introduced. The clustering performance of the algorithm is demonstrated with two well-known data sets, namely, soybean disease and nursery databases.

### Citations

3141 |
E.K.: Uci repository of machine learning databases
- Blake, Merz
- 1998
(Show Context)
Citation Context ... In this section we present two experimental tests with the proposed algorithm on soybean disease and nursery databases taken from the UCI Repository of Machine Learning Databases and Domain Theorem (=-=Blake and Merz, 1998-=-). As is well known, the primary aim of clustering algorithms is to discover classes that exist inherently in data. With this purpose in mind, we first assume that a structure may exist in a given dat... |

2162 | Some methods for classification and analysis of multivariate observations
- MacQueen
- 1967
(Show Context)
Citation Context ...1999). Recently, clustering data with categorical attributes have drawn some attention (Ganti et al., 1999; Gibson et al., 1998; Guha et al., 2000; Huang, 1998). As is well known, k-means clustering (=-=MacQueen, 1967-=-) has been a very popular technique for partitioning large data sets with numerical attributes. Ralambondrainy (1995) proposed a hybrid numeric-symbolic method that integrates an extended version of t... |

832 |
R.: Cluster Analysis for Applications
- Anderberg
- 1973
(Show Context)
Citation Context ...jects in a database into clusters/groups such that objects within the same cluster have a high degree of similarity, while objects belonging to different clusters have a high degree of dissimilarity (=-=Anderberg, 1973-=-; Jain and Dubes, 1988; Kaufman and Rousseeuw, 1990). Traditionally, numerical clustering methods have been viewed in opposition to conceptual clustering methods developed in Artificial Intelligence. ... |

687 | Knowledge acquisition via incremental conceptual clustering
- Fisher
- 1987
(Show Context)
Citation Context ...ng low-level descriptions of clusters (Anderberg, 1973; Kaufman and Rousseeuw, 1990), while a conceptual approach is more concerned with high-level (i.e. more understandable) descriptions of classes (=-=Fisher, 1987-=-; Michalski and Stepp, 1983). Most of the earlier work on clustering has been focused on numerical data whose inherent geometric properties can be exploited to naturally define distance functions betw... |

364 | Rock: A robust clustering algorithm for categorical attributes
- Guha, Rastogi, et al.
- 2000
(Show Context)
Citation Context ...s on which distance functions are not naturally defined (Ganti et al., 1999). Recently, clustering data with categorical attributes have drawn some attention (Ganti et al., 1999; Gibson et al., 1998; =-=Guha et al., 2000-=-; Huang, 1998). As is well known, k-means clustering (MacQueen, 1967) has been a very popular technique for partitioning large data sets with numerical attributes. Ralambondrainy (1995) proposed a hyb... |

185 | Extensions to the k-means algorithm for clustering large data sets with categorical values
- Huang
- 1998
(Show Context)
Citation Context ... functions are not naturally defined (Ganti et al., 1999). Recently, clustering data with categorical attributes have drawn some attention (Ganti et al., 1999; Gibson et al., 1998; Guha et al., 2000; =-=Huang, 1998-=-). As is well known, k-means clustering (MacQueen, 1967) has been a very popular technique for partitioning large data sets with numerical attributes. Ralambondrainy (1995) proposed a hybrid numeric-s... |

163 |
A New Approach to Clustering
- Ruspini
- 1969
(Show Context)
Citation Context ...l with the problem of not well-defined boundaries between clusters, the notion of fuzzy partitions has been applied successfully to the clustering problem resulting in the so-called fuzzy clustering (=-=Ruspini, 1969-=-; Bezdek, 1980; Ismail and Selim, 1986). However, we do not consider this topic in the present paper. As is shown in (Huang, 1998), the k-means algorithm has the following characteristics: • It is eff... |

152 | Clustering categorical data: an approach based on dynamical systems
- Gibson, Kleinberg, et al.
(Show Context)
Citation Context ...categorical attributes on which distance functions are not naturally defined (Ganti et al., 1999). Recently, clustering data with categorical attributes have drawn some attention (Ganti et al., 1999; =-=Gibson et al., 1998-=-; Guha et al., 2000; Huang, 1998). As is well known, k-means clustering (MacQueen, 1967) has been a very popular technique for partitioning large data sets with numerical attributes. Ralambondrainy (1... |

138 |
K-means-type algorithms: a generalized convergence theorem and characterization of local optimality
- Selim, Ismail
- 1984
(Show Context)
Citation Context ...ry conditions for W to minimize P . Then we fix W and minimize P according to Q. Basically, the k-means algorithm iterates through a three-step process until P (W, Q) converges to some local minimum (=-=Selim and Ismail, 1984-=-): 1. Select an initial Q (0) = {Q (0) 1 t = 0. , . . . , Q(0) k }, and set 2. Keep Q (t) fixed and solve P (W, Q (t) ) to obtain W (t) , i.e., regarding Q (t) as the cluster centers, assign each obje... |

105 |
A clustering technique for summarizing multivariate data
- Ball, Hall
- 1967
(Show Context)
Citation Context ...sidered the clustering problem for mixed-type data objects where some domains are numeric, while others are categorical. 3. k-Means Clustering The general algorithm was introduced by Cox (1957), and (=-=Ball and Hall, 1967-=-; MacQueen, 1967) first named it kmeans. Since then it has become widely popular and is classified as a partitional or non-hierarchical clustering method (Jain and Dubes, 1988). It is defined as follo... |

101 | Automated construction of classifications: Conceptual clustering versus numerical taxonomy
- Michalski, Stepp
- 1983
(Show Context)
Citation Context ...escriptions of clusters (Anderberg, 1973; Kaufman and Rousseeuw, 1990), while a conceptual approach is more concerned with high-level (i.e. more understandable) descriptions of classes (Fisher, 1987; =-=Michalski and Stepp, 1983-=-). Most of the earlier work on clustering has been focused on numerical data whose inherent geometric properties can be exploited to naturally define distance functions between data points. However, d... |

96 | CACTUS - Clustering Categorical Data Using Summaries
- Ganti, Gehrke, et al.
- 1999
(Show Context)
Citation Context ...ce functions between data points. However, data mining applications frequently involve many datasets that also consist of categorical attributes on which distance functions are not naturally defined (=-=Ganti et al., 1999-=-). Recently, clustering data with categorical attributes have drawn some attention (Ganti et al., 1999; Gibson et al., 1998; Guha et al., 2000; Huang, 1998). As is well known, k-means clustering (MacQ... |

89 |
A convergence theorem for the fuzzy ISODATA clustering algorithms
- Bezdek
- 1980
(Show Context)
Citation Context ...lem of not well-defined boundaries between clusters, the notion of fuzzy partitions has been applied successfully to the clustering problem resulting in the so-called fuzzy clustering (Ruspini, 1969; =-=Bezdek, 1980-=-; Ismail and Selim, 1986). However, we do not consider this topic in the present paper. As is shown in (Huang, 1998), the k-means algorithm has the following characteristics: • It is efficient in proc... |

59 |
Algorithms for Clustering Data (Englewood Cliffs
- JAIN, DUBES
- 1988
(Show Context)
Citation Context ...se into clusters/groups such that objects within the same cluster have a high degree of similarity, while objects belonging to different clusters have a high degree of dissimilarity (Anderberg, 1973; =-=Jain and Dubes, 1988-=-; Kaufman and Rousseeuw, 1990). Traditionally, numerical clustering methods have been viewed in opposition to conceptual clustering methods developed in Artificial Intelligence. Numerical techniques e... |

50 |
K and Diday E, “Symbolic clustering using a new dissimilarity measure
- Gowda
- 1991
(Show Context)
Citation Context ...fined as categorical if it is finite and unordered, e.g., that only a comparison operation is allowed in Di. That is, for any a, b ∈ Di either a = b or a �= b. Symbolic data objects as considered in (=-=Gowda and Diday, 1991-=-) are not discussed in the present paper. Logically, each data object X in the dataset is also represented as a conjunction of attribute-value pairs [A1 = x1] ∧ . . . ∧ [Am = xm], where xi ∈ Di for 1 ... |

46 | A fuzzy k-modes algorithm for clustering categorical data - Huang, Ng - 1999 |

42 | Clustering large data sets with mixed numeric and categorical values,” Book Clustering large data sets with mixed numeric and categorical values, Series Clustering large data sets with mixed numeric and categorical values
- Huang
- 1997
(Show Context)
Citation Context ...acteristics: • It is efficient in processing large data sets. • It often terminates at a local optimum. • It works only on numerical data. • The clusters have convex shapes. 243 It was also shown in (=-=Huang, 1997-=-; Huang, 1998) that the k-means method can be extended to categorical data by using a simple matching distance measure for categorical objects with a majority-vote strategy to define the “cluster cent... |

34 | A conceptual version of the K-means algorithm - Ralambondrainy - 1995 |

17 |
Finding Groups
- Kaufman, Rousseeuw
- 1990
(Show Context)
Citation Context ...s such that objects within the same cluster have a high degree of similarity, while objects belonging to different clusters have a high degree of dissimilarity (Anderberg, 1973; Jain and Dubes, 1988; =-=Kaufman and Rousseeuw, 1990-=-). Traditionally, numerical clustering methods have been viewed in opposition to conceptual clustering methods developed in Artificial Intelligence. Numerical techniques emphasize the determination of... |

15 | Note on grouping - Cox - 1957 |

10 |
Fuzzy C-Means : Optimality of Solutions and Effective Termina tion of the Algorithms
- Ismail, Selim
- 1986
(Show Context)
Citation Context ...l-defined boundaries between clusters, the notion of fuzzy partitions has been applied successfully to the clustering problem resulting in the so-called fuzzy clustering (Ruspini, 1969; Bezdek, 1980; =-=Ismail and Selim, 1986-=-). However, we do not consider this topic in the present paper. As is shown in (Huang, 1998), the k-means algorithm has the following characteristics: • It is efficient in processing large data sets. ... |

1 | Local convergence of the c-means algorithms - Hathaway, Bezdek - 1986 |