## A Fast Clustering Algorithm to Cluster Very Large Categorical Data Sets in Data Mining

### Cached

### Download Links

- [www.cit.gu.edu.au]
- [www.ece.northwestern.edu]
- DBLP

### Other Repositories/Bibliography

Citations: | 84 - 2 self |

### BibTeX

@MISC{Huang_afast,

author = {Zheyue Huang},

title = {A Fast Clustering Algorithm to Cluster Very Large Categorical Data Sets in Data Mining},

year = {}

}

### OpenURL

### Abstract

Partitioning a large set of objects into homogeneous clusters is a fundamental operation in data mining. The k-means algorithm is best suited for implementing this operation because of its efficiency in clustering large data sets. However, working only on numeric values limits its use in data mining because data sets in data mining often contain categorical values. In this paper we present an algorithm, called k-modes, to extend the k-means paradigm to categorical domains. We introduce new dissimilarity measures to deal with categorical objects, replace means of clusters with modes, and use a frequency based method to update modes in the clustering process to minimise the clustering cost function. Tested with the well known soybean disease data set the algorithm has demonstrated a very good classification performance. Experiments on a very large health insurance data set consisting of half a million records and 34 categorical attributes show that the algorithm is scalable in terms of both the number of clusters and the number of records.

### Citations

7338 |
Genetic Algorithms
- Goldberg
- 1989
(Show Context)
Citation Context ... a local optimum (MacQueen 1967, Selim and Ismail 1984). To find out the global optimum, techniques such as deterministic annealing (Kirkpatrick et al. 1983, Rose et al. 1990) and genetic algorithms (=-=Goldberg 1989-=-, Murthy and Chowdhury 1996) can be incorporated with the k-means algorithm. 3. It works only on numeric values because it minimises a cost function by calculating the means of clusters. 4. The cluste... |

3527 | Optimization by Simulated Annealing
- Kirkpatrick, Gelatt, et al.
- 1989
(Show Context)
Citation Context ...nal complexity is O(n 2 ) (Murtagh 1992). 2. It often terminates at a local optimum (MacQueen 1967, Selim and Ismail 1984). To find out the global optimum, techniques such as deterministic annealing (=-=Kirkpatrick et al. 1983-=-, Rose et al. 1990) and genetic algorithms (Goldberg 1989, Murthy and Chowdhury 1996) can be incorporated with the k-means algorithm. 3. It works only on numeric values because it minimises a cost fun... |

2147 |
Dubes: "Algorithms for Clustering data
- Jain, Richard
- 2005
(Show Context)
Citation Context ...to clusters such that objects in the same cluster are more similar to each other than objects in different clusters according to some defined criteria. Statistical clustering methods (Anderberg 1973, =-=Jain and Dubes 1988-=-) use similarity measures to partition objects whereas conceptual clustering methods cluster objects according to the concepts objects carry (Michalski and Stepp 1983, Fisher 1987). The most distinct ... |

1857 | Some Methods for classification and Analysis of Multivariate Observations
- MacQueen
- 1967
(Show Context)
Citation Context ...h focus (Shafer et al. 1996). In this paper we present a fast clustering algorithm used to cluster categorical data. The algorithm, called kmodes, is an extension to the well known k-means algorithm (=-=MacQueen 1967-=-). Compared to other clustering methods the k-means algorithm and its variants (Anderberg 1973) are efficient in clustering large data sets, thus very suitable for data mining. However, their use is o... |

708 |
Cluster Analysis for applications
- Anderberg, R
- 1973
(Show Context)
Citation Context ...et of objects into clusters such that objects in the same cluster are more similar to each other than objects in different clusters according to some defined criteria. Statistical clustering methods (=-=Anderberg 1973-=-, Jain and Dubes 1988) use similarity measures to partition objects whereas conceptual clustering methods cluster objects according to the concepts objects carry (Michalski and Stepp 1983, Fisher 1987... |

222 |
Theory and Applications of Correspondence Analysis
- Greenacre
- 1983
(Show Context)
Citation Context ... ) ( ) ( , ) = + = (4) where n x j , n y are the numbers of objects in the data set that have categories x j and y j for attribute j. Because d X Y c 2 ( , ) is similar to the chi-square distance in (=-=Greenacre 1984-=-), we call it chi-square distance. This dissimilarity measure gives more importance to rare categories than frequent ones. Eq. (4) is useful in discovering under-represented object clusters such as fr... |

142 |
A new approach to clustering
- RUSPINI
- 1969
(Show Context)
Citation Context ... means (Anderberg 1973, Bobrowski and Bezdek 1991). The sophisticated variants of the k-means algorithm include the well-known ISODATA algorithm (Ball and Hall 1967) and the fuzzy k-means algorithms (=-=Ruspini 1969-=-, 1973). Most k-means type algorithms have been proved convergent (MacQueen 1967, Bezdek 1980, Selim and Ismail 1984). The k-means algorithm has the following important properties. 1. It is efficient ... |

139 |
Discrimination and Classification
- Hand
- 1981
(Show Context)
Citation Context ...ction E y d X Q i l i i n l k =s= = , ( , ) 1 1 (1) where n is the number of objects in a data set X, X isX, Q l is the mean of cluster l, and y i l is an element of a partition matrix Y x l n as in (=-=Hand 1981-=-). d is a dissimilarity measure usually defined by the squared Euclidean distance. There exist a few variants of the k-means algorithm which differ in selection of the initial k means, dissimilarity c... |

121 |
K-Means-Type Algorithms: A Generalized Convergence Theorem and Characterization of Local Optimality
- Selim, Ismail
- 1984
(Show Context)
Citation Context ...lude the well-known ISODATA algorithm (Ball and Hall 1967) and the fuzzy k-means algorithms (Ruspini 1969, 1973). Most k-means type algorithms have been proved convergent (MacQueen 1967, Bezdek 1980, =-=Selim and Ismail 1984-=-). The k-means algorithm has the following important properties. 1. It is efficient in processing large data sets. The computational complexity of the algorithm is O(tkmn), where m is the number of at... |

106 | A General Coefficient of Similarity and Some of its Properties
- Gower
- 1971
(Show Context)
Citation Context ...works on categorical attributes and produces the cluster modes, which describe the clusters, thus very useful to the user in interpreting the clustering results. Using Gower's similarity coefficient (=-=Gower 1971-=-) and other dissimilarity measures (Gowda and Diday 1991) one can use a hierarchical clustering method to cluster categorical or mixed data. However, the hierarchical clustering methods are not effici... |

92 | Automated Construction of Classifications: Conceptual Clustering Versus Numerical Taxonomy
- Michalski, Stepp
- 1983
(Show Context)
Citation Context ...al clustering methods (Anderberg 1973, Jain and Dubes 1988) use similarity measures to partition objects whereas conceptual clustering methods cluster objects according to the concepts objects carry (=-=Michalski and Stepp 1983-=-, Fisher 1987). The most distinct characteristic of data mining is that it deals with very large data sets (gigabytes or even terabytes). This requires the algorithms used in data mining to be scalabl... |

86 | A Clustering Technique for Summarizing Multivariate Data - Ball, Hall - 1967 |

69 |
A convergence theorem for the fuzzy isodata clustering algorithm
- Bezdek
- 1980
(Show Context)
Citation Context ...algorithm include the well-known ISODATA algorithm (Ball and Hall 1967) and the fuzzy k-means algorithms (Ruspini 1969, 1973). Most k-means type algorithms have been proved convergent (MacQueen 1967, =-=Bezdek 1980-=-, Selim and Ismail 1984). The k-means algorithm has the following important properties. 1. It is efficient in processing large data sets. The computational complexity of the algorithm is O(tkmn), wher... |

45 |
Symbolic Clustering Using a New Dissimilarity Measure
- Gowda, Diday
- 1991
(Show Context)
Citation Context ...he cluster modes, which describe the clusters, thus very useful to the user in interpreting the clustering results. Using Gower's similarity coefficient (Gower 1971) and other dissimilarity measures (=-=Gowda and Diday 1991-=-) one can use a hierarchical clustering method to cluster categorical or mixed data. However, the hierarchical clustering methods are not efficient in processing large data sets. Their use is limited ... |

38 | Clustering large data sets with mixed numeric and categorical values
- Huang
- 1997
(Show Context)
Citation Context ...ins are not ordered. The k-modes algorithm in this paper removes this limitation and extends the k-means paradigm to categorical domains whilst preserving the efficiency of the k-means algorithm. In (=-=Huang 1997-=-) we have proposed an algorithm, called k-prototypes, to cluster large data sets with mixed numeric and categorical values. In the k-prototypes algorithm we define a dissimilarity measure that takes i... |

37 |
Knowledge Acquisition Via
- Fisher
- 1987
(Show Context)
Citation Context ...erberg 1973, Jain and Dubes 1988) use similarity measures to partition objects whereas conceptual clustering methods cluster objects according to the concepts objects carry (Michalski and Stepp 1983, =-=Fisher 1987-=-). The most distinct characteristic of data mining is that it deals with very large data sets (gigabytes or even terabytes). This requires the algorithms used in data mining to be scalable. However, m... |

30 | A conceptual version of K-means algorithm - Ralambondrainy - 1995 |

13 | Comments on “Parallel Algorithms for Hierarchical Clustering and Cluster Validity - Murtagh - 1992 |

5 |
Learning Based on Conceptual Distance
- Kodratoff, Tecuci
- 1988
(Show Context)
Citation Context ...categorical domains and used to represent missing values. To simplify the dissimilarity measure we do not consider the conceptual inclusion relationships among values in a categorical domain like in (=-=Kodratoff and Tecuci 1988-=-) such that car and vehicle are two categorical values in a domain and conceptually a car is also a vehicle. However, such relationships may exist in real world databases. 2.2 Categorical Objects Like... |

5 | New Experimental Results in Fuzzy Clustering - Ruspini - 1973 |

4 |
C-means clustering with the and norms
- Bobrowski, Bezdek
- 1991
(Show Context)
Citation Context ...n distance. There exist a few variants of the k-means algorithm which differ in selection of the initial k means, dissimilarity calculations and strategies to calculate cluster means (Anderberg 1973, =-=Bobrowski and Bezdek 1991-=-). The sophisticated variants of the k-means algorithm include the well-known ISODATA algorithm (Ball and Hall 1967) and the fuzzy k-means algorithms (Ruspini 1969, 1973). Most k-means type algorithms... |

4 |
Comments of "Parallel algorithms for hierarchical clustering and cluster validity
- Murtagh
- 1992
(Show Context)
Citation Context ...e whole data set. Usually, k, m, tsn. In clustering large data sets the k-means algorithm is much faster than the hierarchical clustering algorithms whose general computational complexity is O(n 2 ) (=-=Murtagh 1992-=-). 2. It often terminates at a local optimum (MacQueen 1967, Selim and Ismail 1984). To find out the global optimum, techniques such as deterministic annealing (Kirkpatrick et al. 1983, Rose et al. 19... |

4 | c-Means Clustering with the l 1 and l ∞ Norms - BOBROWSKI, BEZDEK - 1991 |

2 |
A Deterministic Annealing Approach to
- Rose, Gurewitz, et al.
- 1990
(Show Context)
Citation Context ... (Murtagh 1992). 2. It often terminates at a local optimum (MacQueen 1967, Selim and Ismail 1984). To find out the global optimum, techniques such as deterministic annealing (Kirkpatrick et al. 1983, =-=Rose et al. 1990-=-) and genetic algorithms (Goldberg 1989, Murthy and Chowdhury 1996) can be incorporated with the k-means algorithm. 3. It works only on numeric values because it minimises a cost function by calculati... |