## Clustering large data sets with mixed numeric and categorical values (1997)

Venue: | In The First Pacific-Asia Conference on Knowledge Discovery and Data Mining |

Citations: | 38 - 3 self |

### BibTeX

@INPROCEEDINGS{Huang97clusteringlarge,

author = {Zhexue Huang},

title = {Clustering large data sets with mixed numeric and categorical values},

booktitle = {In The First Pacific-Asia Conference on Knowledge Discovery and Data Mining},

year = {1997},

pages = {21--34}

}

### Years of Citing Articles

### OpenURL

### Abstract

Efficient partitioning of large data sets into homogenous clusters is a fundamental problem in data mining. The standard hierarchical clustering methods provide no solution for this problem due to their computational inefficiency. The k-means based methods are promising for their efficiency in processing large data sets. However, their use is often limited to numeric data. In this paper we present a k-prototypes algorithm which is based on the k-means paradigm but removes the numeric data limitation whilst preserving its efficiency. In the algorithm, objects are clustered against k prototypes. A method is developed to dynamically update the k prototypes in order to maximise the intra cluster similarity of objects. When applied to numeric data the algorithm is identical to the kmeans. To assist interpretation of clusters we use decision tree induction algorithms to create rules for clusters. These rules, together with other statistics about clusters, can assist data miners to understand and identify interesting clusters. 1

### Citations

7342 | J.H.: Genetic Algorithms and - Goldberg, Holland - 1988 |

4936 | C4.5: Programs for Machine Learning - Quinlan - 1993 |

3910 |
Classification and Regression Trees
- Breiman
- 1984
(Show Context)
Citation Context ... a data set X, one for each cluster. (2) Allocate each object in X to a cluster whose prototype is the nearest to it according to Eq. (2.3). Update the prototype of the cluster after each allocation. =-=(3)-=- After all objects have been allocated to a cluster, retest the similarity of objects against the current prototypes. If an object is found such that its nearest prototype belongs to another cluster r... |

3529 | Optimization by simulated annealing - Gelatt, Vecchi - 1983 |

1858 | Some Methods for classification and Analysis of Multivariate Observations - MacQueen - 1967 |

710 |
Cluster Analysis for Applications
- Anderberg
- 1973
(Show Context)
Citation Context ...urrent clusters of the object are updated. Variable moves records the number of objects which have changed clusters in the process. FOR i = 1 TO NumberOfObjects Mindistance= Distance(X[i],O_prototypes=-=[1]-=-)+ gamma* Sigma(X[i],C_prototypes[1]) FOR j = 1 TO NumberOfClusters distance= Distance(X[i],O_prototypes[j])+ gamma * Sigma(X[i],C_prototypes[j]) IF (distance < Mindistance) Mindistance=distance clust... |

643 | Knowledge acquisition via incremental conceptual clustering - Fisher - 1987 |

471 | Cluster Analysis - Everitt - 1974 |

139 | Discrimination and Classification - Hand - 1981 |

106 | A General Coefficient of Similarity and Some of its Properties - Gower - 1971 |

99 | Experiments with incremental concept formation - Lebowitz - 1987 |

92 | A deterministic annealing approach to clustering - ROSE, GUREWITZ, et al. - 1990 |

8 | A non-greedy approach to tree-structured clustering - Miller, Rose - 1994 |

4 |
c-Means Clustering with the l 1 and l ∞ Norms
- BOBROWSKI, BEZDEK
- 1991
(Show Context)
Citation Context ...eavy investment in real estate. To perform such analyses at least the following two problems have to be solved; (1) efficient partitioning of a large data set into homogeneous groups or clusters, and =-=(2)-=- effective interpretation of clusters. This paper proposes a solution to the first problem and suggests a solution to the second. A number of data partitioning methods can be employed for the first pr... |

4 |
Vector quantisation with complexity costs
- Buhmann, Kuhnel
- 1993
(Show Context)
Citation Context ...otypes. If an object is found such that its nearest prototype belongs to another cluster rather than its current one, reallocate the object to that cluster and update the prototypes of both clusters. =-=(4)-=- Repeat (3) until no object has changed clusters after a full cycle test of X. The algorithm is built upon three processes, initial prototypes selection, initial allocation, and re-allocation. The fir... |

2 | Automated construction of classifications: Clustering versus numerical taxonomy - Michalski, Stepp - 1983 |