Results 1 
5 of
5
Extensions to the kMeans Algorithm for Clustering Large Data Sets with Categorical Values
, 1998
"... The kmeans algorithm is well known for its efficiency in clustering large data sets. However, working only on numeric values prohibits it from being used to cluster real world data containing categorical values. In this paper we present two algorithms which extend the kmeans algorithm to categoric ..."
Abstract

Cited by 156 (2 self)
 Add to MetaCart
The kmeans algorithm is well known for its efficiency in clustering large data sets. However, working only on numeric values prohibits it from being used to cluster real world data containing categorical values. In this paper we present two algorithms which extend the kmeans algorithm to categorical domains and domains with mixed numeric and categorical values. The kmodes algorithm uses a simple matching dissimilarity measure to deal with categorical objects, replaces the means of clusters with modes, and uses a frequencybased method to update modes in the clustering process to minimise the clustering cost function. With these extensions the kmodes algorithm enables the clustering of categorical data in a fashion similar to kmeans. The kprototypes algorithm, through the definition of a combined dissimilarity measure, further integrates the kmeans and kmodes algorithms to allow for clustering objects described by mixed numeric and categorical attributes. We use the well known soybean disease and credit approval data sets to demonstrate the clustering performance of the two algorithms. Our experiments on two real world data sets with half a million objects each show that the two algorithms are efficient when clustering large data sets, which is critical to data mining applications.
A Fast Clustering Algorithm to Cluster Very Large Categorical Data Sets in Data Mining
"... Partitioning a large set of objects into homogeneous clusters is a fundamental operation in data mining. The kmeans algorithm is best suited for implementing this operation because of its efficiency in clustering large data sets. However, working only on numeric values limits its use in data mining ..."
Abstract

Cited by 84 (2 self)
 Add to MetaCart
Partitioning a large set of objects into homogeneous clusters is a fundamental operation in data mining. The kmeans algorithm is best suited for implementing this operation because of its efficiency in clustering large data sets. However, working only on numeric values limits its use in data mining because data sets in data mining often contain categorical values. In this paper we present an algorithm, called kmodes, to extend the kmeans paradigm to categorical domains. We introduce new dissimilarity measures to deal with categorical objects, replace means of clusters with modes, and use a frequency based method to update modes in the clustering process to minimise the clustering cost function. Tested with the well known soybean disease data set the algorithm has demonstrated a very good classification performance. Experiments on a very large health insurance data set consisting of half a million records and 34 categorical attributes show that the algorithm is scalable in terms of both the number of clusters and the number of records.
Quasicontinuous histograms
, 2009
"... Histograms are very useful for summarizing statistical information associated with a set of observed data. They are one of the most frequently used density estimators due to their ease of implementation and interpretation. However, histograms suffer from a high sensitivity to the choice of both refe ..."
Abstract
 Add to MetaCart
Histograms are very useful for summarizing statistical information associated with a set of observed data. They are one of the most frequently used density estimators due to their ease of implementation and interpretation. However, histograms suffer from a high sensitivity to the choice of both reference interval and bin width. This paper addresses this difficulty by means of a fuzzy partition. We propose a new density estimator based on transferring the counts associated with each cell of the fuzzy partition to any subset of the reference interval. We introduce three different methods of achieving this transfer. The properties of each method are illustrated with a classic real observation set. The density estimator obtained relates to the Parzenâ€“Rosenblatt kernel density estimation technique. In this paper, we only consider the monovariate case with precise and imprecise observations.
Intervalvalued probability density estimation based on
, 2011
"... quasicontinuous histograms: Proof of the conjecture ..."
Obtaining the Minimal Polygonal Representation of a Curve by Means of a Fuzzy Clustering
"... Abstract. The problem of obtaining of a minimal polygonal representation of a plane digital curve is treated. Means of a fuzzy clustering method are used. The fuzzy clustering is realized by relations of similarity and dissimilarity that are defined on a planar digital curve. 1 ..."
Abstract
 Add to MetaCart
Abstract. The problem of obtaining of a minimal polygonal representation of a plane digital curve is treated. Means of a fuzzy clustering method are used. The fuzzy clustering is realized by relations of similarity and dissimilarity that are defined on a planar digital curve. 1