Results 1 
6 of
6
Extensions to the kMeans Algorithm for Clustering Large Data Sets with Categorical Values
, 1998
"... The kmeans algorithm is well known for its efficiency in clustering large data sets. However, working only on numeric values prohibits it from being used to cluster real world data containing categorical values. In this paper we present two algorithms which extend the kmeans algorithm to categoric ..."
Abstract

Cited by 252 (3 self)
 Add to MetaCart
The kmeans algorithm is well known for its efficiency in clustering large data sets. However, working only on numeric values prohibits it from being used to cluster real world data containing categorical values. In this paper we present two algorithms which extend the kmeans algorithm to categorical domains and domains with mixed numeric and categorical values. The kmodes algorithm uses a simple matching dissimilarity measure to deal with categorical objects, replaces the means of clusters with modes, and uses a frequencybased method to update modes in the clustering process to minimise the clustering cost function. With these extensions the kmodes algorithm enables the clustering of categorical data in a fashion similar to kmeans. The kprototypes algorithm, through the definition of a combined dissimilarity measure, further integrates the kmeans and kmodes algorithms to allow for clustering objects described by mixed numeric and categorical attributes. We use the well known soybean disease and credit approval data sets to demonstrate the clustering performance of the two algorithms. Our experiments on two real world data sets with half a million objects each show that the two algorithms are efficient when clustering large data sets, which is critical to data mining applications.
A Fast Clustering Algorithm to Cluster Very Large Categorical Data Sets in Data Mining
"... Partitioning a large set of objects into homogeneous clusters is a fundamental operation in data mining. The kmeans algorithm is best suited for implementing this operation because of its efficiency in clustering large data sets. However, working only on numeric values limits its use in data mining ..."
Abstract

Cited by 115 (2 self)
 Add to MetaCart
Partitioning a large set of objects into homogeneous clusters is a fundamental operation in data mining. The kmeans algorithm is best suited for implementing this operation because of its efficiency in clustering large data sets. However, working only on numeric values limits its use in data mining because data sets in data mining often contain categorical values. In this paper we present an algorithm, called kmodes, to extend the kmeans paradigm to categorical domains. We introduce new dissimilarity measures to deal with categorical objects, replace means of clusters with modes, and use a frequency based method to update modes in the clustering process to minimise the clustering cost function. Tested with the well known soybean disease data set the algorithm has demonstrated a very good classification performance. Experiments on a very large health insurance data set consisting of half a million records and 34 categorical attributes show that the algorithm is scalable in terms of both the number of clusters and the number of records.
Intervalvalued probability density estimation based on
, 2011
"... quasicontinuous histograms: Proof of the conjecture ..."
(Show Context)
Solving Factorable Programs with Applications to Cluster Analysis,
, 2005
"... Despite recent advances in optimization research and computing technology, deriving global optimal solutions to nonconvex optimization problems remains a daunting task. Existing approaches for solving such formidable problems are typically heuristic in nature, often leading to significantly subopti ..."
Abstract
 Add to MetaCart
Despite recent advances in optimization research and computing technology, deriving global optimal solutions to nonconvex optimization problems remains a daunting task. Existing approaches for solving such formidable problems are typically heuristic in nature, often leading to significantly suboptimal solutions. This motivates the need to develop a framework for optimally solving a broad class of nonconvex programming problems, which yet retains sufficient flexibility to exploit inherent special structures. Toward this end, we focus in this dissertation on a variety of applications that occur in practice as instances of polynomial programming problems or more general nonconvex factorable programs, and we employ a central theme based on the ReformulationLinearization Technique (RLT) to design theoretically convergent and practically effective and robust solution methodologies. We begin our discussion in this dissertation by providing a basis for developing efficient solution methodologies for solving the class of nonconvex factorable programming problems. Recognizing the ability of the RLT to solve polynomial programs to (global) optimality, the basic idea is to solve the given nonconvex program
Obtaining the Minimal Polygonal Representation of a Curve by Means of a Fuzzy Clustering
"... Abstract. The problem of obtaining of a minimal polygonal representation of a plane digital curve is treated. Means of a fuzzy clustering method are used. The fuzzy clustering is realized by relations of similarity and dissimilarity that are defined on a planar digital curve. 1 ..."
Abstract
 Add to MetaCart
(Show Context)
Abstract. The problem of obtaining of a minimal polygonal representation of a plane digital curve is treated. Means of a fuzzy clustering method are used. The fuzzy clustering is realized by relations of similarity and dissimilarity that are defined on a planar digital curve. 1
Quasicontinuous histograms
, 2009
"... Histograms are very useful for summarizing statistical information associated with a set of observed data. They are one of the most frequently used density estimators due to their ease of implementation and interpretation. However, histograms suffer from a high sensitivity to the choice of both refe ..."
Abstract
 Add to MetaCart
Histograms are very useful for summarizing statistical information associated with a set of observed data. They are one of the most frequently used density estimators due to their ease of implementation and interpretation. However, histograms suffer from a high sensitivity to the choice of both reference interval and bin width. This paper addresses this difficulty by means of a fuzzy partition. We propose a new density estimator based on transferring the counts associated with each cell of the fuzzy partition to any subset of the reference interval. We introduce three different methods of achieving this transfer. The properties of each method are illustrated with a classic real observation set. The density estimator obtained relates to the Parzen–Rosenblatt kernel density estimation technique. In this paper, we only consider the monovariate case with precise and imprecise observations.