Results 1  10
of
22
Extensions to the kMeans Algorithm for Clustering Large Data Sets with Categorical Values
, 1998
"... The kmeans algorithm is well known for its efficiency in clustering large data sets. However, working only on numeric values prohibits it from being used to cluster real world data containing categorical values. In this paper we present two algorithms which extend the kmeans algorithm to categoric ..."
Abstract

Cited by 156 (2 self)
 Add to MetaCart
The kmeans algorithm is well known for its efficiency in clustering large data sets. However, working only on numeric values prohibits it from being used to cluster real world data containing categorical values. In this paper we present two algorithms which extend the kmeans algorithm to categorical domains and domains with mixed numeric and categorical values. The kmodes algorithm uses a simple matching dissimilarity measure to deal with categorical objects, replaces the means of clusters with modes, and uses a frequencybased method to update modes in the clustering process to minimise the clustering cost function. With these extensions the kmodes algorithm enables the clustering of categorical data in a fashion similar to kmeans. The kprototypes algorithm, through the definition of a combined dissimilarity measure, further integrates the kmeans and kmodes algorithms to allow for clustering objects described by mixed numeric and categorical attributes. We use the well known soybean disease and credit approval data sets to demonstrate the clustering performance of the two algorithms. Our experiments on two real world data sets with half a million objects each show that the two algorithms are efficient when clustering large data sets, which is critical to data mining applications.
A Fast Clustering Algorithm to Cluster Very Large Categorical Data Sets in Data Mining
 In Research Issues on Data Mining and Knowledge Discovery
, 1997
"... Partitioning a large set of objects into homogeneous clusters is a fundamental operation in data mining. The kmeans algorithm is best suited for implementing this operation because of its efficiency in clustering large data sets. However, working only on numeric values limits its use in data mining ..."
Abstract

Cited by 82 (2 self)
 Add to MetaCart
Partitioning a large set of objects into homogeneous clusters is a fundamental operation in data mining. The kmeans algorithm is best suited for implementing this operation because of its efficiency in clustering large data sets. However, working only on numeric values limits its use in data mining because data sets in data mining often contain categorical values. In this paper we present an algorithm, called kmodes, to extend the kmeans paradigm to categorical domains. We introduce new dissimilarity measures to deal with categorical objects, replace means of clusters with modes, and use a frequency based method to update modes in the clustering process to minimise the clustering cost function. Tested with the well known soybean disease data set the algorithm has demonstrated a very good classification performance. Experiments on a very large health insurance data set consisting of half a million records and 34 categorical attributes show that the algorithm is scalable in terms of ...
HARP: A practical projected clustering algorithm
 IEEE Transactions on Knowledge and Data Engineering
, 2004
"... Abstract—In highdimensional data, clusters can exist in subspaces that hide themselves from traditional clustering methods. A number of algorithms have been proposed to identify such projected clusters, but most of them rely on some user parameters to guide the clustering process. The clustering ac ..."
Abstract

Cited by 23 (3 self)
 Add to MetaCart
Abstract—In highdimensional data, clusters can exist in subspaces that hide themselves from traditional clustering methods. A number of algorithms have been proposed to identify such projected clusters, but most of them rely on some user parameters to guide the clustering process. The clustering accuracy can be seriously degraded if incorrect values are used. Unfortunately, in real situations, it is rarely possible for users to supply the parameter values accurately, which causes practical difficulties in applying these algorithms to real data. In this paper, we analyze the major challenges of projected clustering and suggest why these algorithms need to depend heavily on user parameters. Based on the analysis, we propose a new algorithm that exploits the clustering status to adjust the internal thresholds dynamically without the assistance of user parameters. According to the results of extensive experiments on real and synthetic data, the new method has excellent accuracy and usability. It outperformed the other algorithms even when correct parameter values were artificially supplied to them. The encouraging results suggest that projected clustering can be a practical tool for various kinds of real applications. Index Terms—Data mining, mining methods and algorithms, clustering, bioinformatics. 1
A Fast and Robust General Purpose Clustering Algorithm
 In Pacific Rim International Conference on Artificial Intelligence
, 2000
"... General purpose and highly applicable clustering methods are usually required during the early stages of knowledge discovery exercises. kMeans has been adopted as the prototype of iterative modelbased clustering because of its speed, simplicity and capability to work within the format of very larg ..."
Abstract

Cited by 16 (2 self)
 Add to MetaCart
General purpose and highly applicable clustering methods are usually required during the early stages of knowledge discovery exercises. kMeans has been adopted as the prototype of iterative modelbased clustering because of its speed, simplicity and capability to work within the format of very large databases. However, kMeans has several disadvantages derived from its statistical simplicity. We propose an algorithm that remains very efficient, generally applicable, multidimensional but is more robust to noise and outliers. We achieve this by using the discrete median rather than the mean as the estimator of the center of a cluster. Comparison with kMeans, Expectation Maximization and Gibbs sampling demonstrates the advantages of our algorithm.
Cluster Analysis using Triangulation
, 1997
"... This paper looks at clustering using tools from graph theory. It first triangulates the data, then partitions the edges of the resulting graph into inter and intracluster edges. The technique is unaffected by the actual shape of the clusters, thus allowing a far more general version of the cluster ..."
Abstract

Cited by 14 (0 self)
 Add to MetaCart
This paper looks at clustering using tools from graph theory. It first triangulates the data, then partitions the edges of the resulting graph into inter and intracluster edges. The technique is unaffected by the actual shape of the clusters, thus allowing a far more general version of the clustering problem to be solved. Section 2 of the paper is a general introduction to clustering, which includes a brief description of the commonly used kmeans technique. Following this is a discussion of the problems which arise in the kmeans (and related) methods and why there is a need for graphbased methods. Sections 4 and 6 explain the proposed new method, and give examples of its success. Section 5 discusses a few existing graphbased methods and why they can be improved upon. The test programs, which provide the results discussed in this paper, are currently written for two dimensional data sets, but Section 7 explains how the same principles can be extended to higher dimensional problems.
An alternative extension of the kmeans algorithm for clustering categorical data
 Int. J. Appl. Math. Comput. Sci
, 2004
"... Most of the earlier work on clustering has mainly been focused on numerical data whose inherent geometric properties can be exploited to naturally define distance functions between data points. Recently, the problem of clustering categorical data has started drawing interest. However, the computatio ..."
Abstract

Cited by 14 (0 self)
 Add to MetaCart
Most of the earlier work on clustering has mainly been focused on numerical data whose inherent geometric properties can be exploited to naturally define distance functions between data points. Recently, the problem of clustering categorical data has started drawing interest. However, the computational cost makes most of the previous algorithms unacceptable for clustering very large databases. The kmeans algorithm is well known for its efficiency in this respect. At the same time, working only on numerical data prohibits them from being used for clustering categorical data. The main contribution of this paper is to show how to apply the notion of “cluster centers ” on a dataset of categorical objects and how to use this notion for formulating the clustering problem of categorical objects as a partitioning problem. Finally, a kmeanslike algorithm for clustering categorical data is introduced. The clustering performance of the algorithm is demonstrated with two wellknown data sets, namely, soybean disease and nursery databases.
Keep it Simple: A CaseBase Maintenance Policy Based on Clustering and Information Theory
 In Proc. of the Canadian AI Conference
, 2000
"... Abstract. Today’s case based reasoning applications face several challenges. In a typical application, the case bases grow at a very fast rate and their contents become increasingly diverse, making it necessary to partition a large case base into several smaller ones. Their users are overloaded with ..."
Abstract

Cited by 7 (0 self)
 Add to MetaCart
Abstract. Today’s case based reasoning applications face several challenges. In a typical application, the case bases grow at a very fast rate and their contents become increasingly diverse, making it necessary to partition a large case base into several smaller ones. Their users are overloaded with vast amounts of information during the retrieval process. These problems call for the development of effective casebase maintenance methods. As a result, many researchers have been driven to design sophisticated casebase structures or maintenance methods. In contrast, we hold a different point of view: we maintain that the structure of a case base should be kept as simple as possible, and that the maintenance method should be as transparent as possible. In this paper we propose a casebase maintenance method that avoids building sophisticated structures around a case base or perform complex operations on a case base. Our method partitions cases into clusters where the cases in the same cluster are more similar than cases in other clusters. In addition to the content of textual cases, the clustering method we propose can also be based on values of attributes that may be attached to the cases. Clusters can be converted to new case bases, which are smaller in size and when stored distributedly, can entail simpler maintenance operations. The contents of the new case bases are more focused and easier to retrieve and update. To support retrieval in this distributed casebase network, we present a method that is based on a decision forest built with the attributes that are obtained through an innovative modification of the ID3 algorithm. 1
A Highlyusable Projected Clustering Algorithm for Gene Expression Profiles
 In Proc. of the 3rd ACM SIGKDD Workshop on Data Mining in Bioinformatics (BIOKDD’03
, 2003
"... Projected clustering has become a hot research topic due to its ability to cluster highdimensional data. However, most existing projected clustering algorithms depend on some critical user parameters in determining the relevant attributes of each cluster. In case wrong parameter values are used, th ..."
Abstract

Cited by 5 (0 self)
 Add to MetaCart
Projected clustering has become a hot research topic due to its ability to cluster highdimensional data. However, most existing projected clustering algorithms depend on some critical user parameters in determining the relevant attributes of each cluster. In case wrong parameter values are used, the clustering performance will be seriously degraded. Unfortunately, correct parameter values are rarely known in real datasets. In this paper, we propose a projected clustering algorithm that does not depend on user inputs in determining relevant attributes. It responds to the clustering status and adjusts the internal thresholds dynamically. From experimental results, our algorithm shows a much higher usability than the other projected clustering algorithms used in our comparison study. It also works well with a gene expression dataset for studying lymphoma. The high usability of the algorithm and the encouraging results suggest that projected clustering can be a practical tool for analyzing gene expression profiles.
Identifying projected clusters from gene expression profiles
 Journal of Biomedical Informatics (JBI
"... In microarray gene expression data, clusters may hide in certain subspaces. For example, a set of coregulated genes may have similar expression patterns in only a subset of the samples in which certain regulating factors are present. Their expression patterns could be dissimilar when measuring in t ..."
Abstract

Cited by 3 (1 self)
 Add to MetaCart
In microarray gene expression data, clusters may hide in certain subspaces. For example, a set of coregulated genes may have similar expression patterns in only a subset of the samples in which certain regulating factors are present. Their expression patterns could be dissimilar when measuring in the full input space. Traditional clustering algorithms that make use of such similarity measurements may fail to identify the clusters. In recent years a number of algorithms have been proposed to identify this kind of projected clusters, but many of them rely on some critical parameters whose proper values are hard for users to determine. In this paper a new algorithm that dynamically adjusts its internal thresholds is proposed. It has a low dependency on user parameters while allowing users to input some domain knowledge should they be available. Experimental results show that the algorithm is capable of identifying some interesting projected clusters.
Fuzzy Clustering of Quantitative and Qualitative Data
"... Abstract — In many applications the objects to cluster are described by quantitative as well as qualitative features. A variety of algorithms has been proposed for unsupervised classification if fuzzy partitions and descriptive cluster prototypes are desired. However, most of these methods are desig ..."
Abstract

Cited by 3 (0 self)
 Add to MetaCart
Abstract — In many applications the objects to cluster are described by quantitative as well as qualitative features. A variety of algorithms has been proposed for unsupervised classification if fuzzy partitions and descriptive cluster prototypes are desired. However, most of these methods are designed for data sets with variables measured in the same scale type (only categorical, or only metric). We propose a new fuzzy clustering approach based on a probabilistic distance measure. Thus a major drawback of present methods can be avoided which lies in the vulnerability to favor one type of attributes. I.