Results 1 - 10
of
18
Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values
, 1998
"... The k-means algorithm is well known for its efficiency in clustering large data sets. However, working only on numeric values prohibits it from being used to cluster real world data containing categorical values. In this paper we present two algorithms which extend the k-means algorithm to categoric ..."
Abstract
-
Cited by 109 (2 self)
- Add to MetaCart
The k-means algorithm is well known for its efficiency in clustering large data sets. However, working only on numeric values prohibits it from being used to cluster real world data containing categorical values. In this paper we present two algorithms which extend the k-means algorithm to categorical domains and domains with mixed numeric and categorical values. The k-modes algorithm uses a simple matching dissimilarity measure to deal with categorical objects, replaces the means of clusters with modes, and uses a frequency-based method to update modes in the clustering process to minimise the clustering cost function. With these extensions the k-modes algorithm enables the clustering of categorical data in a fashion similar to k-means. The k-prototypes algorithm, through the definition of a combined dissimilarity measure, further integrates the k-means and k-modes algorithms to allow for clustering objects described by mixed numeric and categorical attributes. We use the well known soybean disease and credit approval data sets to demonstrate the clustering performance of the two algorithms. Our experiments on two real world data sets with half a million objects each show that the two algorithms are efficient when clustering large data sets, which is critical to data mining applications.
A Fast Clustering Algorithm to Cluster Very Large Categorical Data Sets in Data Mining
- In Research Issues on Data Mining and Knowledge Discovery
, 1997
"... Partitioning a large set of objects into homogeneous clusters is a fundamental operation in data mining. The k-means algorithm is best suited for implementing this operation because of its efficiency in clustering large data sets. However, working only on numeric values limits its use in data mining ..."
Abstract
-
Cited by 70 (2 self)
- Add to MetaCart
Partitioning a large set of objects into homogeneous clusters is a fundamental operation in data mining. The k-means algorithm is best suited for implementing this operation because of its efficiency in clustering large data sets. However, working only on numeric values limits its use in data mining because data sets in data mining often contain categorical values. In this paper we present an algorithm, called k-modes, to extend the k-means paradigm to categorical domains. We introduce new dissimilarity measures to deal with categorical objects, replace means of clusters with modes, and use a frequency based method to update modes in the clustering process to minimise the clustering cost function. Tested with the well known soybean disease data set the algorithm has demonstrated a very good classification performance. Experiments on a very large health insurance data set consisting of half a million records and 34 categorical attributes show that the algorithm is scalable in terms of ...
HARP: A practical projected clustering algorithm
- IEEE Transactions on Knowledge and Data Engineering
, 2004
"... Abstract—In high-dimensional data, clusters can exist in subspaces that hide themselves from traditional clustering methods. A number of algorithms have been proposed to identify such projected clusters, but most of them rely on some user parameters to guide the clustering process. The clustering ac ..."
Abstract
-
Cited by 14 (1 self)
- Add to MetaCart
Abstract—In high-dimensional data, clusters can exist in subspaces that hide themselves from traditional clustering methods. A number of algorithms have been proposed to identify such projected clusters, but most of them rely on some user parameters to guide the clustering process. The clustering accuracy can be seriously degraded if incorrect values are used. Unfortunately, in real situations, it is rarely possible for users to supply the parameter values accurately, which causes practical difficulties in applying these algorithms to real data. In this paper, we analyze the major challenges of projected clustering and suggest why these algorithms need to depend heavily on user parameters. Based on the analysis, we propose a new algorithm that exploits the clustering status to adjust the internal thresholds dynamically without the assistance of user parameters. According to the results of extensive experiments on real and synthetic data, the new method has excellent accuracy and usability. It outperformed the other algorithms even when correct parameter values were artificially supplied to them. The encouraging results suggest that projected clustering can be a practical tool for various kinds of real applications. Index Terms—Data mining, mining methods and algorithms, clustering, bioinformatics. 1
Cluster Analysis using Triangulation
, 1997
"... This paper looks at clustering using tools from graph theory. It first triangulates the data, then partitions the edges of the resulting graph into inter- and intra-cluster edges. The technique is unaffected by the actual shape of the clusters, thus allowing a far more general version of the cluster ..."
Abstract
-
Cited by 13 (0 self)
- Add to MetaCart
This paper looks at clustering using tools from graph theory. It first triangulates the data, then partitions the edges of the resulting graph into inter- and intra-cluster edges. The technique is unaffected by the actual shape of the clusters, thus allowing a far more general version of the clustering problem to be solved. Section 2 of the paper is a general introduction to clustering, which includes a brief description of the commonly used k-means technique. Following this is a discussion of the problems which arise in the k-means (and related) methods and why there is a need for graph-based methods. Sections 4 and 6 explain the proposed new method, and give examples of its success. Section 5 discusses a few existing graph-based methods and why they can be improved upon. The test programs, which provide the results discussed in this paper, are currently written for two dimensional data sets, but Section 7 explains how the same principles can be extended to higher dimensional problems.
A Fast and Robust General Purpose Clustering Algorithm
- In Pacific Rim International Conference on Artificial Intelligence
, 2000
"... General purpose and highly applicable clustering methods are usually required during the early stages of knowledge discovery exercises. k-Means has been adopted as the prototype of iterative model-based clustering because of its speed, simplicity and capability to work within the format of very larg ..."
Abstract
-
Cited by 12 (2 self)
- Add to MetaCart
General purpose and highly applicable clustering methods are usually required during the early stages of knowledge discovery exercises. k-Means has been adopted as the prototype of iterative model-based clustering because of its speed, simplicity and capability to work within the format of very large databases. However, k-Means has several disadvantages derived from its statistical simplicity. We propose an algorithm that remains very efficient, generally applicable, multi-dimensional but is more robust to noise and outliers. We achieve this by using the discrete median rather than the mean as the estimator of the center of a cluster. Comparison with k-Means, Expectation Maximization and Gibbs sampling demonstrates the advantages of our algorithm.
An alternative extension of the k-means algorithm for clustering categorical data
- Int. J. Appl. Math. Comput. Sci
, 2004
"... Most of the earlier work on clustering has mainly been focused on numerical data whose inherent geometric properties can be exploited to naturally define distance functions between data points. Recently, the problem of clustering categorical data has started drawing interest. However, the computatio ..."
Abstract
-
Cited by 9 (0 self)
- Add to MetaCart
Most of the earlier work on clustering has mainly been focused on numerical data whose inherent geometric properties can be exploited to naturally define distance functions between data points. Recently, the problem of clustering categorical data has started drawing interest. However, the computational cost makes most of the previous algorithms unacceptable for clustering very large databases. The k-means algorithm is well known for its efficiency in this respect. At the same time, working only on numerical data prohibits them from being used for clustering categorical data. The main contribution of this paper is to show how to apply the notion of “cluster centers ” on a dataset of categorical objects and how to use this notion for formulating the clustering problem of categorical objects as a partitioning problem. Finally, a k-means-like algorithm for clustering categorical data is introduced. The clustering performance of the algorithm is demonstrated with two well-known data sets, namely, soybean disease and nursery databases.
A Highly-usable Projected Clustering Algorithm for Gene Expression Profiles
- In Proc. of the 3rd ACM SIGKDD Workshop on Data Mining in Bioinformatics (BIOKDD’03
, 2003
"... Projected clustering has become a hot research topic due to its ability to cluster high-dimensional data. However, most existing projected clustering algorithms depend on some critical user parameters in determining the relevant attributes of each cluster. In case wrong parameter values are used, th ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
Projected clustering has become a hot research topic due to its ability to cluster high-dimensional data. However, most existing projected clustering algorithms depend on some critical user parameters in determining the relevant attributes of each cluster. In case wrong parameter values are used, the clustering performance will be seriously degraded. Unfortunately, correct parameter values are rarely known in real datasets. In this paper, we propose a projected clustering algorithm that does not depend on user inputs in determining relevant attributes. It responds to the clustering status and adjusts the internal thresholds dynamically. From experimental results, our algorithm shows a much higher usability than the other projected clustering algorithms used in our comparison study. It also works well with a gene expression dataset for studying lymphoma. The high usability of the algorithm and the encouraging results suggest that projected clustering can be a practical tool for analyzing gene expression profiles.
Keep it Simple: A Case-Base Maintenance Policy Based on Clustering and Information Theory
- In Proc. of the Canadian AI Conference
, 2000
"... Abstract. Today’s case based reasoning applications face several challenges. In a typical application, the case bases grow at a very fast rate and their contents become increasingly diverse, making it necessary to partition a large case base into several smaller ones. Their users are overloaded with ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
Abstract. Today’s case based reasoning applications face several challenges. In a typical application, the case bases grow at a very fast rate and their contents become increasingly diverse, making it necessary to partition a large case base into several smaller ones. Their users are overloaded with vast amounts of information during the retrieval process. These problems call for the development of effective case-base maintenance methods. As a result, many researchers have been driven to design sophisticated case-base structures or maintenance methods. In contrast, we hold a different point of view: we maintain that the structure of a case base should be kept as simple as possible, and that the maintenance method should be as transparent as possible. In this paper we propose a case-base maintenance method that avoids building sophisticated structures around a case base or perform complex operations on a case base. Our method partitions cases into clusters where the cases in the same cluster are more similar than cases in other clusters. In addition to the content of textual cases, the clustering method we propose can also be based on values of attributes that may be attached to the cases. Clusters can be converted to new case bases, which are smaller in size and when stored distributedly, can entail simpler maintenance operations. The contents of the new case bases are more focused and easier to retrieve and update. To support retrieval in this distributed case-base network, we present a method that is based on a decision forest built with the attributes that are obtained through an innovative modification of the ID3 algorithm. 1
SemBiosphere: A Semantic Web Approach to Recommending Microarray Clustering Services
- In Proc. of the Pacific Symposium on Biocomputing
, 2006
"... Clustering is a popular method for analyzing microarray data. Given the large number of clustering algorithms being available, it is difficult to identify the most suitable ones for a particular task. It is also difficult to locate, download, install and run the algorithms. This paper describes a ma ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
Clustering is a popular method for analyzing microarray data. Given the large number of clustering algorithms being available, it is difficult to identify the most suitable ones for a particular task. It is also difficult to locate, download, install and run the algorithms. This paper describes a matchmaking system, SemBiosphere, which solves both problems. It recommends clustering algorithms based on some minimal user requirement inputs and the data properties. An ontology was developed in OWL, an expressive ontological language, for describing what the algorithms are and how they perform, in addition to how they can be invoked. This allows machines to “understand ” the algorithms and make the recommendations. The algorithm can be implemented by different groups and in different languages, and run on different platforms at geographically distributed sites. Through the use of XML-based web services, they can all be invoked in the same standard way. The current clustering services were transformed from the non-semantic web services of the Biosphere system, which includes a variety of algorithms that have been applied to microarray gene expression data analysis. New algorithms can be incorporated into the system without too much effort. The SemBiosphere system and the complete clustering ontology can be accessed at
Identifying projected clusters from gene expression profiles
- Journal of Biomedical Informatics (JBI
"... In microarray gene expression data, clusters may hide in certain subspaces. For example, a set of co-regulated genes may have similar expression patterns in only a subset of the samples in which certain regulating factors are present. Their expression patterns could be dissimilar when measuring in t ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
In microarray gene expression data, clusters may hide in certain subspaces. For example, a set of co-regulated genes may have similar expression patterns in only a subset of the samples in which certain regulating factors are present. Their expression patterns could be dissimilar when measuring in the full input space. Traditional clustering algorithms that make use of such similarity measurements may fail to identify the clusters. In recent years a number of algorithms have been proposed to identify this kind of projected clusters, but many of them rely on some critical parameters whose proper values are hard for users to determine. In this paper a new algorithm that dynamically adjusts its internal thresholds is proposed. It has a low dependency on user parameters while allowing users to input some domain knowledge should they be available. Experimental results show that the algorithm is capable of identifying some interesting projected clusters.

