Results 1 
5 of
5
Extensions to the kMeans Algorithm for Clustering Large Data Sets with Categorical Values
, 1998
"... The kmeans algorithm is well known for its efficiency in clustering large data sets. However, working only on numeric values prohibits it from being used to cluster real world data containing categorical values. In this paper we present two algorithms which extend the kmeans algorithm to categoric ..."
Abstract

Cited by 174 (3 self)
 Add to MetaCart
The kmeans algorithm is well known for its efficiency in clustering large data sets. However, working only on numeric values prohibits it from being used to cluster real world data containing categorical values. In this paper we present two algorithms which extend the kmeans algorithm to categorical domains and domains with mixed numeric and categorical values. The kmodes algorithm uses a simple matching dissimilarity measure to deal with categorical objects, replaces the means of clusters with modes, and uses a frequencybased method to update modes in the clustering process to minimise the clustering cost function. With these extensions the kmodes algorithm enables the clustering of categorical data in a fashion similar to kmeans. The kprototypes algorithm, through the definition of a combined dissimilarity measure, further integrates the kmeans and kmodes algorithms to allow for clustering objects described by mixed numeric and categorical attributes. We use the well known soybean disease and credit approval data sets to demonstrate the clustering performance of the two algorithms. Our experiments on two real world data sets with half a million objects each show that the two algorithms are efficient when clustering large data sets, which is critical to data mining applications.
A Fast Clustering Algorithm to Cluster Very Large Categorical Data Sets in Data Mining
"... Partitioning a large set of objects into homogeneous clusters is a fundamental operation in data mining. The kmeans algorithm is best suited for implementing this operation because of its efficiency in clustering large data sets. However, working only on numeric values limits its use in data mining ..."
Abstract

Cited by 92 (2 self)
 Add to MetaCart
Partitioning a large set of objects into homogeneous clusters is a fundamental operation in data mining. The kmeans algorithm is best suited for implementing this operation because of its efficiency in clustering large data sets. However, working only on numeric values limits its use in data mining because data sets in data mining often contain categorical values. In this paper we present an algorithm, called kmodes, to extend the kmeans paradigm to categorical domains. We introduce new dissimilarity measures to deal with categorical objects, replace means of clusters with modes, and use a frequency based method to update modes in the clustering process to minimise the clustering cost function. Tested with the well known soybean disease data set the algorithm has demonstrated a very good classification performance. Experiments on a very large health insurance data set consisting of half a million records and 34 categorical attributes show that the algorithm is scalable in terms of both the number of clusters and the number of records.
Flexible Matching for Noisy Structural Descriptions
 In proceeding of 12th IJCAI
, 1991
"... Uncertainty on data often makes the task of perfectly matching two descriptions quite ineffective. In this case, a flexible matching, measuring the similarity of two descriptions rather than their equality, is more useful. According to the convention of connecting similarity to the most common conce ..."
Abstract

Cited by 10 (5 self)
 Add to MetaCart
Uncertainty on data often makes the task of perfectly matching two descriptions quite ineffective. In this case, a flexible matching, measuring the similarity of two descriptions rather than their equality, is more useful. According to the convention of connecting similarity to the most common concept of distance, we present a definition of distance measure, based on a probabilistic interpretation of the matching predicate, which can cope with structural deformations. As the problem of matching two formulas of the FOPL is NPcomplete, two methods arc presented in order to cope with complexity: firstly, a branchandbound algorithm, and secondly, a heuristic method. These ideas are applied to the problem of recognizing office documents in digital form according to their page layout. 1
TOPICAL CLUSTERING OF BIOMEDICAL ABSTRACT by SELF ORGANIZING MAPS
"... Abstract: One of the major challenges in the postgenomic era is the speed up of the process of identification of molecules involved in a specific disease (molecular targets). Even if the experimental procedure have greatly enhanced the analytical capability, the textual data analysis still play a c ..."
Abstract
 Add to MetaCart
Abstract: One of the major challenges in the postgenomic era is the speed up of the process of identification of molecules involved in a specific disease (molecular targets). Even if the experimental procedure have greatly enhanced the analytical capability, the textual data analysis still play a central role in the experimental activity design or in the data collection. The extraction of useful information from published papers is still strongly dependent by the human expertise in the selection and retrieval of relevant papers. The search of abstract in MEDLINE or PubMed databases, is a common activity for researcher. Often the navigation in textual databases is not simple and in many case the user can retrieve only list of abstracts without any kind of additional information about the relatedness of the abstract content with the submitted query.. In the last decade the application of Natural language processing tools has acquired some relevance in bioinformatic field. The possibility to retrieve and organize the textual information, according specific topics, allows the user to select and analyse only a reduced set of papers. In our work we present the a application of document clustering system, founded on SelfOrganizing Maps, to reorganize in a hierarchical way the cluster of abstracts retrieved by PubMed query. The system is available at the following site