Extensions to the kMeans Algorithm for Clustering Large Data Sets with Categorical Values
, 1998
Abstract

The kmeans algorithm is well known for its efficiency in clustering large data sets. However, working only on numeric values prohibits it from being used to cluster real world data containing categorical values. In this paper we present two algorithms which extend the kmeans algorithm to categorical domains and domains with mixed numeric and categorical values. The kmodes algorithm uses a simple matching dissimilarity measure to deal with categorical objects, replaces the means of clusters with modes, and uses a frequencybased method to update modes in the clustering process to minimise the clustering cost function. With these extensions the kmodes algorithm enables the clustering of categorical data in a fashion similar to kmeans. The kprototypes algorithm, through the definition of a combined dissimilarity measure, further integrates the kmeans and kmodes algorithms to allow for clustering objects described by mixed numeric and categorical attributes. We use the well known soybean disease and credit approval data sets to demonstrate the clustering performance of the two algorithms. Our experiments on two real world data sets with half a million objects each show that the two algorithms are efficient when clustering large data sets, which is critical to data mining applications.
Descriptive Clustering as a Method for Exploring Text Collections
, 2006
Abstract

Grupowanie opisowe jako metoda eksploracji zbiorów dokumentów tekstowych
Determination of Clustering Tendency With ART Neural
 IN: PROCEEDINGS OF 4TH INTL. CONF. ON RECENT ADVANCES IN SOFT COMPUTING
, 2002
Abstract

We describe how Adaptive Resonance Theory (ART) neural networks can be used to establish binary data clustering tendency. Clustering tendency is the important yet poorly investigated problem of determining whether or not there is natural structure in data.
Statistical Data Compression by Optimal Segmentation  Theory, Algorithms and Experimental Results
, 1999
Abstract

Contents 1 Introduction 7 1.1 The Optimization Problem . .................... 7 1.2 Crucial Questions . . ........................ 10 2 Optimization Problems 13 2.1 Classic Problems . . ........................ 13 2.2 Equivalent Maximization Problems . . . . ............. 15 2.3 Generalized Problems ........................ 20 2.4 Methods for Solution ........................ 25 2.4.1 Fixpoint Method . . .................... 25 2.4.2 Adaptive Methods . .................... 28 2.4.3 Degeneration ........................ 29 3 Basic Algorithms 31 3.1 Fixpoint Algorithm . ........................ 32 3.1.1 Kmeans (Fixpoint Method) . . . ............. 32 3.1.2 Generalized Fixpoint Method . . ............. 34 3.2 Neural Gas Algorithm ........................ 40 3.2.1 Basic Neural Gas . . .................... 40 3.2.2 Generalized Neural Gas . . . . . ............. 42 3 CONTENTS 4 4 Improved Fixpoint Algor
Hybrid Minimal Spanning Tree and Mixture of Gaussians based Clustering Algorithm
Abstract

Abstract. Clustering is an important tool to explore the hidden structure of large databases. There are several algorithms based on different approaches (hierarchical, partitional, densitybased, modelbased, etc.). Most of these algorithms have some discrepancies, e.g. they are not able to detect clusters with convex shapes, the number of the clusters should be a priori known, they suffer from numerical problems, like sensitiveness to the initialization, etc. In this paper we introduce a new clustering algorithm based on the sinergistic combination of the hierarchial and graph theoretic minimal spanning tree based clustering and the partitional Gaussian mixture modelbased clustering algorithms. The aim of this hybridization is to increase the robustness and consistency of the clustering results and to decrease the number of the heuristically defined parameters of these algorithms to decrease the influence of the user on the clustering results. As the examples used for the illustration of the operation of the new algorithm will show, the proposed algorithm can detect clusters from data with arbitrary shape and does not suffer from the numerical problems of the Gaussian mixture based clustering algorithms. 1
Groupbased estimation of missing hydrological data. I. Approach and general
, 2000
Abstract

Abstract In this first paper in a set of two, the problem of estimating missing segments in streamflow records is described. The group approach, different from the traditional singlevalued approach, is proposed and explained. The approach perceives the hydrological data as sequence of groups rather than singlevalued observations. The techniques suggested to handle the group approach are regression, time series analysis, partitioning modelling, and artificial neural networks. Pertinent literature is reviewed and background material is used to support the group approach. Implementation and comparisons of models ' performance are deferred to the second paper. L'approche de groupe pour l'estimation des données hydrologiques manquantes: I. Présentation et méthodologie Résumé Dans ce premier de deux papiers, nous décrivons le problème de l'estimation de suites de données manquantes dans les archives de débits. Nous présentons et expliquons l'approche de groupe, différente des approches traditionnelles focalisées sur l'estimation de valeurs singulières. Cette nouvelle approche conçoit les données hydrologiques comme des suites de groupes plutôt que comme des suites d'observations singulières. Les techniques susceptibles de la servir sont: la régression, l'analyse des séries chronologiques, la segmentation et les réseaux de neurones artificiels. Nous présentons une revue de littérature d'où nous avons tiré des arguments en faveur de la promotion de l'approche de groupe. L'implementation et l'évaluation de l'approche de groupe font l'objet du second papier.
apport de rechercheAn a contrario approach to hierarchical clustering validity assessment
Learning feature weights for KMeans clustering using the Minkowski metric
, 2011
Abstract
Thesis submitted in fulfilment of requirements for degree of PhDDeclaration I hereby declare that the work presented in this thesis is my own, and that it has not previously been submitted for a degree or award at this or any other academic institution. Signed: Renato Cordeiro de Amorim KMeans is arguably the most popular clustering algorithm; this is why it is of great interest to tackle its shortcomings. The drawback in the heart of this project is that this algorithm gives the same level of relevance to all the features in a dataset. This can have disastrous consequences when the features are taken from a database just because they are available. Another issue of our concern is that KMeans results are highly dependent on the initial centroids. To address the issue of unequal relevance of the features we use a threestage extension of the generic KMeans in which a third step is added to the usual two steps
A unified framework for detecting groups and application to shape recognition
Abstract
Publication interne n ˚ 1746 — Septembre 2005 — 36 pages Abstract: A unified a contrario detection method is proposed to solve three classical problems in clustering analysis. The first one is to evaluate the validity of a cluster candidate. The second problem is that meaningful clusters can contain or be contained in other meaningful clusters. A rule is needed to define locally optimal clusters by inclusion. The third problem is the definition of a correct merging rule between meaningful clusters, permitting to decide whether they should stay separate or unit. The motivation of this theory is shape recognition. Matching algorithms usually compute correspondences between more or less local features (called shape elements) between images to be compared. This paper intends to form spatially coherent groups between matching shape elements into a shape. Each pair of matching shape elements indeed leads to a unique transformation (similarity or affine map.) As an application, the present theory on the choice of the right clusters is used to group these shape elements into shapes by detecting clusters in the transformation space. Keywords: Cluster validity, merging criterion, number of false alarms, shape recognition (Résumé: tsvp)
Article URL
, 2009
Abstract
PDF corresponds to the article as it appeared upon acceptance. Fully formatted PDF and full text (HTML) versions will be made available soon. Partitioning clustering algorithms for protein sequence data sets