Results 1  10
of
16
Extensions to the kMeans Algorithm for Clustering Large Data Sets with Categorical Values
, 1998
"... The kmeans algorithm is well known for its efficiency in clustering large data sets. However, working only on numeric values prohibits it from being used to cluster real world data containing categorical values. In this paper we present two algorithms which extend the kmeans algorithm to categoric ..."
Abstract

Cited by 156 (2 self)
 Add to MetaCart
The kmeans algorithm is well known for its efficiency in clustering large data sets. However, working only on numeric values prohibits it from being used to cluster real world data containing categorical values. In this paper we present two algorithms which extend the kmeans algorithm to categorical domains and domains with mixed numeric and categorical values. The kmodes algorithm uses a simple matching dissimilarity measure to deal with categorical objects, replaces the means of clusters with modes, and uses a frequencybased method to update modes in the clustering process to minimise the clustering cost function. With these extensions the kmodes algorithm enables the clustering of categorical data in a fashion similar to kmeans. The kprototypes algorithm, through the definition of a combined dissimilarity measure, further integrates the kmeans and kmodes algorithms to allow for clustering objects described by mixed numeric and categorical attributes. We use the well known soybean disease and credit approval data sets to demonstrate the clustering performance of the two algorithms. Our experiments on two real world data sets with half a million objects each show that the two algorithms are efficient when clustering large data sets, which is critical to data mining applications.
Descriptive Clustering as a Method for Exploring Text Collections
, 2006
"... Grupowanie opisowe jako metoda eksploracji zbiorów dokumentów tekstowych ..."
Abstract

Cited by 7 (3 self)
 Add to MetaCart
Grupowanie opisowe jako metoda eksploracji zbiorów dokumentów tekstowych
Determination of Clustering Tendency With ART Neural
 IN: PROCEEDINGS OF 4TH INTL. CONF. ON RECENT ADVANCES IN SOFT COMPUTING
, 2002
"... We describe how Adaptive Resonance Theory (ART) neural networks can be used to establish binary data clustering tendency. Clustering tendency is the important yet poorly investigated problem of determining whether or not there is natural structure in data. ..."
Abstract

Cited by 4 (3 self)
 Add to MetaCart
We describe how Adaptive Resonance Theory (ART) neural networks can be used to establish binary data clustering tendency. Clustering tendency is the important yet poorly investigated problem of determining whether or not there is natural structure in data.
Statistical Data Compression by Optimal Segmentation  Theory, Algorithms and Experimental Results
, 1999
"... Contents 1 Introduction 7 1.1 The Optimization Problem . .................... 7 1.2 Crucial Questions . . ........................ 10 2 Optimization Problems 13 2.1 Classic Problems . . ........................ 13 2.2 Equivalent Maximization Problems . . . . ............. 15 2.3 Generalized Pro ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
Contents 1 Introduction 7 1.1 The Optimization Problem . .................... 7 1.2 Crucial Questions . . ........................ 10 2 Optimization Problems 13 2.1 Classic Problems . . ........................ 13 2.2 Equivalent Maximization Problems . . . . ............. 15 2.3 Generalized Problems ........................ 20 2.4 Methods for Solution ........................ 25 2.4.1 Fixpoint Method . . .................... 25 2.4.2 Adaptive Methods . .................... 28 2.4.3 Degeneration ........................ 29 3 Basic Algorithms 31 3.1 Fixpoint Algorithm . ........................ 32 3.1.1 Kmeans (Fixpoint Method) . . . ............. 32 3.1.2 Generalized Fixpoint Method . . ............. 34 3.2 Neural Gas Algorithm ........................ 40 3.2.1 Basic Neural Gas . . .................... 40 3.2.2 Generalized Neural Gas . . . . . ............. 42 3 CONTENTS 4 4 Improved Fixpoint Algor
Groupbased estimation of missing hydrological data. I. Approach and general
, 2000
"... Abstract In this first paper in a set of two, the problem of estimating missing segments in streamflow records is described. The group approach, different from the traditional singlevalued approach, is proposed and explained. The approach perceives the hydrological data as sequence of groups rather ..."
Abstract

Cited by 2 (2 self)
 Add to MetaCart
Abstract In this first paper in a set of two, the problem of estimating missing segments in streamflow records is described. The group approach, different from the traditional singlevalued approach, is proposed and explained. The approach perceives the hydrological data as sequence of groups rather than singlevalued observations. The techniques suggested to handle the group approach are regression, time series analysis, partitioning modelling, and artificial neural networks. Pertinent literature is reviewed and background material is used to support the group approach. Implementation and comparisons of models ' performance are deferred to the second paper. L'approche de groupe pour l'estimation des données hydrologiques manquantes: I. Présentation et méthodologie Résumé Dans ce premier de deux papiers, nous décrivons le problème de l'estimation de suites de données manquantes dans les archives de débits. Nous présentons et expliquons l'approche de groupe, différente des approches traditionnelles focalisées sur l'estimation de valeurs singulières. Cette nouvelle approche conçoit les données hydrologiques comme des suites de groupes plutôt que comme des suites d'observations singulières. Les techniques susceptibles de la servir sont: la régression, l'analyse des séries chronologiques, la segmentation et les réseaux de neurones artificiels. Nous présentons une revue de littérature d'où nous avons tiré des arguments en faveur de la promotion de l'approche de groupe. L'implementation et l'évaluation de l'approche de groupe font l'objet du second papier.
Hybrid Minimal Spanning Tree and Mixture of Gaussians based Clustering Algorithm
"... Abstract. Clustering is an important tool to explore the hidden structure of large databases. There are several algorithms based on different approaches (hierarchical, partitional, densitybased, modelbased, etc.). Most of these algorithms have some discrepancies, e.g. they are not able to detect c ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
Abstract. Clustering is an important tool to explore the hidden structure of large databases. There are several algorithms based on different approaches (hierarchical, partitional, densitybased, modelbased, etc.). Most of these algorithms have some discrepancies, e.g. they are not able to detect clusters with convex shapes, the number of the clusters should be a priori known, they suffer from numerical problems, like sensitiveness to the initialization, etc. In this paper we introduce a new clustering algorithm based on the sinergistic combination of the hierarchial and graph theoretic minimal spanning tree based clustering and the partitional Gaussian mixture modelbased clustering algorithms. The aim of this hybridization is to increase the robustness and consistency of the clustering results and to decrease the number of the heuristically defined parameters of these algorithms to decrease the influence of the user on the clustering results. As the examples used for the illustration of the operation of the new algorithm will show, the proposed algorithm can detect clusters from data with arbitrary shape and does not suffer from the numerical problems of the Gaussian mixture based clustering algorithms. 1
An a contrario approach to hierarchical clustering validity assessment
, 1647
"... In this paper we present a method to detect natural groups in a data set, based on hierarchical clustering. A measure of the meaningfulness of clusters, derived from a background model assuming no class structure in the data, provides a way to compare clusters, and leads to a cluster validity criter ..."
Abstract
 Add to MetaCart
In this paper we present a method to detect natural groups in a data set, based on hierarchical clustering. A measure of the meaningfulness of clusters, derived from a background model assuming no class structure in the data, provides a way to compare clusters, and leads to a cluster validity criterion. This criterion is applied to every cluster in the nested structure. While all clusters passing the validity test are meaningful in themselves, the set of all of them will probably provide a redundant data representation. By selecting a subset of the meaningful clusters, a good data representation, which also discards outliers, can be achieved. The strategy we propose combines a new merging criterion (also derived from the background model) with a selection of local maxima of the meaningfulness with respect to inclusion, in the nested hierarchical structure.
Document clustering and Visualization PingLin
"... Document clustering is an approach to organize unstructured text information into meaningful groups. It can be applied to documents in a database to improve performance in information retrieval, or it can be used to organize query results from the web or other types of large, heterogeneous text coll ..."
Abstract
 Add to MetaCart
Document clustering is an approach to organize unstructured text information into meaningful groups. It can be applied to documents in a database to improve performance in information retrieval, or it can be used to organize query results from the web or other types of large, heterogeneous text collections. In this report, we describe a new clustering algorithm to categorize and spatially cluster text documents. We employ TFIDF and term cooccurrence to measure documenttodocument similarities. Next, a modified minimum spanning tree algorithm is used to cluster similar documents. Finally, we apply multidimensional scaling on the clusters to represent them as spatial clusters on a 2D plane. The system is tested with sets of articles generated by an Internet search engine for certain topic areas. Our result shows that the system is capable of distinguishing different topics and producing recognizable and informative clusters. 1
1 Document Classification Methods for Organizing Explicit Knowledge Summary
"... In this paper we describe the two classification approaches (i.e. categorization and clustering) and their preceding steps. For each approach we give a brief description of the underlying theory and outline the advantages and disadvantages of the different methods. Finally we specify potential appli ..."
Abstract
 Add to MetaCart
In this paper we describe the two classification approaches (i.e. categorization and clustering) and their preceding steps. For each approach we give a brief description of the underlying theory and outline the advantages and disadvantages of the different methods. Finally we specify potential application areas in accordance with knowledge management and illustrate exemplarily one topic in detail. We describe the enhancement of queries for illustration purposes because it is a common research problem with respect to information retrieval and, thus, to knowledge management. 1
Article URL
, 2009
"... PDF corresponds to the article as it appeared upon acceptance. Fully formatted PDF and full text (HTML) versions will be made available soon. Partitioning clustering algorithms for protein sequence data sets ..."
Abstract
 Add to MetaCart
PDF corresponds to the article as it appeared upon acceptance. Fully formatted PDF and full text (HTML) versions will be made available soon. Partitioning clustering algorithms for protein sequence data sets