Results 1 - 10
of
14
Extensions to the k-Means Algorithm for Clustering Large Data Sets with Categorical Values
, 1998
"... The k-means algorithm is well known for its efficiency in clustering large data sets. However, working only on numeric values prohibits it from being used to cluster real world data containing categorical values. In this paper we present two algorithms which extend the k-means algorithm to categoric ..."
Abstract
-
Cited by 109 (2 self)
- Add to MetaCart
The k-means algorithm is well known for its efficiency in clustering large data sets. However, working only on numeric values prohibits it from being used to cluster real world data containing categorical values. In this paper we present two algorithms which extend the k-means algorithm to categorical domains and domains with mixed numeric and categorical values. The k-modes algorithm uses a simple matching dissimilarity measure to deal with categorical objects, replaces the means of clusters with modes, and uses a frequency-based method to update modes in the clustering process to minimise the clustering cost function. With these extensions the k-modes algorithm enables the clustering of categorical data in a fashion similar to k-means. The k-prototypes algorithm, through the definition of a combined dissimilarity measure, further integrates the k-means and k-modes algorithms to allow for clustering objects described by mixed numeric and categorical attributes. We use the well known soybean disease and credit approval data sets to demonstrate the clustering performance of the two algorithms. Our experiments on two real world data sets with half a million objects each show that the two algorithms are efficient when clustering large data sets, which is critical to data mining applications.
Feature selection for unsupervised learning
- Journal of Machine Learning Research
, 2004
"... In this paper, we identify two issues involved in developing an automated feature subset selection algorithm for unlabeled data: the need for finding the number of clusters in conjunction with feature selection, and the need for normalizing the bias of feature selection criteria with respect to dime ..."
Abstract
-
Cited by 69 (3 self)
- Add to MetaCart
In this paper, we identify two issues involved in developing an automated feature subset selection algorithm for unlabeled data: the need for finding the number of clusters in conjunction with feature selection, and the need for normalizing the bias of feature selection criteria with respect to dimension. We explore the feature selection problem and these issues through FSSEM (Feature Subset Selection using Expectation-Maximization (EM) clustering) and through two different performance criteria for evaluating candidate feature subsets: scatter separability and maximum likelihood. We present proofs on the dimensionality biases of these feature criteria, and present a cross-projection normalization scheme that can be applied to any criterion to ameliorate these biases. Our experiments show the need for feature selection, the need for addressing these two issues, and the effectiveness of our proposed solutions.
Configurations of Inter-Organizational Relationships: A Comparison Between US and Japanese Automakers
, 1995
"... This paper seeks to uncover dominant configurations of inter-organizational relationships across the USA and Japan in the automotive industry. We integrate relevant theoretical concepts from transaction cost economics, organization theory and political economy to develop a conceptual model of int ..."
Abstract
-
Cited by 28 (1 self)
- Add to MetaCart
This paper seeks to uncover dominant configurations of inter-organizational relationships across the USA and Japan in the automotive industry. We integrate relevant theoretical concepts from transaction cost economics, organization theory and political economy to develop a conceptual model of inter-organizational relationships based on the fit between information processing needs and information processing capabilities. This model is employed to collect data on 447 buyer-supplier relationships in these two countries. We empirically uncover a set of five naturally occurring patterns of inter-organizational relationships. These configurations provide rich explanations of the complexity of interorganizational relationships as well as offer differential insights across US and Japan. We discuss implications for further research pertaining to the logic and development of configurations.
Comparison and validation of community structures in complex networks
- Physica A: Statistical Mechanics and its Application,.367
, 2006
"... The issue of partitioning a network into communities has attracted a great deal of attention recently. Most authors seem to equate this issue with the one of finding the maximum value of the modularity, as defined by Newman. Since the problem formulated this way is believed to be NP-hard, most effor ..."
Abstract
-
Cited by 8 (0 self)
- Add to MetaCart
The issue of partitioning a network into communities has attracted a great deal of attention recently. Most authors seem to equate this issue with the one of finding the maximum value of the modularity, as defined by Newman. Since the problem formulated this way is believed to be NP-hard, most effort has gone into the construction of search algorithms, and less to the question of other measures of community structures, similarities between various partitionings and the validation with respect to external information. Here we concentrate on a class of computer generated networks and on three well-studied real networks which constitute a bench-mark for network studies; the karate club, the US college football teams and a gene network of yeast. We utilize some standard ways of clustering data (originally not designed for finding community structures in networks) and show that these classical methods sometimes outperform the newer ones. We discuss various measures of the strength of the modular structure, and show by examples features and drawbacks. Further, we compare different partitions by applying some graph-theoretic concepts of distance, which indicate that one of the quality measures of the degree of modularity corresponds quite well with the distance from the true partition. Finally, we introduce a way to validate the partitionings with respect to external data when the nodes are classified but the network structure is unknown. This is here possible since we know everything of the computer generated networks, as well as the historical answer to how the karate club and the football teams are partitioned in reality. The partitioning of the gene network is validated by use of the Gene Ontology
Determination of Clustering Tendency With ART Neural
- IN: PROCEEDINGS OF 4TH INTL. CONF. ON RECENT ADVANCES IN SOFT COMPUTING
, 2002
"... We describe how Adaptive Resonance Theory (ART) neural networks can be used to establish binary data clustering tendency. Clustering tendency is the important yet poorly investigated problem of determining whether or not there is natural structure in data. ..."
Abstract
-
Cited by 4 (3 self)
- Add to MetaCart
We describe how Adaptive Resonance Theory (ART) neural networks can be used to establish binary data clustering tendency. Clustering tendency is the important yet poorly investigated problem of determining whether or not there is natural structure in data.
Comparison of chemical clustering methods using graph- and fingerprint-based similarity measures
- J Mol Graph Model
"... Abstract This paper compares several published methods for clustering chemical structures, using both fingerprint-based and graph-based similarity measures. The clusterings from each method were compared to determine the degree of cluster overlap. Each method was also evaluated on how well it groupe ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Abstract This paper compares several published methods for clustering chemical structures, using both fingerprint-based and graph-based similarity measures. The clusterings from each method were compared to determine the degree of cluster overlap. Each method was also evaluated on how well it grouped structures into clusters possessing a non-trivial substructural commonality. The methods which employ adjustable parameters were tested to determine the stability of each parameter for datasets of varying size and composition. Our experiments suggest that both fingerprint-based and graph-based similarity measures can be used effectively for generating chemical clusterings; it is also suggested that the CAST method, suggested recently for the clustering of gene expression patterns, may also prove effective for the clustering of 2D chemical structures.
Optimal Matching and the Social Sciences
- In IATUR - XXVIII Annual Conference
, 2006
"... Les documents de travail ne reflètent pas la position de l'INSEE et n'engagent que leurs auteurs. Working papers do not reflect the position of INSEE but only the views of the authors. 1 Observatoire sociologique du changement (Science-po & CNRS) and Laboratoire de Sociologie Quantitative ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Les documents de travail ne reflètent pas la position de l'INSEE et n'engagent que leurs auteurs. Working papers do not reflect the position of INSEE but only the views of the authors. 1 Observatoire sociologique du changement (Science-po & CNRS) and Laboratoire de Sociologie Quantitative
Truecluster: robust scalable clustering with
, 2007
"... Data-based classification is fundamental to most branches of science. While recent years have brought enormous progress in various areas of statistical computing and clustering, some general challenges in clustering remain: model selection, robustness, and scalability to large datasets. We consider ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
Data-based classification is fundamental to most branches of science. While recent years have brought enormous progress in various areas of statistical computing and clustering, some general challenges in clustering remain: model selection, robustness, and scalability to large datasets. We consider the important problem of deciding on the optimal number of clusters, given an arbitrary definition of space and clusteriness. We show how to construct a cluster information criterion that allows objective model selection. Differing from other approaches, our truecluster method does not require specific assumptions about underlying distributions, dissimilarity definitions or cluster models. Truecluster puts arbitrary clustering algorithms into a generic unified (sampling-based) statistical framework. It is scalable to big datasets and provides robust cluster assignments and case-wise diagnostics. Truecluster will make clustering more objective, allows for automation, and will save time and costs. Free R software is available.
Combinatoral Optimization in Clustering
"... Contents 1 Introduction 2 2 Types of Data 5 3 Cluster Structures 14 4 Clustering Criteria 15 5 Single Cluster Clustering 16 5.1 Clustering Approaches.......................... 16 5.1.1 De#nition-based Clusters .................... 16 5.1.2 Direct Algorithms ........................ 18 5.1.3 Optimal ..."
Abstract
- Add to MetaCart
Contents 1 Introduction 2 2 Types of Data 5 3 Cluster Structures 14 4 Clustering Criteria 15 5 Single Cluster Clustering 16 5.1 Clustering Approaches.......................... 16 5.1.1 De#nition-based Clusters .................... 16 5.1.2 Direct Algorithms ........................ 18 5.1.3 Optimal Clusters . ........................ 20 5.2 Single and Monotone Linkage Clusters ................. 21 5.2.1 MST and Single Linkage Clustering .............. 21 5.2.2 Monotone Linkage Clusters . . ................. 23 1 5.2.3 Modeling Skeletons in Digital Image Processing . . . . . . . . 25 5.2.4 Linkage-based Convex Criteria ................. 27 5.3 Moving Center and Approximation Clusters . . . . . ......... 29 5.3.1 Criteria for Moving Center Methods . . . . . ......... 29 5.3.2 Principal Cluster . . ....................... 29 5.3.3 Additive Cluster ......................... 32 5.3.4 Seriation with Returns . . . . . . ................ 34 6 Partitioning
Approximation Clustering: a Mine of Semidefinite Programming Problems
"... . Clustering is a discipline devoted to #nding homogeneous groups of data entities. In contrast to conventional clustering whichinvolves data processing in terms of either entities or variables, approximation clustering is aimed at processing of the data matrices as they are. Currently, approxima ..."
Abstract
- Add to MetaCart
. Clustering is a discipline devoted to #nding homogeneous groups of data entities. In contrast to conventional clustering whichinvolves data processing in terms of either entities or variables, approximation clustering is aimed at processing of the data matrices as they are. Currently, approximation clustering is a set of clustering models and methods based on approximate decomposition of the data table into scalar product matrices representing weighted subsets, partitions or hierarchies as the sought clustering structures. Some of the problems involved are of semide#nite programming, the others seem quite similar. 1 Introduction Clustering models may di#er depending on the nature of data. We distinguish here among three types of data: column-conditional, similarity and aggregable ones. The #rst two are those usually considered in clustering: a column-conditional data set is represented by an entity-to-variable matrix so that the entries within any column #variable# can be c...

