Results 1  10
of
88
An Empirical Comparison of Four Initialization Methods for the KMeans Algorithm
, 1999
"... In this paper, we aim to compare empirically four initialization methods for the KMeans algorithm: random, Forgy, MacQueen and Kaufman. Although this algorithm is known for its robustness, it is widely reported in literature that its performance depends upon two key points: initial clustering an ..."
Abstract

Cited by 106 (0 self)
 Add to MetaCart
In this paper, we aim to compare empirically four initialization methods for the KMeans algorithm: random, Forgy, MacQueen and Kaufman. Although this algorithm is known for its robustness, it is widely reported in literature that its performance depends upon two key points: initial clustering and instance order. We conduct a series of experiments to draw up (in terms of mean, maximum, minimum and standard deviation) the probability distribution of the squareerror values of the final clusters returned by the KMeans algorithm independently on any initial clustering and on any instance order when each of the four initialization methods is used. The results of our experiments illustrate that the random and the Kaufman initialization methods outperform the rest of the compared methods as they make the KMeans more effective and more independent on initial clustering and on instance order. In addition, we compare the convergence speed of the KMeans algorithm when using each o...
The effectiveness of lloydtype methods for the kmeans problem
 In FOCS
, 2006
"... We investigate variants of Lloyd’s heuristic for clustering high dimensional data in an attempt to explain its popularity (a half century after its introduction) among practitioners, and in order to suggest improvements in its application. We propose and justify a clusterability criterion for data s ..."
Abstract

Cited by 54 (4 self)
 Add to MetaCart
(Show Context)
We investigate variants of Lloyd’s heuristic for clustering high dimensional data in an attempt to explain its popularity (a half century after its introduction) among practitioners, and in order to suggest improvements in its application. We propose and justify a clusterability criterion for data sets. We present variants of Lloyd’s heuristic that quickly lead to provably nearoptimal clustering solutions when applied to wellclusterable instances. This is the first performance guarantee for a variant of Lloyd’s heuristic. The provision of a guarantee on output quality does not come at the expense of speed: some of our algorithms are candidates for being faster in practice than currently used variants of Lloyd’s method. In addition, our other algorithms are faster on wellclusterable instances than recently proposed approximation algorithms, while maintaining similar guarantees on clustering quality. Our main algorithmic contribution is a novel probabilistic seeding process for the starting configuration of a Lloydtype iteration. 1
Detecting stable clusters using principal component analysis
 In Functional Genomics: Methods and Protocols. M.J. Brownstein and A. Kohodursky (eds.) Humana press, 2003
"... Clustering is one of the most commonly used tools in the analysis of gene expression data (1, 2). The usage in grouping genes is based on the premise that coexpression is a result of coregulation. It is thus a preliminary step in extracting gene networks and inference of gene function (3, 4). Clus ..."
Abstract

Cited by 27 (1 self)
 Add to MetaCart
(Show Context)
Clustering is one of the most commonly used tools in the analysis of gene expression data (1, 2). The usage in grouping genes is based on the premise that coexpression is a result of coregulation. It is thus a preliminary step in extracting gene networks and inference of gene function (3, 4). Clustering of experiments can be used to discover novel
Moa: Massive online analysis, a framework for stream classification and clustering
 Journal of Machine Learning Research  Proceedings Track, 11:44–50
, 2010
"... Abstract. In today’s applications, massive, evolving data streams are ubiquitous. Massive Online Analysis (MOA) is a software environment for implementing algorithms and running experiments for online learning from evolving data streams. MOA is designed to deal with the challenging problems of scali ..."
Abstract

Cited by 11 (3 self)
 Add to MetaCart
(Show Context)
Abstract. In today’s applications, massive, evolving data streams are ubiquitous. Massive Online Analysis (MOA) is a software environment for implementing algorithms and running experiments for online learning from evolving data streams. MOA is designed to deal with the challenging problems of scaling up the implementation of state of the art algorithms to real world dataset sizes and of making algorithms comparable in benchmark streaming settings. It contains a collection of offline and online algorithms for both classification and clustering as well as tools for evaluation. Researchers benefit from MOA by getting insights into workings and problems of different approaches, practitioners can easily compare several algorithms and apply them to real world data sets and settings. MOA supports bidirectional interaction with WEKA, the Waikato Environment for Knowledge Analysis, and is released under the GNU GPL license. Besides providing algorithms and measures for evaluation and comparison, MOA is easily extensible with new contributions and allows the creation of benchmark scenarios through storing and sharing setting files. 1
Optimal Data Partitioning and a Test Case for RayFinned Fishes (Actinopterygii) Based on Ten Nuclear Loci' Syst Biol 57(4
, 2008
"... This Article is brought to you for free and open access by the Department of Biology at ..."
Abstract

Cited by 8 (1 self)
 Add to MetaCart
(Show Context)
This Article is brought to you for free and open access by the Department of Biology at
Automated recognition of partial discharges
 IEEE Trans. Diel. Insul
, 1995
"... In this work an overview of automated recognition of partial discharges (PD) is given. The selection of PD patterns, extraction of relevant information for PD recognition and the structure of a data base for PD recognition are discussed. Mathematical methods useful for the design of the data base ar ..."
Abstract

Cited by 8 (0 self)
 Add to MetaCart
In this work an overview of automated recognition of partial discharges (PD) is given. The selection of PD patterns, extraction of relevant information for PD recognition and the structure of a data base for PD recognition are discussed. Mathematical methods useful for the design of the data base are examined. Classification methods are interpreted from a geometrical point of view. Some problems encountered in the automation of PD recognition also are addressed. 1.
Clustering n objects into k groups under optimal scaling of variables
 Psychometrika
, 1989
"... We propose a method to reduce many categorical variables to one variable with k categories, or stated otherwise, to classify n objects into k groups. Objects are measured on a set of nominal, ordinal or numerical variables or any mix of these, and they are represented as n points in pdimensional E ..."
Abstract

Cited by 8 (1 self)
 Add to MetaCart
(Show Context)
We propose a method to reduce many categorical variables to one variable with k categories, or stated otherwise, to classify n objects into k groups. Objects are measured on a set of nominal, ordinal or numerical variables or any mix of these, and they are represented as n points in pdimensional Euclidean space. Starting from homogeneity analysis, also called multiple correspondence analysis, the essential feature of our approach is that these object points are restricted to lie at only one of k locations. It follows that these k locations must be equal to the centroids of all objects belonging to the same group, which corresponds to a sum of squared distances clustering criterion. The problem is not only to estimate the group allocation, but also to obtain an optimal transformation of the data matrix. An alternating least squares algorithm and an example are given.
CLUSTER ANALYSIS AND CLASSIFICATION TREE METHODOLOGY AS AN AID TO IMPROVE UNDERSTANDING OF BENIGN PROSTATIC HYPERPLASIA
, 1994
"... Clear scientifically dermed guidelines for diagnosing benign prostatic hyperplasia have not been developed, and commonly used urologic measures characterizing the disease have shown lack of correlation. However, most reports in the literature are based on studies in referred patients or other nonre ..."
Abstract

Cited by 7 (0 self)
 Add to MetaCart
Clear scientifically dermed guidelines for diagnosing benign prostatic hyperplasia have not been developed, and commonly used urologic measures characterizing the disease have shown lack of correlation. However, most reports in the literature are based on studies in referred patients or other nonrepresentative samples and additionally have not considered the multivariate relationship among these measures. Such commonly used measures were collected during the baseline phase of a communitybased study initiated in Olmsted County,. Minnesota to study the prevalence and progression of disease in a randomly selected sample of untreated men aged 4079 without history of prostate cancer or prior prostate surgery. In the absence of a clinical diagnosis, hierarchical group average cluster analysis and the kth nearest neighbor nonparametric density estimation (NPDE) approach were applied to group men after fIrst standardizing variables using a robust measure. As the number of clusters has been shown to be a monotonically decreasing function ofsmoothing parameter k, graphical tools
Integration of ART2 neural network and genetic Kmeans algorithm for analyzing Web browsing paths in electronic commerce, Decision Support Systems
"... www.elsevier.com/locate/dsw ..."
(Show Context)
Threemode partitioning
 Comput. Stat. Data Anal
, 2006
"... The threemode partitioning model is a clustering model for threeway threemode data sets that implies a simultaneous partitioning of all three modes involved in the data. In the associated data analysis, a data array is approximated by a model array that can be represented by a threemode partitio ..."
Abstract

Cited by 4 (3 self)
 Add to MetaCart
(Show Context)
The threemode partitioning model is a clustering model for threeway threemode data sets that implies a simultaneous partitioning of all three modes involved in the data. In the associated data analysis, a data array is approximated by a model array that can be represented by a threemode partitioning model of a prespecified rank, minimizing a least squares loss function in terms of differences between data and model. Algorithms have been proposed for this minimization, but their performance is not yet clear. A framework for alternating leastsquares methods is described in order to offset the performance problem. Furthermore, a number of both existing and novel algorithms are discussed within this framework. An extensive simulation study is reported in which these algorithms are evaluated and compared according to sensitivity to local optima. The recovery of the truth underlying the data is investigated in order to assess the optimal estimates. The ordering of the algorithms with respect to performance in finding the optimal solution appears to change as compared to the results obtained from the simulation study when a collection of four empirical data sets have been used. This finding is attributed to violations of the implicit stochastic model underlying both the leastsquares loss function and the simulation study. Support for the latter attribution is found in a second simulation study.