Results 1  10
of
145
Consistency of spectral clustering
, 2004
"... Consistency is a key property of statistical algorithms, when the data is drawn from some underlying probability distribution. Surprisingly, despite decades of work, little is known about consistency of most clustering algorithms. In this paper we investigate consistency of a popular family of spe ..."
Abstract

Cited by 567 (15 self)
 Add to MetaCart
(Show Context)
Consistency is a key property of statistical algorithms, when the data is drawn from some underlying probability distribution. Surprisingly, despite decades of work, little is known about consistency of most clustering algorithms. In this paper we investigate consistency of a popular family of spectral clustering algorithms, which cluster the data with the help of eigenvectors of graph Laplacian matrices. We show that one of the two of major classes of spectral clustering (normalized clustering) converges under some very general conditions, while the other (unnormalized), is only consistent under strong additional assumptions, which, as we demonstrate, are not always satisfied in real data. We conclude that our analysis provides strong evidence for the superiority of normalized spectral clustering in practical applications. We believe that methods used in our analysis will provide a basis for future exploration of Laplacianbased methods in a statistical setting.
Consensus clustering  A resamplingbased method for class discovery and visualization of gene expression microarray data
 MACHINE LEARNING 52 (2003) 91–118 FUNCTIONAL GENOMICS SPECIAL ISSUE
, 2003
"... ..."
(Show Context)
Support Vector Clustering
, 2001
"... We present a novel clustering method using the approach of support vector machines. Data points are mapped by means of a Gaussian kernel to a high dimensional feature space, where we search for the minimal enclosing sphere. This sphere, when mapped back to data space, can separate into several compo ..."
Abstract

Cited by 211 (1 self)
 Add to MetaCart
We present a novel clustering method using the approach of support vector machines. Data points are mapped by means of a Gaussian kernel to a high dimensional feature space, where we search for the minimal enclosing sphere. This sphere, when mapped back to data space, can separate into several components, each enclosing a separate cluster of points. We present a simple algorithm for identifying these clusters. The width of the Gaussian kernel controls the scale at which the data is probed while the soft margin constant helps coping with outliers and overlapping clusters. The structure of a dataset is explored by varying the two parameters, maintaining a minimal number of support vectors to assure smooth cluster boundaries. We demonstrate the performance of our algorithm on several datasets.
Spectral biclustering of microarray data: Coclustering genes and conditions
 Genome Research
, 2003
"... ..."
(Show Context)
Comparing clusterings: an axiomatic view
 In ICML ’05: Proceedings of the 22nd international conference on Machine learning
, 2005
"... This paper views clusterings as elements of a lattice. Distances between clusterings are analyzed in their relationship to the lattice. From this vantage point, we first give an axiomatic characterization of some criteria for comparing clusterings, including the variation of information and the unad ..."
Abstract

Cited by 109 (3 self)
 Add to MetaCart
(Show Context)
This paper views clusterings as elements of a lattice. Distances between clusterings are analyzed in their relationship to the lattice. From this vantage point, we first give an axiomatic characterization of some criteria for comparing clusterings, including the variation of information and the unadjusted Rand index. Then we study other distances between partitions w.r.t these axioms and prove an impossibility result: there is no “sensible” criterion for comparing clusterings that is simultaneously (1) aligned with the lattice of partitions, (2) convexely additive, and (3) bounded. 1.
CLICK and EXPANDER: a system for clustering and visualizing gene expression data
 Bioinformatics
, 2003
"... Motivation: Microarrays have become a central tool in biological research. Their applications range from functional annotation to tissue classification and genetic network inference. A key step in the analysis of gene expression data is the identification of groups of genes that manifest similar exp ..."
Abstract

Cited by 101 (6 self)
 Add to MetaCart
(Show Context)
Motivation: Microarrays have become a central tool in biological research. Their applications range from functional annotation to tissue classification and genetic network inference. A key step in the analysis of gene expression data is the identification of groups of genes that manifest similar expression patterns. This translates to the algorithmic problem of clustering genes based on their expression patterns. Results: We present a novel clustering algorithm, called CLICK, and its applications to gene expression analysis. The algorithm utilizes graphtheoretic and statistical techniques to identify tight groups (kernels) of highly similar elements, which are likely to belong to the same true cluster. Several heuristic procedures are then used to expand the kernels into the full clusters. We report on the application of CLICK to a variety of gene expression data sets. In all those applications it outperformed extant algorithms according to several common figures of merit. We also point out that CLICK can be successfully used for the identification of common regulatory motifs in the upstream regions of coregulated genes. Furthermore, we demonstrate how CLICK can be used to accurately classify tissue samples into disease types, based on their expression profiles. Finally, we present a new javabased graphical tool, called EXPANDER, for gene expression analysis and visualization, which incorporates CLICK and several other popular clustering algorithms.
StabilityBased Validation of Clustering Solutions
, 2004
"... Data clustering describes a set of frequently employed techniques in exploratory data analysis to extract “natural” group structure in data. Such groupings need to be validated to separate the signal in the data from spurious structure. In this context, finding an appropriate number of clusters is a ..."
Abstract

Cited by 94 (7 self)
 Add to MetaCart
Data clustering describes a set of frequently employed techniques in exploratory data analysis to extract “natural” group structure in data. Such groupings need to be validated to separate the signal in the data from spurious structure. In this context, finding an appropriate number of clusters is a particularly important model selection question. We introduce a measure of cluster stability to assess the validity of a cluster model. This stability measure quantifies the reproducibility of clustering solutions on a second sample, and it can be interpreted as a classification risk with regard to class labels produced by a clustering algorithm. The preferred number of clusters is determined by minimizing this classification risk as a function of the number of clusters. Convincing results are achieved on simulated as well as gene expression data sets. Comparisons to other methods demonstrate the competitive performance of our method and its suitability as a general validation tool for clustering solutions in realworld problems.
Comparing Clusterings
, 2002
"... This paper proposes an information theoretic criterion for comparing two clusterings of the same data set. The criterion, called variation of information measures the amount of information that is lost or gained in changing from clustering C to dustering C '. The criterion makes no assumptions ..."
Abstract

Cited by 67 (4 self)
 Add to MetaCart
This paper proposes an information theoretic criterion for comparing two clusterings of the same data set. The criterion, called variation of information measures the amount of information that is lost or gained in changing from clustering C to dustering C '. The criterion makes no assumptions about how the dusterings were generated and applies to both soft and hard dusterings.The basic properties of VI are presented and discussed from the point of view of comparing c!usterings. In particular, the VI is positive, symmetric and transitive and thus, surprisingly enough, is a true metric on the space of c1usterings.
A sober look at clustering stability
 In COLT
, 2006
"... Abstract. Stability is a common tool to verify the validity of sample based algorithms. In clustering it is widely used to tune the parameters of the algorithm, such as the number k of clusters. In spite of the popularity of stability in practical applications, there has been very little theoretical ..."
Abstract

Cited by 44 (5 self)
 Add to MetaCart
Abstract. Stability is a common tool to verify the validity of sample based algorithms. In clustering it is widely used to tune the parameters of the algorithm, such as the number k of clusters. In spite of the popularity of stability in practical applications, there has been very little theoretical analysis of this notion. In this paper we provide a formal definition of stability and analyze some of its basic properties. Quite surprisingly, the conclusion of our analysis is that for large sample size, stability is fully determined by the behavior of the objective function which the clustering algorithm is aiming to minimize. If the objective function has a unique global minimizer, the algorithm is stable, otherwise it is unstable. In particular we conclude that stability is not a wellsuited tool to determine the number of clusters it is determined by the symmetries of the data which may be unrelated to clustering parameters. We prove our results for centerbased clusterings and for spectral clustering, and support our conclusions by many examples in which the behavior of stability is counterintuitive. 1
Gaussian mixture clustering and imputation of microarray data
 Bioinformatics
, 2004
"... Motivation: In microarray experiments, missing entries arise from blemishes on the chips. In largescale studies, virtually every chip contains some missing entries and more than 90% of the genes are affected. Many analysis methods require a full set of data. Either those genes with missing entries ..."
Abstract

Cited by 42 (1 self)
 Add to MetaCart
(Show Context)
Motivation: In microarray experiments, missing entries arise from blemishes on the chips. In largescale studies, virtually every chip contains some missing entries and more than 90% of the genes are affected. Many analysis methods require a full set of data. Either those genes with missing entries are excluded, or the missing entries are filled with estimates prior to the analyses.This study compares methods of missing value estimation. Results: Two evaluation metrics of imputation accuracy are employed. First, the root mean squared error measures the difference between the true values and the imputed values. Second, the number of misclustered genes measures the difference between clustering with true values and that with imputed values; it examines the bias introduced by imputation to clustering. The Gaussian mixture clustering with model averaging imputation is superior to all other imputation methods, according to both evaluation metrics, on both timeseries (correlated) and nontime series (uncorrelated) data sets. Availability: Matlab code is available on request from the authors. Contact: