Results 1 -
7 of
7
Relating clustering stability to properties of cluster boundaries
"... In this paper, we investigate stability-based methods for cluster model selection, in particular to select the number K of clusters. The scenario under consideration is that clustering is performed by minimizing a certain clustering quality function, and that a unique global minimizer exists. On the ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
In this paper, we investigate stability-based methods for cluster model selection, in particular to select the number K of clusters. The scenario under consideration is that clustering is performed by minimizing a certain clustering quality function, and that a unique global minimizer exists. On the one hand we show that stability can be upper bounded by certain properties of the optimal clustering, namely by the mass in a small tube around the cluster boundaries. On the other hand, we provide counterexamples which show that a reverse statement is not true in general. Finally, we provide some examples and arguments why, from a theoretic point of view, using clustering stability in a high sample setting can be problematic. It can be seen that distribution-free guarantees bounding the difference between the finite sample stability and the “true stability ” cannot exist, unless one makes strong assumptions on the underlying distribution. 1
On the reliability of clustering stability in the large sample regime
- In Advances in Neural Information Processing Systems
"... Clustering stability is an increasingly popular family of methods for performing model selection in data clustering. The basic idea is that the chosen model should be stable under perturbation or resampling of the data. Despite being reasonably effective in practice, these methods are not well under ..."
Abstract
-
Cited by 4 (3 self)
- Add to MetaCart
Clustering stability is an increasingly popular family of methods for performing model selection in data clustering. The basic idea is that the chosen model should be stable under perturbation or resampling of the data. Despite being reasonably effective in practice, these methods are not well understood theoretically, and present some difficulties. In particular, when the data is assumed to be sampled from an underlying distribution, the solutions returned by the clustering algorithm will usually become more and more stable as the sample size increases. This raises a potentially serious practical difficulty with these methods, because it means there might be some hard-to-compute sample size, beyond which clustering stability estimators ’break down ’ and become unreliable in detecting the most stable model. In this paper, we provide a set of general sufficient conditions, which ensure the reliability of clustering stability estimators in the large sample regime. In contrast to previous work, which concentrated on specific toy distributions or specific idealized clustering frameworks, here we make no such assumptions. We then exemplify how these conditions apply to several important families of clustering algorithms, such as maximum likelihood clustering, certain types of kernel clustering, and centroid-based clustering with any Bregman divergence. In addition, we explicitly derive the non-trivial asymptotic behavior of these estimators, for any framework satisfying our conditions. This may help us understand what is considered a ’stable ’ model by these estimators, at least for large enough samples. 1
Discovering multi–level structures in bio-molecular data through the Bernstein inequality
"... Background: The unsupervised discovery of structures (i.e. clusterings) underlying data is a central issue in several branches of bioinformatics. Methods based on the concept of stability have been recently proposed to assess the reliability of a clustering procedure and to estimate the ”optimal ” n ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Background: The unsupervised discovery of structures (i.e. clusterings) underlying data is a central issue in several branches of bioinformatics. Methods based on the concept of stability have been recently proposed to assess the reliability of a clustering procedure and to estimate the ”optimal ” number of clusters in bio-molecular data. A major problem with stability-based methods is the detection of multi-level structures (e.g. hierarchical functional classes of genes), and the assessment of their statistical significance. In this context, a chi-square based statistical test of hypothesis has been proposed; however, to assure the correctness of this technique some assumptions about the distribution of the data are needed. Results: To assess the statistical significance and to discover multi-level structures in bio-molecular data, a new method based on Bernstein’s inequality is proposed. This approach makes no assumptions about the distribution of the data, thus assuring a reliable application to a large range of bioinformatics problems. Results with synthetic and DNA microarray data show the effectiveness of the proposed method. Conclusions: The Bernstein test, due to its loose assumptions, is more sensitive than the chi-square test to the detection of multiple structures simultaneously present in the data. Nevertheless it is less selective, that is subject to more false positives, but adding independence assumptions, a more selective variant of the Bernstein
Discovering multi–level structures in bio-molecular data through the Bernstein inequality
, 2008
"... with malignancies differentiated at bio-molecular level can be discovered through clustering algorithms, and several other tasks related to the analysis of bio-molecular data require the development and application of unsupervised clustering techniques [2-4]. Anyway, in most bioinfrom ..."
Abstract
- Add to MetaCart
with malignancies differentiated at bio-molecular level can be discovered through clustering algorithms, and several other tasks related to the analysis of bio-molecular data require the development and application of unsupervised clustering techniques [2-4]. Anyway, in most bioinfrom
Stability and Model Selection in k-means Clustering
, 2010
"... Clustering Stability methods are a family of widely used model selection techniques for data clustering. Their unifying theme is that an appropriate model should result in a clustering which is robust with respect to various kinds of perturbations. Despite their relative success, not much is known ..."
Abstract
- Add to MetaCart
Clustering Stability methods are a family of widely used model selection techniques for data clustering. Their unifying theme is that an appropriate model should result in a clustering which is robust with respect to various kinds of perturbations. Despite their relative success, not much is known theoretically on why or when do they work, or even what kind of assumptions they make in choosing an ’appropriate ’ model. Moreover, recent theoretical work has shown that they might ’break down ’ for large enough samples. In this paper, we focus on the behavior of clustering stability using k-means clustering. Our main technical result is an exact characterization of the distribution to which suitably scaled measures of instability converge, based on a sample drawn from any distribution in R n satisfying mild regularity conditions. From this, we can show that clustering stability does not ’break down ’ even for arbitrarily large samples, at least for the k-means framework. Moreover, it allows us to identify the factors which eventually determine the behavior of clustering stability. This leads to some basic observations about what kind of assumptions are made when using these methods. While often reasonable, these assumptions might also lead to unexpected consequences.
RESEARCH Speeding up the Consensus Clustering methodology for microarray data analysis
"... Background: The inference of the number of clusters in a dataset, a fundamental problem in Statistics, Data Analysis and Classification, is usually addressed via internal validation measures. The stated problem is quite difficult, in particular for microarrays, since the inferred prediction must be ..."
Abstract
- Add to MetaCart
Background: The inference of the number of clusters in a dataset, a fundamental problem in Statistics, Data Analysis and Classification, is usually addressed via internal validation measures. The stated problem is quite difficult, in particular for microarrays, since the inferred prediction must be sensible enough to capture the inherent biological structure in a dataset, e.g., functionally related genes. Despite the rich literature present in that area, the identification of an internal validation measure that is both fast and precise has proved to be elusive. In order to partially fill this gap, we propose a speed-up of Consensus (Consensus Clustering), a methodology whose purpose is the provision of a prediction of the number of clusters in a dataset, together with a dissimilarity matrix (the consensus matrix) that can be used by clustering algorithms. As detailed in the remainder of the paper, Consensus is a natural candidate for a speed-up. Results: Since the time-precision performance of Consensus depends on two parameters, our first task is to show that a simple adjustment of the parameters is not enough to obtain a good precision-time trade-off. Our second task is to provide a fast approximation algorithm for Consensus. That is, the closely related algorithm FC (Fast Consensus) that would have the same precision as Consensus with a substantially better time performance. The performance of FC has been assessed via extensive experiments on twelve benchmark datasets that
analysis of genomic data
, 2008
"... © 2008 Zhu et al; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License ..."
Abstract
- Add to MetaCart
© 2008 Zhu et al; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License

