Results 1  10
of
28
On the reliability of clustering stability in the large sample regime
 In Advances in Neural Information Processing Systems
"... Clustering stability is an increasingly popular family of methods for performing model selection in data clustering. The basic idea is that the chosen model should be stable under perturbation or resampling of the data. Despite being reasonably effective in practice, these methods are not well under ..."
Abstract

Cited by 10 (3 self)
 Add to MetaCart
(Show Context)
Clustering stability is an increasingly popular family of methods for performing model selection in data clustering. The basic idea is that the chosen model should be stable under perturbation or resampling of the data. Despite being reasonably effective in practice, these methods are not well understood theoretically, and present some difficulties. In particular, when the data is assumed to be sampled from an underlying distribution, the solutions returned by the clustering algorithm will usually become more and more stable as the sample size increases. This raises a potentially serious practical difficulty with these methods, because it means there might be some hardtocompute sample size, beyond which clustering stability estimators ’break down ’ and become unreliable in detecting the most stable model. In this paper, we provide a set of general sufficient conditions, which ensure the reliability of clustering stability estimators in the large sample regime. In contrast to previous work, which concentrated on specific toy distributions or specific idealized clustering frameworks, here we make no such assumptions. We then exemplify how these conditions apply to several important families of clustering algorithms, such as maximum likelihood clustering, certain types of kernel clustering, and centroidbased clustering with any Bregman divergence. In addition, we explicitly derive the nontrivial asymptotic behavior of these estimators, for any framework satisfying our conditions. This may help us understand what is considered a ’stable ’ model by these estimators, at least for large enough samples. 1
Relating clustering stability to properties of cluster boundaries
"... In this paper, we investigate stabilitybased methods for cluster model selection, in particular to select the number K of clusters. The scenario under consideration is that clustering is performed by minimizing a certain clustering quality function, and that a unique global minimizer exists. On the ..."
Abstract

Cited by 10 (2 self)
 Add to MetaCart
(Show Context)
In this paper, we investigate stabilitybased methods for cluster model selection, in particular to select the number K of clusters. The scenario under consideration is that clustering is performed by minimizing a certain clustering quality function, and that a unique global minimizer exists. On the one hand we show that stability can be upper bounded by certain properties of the optimal clustering, namely by the mass in a small tube around the cluster boundaries. On the other hand, we provide counterexamples which show that a reverse statement is not true in general. Finally, we provide some examples and arguments why, from a theoretic point of view, using clustering stability in a high sample setting can be problematic. It can be seen that distributionfree guarantees bounding the difference between the finite sample stability and the “true stability ” cannot exist, unless one makes strong assumptions on the underlying distribution. 1
Clustering stability: An overview
 Foundations and Trends in Machine Learning
"... ..."
(Show Context)
Discovering multilevel structures in biomolecular data through the Bernstein inequality
 BMC BIOINFORMATICS
, 2008
"... Background: The unsupervised discovery of structures (i.e. clusterings) underlying data is a central issue in several branches of bioinformatics. Methods based on the concept of stability have been recently proposed to assess the reliability of a clustering procedure and to estimate the ”optimal ” n ..."
Abstract

Cited by 5 (1 self)
 Add to MetaCart
(Show Context)
Background: The unsupervised discovery of structures (i.e. clusterings) underlying data is a central issue in several branches of bioinformatics. Methods based on the concept of stability have been recently proposed to assess the reliability of a clustering procedure and to estimate the ”optimal ” number of clusters in biomolecular data. A major problem with stabilitybased methods is the detection of multilevel structures (e.g. hierarchical functional classes of genes), and the assessment of their statistical significance. In this context, a chisquare based statistical test of hypothesis has been proposed; however, to assure the correctness of this technique some assumptions about the distribution of the data are needed. Results: To assess the statistical significance and to discover multilevel structures in biomolecular data, a new method based on Bernstein’s inequality is proposed. This approach makes no assumptions about the distribution of the data, thus assuring a reliable application to a large range of bioinformatics problems. Results with synthetic and DNA microarray data show the effectiveness of the proposed method. Conclusions: The Bernstein test, due to its loose assumptions, is more sensitive than the chisquare test to the detection of multiple structures simultaneously present in the data. Nevertheless it is less selective, that is subject to more false positives, but adding independence assumptions, a more selective variant of the Bernstein inequalitybased test is also presented. The proposed methods can be applied to discover multiple structures and to assess their significance in different types of biomolecular data.
Ensemble clustering with a fuzzy approach
"... Summary. Ensemble clustering is a novel research field that extends to unsupervised learning the approach originally developed for classification and supervised learning problems. In particular ensemble clustering methods have been developed to improve the robustness and accuracy of clustering algor ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
(Show Context)
Summary. Ensemble clustering is a novel research field that extends to unsupervised learning the approach originally developed for classification and supervised learning problems. In particular ensemble clustering methods have been developed to improve the robustness and accuracy of clustering algorithms, as well as the ability to capture the structure of complex data. In many clustering applications an example may belong to multiple clusters, and the introduction of fuzzy set theory concepts can improve the level of flexibility needed to model the uncertainty underlying real data in several application domains. In this paper, we propose an unsupervised fuzzy ensemble clustering approach that permit to dispose both of the flexibility of the fuzzy sets and the robustness of the ensemble methods. Our algorithmic scheme can generate different ensemble clustering algorithms that allow to obtain the final consensus clustering both in crisp and fuzzy formats. 1
RESEARCH Speeding up the Consensus Clustering methodology for microarray data analysis
"... Background: The inference of the number of clusters in a dataset, a fundamental problem in Statistics, Data Analysis and Classification, is usually addressed via internal validation measures. The stated problem is quite difficult, in particular for microarrays, since the inferred prediction must be ..."
Abstract

Cited by 1 (1 self)
 Add to MetaCart
(Show Context)
Background: The inference of the number of clusters in a dataset, a fundamental problem in Statistics, Data Analysis and Classification, is usually addressed via internal validation measures. The stated problem is quite difficult, in particular for microarrays, since the inferred prediction must be sensible enough to capture the inherent biological structure in a dataset, e.g., functionally related genes. Despite the rich literature present in that area, the identification of an internal validation measure that is both fast and precise has proved to be elusive. In order to partially fill this gap, we propose a speedup of Consensus (Consensus Clustering), a methodology whose purpose is the provision of a prediction of the number of clusters in a dataset, together with a dissimilarity matrix (the consensus matrix) that can be used by clustering algorithms. As detailed in the remainder of the paper, Consensus is a natural candidate for a speedup. Results: Since the timeprecision performance of Consensus depends on two parameters, our first task is to show that a simple adjustment of the parameters is not enough to obtain a good precisiontime tradeoff. Our second task is to provide a fast approximation algorithm for Consensus. That is, the closely related algorithm FC (Fast Consensus) that would have the same precision as Consensus with a substantially better time performance. The performance of FC has been assessed via extensive experiments on twelve benchmark datasets that
Discovering multi–level structures in biomolecular data through the Bernstein inequality
"... ..."
(Show Context)
Stability and Model Selection in kmeans Clustering
, 2010
"... Clustering Stability methods are a family of widely used model selection techniques for data clustering. Their unifying theme is that an appropriate model should result in a clustering which is robust with respect to various kinds of perturbations. Despite their relative success, not much is known ..."
Abstract
 Add to MetaCart
Clustering Stability methods are a family of widely used model selection techniques for data clustering. Their unifying theme is that an appropriate model should result in a clustering which is robust with respect to various kinds of perturbations. Despite their relative success, not much is known theoretically on why or when do they work, or even what kind of assumptions they make in choosing an ’appropriate ’ model. Moreover, recent theoretical work has shown that they might ’break down ’ for large enough samples. In this paper, we focus on the behavior of clustering stability using kmeans clustering. Our main technical result is an exact characterization of the distribution to which suitably scaled measures of instability converge, based on a sample drawn from any distribution in R n satisfying mild regularity conditions. From this, we can show that clustering stability does not ’break down ’ even for arbitrarily large samples, at least for the kmeans framework. Moreover, it allows us to identify the factors which eventually determine the behavior of clustering stability. This leads to some basic observations about what kind of assumptions are made when using these methods. While often reasonable, these assumptions might also lead to unexpected consequences.