Results 1  10
of
32
Clustering stability: An overview
 Foundations and Trends in Machine Learning
"... ..."
(Show Context)
On the reliability of clustering stability in the large sample regime
 In Advances in Neural Information Processing Systems
"... Clustering stability is an increasingly popular family of methods for performing model selection in data clustering. The basic idea is that the chosen model should be stable under perturbation or resampling of the data. Despite being reasonably effective in practice, these methods are not well under ..."
Abstract

Cited by 10 (3 self)
 Add to MetaCart
(Show Context)
Clustering stability is an increasingly popular family of methods for performing model selection in data clustering. The basic idea is that the chosen model should be stable under perturbation or resampling of the data. Despite being reasonably effective in practice, these methods are not well understood theoretically, and present some difficulties. In particular, when the data is assumed to be sampled from an underlying distribution, the solutions returned by the clustering algorithm will usually become more and more stable as the sample size increases. This raises a potentially serious practical difficulty with these methods, because it means there might be some hardtocompute sample size, beyond which clustering stability estimators ’break down ’ and become unreliable in detecting the most stable model. In this paper, we provide a set of general sufficient conditions, which ensure the reliability of clustering stability estimators in the large sample regime. In contrast to previous work, which concentrated on specific toy distributions or specific idealized clustering frameworks, here we make no such assumptions. We then exemplify how these conditions apply to several important families of clustering algorithms, such as maximum likelihood clustering, certain types of kernel clustering, and centroidbased clustering with any Bregman divergence. In addition, we explicitly derive the nontrivial asymptotic behavior of these estimators, for any framework satisfying our conditions. This may help us understand what is considered a ’stable ’ model by these estimators, at least for large enough samples. 1
Relating clustering stability to properties of cluster boundaries
"... In this paper, we investigate stabilitybased methods for cluster model selection, in particular to select the number K of clusters. The scenario under consideration is that clustering is performed by minimizing a certain clustering quality function, and that a unique global minimizer exists. On the ..."
Abstract

Cited by 10 (2 self)
 Add to MetaCart
(Show Context)
In this paper, we investigate stabilitybased methods for cluster model selection, in particular to select the number K of clusters. The scenario under consideration is that clustering is performed by minimizing a certain clustering quality function, and that a unique global minimizer exists. On the one hand we show that stability can be upper bounded by certain properties of the optimal clustering, namely by the mass in a small tube around the cluster boundaries. On the other hand, we provide counterexamples which show that a reverse statement is not true in general. Finally, we provide some examples and arguments why, from a theoretic point of view, using clustering stability in a high sample setting can be problematic. It can be seen that distributionfree guarantees bounding the difference between the finite sample stability and the “true stability ” cannot exist, unless one makes strong assumptions on the underlying distribution. 1
Discovering multilevel structures in biomolecular data through the Bernstein inequality
 BMC BIOINFORMATICS
, 2008
"... Background: The unsupervised discovery of structures (i.e. clusterings) underlying data is a central issue in several branches of bioinformatics. Methods based on the concept of stability have been recently proposed to assess the reliability of a clustering procedure and to estimate the ”optimal ” n ..."
Abstract

Cited by 6 (1 self)
 Add to MetaCart
(Show Context)
Background: The unsupervised discovery of structures (i.e. clusterings) underlying data is a central issue in several branches of bioinformatics. Methods based on the concept of stability have been recently proposed to assess the reliability of a clustering procedure and to estimate the ”optimal ” number of clusters in biomolecular data. A major problem with stabilitybased methods is the detection of multilevel structures (e.g. hierarchical functional classes of genes), and the assessment of their statistical significance. In this context, a chisquare based statistical test of hypothesis has been proposed; however, to assure the correctness of this technique some assumptions about the distribution of the data are needed. Results: To assess the statistical significance and to discover multilevel structures in biomolecular data, a new method based on Bernstein’s inequality is proposed. This approach makes no assumptions about the distribution of the data, thus assuring a reliable application to a large range of bioinformatics problems. Results with synthetic and DNA microarray data show the effectiveness of the proposed method. Conclusions: The Bernstein test, due to its loose assumptions, is more sensitive than the chisquare test to the detection of multiple structures simultaneously present in the data. Nevertheless it is less selective, that is subject to more false positives, but adding independence assumptions, a more selective variant of the Bernstein inequalitybased test is also presented. The proposed methods can be applied to discover multiple structures and to assess their significance in different types of biomolecular data.
RESEARCH Speeding up the Consensus Clustering methodology for microarray data analysis
"... Background: The inference of the number of clusters in a dataset, a fundamental problem in Statistics, Data Analysis and Classification, is usually addressed via internal validation measures. The stated problem is quite difficult, in particular for microarrays, since the inferred prediction must be ..."
Abstract

Cited by 3 (3 self)
 Add to MetaCart
(Show Context)
Background: The inference of the number of clusters in a dataset, a fundamental problem in Statistics, Data Analysis and Classification, is usually addressed via internal validation measures. The stated problem is quite difficult, in particular for microarrays, since the inferred prediction must be sensible enough to capture the inherent biological structure in a dataset, e.g., functionally related genes. Despite the rich literature present in that area, the identification of an internal validation measure that is both fast and precise has proved to be elusive. In order to partially fill this gap, we propose a speedup of Consensus (Consensus Clustering), a methodology whose purpose is the provision of a prediction of the number of clusters in a dataset, together with a dissimilarity matrix (the consensus matrix) that can be used by clustering algorithms. As detailed in the remainder of the paper, Consensus is a natural candidate for a speedup. Results: Since the timeprecision performance of Consensus depends on two parameters, our first task is to show that a simple adjustment of the parameters is not enough to obtain a good precisiontime tradeoff. Our second task is to provide a fast approximation algorithm for Consensus. That is, the closely related algorithm FC (Fast Consensus) that would have the same precision as Consensus with a substantially better time performance. The performance of FC has been assessed via extensive experiments on twelve benchmark datasets that
Ensemble clustering with a fuzzy approach
"... Summary. Ensemble clustering is a novel research field that extends to unsupervised learning the approach originally developed for classification and supervised learning problems. In particular ensemble clustering methods have been developed to improve the robustness and accuracy of clustering algor ..."
Abstract

Cited by 3 (0 self)
 Add to MetaCart
(Show Context)
Summary. Ensemble clustering is a novel research field that extends to unsupervised learning the approach originally developed for classification and supervised learning problems. In particular ensemble clustering methods have been developed to improve the robustness and accuracy of clustering algorithms, as well as the ability to capture the structure of complex data. In many clustering applications an example may belong to multiple clusters, and the introduction of fuzzy set theory concepts can improve the level of flexibility needed to model the uncertainty underlying real data in several application domains. In this paper, we propose an unsupervised fuzzy ensemble clustering approach that permit to dispose both of the flexibility of the fuzzy sets and the robustness of the ensemble methods. Our algorithmic scheme can generate different ensemble clustering algorithms that allow to obtain the final consensus clustering both in crisp and fuzzy formats. 1