## Model order selection for bio-molecular data clustering (2007)

Venue: | BMC BIOINFORMATICS |

Citations: | 25 - 5 self |

### BibTeX

@ARTICLE{Bertoni07modelorder,

author = {Alberto Bertoni and Giorgio Valentini},

title = { Model order selection for bio-molecular data clustering},

journal = {BMC BIOINFORMATICS},

year = {2007}

}

### OpenURL

### Abstract

Background: Cluster analysis has been widely applied for investigating structure in bio-molecular data. A drawback of most clustering algorithms is that they cannot automatically detect the ”natural ” number of clusters underlying the data, and in many cases we have no enough ”a priori ” biological knowledge to evaluate both the number of clusters as well as their validity. Recently several methods based on the concept of stability have been proposed to estimate the ”optimal ” number of clusters, but despite their successful application to the analysis of complex bio-molecular data, the assessment of the statistical significance of the discovered clustering solutions and the detection of multiple structures simultaneously present in high-dimensional bio-molecular data are still major problems. Results: We propose a stability method based on randomized maps that exploits the high-dimensionality and relatively low cardinality that characterize bio-molecular data, by selecting subsets of randomized linear combinations of the input variables, and by using stability indices based on the overall distribution of similarity measures between multiple pairs of clusterings performed on the randomly projected data. A χ 2-based statistical test is proposed to assess the significance of the clustering solutions and to detect significant and if possible multi-level structures simultaneously present in the data (e.g. hierarchical structures).

### Citations

1429 |
Finding Groups in Data: An Introduction to Cluster Analysis
- KAUFMAN, ROUSSEEUW
- 1990
(Show Context)
Citation Context ... clusters in gene expression data, and we compare the results with other algorithms for model order selection. In our experiments we used the classical k-means [26] and Prediction Around Medoid (PAM) =-=[27]-=- clustering algorithms, and we applied the Bernoulli, Achlioptas and Normal random projections, but in this section we show only the results obtained with Bernoulli projections, since with the other r... |

1367 | Data clustering: a review
- Jain, Murty, et al.
- 1999
(Show Context)
Citation Context ...t the shape of the clusters, and in principle any clustering algorithm C, randomized map µ, and clustering similarity measure sim may be used (e.g. the Jaccard or the Fowlkes and Mallows coefficients =-=[23]-=-). Stability indices based on the distribution of the similarity measures Using the similarity measures obtained through the MOSRAM algorithm, we may compute stability indices to assess the reliabilit... |

579 |
Mathematical methods of statistics
- Cramer
- 1946
(Show Context)
Citation Context ... θ(1 − ˆ θ), we conclude that the following statistic Y = � (Xk − mˆ θ) 2 mˆ θ(1 − ˆ θ) ∼ χ2 |K|−1 (9) k∈K is approximately distributed according to χ 2 |K|−1 details). (see, e.g. [24] chapter 12, or =-=[25]-=- chapter 30 for more A realization xk of the random variable Xk (and the corresponding realization y of Y ) can be computed by using the output of the MOSRAM algorithm: xk = m� I(M(i, k) > t o ) (10) ... |

557 |
Hierarchical grouping to optimize an objective function
- Ward
- 1963
(Show Context)
Citation Context ..., while with ɛ = 0.4 the results are less reliable due to the relatively large distortion induced (data not shown). Also using the PAM [27] and hierarchical clustering algorithms with the Ward method =-=[33]-=- we obtained a two-level structure with 2 and 3 clusters at α = 10 −5 significance level. Lymphoma Three different lymphoid malignancies are represented in the Lymphoma gene expression data set [1]: D... |

458 |
Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature 403
- Alizadeh, Eisen, et al.
- 2000
(Show Context)
Citation Context ...this end, for each k ∈ K = {2, 3, . . . , H + 1}, let us consider the random variable Sk defined in eq. 2, whose expectation is our proposed stability index. For all k and for a fixed threshold t o ∈ =-=[0, 1]-=- consider the Bernoulli random variable Bk = I(Sk > t o ), where I is the indicator function: I(P ) = 1 if P is True, I(P ) = 0 if P is False. The sum Xk = � m j=1 Bj k of i.i.d. copies of Bk is distr... |

384 | The random subspace method for constructing decision forests - Ho - 1998 |

277 | Estimating the Number of Clusters in a Data Set via the Gap Statistic - Tibshirani, Walther, et al. - 2001 |

234 |
Some Methods for classification and Analysis of Multivariate Observations
- McQueen
(Show Context)
Citation Context ...ithm to discover the ”natural” number of clusters in gene expression data, and we compare the results with other algorithms for model order selection. In our experiments we used the classical k-means =-=[26]-=- and Prediction Around Medoid (PAM) [27] clustering algorithms, and we applied the Bernoulli, Achlioptas and Normal random projections, but in this section we show only the results obtained with Berno... |

234 |
al. Molecular classification of cancer: class discovery and class prediction by gene expression monitoring
- Golub, Slonim, et al.
- 1999
(Show Context)
Citation Context ...able (at α significance level). Experiments with DNA microarray data To show the effectiveness of our methods with gene expression data we applied MOSRAM and the proposed statistical test to Leukemia =-=[29]-=- and Lymphoma [1] samples. These data sets have been analyzed with other model order selection algorithms previously proposed [10, 13, 30–32]: at the end of this section we compare the results obtaine... |

175 | Applications of resampling methods to estimate the number of clusters and to improve the accuracy of a clustering method
- Fridlyand, Dudoit
- 2001
(Show Context)
Citation Context ...original 7129 gene expression values. We further selected the 100 genes with the highest variance across samples, since low variance genes are unlikely to be informative for the purpose of clustering =-=[10, 31]-=-. We analyzed both the 3571-dimensional data and the data restricted to the 100 genes with highest variance, using respectively Bernoulli projections with ɛ ∈ {0.1, 0.2, 0.3, 0.4} and projections to 8... |

167 | Consensus Clustering: A Resampling-Based Method for Class Discovery and Visualization of Gene Expression Microarray Data
- Monti
(Show Context)
Citation Context ...g is considered reliable if it is approximately maintained across multiple perturbations. Different procedures have been introduced to randomly perturb the data, ranging from bootstrapping techniques =-=[9, 12, 13]-=-, to noise injection into the data [14] or random projections into lower dimensional subspaces [15, 16]. In particular, Smolkin and Gosh [17] applied an unsupervised version of the random subspace met... |

132 |
Diffuse large b-cell lymphoma outcome prediction by gene-expression profiling and supervised machine learning
- Shipp, al
- 2002
(Show Context)
Citation Context ... In particular studies based on the gene expression signatures of the DLBCL patients [1] and on their supervised analysis [35], showed the existence of two subclasses of DLBCLs. Moreover Shipp et al. =-=[36]-=- highlighted that FL patients frequently evolve over time and acquire the clinical features of DLBCLs, and Lange et al. [10] found that a 3-clustering solution groups together FL, CLL and a subgroup o... |

101 | Bootstrapping cluster analysis: Assessing the reliability of conclusions from microarray experiments
- Kerr, Churchill
- 2000
(Show Context)
Citation Context ...g is considered reliable if it is approximately maintained across multiple perturbations. Different procedures have been introduced to randomly perturb the data, ranging from bootstrapping techniques =-=[9, 12, 13]-=-, to noise injection into the data [14] or random projections into lower dimensional subspaces [15, 16]. In particular, Smolkin and Gosh [17] applied an unsupervised version of the random subspace met... |

92 |
Computational cluster validation in post-genomic data analysis
- Handl, Knowles, et al.
- 2005
(Show Context)
Citation Context ... classes) [7]. To deal with these problems, several methods for assessing the validity of the discovered clusters and to test the existence of biologically meaningful clusters have been proposed (see =-=[8]-=- for a review). Recently, several methods based on the concept of stability have been proposed to estimate the ”optimal” number of clusters in complex bio-molecular data [9–11]. In this conceptual fra... |

87 |
S: Comparisons and validation of statistical clustering techniques for microarray gene expression data
- Datta, Datta
(Show Context)
Citation Context ...a, we need to assess the reliability of the discovered clusters, and to solve the model order selection problem, that is the 2sproper selection of the ”natural” number of clusters underlying the data =-=[5, 6]-=-. From a machine learning standpoint, this is an intrinsically ”ill-posed” problem, since in unsupervised learning we lack an external objective criterion, that is we have not an equivalent of a prior... |

75 | Stability-based validation of clustering solutions
- Lange, Roth, et al.
- 2004
(Show Context)
Citation Context ...original 7129 gene expression values. We further selected the 100 genes with the highest variance across samples, since low variance genes are unlikely to be informative for the purpose of clustering =-=[10, 31]-=-. We analyzed both the 3571-dimensional data and the data restricted to the 100 genes with highest variance, using respectively Bernoulli projections with ɛ ∈ {0.1, 0.2, 0.3, 0.4} and projections to 8... |

69 | Resampling method for unsupervised estimation of cluster validity
- Levine, Domany
- 2001
(Show Context)
Citation Context ...of Merit measure is based on a resampling approach too, but the stability of the solutions is assessed directly comparing the solution obtained on the full sample with that obtained on the subsamples =-=[32]-=-. We considered also stability-based methods that apply supervised algorithms to assess the quality of the discovered clusterings instead of comparing pairs of perturbed clusterings [10, 31]: the main... |

43 |
Cluster stability scores for microarray data in cancer studies
- Smolkin, Ghosh
(Show Context)
Citation Context ...urb the data, ranging from bootstrapping techniques [9, 12, 13], to noise injection into the data [14] or random projections into lower dimensional subspaces [15, 16]. In particular, Smolkin and Gosh =-=[17]-=- applied an unsupervised version of the random subspace method [18] to estimate the stability of clustering solutions. By this approach, subsets of features are randomly selected multiple times, and c... |

29 | Methods of assessing reproducibility of clustering patterns observed in analyses of microarray data - McShane, Radmacher, et al. - 2002 |

15 |
G.: Randomized maps for assessing the reliability of patients clusters in DNA microarray data analyses
- Bertoni, Valentini
- 2006
(Show Context)
Citation Context ...ures have been introduced to randomly perturb the data, ranging from bootstrapping techniques [9, 12, 13], to noise injection into the data [14] or random projections into lower dimensional subspaces =-=[15, 16]-=-. In particular, Smolkin and Gosh [17] applied an unsupervised version of the random subspace method [18] to estimate the stability of clustering solutions. By this approach, subsets of features are r... |

14 |
Elisseeff A, Guyon I: A stability based method for discovering structure in clustered data. Pac Symp Biocomput 2002
- Ben-Hur
(Show Context)
Citation Context ...g is considered reliable if it is approximately maintained across multiple perturbations. Different procedures have been introduced to randomly perturb the data, ranging from bootstrapping techniques =-=[9, 12, 13]-=-, to noise injection into the data [14] or random projections into lower dimensional subspaces [15, 16]. In particular, Smolkin and Gosh [17] applied an unsupervised version of the random subspace met... |

14 | Clusterv: a tool for assessing the reliability of clusters discovered in DNA microarray data - Valentini |

13 |
Lindenstrauss J.: Extensions of Lipshitz Mapping into Hilbert
- Johnson
- 1984
(Show Context)
Citation Context ...ps from higher to lower-dimensional subspaces, in order to reduce the distortion induced by random projections. Moreover, we introduce a principled method based on the Johnson and Lindenstrauss lemma =-=[19]-=- to properly choose the dimension of the projected subspace. Our proposed stability indices are related to those proposed by Ben-Hur et al. [13]: their stability measures are obtained from the distrib... |

8 | M.: The advantage of functional prediction based on clustering of yeast genes and its correlation with non-sequence based classifications
- Bilu, Linial
- 2002
(Show Context)
Citation Context ...eed, integrating by parts: E[Sk] = � 1 Fact 2: V ar[Sk] ≤ g(k)(1 − g(k). 0 sfk(s)ds = � 1 0 sF ′ k(s)ds = 1 − Since 0 ≤ Sk ≤ 1 it follows S 2 k ≤ Sk; therefore, using Fact 1: � 1 0 Fk(s)ds = 1 − g(k) =-=(4)-=- V ar[Sk] = E[S 2 k] − E[Sk] 2 ≤ E[Sk] − E[Sk] 2 = g(k)(1 − g(k)) (5) In conclusion, g(k) � 0 then E[Sk] � 1 and V ar[Sk] = 0, i.e. Sk is centered close to 1. As a consequence, E[Sk] can be used as an... |

7 |
Friedlich M, Fromer M, Linial M: A functional hierarchical organization of the protein sequence space
- Kaplan
(Show Context)
Citation Context ...e density function of Sk and Fk(s) its cumulative distribution function. A parameter of concentration implicitly used in [13] is the integral g(k) of the cumulative distribution: g(k) = � 1 0 Fk(s)ds =-=(3)-=- Note that if Sk is centered in 1, g(k) is close to 0, and hence it can be used as a measure of stability. Moreover, the following facts show that g(k) is strictly related to both the expectation E[Sk... |

7 | Reproducible clusters from microarray research: whither - Garge, Page, et al. - 2005 |

6 |
et al.: The Lymphochip: a specialized cDNA microarray for genomic-scale analysis of gene expression in normal and malignant lymphocytes
- Alizadeh
- 2001
(Show Context)
Citation Context ...eukemia (CLL). The gene expression measurements are obtained with a cDNA microarray specialized for genes related to lymphoid diseases, the Lymphochip, which provides expression levels for 4026 genes =-=[34]-=-. The 62 available samples are subdivided in 42 DLBCL, 11 CLL and 9 FL. We performed pre-processing of the data according to [1], replacing missing values with 0 and then normalizing the data to zero ... |

6 |
expression data analysis of human lymphoma using support vector machines and output coding ensembles. Artif Intell Med 2002;26:281–304
- Gene
(Show Context)
Citation Context ...ular characteristics and to clinical outcome classes of non Hodgkin lymphomas. In particular studies based on the gene expression signatures of the DLBCL patients [1] and on their supervised analysis =-=[35]-=-, showed the existence of two subclasses of DLBCLs. Moreover Shipp et al. [36] highlighted that FL patients frequently evolve over time and acquire the clinical features of DLBCLs, and Lange et al. [1... |

5 |
Mannila H: Random projection in dimensionality reduction: applications to image and text data. KDD ‘01
- Bingham
- 2001
(Show Context)
Citation Context ...rom the dimension d of the original space. The embedding exhibited in [19] consists in projections from R d in random d ′ -dimensional subspaces. Similar results may be obtained by using simpler maps =-=[20,21]-=-, represented through random d ′ × d matrices R = 1/ √ d ′ (rij), where rij are random variables such that: E[rij] = 0, V ar[rij] = 1 Strictly speaking, these are not projections, but for sake of simp... |

4 |
Azuaje F, Cunningham P: An integrated tool for microarray data clustering and cluster validity assessment
- Bolshakova
- 2005
(Show Context)
Citation Context ...a, we need to assess the reliability of the discovered clusters, and to solve the model order selection problem, that is the 2sproper selection of the ”natural” number of clusters underlying the data =-=[5, 6]-=-. From a machine learning standpoint, this is an intrinsically ”ill-posed” problem, since in unsupervised learning we lack an external objective criterion, that is we have not an equivalent of a prior... |

4 |
de Rijn M: Towards a novel classification of human malignancies based on gene expression
- Alizadeh, Ross, et al.
- 2001
(Show Context)
Citation Context ...valuate both the number of clusters (e.g. the number of biologically distinct tumor classes), as well as the validity of the discovered clusters (e.g. the reliability of new discovered tumor classes) =-=[7]-=-. To deal with these problems, several methods for assessing the validity of the discovered clusters and to test the existence of biologically meaningful clusters have been proposed (see [8] for a rev... |

3 |
Brodley C: Random Projections for High Dimesnional Data Clustering: A Cluster Ensemble Approach
- Fern
- 2003
(Show Context)
Citation Context ... √ 3, 0, √ 3}, such that P rob(rij = 0) = 2/3, P rob(rij = √ 3) = P rob(rij = − √ 3) = 1/6. In this case also we have E[rij] = 0 and V ar[rij] = 1 and the JL lemma holds. 3. Normal random projections =-=[21, 22]-=-: this JL lemma compliant randomized map is represented by a 5 (1)sd ′ × d matrix R = 1/ √ d ′ (rij), where rij are distributed according to a gaussian with 0 mean and unit variance. 4. Random Subspac... |

2 |
Flachmeier C, Kidd K, Berrettini W, Church G: Sequence variability and candidate gene analyisis in complex disease: association of mu opioid receptor gene variation with substance dependence
- Hoehe, Kopke, et al.
- 2000
(Show Context)
Citation Context ...om projection) and sim a suitable similarity measure between two clusterings (e.g. the Fowlkes and Mallows similarity). We may define the random variable Sk, 0 ≤ Sk ≤ 1: Sk = sim (C(D1, k), C(D2, k)) =-=(2)-=- where D1 = ρ (1) (D) and D2 = ρ (2) (D) are obtained through random and independent perturbations of the data set D; the intuitive idea is that if Sk is concentrated close to 1, the corresponding clu... |

1 |
von Luxburg U, Pal D: A Sober Look at Clustering Stability
- Ben-David
(Show Context)
Citation Context ...a given clustering may converge to a suboptimal solution owing to the shape of the data manifold and not to the real structure of the data, thus introducing bias in the stability indices. Moreover in =-=[37]-=- it has been shown that stability based methods based on resampling techniques, when cost-based clustering algorithms are used, may fail to detect the correct number of clusters, if the data are not s... |

1 | D: A Sober Look at Clustering Stability. In 19th Annual Conference on Learning Theory, COLT 2006, Volume 4005 of Lecture Notes in Computer Science Springer; 2006:5-19. Publish with BioMed Central and every scientist can read your work free of charge " - Ben-David, Luxburg, et al. |