## Model-Based Clustering and Data Transformations for Gene Expression Data (2001)

### Cached

### Download Links

- [globin.cse.psu.edu]
- [www.stat.washington.edu]
- [www.cs.washington.edu]
- [www.cs.washington.edu]
- [staff.washington.edu]
- [www.cs.washington.edu]
- [faculty.washington.edu]
- [homes.cs.washington.edu]
- [www.stat.washington.edu]
- [staff.washington.edu]
- [www.cs.washington.edu]
- [www.stat.washington.edu:80]
- [stat.washington.edu]
- [faculty.washington.edu]
- [www.stat.washington.edu]
- [www.stat.washington.edu]
- [staff.washington.edu]
- DBLP

### Other Repositories/Bibliography

Citations: | 127 - 8 self |

### BibTeX

@MISC{Yeung01model-basedclustering,

author = {K. Y. Yeung and C. Fraley and A. Murua and A. E. Raftery and W. L. Ruzzo},

title = {Model-Based Clustering and Data Transformations for Gene Expression Data},

year = {2001}

}

### Years of Citing Articles

### OpenURL

### Abstract

Motivation: Clustering is a useful exploratory technique for the analysis of gene expression data. Many different heuristic clustering algorithms have been proposed in this context. Clustering algorithms based on probability models offer a principled alternative to heuristic algorithms. In particular, model-based clustering assumes that the data is generated by a finite mixture of underlying probability distributions such as multivariate normal distributions. The issues of selecting a 'good' clustering method and determining the 'correct' number of clusters are reduced to model selection problems in the probability framework. Gaussian mixture models have been shown to be a powerful tool for clustering in many applications.

### Citations

2299 |
Estimating the dimension of a model,” The
- SCHWARZ
- 1978
(Show Context)
Citation Context ...alized to more than two models. The main difficulty in using the Bayes factor is the evaluation of the integrated likelihood. We used an approximation called the Bayesian Information Criterion (BIC) (=-=Schwarz 1978): �¢¡¤£¦¥ ¡ � � �s� �¨§ �¢¡¤£¦¥ ¡ � ��© � � £s�-=-�� � � � where hood estimate for the parameter vector of models� � , . For discussion of its use and justification � ��� � ¡�£¦¥ ��� ��� ¡s�� � £ (4) is t... |

1815 | Cluster analysis and display of genome-wide expression patterns - Eisen, Spellman, et al. - 1998 |

979 | Bayes factors
- Kass, Raftery
- 1995
(Show Context)
Citation Context ...prior distribution of � � � . The integrated likelihood represents � � � �s� � �£¢ ¡ � � � � � £s� � ¡ the probability that data � is observed given that the u=-=nderlying model iss� . The Bayes factor (Kass and Raftery 1995) is defined as the ratio of the int-=-egrated likelihoods of the two models, i.e., � � � � � � �s� � ¡ ¡ � �s� � � �§¦ ¡ ¡ the data were distributed according to model . In other words, the Bayes factor... |

979 | Human Behavior and the Principle of Least Effort - Zipf, K - 1949 |

524 | Interpreting patterns of gene expression with self-organizing maps:methods and application to hematopoietic differentiation - Tamayo - 1999 |

517 |
Comparing partitions
- Hubert, Arabie
- 1985
(Show Context)
Citation Context ... clustering result can be considered as a partition of objects into groups. Thus, comparing two clustering results is equivalent to assessing the agreement of two partitions. The adjusted Rand index (=-=Hubert and Arabie 1985-=-) assesses the degree of agreement between two partitions. Milligan and Cooper (1986) recommended the adjusted Rand index as the measure of agreement even when comparing partitions with different numb... |

515 | Systematic determination of genetic network architecture - Tavazoie, Hughes, et al. - 1999 |

466 |
Mixture Models: Inference and Applications to Clustering
- McLachlan, Basford
- 1988
(Show Context)
Citation Context ...tions such as multivariate normal distributions. The Gaussian mixture model has been shown to be a powerful tool for many applications (for example Banfield and Raftery 1993, Celeux and Govaert 1993, =-=McLachlan and Basford 1988-=-). With the underlying probability model, the problems of determining the number of clusters and of choosing an appropriate clustering method become statistical model choice problems (Dasgupta and Raf... |

448 | Objective criteria for the evaluation of clustering methods - Rand - 1971 |

417 | An analysis of transformations
- Box, Cox
- 1964
(Show Context)
Citation Context ... assumption, we explored the degree of normality of each class after applying different data transformations. In particular, we studied two types of data transformations: the Box-Cox transformations (=-=Box and Cox 1964), and-=- the standardization of each gene (or clone) to have mean 0 and standard deviation 1. fromsThe Box-Cox transformation (Box and Cox 1964) is a parametric family of transformations tos¢¡¤£¦¥ with ... |

393 | Knowledge-based analysis of microarray gene expression data by using support vector machine - Brown, Grundy, et al. - 2000 |

346 |
A genome-wide transcriptional analysis of the mitotic cell cycle
- Cho, Campbell, et al.
- 1998
(Show Context)
Citation Context ...sion levels of approximately 6000 genes over two cell cycles (17 time points). We used two different subsets of this data with independent external criteria. The first subset (the 5-phase criterion) (=-=Cho et al. 1998-=-) consists of 384 genes peaking at different time points corresponding to the five phases of cell cycle. Since the 384 genes were identified according to the peak times of genes, we expect clustering ... |

330 | Clustering gene expression patterns
- Dor, Shamir, et al.
- 1999
(Show Context)
Citation Context ...elf-organizing maps (Tamayo, Slonim, Mesirov, Zhu, Kitareewan, Dmitrovsky, Lander, and Golub 1999), kmeans (Tavazoie, Hughes, Campbell, Cho, and Church 1999), graph-theoretic approaches (for example, =-=Ben-Dor and Yakhini 1999-=- and Hartuv, Schmitt, Lange, Meirer-Ewert, Lehrach, and Shamir 1999), and support vector machines (Brown, Grundy, Lin, Cristianini, Sugnet, Furey, Ares, and Haussler 2000). Success in applications has... |

311 |
Model-based Gaussian and non-Gaussian clustering
- Banfield, Raftery
- 1993
(Show Context)
Citation Context ... a finite mixture of underlying probability distributions such as multivariate normal distributions. The Gaussian mixture model has been shown to be a powerful tool for many applications (for example =-=Banfield and Raftery 1993-=-, Celeux and Govaert 1993, McLachlan and Basford 1988). With the underlying probability model, the problems of determining the number of clusters and of choosing an appropriate clustering method becom... |

277 | How many clusters? Which clustering method? Answers via model-based Cluster Analysis
- Fraley, Raftery
- 1998
(Show Context)
Citation Context ... underlying probability model, the problems of determining the number of clusters and of choosing an appropriate clustering method become statistical model choice problems (Dasgupta and Raftery 1998, =-=Fraley and Raftery 1998-=-, Fraley and Raftery 2000). This is an advantage over heuristic clustering algorithms, in which there is no established method to determine the number of clusters or the best clustering method. Detail... |

265 | Model-Based Clustering, Discriminant Analysis, and Density Estimation
- Fraley, Raftery
(Show Context)
Citation Context ...odel, the problems of determining the number of clusters and of choosing an appropriate clustering method become statistical model choice problems (Dasgupta and Raftery 1998, Fraley and Raftery 1998, =-=Fraley and Raftery 2000-=-). This is an advantage over heuristic clustering algorithms, in which there is no established method to determine the number of clusters or the best clustering method. Details of the model-based appr... |

260 | Estimating the number of clusters in a data set via the gap statistic - Tibshirani, Walther, et al. - 2001 |

200 | Statistical Analysis of Compositional Data - Aitchison - 1986 |

121 |
A classification EM algorithm for clustering and two stochastic versions
- Celeux, Govaert
- 1992
(Show Context)
Citation Context ...rithm, first proposed as a heuristic clustering algorithm, has been shown to be closely related to model-based clustering using the equal volume spherical model (EI), as computed by the EM algorithm (=-=Celeux and Govaert 1992).-=- K-means has 3sbeen successfully used for a wide variety of clustering tasks, including clustering of gene expression data. This is not surprising, from the model-based perspective, given k-means’ i... |

119 |
Measures of multivariate skewness and kurtosis with applications
- Mardia
- 1970
(Show Context)
Citation Context ...s and kurtosis are defined by � ��© � � � � ¤ � ¥ � � � ��© ¢ � � ¡©¨ �£¢ ¦ � � ¡ � � ��© ��� � � � � ¦ ��¡ and , and null =-=distributions are available for both the multivariate skewness and kurtosis (Mardia 1970).-=- A small p-value suggests the multivariate normal assumption to be questionable. Maximum likelihood estimation of the transformation parameters: The parameter � in the Box-Cox transformation in Equa... |

97 | Gaussian parsimonious clustering models - Celeux, Govaert - 1995 |

87 | Validating clustering for gene expression data - Yeung, Haynor, et al. - 2001 |

82 | Detecting features in spatial point processes with clutter via model-based clustering
- Dasgupta, Raftery
- 1998
(Show Context)
Citation Context ...and Basford 1988). With the underlying probability model, the problems of determining the number of clusters and of choosing an appropriate clustering method become statistical model choice problems (=-=Dasgupta and Raftery 1998-=-, Fraley and Raftery 1998, Fraley and Raftery 2000). This is an advantage over heuristic clustering algorithms, in which there is no established method to determine the number of clusters or the best ... |

69 | A study of the comparability of external criteria for hierarchical cluster analysis - Milligan, Cooper - 1986 |

62 | Principal Component Analysis for clustering gene expression data - Yeung, Ruzzo |

52 | MCLUST: software for model-based cluster analysis
- Fraley, Raftery
- 1999
(Show Context)
Citation Context ...tical components. In the experiments reported in this paper, we considered the equal volume spherical (EI), unequal volume spherical (VI), EEE and unconstrained (VVV) models as implemented in MCLUST (=-=Fraley and Raftery 1999-=-), and the diagonal model as implemented by Murua, Tantrum, Stuetzle, and Sieberts (2001). In both the MCLUST implementation and the diagonal model implementation, the model parameters are estimated b... |

52 | Array of hope - Lander - 1999 |

49 | Context-specific bayesian clustering for gene expression data - Barash, Friedman |

45 | An algorithm for clustering cDNAs for gene expression analysis - Hartuv, Schmitt, et al. - 1999 |

44 | An empirical study of principal component analysis for clustering gene expression data. Bioinformatics - Yeung, Ruzzo - 2001 |

42 | MIPS: a database for protein sequences and complete genomes - Mewes, Heumann, et al. - 1999 |

39 | Comparative hybridization of an array of 21, 500 ovarian cDNAs for the discovery of genes overexpressed in ovarian carcinomas - Schummer - 1999 |

35 | Finding regulatory elements using joint likelihoods for sequence and expression profile data - Holmes, Bruno - 2000 |

34 | Applied Multivariate Data Analysis - Jobson - 1992 |

21 |
Comparison of the mixture and the classification maximum likelihood in cluster analysis when data are binary
- Govaert, Nadif
- 1996
(Show Context)
Citation Context ...ying probability distributions such as multivariate normal distributions. The Gaussian mixture model has been shown to be a powerful tool for many applications (for example Banfield and Raftery 1993, =-=Celeux and Govaert 1993-=-, McLachlan and Basford 1988). With the underlying probability model, the problems of determining the number of clusters and of choosing an appropriate clustering method become statistical model choic... |

16 | Validating clustering for gene expression data”, Bioinformatics 2001 - Yeung, Haynor, et al. - 2001 |

5 | Model based document classification and clustering. Manuscript in preparation - Murua, Tantrum, et al. - 2001 |

5 | 2000), “Estimating the number of clusters in a dataset via the Gap statistic - Tibshirani, Walther, et al. |

4 | Ruzzo: 2001a, ‘Model-based clustering and data transformations for gene expression data - Yeung, Fraley, et al. |

3 | Methods for assessing multivariate normality - Andrews, Gnanadesikan, et al. - 1973 |

3 |
Speed group microarray page: Hints and prejudices. Http://statwww. berkeley.edu/users/terry/zarray/Html/hintsindex.html
- Speed
- 2000
(Show Context)
Citation Context ...�©�� £ ¡�£¦¥ � � if ��� � 0 if � = 0 © The Box-Cox transformation subsumes many commonly used transformations, including the log transformation which is very popular for m=-=icroarray data (for example, Speed 2000-=-). Standardizing each gene (or clone) to have mean 0 and standard deviation 1 is another very popular data transformation step before clustering; see, for example, Tamayo, Slonim, Mesirov, Zhu, Kitare... |

3 | Ruzzo: 2001b, ‘Validating clustering for gene expression data - Yeung, Haynor, et al. |

1 | Context-specific Bayesianclustering for gene expression data - Barash, Friedman - 2001 |

1 | Applied multivariate data analysis.NewYork - Jobson - 1991 |

1 | MIPS: a database for protein sequencesand complete genomes - Mewes, Heumann, et al. - 1999 |

1 | Modcl-bascd Gaussian and non-Gaussian clustcring - Banficld, Raficry - 1993 |

1 | Model-based clustering for gene expression data - Lyon - 2000 |

1 | A genome-wide transchptional analysis of the milotic cell cycle - Cho, Cmnpbell, et al. - 1998 |

1 | Detecting fatures in spatial point processes with clutter via model-based clustering - Dasgupta, Rallery - 1998 |

1 | Clustcr analysis and display o1' gcnomc-widc cxpression pattcrns - Eiscn, Spclhnan, et al. - 1998 |