Results 1 - 10
of
49
Cluster Analysis for Gene Expression Data: A Survey
- IEEE Transactions on Knowledge and Data Engineering
, 2004
"... Abstract—DNA microarray technology has now made it possible to simultaneously monitor the expression levels of thousands of genes during important biological processes and across collections of related samples. Elucidating the patterns hidden in gene expression data offers a tremendous opportunity f ..."
Abstract
-
Cited by 48 (3 self)
- Add to MetaCart
Abstract—DNA microarray technology has now made it possible to simultaneously monitor the expression levels of thousands of genes during important biological processes and across collections of related samples. Elucidating the patterns hidden in gene expression data offers a tremendous opportunity for an enhanced understanding of functional genomics. However, the large number of genes and the complexity of biological networks greatly increases the challenges of comprehending and interpreting the resulting mass of data, which often consists of millions of measurements. A first step toward addressing this challenge is the use of clustering techniques, which is essential in the data mining process to reveal natural structures and identify interesting patterns in the underlying data. Cluster analysis seeks to partition a given data set into groups based on specified features so that the data points within a group are more similar to each other than the points in different groups. A very rich literature on cluster analysis has developed over the past three decades. Many conventional clustering algorithms have been adapted or directly applied to gene expression data, and also new algorithms have recently been proposed specifically aiming at gene expression data. These clustering algorithms have been proven useful for identifying biologically relevant groups of genes and samples. In this paper, we first briefly introduce the concepts of microarray technology and discuss the basic elements of clustering on gene expression data. In particular, we divide cluster analysis for gene expression data into three categories. Then, we present specific challenges pertinent to each clustering category and introduce several representative approaches. We also discuss the problem of cluster validation in three aspects and review various methods to assess the quality and reliability of clustering results. Finally, we conclude this paper and suggest the promising trends in this field. Index Terms—Microarray technology, gene expression data, clustering.
A Comparison of Spectral Clustering Algorithms
, 2003
"... Spectral Clustering has become quite popular over the last few years and several new algorithms have been published. In this paper, we compare several of the best-known algorithms from the point of view of clustering quality over arti cial and real datasets. We implement many variations of the ex ..."
Abstract
-
Cited by 47 (2 self)
- Add to MetaCart
Spectral Clustering has become quite popular over the last few years and several new algorithms have been published. In this paper, we compare several of the best-known algorithms from the point of view of clustering quality over arti cial and real datasets. We implement many variations of the existing spectral algorithms and compare their performance to see which features are more important. We also demonstrate that spectral methods show competitive performance on real dataset with respect to existing methods.
A Unified Framework for Model-based Clustering
- Journal of Machine Learning Research
, 2003
"... Model-based clustering techniques have been widely used and have shown promising results in many applications involving complex data. This paper presents a unified framework for probabilistic model-based clustering based on a bipartite graph view of data and models that highlights the commonaliti ..."
Abstract
-
Cited by 43 (6 self)
- Add to MetaCart
Model-based clustering techniques have been widely used and have shown promising results in many applications involving complex data. This paper presents a unified framework for probabilistic model-based clustering based on a bipartite graph view of data and models that highlights the commonalities and differences among existing model-based clustering algorithms. In this view, clusters are represented as probabilistic models in a model space that is conceptually separate from the data space. For partitional clustering, the view is conceptually similar to the ExpectationMaximization (EM) algorithm. For hierarchical clustering, the graph-based view helps to visualize critical/important distinctions between similarity-based approaches and model-based approaches.
A quantitative study of gene regulation involved in the immune response of Anopheline mosquitoes: An application of Bayesian hierarchical clustering of curves
- Journal of the American Statistical Association
, 2006
"... Malaria represents one of the major worldwide challenges to public health. A recent breakthrough in the study of the disease follows the annotation of the genome of the malaria parasite Plasmodium falciparum and the mosquito vector 1 Anopheles. Of particular interest is the molecular biology underly ..."
Abstract
-
Cited by 26 (1 self)
- Add to MetaCart
Malaria represents one of the major worldwide challenges to public health. A recent breakthrough in the study of the disease follows the annotation of the genome of the malaria parasite Plasmodium falciparum and the mosquito vector 1 Anopheles. Of particular interest is the molecular biology underlying the immune response system of Anopheles which actively fights against Plasmodium infection. This paper reports a statistical analysis of gene expression time profiles from mosquitoes which have been infected with a bacterial agent. Specifically, we introduce a Bayesian model-based hierarchical clustering algorithm for curve data to investigate mechanisms of regulation in the genes concerned; that is, we aim to cluster genes having similar expression profiles. Genes displaying similar, interesting profiles can then be highlighted for further investigation by the experimenter. We show how our approach reveals structure within the data not captured by other approaches. One of the most pertinent features of the data is the sample size, which records the expression levels of 2771 genes at six time points. Additionally, the time points are unequally spaced and there is expected non-stationary behaviour in the gene profiles. We demonstrate our approach to be readily implementable under these conditions, and highlight some crucial computational savings that can be made in the context of a fully Bayesian analysis.
Bayesian regularization for normal mixture estimation and model-based clustering
, 2005
"... Normal mixture models are widely used for statistical modeling of data, including cluster analysis. However maximum likelihood estimation (MLE) for normal mixtures using the EM algorithm may fail as the result of singularities or degeneracies. To avoid this, we propose replacing the MLE by a maximum ..."
Abstract
-
Cited by 13 (3 self)
- Add to MetaCart
Normal mixture models are widely used for statistical modeling of data, including cluster analysis. However maximum likelihood estimation (MLE) for normal mixtures using the EM algorithm may fail as the result of singularities or degeneracies. To avoid this, we propose replacing the MLE by a maximum a posteriori (MAP) estimator, also found by the EM algorithm. For choosing the number of components and the model parameterization, we propose a modified version of BIC, where the likelihood is evaluated at the MAP instead of the MLE. We use a highly dispersed proper conjugate prior, containing a small fraction of one observation’s worth of information. The resulting method avoids degeneracies and singularities, but when these are not present it gives similar results to the standard method using MLE, EM and BIC. Key words: BIC; EM algorithm; mixture models; model-based clustering; conjugate prior; posterior mode. 1
Translation-Invariant Mixture Models for Curve Clustering
- In Proc. Ninth ACM SIGKDD Inter. Conf. on Knowledge Discovery and Data Mining, Washington D.C., August 24–27
, 2003
"... In this paper we present a family of algorithms that can simultaneously align and cluster sets of multidimensional curves defined on a discrete time grid. Our approach assumes that the data are being generated from a finite mixture of curve models. Each mixture component uses (a) a mean curve ba ..."
Abstract
-
Cited by 10 (2 self)
- Add to MetaCart
In this paper we present a family of algorithms that can simultaneously align and cluster sets of multidimensional curves defined on a discrete time grid. Our approach assumes that the data are being generated from a finite mixture of curve models. Each mixture component uses (a) a mean curve based on a flexible non-parametric representation, (b) additive measurement noise, (c) randomly selected discrete-valued shifts of each curve with respect to the independent variable (i.e., typically along the time axis), and (d) random real-valued o#sets of each curve with respect to the observed variable. We show that the Expectation-Maximization (EM) algorithm can be used to simultaneously recover both the curve models for each cluster, and the most likely shifts, o#sets, and cluster memberships for each curve. We demonstrate how Bayesian estimation methods can improve the results for small sample sizes by enforcing smoothness in the cluster mean curves. We evaluate the methodology on two real-world data sets, time-course gene expression data and storm trajectory data. Experimental results show that models that incorporate curve alignment systematically provide improvements in predictive power on test data sets. The proposed approach provides a non-parametric, computationally e#cient, and robust methodology for clustering broad classes of curve data.
Gene Clustering Based on Clusterwide Mutual Information
- of the school of engineering. Michel Verleysen was born in 1965 in Belgium. He received the M.S. and Ph.D. degrees in electrical engineering from the Université catholique de Louvain (Belgium) in 1987 and 1992, respectively. He was an Invited Professor at
, 2004
"... Cluster analysis of gene-wide expression data from DNA microarray hybridization studies has proved to be a useful tool for identifying biologically relevant groupings of genes and constructing gene regulatory networks. The motivation for considering mutual information is its capacity to measure a ge ..."
Abstract
-
Cited by 10 (3 self)
- Add to MetaCart
Cluster analysis of gene-wide expression data from DNA microarray hybridization studies has proved to be a useful tool for identifying biologically relevant groupings of genes and constructing gene regulatory networks. The motivation for considering mutual information is its capacity to measure a general dependence among gene random variables. We propose a novel clustering strategy based on minimizing mutual information among gene clusters. Simulated annealing is employed to solve the optimization problem. Bootstrap techniques are employed to get more accurate estimates of mutual information when the data sample size is small. Moreover, we propose to combine the mutual information criterion and traditional distance criteria such as the Euclidean distance and the fuzzy membership metric in designing the clustering algorithm. The performances of the new clustering methods are compared with those of some existing methods, using both synthesized data and experimental data. It is seen that the clustering algorithm based on a combined metric of mutual information and fuzzy membership achieves the best performance. The supplemental material is available at www.gspsnap.tamu.edu/gspweb/zxb/glioma_zxb.
Network constrained clustering for gene microarray data
- Bioinformatics
, 2005
"... doi:10.1093/bioinformatics/bti655 ..."
Statistical challenges in functional genomics
- Statist. Sci
, 2003
"... On February 12, 2001 the Human Genome Project announced that it had assembled a draft physical map of the human genome- the genetic blueprint for a human being. Now the challenge is to annotate this map, by understanding the functions of genes and their interplay with proteins and the environment to ..."
Abstract
-
Cited by 8 (0 self)
- Add to MetaCart
On February 12, 2001 the Human Genome Project announced that it had assembled a draft physical map of the human genome- the genetic blueprint for a human being. Now the challenge is to annotate this map, by understanding the functions of genes and their interplay with proteins and the environment to create complex, dynamic living systems. This is the goal of functional genomics. Recent technological advances enable biomedical investigators to observe the genome of entire organisms in action by simultaneously measuring the level of activation of thousands of genes under the same experimental conditions. This technology, known as microarrays, provides today unparalleled discovery opportunities and it is reshaping biomedical sciences. One of the main aspects of this revolution is the introduction of heavily quantitative data-analytical methods in biomedical research. This paper reviews the foundations of this technology
Analysis of Microarray Gene Expression Data
- in ‘Handbook of Statistical Genetics’, 2nd edn
, 2003
"... This article reviews the methods utilized in processing and analysis of gene expression data generated using DNA microarrays. This type of experiment allows to determine relative levels of mRNA abundance in a set of tissues or cell populations for thousands of genes simultaneously. Naturally, suc ..."
Abstract
-
Cited by 8 (2 self)
- Add to MetaCart
This article reviews the methods utilized in processing and analysis of gene expression data generated using DNA microarrays. This type of experiment allows to determine relative levels of mRNA abundance in a set of tissues or cell populations for thousands of genes simultaneously. Naturally, such an experiment requires computational and statistical analysis techniques. At the outset of the processing pipeline, the computational procedures are largely determined by the technology and experimental setup that are used. Subsequently, as more reliable intensity values for genes emerge, pattern discovery methods come into play. The most striking peculiarity of this kind of data is that one usually obtains measurements for thousands of genes for only a much smaller number of conditions. This is at the root of several of the statistical questions discussed here.

