Results 1 - 10
of
18
Model-Based Clustering and Data Transformations for Gene Expression Data
, 2001
"... Motivation: Clustering is a useful exploratory technique for the analysis of gene expression data. Many different heuristic clustering algorithms have been proposed in this context. Clustering algorithms based on probability models offer a principled alternative to heuristic algorithms. In particula ..."
Abstract
-
Cited by 88 (8 self)
- Add to MetaCart
Motivation: Clustering is a useful exploratory technique for the analysis of gene expression data. Many different heuristic clustering algorithms have been proposed in this context. Clustering algorithms based on probability models offer a principled alternative to heuristic algorithms. In particular, model-based clustering assumes that the data is generated by a finite mixture of underlying probability distributions such as multivariate normal distributions. The issues of selecting a 'good' clustering method and determining the 'correct' number of clusters are reduced to model selection problems in the probability framework. Gaussian mixture models have been shown to be a powerful tool for clustering in many applications.
Rich Probabilistic Models for Gene Expression
, 2001
"... Clustering is commonly used for analyzing gene expression data. Despite their successes, clustering methods suffer from a number of limitations. First, these methods reveal similarities that exist over all of the measurements, while obscuring relationships that exist over only a subset of the data. ..."
Abstract
-
Cited by 59 (5 self)
- Add to MetaCart
Clustering is commonly used for analyzing gene expression data. Despite their successes, clustering methods suffer from a number of limitations. First, these methods reveal similarities that exist over all of the measurements, while obscuring relationships that exist over only a subset of the data. Second, clustering methods cannot readily incorporate additional types of information, such as clinical data or known attributes of genes. To circumvent these shortcomings, we propose the use of a single coherent probabilistic model, that encompasses much of the rich structure in the genomic expression data, while incorporating additional information such as experiment type, putative binding sites, or functional information. We show how this model can be learned from the data, allowing us to discover patterns in the data and dependencies between the gene expression patterns and additional attributes. The learned model reveals context-specific relationships, that exist only over a subset of the experiments in the dataset. We demonstrate the power of our approach on synthetic data and on two real-world gene expression data sets for yeast. For example, we demonstrate a novel functionality that falls naturally out of our framework: predicting the “cluster” of the array resulting from a gene mutation based only on the gene’s expression pattern in the context of other mutations.
From Promoter Sequence to Expression: A Probabilistic Framework
, 2002
"... We present a probabilistic framework that models the process by which transcriptional binding explains the mRNA expression of different genes. Our joint probabilistic model unifies the two key components of this process: the prediction of gene regulation events from sequence motifs in the gene's pro ..."
Abstract
-
Cited by 44 (5 self)
- Add to MetaCart
We present a probabilistic framework that models the process by which transcriptional binding explains the mRNA expression of different genes. Our joint probabilistic model unifies the two key components of this process: the prediction of gene regulation events from sequence motifs in the gene's promoter region, and the prediction of mRNA expression from combinations of gene regulation events in different settings. Our approach has several advantages. By learning promoter sequence motifs that are directly predictive of expression data, it can improve the identification of binding site patterns. It is also able to identify combinatorial regulation via interactions of different transcription factors. Finally, the general framework allows us to integrate additional data sources, including data from the recent binding localization assays. We demonstrate our approach on the cell cycle data of Spellman et al., combined with the binding localization information of Simon et al. We show that the learned model predicts expression from sequence, and that it identifies coherent co-regulated groups with significant transcription factor motifs. It also provides valuable biological insight into the domain via these co-regulated "modules" and the combinatorial regulation effects that govern their behavior.
Context-Specific Bayesian Clustering for Gene Expression Data
, 2002
"... The recent growth in genomic data and measurements of genome-wide expression patterns allows us to apply computational tools to examine gene regulation by transcription factors. ..."
Abstract
-
Cited by 41 (5 self)
- Add to MetaCart
The recent growth in genomic data and measurements of genome-wide expression patterns allows us to apply computational tools to examine gene regulation by transcription factors.
Computational Analysis of Microarray Gene Expression Profiles: Clustering, Classification, and Beyond
, 2002
"... Gene array studies can assess the global expression patterns of thousands of genes under multiple conditions. This technology can provide important insights about the underlying genetic causes of many important biological questions, and can change our understanding of diseases, ultimately allowing t ..."
Abstract
-
Cited by 9 (1 self)
- Add to MetaCart
Gene array studies can assess the global expression patterns of thousands of genes under multiple conditions. This technology can provide important insights about the underlying genetic causes of many important biological questions, and can change our understanding of diseases, ultimately allowing the development of novel chemical entities as potential drug candidates. The informatics analysis and integration of gene expression pattern are critical for interpreting gene array studies. In this paper, we discuss the computational analysis of three important tasks: (1) the identification of differentially expressed genes, (2) the discovery of gene clusters, and (3) the classification of biological samples. In addition, we discuss how gene sequence and chemical structures can be profitably combined with microarray studies. Detailed examples are given throughout. Programs written in open source R language for achieving each of these tasks are freely available at gila.engr.uic.edu/genex. D 2002 Elsevier Science B.V. All rights reserved.
A Mixed Factors Model for Dimension Reduction and Extraction of a Group Structure in Gene Expression Data
"... When we cluster tissue samples on the basis of genes, the number of observations to be grouped is much smaller than the dimension of feature vector. In such a case, the applicability of conventional model-based clustering is limited since the high dimensionality of feature vector leads to overfittin ..."
Abstract
-
Cited by 6 (3 self)
- Add to MetaCart
When we cluster tissue samples on the basis of genes, the number of observations to be grouped is much smaller than the dimension of feature vector. In such a case, the applicability of conventional model-based clustering is limited since the high dimensionality of feature vector leads to overfitting during the density estimation process. To overcome such difficulty, we attempt a methodological extension of the factor analysis. Our approach enables us not only to prevent from the occurrence of overfitting, but also to handle the issues of clustering, data compression and extracting a set of genes to be relevant to explain the group structure. The potential usefulness are demonstrated with the application to the leukemia dataset. 1.
Combining Sequence and Time Series Expression Data to Learn Transcriptional Modules
"... Our goal is to cluster genes into transcriptional modules—sets of genes where similarity in expression is explained by common regulatory mechanisms at the transcriptional level. We want to learn modules from both time series gene expression data and genome-wide motif data that are now readily availa ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
Our goal is to cluster genes into transcriptional modules—sets of genes where similarity in expression is explained by common regulatory mechanisms at the transcriptional level. We want to learn modules from both time series gene expression data and genome-wide motif data that are now readily available for organisms such as S. cereviseae as a result of prior computational studies or experimental results. We present
Computational identification and analysis of eukaryotic promoters: new algorithms on the traces of gene regulation
, 2000
"... With the information of the complete DNA sequence of several higher eukaryotes as well as expression patterns of thousands of genes under a variety of conditions in our hands, we now have the possibility to computationally identify and analyze the parts of a genome believed to be largely responsibl ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
With the information of the complete DNA sequence of several higher eukaryotes as well as expression patterns of thousands of genes under a variety of conditions in our hands, we now have the possibility to computationally identify and analyze the parts of a genome believed to be largely responsible for transcription control--the promoters. This article gives a short overview of the state-of-the-art techniques for promoter localization and analysis, and comments on the most recent advances in the field. Understanding gene regulation is one of the most exciting topics within molecular genetics. To learn how the interplay among thousands of genes leads to the existance of a complex eukaryotic organism is one of the great challenges, and the availability of large amounts of information gained in the sequencing and gene expression projects both demands and enables us to use computers to solve this task. A key role in gene regulation is played by promoter sequences. We define this here as the region proximal to the transcription start site (TSS) of protein encoding genes, those transcribed by RNA polymerase II, and leave aside distal regions such as enhancers. We want to outline the recent developments within two areas of bioinformatics that deal with promoters: The general recognition of eukaryotic promoters, and the analysis of these regions to identify the regulatory elements hidden in them. This is the first step on the way to complex models of regulatory networks. We focus on the computational point of view, pinpointing out some classic and many recent publications, and leave a more elaborate description, especially of the underlying biology, to the the cited papers and reviews.
Model-based clustering with Hidden Markov Models and its application to financial time-series data
, 2002
"... We have developed a method to partition a set of data into clusters by use of Hidden Markov Models. Given a number of clusters, each of which is represented by one Hidden Markov Model, an iterative procedure finds the combination of cluster models and an assignment of data points to cluster models w ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
We have developed a method to partition a set of data into clusters by use of Hidden Markov Models. Given a number of clusters, each of which is represented by one Hidden Markov Model, an iterative procedure finds the combination of cluster models and an assignment of data points to cluster models which maximizes the joint likelihood of the clustering.
Inferring Regulatory Networks from Multiple Sources of Genomic Data
, 2004
"... This thesis addresses the problems of modeling the gene regulatory system from multiple sources of large-scale datasets. In the first part, we develop a computational framework of building and validating simple, mechanistic models of gene regulation from multiple sources of data. These models, which ..."
Abstract
- Add to MetaCart
This thesis addresses the problems of modeling the gene regulatory system from multiple sources of large-scale datasets. In the first part, we develop a computational framework of building and validating simple, mechanistic models of gene regulation from multiple sources of data. These models, which we call physical network models, annotate the network of molecular interactions with several types of attributes (variables). We associate model attributes with physical interaction and knock-out gene expression data according to the confidence measures of data and the hypothesis that gene regulation is achieved via molecular interaction cascades. By applying standard model inference algorithms, we are able to obtain the configurations of model attributes which optimally fit the data. Because existing datasets do not provide sufficient constraints to the models, there are many optimal configurations which fit the data equally well. In the second part, we develop an information theoretic score to measure the expected capacity of new knock-out experiments in terms of reducing the model uncertainty. We collaborate with biologists to perform suggested knockout

