Results 1 - 10
of
26
Model-Based Clustering and Data Transformations for Gene Expression Data
, 2001
"... Motivation: Clustering is a useful exploratory technique for the analysis of gene expression data. Many different heuristic clustering algorithms have been proposed in this context. Clustering algorithms based on probability models offer a principled alternative to heuristic algorithms. In particula ..."
Abstract
-
Cited by 88 (8 self)
- Add to MetaCart
Motivation: Clustering is a useful exploratory technique for the analysis of gene expression data. Many different heuristic clustering algorithms have been proposed in this context. Clustering algorithms based on probability models offer a principled alternative to heuristic algorithms. In particular, model-based clustering assumes that the data is generated by a finite mixture of underlying probability distributions such as multivariate normal distributions. The issues of selecting a 'good' clustering method and determining the 'correct' number of clusters are reduced to model selection problems in the probability framework. Gaussian mixture models have been shown to be a powerful tool for clustering in many applications.
Rich Probabilistic Models for Gene Expression
, 2001
"... Clustering is commonly used for analyzing gene expression data. Despite their successes, clustering methods suffer from a number of limitations. First, these methods reveal similarities that exist over all of the measurements, while obscuring relationships that exist over only a subset of the data. ..."
Abstract
-
Cited by 59 (5 self)
- Add to MetaCart
Clustering is commonly used for analyzing gene expression data. Despite their successes, clustering methods suffer from a number of limitations. First, these methods reveal similarities that exist over all of the measurements, while obscuring relationships that exist over only a subset of the data. Second, clustering methods cannot readily incorporate additional types of information, such as clinical data or known attributes of genes. To circumvent these shortcomings, we propose the use of a single coherent probabilistic model, that encompasses much of the rich structure in the genomic expression data, while incorporating additional information such as experiment type, putative binding sites, or functional information. We show how this model can be learned from the data, allowing us to discover patterns in the data and dependencies between the gene expression patterns and additional attributes. The learned model reveals context-specific relationships, that exist only over a subset of the experiments in the dataset. We demonstrate the power of our approach on synthetic data and on two real-world gene expression data sets for yeast. For example, we demonstrate a novel functionality that falls naturally out of our framework: predicting the “cluster” of the array resulting from a gene mutation based only on the gene’s expression pattern in the context of other mutations.
Modeling Dependencies in Protein-DNA Binding Sites
, 2003
"... The availability of whole genome sequences and high-throughput genomic assays opens the door for in silico analysis of transcription regulation. This includes methods for discovering and characterizing the binding sites of DNA-binding proteins, such as transcription factors. A common representation ..."
Abstract
-
Cited by 54 (1 self)
- Add to MetaCart
The availability of whole genome sequences and high-throughput genomic assays opens the door for in silico analysis of transcription regulation. This includes methods for discovering and characterizing the binding sites of DNA-binding proteins, such as transcription factors. A common representation of transcription factor binding sites is aposition specific score matrix (PSSM). This representation makes the strong assumption that binding site positions are independent of each other. In this work, we explore Bayesian network representations of binding sites that provide different tradeoffs between complexity (number of parameters) and the richness of dependencies between positions. We develop the formal machinery for learning such models from data and for estimating the statistical significance of putative binding sites. We then evaluate the ramifications of these richer representations in characterizing binding site motifs and predicting their genomic locations. We show that these richer representations improve over the PSSM model in both tasks.
From Promoter Sequence to Expression: A Probabilistic Framework
, 2002
"... We present a probabilistic framework that models the process by which transcriptional binding explains the mRNA expression of different genes. Our joint probabilistic model unifies the two key components of this process: the prediction of gene regulation events from sequence motifs in the gene's pro ..."
Abstract
-
Cited by 44 (5 self)
- Add to MetaCart
We present a probabilistic framework that models the process by which transcriptional binding explains the mRNA expression of different genes. Our joint probabilistic model unifies the two key components of this process: the prediction of gene regulation events from sequence motifs in the gene's promoter region, and the prediction of mRNA expression from combinations of gene regulation events in different settings. Our approach has several advantages. By learning promoter sequence motifs that are directly predictive of expression data, it can improve the identification of binding site patterns. It is also able to identify combinatorial regulation via interactions of different transcription factors. Finally, the general framework allows us to integrate additional data sources, including data from the recent binding localization assays. We demonstrate our approach on the cell cycle data of Spellman et al., combined with the binding localization information of Simon et al. We show that the learned model predicts expression from sequence, and that it identifies coherent co-regulated groups with significant transcription factor motifs. It also provides valuable biological insight into the domain via these co-regulated "modules" and the combinatorial regulation effects that govern their behavior.
A Simple Hyper-Geometric Approach for Discovering Putative Transcription Factor Binding Sites
- Algorithms in Bioinformatics: Proc. First International Workshop, number 2149 in LNCS
, 2001
"... A central issue in molecular biology is understanding the regulatory mechanisms that control gene expression. The recent ood of genomic and post-genomic data opens the way for computational methods elucidating the key components that play a role in these mechanisms. ..."
Abstract
-
Cited by 36 (6 self)
- Add to MetaCart
A central issue in molecular biology is understanding the regulatory mechanisms that control gene expression. The recent ood of genomic and post-genomic data opens the way for computational methods elucidating the key components that play a role in these mechanisms.
Handling Very Large Numbers of Association Rules in the Analysis of Microarray Data
- In Proceedings of SIGKDD’02
"... The problem of analyzing microarray data became one of important topics in bioinformatics over the past several years, and different data mining techniques have been proposed for the analysis of such data. In this paper, we propose to use association rule discovery methods for determining associatio ..."
Abstract
-
Cited by 17 (0 self)
- Add to MetaCart
The problem of analyzing microarray data became one of important topics in bioinformatics over the past several years, and different data mining techniques have been proposed for the analysis of such data. In this paper, we propose to use association rule discovery methods for determining associations among expression levels of different genes. One of the main problems related to the discovery of these associations is the scalability issue. Microarrays usually contain very large numbers of genes that are sometimes measured in 10,000s. Therefore, analysis of such data can generate a very large number of associations that can often be measured in millions. The paper addresses this problem by presenting a method that enables biologists to evaluate these very large numbers of discovered association rules during the post-analysis stage of the data mining process. This is achieved by providing several rule evaluation operators, including rule grouping, filtering, browsing, and data inspection operators, that allow biologists to validate multiple individual gene regulation patterns at a time. By iteratively applying these operators, biologists can explore a significant part of all the initially generated rules in an acceptable period of time and thus answer biological questions that are of a particular interest to him or her. To validate our method, we tested our system on the microarray data pertaining to the studies of environmental hazards and their influence of gene expression processes. As a result, we managed to answer several questions that were of interest to the biologists that had collected this data.
A Mixed Factors Model for Dimension Reduction and Extraction of a Group Structure in Gene Expression Data
"... When we cluster tissue samples on the basis of genes, the number of observations to be grouped is much smaller than the dimension of feature vector. In such a case, the applicability of conventional model-based clustering is limited since the high dimensionality of feature vector leads to overfittin ..."
Abstract
-
Cited by 6 (3 self)
- Add to MetaCart
When we cluster tissue samples on the basis of genes, the number of observations to be grouped is much smaller than the dimension of feature vector. In such a case, the applicability of conventional model-based clustering is limited since the high dimensionality of feature vector leads to overfitting during the density estimation process. To overcome such difficulty, we attempt a methodological extension of the factor analysis. Our approach enables us not only to prevent from the occurrence of overfitting, but also to handle the issues of clustering, data compression and extracting a set of genes to be relevant to explain the group structure. The potential usefulness are demonstrated with the application to the leukemia dataset. 1.
Process Pathway Inference via Time Series Analysis
- Experimental Mechanics
, 2003
"... Motivated by recent experimental developments in functional genomics, we construct and test a numerical technique for inferring process pathways, in which one process calls another process, from time series data. We validate using a case in which data are readily available and formulate an extension ..."
Abstract
-
Cited by 6 (2 self)
- Add to MetaCart
Motivated by recent experimental developments in functional genomics, we construct and test a numerical technique for inferring process pathways, in which one process calls another process, from time series data. We validate using a case in which data are readily available and formulate an extension, appropriate for genetic regulatory networks, which exploits Bayesian inference and in which the present–day undersampling is compensated for by prior understanding of genetic regulation. Preprint number: NSF-ITP-02-47 1
S.: Independent subspaces of gene expression data
- In: Proc. IASTED Int’l Conf. Artificial Intelligence and Applications
, 2005
"... Independent subspace anlaysis (ISA) is a linear modelbased method which generalizes independent component analysis (ICA) by incorporating the invariant feature subspace into multidimensional ICA. In this paper we apply ISA to the problem of gene expression data analysis and show the useful behavior ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
Independent subspace anlaysis (ISA) is a linear modelbased method which generalizes independent component analysis (ICA) by incorporating the invariant feature subspace into multidimensional ICA. In this paper we apply ISA to the problem of gene expression data analysis and show the useful behavior of the independent subspaces of gene expression data in the task of gene clustering and gene-gene interaction analysis. KEY WORDS DNA chip data, gene clustering, gene-gene interaction analysis, independent component analysis, independent subspace analysis. 1

