Results 1 - 10
of
76
Biclustering algorithms for biological data analysis: a survey
- IEEE/ACM Transactions on Computational Biology and Bioinformatics
, 2004
"... Abstract—A large number of clustering approaches have been proposed for the analysis of gene expression data obtained from microarray experiments. However, the results from the application of standard clustering methods to genes are limited. This limitation is imposed by the existence of a number of ..."
Abstract
-
Cited by 184 (7 self)
- Add to MetaCart
Abstract—A large number of clustering approaches have been proposed for the analysis of gene expression data obtained from microarray experiments. However, the results from the application of standard clustering methods to genes are limited. This limitation is imposed by the existence of a number of experimental conditions where the activity of genes is uncorrelated. A similar limitation exists when clustering of conditions is performed. For this reason, a number of algorithms that perform simultaneous clustering on the row and column dimensions of the data matrix has been proposed. The goal is to find submatrices, that is, subgroups of genes and subgroups of conditions, where the genes exhibit highly correlated activities for every condition. In this paper, we refer to this class of algorithms as biclustering. Biclustering is also referred in the literature as coclustering and direct clustering, among others names, and has also been used in fields such as information retrieval and data mining. In this comprehensive survey, we analyze a large number of existing approaches to biclustering, and classify them in accordance with the type of biclusters they can find, the patterns of biclusters that are discovered, the methods used to perform the search, the approaches used to evaluate the solution, and the target applications. Index Terms—Biclustering, simultaneous clustering, coclustering, subspace clustering, bidimensional clustering, direct clustering, block clustering, two-way clustering, two-mode clustering, two-sided clustering, microarray data analysis, biological data analysis, gene expression data. 1
Discovering Statistically Significant Biclusters in Gene Expression Data
- In Proceedings of ISMB 2002
, 2002
"... In gene expression data, a bicluster is a subset of the genes exhibiting consistent patterns over a subset of the conditions. We propose a new method to detect significant biclusters in large expression datasets. Our approach is graph theoretic coupled with statistical modelling of the data. Under p ..."
Abstract
-
Cited by 137 (4 self)
- Add to MetaCart
In gene expression data, a bicluster is a subset of the genes exhibiting consistent patterns over a subset of the conditions. We propose a new method to detect significant biclusters in large expression datasets. Our approach is graph theoretic coupled with statistical modelling of the data. Under plausible assumptions, our algorithm is polynomial and is guaranteed to find the most significant biclusters. We tested our method on a collection of yeast expression profiles and on a human cancer dataset. Cross validation results show high specificity in assigning function to genes based on their biclusters, and we are able to annotate in this way 196 uncharacterized yeast genes. We also demonstrate how the biclusters lead to detecting new concrete biological associations. In cancer data we are able to detect and relate finer tissue types than was previously possible. We also show that the method outperforms the biclustering algorithm of Cheng and Church (2000).
Spectral biclustering of microarray data: Coclustering genes and conditions
- Genome Research
, 2003
"... service ..."
Algorithmic Approaches to Clustering Gene Expression Data
- Current Topics in Computational Biology
, 2001
"... Technologies for generating high-density arrays of cDNAs and oligonucleotides are developing rapidly, and changing the landscape of biological and biomedical research. They enable, for the first time, a global, simultaneous view on the transcription levels of many thousands of genes, when the cell u ..."
Abstract
-
Cited by 53 (2 self)
- Add to MetaCart
Technologies for generating high-density arrays of cDNAs and oligonucleotides are developing rapidly, and changing the landscape of biological and biomedical research. They enable, for the first time, a global, simultaneous view on the transcription levels of many thousands of genes, when the cell undergoes specific conditions or processes. For several organisms that had their genomes completely sequenced, the full set of genes can already be monitored this way today. The potential of such technologies is tremendous: The information obtained by monitoring gene expression levels in different developmental stages, tissue types, clinical conditions and di erent organisms can help understanding gene function and gene networks, and assist in the diagnostic of disease conditions and of effects of medical treatments. Undoubtedly, other applications will emerge in coming years. A key step in the analysis of gene expression data is the identification of groups of genes that manifest...
Defining transcription modules using large-scale gene expression data
- Bioinformatics
, 2004
"... Running title: Defining modules using large-scale expression data Motivation: Large-scale gene expression data comprising a variety of cellular conditions holds the promise of a global view on the transcription program. While conventional clustering algorithms have been successfully applied to small ..."
Abstract
-
Cited by 49 (1 self)
- Add to MetaCart
Running title: Defining modules using large-scale expression data Motivation: Large-scale gene expression data comprising a variety of cellular conditions holds the promise of a global view on the transcription program. While conventional clustering algorithms have been successfully applied to smaller datasets, the utility of many algorithms for the analysis of large-scale data is limited by their inability to capture combinatorial and conditionspecific co-regulation. In addition, there is an increasing need to integrate the rapidly accumulating body of other high-throughput biological data with the expression analysis. In a previous work, we introduced the Signature Algorithm, which overcomes the problems of conventional clustering and allows for intuitive integration of additional biological data. However, the applicability of this approach to global analyses is constrained by the comprehensiveness of relevant external data and by its lacking capability of capturing hierarchical organization of the transcription network. Methods: We present a novel method for the analysis of large-scale expression data, which assigns genes into context-dependent and potentially overlapping regulatory units. We introduce
Cluster Analysis for Gene Expression Data: A Survey
- IEEE Transactions on Knowledge and Data Engineering
, 2004
"... Abstract—DNA microarray technology has now made it possible to simultaneously monitor the expression levels of thousands of genes during important biological processes and across collections of related samples. Elucidating the patterns hidden in gene expression data offers a tremendous opportunity f ..."
Abstract
-
Cited by 48 (3 self)
- Add to MetaCart
Abstract—DNA microarray technology has now made it possible to simultaneously monitor the expression levels of thousands of genes during important biological processes and across collections of related samples. Elucidating the patterns hidden in gene expression data offers a tremendous opportunity for an enhanced understanding of functional genomics. However, the large number of genes and the complexity of biological networks greatly increases the challenges of comprehending and interpreting the resulting mass of data, which often consists of millions of measurements. A first step toward addressing this challenge is the use of clustering techniques, which is essential in the data mining process to reveal natural structures and identify interesting patterns in the underlying data. Cluster analysis seeks to partition a given data set into groups based on specified features so that the data points within a group are more similar to each other than the points in different groups. A very rich literature on cluster analysis has developed over the past three decades. Many conventional clustering algorithms have been adapted or directly applied to gene expression data, and also new algorithms have recently been proposed specifically aiming at gene expression data. These clustering algorithms have been proven useful for identifying biologically relevant groups of genes and samples. In this paper, we first briefly introduce the concepts of microarray technology and discuss the basic elements of clustering on gene expression data. In particular, we divide cluster analysis for gene expression data into three categories. Then, we present specific challenges pertinent to each clustering category and introduce several representative approaches. We also discuss the problem of cluster validation in three aspects and review various methods to assess the quality and reliability of clustering results. Finally, we conclude this paper and suggest the promising trends in this field. Index Terms—Microarray technology, gene expression data, clustering.
Multi-way distributional clustering via pairwise interactions
- In ICML
, 2005
"... We present a novel unsupervised learning scheme that simultaneously clusters variables of several types (e.g., documents, words and authors) based on pairwise interactions between the types, as observed in co-occurrence data. In this scheme, multiple clustering systems are generated aiming at maximi ..."
Abstract
-
Cited by 47 (8 self)
- Add to MetaCart
We present a novel unsupervised learning scheme that simultaneously clusters variables of several types (e.g., documents, words and authors) based on pairwise interactions between the types, as observed in co-occurrence data. In this scheme, multiple clustering systems are generated aiming at maximizing an objective function that measures multiple pairwise mutual information between cluster variables. To implement this idea, we propose an algorithm that interleaves top-down clustering of some variables and bottom-up clustering of the other variables, with a local optimization correction routine. Focusing on document clustering we present an extensive empirical study of two-way, three-way and four-way applications of our scheme using six real-world datasets including the 20 Newsgroups (20NG) and the Enron email collection. Our multi-way distributional clustering (MDC) algorithms consistently and significantly outperform previous state-of-the-art information theoretic clustering algorithms. 1.
Functional modules by relating protein interaction networks and gene expression
- Nucleic Acids Res
, 2003
"... gene expression ..."
Variable Selection for Model-Based Clustering
- Journal of the American Statistical Association
, 2006
"... We consider the problem of variable or feature selection for model-based clustering. We recast the problem of comparing two nested subsets of variables as a model comparison problem, and address it using approximate Bayes factors. We develop a greedy search algorithm for finding a local optimum in m ..."
Abstract
-
Cited by 27 (4 self)
- Add to MetaCart
We consider the problem of variable or feature selection for model-based clustering. We recast the problem of comparing two nested subsets of variables as a model comparison problem, and address it using approximate Bayes factors. We develop a greedy search algorithm for finding a local optimum in model space. The resulting method selects variables (or features), the number of clusters, and the clustering model simultaneously. We applied the method to several simulated and real examples, and found that removing irrelevant variables often improved performance. Compared to methods based on all the variables, our variable selection method consistently yielded more accurate estimates of the number of clusters, and lower classification error rates, as well as more parsimonious clustering models and easier visualization of results.
Biclustering microarray data by Gibbs sampling
- Bioinformatics
, 2003
"... Motivation: Gibbs sampling has become a method of choice for the discovery of noisy patterns, known as motifs, in DNA and protein sequences. Because handling noise in microarray data presents similar challenges, we have adapted this strategy to the biclustering of discretized microarray data. ..."
Abstract
-
Cited by 26 (2 self)
- Add to MetaCart
Motivation: Gibbs sampling has become a method of choice for the discovery of noisy patterns, known as motifs, in DNA and protein sequences. Because handling noise in microarray data presents similar challenges, we have adapted this strategy to the biclustering of discretized microarray data.

