Results 1  10
of
300
Biclustering algorithms for biological data analysis: a survey
 IEEE/ACM Transactions on Computational Biology and Bioinformatics
, 2004
"... Abstract—A large number of clustering approaches have been proposed for the analysis of gene expression data obtained from microarray experiments. However, the results from the application of standard clustering methods to genes are limited. This limitation is imposed by the existence of a number of ..."
Abstract

Cited by 481 (15 self)
 Add to MetaCart
(Show Context)
Abstract—A large number of clustering approaches have been proposed for the analysis of gene expression data obtained from microarray experiments. However, the results from the application of standard clustering methods to genes are limited. This limitation is imposed by the existence of a number of experimental conditions where the activity of genes is uncorrelated. A similar limitation exists when clustering of conditions is performed. For this reason, a number of algorithms that perform simultaneous clustering on the row and column dimensions of the data matrix has been proposed. The goal is to find submatrices, that is, subgroups of genes and subgroups of conditions, where the genes exhibit highly correlated activities for every condition. In this paper, we refer to this class of algorithms as biclustering. Biclustering is also referred in the literature as coclustering and direct clustering, among others names, and has also been used in fields such as information retrieval and data mining. In this comprehensive survey, we analyze a large number of existing approaches to biclustering, and classify them in accordance with the type of biclusters they can find, the patterns of biclusters that are discovered, the methods used to perform the search, the approaches used to evaluate the solution, and the target applications. Index Terms—Biclustering, simultaneous clustering, coclustering, subspace clustering, bidimensional clustering, direct clustering, block clustering, twoway clustering, twomode clustering, twosided clustering, microarray data analysis, biological data analysis, gene expression data. 1
Approximating the cutnorm via Grothendieck’s inequality
 Proc. of the 36 th ACM STOC
, 2004
"... ..."
Defining transcription modules using largescale gene expression data
 Bioinformatics
, 2004
"... Running title: Defining modules using largescale expression data Motivation: Largescale gene expression data comprising a variety of cellular conditions holds the promise of a global view on the transcription program. While conventional clustering algorithms have been successfully applied to small ..."
Abstract

Cited by 102 (2 self)
 Add to MetaCart
(Show Context)
Running title: Defining modules using largescale expression data Motivation: Largescale gene expression data comprising a variety of cellular conditions holds the promise of a global view on the transcription program. While conventional clustering algorithms have been successfully applied to smaller datasets, the utility of many algorithms for the analysis of largescale data is limited by their inability to capture combinatorial and conditionspecific coregulation. In addition, there is an increasing need to integrate the rapidly accumulating body of other highthroughput biological data with the expression analysis. In a previous work, we introduced the Signature Algorithm, which overcomes the problems of conventional clustering and allows for intuitive integration of additional biological data. However, the applicability of this approach to global analyses is constrained by the comprehensiveness of relevant external data and by its lacking capability of capturing hierarchical organization of the transcription network. Methods: We present a novel method for the analysis of largescale expression data, which assigns genes into contextdependent and potentially overlapping regulatory units. We introduce
Identification of Protein Complexes by Comparative Analysis of Yeast and Bacterial Protein Interaction Data
 JOURNAL OF COMPUTATIONAL BIOLOGY
, 2004
"... Mounting evidence shows that many protein complexes are conserved in evolution. Here we use conservation to find complexes that are common to yeast S. Cerevisiae and bacteria H. pylori. Our analysis combines protein interaction data, that are available for each of the two species, and orthology info ..."
Abstract

Cited by 90 (11 self)
 Add to MetaCart
Mounting evidence shows that many protein complexes are conserved in evolution. Here we use conservation to find complexes that are common to yeast S. Cerevisiae and bacteria H. pylori. Our analysis combines protein interaction data, that are available for each of the two species, and orthology information based on protein sequence comparison. We develop a detailed probabilistic model for protein complexes in a single species, and a model for the conservation of complexes between two species. Using these models, one can recast the question of finding conserved complexes as a problem of searching for heavy subgraphs in an edge and nodeweighted graph, whose nodes are orthologous protein pairs. We tested
Parallel stochastic gradient algorithms for largescale matrix completion
 Mathematical Programming Computation
, 2013
"... This paper develops Jellyfish, an algorithm for solving dataprocessing problems with matrixvalued decision variables regularized to have low rank. Particular examples of problems solvable by Jellyfish include matrix completion problems and leastsquares problems regularized by the nuclear norm or ..."
Abstract

Cited by 71 (7 self)
 Add to MetaCart
This paper develops Jellyfish, an algorithm for solving dataprocessing problems with matrixvalued decision variables regularized to have low rank. Particular examples of problems solvable by Jellyfish include matrix completion problems and leastsquares problems regularized by the nuclear norm or γ2norm. Jellyfish implements a projected incremental gradient method with a biased, random ordering of the increments. This biased ordering allows for a parallel implementation that admits a speedup nearly proportional to the number of processors. On largescale matrix completion tasks, Jellyfish is orders of magnitude more efficient than existing codes. For example, on the Netflix Prize data set, prior art computes rating predictions in approximately 4 hours, while Jellyfish solves the same problem in under 3 minutes on a 12 core workstation.
Nonsmooth nonnegative matrix factorization (nsnmf
 IEEE transactions on
, 2006
"... Abstract—We propose a novel nonnegative matrix factorization model that aims at finding localized, partbased, representations of nonnegative multivariate data items. Unlike the classical nonnegative matrix factorization (NMF) technique, this new model, denoted “nonsmooth nonnegative matrix factoriz ..."
Abstract

Cited by 64 (4 self)
 Add to MetaCart
(Show Context)
Abstract—We propose a novel nonnegative matrix factorization model that aims at finding localized, partbased, representations of nonnegative multivariate data items. Unlike the classical nonnegative matrix factorization (NMF) technique, this new model, denoted “nonsmooth nonnegative matrix factorization ” (nsNMF), corresponds to the optimization of an unambiguous cost function designed to explicitly represent sparseness, in the form of nonsmoothness, which is controlled by a single parameter. In general, this method produces a set of basis and encoding vectors that are not only capable of representing the original data, but they also extract highly localized patterns, which generally lend themselves to improved interpretability. The properties of this new method are illustrated with several data sets. Comparisons to previously published methods show that the new nsNMF method has some advantages in keeping faithfulness to the data in the achieving a high degree of sparseness for both the estimated basis and the encoding vectors and in better interpretability of the factors. Index Terms—nonnegative matrix factorization, constrained optimization, datamining, mining methods and algorithms, pattern analysis, feature extraction or construction, sparse, structured, and very large systems. æ 1
Biclustering microarray data by Gibbs sampling
 Bioinformatics
, 2003
"... Motivation: Gibbs sampling has become a method of choice for the discovery of noisy patterns, known as motifs, in DNA and protein sequences. Because handling noise in microarray data presents similar challenges, we have adapted this strategy to the biclustering of discretized microarray data. ..."
Abstract

Cited by 55 (3 self)
 Add to MetaCart
(Show Context)
Motivation: Gibbs sampling has become a method of choice for the discovery of noisy patterns, known as motifs, in DNA and protein sequences. Because handling noise in microarray data presents similar challenges, we have adapted this strategy to the biclustering of discretized microarray data.
MJ: TRICLUSTER: an effective algorithm for mining coherent clusters in 3Dmicroarray data
 In SIGMOD ’05: Proceedings of the 2005 ACM SIGMOD international
"... In this paper we introduce a novel algorithm called triCluster, for mining coherent clusters in threedimensional (3D) gene expression datasets. triCluster can mine arbitrarily positioned and overlapping clusters, and depending on di®erent parameter values, it can mine di®erent types of clusters, ..."
Abstract

Cited by 36 (2 self)
 Add to MetaCart
In this paper we introduce a novel algorithm called triCluster, for mining coherent clusters in threedimensional (3D) gene expression datasets. triCluster can mine arbitrarily positioned and overlapping clusters, and depending on di®erent parameter values, it can mine di®erent types of clusters, including those with constant or similar values along each dimension, as well as scaling and shifting expression patterns. triCluster relies on graphbased approach to mine all valid clusters. For each time slice, i.e., a gene£sample matrix, it constructs the range multigraph, a compact representation of all similar value ranges between any two sample columns. It then searches for constrained maximal cliques in this multigraph to yield the set of biclusters for this time slice. Then triCluster constructs another graph using the biclusters (as vertices) from each time slice; mining cliques from this graph yields the ¯nal set of triclusters. Optionally, triCluster merges/deletes some clusters having large overlaps. We present a useful set of metrics to evaluate the clustering quality, and we show that triCluster can ¯nd signi¯cant triclusters in the real microarray datasets. 1.
Techniques for clustering gene expression data
 COMPUT BIOL MED
, 2007
"... Many clustering techniques have been proposed for the analysis of gene expression data obtained from microarray experiments. However, choice of suitable method(s) for a given experimental dataset is not straightforward. Common approaches do not translate well and fail to take account of the data pro ..."
Abstract

Cited by 34 (3 self)
 Add to MetaCart
(Show Context)
Many clustering techniques have been proposed for the analysis of gene expression data obtained from microarray experiments. However, choice of suitable method(s) for a given experimental dataset is not straightforward. Common approaches do not translate well and fail to take account of the data profile. This review paper surveys state of the art applications which recognise these limitations and addresses them. As such, it provides a framework for the evaluation of clustering in gene expression analyses. The nature of microarray data is discussed briefly. Selected examples are presented for clustering methods considered.
FABIA: factor analysis for bicluster acquisition
 Bioinformatics
, 2010
"... Motivation: Biclustering of transcriptomic data groups genes and samples simultaneously. It is emerging as a standard tool for extracting knowledge from gene expression measurements. We propose a novel generative approach for biclustering called “FABIA: Factor Analysis for Bicluster Acquisition”. FA ..."
Abstract

Cited by 33 (0 self)
 Add to MetaCart
(Show Context)
Motivation: Biclustering of transcriptomic data groups genes and samples simultaneously. It is emerging as a standard tool for extracting knowledge from gene expression measurements. We propose a novel generative approach for biclustering called “FABIA: Factor Analysis for Bicluster Acquisition”. FABIA is based on a multiplicative model, which accounts for linear dependencies between gene expression and conditions, and also captures heavytailed distributions as observed in realworld transcriptomic data. The generative framework allows to utilize wellfounded model selection methods and to apply Bayesian techniques. Results: On 100 simulated data sets with known true, artificially implanted biclusters, FABIA clearly outperformed all 11 competitors. On these data sets, FABIA was able to separate spurious biclusters from true biclusters by ranking biclusters according to their information content. FABIA was tested on three microarray data sets with known subclusters, where it was two times the best and once the second best method among the compared biclustering approaches. Availability: FABIA is available as an R package on Bioconductor