Results 1  10
of
59
Biclustering algorithms for biological data analysis: a survey.
 IEEE/ACM Transactions of Computational Biology and Bioinformatics,
, 2004
"... Abstract A large number of clustering approaches have been proposed for the analysis of gene expression data obtained from microarray experiments. However, the results of the application of standard clustering methods to genes are limited. These limited results are imposed by the existence of a num ..."
Abstract

Cited by 481 (15 self)
 Add to MetaCart
Abstract A large number of clustering approaches have been proposed for the analysis of gene expression data obtained from microarray experiments. However, the results of the application of standard clustering methods to genes are limited. These limited results are imposed by the existence of a number of experimental conditions where the activity of genes is uncorrelated. A similar limitation exists when clustering of conditions is performed. For this reason, a number of algorithms that perform simultaneous clustering on the row and column dimensions of the gene expression matrix has been proposed to date. This simultaneous clustering, usually designated by biclustering, seeks to find submatrices, that is subgroups of genes and subgroups of columns, where the genes exhibit highly correlated activities for every condition. This type of algorithms has also been proposed and used in other fields, such as information retrieval and data mining. In this comprehensive survey, we analyze a large number of existing approaches to biclustering, and classify them in accordance with the type of biclusters they can find, the patterns of biclusters that are discovered, the methods used to perform the search and the target applications.
A systematic comparison and evaluation of biclustering methods for gene expression data
, 2006
"... ..."
MJ: TRICLUSTER: an effective algorithm for mining coherent clusters in 3Dmicroarray data
 In SIGMOD ’05: Proceedings of the 2005 ACM SIGMOD international
"... In this paper we introduce a novel algorithm called triCluster, for mining coherent clusters in threedimensional (3D) gene expression datasets. triCluster can mine arbitrarily positioned and overlapping clusters, and depending on di®erent parameter values, it can mine di®erent types of clusters, ..."
Abstract

Cited by 38 (2 self)
 Add to MetaCart
(Show Context)
In this paper we introduce a novel algorithm called triCluster, for mining coherent clusters in threedimensional (3D) gene expression datasets. triCluster can mine arbitrarily positioned and overlapping clusters, and depending on di®erent parameter values, it can mine di®erent types of clusters, including those with constant or similar values along each dimension, as well as scaling and shifting expression patterns. triCluster relies on graphbased approach to mine all valid clusters. For each time slice, i.e., a gene£sample matrix, it constructs the range multigraph, a compact representation of all similar value ranges between any two sample columns. It then searches for constrained maximal cliques in this multigraph to yield the set of biclusters for this time slice. Then triCluster constructs another graph using the biclusters (as vertices) from each time slice; mining cliques from this graph yields the ¯nal set of triclusters. Optionally, triCluster merges/deletes some clusters having large overlaps. We present a useful set of metrics to evaluate the clustering quality, and we show that triCluster can ¯nd signi¯cant triclusters in the real microarray datasets. 1.
Biclustering in gene expression data by tendency
 In Proceedings of the Computational Systems Bioinformatics
, 2004
"... The advent of DNA microarray technologies has revolutionized the experimental study of gene expression. Clustering is the most popular approach of analyzing gene expression data and has indeed proven to be successful in many applications. Our work focuses on discovering a subset of genes which exhib ..."
Abstract

Cited by 29 (1 self)
 Add to MetaCart
(Show Context)
The advent of DNA microarray technologies has revolutionized the experimental study of gene expression. Clustering is the most popular approach of analyzing gene expression data and has indeed proven to be successful in many applications. Our work focuses on discovering a subset of genes which exhibit similar expression patterns along a subset of conditions in the gene expression matrix. Specifically, we are looking for the Order Preserving clusters (OPCluster), in each of which a subset of genes induce a similar linear ordering along a subset of conditions. The pioneering work of the OPSM model[3], which enforces the strict order shared by the genes in a cluster, is included in our model as a special case. Our model is more robust than OPSM because similarly expressed conditions are allowed to form order equivalent groups and no restriction is placed on the order within a group. Guided by our model, we design and implement a deterministic algorithm, namely OPCTree, to discover OPClusters. Experimental study on two real datasets demonstrates the effectiveness of the algorithm in the application of tissue classification and cell cycle identification. In addition, a large percentage of OPClusters exhibit significant enrichment of one or more function categories, which implies that OPClusters indeed carry significant biological relevance.
Mining shiftingandscaling coregulation patterns on . . .
"... In this paper, we propose a new model for coherent clustering of gene expression data called regcluster. The proposed model allows (1) the expression profiles of genes in a cluster to follow any shiftingandscaling patterns in subspace, where the scaling can be either positive or negative, and (2) ..."
Abstract

Cited by 20 (5 self)
 Add to MetaCart
In this paper, we propose a new model for coherent clustering of gene expression data called regcluster. The proposed model allows (1) the expression profiles of genes in a cluster to follow any shiftingandscaling patterns in subspace, where the scaling can be either positive or negative, and (2) the expression value changes across any two conditions of the cluster to be significant. No previous work measures up to the task that we have set: the densitybased subspace clustering algorithms require genes to have similar expression levels to each other in subspace; the patternbased biclustering algorithmsonlyallowpureshiftingorpurescalingpatterns; and the tendencybased biclustering algorithms have no coherence guarantees. We also develop a novel patternbased biclustering algorithm for identifying shiftingandscaling coregulation patterns, satisfying both coherence constraint and regulation constraint. Our experimental results show that the regcluster algorithm is able to detect a significant amount of clusters missed by previous models, and these clusters are potentially of high biological significance. 1.
Biclustering of Expression Data with Evolutionary Computation
 IEEE TRANSACTIONS ON KNOWLEDGE & DATA ENGINEERING
, 2006
"... Microarray techniques are leading to the development of sophisticated algorithms capable of extracting novel and useful knowledge from a biomedical point of view. In this work, we address the biclustering of gene expression data with evolutionary computation. Our approach is based on evolutionary al ..."
Abstract

Cited by 18 (1 self)
 Add to MetaCart
(Show Context)
Microarray techniques are leading to the development of sophisticated algorithms capable of extracting novel and useful knowledge from a biomedical point of view. In this work, we address the biclustering of gene expression data with evolutionary computation. Our approach is based on evolutionary algorithms, which have been proven to have excellent performance on complex problems, and searches for biclusters following a sequential covering strategy. The goal is to find biclusters of maximum size with mean squared residue lower than a given #. In addition, we pay special attention to the fact of looking for high quality biclusters with large variation, i.e., with a relatively high row variance, and with a low level of overlapping among biclusters. The quality of biclusters found by our evolutionary approach is discussed and the results are compared to those reported by Cheng and Church, and Yang et al. In general, our approach, named SEBI, shows an excellent performance at finding patterns in gene expression data.
A framework for ontologydriven subspace clustering
 In Proceedings of the tenth ACM SIGKDD international
"... Traditional clustering is a descriptive task that seeks to identify homogeneous groups of objects based on the values of their attributes. While domain knowledge is always the best way to justify clustering, few clustering algorithms have ever take domain knowledge into consideration. In this pape ..."
Abstract

Cited by 15 (1 self)
 Add to MetaCart
(Show Context)
Traditional clustering is a descriptive task that seeks to identify homogeneous groups of objects based on the values of their attributes. While domain knowledge is always the best way to justify clustering, few clustering algorithms have ever take domain knowledge into consideration. In this paper, the domain knowledge is represented by hierarchical ontology. We develop a framework by directly incorporating domain knowledge into clustering process, yielding a set of clusters with strong ontology implication. During the clustering process, ontology information is utilized to efficiently prune the exponential search space of the subspace clustering algorithms. Meanwhile, the algorithm generates automatical interpretation of the clustering result by mapping the natural hierarchical organized subspace clusters with significant categorical enrichment onto the ontology hierarchy. Our experiments on a set of gene expression data using gene ontology demonstrate that our pruning technique driven by ontology significantly improve the clustering performance with minimal degradation of the cluster quality. Meanwhile, many hierarchical organizations of gene clusters corresponding to a subhierarchies in gene ontology were also successfully captured.
Deriving Quantitative Models for Correlation Clusters
 IN PROC. 12TH ACM SIGKDD INT’L CONF. ON KNOWLEDGE DISCOVERY AND DATA MINING
, 2006
"... Correlation clustering aims at grouping the data set into correlation clusters such that the objects in the same cluster exhibit a certain density and are all associated to a common arbitrarily oriented hyperplane of arbitrary dimensionality. Several algorithms for this task have been proposed recen ..."
Abstract

Cited by 14 (6 self)
 Add to MetaCart
(Show Context)
Correlation clustering aims at grouping the data set into correlation clusters such that the objects in the same cluster exhibit a certain density and are all associated to a common arbitrarily oriented hyperplane of arbitrary dimensionality. Several algorithms for this task have been proposed recently. However, all algorithms only compute the partitioning of the data into clusters. This is only a first step in the pipeline of advanced data analysis and system modelling. The second (postclustering) step of deriving a quantitative model for each correlation cluster has not been addressed so far. In this paper, we describe an original approach to handle this second step. We introduce a general method that can extract quantitative information on the linear dependencies within a correlation clustering. Our concepts are independent of the clustering model and can thus be applied as a postprocessing step to any correlation clustering algorithm. Furthermore, we show how these quantitative models can be used to predict the probability distribution that an object is created by these models. Our broad experimental evaluation demonstrates the beneficial impact of our method on several applications of significant practical importance.
Minimax Localization of Structural Information in Large Noisy Matrices
"... We consider the problem of identifying a sparse set of relevant columns and rows in a large data matrix with highly corrupted entries. This problem of identifying groups from a collection of bipartite variables such as proteins and drugs, biological species and gene sequences, malware and signatures ..."
Abstract

Cited by 13 (2 self)
 Add to MetaCart
(Show Context)
We consider the problem of identifying a sparse set of relevant columns and rows in a large data matrix with highly corrupted entries. This problem of identifying groups from a collection of bipartite variables such as proteins and drugs, biological species and gene sequences, malware and signatures, etc is commonly referred to as biclustering or coclustering. Despite its great practical relevance, and although several adhoc methods are available for biclustering, theoretical analysis of the problem is largely nonexistent. The problem we consider is also closely related to structured multiple hypothesis testing, an area of statistics that has recently witnessed a flurry of activity. We make the following contributions 1. We prove lower bounds on the minimum signal strength needed for successful recovery of a bicluster as a function of the noise variance, size of the matrix and bicluster of interest. 2. We show that a combinatorial procedure based on the scan statistic achieves this optimal limit. 3. We characterize the SNR required by several computationally tractable procedures for biclustering including elementwise thresholding, column/row average thresholding and a convex relaxation approach to sparse singular vector decomposition. 1
On exploring complex relationships of correlation clusters
 In Proceedings of the 19th international conference on scientific and statistical database management (SSDBM
, 2007
"... In high dimensional data, clusters often only exist in arbitrarily oriented subspaces of the feature space. In addition, these socalled correlation clusters may have complex relationships between each other. For example, a correlation cluster in a 1D subspace (forming a line) may be enclosed w ..."
Abstract

Cited by 10 (9 self)
 Add to MetaCart
(Show Context)
In high dimensional data, clusters often only exist in arbitrarily oriented subspaces of the feature space. In addition, these socalled correlation clusters may have complex relationships between each other. For example, a correlation cluster in a 1D subspace (forming a line) may be enclosed within one or even several correlation clusters in 2D superspaces (forming planes). In general, such relationships can be seen as a complex hierarchy that allows multiple inclusions, i.e. clusters may be embedded in several superclusters rather than only in one. Obviously, uncovering the hierarchical relationships between the detected correlation clusters is an important information gain. Since existing approaches cannot detect such complex hierarchical relationships among correlation clusters, we propose the algorithm ERiC to tackle this problem and to visualize the result by means of a graphbased representation. In our experimental evaluation, we show that ERiC finds more information than stateoftheart correlation clustering methods and outperforms existing competitors in terms of efficiency. 1.