Results 21 - 30
of
481
Approximation algorithms for co-clustering
- In Proceedings PODS 2008
, 2008
"... Co-clustering is the simultaneous partitioning of the rows and columns of a matrix such that the blocks induced by the row/column partitions are good clusters. Motivated by several applications in text mining, market-basket analysis, and bioinformatics, this problem has attracted severe attention in ..."
Abstract
-
Cited by 19 (0 self)
- Add to MetaCart
(Show Context)
Co-clustering is the simultaneous partitioning of the rows and columns of a matrix such that the blocks induced by the row/column partitions are good clusters. Motivated by several applications in text mining, market-basket analysis, and bioinformatics, this problem has attracted severe attention in the past few years. Unfortunately, to date, most of the algorithmic work on this problem has been heuristic in nature. In this work we obtain the first approximation algorithms for the co-clustering problem. Our algorithms are simple and obtain constant-factor approximation solutions to the optimum. We also show that co-clustering is NP-hard, thereby complementing our algorithmic result.
A scalable framework for discovering coherent co-clusters in noisy data
- In ICML ’08
"... A scalable framework for discovering coherent co-clusters in noisy data ..."
Abstract
-
Cited by 19 (4 self)
- Add to MetaCart
(Show Context)
A scalable framework for discovering coherent co-clusters in noisy data
QUBIC: a qualitative biclustering algorithm for analyses of gene expression data
, 2009
"... ..."
(Show Context)
Attribute clustering for grouping, selection, and classification of gene expression
- IEEE/ACM Transactions on Computational Biology and Bioinformations
, 2005
"... This paper presents an attribute clustering method which is able to group genes based on their interdependence so as to mine meaningful patterns from the gene expression data. It can be used for gene grouping, selection and classification. The partitioning of a relational table into attribute subgro ..."
Abstract
-
Cited by 18 (0 self)
- Add to MetaCart
(Show Context)
This paper presents an attribute clustering method which is able to group genes based on their interdependence so as to mine meaningful patterns from the gene expression data. It can be used for gene grouping, selection and classification. The partitioning of a relational table into attribute subgroups allows a small number of attributes within or across the groups to be selected for analysis. By clustering attributes, the search dimension of a data mining algorithm is reduced. The reduction of search dimension is especially important to data mining in gene expression data because such data typically consist of a huge number of genes (attributes) and a small number of gene expression profiles (tuples). Most data mining algorithms are typically developed and optimized to scale to the number of tuples instead of the number of attributes. The situation becomes even worse when the number of attributes overwhelms the number of tuples, in which case, the likelihood of reporting patterns that are actually irrelevant due to chances becomes rather high. It is for the aforementioned reasons that gene grouping and selection are important preprocessing steps for many data mining algorithms to be effective when applied to gene expression data. This paper defines the problem of attribute clustering and introduces a methodology to solving it. Our proposed method groups interdependent attributes into clusters by optimizing a criterion
Identification of regulatory modules in time-series gene expression data using a linear time biclustering algorithm
- IEEE/ACM Transactions on Computational Biology and Bioinformatics
"... Several non-supervised machine learning methods have been used in the analysis of gene expression data obtained from microarray experiments. Recently, biclustering, a non-supervised approach that performs simultaneous clustering on the row and column dimensions of the data matrix, has been shown to ..."
Abstract
-
Cited by 18 (2 self)
- Add to MetaCart
(Show Context)
Several non-supervised machine learning methods have been used in the analysis of gene expression data obtained from microarray experiments. Recently, biclustering, a non-supervised approach that performs simultaneous clustering on the row and column dimensions of the data matrix, has been shown to be remarkably effective in a variety of applications. The goal of biclustering is to find subgroups of genes and subgroups of experimental conditions, where the genes exhibit highly correlated behaviors. These correlated behaviors correspond to coherent expression patterns and can be used to identify potential regulatory modules possibly involved in regulatory mechanisms. Many specific versions of the biclustering problem have been shown to be NP-complete. However, when we are interested in identifying biclusters in time series expression data, we can restrict the problem by finding only maximal biclusters with contiguous columns. This restriction leads to a tractable problem. Its motivation is the fact that biological processes start and finish in an identifiable contiguous period of time, leading to increased (or decreased) activity of sets of genes forming biclusters with contiguous
Bipartite spectral graph partitioning for clustering dialect varieties and detecting their linguistic features
, 2009
"... In this study we use bipartite spectral graph partitioning to simultaneously cluster varieties and identify their most distinctive linguistic features in Dutch dialect data. While clustering geographical varieties with respect to their features, e.g. pronunciation, is not new, the simultaneous ident ..."
Abstract
-
Cited by 18 (12 self)
- Add to MetaCart
In this study we use bipartite spectral graph partitioning to simultaneously cluster varieties and identify their most distinctive linguistic features in Dutch dialect data. While clustering geographical varieties with respect to their features, e.g. pronunciation, is not new, the simultaneous identification of the features which give rise to the geographical clustering presents novel opportunities in dialectometry. Earlier methods aggregated sound differences and clustered on the basis of aggregate differences. The determination of the significant features which co-vary with cluster membership was carried out on a post hoc basis. Bipartite spectral graph clustering simultaneously seeks groups of individual features which are strongly associated, even while seeking groups of sites which share subsets of these same features. We show that the application of this method results in clear and sensible geographical groupings and discuss and analyze the importance of the concomitant features.
Redescription mining: Structure theory and algorithms
- In AAAI
, 2005
"... We introduce a new data mining problem—redescription mining—that unifies considerations of conceptual clustering, constructive induction, and logical formula discovery. Re-description mining begins with a collection of sets, views it as a propositional vocabulary, and identifies clusters of data tha ..."
Abstract
-
Cited by 17 (5 self)
- Add to MetaCart
We introduce a new data mining problem—redescription mining—that unifies considerations of conceptual clustering, constructive induction, and logical formula discovery. Re-description mining begins with a collection of sets, views it as a propositional vocabulary, and identifies clusters of data that can be defined in at least two ways using this vocabulary. The primary contributions of this paper are conceptual and theoretical: (i) we formally study the space of redescriptions underlying a dataset and characterize their intrinsic structure, (ii) we identify impossibility as well as strong possibility re-sults about when mining redescriptions is feasible, (iii) we present several scenarios of how we can custom-build re-description mining solutions for various biases, and (iv) we outline how many problems studied in the larger machine learning community are really special cases of redescription mining. By highlighting its broad scope and relevance, we aim to establish the importance of redescription mining and make the case for a thrust in this new line of research.
Coclustering of Human Cancer Microarrays Using Minimum Sum-Squared Residue
"... Abstract—It is a consensus in microarray analysis that identifying potential local patterns, characterized by coherent groups of genes and conditions, may shed light on the discovery of previously undetectable biological cellular processes of genes, as well as macroscopic phenotypes of related sampl ..."
Abstract
-
Cited by 17 (7 self)
- Add to MetaCart
(Show Context)
Abstract—It is a consensus in microarray analysis that identifying potential local patterns, characterized by coherent groups of genes and conditions, may shed light on the discovery of previously undetectable biological cellular processes of genes, as well as macroscopic phenotypes of related samples. In order to simultaneously cluster genes and conditions, we have previously developed a fast coclustering algorithm, Minimum Sum-Squared Residue Coclustering (MSSRCC), which employs an alternating minimization scheme and generates what we call coclusters in a “checkerboard ” structure. In this paper, we propose specific strategies that enable MSSRCC to escape poor local minima and resolve the degeneracy problem in partitional clustering algorithms. The strategies include binormalization, deterministic spectral initialization, and incremental local search. We assess the effects of various strategies on both synthetic gene expression data sets and real human cancer microarrays and provide empirical evidence that MSSRCC with the proposed strategies performs better than existing coclustering and clustering algorithms. In particular, the combination of all the three strategies leads to the best performance. Furthermore, we illustrate coherence of the resulting coclusters in a checkerboard structure, where genes in a cocluster manifest the phenotype structure of corresponding specific samples and evaluate the enrichment of functional annotations in Gene Ontology (GO). Index Terms—Microarray analysis, coclustering, binormalization, deterministic spectral initialization, local search, gene ontology. 1