Results 1 - 10
of
32
Methods in comparative genomics: Genome correspondence, gene identification, and regulatory motif discovery
- Journal of Computational Biology
, 2004
"... In Kellis et al. (2003), we reported the genome sequences of S. paradoxus, S. mikatae, and S. bayanus and compared these three yeast species to their close relative, S. cerevisiae. Genomewide comparative analysis allowed the identification of functionally important sequences, both coding and noncodi ..."
Abstract
-
Cited by 14 (4 self)
- Add to MetaCart
In Kellis et al. (2003), we reported the genome sequences of S. paradoxus, S. mikatae, and S. bayanus and compared these three yeast species to their close relative, S. cerevisiae. Genomewide comparative analysis allowed the identification of functionally important sequences, both coding and noncoding. In this companion paper we describe the mathematical and algorithmic results underpinning the analysis of these genomes. (1) We present methods for the automatic determination of genome correspondence. The algorithms enabled the automatic identification of orthologs for more than 90 % of genes and intergenic regions across the four species despite the large number of duplicated genes in the yeast genome. The remaining ambiguities in the gene correspondence revealed recent gene family expansions in regions of rapid genomic change. (2) We present methods for the identification of proteincoding genes based on their patterns of nucleotide conservation across related species. We observed the pressure to conserve the reading frame of functional proteins and developed a test for gene identification with high sensitivity and specificity. We used this test to revisit the genome of S. cerevisiae, reducing the overall gene count by 500 genes (10 % of previously
CSV: Visualizing and Mining Cohesive Subgraphs
"... Extracting dense sub-components from graphs efficiently is an important objective in a wide range of application domains ranging from social network analysis to biological network analysis, from the World Wide Web to stock market analysis. Motivated by this need recently we have seen several new alg ..."
Abstract
-
Cited by 10 (3 self)
- Add to MetaCart
Extracting dense sub-components from graphs efficiently is an important objective in a wide range of application domains ranging from social network analysis to biological network analysis, from the World Wide Web to stock market analysis. Motivated by this need recently we have seen several new algorithms to tackle this problem based on the (frequent) pattern mining paradigm. A limitation of most of these methods is that they are highly sensitive to parameter settings, rely on exhaustive enumeration with exponential time complexity, and often fail to help the users understand the underlying distribution of components embedded within the host graph. In this article we propose an approximate algorithm, to mine and visualize cohesive subgraphs (dense sub components) within a large graph. The approach, refereed to as Cohesive Subgraph Visualization (CSV) relies on a novel mapping strategy that maps edges and nodes to a multidimensional space wherein dense areas in the mapped space correspond to cohesive subgraphs. The algorithm then walks through the dense regions in the mapped space to output a visual plot that effectively captures the overall dense subcomponent distribution of the graph. Unlike extant algorithms with exponential complexity, CSV has a complexity of O(V 2 logV) when fixing the parameter mapping dimension, where V corresponds to the number of vertices in the graph, although for many real datasets the performance is typically sub-quadratic. We demonstrate the utility of CSV as a stand-alone tool for visual graph exploration and as a pre-filtering step to significantly scale up exact subgraph mining algorithms such
Clustering of Unevenly Sampled Gene Expression Time-Series Data
, 2003
"... Motivation: Time course measurements are becoming a common type of experiment in the use of microrarrays. Conventional clustering algorithms based on the Euclidean distance or the Pearson correlation coefficient are not able to include temporal information in the distance metric. The temporal order ..."
Abstract
-
Cited by 7 (2 self)
- Add to MetaCart
Motivation: Time course measurements are becoming a common type of experiment in the use of microrarrays. Conventional clustering algorithms based on the Euclidean distance or the Pearson correlation coefficient are not able to include temporal information in the distance metric. The temporal order of the data and the varying length of sampling intervals are important and should be considered in clustering time-series. However, the shortness of gene expression time-series data limits the use of conventional statistical models and techniques for time-series analysis. To address this problem, this paper proposes the Fuzzy Short Time-Series (FSTS) clustering algorithm, which is able to cluster profiles based on the similarity of their relative change of expression level and the corresponding temporal information. One of the major advantages of fuzzy clustering is that genes can belong to more than one group, revealing distinctive features of each gene's function and regulation. Results:
Fuzzy J-Means and VNS Methods for Clustering Genes from Microarray Data
- Bioinformatics
, 2004
"... Motivation: In the interpretation of gene expression data from a group of microarray experiments that include samples from either different patients or conditions, special consideration must be given to the pleiotropic and epistatic roles of genes, as observed in the variation of gene co-expression ..."
Abstract
-
Cited by 7 (0 self)
- Add to MetaCart
Motivation: In the interpretation of gene expression data from a group of microarray experiments that include samples from either different patients or conditions, special consideration must be given to the pleiotropic and epistatic roles of genes, as observed in the variation of gene co-expression patterns. Crisp clustering methods assign each gene to one cluster, thereby omitting information about the multiple roles of genes. Results: Here we present the application of a local search heuristic, Fuzzy J-Means, embedded into the Variable Neighborhood Search metaheuristic for the clustering of microarray gene expression data. We show that for all data sets studied this algorithm outperforms the standard Fuzzy C-Means heuristic. Different methods for the utilization of cluster membership information in determining gene co-regulation are presented. The clustering and data analyses were performed on simulated data sets as well as experimental cDNA microarray data for breast cancer and human blood from the Stanford Microarray Database. Availability: The source code of the clustering software (C programming language) is freely available
Whole-genome comparative annotation and regulatory motif discovery in multiple yeast species; 2003
- Proceedings of the 7th International Conference on Research in Computational Molecular Biology 2003, 7
, 2003
"... In [13] we reported the genome sequences of S. paradoxus, S. mikatae and S. bayanus and compared these three yeast species to their close relative, S. cerevisiae. Genome-wide comparative analysis allowed the identification of functionally important sequences, both coding and non-coding. In this comp ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
In [13] we reported the genome sequences of S. paradoxus, S. mikatae and S. bayanus and compared these three yeast species to their close relative, S. cerevisiae. Genome-wide comparative analysis allowed the identification of functionally important sequences, both coding and non-coding. In this companion paper we describe the mathematical and algorithmic results underpinning the analysis of these genomes. We developed methods for the automatic comparative annotation of the four species and the determination of orthologous genes and intergenic regions. The algorithms enabled the automatic identification of orthologs for more than 90 % of genes despite the large number of duplicated genes in the yeast genome, and the discovery of recent gene family expansions and genome rearrangements. We also developed a test to validate
Gene expression analysis using fuzzy k-means clustering
- Genome Informatics
, 2003
"... The recent advances of array technologies have made it possible to monitor huge amount of genes expression data. Clustering, for example, hierarchical clustering, self-organizing maps (SOM), kmeans clustering, has become important analysis for such gene expression data. We have applied the Fuzzy ada ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
The recent advances of array technologies have made it possible to monitor huge amount of genes expression data. Clustering, for example, hierarchical clustering, self-organizing maps (SOM), kmeans clustering, has become important analysis for such gene expression data. We have applied the Fuzzy adaptive resonance theory (Fuzzy ART) [5] to the gene clustering of DNA microarray data
yMGV: a cross-species expression data mining tool
, 2004
"... The yeast Microarray Global Viewer (yMGV @ ..."
A Simulated Annealing Approach to Find the Optimal Parameters for Fuzzy Clustering Microarray Data
"... Rapid advances of microarray technologies are making it possible to analyze and manipulate large amounts of gene expression data. Clustering algorithms, such as hierarchical clustering, self-organizing maps, k-means clustering and fuzzy k-means clustering, have become important tools for expression ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
Rapid advances of microarray technologies are making it possible to analyze and manipulate large amounts of gene expression data. Clustering algorithms, such as hierarchical clustering, self-organizing maps, k-means clustering and fuzzy k-means clustering, have become important tools for expression analysis of microarray data. However, the need of prior knowledge of the number of clusters, k, and the fuzziness parameter, b, limits the usage of fuzzy clustering. Few approaches have been proposed for assigning best possible values for such parameters. In this paper, we use simulated annealing and fuzzy k-means clustering to determine the optimal parameters, namely the number of clusters, k, and the fuzziness parameter, b. Our results show that a nearly-optimal pair of k and b can be obtained without exploring the entire search space.
P: Techniques for clustering gene expression data
- Comput Biol Med
"... Many clustering techniques have been proposed for the analysis of gene expression data obtained from microarray experiments. However, choice of suitable method(s) for a given experimental dataset is not straightforward. Common approaches do not translate well and fail to take account of the data pro ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
Many clustering techniques have been proposed for the analysis of gene expression data obtained from microarray experiments. However, choice of suitable method(s) for a given experimental dataset is not straightforward. Common approaches do not translate well and fail to take account of the data profile. This review paper surveys state of the art applications which recognise these limitations and addresses them. As such, it provides a framework for the evaluation of clustering in gene expression analyses. The nature of microarray data is discussed briefly. Selected examples are presented for clustering methods considered. Key words: Gene Expression, Clustering, Bi-clustering, Microarray Analysis 1
Efficient bayesian methods for clustering
, 2008
"... I, Katherine Ann Heller, confirm that the work presented in this thesis is my own. Where information has been derived from other sources, I confirm that this has been indicated in the thesis. 3 One of the most important goals of unsupervised learning is to discover meaningful clusters in data. Clust ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
I, Katherine Ann Heller, confirm that the work presented in this thesis is my own. Where information has been derived from other sources, I confirm that this has been indicated in the thesis. 3 One of the most important goals of unsupervised learning is to discover meaningful clusters in data. Clustering algorithms strive to discover groups, or clusters, of data points which belong together because they are in some way similar. The research presented in this thesis focuses on using Bayesian statistical techniques to cluster data. We take a model-based Bayesian approach to defining a cluster, and evaluate cluster membership in this paradigm. Due to the fact that large data sets are increasingly common in practice, our aim is for the methods in this thesis to be efficient while still retaining the desirable properties which result from a Bayesian paradigm. We develop a Bayesian Hierarchical Clustering (BHC) algorithm which efficiently addresses many of the drawbacks of traditional hierarchical clustering algorithms. The goal of BHC is to construct a hierarchical representation of the data, incorporating both

