Results 1 - 10
of
108
Comining phylogenetic data with co-regulated genes to identify regulatory motif
- BIOINFORMATICS
, 2003
"... Motivation: Discovery of regulatory motifs in unaligned DNA sequences remains a fundamental problem in computational biology. Two categories of algorithms have been developed to identify common motifs from a set of DNA sequences. The first can be called a ‘multiple genes, single species’approach. It ..."
Abstract
-
Cited by 136 (11 self)
- Add to MetaCart
(Show Context)
Motivation: Discovery of regulatory motifs in unaligned DNA sequences remains a fundamental problem in computational biology. Two categories of algorithms have been developed to identify common motifs from a set of DNA sequences. The first can be called a ‘multiple genes, single species’approach. It proposes that a degenerate motif is embedded in some or all of the otherwise unrelated input sequences and tries to describe a consensus motif and identify its occurrences. It is often used for co-regulated genes identified through experimental approaches. The second approach can be called ‘single gene, multiple species’. It requires orthologous input sequences and tries to identify unusually well conserved regions by phylogen-etic footprinting. Both approaches perform well, but each has some limitations. It is tempting to combine the knowledge of co-regulation among different genes and conservation among orthologous genes to improve our ability to identify motifs. Results: Based on the Consensus algorithm previously established by our group, we introduce a new algorithm called PhyloCon (Phylogenetic Consensus) that takes into account both conservation among orthologous genes and co-regulation of genes within a species. This algorithm first aligns conserved regions of orthologous sequences into multiple sequence alignments, or profiles, then compares profiles representing non-orthologous sequences. Motifs emerge as common regions in these profiles. Here we present a novel statistic to compare profiles of DNA sequences and a greedy approach to search for common subprofiles. We demonstrate that PhyloCon performs well on both synthetic and biological data. Availability: Software available upon request from the authors.
Gibbs recursive sampler: finding transcription factor binding sites
- Nucleic Acids Res
, 2003
"... The Gibbs Motif Sampler is a software package for locating common elements in collections of biopolymer sequences. In this paper we describe a new variation of the Gibbs Motif Sampler, the Gibbs Recursive Sampler, which has been developed specifically for locating multiple transcription factor bindi ..."
Abstract
-
Cited by 92 (7 self)
- Add to MetaCart
The Gibbs Motif Sampler is a software package for locating common elements in collections of biopolymer sequences. In this paper we describe a new variation of the Gibbs Motif Sampler, the Gibbs Recursive Sampler, which has been developed specifically for locating multiple transcription factor binding sites for multiple transcription factors simultaneously in unaligned DNA sequences that may be heterogeneous in DNA composition. Here we describe the basic operation of the web-based version of this sampler. The sampler may be accessed at
Computational identification of transcriptional regulatory elements in DNA sequence
, 2006
"... Identification and annotation of all the functional elements in the genome, including genes and the regulatory sequences, is a fundamental challenge in genomics and computational biology. Since regulatory elements are frequently short and variable, their identification and discovery using computatio ..."
Abstract
-
Cited by 55 (0 self)
- Add to MetaCart
Identification and annotation of all the functional elements in the genome, including genes and the regulatory sequences, is a fundamental challenge in genomics and computational biology. Since regulatory elements are frequently short and variable, their identification and discovery using computational algorithms is difficult. However, significant advances have been made in the computational methods for modeling and detection of DNA regulatory elements. The availability of complete genome sequence from multiple organisms, as well as mRNA profiling and high-throughput experimental methods for mapping protein-binding sites in DNA, have contributed to the development of methods that utilize these auxiliary data to inform the detection of transcriptional regulatory elements. Progress is also being made in the identification of cis-regulatory modules and higher order structures of the regulatory sequences, which is essential to the understanding of transcription regulation in the metazoan genomes. This article reviews the computational approaches for modeling and identification of genomic regulatory elements, with an emphasis on the recent developments, and current challenges.
Finding Subtle Motifs by Branching from Sample Strings
, 2003
"... Many motif finding algorithms apply local search techniques to a set of seeds. For example, GibbsDNA (Lawrence et al., 1993) applies Gibbs sampling to random seeds, and MEME (Bailey and Elkan, 1994) applies the EM algorithm to selected sample strings, i.e. substrings of the sample. In the case of su ..."
Abstract
-
Cited by 43 (0 self)
- Add to MetaCart
Many motif finding algorithms apply local search techniques to a set of seeds. For example, GibbsDNA (Lawrence et al., 1993) applies Gibbs sampling to random seeds, and MEME (Bailey and Elkan, 1994) applies the EM algorithm to selected sample strings, i.e. substrings of the sample. In the case of subtle motifs, recent benchmarking efforts show that both random seeds and selected sample strings may never get close to the globally optimal motif. We propose a new approach which searches motif space by branching from sample strings, and implement this idea in both pattern-based and profile-based settings. Our PatternBranching and ProfileBranching algorithms achieve favorable results relative to other motif finding algorithms.
Rare Events and Conditional Events on Random Strings
- DMTCS
, 2004
"... this paper is twofold. First, a single word is given. We study the tail distribution of the number of its occurrences. Sharp large deviation estimates are derived. Second, we assume that a given word is overrepresented. The conditional distribution of a second word is studied; formulae for the expec ..."
Abstract
-
Cited by 24 (4 self)
- Add to MetaCart
this paper is twofold. First, a single word is given. We study the tail distribution of the number of its occurrences. Sharp large deviation estimates are derived. Second, we assume that a given word is overrepresented. The conditional distribution of a second word is studied; formulae for the expectation and the variance are derived. In both cases, the formulae are precise and can be computed efficiently. These results have applications in computational biology, where a genome is viewed as a text
MotifCut: regulatory motif finding with maximum density subgraphs bioinformatics
- In Proceedings of International Conference on Intelligent Systems and Molecular Biology
, 2006
"... doi:10.1093/bioinformatics/btl243BIOINFORMATICS ..."
A highly scalable algorithm for the extraction of cis-regulatory regions
- In Proc. APBC’05
, 2005
"... In this paper we propose a new algorithm for identifying cis-regulatory modules in genomic sequences. In particular, the algorithm extracts structured motifs, defined as a collection of highly conserved regions with pre-specified sizes and spacings between them. This type of motifs is extremely rele ..."
Abstract
-
Cited by 20 (2 self)
- Add to MetaCart
(Show Context)
In this paper we propose a new algorithm for identifying cis-regulatory modules in genomic sequences. In particular, the algorithm extracts structured motifs, defined as a collection of highly conserved regions with pre-specified sizes and spacings between them. This type of motifs is extremely relevant in the research of gene regulatory mechanisms since it can effectively represent promoter models. The proposed algorithm uses a new data structure, called box-link, to store the information about conserved regions that occur in a well-ordered and regularly spaced manner in the dataset sequences. The complexity analysis shows a time and space gain over previous algorithms that is exponential on the spacings between binding sites. Experimental results show that the algorithm is much faster than existing ones, sometimes by more than two orders of magnitude. The application of the method to biological datasets shows its ability to extract relevant consensi. 1.
BIOINFORMATICS Composite Module Analyst: A fitness-based tool for identification of transcription factor binding site combinations.
"... Motivation: Functionally related genes involved in the same molecular-genetic, biochemical, or physiological process are often regulated coordinately. Such regulation is provided by precisely organized binding of a multiplicity of special proteins (transcription factors) to their target sites (cis-e ..."
Abstract
-
Cited by 17 (2 self)
- Add to MetaCart
(Show Context)
Motivation: Functionally related genes involved in the same molecular-genetic, biochemical, or physiological process are often regulated coordinately. Such regulation is provided by precisely organized binding of a multiplicity of special proteins (transcription factors) to their target sites (cis-elements) in regulatory regions of genes. Cis-element combinations provide a structural basis for the generation of unique patterns of gene expression. Results: Here we present a new approach for defining promoter models based on the composition of transcription factor binding sites and their pairs. We utilize a multicomponent fitness function for selection of the promoter model that fits best to the observed gene expression profile. We demonstrate examples of successful application of the fitness function with the help of a genetic algorithm for the analysis of functionally related or co-expressed genes as well as testing on simulated and permutated data. Availability: The CMA program is freely available for noncommercial users.
RISOTTO: Fast extraction of motifs with mismatches
- PROCEEDINGS OF THE 7TH LATIN AMERICAN THEORETICAL INFORMATICS SYMPOSIUM, 3887 OF LNCS:757–768
, 2006
"... We present in this paper an exact algorithm for motif extraction. Efficiency is achieved by means of an improvement in the algorithm and data structures that applies to the whole class of motif inference algorithms based on suffix trees. An average case complexity analysis shows a gain over the b ..."
Abstract
-
Cited by 16 (5 self)
- Add to MetaCart
(Show Context)
We present in this paper an exact algorithm for motif extraction. Efficiency is achieved by means of an improvement in the algorithm and data structures that applies to the whole class of motif inference algorithms based on suffix trees. An average case complexity analysis shows a gain over the best known exact algorithm for motif extraction, when applied to extract long motifs. A full implementation was developed and made available online. Experimental results show that the proposed algorithm is more than two times faster than the best known exact algorithm for motif extraction, confirming in this way the theoretical results obtained.