Results 1  10
of
51
Finding motifs using random projections
, 2001
"... Pevzner and Sze [23] considered a precise version of the motif discovery problem and simultaneously issued an algorithmic challenge: find a motif Å of length 15, where each planted instance differs from Å in 4 positions. Whereas previous algorithms all failed to solve this (15,4)motif problem, Pevz ..."
Abstract

Cited by 211 (5 self)
 Add to MetaCart
Pevzner and Sze [23] considered a precise version of the motif discovery problem and simultaneously issued an algorithmic challenge: find a motif Å of length 15, where each planted instance differs from Å in 4 positions. Whereas previous algorithms all failed to solve this (15,4)motif problem, Pevzner and Sze introduced algorithms that succeeded. However, their algorithms failed to solve the considerably more difficult (14,4), (16,5), and (18,6)motif problems. We introduce a novel motif discovery algorithm based on the use of random projections of the input’s substrings. Experiments on simulated data demonstrate that this algorithm performs better than existing algorithms and, in particular, typically solves the difficult (14,4), (16,5), and (18,6)motif problems quite efficiently. A probabilistic estimate shows that the small values of � for which the algorithm fails to recover the planted Ð � �motif are in all likelihood inherently impossible to solve. We also present experimental results on realistic biological data by identifying ribosome binding sites in prokaryotes as well as a number of known transcriptional regulatory motifs in eukaryotes. 1. CHALLENGING MOTIF PROBLEMS Pevzner and Sze [23] considered a very precise version of the motif discovery problem of computational biology, which had also been considered by Sagot [26]. Based on this formulation, they issued an algorithmic challenge: Planted Ð � �Motif Problem: Suppose there is a fixed but unknown nucleotide sequence Å (the motif) of length Ð. The problem is to determine Å, givenØ nucleotide sequences each of length Ò, and each containing a planted variant of Å. More precisely, each such planted variant is a substring that is Å with exactly � point substitutions. One instantiation that they labeled “The Challenge Problem ” was parameterized as finding a planted (15,4)motif in Ø � sequences each of length Ò � �. These values of Ò, Ø, andÐ are
Algorithms for Extracting Structured Motifs Using a Suffix Tree With an Application to Promoter and Regulatory Site Consensus Identification
, 2000
"... This paper introduces two exact algorithms for extracting conserved structured motifs from a set of DNA sequences. Structured motifs may be described as an ordered collection of p # 1 "boxes" (each box corresponding to one part of the structured motif), p substitution rates (one for each box) and p ..."
Abstract

Cited by 88 (7 self)
 Add to MetaCart
This paper introduces two exact algorithms for extracting conserved structured motifs from a set of DNA sequences. Structured motifs may be described as an ordered collection of p # 1 "boxes" (each box corresponding to one part of the structured motif), p substitution rates (one for each box) and p  1 intervals of distance (one for each pair of successive boxes in the collection). The contents of the boxes  that is, the motifs themselves  are unknown at the start of the algorithm. This is precisely what the algorithms are meant to find. A suffix tree is used for finding such motifs. The algorithms are efficient enough to be able to infer site consensi, such as, for instance, promoter sequences or regulatory sites, from a set of unaligned sequences corresponding to the non coding regions upstream from all genes of a genome. In particular, both algorithms time complexity scales linearly with N 2 n where n is the average length of the sequences and N their number. An application t...
VOTING ALGORITHMS FOR DISCOVERING LONG MOTIFS
"... Pevzner and Sze [14] have introduced the Planted (l,d)Motif Problem to find similar patterns (motifs) in sequences which represent the promoter region of coregulated genes. l is the length of the motif and d is the maximum Hamming distance around the similar patterns. Many algorithms have been dev ..."
Abstract

Cited by 28 (6 self)
 Add to MetaCart
Pevzner and Sze [14] have introduced the Planted (l,d)Motif Problem to find similar patterns (motifs) in sequences which represent the promoter region of coregulated genes. l is the length of the motif and d is the maximum Hamming distance around the similar patterns. Many algorithms have been developed to solve this motif problem. However, these algorithms either have long running times or do not guarantee the motif can be found. In this paper, we introduce new algorithms to solve the motif problem. Our algorithms can find motifs in reasonable time for not only the challenging (9,2), (11,3), (15,5)motif problems but for even longer motifs, say (20,7), (30,11) and (40,15), which have never been seriously attempted by other researchers because of heavy time and space requirements. 1
Inferring regulatory elements from a whole genome. An analysis of Helicobacter pylori sigma(80) family of promoter signals
 J. Mol. Biol
, 2000
"... binding site. Email address of the corresponding author: ..."
Abstract

Cited by 24 (3 self)
 Add to MetaCart
binding site. Email address of the corresponding author:
Extracting structured motifs using a suffix tree  algorithms and application to promoter consensus identification
 In Proceedings of RECOMB 2000
, 2000
"... promoter consensus identification ..."
Computational identification of transcriptional regulatory elements in DNA sequence
, 2006
"... Identification and annotation of all the functional elements in the genome, including genes and the regulatory sequences, is a fundamental challenge in genomics and computational biology. Since regulatory elements are frequently short and variable, their identification and discovery using computatio ..."
Abstract

Cited by 24 (0 self)
 Add to MetaCart
Identification and annotation of all the functional elements in the genome, including genes and the regulatory sequences, is a fundamental challenge in genomics and computational biology. Since regulatory elements are frequently short and variable, their identification and discovery using computational algorithms is difficult. However, significant advances have been made in the computational methods for modeling and detection of DNA regulatory elements. The availability of complete genome sequence from multiple organisms, as well as mRNA profiling and highthroughput experimental methods for mapping proteinbinding sites in DNA, have contributed to the development of methods that utilize these auxiliary data to inform the detection of transcriptional regulatory elements. Progress is also being made in the identification of cisregulatory modules and higher order structures of the regulatory sequences, which is essential to the understanding of transcription regulation in the metazoan genomes. This article reviews the computational approaches for modeling and identification of genomic regulatory elements, with an emphasis on the recent developments, and current challenges.
The at most kdeep factor tree
, 2003
"... Cet article présente un nouvelle structure d’indexation proche de l’arbre des suffixes. Cette structure indexe tous les facteurs de longueur au plus k d’une chaîne. La construction et la place mémoire sont linéaires en la longueur de la chaîne (comme l’arbre des suffixes). Cependant, pour des valeur ..."
Abstract

Cited by 17 (4 self)
 Add to MetaCart
Cet article présente un nouvelle structure d’indexation proche de l’arbre des suffixes. Cette structure indexe tous les facteurs de longueur au plus k d’une chaîne. La construction et la place mémoire sont linéaires en la longueur de la chaîne (comme l’arbre des suffixes). Cependant, pour des valeurs de k petites, l’arbre des facteurs présente un fort gain mémoire visàvis de l’arbre des suffixes. Mots Clefs: arbre des suffixes, arbre des facteurs, structure d’indexation.
On the Parameterized Intractability of Closest Substring and Related Problems
 In Proc. 19th STACS, volume 2285 of LNCS
, 2002
"... We show that Closest Substring, one of the most important problems in the field of biological sequence analysis, is W[1]hard with respect to the number k of input strings (even over a binary alphabet). This problem is therefore unlikely to be solvable in time O(f(k)n for any function f and constant ..."
Abstract

Cited by 13 (4 self)
 Add to MetaCart
We show that Closest Substring, one of the most important problems in the field of biological sequence analysis, is W[1]hard with respect to the number k of input strings (even over a binary alphabet). This problem is therefore unlikely to be solvable in time O(f(k)n for any function f and constant c independent of k  effectively, the problem can be expected to be intractable, in any practical sense, for k 3. Our result supports the intuition that Closest Substring is computationally much harder than the special case of Closest String, although both problems are NPcomplete and both possess polynomial time approximation schemes. We also prove W[1]hardness for other parameterizations in the case of unbounded alphabet size. Our main W[1]hardness result generalizes to Consensus Patterns, a problem of similar significance in computational biology.
A highly scalable algorithm for the extraction of cisregulatory regions
 In Proc. APBC’05
, 2005
"... In this paper we propose a new algorithm for identifying cisregulatory modules in genomic sequences. In particular, the algorithm extracts structured motifs, defined as a collection of highly conserved regions with prespecified sizes and spacings between them. This type of motifs is extremely rele ..."
Abstract

Cited by 12 (2 self)
 Add to MetaCart
In this paper we propose a new algorithm for identifying cisregulatory modules in genomic sequences. In particular, the algorithm extracts structured motifs, defined as a collection of highly conserved regions with prespecified sizes and spacings between them. This type of motifs is extremely relevant in the research of gene regulatory mechanisms since it can effectively represent promoter models. The proposed algorithm uses a new data structure, called boxlink, to store the information about conserved regions that occur in a wellordered and regularly spaced manner in the dataset sequences. The complexity analysis shows a time and space gain over previous algorithms that is exponential on the spacings between binding sites. Experimental results show that the algorithm is much faster than existing ones, sometimes by more than two orders of magnitude. The application of the method to biological datasets shows its ability to extract relevant consensi. 1.
On the parameterized intractability of motif search problems
 Combinatorica
, 2006
"... We show that Closest Substring, one of the most important problems in the field of biological sequence analysis, is W[1]hard when parameterized by the number k of input strings (and remains so, even over a binary alphabet). This problem is therefore unlikely to be solvable in time O(f(k) · n c) fo ..."
Abstract

Cited by 11 (4 self)
 Add to MetaCart
We show that Closest Substring, one of the most important problems in the field of biological sequence analysis, is W[1]hard when parameterized by the number k of input strings (and remains so, even over a binary alphabet). This problem is therefore unlikely to be solvable in time O(f(k) · n c) for any function f of k and constant c independent of k. The problem can therefore be expected to be intractable, in any practical sense, for k ≥ 3. Our result supports the intuition that Closest Substring is computationally much harder than the special case of Closest String, although both problems are NPcomplete. We also prove W[1]hardness for other parameterizations in the case of unbounded alphabet size. Our W[1]hardness result for Closest Substring generalizes to Consensus Patterns, a problem of similar significance in computational biology. 1