• Documents
  • Authors
  • Tables
  • Other Seers ▼
    RefSeer AckSeer CollabSeer SeerSeer
  • Log in
  • Sign up
  • MetaCart

CiteSeerX logo

Advanced Search Include Citations
Advanced Search Include Citations | Disambiguate

Spelling approximate repeated or common motifs using a suffix tree (1998)

by M F Sagot
Venue:Lecture Notes Comput. Sci
Add To MetaCart

Tools

Sorted by:
Results 1 - 10 of 37
Next 10 →

Finding motifs using random projections

by Jeremy Buhler, Martin Tompa , 2001
"... Pevzner and Sze [23] considered a precise version of the motif discovery problem and simultaneously issued an algorithmic challenge: find a motif Å of length 15, where each planted instance differs from Å in 4 positions. Whereas previous algorithms all failed to solve this (15,4)-motif problem, Pevz ..."
Abstract - Cited by 174 (5 self) - Add to MetaCart
Pevzner and Sze [23] considered a precise version of the motif discovery problem and simultaneously issued an algorithmic challenge: find a motif Å of length 15, where each planted instance differs from Å in 4 positions. Whereas previous algorithms all failed to solve this (15,4)-motif problem, Pevzner and Sze introduced algorithms that succeeded. However, their algorithms failed to solve the considerably more difficult (14,4)-, (16,5)-, and (18,6)motif problems. We introduce a novel motif discovery algorithm based on the use of random projections of the input’s substrings. Experiments on simulated data demonstrate that this algorithm performs better than existing algorithms and, in particular, typically solves the difficult (14,4)-, (16,5)-, and (18,6)-motif problems quite efficiently. A probabilistic estimate shows that the small values of � for which the algorithm fails to recover the planted Ð � �-motif are in all likelihood inherently impossible to solve. We also present experimental results on realistic biological data by identifying ribosome binding sites in prokaryotes as well as a number of known transcriptional regulatory motifs in eukaryotes. 1. CHALLENGING MOTIF PROBLEMS Pevzner and Sze [23] considered a very precise version of the motif discovery problem of computational biology, which had also been considered by Sagot [26]. Based on this formulation, they issued an algorithmic challenge: Planted Ð � �-Motif Problem: Suppose there is a fixed but unknown nucleotide sequence Å (the motif) of length Ð. The problem is to determine Å, givenØ nucleotide sequences each of length Ò, and each containing a planted variant of Å. More precisely, each such planted variant is a substring that is Å with exactly � point substitutions. One instantiation that they labeled “The Challenge Problem ” was parameterized as finding a planted (15,4)-motif in Ø � sequences each of length Ò � �. These values of Ò, Ø, andÐ are

Algorithms for Extracting Structured Motifs Using a Suffix Tree With an Application to Promoter and Regulatory Site Consensus Identification

by Laurent Marsan, Marie-France Sagot , 2000
"... This paper introduces two exact algorithms for extracting conserved structured motifs from a set of DNA sequences. Structured motifs may be described as an ordered collection of p # 1 "boxes" (each box corresponding to one part of the structured motif), p substitution rates (one for each box) and p ..."
Abstract - Cited by 71 (5 self) - Add to MetaCart
This paper introduces two exact algorithms for extracting conserved structured motifs from a set of DNA sequences. Structured motifs may be described as an ordered collection of p # 1 "boxes" (each box corresponding to one part of the structured motif), p substitution rates (one for each box) and p - 1 intervals of distance (one for each pair of successive boxes in the collection). The contents of the boxes -- that is, the motifs themselves -- are unknown at the start of the algorithm. This is precisely what the algorithms are meant to find. A suffix tree is used for finding such motifs. The algorithms are efficient enough to be able to infer site consensi, such as, for instance, promoter sequences or regulatory sites, from a set of unaligned sequences corresponding to the non coding regions upstream from all genes of a genome. In particular, both algorithms time complexity scales linearly with N 2 n where n is the average length of the sequences and N their number. An application t...

Extracting structured motifs using a suffix tree - algorithms and application to promoter consensus identification

by Laurent Marsan, Marie-france Sagot - In Proceedings of RECOMB 2000 , 2000
"... promoter consensus identification ..."
Abstract - Cited by 23 (1 self) - Add to MetaCart
promoter consensus identification

Inferring regulatory elements from a whole genome. An analysis of Helicobacter pylori sigma(80) family of promoter signals

by Anne Vanet, Laurent Marsan, Agneás Labigne, Marie-france Sagot, Rue Pierre Et Marie Curie, Uniteâ De Pathogeânie, Bacteârienne Des Muqueuses - J. Mol. Biol , 2000
"... binding site. E-mail address of the corresponding author: ..."
Abstract - Cited by 21 (3 self) - Add to MetaCart
binding site. E-mail address of the corresponding author:

VOTING ALGORITHMS FOR DISCOVERING LONG MOTIFS

by Francis Y. L. Chin, Henry C. M. Leung
"... Pevzner and Sze [14] have introduced the Planted (l,d)-Motif Problem to find similar patterns (motifs) in sequences which represent the promoter region of co-regulated genes. l is the length of the motif and d is the maximum Hamming distance around the similar patterns. Many algorithms have been dev ..."
Abstract - Cited by 19 (6 self) - Add to MetaCart
Pevzner and Sze [14] have introduced the Planted (l,d)-Motif Problem to find similar patterns (motifs) in sequences which represent the promoter region of co-regulated genes. l is the length of the motif and d is the maximum Hamming distance around the similar patterns. Many algorithms have been developed to solve this motif problem. However, these algorithms either have long running times or do not guarantee the motif can be found. In this paper, we introduce new algorithms to solve the motif problem. Our algorithms can find motifs in reasonable time for not only the challenging (9,2), (11,3), (15,5)-motif problems but for even longer motifs, say (20,7), (30,11) and (40,15), which have never been seriously attempted by other researchers because of heavy time and space requirements. 1

The at Most k-Deep Factor Tree

by Julien Allali, Marie-France Sagot , 2004
"... We present a new data structure to index strings that is very similar to a su#x tree. ..."
Abstract - Cited by 13 (3 self) - Add to MetaCart
We present a new data structure to index strings that is very similar to a su#x tree.

On the Parameterized Intractability of Closest Substring and Related Problems

by Michael R. Fellows, Jens Gramm, Rolf Niedermeier - In Proc. 19th STACS, volume 2285 of LNCS , 2002
"... We show that Closest Substring, one of the most important problems in the field of biological sequence analysis, is W[1]-hard with respect to the number k of input strings (even over a binary alphabet). This problem is therefore unlikely to be solvable in time O(f(k)n for any function f and constant ..."
Abstract - Cited by 12 (4 self) - Add to MetaCart
We show that Closest Substring, one of the most important problems in the field of biological sequence analysis, is W[1]-hard with respect to the number k of input strings (even over a binary alphabet). This problem is therefore unlikely to be solvable in time O(f(k)n for any function f and constant c independent of k - effectively, the problem can be expected to be intractable, in any practical sense, for k 3. Our result supports the intuition that Closest Substring is computationally much harder than the special case of Closest String, although both problems are NP-complete and both possess polynomial time approximation schemes. We also prove W[1]-hardness for other parameterizations in the case of unbounded alphabet size. Our main W[1]-hardness result generalizes to Consensus Patterns, a problem of similar significance in computational biology.

A highly scalable algorithm for the extraction of cis-regulatory regions

by Alexandra M. Carvalho, Ana T. Freitas, Arlindo L. Oliveira, Inria Rhône-alpes, Université Claude Bernard, Lyon I - In Proc. APBC’05 , 2005
"... In this paper we propose a new algorithm for identifying cis-regulatory modules in genomic sequences. In particular, the algorithm extracts structured motifs, defined as a collection of highly conserved regions with pre-specified sizes and spacings between them. This type of motifs is extremely rele ..."
Abstract - Cited by 10 (1 self) - Add to MetaCart
In this paper we propose a new algorithm for identifying cis-regulatory modules in genomic sequences. In particular, the algorithm extracts structured motifs, defined as a collection of highly conserved regions with pre-specified sizes and spacings between them. This type of motifs is extremely relevant in the research of gene regulatory mechanisms since it can effectively represent promoter models. The proposed algorithm uses a new data structure, called box-link, to store the information about conserved regions that occur in a well-ordered and regularly spaced manner in the dataset sequences. The complexity analysis shows a time and space gain over previous algorithms that is exponential on the spacings between binding sites. Experimental results show that the algorithm is much faster than existing ones, sometimes by more than two orders of magnitude. The application of the method to biological datasets shows its ability to extract relevant consensi. 1.

Parameterized intractability of motif search problems

by Michael R. Fellows, Jens Gramm, Rolf Niedermeier - Combinatorica , 2002
"... We show that Closest Substring, one of the most important problems in the field of consensus string analysis, is W[1]-hard when parameterized by the number k of input strings (and remains so, even over a binary alphabet). This is done by giving a “strongly structure-preserving” reduction from the gr ..."
Abstract - Cited by 7 (3 self) - Add to MetaCart
We show that Closest Substring, one of the most important problems in the field of consensus string analysis, is W[1]-hard when parameterized by the number k of input strings (and remains so, even over a binary alphabet). This is done by giving a “strongly structure-preserving” reduction from the graph problem Clique to Closest Substring. This problem is therefore unlikely to be solvable in time O(f(k) · n c) for any function f of k and constant c independent of k, i.e., the combinatorial explosion seemingly inherent to this NP-hard problem cannot be restricted to parameter k. The problem can therefore be expected to be intractable, in any practical sense, for k ≥ 3. Our result supports the intuition that Closest Substring is computationally much harder than the special case of Closest String, although both problems are NP-complete. We also prove W[1]-hardness for other parameterizations in the case of unbounded alphabet size. Our W[1]-hardness result for Closest Substring generalizes to Consensus Patterns, a problem arising in computational biology. 1

Using Suffix Trees for Gapped Motif Discovery

by Emily Rocke - In Proceedings of the 11th Annual Symposium on Combinatorial Pattern Matching , 2000
"... Gibbs sampling is a local search method that can be used to find novel motifs in a text string. In previous work [8], we have proposed a modified Gibbs sampler that can discover novel gapped motifs of varying lengths and occurrence rates in DNA or protein sequences. The Gibbs sampling method require ..."
Abstract - Cited by 6 (0 self) - Add to MetaCart
Gibbs sampling is a local search method that can be used to find novel motifs in a text string. In previous work [8], we have proposed a modified Gibbs sampler that can discover novel gapped motifs of varying lengths and occurrence rates in DNA or protein sequences. The Gibbs sampling method requires repeated searching of the text for the best match to a constantly evolving collection of aligned strings, and each search pass previously required (nl) time, where l is the length of the motif and n the length of the original sequence. This paper presents a novel method for using suffix trees to greatly improve the performance of the Gibbs sampling approach.
The National Science Foundation
  • About CiteSeerX
  • Submit Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2010 The Pennsylvania State University