Results 1 - 10
of
20
RISOTTO: Fast extraction of motifs with mismatches
- PROCEEDINGS OF THE 7TH LATIN AMERICAN THEORETICAL INFORMATICS SYMPOSIUM, 3887 OF LNCS:757–768
, 2006
"... We present in this paper an exact algorithm for motif extraction. Efficiency is achieved by means of an improvement in the algorithm and data structures that applies to the whole class of motif inference algorithms based on suffix trees. An average case complexity analysis shows a gain over the b ..."
Abstract
-
Cited by 16 (5 self)
- Add to MetaCart
(Show Context)
We present in this paper an exact algorithm for motif extraction. Efficiency is achieved by means of an improvement in the algorithm and data structures that applies to the whole class of motif inference algorithms based on suffix trees. An average case complexity analysis shows a gain over the best known exact algorithm for motif extraction, when applied to extract long motifs. A full implementation was developed and made available online. Experimental results show that the proposed algorithm is more than two times faster than the best known exact algorithm for motif extraction, confirming in this way the theoretical results obtained.
Lossless filter for finding long multiple approximate repetitions using a new data structure, the bi-factor array
- STRING PROCESSING AND INFORMATION RETRIEVAL (SPIRE 2005)
, 2005
"... Similarity search in texts, notably biological sequences, has received substantial attention in the last few years. Numerous filtration and indexing techniques have been created in order to speed up the resolution of the problem. However, previous filters were made for speeding up pattern matching, ..."
Abstract
-
Cited by 11 (5 self)
- Add to MetaCart
(Show Context)
Similarity search in texts, notably biological sequences, has received substantial attention in the last few years. Numerous filtration and indexing techniques have been created in order to speed up the resolution of the problem. However, previous filters were made for speeding up pattern matching, or for finding repetitions between two sequences or occurring twice in the same sequence. In this paper, we present an algorithm called NIMBUS for filtering sequences prior to finding repetitions occurring more than twice in a sequence or in more than two sequences. NIMBUS uses gapped seeds that are indexed with a new data structure, called a bi-factor array, that is also presented in this paper. Experimental results show that the filter can be very efficient: preprocessing with NIMBUS a data set where one wants to find functional elements using a multiple local alignment tool such as GLAM ([7]), the overall execution time can be reduced from 10 hours to 6 minutes while obtaining exactly the same results. 1
An efficient algorithm for the identification of structured motifs in DNA promoter sequences
- IEEE TRANS. COMPUT. BIOL. BIOINFORM
, 2006
"... We propose a new algorithm for identifying cis-regulatory modules in genomic sequences. The proposed algorithm, named RISO, uses a new data structure, called boxlink, to store the information about conserved regions that occur in a well-ordered and regularly spaced manner in the dataset sequences. T ..."
Abstract
-
Cited by 4 (1 self)
- Add to MetaCart
(Show Context)
We propose a new algorithm for identifying cis-regulatory modules in genomic sequences. The proposed algorithm, named RISO, uses a new data structure, called boxlink, to store the information about conserved regions that occur in a well-ordered and regularly spaced manner in the dataset sequences. This type of conserved regions, called structured motifs, is extremely relevant in the research of gene regulatory mechanisms since it can effectively represent promoter models. The complexity analysis shows a time and space gain, over the best known exact algorithms, that is exponential in the spacings between binding sites. A full implementation of the algorithm was developed and made available online. Experimental results show that the algorithm is much faster than existing ones, sometimes by more than four orders of magnitude. The application of the method to biological datasets shows its ability to extract relevant consensi.
An efficient multicore implementation of planted motif problem
- In Proceedings of the International Conference On High Performance Computing and Simulation
, 2010
"... In this paper we propose a parallel algorithm for the planted motif problem that arises in computational biol-ogy. A variety of algorithms have been proposed in the literature to solve this problem. The drawback of all these algorithms is that they have been designed to work on se-rial computers; an ..."
Abstract
-
Cited by 4 (1 self)
- Add to MetaCart
(Show Context)
In this paper we propose a parallel algorithm for the planted motif problem that arises in computational biol-ogy. A variety of algorithms have been proposed in the literature to solve this problem. The drawback of all these algorithms is that they have been designed to work on se-rial computers; and are not suitable for parallelization on current multicore architectures. We have implemented the proposed algorithm on a 4 Quad-Core Intel Xeon X5550 2.67GHz processor for a total of 16 cores. We compare our performance results with the best performance results re-ported in the literature; and showed that the performance of our algorithm scales linearly with the number of cores. We also solved the (21, 8) challenging instance on 16 cores in 6.9 hrs.
G: Efficient composite pattern finding from monad patterns
- Int J Bioinf Res Appl
"... Abstract: Automatically identifying frequent composite patterns in DNA sequences is an important task in bioinformatics, especially when all the basic elements (or monad patterns) of a composite pattern are weak. In this paper, we compare one straightforward approach to assemble the monad patterns i ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
(Show Context)
Abstract: Automatically identifying frequent composite patterns in DNA sequences is an important task in bioinformatics, especially when all the basic elements (or monad patterns) of a composite pattern are weak. In this paper, we compare one straightforward approach to assemble the monad patterns into composite patterns to two other rather complex approaches. Both our theoretical analysis and empirical results show that this overlooked straightforward method can be several orders of magnitude faster. Furthermore, different from the previous understandings, the empirical results show that the runtime superiority among the three approaches is closely related to the insignificance of the monad patterns.
Lossless filter for multiple repetitions with Hamming distance
, 2007
"... Similarity search in texts, notably in biological sequences, has received substantial attention in the last few years. Numerous filtration and indexing techniques have been created in order to speed up the solution of the problem. However, previous filters were made for speeding up pattern matching, o ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
Similarity search in texts, notably in biological sequences, has received substantial attention in the last few years. Numerous filtration and indexing techniques have been created in order to speed up the solution of the problem. However, previous filters were made for speeding up pattern matching, or for finding repetitions between two strings or o ccurring twice in the same string. In this paper, we present an algorithm called Nimbus for filtering strings prior to finding repetitions o ccurring twice or more in a string, or in two or more strings. Nimbus uses gapped seeds that are indexed with a new data structure, called a bi-factor array, that is also presented in this paper. Experimental results show that the filter can be very efficient: preprocessing with Nimbus a data set where one wants to find functional elements using a multiple lo cal alignment to ol such as Glam, the overall execution time can be reduced from 7.5 hours to 2 minutes.
Suffix Tree Characterization of Maximal Motifs in Biological Sequences
"... Finding motifs in biological sequences is one of the most intriguing problems for string algorithms designers due to, on the one hand, the numerous applications of this problem in molecular biology and, on the other hand, the challenging aspects of the computational problem. Indeed, when dealing wit ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
(Show Context)
Finding motifs in biological sequences is one of the most intriguing problems for string algorithms designers due to, on the one hand, the numerous applications of this problem in molecular biology and, on the other hand, the challenging aspects of the computational problem. Indeed, when dealing with biological sequences it is necessary to work with approximations (that is, to identify fragments that are not necessarily identical, but just similar, according to a given similarity notion) and this complicates the problem. Existing algorithms run in time linear with respect to the input size. Nevertheless, the output size can be very large due to the approximation (namely exponential in the approximation degree). This often makes the output unreadable, next to slowing down the inference itself. A high degree of redundancy has been detected in the set of motifs that satisfy traditional requirements, even for exact motifs. Moreover, it has been observed many times that only a subset of these motifs, namely the maximal motifs, could be enough to provide the information of all of them. In this paper, we aim at removing such redundancy. We extend some notions of maximality already defined for exact motifs to the case of approximate motifs with Hamming distance, and we give a characterization of maximal motifs on the suffix tree. Given that this data structure is used by a whole class of motif extraction tools, we show how these tools can be modified to include the maximality requirement without changing the asymptotical complexity.