Results 1  10
of
11
Algorithms for computing variants of the longest common subsequence problem
 In ISAAC
, 2006
"... Abstract. The longest common subsequence(LCS) problem is one of the classical and wellstudied problems in computer science. The computation of the LCS is a frequent task in DNA sequence analysis, and has applications to genetics and molecular biology. In this paper we define new variants, introducin ..."
Abstract

Cited by 14 (5 self)
 Add to MetaCart
(Show Context)
Abstract. The longest common subsequence(LCS) problem is one of the classical and wellstudied problems in computer science. The computation of the LCS is a frequent task in DNA sequence analysis, and has applications to genetics and molecular biology. In this paper we define new variants, introducing the notion of gapconstraints in LCS problem and present efficient algorithms to solve them. The new variants are motivated by practical applications in molecular biology. 1
The gappedfactor tree
, 2006
"... Abstract. We present a data structure to index a specific kind of factors, that is of substrings, called gappedfactors. A gappedfactor is a factor containing a gap that is ignored during the indexation. The data structure presented is based on the suffix tree and indexes all the gappedfactors of ..."
Abstract

Cited by 4 (0 self)
 Add to MetaCart
Abstract. We present a data structure to index a specific kind of factors, that is of substrings, called gappedfactors. A gappedfactor is a factor containing a gap that is ignored during the indexation. The data structure presented is based on the suffix tree and indexes all the gappedfactors of a text with a fixed size of gap, and only those. The construction of this data structure is done online in O(n × Σ) time and space, with n the length of the text and Σ  the size of the alphabet. Such a data structure may play an important role in some pattern matching and motif inference problems, for instance in text filtration.
Indexing gappedfactors using a tree
 INTERNATIONAL JOURNAL OF FOUNDATIONS OF COMPUTER SCIENCE
, 2008
"... We present a data structure to index a specific kind of factors, that is of substrings, called gappedfactors. A gappedfactor is a factor containing a gap that is ignored during the indexation. The data structure presented is based on the suffix tree and indexes all the gappedfactors of a text wit ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
We present a data structure to index a specific kind of factors, that is of substrings, called gappedfactors. A gappedfactor is a factor containing a gap that is ignored during the indexation. The data structure presented is based on the suffix tree and indexes all the gappedfactors of a text with a fixed size of gap, and only those. The construction of this data structure is done online in linear time and space. Such a data structure may play an important role in various pattern matching and motif inference problems, for instance in text filtration.
Lossless filter for multiple repetitions with Hamming distance
, 2007
"... Similarity search in texts, notably in biological sequences, has received substantial attention in the last few years. Numerous ﬁltration and indexing techniques have been created in order to speed up the solution of the problem. However, previous ﬁlters were made for speeding up pattern matching, o ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
Similarity search in texts, notably in biological sequences, has received substantial attention in the last few years. Numerous ﬁltration and indexing techniques have been created in order to speed up the solution of the problem. However, previous ﬁlters were made for speeding up pattern matching, or for ﬁnding repetitions between two strings or o ccurring twice in the same string. In this paper, we present an algorithm called Nimbus for ﬁltering strings prior to ﬁnding repetitions o ccurring twice or more in a string, or in two or more strings. Nimbus uses gapped seeds that are indexed with a new data structure, called a bifactor array, that is also presented in this paper. Experimental results show that the ﬁlter can be very efficient: preprocessing with Nimbus a data set where one wants to ﬁnd functional elements using a multiple lo cal alignment to ol such as Glam, the overall execution time can be reduced from 7.5 hours to 2 minutes.
Lossless Filter for Long Multiple Repetitions with Edit Distance
 TECHNICAL REPORT
, 2011
"... Identifying local similarity between two or more sequences, or identifying repetitions occurring at least twice in a sequence, is an essential part in the analysis of biological sequences and of their phylogenetic relationship. Finding fragments that are conserved among several given sequences, or i ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
Identifying local similarity between two or more sequences, or identifying repetitions occurring at least twice in a sequence, is an essential part in the analysis of biological sequences and of their phylogenetic relationship. Finding fragments that are conserved among several given sequences, or inside a unique sequence, while allowing for a certain number of insertions, deletions, and substitutions, is however known to be a computationally expensive task, and consequently exact methods can usually not be applied in practice. The filter we introduce in this paper, called Ed’Nimbus, providesapossiblesolutionto this problem. It can be used as a preprocessing step to any multiple alignment method, eliminating an important fraction of the input that is guaranteed not to contain any approximate repetition. It consists in the verification of a strong necessary condition. This condition concerns the number and order of exactly repeated words shared by the approximate repetitions. The efficiency of the filter is due to this condition, that we show how to check in a fast way. The speed of the method is achieved thanks also to the use of a simple and efficient data structure, that we describe in this paper, as well as its linear time and space construction. Our results show that using Ed’Nimbus allows us to sensibly reduce the
Compressed Spaced Suffix Arrays
, 2014
"... Spaced seeds are important tools for similarity search in bioinformatics, and using several seeds together often significantly improves their performance. With existing approaches, however, for each seed we keep a separate linearsize data structure, either a hash table or a spaced suffix array (SSA ..."
Abstract

Cited by 1 (1 self)
 Add to MetaCart
(Show Context)
Spaced seeds are important tools for similarity search in bioinformatics, and using several seeds together often significantly improves their performance. With existing approaches, however, for each seed we keep a separate linearsize data structure, either a hash table or a spaced suffix array (SSA). In this paper we show how to compress SSAs relative to normal suffix arrays (SAs) and still support fast random access to them. We first prove a theoretical upper bound on the space needed to store an SSA when we already have the SA. We then present experiments indicating that our approach works even better in practice.
International Journal of Foundations of Computer Science c ○ World Scientific Publishing Company Indexing gappedfactors using a tree
"... Communicated by Editor’s name We present a data structure to index a specific kind of factors, that is of substrings, called gappedfactors. A gappedfactor is a factor containing a gap that is ignored during the indexation. The data structure presented is based on the suffix tree and indexes all th ..."
Abstract
 Add to MetaCart
(Show Context)
Communicated by Editor’s name We present a data structure to index a specific kind of factors, that is of substrings, called gappedfactors. A gappedfactor is a factor containing a gap that is ignored during the indexation. The data structure presented is based on the suffix tree and indexes all the gappedfactors of a text with a fixed size of gap, and only those. The construction of this data structure is done online in linear time and space. Such a data structure may play an important role in various pattern matching and motif inference problems, for instance in text filtration.
Lossless filter for multiple repetitions with Hamming distance
 JOURNAL OF DISCRETE ALGORITHMS
, 2008
"... Similarity search in texts, notably in biological sequences, has received substantial attention in the last few years. Numerous filtration and indexing techniques have been created in order to speed up the solution of the problem. However, previous filters were made for speeding up pattern matching, ..."
Abstract
 Add to MetaCart
(Show Context)
Similarity search in texts, notably in biological sequences, has received substantial attention in the last few years. Numerous filtration and indexing techniques have been created in order to speed up the solution of the problem. However, previous filters were made for speeding up pattern matching, or for finding repetitions between two strings or occurring twice in the same string. In this paper, we present an algorithm called Nimbus for filtering strings prior to finding repetitions occurring twice or more in a string, or in two or more strings. Nimbus uses gapped seeds that are indexed with a new data structure, called a bifactor array, that is also presented in this paper. Experimental results show that the filter can be very efficient: preprocessing with Nimbus a data set where one wants to find functional elements using a multiple local alignment tool such as Glam, the overall execution time can be reduced from 7.5 hours to 2 minutes. Key words: approximate repetitions, kfactors, multiple local alignment, bifactors, bifactor array
ProjectTeam sequoia  Algorithms for largescale sequence analysis for molecular biology  INRIA Activity Report
, 2008
"... The main goal of SEQUOIA projectteam is to define appropriate combinatorial models and efficient algorithms for largescale sequence analysis in molecular biology. An emphasis is made on the annotation of noncoding regions in genomes – RNA genes and regulatory sequences – via comparative genomics ..."
Abstract
 Add to MetaCart
The main goal of SEQUOIA projectteam is to define appropriate combinatorial models and efficient algorithms for largescale sequence analysis in molecular biology. An emphasis is made on the annotation of noncoding regions in genomes – RNA genes and regulatory sequences – via comparative genomics methods. This task involves several complementary issues such as sequence comparison, prediction, analysis and manipulation of RNA secondary structures, identification and processing of regulatory sequences. Efficient algorithms and parallelism on highperformance computing architectures allow largescale instances of such issues. Our aim is to tackle all those issues in an integrated fashion and to put together the developed software tools into a common platform for annotation of noncoding regions. We also explore complementary problems of protein sequence analysis. Those include new approaches to protein sequence comparison on the one hand, and a system for storing and manipulating nonribosomal peptides on the other hand. A special attention is given to the development of robust software, its validation on biological data and to its availability from the software platform of the team and by other means. Most of research projects are carried out in collaboration with biologists.