Results 1 - 10
of
15
Detection of significant patterns by compression algorithms: the case of Approximate Tandem Repeats in DNA sequences.
, 1997
"... We use compression algorithms to analyse genetic sequences. The basic idea is that a compression algorithm is associated with a property. The more a sequence is compressed by the algorithm, the more significant is the property for that sequence. Here we present an algorithm to detect a particular ty ..."
Abstract
-
Cited by 18 (3 self)
- Add to MetaCart
We use compression algorithms to analyse genetic sequences. The basic idea is that a compression algorithm is associated with a property. The more a sequence is compressed by the algorithm, the more significant is the property for that sequence. Here we present an algorithm to detect a particular type of dosDNA (Defined Ordered Sequence-DNA): approximate tandem repeats of small motifs (i.e. of lengths ! 4). This algorithm has been experimented over four yeast chromosomes. The presence of approximate tandem repeats seems to be a uniform structural property of yeast chromosomes. The algorithms in C are available by the World Wide Web (URL: http://www.lifl.fr/ rivals/Doc/RTA/ ). 3 Introduction A compression algorithm detects significant patterns in a text, if it encodes such patterns and achieves by the way a concise description of the whole text. The shorter the output description is, the more significant the patterns are. Consequently for a given text, the significance of the detect...
Sequence Alignment with Tandem Duplication
- J. Comp. Biol
, 1997
"... Algorithm development for comparing and aligning biological sequences has, until recently, been based on the SI model of mutational events which assumes that modi#cation of sequences proceeds through any of the operations of substitution, insertion or deletion #the latter two collectively termed i ..."
Abstract
-
Cited by 17 (2 self)
- Add to MetaCart
Algorithm development for comparing and aligning biological sequences has, until recently, been based on the SI model of mutational events which assumes that modi#cation of sequences proceeds through any of the operations of substitution, insertion or deletion #the latter two collectively termed indels#.
Identifying Satellites in Nucleic Acid Sequences
, 1998
"... We present in this paper an algorithm for identifying satellites in DNA sequences. Satellites (simple, micro, or mini) are repeats in number between 30 and as many as 1,000,000 whose lengths vary between 2 and hundreds of base pairs and that appear, with some mutations, in tandem along the sequence. ..."
Abstract
-
Cited by 16 (3 self)
- Add to MetaCart
We present in this paper an algorithm for identifying satellites in DNA sequences. Satellites (simple, micro, or mini) are repeats in number between 30 and as many as 1,000,000 whose lengths vary between 2 and hundreds of base pairs and that appear, with some mutations, in tandem along the sequence. We concentrate here on short to moderately long (up to 30-40 base pairs) approximate tandem repeats where copies may differ up to ffl = 15-20% from a consensus model of the repeating unit (implying individual units may vary by 2ffl from each other). The algorithm is composed of two parts. The first one consists of a filter that basically eliminates all regions whose probability of containing a satellite is less than one in 10 4 when ffl = 10%. The second part realizes an exhaustive exploration of the space of all possible models for the repeating units present in the sequence. Thus it has the advantage over previous work of being able to report a consensus model, say m, of the repeated un...
Identifying Satellites and Periodic Repetitions in Biological Sequences
, 1998
"... We present in this paper an algorithm for identifying satellites in DNA sequences. Satellites (simple, micro, or mini) are repeats in number between 30 and as many as 1,000,000 whose lengths vary between 2 and hundreds of base pairs and that appear, with some mutations, in tandem along the sequenc ..."
Abstract
-
Cited by 10 (2 self)
- Add to MetaCart
We present in this paper an algorithm for identifying satellites in DNA sequences. Satellites (simple, micro, or mini) are repeats in number between 30 and as many as 1,000,000 whose lengths vary between 2 and hundreds of base pairs and that appear, with some mutations, in tandem along the sequence. We concentrate here on short to moderately long (up to 30-40 base pairs) approximate tandem repeats where copies may di#er up to # = 15-20% from a consensus model of the repeating unit (implying individual units may vary by 2# from each other). The algorithm is composed of two parts. The first one consists of a filter that basically eliminates all regions whose probability of containing a satellite is less than one in 10 4 when # = 10%. The second part realizes an exhaustive exploration of the space of all possible models for the repeating units present in the sequence. It therefore has the advantage over previous work of being able to report a consensus model, say m, of the repe...
Approximate String Matching with Gaps
, 2002
"... In this paper we consider several new versions of approximate string matching with gaps. The main characteristic of these new versions is the existence of gaps in the matching of a given pattern in a text. Algorithms are sketched for each version and their time and space complexity is stated. The sp ..."
Abstract
-
Cited by 9 (1 self)
- Add to MetaCart
In this paper we consider several new versions of approximate string matching with gaps. The main characteristic of these new versions is the existence of gaps in the matching of a given pattern in a text. Algorithms are sketched for each version and their time and space complexity is stated. The specific versions of approximate string matching have various applications in computerized music analysis.
Tandem cyclic alignment
- Proceedings of the 12th annual symposium on combinatorial pattern matching, LNCS
, 2000
"... Abstract. We present a solution for the following problem. Given two sequences X = x1x2 ···xn and Y = y1y2 ···ym, n ≤ m, find the best scoring alignment of X ′ = X k [i] vsY over all possible pairs (k, i), for k =1, 2,... and 1 ≤ i ≤ n, whereX[i] is the cyclic permutation of X, X k [i] is the conca ..."
Abstract
-
Cited by 4 (1 self)
- Add to MetaCart
Abstract. We present a solution for the following problem. Given two sequences X = x1x2 ···xn and Y = y1y2 ···ym, n ≤ m, find the best scoring alignment of X ′ = X k [i] vsY over all possible pairs (k, i), for k =1, 2,... and 1 ≤ i ≤ n, whereX[i] is the cyclic permutation of X, X k [i] is the concatenation of k complete copies of X[i] (k tandem copies), and the alignment must include all of Y and all of X ′. Our algorithm allows any alignment scoring scheme with additive gap costs and runs in time O(nm log n). We have used it to identify related tandem repeats in the C. elegans genome as part of the development of a multi-genome database of tandem repeats.
Pattern Inference under many Guises
"... This paper surveys some of the main combinatorial methods for inferring patterns from a string, or a set of strings. The types of problems that will be addressed are repeat identification and common pattern inference. The strings that will concern us represent biological entities, nucleic acid and p ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
This paper surveys some of the main combinatorial methods for inferring patterns from a string, or a set of strings. The types of problems that will be addressed are repeat identification and common pattern inference. The strings that will concern us represent biological entities, nucleic acid and protein sequences or, in some cases, structures. As is well-known, exact ("identical") patterns hardly make sense in biology; we consider here two types of similar ("nonidentical") patterns. One comes from looking at what "hides" behind each letter of the dna/rna or protein alphabet while the other corresponds to the more familiar notion of "errors". The errors concern mutational events that may affect a molecule during dna replication.
Three Heuristics for δ-Matching: δ-BM Algorithms
, 2002
"... We consider a version of pattern matching useful in processing large musical data: delta-matching, which consists in finding matches which are delta-approximate in the sense of the distance measured as maximum difference between symbols. The alphabet is an interval of integers, and the distance betw ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
We consider a version of pattern matching useful in processing large musical data: delta-matching, which consists in finding matches which are delta-approximate in the sense of the distance measured as maximum difference between symbols. The alphabet is an interval of integers, and the distance between two symbols a, b is measured as |a - b|. We present delta-matching algorithms fast on the average providing that the pattern is "non-flat"and the alphabet interval is large. The pattern is "flat" if its structure does not vary substantially. We also consider (delta, gamma)-matching, where gamma is a bound on the total number of errors. The algorithms, named delta-BM1, delta-BM2 and delta-BM3 can be thought as members of the generalized Boyer-Moore family of algorithms. The algorithms are fast on average. This is the first paper on the subject, previously only "occurrence heuristics" have been considered. Our heuristics are much stronger and refer to larger parts of texts (not only to single positions). We use delta-versions of...
Marie-France Sagot
"... We present in this paper an algorithm for identifying satellites in DNA sequences. Satellites (simple, micro, or mini) are repeats in number between 30 and as many as 1,000,000 whose lengths vary between 2 and hundreds of base pairs and that appear, with some mutations, in tandem along the sequence. ..."
Abstract
- Add to MetaCart
We present in this paper an algorithm for identifying satellites in DNA sequences. Satellites (simple, micro, or mini) are repeats in number between 30 and as many as 1,000,000 whose lengths vary between 2 and hundreds of base pairs and that appear, with some mutations, in tandem along the sequence. We concentrate here on short to moderately long (up to 30-40 base pairs) approximate tandem repeats where copies may differ up to ffl = 15-20% from a consensus model of the repeating unit (implying individual units may vary by 2ffl from each other). The algorithm is composed of two parts. The first one consists of a filter that basically eliminates all regions whose probability of containing a satellite is less than one in 10 4 when ffl = 10%. The second part realizes an exhaustive exploration of the space of all possible models for the repeating units present in the sequence. Thus it has the advantage over previous work of being able to report a consensus model, say m, of the repeated un...

