Results 1  10
of
17
Detection of significant patterns by compression algorithms: the case of Approximate Tandem Repeats in DNA sequences.
, 1997
"... We use compression algorithms to analyse genetic sequences. The basic idea is that a compression algorithm is associated with a property. The more a sequence is compressed by the algorithm, the more significant is the property for that sequence. Here we present an algorithm to detect a particular ty ..."
Abstract

Cited by 21 (3 self)
 Add to MetaCart
We use compression algorithms to analyse genetic sequences. The basic idea is that a compression algorithm is associated with a property. The more a sequence is compressed by the algorithm, the more significant is the property for that sequence. Here we present an algorithm to detect a particular type of dosDNA (Defined Ordered SequenceDNA): approximate tandem repeats of small motifs (i.e. of lengths < 4). This algorithm has been experimented over four yeast chromosomes. The presence of approximate tandem repeats seems to be a uniform structural property of yeast chromosomes. The algorithms in C are available by the World Wide Web (URL: http://www.lifl.fr/ rivals/Doc/RTA/ ).
Sequence Alignment with Tandem Duplication
 J. Comp. Biol
, 1997
"... Algorithm development for comparing and aligning biological sequences has, until recently, been based on the SI model of mutational events which assumes that modi#cation of sequences proceeds through any of the operations of substitution, insertion or deletion #the latter two collectively termed i ..."
Abstract

Cited by 19 (2 self)
 Add to MetaCart
Algorithm development for comparing and aligning biological sequences has, until recently, been based on the SI model of mutational events which assumes that modi#cation of sequences proceeds through any of the operations of substitution, insertion or deletion #the latter two collectively termed indels#.
Identifying Satellites in Nucleic Acid Sequences
, 1998
"... We present in this paper an algorithm for identifying satellites in DNA sequences. Satellites (simple, micro, or mini) are repeats in number between 30 and as many as 1,000,000 whose lengths vary between 2 and hundreds of base pairs and that appear, with some mutations, in tandem along the sequence. ..."
Abstract

Cited by 16 (3 self)
 Add to MetaCart
We present in this paper an algorithm for identifying satellites in DNA sequences. Satellites (simple, micro, or mini) are repeats in number between 30 and as many as 1,000,000 whose lengths vary between 2 and hundreds of base pairs and that appear, with some mutations, in tandem along the sequence. We concentrate here on short to moderately long (up to 3040 base pairs) approximate tandem repeats where copies may differ up to ffl = 1520% from a consensus model of the repeating unit (implying individual units may vary by 2ffl from each other). The algorithm is composed of two parts. The first one consists of a filter that basically eliminates all regions whose probability of containing a satellite is less than one in 10 4 when ffl = 10%. The second part realizes an exhaustive exploration of the space of all possible models for the repeating units present in the sequence. Thus it has the advantage over previous work of being able to report a consensus model, say m, of the repeated un...
Identifying Satellites and Periodic Repetitions in Biological Sequences
, 1998
"... We present in this paper an algorithm for identifying satellites in DNA sequences. Satellites (simple, micro, or mini) are repeats in number between 30 and as many as 1,000,000 whose lengths vary between 2 and hundreds of base pairs and that appear, with some mutations, in tandem along the sequenc ..."
Abstract

Cited by 14 (2 self)
 Add to MetaCart
We present in this paper an algorithm for identifying satellites in DNA sequences. Satellites (simple, micro, or mini) are repeats in number between 30 and as many as 1,000,000 whose lengths vary between 2 and hundreds of base pairs and that appear, with some mutations, in tandem along the sequence. We concentrate here on short to moderately long (up to 3040 base pairs) approximate tandem repeats where copies may di#er up to # = 1520% from a consensus model of the repeating unit (implying individual units may vary by 2# from each other). The algorithm is composed of two parts. The first one consists of a filter that basically eliminates all regions whose probability of containing a satellite is less than one in 10 4 when # = 10%. The second part realizes an exhaustive exploration of the space of all possible models for the repeating units present in the sequence. It therefore has the advantage over previous work of being able to report a consensus model, say m, of the repe...
Approximate String Matching with Gaps
, 2002
"... In this paper we consider several new versions of approximate string matching with gaps. The main characteristic of these new versions is the existence of gaps in the matching of a given pattern in a text. Algorithms are sketched for each version and their time and space complexity is stated. The sp ..."
Abstract

Cited by 10 (1 self)
 Add to MetaCart
In this paper we consider several new versions of approximate string matching with gaps. The main characteristic of these new versions is the existence of gaps in the matching of a given pattern in a text. Algorithms are sketched for each version and their time and space complexity is stated. The specific versions of approximate string matching have various applications in computerized music analysis.
Tandem cyclic alignment
 Proceedings of the 12th annual symposium on combinatorial pattern matching, LNCS
, 2000
"... Abstract. We present a solution for the following problem. Given two sequences X = x1x2 ···xn and Y = y1y2 ···ym, n ≤ m, find the best scoring alignment of X ′ = X k [i] vsY over all possible pairs (k, i), for k =1, 2,... and 1 ≤ i ≤ n, whereX[i] is the cyclic permutation of X, X k [i] is the conca ..."
Abstract

Cited by 5 (1 self)
 Add to MetaCart
Abstract. We present a solution for the following problem. Given two sequences X = x1x2 ···xn and Y = y1y2 ···ym, n ≤ m, find the best scoring alignment of X ′ = X k [i] vsY over all possible pairs (k, i), for k =1, 2,... and 1 ≤ i ≤ n, whereX[i] is the cyclic permutation of X, X k [i] is the concatenation of k complete copies of X[i] (k tandem copies), and the alignment must include all of Y and all of X ′. Our algorithm allows any alignment scoring scheme with additive gap costs and runs in time O(nm log n). We have used it to identify related tandem repeats in the C. elegans genome as part of the development of a multigenome database of tandem repeats.
Pattern Inference under many Guises
"... This paper surveys some of the main combinatorial methods for inferring patterns from a string, or a set of strings. The types of problems that will be addressed are repeat identification and common pattern inference. The strings that will concern us represent biological entities, nucleic acid and p ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
This paper surveys some of the main combinatorial methods for inferring patterns from a string, or a set of strings. The types of problems that will be addressed are repeat identification and common pattern inference. The strings that will concern us represent biological entities, nucleic acid and protein sequences or, in some cases, structures. As is wellknown, exact ("identical") patterns hardly make sense in biology; we consider here two types of similar ("nonidentical") patterns. One comes from looking at what "hides" behind each letter of the dna/rna or protein alphabet while the other corresponds to the more familiar notion of "errors". The errors concern mutational events that may affect a molecule during dna replication.
Three Heuristics for δMatching: δBM Algorithms
, 2002
"... We consider a version of pattern matching useful in processing large musical data: deltamatching, which consists in finding matches which are deltaapproximate in the sense of the distance measured as maximum difference between symbols. The alphabet is an interval of integers, and the distance betw ..."
Abstract

Cited by 1 (1 self)
 Add to MetaCart
We consider a version of pattern matching useful in processing large musical data: deltamatching, which consists in finding matches which are deltaapproximate in the sense of the distance measured as maximum difference between symbols. The alphabet is an interval of integers, and the distance between two symbols a, b is measured as a  b. We present deltamatching algorithms fast on the average providing that the pattern is "nonflat"and the alphabet interval is large. The pattern is "flat" if its structure does not vary substantially. We also consider (delta, gamma)matching, where gamma is a bound on the total number of errors. The algorithms, named deltaBM1, deltaBM2 and deltaBM3 can be thought as members of the generalized BoyerMoore family of algorithms. The algorithms are fast on average. This is the first paper on the subject, previously only "occurrence heuristics" have been considered. Our heuristics are much stronger and refer to larger parts of texts (not only to single positions). We use deltaversions of...
Bioinformatics
, 2003
"... Selection of significant genes via expression patterns is an important problem in microarray experiments. Owing to small sample size and the large number of variables (genes), the selection process can be unstable. This paper proposes a hierarchical Bayesian model for gene (variable) selection. We e ..."
Abstract
 Add to MetaCart
Selection of significant genes via expression patterns is an important problem in microarray experiments. Owing to small sample size and the large number of variables (genes), the selection process can be unstable. This paper proposes a hierarchical Bayesian model for gene (variable) selection. We employ latent variables to specialize the model to a regression setting and uses a Bayesian mixture prior to perform the variable selection. We control the size of the model by assigning a prior distribution over the dimension (number of significant genes) of the model. The posterior distributions of the parameters are not in explicit form and we need to use a combination of truncated sampling and Markov Chain Monte Carlo (MCMC) based computation techniques to simulate the parameters from the posteriors. The Bayesian model is flexible enough to identify significant genes as well as to perform future predictions. The method is applied to cancer classification via cDNA microarrays where the genes BRCA1 and BRCA2 are associated with a hereditary disposition to breast cancer, and the method is used to identify a set of significant genes. The method is also applied successfully to the leukemia data.