Results 1 
7 of
7
The Transformation Distance : A Dissimilarity Measure Based On Movements Of Segments
, 1998
"... Evolution acts in several ways on DNA : either by mutating a base, or inserting, deleting or copying a segment of the sequence [17, 18, ?]. Classical alignment methods deal with point mutations [19], genomelevel mutations are studied using genome rearrangement distances [1, 2, 8, 9]. Those distance ..."
Abstract

Cited by 9 (0 self)
 Add to MetaCart
Evolution acts in several ways on DNA : either by mutating a base, or inserting, deleting or copying a segment of the sequence [17, 18, ?]. Classical alignment methods deal with point mutations [19], genomelevel mutations are studied using genome rearrangement distances [1, 2, 8, 9]. Those distances are mostly evaluated by a number of transpositions of genes. Here we define a new distance, called transformation distance, which quantifies the dissimilarity between two sequences in term of segmentbased events (without requiring a preliminary identification of genes). Those events are weighted by their description length. The transformation distance from S to T is the Minimum Description Length among all possible scripts that build the sequence T knowing the sequence S with segmentbased operations. The underlying idea is related to Kolmogorov complexity theory. Herein, we focus on the case where segmentcopy, reversecopy andinsertion operations are allowed. We present an algorithm which computes the transformation distance. A biological application on Tnt1 tobacco retrotransposon is presented
Sequence Complexity for Biological Sequence Analysis
, 2000
"... A new statistical model for DNA considers a sequence to be a mixture of regions with little structure and regions that are approximate repeats of other subsequences, i.e. instances of repeats do not need to match each other exactly. Both forward and reversecomplementary repeats are allowed. The mo ..."
Abstract

Cited by 8 (0 self)
 Add to MetaCart
A new statistical model for DNA considers a sequence to be a mixture of regions with little structure and regions that are approximate repeats of other subsequences, i.e. instances of repeats do not need to match each other exactly. Both forward and reversecomplementary repeats are allowed. The model has a small number of parameters which are fitted to the data. In general there are many explanations for a given sequence and how to compute the total probability of the data given the model is shown. Computer algorithms are described for these tasks. The model can be used to compute the information content of a sequence, either in total or base by base. This amounts to looking at sequences from a datacompression point of view and it is argued that this is a good way to tackle intelligent sequence analysis in general.
Compression and Approximate Matching
 The Computer Journal
, 1999
"... A population of sequences is called nonrandom if there is a statistical model and an associated compression algorithm that allows members of the population to be compressed, on average. Any available statistical model of a population should be incorporated into algorithms for alignment of the seque ..."
Abstract

Cited by 5 (2 self)
 Add to MetaCart
A population of sequences is called nonrandom if there is a statistical model and an associated compression algorithm that allows members of the population to be compressed, on average. Any available statistical model of a population should be incorporated into algorithms for alignment of the sequences and doing so changes the rank order of possible alignments in general. The model should also be used in deciding if a resulting approximate match between two sequences is significant or not. It is shown how to do this for two plausible interpretations involving pairs of sequences that might or might not be related. Efficient alignment algorithms are described for quite general statistical models of sequences. The new alignment algorithms are more sensitive to what might be termed 'features' of the sequences. A natural significance test is shown to be rarely fooled by apparent similarities between two sequences that are merely typical of all or most members of the population, even unrelated members. The Computer Journal, Volume 42, Issue 1, pp. 110, 1999. http://www.csse.monash.edu.au/~lloyd/tildeStrings/
Compression of Strings with Approximate Repeats
 Intelligent Systems in Molecular Biology, ISMB ’98
, 1998
"... We describe a model for strings of characters that is loosely based on the Lempel Ziv model with the addition that a repeated substring can be an approximate match to the original substring; this is close to the situation of DNA, for example. Typically there are many explanations for a given string ..."
Abstract

Cited by 5 (3 self)
 Add to MetaCart
We describe a model for strings of characters that is loosely based on the Lempel Ziv model with the addition that a repeated substring can be an approximate match to the original substring; this is close to the situation of DNA, for example. Typically there are many explanations for a given string under the model, some optimal and many suboptimal. Rather than commit to one optimal explanation, we sum the probabilities over all explanations under the model because this gives the probability of the data under the model. The model has a small number of parameters and these can be estimated from the given string by an expectationmaximization (EM) algorithm. Each iteration of the EM algorithm takes O(n2) time and a few iterations are typically sufficient. O(n2) complexity is impractical for strings of more than a few tens of thousands of characters and a faster approximation algorithm is also given. The model is further extended to include approximate reverse complementary repeats when analyzing DNA strings. Tests include the recovery of parameter estimates from known sources and applications to real DNA strings. http://www.csse.monash.edu.au/~lloyd/tildeStrings/Compress/1998ISMB.html
Discovering Patterns In Plasmodium Falciparum genomic DNA
, 2001
"... A method has been developed for discovering patterns in DNA sequences. Loosely based on the wellknown Lempel Ziv model for text compression, the model detects repeated sequences in DNA. The repeats can be forward or inverted, and they need not be exact. The method is particularly useful for detecti ..."
Abstract

Cited by 4 (1 self)
 Add to MetaCart
A method has been developed for discovering patterns in DNA sequences. Loosely based on the wellknown Lempel Ziv model for text compression, the model detects repeated sequences in DNA. The repeats can be forward or inverted, and they need not be exact. The method is particularly useful for detecting distantly related sequences, and for finding patterns in sequences of biased nucleotide composition, where spurious patterns are often observed because the bias leads to coincidental nucleotide matches. We show here the utility of the method by applying it to genomic sequences of Plasmodium falciparum. A single scan of chromosomes 2and3ofP. falciparum, using our method and no other a priori information about the sequences, reveals regions of low complexity in both telomeric and central regions, long repeats in the subtelomeric regions, and shorter repeat areas in dense coding regions. Application of the method to a recently sequenced contig of chromosome 10 that has a particularly biased base composition detects a long internal repeat more readily than does the conventional dot matrix plot. Space requirements are linear, so the method can be used on large sequences. The observed repeat patterns may be related to largescale chromosomal organization and control of gene expression. The method has general application in detecting patterns of potential interest in newly sequenced genomic material.
CONSERV: a tool for finding exact matching conservedsequences in biological sequences
 Genome Informatics
"... ..."
(Show Context)
called Substring Compression Problems. Given a string
"... We initiate a new class of string matching problems ..."