Results 1 
5 of
5
The Transformation Distance : A Dissimilarity Measure Based On Movements Of Segments
, 1998
"... Evolution acts in several ways on DNA : either by mutating a base, or inserting, deleting or copying a segment of the sequence [17, 18, ?]. Classical alignment methods deal with point mutations [19], genomelevel mutations are studied using genome rearrangement distances [1, 2, 8, 9]. Those distance ..."
Abstract

Cited by 9 (0 self)
 Add to MetaCart
Evolution acts in several ways on DNA : either by mutating a base, or inserting, deleting or copying a segment of the sequence [17, 18, ?]. Classical alignment methods deal with point mutations [19], genomelevel mutations are studied using genome rearrangement distances [1, 2, 8, 9]. Those distances are mostly evaluated by a number of transpositions of genes. Here we define a new distance, called transformation distance, which quantifies the dissimilarity between two sequences in term of segmentbased events (without requiring a preliminary identification of genes). Those events are weighted by their description length. The transformation distance from S to T is the Minimum Description Length among all possible scripts that build the sequence T knowing the sequence S with segmentbased operations. The underlying idea is related to Kolmogorov complexity theory. Herein, we focus on the case where segmentcopy, reversecopy andinsertion operations are allowed. We present an algorithm which computes the transformation distance. A biological application on Tnt1 tobacco retrotransposon is presented
Sequence Complexity for Biological Sequence Analysis
, 2000
"... A new statistical model for DNA considers a sequence to be a mixture of regions with little structure and regions that are approximate repeats of other subsequences, i.e. instances of repeats do not need to match each other exactly. Both forward and reversecomplementary repeats are allowed. The mo ..."
Abstract

Cited by 8 (0 self)
 Add to MetaCart
A new statistical model for DNA considers a sequence to be a mixture of regions with little structure and regions that are approximate repeats of other subsequences, i.e. instances of repeats do not need to match each other exactly. Both forward and reversecomplementary repeats are allowed. The model has a small number of parameters which are fitted to the data. In general there are many explanations for a given sequence and how to compute the total probability of the data given the model is shown. Computer algorithms are described for these tasks. The model can be used to compute the information content of a sequence, either in total or base by base. This amounts to looking at sequences from a datacompression point of view and it is argued that this is a good way to tackle intelligent sequence analysis in general.
Compression and Approximate Matching
 The Computer Journal
, 1999
"... A population of sequences is called nonrandom if there is a statistical model and an associated compression algorithm that allows members of the population to be compressed, on average. Any available statistical model of a population should be incorporated into algorithms for alignment of the seque ..."
Abstract

Cited by 5 (2 self)
 Add to MetaCart
A population of sequences is called nonrandom if there is a statistical model and an associated compression algorithm that allows members of the population to be compressed, on average. Any available statistical model of a population should be incorporated into algorithms for alignment of the sequences and doing so changes the rank order of possible alignments in general. The model should also be used in deciding if a resulting approximate match between two sequences is significant or not. It is shown how to do this for two plausible interpretations involving pairs of sequences that might or might not be related. Efficient alignment algorithms are described for quite general statistical models of sequences. The new alignment algorithms are more sensitive to what might be termed 'features' of the sequences. A natural significance test is shown to be rarely fooled by apparent similarities between two sequences that are merely typical of all or most members of the population, even unrelated members. The Computer Journal, Volume 42, Issue 1, pp. 110, 1999. http://www.csse.monash.edu.au/~lloyd/tildeStrings/
Discovering Patterns In Plasmodium Falciparum genomic DNA
, 2001
"... A method has been developed for discovering patterns in DNA sequences. Loosely based on the wellknown Lempel Ziv model for text compression, the model detects repeated sequences in DNA. The repeats can be forward or inverted, and they need not be exact. The method is particularly useful for detecti ..."
Abstract

Cited by 3 (1 self)
 Add to MetaCart
A method has been developed for discovering patterns in DNA sequences. Loosely based on the wellknown Lempel Ziv model for text compression, the model detects repeated sequences in DNA. The repeats can be forward or inverted, and they need not be exact. The method is particularly useful for detecting distantly related sequences, and for finding patterns in sequences of biased nucleotide composition, where spurious patterns are often observed because the bias leads to coincidental nucleotide matches. We show here the utility of the method by applying it to genomic sequences of Plasmodium falciparum. A single scan of chromosomes 2and3ofP. falciparum, using our method and no other a priori information about the sequences, reveals regions of low complexity in both telomeric and central regions, long repeats in the subtelomeric regions, and shorter repeat areas in dense coding regions. Application of the method to a recently sequenced contig of chromosome 10 that has a particularly biased base composition detects a long internal repeat more readily than does the conventional dot matrix plot. Space requirements are linear, so the method can be used on large sequences. The observed repeat patterns may be related to largescale chromosomal organization and control of gene expression. The method has general application in detecting patterns of potential interest in newly sequenced genomic material.
CONSERV: A Tool for Finding Exact Matching Conserved Sequences in Biological Sequences
 Genome Informatics
, 2000
"... Introduction Complete genome sequencesof more than 30 organisms have been determined today. When many complete genome sequences become available, oneof the first questions is which regions are conserved among various genome sequences. For the purpose, however, most existing tools are not available ..."
Abstract

Cited by 2 (2 self)
 Add to MetaCart
Introduction Complete genome sequencesof more than 30 organisms have been determined today. When many complete genome sequences become available, oneof the first questions is which regions are conserved among various genome sequences. For the purpose, however, most existing tools are not available because they can not treat large sequences such as complete genome sequences, or even when they can treat complete genome sequences, they areof5M too slow. We have developed asof ware tool CONSERV, which allows us to detect all exactly matching common regions in two or more complete genome sequences. We can use CONSERVfz not only nucleic acid sequences but also amino acid sequences. CONSERV can only detect exact matches, but it is veryfy5P For example, to compute all exact matching common regions longer than 1 bases in Mycoplasma pneumoniae, Chlamydia trachomatis, Archaeoglobus fulgidus, and Esc