## A Guaranteed Compression Scheme for Repetitive DNA Sequences (1995)

Citations: | 15 - 0 self |

### BibTeX

@MISC{Rivals95aguaranteed,

author = {Eric Rivals and Jean-paul Delahaye and Max Dauchet and Olivier Delgrange},

title = {A Guaranteed Compression Scheme for Repetitive DNA Sequences},

year = {1995}

}

### Years of Citing Articles

### OpenURL

### Abstract

We present a text compression scheme dedicated to DNA sequences. This algorithm has two computation phases. In the parsing phase, the suffix tree is built to select repeats for the dictionary. In the encoding phase, selected repetitions for which a guarantee of gain is established, are encoded. We prove a theorem that guarantees the compression gain and report some comparisons with classical compression schemes. Complete results are available by anonymous FTP at ftp.lifl.fr:/pub/BIC/biologie/Cfact. These experiments establish that DNA sequences require special compression methods and show our algorithm utility for classification purposes in biology.

### Citations

9231 | Elements of Information Theory - Cover, Thomas - 1990 |

1221 | A universal algorithm for sequential data compression - Ziv, Lempel - 1977 |

776 | Compression of individual sequences via variable rate coding - Lempel, Ziv - 1978 |

644 |
Modeling for text compression
- Bell, Witten, et al.
- 1989
(Show Context)
Citation Context ... for each of them an element in LEZ. Except non selected occurrences, the parsing classifies other factor occurrences in three types: 1 We use the convenient notations of the compression schemes from =-=[BCW90]-=- 3 Reference Occurrence: the leftmost occurrence of the factor in the text is never encoded, it would be referred to, if another occurrence is encoded ; 2nd Occurrence: the second occurrence is the fi... |

458 |
A technique for high performance data compression
- Welch
- 1984
(Show Context)
Citation Context ...mes allows that, Cfact does not. Selected compression schemes. The compressors in competition are: Cfact ; LZSS, a LZ77 scheme from [Bel86]; LZW15, a LZWelch scheme with 15 bits dictionary index (cf. =-=[Wel84]-=-) ; Arith1 and Arith2, two arithmetic encoders used with finite context models of order 1 and 2. All but Cfact are adapted from [Nel91]. We contacted the authors of [GT93] to compare with their algori... |

421 |
Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison
- Sankoff, Kruskal
- 1983
(Show Context)
Citation Context ... repartition of sequences repeats is clearly established, nor their quantitative importance precisely assessed. Aside to well-known methods for identifying local similarities between given sequences (=-=[SK83]-=-), there is a real need for global comparison methods. A specific need emerges for methods that perform sequences classification upon various criteria, one of which is the sequence repetitiveness (cf.... |

115 | An Introduction to Kolmogorov Complexity - Li, Vitányi - 2008 |

83 |
The Data Compression Book
- Nelson
(Show Context)
Citation Context ...l86]; LZW15, a LZWelch scheme with 15 bits dictionary index (cf. [Wel84]) ; Arith1 and Arith2, two arithmetic encoders used with finite context models of order 1 and 2. All but Cfact are adapted from =-=[Nel91]-=-. We contacted the authors of [GT93] to compare with their algorithm, but it is no more available. Observations. The results shown here, have been selected to point out that repetitive sequences can b... |

66 |
Data Compression with Finite Windows
- Fiala, Greene
- 1989
(Show Context)
Citation Context ...e encoding of one index. 1.1 Use of the Suffix Tree. The suffix tree is a well-known data structure that represents all factors of a text (cf. [CR94]), it has already been used in text compression in =-=[FG89]-=- but with a scheme derived from LZ78. The suffix tree is a tree in which a node: ffl has a degree different of 1 (the node degree gives its number of childs), ffl stores a factor of the text by a coup... |

64 | Linear algorithm for data compression via string matching - Rodeh, Pratt, et al. |

30 |
Compression of DNA sequences
- Grumbach, Tahi
- 1993
(Show Context)
Citation Context ...texts, but are unable to achieve compression on repetitive DNA sequences (see experiments results in section 5). A first attempt to build a compression scheme dedicated to genetic sequences is due to =-=[GT93]-=-: their algorithm is able to encode direct repeats and genetic palindromes (a special kind of symmetry between words), but is derived from a LZR scheme 1 , i.e. a LZ77 scheme which avoids the limitati... |

29 |
Robust transmission of unbounded strings using Fibonacci representations
- Apostolico, Fraenkel
- 1987
(Show Context)
Citation Context ...r may reach, nor on how long a factor may be, the integers must be written in a self-delimited format. For integers, we always use the Fibonacci code which is proved to be asymptotically optimal (cf. =-=[AF85]-=-), but our algorithm works with any optimal code. Moreover, each time a 2nd occurrence is encoded, it is inserted in an adaptive dictionary and is associated with an index. For the encoding of a nth o... |

18 | Information, complexité et hasard - Delahaye - 1994 |

15 | Discovering sequence similarity by the algorithmic significance method - Milosavljević - 1993 |

9 |
Better OPM/L text compression
- Bell
- 1986
(Show Context)
Citation Context ... additional bit to announce this special case. Not all encoding schemes allows that, Cfact does not. Selected compression schemes. The compressors in competition are: Cfact ; LZSS, a LZ77 scheme from =-=[Bel86]-=-; LZW15, a LZWelch scheme with 15 bits dictionary index (cf. [Wel84]) ; Arith1 and Arith2, two arithmetic encoders used with finite context models of order 1 and 2. All but Cfact are adapted from [Nel... |

6 | A First Step Towards Chromosome Analysis by Compression Algorithms
- Rivals, Delgrange, et al.
- 1995
(Show Context)
Citation Context ...pression [CT91, Del94, LV93]. Practical applications of sequence analysis by compression algorithms appear in [Mil93, Riv94, RDDD94] and has been developed towards complete biological applications in =-=[RDDD95]-=-. In practice classical dictionary compression schemes perform really well on usual texts, but are unable to achieve compression on repetitive DNA sequences (see experiments results in section 5). A f... |

3 | Compression and Sequence Comparison - Rivals, Delgrange, et al. - 1994 |

1 |
Escherichia coli in silico
- H'enaut, Danchin
- 1994
(Show Context)
Citation Context ..., there is a real need for global comparison methods. A specific need emerges for methods that perform sequences classification upon various criteria, one of which is the sequence repetitiveness (cf. =-=[HD94]-=-). Various mechanisms such as DNA segments transpositions, inequal cross-over or defective DNA replication generate duplications of DNA contigs. Those exact repeats undergo elementary mutations mostly... |