## A Guaranteed Compression Scheme for Repetitive DNA Sequences (1995)

### Abstract

We present a text compression scheme dedicated to DNA sequences. This algorithm has two computation phases. In the parsing phase, the suffix tree is built to select repeats for the dictionary. In the encoding phase, selected repetitions for which a guarantee of gain is established, are encoded. We prove a theorem that guarantees the compression gain and report some comparisons with classical compression schemes. Complete results are available by anonymous FTP at ftp.lifl.fr:/pub/BIC/biologie/Cfact. These experiments establish that DNA sequences require special compression methods and show our algorithm utility for classification purposes in biology.

