Results 1  10
of
16
A compression algorithm for DNA sequences and its applications in genome comparison
, 1999
"... We present a lossless compression algorithm, GenCompress, for genetic sequences, based on searching for approximate repeats. Our algorithm achieves the best compression ratios for benchmark DNA sequences. Significantly better compression results show that the approximate repeats are one of the main ..."
Abstract

Cited by 70 (4 self)
 Add to MetaCart
(Show Context)
We present a lossless compression algorithm, GenCompress, for genetic sequences, based on searching for approximate repeats. Our algorithm achieves the best compression ratios for benchmark DNA sequences. Significantly better compression results show that the approximate repeats are one of the main hidden regularities in DNA sequences. We then describe a theory of measuring the relatedness between two DNA sequences. Using our algorithm, we present strong experimental support for this theory, and demonstrate its application in comparing genomes and constructing evolutionary trees. 1
Offline compression by greedy textual substitution
 PROC. IEEE
, 2000
"... Greedy offline textual substitution refers to the following approach to compression or structural inference. Given a long textstring x, a substring w is identified such that replacing all instances of w in x except one by a suitable pair of pointers yields the highest possible contraction of x; the ..."
Abstract

Cited by 27 (1 self)
 Add to MetaCart
Greedy offline textual substitution refers to the following approach to compression or structural inference. Given a long textstring x, a substring w is identified such that replacing all instances of w in x except one by a suitable pair of pointers yields the highest possible contraction of x; the process is then repeated on the contracted textstring until substrings capable of producing contractions can no longer be found. This paper examines computational issues arising in the implementation of this paradigm and describes some applications and experiments.
Compressed qgram Indexing for Highly Repetitive Biological Sequences
"... Abstract—The study of compressed storage schemes for highly repetitive sequence collections has been recently boosted by the availability of cheaper sequencing technologies and the flood of data they promise to generate. Such a storage scheme may range from the simple goal of retrieving whole indivi ..."
Abstract

Cited by 18 (13 self)
 Add to MetaCart
(Show Context)
Abstract—The study of compressed storage schemes for highly repetitive sequence collections has been recently boosted by the availability of cheaper sequencing technologies and the flood of data they promise to generate. Such a storage scheme may range from the simple goal of retrieving whole individual sequences to the more advanced one of providing fast searches in the collection. In this paper we study alternatives to implement a particularly popular index, namely, the one able of finding all the positions in the collection of substrings of fixed length (qgrams). We introduce two novel techniques and show they constitute practical alternatives to handle this scenario. They excell particularly in two cases: when q is small (up to 6), and when the collection is extremely repetitive (less than 0.01 % mutations). I. INTRODUCTION AND RELATED WORK The sequencing of the whole Human Genome was a celebrated
A Simple Statistical Algorithm for Biological Sequence Compression
 DATA COMPRESSION CONFERENCE
, 2007
"... This paper introduces a novel algorithm for biological sequence compression that makes use of both statistical properties and repetition within sequences. A panel of experts is maintained to estimate the probability distribution of the next symbol in the sequence to be encoded. Expert probabilities ..."
Abstract

Cited by 17 (1 self)
 Add to MetaCart
This paper introduces a novel algorithm for biological sequence compression that makes use of both statistical properties and repetition within sequences. A panel of experts is maintained to estimate the probability distribution of the next symbol in the sequence to be encoded. Expert probabilities are combined to obtain the final distribution. The resulting information sequence provides insight for further study of the biological sequence. Each symbol is then encoded by arithmetic coding. Experiments show that our algorithm outperforms existing compressors on typical DNA and protein sequence datasets while maintaining a practical running time. 1.
DNA Coding using FiniteContext Models and Arithmetic Coding
 In Proceedings of IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP2009
, 2009
"... The interest in DNA coding has been growing with the availability of extensive genomic databases. Although only two bits are sufficient to encode the four DNA bases, efficient lossless compression methods are still needed due to the size of DNA sequences and because standard compression algorithms ..."
Abstract

Cited by 4 (0 self)
 Add to MetaCart
The interest in DNA coding has been growing with the availability of extensive genomic databases. Although only two bits are sufficient to encode the four DNA bases, efficient lossless compression methods are still needed due to the size of DNA sequences and because standard compression algorithms do not perform well on DNA sequences. As a result, several specific coding methods have been proposed. Most of these methods are based on searching procedures for finding exact or approximate repeats. Low order finitecontext models have only been used as secondary, fall back mechanisms. In this paper, we show that finitecontext models can also be used as main DNA encoding methods. We propose a coding method based on two finitecontext models that compete for the encoding of data, on a block by block basis. The experimental results confirm the effectiveness of the proposed method. Index Terms — DNA coding, source coding, finitecontext modeling, bioinformatics, arithmetic coding. 1.
A threestate model for DNA proteincoding regions
 IEEE Transactions on Biomedical Engineering
, 2006
"... ..."
(Show Context)
EXPLORING THREEBASE PERIODICITY FOR DNA COMPRESSION AND MODELING
"... To explore the threebase periodicity often found in proteincoding DNA regions, we introduce a DNA model based on three deterministic states, where each state implements a finitecontext model. The results obtained show compression gains in relation to the single finitecontext model counterpart. Add ..."
Abstract

Cited by 2 (2 self)
 Add to MetaCart
(Show Context)
To explore the threebase periodicity often found in proteincoding DNA regions, we introduce a DNA model based on three deterministic states, where each state implements a finitecontext model. The results obtained show compression gains in relation to the single finitecontext model counterpart. Additionally, and potentially more interesting than the compression gain on its own, is the observation that the entropy associated to each of the three states differs and that this variation is not the same among the organisms analyzed. 1.
InvertedRepeatsAware FiniteContext Models for DNA Coding
 In Proceedings of 16th European Signal Processing Conference (EUSIPCO2008
, 2008
"... Finitecontext models have been used for DNA sequence compression as secondary, fall back mechanisms, the generalized opinion being that models with order larger than two or three are inappropriate. In this paper we show that finitecontext models can also be used as the main encoding method, and t ..."
Abstract

Cited by 2 (1 self)
 Add to MetaCart
(Show Context)
Finitecontext models have been used for DNA sequence compression as secondary, fall back mechanisms, the generalized opinion being that models with order larger than two or three are inappropriate. In this paper we show that finitecontext models can also be used as the main encoding method, and that they are effective for model orders at least as higher as thirteen. Moreover, we propose a new model updating scheme that takes into account inverted repeats, a common characteristic in DNA sequences. 1.
Survey of Compression of DNA Sequence
"... Compression of large collections of data can lead to improvements in retrieval times by offsetting the CPU decompression costs with the cost of seeking and retrieving data from disk. In this paper, the author has study the different compression method which can compress the large DNA sequence. In th ..."
Abstract
 Add to MetaCart
(Show Context)
Compression of large collections of data can lead to improvements in retrieval times by offsetting the CPU decompression costs with the cost of seeking and retrieving data from disk. In this paper, the author has study the different compression method which can compress the large DNA sequence. In this paper, authors have explored the DNA compression method that is COMRAD, which is used to compare with the dictionary based compression method i.e. LZ77, LZ78, LZW and general purpose compression method RAY. In this, authors have analyzed which one algorithm is better to compress the large collection of the DNA Sequence. Compression table and the line graph show that which compression algorithm has a better compression ratio and the compression size. It also shows that which one has better compression and decompression time.