Results 1 - 10
of
11
On prediction using variable order Markov models
- JOURNAL OF ARTIFICIAL INTELLIGENCE RESEARCH
, 2004
"... This paper is concerned with algorithms for prediction of discrete sequences over a finite alphabet, using variable order Markov models. The class of such algorithms is large and in principle includes any lossless compression algorithm. We focus on six prominent prediction algorithms, including Cont ..."
Abstract
-
Cited by 42 (1 self)
- Add to MetaCart
This paper is concerned with algorithms for prediction of discrete sequences over a finite alphabet, using variable order Markov models. The class of such algorithms is large and in principle includes any lossless compression algorithm. We focus on six prominent prediction algorithms, including Context Tree Weighting (CTW), Prediction by Partial Match (PPM) and Probabilistic Suffix Trees (PSTs). We discuss the properties of these algorithms and compare their performance using real life sequences from three domains: proteins, English text and music pieces. The comparison is made with respect to prediction quality as measured by the average log-loss. We also compare classification algorithms based on these predictors with respect to a number of large protein classification tasks. Our results indicate that a “decomposed” CTW (a variant of the CTW algorithm) and PPM outperform all other algorithms in sequence prediction tasks. Somewhat surprisingly, a different algorithm, which is a modification of the Lempel-Ziv compression algorithm, significantly outperforms all algorithms on the protein classification problems.
Biological Sequence Compression Algorithms
- Genome Informatics
, 2000
"... Today, more and more DNA sequences are becoming available. The information about DNA sequences are stored in molecular biology databases. The size and importance of these databases will be bigger and bigger in the future, therefore this information must be stored or communicated efficiently. Further ..."
Abstract
-
Cited by 13 (0 self)
- Add to MetaCart
Today, more and more DNA sequences are becoming available. The information about DNA sequences are stored in molecular biology databases. The size and importance of these databases will be bigger and bigger in the future, therefore this information must be stored or communicated efficiently. Furthermore, sequence compression can be used to define similarities between biological sequenices. The standard compression algorithms such as gzip or compress cannot compress DNA sequences but only expand them in size. On the other hand CTW (Context Tree Weighting Method) can compress DNA sequences less than two bits per symbol. These algorithms do nJ use special structures of biological sequencal Two characteristic structures of DNAsequen.# are kn wn On is calledpalinzLqqq or reverse complemen ts an the other structure is approximate repeats. Several specific algorithms for DNA sequenNz that use these structurescan compress them lessthan two bits per symbol. In this paper, we improve the CTW so that characteristic structures of DNAsequenzO are available. Beforeeno din the neJ symbol, the algorithm searchesan approximate repeat an palinNLFJ usin hashan dynJLx programminq If there is apalinJwq. oran approximate repeat withenhJL lenJL then our algorithmrepresen ts it withlenJq an distanNF By using this preprocessing anL program achieves a little higher compression ratio than that of existing DNA-oriented compression algorithms. We also describe new compression algorithm for protein sequences.
LIPT: A reversible lossless text transform to improve compression performance
- In Proceedings of the IEEE Data Compression Conference 2001, Snowbird
, 2001
"... Abstract. Lossless compression researchers have developed highly sophisticated approaches, such as Huffman encoding, arithmetic encoding, the Lempel-Ziv family, ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
Abstract. Lossless compression researchers have developed highly sophisticated approaches, such as Huffman encoding, arithmetic encoding, the Lempel-Ziv family,
LIPT: A Lossless Text Transform to improve compression
- In Proceedings of International Conference on Information and Theory : Coding and Computing, Las Vegas
, 2001
"... We propose an approach to develop a dictionary based reversible lossless text transformation, called LIPT (Length Index Preserving Transform), which can be applied to a source text to improve existing algorithm’s ability to compress. In LIPT, the length of the input word and the offset of the words ..."
Abstract
-
Cited by 4 (2 self)
- Add to MetaCart
We propose an approach to develop a dictionary based reversible lossless text transformation, called LIPT (Length Index Preserving Transform), which can be applied to a source text to improve existing algorithm’s ability to compress. In LIPT, the length of the input word and the offset of the words in the dictionary are denoted with alphabets. Our encoding scheme makes use of recurrence of same length of words in the English Language to create context in the transformed text that the entropy coders can exploit. LIPT achieves some compression at the preprocessing stage as well and retains enough context and redundancy for the compression algorithms to give better results. Bzip2 with LIPT gives 5.24 % improvement in average BPC over Bzip2 without LIPT, and PPMD with LIPT gives 4.46% improvement in average BPC over PPMD without LIPT, for our test corpus. 1.
Dictionary-Based Fast Transform for Text Compression ∗
"... In this paper we present StarNT, a dictionary-based fast lossless text transform algorithm. With a static generic dictionary, StarNT achieves a superior compression ratio than almost all the other recent efforts based on BWT and PPM. This algorithm utilizes ternary search tree to expedite transform ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
In this paper we present StarNT, a dictionary-based fast lossless text transform algorithm. With a static generic dictionary, StarNT achieves a superior compression ratio than almost all the other recent efforts based on BWT and PPM. This algorithm utilizes ternary search tree to expedite transform encoding. Experimental results show that the average compression time has improved by orders of magnitude compared with our previous algorithm LIPT and the additional time overhead it introduced to the backend compressor is unnoticeable. Based on StarNT, we propose StarZip, a domain-specific lossless text compression utility. Using domain-specific static dictionaries embedded in the system, StarZip achieves an average improvement in compression performance (in terms of BPC) of 13 % over bzip2-9, 19 % over gzip-9, and 10 % over PPMD.
Superior Guarantees for Sequential Prediction and Lossless Compression via Alphabet Decomposition
"... We present worst case bounds for the learning rate of a known prediction method that is based on hierarchical applications of binary context tree weighting (CTW) predictors. A heuristic application of this approach that relies on Huffman’s alphabet decomposition is known to achieve state-ofthe-art p ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
We present worst case bounds for the learning rate of a known prediction method that is based on hierarchical applications of binary context tree weighting (CTW) predictors. A heuristic application of this approach that relies on Huffman’s alphabet decomposition is known to achieve state-ofthe-art performance in prediction and lossless compression benchmarks. We show that our new bound for this heuristic is tighter than the best known performance guarantees for prediction and lossless compression algorithms in various settings. This result substantiates the efficiency of this hierarchical method and provides a compelling explanation for its practical success. In addition, we present the results of a few experiments that examine other possibilities for improving the multialphabet prediction performance of CTW-based algorithms.
Implementing the context tree weighting method for context recognition
- in Proc. Data Compression Conf., Snowbird, UT, Mar. 2004, p. 536. et al.: UNIVERSAL DIVERGENCE ESTIMATION FOR FINITE-ALPHABET SOURCES 3475
"... The context tree weighting method (CTW) is a statistics–based universal date compres-sion algorithm that is capable of achieving superior performance compared to Lempel– Ziv based algorithms [1], [2]. Motivated by this fact, we investigate the usability of CTW for applications involving content reco ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
The context tree weighting method (CTW) is a statistics–based universal date compres-sion algorithm that is capable of achieving superior performance compared to Lempel– Ziv based algorithms [1], [2]. Motivated by this fact, we investigate the usability of CTW for applications involving content recognition. Recently, various authors have explored the application of other data compression algorithms for content recognition, e.g. see [3], [4], [5]. Given a test file that needs to be classified among a set of several reference files that represent different classes, the reference file which leads to the best compression of the test file when both files are appended is selected as the most probable match. Moreover, we modify CTW for content recognition purposes by introducing the concept of context tree freezing after the reference sequence is encoded to avoid learning the memory structure of the appended test sequence. Results show that CTW with the proposed freezing technique achieves a clearly superior performance compared to a wide range of other compression algorithms for content recognition problems such as language recognition, authorship attribution, and DNA data classification. For more details, the reader is referred to the full paper version available at [6].
Classification Tree Sources
"... Abstract—The separation of source coding into two stages, modeling and encoding, is a highly successful approach. We propose meta-modeling as an additional stage. As an application, we use this paradigm to deduce an efficient and optimal algorithm for a novel and powerful model set: the classificati ..."
Abstract
- Add to MetaCart
Abstract—The separation of source coding into two stages, modeling and encoding, is a highly successful approach. We propose meta-modeling as an additional stage. As an application, we use this paradigm to deduce an efficient and optimal algorithm for a novel and powerful model set: the classification tree sources. Our results on classification tree sources unify and generalize prior results for tree sources. Moreover, we point out applications in text and image compression. I.
Combining Models in Data Compression
"... We propose Beta Weighting as a simple linear weighting scheme for combining different models in data compression. Suppose we are given a finite number of models. Under the assumption that with a given a priori probability distribution one of the models is the best – but we do not know which one and ..."
Abstract
- Add to MetaCart
We propose Beta Weighting as a simple linear weighting scheme for combining different models in data compression. Suppose we are given a finite number of models. Under the assumption that with a given a priori probability distribution one of the models is the best – but we do not know which one and we do not have further knowledge about the models – Beta Weighting is optimal in the sense that it yields minimum redundancy. Every single update of each weight requires only a constant number of arithmetic operations. 1
Biological Sequence Compression Algorithms
"... Today, more and more DNA sequences are becoming available. The information about DNA sequences are stored in molecular biology databases. The size and importance of these databases will be bigger and bigger in the future, therefore this information must be stored or communicated efficiently. Further ..."
Abstract
- Add to MetaCart
Today, more and more DNA sequences are becoming available. The information about DNA sequences are stored in molecular biology databases. The size and importance of these databases will be bigger and bigger in the future, therefore this information must be stored or communicated efficiently. Furthermore, sequence compression can be used to define similarities between biological sequences. The standard compression algorithms such as gzip or compress cannot compress DNA sequences, but only expand them in size. On the other hand, CTW (Context Tree Weighting Method) can compress DNA sequences less than two bits per symbol. These algorithms do not use special structures of biological sequences. Two characteristic structures of DNA sequences are known. One is called palindromes or reverse complements and the other structure is approximate repeats. Several specific algorithms for DNA sequences that use these structures can compress them less than two bits per symbol. In this paper, we improve the CTW so that characteristic structures of DNA sequences are available. Before encoding the next symbol, the algorithm searches an approximate repeat and palindrome using hash and dynamic programming. If there is a palindrome or an approximate repeat with enough length then our algorithm represents it with length and distance. By using this preprocessing, a new program achieves a little higher compression ratio than that of existing DNA-oriented compression algorithms. We also describe new compression algorithm for protein sequences.

