Results 1  10
of
11
On prediction using variable order Markov models
 JOURNAL OF ARTIFICIAL INTELLIGENCE RESEARCH
, 2004
"... This paper is concerned with algorithms for prediction of discrete sequences over a finite alphabet, using variable order Markov models. The class of such algorithms is large and in principle includes any lossless compression algorithm. We focus on six prominent prediction algorithms, including Cont ..."
Abstract

Cited by 56 (1 self)
 Add to MetaCart
This paper is concerned with algorithms for prediction of discrete sequences over a finite alphabet, using variable order Markov models. The class of such algorithms is large and in principle includes any lossless compression algorithm. We focus on six prominent prediction algorithms, including Context Tree Weighting (CTW), Prediction by Partial Match (PPM) and Probabilistic Suffix Trees (PSTs). We discuss the properties of these algorithms and compare their performance using real life sequences from three domains: proteins, English text and music pieces. The comparison is made with respect to prediction quality as measured by the average logloss. We also compare classification algorithms based on these predictors with respect to a number of large protein classification tasks. Our results indicate that a “decomposed” CTW (a variant of the CTW algorithm) and PPM outperform all other algorithms in sequence prediction tasks. Somewhat surprisingly, a different algorithm, which is a modification of the LempelZiv compression algorithm, significantly outperforms all algorithms on the protein classification problems.
Biological Sequence Compression Algorithms
 Genome Informatics
, 2000
"... Today, more and more DNA sequences are becoming available. The information about DNA sequences are stored in molecular biology databases. The size and importance of these databases will be bigger and bigger in the future, therefore this information must be stored or communicated efficiently. Further ..."
Abstract

Cited by 16 (0 self)
 Add to MetaCart
Today, more and more DNA sequences are becoming available. The information about DNA sequences are stored in molecular biology databases. The size and importance of these databases will be bigger and bigger in the future, therefore this information must be stored or communicated efficiently. Furthermore, sequence compression can be used to define similarities between biological sequenices. The standard compression algorithms such as gzip or compress cannot compress DNA sequences but only expand them in size. On the other hand CTW (Context Tree Weighting Method) can compress DNA sequences less than two bits per symbol. These algorithms do nJ use special structures of biological sequencal Two characteristic structures of DNAsequen.# are kn wn On is calledpalinzLqqq or reverse complemen ts an the other structure is approximate repeats. Several specific algorithms for DNA sequenNz that use these structurescan compress them lessthan two bits per symbol. In this paper, we improve the CTW so that characteristic structures of DNAsequenzO are available. Beforeeno din the neJ symbol, the algorithm searchesan approximate repeat an palinNLFJ usin hashan dynJLx programminq If there is apalinJwq. oran approximate repeat withenhJL lenJL then our algorithmrepresen ts it withlenJq an distanNF By using this preprocessing anL program achieves a little higher compression ratio than that of existing DNAoriented compression algorithms. We also describe new compression algorithm for protein sequences.
LIPT: A Lossless Text Transform to improve compression
 In Proceedings of International Conference on Information and Theory : Coding and Computing, Las Vegas
, 2001
"... We propose an approach to develop a dictionary based reversible lossless text transformation, called LIPT (Length Index Preserving Transform), which can be applied to a source text to improve existing algorithm’s ability to compress. In LIPT, the length of the input word and the offset of the words ..."
Abstract

Cited by 9 (2 self)
 Add to MetaCart
We propose an approach to develop a dictionary based reversible lossless text transformation, called LIPT (Length Index Preserving Transform), which can be applied to a source text to improve existing algorithm’s ability to compress. In LIPT, the length of the input word and the offset of the words in the dictionary are denoted with alphabets. Our encoding scheme makes use of recurrence of same length of words in the English Language to create context in the transformed text that the entropy coders can exploit. LIPT achieves some compression at the preprocessing stage as well and retains enough context and redundancy for the compression algorithms to give better results. Bzip2 with LIPT gives 5.24 % improvement in average BPC over Bzip2 without LIPT, and PPMD with LIPT gives 4.46% improvement in average BPC over PPMD without LIPT, for our test corpus. 1.
LIPT: A reversible lossless text transform to improve compression performance
 In Proceedings of the IEEE Data Compression Conference 2001, Snowbird
, 2001
"... Abstract. Lossless compression researchers have developed highly sophisticated approaches, such as Huffman encoding, arithmetic encoding, the LempelZiv family, ..."
Abstract

Cited by 6 (0 self)
 Add to MetaCart
Abstract. Lossless compression researchers have developed highly sophisticated approaches, such as Huffman encoding, arithmetic encoding, the LempelZiv family,
Superior Guarantees for Sequential Prediction and Lossless Compression via Alphabet Decomposition
"... We present worst case bounds for the learning rate of a known prediction method that is based on hierarchical applications of binary context tree weighting (CTW) predictors. A heuristic application of this approach that relies on Huffman’s alphabet decomposition is known to achieve stateoftheart p ..."
Abstract

Cited by 3 (0 self)
 Add to MetaCart
We present worst case bounds for the learning rate of a known prediction method that is based on hierarchical applications of binary context tree weighting (CTW) predictors. A heuristic application of this approach that relies on Huffman’s alphabet decomposition is known to achieve stateoftheart performance in prediction and lossless compression benchmarks. We show that our new bound for this heuristic is tighter than the best known performance guarantees for prediction and lossless compression algorithms in various settings. This result substantiates the efficiency of this hierarchical method and provides a compelling explanation for its practical success. In addition, we present the results of a few experiments that examine other possibilities for improving the multialphabet prediction performance of CTWbased algorithms.
DictionaryBased Fast Transform for Text Compression ∗
"... In this paper we present StarNT, a dictionarybased fast lossless text transform algorithm. With a static generic dictionary, StarNT achieves a superior compression ratio than almost all the other recent efforts based on BWT and PPM. This algorithm utilizes ternary search tree to expedite transform ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
In this paper we present StarNT, a dictionarybased fast lossless text transform algorithm. With a static generic dictionary, StarNT achieves a superior compression ratio than almost all the other recent efforts based on BWT and PPM. This algorithm utilizes ternary search tree to expedite transform encoding. Experimental results show that the average compression time has improved by orders of magnitude compared with our previous algorithm LIPT and the additional time overhead it introduced to the backend compressor is unnoticeable. Based on StarNT, we propose StarZip, a domainspecific lossless text compression utility. Using domainspecific static dictionaries embedded in the system, StarZip achieves an average improvement in compression performance (in terms of BPC) of 13 % over bzip29, 19 % over gzip9, and 10 % over PPMD.
Combining Models in Data Compression
"... We propose Beta Weighting as a simple linear weighting scheme for combining different models in data compression. Suppose we are given a finite number of models. Under the assumption that with a given a priori probability distribution one of the models is the best – but we do not know which one and ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
We propose Beta Weighting as a simple linear weighting scheme for combining different models in data compression. Suppose we are given a finite number of models. Under the assumption that with a given a priori probability distribution one of the models is the best – but we do not know which one and we do not have further knowledge about the models – Beta Weighting is optimal in the sense that it yields minimum redundancy. Every single update of each weight requires only a constant number of arithmetic operations. 1
Implementing the context tree weighting method for context recognition
 in Proc. Data Compression Conf., Snowbird, UT, Mar. 2004, p. 536. et al.: UNIVERSAL DIVERGENCE ESTIMATION FOR FINITEALPHABET SOURCES 3475
"... The context tree weighting method (CTW) is a statistics–based universal date compression algorithm that is capable of achieving superior performance compared to Lempel– Ziv based algorithms [1], [2]. Motivated by this fact, we investigate the usability of CTW for applications involving content reco ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
The context tree weighting method (CTW) is a statistics–based universal date compression algorithm that is capable of achieving superior performance compared to Lempel– Ziv based algorithms [1], [2]. Motivated by this fact, we investigate the usability of CTW for applications involving content recognition. Recently, various authors have explored the application of other data compression algorithms for content recognition, e.g. see [3], [4], [5]. Given a test file that needs to be classified among a set of several reference files that represent different classes, the reference file which leads to the best compression of the test file when both files are appended is selected as the most probable match. Moreover, we modify CTW for content recognition purposes by introducing the concept of context tree freezing after the reference sequence is encoded to avoid learning the memory structure of the appended test sequence. Results show that CTW with the proposed freezing technique achieves a clearly superior performance compared to a wide range of other compression algorithms for content recognition problems such as language recognition, authorship attribution, and DNA data classification. For more details, the reader is referred to the full paper version available at [6].
RIDBE: A Lossless, Reversible Text Transformation Scheme for better Compression
"... In this paper, we propose RIDBE (Reinforced Intelligent Dictionary Based Encoding), a Dictionarybased reversible lossless text transformation algorithm. The basic philosophy of our secure compression is to preprocess the text and transform it into some intermediate form which can be compressed with ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
In this paper, we propose RIDBE (Reinforced Intelligent Dictionary Based Encoding), a Dictionarybased reversible lossless text transformation algorithm. The basic philosophy of our secure compression is to preprocess the text and transform it into some intermediate form which can be compressed with better efficiency and which exploits the natural redundancy of the language in making the transformation. In RIDBE, the length of the input word is denoted by the ASCII characters 232 – 253 and the offset of the words in the dictionary is denoted with the alphabets AZ. The existing or backend algorithm’s ability to compress is seen to improve considerably when this approach is applied to source text and it is used in conjunction with BWT. A sufficient level of security of the transmitted information is also maintained. RIDBE achieves better compression at the preprocessing stage and enough redundancy is retained for the compression algorithms to get better results. The experimental results of this compression method are analysed. RIDBE gives 19.08 % improvement over Simple BWT, 9.40 % improvement over BWT with *encode, 3.20 % improvement over BWT
Classification Tree Sources
"... Abstract—The separation of source coding into two stages, modeling and encoding, is a highly successful approach. We propose metamodeling as an additional stage. As an application, we use this paradigm to deduce an efficient and optimal algorithm for a novel and powerful model set: the classificati ..."
Abstract
 Add to MetaCart
Abstract—The separation of source coding into two stages, modeling and encoding, is a highly successful approach. We propose metamodeling as an additional stage. As an application, we use this paradigm to deduce an efficient and optimal algorithm for a novel and powerful model set: the classification tree sources. Our results on classification tree sources unify and generalize prior results for tree sources. Moreover, we point out applications in text and image compression. I.