Results 1  10
of
147
The String BTree: A New Data Structure for String Search in External Memory and its Applications.
 Journal of the ACM
, 1998
"... We introduce a new textindexing data structure, the String BTree, that can be seen as a link between some traditional externalmemory and stringmatching data structures. In a short phrase, it is a combination of Btrees and Patricia tries for internalnode indices that is made more effective by a ..."
Abstract

Cited by 138 (12 self)
 Add to MetaCart
(Show Context)
We introduce a new textindexing data structure, the String BTree, that can be seen as a link between some traditional externalmemory and stringmatching data structures. In a short phrase, it is a combination of Btrees and Patricia tries for internalnode indices that is made more effective by adding extra pointers to speed up search and update operations. Consequently, the String BTree overcomes the theoretical limitations of inverted files, Btrees, prefix Btrees, suffix arrays, compacted tries and suffix trees. String Btrees have the same worstcase performance as Btrees but they manage unboundedlength strings and perform much more powerful search operations such as the ones supported by suffix trees. String Btrees are also effective in main memory (RAM model) because they improve the online suffix tree search on a dynamic set of strings. They also can be successfully applied to database indexing and software duplication.
Unbounded length contexts for PPM
 in Proc. Data Compression Conf., DCC95
, 1995
"... ..."
(Show Context)
On prediction using variable order Markov models
 JOURNAL OF ARTIFICIAL INTELLIGENCE RESEARCH
, 2004
"... This paper is concerned with algorithms for prediction of discrete sequences over a finite alphabet, using variable order Markov models. The class of such algorithms is large and in principle includes any lossless compression algorithm. We focus on six prominent prediction algorithms, including Cont ..."
Abstract

Cited by 102 (1 self)
 Add to MetaCart
(Show Context)
This paper is concerned with algorithms for prediction of discrete sequences over a finite alphabet, using variable order Markov models. The class of such algorithms is large and in principle includes any lossless compression algorithm. We focus on six prominent prediction algorithms, including Context Tree Weighting (CTW), Prediction by Partial Match (PPM) and Probabilistic Suffix Trees (PSTs). We discuss the properties of these algorithms and compare their performance using real life sequences from three domains: proteins, English text and music pieces. The comparison is made with respect to prediction quality as measured by the average logloss. We also compare classification algorithms based on these predictors with respect to a number of large protein classification tasks. Our results indicate that a “decomposed” CTW (a variant of the CTW algorithm) and PPM outperform all other algorithms in sequence prediction tasks. Somewhat surprisingly, a different algorithm, which is a modification of the LempelZiv compression algorithm, significantly outperforms all algorithms on the protein classification problems.
Analysis of Branch Prediction via Data Compression
 in Proceedings of the 7th International Conference on Architectural Support for Programming Languages and Operating Systems
, 1996
"... Branch prediction is an important mechanism in modem microprocessor design. The focus of research in this area has been on designing new branch prediction schemes. In contrast, very few studies address the theoretical basis behind these prediction schemes. Knowing this theoretical basis helps us to ..."
Abstract

Cited by 91 (3 self)
 Add to MetaCart
Branch prediction is an important mechanism in modem microprocessor design. The focus of research in this area has been on designing new branch prediction schemes. In contrast, very few studies address the theoretical basis behind these prediction schemes. Knowing this theoretical basis helps us to evaluate how good a prediction scheme is and how much we can expect to improve its accuracy.
Adding Compression to a FullText Retrieval System
, 1995
"... We describe the implementation of a data compression scheme as an integral and transparent layer within a fulltext... ..."
Abstract

Cited by 90 (25 self)
 Add to MetaCart
We describe the implementation of a data compression scheme as an integral and transparent layer within a fulltext...
A Corpus for the Evaluation of Lossless Compression Algorithms
, 1997
"... This paper investigates how the reliability of these evaluations can be ensured, particularly the repeatability of experiments, in line with scientific method. The evaluation of compression methods can be analytical or empirical. Analytical results are generally expressed in terms of the compression ..."
Abstract

Cited by 68 (1 self)
 Add to MetaCart
This paper investigates how the reliability of these evaluations can be ensured, particularly the repeatability of experiments, in line with scientific method. The evaluation of compression methods can be analytical or empirical. Analytical results are generally expressed in terms of the compression of a system relative to the entropy of the source, which is assumed to belong to a specified class. Such results tend to have only asymptotic significance; for example, the LZ78 method [ZL78] converges to the entropy for very large inputs, but in practical situations files are far too short for this convergence to have any significance. For this reason empirical results are needed to establish the practical worth of a method. The main factor measured in typical empirical experiments is the amount of compression achieved on some set of files. Researchers also often report the speed of compression, and the amount of primary memory required to perform the compression. The speed and memory requirements can be different for the encoding and decoding processes, and may depend on the file being compressed. This can result in a daunting number of factors that need to be presented. A number of authors have used the "Calgary corpus" of texts to provide empirical results for lossless compression algorithms. This corpus was collected in 1987, although it was not published until 1990 [BCW90]. Recent advances with compression algorithms have been achieving relatively small improvements in compression, measured using the Calgary corpus. There is a concern that algorithms are being finetuned to this corpus, and that small improvements measured in this way may not apply to other files. Furthermore, the corpus is almost ten years old, and over
Augmenting Naive Bayes Classifiers with Statistical Language Models
, 2003
"... We augment naive Bayes models with statistical ngram language models to address shortcomings of the standard naive Bayes text classifier. The result is a generalized naive Bayes classifier ..."
Abstract

Cited by 65 (0 self)
 Add to MetaCart
We augment naive Bayes models with statistical ngram language models to address shortcomings of the standard naive Bayes text classifier. The result is a generalized naive Bayes classifier
Universal Lossless Source Coding With the Burrows Wheeler Transform
 IEEE TRANSACTIONS ON INFORMATION THEORY
, 2002
"... The Burrows Wheeler Transform (BWT) is a reversible sequence transformation used in a variety of practical lossless sourcecoding algorithms. In each, the BWT is followed by a lossless source code that attempts to exploit the natural ordering of the BWT coefficients. BWTbased compression schemes ar ..."
Abstract

Cited by 44 (4 self)
 Add to MetaCart
(Show Context)
The Burrows Wheeler Transform (BWT) is a reversible sequence transformation used in a variety of practical lossless sourcecoding algorithms. In each, the BWT is followed by a lossless source code that attempts to exploit the natural ordering of the BWT coefficients. BWTbased compression schemes are widely touted as lowcomplexity algorithms giving lossless coding rates better than those of the ZivLempel codes (commonly known as LZ'77 and LZ'78) and almost as good as those achieved by prediction by partial matching (PPM) algorithms. To date, the coding performance claims have been made primarily on the basis of experimental results. This work gives a theoretical evaluation of BWTbased coding. The main results of this theoretical evaluation include: 1) statistical characterizations of the BWT output on both finite strings and sequences of length , 2) a variety of very simple new techniques for BWTbased lossless source coding, and 3) proofs of the universality and bounds on the rates of convergence of both new and existing BWTbased codes for finitememory and stationary ergodic sources. The end result is a theoretical justification and validation of the experimentally derived conclusions: BWTbased lossless source codes achieve universal lossless coding performance that converges to the optimal coding performance more quickly than the rate of convergence observed in ZivLempel style codes and, for some BWTbased codes, within a constant factor of the optimal rate of convergence for finitememory sources.
A Fast Blocksorting Algorithm for lossless Data Compression
, 1996
"... I describe a fast blocksorting algorithm and its implementation to be used as frontend to simple lossless data compression algorithms like movetofront coding. I also compare it with widely available data compression algorithms running on the same Hardware. My algorithm achieves speed above c ..."
Abstract

Cited by 44 (0 self)
 Add to MetaCart
(Show Context)
I describe a fast blocksorting algorithm and its implementation to be used as frontend to simple lossless data compression algorithms like movetofront coding. I also compare it with widely available data compression algorithms running on the same Hardware. My algorithm achieves speed above comparable algorithms while maintaining the same good compression. Since it is a derivative from the algorithm published by M. Burrows and D.J. Wheeler the size of the input blocks must be large to achieve good compression. Unlike their method execution speed here does not depend on the blocksize used. I will also present improvements to the backend of blocksorting compression methods. Michael Schindler A fast blocksorting algorithm for lossless data compression 3 1 Introduction Today's popular lossless data compression algorithms are mainly based on the sequential datacompression published by Lempel and Ziv in 1977 [1] and 1978 [2]. There were improvements like in [3] or the developme...