Results 21  30
of
51
PPM performance with BWT complexity: A fast and effective data compression algorithm
 Proceedings of the IEEE
, 2000
"... This paper introduces a new data compression algorithm. The goal underlying this new code design is to achieve a single lossless compression algorithm with the excellent compression ratios of the Prediction by Partial Mapping (PPM) algorithms and the low complexity of codes based on the Burrows Whee ..."
Abstract

Cited by 8 (0 self)
 Add to MetaCart
This paper introduces a new data compression algorithm. The goal underlying this new code design is to achieve a single lossless compression algorithm with the excellent compression ratios of the Prediction by Partial Mapping (PPM) algorithms and the low complexity of codes based on the Burrows Wheeler Transform (BWT). Like the BWTbased codes, the proposed algorithm requires worst case O(n) computational complexity and memory; in contrast, the unboundedcontext PPM algorithm, called PPM 3, requires worst case O(n 2) computational complexity. Like PPM 3, the proposed algorithm allows the use of unbounded contexts. Using standard data sets for comparison, the proposed algorithm achieves compression performance better than that of the BWTbased codes and comparable to that of PPM 3. In particular, the proposed algorithm yields an average rate of 2.29 bits per character (bpc) on the Calgary corpus; this result compares favorably with the 2.33 and 2.34 bpc of PPM5 and PPM 3 (PPM algorithms), the 2.43 bpc of BW94 (the original BWTbased code), and the 3.64 and 2.69 bpc of compress and gzip (popular Unix compression algorithms based on Lempel–Ziv (LZ) coding techniques) on the same data set. The given code does not, however, match the best reported compression performance—2.12 bpc with PPMZ9—listed on the Calgary corpus results web page at the time of this publication. Results on the Canterbury corpus give a similar relative standing. The proposed algorithm gives an average rate of 2.15 bpc on the Canterbury corpus, while the Canterbury corpus web page gives average rates of 1.99 bpc for PPMZ9, 2.11 bpc for PPM5, 2.15 bpc for PPM7, 2.23 bpc for BZIP2 (a popular BWTbased code), and 3.31 and 2.53 bpc for compress and gzip, respectively. Keywords—Burrows Wheeler Transform, lossless source coding, prediction by partial mapping algorithm, suffix trees, text compression. I.
Finding Repeats With Fixed Gap
 IN: PROC. OF THE 7TH INT’L SYMP. ON STRING PROCESSING AND INFORMATION RETRIEVAL (SPIRE). WASHINGTON: IEEE COMPUTER SOCIETY
, 2000
"... We propose an algorithm for finding in a word all pairs of occurrences of the same subword with a given distance r between them. The obtained complexity is O(n log r + S), where S is the size of the output. We also show how the algorithm can be modified in order to find all such pairs of occurrences ..."
Abstract

Cited by 7 (2 self)
 Add to MetaCart
We propose an algorithm for finding in a word all pairs of occurrences of the same subword with a given distance r between them. The obtained complexity is O(n log r + S), where S is the size of the output. We also show how the algorithm can be modified in order to find all such pairs of occurrences separated by a given word. The solution uses an algorithm for finding all quasisquares in two strings, a problem that generalizes the known problem of searching for squares.
Augmenting Suffix Trees, with Applications
, 1998
"... . Information retrieval and data compression are the two main application areas where the rich theory of string algorithmics plays a fundamental role. In this paper, we consider one algorithmic problem from each of these areas and present highly efficient (linear or near linear time) algorithms ..."
Abstract

Cited by 6 (2 self)
 Add to MetaCart
. Information retrieval and data compression are the two main application areas where the rich theory of string algorithmics plays a fundamental role. In this paper, we consider one algorithmic problem from each of these areas and present highly efficient (linear or near linear time) algorithms for both problems. Our algorithms rely on augmenting the suffix tree, a fundamental data structure in string algorithmics. The augmentations are nontrivial and they form the technical crux of this paper. In particular, they consist of adding extra edges to suffix trees, resulting in Directed Acyclic Graphs (DAGs). Our algorithms construct these "suffix DAGs" and manipulate them to solve the two problems efficiently. 1 Introduction In this paper, we consider two algorithmic problems, one from the area of Data Compression and the other from Information Retrieval. Our main results are highly efficient (linear or near linear time) algorithms for these problems. All our algorithms rely on...
On the Optimality of Parsing in Dynamic Dictionary Based Data Compression
"... This paper considers the following question: once a (dynamic) dictionary construction scheme is selected, is there an efficient dynamic parsing method that results with the smallest number of phrases possible for the selected scheme, for all input strings. It is shown that greedy parsing, a method u ..."
Abstract

Cited by 5 (1 self)
 Add to MetaCart
This paper considers the following question: once a (dynamic) dictionary construction scheme is selected, is there an efficient dynamic parsing method that results with the smallest number of phrases possible for the selected scheme, for all input strings. It is shown that greedy parsing, a method used in almost all dictionary based algorithms (including unix compress, gif image compression, and fax and modem standards), can be far from optimal for certain input strings. On the positive side, a simple parsing method is introduced which, for any selected dictionary scheme which has the prefix property (this includes virtually all the variations on LempelZiv algorithms), is optimal with respect to the selected scheme for any input string. Also introduced is a simple data structure that enables an efficient dynamic implementation of the parsing method, in O(1) time per character and optimal space requirement.
Optimal Lossless Compression of a Class of Dynamic Sources
 Proc Data Compression Conference, edited by J.A. Storer and J.H. Reif. IEEE Computer Society Press, Los Alamitos, CA
, 1997
"... . The usual assumption for proofs of the optimality of lossless encoding is a stationary ergodic source. Dynamic sources with nonstationary probability distributions occur in many practical situations where the data source is constructed by a composition of distinct sources, for example, a document ..."
Abstract

Cited by 4 (0 self)
 Add to MetaCart
. The usual assumption for proofs of the optimality of lossless encoding is a stationary ergodic source. Dynamic sources with nonstationary probability distributions occur in many practical situations where the data source is constructed by a composition of distinct sources, for example, a document with multiple authors, a multimedia document, or the composition of distinct packets sent over a communication channel. There is a vast literature of adaptive methods used to tailor the compression to dynamic sources. However, little is known about optimal or near optimal methods for lossless compression of strings generated by sources that are not stationary ergodic. Here we do not assume the source is stationary. Instead we assume that the source produces an infinite sequence of concatenated finite strings s 1 ; s 2 ; : : : where (i) each finite string s i is generated by a sampling of a (possibly distinct) stationary ergodic source S i , and (ii) the length of each of the s i is lower b...
On Finding Duplication in Strings and Software
 Journal of Algorithms
, 1993
"... This paper investigates finding duplication within a string. The results are phrased in terms of maximal matches, which are pairs of substrings that match but for which the match cannot be extended to the left or right in the input. An algorithm is given to find all maximal exact matches over a thre ..."
Abstract

Cited by 4 (1 self)
 Add to MetaCart
This paper investigates finding duplication within a string. The results are phrased in terms of maximal matches, which are pairs of substrings that match but for which the match cannot be extended to the left or right in the input. An algorithm is given to find all maximal exact matches over a threshold length. For a finite alphabet S, it runs in time O(nlog ï S ï + m), where n is the input length and m is the number of matches reported. The algorithm has been implemented and has been used as the basis of a program for finding code duplication in large software systems. February 15, On Finding Duplication in Strings and Software Brenda S. Baker AT&T Bell Laboratories Murray Hill, New Jersey 07974 1. Introduction This paper describes an algorithm for finding duplication in strings and its application to finding duplication in software. The basic problem is to find all maximal matches, which are pairs of substrings that match but for which the match cannot be extended to the left o...
Incremental Algorithms on Lists
 Proceedings SION Computing Science in the Netherlands
, 1991
"... Incremental computations can improve the performance of interactive programs such as spreadsheet programs, program development environments, text editors, etc. Incremental algorithms describe how to compute a required value depending on the input, after the input has been edited. By considering the ..."
Abstract

Cited by 4 (4 self)
 Add to MetaCart
Incremental computations can improve the performance of interactive programs such as spreadsheet programs, program development environments, text editors, etc. Incremental algorithms describe how to compute a required value depending on the input, after the input has been edited. By considering the possible different edit actions on the data type lists, the basic data type used in spreadsheet programs and text editors, we define incremental algorithms on lists. Some theory for the construction of incremental algorithms is developed, and we give an incremental algorithm for a more involved example: formatting a text. CR categories and descriptors: D11 [Software]: Programming Techniques  Applicative Programming, D43 [Software]: Programming Languages  Language constructs, I22 [Artificial Intelligence]: Automatic Programming  Program transformation. General terms: algorithm, design, theory. Additional keywords and phrases: BirdMeertens calculus for program construction, incremen...
Lineartime computation of local periods
 Theoret. Comput. Sci
"... Abstract. We present a lineartime algorithm for computing all local periods of a given word. This subsumes (but is substantially more powerful than) the computation of the (global) period of the word and on the other hand, the computation of a critical factorization, implied by the Critical Factori ..."
Abstract

Cited by 3 (0 self)
 Add to MetaCart
Abstract. We present a lineartime algorithm for computing all local periods of a given word. This subsumes (but is substantially more powerful than) the computation of the (global) period of the word and on the other hand, the computation of a critical factorization, implied by the Critical Factorization Theorem. 1
Maximal repetitions and Application to DNA sequences
, 2000
"... In this paper we describe an implementation of MainKolpakovKucherov algorithm [9] of lineartime search for maximal repetitions in sequences. We first present a theoretical background and sketch main components of the method. We also discuss how the method can be generalized to finding approxim ..."
Abstract

Cited by 2 (1 self)
 Add to MetaCart
In this paper we describe an implementation of MainKolpakovKucherov algorithm [9] of lineartime search for maximal repetitions in sequences. We first present a theoretical background and sketch main components of the method. We also discuss how the method can be generalized to finding approximate repetitions. Then we discuss implementation decisions and present test examples of running the programs on real DNA data.
Efficient string matching algorithms for combinatorial universal denoising
 In Proc. of IEEE Data Compression Conference (DCC), Snowbird
, 2005
"... Inspired by the combinatorial denoising method DUDE [13], we present efficient algorithms for implementing this idea for arbitrary contexts or for using it within subsequences. We also propose effective, efficient denoising error estimators so we can find the best denoising of an input sequence over ..."
Abstract

Cited by 2 (0 self)
 Add to MetaCart
Inspired by the combinatorial denoising method DUDE [13], we present efficient algorithms for implementing this idea for arbitrary contexts or for using it within subsequences. We also propose effective, efficient denoising error estimators so we can find the best denoising of an input sequence over different context lengths. Our methods are simple, drawing from string matching methods and radix sorting. We also present experimental results of our proposed algorithms. 1