Results 1 - 10
of
53
Unbounded Length Contexts for PPM
- The Computer Journal
, 1995
"... uses considerably greater computational resources (both time and space). The next section describes the basic PPM compression scheme. Following that we motivate the use of contexts of unbounded length, introduce the new method, and show how it can be implemented using a trie data structure. Then we ..."
Abstract
-
Cited by 102 (7 self)
- Add to MetaCart
uses considerably greater computational resources (both time and space). The next section describes the basic PPM compression scheme. Following that we motivate the use of contexts of unbounded length, introduce the new method, and show how it can be implemented using a trie data structure. Then we give some results that demonstrate an improvement of about 6% over the old method. Finally, a recently-published and seemingly unrelated compression scheme [2] is related to the unbounded-context idea that forms the essential innovation of PPM*. 1 PPM: Prediction by partial match The basic idea of PPM is to use the last few characters in the input stream to predict the upcoming one. Models that condition their predictions on a few immediately preceding symbols are called "finite-context" models of order k, where k is the number of preceding symbols used. PPM employs a suite of fixed-order context models with different values of k
Introduction to Programmable Active Memories
, 1989
"... We introduce the concept of PAM, Programmable Active Memory and present results obtained with our Perle-0 prototype board, featuring: ffl A software silicon foundry for a 50K gate array, with a 50 milliseconds turn-around time. ffl A 3000 one bit processors universal machine with an arbitrary inter ..."
Abstract
-
Cited by 57 (2 self)
- Add to MetaCart
We introduce the concept of PAM, Programmable Active Memory and present results obtained with our Perle-0 prototype board, featuring: ffl A software silicon foundry for a 50K gate array, with a 50 milliseconds turn-around time. ffl A 3000 one bit processors universal machine with an arbitrary interconnect structure specified by 400K bits of nano-code. ffl A programmable hardware co-processor with an initial library including: a long multiplier, an image convolver, a data compressor, etc. Each of these hardware designs speeds up the corresponding software application by at least an order of magnitude.
A New Challenge for Compression Algorithms: Genetic Sequences
- Information Processing & Management
, 1994
"... Universal data compression algorithms fail to compress genetic sequences. It is due to the specificity of this particular kind of "text". We analyze in some details the properties of the sequences, which cause the failure of classical algorithms. We then present a lossless algorithm, biocompress-2, ..."
Abstract
-
Cited by 55 (0 self)
- Add to MetaCart
Universal data compression algorithms fail to compress genetic sequences. It is due to the specificity of this particular kind of "text". We analyze in some details the properties of the sequences, which cause the failure of classical algorithms. We then present a lossless algorithm, biocompress-2, to compress the information contained in DNA and RNA sequences, based on the detection of regularities, such as the presence of palindromes. The algorithm combines substitutional and statistical methods, and to the best of our knowledge, lead to the highest compression of DNA. The results, although not satisfactory, gives insight to the necessary correlation between compression and comprehension of genetic sequences. 1 Introduction There are plenty of specific types of data which need to be compressed, for ease of storage and communication. Among them are texts (such as natural language and programs), images, sounds, etc. In this paper, we focus on the compression of a specific kin...
Lempel-Ziv parsing and sublinear-size index structures for string matching (Extended Abstract)
- Proc. 3rd South American Workshop on String Processing (WSP'96
, 1996
"... String matching over a long text can be significantly speeded up with an index structure formed by preprocessing the text. For very long texts, the size of such an index can be a problem. This paper presents the first sublinear-size index structure. The new structure is based on Lempel-Ziv parsing ..."
Abstract
-
Cited by 46 (1 self)
- Add to MetaCart
String matching over a long text can be significantly speeded up with an index structure formed by preprocessing the text. For very long texts, the size of such an index can be a problem. This paper presents the first sublinear-size index structure. The new structure is based on Lempel-Ziv parsing of the text and has size linear in N, the size of the Lempel-Ziv parse. For a text of length n, N = O(n = log n) and can be still smaller if the text is compressible. With the new index structure, all occurrences of a pattern string of length m can be found in time O(m 2
A General Practical Approach to Pattern Matching over Ziv-Lempel Compressed Text
, 1998
"... . We address the problem of string matching on Ziv-Lempel compressed text. The goal is to search a pattern in a text without uncompressing it. This is a highly relevant issue to keep compressed text databases where efficient searching is still possible. We develop a general technique for string matc ..."
Abstract
-
Cited by 38 (7 self)
- Add to MetaCart
. We address the problem of string matching on Ziv-Lempel compressed text. The goal is to search a pattern in a text without uncompressing it. This is a highly relevant issue to keep compressed text databases where efficient searching is still possible. We develop a general technique for string matching when the text comes as a sequence of blocks. This abstracts the essential features of Ziv-Lempel compression. We then apply the scheme to each particular type of compression. We present the first algorithm to find all the matches of a pattern in a text compressed using LZ77. When we apply our scheme to LZ78, we obtain a much more efficient search algorithm, which is faster than uncompressing the text and then searching on it. Finally, we propose a new hybrid compression scheme which is between LZ77 and LZ78, being in practice as good to compress as LZ77 and as fast to search in as LZ78. 1 Introduction String matching is one of the most pervasive problems in computer science, with appli...
Extended Application of Suffix Trees to Data Compression
- In Data Compression Conference
, 1996
"... A practical scheme for maintaining an index for a sliding window in optimal time and space, by use of a suffix tree, is presented. The index supports location of the longest matching substring in time proportional to the length of the match. The total time for build and update operations is proporti ..."
Abstract
-
Cited by 35 (2 self)
- Add to MetaCart
A practical scheme for maintaining an index for a sliding window in optimal time and space, by use of a suffix tree, is presented. The index supports location of the longest matching substring in time proportional to the length of the match. The total time for build and update operations is proportional to the size of the input. The algorithm, which is simple and straightforward, is presented in detail. The most prominent lossless data compression scheme, when considering compression performance, is prediction by partial matching with unbounded context lengths (PPM*). However, previously presented algorithms are hardly practical, considering their extensive use of computational resources. We show that our scheme can be applied to PPM*-style compression, obtaining an algorithm that runs in linear time, and in space bounded by an arbitrarily chosen window size. Application to Ziv--Lempel '77 compression methods is straightforward and the resulting algorithm runs in linear time. 1 Introdu...
Estimating Alphanumeric Selectivity in the Presence of Wildcards
- In SIGMOD
, 1996
"... Success of commercial query optimizers and database management systems (object-oriented or relational) depend on accurate cost estimation of various query reorderings [BGI]. Estimating predicate selectivity, or the fraction of rows in a database that satisfy a selection predicate, is key to determin ..."
Abstract
-
Cited by 33 (2 self)
- Add to MetaCart
Success of commercial query optimizers and database management systems (object-oriented or relational) depend on accurate cost estimation of various query reorderings [BGI]. Estimating predicate selectivity, or the fraction of rows in a database that satisfy a selection predicate, is key to determining the optimal join order. Previous work has concentrated on estimating selectivity for numeric fields [ASW, HaSa, IoP, LNS, SAC, WVT]. With the popularity of textual data being stored in databases, it has become important to estimate selectivity accurately for alphanumeric fields. A particularly problematic predicate used against alphanumeric fields is the SQL like predicate [Dat]. Techniques used for estimating numeric selectivity are not suited for estimating In this paper, we study for the first time the problem of estimating alphanumeric selectivity in the presence of wildcards. Based on the intuition that the model built by a data compressor on an input text encapsulates information about common substrings in the text, we develop a technique based on the suffix tree data structure to estimate alphanumeric selectivity. In a statistics generation pass over the database, we construct a compact suffix tree-based structure from the columns of the database. We then look at three families of methods that utilize this structure to estimate selectivity during query plan costing, when a query with predicates on alphanumeric attributes contains wildcards in the predicate. We evaluate our methods empirically in the context of the TPC-D benchmark. We study our methods experimentally against a variety of query patterns and identify five techniques that hold promise.
Linear-Time, Incremental Hierarchy Inference for Compression
- Data Compression Conference, Snowbird, Utah, IEEE Computer Society
, 1997
"... this paper, we present three new results that characterize SEQUITUR's computational and compression performance. First, we prove that SEQUITUR operates in time linear in n, the length of the input sequence, despite its ability to build a hierarchy as deep as log(n). Second, we show that a sequence c ..."
Abstract
-
Cited by 27 (3 self)
- Add to MetaCart
this paper, we present three new results that characterize SEQUITUR's computational and compression performance. First, we prove that SEQUITUR operates in time linear in n, the length of the input sequence, despite its ability to build a hierarchy as deep as log(n). Second, we show that a sequence can be compressed incrementally, improving on the non-incremental algorithm that was described by Nevill-Manning et al. (1994), and making on-line compression feasible. Third, we present an intriguing result that emerged during benchmarking; whereas PPMC (Moffat, 1990) outperforms SEQUITUR on most files in the Calgary corpus, SEQUITUR regains the lead when tested on multimegabyte sequences. We make some tentative conclusions about the underlying reasons for this phenomenon, and about the nature of current compression benchmarking.
Off-line compression by greedy textual substitution
- PROC. IEEE
, 2000
"... Greedy off-line textual substitution refers to the following approach to compression or structural inference. Given a long textstring x, a substring w is identified such that replacing all instances of w in x except one by a suitable pair of pointers yields the highest possible contraction of x; the ..."
Abstract
-
Cited by 23 (1 self)
- Add to MetaCart
Greedy off-line textual substitution refers to the following approach to compression or structural inference. Given a long textstring x, a substring w is identified such that replacing all instances of w in x except one by a suitable pair of pointers yields the highest possible contraction of x; the process is then repeated on the contracted textstring until substrings capable of producing contractions can no longer be found. This paper examines computational issues arising in the implementation of this paradigm and describes some applications and experiments.
Online Prediction Algorithms for Databases and Operating Systems
, 1995
"... In making online decisions, computer systems are inherently trying to predict future events. Typical decision problems in computer systems translate to three prediction scenarios: predicting what event is going to happen in the future, when a specific event will take place, or how much of something ..."
Abstract
-
Cited by 16 (1 self)
- Add to MetaCart
In making online decisions, computer systems are inherently trying to predict future events. Typical decision problems in computer systems translate to three prediction scenarios: predicting what event is going to happen in the future, when a specific event will take place, or how much of something is going to happen. In this thesis, we develop practical algorithms for specific instances of these three prediction scenarios, and prove the goodness of our algorithms via analytical and experimental methods. We study each of the three prediction scenarios via motivating systems problems. The problem of prefetching requires a prediction of which page is going to be next requested by a user. The problem of disk spindown in mobile machines, modeled by the rent-to-buy framework, requires an estimate of when the next disk access is going to happen. Query optimizers choose a database access strategy by predicting or estimating selectivity, i.e., by estimating the size of a query result. We an...

