Results 1 - 10
of
22
An Experimental Study of an Opportunistic Index
- In SODA
, 2001
"... The size of electronic data is currently growing at a faster rate than computer memory and disk storage capacities. For this reason compression appears always as an attractive choice, if not mandatory. However space overhead is not the only resource to be optimized when managing large data collectio ..."
Abstract
-
Cited by 61 (6 self)
- Add to MetaCart
The size of electronic data is currently growing at a faster rate than computer memory and disk storage capacities. For this reason compression appears always as an attractive choice, if not mandatory. However space overhead is not the only resource to be optimized when managing large data collections; in fact data turn out to be useful only when properly indexed to support search operations that efficiently extract the user-requested information. Approaches to combine compression and indexing techniques are nowadays receiving more and more attention. A rst step towards the design of a compressed full-text index achieving guaranteed performance in the worst case has been recently done in [10]. This index combines the compression algorithm proposed by Burrows and Wheeler [5] with the sux array data structure [16]. The index is opportunistic in that it takes advantage of the compressibility of the input data by decreasing the space occupancy at no signi cant asymptotic slowdown in the query performance. In this paper we present an implementation of this index and perform an extensive set of experiments on various text collections. The experiments show that our index is compact (its space occupancy is close to the one achieved by the best known compressors), it is fast in counting the number of pattern occurrences, and the cost of their retrieval is reasonable when they are few (i.e., in case of a selective query). In addition, our experiments show that the FM-index is exible in that it is possible to trade space occupancy for search time by choosing the amount of auxiliary information stored into it. 1
Boosting textual compression in optimal linear time
- Journal of the ACM
, 2005
"... Abstract. We provide a general boosting technique for Textual Data Compression. Qualitatively, it takes a good compression algorithm and turns it into an algorithm with a better compression Extended abstracts related to this article appeared in Proceedings of CPM 2001 and Proceedings of ACM-SIAM SOD ..."
Abstract
-
Cited by 34 (19 self)
- Add to MetaCart
Abstract. We provide a general boosting technique for Textual Data Compression. Qualitatively, it takes a good compression algorithm and turns it into an algorithm with a better compression Extended abstracts related to this article appeared in Proceedings of CPM 2001 and Proceedings of ACM-SIAM SODA 2004, and were combined due to their strong relatedness and complementarity. The work of P. Ferragina was partially supported by the Italian MIUR projects “Algorithms for the Next
Second step algorithms in the Burrows-Wheeler compression algorithm
- Software Practice and Experience
, 2001
"... In this paper we fix our attention on the second step algorithms of the Burrows--Wheeler compression algorithm, which in the original version is the Move To Front transform. We discuss many of its replacements presented so far, and compare compression results obtained using them. Then we propose ..."
Abstract
-
Cited by 19 (0 self)
- Add to MetaCart
In this paper we fix our attention on the second step algorithms of the Burrows--Wheeler compression algorithm, which in the original version is the Move To Front transform. We discuss many of its replacements presented so far, and compare compression results obtained using them. Then we propose a new algorithm that yields a better compression ratio than the previous ones.
Compression Boosting in Optimal Linear Time Using the Burrows-Wheeler Transform
- Journal of the ACM
, 2004
"... In this paper we provide the first compression booster that turns a zeroth order compressor into a more e#ective k-th order compressor without any loss in time e#ciency. More precisely, let A be an algorithm that compresses a string s within #|s|H # 0 (s)+ bits of storage in O(T (|s|)) time, where H ..."
Abstract
-
Cited by 11 (5 self)
- Add to MetaCart
In this paper we provide the first compression booster that turns a zeroth order compressor into a more e#ective k-th order compressor without any loss in time e#ciency. More precisely, let A be an algorithm that compresses a string s within #|s|H # 0 (s)+ bits of storage in O(T (|s|)) time, where H # 0 (s) is the zeroth order entropy of the string s. Our booster improves A by compressing s within #|s|H # k (s) + log 2 + g k bits still using O(T (|s|)) time, where H # k (s) is the k-th order entropy of s.
An Experimental Study of a Compressed Index
, 2001
"... The size of electronic data is currently growing at a faster rate than computer memory and disk storage capacities. For this reason compression appears always as an attractive choice, if not mandatory. However space overhead is not the only resource to be optimized when managing large data collectio ..."
Abstract
-
Cited by 10 (6 self)
- Add to MetaCart
The size of electronic data is currently growing at a faster rate than computer memory and disk storage capacities. For this reason compression appears always as an attractive choice, if not mandatory. However space overhead is not the only resource to be optimized when managing large data collections; in fact data turn out to be useful only when properly indexed to support search operations that e#ciently extract the user-requested information.
Improvements to Burrows-Wheeler Compression Algorithm
, 2000
"... In 1994 Burrows and Wheeler presented a new algorithm for lossless data compression. The compression ratio that can be achieved using their algorithm is comparable with the best known other algorithms, whilst its complexity is relatively small. In this paper we explain the internals of this algorith ..."
Abstract
-
Cited by 9 (3 self)
- Add to MetaCart
In 1994 Burrows and Wheeler presented a new algorithm for lossless data compression. The compression ratio that can be achieved using their algorithm is comparable with the best known other algorithms, whilst its complexity is relatively small. In this paper we explain the internals of this algorithm and discuss its various modifications that have been presented so far. Then we propose new improvements for its effectiveness. They allow us for obtaining the compression ratio equal to 2.271 bpc for the Calgary Corpus files, which is the best result in the class of Burrows-Wheeler Transform based algorithms.
Computing Repeated Factors With a Factor Oracle
- Proceedings of the 11th Australasian Workshop On Combinatorial Algorithms
, 2000
"... We present in this article a linear time and space method for the computation of the length of a repeated suffix for each preffix of a given word p. Our method is based on the utilization of the factor oracle of p which is a new and very compact structure introduced in [ACR99], used for representing ..."
Abstract
-
Cited by 8 (1 self)
- Add to MetaCart
We present in this article a linear time and space method for the computation of the length of a repeated suffix for each preffix of a given word p. Our method is based on the utilization of the factor oracle of p which is a new and very compact structure introduced in [ACR99], used for representing all the factors of p. We also exhibit applications where our method really speeds up the computation of repetitions in words, including a text data compression scheme. Keywords: combinatorics on word, string algorithms, repetitions, factor oracle, sux link, data compression. 1 Introduction There have been a large number of studies on the problem of nding repetitions (or repeats) in a given word p (see [Smy00]). For instance, the Morris and Pratt [MP70] shift function gives for each pre x of p the length of its longest border. The Boyer-Moore matching shift function (see [BM77], [KMP77] and [Ryt80]) (also called good sux shift function) gives for each sux the position of its rightmost r...
Unifying Text Search And Compression - Suffix Sorting, Block Sorting and Suffix Arrays
, 2000
"... Today many electronic documents are available such as articles of newspapers, dictionaries, books, DNA sequences, etc. and they are stored in databases. We also have many documents on the Internet and have many e-mail documents. Therefore, fast queries on such huge amount of documents and their comp ..."
Abstract
-
Cited by 6 (0 self)
- Add to MetaCart
Today many electronic documents are available such as articles of newspapers, dictionaries, books, DNA sequences, etc. and they are stored in databases. We also have many documents on the Internet and have many e-mail documents. Therefore, fast queries on such huge amount of documents and their compression to reduce costs for storing or transferring them are important. In this thesis, a unified method for improving efficiency of search and compression for huge text data is proposed. All search methods and compression methods used in this thesis are related to a data structure called suffix array. The suffix array is a text search data structure and it is used in a text compression method called block sorting. Both are promising search method and compression method and there are many studies on the methods. Now a data structure called inverted file is used for queries from huge amount of documents. Though it is widely used, query unit is a document in order to reduce disk space to sto...
The Burrows-Wheeler Transform: Theory and Practice
- Lecture Notes in Computer Science
, 1999
"... In this paper we describe the Burrows-Wheeler Transform (BWT) a completely new approach to data compression which is the basis of some of the best compressors available today. Although it is easy to intuitively understand why the BWT helps compression, the analysis of BWT-based algorithms requir ..."
Abstract
-
Cited by 5 (1 self)
- Add to MetaCart
In this paper we describe the Burrows-Wheeler Transform (BWT) a completely new approach to data compression which is the basis of some of the best compressors available today. Although it is easy to intuitively understand why the BWT helps compression, the analysis of BWT-based algorithms requires a careful study of every single algorithmic component. We describe two algorithms which use the BWT and we show that their compression ratio can be bounded in terms of the k-th order empirical entropy of the input string for any k 0. Intuitively, this means that these algorithms are able to make use of all the regularity which is in the input string.
Pattern Matching in Compressed Text and Images
, 2001
"... Normally compressed data needs to be decompressed before it is processed, but if the compression has been done in the fight way, it is often possible to search the data without having to decompress it, or at least only partially decompress it. The problem can be divided into lossless and lossy c ..."
Abstract
-
Cited by 4 (4 self)
- Add to MetaCart
Normally compressed data needs to be decompressed before it is processed, but if the compression has been done in the fight way, it is often possible to search the data without having to decompress it, or at least only partially decompress it. The problem can be divided into lossless and lossy compression methods, and then in each of these cases the pattern matching can be either exact or inexact. Much work has been reported in the literature on techniques for all of these cases, including algorithms that are suitable for pattern matching for various compression methods, and compression methods designed specifically for pattern matching. This work is surveyed in this paper. The paper also exposes the important relationship between pattern matching and compression, and proposes some performance measures for compressed pattern matching algorithms. Ideas and directions for future work are also described.

