Results

**11 - 15**of**15**### December 2005The SBC-Tree: An Index for Run-Length Compressed Sequences

, 2005

"... Run-Length-Encoding (RLE) is a data compression technique that is used in various applications, e.g., biological sequence databases. multimedia: and facsimile transmission. One of the main challenges is how to operate, e.g., indexing: searching, and retriexral: on the compressed data without decompr ..."

Abstract
- Add to MetaCart

Run-Length-Encoding (RLE) is a data compression technique that is used in various applications, e.g., biological sequence databases. multimedia: and facsimile transmission. One of the main challenges is how to operate, e.g., indexing: searching, and retriexral: on the compressed data without decompressing it. In t.his paper, we present the String &tree for _Compressed sequences; termed the SBC-tree, for indexing and searching RLE-compressed sequences of arbitrary length. The SBC-tree is a two-level index structure based on the well-knoxvn String B-tree and a 3-sided range query structure. The SBC-tree supports substring as \\re11 as prefix m,atching, and range search operations over RLE-compressed sequences. The SBC-tree has an optimal external-memory space complexity of O(N/B) pages, where N is the total length of the compressed sequences, and B is the disk page size. The insertion and deletion of all suffixes of a compressed sequence of length m taltes O(m logB(N + m)) I/O operations. Substring match,ing, pre,fix matching, and range search execute in an optimal O(log, N + F) I/O operations, where Ip is the length of the compressed query pattern and T is the query output size. Re present also two variants of the SBC-tree: the SBC-tree that is based on an R-tree instead of the 3-sided structure: and the one-level SBC-tree that does not use a two-dimensional index. These variants do not have provable worstcase theoret.ica1 bounds for search operations, but perform well in practice. The SBC-tree index is realized inside PostgreSQL in t,he context of a biological protein database application. Performance results illustrate that using the SBC-tree to index RLE-compressed sequences achieves up to an order of magnitude reduction in storage, up to 30 % reduction in 110s for the insertion operations, and retains the optimal search performance achieved by the St,ring B-tree over the uncompressed sequences.!I c 0, h

### Symposium on Theoretical Aspects of Computer Science year (city), pp. numbers www.stacs-conf.org A UNIFIED ALGORITHM FOR ACCELERATING EDIT-DISTANCE COMPUTATION

"... Abstract. The edit distance problem is a classical fundamental problem in computer science in general, and in combinatorial pattern matching in particular. The standard dynamic-programming solution for this problem computes the edit-distance between a pair of strings of total length O(N) in O(N 2) t ..."

Abstract
- Add to MetaCart

Abstract. The edit distance problem is a classical fundamental problem in computer science in general, and in combinatorial pattern matching in particular. The standard dynamic-programming solution for this problem computes the edit-distance between a pair of strings of total length O(N) in O(N 2) time. To this date, this quadratic upperbound has never been substantially improved for general strings. However, there are known techniques for breaking this bound in case the strings are known to compress well under a particular compression scheme. The basic idea is to first compress the strings, and then to compute the edit distance between the compressed strings. As it turns out, practically all known o(N 2) edit-distance algorithms work, in some sense, under the same paradigm described above. It is therefore natural to ask whether there is a single edit-distance algorithm that works for strings which are compressed under any compression scheme. A rephrasing of this question is to ask whether a single algorithm can exploit the compressibility properties of strings under any compression method, even if each string is compressed using a different compression. In this paper we set out to answer this question by using straight-line programs. These provide a generic platform

### Unified Compression-Based Acceleration of Edit-Distance Computation

"... The edit distance problem is a classical fundamental problem in computer science in general, and in combinatorial pattern matching in particular. The standard dynamic programming solution for this problem computes the edit-distance between a pair of strings of total length O(N) in O(N²) time. To thi ..."

Abstract
- Add to MetaCart

The edit distance problem is a classical fundamental problem in computer science in general, and in combinatorial pattern matching in particular. The standard dynamic programming solution for this problem computes the edit-distance between a pair of strings of total length O(N) in O(N²) time. To this date, this quadratic upper-bound has never been substantially improved for general strings. However, there are known techniques for breaking this bound in case the strings are known to compress well under a particular compression scheme. The basic idea is to first compress the strings, and then to compute the edit distance between the compressed strings. As it turns out, practically all known o(N 2) edit-distance algorithms work, in some sense, under the same paradigm described above. It is therefore natural to ask whether there is a single edit-distance algorithm that works for strings which are compressed under any compression scheme. A rephrasing of this question is to ask whether a single algorithm can exploit the compressibility properties of strings under any compression method, even if each string is compressed using a different compression. In this paper we set out to answer this question by using straight line programs. These provide a generic platform for representing many popular compression schemes including the LZ-family, Run-Length Encoding, Byte-Pair Encoding, and dictionary methods. For two strings of total length N having straight-line program representations of total size n, we present an

### Improved Compression-Based Acceleration of Edit-Distance Computation

"... Abstract. We focus on accelerating the known solutions for the classical edit-distance problem via compression techniques. Using straight-line programs we show a single edit-distance algorithm that works for strings which compress well under many popular compression schemes including the LZfamily, R ..."

Abstract
- Add to MetaCart

Abstract. We focus on accelerating the known solutions for the classical edit-distance problem via compression techniques. Using straight-line programs we show a single edit-distance algorithm that works for strings which compress well under many popular compression schemes including the LZfamily, Run-Length Encoding, Byte-Pair Encoding, and dictionary methods. For two strings of total length N having straight-line program representations of total size n, we present an algorithm running in O(nN lg(N/n)) time for computing the edit-distance of these two strings under any rational scoring function, and an O(n 2/3 N 4/3) time algorithm for arbitrary scoring functions. Our new result, while providing a significant speed-up for highly compressible strings, does not surpass the quadratic time bound even in the worst-case scenario. Supported by the Adams Fellowship of the Israel Academy of Sciences and Humanities. 1

### Sorting a Compressed List

, 2012

"... We consider the task of sorting and performing kth order statistics on a list that is stored in compressed form. The most common approach to this problem is to first decompress the array (usually in linear time), and then apply standard algorithmic tools. This approach, however, ignores the rich inf ..."

Abstract
- Add to MetaCart

We consider the task of sorting and performing kth order statistics on a list that is stored in compressed form. The most common approach to this problem is to first decompress the array (usually in linear time), and then apply standard algorithmic tools. This approach, however, ignores the rich information about the input that is implicit in the compressed form. In particular, exploiting this information from the compression may eliminate the need to decompress, and may also enable algorithmic improvements that provide substantial speedups. We thus suggest a more rigorous study of what we call compression-aware algorithms. Already the string-matching community has applied this idea to developing surprisingly efficient pattern matching and edit distance algorithms on compressed strings. In this paper, we begin to study the problem of sorting on compressed lists. Given an LZ77 representation of size C that decompresses to an array of length n, our algorithm can output an LZ77compressed representation of the sorted dataset in O(C + |Σ | log |Σ | + n) time, with Σ as the alphabet. Secondly, we consider a compression scheme in which an n-integer array is represented as the union of C arithmetic sequences. Using priority queues, we can sort the array in O(n log C) time. Lastly, given an array compressed with a context free grammar of size C we can find the sorted array in O(C · |Σ|), where Σ is the alphabet of the string. Additionally we present algorithms for indexing an LZ77 compressed string in O(C), and 1.1