Results

**11 - 19**of**19**### Symposium on Theoretical Aspects of Computer Science year (city), pp. numbers www.stacs-conf.org A UNIFIED ALGORITHM FOR ACCELERATING EDIT-DISTANCE COMPUTATION

"... Abstract. The edit distance problem is a classical fundamental problem in computer science in general, and in combinatorial pattern matching in particular. The standard dynamic-programming solution for this problem computes the edit-distance between a pair of strings of total length O(N) in O(N 2) t ..."

Abstract
- Add to MetaCart

Abstract. The edit distance problem is a classical fundamental problem in computer science in general, and in combinatorial pattern matching in particular. The standard dynamic-programming solution for this problem computes the edit-distance between a pair of strings of total length O(N) in O(N 2) time. To this date, this quadratic upperbound has never been substantially improved for general strings. However, there are known techniques for breaking this bound in case the strings are known to compress well under a particular compression scheme. The basic idea is to first compress the strings, and then to compute the edit distance between the compressed strings. As it turns out, practically all known o(N 2) edit-distance algorithms work, in some sense, under the same paradigm described above. It is therefore natural to ask whether there is a single edit-distance algorithm that works for strings which are compressed under any compression scheme. A rephrasing of this question is to ask whether a single algorithm can exploit the compressibility properties of strings under any compression method, even if each string is compressed using a different compression. In this paper we set out to answer this question by using straight-line programs. These provide a generic platform

### Sorting a Compressed List

, 2012

"... We consider the task of sorting and performing kth order statistics on a list that is stored in compressed form. The most common approach to this problem is to first decompress the array (usually in linear time), and then apply standard algorithmic tools. This approach, however, ignores the rich inf ..."

Abstract
- Add to MetaCart

We consider the task of sorting and performing kth order statistics on a list that is stored in compressed form. The most common approach to this problem is to first decompress the array (usually in linear time), and then apply standard algorithmic tools. This approach, however, ignores the rich information about the input that is implicit in the compressed form. In particular, exploiting this information from the compression may eliminate the need to decompress, and may also enable algorithmic improvements that provide substantial speedups. We thus suggest a more rigorous study of what we call compression-aware algorithms. Already the string-matching community has applied this idea to developing surprisingly efficient pattern matching and edit distance algorithms on compressed strings. In this paper, we begin to study the problem of sorting on compressed lists. Given an LZ77 representation of size C that decompresses to an array of length n, our algorithm can output an LZ77compressed representation of the sorted dataset in O(C + |Σ | log |Σ | + n) time, with Σ as the alphabet. Secondly, we consider a compression scheme in which an n-integer array is represented as the union of C arithmetic sequences. Using priority queues, we can sort the array in O(n log C) time. Lastly, given an array compressed with a context free grammar of size C we can find the sorted array in O(C · |Σ|), where Σ is the alphabet of the string. Additionally we present algorithms for indexing an LZ77 compressed string in O(C), and 1.1

### Unified Compression-Based Acceleration of Edit-Distance Computation

"... The edit distance problem is a classical fundamental problem in computer science in general, and in combinatorial pattern matching in particular. The standard dynamic programming solution for this problem computes the edit-distance between a pair of strings of total length O(N) in O(N²) time. To thi ..."

Abstract
- Add to MetaCart

The edit distance problem is a classical fundamental problem in computer science in general, and in combinatorial pattern matching in particular. The standard dynamic programming solution for this problem computes the edit-distance between a pair of strings of total length O(N) in O(N²) time. To this date, this quadratic upper-bound has never been substantially improved for general strings. However, there are known techniques for breaking this bound in case the strings are known to compress well under a particular compression scheme. The basic idea is to first compress the strings, and then to compute the edit distance between the compressed strings. As it turns out, practically all known o(N 2) edit-distance algorithms work, in some sense, under the same paradigm described above. It is therefore natural to ask whether there is a single edit-distance algorithm that works for strings which are compressed under any compression scheme. A rephrasing of this question is to ask whether a single algorithm can exploit the compressibility properties of strings under any compression method, even if each string is compressed using a different compression. In this paper we set out to answer this question by using straight line programs. These provide a generic platform for representing many popular compression schemes including the LZ-family, Run-Length Encoding, Byte-Pair Encoding, and dictionary methods. For two strings of total length N having straight-line program representations of total size n, we present an

### Re-Use Dynamic Programming for Sequence Alignment: An Algorithmic Toolkit ∗

"... The problem of comparing two sequences S and T to determine their similarity is one of the fundamental problems in pattern matching. In this manuscript we will be primarily concerned with sequences as our objects and with various string comparison metrics. Our goal is to survey a methodology for uti ..."

Abstract
- Add to MetaCart

The problem of comparing two sequences S and T to determine their similarity is one of the fundamental problems in pattern matching. In this manuscript we will be primarily concerned with sequences as our objects and with various string comparison metrics. Our goal is to survey a methodology for utilizing repetitions in sequences in order to speed up the comparison process. Within this framework we consider various methods of parsing the sequences in order to frame their repetitions, and present a toolkit of various solutions whose time complexity depends both on the chosen parsing method as well as on the string-comparison metric used for the alignment. 1

### Musical Sequence Comparison for Melodic and Rhythmic Similarities

"... We address the problem of musical sequence comparison for melodic similarity. Starting with a very simple similarity measure, we improve it step-by-step to finally obtain an acceptable measure. While the measure is still simple and has only two tuning parameters, it is better than that proposed by M ..."

Abstract
- Add to MetaCart

We address the problem of musical sequence comparison for melodic similarity. Starting with a very simple similarity measure, we improve it step-by-step to finally obtain an acceptable measure. While the measure is still simple and has only two tuning parameters, it is better than that proposed by Mongeau and Sankoff (1990) in the sense that it can distinguish variations on a particular theme from a mixed collection of variations on multiple themes by Mozart, more successfully than the Mongeau-Sankoff measure. We also present a measure for quantifying rhythmic similarity and evaluate its performance on popular Japanese songs. 1

### December 2005The SBC-Tree: An Index for Run-Length Compressed Sequences

, 2005

"... Run-Length-Encoding (RLE) is a data compression technique that is used in various applications, e.g., biological sequence databases. multimedia: and facsimile transmission. One of the main challenges is how to operate, e.g., indexing: searching, and retriexral: on the compressed data without decompr ..."

Abstract
- Add to MetaCart

Run-Length-Encoding (RLE) is a data compression technique that is used in various applications, e.g., biological sequence databases. multimedia: and facsimile transmission. One of the main challenges is how to operate, e.g., indexing: searching, and retriexral: on the compressed data without decompressing it. In t.his paper, we present the String &tree for _Compressed sequences; termed the SBC-tree, for indexing and searching RLE-compressed sequences of arbitrary length. The SBC-tree is a two-level index structure based on the well-knoxvn String B-tree and a 3-sided range query structure. The SBC-tree supports substring as \\re11 as prefix m,atching, and range search operations over RLE-compressed sequences. The SBC-tree has an optimal external-memory space complexity of O(N/B) pages, where N is the total length of the compressed sequences, and B is the disk page size. The insertion and deletion of all suffixes of a compressed sequence of length m taltes O(m logB(N + m)) I/O operations. Substring match,ing, pre,fix matching, and range search execute in an optimal O(log, N + F) I/O operations, where Ip is the length of the compressed query pattern and T is the query output size. Re present also two variants of the SBC-tree: the SBC-tree that is based on an R-tree instead of the 3-sided structure: and the one-level SBC-tree that does not use a two-dimensional index. These variants do not have provable worstcase theoret.ica1 bounds for search operations, but perform well in practice. The SBC-tree index is realized inside PostgreSQL in t,he context of a biological protein database application. Performance results illustrate that using the SBC-tree to index RLE-compressed sequences achieves up to an order of magnitude reduction in storage, up to 30 % reduction in 110s for the insertion operations, and retains the optimal search performance achieved by the St,ring B-tree over the uncompressed sequences.!I c 0, h

### Approximate Matching of Run-Length Compressed Strings\Lambda

"... 1 1 Introduction The problem of compressed pattern matching is, given a compressed text T and a (possibly compressed) pattern P, to find all occurrences of P in T without decompressing T (and P). The goal is to search faster than by using the basic scheme: decompression followed by a search. ..."

Abstract
- Add to MetaCart

1 1 Introduction The problem of compressed pattern matching is, given a compressed text T and a (possibly compressed) pattern P, to find all occurrences of P in T without decompressing T (and P). The goal is to search faster than by using the basic scheme: decompression followed by a search.

### Improved Compression-Based Acceleration of Edit-Distance Computation

"... Abstract. We focus on accelerating the known solutions for the classical edit-distance problem via compression techniques. Using straight-line programs we show a single edit-distance algorithm that works for strings which compress well under many popular compression schemes including the LZfamily, R ..."

Abstract
- Add to MetaCart

Abstract. We focus on accelerating the known solutions for the classical edit-distance problem via compression techniques. Using straight-line programs we show a single edit-distance algorithm that works for strings which compress well under many popular compression schemes including the LZfamily, Run-Length Encoding, Byte-Pair Encoding, and dictionary methods. For two strings of total length N having straight-line program representations of total size n, we present an algorithm running in O(nN lg(N/n)) time for computing the edit-distance of these two strings under any rational scoring function, and an O(n 2/3 N 4/3) time algorithm for arbitrary scoring functions. Our new result, while providing a significant speed-up for highly compressible strings, does not surpass the quadratic time bound even in the worst-case scenario. Supported by the Adams Fellowship of the Israel Academy of Sciences and Humanities. 1

### Quantization of random sequences and related statistical problems

"... We consider quantization of signals in probabilistic framework. In prac-tice, signals (or random processes) are observed at sampling points. We study probabilistic models for run-length encoding (RLE) method. This method is characterized by the compression efficiency coefficient (or quan-tization ra ..."

Abstract
- Add to MetaCart

We consider quantization of signals in probabilistic framework. In prac-tice, signals (or random processes) are observed at sampling points. We study probabilistic models for run-length encoding (RLE) method. This method is characterized by the compression efficiency coefficient (or quan-tization rate) and is widely used, for example, in digital signal and image compression. Some properties of RLE quantization rate are investigated. Statistical inference for mean RLE quantization rate is considered. In particular, the asymptotical normality of mean RLE quantization rate es-timates is studied. Numerical experiments demonstrating the rate of con-vergence in the obtained asymptotical results are presented.