Results 1  10
of
66
Random Access to GrammarCompressed Strings
, 2011
"... Let S be a string of length N compressed into a contextfree grammar S of size n. We present two representations of S achieving O(log N) random access time, and either O(n · αk(n)) construction time and space on the pointer machine model, or O(n) construction time and space on the RAM. Here, αk(n) is ..."
Abstract

Cited by 30 (3 self)
 Add to MetaCart
Let S be a string of length N compressed into a contextfree grammar S of size n. We present two representations of S achieving O(log N) random access time, and either O(n · αk(n)) construction time and space on the pointer machine model, or O(n) construction time and space on the RAM. Here, αk(n) is the inverse of the k th row of Ackermann’s function. Our representations also efficiently support decompression of any substring in S: we can decompress any substring of length m in the same complexity as a single random access query and additional O(m) time. Combining these results with fast algorithms for uncompressed approximate string matching leads to several efficient algorithms for approximate string matching on grammarcompressed strings without decompression. For instance, we can find all approximate occurrences of a pattern P with at most k errors in time O(n(min{P k, k 4 + P } + log N) + occ), where occ is the number of occurrences of P in S. Finally, we are able to generalize our results to navigation and other operations on grammarcompressed trees. All of the above bounds significantly improve the currently best known results. To achieve these bounds, we introduce several new techniques and data structures of independent interest, including a predecessor data structure, two ”biased” weighted ancestor data structures, and a compact representation of heavypaths in grammars.
Transposition invariant string matching
, 2003
"... Given strings A = a1a2...am and B = b1b2...bn over an alphabet Σ ⊆ U, whereU is some numerical universe closed under addition and subtraction, and a distance function d(A,B) that gives the score of the best (partial) matching of A and B, the transposition invariant distance is ..."
Abstract

Cited by 25 (6 self)
 Add to MetaCart
Given strings A = a1a2...am and B = b1b2...bn over an alphabet Σ ⊆ U, whereU is some numerical universe closed under addition and subtraction, and a distance function d(A,B) that gives the score of the best (partial) matching of A and B, the transposition invariant distance is
Protein similarity search with subset seeds on a dedicated reconfigurable hardware
 PROCEEDINGS OF THE 2ND WORKSHOP ON PARALLEL BIOCOMPUTING WORKSHOP (PBC'07)
, 2008
"... With a sharp increase of available DNA and protein sequence data, new precise and fast similarity search methods are needed for large scale genome and proteome comparisons. Modern seedbased techniques of similarity search (spaced seeds, multiple seeds, subset seeds) provide a better sensitivity/sp ..."
Abstract

Cited by 24 (18 self)
 Add to MetaCart
With a sharp increase of available DNA and protein sequence data, new precise and fast similarity search methods are needed for large scale genome and proteome comparisons. Modern seedbased techniques of similarity search (spaced seeds, multiple seeds, subset seeds) provide a better sensitivity/specificity ratio. We present an implementation of such a seedbased technique on a parallel specialized hardware embed ding reconfigurable architecture (FPGA), where the FPGA is tightly connected to large capacity Flash memories. This parallel system allows large databases to be fully indexed and rapidly accessed. Compared to traditional approaches presented by the Blastp software, we obtain both a significant speedup and better results. To the best of our knowledge, this is the first attempt to exploit efficient seedbased algorithms for parallelizing the sequence similarity search
Approximate Matching of RunLength Compressed Strings
 Algorithmica
, 2001
"... We focus on the problem of approximate matching of strings that have been compressed using runlength encoding. Previous studies have concentrated on the problem of computing the longest common subsequence (LCS) between two strings of length m and n, compressed to m runs. We extend an existi ..."
Abstract

Cited by 23 (0 self)
 Add to MetaCart
We focus on the problem of approximate matching of strings that have been compressed using runlength encoding. Previous studies have concentrated on the problem of computing the longest common subsequence (LCS) between two strings of length m and n, compressed to m runs. We extend an existing algorithm for the LCS to the Levenshtein distance achieving O(m m) complexity.
PrivacyPreserving Genomic Computation Through Program Specialization
"... In this paper, we present a new approach to performing important classes of genomic computations (e.g., search for homologous genes) that makes a significant step towards privacy protection in this domain. Our approach leverages a key property of the human genome, namely that the vast majority of it ..."
Abstract

Cited by 22 (3 self)
 Add to MetaCart
(Show Context)
In this paper, we present a new approach to performing important classes of genomic computations (e.g., search for homologous genes) that makes a significant step towards privacy protection in this domain. Our approach leverages a key property of the human genome, namely that the vast majority of it is shared across humans (and hence public), and consequently relatively little of it is sensitive. Based on this observation, we propose a privacyprotection framework that partitions a genomic computation, distributing the part on sensitive data to the data provider and the part on the pubic data to the user of the data. Such a partition is achieved through program specialization that enables a biocomputing program to perform a concrete execution on public data and a symbolic execution on sensitive data. As a result, the program is simplified into an efficient query program that takes only sensitive genetic data as inputs. We prove the effectiveness of our techniques on a set of dynamic programming algorithms common in genomic computing. We develop a program transformation tool that automatically instruments a legacy program for specialization operations. We also demonstrate that our techniques can greatly facilitate secure multiparty computations on large biocomputing problems.
Speeding Up HMM Decoding and Training by Exploiting Sequence Repetitions
"... We present a method to speed up the dynamic program algorithms used for solving the HMM decoding and training problems for discrete timeindependent HMMs. We discuss the application of our method to Viterbi’s decoding and training algorithms [33], as well as to the forwardbackward and BaumWelch [ ..."
Abstract

Cited by 17 (5 self)
 Add to MetaCart
(Show Context)
We present a method to speed up the dynamic program algorithms used for solving the HMM decoding and training problems for discrete timeindependent HMMs. We discuss the application of our method to Viterbi’s decoding and training algorithms [33], as well as to the forwardbackward and BaumWelch [6] algorithms. Our approach is based on identifying repeated substrings in the observed input sequence. Initially, we show how to exploit repetitions of all sufficiently small substrings (this is similar to the Four Russians method). Then, we describe four algorithms based alternatively on run length encoding (RLE), LempelZiv (LZ78) parsing, grammarbased compression (SLP), and byte pair encoding (BPE). Compared to Viterbi’s algorithm, we achieve speedups of Θ(log n) using the Four Russians method, Ω ( r log n r) using RLE, Ω ( ) using LZ78, Ω ( ) using SLP, and Ω(r) using BPE, where k is the number log r k k of hidden states, n is the length of the observed sequence and r is its compression ratio (under each compression scheme). Our experimental results demonstrate that our new algorithms are indeed faster in practice. Furthermore, unlike Viterbi’s algorithm, our algorithms are highly parallelizable.
All semilocal longest common subsequences in subquadratic time
 In Proceedings of CSR
, 2006
"... subquadratic time ..."
(Show Context)
Optimal neighborhood indexing for protein similarity search
, 2008
"... Background: Similarity inference, one of the main bioinformatics tasks, has to face an exponential growth of the biological data. A classical approach used to cope with this data flow involves heuristics with large seed indexes. In order to speed up this technique, the index can be enhanced by stori ..."
Abstract

Cited by 12 (9 self)
 Add to MetaCart
(Show Context)
Background: Similarity inference, one of the main bioinformatics tasks, has to face an exponential growth of the biological data. A classical approach used to cope with this data flow involves heuristics with large seed indexes. In order to speed up this technique, the index can be enhanced by storing additional information to limit the number of random memory accesses. However, this improvement leads to a larger index that may become a bottleneck. In the case of protein similarity search, we propose to decrease the index size by reducing the amino acid alphabet.
Results: The paper presents two main contributions. First, we show that an optimal neighborhood indexing combining an alphabet reduction and a longer neighborhood leads to a reduction of 35% of memory involved into the process, without sacrificing the quality of results nor the computational time. Second, our approach led us to develop a new kind of substitution score matrices and their associated evalue parameters. In contrast to usual matrices, these matrices are rectangular since they compare amino acid groups from different alphabets. We describe the method used for computing those matrices and we provide some typical examples that can be used in such comparisons. Supplementary data can be found on the website http://bioinfo.lifl.fr/reblosum.
Conclusions: We propose a practical index size reduction of the neighborhood data, that does not negatively affect the performance of largescale search in protein sequences. Such an index can be used in any study involving large protein data. Moreover, rectangular substitution score matrices and their associated statistical parameters can have applications in any study involving an alphabet reduction.
A UNIFIED ALGORITHM FOR ACCELERATING EDITDISTANCE COMPUTATION via . . .
, 2009
"... The edit distance problem is a classical fundamental problem in computer science in general, and in combinatorial pattern matching in particular. The standard dynamicprogramming solution for this problem computes the editdistance between a pair of strings of total length O(N) in O(N²) time. To th ..."
Abstract

Cited by 12 (1 self)
 Add to MetaCart
(Show Context)
The edit distance problem is a classical fundamental problem in computer science in general, and in combinatorial pattern matching in particular. The standard dynamicprogramming solution for this problem computes the editdistance between a pair of strings of total length O(N) in O(N²) time. To this date, this quadratic upperbound has never been substantially improved for general strings. However, there are known techniques for breaking this bound in case the strings are known to compress well under a particular compression scheme. The basic idea is to first compress the strings, and then to compute the edit distance between the compressed strings. As it turns out, practically all known o(N 2) editdistance algorithms work, in some sense, under the same paradigm described above. It is therefore natural to ask whether there is a single editdistance algorithm that works for strings which are compressed under any compression scheme. A rephrasing of this question is to ask whether a single algorithm can exploit the compressibility properties of strings under any compression method, even if each string is compressed using a different compression. In this paper we set out to answer this question by using straightline programs. These provide a generic platform
New efficient algorithms for LCS and constrained LCS problem
 In Broersma et al
"... Abstract. In this paper, we study the classic and wellstudied longest common subsequence (LCS) problem and a recent variant of it, namely the constrained LCS (CLCS) problem. In the CLCS problem, the computed LCS must also be a supersequence of a third given string. In this paper, we first present a ..."
Abstract

Cited by 11 (2 self)
 Add to MetaCart
(Show Context)
Abstract. In this paper, we study the classic and wellstudied longest common subsequence (LCS) problem and a recent variant of it, namely the constrained LCS (CLCS) problem. In the CLCS problem, the computed LCS must also be a supersequence of a third given string. In this paper, we first present an efficient algorithm for the traditional LCS problem that runs in O(R log log n + n) time, where R is the total number of ordered pairs of positions at which the two strings match and n is the length of the two given strings. Then, using this algorithm, we devise an algorithm for the CLCS problem having time complexity O(pR log log n + n) in the worst case, where p is the length of the third string. 1