Results 1  10
of
42
Approximate Matching of RunLength Compressed Strings
 Algorithmica
, 2001
"... We focus on the problem of approximate matching of strings that have been compressed using runlength encoding. Previous studies have concentrated on the problem of computing the longest common subsequence (LCS) between two strings of length m and n, compressed to m runs. We extend an existi ..."
Abstract

Cited by 18 (0 self)
 Add to MetaCart
We focus on the problem of approximate matching of strings that have been compressed using runlength encoding. Previous studies have concentrated on the problem of computing the longest common subsequence (LCS) between two strings of length m and n, compressed to m runs. We extend an existing algorithm for the LCS to the Levenshtein distance achieving O(m m) complexity.
Protein similarity search with subset seeds on a dedicated reconfigurable hardware
 PROCEEDINGS OF THE 2ND WORKSHOP ON PARALLEL BIOCOMPUTING WORKSHOP (PBC'07)
, 2008
"... With a sharp increase of available DNA and protein sequence data, new precise and fast similarity search methods are needed for large scale genome and proteome comparisons. Modern seedbased techniques of similarity search (spaced seeds, multiple seeds, subset seeds) provide a better sensitivity/sp ..."
Abstract

Cited by 18 (15 self)
 Add to MetaCart
With a sharp increase of available DNA and protein sequence data, new precise and fast similarity search methods are needed for large scale genome and proteome comparisons. Modern seedbased techniques of similarity search (spaced seeds, multiple seeds, subset seeds) provide a better sensitivity/specificity ratio. We present an implementation of such a seedbased technique on a parallel specialized hardware embed ding reconfigurable architecture (FPGA), where the FPGA is tightly connected to large capacity Flash memories. This parallel system allows large databases to be fully indexed and rapidly accessed. Compared to traditional approaches presented by the Blastp software, we obtain both a significant speedup and better results. To the best of our knowledge, this is the first attempt to exploit efficient seedbased algorithms for parallelizing the sequence similarity search
PrivacyPreserving Genomic Computation Through Program Specialization
"... In this paper, we present a new approach to performing important classes of genomic computations (e.g., search for homologous genes) that makes a significant step towards privacy protection in this domain. Our approach leverages a key property of the human genome, namely that the vast majority of it ..."
Abstract

Cited by 12 (2 self)
 Add to MetaCart
In this paper, we present a new approach to performing important classes of genomic computations (e.g., search for homologous genes) that makes a significant step towards privacy protection in this domain. Our approach leverages a key property of the human genome, namely that the vast majority of it is shared across humans (and hence public), and consequently relatively little of it is sensitive. Based on this observation, we propose a privacyprotection framework that partitions a genomic computation, distributing the part on sensitive data to the data provider and the part on the pubic data to the user of the data. Such a partition is achieved through program specialization that enables a biocomputing program to perform a concrete execution on public data and a symbolic execution on sensitive data. As a result, the program is simplified into an efficient query program that takes only sensitive genetic data as inputs. We prove the effectiveness of our techniques on a set of dynamic programming algorithms common in genomic computing. We develop a program transformation tool that automatically instruments a legacy program for specialization operations. We also demonstrate that our techniques can greatly facilitate secure multiparty computations on large biocomputing problems.
Speeding Up HMM Decoding and Training by Exploiting Sequence Repetitions
"... We present a method to speed up the dynamic program algorithms used for solving the HMM decoding and training problems for discrete timeindependent HMMs. We discuss the application of our method to Viterbi’s decoding and training algorithms [33], as well as to the forwardbackward and BaumWelch [ ..."
Abstract

Cited by 12 (5 self)
 Add to MetaCart
We present a method to speed up the dynamic program algorithms used for solving the HMM decoding and training problems for discrete timeindependent HMMs. We discuss the application of our method to Viterbi’s decoding and training algorithms [33], as well as to the forwardbackward and BaumWelch [6] algorithms. Our approach is based on identifying repeated substrings in the observed input sequence. Initially, we show how to exploit repetitions of all sufficiently small substrings (this is similar to the Four Russians method). Then, we describe four algorithms based alternatively on run length encoding (RLE), LempelZiv (LZ78) parsing, grammarbased compression (SLP), and byte pair encoding (BPE). Compared to Viterbi’s algorithm, we achieve speedups of Θ(log n) using the Four Russians method, Ω ( r log n r) using RLE, Ω ( ) using LZ78, Ω ( ) using SLP, and Ω(r) using BPE, where k is the number log r k k of hidden states, n is the length of the observed sequence and r is its compression ratio (under each compression scheme). Our experimental results demonstrate that our new algorithms are indeed faster in practice. Furthermore, unlike Viterbi’s algorithm, our algorithms are highly parallelizable.
All semilocal longest common subsequences in subquadratic time
 In Proceedings of CSR
, 2006
"... subquadratic time ..."
New efficient algorithms for LCS and constrained LCS problem
 In Broersma et al
"... Abstract. In this paper, we study the classic and wellstudied longest common subsequence (LCS) problem and a recent variant of it, namely the constrained LCS (CLCS) problem. In the CLCS problem, the computed LCS must also be a supersequence of a third given string. In this paper, we first present a ..."
Abstract

Cited by 10 (1 self)
 Add to MetaCart
Abstract. In this paper, we study the classic and wellstudied longest common subsequence (LCS) problem and a recent variant of it, namely the constrained LCS (CLCS) problem. In the CLCS problem, the computed LCS must also be a supersequence of a third given string. In this paper, we first present an efficient algorithm for the traditional LCS problem that runs in O(R log log n + n) time, where R is the total number of ordered pairs of positions at which the two strings match and n is the length of the two given strings. Then, using this algorithm, we devise an algorithm for the CLCS problem having time complexity O(pR log log n + n) in the worst case, where p is the length of the third string. 1
Algorithms for Transposition Invariant String Matching (Extended Abstract)
 Journal of Algorithms
, 2002
"... Given strings A and B over an alphabet Σ ⊆ U, where U is some numerical universe closed... ..."
Abstract

Cited by 9 (5 self)
 Add to MetaCart
Given strings A and B over an alphabet Σ ⊆ U, where U is some numerical universe closed...
Random Access to GrammarCompressed Strings
, 2011
"... Let S be a string of length N compressed into a contextfree grammar S of size n. We present two representations of S achieving O(log N) random access time, and either O(n · αk(n)) construction time and space on the pointer machine model, or O(n) construction time and space on the RAM. Here, αk(n) is ..."
Abstract

Cited by 9 (0 self)
 Add to MetaCart
Let S be a string of length N compressed into a contextfree grammar S of size n. We present two representations of S achieving O(log N) random access time, and either O(n · αk(n)) construction time and space on the pointer machine model, or O(n) construction time and space on the RAM. Here, αk(n) is the inverse of the k th row of Ackermann’s function. Our representations also efficiently support decompression of any substring in S: we can decompress any substring of length m in the same complexity as a single random access query and additional O(m) time. Combining these results with fast algorithms for uncompressed approximate string matching leads to several efficient algorithms for approximate string matching on grammarcompressed strings without decompression. For instance, we can find all approximate occurrences of a pattern P with at most k errors in time O(n(min{P k, k 4 + P } + log N) + occ), where occ is the number of occurrences of P in S. Finally, we are able to generalize our results to navigation and other operations on grammarcompressed trees. All of the above bounds significantly improve the currently best known results. To achieve these bounds, we introduce several new techniques and data structures of independent interest, including a predecessor data structure, two ”biased” weighted ancestor data structures, and a compact representation of heavypaths in grammars.
Rapid Homology Search with Neighbor Seeds
, 2005
"... Using a seed to rapidly "hit" possible homologies for further scrutiny is a common practice to speed up homology search in molecular sequences. ..."
Abstract

Cited by 8 (0 self)
 Add to MetaCart
Using a seed to rapidly "hit" possible homologies for further scrutiny is a common practice to speed up homology search in molecular sequences.
Rapid Homology Search with TwoStage Extension and Daughter Seeds
, 2005
"... Using a seed to rapidly "hit" possible homologies for further examination is a common practice to speed up homology search in molecular sequences. It has been shown that a collection of higher weight seeds have better sensitivity than a single lower weight seed at the same speed. However, huge me ..."
Abstract

Cited by 7 (0 self)
 Add to MetaCart
Using a seed to rapidly "hit" possible homologies for further examination is a common practice to speed up homology search in molecular sequences. It has been shown that a collection of higher weight seeds have better sensitivity than a single lower weight seed at the same speed. However, huge memory requirements diminish the advantages of high weight seeds. This paper describes a twostage extension method, which simulates high weight seeds with modest memory requirements. The paper also proposes the use of socalled daughter seeds, which is an extension of the previously studied vector seed idea. Daughter seeds, especially when combined with the twostage extension, provide the flexibility to maximize the independence between the seeds, which is a wellknown criterion for maximizing sensitivity. Some other practical techniques to reduce memory usage are also discussed in the paper.