Results 1  10
of
62
Algorithms for the longest common subsequence problem
 J. ACM
, 1977
"... AaS~ACT Two algorithms are presented that solve the longest common subsequence problem The first algorithm is applicable in the general case and requires O(pn + n log n) time where p is the length of the longest common subsequence The second algorithm requires time bounded by O(p(m + 1 p)log n) In ..."
Abstract

Cited by 210 (2 self)
 Add to MetaCart
AaS~ACT Two algorithms are presented that solve the longest common subsequence problem The first algorithm is applicable in the general case and requires O(pn + n log n) time where p is the length of the longest common subsequence The second algorithm requires time bounded by O(p(m + 1 p)log n) In the common speoal case where p is close to m, this algorithm takes much less time than n ~ KEY WORDS AND PHRASES ' subsequence, common subsequence, algorithm CR CATEOORIES 3 73, 3 79, 5 25, 5 39
Bounds on the complexity of the longest common subsequence problem
 Journal of the ACM
, 1976
"... ABSTRACT The problem of finding a longest common subsequence of two strings is discussed This problem arises in data processing applications such as comparing two files and in genetic applications such as studying molecular evolution The ddlqculty of computing a longest common subsequence of two str ..."
Abstract

Cited by 69 (1 self)
 Add to MetaCart
(Show Context)
ABSTRACT The problem of finding a longest common subsequence of two strings is discussed This problem arises in data processing applications such as comparing two files and in genetic applications such as studying molecular evolution The ddlqculty of computing a longest common subsequence of two strings IS examined using the decision tree model of computation, m which vertices represent "equalunequal " comparisons It IS shown that unless a bound on the total number of 0istmct symbols is assumed, every solution to the problem can consume an amount of time that is proportional to the product of the lengths of the two strings A general lower bound as a function of the ratio of alphabet size to string length is derived The case where comparisons between symbols of the same string are forbidden is also considered and it is shown that this problem is of linear complexity for a twosymbol alphabet and quadratic for an alphabet of three or more symbols KEY WORDS AND PHR~tSES longest common subsequence, algorithm, computational complexity, file comparison, molecular evolution CR CATEGORIES 3 12, 3 73, 5 25 1.
Combinatorial algorithms for DNA sequence assembly
 Algorithmica
, 1993
"... The trend towards very large DNA sequencing projects, such as those being undertaken as part of the human genome initiative, necessitates the development of efficient and precise algorithms for assembling a long DNA sequence from the fragments obtained by shotgun sequencing or other methods. The seq ..."
Abstract

Cited by 53 (3 self)
 Add to MetaCart
(Show Context)
The trend towards very large DNA sequencing projects, such as those being undertaken as part of the human genome initiative, necessitates the development of efficient and precise algorithms for assembling a long DNA sequence from the fragments obtained by shotgun sequencing or other methods. The sequence reconstruction problem that we take as our formulation of DNA sequence assembly is a variation of the shortest common superstring problem, complicated by the presence of sequencing errors and reverse complements of fragments. Since the simpler superstring problem is NPhard, any efficient reconstruction procedure must resort to heuristics. In this paper, however, a four phase approach based on rigorous design criteria is presented, and has been found to be very accurate in practice. Our method is robust in the sense that it can accommodate high sequencing error rates and list a series of alternate solutions in the event that several appear equally good. Moreover it uses a limited form ...
Fast pairwise structural RNA alignments by pruning of the dynamical programming matrix. PLoS Comput Biol 3
, 2007
"... It has become clear that noncoding RNAs (ncRNA) play important roles in cells, and emerging studies indicate that there might be a large number of unknown ncRNAs in mammalian genomes. There exist computational methods that can be used to search for ncRNAs by comparing sequences from different genome ..."
Abstract

Cited by 33 (6 self)
 Add to MetaCart
(Show Context)
It has become clear that noncoding RNAs (ncRNA) play important roles in cells, and emerging studies indicate that there might be a large number of unknown ncRNAs in mammalian genomes. There exist computational methods that can be used to search for ncRNAs by comparing sequences from different genomes. One main problem with these methods is their computational complexity, and heuristics are therefore employed. Two heuristics are currently very popular: prefolding and prealigning. However, these heuristics are not ideal, as prealigning is dependent on sequence similarity that may not be present and prefolding ignores the comparative information. Here, pruning of the dynamical programming matrix is presented as an alternative novel heuristic constraint. All subalignments that do not exceed a lengthdependent minimum score are discarded as the matrix is filled out, thus giving the advantage of providing the constraints dynamically. This has been included in a new implementation of the FOLDALIGN algorithm for pairwise local or global structural alignment of RNA sequences. It is shown that time and memory requirements are dramatically lowered while overall performance is maintained. Furthermore, a new divide and conquer method is introduced to limit the memory requirement during global alignment and backtrack of local alignment. All branch points in the computed RNA structure are found and used to divide the structure into smaller unbranched segments. Each segment is then realigned and backtracked in a normal fashion. Finally, the FOLDALIGN algorithm has also been updated with a better memory implementation and an improved energy model. With these improvements in the algorithm, the FOLDALIGN software package provides the molecular biologist with an efficient and userfriendly tool for searching for new ncRNAs. The software package is available for download at
Longest Common Subsequences
 In Proc. of 19th MFCS, number 841 in LNCS
, 1994
"... . The length of a longest common subsequence (LLCS) of two or more strings is a useful measure of their similarity. The LLCS of a pair of strings is related to the `edit distance', or number of mutations /errors/editing steps required in passing from one string to the other. In this talk, we ex ..."
Abstract

Cited by 33 (1 self)
 Add to MetaCart
. The length of a longest common subsequence (LLCS) of two or more strings is a useful measure of their similarity. The LLCS of a pair of strings is related to the `edit distance', or number of mutations /errors/editing steps required in passing from one string to the other. In this talk, we explore some of the combinatorial properties of the suband supersequence relations, survey various algorithms for computing the LLCS, and introduce some results on the expected LLCS for pairs of random strings. 1 Introduction The set \Sigma of finite strings over an unordered finite alphabet \Sigma admits of several natural partial orders. Some, such as the substring, prefix, and suffix relations, depend on contiguity and lead to many interesting combinatorial questions with practical applications to stringmatching. An excellent survey is given by Aho in [1]. In this talk however we will focus on the `subsequence' partial order. We say that u = u 1 \Delta \Delta \Delta um is a subsequence of ...
On Information Transmission over a Finite Buffer Channel
 IEEE Trans. Inform. Theory
, 2006
"... We study information transmission through a finite buffer queue. We model the channel as a finitestate channel whose state is given by the buffer occupancy upon packet arrival; a loss occurs when a packet arrives to a full queue. We study this problem in two contexts; one where the state of the buf ..."
Abstract

Cited by 29 (3 self)
 Add to MetaCart
(Show Context)
We study information transmission through a finite buffer queue. We model the channel as a finitestate channel whose state is given by the buffer occupancy upon packet arrival; a loss occurs when a packet arrives to a full queue. We study this problem in two contexts; one where the state of the buffer is known at the receiver, and the other where it is unknown. In the former case, we show that the capacity of the channel depends on the longterm loss probability of the buffer. Thus, even though the channel itself has memory, the capacity depends only on the stationary loss probability of the buffer. The main focus of this paper is on the latter case. When the receiver does not know the buffer state, this leads to the study of deletion channels, where symbols are randomly dropped and a subsequence of the transmitted symbols is received. In deletion channels, unlike erasure channels, there is no sideinformation about which symbols are dropped. We study the achievable rate for deletion channels, and focus our attention on simple (mismatched) decoding schemes. We show that even with simple decoding schemes, with i.i.d. input codebooks, the achievable rate in deletion channels differs from that of erasure channels by at most H0(pd) − pd log K
Upper Bounds for the Expected Length of Longest Common Subsequences
, 1996
"... Let f(n) be the expected length of a longest common subsequence of two random sequences over a fixed alphabet of size k. It is known that f(n) ! ck n for some constant ck . We define a collation as a pair of sequences with marked matches. A dominated collation is a collation that is not matched opti ..."
Abstract

Cited by 22 (3 self)
 Add to MetaCart
Let f(n) be the expected length of a longest common subsequence of two random sequences over a fixed alphabet of size k. It is known that f(n) ! ck n for some constant ck . We define a collation as a pair of sequences with marked matches. A dominated collation is a collation that is not matched optimally. Upper bounds for ck can be derived from upper bounds for the number of nondominated collations. Using local properties of matches we can eliminate many nondominated collations and improve upper bounds for ck . 1 Introduction The problem of finding longest common subsequences arises in various situations. As typical we can mention approximate string matching and text comparisons (e.g. the diff function in UNIX) [1, 11]. Another important area where the longest common subsequence problem appears is molecular biology. The longest common subsequence problem is a special case of the more general sequence alignment problem. A survey on the longest common subsequence problem can be found in...
Ngram similarity and distance
 Proc. Twelfth Int’l Conf. on String Processing and Information Retrieval
, 2005
"... Abstract. In many applications, it is necessary to algorithmically quantify the similarity exhibited by two strings composed of symbols from a finite alphabet. Numerous string similarity measures have been proposed. Particularly wellknown measures are based are edit distance and the length of the l ..."
Abstract

Cited by 21 (0 self)
 Add to MetaCart
(Show Context)
Abstract. In many applications, it is necessary to algorithmically quantify the similarity exhibited by two strings composed of symbols from a finite alphabet. Numerous string similarity measures have been proposed. Particularly wellknown measures are based are edit distance and the length of the longest common subsequence. We develop a notion of ngram similarity and distance. We show that edit distance and the length of the longest common subsequence are special cases of ngram distance and similarity, respectively. We provide formal, recursive definitions of ngram similarity and distance, together with efficient algorithms for computing them. We formulate a family of word similarity measures based on ngrams, and report the results of experiments that suggest that the new measures outperform their unigram equivalents. 1
Expected Length of Longest Common Subsequences
"... Contents 1 Introduction 1 2 Notation and preliminaries 4 2.1 Notation and basic definitions : : : : : : : : : : : : : : : : : : 4 2.2 Longest common subsequences : : : : : : : : : : : : : : : : : : 7 2.3 Computing longest common subsequences : : : : : : : : : : : 10 2.4 Expected length of longest c ..."
Abstract

Cited by 19 (2 self)
 Add to MetaCart
Contents 1 Introduction 1 2 Notation and preliminaries 4 2.1 Notation and basic definitions : : : : : : : : : : : : : : : : : : 4 2.2 Longest common subsequences : : : : : : : : : : : : : : : : : : 7 2.3 Computing longest common subsequences : : : : : : : : : : : 10 2.4 Expected length of longest common subsequences : : : : : : : 14 3 Lower Bounds 20 3.1 Css machines : : : : : : : : : : : : : : : : : : : : : : : : : : : 20 3.2 Analysis of css machines : : : : : : : : : : : : : : : : : : : : : 26 3.3 Design of css machines : : : : : : : : : : : : : : : : : : : : : : 31 3.4 Labeled css machines : : : : : : : : : : : : : : : : : : : : : : : 38 4 Upper bounds 45 4.1 Collations : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 45 4.2 Previous upper bounds : : : : : : : : : : : : : : : : : : : : : : 51 4.3 Simple upper bound (binary alphabet) : : : : : : : : : : : : : 55 4.4 Simple upper bound (alphabet size 3) : : : : : : : : : : : : : : 59 4.5 Upper bounds for binary alphabet : :
Statistical Significance of Probabilistic Sequence Alignment and Related Local Hidden Markov Models
 J. COMP. BIOL
, 2001
"... The score statistics of probabilistic gapped local alignment of random sequences is investigated both analytically and numerically. The full probabilistic algorithm (e.g., the “local” version of maximumlikelihood or hidden Markov model method) is found to have anomalous statistics. A modified “semi ..."
Abstract

Cited by 18 (3 self)
 Add to MetaCart
The score statistics of probabilistic gapped local alignment of random sequences is investigated both analytically and numerically. The full probabilistic algorithm (e.g., the “local” version of maximumlikelihood or hidden Markov model method) is found to have anomalous statistics. A modified “semiprobabilistic” alignment consisting of a hybrid of Smith–Waterman and probabilistic alignment is then proposed and studied in detail. It is predicted that the score statistics of the hybrid algorithm is of the Gumbel universal form, with the key Gumbel parameter l taking on a fixed asymptotic value for a wide variety of scoring systems and parameters. A simple recipe for the computation of the “relative entropy,” and from it the finite size correction to l, is also given. These predictions compare well with direct numerical simulations for sequences of lengths between 100 and 1,000 examined using various PAM substitution scores and affine gap functions. The sensitivity of the hybrid method in the detection of sequence homology is also studied using correlated sequences generated from toy mutation models. It is found to be comparable to that of the Smith–Waterman alignment and significantly better than the Viterbi version of the probabilistic alignment.