## Combinatorial algorithms for DNA sequence assembly (1993)

### Cached

### Download Links

- [www.cs.arizona.edu]
- [ftp.cs.arizona.edu]
- CiteULike

### Other Repositories/Bibliography

Venue: | Algorithmica |

Citations: | 42 - 3 self |

### BibTeX

@ARTICLE{Kececioglu93combinatorialalgorithms,

author = {John D. Kececioglu and Eugene W. Myers},

title = {Combinatorial algorithms for DNA sequence assembly},

journal = {Algorithmica},

year = {1993},

volume = {13},

pages = {7--51}

}

### Years of Citing Articles

### OpenURL

### Abstract

The trend towards very large DNA sequencing projects, such as those being undertaken as part of the human genome initiative, necessitates the development of efficient and precise algorithms for assembling a long DNA sequence from the fragments obtained by shotgun sequencing or other methods. The sequence reconstruction problem that we take as our formulation of DNA sequence assembly is a variation of the shortest common superstring problem, complicated by the presence of sequencing errors and reverse complements of fragments. Since the simpler superstring problem is NP-hard, any efficient reconstruction procedure must resort to heuristics. In this paper, however, a four phase approach based on rigorous design criteria is presented, and has been found to be very accurate in practice. Our method is robust in the sense that it can accommodate high sequencing error rates and list a series of alternate solutions in the event that several appear equally good. Moreover it uses a limited form ...

### Citations

1865 | Numerical Recipes in C: The Art of Scientific Computing - Press, Teukolsky, et al. - 1992 |

1461 |
Identification of common molecular subsequences
- Smith, Waterman
- 1981
(Show Context)
Citation Context ...nd the suffix and prefix of each fragment is assumed to match only one other fragment in an overlap longer than a given threshold. Huang [16] applies a local alignment algorithm of Smith and Waterman =-=[34]-=- to compute an overlap that maximizes a linear function of the number exact matches and errors in the alignment, 4 and uses a filtering technique of Chang and Lawler [4] to avoid considering some of t... |

644 | Suffix arrays: a new method for on-line string searches
- Manber, Myers
- 1990
(Show Context)
Citation Context ... all pairs of fragments in time linear in the size of the input and output, if no errors are permitted in the overlaps. Cull and Holloway [6] apply the suffix array data structure of Manber and Myers =-=[22]-=- to find overlaps, where fragments are assumed to contain only substitution errors, and the suffix and prefix of each fragment is assumed to match only one other fragment in an overlap longer than a g... |

575 |
Fibonacci heaps and their uses in improved network optimization algorithms
- Fredman, Tarjan
- 1987
(Show Context)
Citation Context ...ng layout. Constructing the graph takes O(E + V ) time, where V is the number of fragments and E is the number pairs of fragments with nonzero same or opp. Tree T can be found in O(E + V log V ) time =-=[9]-=-. We can locate R in O(V ) time by two passes over T . The first pass computes the total distance of each vertex A to all vertices in the subtree rooted at A bottom-up, along with the size of the subt... |

104 |
Minimal mutation trees of sequence
- Sankoff
- 1975
(Show Context)
Citation Context ...ructure. When no error is present, the sequences are identical, and the alignment 12 This observation, expressed in different language, can be found in many papers. Perhaps the first occurrence is in =-=[29]-=-. 29 graph is a series of columns, each column a complete subgraph. When a rare error is present, its effect on this structure is to displace or delete some edges local to the defect. For such graphs,... |

79 |
A procedure for computing the k best solutions to discrete optimization problems and its application to the shortest path problem
- Lawler
- 1972
(Show Context)
Citation Context ... subproblems. Let I and O be the in- and out-sets for P . One subproblem receives constraints I and O[feg, and the other receives constraints I [feg and O[ffg. This follows a general method of Lawler =-=[20]-=- for generating next-best solutions to combinatorial optimization problems. The resulting collection of problems is conveniently represented by a computation tree. Each node in the tree contains an in... |

73 | Linear approximation of shortest superstrings
- Blum, Jiang, et al.
- 1994
(Show Context)
Citation Context ...hat a simple greedy algorithm finds a superstring whose amount of compression is within a factor of 1 2 of the maximum, and give efficient implementations. Blum, Avrim, Jiang, Li, Tromp and Yannkakis =-=[1]-=- prove that the greedy algorithm delivers a superstring at most 4 times longer than the shortest, and that a simple variant delivers a superstring at most 3 times longer than the shortest. It is not k... |

52 |
Longest common subsequences of two random sequences
- Chvatal, Sankoff
- 1975
(Show Context)
Citation Context ...f course the other possibility is that the sequence contains a repeat. 7 bet fa; c; g; tg. While the exact probability of an alignment is unknown even for this model, a result of Chv'atal and Sankoff =-=[5]-=- on random common subsequences gives a good upper bound. The alignments that we compute match a pair of characters only when they are equal. These matches give a common subsequence of the fragments, a... |

48 |
The pairing heap: A new form of self-adjusting heap
- Fredman, Sedgewick, et al.
- 1986
(Show Context)
Citation Context ...of coverage depth D, this is O(D 2 ) time. In practice we recommend a different spanning tree algorithm. The O(E + V log V ) time algorithm requires a Fibonacci heap [9] or in practice a pairing heap =-=[8]-=- and since a spanning tree is computed for every column, the overhead of these data structures is unappealing. Moreover, our spanning tree problems often vary only slightly from window to window, in w... |

46 |
Approximate string matching in sublinear expected time
- CHANG, LAWLER
- 1990
(Show Context)
Citation Context ...gorithm of Smith and Waterman [34] to compute an overlap that maximizes a linear function of the number exact matches and errors in the alignment, 4 and uses a filtering technique of Chang and Lawler =-=[4]-=- to avoid considering some of the pairs of fragments whose alignment score is below a fixed threshold. Our work may be distinguished from prior theoretical investigations in that we address both seque... |

45 |
Finding optimum branchings
- Tarjan
- 1977
(Show Context)
Citation Context ...og V ) time and O(K + E + V ) space, as shown by Camerini, Fratta and Maffioli [3]. Our method of generating branchings is similar to Camerini et al., which applies the branchings algorithm of Tarjan =-=[37]-=-, but has some differences. These differences are due to our particular application, namely generating 19 branchings to meet a dovetail-chain constraint, which allows us to apply the algorithm of Gabo... |

43 |
On finding minimal length superstrings
- Gallant, Maier, et al.
- 1980
(Show Context)
Citation Context ...e aligned to the superstring within a length-relative error threshold of ffl. In fact, Superstring can be reduced to the sequence reconstruction problem with ffl = 0. Since Superstring is NP-complete =-=[13]-=-, this implies that Reconstruct is NP-complete. Details may be found in [18]. Related work Prior work related to DNA sequence assembly may be classified into three categories. In the first class of pa... |

42 |
A contig assembly program based on sensitive detection of fragment overlaps
- Huang
- 1992
(Show Context)
Citation Context ...re fragments are assumed to contain only substitution errors, and the suffix and prefix of each fragment is assumed to match only one other fragment in an overlap longer than a given threshold. Huang =-=[16]-=- applies a local alignment algorithm of Smith and Waterman [34] to compute an overlap that maximizes a linear function of the number exact matches and errors in the alignment, 4 and uses a filtering t... |

38 |
Approximation Algorithms for the Shortest Common Superstring Problem
- Turner
- 1989
(Show Context)
Citation Context ...he shortest common superstring problem, which we have indicated is equivalent to the sequence reconstruction problem without error and with fragment orientation known. Tarhio and Ukkonen [36], Turner =-=[38]-=-, and Ukkonen [39] show that a simple greedy algorithm finds a superstring whose amount of compression is within a factor of 1 2 of the maximum, and give efficient implementations. Blum, Avrim, Jiang,... |

27 |
Exact and Approximation Algorithms for DNA Sequence Reconstruction
- Kececioglu
- 1991
(Show Context)
Citation Context .... In fact, Superstring can be reduced to the sequence reconstruction problem with ffl = 0. Since Superstring is NP-complete [13], this implies that Reconstruct is NP-complete. Details may be found in =-=[18]-=-. Related work Prior work related to DNA sequence assembly may be classified into three categories. In the first class of papers, Shapiro [32], Hutchinson [17], Smetanic and Polozov [33], Gallant [12]... |

26 |
Two Algorithms for Generating Weighted Spanning Trees in Order
- Gabow
- 1978
(Show Context)
Citation Context ...ation tree appears to requiresO(KE) space---it has O(K) nodes and each node has an in- and out-list of size O(E)---but this can be reduced to constant space per node using the following idea of Gabow =-=[10]-=-. The in- and out-sets for a left child l in the computation tree may be obtained from 21 its parent p by adding one edge e to p's out-set to form l's out-set, and by copying p's in-set. The in- and o... |

17 | A note on finding optimum branchings - Camerini, Fratta, et al. - 1979 |

15 |
Incremental alignment algorithms and their applications
- Myers
- 1986
(Show Context)
Citation Context ...me by the standard dynamic programming algorithm [31], this gives an O(m 2 n 2 ) time algorithm, and it is easy to bring this down to O(m 2 n) time by combining subproblems. Here we assume msn. Myers =-=[25]-=- has shown that it is possible to solve all O(mn) subproblems in O(ffin) time, where ffi is the maximum edit distance allowed. 5 In our application, ffi = bfflm + fflnc, so this gives an O(ffln 2 ) ti... |

14 |
An efficient algorithm for the all pairs suffix-prefix problem
- Gusfield, Landau, et al.
- 1992
(Show Context)
Citation Context ...e reconstruction that is output. In addition, three papers have recently come to our attention that look at the subtask of computing overlaps between pairs of fragments. Gusfield, Landau and Schieber =-=[15]-=- show that with the suffix tree data structure the longest overlap beween a suffix of one fragment and a prefix of another can be determined for all pairs of fragments in time linear in the size of th... |

13 |
Tarjan: Efficient algorithms for finding minimum spanning trees in undirected and directed graphs
- Gabow, Galil, et al.
- 1986
(Show Context)
Citation Context ... in order of decreasing weight. A maximum weight branching over a graph of E edges and V vertices can be computed in O(E+V log V ) time and O(E+V ) space, as shown by Gabow, Galil, Spencer and Tarjan =-=[11]-=-. The K branchings of greatest weight can be generated in O(KE log V ) time and O(K + E + V ) space, as shown by Camerini, Fratta and Maffioli [3]. Our method of generating branchings is similar to Ca... |

10 |
Towards a DNA sequencing theory
- Li
- 1990
(Show Context)
Citation Context ...s a superstring at most 4 times longer than the shortest, and that a simple variant delivers a superstring at most 3 times longer than the shortest. It is not known whether these bounds are tight. Li =-=[21]-=- examines sequence assembly from the viewpoint of computational learning theory and shows that an approximation algorithm for Superstring will learn the underlying sequence in polynomial time in the P... |

9 |
A strategy of DNA sequencing employing computer programs", Nucleic Acids Research
- Staden
- 1979
(Show Context)
Citation Context ...m for Superstring will learn the underlying sequence in polynomial time in the PAC model of learning, given fragments without error and with known orientation. In the third category of papers, Staden =-=[35]-=-, Gingeras, Milazzo, Sciaky and Roberts [14], and Peltola, Soderlund and Ukkonen [27] develop software for sequence assembly. Peltola, Soderlund, Tarhio and Ukkonen [26] describes the algorithms used ... |

8 | Data Structures and Algorithms. Volume 1: Sorting and Searching - Mehlhorn - 1984 |

5 |
The k-Best Spanning Arborescences of a Network
- Camerini, Fratta, et al.
- 1980
(Show Context)
Citation Context ... space, as shown by Gabow, Galil, Spencer and Tarjan [11]. The K branchings of greatest weight can be generated in O(KE log V ) time and O(K + E + V ) space, as shown by Camerini, Fratta and Maffioli =-=[3]-=-. Our method of generating branchings is similar to Camerini et al., which applies the branchings algorithm of Tarjan [37], but has some differences. These differences are due to our particular applic... |

5 | The Complexity of the Overlap Method for Sequencing Biopolymers - Gallant - 1983 |

5 |
Computer Programs for the Assembly of DNA Sequences
- Gingeras, Milazzo, et al.
- 1979
(Show Context)
Citation Context ...sequence in polynomial time in the PAC model of learning, given fragments without error and with known orientation. In the third category of papers, Staden [35], Gingeras, Milazzo, Sciaky and Roberts =-=[14]-=-, and Peltola, Soderlund and Ukkonen [27] develop software for sequence assembly. Peltola, Soderlund, Tarhio and Ukkonen [26] describes the algorithms used in [27], and also gives the first statement ... |

5 |
A linear time algorithm for finding approximate shortest common superstrings
- Ukkonen
- 1990
(Show Context)
Citation Context ... superstring problem, which we have indicated is equivalent to the sequence reconstruction problem without error and with fragment orientation known. Tarhio and Ukkonen [36], Turner [38], and Ukkonen =-=[39]-=- show that a simple greedy algorithm finds a superstring whose amount of compression is within a factor of 1 2 of the maximum, and give efficient implementations. Blum, Avrim, Jiang, Li, Tromp and Yan... |

4 |
Esko Ukkonen. SEQAID: a DNA sequence assembling program based on a mathematical model
- Peltola, Sderlund
- 1984
(Show Context)
Citation Context ...del of learning, given fragments without error and with known orientation. In the third category of papers, Staden [35], Gingeras, Milazzo, Sciaky and Roberts [14], and Peltola, Soderlund and Ukkonen =-=[27]-=- develop software for sequence assembly. Peltola, Soderlund, Tarhio and Ukkonen [26] describes the algorithms used in [27], and also gives the first statement of the sequence reconstruction problem. T... |

4 |
An upper-bound technique for lengths of common subsequences
- Chv'atal, Sankoff
- 1983
(Show Context)
Citation Context ...nted as a deletion error followed by an insertion error. The quantities that we measure for an alignment are l, the length of the common subsequence, and d, the number of errors. Sankoff and Chv'atal =-=[30]-=- show that the number of sequences of length l + d over an alphabet of size s that contain a fixed subsequence of length l is N s (l; d) = X 0id / l + d i ! (s \Gamma 1) i ; (1) independent of the par... |

3 |
Evaluation of polymer sequence fragment data using graph theory
- Hutchinson
- 1969
(Show Context)
Citation Context ... is NP-complete. Details may be found in [18]. Related work Prior work related to DNA sequence assembly may be classified into three categories. In the first class of papers, Shapiro [32], Hutchinson =-=[17]-=-, Smetanic and Polozov [33], Gallant [12], and Foulser [7] examine an early model of the problem where fragments do not contain errors and are partitioned into classes such that concatenating the frag... |

3 |
A procedural interface for a fragment assembly tool
- Kececioglu, Myers
- 1989
(Show Context)
Citation Context ...D + E + fflN 2 ) space. 6 Experimental results To explore the viability of this approach to sequence reconstruction, we have implemented a software package embodying the preceding suite of algorithms =-=[19]-=-. In both the orientation and layout phases, the exact algorithms are run first. If the size of a search tree becomes too large, for example when K = 500, the phases switch to the approximation algori... |

3 |
An algorithm for reconstructing protein and RNA sequences
- Shapiro
- 1967
(Show Context)
Citation Context ... that Reconstruct is NP-complete. Details may be found in [18]. Related work Prior work related to DNA sequence assembly may be classified into three categories. In the first class of papers, Shapiro =-=[32]-=-, Hutchinson [17], Smetanic and Polozov [33], Gallant [12], and Foulser [7] examine an early model of the problem where fragments do not contain errors and are partitioned into classes such that conca... |

2 |
A linear time algorithm for DNA sequencing
- Foulser
- 652
(Show Context)
Citation Context ... Prior work related to DNA sequence assembly may be classified into three categories. In the first class of papers, Shapiro [32], Hutchinson [17], Smetanic and Polozov [33], Gallant [12], and Foulser =-=[7]-=- examine an early model of the problem where fragments do not contain errors and are partitioned into classes such that concatenating the fragments within each class, in some order, gives the underlyi... |

2 |
On the algorithms for determining the primary structure of biopolymers
- Smetanic, Polozov
- 1979
(Show Context)
Citation Context ...y be found in [18]. Related work Prior work related to DNA sequence assembly may be classified into three categories. In the first class of papers, Shapiro [32], Hutchinson [17], Smetanic and Polozov =-=[33]-=-, Gallant [12], and Foulser [7] examine an early model of the problem where fragments do not contain errors and are partitioned into classes such that concatenating the fragments within each class, in... |

1 |
Reconstructing sequences from shotgun data
- Cull, Holloway
- 1992
(Show Context)
Citation Context ... of one fragment and a prefix of another can be determined for all pairs of fragments in time linear in the size of the input and output, if no errors are permitted in the overlaps. Cull and Holloway =-=[6]-=- apply the suffix array data structure of Manber and Myers [22] to find overlaps, where fragments are assumed to contain only substitution errors, and the suffix and prefix of each fragment is assumed... |

1 |
Complete nucleotide sequence of the rabbit fi-like globin gene cluster: Analysis of intergenic sequences and comparison with the human fi-like globin gene cluster
- Margot, Demers, et al.
- 1989
(Show Context)
Citation Context ...n the data. A random sequence has no structure, while biological sequences contain repeats. In the remaining experiments, numbered 4 through 12, we used the human fi-like globin gene cluster sequence =-=[23]-=-. This 73,360 character sequence contains many approximate repeats, and presents a challenging reconstruction problem. Thirteen short interspersed Alu repeats are present, nine in the forward directio... |

1 |
Jorma Tarhio, and Esko Ukkonen. Algorithms for some string matching problems arising in molecular genetics
- Peltola, Soderlund
- 1983
(Show Context)
Citation Context ...ird category of papers, Staden [35], Gingeras, Milazzo, Sciaky and Roberts [14], and Peltola, Soderlund and Ukkonen [27] develop software for sequence assembly. Peltola, Soderlund, Tarhio and Ukkonen =-=[26]-=- describes the algorithms used in [27], and also gives the first statement of the sequence reconstruction problem. These papers deal with error, and with orientation, but do not characterize the quali... |

1 |
and Esko Ukkonen. A greedy approximation algorithm for constructing shortest common superstrings
- Tarhio
- 1988
(Show Context)
Citation Context ...orithms for the shortest common superstring problem, which we have indicated is equivalent to the sequence reconstruction problem without error and with fragment orientation known. Tarhio and Ukkonen =-=[36]-=-, Turner [38], and Ukkonen [39] show that a simple greedy algorithm finds a superstring whose amount of compression is within a factor of 1 2 of the maximum, and give efficient implementations. Blum, ... |