Results 1 - 10
of
12
Indexing and Retrieval for Genomic Databases
- IEEE Transactions on Knowledge and Data Engineering
, 2002
"... Genomic sequence databases are widely used by molecular biologists for homology searching. Amino-acid and nucleotide databases are increasing in size exponentially, and mean sequence lengths are also increasing. In searching such databases, it is desirable to use heuristics to perform computationall ..."
Abstract
-
Cited by 40 (6 self)
- Add to MetaCart
Genomic sequence databases are widely used by molecular biologists for homology searching. Amino-acid and nucleotide databases are increasing in size exponentially, and mean sequence lengths are also increasing. In searching such databases, it is desirable to use heuristics to perform computationally intensive local alignments on selected sequences only and to reduce the costs of the alignments that are attempted. We present an index-based approach for both selecting sequences that display broad similarity to a query and for fast local alignment. We show experimentally that the indexed approach results in signi cant savings in computationally intensive local alignments, and that index-based searching is as accurate as existing exhaustive search schemes.
Recent developments in linear-space alignment methods: A survey
- J. Comput. Biol
, 1994
"... A dynamic-programming strategy for sequence alignment first proposed in 1975 by Dan Hirschberg can be adapted to yield a number of extremely space-efficient algorithms. Specifically, these algorithms align two sequences using only ‘‘linear space’’, i.e., an amount of computer memory that is proporti ..."
Abstract
-
Cited by 19 (4 self)
- Add to MetaCart
A dynamic-programming strategy for sequence alignment first proposed in 1975 by Dan Hirschberg can be adapted to yield a number of extremely space-efficient algorithms. Specifically, these algorithms align two sequences using only ‘‘linear space’’, i.e., an amount of computer memory that is proportional to the sum of the lengths of the two sequences being aligned. This paper begins by reviewing the basic idea, as it applies to the global (i.e., end-to-end) alignment of two DNA or protein sequences. Three of our recent extensions of the technique are then outlined. The first extension computes an optimal alignment subject to the constraint that each position, i, of the first sequence must be aligned somewhere between positions L[i] and U[i] of the second sequence, for given values of L and U. The second finds all aligned position pairs (i.e., potential columns of the alignment) that occur in an alignment whose score exceeds a given threshold. The third treats the case where each of the two sequences is allowed to be an alignment (e.g., a sequence of aligned pairs), using a sensitive scoring scheme. We also describe two linear-space methods for computing k best local (i.e., involving only a part of each sequence) alignments, where k ≥ 1. One is a linear-space version of the algorithm of Waterman and Eggert (1987), and the other is based on the strategy proposed by Wilbur and Lipman (1983). Finally, we describe programs that implement various combinations of these techniques to provide a multi-sequence alignment method that is especially suited to handling a few very long sequences. The utility of these programs is illustrated by analysis of the locus control region of the β-like globin gene cluster of several mammals.
Linear-Space Algorithms that Build Local Alignments from Fragments
- Algorithmica
, 1995
"... Abstract. This paper presents practical algorithms for building an alignment of two long sequences from a collection of "alignment fragments, " such as all occurrences of identical 5-tuples in each of two DNA sequences. We first combine a time-efficient algorithm developed by Galil and cow ..."
Abstract
-
Cited by 9 (6 self)
- Add to MetaCart
Abstract. This paper presents practical algorithms for building an alignment of two long sequences from a collection of "alignment fragments, " such as all occurrences of identical 5-tuples in each of two DNA sequences. We first combine a time-efficient algorithm developed by Galil and coworkers with a space-saving approach of Hirschberg to obtain a local alignment algorithm that uses O((M + N + F log N) log M) time and O(M + N) space to align sequences of lengths M and N from a pool of F alignment fragments. Ideas of Huang and Miller are then employed to develop a time- and space-efficient algorithm that computes n best nonintersecting alignments for any n> 1. An example illustrates the utility of these methods.
Near Optimal Multiple Alignment Within a Band In Polynomial Time
- In Proc. of 32nd ACM STOC
, 2000
"... Multiple sequence alignment is one of the most important problems in computational biology. Because of its notorious difficulties, aligning sequences within a constant band is a popular practice in bioinformatics with good results [17; 13; 14; 15; 1; 3; 6; 20; 18]. However, the problem is still NP-h ..."
Abstract
-
Cited by 8 (0 self)
- Add to MetaCart
Multiple sequence alignment is one of the most important problems in computational biology. Because of its notorious difficulties, aligning sequences within a constant band is a popular practice in bioinformatics with good results [17; 13; 14; 15; 1; 3; 6; 20; 18]. However, the problem is still NP-hard for multiple sequences. In this paper, we present polynomial time approximation schemes (PTAS) for multiple sequence alignment within a constant band, under standard models of SP alignment and consensus (star) alignment. The algorithms work for very general score schemes. In order to prove our main results, we also present a PTAS for SP alignment and a PTAS for consensus alignment, allowing only constant number of insertion and deletion gaps (of arbitrary length) per sequence on the average. 1.
Stochastic models of sequence evolution including insertion-deletion events
, 2008
"... Abstract. Comparison of sequences that have descended from a common ancestor based on an explicit stochastic model of substitutions, insertions and deletions has risen to prominence in the last decade. Making statements about the positions of insertions-deletions (abbr. indels) is central in sequenc ..."
Abstract
-
Cited by 4 (3 self)
- Add to MetaCart
Abstract. Comparison of sequences that have descended from a common ancestor based on an explicit stochastic model of substitutions, insertions and deletions has risen to prominence in the last decade. Making statements about the positions of insertions-deletions (abbr. indels) is central in sequence and genome analysis and is called alignment. This statistical approach is harder conceptually and computationally, than competing approaches based on choosing an alignment according to some optimality criteria. But it has major practical advantages in terms of testing evolutionary hypotheses and parameter estimation. Basic dynamic approaches can allow the analysis of up to 4-5 sequences. MCMC techniques can bring this to about 10-15 sequences. Beyond this, different or heuristic approaches must be used. Besides the computational challenges, increasing realism in the underlying models is presently being addressed. A recent development that has been especially fruitful is combining statistical alignment with the problem of sequence annotation, making statements about the function of each nucleotide/amino acid. So far gene finding, protein secondary structure prediction and regulatory signal detection has been tackled within this framework. Much progress can be reported, but clearly major challenges remain if this approach is to be central in the analyses of large incoming sequence data sets. 1
Parameterized Complexity and Biopolymer Sequence Comparison
, 2007
"... The paper surveys parameterized algorithms and complexities for computational tasks on biopolymer sequences, including the problems of longest common subsequence, shortest common supersequence, pairwise sequence alignment, multiple sequencing alignment, structure–sequence alignment and structure–str ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
The paper surveys parameterized algorithms and complexities for computational tasks on biopolymer sequences, including the problems of longest common subsequence, shortest common supersequence, pairwise sequence alignment, multiple sequencing alignment, structure–sequence alignment and structure–structure alignment. Algorithm techniques, built on the structural-unit level as well as on the residue level, are discussed.
Comparing Compressed Sequences for Faster Nucleotide BLAST Searches
- IEEE/ACM Transactions on Computational Biology and Bioinformatics
"... Abstract. Molecular biologists, geneticists, and other life scientists use the blast homology search package as their first step for discovery of information about unknown or poorly annotated genomic sequences. There are two main variants of blast: blastp for searching protein collections and blastn ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
Abstract. Molecular biologists, geneticists, and other life scientists use the blast homology search package as their first step for discovery of information about unknown or poorly annotated genomic sequences. There are two main variants of blast: blastp for searching protein collections and blastn for nucleotide collections. Surprisingly, blastn has had very little attention; for example, the algorithms it uses do not follow those described in the 1997 blast paper (Altschul, Madden, Schaffer, Zhang, Zhang, Miller & Lipman 1997) and no exact description has been published. It is important that blastn is state-of-the-art: nucleotide collections such as GenBank dwarf the protein collections in size, they double in size almost yearly, and take many minutes to search on modern general-purpose workstations. This paper proposes significant improvements to the blastn algorithms. Each of our schemes is based on compressed bytepacked formats that allow queries and collection sequences to be compared four bases at a time, permitting very fast query evaluation using lookup tables and numeric comparisons. Our most significant innovations are two new, fast gapped alignment schemes that allow accurate sequence alignment without decompression of the collection sequences. Overall, our innovations more than double the speed of blastn with no effect on accuracy and have been integrated into our new version of blast that is freely available for download from
Fast discovery of similar sequences in large genomic collections
- In Proc. European Conference on Information Retrieval
, 2006
"... Abstract. Detection of highly similar sequences within genomic collections has a number of applications, including the assembly of expressed sequence tag data, genome comparison, and clustering sequence collections for improved search speed and accuracy. While several approaches exist for this task, ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
Abstract. Detection of highly similar sequences within genomic collections has a number of applications, including the assembly of expressed sequence tag data, genome comparison, and clustering sequence collections for improved search speed and accuracy. While several approaches exist for this task, they are becoming infeasible — either in space or in time — as genomic collections continue to grow at a rapid pace. In this paper we present an approach based on document fingerprinting for identifying highly similar sequences. Our approach uses a modest amount of memory and executes in a time roughly proportional to the size of the collection. We demonstrate substantial speed improvements compared to the CD-HIT algorithm, the most successful existing approach for clustering large protein sequence collections. 1
Discrete Pattern Matching Over Sequences And Interval Sets
, 1993
"... Finding matches, both exact and approximate, between a sequence of symbols A and a pattern P has long been an active area of research in algorithm design. Some of the more well-known byproducts from that research are the diff program and grep family of programs. These problems form a sub-domain of a ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Finding matches, both exact and approximate, between a sequence of symbols A and a pattern P has long been an active area of research in algorithm design. Some of the more well-known byproducts from that research are the diff program and grep family of programs. These problems form a sub-domain of a larger areas of problems called discrete pattern matching which has been developed recently to characterise the wide range of pattern matching problems. This dissertation presents new algorithms for discrete pattern matching over sequences and develops a new sub-domain of problems called discrete pattern matching over interval sets. The problems and algorithms presented here are characterised by pattern matching over interval sets. The problems and al

