A Robust Model for Finding Optimal Evolutionary Trees
, 1993
"... Constructing evolutionary trees for species sets is a fundamental problem in computational biology. One of the standard models assumes the ability to compute distances between every pair of species and seeks to find an edgeweighted tree T in which the distance d T ij in the tree between the leaves ..."
Constructing evolutionary trees for species sets is a fundamental problem in computational biology. One of the standard models assumes the ability to compute distances between every pair of species and seeks to find an edgeweighted tree T in which the distance d T ij in the tree between the leaves of T corresponding to the species i and j exactly equals the observed distance, d ij . When such a tree exists, this is expressed in the biological literature by saying that the distance function or matrix is additive, and trees can be constructed from additive distance matrices in O(n 2 ) time. Real distance data is hardly ever additive, and we therefore need ways of modeling the problem of finding the bestfit tree as an optimization problem. In this paper we present several natural and realistic ways of modeling the inaccuracies in the distance data. In one model we assume that we have upper and lower bounds for the distances between pairs of species and try to find an additive distanc...
Sentence Fusion for Multidocument News Summarization
 Lexical cohesion, the thesaurus, and the structure of text. Computational Linguistics, 17(1):21–48. Nenkova, Ani
, 1991
"... A system that can produce informative summaries, highlighting common information found in many online documents, will help Web users to pinpoint information that they need without extensive reading. In this article, we introduce sentence fusion, a novel texttotext generation technique for synthesi ..."
A system that can produce informative summaries, highlighting common information found in many online documents, will help Web users to pinpoint information that they need without extensive reading. In this article, we introduce sentence fusion, a novel texttotext generation technique for synthesizing common information across documents. Sentence fusion involves bottomup local multisequence alignment to identify phrases conveying similar information and statistical generation to combine common phrases into a sentence. Sentence fusion moves the summarization field from the use of purely extractive methods to the generation of abstracts that contain sentences not found in any of the input documents and can synthesize information across sources. 1.
Combinatorial algorithms for DNA sequence assembly
 Algorithmica
, 1993
"... The trend towards very large DNA sequencing projects, such as those being undertaken as part of the human genome initiative, necessitates the development of efficient and precise algorithms for assembling a long DNA sequence from the fragments obtained by shotgun sequencing or other methods. The seq ..."
The trend towards very large DNA sequencing projects, such as those being undertaken as part of the human genome initiative, necessitates the development of efficient and precise algorithms for assembling a long DNA sequence from the fragments obtained by shotgun sequencing or other methods. The sequence reconstruction problem that we take as our formulation of DNA sequence assembly is a variation of the shortest common superstring problem, complicated by the presence of sequencing errors and reverse complements of fragments. Since the simpler superstring problem is NPhard, any efficient reconstruction procedure must resort to heuristics. In this paper, however, a four phase approach based on rigorous design criteria is presented, and has been found to be very accurate in practice. Our method is robust in the sense that it can accommodate high sequencing error rates and list a series of alternate solutions in the event that several appear equally good. Moreover it uses a limited form ...
On Distances between Phylogenetic Trees
, 1997
"... Different phylogenetic trees for the same group of species are often produced either by procedures that use diverse optimality criteria [18] or from different genes [12] in the study of molecular evolution. Comparing these trees to find their similarities (e.g. agreement or consensus) and dissimila ..."
Different phylogenetic trees for the same group of species are often produced either by procedures that use diverse optimality criteria [18] or from different genes [12] in the study of molecular evolution. Comparing these trees to find their similarities (e.g. agreement or consensus) and dissimilarities, i.e. distance, is thus an important issue in computational molecular biology. The nearest neighbor interchange (nni) distance [26, 24, 32, 4, 5, 3, 16, 17, 19, 30, 20, 21, 23] and the subtreetransfer distance [12, 13, 15] are two major distance metrics that have been proposed and extensively studied for different reasons. Despite their many appealing aspects such as simplicity and sensitivity to tree topologies, computing these distances has remained very challenging. This article studies the complexity and efficient approximation algorithms for computing the nni distance and a natural extension of the subtreetransfer distance, called the linearcost subtreetransfer distance. The ...
A novel method for multiple alignment of sequences with repeated and shuffled elements
, 2004
Optimal expression evaluation for data parallel architectures
 Journal of Parallel and Distributed Computing
, 1991
General TimeReversible Distances with Unequal Rates across Sites: Mixing Γ and Inverse Gaussian Distributions with Invariant Sites
, 1997
"... This paper aims to explain to biologists the assumptions of these distances and to clarify some earlier misconceptions. Importantly, nearly all of the currently used distance estimates (including those of Tamura, 1992; Tamura and Nei, 1994) are special cases (restrictions) of the general timerevers ..."
This paper aims to explain to biologists the assumptions of these distances and to clarify some earlier misconceptions. Importantly, nearly all of the currently used distance estimates (including those of Tamura, 1992; Tamura and Nei, 1994) are special cases (restrictions) of the general timereversible distance (see Zharkikh, 1994; Swofford et al., 1996)
A more Efficient Approximation Scheme for Tree Alignment
 SIAM Journal on Computing
, 1997
"... Abstract. We present a new polynomial time approximation scheme (PTAS) for tree alignment, which is an important variant of multiple sequence alignment. As in the existing PTASs in the literature, the basic approach of our algorithm is to partition the given tree into overlapping components of a con ..."
Abstract. We present a new polynomial time approximation scheme (PTAS) for tree alignment, which is an important variant of multiple sequence alignment. As in the existing PTASs in the literature, the basic approach of our algorithm is to partition the given tree into overlapping components of a constant size and then apply local optimization on each such component. But the new algorithm uses a clever partitioning strategy and achieves a better efficiency for the same performance ratio. For example, to achieve approximation ratios 1.6 and 1.5, the best existing PTAS has to spend time O(kdn 5) and O(kdn 9), respectively, where n is the length of each leaf sequence and d, k are the depth and number of leaves of the tree, while the new PTAS only has to spend time O(kdn 4) and O(kdn 5). Moreover, the performance of the PTAS is more sensitive to the size of the components, which basically determines the running time, and we obtain an improved approximation ratio for each size. Some experiments of the algorithm on simulated and real data are also given.
Approximation Algorithms for Multiple Sequence Alignment Under a Fixed Evolutionary Tree
, 1995
"... . We consider the problem of aligning sequences related by a given evolutionary tree: given a fixed tree with its leaves labeled with sequences, find ancestral sequences to label the internal nodes so as to minimize the total cost of all the edges in the tree. The cost of an edge is the edit distanc ..."
. We consider the problem of aligning sequences related by a given evolutionary tree: given a fixed tree with its leaves labeled with sequences, find ancestral sequences to label the internal nodes so as to minimize the total cost of all the edges in the tree. The cost of an edge is the edit distance between the sequences labeling its endpoints. In this paper, we consider the case when the given tree is a regular dary tree for some fixed d and provide a d+1 d01 approximation algorithm for this problem that runs in time O(d(2kn) d +n 2 k 2d ) where k is the number of leaves in the tree and n is the maximum length of any of the sequences labeling the leaves. We also consider a new bottleneck objective in labeling the internal nodes. In this version, we wish to find the labeling of the internal nodes that minimizes the maximum cost of any edge in the tree. For this problem we provide a simple 2ffi + 1approximation algorithm where ffi is the depth of the given undirected tree def...