Results 1 - 10
of
62
Optimal, efficient reconstruction of phylogenetic networks with constrained recombination
- J. Bioinformatics and Computational Biology
, 2003
"... gusfield,eddhu¡ A phylogenetic network is a generalization of a phylogenetic tree, allowing structural properties that are not treelike. With the growth of genomic data, much of which does not fit ideal tree models, there is greater need to understand the algorithmics and combinatorics of phylogenet ..."
Abstract
-
Cited by 85 (14 self)
- Add to MetaCart
gusfield,eddhu¡ A phylogenetic network is a generalization of a phylogenetic tree, allowing structural properties that are not treelike. With the growth of genomic data, much of which does not fit ideal tree models, there is greater need to understand the algorithmics and combinatorics of phylogenetic networks [10, 11]. However, to date, very little has been published on this, with the notable exception of the paper by Wang et al.[12]. Other related papers include [4, 5, 7] We consider the problem introduced in [12], of determining whether the sequences can be derived on a phylogenetic network where the recombination cycles are node disjoint. In this paper, we call such a phylogenetic network a “galled-tree”. By more deeply analysing the combinatorial constraints on cycle-disjoint phylogenetic networks, we obtain an efficient algorithm that is guaranteed to be both a necessary and sufficient test for the existence of a galledtree for the data. If there is a galled-tree, the algorithm constructs one and obtains an implicit representation of all the galled trees for the data, and can create these in linear time for each one. We also note two additional results related to galled trees: first, any set of sequences that can be derived on a galled tree can be derived on a true tree (without recombination cycles), where at most one back mutation is allowed per site; second, the site compatibility problem (which is NP-hard in general) can be solved in linear time for any set of sequences that can be derived on a galled tree. The combinatorial constraints we develop apply (for the most part) to node-disjoint cycles in any phylogenetic network (not just galled-trees), and can be used for example to prove that a given site cannot be on a node-disjoint cycle in any phylogenetic network. Perhaps more important than the specific results about galled-trees, we introduce an approach that can be used to study recombination in phylogenetic networks that go beyond galled-trees.
Efficient reconstruction of haplotype structure via perfect phylogeny
- Journal of Bioinformatics and Computational Biology
, 2003
"... Each person’s genome contains two copies of each chromosome, one inherited from the father and the other from the mother. A person’s genotype specifies the pair of bases at each site, but does not specify which base occurs on which chromosome. The sequence of each chromosome separately is called a h ..."
Abstract
-
Cited by 57 (10 self)
- Add to MetaCart
Each person’s genome contains two copies of each chromosome, one inherited from the father and the other from the mother. A person’s genotype specifies the pair of bases at each site, but does not specify which base occurs on which chromosome. The sequence of each chromosome separately is called a haplotype. The determination of the haplotypes within a population is essential for understanding genetic variation and the inheritance of complex diseases. The haplotype mapping project, a successor to the human genome project, seeks to determine the common haplotypes in the human population. Since experimental determination of a person’s genotype is less expensive than determining its component haplotypes, algorithms are required for computing haplotypes from genotypes. Two observations aid in this process: first, the human genome contains short blocks within which only a few different haplotypes occur; second, as suggested by Gusfield, it is reasonable to assume that the haplotypes observed within a block have evolved according to a perfect phylogeny, in which at most one mutation event has occurred at any site, and no recombination occurred at the given region. We present a simple and efficient polynomial-time algorithm for inferring haplotypes from the genotypes of a set of individuals assuming a perfect phylogeny. Using a reduction to 2-SAT we extend this algorithm to handle constraints that apply when we have genotypes from both parents and child. We also present a hardness result for the problem of removing the minimum number of individuals from a population to ensure that the genotypes of the remaining individuals are consistent with a perfect phylogeny. Our algorithms have been tested on real data and give biologically meaningful results. Our webserver
Haplotype reconstruction from genotype data using imperfect phylogeny
- Bioinformatics
, 2004
"... Critical to the understanding of the genetic basis for complex diseases is the modeling of human variation. Most of this variation can be characterized by single nucleotide polymorphisms (SNPs) which are mutations at a single nucleotide position. To characterize the genetic variation between differe ..."
Abstract
-
Cited by 53 (4 self)
- Add to MetaCart
Critical to the understanding of the genetic basis for complex diseases is the modeling of human variation. Most of this variation can be characterized by single nucleotide polymorphisms (SNPs) which are mutations at a single nucleotide position. To characterize the genetic variation between different people, we must determine an individual’s haplotype or which nucleotide base occurs at each position of these common SNPs for each chromosome. In this paper, we present results for a highly accurate method for haplotype resolution from genotype data. Our method leverages a new insight into the underlying structure of haplotypes which shows that SNPs are organized in highly correlated “blocks”. In a few recent studies (see Daly et al. (2001); Patil et al. (2001)), considerable parts of the human genome were partitioned into blocks, such that the majority of the sequenced genotypes have one of about four common haplotypes in each block. Our method partitions the SNPs into blocks and for each block, we predict the common haplotypes and each individual’s haplotype. We evaluate our method over biological data. Our method predicts the common haplotypes perfectly and has a very low error rate (less than ¢ ¡ over the data from Daly et al. (2001).) when taking into account the predictions for the uncommon haplotypes. Our method is extremely efficient compared to previous methods, such as PHASE and HAPLOTYPER. Its efficiency allows us to find the block partition of the haplotypes, to cope with missing data and to work with large data sets. Availability: The algorithm is available via webserver at
Large scale reconstruction of haplotypes from genotype data
- In Proc. RECOMB’03
, 2003
"... Critical to the understanding of the genetic basis for complex diseases is the modeling of human variation. Most of this variation can be characterized by single nucleotide polymorphisms (SNPs) which are mutations at a single nucleotide position. To characterize an individual’s variation, we must de ..."
Abstract
-
Cited by 38 (1 self)
- Add to MetaCart
Critical to the understanding of the genetic basis for complex diseases is the modeling of human variation. Most of this variation can be characterized by single nucleotide polymorphisms (SNPs) which are mutations at a single nucleotide position. To characterize an individual’s variation, we must determine an individual’s haplotype or which nucleotide base occurs at each position of these common SNPs for each chromosome. In this paper, we present results for a highly accurate method for haplotype resolution from genotype data. Our method leverages a new insight into the underlying structure of haplotypes which shows that SNPs are organized in highly correlated “blocks”. The majority of individuals have one of about four common haplotypes in each block. Our method partitions the SNPs into blocks and for each block, we predict the common haplotypes and each individual’s haplotype. We evaluate our method over biological data. Our method predicts the common haplotypes perfectly and has a very low error rate (0.47%) when taking into account the predictions for the uncommon haplotypes. Our method is extremely efficient compared to previous methods, (a matter of seconds where previous methods needed hours). Its efficiency allows us to find the block partition of the haplotypes, to cope with missing data and to work with large data sets such as genotypes for thousands of SNPs for hundreds of individuals. The algorithm is available via webserver
A survey of computational methods for determining haplotypes
- Lecture Notes in Computer Science (2983): Computational Methods for SNPs and Haplotype Inference
, 2004
"... Abstract. It is widely anticipated that the study of variation in the human genome will provide a means of predicting risk of a variety of complex diseases. Single nucleotide polymorphisms (SNPs) are the most common form of genomic variation. Haplotypes have been suggested as one means for reducing ..."
Abstract
-
Cited by 28 (4 self)
- Add to MetaCart
Abstract. It is widely anticipated that the study of variation in the human genome will provide a means of predicting risk of a variety of complex diseases. Single nucleotide polymorphisms (SNPs) are the most common form of genomic variation. Haplotypes have been suggested as one means for reducing the complexity of studying SNPs. In this paper we review some of the computational approaches that have been taking for determining haplotypes and suggest new approaches. 1
Efficient Rule-Based Haplotyping Algorithms for Pedigree Data (Extended Abstract)
, 2003
"... Jing Li jili@cs.ucr.edu Tao Jiang + University of California - Riverside & Shanghai Center for Bioinform. Technology jiang@cs.ucr.edu ABSTRACT We study haplotype reconstruction under the Mendelian law of inheritance and the minimum recombination principle on pedigree data. We prove that th ..."
Abstract
-
Cited by 28 (9 self)
- Add to MetaCart
Jing Li jili@cs.ucr.edu Tao Jiang + University of California - Riverside & Shanghai Center for Bioinform. Technology jiang@cs.ucr.edu ABSTRACT We study haplotype reconstruction under the Mendelian law of inheritance and the minimum recombination principle on pedigree data. We prove that the problem of finding a minimum -recombinant haplotype configuration (MRHC) is in general NP-hard. This is the first complexity result concerning the problem to our knowledge. An iterative algorithm based on blocks of consecutive resolved marker loci (called block-extension) is proposed. It is very e#cient and can be used for large pedigrees with a large number of markers, especially for those data sets requiring few recombinants (or recombination events). A polynomial-time exact algorithm for haplotype reconstruction without recombinants is also presented. This algorithm first identifies all the necessary constraints based on the Mendelian law and the zero recombinant assumption, and represents them using a system of linear equations over the cyclic group Z2 . By using a simple method based on Gaussian elimination, we could obtain all possible feasible haplotype configurations. We have tested the block-extension algorithm on simulated data generated on three pedigree structures. The results show that the algorithm performs very well on both multi-allelic and biallelic data, especially when the number of recombinants is small.
A linear-time algorithm for the perfect phylogeny haplotyping (PPH) problem
- In International Conference on Research in Computational Molecular Biology (RECOMB
, 2005
"... Since the introduction of the Perfect Phylogeny Haplotyping (PPH) Problem in RECOMB 2002 (Gusfield, 2002), the problem of finding a linear-time (deterministic, worst-case) solution for it has remained open, despite broad interest in the PPH problem and a series of papers on various aspects of it. In ..."
Abstract
-
Cited by 26 (6 self)
- Add to MetaCart
Since the introduction of the Perfect Phylogeny Haplotyping (PPH) Problem in RECOMB 2002 (Gusfield, 2002), the problem of finding a linear-time (deterministic, worst-case) solution for it has remained open, despite broad interest in the PPH problem and a series of papers on various aspects of it. In this paper, we solve the open problem, giving a practical, deterministic linear-time algorithm based on a simple data structure and simple operations on it. The method is straightforward to program and has been fully implemented. Simulations show that it is much faster in practice than prior nonlinear methods. The value of a linear-time solution to the PPH problem is partly conceptual and partly for use in the inner loop of algorithms for more complex problems, where the PPH problem must be solved repeatedly. Key words: Perfect Phylogeny Haplotyping (PPH) Problem, Haplotype Inference Problem, linear-time algorithm, shadow tree. 1.
Efficient Inference of Haplotypes from Genotypes on a Pedigree
- J Bioinfo Comp Biol
, 2003
"... We study haplotype reconstruction under the Mendelian law of inheritance and the minimum recombination principle on pedigree data. We prove that the problem of finding a minimum-recombinant haplotype configuration (MRHC) is in general NP-hard. This is the first complexity result concerning the pr ..."
Abstract
-
Cited by 25 (9 self)
- Add to MetaCart
We study haplotype reconstruction under the Mendelian law of inheritance and the minimum recombination principle on pedigree data. We prove that the problem of finding a minimum-recombinant haplotype configuration (MRHC) is in general NP-hard. This is the first complexity result concerning the problem to our knowledge. An iterative algorithm based on blocks of consecutive resolved marker loci (called block-extension) is proposed. It is very efficient and can be used for large pedigrees with a large number of markers, especially for those data sets requiring few recombinants (or recombination events). A polynomial-time exact algorithm for haplotype reconstruction without recombinants is also presented. This algorithm first identifies all the necessary constraints based on the Mendelian law and the zero recombinant assumption, and represents them using a system of linear equations over the cyclic group Z 2 . By using a simple method based on Gaussian elimination, we could obtain all possible feasible haplotype configurations. A C++ implementation of the block-extension algorithm, called PedPhase, has been tested on both simulated data and real data. The results show that the program performs very well on both types of data and will be useful for large scale haplotype inference projects.
An approximation algorithm for haplotype inference by maximum parsimony
- Journal of Computational Biology
, 2005
"... This paper studies haplotype inference by maximum parsimony using population data. We define the optimal haplotype inference (OHI) problem as given a set of genotypes and a set of related haplotypes, find a minimum subset of haplotypes that can resolve all the genotypes. We prove that OHI is NP-hard ..."
Abstract
-
Cited by 24 (1 self)
- Add to MetaCart
This paper studies haplotype inference by maximum parsimony using population data. We define the optimal haplotype inference (OHI) problem as given a set of genotypes and a set of related haplotypes, find a minimum subset of haplotypes that can resolve all the genotypes. We prove that OHI is NP-hard and can be formulated as an integer quadratic programming (IQP) problem. To solve the IQP problem, we propose an iterative semi-definite programming based approximation algorithm, (called SDPHapInfer). We show that this algorithm finds a solution within a factor of O(log n) of the optimal solution, where n is the number of genotypes. This algorithm has been implemented and tested on a variety of simulated and biological data. In comparison with three other methods: (1) HAPAR, which was implemented based on the branching and bound algorithm, (2) HAPLOTYPER, which was implemented based on the Expectation-Maximization algorithm, and (3) PHASE, which combined the Gibbs sampling algorithm with an approximate coalescent prior, the experimental results indicate that SDPHapInfer and HAPLOTYPER have similar error rates. In addition, the results generated by PHASE have lower error rates on some data but higher error rates on others. The error rates of HAPAR are higher than the others on biological data. In
The Haplotyping Problem: An Overview of Computational Models and Solutions
- Journal of Computer Science and Technology
, 2003
"... The investigation of genetic di#erences among humans has given evidence that mutations in DNA sequences are responsible for some genetic diseases. The most common mutation is the one that involves only a single nucleotide of the DNA sequence, which is called a single nucleotide polymorphism (SNP) ..."
Abstract
-
Cited by 22 (5 self)
- Add to MetaCart
The investigation of genetic di#erences among humans has given evidence that mutations in DNA sequences are responsible for some genetic diseases. The most common mutation is the one that involves only a single nucleotide of the DNA sequence, which is called a single nucleotide polymorphism (SNP). As a consequence, computing a complete map of all SNPs occurring in the human populations is one of the primary goals of recent studies in human genomics. The construction of such a map requires to determine the DNA sequences that from all chromosomes. In diploid organisms like humans, each chromosome consists of two sequences called haplotypes.

