Model Selection and Model Averaging in Phylogenetics: Advantages of Akaike Information Criterion and Bayesian Approaches Over Likelihood Ratio Tests
, 2004
Model selection is a topic of special relevance in molecular phylogenetics that affects many, if not all, stages of phylogenetic inference. Here we discuss some fundamental concepts and techniques of model selection in the context of phylogenetics. We start by reviewing different aspects of the selection of substitution models in phylogenetics from a theoretical, philosophical and practical point of view, and summarize this comparison in table format. We argue that the most commonly implemented model selection approach, the hierarchical likelihood ratio test, is not the optimal strategy for model selection in phylogenetics, and that approaches like the Akaike Information Criterion (AIC) and Bayesian methods offer important advantages. In particular, the latter two methods are able to simultaneously compare multiple nested or nonnested models, assess model selection uncertainty, and allow for the estimation of phylogenies and model parameters using all available models (modelaveraged inference or multimodel inference). We also describe how the relative importance of the different parameters included in substitution models can be depicted. To illustrate some of these points, we have applied AICbased model averaging to 37 mitochondrial DNA sequences from the subgenus Ohomopterus (genus Carabus) ground beetles described by Sota and Vogler (2001).
Species trees from gene trees: Reconstructing Bayesian posterior distributions of a species phylogeny using estimated gene tree distributions
 SYSTEMATIC BIOLOGY
, 2007
The estimation of species trees has become popular as a considerable amount of multilocus molecular data is available for inferring the evolutionary history of species. However, the current phylogenetic paradigm, that reconstructs gene trees to represent the species tree suggests that commonly used methods such as the concatenation method, the consensus tree method, or the gene tree parsimony method may be either inconsistent or highly biased. In this paper, we propose a Bayesian hierarchical model to estimate the phylogeny of a group of species using multiple estimated gene tree distributions such as those that arise in a Bayesian analysis of DNA sequence data. Our model employs substitution models used in traditional phylogenetics, but also uses coalescent theory to explain genealogical signals from species trees to gene trees and from gene trees to sequence data, thereby forming a stochastic model to estimate gene trees, species trees, ancestral population sizes and species divergence times simultaneously. Our model is founded on the assumption that gene trees, even of unlinked loci, are correlated due to being derived from a single species tree and therefore should be estimated jointly. We apply the method to two multilocus DNA sequences datasets. The estimates of the
Fast and Accurate Phylogeny Reconstruction Algorithms Based on the MinimumEvolution Principle
 JOURNAL OF COMPUTATIONAL BIOLOGY
, 2002
The Minimum Evolution (ME) approach to phylogeny estimation has been shown to be statistically consistent when it is used in conjunction with ordinary leastsquares (OLS) fitting of a metric to a tree structure. The traditional approach to using ME has been to start with the Neighbor Joining (NJ) topology for a given matrix and then do a topological search from that starting point. The first stage requires O(n³) time, where n is the number of taxa, while the current implementations of the second are in O(p n³) or more, where p is the number of swaps performed by the program. In this paper, we examine a greedy approach to minimum evolution which produces a starting topology in O(n²) time. Moreover, we provide an algorithm that searches for the best topology using nearest neighbor interchanges (NNIs), where the cost of doing p NNIs is O(n² C p n), i.e., O(n²) in practice because p is always much smaller than n. The Greedy Minimum Evolution (GME) algorithm, when used in combination with NNIs, produces trees which are fairly close to NJ trees in terms of topological accuracy. We also examine ME under a balanced weighting scheme, where sibling subtrees have equal weight, as opposed to the standard “unweighted ” OLS, where
Distributions of tree comparison metrics – some new results. Syst. Biol
, 1993
Abstract.—Measures of dissimilarity (metrics) for comparing trees are important tools in the quantitative analysis of evolutionary trees, but many of their properties are incompletely known. The present paper reports formulae for the distributions of three classes of tree comparison metrics: the partition (or symmetric difference) metric, the quartet metric (which compares subsets of four taxa), and a metric based on pathlength differences between pairs of taxa. The properties studied include the mean and variance for several underlying distributions of trees, the range, the effect of the number of taxa, and methods of calculation. Three basic theorems and their proofs are reported, one for each class of tree comparison metric. The partition metric generates an asymptotic Poisson distribution for most distributions of trees (its mean is given for three tree distributions). Exact expressions are derived for the variance of the quartet metric and the mean square value of a metric based on path differences. Factors that affect the choice of a metric for a particular study include the degree of similarity of the trees being compared and the type of hypothesis being tested (e.g., whether the trees estimate the same underlying phylogeny or are simply related in some, perhaps unknown, way). [Evolutionary trees; tree comparison metrics; quartet metric; partition metric; pathlength differences.]
Estimating Species Phylogenies Using Coalescence Times among Sequences
, 2009
The estimation of species trees (phylogenies) is one of the most important problems in evolutionary biology, and recently, there has been greater appreciation of the need to estimate species trees directly rather than using gene trees as a surrogate. A Bayesian method constructed under the multispecies coalescent model can consistently estimate species trees but involves intensive computation, which can hinder its application to the phylogenetic analysis of largescale genomic data. Many summary statistics–based approaches, such as shallowest coalescences (SC) and Global LAteSt Split (GLASS), have been developed to infer species phylogenies for multilocus data sets. In this paper, we propose 2 methods, species tree estimation using average ranks of coalescences (STAR) and species tree estimation using average coalescence times (STEAC), based on the summary statistics of coalescence times. It can be shown that the 2 methods are statistically consistent under the multispecies coalescent model. STAR uses the ranks of coalescences and is thus resistant to variable substitution rates along the branches in gene trees. A simulation study suggests that STAR consistently outperforms STEAC, SC, and GLASS when the substitution rates among lineages are highly variable. Two real genomic data sets were analyzed by the 2 methods and produced species trees that are consistent with previous results. [Coalescent model; gene tree; species tree.]
Southern hemisphere biogeography inferred by eventbased models: plant versus animal patterns
 Systematic Biology
, 2004
Abstract.—The Southern Hemisphere has traditionally been considered as having a fundamentally vicariant history. The common transPacific disjunctions are usually explained by the sequential breakup of the supercontinent Gondwana during the last 165 million years, causing successive division of an ancestral biota. However, recent biogeographic studies, based on molecular estimates and more accurate paleogeographic reconstructions, indicate that dispersal may have been more important than traditionally assumed. We examined the relative roles played by vicariance and dispersal in shaping Southern Hemisphere biotas by analyzing a large data set of 54 animal and 19 plant phylogenies, including marsupials, ratites, and southern beeches (1,393 terminals). Parsimonybased tree fitting in conjunction with permutation tests was used to examine to what extent Southern Hemisphere biogeographic patterns fit the breakup sequence of Gondwana and to identify concordant dispersal patterns. Consistent with other studies, the animal data are congruent with the geological sequence of Gondwana breakup: (Africa(New Zealand(southern South America, Australia))). TransAntarctic dispersal (Australia ↔ southern South America) is also significantly more frequent than any other dispersal event in animals, which may be explained by the long period of geological contact between Australia and South America via Antarctica. In contrast, the dominant pattern in plants, (southern South America(Australia, New Zealand)), is better explained by dispersal, particularly the prevalence of transTasman dispersal between New Zealand and Australia. Our results also confirm the hybrid origin of the South American biota: there has been surprisingly little biotic exchange between the northern tropical and the southern
Multiple Sequence Alignment Accuracy and Phylogenetic Inference
Phylogenies are often thought to be more dependent upon the specifics of the sequence alignment rather than on the method of reconstruction. Simulation of sequences containing insertion and deletion events was performed in order to determine the role that alignment accuracy plays during phylogenetic inference. Data sets were simulated for pectinate, balanced, and random tree shapes under different conditions (ultrametric equal branch length, ultrametric random branch length, nonultrametric random branch length). Comparisons between hypothesized alignments and true alignments enabled determination of two measures of alignment accuracy, that of the total data set and that of individual branches. In general, our results indicate that as alignment error increases, topological accuracy decreases. This trend was much more pronounced for data sets derived from more pectinate topologies. In contrast, for balanced, ultrametric, equal branch length tree shapes, alignment inaccuracy had little average effect on tree reconstruction. These conclusions are based on average trends of many analyses under different conditions, and any one specific analysis, independent of the alignment accuracy, may recover very accurate or inaccurate topologies. Maximum likelihood and Bayesian, in general, outperformed neighbor joining and maximum parsimony in terms of tree reconstruction accuracy. Results also indicated that as the length of the branch and of the neighboring branches increase, alignment accuracy decreases, and the length of the neighboring branches is the major factor in topological accuracy. Thus, multiplesequence alignment can be an important factor in downstream effects on topological reconstruction. [Bayesian; maximum likelihood; maximum parsimony; multiple sequence alignment; neighbor
A Framework for Representing Reticulate Evolution
 ANNALS OF COMBINATORICS
, 2004
Acyclic directed graphs (ADGs) are increasingly being viewed as more appropriate for representing certain evolutionary relationships, particularly in biology, than rooted trees. In this paper, we develop a framework for the analysis of these graphs which we call hybrid phylogenies. We are particularly interested in the problem whereby one is given a set of phylogenetic trees and wishes to determine a hybrid phylogeny that ‘embeds’ each of these trees and which requires the smallest number of hybridisation events. We show that this quantity can be greatly reduced if additional species are involved, and investigate other combinatorial aspects of this and related questions.
The Importance of Proper Model Assumption in Bayesian Phylogenetics
, 2004
We studied the importance of proper model assumption in the context of Bayesian phylogenetics by examining>5,000 Bayesian analyses and six nested models of nucleotide substitution. Model misspecification can strongly bias bipartition posterior probability estimates. These biases were most pronounced when rate heterogeneity was ignored. The type of bias seen at a particular bipartition appeared to be strongly influenced by the lengths of the branches surrounding that bipartition. In the Felsenstein zone, posterior probability estimates of bipartitions were biased when the assumed model was underparameterized but were unbiased when the assumed model was overparameterized. For the inverse Felsenstein zone, however, both underparameterization and overparameterization led to biased bipartition posterior probabilities, although the bias caused by overparameterization was less pronounced and disappeared with increased sequence length. Model parameter estimates were also affected by model misspecification. Underparameterization caused a bias in some parameter estimates, such as branch lengths and the gamma shape parameter, whereas overparameterization caused a decrease in the precision of some parameter estimates. We caution researchers to assure that the most appropriate model is assumed by employing both a priori model choice methods and a posteriori model adequacy tests. [Bayesian phylogenetic inference; convergence; Markov chain Monte Carlo; maximum likelihood; model choice; posterior probability.] Model choice is becoming a critical issue as the number of available models of nucleotide evolution increases rapidly. Recent studies have shown that adequate
Case Study: Visualizing Sets of Evolutionary Trees
, 2002
We describe a visualization tool which allows a biologist to explore a large set of hypothetical evolutionary trees. Interacting with such a dataset allows the biologist to identify distinct hypotheses about how different species or organisms evolved, which would not have been clear from traditional analyses. Our system integrates a pointset visualization of the distribution of hypothetical trees with detail views of an individual tree, or of a consensus tree summarizing a subset of trees. Efficient algorithms were required for the key tasks of computing distances between trees, finding consensus trees, and laying out the pointset visualization. 1