#### DMCA

## likelihood and the role of models in molecular phylogenetics. (2000)

Venue: | Mol. Biol. Evol. |

Citations: | 70 - 11 self |

### BibTeX

@ARTICLE{Steel00likelihoodand,

author = {Mike Steel and David Penny},

title = {likelihood and the role of models in molecular phylogenetics.},

journal = {Mol. Biol. Evol.},

year = {2000},

pages = {839--850}

}

### OpenURL

### Abstract

Methods such as maximum parsimony (MP) are frequently criticized as being statistically unsound and not being based on any ''model.'' On the other hand, advocates of MP claim that maximum likelihood (ML) has some fundamental problems. Here, we explore the connection between the different versions of MP and ML methods, particularly in light of recent theoretical results. We describe links between the two methods-for example, we describe how MP can be regarded as an ML method when there is no common mechanism between sites (such as might occur with morphological data and certain forms of molecular data). In the process, we clarify certain historical points of disagreement between proponents of the two methodologies, including a discussion of several forms of the ML optimality criterion. We also describe some additional results that shed light on how much needs to be assumed about underling models of sequence evolution in order to successfully reconstruct evolutionary trees. Introduction Maximum parsimony (MP) is a popular technique for phylogeny reconstruction. However, MP is often criticized as being a statistically unsound method and one that fails to make explicit an underlying ''model'' of evolution. Discussion is further clouded by claims that MP variously is, or is not, a form of maximum likelihood (ML) and the promotion of ''zones'' within which either method performs worse than the other in recovering the true tree. There is little agreement on how, or even whether, MP should be justified. According to However, others (e.g., Key words: phylogeny, maximum likelihood, maximum parsimony, site substitution models. Address for correspondence and reprints: Mike Steel, Biomathematics Research Centre, University of Canterbury, Private Bag 4800, Christchurch, New Zealand. E-mail: m.steel@math.canterbury.ac.nz. Some authors (e.g., Farris 1973; The simplicity of a method like MP (and its embellishments that allow weightings on characters and transition types), together with its apparent lack of assumption involving underlying models, made it popular in phylogeny, particularly in the 1970s and the 1980s. Furthermore, it is possible to state sufficient conditions on the process by which characters evolve so that MP will recover the true tree. Essentially, these conditions amount to requiring that convergent evolution and reversals occur in (sufficiently) low numbers in comparison with the characters that identify edges of the tree (a more precise formulation is given by the lemma given in section (a) of the appendix). The main problem with such simple criteria is that they are very unlikely to be satisfied for most real data sets, and even when they are, it may be impossible to tell this directly from the data (without knowing in advance the true tree). MP is still widely used, but model-based approaches have come to rival, and even dominate, phylogenetic methodology, particularly over the last decade. While ML is the leading alternative, other approaches include distance-based methods that use transformed or inferred distances, for example, logdet/paralinear distances (see Nevertheless, ML methodology enjoys far from universal acceptance. Objections to ML include the following: Concern about the validity and exact form of any underlying stochastic model (e.g., there is concern as to the choice of underlying parameters/distributions and as to the idea that by selecting the appropriate model, perhaps one could reconstruct any favored tree); The concern that ML estimation of a tree (and statistical tests between different trees) that involves optimizing ''nuisance (supplementary) parameters'' is statistically problematic. Suggestions that the Felsenstein Zone rarely, if ever, arises for real data. The existence of a ''Farris Zone,'' where MP outperforms ML. The analysis of new types of genome data, e.g., gene order and short interspersed nuclear elements (SINEs), for which MP may be more appropriate. Concern about the computational complexity of ML. Even on a given tree, optimizing the likelihood can be problematic (unlike with MP, for which Fitch's [1971b] algorithm provides a linear time algorithm for computing the parsimony score). In this paper we will explore most of these objections and survey some recent theoretical results that shed light on the interplay between the two methodologies and on the limits of what one can hope to achieve in phylogeny reconstruction. Before proceeding, it is necessary to clarify some terminology. We have already pointed out that the Principle of Parsimony (Ockham's razor) has two general applications, one as justification for an attempt to analyze data without reference to an underlying model, the other as a tree selection process (MP) to minimize mutations. However, this latter usage combines the two aspects of selecting a tree with a minimal number of mutations and using only observed data (not corrected for any multiple changes). However, these are independent concepts and can be used in different combinations. For example, minimization of the number of mutations can be applied after correction for multiple changes (corrected parsimony; In general, we prefer to treat a ''method'' for inferring evolutionary trees as being composed of three largely independent parts: the choice of optimality criterion, the search strategy over the space of trees, and assumptions about the model of evolution. It is useful to make a three-way division of the model of evolution. This consists of a tree T (or, more generally, a graph, when median networks or splits graphs are considered), a stochastic mechanism of evolution (such as whether or not it is neutral, Kimura 3ST, whether it exhibits rate heterogeneity), and the initial conditions (e.g., interspeciation times or rates on each edge [branch] of the tree). An additional factor is that the researcher may be hoping to recover different aspects of the model. Most frequently, perhaps it is just the unweighted tree, regardless of the amount of mutation on each edge of the tree. In addition, the tree will usually be unrooted unless an outgroup or an assumption about a molecular clock is used. Frequently, however, the rates of mutation will be required in order to estimate times of divergence. Others will also wish to estimate the character states at the internal nodes. It is thus too simple just to compare ''parsimony'' and ''likelihood.'' Indeed, likelihood itself comes in many flavors, and these will be discussed next. The usual form of ML is ''maximum average likelihood,'' an example of ''maximum relative likelihood.'' These and other distinctions we discuss below have also been noted by others, in particular, What Is ML, and What Does it Maximize? The likelihood of the hypothesis H, given data D and a specific model, is proportional to P(D ͦ H), the conditional probability of observing D given that H is correct In the context of phylogeny reconstruction from sequences, D typically counts the number of ''site patterns'' that occur in a collection of aligned sequences. The order in which these patterns occur (and the phylogenetic information that this might convey) is usually discarded, although some authors have explicitly incorporated this into their analysis (e.g., Nuisance parameters (and the associated problems they cause) arise widely in many statistical settings. They have been discussed in the phylogeny setting by several authors, perhaps most lucidly by Two frequent assumptions concerning substitution models are that aligned sites evolve independently and according to identical processes-the so-called ''i.i.d.'' assumption. Note that the i.i.d. assumption still allows sites to evolve at different rates by regarding the rate of a site as being randomly and independently selected from an appropriate distribution (such as a gamma distribution). Of course, in real sequences, there is clustering of ''conserved'' and ''hypervariable'' sites (so the real process is definitely not i.i.d. across sites), but when one passes to the frequencies of site patterns (i.e., the data D), the process can be modeled by an i.i.d. process. Similarly, certain covarion-style mechanisms (where sites can alternate between invariable and variable during evolution) can be modeled using an i.i.d. process The i.i.d. assumption allows one to readily compute P(D ͦ T, ) by identifying this with the product of the probabilities of evolving each particular site. Occasionally, more intricate models have been proposed and analyzed. These include models that allow a limited degree of nonindependence between sites (e.g., pairwise interactions in stem regions; Schöniger and von Haeseler 1994) and models that work with nonaligned sequences and explicitly model the insertion-deletion process as well as the site substitution process Maximum Integrated Likelihood Versus Maximum Relative Likelihood If the nuisance parameters and the phylogeny T are generated according to some known prior distribution (e.g., a Yule pure-birth process) one can formally integrate out these nuisance parameters, and thereby take P(D ͦ T) to be this average value. That is, if ⌽( ͦ T) denotes the distribution function of the nuisance parameters conditional on the underlying tree T, then ͵ This approach is sometimes referred to as ''integrated likelihood,'' and we will refer to a tree T that maximizes P(D ͦ T) as a maximum integrated likelihood (MIL) tree. MIL, and, more generally, the assignment of posterior probabilities to trees based on sequence data (using Markov chain Monte Carlo techniques to approximate the integral in the above equation), has been independently developed by several authors recently, in particular, Assume for the moment that one possesses such a prior distribution (e.g., based on a Yule process). A natural question arises: namely, in what sense is maximum integrated likelihood an optimal method for selecting a tree? In particular, is it the method that is most likely (on average) to return us the true tree? In order to formalize this question, suppose we have a tree reconstruction method, and we apply it to sequences that have been generated by a model with underlying parameters T and . The reconstruction probability, denoted (M, T, ), is the probability that the sequences so generated return the correct tree T when method M is applied. Since we have a distribution on trees and the nuisance parameters, let (M) denote the expected reconstruction probability of method M, obtained by integrating (M, T, ) over the joint parameter space. That is, where p(T) is the probability of the tree T under the prior distribution (we will assume that only binary trees have positive probability). The following theorem precisely describes the method that maximizes the expected reconstruction probability (for a proof of this, see Székely and Steel 1999). THEOREM 1. Under the conditions described, the method M that maximizes the expected reconstruction probability (M) is precisely that method that selects, for any data D, the tree(s) T that maximizes p(T)P(D ͦ T). This tree(s) that maximizes p(T)P(D ͦ T) is sometimes referred to as the maximum a posterior probability (MAP) estimate. It is precisely the MIL tree(s) whenever the prior distribution on binary trees is uniform (i.e., when all binary trees are equally likely). Consequently, assuming that the prior distribution assigns equal probabilities to all binary trees, MIL maximizes one's average chance of recovering the correct tree. However, if the distribution on binary trees is not uniform-for example, if it is described by a Yule process-then the optimal selection criteria are slightly different. In any case, an obvious question is that of how to agree upon a biologically reasonable distribution on trees and parameters. The alternative approach, which is more widely adopted, is sometimes called maximum relative likelihood (MRL). One simply assumes that the nuisance parameters take values that, simultaneously with an optimal tree T, maximize P(D ͦ T, ). Usually, one then discards and outputs just the tree(s) T. Such an approach can be problematic in general statistical settings where D depends on both continuous (nuisance) parameters and a discrete parameter x of interest. In this situation, there may be one ''unlikely'' value of that, for x ϭ x 1 , gives a higher P(D ͦ x, ) value that max P(D ͦ x 2 , ), yet for most ''likely'' values of , the probability . This property means that MRL may make selections different from those of MIL, and this seems to have been a fundamental issue in the exchange between Felsenstein and Sober (1986) on the relative merits of MP and ML. Moreover, in the phylogenetic setting, MRL may select different trees from the MIL method described above, even when all binary trees are equally likely (at least for certain distributions on the edge parameters of the tree). An example of this is described at the end of Can MP Outperform M av L? below. For the remainder of this paper, we will generally assume there is no prior distribution given for trees and edge parameters, and so all forms of ML involve MRL. With this in mind, we review some further distinctions. Maximum Average Likelihood, Most-Parsimonious Likelihood, and Evolutionary Pathway Likelihood In fitting sequence data to a tree, the sequences at the leaves (tips) of the tree are given, but those at the internal vertices (speciation or branching points) of the tree are not. In the usual implementation of MRL in molecular phylogenetics, one effectively averages over all possible assignments of sequences to these internal vertices. Following Barry and Hartigan (1987), we call this maximum average likelihood, and we denote it as M av L. However, one could also assign sequences to the internal vertices (along with the other parameters) so as to maximize the likelihood. Such an approach was explicitly suggested by We pause here to note that Goldman (1990) has already noted one link between MP and most-parsimonious likelihood. He showed that under a symmetric two-state mutation model, if one imposed the rather artificial constraint that the mutation probability associated with each edge of any binary tree is set equal to some value p, then the MP tree(s) were exactly the most-parsimonious likelihood trees. This result applies either with p fixed or allowing p to be optimized. Given the most-parsimonious likelihood approach, one might ask, what is so special about the sequences at the internal vertices of the tree? That is, perhaps one might carry the approach further and select sequences for each time interval right through the tree (jointly with the other parameters) to maximize the probability of observing the given sequences at the leaves. Thus, one would associate along each edge of the tree a series of sequences, corresponding to their evolution at frequently sampled time intervals (see Such an approach was suggested by Does MP ؍ M av L Under Some Model? Most-parsimonious likelihood and evolutionary pathway likelihood both entail the specification of a choice of sequences to points inside the tree. Although a particular selection of sequences may be the most probable, the attraction of M av L is that it effectively allows all possible assignments of sequences to the interior of the tree. These are weighted according to their probability and then summed up to give the marginal probability of evolving the sequences observed at the leaves. The question arises then as to whether MP can be regarded as a M av L method under some model. Suppose we take the simplest type of substitution model at a particular site, a Poisson model in which each of the possible substitutions at that site occurs with equal probability. This model, sometimes called the Neyman model (or the Jukes-Cantor model, when dealing with exactly four states) will be referred to here simply as the Poisson model. Now suppose the rates of evolution on each branch of the tree can vary freely from site to site. In this case, we have some constraints on the underlying type of substitution model (i.e., JukesCantor type) but no constraints on the edge parameters from site to site. We refer to this as ''no common mechanism.'' This is even more general than the type of approach considered by Olsen (see Swofford et al. 1996, p. 443) in which the rate at which a site evolves can vary freely from site to site, but the ratios of the edge lengths are equal across the sites. For the Poisson model with no common mechanism (not even the same rates for different characters) the following result applies. THEOREM 2. Under the model described (with no common mechanism), the maximum average likelihood tree(s) is precisely the maximum parsimony tree(s). This result, by Tuffley and Steel (1997a), generalizes an earlier special case by The argument used to establish Theorem 2 also shows that, under a Poisson model, if we are given just a tree and a single character (and no information as to the edge lengths), the ML estimate of the state at any internal vertex of the tree (given the states at the leaves of the tree) is precisely the MP estimate. For a further link between ML and MP, suppose we take any sequence data and add a sufficiently large number of unvaried sites. Then, under a Poisson model, the ML tree of this extended data set is always an MP tree. For details and justification of these last two results see Of course, this type of underlying model (in Theorem 2) is almost certainly too flexible, since it allows many new parameters for each edge. It might be regarded as the model one might start with if one knew virtually nothing about any common underlying mechanism linking the evolution of different characters on a tree (e.g., as with some morphological characters). For processes like nucleotide substitution, as one learns more about the common mechanisms involved, it would seem desirable to use this information. This would lead to the more usual implementations of M av L where the model parameters (such as edge lengths) are constant across sites. Indeed, advocates of Ockham's razor (the Principle of Parsimony) might well invoke the principle at this point, as illustrated by the following example. Consider sequences of a pseudogene, with each sequence being over 10,000 nt long Again, this conclusion must be taken with care. Such a model may not apply to other sequence data and would not often apply to morphological data (e.g., where the evolution of numbers of legs may differ from that of wing color). It is clear that we still need to learn more about the processes leading to different types of insertion and deletion events in sequence data to postulate a common mechanism. In summary, this subsection suggests two ironies: first, that the parsimonious approach suggested by Ockham's razor can, given information of a common mechanism, support the usual forms of ML over MP for sequence data. Second, by Theorem 2, when we generalize traditional substitution models (like Jukes-Cantor) sufficiently far-namely, to allow different edge parameters at different sites-the usual ML approach arrives back at MP. When is MP Statistically Consistent? Given a model of site substitution, a tree reconstruction method is said to be statistically consistent if the probability of its reconstructing the true tree converges to certainty as the sequence length tends to infinity, regardless of what value the structural nuisance parameters take. Note that the reconstructed tree is considered correct if it matches the generating tree up to the position of any root vertex in the latter tree, since the root generally cannot be determined without additional assumptions (e,g., a molecular clock). The concept of statistical consistency is always relative to the model in question, and methods that are consistent for one class of models may be inconsistent for others (Chang 1996b). Steel and Penny Statistical consistency is often seen as a desirable, if not essential, property of an estimator in most statistical settings. However this viewpoint is sometimes questioned in phylogenetics (e.g., by Nevertheless, the issue of consistency has tended to dominate much of the discussion concerning the relative merits of ML over MP, particularly since Felsenstein's (1978) classic paper showing that MP (and the related maximum-compatibility method) can be inconsistent. However, distance methods applied to uncorrected data can also be inconsistent; indeed, under the symmetric two-state model, the conditions for inconsistency of some standard distance methods (applied to uncorrected distances) are identical to those for MP on four-taxon trees A seductive, but erroneous, belief is that if the mutation probabilities on the edges are all sufficiently small, then MP is statistically consistent under simple models. However, Felsenstein's (1978) counterexample allows arbitrarily low mutation probabilities. Nevertheless, if one fixes the relative branch lengths on any tree, one can easily show that if the rate of substitution is sufficiently small, then MP is statistically consistent. For four sequences, it is possible to say exactly when MP will be statistically consistent (in terms of the edge parameters), at least for simple models such as the symmetric two-state model If the branch lengths satisfy a molecular clock, then the Felsenstein Zone disappears for four-taxon trees, at least for symmetric models like the Kimura 3ST and Jukes-Cantor models (see A curious consequence of these results arises when a molecular clock applies. If one uses MP on the entire data set, the method may be statistically inconsistent, yet if one had used MP to reconstruct trees on quartets of taxa and then combined these quartet trees, the method would be statistically consistent. Note that, just as with distances, it is possible for some models (e.g., the Kimura 3ST model) to transform sequence data so that MP applied to this new data will always be consistent (this approach, called ''corrected parsimony,'' is described in Steel, Penny, and Hendy [1993] and Penny et al. [1996]). An unresolved issue is to what extent such inconsistency occurs with biological (as distinct from computer-simulated) sequence data. Suggested examples of tree-building inconsistency arising from the use of inappropriate analysis models include those of A further relevant factor is the size of the state space of characters. With site substitutions, one generally has a state space of size 2 (purines/pyrimidines) or, more usually, 4 (the 4 nt), while for amino acid and codon data, the state space has size 20 or 64, respectively. With other types of genomic data-for example, gene order (Blanchette, Kunisawa, and Sankoff 1999), SINEs (Nikaido, Rooney, and Okada 1999)-there is a much larger state space. In this case, if the states evolve by a simple Markov model, then one might expect MP (and related methods like maximum compatibility) to behave better, since there is less likelihood of returning to the same state that was present earlier in the tree. We formalize this as follows. Suppose, for example, we generate characters independently and by an identical process according to a tree-based Markov model, in which there are r states that evolve on a tree T with n leaves. We will suppose that the probability of a mutation on an edge e of the tree, conditional on there having been any given number of mutations earlier in the tree, lies strictly between a and b, where 0 Ͻ a Յ b Ͻ 1. We will also suppose that, conditional on (1) a mutation occurring on edge e ϭ (u, v) and (2) given the state at u, the probability that the state at v is any one of the particular r Ϫ 1 alternative states is at most c/(r Ϫ 1) for some constant c. For example, in a Poisson model, where each of the r Ϫ 1 different states is equally likely to be selected if a mutation occurs, we have c ϭ 1. This model allows some transition events to have very low (or zero) probability, since we only require c/(r Ϫ 1) to be an upper bound to these conditional transition probabilities. We summarize the relevant constraints on this model by the quadruple THEOREM 3. If the number of states (r) is large enough (relative to the other constraints n, a, b, c), then MP is statistically consistent for all binary trees with n leaves. Thus, for simple mutation models with bounded mutation probabilities, if the state space is large enough, then there is hope of escaping the Felsenstein Zone. However, this claim needs qualifying: it does not imply that any simple enlargement of the state space will automatically make MP statistically consistent. For example, suppose one enlarges the state space by considering pairs (2-tuples) or triples (3-tuples) or, more generally, k-tuples of sites (in which case the size of the state space r is 4 k if we have four-state sites). We suppose that changing one k-tuple of states into a different pair of states costs 1 unit regardless of the number of site changes involved. Note that MP applied to pairs (or, more generally, to k-tuples) of sites may lead to different trees than MP applied to single sites, even for four sequences. Nevertheless, for four sequences, MP will be consistent when applied to pairs of sites if and only if it is statistically consistent on the original single site data. Formally we have: THEOREM 3A. For four sequences and any i.i.d. model of sequence evolution, MP is statistically consistent on k-tuple-site data if and only if MP is statistically consistent on single-site data. A proof of Theorem 3a is given in section (b) of the appendix. Note that Theorem 3a does not contradict Theorem 3, since if we take k-tuples of sites, then the effective mutation probability increases toward 1 as k increases, so b is not fixed as r grows (i.e., as we put the sites together, the effective rate of mutation increases). We note in passing that a simple corollary of Theorem 3 is the consistency of MP under the type of ''infinite-sites'' model employed in population genetics. Leaving MP briefly, one can also consider the consequences of a molecular clock on tree reconstruction methods that use uncorrected distances (i.e., the distance between each pair of sequences is taken to be the proportion of sites at which there is a substitution). In this case, under most models, even those that allow an (unknown!) distribution of rates across sites, the uncorrected distances will, in expectation, already be treelike. Thus, there is no need to correct them, and to do so can be problematic since (1) the correction depends on the (unknown) distribution of rates across sites, and (2) the corrected distances typically will have higher variance (and be biased upward) compared with the uncorrected distances. Formally stated (a proof is given in section (c) of the appendix), we have the following result, where by a ''standard'' site substitution model we mean a model that satisfies two conditions, namely, that it is stationary (unvaried across the tree) and reversible (the process appears the same whether viewed into the past or into the future). THEOREM 4. For standard site substitution models with a distribution of rates across sites, the expected uncorrected Hamming (observed) distances between pairs of sequences are additive on the underlying tree. Thus, if a molecular clock applies, then as far as reconstructing the tree is concerned (without regard to branch lengths), it may be preferable to work with uncorrected distances. Once the tree is reconstructed, it is clearly preferable for the estimation of the branch lengths to use the corrected distances (or ML estimation) instead of the uncorrected distances. Can MP Outperform M av L? It is easy to construct examples where M av L will be inconsistent if the model used in the ML analysis differs from the model that generated the sequences. However, some investigators have noted that MP can perform better than M av L, even when the underlying model matches the generating model To make this idea more precise, by the ''performance'' of a tree reconstruction method M (on sequence data generated under a tree-indexed Markov model) we again mean the reconstruction probability (M, T, ) described in What Is ML, and What Does it Maximize? (the probability the method will correctly return the true tree T). This quantity depends not just on M but also on T and the parameters on the edges of the tree. Now, for each tree T, there exist parameters for which MP will have a higher probability of returning the ''true tree'' T than M av L. Of course, it is trivial to construct a method that can have a higher reconstruction probability than M av L for a given underlying tree: simply ignore the data, and always output a fixed (favorite) tree. This ''method'' performs splendidly if the favored tree is the true tree, but otherwise it performs very badly. So why is the construction we discuss here any less trivial? The crucial difference is that MP has a higher reconstruction probability than M av L not just on one four-species tree, but on any of the underlying trees (provided the other associated parameters are chosen appropriately)-and this is something a trivial method like the one described clearly cannot achieve. Again, this should not be overinterpreted-it does not mean that we should be using MP-it may well be that on average (under some prior on trees and their parameters) M av L outperforms MP, but it does not globally outperform (in the sense described above) MP. In more detail, consider a fully resolved tree T on four species-say, a, b, c, and d-with the topology ab ͦ cd and the simple symmetric two-state model with mutation probability p(e) ϭ ⑀ on the two edges incident with leaves a, b, while p(e) Ͼ 0.5 Ϫ ⑀ on the other three edges, where ⑀ is small but positive. Thus, three edges involve long interspeciation times (and/or high mutation rates) and so are near site saturation, while two sister taxa are recently separated (and/or have low mutation rates on their incident edges). Note that such a situation is entirely possible under a molecular clock (see Suppose we evolve k sites independently on this tree. Let P 1 (k) be the probability that MP recovers the 846 Steel and Penny true tree T, and let P 2 (k) be the probability that M av L recovers T from the k sites. THEOREM 5. As ⑀ converges to 0 (with the number of sites k fixed), In particular, the probability that MP correctly reconstructs T can be higher than the corresponding probability for M av L for any fixed sequence length k Ն 4. A similar result was stated without proof in Székely and Steel (1999); we outline a proof here in section (d) of the appendix. Note that for ⑀ very small (but positive), MP will recover T with probability 0.99 with just 16 sites, yet M av L could potentially take 10 10 sites to achieve the same probability of correctly reconstructing T (in which case, for realistic length sequences, other effects, e.g., deviations from the model, might have more effect on the reconstructed tree than the sequence data). This is of course an extreme situation; nevertheless, it shows that there are situations in which we would expect M av L to require much longer sequences than MP needs to recover the true tree. Note that we actually only require p(e) Ͼ 0.5 Ϫ ⑀ on two of the three edges, but we have opted to allow three edges to be near site saturation, since then the example can arise under a molecular clock. In contrast, the Felsenstein Zone cannot arise under a molecular clock, yet, to be fair, if we want to impose a molecular clock, we should implement ML with a molecular clock, and if we did, ML would no longer behave as described above. Also, this example does not demonstrate any inconsistency of M av L, since if the edge mutation probabilities are fixed (and strictly between 0 and 0.5), then M av L will eventually recover the true tree with probability converging to certainty as k tends to infinity. This example can also be modified to demonstrate that M av L can differ from maximum integrated likelihood, even when all trees have equal prior probabilities (provided the prior distribution on the edge lengths is sufficiently contrived). Specifically, suppose that each of the three binary trees on sequences a, b, c, and d has equal probability and that the prior distribution on the edge lengths allows all possible values for the mutation probabilities, but with probability 1 Ϫ ␦, we have p(e) Յ ⑀ on two edges incident with two sister leaves and p(e) Ͼ 0.5 Ϫ ⑀ on the other three edges. Then it can be shown that for ⑀, ␦ sufficiently small (but positive), MIL can select a different tree than M av L on certain data. The Limits to Models: Recent Developments and Future Directions As models become increasingly sophisticated and parameter-rich, one risks losing the ability to discriminate between different underlying trees Even if one knows the distribution of rates across sites, nonstationarity can also lead to a similar nonidentifiability phenomenon, at least for pairwise comparisons, as Baake (1998) has shown. Baake's example was particularly simple-exactly half the sites are invariable, while the other half evolve according to the same Markov process. It is an open question whether this nonidentifiability of the tree is also true if one simultaneously uses all the sequence information. There are other related problems where reducing data to pairwise information destroys information about the underlying structural parameters. For example, Chang (1996a) showed that for a nonstationary model (and without rates across sites), triplewise comparisons of sequences generally suffice to determine all of the edge parameters (i.e., relative rates of substitution between the different nucleotides), but pairwise comparisons generally do not. Independently, Lake (1997) also described a triplewise technique for reconstructing these edge parameters. The question of phylogeny reconstruction can also be viewed from an information-theoretic perspective. One such approach (based on the concept of Fisher information) has been presented by Goldman (1998) and developed as a tool for experimental design. In phylogeny reconstruction, it is helpful to regard each site as containing some information concerning the underlying tree and note that this signal depends on the other underlying structural nuisance parameters, such as the edge lengths. For example, very many sites will be required in order to reliably recover a very short internal edge, while a very long external edge (i.e., leading to a distant outgroup sequence) will need very many sites in order to be correctly placed in the tree. A fundamental question in terms of these edge lengths and the number of sequences is, how many sites are required to accurately reconstruct the underlying tree? Recently, this question has been shown to have a rather surprising answer. Namely, for simple models, if the underlying structural parameters are sufficiently constrained, then the sequence length required to reconstruct the true tree can grow even slower than the number of sequences (even though the number of possible trees grows exponentially with the number of sequences) (for details, see Erdös et al. 1999). These theoretical results are relevant to recent simulation studies (and the surrounding controversy) suggesting that trees on large numbers of sequences can sometimes be reconstructed from surprisingly short sequences