Evolutionary meandering of intermolecular interactions along the drift barrier
Venue: | Proceedings of the National Academy of Sciences of the United States of America. 2015. 112:E30–E38. doi: 10.1073/pnas.1421641112 PMID: 25535374 |
Citations: | 1 - 0 self |
BibTeX
@INPROCEEDINGS{Lynch_evolutionarymeandering,
author = {Michael Lynch and Kyle Hagner},
title = {Evolutionary meandering of intermolecular interactions along the drift barrier},
booktitle = {Proceedings of the National Academy of Sciences of the United States of America. 2015. 112:E30–E38. doi: 10.1073/pnas.1421641112 PMID: 25535374},
year = {}
}
OpenURL
Abstract
Many cellular functions depend on highly specific intermolecular interactions, for example transcription factors and their DNA binding sites, microRNAs and their RNA binding sites, the interfaces between heterodimeric protein molecules, the stems in RNA molecules, and kinases and their response regulators in signaltransduction systems. Despite the need for complementarity between interacting partners, such pairwise systems seem to be capable of high levels of evolutionary divergence, even when subject to strong selection. Such behavior is a consequence of the diminishing advantages of increasing binding affinity between partners, the multiplicity of evolutionary pathways between selectively equivalent alternatives, and the stochastic nature of evolutionary processes. Because mutation pressure toward reduced affinity conflicts with selective pressure for greater interaction, situations can arise in which the expected distribution of the degree of matching between interacting partners is bimodal, even in the face of constant selection. Although biomolecules with larger numbers of interacting partners are subject to increased levels of evolutionary conservation, their more numerous partners need not converge on a single sequence motif or be increasingly constrained in more complex systems. These results suggest that most phylogenetic differences in the sequences of binding interfaces are not the result of adaptive fine tuning but a simple consequence of random genetic drift. cellular evolution | molecular interaction | transcription | random genetic drift | coevolution M uch of biology relies on the specificity of intermolecular interactions-the regulation of gene expression, the transmission of information via signal transduction, the assembly of monomeric subunits into multimers, vesicle sorting in eukaryotic cells, toxin-antitoxin systems in microbes, mating-type recognition, and many other cellular features. Although the basic structural features of such fitness-related traits would seem to be under strong purifying selection, molecular specificity often seems to be highly flexible in evolutionary time. For example, transcription-factor binding-site motifs often vary dramatically among orthologous genes in different species and even among similarly regulated genes within the same species (1-3). The binding interfaces of multimeric proteins can vary substantially among species, sometimes with no overlap at all (4, 5). The key amino acid sequences involved in intermolecular cross-talk in signal-transduction systems can evolve at high rates (6, 7), and growing evidence suggests that the locations of sites involved in posttranslational modification in individual proteins are under much weaker selective constraints than their absolute numbers (8). Although it is often argued that subtle differences in the motifs involved in intermolecular interactions are molded by the demands of natural selection, seldom has any direct evidence ever been provided in support of such arguments. The theory provided below makes the case that substantial divergence of the motifs involved in molecular cross-talk is expected even in the face of strong selection for high specificity. Such behavior is a natural consequence of several factors: the degrees of freedom typically associated with the biophysical aspects of molecular binding, the opposing pressures of mutation and selection, the evolutionary noise produced by coevolutionary interactions, and the limits to the efficiency of natural selection. In the following pages, we show how the joint application of theory from biophysics and population genetics helps move our understanding of evolutionary aspects of cell biology beyond the too-common view that all aspects of biodiversity are molded by the unbridled power of natural selection. Results Although intermolecular interactions underlie a wide variety of issues in cell biology, the following theory will be phrased in the context of the evolution of a transcription-factor binding site (TFBS) and its cognate transcription factor (TF). The general formulations can be equally applied to the complementary interfaces involving an untranslated region of an mRNA and a microRNA, the monomeric subunits of a dimeric molecule, a cargo protein and its molecular motor, an intron-exon junction and the spliceosome, a sensor kinase and a response regulator protein, and so on. Gene expression, which in turn influences individual fitness, will be assumed to be a function of binding-site affinity. The DNA-binding domain of any TF defines a specific TFBS motif that maximizes the binding strength. However, owing to the recurrent introduction of mutations, variation will inevitably arise among the TFBS sequences of different genes serviced by the same TF within a species as well as among orthologous genes in different species. Selection will prevent extreme TFBS degeneration, but there are diminishing returns on increasing the binding-site strength beyond the point at which the associated gene is in a near-optimal state of transcriptional activation. Thus, levels of TF-TFBS matching can be expected to wander along the boundaries dictated by the prevailing features of mutation, selection, and random genetic drift. Because there are typically Significance Many cellular functions depend on highly specific intermolecular interactions, with mutational changes in each component of the interaction imposing coevolutionary pressure on the remaining members (e.g., a transcription factor and its DNA binding sites). The conflict between mutation pressure toward reduced affinity and selective pressure for greater interaction results in an evolutionary equilibrium distribution for the affinity between interacting partners. Nevertheless, conditional on the maintenance of a critical level of molecular recognition, the sites containing the key residues of binding interfaces are free to evolve. The theory developed suggests that most such evolution is a simple consequence of random genetic drift and not an outcome of adaptive fine tuning. numerous ways in which selectively equivalent binding affinities can be achieved, this means that substantial variation in bindingsite signatures may arise even in the face of strong selection. One Evolvable Component. As a simple entrée into the problem, consider a single TFBS interacting with a TF whose binding domain is unable to evolve, at least on a time scale comparable to that which allows TFBS flexibility. It will be shown below that this situation is closely approximated for TFs servicing multiple target genes. Such a scenario also applies to a protein binding domain for a nonevolvable substrate such as an inorganic ion or an intermediate metabolite. Several features of this model can be evaluated analytically, the initial point being the probability distribution of the possible binding motifs under drift-mutationselection equilibrium. We start with two simple assumptions: (i) that all TFBSs with the same number of matches (m) to the TF are equivalent with respect to binding affinity, regardless of the position of the mismatches, and (ii) that each of the four nucleotides (A, C, G, and T) mutates to each of the other three states at the same rate μ. With a TF recognition motif ℓ nucleotides in length, there are then ℓ + 1 TFBS matching classes to consider, each consisting of multiple subclasses with equal expected probabilities under selectionmutation equilibrium. For example, matching class m = ℓ − 1 has a multiplicity of 3ℓ TFBS types, because the single mismatch can reside in sites 1 to ℓ and involve any of the three possible mismatching nucleotides. The general approach used here assumes a population that usually resides in a nearly monomorphic TFBS state, with the average interval between stochastic state transitions being long enough that single mutational changes are fixed in a sequential manner. Letting Pðm; tÞ denote the probability that a TFBS resides in matching class m at time t, where ℓ − m denotes the number of mismatches, the time-dependent behavior of the system is described by dPðm; tÞ dt = Nμ · È 3ðm + 1Þϕ m+1;m Pðm + 1; tÞ with the first two terms being dropped when m = ℓ and the last two being dropped when m = 0: Here, we assume a haploid population of N individuals (for a diploid population, 2N should be substituted for N throughout). This dynamical equation consists of three terms, the first denoting the influx of probability from the next higher (more beneficial) class, with each of the ðm + 1Þ matching sites mutating to nonmatching states at rate 3μ in each gene copy (the 3 accounting for mutation to three alternative nucleotide types), and becoming fixed in the population with probability ϕ m+1;m : The terms involving Pðm; tÞ account for the efflux from class m to the next higher and lower classes (m + 1 and m − 1), again accounting for the number of possible mutations that cause such movement and their probabilities of fixation. The final term describes the influx from the next lower class, which has ℓ − m + 1 mismatches, each back-mutating to a matching state at rate μ. The fixation probabilities are provided by Kimura's (9) diffusion equation for newly arisen mutations, where N e is the effective population size, 1=N is the initial frequency of a mutation (for a haploid population), and s x;y is the fractional selective advantage of allelic class y over x. Because there are nonzero transition probabilities between all adjacent classes, Eq. 1 converges on a global equilibrium probability distribution after a sufficiently long period, regardless of the starting conditions. At this point, the total fluxes into and out of each class are equal, a condition known as detailed balance. One simple approach to obtaining the equilibrium for a linear array of TFBS states is to view the system as a flow diagram with connecting arrows denoting the flux rates between adjacent classes where C is a normalization constant [equal to the reciprocal of the sum of the terms to the right of C for all m, often referred to as the partition function (10,11)]. The exponential term on the right is the ratio of fixation probabilities in the upward to downward directions, which is derived from Eq. 2 in SI Text. Given a constant set of population-genetic parameters,PðmÞ can be interpreted in two ways: (i) For a single TFBS, it represents the long-term proportion of evolutionary time spent in the various matching states and (ii) for a set of different TFBSs under the same selective constraints, it represents the expected distribution of states at any point in time. There are several notable features of this solution. First, because each transition rate in the linear chain is a factor of μ, the equilibrium probabilities are completely independent of the mutation rate. This would be true even with a different mutation spectrum, although the factor of 3 (here the ratio of deleterious to advantageous mutations) would change. Second, the expression within the square brackets defines the expected distribution of matching states in the absence of selection, that is, the neutral expectationP n ðmÞ: This term is equivalent to the number of unique ways in which a sequence of length ℓ can harbor m sites matching the optimal binding motif, accounting for both the À ℓ m Á distinct spatial configurations of matches and the fact that there are three alternative inappropriate nucleotides for each mismatch. Eq. 3 shows that the effect of selection on the equilibrium probability distribution is equivalent to a simple transformation of the neutral expectation, with each state probability being weighted by an exponential function of the product of its relative selective advantage and the effective population size ðN e Þ. Because 1=N e is a measure of the power of random genetic drift, the weighting terms are equivalent to the ratio of the power of selection to that of drift. The transition rates are given on the arrows for the case of neutrality, where the probability of fixation is equal to the mutation rate per site, 3μ in the case of single-site losses (arrows to the right) because each appropriate nucleotide can mutate to three others, and μ in the case of site improvement (arrows to the left) because each mismatch can only mutate to the appropriate state in one way. With selection in operation, each coefficient needs to be multiplied by the number of individuals (N) and the associated probability of fixation. Lynch and Hagner PNAS | Published online December 22, 2014 | E31 EVOLUTION PHYSICS PNAS PLUS What remains is the definition of the fitness function. The universality of the basic mode of transcription-the interaction of a specific protein (the TF) with a specific DNA binding site (the TFBS)-provides a mechanistic basis for addressing this issue in biophysical terms (12). The most common approach is to consider individual fitness to be a linear function of p on ðmÞ; the fraction of time that a TFBS with m matching sites is expected to be bound by its cognate TF (a minimum requirement for expression of the associated gene), W ðmÞ = 1 + sðmÞ = 1 + α p on ðmÞ; [4] with α being a scaling factor relating binding probability to fitness Using results from statistical-mechanics theory, an expression for p on ðmÞ can be obtained in terms of the binding energy of a motif (β) and background interference (B) within the cell, (SI Text). The exponential term βm is a measure of the total strength of binding (in Boltzmann units of 0.6 kcal/mol), under the assumption that binding strength scales linearly with the number of matching sites (m) between a TFBS and the optimal binding motif of its TF. Multiple empirical studies involving single-base changes in TFBSs suggest an average energetic cost of a mismatch of β ' 2:0 The solution of Eqs. 3-5 illustrates several general principles Second, regardless of the set of parameter values, substantial variation in m is almost always expected, even in the face of constant directional selection. Unless the motif size is on the low end of the range typically seen ðℓ = 8Þ and levels of background Motif size is based on consensus sequences. The estimated costs of mismatches are obtained from bindingstrength experiments in which single-base changes were made in motifs. Costs of single-base mismatches are in units of kilocalories per mole; these average to 1.4 across the full set of studies, or in terms of Boltzmann units (K B T ' 0:6 kcal/mol) to 2.3. E32 | www.pnas.org/cgi/doi/10.1073/pnas.1421641112 Lynch and Hagner interference and selection pressures are very high, the vast majority of binding sites are expected to contain mismatches with respect to the optimum motif. With larger population sizes the distribution is pushed toward larger m, but with an optimum motif size of 16 bp essentially no TFBS is expected to be perfect. This behavior can be understood by considering the selection coefficients, αp on ðmÞ, associated with adjacent matching classes. Because p on ðmÞ is sigmoid, the fitness function approaches an asymptotic slope of zero at high values of m, owing to the nearcertain level of binding. The point at which the selective difference between adjacent classes becomes smaller than the power of drift, 1=N e , represents the barrier beyond which selection is incapable of influencing the rate of allelic substitution. Third, with relatively weak selection pressure ðN e α < 1Þ,PðmÞ is very heavily skewed toward small numbers of matches (converging on the neutral expectation). This intrinsic weighting toward low numbers of matches is a result of the biased mutation pressure toward mismatches and the increasing multiplicity of configurations leading to the same m with increasing numbers of mismatches. Fourth, because the neutral distribution is strongly weighted toward low m, there can be a strong "phase transition" in the form of the probability distribution as N e α crosses the threshold of ' 1:0. As can be seen in The preceding results can be used to evaluate the validity of the popular notion that TFBSs can be detected by searching for relatively conserved intergenic patches of orthologous nucleotide sequence in genome comparisons. For genes in an early stage of divergence, that is, with on the order of single-nucleotide substitutions per motif, as would be the case for closely related species, the rate of evolution relative to the neutral expectation is defined by (derived in SI Text). Effective neutrality results in ω → 1, whereas strong purifying selection results in ω → 0. Analysis of Eq. 6 demonstrates that despite their centrality to gene expression TFBS sequences are expected to be frequently under only weak purifying selection on short time scales, unless the motif size is very small and N e is very large where is the expected number of identical sites for a pair of random motifs with x and y sites matching the optimum TF motif. Solution of Eq. 7a reveals a wide range of parameter space for which the asymptotic level of sequence similarity is <50%, even in the face of fairly efficient selection, especially when the recognition motif of the TF exceeds 10 bp or so in length A Coevolving Two-Component System. The preceding model can be readily extended to the situation in which both the TFBS and the optimal binding motif of the TF are capable of evolving. Although such a scenario opens up the possibility that a suboptimal TFBS can be restored to more favorable binding through a compensatory mutation in the TF, it also provides additional and typically more numerous routes to suboptimal binding through the accumulation of deleterious TF modifications. The net effect of such coevolution will be the random wandering of the joint system over the entire domain of possible sequence space, constrained only by the joint maintenance of a level of cooperativity defined by the drift barrier. This type of scenario should apply to a number of other specialized pairwise intracellular interactions, such as bacterial two-component signaling systems and toxin-antitoxin systems. To accommodate the evolution of trans effects in the TF, we retain a focus on the number of matches between a TFBS and its cognate TF, while making two modifications to the flow diagram in The Consequence of Multiple Downstream Components. A common view, especially among developmental biologists, is that key evolutionary changes associated with gene regulation are much more likely to involve modifications at the level of cis-regulatory (TFBS) sites than at the trans (TF) level. The implicit assumption underlying this idea is that the evolution of transcription factors becomes increasingly constrained with increasing numbers of target genes. To explore the validity of this idea, we now move beyond the one-to-one situation outlined in the preceding section to allow for the coevolution of the optimal recognition motif of a TF with the TFBSs of multiple genes. To keep things reasonably tractable in this initial exploration, we treat the TF recognition motif in a manner parallel to that of the TFBSs, that is, as being a sequence of fixed length ℓ with four possible states at each position, with δ = 3, ν = μ, and Δ = 3; so that the relevant sites in the TF motif are subject to the same mutation pressure as those in the TFBSs. Under this computational setting, the number of matches (m) between the TF and any TFBS is defined by the states of the corresponding positions. This particular approach would be especially appropriate for the analysis of the coevolution of microRNAs and their binding sites. By adhering to the sequential model, the evolution of the entire system can be followed in a stepwise manner over time, with the transition probabilities to alternative states being defined by the products of the mutation rates and fixation probabilities to alternative adjacent states. The approach is identical to that used above except for the more intricate details-the precise sequence Lynch and Hagner of the TF and each TFBS must be monitored. We assumed that total fitness is determined by the product of the locus-specific fitnesses of the TFBSs, each defined in the manner outlined above using Eq. 4. Three general results emerge from this analysis Stabilizing Selection on the Rate of Expression. In the preceding analyses, it was assumed that selection favors maximum binding strength. Such a fitness function is justified by analyses of TFBSs in E. coli and S. cerevisiae that consistently infer a monotonic increase of fitness with increasing binding strength The gradient of this fitness function around the optimum is defined by σ, with higher σ causing a flatter (and hence more neutral) fitness scenario. Denoting the term within parentheses as d, W ðp on Þ ' 1 − d 2 for d 1, so for example a 0.01 deviation of p on from θ in units of σ leads to a 0.0001 reduction in fitness relative to the optimal value of 1.0. Recalling Eq. 3, this implies that the equilibrium distribution of m under this model is primarily determined by the composite parameter N e =σ 2 , which is a measure of the ratio of the power of stabilizing selection to random genetic drift. Because of the nonlinear relationship between p on and m and the discrete nature of m, this model exhibits several unique features All of the methods used above can be used to explore the consequences of stabilizing selection (or any other kind of fitness function). For example, the equilibrium distribution of matching sites can be obtained by applying Eq. 8 directly to Eq. 3. Under this model, the scaled selection parameter N e =σ 2 = 1 defines an approximate cutoff below which the evolutionary distribution is close to the neutral expectation regardless of the optimum level of expression Discussion Although the preceding analyses have been couched in terms of transcription factors and their binding sites, they should apply to a diversity of other coevolutionary issues involving pairwise molecular interactions. These include microRNAs and their binding sites, bacterial two-component signaling systems and toxinantitoxin systems, the interfaces within heterodimeric molecules, proteins involved in vesicle sorting, and so on. Thus, the preceding results have potentially general implications for evolutionary cell biology. First, owing to both the diminishing-returns aspect of intermolecular binding with increasing numbers of participating residues and the multiplicity of effectively equivalent alternative motifs, substantial variation is typically expected for the sequences underlying pairwise interactions, even in the face of strong selection for conserved function. Exceedingly high intensities of selection are generally required for the maintenance of sequence identity across the full length of a motif, and the degrees of freedom associated with the locations and nucleotide identities of mismatching sites provide numerous paths along which motifs can wander neutrally in sequence space. Thus, although it has been argued that selection favors short TFBSs as a means for minimizing mutational breakdown (14), less than maximum TF/TFBS matching lengths arise naturally as a consequence of mutationselection balance, without any direct selection for mutational robustness. Second, with intermediate levels of selection, bimodal distributions of binding strengths can emerge among motifs exposed to identical selection pressures, raising questions about the interpretation of species differences in TFBS affinities as indicators of lineage-specific differences in selection pressures. Such behavior is a simple consequence of the conflict between mutational bias toward low affinity and selection bias toward high affinity. Because the motifs within the different peaks of such distributions will differ in both length and sequence, this result may help explain the widespread use of secondary TFBS motifs by TFs in mammals and land plants Third, unless effective population sizes are enormous and selection is exceedingly strong, 50% or more of the mutations arising in binding motifs are typically free to move toward fixation. As a consequence of this drift process, despite being critical to fitness orthologous motifs in closely related species can often evolve at rates of at least half the mutation rate, and those in distantly related lineages will frequently have asymptotic levels of divergence of up to 50%. For motifs that are only 10 bp or so in length, this level of divergence can be exceedingly difficult to discriminate from the neutral expectation. Consistent with this view, many studies have documented substantial within-species variation in the sequences of orthologous regulatory elements (36-39). In addition, studies in diverse lineages, sometimes involving closely related species, have routinely shown nearcomplete scrambling of regulatory-region sequence, often with no apparent functional consequences (40-48). Fourth, if both members of an interacting pair are free to evolve, even greater variation and divergence in binding-site affinity is expected. This is because selection operates only on the overall degree of matching between participants, with the precise motifs involved in the interaction being largely irrelevant (so long as spurious cross-talk with noncognate systems is avoided). Provided the degree of overall affinity remains within certain bounds, the degrees of freedom beyond the drift barrier present ample opportunities for unbounded molecular wandering of the individual motifs. Such within-species drift will passively give rise to incompatibilities among isolated lineages as the reciprocal partners in heterospecific combinations no longer recognize each other (49, 50). Finally, the pattern of evolution of participating partners is expected to vary with the number of transactions involved. With a highly specific (one-to-one) association, long-term rates of evolution must be identical for both members of the pair. Otherwise, fitness would necessarily decline as the coevolutionary loop is broken. In contrast, under a one-to-many scenario, where one member of the pair interacts with multiple representatives of the other, there can be substantial asymmetry in the rates of evolution of the two components. The member of the pair servicing multiple partners (i.e., having more pleiotropic effects) experiences the strongest selective constraint, with the overall rate of evolutionary divergence declining with increasing numbers of partners. In contrast, the more numerous partners evolve in an essentially independent fashion, with distribution and rate features identical to what would be expected in a oneto-one system. In effect, with multiple interacting partners the master controlling element becomes increasingly constrained to accepting only the reduced subset of mutations that is either effectively neutral for all partners or the even smaller subset with Lynch and Hagner a net overall positive impact. Consistent with this prediction, within the γ-Proteobacteria TFs with larger numbers of target genes are more evolutionarily conserved at the amino acid sequence level, including in the TFBS recognition sequence (51). In addition, the decline in TF binding-site specificity with increasing numbers of genes serviced in both E. coli and yeast (14, 52) seems to be consistent with the evolution of more generalized recognition systems in TFs with greater pleiotropic effects. Collectively, the preceding theory indicates that, without direct empirical evidence, there is little justification for interpreting motif variation (either among multiple genes within genomes or among orthologous sequences across species) as evidence for the adaptive fine-tuning of individual loci. Rather, such variation is expected to be a natural consequence of the degrees of freedom associated with binding interfaces, the diminishing advantages of increased binding affinity, and the limits to the power of natural selection. Thus, the common inability to locate orthologous TFBSs in comparative studies is most likely not an artifact of inadequate computational tools but an inherent consequence of the evolutionary features of such sequences. It has recently been argued that some genomic features may exhibit evolutionary behaviors that are nearly independent of population size, particularly when such features are linearly related to a phenotypic trait under stabilizing selection (53). However, a central conclusion from the preceding theory is that the evolutionary behavior of binding interfaces is strongly influenced by N e , even when the level of gene expression is under stabilizing selection. Consistent with the theory, comparison of TF systems in Drosophila and mammals (cases of relatively high vs. low N e ) suggests higher rates of evolution in the latter (3). Such work will need to be repeated over many other phylogenetic lineages before definitive conclusions can be drawn. However, as large, curated databases continue to develop for multiple organisms (e.g., for TFs and their binding sites; see refs. 54-56), it will become possible to test the various evolutionary hypotheses concerning the roles played by effective population size, numbers of molecules per cell, number of interacting partners, and so on. A number of additional avenues of inquiry remain for the future. For example, although we have provided a fairly extensive analysis of one-to-one and one-to-many types of interactions, the many-to-one and many-to-many scenarios remain to be explored. Many-to-one scenarios include the regulation of the large proportion of genes in eukaryotes that require multiple TFs for gene activation. Many-to-many interactions include higher-order gene networks and heteromeric proteins, and long chains of interactions are relevant to many signal-transduction cascades in eukaryotes. In all of these cases, the issue of epistasis (nonmultiplicative interactions among fitness-determining loci) will merit investigation, because this might influence the degree to which the partners at equivalent levels in such systems evolve in an independent manner. Recombination will have no influence on the preceding results provided the population size is small enough that the mode of evolution is consistent with the sequential-fixation model, but will become increasingly important in cases involving multiple loci in large populations where jointly segregating polymorphic sites have an appreciable probability of occurrence. ACKNOWLEDGMENTS. We thank J. Gunawardena, M. Lässig, and G. Marinov for helpful comments.