Literature DB >> 36099311

Strong evidence for the adaptive walk model of gene evolution in Drosophila and Arabidopsis.

Ana Filipa Moutinho^1,2, Adam Eyre-Walker², Julien Y Dutheil^1,3.

Abstract

Understanding the dynamics of species adaptation to their environments has long been a central focus of the study of evolution. Theories of adaptation propose that populations evolve by "walking" in a fitness landscape. This "adaptive walk" is characterised by a pattern of diminishing returns, where populations further away from their fitness optimum take larger steps than those closer to their optimal conditions. Hence, we expect young genes to evolve faster and experience mutations with stronger fitness effects than older genes because they are further away from their fitness optimum. Testing this hypothesis, however, constitutes an arduous task. Young genes are small, encode proteins with a higher degree of intrinsic disorder, are expressed at lower levels, and are involved in species-specific adaptations. Since all these factors lead to increased protein evolutionary rates, they could be masking the effect of gene age. While controlling for these factors, we used population genomic data sets of Arabidopsis and Drosophila and estimated the rate of adaptive substitutions across genes from different phylostrata. We found that a gene's evolutionary age significantly impacts the molecular rate of adaptation. Moreover, we observed that substitutions in young genes tend to have larger physicochemical effects. Our study, therefore, provides strong evidence that molecular evolution follows an adaptive walk model across a large evolutionary timescale.

Entities: Chemical

Mesh：

Year: 2022 PMID： 36099311 PMCID： PMC9470001 DOI： 10.1371/journal.pbio.3001775

Source DB: PubMed Journal: PLoS Biol ISSN： 1544-9173 Impact factor: 9.593

Introduction

How does adaptive evolution proceed in space and in time? This question has long intrigued evolutionary biologists. Fisher [1] proposed that adaptation relies on mutations with small effect sizes at the phenotypic level. He presented the geometric model of adaptation where phenotypic evolution occurs continuously and gradually toward an optimum fitness [1]. At the molecular level, Wright [2,3] first introduced the idea that populations evolve in the space of all possible gene combinations to acquire higher fitness. He characterised this model of evolution as a “walk” in an adaptive landscape. Wright consequently proposed the shifting balance theory of adaptation, which combines the effects of drift and selection. Drift acts by moving the population away from its local peak, while natural selection directs the population to higher fitness, the so-called “global optimum” in the fitness landscape. With the rise of molecular genetics, Maynard Smith [4] extended this idea to a sequence-based model of adaptation. He introduced the idea of an “adaptive walk,” where a protein “walks” in the space of all possible amino acid sequences towards the ones with increasingly higher fitness values. Wright’s and Smith’s adaptation model was further extended by Gillespie [5-7], who presented the “move rule” in an adaptive landscape. Following Kimura’s work on the effect sizes of mutations [8], Gillespie suggested that adaptation proceeds in large steps, where mutations with higher fitness effects are more likely to reach fixation. The adaptive walk model was later fully developed by Allen Orr [9,10], who extended Fisher’s geometric model of adaptation and showed that adaptive walks lead to a pattern of diminishing returns. A sequence further away from its local optimum tends to accumulate large-effect mutations at the beginning of the walk, small-effect mutations being then only fixed when the sequence is approaching its optimum fitness. Therefore, under this model, adaptation relies both on mutations of large and small fitness effects, in a time-dependent manner. Experimental studies tracing the evolution of bacteria [11-14] and fungi [15] have supported this model. Experimental studies, however, can only assess patterns of adaptation at relatively short time scales in artificial environments and are limited to certain organisms. The challenge lies in studying adaptation across long evolutionary time scales: How does the distribution of beneficial mutations vary over time? While long-term evolutionary processes are not directly observable, they leave a signature in genome sequences. Species share genes to variable extents, according to their degree of divergence. Some ancient genes, which evolved early, are shared by many species. Others, which evolved more recently, are only shared by a few related species [16,17]. Consequently, the age of a gene can be inferred from its phyletic pattern, i.e., its presence or absence across the phylogeny [18]. These phyletic patterns are reconstructed using sequence similarity searches performed by tools like BLAST [19]. A gene is considered “old” if a homolog is identified in several taxa over a deep evolutionary scale, or “young” or lineage-specific if the recognised homologs are only present in closely related species. This approach is known as phylostratigraphy [20]. Previous studies suggested that young or lineage-specific protein-coding genes evolve faster than older ones [17,21-27]. Albà and Castresana [27] showed a negative correlation between the ratio of the nonsynonymous (d) to synonymous (d) substitutions rates, ω, and gene age in the divergence between humans and mice, with young genes presenting a higher ω. Similar correlations between ω and gene age have been observed in primates [22], fungi [25], Drosophila [23,28,29], bacteria [30], viruses [31], plants [32,33], and protozoan parasites [34]. Despite the observed consistency across taxa, the underlying causal mechanisms remain debated [21]. As the average d/d ratio integrates both positive and negative selection, a comparatively higher ratio can result from a less stringent purifying selection, the occurrence of positive selection, or both. By looking at polymorphism data, Cai and Petrov [22] suggested that the faster evolution in young primate genes may be due to the lack of selective constraint posed by purifying selection, while analyses of adaptive evolution were inconclusive. Analysing the effect of gene age on adaptive evolution is a complex task, as young and old genes differ in their structural properties, expression level, and protein function. Young genes tend to be smaller [22,24,35], have a higher degree of intrinsic disorder [36], and are expressed at lower levels [17,22,24,26]. Moreover, young genes tend to encode proteins involved in developing species-specific characteristics and immune and stress responses [16,37,38]. As the macromolecular structure [39,40], gene expression levels [39,41] and protein function [39,42,43] are known determinants of the rate of protein adaptation, they could be confounding the effect of gene age. Several studies reported the substantial impact of gene expression on the adaptive rate of proteins, where highly expressed proteins are significantly more constrained and have lower adaptation rates [39,41,44,45]. At the macromolecular level, some studies showed that highly disordered [39,40] and exposed residues [39] present higher rates of adaptive evolution. Finally, there is evidence that proteins involved in the immune and stress response have higher molecular adaptive rates [39,43,46,47]. Thus, controlling for these confounding factors when assessing the impact of gene age on the rate of molecular adaptation is crucial. Here, we used a population genomic approach to test the adaptive walk model. We make 2 predictions: first, that younger genes are undergoing faster rates of adaptive evolution, and second, the evolutionary steps they make are larger. We tested the first prediction by estimating rates of adaptive and nonadaptive protein evolution using an extension of the MacDonald–Kreitman (MK) test [48], which uses counts of polymorphism and substitution at selected and neutral sites. We quantified the rates of adaptive and nonadaptive evolution using the statistics ω and ω, which denote the rates of adaptive and nonadaptive nonsynonymous substitution relative to the mutation rate. We investigated whether protein length, gene expression, relative solvent accessibility (RSA), intrinsic protein disorder, BLAST’s false-negative rate, and protein function act as confounding factors of the effect of gene age. To test the second prediction, we considered the rates of substitution between amino acids separated by different physicochemical distances as a function of gene age. We tested our hypotheses in 2 pairs of species with different life history traits: the Diptera Drosophila melanogaster and Drosophila simulans and the Brassica Arabidopsis thaliana and Arabidopsis lyrata. In each species pair, we compared their most recent genes with those dating back to the origin of cellular organisms.

Results

Young genes have a higher rate of adaptive substitutions

We tested the adaptive walk model of sequence evolution by assessing the impact of gene age on the rate of adaptive (ω) and nonadaptive (ω) nonsynonymous substitutions. We used Grapes [48] to estimate ω and ω, which accounts for segregating slightly deleterious and advantageous mutations by specifically modelling the distribution of fitness effects (DFE). Gene age data were obtained from published data sets [28,33]. We found that gene age is significantly correlated to estimates of ω, ω, and ω in both species’ pairs (Table 1 and Fig 1B; data available in S1 Data). This result suggests that the higher ω ratio of more recently evolved genes is due to a higher rate of adaptive and nonadaptive nonsynonymous substitutions. As X-linked genes are known to evolve faster due to the male hemizygosity [49-51], we assessed whether the relationship between evolutionary rates and gene age differed between chromosomes in Drosophila (Fig 1B). We compared models with and without the chromosome’s effect (see Material and methods and S1 File) and found only low support for a chromosomal effect (p = 0.041 for ω and p = 0.094 for ω). We, therefore, combined data from all chromosomes for subsequent analyses.

Table 1

Kendall’s correlation coefficients for the relationship between ω, ω, and ω and gene age, for the analysis of gene age and the combined analyses of gene age with the respective cofactors: protein length, gene expression, protein intrinsic disorder, and the mean relative solvent accessibility per gene.

The combined probabilities for each cofactor within and across species are presented in the fields “Weighted Z” and “Weighted Z across species,” respectively, for ω, ω, and ω.

		Arabidopsis			Drosophila			Weighted Z across species
		ω	ω _na	ω _a	ω	ω _na	ω _a	ω	ω _na	ω _a
Gene Age		0.962 ***	0.848 ***	0.733 ***	0.727 ***	0.697 **	0.636 **
Protein Length	Long	1.000 ^**	0.867 ^*	−0.200	0.867 *	0.600 ^(.)	0.867 *	1.56 × 10^{−4 ***}	7.71 × 10^{−5 ***}	7.98 × 10^{−3 **}
	Short	1.000 ^**	0.867 *	0.600 ^(.)	0.733 *	0.867 *	0.467
	Weighted Z	6.46 × 10^{−4 ***}	1.61 × 10^{−3 **}	0.133	2.64 × 10^{−3 **}	5.29 × 10^{−3 **}	0.0105 *
Gene Expression	High	0.867 ^*	0.867 *	0.467	0.867 *	1.000 **	0.600 ^(.)	6.93 × 10^{−5 ***}	6.89 × 10^{−6 ***}	3.53 × 10^{−3 **}
	Low	0.867 ^*	1.000 ^**	0.333	0.867 *	0.733 *	1.000 **
	Weighted Z	1.51 × 10^{−3 **}	3.71 × 10^{−4 ***}	0.186	1.09 × 10^{−3 **}	1.68 × 10^{−3 **}	2.24 × 10^{−3 **}
Protein Intrinsic Disorder	High	1.000 ^***	0.939 ^***	0.636 **	0.670 **	0.303	0.515 *	<2 × 10^{−216 ***}	6.60 × 10^{−6 ***}	2.53 × 10^{−3 **}
	Low	0.970 ^***	0.909 ^***	0.454 *	0.630 **	0.576 **	0.273
	Weighted Z	<2 × 10^{−216 ***}	<2 × 10^{−216 ***}	1.20 × 10^{−3 **}	3.85 × 10^{−5 ***}	5.80 × 10^{−3 **}	4.18 × 10^{−2 *}
Mean Relative Solvent Accessibility	High	0.944 ^***	0.889 ^***	0.722 **	0.636 **	0.673 **	0.564 *	1.00 × 10^{−7 ***}	9.00 × 10^{−7 ***}	1.37 × 10^{−5 ***}
	Low	1.000 ^***	0.778 ^**	0.667 *	0.636 **	0.491 *	0.564 *
	Weighted Z	6.20 × 10^{−6 ***}	1.41 × 10^{−5 ***}	1.24 × 10^{−3 **}	3.67 × 10^{−4 ***}	7.76 × 10^{−4 ***}	1.55 × 10^{−3 **}

For each variable, the correlation coefficient and the combined probabilities are shown with the respective significance (*P < 0.05; **P < 0.01; ***P < 0.001; “.” 0.05 ≤ P < 0.10) for ω, ω, and ω in Arabidopsis and Drosophila. As the effect of gene age was assessed by combining genes in each age class, these correlation coefficients do not measure the intrinsic gene-level strength of correlation between gene age and molecular rates of evolution.

Fig 1

(a) Phylogenetic definition of the strata used in the analyses for A. thaliana (top) and D. melanogaster (bottom). The number of genes mapped to each clade is shown. (b) Relationship between the rate of protein evolution (ω), nonadaptive nonsynonymous substitutions (ω), and adaptive nonsynonymous substitutions (ω) with gene age in A. thaliana (top) and in D. melanogaster (bottom). Clades are ordered according to (a). In D. melanogaster, the results for X-linked, autosomal, and total genes are shown. Mean values of ω, ω, and ω for each category are represented with the black points. Error bars denote for the 95% confidence interval for each category, computed over 100 bootstrap replicates. The data (S1 Data) and code needed to generate this table can be found at https://gitlab.gwdg.de/molsysevol/supplementarydata_geneage and https://zenodo.org/record/6828430.

Kendall’s correlation coefficients for the relationship between ω, ω, and ω and gene age, for the analysis of gene age and the combined analyses of gene age with the respective cofactors: protein length, gene expression, protein intrinsic disorder, and the mean relative solvent accessibility per gene.

The combined probabilities for each cofactor within and across species are presented in the fields “Weighted Z” and “Weighted Z across species,” respectively, for ω, ω, and ω. For each variable, the correlation coefficient and the combined probabilities are shown with the respective significance (*P < 0.05; **P < 0.01; ***P < 0.001; “.” 0.05 ≤ P < 0.10) for ω, ω, and ω in Arabidopsis and Drosophila. As the effect of gene age was assessed by combining genes in each age class, these correlation coefficients do not measure the intrinsic gene-level strength of correlation between gene age and molecular rates of evolution.

The effect of gene age on the rate of molecular adaptation is robust to multiple confounding factors

Genes of different ages intrinsically differ in their features [16,22,36]. As such traits significantly impact the rate of molecular evolution [39], they may be confounding the faster adaptive rates observed in young genes. Previous studies reported that younger genes encode shorter proteins [24,35,52] and are expressed at lower levels [17,22,24,26], a pattern that we also observed in our data set (gene age versus protein length for D. melanogaster and A. thaliana, respectively: Kendall’s τ = −0.485, p = 2.82 × 10−2; τ = −0.848, p = 1.06 × 10−5, S1A Fig, data available in S2 Data; gene age versus gene expression for D. melanogaster and A. thaliana, respectively: τ = −0.595, p = 7.35 × 10−3; τ = −0.790, p = 4.00 × 10−5, S1B Fig, data available in S3 Data). As younger proteins are shorter than older ones, they have a higher proportion of exposed residues [39]: Gene age is significantly positively correlated with the average RSA per gene (τ = 0.636, p = 3.98 × 10−3; τ = 0.695, p = 3.03 × 10−4, for D. melanogaster and A. thaliana, respectively; S2A Fig, data available in S4 Data). Because exposed residues are more flexible [53], young genes tend to encode proteins with a higher degree of intrinsic disorder, a pattern previously reported in mice [36]. We confirmed this pattern in D. melanogaster (τ = 0.606, p = 6.10 × 10−3; S2B Fig) and A. thaliana (τ = 0.467, p = 1.53 × 10−2; S2B Fig, data available in S5 Data). Given that ω and ω are negatively correlated to length and gene expression and positively correlated to RSA and protein disorder [39], these could be driving the effect of gene age on ω and ω; i.e., the correlation between ω and ω and age could be incidental. To control for the impact of each confounding factor using Grapes, we split our data into 2 roughly equal-sized groups according to protein length, expression level, average RSA, and average intrinsic disorder and reran the analysis within the “high” and “low” groups, combining probabilities from the 2 analyses using the weighted Z-method [54]. As estimates of the rate of adaptive substitutions for a small number of genes exhibit large sampling variances [55,56], each putative confounding factor could only be tested separately. Some phylostrata were further combined when underrepresented in some gene categories (see Material and methods). We found that ω, ω, and ω remain significantly correlated to gene age, except when controlling for protein length and gene expression for ω in Arabidopsis (Fig 2 and Table 1; data available in S6–S9 Data tables). This weaker effect may be a consequence of how the most recent clades were combined in these analyses, as there was little data available for those genes (see Material and methods). Nonetheless, when combining probabilities across the 2 species, we observed a significant correlation between all measures of evolutionary rate and gene age controlling each of the cofactors (Table 1; data available in S10–S13 Data tables).

Fig 2

Estimates of ω, ω, and ω plotted as a function of (a) protein length and (b) mean expression levels, (c) RSA, and (d) protein intrinsic disorder with gene age in A. thaliana (top) and D. melanogaster (bottom). Analyses were performed by comparing short and long genes (a), lowly and highly expressed genes (b), proteins with low and high mean RSA values (c), and proteins with low and high average intrinsic disorder (d) across age categories (see Material and methods). Legend as in Fig 1. Significance levels are shown for each correlation between gene age and the rates of protein evolution (*P < 0.05; **P < 0.01; ***P < 0.001; “.” 0.05 ≤ P < 0.10; “ns” P > 0.10). The data (S6–S9 Data tables) and code needed to generate this table can be found at https://gitlab.gwdg.de/molsysevol/supplementarydata_geneage and https://zenodo.org/record/6828430.

We next asked whether the observed effect could be explained by the residual correlation of the cofactors with gene age within each “high” and “low” group. We performed 3 different tests: (1) the correlation between ω and the mean value of each cofactor in each age class within the “high” and “low” groups (data available in S14–S17 Data tables); (2) the correlation between the cofactor and gene age in each “high” and “low” group (data available in S18 Data); and (3) a linear model where ω is the response variable, and gene age, category, species (Arabidopsis or Drosophila), and the within category cofactor values are explanatory variables (data available in S14–S17 Data tables). While we do observe a general correlation of the cofactor with gene age within each group, we found that nearly all correlations between ω and the cofactor are nonsignificant (S1 Table). Moreover, the linear model analyses showed that in all cases but gene length, gene age is the most significant variable, with an effect consistent with that observed in Table 1. For gene length, gene age was only significant as an interaction with the species variable, having a positive effect only in Drosophila (S2 Table). These findings, therefore, suggest that the effect of gene age on ω is independent of the cofactor. To jointly estimate the effect of the potential confounding factors, we applied a recently developed method that extends the MK test with a generalized linear model [57]. This approach disentangles the effects of each factor on the rate of adaptive substitutions per nucleotide site. However, this method does not model the DFE and hence cannot account for segregating slightly deleterious mutations, which can bias estimates of the rate of adaptive substitutions [58]. Hence, following the approach suggested in Huang [57], we removed sites for which the derived allele frequency was below 50% to minimize any potential bias. Despite the large reduction in the data set, this analysis revealed a significant effect of gene age (S3 Table; data available in S19–S22 Data tables). Our findings, therefore, suggest that the effect of gene age on rates of protein evolution is robust to the tested confounding factors and that a gene’s age acts as a significant determinant of the rate of adaptive and nonadaptive evolution in both species. Lastly, we aimed at assessing the effect size of gene age on ω relative to other factors. Because correlation coefficients were computed from values averaged over multiple genes and genes were categorized differently for each analysis, the comparison of correlation coefficients does not provide a reliable estimate of relative effect sizes. Therefore, we assessed the relative contribution of each variable to the MK regression model by computing likelihood pseudo-R2s [59]. These were estimated by comparing the log-likelihood of the full model (all factors included) with the log-likelihood of reduced models, where each single factor was removed independently from the full model. These estimates suggest that, despite its very significant effect, gene age contributes comparatively little to the explained variance compared to factors such as protein function (S4 Table). To note, however, that these estimates refer to the relative contribution of each variable per nucleotide site and not per gene as this is the unit used by the MK regression, which results in generally low R2 values. Estimates of ω, ω, and ω plotted as a function of (a) protein length and (b) mean expression levels, (c) RSA, and (d) protein intrinsic disorder with gene age in A. thaliana (top) and D. melanogaster (bottom). Analyses were performed by comparing short and long genes (a), lowly and highly expressed genes (b), proteins with low and high mean RSA values (c), and proteins with low and high average intrinsic disorder (d) across age categories (see Material and methods). Legend as in Fig 1. Significance levels are shown for each correlation between gene age and the rates of protein evolution (*P < 0.05; **P < 0.01; ***P < 0.001; “.” 0.05 ≤ P < 0.10; “ns” P > 0.10). The data (S6–S9 Data tables) and code needed to generate this table can be found at https://gitlab.gwdg.de/molsysevol/supplementarydata_geneage and https://zenodo.org/record/6828430.

The effect of gene age on the molecular rate of adaptation is robust to BLAST’s false negative rates

The phylostratigraphy approach has been previously used to date the emergence of new genes, and some studies have pointed out its potential limitations [28,60-63]. Because BLAST homology searches might fail to identify homologs in short or rapidly evolving genes, such genes could be mistakenly classified as young. As expected, we observed that genes in younger phylostrata present higher E-values in both species (τ = 0.564, p = 0.025; τ = 0.951, p = 1.20 × 10−6, for D. melanogaster and A. thaliana, respectively; S3A Fig; data available in S23 Data). To control this effect, we reran our analyses using genes with very low E-values that are likely to be detected in most age strata and for which correlation between E-values and gene age was no longer significant (see Material and methods and S2 File) (τ = 0.408, p = 0.111; τ = 0.354, p = 0.141, for D. melanogaster and A. thaliana, respectively; S3B Fig; data available in S23 Data). We observed that the effect of gene age remained for all estimates in the 2 species when using this restricted data set (ω: τ = 0.929, p = 1.30 × 10−3; ω: τ = 0.786, p = 6.49 × 10−3; ω: τ = 0.643, p = 2.59 × 10−2 in A. thaliana; and ω: τ = 0.697, p = 1.61 × 10−3; ω: τ = 0.636, p = 3.98 × 10−3; ω: τ = 0.636, p = 3.98 × 10−3 in D. melanogaster, S4 Fig; data available in S24 Data). These results suggest that the correlation of gene age with the rate of adaptive evolution cannot be attributed to errors in dating the emergence of a gene stemming from the failure of identifying homologs in older taxa.

The effect of gene age on the rate of molecular adaptation does not depend on protein function

Lineage-specific genes are known to be involved in species-specific adaptive processes, such as the evolution of morphological diversity [64] and immune and stress responses [17,34,64]. As proteins encoding such functions tend to have higher molecular rates of adaptation [39,42,43,46,47,65], we further assessed whether the observed effect of gene age could be due to younger genes being enriched in functions with higher evolutionary rates. We first examined which functions are encoded by young genes in A. thaliana and D. melanogaster. In A. thaliana, young genes (clades 12 to 15 in Fig 1A) are mostly involved in a large variety of cellular processes, stress response and external stimulus, protein binding, and signal transduction (S5A Fig; data available in S25 Data). In D. melanogaster, young genes (clades 11 and 12 in Fig 1A) encode mostly functions involved in the cell’s anatomic structure, stress response, nervous system processes, enzyme regulators, immune system mechanisms, and a wide range of metabolic processes (S5B Fig; data available in S25 Data). To correct for the potential bias of protein function in the relationship between the rate of adaptive evolution and gene age, we split the genes into different functional categories based on gene ontology (GO) terms. To simultaneously control for the effect of gene age, we further divided the genes into 3 age categories, trying to keep a similar number of genes in each (see Material and methods). Grapes was then used to estimate ω, ω, and ω in each combined category. As some gene functions were biased towards some age categories (S5 Fig), we could not do this analysis for all GO terms. We, therefore, only used the GO terms with a sufficient number of annotated genes in each age class (see Material and methods). In A. thaliana, we found that the impact of gene age on ω is stronger in proteins linked to stress response and cellular components, where younger genes present higher molecular adaptive rates (S6A Fig and S3 File; data available in S26 Data). Although the GO term cellular component represents a comprehensive annotation, it denotes the cellular compartments where processes such as signal transduction and membrane trafficking occur, essential for maintaining the cell homeostasis [66,67]. In D. melanogaster, we observed a strong effect of gene age on ω for proteins encoding chromosomal organisation, protein complex, stress response, signal transduction, and involved in the cell cycle (S6B Fig and S3 File; data available in S26 Data). Even though these functions cover a wide range of molecular processes, they are involved in DNA replication, genome stability, and immune and stress responses, which are critical functions for the coevolutionary arms race between hosts and parasites [47]. When looking at the nonadaptive rate of evolution, our analyses revealed a strong influence of gene age in ω within most functions analysed in both species, where young genes present higher rates of nonadaptive substitutions (S6 Fig and S3 File; data available in S26 Data). These results suggest that, when restricting the analysis to proteins involved in defence mechanisms, which are known to adapt faster [42,43,47,65], gene age still has an impact on the efficiency of selection acting upon a protein.

Substitutions in young genes have larger effect sizes

Our second prediction under the adaptive walk model is that substitutions in young genes should have larger fitness effects than in older genes. To test this prediction, we used Grantham’s physicochemical distances between amino acids [68] as a proxy for the fitness effects of amino acid mutations. We looked at the fixed differences separated by 1 mutational step between each pair of species (i.e., fixed differences in the ingroup when compared to the outgroup species) and reported the average Grantham’s distances between residues within each age stratum. We observed that substitutions in young genes tend to occur between less biochemically similar residues (Arabidopsis: τ = 1, p = 2.00 × 10−7; Drosophila: τ = 0.788, p = 3.628 × 10−4; Fig 3 and S4 File; data available in S27 Data), suggesting that substitutions in these genes have larger physicochemical effects than in old ones. To further test whether these larger effects had a positive or negative impact on younger genes, we estimated the average Grantham’s distance among adaptive () and nonadaptive () nonsynonymous substitutions (see Material and methods and S4 File; data available in S28 Data). We expect a positive correlation between and and gene age, i.e., the average Grantham’s distance for adaptive and nonadaptive substitutions should increase towards the present. This is indeed what we find: We observed a positive correlation between and gene age in both species, which is individually significant in Drosophila (p = 0.006) and significant across species when we combine probabilities (p = 0.009) (S7 Fig). We observed the same pattern in the analysis of with a significant correlation in Arabidopsis (p = 0.004) and a significant correlation when combining the probabilities from both species (p = 0.0007) (S7 Fig).

Fig 3

Relationship between gene age and Grantham’s distance between amino acids for A. thaliana (a) and D. melanogaster (b). For each clade, the median value of Grantham’s distance between residues is depicted with the black dot. The shaded area represents the physicochemical distances within the first and third quartile. The data (S27 Data) and code needed to generate this table can be found at https://gitlab.gwdg.de/molsysevol/supplementarydata_geneage and https://zenodo.org/record/6828430.

Discussion

We used a population genomic approach to disentangle the effects of positive and negative selection on the rate of nonsynonymous substitutions. Using complete genome data from 2 Arabidopsis and Drosophila species, we showed that the higher rate of nonsynonymous substitutions in younger genes results both from relaxed purifying selection (higher ω) and a higher rate of adaptive substitutions (higher ω) (Fig 1B). Although gene age is not the variable contributing most to the rate of molecular adaptation (S4 Table), its effect is consistent across comparisons and robust to multiple confounding factors (Fig 2 and Tables 1 and S2 and S3). Moreover, we observed that young genes undergo substitutions that are larger in terms of physicochemical properties than older genes. A question remains: What are the drivers of these effects?

The magnitude effect of gene age on adaptive evolution is species-specific

Although we observed a strong impact of gene age on the molecular adaptive rate in both species pairs, the shape of the relation between these 2 variables differs. While the relationship between gene age and ω is monotonously increasing in Arabidopsis, it has several peaks in Drosophila (Fig 1B). This pattern is particularly evident if we discard the 2 youngest clades. In Drosophila, the correlation becomes much weaker and nonsignificant for ω (ω: τ = 0.600, p = 0.016; ω: τ = 0.556, p = 0.025; ω: τ = 0.467, p = 0.060), whereas in Arabidopsis, the effect of gene age persists (ω: τ = 0.9487, p = 6.342 × 10−6; ω: τ = 0.872, p = 3.345 × 10−5; ω: τ = 0.692, p = 9.86 × 10−4). Intriguingly, this multimodal distribution of ω observed in Drosophila resembles the pattern of gene emergence in this species [17]. The peak in the adaptive substitution rate observed for clades 6 and 7 (Fig 1B) coincided with the animal phyla’s major radiation at the time of extensive periods of glacial cycles [69]. When looking at the functions coded by these proteins, we found that they are linked to a wide range of vital cellular and biological processes, such as defence mechanisms and cell differentiation (S8 Fig and S3 File; data available in S25 Data). This pattern suggests that these genes might be experiencing higher molecular adaptive rates due to their role in such vital processes. However, for these genes to keep such high rates of adaptive substitutions until recent times (i.e., in the branch between D. melanogaster and D. simulans), epistatic interactions might be at play. Studies across taxa have proposed that functional epistasis is an important factor in the evolution of genes involved in defence mechanisms and adaptation to new environmental stresses [70-74]. We posit that such gene interactions keep these proteins further from their optimum throughout time due to the rugged shape of the fitness landscape, leading to the high molecular adaptive rates observed in the branch between D. melanogaster and D. simulans. To further test this hypothesis, we used the degree of protein–protein interactions (PPIs) as a proxy for epistatic interactions and analysed its relationship with gene age. We observed that genes in clade 7 have a slightly higher degree of PPI than other strata (S9 Fig and S5 File; data available in S29 Data), suggesting that these genes might be experiencing relatively more epistatic interactions. These findings are consistent with epistasis influencing the evolution of these genes, potentially explaining their continued higher rates of molecular adaptation. In contrast, the burst of the emergence of new genes in Arabidopsis coincided with the plant-specific radiation right before the emergence of Brassicaceae [17,75]. This trend is consistent with our results from A. thaliana, where the bursts of ω occur in younger clades (after clades 11 and 12 in Fig 1B). These distinct patterns observed between species suggest that the role of a gene’s age in molecular adaptation is complex, as also evidenced by the lack of a significant correlation with ω previously reported in humans [22]. The authors proposed that this result may be a consequence of the generally low molecular adaptive rates observed in primates [22,48]. Despite these species-specific trends, our analyses revealed a strong correlation between ω and gene age extending through hundreds of millions of years (Figs 1 and 2 and S2 and S3 Tables). These findings suggest a consistent effect of a gene’s age on the rate of molecular adaptation across taxa.

An adaptive walk model of gene evolution

Our study highlighted that, after their emergence, young genes evolve through relaxed selection, as first proposed by Ohno [76], but also by acquiring beneficial mutations, as described in the “adaptive-conflict” model [37,77]. Ohno’s idea of evolution was “non-Darwinian” in its nature, as he believed that “natural selection merely modified while redundancy created” [76]. He proposed that new genes evolve by accumulating “forbidden” mutations, where they are only preserved if the development of a formerly nonexistent function occurs, a process known as neofunctionalisation. In this scenario, natural selection only acts at the stage of acquiring a new function. Further extensions of this theory suggested that the preservation of a new gene can also occur through subfunctionalisation, where the accumulation of deleterious mutations leads to a complementary loss of function in both copies of the gene [78,79]. In contrast, the “adaptive-conflict” model assumes that the ancestral gene could carry more than 2 pleiotropically constrained functions [37,77]. Once the duplication event occurs, each copy then becomes specialised in one of the ancestral functions. In this case, the ancestral gene’s split proceeded through positive Darwinian selection [37,77]. These theories are based on the evolution of gene duplicates and agree with the idea of evolution as a “tinkerer” proposed by Jacob [80], where evolution adjusts the already existing elements. In de novo evolution, however, new genes emerge by acquiring new functions from the noncoding fragments of the genome [17,81,82]. This process is thought to proceed through a stochastic phase followed by the successive accumulation of beneficial mutations, ultimately leading to a new function with a species-specific selective advantage [83-86]. When looking at the fundamental ideas behind these theories, one can draw 1 prominent feature that portrays the evolution of new genes: Young genes are further away from their fitness optimum. Hence, we posit that these genes follow an adaptive walk model of gene evolution to reach their fitness peak [3,4,87]. As their full potential has yet to be met, more consecutive beneficial mutations are theoretically needed to reach their fitness optimum, leading to the higher molecular adaptive rates observed in these genes. In turn, older genes are closer to their optimal features and less robust to large effects’ mutations, thus only accumulating mutations with small fitness effects. Such slightly advantageous mutations are more difficult to select for, leading to lower adaptive rates in these proteins. We further tested this hypothesis using Grantham’s physicochemical distances [68] as a proxy for the fitness effect of substitutions. This analysis showed that substitutions in young genes tend to occur between more dissimilar residues (Fig 3), suggesting that the evolution of young genes proceeds in larger steps compared to old ones. Moreover, we observed that adaptive substitutions have greater physicochemical effects (i.e., the average Grantham’s distance of adaptive substitutions () is correlated to gene age; S7 Fig). However, the strength of this signal differs between species. While we observed a strong positive correlation between and gene age in Drosophila, this relationship is weaker in Arabidopsis. The opposite pattern is observed for the correlation between and gene age, where this relationship is only significant in Arabidopsis (S7 Fig). These patterns suggest that, in Arabidopsis, there is a stronger influence of relaxed purifying selection when compared to positive selection. This weaker adaptive signal may be a consequence of their mating system. The transition from out-crossing to self-fertilization in A. thaliana exposed these plants to less efficient purifying selection due to their reduced effective population sizes [88,89], thus accumulating more deleterious mutations. This higher load of segregating deleterious mutations can interfere with nearby beneficial mutations, preventing their fixation and slowing down adaptation. Indeed, previous studies have shown that segregating recessive deleterious mutations can substantially decrease the fixation rate of advantageous mutations, leading to a weaker adaptive signal in some parts of the genome [90,91]. In contrast, the large effective population sizes of Drosophila species make them more efficient in removing deleterious mutations and prone to stronger positive selection [92-94]. These patterns could explain the observed differences in the strength of selection in young genes between species. An alternative hypothesis to the adaptive walk theory is a static model of adaptation. This model is based on the level of constraint of a gene: As some genes are less constrained than others, they accumulate both advantageous and deleterious mutations, some of which may lead to the loss of a gene. Under this hypothesis, less constrained genes tend to be young because they are more likely to be lost. Additionally, as such genes accumulate a larger number of deleterious mutations, they also represent a bigger mutational target for adaptive mutations, which could also explain the overall correlation between ω and ω. We can test this hypothesis by assessing whether the correlation between ω and gene age is independent of ω. We controlled for the effect of ω by running Grapes on a subset of genes for which the correlation between p/p and gene age was no longer significant. We found that gene age was still significantly correlated with ω whereas ω was not (S10 Fig; data available in S30 Data), thus suggesting that the rate of adaptive evolution is independent of the nonadaptive rate. While these results do not completely discard the possibility of a stationary view of adaptation, they agree with an adaptive walk model of gene evolution. Overall, our findings support an adaptive walk model hypothesis of gene evolution over time, where protein-coding genes adapt following a pattern of diminishing returns in Drosophila and Arabidopsis. The increasing availability of population genomic data sets will allow the detailed characterization of the adaptive walk model of gene evolution in an increasingly diverse range of taxa.

Material and methods

We assessed the role of gene age on adaptive evolution using the divergence and polymorphism data published in Moutinho and colleagues [39]. For D. melanogaster, the data included 10,318 protein-coding genes in 114 genomes for 1 chromosome arm of the 2 large autosomes (2L, 2R, 3L, and 3R) and 1 sex chromosome (X) pooled from an admixed sub-Saharan population from Phase 2 of the Drosophila Genomics Project [DPGP2] [95] and divergence estimates from D. simulans (S31 Data). For A. thaliana, we analysed 18,669 protein-coding genes in 110 genomes comprising polymorphism data from a Spanish population (1001 Genomes Project) [96] and divergence out to A. lyrata (S32 Data). We used these data sets to infer the synonymous and nonsynonymous unfolded site frequency spectrum (SFS), and synonymous and nonsynonymous divergence from the rate of synonymous and nonsynonymous substitutions. Sites with a missing outgroup allele were considered as missing data. All estimates were obtained using the “bppPopStats” program from the Bio++ Program Suite (version 2.4.0) [97]. Gene age data were obtained from published data sets, wherein 9,004 Drosophila [28] and 17,732 Arabidopsis [33] genes were used. Analyses were performed by dividing genes into 12 and 15 phylostrata for D. melanogaster and A. thaliana (Fig 1), respectively, numbered from the oldest (stratum 1) to the most recent (strata 12 and 15 in D. melanogaster and A. thaliana, respectively). The most recent clades include orthologous genes present in each species and their respective outgroups. The analyses on the X-linked and autosomal genes in D. melanogaster were performed with 1,478 and 7,526 genes, respectively. We fitted models of the DFE across different age classes and gene categories to estimate the molecular rate of adaptation [48].

Estimation of the adaptive and nonadaptive rate of nonsynonymous substitutions

The rates of adaptive nonsynonymous substitutions were estimated with the Grapes program [48], using the Gamma-Exponential DFE, as this model was previously shown to best fit the data [39]. As this method does not accommodate SNPs with missing information in some individuals, we reduced the original data set to n = 110 and n = 105 genomes for D. melanogaster and A. thaliana, respectively, by randomly down-sampling sites with at least n genotypes available and discarding positions where less than n genotypes were present. Estimates of substitution rates and their confidence intervals were obtained with a bootstrap analysis by sampling genes in each category, with replacement. We performed a total of 100 replicates, and the DFE model was fitted for each replicate with Grapes. Results for ω, ω, and ω were plotted using the R package “ggplot2” [98] by taking the mean value and the 95% confidence interval of the 100 bootstrap replicates performed for each category (see detailed R scripts in the supplementary files in the supplementary data online, https://gitlab.gwdg.de/molsysevol/supplementarydata_geneage and https://zenodo.org/record/6828430).

Gene age versus protein length and gene expression

Data on protein length and gene expression were included in the data sets of Moutinho and colleagues [39] (S31 and S32 Data). Gene expression data were downloaded from the database Expression Atlas [99], taking 1 baseline experiment for each species [39]. Mean gene expression levels were then obtained by averaging across samples and tissues for each gene. To correct for these 2 factors, we divided our data set into 2 equally sized groups based on the factor we wished to control. Short proteins had a size up to 366 and 389 amino acids, and long proteins had a size up to 4,674 and 5,098 amino acids in A. thaliana and D. melanogaster, respectively. We further merged phylostrata containing a low number of genes. For D. melanogaster, we categorised gene age into 6 main clades by combining clades 3 to 4, 5 to 6, 7 to 10, and 11 to 12, keeping the others unchanged. In A. thaliana, we combined the 15 clades into 6 main groups by merging clades 5 to 8 and 9 to 15. For gene expression, we used a total of 17,126 and 6,247 genes for A. thaliana and D. melanogaster, respectively, being categorised as lowly and highly expressed. Genes were classified as lowly expressed if the mean expression levels were up to 10.3 and 6.8, and highly expressed genes were the ones with expression up to 6,632.8 and 4,237.0 in A. thaliana and D. melanogaster, respectively. For D. melanogaster, we categorised gene age into 6 categories by combining clades 3 to 5, 6 to 9, and 10 to 12. In A. thaliana, we combined the data in 6 clades, merging clades 4 to 7, 8 to 11, and 12 to 15 (S6 File).

Gene age versus protein structure

Since most young genes lack a defined three-dimensional structure [36], they do not have information on the residue’s solvent accessibility. Hence, we used a deep learning approach, NetSurfP-2.0, which predicts the RSA of each residue from the amino acid sequence [100] by using as a training model the HH-suite sequence alignment tool for protein similarity searches [101]. To assess whether this approach provided reliable results, we compared the RSA estimates of NetSurfP-2.0 with those obtained from the PDB structures in our data set [39]. We found a good correlation between the 2 approaches for both species (Kendall’s τ = 0.571, p < 2 × 10−216; τ = 0.462, p < 2 × 10−216, for D. melanogaster and A. thaliana, respectively). Using NetSurfP-2.0, RSA estimates were successfully obtained for a total of 4,238,686 (88% of the total codon sites) and 7,479,807 (99% of the total codon sites) amino acid residues for D. melanogaster and A. thaliana, respectively. To assess the impact of RSA at the gene level, we analysed the total number of genes in both species by making 2 categories of genes according to the average RSA value per gene. Genes with lower RSA had mean values between 0.127 and 0.389 in Drosophila and 0.217 and 0.386 in Arabidopsis. Genes with a higher RSA had mean values between 0.390 and 0.894 in Drosophila and 0.387 and 0.898 in Arabidopsis. The phylostrata groups were defined by combining clades 7 to 8 in D. melanogaster and 8 to 11 and 12 to 15 in A. thaliana (S6 File). For the analysis of the residue intrinsic disorder, we used the estimates included in the data sets of Moutinho and colleagues [39]. These data were obtained with the software DisEMBL [102], where intrinsic disorder was estimated per site and classified according to the degree of “hot loops,” i.e., highly mobile loops. This analysis was performed for 17,732 and 7,410 genes for A. thaliana and D. melanogaster, respectively. Genes were combined into 2 categories according to the mean value of their residue’s intrinsic disorder. Genes with a low level of intrinsic disorder had values between 0.029 and 0.080 in Drosophila and among 0.041 and 0.084 in Arabidopsis. Genes with a higher degree of intrinsic disorder had values between 0.081 and 0.554 in Drosophila and among 0.085 to 0.551 in Arabidopsis. In D. melanogaster, all the 12 phylostrata could be used. In A. thaliana, the 15 strata were combined in 12 categories by merging clades 9 to 10, 11 to 12, and 13 to 14 (S6 File).

Correcting for BLAST E-values

We analysed the robustness of the gene age’s effect by correcting the variation in the Expect (E) value estimates in BLAST’s searches between our focus species and their respective outgroups. By reducing the variation in E-values estimates, we could correct for potential failures in BLAST’s homology searches. To do so, we limited the analysis to genes with particularly low E-values that are more likely to be identified in all age strata: 12,472 genes with an E-value lower than 1 × 10−150 for A. thaliana and 7,104 genes with an E-value lower than 1 × 10−100 for D. melanogaster (S2 File). For A. thaliana, analyses were carried out by combining clades 8 to 13, with no genes left in clades 14 and 15. For D. melanogaster, analyses were performed with the 12 strata. By using these genes, we found no correlation between E-value and age.

Gene age versus protein function

GO terms were obtained from the “dmelanogaster_gene_ensembl” and the “athaliana_eg_gene” tables in the Ensembl database (version 103), through the R package “biomaRt” [103]. A total of 7,253 (approximately 70% of the genes) and 15,604 (approximately 80% of the genes) genes were mapped in D. melanogaster and A. thaliana, respectively. To check whether the effect of gene age prevailed across functional protein classes, we analysed the GO terms with the highest number of young genes mapped: more than 50 genes present in clades 11 and 12 in D. melanogaster and more than 30 genes present in clades 12 to 15 in A. thaliana. This filtering step resulted in 6,637 genes across 23 GO categories in D. melanogaster (S31 Data) and 15,410 genes across 10 GO categories in A. thaliana (S32 Data). To analyse the effect of gene age, we compared 3 age classes. In D. melanogaster, the first age category spanned over phylostrata 1 to 3, the second category covered clades 4 to 7, and the third one included clades 8 to 12. In A. thaliana, the first category comprised genes belonging to clades 1 to 6, the second category spanned over clades 7 to 11, and the third one included the phylostrata between clades 12 and 15 (Fig 1A).

Gene age versus protein–protein interactions (PPIs)

We obtained PPI data for D. melanogaster from the STRING database [104], which includes both physical and functional interactions (https://string-db.org/). This database included 13,046 proteins with annotated interactions, which were used to analyse the distribution of protein networks across phylostrata.

The fitness effects of amino acid substitutions

We used Grantham’s physicochemical distances between amino acids [68] as a proxy for the fitness effects of amino acid substitutions and performed 2 analyses (see detailed R script in S4 File). In the first one, we calculated the average Grantham’s distance between amino acid substitutions for each phylostrata. In the second analysis, we estimated the average Grantham’s distance for adaptive () and nonadaptive () nonsynonymous substitutions within each age stratum. This analysis was performed by running Grapes across amino acid pairs separated by a single mutational step within categories of gene age. To run Grapes, we first estimated the synonymous and nonsynonymous SFS using the same approach as in Bergman and Eyre-Walker [105]. This method compares the nonsynonymous SFS for a particular amino acid pair, for example, alanine and aspartic acid, which are separated by a A <> C mutation, with the synonymous SFS of 4-fold degenerate sites separated only by A <> C mutations (SFS4). For amino acid pairs separated by more than 1 mutation type, we estimated the weighted average of the synonymous SFS of 4-fold sites for the different types of mutations, weighting by the frequency of the respective codons. For example, serine and cysteine are separated by A <> T and C <> G mutations. The synonymous SFS for this amino acid pair was estimated as SFS4(weighted) = [(f + f + f + f) * SFS4 +(f + f + f + f) * SFS4] / (f + f + f + f + f + f, where f corresponds to the frequency of each codon (e.g., f is the frequency of the codon AGT). It was not possible to estimate the rates of adaptive and nonadaptive substitution for each pair of amino acids for each phylostratum because we have insufficient data. So, we combined amino acid pairs into 10 categories based on their Grantham’s distance, and we combined clades 5 to 6, 8 to 11, and 12 to 15 in Arabidopsis and clades 3 to 4, 5 to 7, and 8 to 10 in Drosophila. We then ran Grapes to estimate ω and ω for the 10 categories of Grantham’s physicochemical distances for each combined age stratum. We further calculated the average Grantham’s distance for adaptive () and nonadaptive () nonsynonymous substitutions using the values of ω and ω estimated with Grapes and applying the following equations: and , where the sum is across the 10 categories of Grantham’s distances within each age class, ω and ω are the rates of adaptive and nonadaptive substitution in the ith category for each age class, is the average Grantham’s distance among the amino acid pairs within each Grantham’s distance category, and N is the number of nonsynonymous sites in the ith category in each age class. Higher values of (the same for with nonadaptive substitutions) mean that there is a higher proportion of adaptive substitutions across Grantham’s distance categories.

Assessing the correlation between ω and ω

To test whether the impact of gene age on ω was independent of ω, we ran Grapes for a subset of genes for which the correlation between ω and gene age was no longer significant (see detailed R script in S7 File). We controlled for the variation in ω by discarding genes according to their p/p values. To ensure a sufficient number of genes, we kept the genes that were within 0.02 and 0.05 standard deviation from the modal value of the p/p distribution in Arabidopsis (mode: 0.05) and Drosophila (mode: 0.01), respectively. The subset of data included 8,652 and 2,570 genes for Arabidopsis and Drosophila, respectively.

Statistical analyses

Assessing the effect of gene age within each protein functional class was performed by comparing rate estimates between all pairs of age categories. A total of 100 bootstrap replicates were generated and ω and ω were estimated for each resampling, allowing us to compute the rate differences between categories. A one-tailed p-value can be obtained using the formula P = (k + 1)/ (N + 1), where N = 100 is the number of bootstrap replicates and k is the number of times the computed difference was greater (resp. lower) than 0. Here, we used a two-tailed version of this test, computing the p-value as P = [2 * min (k−, k+) + 1] / (N + 1), where k− is the number of times the difference was negative, and k+ is the number of times the difference was positive. P-values for all pairwise comparisons were corrected for multiple testing using the FDR method [106] as implemented in R [107] (see detailed R script in S3 File). For the analysis with PPI and gene age, statistical significance was assessed using nonparametric post hoc tests, as implemented in the “Kruskal” method of the R package “agricolae” using the FDR method to correct for multiple testing [108] (see detailed R script in S5 File). For the rest of the analyses, statistical significance was assessed with Kendall’s correlation tests using the mean value of the 100 bootstrap replicates for each category (see detailed script in S6 File). To estimate the combined p-value for each cofactor, we used the weighted-Z method using the R package “metap” [109]. We estimated the weight of each p-value using a linear modelling approach with ω and ω as response variables and gene age and potential cofactors as explanatory variables and inferred the reciprocal of the squared standard error of the residuals in each model (see detailed R scripts in S8 File). To determine whether the chromosome impacted gene age’s effect on estimates of ω and ω, we performed an analysis of covariance (ANCOVA) by comparing a model M1 that included the impact of the chromosome, age, and their interaction, with a model M0 that included age only (see detailed R script in S1 File). Normality, homoscedasticity, and independence of the error terms of the model were assessed with the package “lmtest” [110] in R. Finally, to further assess the effect of residual correlations between gene age and each cofactor, we fitted linear models with the rates of protein evolution (i.e., ω, ω and ω) as response variables and the average value of the cofactor for each age class as an explanatory variable (nested within the cofactor category, low or high), together with gene age, species (Drosophila or Arabidopsis), cofactor category, and their interaction. A model selection was conducted using the R function “step” and the significance of the model coefficients was assessed for the selected models (see detailed R script in S9 File).

MK regression

The MK regression analysis was performed per nucleotide site, where we analysed the combined effect of RSA, intrinsic protein disorder, protein length, gene expression, protein function, Grantham’s distances, and the sex chromosome in Drosophila. As this method does not automatically handle categorical variables, we created a binary variable for protein function based on the gene’s presence (1) or absence (0) in stress-related proteins (i.e., response to stress, response to external stimulus and signal transduction). We used the same rationale for the X chromosome effect in Drosophila. We performed the MK regression analysis by comparing polymorphic and divergence sites between nonsynonymous and 4-fold degenerated sites in both species. As this method does not account for the effects of weak selection [57], we analysed only sites with an allele frequency above 50%. To assess the relative contribution of each variable to the regression model, we compared the likelihood pseudo-R2 following the Cox and Snell method [59]. These were estimated by comparing the log-likelihood of the full model (all factors included) with that from each reduced model (each factor removed) using the following equation: where N represents the number of sites analysed with the MK regression, ln represents the log-likelihood of the model with all factors included, and ln represents the log-likelihood of the model excluding each factor (e.g., to assess the contribution of the gene age variable, the full model was compared to a model including all variables except gene age).

Kendall’s correlation coefficients for the relationship between the mean value of each cofactor in each age class and ω, ω, and ω for each “high” and “low” group.

The Kendall’s correlation coefficients for the relationship between the cofactor and gene age in each “high” and “low” group are presented in the column “co-factor~Age”. The data (S14–S18 Data tables) and code needed to generate this table can be found at https://gitlab.gwdg.de/molsysevol/supplementarydata_geneage and https://zenodo.org/record/6828430. (XLSX) Click here for additional data file.

Linear models’ estimates when using the rates of protein evolution (ω, ω, and ω) as response variables and the mean value of the cofactor per category as a putative explanatory variable, together with gene age, the species (Arabidopsis or Drosophila), cofactor-category, and their interaction with gene age.

The data (S14–S17 Data tables) and code needed to generate this table can be found at https://gitlab.gwdg.de/molsysevol/supplementarydata_geneage and https://zenodo.org/record/6828430. (XLSX) Click here for additional data file.

MK regression estimates, z-scores, and respective p-values. The data (S19–S22 Data tables) needed to generate this table can be found at https://gitlab.gwdg.de/molsysevol/supplementarydata_geneage and https://zenodo.org/record/6828430.

(XLSX) Click here for additional data file.

Partial R2 estimates for each factor analysed in the MK regression analysis.

(XLSX) Click here for additional data file. Relationship between gene age and gene length (a) and gene expression (b) for This analysis was performed by categorizing gene age according to the clades defined in Fig 1A. For each clade, the median value of gene length and gene expression is depicted with the black dot. The shaded area represents the values of gene length and mean expression levels within the first and third quartile. The data (S2 and S3 Data tables) and code needed to generate this figure can be found at https://gitlab.gwdg.de/molsysevol/supplementarydata_geneage and https://zenodo.org/record/6828430. (PDF) Click here for additional data file. Relationship between gene age and RSA (a) and protein intrinsic disorder (b) for Legend as in S1 Fig. The data (S4 and S5 Data tables) and code needed to generate this figure can be found at https://gitlab.gwdg.de/molsysevol/supplementarydata_geneage and https://zenodo.org/record/6828430. (PDF) Click here for additional data file. Relationship between gene age and E-values before (a) and after (b) the E value correction for Each black dot represents a gene and median E value for each clade is represented with the black line in the boxplot. The data (S23 Data) and code needed to generate this figure can be found at https://gitlab.gwdg.de/molsysevol/supplementarydata_geneage and https://zenodo.org/record/6828430. (PDF) Click here for additional data file.

Estimates of ω, ω, and ω plotted as a function of gene age by correcting for the E-value on BLAST’s searches in A. thaliana (top) and D. melanogaster (bottom).

Mean values of ω, ω, and ω for each category are represented with the black points. Error bars denote for the 95% confidence interval for each category, computed over 100 bootstrap replicates. The data (S24 Data) and code needed to generate this figure can be found at https://gitlab.gwdg.de/molsysevol/supplementarydata_geneage and https://zenodo.org/record/6828430. (PDF) Click here for additional data file. The number of young genes for the respective protein function in (a) The data (S25 Data) and code needed to generate this figure can be found at https://gitlab.gwdg.de/molsysevol/supplementarydata_geneage and https://zenodo.org/record/6828430. (PDF) Click here for additional data file. Estimates of ω, , and plotted as a function of protein function and gene age in (a) Categories are ordered according to the values of ω. Gene age categories are ordered from old (1) to young (3). Mean values of ω, ω, and ω for each class are represented with the black points. Error bars denote the 95% confidence interval for each category, computed over 100 bootstrap replicates. The data (S26 Data) and code needed to generate this figure can be found at https://gitlab.gwdg.de/molsysevol/supplementarydata_geneage and https://zenodo.org/record/6828430. (PDF) Click here for additional data file.

Relationship between the proportion of adaptive () and nonadaptive () substitutions and gene age.

Each point represents the weighted average for each age category. A linear model was fitted between gene age and Grantham’s distances values and is represented with the blue line. Statistical significance was assessed with a Pearson’s correlation test and the respective correlation coefficient (R) and p-values (p) are shown in each plot. The data (S28 Data) and code needed to generate this figure can be found at https://gitlab.gwdg.de/molsysevol/supplementarydata_geneage and https://zenodo.org/record/6828430. (PDF) Click here for additional data file.

Frequency of genes belonging to clades 6–7 for the respective protein function in D. melanogaster.

The frequency of genes for each functional category was estimated by dividing the number of genes present in these clades by the total number of genes annotated with the respective function. The data (S25 Data) and code needed to generate this figure can be found at https://gitlab.gwdg.de/molsysevol/supplementarydata_geneage and https://zenodo.org/record/6828430. (PDF) Click here for additional data file.

Distribution of the degree of PPI values in D. melanogaster.

The statistical group for each clade is represented. The black line represents the median value of PPI for each clade and black dots denote the outliers of the distribution. The y-axis is scaled with a square root function. The data (S29 Data) and code needed to generate this figure can be found at https://gitlab.gwdg.de/molsysevol/supplementarydata_geneage and https://zenodo.org/record/6828430. (PDF) Click here for additional data file.

Estimates of ω, ω, and ω plotted as a function of gene age by correcting for the variation in p/p in A. thaliana (top) and D. melanogaster (bottom).

The Kendall’s correlation coefficients are shown with the respective significance (*P < 0.05; **P < 0.01; ***P < 0.001; “.” 0.05 ≤ P < 0.10). Legend as in S4 Fig. The data (S30 Data) and code needed to generate this figure can be found at https://gitlab.gwdg.de/molsysevol/supplementarydata_geneage and https://zenodo.org/record/6828430. (PDF) Click here for additional data file.

Gene age analysis (Fig 1B).

(CSV) Click here for additional data file.

Correlation between gene age and protein length (S1A Fig).

(CSV) Click here for additional data file.

Correlation between gene age and gene expression (S1B Fig).

(CSV) Click here for additional data file.

Correlation between gene age and RSA (S2A Fig).

(CSV) Click here for additional data file.

Correlation between gene age and protein disorder (S2B Fig).

(CSV) Click here for additional data file.

Correlation between gene age and ω, ω, and ω while controlling for protein length (Fig 2A).

(CSV) Click here for additional data file.

Correlation between gene age and ω, ω, and ω while controlling for gene expression (Fig 2B).

(CSV) Click here for additional data file.

Correlation between gene age and ω, ω, and ω while controlling for RSA (Fig 2C).

(CSV) Click here for additional data file.

Correlation between gene age and ω, ω, and ω while controlling for protein disorder (Fig 2D).

(CSV) Click here for additional data file.

Statistical results of the correlation between gene age and ω, ω, and ω within each category of protein length.

(CSV) Click here for additional data file.

Statistical results of the correlation between gene age and ω, ω, and ω within each category of gene expression.

(CSV) Click here for additional data file.

Statistical results of the correlation between gene age and ω, ω, and ω within each category of RSA.

(CSV) Click here for additional data file.

Statistical results of the correlation between gene age and ω, ω, and ω within each category of protein disorder.

(CSV) Click here for additional data file.

Correlation between the mean value of RSA within each category and ω, ω, and ω.

(CSV) Click here for additional data file.

Correlation between the mean value of protein disorder within each category and ω, ω, and ω.

(CSV) Click here for additional data file.

Correlation between the mean value of protein length within each category and ω, ω, and ω.

(CSV) Click here for additional data file.

Correlation between the mean value of gene expression within each category and ω, ω, and ω.

(CSV) Click here for additional data file.

Correlation between the mean value of each cofactor within each category and gene age.

(CSV) Click here for additional data file.

Input data for 4-fold degenerate sites to run the MKRegression in Drosophila.

(GZ) Click here for additional data file.

Input data for 0-fold degenerate sites to run the MKRegression in Drosophila.

(GZ) Click here for additional data file.

Input data for 4-fold degenerate sites to run the MKRegression in Arabidopsis.

(GZ) Click here for additional data file.

Input data for 0-fold degenerate sites to run the MKRegression in Arabidopsis.

(GZ) Click here for additional data file.

Correlation between E-values and gene age before and after the correction (S3 Fig).

(CSV) Click here for additional data file.

Correlation between gene age and ω, ω, and ω after controlling for the variation in E-values (S4 Fig).

(CSV) Click here for additional data file.

Correlation between gene age and protein function (S5 and S8 Figs).

(CSV) Click here for additional data file.

Correlation between gene age and ω, ω, and ω while controlling for protein function (S6 Fig).

(CSV) Click here for additional data file.

Correlation between gene age and Grantham’s distance (Fig 3).

(CSV) Click here for additional data file.

Correlation between gene age and and (S7 Fig).

(CSV) Click here for additional data file.

Correlation between gene age and protein–protein interactions (S9 Fig).

(GZ) Click here for additional data file.

Correlation between gene age and ω, ω, and ω after controlling for p/p (S10 Fig).

(CSV) Click here for additional data file.

All data per gene for Drosophila melanogaster.

(GZ) Click here for additional data file.

All data per gene for Arabidopsis thaliana.

(GZ) Click here for additional data file.

Detailed R script of the gene age analysis.

(PDF) Click here for additional data file.

Detailed R script of the correlation between gene age and ω, ω, and ω while controlling for the variation in E-values.

(PDF) Click here for additional data file.

Detailed R script of the correlation between gene age and ω, ω, and ω while controlling for protein function.

(PDF) Click here for additional data file.

Detailed R script of the correlation between gene age and Grantham’s distances.

(PDF) Click here for additional data file.

Detailed R script of the correlation between gene age and protein–protein interactions (PPI).

(PDF) Click here for additional data file.

Detailed R script of the correlation between gene age and ω, ω, and ω while controlling for the protein length, gene expression, RSA, and protein intrinsic disorder.

(PDF) Click here for additional data file.

Detailed R script of the correlation between gene age and ω, ω, and ω while controlling for the variation in p/p.

(PDF) Click here for additional data file.

Detailed R script of the analysis of the combined probabilities for each of the cofactors within and across species using the weighted Z-method.

(PDF) Click here for additional data file.

Detailed R script of the extra analyses performed to correct for intracategory variation of the cofactors.

(PDF) Click here for additional data file. 10 Nov 2021 Dear Dr Moutinho, Thank you for submitting your manuscript entitled "Testing the adaptive walk model of gene evolution" for consideration as a Research Article by PLOS Biology. Your manuscript has now been evaluated by the PLOS Biology editorial staff, as well as by an academic editor with relevant expertise, and I'm writing to let you know that we would like to send your submission out for external peer review. However, before we can send your manuscript to reviewers, we need you to complete your submission by providing the metadata that is required for full assessment. To this end, please login to Editorial Manager where you will find the paper in the 'Submissions Needing Revisions' folder on your homepage. Please click 'Revise Submission' from the Action Links and complete all additional questions in the submission questionnaire. Once your full submission is complete, your paper will undergo a series of checks in preparation for peer review. Once your manuscript has passed the checks it will be sent out for review. If your manuscript has been previously reviewed at another journal, PLOS Biology is willing to work with those reviews in order to avoid re-starting the process. Submission of the previous reviews is entirely optional and our ability to use them effectively will depend on the willingness of the previous journal to confirm the content of the reports and share the reviewer identities. Please note that we reserve the right to invite additional reviewers if we consider that additional/independent reviewers are needed, although we aim to avoid this as far as possible. In our experience, working with previous reviews does save time. If you would like to send your previous reviewer reports to us, please specify this in the cover letter, mentioning the name of the previous journal and the manuscript ID the study was given, and include a point-by-point response to reviewers that details how you have or plan to address the reviewers' concerns. Please contact me at the email that can be found below my signature if you have questions. Please re-submit your manuscript within two working days, i.e. by Nov 12 2021 11:59PM. Login to Editorial Manager here: https://www.editorialmanager.com/pbiology During resubmission, you will be invited to opt-in to posting your pre-review manuscript as a bioRxiv preprint. Visit http://journals.plos.org/plosbiology/s/preprints for full details. If you consent to posting your current manuscript as a preprint, please upload a single Preprint PDF when you re-submit. Given the disruptions resulting from the ongoing COVID-19 pandemic, please expect delays in the editorial process. We apologise in advance for any inconvenience caused and will do our best to minimize impact as far as possible. Feel free to email us at plosbiology@plos.org if you have any queries relating to your submission. Kind regards, Roli Roberts Roland Roberts Senior Editor PLOS Biology rroberts@plos.org 20 Jan 2022 Dear Dr Moutinho, Thank you for submitting your manuscript "Testing the adaptive walk model of gene evolution" for consideration as a Research Article at PLOS Biology. Your manuscript has been evaluated by the PLOS Biology editors, an Academic Editor with relevant expertise, and by three independent reviewers. You'll see that while the reviewers are broadly positive about your study, they each raise a number of concerns that must be addressed before further consideration. These include several potential confounders that may not be fully controlled for, requests for further methodological clarification, concerns about your statistical treatment, some potential alternative explanations, and requests for more open-minded interpretation. In light of the reviews (below), we will not be able to accept the current version of the manuscript, but we would welcome re-submission of a much-revised version that takes into account the reviewers' comments. We cannot make any decision about publication until we have seen the revised manuscript and your response to the reviewers' comments. Your revised manuscript is also likely to be sent for further evaluation by the reviewers. We expect to receive your revised manuscript within 3 months. Please email us (plosbiology@plos.org) if you have any questions or concerns, or would like to request an extension. At this stage, your manuscript remains formally under active consideration at our journal; please notify us by email if you do not intend to submit a revision so that we may end consideration of the manuscript at PLOS Biology. **IMPORTANT - SUBMITTING YOUR REVISION** Your revisions should address the specific points made by each reviewer. Please submit the following files along with your revised manuscript: 1. A 'Response to Reviewers' file - this should detail your responses to the editorial requests, present a point-by-point response to all of the reviewers' comments, and indicate the changes made to the manuscript. *NOTE: In your point by point response to the reviewers, please provide the full context of each review. Do not selectively quote paragraphs or sentences to reply to. The entire set of reviewer comments should be present in full and each specific point should be responded to individually, point by point. You should also cite any additional relevant literature that has been published since the original submission and mention any additional citations in your response. 2. In addition to a clean copy of the manuscript, please also upload a 'track-changes' version of your manuscript that specifies the edits made. This should be uploaded as a "Related" file type. *Re-submission Checklist* When you are ready to resubmit your revised manuscript, please refer to this re-submission checklist: https://plos.io/Biology_Checklist To submit a revised version of your manuscript, please go to https://www.editorialmanager.com/pbiology/ and log in as an Author. Click the link labelled 'Submissions Needing Revision' where you will find your submission record. Please make sure to read the following important policies and guidelines while preparing your revision: *Published Peer Review* Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out. Please see here for more details: https://blogs.plos.org/plos/2019/05/plos-journals-now-open-for-published-peer-review/ *PLOS Data Policy* Please note that as a condition of publication PLOS' data policy (http://journals.plos.org/plosbiology/s/data-availability) requires that you make available all data used to draw the conclusions arrived at in your manuscript. If you have not already done so, you must include any data used in your manuscript either in appropriate repositories, within the body of the manuscript, or as supporting information (N.B. this includes any numerical values that were used to generate graphs, histograms etc.). For an example see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5 *Blot and Gel Data Policy* We require the original, uncropped and minimally adjusted images supporting all blot and gel results reported in an article's figures or Supporting Information files. We will require these files before a manuscript can be accepted so please prepare them now, if you have not already uploaded them. Please carefully read our guidelines for how to prepare and upload this data: https://journals.plos.org/plosbiology/s/figures#loc-blot-and-gel-reporting-requirements *Protocols deposition* To enhance the reproducibility of your results, we recommend that if applicable you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols Thank you again for your submission to our journal. We hope that our editorial process has been constructive thus far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments. Sincerely, Roli Roberts Roland Roberts Senior Editor PLOS Biology rroberts@plos.org ***************************************************** REVIEWERS' COMMENTS: Reviewer #1: The manuscript by Moutinho et al. seeks to examine molecular adaptation and adaptive walks in natural populations of Drosophila and Arabidopsis. The adaptive walk model predicts that the initial steps of adaptation will consist primarily of many mutations of larger effects. Then, as the population gets closer to the fitness optimum, more small-effect mutations will predominate. Testing this prediction has been challenging for a number of reasons, some of which are related to the myriad of genomic confounders of genomic studies in natural populations. In this study, the authors hypothesize that newly arisen genes will be in the earlier stages of adaptive walks, experiencing more adaptive mutations and larger effect mutations while older genes will have smaller effect mutations. Indeed, by analyzing patterns of polymorphism and divergence in multiple species, the authors find support for this model. Specifically, they find that younger genes tend to have a higher rate of adaptive evolution than older genes. Further, substitutions in younger genes appear to result in changes that are less chemically similar to substitutions seen in older genes. Overall, I found this paper to be a very creative and elegant study of an important fundamental question in evolutionary biology. The methods appear generally robust and the manuscript is clearly written. However, I have a number of concerns and suggestions on how to improve the manuscript. Major comments: 1. Figure 2 and related analyses of confounders: I appreciate that the authors have attempted control for many of the confounders of other factors like gene expression level, protein length, and the amount of intrinsic disorder among proteins that can affect the rate of evolution and rate of adaptation. Indeed, the correlation between rate of adaptation and gene age seems to persist, even when only analyzing genes that are matched on the confounder. However, the analyses seemed to only control for a single confounder at a time. Do associations between the rate of adaptation and the other confounders persist when controlling for a single confounder. Put another way, does the association between rate of adaptation and protein length persist when controlling for expression level, for example? I wonder whether the main effect of gene age on evolutionary rate would still persist when somehow controlling for multiple confounders simultaneously? I realize that the authors indicate they cannot stratify into additional groups because they will not have enough genes in each group for a meaningful comparison. However, might some of the confounders also be correlated with each other? If so, might it possible to control for the joint effects of the confounder? 2. Related to point 1 above and controlling for confounders—In figure 2 the authors report diving the genes into 2 equal groups based on the confounder and then testing for the main effect within each group. I'm not sure whether this is sufficient to adequately remove the effects of the confounder. For example, in Figure 2b, do genes in the "high expression" group still show a correlation between gene expression and rates of evolution? I worry that there will still be some variability in gene expression within each of the two large groups that could account for some of the association between the gene age and the rate of evolution. My comment holds for the other confounders as well. 3. Lines 212-241: The authors correctly are concerned that young and old genes may have different biological functions. If the young and old genes have different biological function, they may have different distributions of fitness effects and could have different rates of adaptation because of the differences in gene function, rather than the genes only differing in terms of where they are in adaptative walks. The authors attempt to control for this confounder using GO analyses. However, I'm not sure their control is adequate. Might it be possible to match the sets of "younger" and "older" genes based on having a similar set of GO terms and then testing whether the rates of adaptation differ between the younger and older genes? This matching on GO terms might better ensure that genes being compared have similar functional properties. 4. Lines 493-508: I like the analysis of amino acid exchangeability and gene age. However, I had a really hard time understanding how "G_a" and "G_na" were estimated in these analyses. I found the explanation in the Methods section to be rather vague and unclear, For example, "f_AGT" wasn't defined. Related, how well does this method using the SFS to estimate the rate of adaptation actually work for estimating G_a? Given the importance of these results for the overall conclusions of the paper, I think more description is required as well as some way of showing that the methods work. I tried looking at Bergman and Eyre-Walker [106], but that didn't help clarify things as much as needed. Minor comments: 1. Line 76: Also, experimental studies are limited to only certain organisms. 2. Line 204: The use of the word "prevailed" here seems a little awkward. Maybe say "remained" or "persisted" instead. 3. It would be informative to include the P-values somehow on each plot in Figure 2. 4. Figure S7: P_a and P_na are listed in the caption but G_a and G_na are listed in the figure labels. Are these the same thing? Please clarify. I'm also confused by the point of this figure more generally. Please explain more. Reviewer #2: In this manuscript the authors test Fisher's geometric model and the idea of the adaptive walk where populations further away from the fitness optimum are more likely to fix beneficial mutations of larger effect first and of smaller effects later. They do so by testing this hypothesis in old vs new genes in Arabidopsis and Drosophila. Overall, I find the idea interesting and definitely worth testing. However, I'm not convinced of the results. Here are some specific and detailed comments: Major Comments: - I'm not entirely convinced of the results, which seem a bit over-sold throughout the manuscript without the explicit listing of caveats. a) For instance, from reading the abstract it appears that you have really estimated the selection coefficients of the adaptive substitutions; however you have used only a proxy for that. b) Table S1 does show a positive correlation with gene age but there's a much stronger correlation with other factors like gene length and RSA. c) I'm also not convinced of the positive correlations between w_a and gene age, especially after you account for gene length and disorder (Figure 2a Arabidopsis; Figure 2d both). They appear extremely weak, unfortunately. I would urge the authors to tone down the interpretation of their observations and discuss the caveats more. -It would be a good idea to more thoroughly describe the idea behind using Grantham's distance in the Results section. It's not exactly a standard test and is not used very widely. -In this particular study, I'm not really sure that using Grantham's distance is the best way to show that fitness effects of substitutions are large for young genes. If young genes are more likely to be disordered, they might also be less likely to have strong fitness effects when the physicochemical distances are larger. So this appears to be a confounding factor here. Have you looked at the correlations between Grantham's distance and gene age while accounting for the confounding factors? If I missed it, I apologize! Minor Comments: -Lines 131-134 -> perhaps elaborate on the methods used here? -line 139 -> briefly explain how you got gene age. Reviewer #3: [identifies himself as Nicolas Lartillot] This manuscript explores the relation between gene age (such as determined by phylostratigraphy) and the rate and patterns of adaptive and non adaptive evolution (such as measured by MacDonald Kreitman approaches), in Drosophila and Arabidopsis. The main results are as follows: - younger genes undergo higher rates of molecular evolution, both adaptive and non adaptive; this correlation appears to be robust when controlling for multiple confounding factors; - younger genes tend to make more radical amino-acid changes, and this, both for adaptive and non-adaptive events. These results are interpreted in terms of an adaptive walk model of gene adaptation and subsequent molecular evolution. This is a very interesting article. The results presented in it appear to be robust, and the evolutionary hypotheses are stimulating. The article is also well written: very clear, it is a pleasure to read it. I highly recommend it for publication. I would just have two main comments, one concerning the statistical details of the methods, and another one concerning the interpretation of the results. Then, there are a few very minor comments below at the end. 1. Statistical analyses: strength of evidence versus strength of correlation. As a general rule, it is useful to discriminate between strength of evidence, on one side, and strength of correlation on the other side. In a typical statistical analysis, information is given separately about the two aspects (typically, a p-value for the strength of evidence, and then a correlation coefficient for the strength of correlation). Strength of evidence scales with sample size (p-values converge to 0 for larger data samples), whereas strength of correlation is a population-level property (correlation coefficients don't systematically increase, they are just more accurately estimated, with larger data samples). In the present case, however, it seems to me that there is a latent confusion between these two aspects. For instance, the kendall's tau values presented in table 1 are correlation coefficients, so at first sight, one would be tempted to interpret them as a measure of the strength of correlation between gene age and, e.g. the rate of adaptive evolution. Sentences such as 'the effect of gene age prevailed for all estimates in the two species (omega: tau = 0.929, p =1.30e-03)' implicitly convey this message that tau is giving a measure of the strength of correlation, while p is giving a measure of the strength of evidence for this effect size. However, these tau values are very close to 1, some of them are even essentially equal to 1. This surprised me at first sight. But then I realized that, given the design of the experiment, this is just a consequence of sample size: these tau values measure the correlation between bin median age and bin *mean* effect. The mean effect of a given bin is an average over many genes, and thus much of the variance in gene adaptive or non-adaptive rate has already been factored out at that step. Ultimately, these tau values should all be equal to 1 for very large numbers of genes per bin, which suggests that they should certainly not be taken as a measure of the intrinsic, population-level, strength of the correlation between gene age and other quantities such as omega_a. They cannot be intrinsic, since they scale with sample size (or, here, with bin size). Similarly, the confidence intervals displayed on the figures such as figure 1 are obtained by bootstrapping the genes within each bin. But then this means that their width should shrink as 1/sqrt(n), with n the number of genes per bin. So, again, the confidence regions displayed on the figures represent the strength of evidence for the correlation patterns, but not directly the strength of the correlation itself: again, they scale with sample size. Conversely, I don't see any statistical quantity across the article, whether on the text, tables, or figures, that could be taken as a measure of the intrinsic strength of correlation at the gene level: basically, a measure of how much the age of a randomly chosen gene can predict the evolutionary/adaptive rate of this gene (typically, how much of the variance is explained by the covariate). This is missing a lot, I think. In this respect, there is a point in the discussion: "we observe that young genes present a 25-fold higher rate of adaptation than older genes in Drosophila species and around 30-fold higher in Arabidopsis. " -> this sounds like a measure of effect size; but again, it is only about the mean for a given age class; also it is related to the slope of the regression of omega versus gene age, but not to the strength of the correlation. If we take the linear regression case, which is simpler to understand, you can have a steep relation between Y and X (i.e. mean[Y|X=x_highest] can be much higher than mean[Y|X=x_lowest], while still having a weak correlation (var[Y | X] still large, compared to the variance of mean[Y|X] over X). And thus, I was wondering if the Authors could think a bit about this point and clarify and, perhaps even, enrich, this aspect of their statistical analyses. - clarify: clearly say, in the main text, or figures, or table legends, whenever a correlation coeff or a confidence interval scales with sample size (possibly indirectly, i.e. by playing with the number of bins, discretization scheme, etc). perhaps expand a bit on the fact that all this does not really measure the intrinsic strength of the correlation between gene age and gene rate of evolution; - enrich: if possible, give some meaningful measure of the intrinsic gene-level strength of correlation between gene age and molecular evolutionary rates, or proportion of variance explained. I think one way to estimate the proportion of variance explained would be to compare the bootstrap variance obtained in the experiments done here, with the same bootstrap variance but in a control experiment where genes have been randomly reshuffled across bins, while keeping the same number of genes per bin (thus erasing all information about gene age). If gene age is a good predictor, then the first variance should be smaller than the second, and 1 - v1/v2 should be a measure of the percentage of variance explained by gene age. Another quicker way would be as follows: assuming that omega_a (or omega_na) is an additive property, such that the mean omega_a for a set of genes is just, conceptually, the mean of the n gene-specific omega_a's, then it would make sense to just inflate the bootstrap variance estimates by n; this should give a rough estimate of the intra-class (i.e. gene-level) true variance of the effect being measured. And then it is relatively simple to compare this intra-class variance (averaged over all bins) with the inter-class variance of the means. One problem is that genes are of varying length, so one should perhaps inflate the variances, not by the true but by the effective number of genes: n_eff = (sum_i L_i)^2 / (sum_i L_i^2), where L_i is the length of gene i. There are probably better approaches than those suggested here. Of note, I don't think it is a problem if gene age turns out to explain a small proportion of the total variance. Molecular evolutionary patterns are always rather subtle, so one should not consider this possible outcome as a weakness in itself. But it is just that it would be nice to have at least some hint, some quantitative evaluation, or some discussion, about this in the manuscript. In any case, it is important to clarify and to avoid any misunderstanding about the meaning of the statistical measures of strength of signal that are presented. 2. Interpretation of the results. The interpretation of the results in terms of adaptive walk is definitely an interesting one. However, I could think of other interpretations. For instance: Less constrained genes are more easily lost: they can more easily accommodate mis-sense mutations, so they can probably also more easily accommodate non-sense mutations. Therefore, on average, less constrained genes are younger (because the older ones have been lost). In addition, since they accept a larger number of mutations that are otherwise not too deleterious for the folding and for the primary function of the protein, less constrained genes also represent a bigger mutation target for serendipitous adaptations. So less constrained proteins show a higher rate of adaptive evolution. And thus, younger genes, which are enriched in less constrained genes, show a higher rate of both non-adaptive and adaptive evolution. Of note, this mutational target size argument also explains why omega_na and omega_a are correlated, irrespective of gene age. Also, note that the alternative interpretation just suggested is based on a stationary scenario. In contrast, the adaptive walk idea is fundamentally non-stationary. To be clear: this is not to dismiss the interpretation proposed by the Authors. But it is just that I found the manuscript perhaps a bit too exclusively oriented toward one single 'story' for explaining the pattern, and this at the cost of a broader - and richer - discussion about what could be responsible for these observations. Perhaps, in their discussion, the Authors could give some hints as to other possible interpretations; also, they could give some suggestions as to how one could, in the future, discriminate between these alternative explanations. For instance, since the interpretation proposed by the Author is deeply committed to a non-stationary pattern, it should be detectable by estimating variation in dN/dS across a phylogenetic tree: analyses along those lines would definitely be an interesting perspective, as a way to discriminate between stationary or non-stationary explanations of these findings more generally. Minor points: The overall effect of gene age on omega_a and omega_na in each co-factor was assessed by... -> not clear what is meant by overall effect in each co-factor. perhaps rephrase? To do so, we first assessed the correlation of gene age with the rates of molecular evolution in distinct categories of genes, according to a putative confounding factor. -> not totally clear, phrased like this. Am I correct to understand this: We first sorted genes into classes, according to a putative confounding factor, and then assessed the correlation of gene age with the rates of molecular evolution within each class? page 5, lines 59: tau = -8.48 ?? not clear why there are two p-values and two tau values. perhaps be more explicit. - 'controls for confounding effects are only considering two categories (low and high)' is this control sufficiently tight? Isn't there still some gene age / gene length (or other confounding factor) stratification within each class? In fact, this point could be tested internally: do you still see a significant correlation between gene age and e.g. gene length within the high or within the low class ? An alternative control would have relied on a bins of differing gene ages that are matched for their underlying distribution for gene length (or for any other counfounding factor), although it is not clear to me whether it would be easy to do such matched subsampling in the present case while still guaranteeing sufficiently large sample size within each bin. Of note, the Authors are also using an alternative approach, using linear regression, based on Huang, 2021, which makes their analysis much less dependent on this single control experiment. - 'We first examined which functions are encoded by young genes in A. thaliana and D. melanogaster...' -> why only young genes? Wouldn't that make sense to contrast young versus old ? Exactly like the correlation between gene length and gene age was verified before controlling for gene length, earlier in the manuscript, wouldn't that make sense here to test whether some functions are over- or under-represented in young genes versus old genes, before trying to control for this? - 'When looking at omega_na, our analyses revealed a strong influence of gene age in most functions analysed in both species, where young genes present higher rates of non-adaptive substitutions.' -> phrasing is slightly ambiguous. Strong influence of gene age on omega_na within most functional classes? 8 Jun 2022 Submitted filename: Response_reviewers-v5.pdf Click here for additional data file. 7 Jul 2022 Dear Dr Moutinho, Thank you for your patience while we considered your revised manuscript "Testing the adaptive walk model of gene evolution" for publication as a Research Article at PLOS Biology. This revised version of your manuscript has been evaluated by the PLOS Biology editors, the Academic Editor and the original reviewers. Based on the reviews, we are likely to accept this manuscript for publication, provided you satisfactorily address the remaining points raised by the reviewers and the following data and other policy-related requests. IMPORTANT: a) Please make your title more explicit and declarative. We suggest "Data from Arabidopsis and Drosophila provide strong evidence for the adaptive walk model of gene evolution" or "Strong evidence for the adaptive walk model of gene evolution in both Drosophila and Arabidopsis." b) Please attend to the remaining requests from reviewer #1. c) Please address my Data Policy requests below; specifically, we need you to supply the numerical values underlying Figs 2ABCDEF, 3ABCD, 4ABCDEFGH, 5AB, S1, S2, S3AB, S4AB, S5ABCDEF, S6AB, S7ABC, S8. I note that your Github deposition currently contains code for the simulations, but please also add the output data. In addition, we need a citeable, permanent record of the data, e.g. in Zenodo, Dryad etc. d) Please also cite the location of the data clearly in each Fig legend, e.g. “The data and code needed to generate this Figure can be found in https://gitlab.gwdg.de/molsysevol/supplementarydata_geneage and https://zenodo.org/record/XXXXXX.” As you address these items, please take this last chance to review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the cover letter that accompanies your revised manuscript. We expect to receive your revised manuscript within two weeks. To submit your revision, please go to https://www.editorialmanager.com/pbiology/ and log in as an Author. Click the link labelled 'Submissions Needing Revision' to find your submission record. Your revised submission must include the following: - a cover letter that should detail your responses to any editorial requests, if applicable, and whether changes have been made to the reference list - a Response to Reviewers file that provides a detailed response to the reviewers' comments (if applicable) - a track-changes file indicating any changes that you have made to the manuscript. NOTE: If Supporting Information files are included with your article, note that these are not copyedited and will be published as they are submitted. Please ensure that these files are legible and of high quality (at least 300 dpi) in an easily accessible file format. For this reason, please be aware that any references listed in an SI file will not be indexed. For more information, see our Supporting Information guidelines: https://journals.plos.org/plosbiology/s/supporting-information *Published Peer Review History* Please note that you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out. Please see here for more details: https://blogs.plos.org/plos/2019/05/plos-journals-now-open-for-published-peer-review/ *Press* Should you, your institution's press office or the journal office choose to press release your paper, please ensure you have opted out of Early Article Posting on the submission form. We ask that you notify us as soon as possible if you or your institution is planning to press release the article. *Protocols deposition* To enhance the reproducibility of your results, we recommend that if applicable you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols Please do not hesitate to contact me should you have any questions. Sincerely, Roli Roberts Roland Roberts, PhD Senior Editor, rroberts@plos.org, PLOS Biology ------------------------------------------------------------------------ DATA POLICY: You may be aware of the PLOS Data Policy, which requires that all data be made available without restriction: http://journals.plos.org/plosbiology/s/data-availability. For more information, please also see this editorial: http://dx.doi.org/10.1371/journal.pbio.1001797 Note that we do not require all raw data. Rather, we ask that all individual quantitative observations that underlie the data summarized in the figures and results of your paper be made available in one of the following forms: 1) Supplementary files (e.g., excel). Please ensure that all data files are uploaded as 'Supporting Information' and are invariably referred to (in the manuscript, figure legends, and the Description field when uploading your files) using the following format verbatim: S1 Data, S2 Data, etc. Multiple panels of a single or even several figures can be included as multiple sheets in one excel file that is saved using exactly the following convention: S1_Data.xlsx (using an underscore). 2) Deposition in a publicly available repository. Please also provide the accession code or a reviewer link so that we may view your data before publication. Regardless of the method selected, please ensure that you provide the individual numerical values that underlie the summary data displayed in the following figure panels as they are essential for readers to assess your analysis and to reproduce it: Figs 2ABCDEF, 3ABCD, 4ABCDEFGH, 5AB, S1, S2, S3AB, S4AB, S5ABCDEF, S6AB, S7ABC, S8. NOTE: the numerical data provided should include all replicates AND the way in which the plotted mean and errors were derived (it should not present only the mean/average values). IMPORTANT: Please also ensure that figure legends in your manuscript include information on where the underlying data can be found, and ensure your supplemental data file/s has a legend. Please ensure that your Data Statement in the submission system accurately describes where your data can be found. ------------------------------------------------------------------------ DATA NOT SHOWN? - Please note that per journal policy, we do not allow the mention of "data not shown", "personal communication", "manuscript in preparation" or other references to data that is not publicly available or contained within this manuscript. Please either remove mention of these data or provide figures presenting the results and the data underlying the figure(s). ------------------------------------------------------------------------ REVIEWERS' COMMENTS: Reviewer #1: The authors have addressed my previous concerns about the analyses and have vastly improved the manuscript. I just have a few minor comments to help improve presentation: 1) Lines 245-250: This part was a little unclear. I think a sentence or two are needed toward the beginning of this paragraph giving an overview of what the authors are doing. My understanding is that the authors wish to control for the effect of protein function as a potential confounder on the relationship between omega and gene age. So, they split the genes into different functional categories based on GO terms. Then, within a given GO term, the genes were further divided into different age categories. Grapes was then used to estimate omega. Consider giving an overview like this before delving into the details of how the groups were assigned, etc. 2) Figure S7: The caption here is a little unclear still. Ga is called the proportion of adaptive substitutions and later it's called the "Grantham's distance values" (when referring to the regression line). I think these are the same quantities, but it would be cleaner to use one term in the figure. 3) Methods, section starting on line 520: While this is a little clearer than the previous version of the manuscript, I still think it's hard to follow. I might suggest re-ordering different parts. Specifically, I would first talk about the Grantham distances (like what is done up to line 526). Then, I would suggest talking about estimating omega and Grapes next (ie I would move the parts in lines 539-550 earlier). Lastly, I would talk about the details of inferring the proportion of adaptive nonsynonymous substitutions using the method of Bergman and Eyre-Walker. Right now, it's hard to know why that method is used and what it's being used for. Explicitly explaining that too would make the manuscript clearer here. Reviewer #2: The authors have sufficiently addressed my concerns. Reviewer #3: The Authors have addressed my comments. I find the experiment controlling for omega_na particularly interesting, in the idea to try to discriminate between the adaptive walk hypothesis and the static model. I still think that more direct tests of the adaptive walk could be considered, e.g. one should see a trend for omega over time. But that's left for future work.. congrats for this beautiful manuscript. 22 Jul 2022 Submitted filename: Response_reviewers_r2.pdf Click here for additional data file. 1 Aug 2022 Dear Dr Moutinho, Thank you for the submission of your revised Research Article "Strong evidence for the adaptive walk model of gene evolution in Drosophila and Arabidopsis" for publication in PLOS Biology. On behalf of my colleagues and the Academic Editor, Claudia Bank, I'm pleased to say that we can in principle accept your manuscript for publication, provided you address any remaining formatting and reporting issues. These will be detailed in an email you should receive within 2-3 business days from our colleagues in the journal operations team; no action is required from you until then. Please note that we will not be able to formally accept your manuscript and schedule it for publication until you have completed any requested changes. Please take a minute to log into Editorial Manager at http://www.editorialmanager.com/pbiology/, click the "Update My Information" link at the top of the page, and update your user information to ensure an efficient production process. PRESS: We frequently collaborate with press offices. If your institution or institutions have a press office, please notify them about your upcoming paper at this point, to enable them to help maximise its impact. If the press office is planning to promote your findings, we would be grateful if they could coordinate with biologypress@plos.org. If you have previously opted in to the early version process, we ask that you notify us immediately of any press plans so that we may opt out on your behalf. We also ask that you take this opportunity to read our Embargo Policy regarding the discussion, promotion and media coverage of work that is yet to be published by PLOS. As your manuscript is not yet published, it is bound by the conditions of our Embargo Policy. Please be aware that this policy is in place both to ensure that any press coverage of your article is fully substantiated and to provide a direct link between such coverage and the published work. For full details of our Embargo Policy, please visit http://www.plos.org/about/media-inquiries/embargo-policy/. Thank you again for choosing PLOS Biology for publication and supporting Open Access publishing. We look forward to publishing your study. Sincerely, Roli Roberts Roland G Roberts, PhD, PhD Senior Editor PLOS Biology rroberts@plos.org

96 in total

1. HHblits: lightning-fast iterative protein sequence searching by HMM-HMM alignment.

Authors: Michael Remmert; Andreas Biegert; Andreas Hauser; Johannes Söding
Journal: Nat Methods Date: 2011-12-25 Impact factor: 28.547

2. Estimation of the neutrality index.

Authors: Nina Stoletzki; Adam Eyre-Walker
Journal: Mol Biol Evol Date: 2010-09-13 Impact factor: 16.240

3. GLOOME: gain loss mapping engine.

Authors: Ofir Cohen; Haim Ashkenazy; Frida Belinky; Dorothée Huchon; Tal Pupko
Journal: Bioinformatics Date: 2010-09-27 Impact factor: 6.937

4. Rosid radiation and the rapid rise of angiosperm-dominated forests.

Authors: Hengchang Wang; Michael J Moore; Pamela S Soltis; Charles D Bell; Samuel F Brockington; Roolse Alexandre; Charles C Davis; Maribeth Latvis; Steven R Manchester; Douglas E Soltis
Journal: Proc Natl Acad Sci U S A Date: 2009-02-17 Impact factor: 11.205

5. Bio++: efficient extensible libraries and tools for computational molecular evolution.

Authors: Laurent Guéguen; Sylvain Gaillard; Bastien Boussau; Manolo Gouy; Mathieu Groussin; Nicolas C Rochette; Thomas Bigot; David Fournier; Fanny Pouyet; Vincent Cahais; Aurélien Bernard; Céline Scornavacca; Benoît Nabholz; Annabelle Haudry; Loïc Dachary; Nicolas Galtier; Khalid Belkhir; Julien Y Dutheil
Journal: Mol Biol Evol Date: 2013-05-21 Impact factor: 16.240

Introduction

Results

Young genes have a higher rate of adaptive substitutions

The effect of gene age on the rate of molecular adaptation is robust to multiple confounding factors

The effect of gene age on the molecular rate of adaptation is robust to BLAST’s false negative rates

The effect of gene age on the rate of molecular adaptation does not depend on protein function

Substitutions in young genes have larger effect sizes

Discussion

The magnitude effect of gene age on adaptive evolution is species-specific

An adaptive walk model of gene evolution

Material and methods

Estimation of the adaptive and nonadaptive rate of nonsynonymous substitutions

Gene age versus protein length and gene expression

Gene age versus protein structure

Correcting for BLAST E-values

Gene age versus protein function

Gene age versus protein–protein interactions (PPIs)

The fitness effects of amino acid substitutions

Assessing the correlation between ω and ω

Statistical analyses

MK regression

Kendall’s correlation coefficients for the relationship between the mean value of each cofactor in each age class and ω, ω, and ω for each “high” and “low” group.

MK regression estimates, z-scores, and respective p-values. The data (S19–S22 Data tables) needed to generate this table can be found at https://gitlab.gwdg.de/molsysevol/supplementarydata_geneage and https://zenodo.org/record/6828430.

Partial R2 estimates for each factor analysed in the MK regression analysis.

Estimates of ω, ω, and ω plotted as a function of gene age by correcting for the E-value on BLAST’s searches in A. thaliana (top) and D. melanogaster (bottom).

Relationship between the proportion of adaptive () and nonadaptive () substitutions and gene age.

Frequency of genes belonging to clades 6–7 for the respective protein function in D. melanogaster.

Distribution of the degree of PPI values in D. melanogaster.

Estimates of ω, ω, and ω plotted as a function of gene age by correcting for the variation in p/p in A. thaliana (top) and D. melanogaster (bottom).

Gene age analysis (Fig 1B).

Correlation between gene age and protein length (S1A Fig).

Correlation between gene age and gene expression (S1B Fig).

Correlation between gene age and RSA (S2A Fig).

Correlation between gene age and protein disorder (S2B Fig).

Correlation between gene age and ω, ω, and ω while controlling for protein length (Fig 2A).

Correlation between gene age and ω, ω, and ω while controlling for gene expression (Fig 2B).

Correlation between gene age and ω, ω, and ω while controlling for RSA (Fig 2C).

Correlation between gene age and ω, ω, and ω while controlling for protein disorder (Fig 2D).

Statistical results of the correlation between gene age and ω, ω, and ω within each category of protein length.

Statistical results of the correlation between gene age and ω, ω, and ω within each category of gene expression.

Statistical results of the correlation between gene age and ω, ω, and ω within each category of RSA.

Statistical results of the correlation between gene age and ω, ω, and ω within each category of protein disorder.

Correlation between the mean value of RSA within each category and ω, ω, and ω.

Correlation between the mean value of protein disorder within each category and ω, ω, and ω.

Correlation between the mean value of protein length within each category and ω, ω, and ω.

Correlation between the mean value of gene expression within each category and ω, ω, and ω.

Correlation between the mean value of each cofactor within each category and gene age.

Input data for 4-fold degenerate sites to run the MKRegression in Drosophila.

Input data for 0-fold degenerate sites to run the MKRegression in Drosophila.

Input data for 4-fold degenerate sites to run the MKRegression in Arabidopsis.

Input data for 0-fold degenerate sites to run the MKRegression in Arabidopsis.

Correlation between E-values and gene age before and after the correction (S3 Fig).

Correlation between gene age and ω, ω, and ω after controlling for the variation in E-values (S4 Fig).

Correlation between gene age and protein function (S5 and S8 Figs).

Correlation between gene age and ω, ω, and ω while controlling for protein function (S6 Fig).

Correlation between gene age and Grantham’s distance (Fig 3).

Correlation between gene age and and (S7 Fig).

Correlation between gene age and protein–protein interactions (S9 Fig).

Correlation between gene age and ω, ω, and ω after controlling for p/p (S10 Fig).

All data per gene for Drosophila melanogaster.

All data per gene for Arabidopsis thaliana.

Detailed R script of the gene age analysis.

Detailed R script of the correlation between gene age and ω, ω, and ω while controlling for the variation in E-values.

Detailed R script of the correlation between gene age and ω, ω, and ω while controlling for protein function.

Detailed R script of the correlation between gene age and Grantham’s distances.

Detailed R script of the correlation between gene age and protein–protein interactions (PPI).

Detailed R script of the correlation between gene age and ω, ω, and ω while controlling for the protein length, gene expression, RSA, and protein intrinsic disorder.

Detailed R script of the correlation between gene age and ω, ω, and ω while controlling for the variation in p/p.

Detailed R script of the analysis of the combined probabilities for each of the cofactors within and across species using the weighted Z-method.

Detailed R script of the extra analyses performed to correct for intracategory variation of the cofactors.

Review 7. The genomic rate of adaptive evolution.