Yanzhu Ji1,2, Shaohong Feng3,4,5, Lei Wu1,6, Qi Fang3,7, Anna Brüniche-Olsen8, J Andrew DeWoody9, Yalin Cheng1, Dezhi Zhang1, Yan Hao1, Gang Song1, Yanhua Qu1, Alexander Suh10,11, Guojie Zhang4,5,12,7,13,14, Shannon J Hackett2, Fumin Lei1,6,15. 1. Key Laboratory of Zoological Systematics and Evolution, Institute of Zoology, Chinese Academy of Sciences, Beijing 100101, China. 2. Negaunee Integrative Research Center, Field Museum of Natural History, Chicago, IL 60605, USA. 3. BGI-Shenzhen, Beishan Industrial Zone, Shenzhen 518083, China. 4. Future Health Laboratory, Innovation Center of Yangtze River Delta, Zhejiang University, Jiaxing 314100, China. 5. Evolutionary and Organismal Biology Research Center, Zhejiang University School of Medicine, Hangzhou, China. 6. University of the Chinese Academy of Sciences, Beijing 100049, China. 7. Villum Centre for Biodiversity Genomics, Section for Ecology and Evolution, Department of Biology, University of Copenhagen, Copenhagen DK-2200, Denmark. 8. Section for Computational and RNA Biology, Department of Biology, University of Copenhagen, Copenhagen DK-2200, Denmark. 9. Departments of Forestry and Natural Resources and Biological Sciences, Purdue University, West Lafayette, IN 47906, USA. 10. School of Biological Sciences, Organism and Environment, University of East Anglia, NR4 7TU, Norwich, UK. 11. Department of Organismal Biology, Systematic Biology, Evolutionary Biology Centre (EBC), Science for Life Laboratory, Uppsala University, Uppsala SE-752 36, Sweden. 12. Liangzhu Laboratory, Zhejiang University Medical Center, Hangzhou 311121, China. 13. State Key Laboratory of Genetic Resources and Evolution, Kunming Institute of Zoology, Chinese Academy of Sciences, Kunming 650223, China. 14. Women's Hospital, School of Medicine, Zhejiang University, Shangcheng District, Hangzhou, 310006, China. 15. Center for Excellence in Animal Evolution and Genetics, Chinese Academy of Sciences, Kunming 650201, China.
Abstract
The rate of mutation accumulation in germline cells can be affected by cell replication and/or DNA damage, which are further related to life history traits such as generation time and body mass. Leveraging the existing datasets of 233 neoavian bird species, here, we investigated whether generation time and body mass contribute to the interspecific variation of orthologous microsatellite length, transposable element (TE) length, and deletion length and how these genomic attributes affect genome sizes. In nonpasserines, we found that generation time is correlated to both orthologous microsatellite length and TE length, and body mass is negatively correlated to DNA deletions. These patterns are less pronounced in passerines. In all species, we found that DNA deletions relate to genome size similarly as TE length, suggesting a role of body mass dynamics in genome evolution. Our results indicate that generation time and body mass shape the evolution of genomic attributes in neoavian birds.
The rate of mutation accumulation in germline cells can be affected by cell replication and/or DNA damage, which are further related to life history traits such as generation time and body mass. Leveraging the existing datasets of 233 neoavian bird species, here, we investigated whether generation time and body mass contribute to the interspecific variation of orthologous microsatellite length, transposable element (TE) length, and deletion length and how these genomic attributes affect genome sizes. In nonpasserines, we found that generation time is correlated to both orthologous microsatellite length and TE length, and body mass is negatively correlated to DNA deletions. These patterns are less pronounced in passerines. In all species, we found that DNA deletions relate to genome size similarly as TE length, suggesting a role of body mass dynamics in genome evolution. Our results indicate that generation time and body mass shape the evolution of genomic attributes in neoavian birds.
Mutations are the primary material for evolution. Various mutations have been identified that underlie both adaptive and nonadaptive evolution. In addition, the rates of mutation accumulation vary wildly across species [e.g., ()] and covary with life history traits (see below). For example, the rate of nucleotide substitution covaries with body mass in mammals, birds, and poikilotherms (–). Universal machanisms, such as effects related to metabolism and/or generation time, have been posited to explain these data (, ).The correlation between body mass and mutation accrual is believed to reflect two covariates of body mass, both of which are related to the mechanisms of how mutations are generated (). Mutations that can be inherited arise in germline cells, either during DNA replication or when DNA is damaged but fails to be repaired. It is thus expected that, when compared across species, the accumulation of mutations would reflect the rate of DNA replication in germline cells and/or the degree of DNA damage and its repair, depending on the type of mutations (). For mutations that are generated during DNA replication, the rate of germline cell replication (i.e., the number of germline cell replications over time) dictates how fast these mutations accumulate. If we assume that species all undergo a relatively fixed number of cell cycles during one generation, then the generation time would be negatively related to the rate of germline cell replication and thus negatively affect how many mutations are generated within a given time frame. In contrast, the accumulation of damage-caused mutations that are not repaired efficiently should track time elapsed or the strength of mutation genesis. Recent studies have emphasized the previously overlooked effects of DNA damage on primate mutation accumulation (–). As exogenous factors that cause DNA damage tend to relate to environment and are highly unpredictable and transient, one of the main endogenous factors, the generation of reactive oxygen species (ROS), is tightly linked to the physiology of animals (, ). ROS are the by-product of mitochondrial oxidative phosphorylation (, ) and are determined by the level of aerobic activities, which is further related to body mass [e.g., (, )].We chose birds as our study system because the accumulated data on life history and phylogenomics make cross-species comparative studies feasible. Here, we focus on the effect of life history traits on three genomic attributes that reflect accumulated mutations in avian genomes. The genomic attributes include orthologous microsatellite length and transposable element (TE) length as representatives of mutations that at least partially reflect DNA replication. We also included DNA deletions as another genomic attribute to represent mutations that are caused by DNA damage and those mediated by nonallelic homologous recombination (NAHR), the latter possibly related to the density of TEs (). Specifically, microsatellites mutate when DNA strands fail to match each other during DNA synthesis, forming either insertions or deletions of repeat units. Furthermore, the mutations in microsatellites are believed to be biased toward insertions (–). We also include TE length, because the heritable mobilization of TEs, while related to the transcriptional and/or (retro)transpositional activities of TEs themselves and host defenses against TE invasions, is highly dependent on resetting the epigenetic marks in germline cells during replication that likely influence the rate of TE incorporation into avian genomes [e.g., ()]. The length of orthologous microsatellites and TEs (under the assumption of similar activities) should reflect rounds of germline cell replication and be associated with generation time. In comparison, the formation of DNA deletions is frequently accompanied by imperfect repair of DNA double-strand breaks (–).Together, the compiled data on avian generation time (), body mass (), and phylogenomics (, ) enabled us to explicitly test the association between avian life history traits (i.e., generation time and body mass) and genomic evolution. Our a priori hypotheses were that (i) generation time is expected to negatively correlate with accumulated mutations with shared origins related to DNA replication (i.e., both orthologous microsatellite length and TE length) and that (ii) body mass is expected to negatively correlate with mutations associated with improper DNA repair (i.e., DNA deletions). Our data revealed that both hypotheses were mostly supported in nonpasserine Neoaves. In passerines, we found a distinct pattern with a correlation between generation time and deletion length that required alternative explanations.
RESULTS
Quantification of genomic attributes
To quantify orthologous microsatellites, we first downloaded/gathered raw sequencing reads from online sources (tables S1 and S2). Of all 278 species, after read trimming, we collected 11.3- to 25.4-Gb DNA sequences per species, with a median of 21.1 Gb.Our effort to identify orthologous microsatellite first discovered 421 orthologous microsatellite loci between chicken and zebra finch, of which 331 loci have annotated genes in either or both assemblies, and all annotated genes were matched (data file S1). With this pipeline applied on the first 43 species (table S1), orthologous microsatellite loci were found in 23 of 45 supermotifs with the motif sizes of 2 to 4 base pairs (bp) (table S3). Last, we identified 695 loci using reads of the rest of the 235 species (table S2). After genotyping the orthologous microsatellites and filtering out 204 loci with invariant length across species, we found the species-averaged orthologous microsatellite length, which represents the number of repeat units that differs from the most common allele (Δ units), ranged from 0.006 to 0.229 for Neoaves that were analyzed subsequently (Fig. 1A and data file S2). TE length, calculated as the proportions of TEs [data from ()] times the assembly sizes, ranged from 0.073 to 0.400 Gb (Fig. 1A and data file S2).
Fig. 1.
Evolution of genomic attributes (orthologous microsatellite length, TE length, and deletion length), together with life history traits and genome (assembly) size.
(A) The overview of genomic attributes evolution, together with body mass, generation time, and assembly size, across the evolutionary tree. Orders (or taxonomic groups) that include ≥5 species are highlighted with black or white bars. Species with elevated TEs are highlighted with blue bars. Silhouettes are from phylopic.org, most of them unchanged but a few are mirrored, with credit to F. Sayol, S. Wegner-Larson, B. McCormish (photo by Avenue), “annaleeblysse,” M. Michaud, S. Traver, and C. Schmidt, under Public Domain Dedication 1.0 (http://creativecommons.org/publicdomain/zero/1.0/), Creative Commons Attribution-ShareAlike 3.0 Unported (https://creativecommons.org/licenses/by-sa/3.0/), or Creative Commons Attribution 3.0 Unported (https://creativecommons.org/licenses/by/3.0/). The figure is generated by the R package ggtree (). Or. msat. length, orthologous microsatellite length. (B) Scatterplots between TE length and deletion length, highlighting the distinct trends between passerines (yellow) and nonpasserines (gray) and between those with (blue circles) or without (no circles) high TE contents. Lines represent modeling results, with passerines represented by the yellow line, nonpasserines by the gray line with a blue asterisk, and nonpasserines without outliers by the gray line. Dashed line represents statistical insignificant results, and solid lines represent significant results.
Evolution of genomic attributes (orthologous microsatellite length, TE length, and deletion length), together with life history traits and genome (assembly) size.
(A) The overview of genomic attributes evolution, together with body mass, generation time, and assembly size, across the evolutionary tree. Orders (or taxonomic groups) that include ≥5 species are highlighted with black or white bars. Species with elevated TEs are highlighted with blue bars. Silhouettes are from phylopic.org, most of them unchanged but a few are mirrored, with credit to F. Sayol, S. Wegner-Larson, B. McCormish (photo by Avenue), “annaleeblysse,” M. Michaud, S. Traver, and C. Schmidt, under Public Domain Dedication 1.0 (http://creativecommons.org/publicdomain/zero/1.0/), Creative Commons Attribution-ShareAlike 3.0 Unported (https://creativecommons.org/licenses/by-sa/3.0/), or Creative Commons Attribution 3.0 Unported (https://creativecommons.org/licenses/by/3.0/). The figure is generated by the R package ggtree (). Or. msat. length, orthologous microsatellite length. (B) Scatterplots between TE length and deletion length, highlighting the distinct trends between passerines (yellow) and nonpasserines (gray) and between those with (blue circles) or without (no circles) high TE contents. Lines represent modeling results, with passerines represented by the yellow line, nonpasserines by the gray line with a blue asterisk, and nonpasserines without outliers by the gray line. Dashed line represents statistical insignificant results, and solid lines represent significant results.By analyzing the “hal” file of avian 363-way whole-genome alignment, mean DNA deletion length ranged from 77.3 to 182.8 bp per 1000 bp in Neoaves (Fig. 1A and data file S2). Note that sequences extracted in this way only included orthologous sequences that are present in the chicken, so that any chicken-specific deletions were not included (fig. S1). Furthermore, any lineage-specific insertions from chicken, Galliformes, Galloanserae, and Neognathae are present as well (fig. S1). As it is impossible to polarize Neognathae-specific insertions without a nonavian outgroup, we therefore excluded species of Galloanserae and Paleognathae from our analyses. After overlapping the remaining neoavian species and those with available life history traits, we ended up with 233 species for subsequent statistical analyses.
Phylogenetic comparative analysis
Because inspection of pairwise scatterplots (e.g., Fig. 1B) further revealed that TE-related trends were not linear, we split our data into nonpasserines and passerines to better explain these data. Twelve nonpasserine species had a substantial TE burden, as the log(TE length) passed third quartile + 1.5 × interquartile range (IQR) and corresponded to the total percentage of TEs ranging from 12.85 to 31.47%. The species included all eight species sampled from Piciformes, the eurasion hoopoe (Upupa epops) and the common scimitarbill (Rhinopomastus cyanomelas) in Bucerotiformes, the hooded pitta (Pitta sordida) in Passeriformes, and the squirrel cuckoo (Piaya cayana) in Cuculiformes (Fig. 1). As the trend in nonpasserine birds was different from the trend in this clade without the species with high TEs (Fig. 1B), we further excluded the species with high TE length in nonpasserines as the third group. Last, we tested our a priori hypotheses in the three groups of species: (i) nonpasserine birds, (ii) nonpasserine birds without species with high TE length, and (iii) passerine birds (without species with high TE or deletion length; the species with high deletion length became apparent once passerines were separated). Unfortunately, we could not run models focused on species with high TE length because of the limited sample size.We built three piecewise structural equation models (PSEMs) that represent three scenarios among the relationships of life history traits, genomic attributes, and genome (or assembly) size (Fig. 2). Model 1 describes the general basic relationships among life history traits, and between genomic attributes and genome size, without relationships between life history traits and genomic attributes (Fig. 2A). Notably, we included a correlation between TE length and deletion length, as Kapusta et al. () proposed an “accordion” model of genome size evolution that suggests that TE insertions drive the formation of deletions through NAHR. Model 2, as described by our a priori hypotheses, predicts that both orthologous microsatellite length and TE length are related to generation time, and DNA deletion is related to body mass, in addition to the relationships in model 1 (Fig. 2B). Last, model 3 represented a full model in which each of the three genomic attributes is related to both body mass and generation time (Fig. 2C). Note that, although the PSEMs were defined by a series of hypothetical causational (single-ended) and correlated (double-ended) relationships, here, we interpreted the relationships as correlations. Our results, as measured by correlations, do not demonstrate causations but rather lend support to these hypothetical causal models ().
Fig. 2.
Models that are tested by PSEMs.
The models describe how avian genomic attributes, including orthologous microsatellite length, TE length, and deletion length, evolved under the covariation of DNA loss and gain (the arrow between TE length and deletion length), effect of life history traits on genomic attributes (arrows between body mass, generation time, and genomic attributes), and contributions of genomic attributes to genome size (or assembly size). Gray arrows indicate relationships that are common across all three models. Blue/red arrows refer to model-specific relationships. Single-directional arrows indicate relationships that are presumably causational, and bidirectional arrows indicate correlated errors among variables. (A) Model 1: The evolution of genomic attributes is not related to life history traits. (B) Model 2: Life history traits correlate with the evolution of genomic attributes as predicted by a priori hypothesis. (C) Model 3: A full model that makes every possible connection between life history traits and genomic attributes.
Models that are tested by PSEMs.
The models describe how avian genomic attributes, including orthologous microsatellite length, TE length, and deletion length, evolved under the covariation of DNA loss and gain (the arrow between TE length and deletion length), effect of life history traits on genomic attributes (arrows between body mass, generation time, and genomic attributes), and contributions of genomic attributes to genome size (or assembly size). Gray arrows indicate relationships that are common across all three models. Blue/red arrows refer to model-specific relationships. Single-directional arrows indicate relationships that are presumably causational, and bidirectional arrows indicate correlated errors among variables. (A) Model 1: The evolution of genomic attributes is not related to life history traits. (B) Model 2: Life history traits correlate with the evolution of genomic attributes as predicted by a priori hypothesis. (C) Model 3: A full model that makes every possible connection between life history traits and genomic attributes.After running all three models using our three subsets of data, we first noticed that models 2 and 3 were supported by our data: Models 2 and 3 were both supported by all nonpasserines, whereas model 2 was supported by nonpasserines without outliers and model 3 was supported by passerines (Table 1). By examining standardized regression coefficients (coef.) as a way to measure the strength and direction of the relationships in PSEMs (for results, see table S4), we checked whether the modeling results were consistent with our hypotheses. Basically, we found that it depended on the dataset whether our three focal correlations (i.e., among generation time or body mass and genomic attributes) were significant or not. In nonpasserines, all of our three focal correlations were significant in model 2 (generation time versus orthologous microsatellite length: coef. = −0.24, P = 0.003; generation time versus TE length: coef. = −0.23, P = 0.0005; body mass versus deletion length: coef. = −0.31, P < 0.001), but the one between generation time and orthologous microsatellite length was insignificant in model 3 (generation time versus orthologous microsatellite length: coef. = −0.08, P > 0.05; generation time versus TE length: coef. = −0.19, P = 0.04; body mass versus deletion length: coef. = −0.27, P = 0.004; Figs. 3A and 4). In nonpasserines without outliers, we found that all three correlations were significant (generation time versus orthologous microsatellite length: coef. = −0.19, P = 0.03; generation time versus TE length: coef. = −0.29, P = 0.005; body mass versus deletion length: coef. = −0.40, P < 0.001; Figs. 3B and 4). Last, in passerines, the model showed that all three aforementioned correlations were insignificant (all P > 0.05; table S4), but the correlation between generation time and deletions was significant (coef. = −0.38, P = 0.001; Fig. 3C, fig. S2, and table S4). The correlation between TE length and deletion length, reflecting that increases in TE length cause larger values of deletion length, was significant only in all nonpasserines when including outliers (model 2: coef. = 0.26, P = 0.001; model 3: coef. = 0.25, P = 0.002; table S4; Fig. 3, A versus B), consistent with the accordion model that TEs may facilitate the removal of DNA by NAHR (). Given that NAHR may result in larger deletions, this is further supported by the presence of longer deletions in the TE-rich clades than those in the rest clades (fig. S3).
Table 1.
Summary of piecewise structural equation model (PSEM) results across datasets.
In the table, df denotes the degree of freedom, C denotes C statistic, P denotes P values for d-separation tests (note that P values of >0.05 indicate proper model fit), AICc denotes Akaike information criterion corrected for small sample sizes, and ΔAICc denotes changes in ΔAICc. Models with ΔAICc < 2 are shown in bold.
Nonpasserines
Models
df
C
P
AICc
ΔAICc
Model 2
8
6.66
0.57
54
0
Model 3
2
0.45
0.79
55.4
1.4
Model 1
12
55.7
<0.05
89.8
35.8
Nonpasserines without outliers
Models
df
C
P
AICc
ΔAICc
Model 2
8
5.73
0.67
54
0
Model 3
2
0.92
0.63
57.5
3.5
Model 1
12
50
<0.05
84.5
30.5
Passerines
Models
df
C
P
AICc
ΔAICc
Model 3
2
1.73
0.42
57.4
0
Model 1
12
34.61
<0.05
66.2
8.8
Model 2
8
25.53
<0.05
77.2
19.8
Fig. 3.
PSEM results in different taxa being tested.
The panels represent modeling results in (A) nonpasserines, (B) nonpasserines without outliers, and (C) passerines. Gray arrows are used to indicate shared correlations among models, whereas red/blue arrows indicate model-specific correlations with positive/negative directions. The thickness of arrows represents the strength of standardized regression coefficient (also see table S4). This figure shows that the correlations of our three predictions are supported in nonpasserines but not in passerines. ns, not significant.
Fig. 4.
Intercorrelation among life history traits and three genomic attributes.
(A) Generation time [ln(year)] and orthologous microsatellite length (Δ units); (B) generation time [ln(year)] and TE length [ln(Gb)]; (C) body mass [ln(g)] and deletion length (bp/1000 bp); and (D) change in ancestral body mass [ln(g)] and change in deletion length (bp). In these plots, we used colors to differentiate species from nonpasserines (gray) and passerines (yellow), and those with high TE length (circled with blue) and without (no circles). Regression lines from PGLS models are plotted in each, with gray lines with asterisks representing the dataset with all nonpasserine species, gray lines without asterisks representing species without high TE lengths, and yellow lines representing passerine species. Significant regressions are represented by solid lines, and insignificant results are represented by dashed lines. This figure shows the three correlations between life history traits and the genomic attributes that we focused on (A to C) and the correlation between ancestral body mass and change in deletion length (D) further supported the correlation between deletion length and body mass (C).
Summary of piecewise structural equation model (PSEM) results across datasets.
In the table, df denotes the degree of freedom, C denotes C statistic, P denotes P values for d-separation tests (note that P values of >0.05 indicate proper model fit), AICc denotes Akaike information criterion corrected for small sample sizes, and ΔAICc denotes changes in ΔAICc. Models with ΔAICc < 2 are shown in bold.
PSEM results in different taxa being tested.
The panels represent modeling results in (A) nonpasserines, (B) nonpasserines without outliers, and (C) passerines. Gray arrows are used to indicate shared correlations among models, whereas red/blue arrows indicate model-specific correlations with positive/negative directions. The thickness of arrows represents the strength of standardized regression coefficient (also see table S4). This figure shows that the correlations of our three predictions are supported in nonpasserines but not in passerines. ns, not significant.
Intercorrelation among life history traits and three genomic attributes.
(A) Generation time [ln(year)] and orthologous microsatellite length (Δ units); (B) generation time [ln(year)] and TE length [ln(Gb)]; (C) body mass [ln(g)] and deletion length (bp/1000 bp); and (D) change in ancestral body mass [ln(g)] and change in deletion length (bp). In these plots, we used colors to differentiate species from nonpasserines (gray) and passerines (yellow), and those with high TE length (circled with blue) and without (no circles). Regression lines from PGLS models are plotted in each, with gray lines with asterisks representing the dataset with all nonpasserine species, gray lines without asterisks representing species without high TE lengths, and yellow lines representing passerine species. Significant regressions are represented by solid lines, and insignificant results are represented by dashed lines. This figure shows the three correlations between life history traits and the genomic attributes that we focused on (A to C) and the correlation between ancestral body mass and change in deletion length (D) further supported the correlation between deletion length and body mass (C).Our modeling results also showed that the two relationships reflecting correlated errors between orthologous microsatellite length versus TE length and deletion length, respectively, were also consistent across datasets and models, although they were stronger in all nonpasserines (Fig. 3, fig. S4, and table S4). Last, comparable coefficients of deletion length and TE length to the assembly size in nonpasserines were identified in all models (Fig. 3, fig. S5, and table S4).
Potential impact of sequencing depth on PSEMs
To explore a possible effect of the sequence data on our results, we used sequencing depth as an indicator of sequencing quantity and quality. We reasoned that most genomes were generated by the B10K project using similar sequencing platforms and assembly methods, so that the variation in sequencing and assembly methods should be relatively minor. In comparison, the depth of focal species ranged from 24X to 122X, with a median of 49X, suggesting that the impact of depth should be considered.Further analyses showed that passerines and nonpasserines did not differ in sequencing depth [phylogenetic generalized least squares (PGLS): r = 0.004, P > 0.05; fig. S6A]. Similarly, species with high TE length did not have higher sequencing depth (PGLS: r = 0.03, P > 0.05; fig. S6B). When correlating sequencing depth with genomic attributes and assembly size using each dataset, we did find depth to be positively correlated with orthologous microsatellite length in nonpasserines without outliers (PGLS: r = 0.17, P = 0.05; fig. S6C and table S5) outliers and depth to be positively correlated with assembly size in nonpasserines with and without outliers (PGLS: r = 0.23, P = 0.01 and r = 0.24, P = 0.01, respectively; fig. S6D and table S5).To exclude the effect of depth on models, we incorporated depth as a confounding variable in two of the three datasets that showed significant correlations, assuming that increasing sequencing depth would result in longer orthologous microsatellites and larger assembly sizes (fig. S7). The updated PSEMs showed that the addition of depth had limited influence on the standardized regression coefficients and significance (fig. S8 and table S4). This evidence suggests that, while sequencing depth does have impact on some genomic attribute and assembly size, the relationships among body mass, generation time, genomic attributes, and assembly size remain unchanged.
Evidence supporting the correlation between deletions and body mass
We further explored whether between-node changes in ancestral deletions correlated with between-node changes in ancestral body mass. Briefly, ancestral deletions were defined as deletions that can be confidently assigned to a branch in the phylogeny where all descendent species (but no others) harbor the deletion. As running through all 100,000 1-kb blocks to assign the phylogenetic position of each deletion was time-consuming, we randomly subsampled 20,000 1-kb blocks for subsequent analyses. The result with 10,000 blocks (fig. S9A) was consistent with that of 20,000 blocks (Fig. 4D), suggesting that we have sampled sufficient data for accurate statistical inferences. After filtering out blocks with missing species, we ended up with 34.8% (n = 6,958) blocks for further analyses. Deletions that can be confidently traced back to a branch consisted of 79.8% of all deletions (n = 3,244,677), with deletions of 0 to 20 bp dominating the overall length distribution (proportion of deletions of 0 to 20 bp: 90.8%; fig. S9B).Pearson’s correlations revealed that the length of deletions mapped back to each branch of the phylogeny was negatively correlated with the change in ancestral body mass as estimated by maximum likelihood (see Materials and Methods; r = −0.27, P < 0.001; Fig. 4D). The correlation in nonpasserines was stronger than that in passerines (nonpasserines: r = −0.33, P < 0.001; passerines: r = −0.12, P = 0.02; Fig. 4D), consistent with our PSEMs where, in passerines, a correlation between deletion and body mass was too weak to be detected. The correlation in nonpasserines was higher than that without the species/clades with high TE length (r = −0.30, P < 0.001).We also examined whether the correlation between life history and genomic attributes holds in other taxa when such data were available. In mammals where both whole-genome alignment () and body mass data () were available for 186 species (data file S2), a negative trend between body mass and deletion length was also detected (r = −0.17, P < 0.01; fig. S10).
Possible effects of longevity on DNA deletions
Mutations may be repaired more efficiently in long-lived species [e.g., (, )]. In our dataset, we found that long-lived species from Procellariiformes (as represented by albatrosses, petrels and shearwaters, and storm petrels) exhibit the shortest deletions compared to most other birds, which is evident after controlling for phylogeny (fig. S11A). Using a subset of 36 species with data on maximum longevity of wild animals collected from AnAge database (data file S2) (), we observed a positive correlation between the residuals of deletion length versus the residuals of maximum longevity, both of which were calculated when regressed against body mass (r = −0.34, P = 0.01; fig. S11B). However, the lack of data of maximum longevity for all of our focal species excluded us from incorporating longevity into the PSEM.
DISCUSSION
The intercorrelations among genomic attributes, life history traits, and genome size among 233 neoavian bird species show that our three a priori hypotheses are mostly supported in nonpasserines, whereas in passerines, the patterns are less pronounced. We also show that DNA deletions covary with TEs in nonpasserines when species with high TE content are included, which is consistent with the study of Kapusta et al. (). Our results highlight how life history traits influence cellular processes that ultimately shape genomic architecture in birds.
Plausible effects of effective population size
Our study has focused on the effect of the generation of mutations, which is reflected by generation time and body mass, on the accumulation of mutations. However, processes that affect the fixation and/or repair of mutations may also play significant roles in the observed pattern. First, whether the probability of fixation has an effect on the accumulation of mutations depends on the neutrality of mutations. For neutral mutations in a diploid population of the effective population size of Ne with mutation rate of μ, the fixation rate of mutations is calculated as the rate of neutral mutations (2Neμ) multiplied by the probability of fixation (1/2Ne). Therefore, the fixation rate of neutral mutations completely depends on the mutation rate as the effect of Ne is canceled out.In contrast, the fixation of slightly deleterious mutations depends on Ne (, ). Previous studies have shown that populations with smaller Ne fix more slightly deleterious mutations compared to those with larger Ne (–). As species with small Ne often have large body mass and long generation time, a positive correlation between body mass and the number of fixed mutations would be expected, which is in contrast to our observation. Further, for cross-species comparisons where the direction and strength of selection are different across species, Ne is not expected to exert directional impact on the accumulation of mutations. It depends on the biology of each species whether an individual insertion or deletion is deleterious or not. Considering that TEs and deletions lead to incremental genome size changes, it is uncertain whether the species are under the same direction and strength of selection with regard to genome size (or a correlated trait). Second, Ne can also positively affect the time for mutations to fix, as it takes (on average) 4Ne generations for neutral mutations to be fixed in a population (). It thus may take much longer time for neutral mutations to reach fixation in populations/species with larger Ne than those in species with smaller Ne. Assuming the same evolutionary time and mutation rate across species, fewer mutations are predicted to be fixed in species with large Ne than those in small Ne. This prediction translates to a positive correlation between body mass (or generation time) and accumulated mutations, which is contradictory to our observation. Together, this evidence suggests that Ne is not the major driver responsible for the pattern exhibited in our analyses.
Deletions are associated with the evolution of flight, body mass, and genome size
Previous studies have implicated the role of flight in reducing genome size, as evidenced by smaller genome sizes across flighted vertebrates including birds, bats, and pterosaurs (–). It has been hypothesized that bird genomes are constrained due to the metabolic demands of flight, as (i) smaller cells have higher surface-to-volume ratios and thus higher metabolic rates and (ii) small cells also tend to have small genomes (, , ). In other studies, a declining trend for both body mass (–) and genome size () has been documented in theropods, the lineage from which birds have evolved. Both the associations between genome size versus flight and body mass can be potentially explained by the accumulation of deletions. Both reduction in body mass and flight are tightly associated with shifts in metabolic rates, either through metabolic allometry (, ) or through energetic demands of flight. The increase in metabolism subsequently induced double-strand breaks where deletions are formed. As it is apparent in our study that longer deletions tend to relate to smaller genome sizes (or smaller assembly sizes; fig. S5), it is thus probable that the accumulation of deletions is the link between both body mass reduction [also see ()] and genome size reduction before the evolution of flight.The correlation between body mass and DNA deletions may be applicable in taxa beyond birds. In mammals, we have also observed a negative trend between body mass and DNA deletions (fig. S10). In addition, the correlations between life history traits, genomic attributes, and genome size can also be found in amphibians, where much larger genome sizes with much larger variance are exhibited compared to birds. For example, one of the smallest genomes in frogs, found in ornate burrowing frog (Platyplectrum ornatum), was characterized by short introns, deletion bias in TEs, and suppressed TE activities (). Moreover, P. ornatum and two other frog species with small genomes converge on life history traits including rapid tadpole development (). On the contrary, the gigantic genome sizes of salamanders have been shown to have slow removal rates of TEs (, ), whereas, recently, a link between large genome size and slow developmental time has been identified (). These studies suggest a link between life history traits (including body mass), genomic attributes, and genome size evolution in different vertebrate groups.
The distinct pattern in Passeriformes
Compared to PSEM results in nonpasserines, the results in passerines show a distinct pattern in that none of the three expected correlations are detected. Instead, a correlation between generation time and deletions is supported (Fig. 2C and fig. S2). Recently, a study of mathematical models revealed that the accumulation of damage-caused mutations may track cell division rate as well, if some of them are repaired efficiently before DNA replication (). Still, it remains unknown why this correlation is significant only in passerines, i.e., whether passerines are better in terms of DNA repair than nonpasserines (see the “Limitation of the results” section below). Meanwhile, the correlated errors between orthologous microsatellite length versus TE length and deletions are still present, suggesting that some unknown factors may be still in control of the correlations in passerines.Notably, the observation that the assembly sizes tend to scatter once deletion lengths are longer (fig. S5B) suggests uncharacterized genome expansion events in these passerine species. For instance, in nonpasserines, most species with elevated TE levels tend to deviate from the negative correlation between assembly size and deletion. This pattern can be explained by the presence of TEs, which increased the assembly sizes. In comparison, in passerines, only one species that deviated from the negative correlation (hooded pitta, P. sordida) has been characterized to have more TEs than most other species. It remains to be characterized why other passerine species deviated from the trend, including white-throated oxylab (Oxylabes madagascariensis), chipping sparrow (Spizella passerine), common sunbird-asity (Neodrepanis coruscans), red-billed oxpecker (Buphagus erythrorhynchus), and Crossley’s vanga (Mystacornis crossleyi).One of the closely related species of the chipping sparrow, the yellow-throated bunting (Emberiza elegans), has exceptionally high levels of repetitive sequence in a genome assembly sequenced with PacBio and scaffolded with Hi-C (PRJNA778594). In addition, three of the five species (O. madagascariensis, N. coruscans, and M. crossleyi) are island birds endemic to Madagascar. Island birds are known to have small Ne and thus experience stronger levels of drift (). It also happens that another endemic bird appeared to accumulate more microsatellite DNA in Piciformes (), suggesting a potential link between island endemics and DNA expansion.
Limitation of the results
Overall, our results could be limited by a general lack of variation of the life history traits under investigation. Compared to mammals, birds (especially neoavian birds) are generally smaller and exhibit less variance in body masses, potentially due to the demand of flight [e.g., ()]. Similarly, avian genomes sizes also show limited variation (). It is thus expected that detecting correlations between constrained variables is inherently hard for our focal species. Nevertheless, the fact that we were able to find correlations between body mass and genomic attributes, and between genomic attributes and genome sizes, indicates that these signals have been strong enough to be detected. The distinct pattern between passerines and nonpasserines (Fig. 3) can also be explained by the lack of variation in body mass, generation time, genome size, and genomic attributes of passerines.Second, the study is conducted using genomic data mostly assembled from short reads. The annotation of TEs and the estimation of genome size by assembly size, for example, can be greatly improved by third-generation sequencing (). However, we think that DNA deletions (especially smaller ones; fig. S9B) are less impaired by the sequencing technology, as their identification relied on sequence similarity of nonrepetitive DNA. Regardless, we expect that the increasing available assemblies using third-generation sequencing technique will extensively advance our understanding of the repetitive part of avian genomes.Last, because of limited data availability regarding avian longevity, we were unable to include the longevity of birds into our PSEMs to test the role of longevity systematically. Compared to the rapidly advancing field of genome sequencing, the accumulation of “traditional” data on life history of birds proceeds much slowly but continues to be critical for meaningful interpretation of genomic data.
MATERIALS AND METHODS
Identification and genotyping orthologous microsatellite loci
To account for microsatellite underrepresentation in whole-genome assemblies due to their repetitiveness, we used raw sequencing reads to identify and quantify avian microsatellites. For the 43 species published by Zhang et al. (), we downloaded raw reads from National Center for Biotechnology Information Sequence Read Archive libraries with a targeted size of approximately 15 Gb (~12X given an average genome size of 1.3 Gb; table S1). We tried to download libraries with similar sequencing strategy as far as possible, with paired-end sequencing of read length of 100 bp and an insertion size of 800 bp. For an additional 235 bird species published by Feng et al. (), we extracted raw sequencing reads with a target coverage of 20X (table S2) and a target insertion size of 500 bp. The reads were trimmed using Trimmomatic (), from which microsatellites of motif sizes of 1 to 7 nucleotides (nt) were identified using the Tandem Repeat Finder ().We identified orthologous microsatellite loci across species using the microsatellites uncovered in the sequencing reads. Because of the limitation of a downstream software [STR-FM (short tandem repeat profiling using a flanking-based mapping approach)], we only focused on microsatellite with motif sizes of 2 to 4 nt for orthologous microsatellites. We first identified orthologous microsatellites in the 43 species published by Zhang et al. (). We relied on the sequence similarity among the 15-bp flanking regions of orthologous loci to help identify them (for further information, see Supplementary Methods). To identify orthologous microsatellites in the 235 species (), we searched for the already-identified orthologous loci from the 43 species in each of the 235 species. We imputed genotypes of the orthologous microsatellite loci from sequencing reads using STR-FM (see Supplementary Methods for details) (). For each locus, we calculated the change in allele size by subtracting the allele size that was represented most frequently across species from the allele size of each allele of each locus in each species. For heterozygous loci, the changes in allele size were averaged. Last, a loci-averaged number was calculated to represent the average change in allele size of the orthologous microsatellites for each species. All related custom scripts for microsatellite quantification are available at https://github.com/yanzhu-ji/avian-msats.
Calculation of DNA deletion
To calculate DNA deletion length, we used multiple 1-kb syntenic alignments (n = 100,000) that were randomly sampled from the 363-way genome alignment generated by the B10K project using Progressive Cactus (, ). Random blocks of sequences that are neither genes nor repetitive regions were extracted using hal2maf with the genome of chicken (Gallus gallus) as a reference (refer to Supplementary Methods for details) (). Note that sequences extracted in this way only included orthologous sequences that are present in the chicken, so that any chicken-specific deletions were not included (fig. S1). Furthermore, any lineage-specific insertions from chicken, Galliformes, Galloanserae, and Neognathae are present as well (fig. S1). As it is impossible to polarize Neognathae-specific insertions without an outgroup, we therefore excluded species of Galloanserae and Paleognathae from our analyses. For each species, deletion lengths were calculated as 1000 (bp) subtracted by the syntenic alignment length, which was further averaged across blocks. Note here that the deletion length is a relative measurement instead of absolute length, as the extent of the consistent chicken-specific (and closely related species) insertions was unknown.
Collection of other biological data from literature
Overall, we compiled a dataset of generation time, body mass, assembly size, TE length, and a phylogeny from existing literature. Specifically, we collected generation time from Bird et al. () and body mass from the CRC Handbook of Avian Body Masses () and log-transformed these data before analyses. We collected the assembly size and total TE proportion (including LINE/SINE/LTR/DNA/RC/Unknown) from Feng et al. (). The total TE length was calculated by multiplying the total TE proportion and assembly size. For the phylogeny, we relied on Burleigh et al. ().
Phylogenetic comparative analyses
To select and describe the model among life history traits, genomic attributes, and genome size, we used PSEM as implemented in the R package piecewiseSEM (), given the phylogenetic nonindependence of our data. Basically, we set up three scenarios for PSEM to test against (Fig. 1). Furthermore, we ran the models for three datasets: The first dataset included all neoavian nonpasserine birds, the second dataset excluded species with TE length as identified as outliers judged by the criteria of third quartile + 1.5 × IQR, and the third dataset consisted of all passerine birds. For each dataset, we identified the model that was best supported by the data as defined by ΔAICc < 2.We obtained sequencing depth from Feng et al. (). To test whether sequencing depth has any effect on (or can help explain) our results, we first tested whether passerines have higher sequencing depth than nonpasserines and whether species with higher TE length have higher depth than species without using PGLS (implemented in R package phylolm) () with the status of species coded into dummy variables of 0 or 1. We also tested whether depth is correlated with any of the genomic attributes or assembly size using PGLS tests. If depth was found to be correlated with any of the variables, then PSEMs were updated with depth as a confounding variable (fig. S7).
Comparison of ancestral states of deletions and body mass
To identify whether the detected (if any) correlations between genomic attributes and life history traits can be further traced back to ancestral nodes in a phylogeny, we reconstructed ancestral states of life history traits of each node using fastAnc function in phytools (), and the changes in life history traits of each branch were calculated using the adjacent node and tip values. The ancestral states of one of the genomic attributes, deletions, were reconstructed by in-house scripts. Specifically, after blocks of sequences were extracted from hal files, we first defined the deletions by the start and end positions in the block, assuming that deletions with same starts and ends originate from one deletion event. We next located the phylogenetic branch of each deletion by the branch preceding their most common ancestor. We discarded the blocks that fail to cover all species in phylogeny or the deletions that have evolved multiple times independently across the tree. The change in life history traits and the change in deletion length were subsequently correlated using Pearson correlations in R. All the above statistical analyses were performed in R v4.1.2 ().
Authors: Zev N Kronenberg; Arang Rhie; Sergey Koren; Gregory T Concepcion; Paul Peluso; Katherine M Munson; David Porubsky; Kristen Kuhn; Kathryn A Mueller; Wai Yee Low; Stefan Hiendleder; Olivier Fedrigo; Ivan Liachko; Richard J Hall; Adam M Phillippy; Evan E Eichler; John L Williams; Timothy P L Smith; Erich D Jarvis; Shawn T Sullivan; Sarah B Kingan Journal: Nat Commun Date: 2021-04-28 Impact factor: 14.919