| Literature DB >> 28531201 |
Yves Clément1,2,3, Gautier Sarah4,5, Yan Holtz1, Felix Homa5,6, Stéphanie Pointet5,7,8, Sandy Contreras5,9, Benoit Nabholz2, François Sabot5,10, Laure Sauné11, Morgane Ardisson4, Roberto Bacilieri4, Guillaume Besnard12, Angélique Berger7, Céline Cardi7, Fabien De Bellis7, Olivier Fouet7, Cyril Jourda7,13, Bouchaib Khadari4, Claire Lanaud7, Thierry Leroy7, David Pot7, Christopher Sauvage14, Nora Scarcelli10, James Tregear10, Yves Vigouroux10, Nabila Yahiaoui7, Manuel Ruiz5,7, Sylvain Santoni4, Jean-Pierre Labouisse7, Jean-Louis Pham10, Jacques David1, Sylvain Glémin2,15.
Abstract
Base composition is highly variable among and within plant genomes, especially at third codon positions, ranging from GC-poor and homogeneous species to GC-rich and highly heterogeneous ones (particularly Monocots). Consequently, synonymous codon usage is biased in most species, even when base composition is relatively homogeneous. The causes of these variations are still under debate, with three main forces being possibly involved: mutational bias, selection and GC-biased gene conversion (gBGC). So far, both selection and gBGC have been detected in some species but how their relative strength varies among and within species remains unclear. Population genetics approaches allow to jointly estimating the intensity of selection, gBGC and mutational bias. We extended a recently developed method and applied it to a large population genomic dataset based on transcriptome sequencing of 11 angiosperm species spread across the phylogeny. We found that at synonymous positions, base composition is far from mutation-drift equilibrium in most genomes and that gBGC is a widespread and stronger process than selection. gBGC could strongly contribute to base composition variation among plant species, implying that it should be taken into account in plant genome analyses, especially for GC-rich ones.Entities:
Mesh:
Year: 2017 PMID: 28531201 PMCID: PMC5460877 DOI: 10.1371/journal.pgen.1006799
Source DB: PubMed Journal: PLoS Genet ISSN: 1553-7390 Impact factor: 5.917
Fig 1Phylogeny of the species used in this study.
Phylogenetic relationship of the species used in this study. The phylogeny was computed with PhyML [75] on a set of 33 1–1 orthologous protein clusters obtained with SiLiX [76] and the resulting tree was made ultrametric (see untransformed trees in S5 and S6 Figs). Images for S. bicolor, T. monococcum, D. abyssinica and O. europaea come from the pixabay website. Images for S. pimpinellifolium and M. acuminata are provided by the authors. All other images come from the Wikimedia website.
List of studied species and datasets characteristics.
| Species | Name | Group | Mating system | Outgroup 1 | Outgroup 2 | Reference | # of individuals |
|---|---|---|---|---|---|---|---|
| Sorghum | Monocot—Commelinid | Mixed | Genome | 9 | |||
| Pearl millet | Monocot—Commelinid | Outcrossing | Transcriptome | 10 | |||
| Einkorn wheat | Monocot—Commelinid | Selfing | Transcriptome | 10 | |||
| Banana | Monocot—Commelinid | Outcrossing | Transcriptome | 10 | |||
| Oil palm tree | Monocot—Commelinid | Outcrossing | Transcriptome | 10 | |||
| Yam | Monocot—Basal | Outcrossing | Transcriptome | 5 | |||
| Coffee tree | Eudicot—Asterid | Outcrossing | Transcriptome | 12 | |||
| Tomato | Eudicot—Asterid | Mixed | Genome | 10 | |||
| Olive tree | Eudicot—Asterid | Outcrossing | Transcriptome | 10 | |||
| Cocoa | Eudicot—Rosid | Outcrossing | Genome | 10 | |||
| Grape vine | Eudicot—Rosid | Outcrossing | Genome | 12 |
* Simply noted Olea europaea in the rest of the article
Global statistics for each dataset.
| Species | # of contigs | Total length | # of SNPS | Base composition | Polymorphism | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Total | Genotyped | With outgroup | Total | Polarized | GC | GC3 | Average ENC | Codon Preference | Cor(GC3,Expression) | |||||
| 29448 | 18518 | 3884 | 25849393 | 77703 | 12201 | 0.52 | 0.56 | 40.33 | 15 / 7 | 0.30 | 0.407 | 0.065 | 0.161 | |
| 24618 | 12443 | 9616 | 8870196 | 95068 | 78360 | 0.48 | 0.53 | 39.75 | 13 / 10 | 0.27 | 0.710 | 0.121 | 0.170 | |
| 33381 | 3766 | 1319 | 1758789 | 4409 | 3522 | 0.46 | 0.48 | 40.06 | 26 / 2 | 0.38 | 0.272 | 0.033 | 0.122 | |
| 36115 | 14366 | 10546 | 6796494 | 113585 | 89793 | 0.49 | 0.52 | 39.42 | 28 / 1 | 0.31 | 1.223 | 0.237 | 0.194 | |
| 26791 | 14970 | 9144 | 10623105 | 28097 | 27514 | 0.47 | 0.47 | 39.33 | 28 / 4 | 0.28 | 0.175 | 0.046 | 0.261 | |
| 30551 | 18497 | 11544 | 16125630 | 84961 | 49552 | 0.46 | 0.46 | 41.10 | 26 / 12 | 0.17 | 0.417 | 0.085 | 0.205 | |
| 28975 | 13290 | 9064 | 11180913 | 115483 | 78519 | 0.45 | 0.42 | 40.68 | 27 / 6 | 0.22 | 0.593 | 0.145 | 0.245 | |
| 34727 | 12357 | 1074 | 9438177 | 25392 | 3253 | 0.43 | 0.38 | 42.79 | 22 / 8 | 0.18 | 0.213 | 0.051 | 0.238 | |
| 45389 | 12816 | 8512 | 6718947 | 90397 | 68299 | 0.44 | 0.42 | 39.09 | 28 / 6 | 0.23 | 1.070 | 0.231 | 0.216 | |
| 28798 | 9918 | 7901 | 5510955 | 37455 | 32674 | 0.45 | 0.42 | 44.06 | 27 / 8 | 0.31 | 0.484 | 0.124 | 0.257 | |
| 29971 | 12398 | 9325 | 12513219 | 101351 | 68315 | 0.46 | 0.45 | 44.30 | 27 / 8 | 0.21 | 0.744 | 0.147 | 0.197 | |
GC and GC3 have been computed on the total number of contigs
a # of preferred codons ending in G or C / ending in A or T
b correlation between GC at third codon positions and gene expression (log10(RPKM))
ENC: effective number of codons (computed with method X)
π: nucleotide diversity at synonymous sites
π: nucleotide diversity at non-synonymous sites
Fig 2Patterns of codon preference among the 11 studied species.
The colour scale indicates the magnitude of Δ RSCU, the difference in the Relative Synonymous Codon Usage between highly and lowly expressed genes. The greenest codons are the most preferred and the reddest the least preferred. Codons ending in G or C are in red and those ending in A or T in blue.
Fig 3Relationship between the frequency of optimal codons (FOP) and expression in the 11 studied species.
For each species, genes have been split into eight categories of expression (based on RPKM) of same size and the mean FOP for each category is plotted with its 95% confidence interval.
Skewness, neutrality index (NI) and direction of selection (DoS) statistics for GC content and codon usage.
| Species | GC content | Codon usage | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Mean allele frequency of GC alleles | Skewness | p-value | NI | DoS | p-value | Mean frequency of Pref alleles | Skewness | p-value | NI | DoS | p-value | |
| 0.576 | -0.351 | 0.834 | 0.043 | 0.535 | -0.164 | 0.94 | 0.02 | 0.256 | ||||
| 0.562 | -0.294 | 0.963 | 0.009 | 0.534 | -0.158 | 0.87 | 0.03 | |||||
| 0.547 | -0.222 | 0.728 | 0.078 | 0.550 | -0.236 | 0.71 | 0.08 | |||||
| 0.570 | -0.343 | 0.827 | 0.047 | 0.570 | -0.344 | 0.83 | 0.05 | |||||
| 0.540 | -0.201 | 0.819 | 0.050 | 0.535 | -0.170 | 0.82 | 0.05 | |||||
| 0.554 | -0.277 | 0.856 | 0.037 | 0.549 | -0.252 | 0.87 | 0.03 | 0.112 | ||||
| 0.450 | 0.234 | 0.913 | 0.022 | 0.458 | 0.199 | 0.92 | 0.02 | |||||
| 0.534 | -0.152 | 1.132 | -0.031 | 0.051 | 0.539 | -0.174 | 0.73 | 0.08 | ||||
| 0.509 | -0.047 | 0.884 | 0.031 | 0.510 | -0.051 | 0.89 | 0.03 | |||||
| 0.515 | -0.071 | 0.838 | 0.044 | 0.510 | -0.045 | 0.053 | 0.88 | 0.03 | ||||
| 0.550 | -0.229 | 0.737 | 0.075 | 0.538 | -0.172 | 0.66 | 0.10 | |||||
a Null hypothesis: skewness = 0
b Null hypothesis: NI = 1 / DoS = 0 (equivalent test done on the same contingency table).
Fig 4DoS statistics as a function of GC3 and expression level.
Correlation between GC3 and DoS computed on WS changes (left panel) or between expression level (measured through RPKM) and DoS computed on UP changes (right). Pearson correlation coefficients are given for each species (red: significant at the 5% level, blue non-significant).
Fig 5Combined effect of GC3 and expression level on DoS statistics.
The DoS statistics was computed on W/S (gBGC) or U/P (SCU) changes for four gene categories: GC-rich and highly expressed, GC-rich and lowly expressed, GC-poor and highly expressed, GC-poor and lowly expressed.
Fig 6Schematic presentation of the method to estimate recent and ancestral gBGC or SCU.
In addition to polymorphic derived mutations used to infer recent gBGC or selection (B1/S1) as in [38] we also consider substitutions (i.e. fixed derived mutations) on the branch leading to the focal species. Each box corresponds to a site position in a sequence alignment. Both kinds of mutations are polarized with the two same outgroups and are thus sensitive to the same probability of polarization error. We assume that gBGC and selection may have change so that fixed mutations may have undergo a different intensity. Note that these two B or S values correspond to average of potentially more complex variations over the two periods.
Separated estimations of recent and ancestral gBGC (B = 4Nb) and SCU (S = 4Ns).
| 1.61 [1.51–2.69] | 0.378 [0.290–0.516] | 0.078 [-0.492–0.739] | 0.758 | 0.189 | ||
| 1.73 [1.69–1.83] | 0.224 [0.189–0.261] | 0.524 [0.383–0.661] | ||||
| 1.99 [1.67–2.25] | 0.448 [0.269–0.613] | -0.008 [-0.824–0.691] | 0.985 | 0.164 | ||
| 1.71 [1.66–1.80] | 0.313 [0.253–0.370] | 0.397 [0.234–0.546] | 0.343 | |||
| 1.84 [1.77–1.93] | 0.328 [0.267–0.400] | 0.516 [0.328–0.702] | ||||
| 2.20 [2.10–2.47] | 1.171 [0.127–4.067] | 0.008 [-0.221–0.264] | 0.949 | 0.072 | ||
| 1.05 [1.02–1.10] | 0.154 [0.110–0.202] | 0.243 [0.113–0.366] | 0.171 | |||
| 2.05 [1.74–2.63] | 0.114 [-0.057–0.392] | 0.759 [-0.491–3.785] | 0.215 | 0.153 | 0.193 | |
| 1.58 [1.53–1.64] | 0.167 [0.080–0.268] | 0.031 [-0.127–0.168] | 0.687 | 0.132 | ||
| 1.67 [1.59–1.74] | 0.316 [0.258–0.377] | 0.465 [0.222–0.683] | 0.135 | |||
| 2.15 [2.08–2.22] | 0.360 [0.318–0.413] | 0.024 [-0.101–0.153] | 0.71 | |||
| 2.04 [1.70–2.47] | 0.139 [0.023–0.260] | 0.439 [-0.251–1.083] | 0.143 | 0.341 | ||
| 1.76 [1.70–1.87] | 0.181 [0.137–0.226] | 0.126 [-0.062–0.289] | 0.165 | 0.484 | ||
| 2.84 [2.33–3.31] | 0.534 [0.353–0.718] | 0.236 [-0.610–1.029] | 0.581 | 0.409 | ||
| 2.02 [1.96–2.15] | 0.315 [0.256–0.362] | 0.392 [0.221–0.553] | 0.394 | |||
| 1.58 [1.50–1.66] | 0.324 [0.233–0.396] | 0.512 [0.322–0.704] | ||||
| 1.68 [1.39–1.74] | 1.909 [0.306–9.994] | -0.101 [-0.311–0.135] | 0.470 | |||
| 0.89 [0.86–0.95] | 0.148 [0.079–0.197] | 0.196 [0.039–0.330] | 0.515 | |||
| 1.56 [1.32–2.05] | 0.465 [0.270–0.857] | 0.566 [-0.567–3.900] | 0.285 | 0.834 | ||
| 1.18 [1.13–1.22] | 0.148 [0.040–0.241] | 0.025 [-0.162–0.186] | 0.772 | 0.214 | ||
| 1.09 [1.02–1.16] | 0.245 [0.167–0.339] | 0.397 [0.107–0.673] | 0.185 | |||
| 1.26 [1.22–1.32] | 0.470 [0.421–0.525] | 0.118 [-0.028–0.258] | 0.103 | |||
Best model for the joined estimations of recent and ancestral gBGC (B = 4Nb) and SCU (S = 4Ns).
| Species | 4 | 4 | 4 | 4 |
|---|---|---|---|---|
| 0.439 [0.334–0.525] | 0 | 0 | 0 | |
| 0.218 [0.182–0.253] | 0.561 [0.393–0.689] | 0.139 [0.106–0.175] | 0 | |
| 0.264 [0.042–0.443] | 0 | 0.247 [0.027–0.468] | 0 | |
| 0.312 [0.281–0.395] | 0.394 [0.241–0.580] | 0 | 0 | |
| 0 | 0 | 0.317 [0.284–0.400] | 0.398 [0.176–0.540] | |
| 0.329 [0.241–0.383] | 0.517 [0.234–0.744] | 0 | 0 | |
| 1.256 [0.564–2.202] | 0 | 0 | 0 | |
| 0.154 [0.119–0.227] | 0.244 [0.070–0.361] | 0 | 0 | |
| 0 | 0 | 0.459 [0.311–0.603] | 0 | |
| 0.168 [0.074–0.250] | 0 | 0 | 0 | |
| 0.318 [0.241–0.383] | 0.474 [0.234–0.744] | 0 | 0 | |
| 0.256 [0.216–0.295] | 0 | 0.380 [0.323–0.439] | 0 |
For Musa acuminata the two best models with very close AIC values are given.
Fig 7GC3 and gBGC gradients along genes.
A: gBGC strength estimations (4Nb) for first exons (252 first bp of contigs) and rest of gene. Error bars indicate the 95% confidence intervals. With the exception of D. abyssinica and S. pimpinellifolium, all species exhibit stronger gBGC in the first exons compared to the rest of genes. B. Correlations between GC3 and gBGC strength in first exons (red) and rest of genes (blue). Each dot corresponds to one species. GC3 and 4Nb tend to be positively correlated in both regions: ρSpearman = 0.591, p-value = 0.061 for first exons and ρSpearman = 0.382, p-value = 0.248 for the rest of genes. C. Comparison of 4Neb estimates between first exons and rest of genes for Commelinids (all Monocots with the exception of D. abyssinica, left panel) and other species (right panel). 4Nb values are higher in first exons compared to rest of genes in Commelinids species, while other species exhibit no differences between first exons and rest of genes.