| Literature DB >> 35907247 |
Haoxuan Liu1,2, Jianzhi Zhang1.
Abstract
A study of the plant Arabidopsis thaliana detected lower mutation rates in genomic regions where mutations are more likely to be deleterious, challenging the principle that mutagenesis is blind to its consequence. To examine the generality of this finding, we analyze large mutational data from baker's yeast and humans. The yeast data do not exhibit this trend, whereas the human data show an opposite trend that disappears upon the control of potential confounders. We find that the Arabidopsis study identified substantially more mutations than reported in the original data-generating studies and expected from Arabidopsis' mutation rate. These extra mutations are enriched in polynucleotide tracts and have relatively low sequencing qualities so are likely sequencing errors. Furthermore, the polynucleotide "mutations" can produce the purported mutational trend in Arabidopsis. Together, our results do not support lower mutagenesis of genomic regions of stronger selective constraints in the plant, fungal, and animal models examined.Entities:
Keywords: Arabidopsis; human; mutation; natural selection; yeast
Mesh:
Substances:
Year: 2022 PMID: 35907247 PMCID: PMC9372563 DOI: 10.1093/molbev/msac169
Source DB: PubMed Journal: Mol Biol Evol ISSN: 0737-4038 Impact factor: 8.800
Relationships among Mutation Rate, dN/dS, and Six Other Factors in Yeast and Humans.
| Species | Partial correlation between | Multiple linear regression[ | ||||
|---|---|---|---|---|---|---|
| Controlled variables | Rank correlation ( | Linear correlation ( | Independent variables | Coefficient |
| |
| Yeast | None | –0.0063 (0.68) | –0.0149 (0.33) |
| –5.63 × 10−5 | 0.43 |
| Expression level | –0.0145 (0.34) | –0.0114 (0.46) | Expression level | 3.85 × 10−10 | 0.03 | |
| Gene length | –0.0202 (0.19) | –0.0149 (0.33) | Gene length | 5.26 × 10−9 | 0.51 | |
| Nucleosome occupancy | –0.0137 (0.37) | –0.0150 (0.33) | Nucleosome occupancy | –2.55 × 10−7 | 0.67 | |
| Replication timing | –0.0082 (0.59) | –0.0171 (0.26) | Replication timing | –8.59 × 10−5 | 2.6 × 10−3 | |
| GC content | –0.0230 (0.13) | –0.0138 (0.37) | GC content | 4.24 × 10−4 | 0.44 | |
| DNA curvature | –0.0134 (0.38) | –0.0169 (0.27) | DNA curvature | 5.64 × 10−5 | 0.41 | |
| All of the above | –0.0167 (0.28) | –0.0122 (0.43) | ||||
| Humans | None | –0.0633 (4.6 × 10−13) | –0.0126 (0.15) |
| 9.51 × 10−7 | 0.62 |
| Expression level | –0.0596 (9.2 × 10−12) | –0.0125 (0.15) | Expression level | –3.53 × 10−9 | 0.38 | |
| Gene length | –0.0432 (7.8 × 10−7) | –0.0135 (0.12) | Gene length | 9.41 × 10−12 | 0.14 | |
| Nucleosome occupancy | –0.0625 (8.7 × 10−13) | 0.0021 (0.81) | Nucleosome occupancy | 8.27 × 10−7 | 3.9 × 10−6 | |
| Replication timing | –0.0635 (3.9 × 10−13) | –0.0113 (0.20) | Replication timing | –6.57 × 10−5 | 3.8 × 10−3 | |
| GC content | –0.0675 (1.1 × 10−14) | 0.00042 (0.96) | GC content | 3.14 × 10−5 | 0.18 | |
| DNA curvature | –0.0620 (1.3 × 10−12) | –0.0036 (0.68) | DNA curvature | –1.12 × 10−5 | 0.24 | |
| All of the above | –0.0095 (0.28) | 0.0044 (0.62) | ||||
Mutation rate is the dependent variable in the multiple linear regression.
Differences in the Number of Mutations Identified Between Monroe et al.'s Study and Former Studies.
| Dataset | Sample size | Mutation no. in Monroe | Mutation no. in original studies | Sequencing depth | References |
|---|---|---|---|---|---|
| 1. Training dataset | 107 MA lines × 25 generations | 8,574 | 2,209 | 36× |
|
| 2. New dataset | 400 MA lines × 8.3 generations | 359,133 | NA | ∼30× |
|
| 3. Somatic dataset | 64 somatic samples from two plants | 773,141 | 17 | 52.3× |
|
Fig. 1.Relationships among AT content, density of polynucleotides, and dN/dS in Arabidopsis thaliana. (A) Pearson's correlation (r) and Spearman's correlation (ρ) between the coding region AT content and dN/dS across genes. Each dot represents a gene. The blue line is the linear regression. (B) Correlations between the number of poly(A) + poly(T) tracts per 1000 nucleotides (i.e., density) in the coding sequence (CDS) of a gene and its dN/dS. Each dot represents a gene. The blue line is the linear regression. (C) AT content in coding, intron, and intergenic regions, respectively. Errors are too small to present. (D) Mean density of poly(A) + poly(T) tracts in coding, intron, and intergenic regions, respectively. (E) Number of mutations per site in coding, intron, and intergenic regions, respectively, calculated using the sum of the three datasets in Monroe . (F) Correlations between the no. of mutations per site generated by simulation and dN/dS across genes. Each dot represents a gene. In the simulation, 70% of mutations are randomly distributed at non-polynucleotide sites across the genome while the remaining mutations are randomly distributed among poly(A) and poly(T) tracts. A similar Pearson's correlation is observed upon log transformations of the data. (G) Numbers of simulated mutations per site in coding, intron, and intergenic regions, respectively. In (D), (E), and (G), error bars represent 95% confidence intervals predicted by Poisson distributions.