| Literature DB >> 32687176 |
Alan M Rice1, Atahualpa Castillo Morales1, Alexander T Ho1, Christine Mordstein1,2, Stefanie Mühlhausen1, Samir Watson3, Laura Cano2, Bethan Young1,2, Grzegorz Kudla2, Laurence D Hurst1.
Abstract
Large-scale re-engineering of synonymous sites is a promising strategy to generate vaccines either through synthesis of attenuated viruses or via codon-optimized genes in DNA vaccines. Attenuation typically relies on deoptimization of codon pairs and maximization of CpG dinucleotide frequencies. So as to formulate evolutionarily informed attenuation strategies that aim to force nucleotide usage against the direction favored by selection, here, we examine available whole-genome sequences of SARS-CoV-2 to infer patterns of mutation and selection on synonymous sites. Analysis of mutational profiles indicates a strong mutation bias toward U. In turn, analysis of observed synonymous site composition implicates selection against U. Accounting for dinucleotide effects reinforces this conclusion, observed UU content being a quarter of that expected under neutrality. Possible mechanisms of selection against U mutations include selection for higher expression, for high mRNA stability or lower immunogenicity of viral genes. Consistent with gene-specific selection against CpG dinucleotides, we observe systematic differences of CpG content between SARS-CoV-2 genes. We propose an evolutionarily informed approach to attenuation that, unusually, seeks to increase usage of the already most common synonymous codons. Comparable analysis of H1N1 and Ebola finds that GC3 deviated from neutral equilibrium is not a universal feature, cautioning against generalization of results.Entities:
Keywords: SARS-CoV-2; mutation equilibrium; selection; synonymous mutations; vaccine design; viral attenuation
Mesh:
Substances:
Year: 2021 PMID: 32687176 PMCID: PMC7454790 DOI: 10.1093/molbev/msaa188
Source DB: PubMed Journal: Mol Biol Evol ISSN: 0737-4038 Impact factor: 16.240
The 4×4 Mutational Matrix for 1,151 Mutations at 4-Fold Synonymous Sites (in italics) and from 5,482 Mutations Observed Anywhere in Codons (not italics).
| Derived Allele | |||||
|---|---|---|---|---|---|
| Reference allele | A | U | C | G | |
| A | — |
0.02204 |
0.01722 |
0.10067 | |
| U |
0.01753 | — |
0.08912 |
0.01296 | |
| C |
0.03545 |
0.40877 | — |
0.00896 | |
| G |
0.12389 |
0.18060 |
0.02111 | — | |
Note.—Rates are defined as the number of observed changes per incidence of the nucleotide in the reference genome at 4-fold third sites (italics) or in codons. Note that because of different normalizations, the two sets of numbers are not directly comparable in absolute terms.
Fig. 1.Chord diagram displaying the rate of flux from one dinucleotide to another in the coding sequence of SARS-CoV-2. For each node, the direction of flux is indicated by the indentation of the connecting links: the outermost layer represents flux into the node and the inner layer represents flux out. The frequency of the flux exchange is represented by the width of any given link where it meets the outer axis. Dinucleotide nodes are colored according to their GC-content. Hence, it is evident that there is high flux away from GC-rich dinucleotides whereas AU-rich dinucleotides are largely conserved.
Fig. 2.Comparison of dinucleotide content across SARS-CoV-2 compared with neutral expectations. Error bars represent bootstrapped 95% upper and lower confidence bounds.
Fig. 3.(a) CpG enrichment across the genes of SARS-CoV-2. Gray line, no enrichment. (b) Relationship between CpG enrichment and GC3.
Fig. 4.GC content across genes of SARS-CoV-2 at codon sites 1, 2, 3, and averaged across the gene.
Fig. 5.Base composition at codon third sites across genes of SARS-CoV-2.
Fig. 6.(a) UpA enrichment across genes of SARS-CoV-2 and (b) correlation with CpG enrichment. Gray line is the line of slope 1 through the origin.
Between-Gene Correlations in Dinucleotide Enrichment Scores (Pearson product moment correlation r values).
| UpAe | ApUe | CpGe | GpCe | |
|---|---|---|---|---|
| UpAe | — | 0.20 | −0.18 | |
| ApUe | — | −0.15 | −0.16 | |
| CpGe | — | 0.007 | ||
| GpCe | — |
Note.—Significant correlations in italics:
P < 0.005.
Fig. 7.(a) GpC and (b) ApU enrichment across the genes of SARS-CoV-2.
Fig. 8.Correlation between expression level and CpG enrichment, GC content, and GC3.
Fig. 9.GC content of cytoplasmic and nuclear viruses. Cytoplasmic viruses have significantly lower values for all three measures (Mann–Whitney U test: GC: P = 6.42e-18, CpG enrichment: P = 1.35e-13, UpA enrichment: P = 9.1e-29).
The 4×4 Mutational Matrix for 1,522 Mutations at Synonymous Sites (in italics) and from 2,571 Mutations Observed Anywhere in Codons (not italics) for H1N1.
| Derived Allele | |||||
|---|---|---|---|---|---|
| Reference Allele | A | U | C | G | |
| A | — |
0.04597 |
0.0451 |
0.25542 | |
| U |
0.05143 | — |
0.24889 |
0.03429 | |
| C |
0.11426 |
0.30675 | — |
0.02607 | |
| G |
0.32052 |
0.05027 |
0.0207 | — | |
Note.—Rates are defined as the number of observed changes per incidence of the nucleotide in the reference genome at third sites (italics) or in codons.
The 4×4 Mutational Matrix for 1,682 Mutations at Synonymous Sites (in italics) and from 3,523 Mutations Observed Anywhere in Codons (not italics) for Ebola.
| Derived Allele | |||||
|---|---|---|---|---|---|
| Reference Allele | A | U | C | G | |
| A | — |
0.05077 |
0.06722 |
0.14803 | |
| U |
0.05152 | — |
0.13429 |
0.04786 | |
| C |
0.08086 |
0.14868 | — |
0.04845 | |
| G |
0.16051 |
0.05139 |
0.05139 | — | |
Note.—Rates are defined as the number of observed changes per incidence of the nucleotide in the reference genome at third sites (italics) or in codons.