| Literature DB >> 32508375 |
Victoria R Caudill1,2, Sarina Qin1,3, Ryan Winstead1, Jasmeen Kaur1, Kaho Tisthammer1, E Geo Pineda1, Caroline Solis1, Sarah Cobey4, Trevor Bedford5, Oana Carja6, Rosalind M Eggo7, Katia Koelle8, Katrina Lythgoe9, Roland Regoes10, Scott Roy1, Nicole Allen1, Milo Aviles1, Brittany A Baker1, William Bauer1, Shannel Bermudez1, Corey Carlson1, Edgar Castellanos1, Francisca L Catalan1,11, Angeline Katia Chemel1, Jacob Elliot1, Dwayne Evans1,12, Natalie Fiutek1, Emily Fryer1,13, Samuel Melvin Goodfellow1,14, Mordecai Hecht1, Kellen Hopp1, E Deshawn Hopson1, Amirhossein Jaberi1, Christen Kinney1, Derek Lao1, Adrienne Le1, Jacky Lo1, Alejandro G Lopez1, Andrea López1, Fernando G Lorenzo1, Gordon T Luu1, Andrew R Mahoney1, Rebecca L Melton1,15, Gabriela Do Nascimento1, Anjani Pradhananga1, Nicole S Rodrigues1,16, Annie Shieh1, Jasmine Sims1,17, Rima Singh1, Hasan Sulaeman1, Ricky Thu1, Krystal Tran1, Livia Tran1, Elizabeth J Winters1, Albert Wong1, Pleuni S Pennings1.
Abstract
Mutations can occur throughout the virus genome and may be beneficial, neutral or deleterious. We are interested in mutations that yield a C next to a G, producing CpG sites. CpG sites are rare in eukaryotic and viral genomes. For the eukaryotes, it is thought that CpG sites are rare because they are prone to mutation when methylated. In viruses, we know less about why CpG sites are rare. A previous study in HIV suggested that CpG-creating transition mutations are more costly than similar non-CpG-creating mutations. To determine if this is the case in other viruses, we analyzed the allele frequencies of CpG-creating and non-CpG-creating mutations across various strains, subtypes, and genes of viruses using existing data obtained from Genbank, HIV Databases, and Virus Pathogen Resource. Our results suggest that CpG sites are indeed costly for most viruses. By understanding the cost of CpG sites, we can obtain further insights into the evolution and adaptation of viruses.Entities:
Keywords: CpG sites; Fitness costs; Mutations; Viruses
Year: 2020 PMID: 32508375 PMCID: PMC7245597 DOI: 10.1007/s10682-020-10039-z
Source DB: PubMed Journal: Evol Ecol ISSN: 0269-7653 Impact factor: 2.074
Information pertaining to the datasets, such as virus name, how much and where data was available, statistical results
| Dataset | #Sequence | # Nucleotide | Source | A | T | ||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| CpG versus non-CpG creating | Syn versus Nonsyn | Ratio | CpG versus non-CpG creating | Syn versus nonsyn | Ratio | ||||||
| Synonymous | Non-synonymous | Non CpG/CpG | Synonymous | Non-synonymous | Non CpG/CpG | ||||||
| A -> G Syn | T | ||||||||||
| Dengue 1 (WG) | 1783 | 10176 | Genbank | < 0.01 | < 0.01 | < 0.01 | 3.06 | < 0.01 | < 0.01 | < 0.01 | 4.37 |
| Dengue 2 (WG) | 1466 | 10173 | Genbank | < 0.01 | 0.0128 | < 0.01 | 1.57 | < 0.01 | 0.0297 | < 0.01 | 1.89 |
| Dengue 3 (WG) | 959 | 10170 | Genbank | < 0.01 | 0.0191 | < 0.01 | 2.90 | < 0.01 | < 0.01 | < 0.01 | 3.54 |
| Dengue 4 (WG) | 256 | 10206 | Genbank | < 0.01 | 0.0724 | < 0.01 | 1.83 | < 0.01 | < 0.01 | < 0.01 | 2.42 |
| HCV 1A (WG) | 414 | 9033 | Genbank | 0.021 | 0.385 | < 0.01 | 1.33 | 0.027 | 0.193 | < 0.01 | 1.35 |
| HCV 1B (WG) | 243 | 9033 | Genbank | < 0.01 | 0.891 | < 0.01 | 1.27 | < 0.01 | < 0.01 | < 0.01 | 1.25 |
| HIV pol gene | 2956 | 3231 | HIV Database | < 0.01 | < 0.01 | < 0.01 | 2.71 | < 0.01 | 0.156 | < 0.01 | 6.34 |
| H Parainfluenza 1 HN | 340 | 1728 | Genbank | < 0.01 | 0.68 | < 0.01 | 0.22 | 0.0132 | 0.214 | < 0.01 | 0.70 |
| H Parainfluenza 3 HN | 702 | 1725 | Genbank | 0.271 | 0.45 | < 0.01 | 2.05 | < 0.01 | 0.782 | < 0.01 | 1.73 |
| H Parainfluenza 1 (WG) | 99 | 14397 | Genbank | < 0.01 | 0.378 | < 0.01 | 1.69 | < 0.01 | < 0.01 | < 0.01 | 2.12 |
| H Parainfluenza 3 (WG) | 452 | 14469 | Genbank | < 0.01 | 0.0151 | < 0.01 | 1.75 | < 0.01 | < 0.01 | < 0.01 | 2.49 |
| Influenza A NA H3N2 | 19095 | 1710 | Genbank | < 0.01 | 0.224 | < 0.01 | 7.20 | < 0.01 | 0.126 | < 0.01 | 2.31 |
| Influenza A HA H1N1 | 24005 | 1701 | Genbank | < 0.01 | 0.328 | < 0.01 | 4.44 | < 0.01 | < 0.01 | < 0.01 | 2.94 |
| Influenza A HA H3N2 | 19226 | 1410 | Genbank | < 0.01 | 0.0628 | < 0.01 | 2.40 | < 0.01 | < 0.01 | < 0.01 | 2.85 |
| Influenza A NA H1N1 | 2428 | 1407 | Genbank | < 0.01 | 0.19 | < 0.01 | 9.03 | < 0.01 | < 0.01 | < 0.01 | 4.05 |
| Influenza B HA | 1054 | 1755 | Genbank | < 0.01 | 0.0946 | < 0.01 | 1.62 | < 0.01 | 0.162 | < 0.01 | 0.96 |
| Influenza B NA | 3852 | 1398 | Genbank | < 0.01 | 0.0413 | < 0.01 | 1.62 | < 0.01 | 0.199 | < 0.01 | 2.42 |
| Entero A VP1 EVA71 | 3866 | 894 | VPR | < 0.01 | 0.384 | < 0.01 | 1.77 | 0.0221 | 0.253 | < 0.01 | 1.64 |
| Entero A VP2 EVA71 | 575 | 294 | VPR | 0.0783 | 0.477 | < 0.01 | 2.22 | 0.0847 | 0.545 | < 0.01 | 1.62 |
| Entero B VP1 Echovirus30 | 2419 | 876 | VPR | 0.0178 | 0.664 | < 0.01 | 0.39 | 0.412 | 0.204 | < 0.01 | 0.37 |
| Entero B VP2 Echovirus30 | 413 | 750 | VPR | < 0.01 | < 0.01 | < 0.01 | 1.23 | 0.128 | 0.196 | < 0.01 | 1.32 |
| Entero C VP1 Polio2 | 1342 | 906 | VPR | < 0.01 | 0.288 | < 0.01 | 1.75 | 0.0247 | 0.126 | < 0.01 | 1.57 |
| Entero C VP2 Polio2 | 574 | 813 | VPR | < 0.01 | 0.926 | < 0.01 | 1.57 | < 0.01 | 0.488 | < 0.01 | 2.07 |
| Entero D68 VP1 | 528 | 546 | VPR | 0.0158 | 0.12 | < 0.01 | 3.45 | < 0.01 | 0.897 | < 0.01 | 3.49 |
| H Respiratory Syncytial | 1071 | 13437 | Genbank | < 0.01 | 0.154 | < 0.01 | 4.40 | < 0.01 | < 0.01 | < 0.01 | 3.60 |
| Measles HH | 799 | 1851 | Genbank | < 0.01 | 0.0929 | < 0.01 | 1.73 | < 0.01 | 0.0258 | < 0.01 | 2.20 |
| Rhino B (WG) | 41 | 6579 | Genbank | < 0.01 | < 0.01 | < 0.01 | 4.84 | < 0.01 | 0.0426 | < 0.01 | 3.17 |
| Rhino C (WG) | 69 | 6531 | Genbank | < 0.01 | < 0.01 | 1 | 1.75 | 0.38 | < 0.01 | 1 | 1.81 |
| Rota A VP6 | 4331 | 1197 | Genbank | 0.0105 | 0.0228 | < 0.01 | 1.75 | 0.0201 | 0.0346 | < 0.01 | 1.55 |
| Bk Polyoma VP1 | 3164 | 1089 | Genbank | < 0.01 | 0.158 | < 0.01 | 100.29 | < 0.01 | 0.189 | < 0.01 | 237.84 |
| H Boca 1 VP1 | 211 | 2013 | Genbank | 0.195 | 0.0873 | < 0.01 | 15.70 | 0.258 | 0.141 | < 0.01 | 9.00 |
| HBV A Polymerase | 264 | 852 | Genbank | 0.389 | 0.347 | 0.192 | 1.17 | 0.15 | < 0.01 | 0.41 | 59.65 |
| HBV A S | 263 | 837 | Genbank | 0.291 | 0.262 | 0.153 | 1.77 | 0.155 | < 0.01 | 0.392 | 77.12 |
| HBV B Polymerase | 298 | 909 | Genbank | 0.351 | 0.832 | 0.017 | 3.21 | 0.154 | 0.453 | < 0.01 | 1.70 |
| HBV B PreC-Core | 344 | 639 | Genbank | 0.764 | 0.199 | < 0.01 | 1.08 | 0.212 | 0.526 | < 0.01 | 6.53 |
| HBV C polymerase | 499 | 1635 | Genbank | 0.0176 | 0.218 | < 0.01 | 1.70 | 0.212 | 0.545 | < 0.01 | 1.60 |
| HBV C PreC-Core | 583 | 639 | Genbank | 0.047 | 0.876 | < 0.01 | 13.10 | 0.788 | 0.599 | < 0.01 | 1.37 |
| HBV C S | 2224 | 834 | Genbank | 0.646 | 0.153 | 0.0179 | 0.48 | 0.249 | 0.213 | 0.82 | 28.31 |
| Herpes 2 Glycoprotein G | 312 | 2109 | Genbank | 0.193 | 0.529 | 0.678 | 4.99 | 0.972 | 0.268 | < 0.01 | 0.03 |
| H Papilloma 16 L1 | 1104 | 1518 | Genbank | 0.0965 | 0.456 | < 0.01 | 1.09 | 0.102 | 0.577 | < 0.01 | 0.57 |
| Parvo B19 NS1 | 155 | 2016 | Genbank | < 0.01 | 0.805 | < 0.01 | 4.65 | < 0.01 | 0.22 | < 0.01 | 3.49 |
| Parvo B19 VP1 | 268 | 2343 | Genbank | < 0.01 | 0.645 | < 0.01 | 20.16 | < 0.01 | 0.711 | < 0.01 | 3.99 |
The word “Human” at the beginning of a virus name is shortened to “H”. If the whole genome was used for a virus, it is indicated by (WG)
Fig. 1A pictorial representation of 12 transition mutation groups. Each nucleotide can mutate to one other nucleotide due to a transition. Each mutation (and site) was categorized into synonymous or non-synonymous by the resulting amino acid. For A and T, we further separated the groups into CpG-creating or non-CpG-creating mutations (Nucleotides C and G cannot form CpG sites). Most comparisons in this study are between adjacent yellow and blue mutation categories (CpG-creating vs non-CpG-creating)
Fig. 3Observed transition mutation frequencies of CpG/non-CpG-creating mutations in select viral datasets (a the whole genome of Dengue 1 virus, c the HA gene of Influenza A virus H3N2, and e the glycoprotein gene of Human Respiratory Syncytial virus). Each figure on the left (a, c, e) displays transition mutation frequencies, with the mean and standard errors (black lines). The Wilcoxon test results are shown on the right (b, d, f). The shade of the blue color in the p value cell represents the significance level; darker the shade, the more significant the results are ( dark blue, 0.01–0.05 medium blue, light blue)
Fig. 4Overview of the cost associated with CpG-creating mutations. Each dot represents a ratio of the average virus mutation frequency of non-CpG-creating mutations to the average frequency of CpG-creating mutations. The bottom half of the figure depicts the total amount of data in each virus data set (the number of sequences the number of nucleotides)
The number of data sets (out of 42) for which the Wilcoxon test was significant (percentages in parentheses) indicating that non-CpG-creating mutations were observed at higher frequencies than, otherwise similar, CpG-creating mutations
| Summary of comparisons | ||
|---|---|---|
| Comparisons | A | T |
| Synonymous: CpG versus non-CpG | 32 (76.2%) | 28 (66.7%) |
| Non-synonymous: CpG versus non-CpG | 10 (23.8%) | 17 (40.5%) |
| Synonymous versus non-synonymous | 38 (90.5%) | 38 (90.5%) |
Fig. 2How CpG sites are created. A. There are two ways for a CpG site to be formed by a transition mutation; (1) a C precedes an A (CA) and the A mutates to a G, and (2) a T precedes a G (TG) and the T mutates to a C. B. In this study, we compare mutations that create CpG sites with similar mutations (AG and TC) that do not create CpG sites
Fig. 5Each point represents one dataset. Its location corresponds to the amount of sequences (on the x axis) and the number of sites with CpG-creating mutations (on the y axis) for each data set. The colors and shapes represent what was found significant in each Wilcoxon test; blue triangles if both AG and TC are significant, green squares if only one was significant (partially significant) and red circles if both are not significant. We find that, in general, we are more likely to find significant effects for viruses for which we have more data (towards the top and the right)