Leszek J Klimczak1, Thomas A Randall1, Natalie Saini2, Jian-Liang Li1, Dmitry A Gordenin2. 1. Integrative Bioinformatics Support Group, National Institute of Environmental Health Sciences, NIH, Durham, North Carolina, United State of America. 2. Mechanisms of Genome Dynamics Group, National Institute of Environmental Health Sciences, NIH, Durham, North Carolina, United State of America.
Abstract
Genomes of tens of thousands of SARS-CoV2 isolates have been sequenced across the world and the total number of changes (predominantly single base substitutions) in these isolates exceeds ten thousand. We compared the mutational spectrum in the new SARS-CoV-2 mutation dataset with the previously published mutation spectrum in hypermutated genomes of rubella-another positive single stranded (ss) RNA virus. Each of the rubella virus isolates arose by accumulation of hundreds of mutations during propagation in a single subject, while SARS-CoV-2 mutation spectrum represents a collection events in multiple virus isolates from individuals across the world. We found a clear similarity between the spectra of single base substitutions in rubella and in SARS-CoV-2, with C to U as well as A to G and U to C being the most prominent in plus strand genomic RNA of each virus. Of those, U to C changes universally showed preference for loops versus stems in predicted RNA secondary structure. Similarly, to what was previously reported for rubella virus, C to U changes showed enrichment in the uCn motif, which suggested a subclass of APOBEC cytidine deaminase being a source of these substitutions. We also found enrichment of several other trinucleotide-centered mutation motifs only in SARS-CoV-2-likely indicative of a mutation process characteristic to this virus. Altogether, the results of this analysis suggest that the mutation mechanisms that lead to hypermutation of the rubella vaccine virus in a rare pathological condition may also operate in the background of the SARS-CoV-2 viruses currently propagating in the human population.
Genomes of tens of thousands of SARS-CoV2 isolates have been sequenced across the world and the total number of changes (predominantly single base substitutions) in these isolates exceeds ten thousand. We compared the mutational spectrum in the new SARS-CoV-2 mutation dataset with the previously published mutation spectrum in hypermutated genomes of rubella-another positive single stranded (ss) RNA virus. Each of the rubella virus isolates arose by accumulation of hundreds of mutations during propagation in a single subject, while SARS-CoV-2 mutation spectrum represents a collection events in multiple virus isolates from individuals across the world. We found a clear similarity between the spectra of single base substitutions in rubella and in SARS-CoV-2, with C to U as well as A to G and U to C being the most prominent in plus strand genomic RNA of each virus. Of those, U to C changes universally showed preference for loops versus stems in predicted RNA secondary structure. Similarly, to what was previously reported for rubella virus, C to U changes showed enrichment in the uCn motif, which suggested a subclass of APOBECcytidine deaminase being a source of these substitutions. We also found enrichment of several other trinucleotide-centered mutation motifs only in SARS-CoV-2-likely indicative of a mutation process characteristic to this virus. Altogether, the results of this analysis suggest that the mutation mechanisms that lead to hypermutation of the rubella vaccine virus in a rare pathological condition may also operate in the background of the SARS-CoV-2 viruses currently propagating in the human population.
RNA viruses can show a high mutation rate [1], which often results in fast emergence of viral quasispecies—populations of viruses differing in several genomic positions from the original virus [2]. Errors made by the RNA-dependent RNA-polymerase (RdRP) viral replicase are a source of mutations, however in coronaviruses some of these errors can be corrected by a proofreading RNA exonuclease ExoN [3, 4]. Another source of mutations comes from RNA base editing by two classes of endogenous enzymes: adenine deaminases (ADAR) and cytosine deaminases APOBECs (APOlipoprotein B mRNA Editing Complex like polypeptides) which have a broad range of functions spanning from site-specific editing of cellular mRNAs to inhibiting viral and retrotransposon proliferation [5-7].ADARs (ADAR1 and ADAR2) are double-strand (ds) RNA-specific enzymes converting adenine to inosine (A to I). Since inosine pairs with cytosine, this will result in A to G changes after the next round of replication. The preference of ADARs for certain deamination motifs—reflecting a combination of immediate nucleotide context and the anticipated dsRNA formed by folding—was assessed for in vitro editing of several RNA substrates. Based on these data, software was developed aimed to assign predictive ADAR deamination scores to any A position in a given RNA molecule [8]. The ADAR editing sites that were deduced in RNAs of cultured stimulated immune cells [9] agreed with the preferences defined in the in vitro study. It remains to be established whether these preferences would hold for a wide variety of RNA substrates in conditions of controlled in vivo expression of either ADAR1 or ADAR2.Unlike ADARs, the structure of APOBEC enzymes allow deamination only in single-strand (ss) RNA or in ssDNA. At least two APOBECs, APOBEC1 and APOBEC3A are capable of deaminating cytosine to uracil in RNA, however an RNA editing capacity of other APOBECs cannot be excluded [10-13]. Cytosine deamination in RNA creates the normal RNA base–uracil, which can be then accurately copied in subsequent rounds of RNA replication. DNA deamination motifs or mutation signatures (i.e., the immediate nucleotide contexts around deaminated Cs) of several humanAPOBECs were first defined in model microbial systems and then found in genomic DNAs of humancancers, where they can cause hypermutation clusters [14-19]. The preferred DNA deamination motif of APOBEC3G (A3G) is nCc to nTc (n = any base; the mutated nucleotide and the resulting nucleotide are capitalized). Other APOBECs show preference for the tCn to tTn deamination motif or to a more stringently defined trinucleotide. For example, both APOBEC3A (A3A) and APOBEC3B (A3B) prefer tCa as a target, both in the yeast model and in humancancers [15].An indication of frequent RNA editing was recently found in the isolates of the hypermutated plus-strand ssRNArubella vaccine virus from cutaneous granulomas of children with primary immunodeficiencies [20]. Altogether, genomes of six independent isolates of the hypermutated vaccine-derived viruses contained 993 mutations. Most changes were C to U in the genomic plus-strand RNA. These C to U changes showed high enrichment of a uCa to uUa RNA editing motif–a match to the characteristic A3A or A3B mutagenic motif in DNA. While the similarity between the C to U RNA editing motif in rubella virus and the DNA editing motifs strongly suggested the nature of the editing enzyme, signature motifs of APOBECcytosine deamination in RNA are yet to be confirmed in a direct study involving expression of an APOBEC enzyme and collection of in vivo-editing spectrum data. The second most prevalent type of editing event was A to G change in the rubella plus or minus-strand, revealed as either U to C or G to A changes in the reported plus-strand sequence, respectively. These changes would be expected to result from ADAR editing. Minus-strand RNA in rubella virus as well as in Coronaviridae would often exist within completely- or partially double-stranded RNA [21, 22], which would be the right substrate for an ADAR. This strand is a template for the multiple rounds of transcription generating many plus-strand partial- or full-size genomic RNAs. Thus, an A to G editing event in the minus-strand of dsRNA at the beginning of replication cycle would be carried as a U to C change to multiple rubella virus genomes. A to G editing events in the plus-strand of the dsRNA intermediate may directly contribute to the mutation spectrum in plus-strand viral genomes or propagate the mutation via the subsequent rounds of replication within the same cell. Besides ADAR editing, U to C (or complementary A to G) changes can result from uracil modifications by enzymes normally acting on specific uracils in tRNAs [23, 24]In summary, the previous analysis of the mutation spectrum and mutational signatures of hypermutated rubella virus genomes provided a strong indication of hyperediting by APOBECcytidine deaminases as well as suggested editing by ADARadenine deaminases [20]. Both rubella and Coroniviridae are positive ssRNA viruses which produce many copies of the genomic positive RNA strand and also have dsRNA intermediates in their replication cycles [21, 22, 25], which can serve as substrates for APOBECs and for ADARs, respectively. Indeed, recent analyses suggested APOBEC and ADAR editing in SARS-CoV-2 based on an excess of C to U changes and A to G in sequencing reads from lavages of two COVID-19patients or in genome alignments [26, 27]. Based on the similarity between the preferred RNA editing motifs in rubella virus and the APOBEC DNA hypermutation motifs, we sought to determine whether similar mutational signature motifs can be detected in a collection of 32,341 whole genome sequences of multiple SARS-CoV-2 isolates that have been sequenced during the current COVID-19 pandemic. We present here the evidence indicating a similarity between the RNA editing spectra and mutational signatures between the hypermutated rubella virus isolates and the load of editing changes accumulated in this collection of SARS-CoV-2 genomes. We also found several new trinucleotide-centered mutational motifs unique to SARS-CoV-2.
Materials and methods
SARS-CoV-2 genomes
SARS-CoV-2 genome sequences in the FASTA format were downloaded from https://www.epicov.org/epi3/frontend# at 13:10 EST on 2020/06/24 after applying the following download filters: (i) “complete sequence”; (ii) “high coverage”; (iii) “human”; (iv) “hCOV-19/…”. The downloaded 32,341 FASTA entries were edited to remove spaces from FASTA headers (fatal defects for many tools) and reformatted to a consistent line length of 80 characters. Several samples with non-standard FASTA problems (many of them contain hyphens) that cannot be reasonably fixed and failed at the stage of alignment with the reference, therefore only 32,115 isolates were included into mutation calling.
Mutation calls in SARS-CoV-2 genomes
Mutations in individual isolates were identified using MUMmer 3.23 ([28] and http://mummer.sourceforge.net/) by making pairwise alignments with the original Wuhan isolate (GenBank entry NC_045512.2) using the command:nucmer NC_045512.2.fasta query.fastaThe SNP variants output was generated using the command:show-snps -T -Clr out.deltaand concatenating the individual results into a single tab-delimited text file.For compatibility with other mutation analysis tools, the variant tables were created using the Mutation Annotation Format (MAF): https://software.broadinstitute.org/software/igv/MutationAnnotationFormat but any suitable mutation representation format can be used instead. Functional annotation of the mutations was performed using the standard protocol of ANNOVAR ([29] and https://doc-openbio.readthedocs.io/projects/annovar/en/latest/) based on the genome annotations in GenBank entry NC_045512.2.Out of 251,481 mutations initially called in 32,115 isolates, 251,273 were retained after removing redundant DNA symbols (anything but A,C,G,T) as well as mutation calls separated by less than 20 nt from either end of the reference, of which 243,454 were SNVs in 32070 isolates. Those mutations, redundantly spread in multiple isolates, were collapsed into a non-duplicated MAF designated as NoDups (up to three substitution types at each individual base position in the genome) of 13,736 mutations, 12,156 of which were SNVs. The NoDups filtered MAF was further subdivided into two MAFs: (i) NoDupsNonFunc MAF containing only 4,740 base substitutions that either caused a synonymous change in protein or were located in non-coding regions and therefore were annotated as non-functional; (ii) NoDupsFunc MAF containing only 7,416 base substitutions causing either aminoacid change or protein-truncation and therefore annotated as functional.
Rubella virus genome and mutation data
The set of 993 base substitutions identified in six hypermutated isolates of rubellaRA27/3 vaccine strain listed in MAF format were obtained from a previous study [20]. RA27/3 strain reference sequence GenBank entry FJ211588 was used for RNA-fold and nucleotide context annotations. Rubella mutation calls were compared with de-duplicated sets of SARS-CoV-2 mutation calls from 32,115 isolates contained in three versions of filtered MAFs (see “Design of the analysis” in Results).
Comparison of SARS-CoV-2 and rubella virus base substitution spectra
The first indication of certain mutagenic mechanisms prevailing in generation of mutation load is a non-uniform distribution of base substitutions. Base substitution counts in each virus depend on both the relative probability of a given base substitution within the group of three possible substitutions of a given base and on the prevalence of each of four bases in a viral genome. Thus, in order to correct for the latter, we calculated densities of each of twelve possible base substitutions in each SARS-CoV-2 and rubella MAFs, dividing a base substitution count by the number of the mutated base in the reference sequence. We then assessed similarity of base substitution densities distributions between rubella virus and each of SARS-CoV-2 filtered MAF using non-parametric Spearman correlation with the null hypothesis that, there is no positive correlation between spectra in rubella and SARS-CoV-2.
Statistical evaluation of mutagenesis in trinucleotide-centered mutation motifs
Calculating enrichment and statistical evaluation of mutagenesis in a small number of trinucleotide-centered mutation motifs identified from mechanistic knowledge turned productive in our prior assessments of mutagenesis associated with established mechanisms and known preference to certain trinucleotide motifs [15, 20, 30, 31]. In this study we extended statistical evaluation to all 192 possible trinucleotide centered motifs.Trinucleotide and single-nucleotide frequencies in the genomic background were calculated using two alternative methods:context-based–counts in the 41 nt windows centered around each mutation location;reference-based–counts in the whole reference genome.In both cases, Jellyfish ([32] and https://www.cbcb.umd.edu/software/jellyfish/) was used to calculate the counts of tri- and mononucleotides (k-mers with k equal 3 or 1, respectively) in the appropriate FASTA sequences (multiple FASTA entries for context, single entry for the reference). Each of the three substitution types in each of the 64 trinucleotides (total of 192) centered around the mutated base were counted with a set of 192 counters based on string-indexed arrays implemented as simple commands in Awk.Counts of single nucleotide mutations, mutated trinucleotide motifs as well as trinucleotide and single-nucleotide frequencies in the genomic background were used to calculate enrichment with mutagenesis in each of 192 motifs over the presence expected for random mutagenesis as follows:Enrichment (E) of xYz to xMz mutations calculated as
WhereY and M are the original nucleotide and the nucleotide resulting from mutation, respectively,y is the nucleotide in the context identical to Y in mutation motifx and z are 5’ and 3’ flanking nucleotides in a motif, respectivelyStatistical evaluation of Enrichment values was performed by two-tailed Fisher's exact test p-value comparing two ratios:((xYz:M_counts)/(Y:M_counts-xYz:M_counts)) vs (xyz_counts/y_counts-xyz_counts)P-values were then corrected for multiple hypotheses testing by Benjamini-Hochberg FDR including all 192 motifs. Only values passing FDR = 0.05 were considered statistically significant.A minimum estimate of the number of mutations in a sample caused by xYz to xMz specific mutagenesis in excess of what would be expected by random mutagenesis was calculated as follows:Calculated values were rounded to the nearest whole number. xYz:M_MutLoad_MinEstimate was calculated only for samples passing FDR = 0.05, signifying a statistical over-representation of motif-specific mutagenesis. Samples with FDR>0.05 received a value of 0.
Statistical evaluation of preference to loop or stem locations in predicted RNA secondary structure
The RNAfold function of the ViennaRNA Package 2.0 [33] was used to determine the secondary structure of the complete FASTA sequences of the reference genomes for the SARS-CoV-2 virus (NC_045512.2) and the RA27/3rubella vaccine virus (FJ211588). A sample command for generating the secondary structure of SARS-CoV-2 genome shown below:RNAfold -d2—noLP < nc_045512.2.ref.fasta > nc_045512.2.ref.RNAfold.outThe output for each analysis (.out) in dot-bracket notation was input into BBEdit (https://www.barebones.com/products/bbedit/) and all characters in both sequence and notation rows were made space delimited. Each of these rows were pasted into Excel and turned into space delimited cells. Sequence and notation were separately copied and pasted using the “Transform” function into a new Excel spreadsheet. A column with the nucleotide position was added and the file saved as a tab delimited text file *RNAfold.txt. For each resulting file the first column was the nucleotide position, the second column is the nucleotide, and the third column was the annotation of that nucleotide in dot-bracket notation. The *RNAfold.txt files were used to add a stem-loop annotation column “RNAfold” to all MAF files using the vlookup function in Excel and saved as a tab delimited text file.For searching for motifs and trinucleotides, *RNAfold.txt files were used to create a searchable text files as follows. Columns two and three of each file were copied into a two new text files. On command line, the two columns in each file were merged usingawk '{$(NF-1) = $(NF-1)""$NF;$NF = ""}1' OFS = "\t"The output file from this was opened in BBEdit, the line breaks were removed, resulting in a file containing nucleotides and annotation of those nucleotides in a single row as an interleaved and searchable format as CoV2_annot_final.txt and Rubella_annot_final.txt.These files, displayed in BBEdit, were used to separately count all single nucleotides and all 64 trinucleotides classified as either stem or loop location based on the stem or loop annotation of the individual nucleotide position or of a central position in each trinucleotide.Statistical evaluation of differences between loop vs stem single base substitution mutagenesis or trinucleotide motif associated mutagenesis was by comparing mutation densities in loop vs stem:mutLoop/refLoop—density of a substitution type or a trinucleotide motif mutation type in loopswheremutLoop and refLoop are counts in loops of a given type of events mutations or nucleotides in reference, respectively,andmutStem/refStem—density of a substitution type or a trinucleotide motif mutation type in stemswheremutStem and refStem are counts in stems of a given type of events mutations or nucleotides in reference, respectively.Statistical evaluation of loop vs stem mutagenesis was performed by two-tailed Fisher's exact test comparing ratios (mutLoop/(refLoop-mutLoop)) and (mutStem/(refStem-mutStem)) for either base substitutions or for trinucleotide motifs. Fisher's exact test p-value was corrected by Benjamini-Hochberg for the set of 12 possible base substitutions or for 16 possible tri-nucleotide centered around a given base substitution.
Results
Design of the analysis
The overarching hypothesis of this study was that some of the processes generating RNA mutation load in population of SARS-CoV-2 genomes are similar (but not necessarily identical) to the processes that generated changes in genomic RNAs of hypermutated rubella viruses. For that purpose, we obtained the viral genome FASTA files and processed them to obtain unique mutation calls and the mutation signatures as outlined in Fig 1.
Fig 1
Analysis workflow.
MAF–mutation annotation format table. Each filtered MAF combines mutations from all samples into a single dataset. Details of mutation call filtering and grouping as well as abbreviations are explained in the text of the "Design of the analysis" section.
Analysis workflow.
MAF–mutation annotation format table. Each filtered MAF combines mutations from all samples into a single dataset. Details of mutation call filtering and grouping as well as abbreviations are explained in the text of the "Design of the analysis" section.32,341 FASTA files were downloaded from the GISAID Initiative [34] web site (https://www.gisaid.org/) on 06/22/2020, each containing a consensus whole genome-sequence of a SARS-CoV-2 virus isolated from a human subject and sequenced at high coverage. Based on the published analysis of the GISAID data for a subset of around 4000 of SARS-CoV-2 isolates across the world performed with the use of the Nextstrain package ([35] and https://nextstrain.org/ncov/global?l=clock), an average lineage of SARS-CoV-2 virus successfully transmitted from one subject to another would accumulate approximately 22 base substitutions per year (12–13 base substitutions for the period of December—June, 2020); a similar estimate was also obtained in [36]). The final FASTA sequence files of the individual isolates in GISAID represent a consensus derived from high coverage sequencing reads and contain information about the mutations present with high frequency in a sequenced viral isolate and therefore belong to viral particles capable of proliferation. We aligned each sequence against the sequence of presumably the earliest isolate of a SARS-CoV-2 genome (NC_045512. 2) and listed each change in a separate row of a mutation annotation file (MAF) Fig 1 and S1A Table. We annotated each of 251,273 mutation events in each isolate by surrounding +/- 20 nucleotides of genomic context around position of each mutation, by location in one or in several overlapping ORFs, by potential amino acid change or protein truncation effect, as well as by location of a change in self complementary area (predicted stem) or outside of such area (predicted loop) in plus-strand genomic RNA. Many independent isolates could have originated from the already mutated virus spreading the same mutation(s) into genomes of multiple (up to thousands) downstream isolates (see column “times_refPos_mutated_inMAF” in S1A Table). Therefore, we also annotated each mutation in a sample by the number of different samples in which such a mutation was found. Since each genome of an individual isolate contained only few mutations and many of these mutations were identical in multiple isolates, we built our analysis to evaluate the overall spectrum of non-redundant mutation events that have accumulated in the human population through the current pandemic rather than the mutation spectra in individual isolates We de-duplicated the starting MAF and created three groups of mutations (S2A–S2C Table). The first group contained a pooled non-redundant set with no duplicates (NoDups) that listed each individual mutation only once regardless of how many isolates contained the same mutation (total 12,156 mutation events). While the individual isolates are not listed in this group any more, it contains a set of distinct events most closely representing the spectrum of unrelated mutation events rather than a complex downstream process of distributing mutant forms in human population. However, even in the NoDups list (S2C Table) the base substitutions in many positions could be under positive or negative selection, which could skew the spectrum of the observed changes from the mechanistic mutation spectrum that accurately reflects the underlying mutagenic processes. Therefore, we subdivided this group based on whether the changes yielded functional effects in the SARS-CoV-2 genome. Non-synonymous amino acid changes and changes introducing or removing stop codons were designated as functional (Func), while synonymous changes or changes outside ORFs were designated and non-functional (NonFunc). The content of NoDupsNonFunc group (S2A Table) would be the least affected by functional selection and thus, most accurately represent the impact of unconstrained mutational processes operating on the viral genome. While this group is smaller, it still contains a sufficient number of changes (4,740 mutation events) for detecting trends in the mutational patterns. The mutation spectrum of 7,416 NoDupsFunc events (S2B Table) was also analyzed. Each of the three SARS-CoV-2 mutation spectra was compared to the combined mutation spectrum (993 base substitutions) from six independent isolates originated from the hypermutated rubella-vaccine virus [20] (S3A Table). Unlike many SARS-CoV-2 isolates, where individual mutated event could be carried from one isolate to another, each rubella virus isolate contained mutations that had occurred independently from the vaccine virus in each subject. Thus, the total of the mutational events in six rubella isolates was, at least in part, representative of the mutation spectrum. However, mutation spectra in each rubella isolate may represent an unknown level of selection. Indeed, a number of mutations was observed in more than one rubella virus isolate (see column “times_refPos_mutated_inMAF” in S3A Table). Some level of selection was also indicated by analysis of synonymous and nonsynonymous substitutions in each codon [20]. Therefore, we made separate comparisons of the rubella virus mutation spectra with each of the three SARS-CoV-2 non-redundant MAFs (Fig 1): the non-duplicated mutation events (NoDups), and its two subsets–the non-duplicated mutation events with potential of functional significance (NoDupsFunc) and the non-duplicated non-functional mutation events (NoDupsNonFunc).All mutations are reported based on the plus (genomic) strand of the virus. We started from conventional comparisons of all possible single base substitutions and the mutation preference for potential loop or stem parts of ssRNA secondary structure. Unlike in our previous analysis of the mutation spectrum and signatures in the genomes of hypermutated rubella virus isolates, where we followed only a limited set of motifs based on specific hypotheses, we used here an “agnostic” approach analyzing all 192 possible trinucleotide-centered mutation motifs for enrichment in the viral genomes. We also used existing software to calculate ADAR editing scores [8]. Overall, our methodology allows to detect the mutational signatures that predominate in the viral genomes. Comparisons with the hypermutated rubella genomes further demonstrated the similarities in the mutational processes operating on both viral genomes.
Similarity of base substitution spectra between hypermutated rubella virus genomes and SARS-CoV-2
We compared the distribution of densities of the 12 possible single base substitutions (counts of each base substitution normalized by the presence of the unmutated base in the genome). While density distributions reflect the contributions of different mutagenic processes in each dataset, density values for specific base substitutions cannot be compared directly between two viruses because they were obtained from vastly different genome numbers. Importantly, there was a statistically significant similarity between the distributions of base substitution densities in rubella and in each of three filtered SARS-CoV-2MAFs as well as a similarity in several prevailing types of base substitutions (Fig 2 and S4 Table).
Fig 2
Comparison of base substitution spectra between rubella virus and SARS-CoV-2 datasets from filtered MAFs.
Creation of filtered SARS-CoV-2 MAFs and their abbreviated names NoDupsNonFunc, NoDupsFunc, NoDups, are described in Fig 1 and in associated text. The spectrum from each SARS-CoV-2 filtered MAF was compared with rubella mutation spectrum. Bars represent densities of base substitutions in each dataset calculated by dividing counts of each base substitution by counts of the substituted base in the reference sequence. Connecting lines visualize overall parallelism between rubella and each filtered MAF. Insert boxes show Spearman r, its 95% CI, and one-tailed p-value for hypothesis about positive correlation between rubella and SARS-CoV-2 spectra. Source data are in S4 Table.
Comparison of base substitution spectra between rubella virus and SARS-CoV-2 datasets from filtered MAFs.
Creation of filtered SARS-CoV-2MAFs and their abbreviated names NoDupsNonFunc, NoDupsFunc, NoDups, are described in Fig 1 and in associated text. The spectrum from each SARS-CoV-2 filtered MAF was compared with rubella mutation spectrum. Bars represent densities of base substitutions in each dataset calculated by dividing counts of each base substitution by counts of the substituted base in the reference sequence. Connecting lines visualize overall parallelism between rubella and each filtered MAF. Insert boxes show Spearman r, its 95% CI, and one-tailed p-value for hypothesis about positive correlation between rubella and SARS-CoV-2 spectra. Source data are in S4 Table.In both viruses, there was a very high frequency of the C to U changes, consistent with the hypothesis of cytidine deamination in the plus-strand (genomic) RNA. C to U changes in the minus-strand, which would be reported as G to A in the plus-strand, were less abundant in both viruses. Another class of highly abundant changes in both viruses were U to C changes in the plus RNA strand which could originate from A to G changes caused by ADARadenine deaminase in the minus-strand. The corresponding A to G changes in the plus-strand were less abundant in rubella but were comparable with the C to U changes in SARS-CoV-2.A prior study of hypermutated rubella genomes found small, but statistically significant increase in ADAR scores (calculated as described by [8]) in U to C and A to G ADAR-like base substitutions compared to two other types of substitutions in U or A nucleotides [20]. However, no statistically significant increase in ADAR scores was found for the U to C and A to G changes in the SARS-CoV-2 dataset analyzed in a similar way (S1 Fig and S1 Data). Since the ADAR score tool was developed based on in vitro deamination of a perfectly paired dsRNA substrate, there could be a difference in sequence preferences between this substrate and the actual substrate of in vivo editing of SARS-CoV-2 RNA. Alternatively, abundant U to C and A to G ADAR-like editing could be due to mechanisms not involving ADARs.The only apparent discrepancy between the two viruses was in a high density of the G to U changes in the plus-strand of SARS-CoV-2, while they were nearly absent in rubella. We note that the density of G to U changes in minus-strand (reported as the complementary C to A changes in plus-strand in Fig 2) was similar to other low abundant changes in both viruses. A possible origin of the increased G to U changes in SARS-CoV-2 genomes will be detailed in Discussion.
Several types of base substitutions show preference for regions prone to loop formation in viral RNA secondary structure
A high abundance of C to U (or G to A) mutations was already noticed in several recent analyses of SARS-CoV-2 mutation data and inferred to either APOBEC mutagenesis or to errors in RdRp copying of the minus-strand [26, 36, 37]. C to U mutations in RNA can be also caused by non-enzymatic deamination of cytidines similar to such deamination described in DNA [16, 38]. Recently it was revealed that APOBEC3A has a preference for deaminating cytosines in regions prone to forming loops in ssDNA secondary structure [39]. Therefore we annotated all positions in the SARS-CoV-2 and rubella genomes for either preference for loop or stem location in potential secondary structure formed by the RNA plus-strand ([33] S1B and S3B Tables and Methods). We then compared mutation counts in loop vs stem for each type of base substitutions (Fig 3 and S5 Table).
Fig 3
Comparison of base substitution mutagenesis between locations prone to loop or stem formation in viral RNA genomes.
Creation of filtered SARS-CoV-2 MAFs and their abbreviated names NoDupsNonFunc, NoDupsFunc, NoDups, are described in Fig 1 and in associated text. Bars represent densities of base substitutions in stem- or in loop-forming sections. Densities are calculated by dividing counts of each base substitution in either loop or in stem by counts of the substituted base in the loop-forming or in stem-forming regions of the reference sequence. Statistical comparison between mutagenesis in stem vs loop for every base substitution was done by two-tailed Fisher’s exact test. P-values were considered after correcting by FDR. Brackets indicate pairs passing FDR = 0.05. *<0.05, ** <0.005, *** <0.0005. Source data including exact p-values are in S5 Table.
Comparison of base substitution mutagenesis between locations prone to loop or stem formation in viral RNA genomes.
Creation of filtered SARS-CoV-2MAFs and their abbreviated names NoDupsNonFunc, NoDupsFunc, NoDups, are described in Fig 1 and in associated text. Bars represent densities of base substitutions in stem- or in loop-forming sections. Densities are calculated by dividing counts of each base substitution in either loop or in stem by counts of the substituted base in the loop-forming or in stem-forming regions of the reference sequence. Statistical comparison between mutagenesis in stem vs loop for every base substitution was done by two-tailed Fisher’s exact test. P-values were considered after correcting by FDR. Brackets indicate pairs passing FDR = 0.05. *<0.05, ** <0.005, *** <0.0005. Source data including exact p-values are in S5 Table.In both viruses there was a highly significant preference for loop location with C to U changes in plus-strand. The second type of base changes prevalent in both SARS-CoV-2 and in rubella, the U to C changes in plus-strand (corresponding to A to G changes in minus-strand) did not show statistically significant differences between loop and stem. If the U to C (A to G) changes were to come from ADARadenine deaminase acting on dsRNA, secondary structure effects of ssRNA intermediate folding would not be expected. Alternatively, these changes could be not ADAR driven. The only other type of changes showing statistically significant difference between loop and stem locations in all groups of SARS-CoV-2 mutations were G to U changes. Same as for the base substitution spectra, this outstanding feature of the G to U changes showed up in SARS-CoV-2, but not in rubella (see above).
Mutational motif preferences in SARS-CoV-2 and rubella virus genomes suggest APOBEC cytidine deaminases as a source of C to U base substitutions in the plus RNA strand
Since base substitution spectra in SARS-CoV-2 and rubella virus are correlated, it is likely that they also shared common mechanisms which generated these changes. It is well established that several mechanisms of mutagenesis in DNA of tumors and normal cells can have not only distinctive base substitutions spectra but also diagnostic preference for trinucleotide mutation motifs [30, 40, 41]. Currently there is very little information about motif preference in RNA editing or mutagenesis. Therefore, we assessed enrichment using all possible 192 trinucleotide mutation motifs (96 in plus and 96 in minus RNA strand) of each virus. Enrichment values for each motif were calculated based on counts of mutations in a motif normalized for the motif content in the genomic background (see Methods). Statistical evaluation of enrichments showed significance for several motifs even after FDR<0.05 correction to individual P-values was applied (S6A–S6C Table). However, base substitutions for the most-enriched motifs were present in low numbers, so these results require validation in independent studies (also see Discussion). Therefore, we concentrated on the motifs representing the most abundant types of base substitutions present in both viruses, i.e., on the C to U and their complement G to A, as well as U to C and their complement A to G changes. For statistically-significant enriched trinucleotide motifs containing one of these four base substitutions, we calculated the minimum estimates of mutation load (MutLoad) that can be assigned to mechanism(s) with preference for a significantly enriched motif (Fig 4, S6A–S6D Table and Materials and methods).
Fig 4
Trinucleotide-centered mutation motifs with statistically significant enrichment over random mutagenesis.
Creation of filtered SARS-CoV-2 MAFs and their abbreviated names NoDupsNonFunc, NoDupsFunc, NoDups are described in Fig 1 and in the associated text. The order of data in each group is the same as in the panel’s legend—Rubella, NoDupsNonFunc, NoDupsFunc, NoDups. Zero values are shown for the convenience of following the order of values within each group. Bars represent minimum estimates of mutation load that can be assigned to motif-specific mutagenic mechanism (MutLoad) as described in Methods. Statistical evaluation of enrichments was done by two-tailed Fisher’s exact test and corrected by FDR including p-values for all 192 possible trinucleotide-centered base substitution motifs. MutLoad for FDR>0.05 = 0. Only results for motifs which included the most frequent base substitutions in the plus-strand, C to U, G to A, A to G, U to C, are shown. Reverse complement motifs in the plus-strand corresponding to the statistically significant motifs mutated in the minus-strand are shown in parentheses. If both plus-strand motifs in the reverse complement pair were statistically-significantly enriched in at least one dataset (in rubella or in a filtered SARS-CoV-2 MAF), they are highlighted in red font. Source data including calculations for all 192 motifs are in S6 Table.
Trinucleotide-centered mutation motifs with statistically significant enrichment over random mutagenesis.
Creation of filtered SARS-CoV-2MAFs and their abbreviated names NoDupsNonFunc, NoDupsFunc, NoDups are described in Fig 1 and in the associated text. The order of data in each group is the same as in the panel’s legend—Rubella, NoDupsNonFunc, NoDupsFunc, NoDups. Zero values are shown for the convenience of following the order of values within each group. Bars represent minimum estimates of mutation load that can be assigned to motif-specific mutagenic mechanism (MutLoad) as described in Methods. Statistical evaluation of enrichments was done by two-tailed Fisher’s exact test and corrected by FDR including p-values for all 192 possible trinucleotide-centered base substitution motifs. MutLoad for FDR>0.05 = 0. Only results for motifs which included the most frequent base substitutions in the plus-strand, C to U, G to A, A to G, U to C, are shown. Reverse complement motifs in the plus-strand corresponding to the statistically significant motifs mutated in the minus-strand are shown in parentheses. If both plus-strand motifs in the reverse complement pair were statistically-significantly enriched in at least one dataset (in rubella or in a filtered SARS-CoV-2MAF), they are highlighted in red font. Source data including calculations for all 192 motifs are in S6 Table.The only revealed similarity between statistically-significant enriched motifs in rubella virus and in SARS-CoV-2 was for the uCn to uUn changes, consistent with the tCn to tTn ssDNA mutagenesis specificity of a subgroup of APOBECcytidine deaminases. However, even within the APOBEC-like group of motifs there was a difference between strong enrichment with uCa to uUa motif in rubella and the lack of statistically significant preference for this motif in SARS-CoV-2. There were also three groups of motifs significantly enriched in SARS-CoV-2, but not in rubella (see Table 1 and Discussion for possible mechanistic assignment of these motifs).
Table 1
Analyses of base substitutions prevailing in rubella virus and in SARS-CoV-2 plus-strand genomic RNAs.
Feature
Virus
C to U
G to A
A to G
U to C
Prevalence of a base substitution
rubella
High frequency
Frequent; Less frequent than C to U
High frequency; Less frequent than U to C
High frequency; More frequent than A to G
SARS-CoV-2
High frequency
Frequent; Less frequent than C to U
High frequency
High frequency
Secondary structure element preferred by base substitution
rubella
Prefers loops over stems
ND
ND
ND
SARS-CoV-2
Prefers loops over stems
ND
ND
ND
Enriched trinucleotide motif(s)
rubella
uCn to uUn
ND
ND
ND
SARS-CoV-2
uCn to uUn; aCn to aUn
cGn to cAn (reverse complement for nCg to nTg in minus-strand);
nAu to nGu (reverse complement to motif preferred by U to C)
aUn to aCn (reverse complement to motif preferred by A to G)
Suggested mechanism
rubella and/or SARS-CoV-2
(i) Frequent C to U editing by tCn (uCn) -specific APOBEC(s) in plus-strand; (ii) new motif aCn to aUn
(i) Increased C-deamination in nCg (CpG) minus-strand motif (SARS-CoV-2 only); (ii) No statistical support to APOBEC editing in minus-strands of either virus
A to G editing by ADAR(s) in plus-strand
A to G editing by ADAR(s) in minus-strand
ND–not detected
ND–not detectedWe also assessed the potential loop vs stem preference for trinucleotide motifs containing C to U and G to U single base substitutions that showed overall loop vs stem preference. None of trinucleotide motifs containing C to U base substitutions showed loop or stem preference in selection-free SARS-CoV-2 NoDupsNonFunc filtered dataset and in rubella (S2 Fig and S7A and S7D Table). Several C to U containing trinucleotide motifs in SARS-CoV-2 datasets, where functional selection cannot be excluded, showed statistically significant bias towards mutations in loops (S2 Fig and S7B and S7C Table), however more data accumulation is required in order to exclude the confounding effects of functional selection in specific sites. No loop vs stem preference was detected in trinucleotide motifs containing G to U substitutions either in SARS-CoV-2 or in rubella (S3 Fig and S8A–S8C Table).In summary, our agnostic analysis of trinucleotide signature motifs demonstrated that the uCn (tCn) APOBEC-like mutagenesis, which is a major component in rubella virus hypermutation, also contributes towards the mutations accumulated in the genomes of infectious SARS-CoV-2 spreading in the current pandemic.
Discussion
Previously, we demonstrated that hypermutation of the live attenuated rubella vaccine virus [42] can generate infectious virus particles in immunocompromised children [20]. Based on this work, we hypothesized that similar mutagenic processes may act upon the genomes of other similar plus-strand RNA viruses like SARS-CoV-2. The large-scale sequencing efforts producing genomes of tens of thousands of SARS-CoV-2 isolates allowed us to accurately identify mutations, build a mutation catalog for this virus, highlight similarities with hypermutated rubella and reveal unique features of SARS-CoV-2 mutagenesis (summarized in Table 1).In agreement with the previous analyses performed separately for rubella [20] and for SARS-CoV2 [26, 27, 36, 37] comparisons between SARS-CoV-2 and hypermutated rubella strains demonstrated that the base substitution spectra correlate between the two viruses. Two types of base substitutions–C to U (and its complementary G to A) and A to G (and its complementary U to C), expected from endogenous mutagenesis by APOBECs and ADARs, respectively, prevailed in both viruses (Fig 2).Since RNA can form secondary structures, any mutagenic processes active upon ssRNA would preferentially be formed in the loop regions of the secondary structures. Analysis of A to G and U to C substitutions consistent with the biochemical specificity of ADARs did not reveal any preference for mutagenesis in loops versus stem regions. ADARs are known to act on dsRNA substrates. Thus, if ADARs did in fact contribute to induction of these substitutions, they should be acting on a dsRNA form, wherein we do not expect RNA to fold into secondary structures. We also found in SARS-CoV-2, but not in rubella, statistically significant enrichment of the nAu to nGu mutations along with its reverse complement aUn to aCn. While these could reflect adenine deamination by one of the ADARs in either strand of a dsRNA intermediate, these motifs are different from the motif preference revealed by in vitro editing of artificial dsRNA substrate [8]. Also, there was no increase in ADAR scores in the SARS-CoV-2 A to G or U to C mutations (S1 Fig and S1 Data), thus indicating that ADARs may not be the primary source of these changes. If A to G or U to C changes are really stemming from ADAR activity, it is possible that these enriched motifs are preferred by ADARs in SARS-CoV-2 but not in RNA substrates used to generate the editing consensus in [8]. Alternatively, A to G and U to C changes could be caused by one or more mechanisms unconnected to ADARs.Unlike A to G changes, in-depth analysis of C to U substitutions revealed that they were predominantly present in the RNA plus-strands of both viruses and demonstrated a preference for loops versus stems in the RNA secondary structure (Fig 3 and S5 Table). This phenomenon is similar to the preference of ssDNA and mRNA editing in loops by the APOBEC3Acytidine deaminase [39, 43], which so far is the only RNA-editing APOBEC explored for this feature. Agnostic analysis of enrichments in all 192 possible trinucleotide mutation motifs highlighted statistically significant excess of uCa to uUa motif in rubella, however these changes were not prevalent in SARS-CoV-2. Mutations in plus-strands of both viruses showed statistically significant enrichments with uCg to uUg and uCu to uUu motifs (Fig 4A and 4B). These motifs belong to a group uCn to uUn (tCn to tTn in DNA) which is characteristic of several APOBECcytidine deaminases ([15] and references therein). We note that although these signatures were enriched in the non-functional mutations (NoDupsNonFunc), they did not pass the 0.05 FDR threshold in filtered datasets that included mutations with potential functional effects (NoDupsFunc). These differences in the mutation signatures between SARS-CoV-2 and rubella may be due to different APOBEC family members performing editing or due to the confounding presence of other sources of C to U mutagenesis, such as spontaneous cytosine deamination that frequently occurs in ssDNA [44] or oxidative mutagenesis capable of generating C to T mutations in ssDNA in vivo [45, 46]. In support of the role of oxidative damage in SARS-CoV-2 genomes, is the increased prevalence of G to U substitutions which is consistent with the oxidation of guanines in the RNA plus-strand (Fig 2). G to U changes could be caused by an increased level of oxidative damage generating 8-oxoG in viral RNA within cells or during sequencing library preparation [47, 48]. Frequent copying of 8-oxoG with A, would show up as G to U changes in the strand, where 8-oxoG was present. However, since we analyze the consensus sequences of the viral genomes and not individual reads, errors during library preparation would most likely be filtered out and would not be represented in the viral genome sequence. We also note that the recent study [27] indicated the overall low chance of sequencing errors reflected in SARS-CoV-2 consensus sequences in the dataset analyzed in that work. On the other hand, G to U changes were present only at low density in hypermutated rubella genomes indicating physiological differences between the two viruses.There were two more groups of trinucleotide mutation motifs involving C to U (and complementary G to A) substitutions in plus RNA strand specifically enriched for SARS-CoV-2 (Fig 4A and 4B). The aCn to aUn (reverse complement nGu to nAu) group of motifs may represent a preference previously unknown for APOBECs in RNA or just a mutagenic mechanism yet to be defined. The cGn to cAn group of motifs seen in the plus-strand may be in fact due to mutations of the reverse complement motif nCg to nUg in the minus-strand. nCg to nTg (CpG to TpG) germline and somatic mutagenesis is universally present in DNA of species with 5-methylcytosine and is generated by systems specialized to mutagenesis in methylated CpG sequences. However various studies have demonstrated that CpG to TpG mutagenesis can occur independent of cytosine methylation [49, 50]. Several studies have shown that CpG dinucleotides are depleted in the genomes of SARS viruses indicating functional selection and/or increased frequency of cytosine deamination in these viral genomes [51-54]. Our study shows with high statistical confidence that nCg to nUg (CpG to UpG) mutagenesis in the minus strand is enriched (Fig 4B) supporting the role of nCg- (CpG)-specific cytosine deamination in minus RNA strand in SARS-CoV-2 genomic mutagenesis.In summary, comparison of base substitution spectra and signatures between hypermutated rubella virus isolates and the SARS-CoV-2 multi-genome dataset demonstrates both similarities and differences in the mutational processes active upon the two plus-strand RNA viruses. It is important to understand the mechanisms that contribute to mutagenesis of viral genomes, since hypermutation of even inactivated rubella vaccine virus was shown to generate reactivated viral particles [20]. We demonstrate here that the APOBEC-specific uCa to uUa changes that are highly enriched in hypermutated rubella, are much less prevalent in SARS-CoV-2. We propose that assessment of uCa to uUa signature in viral genomes can provide insights into the potential hypermutation risk of SARS-CoV-2. Moreover, understanding the genomic mutational patterns is important for predicting virus evolution. Our study has highlighted several distinct features of SARS-CoV-2 mutational spectrum that, after validation with independent dataset(s) can be used to build predictive models for this and related SARS viruses.
Mean values of ADAR scores.
ADAR scores were calculated using the Web tool http://hci-bio-app.hci.utah.edu:8081/Bass/InosinePredict. Source data are in S1 Data.(PDF)Click here for additional data file.
Comparison of trinucleotide-centered motif C to U mutation densities between locations prone to loop or to stem formation in viral RNA genomes.
Bars represent densities of base substitutions in stem- or in loop-forming regions. Densities are calculated by dividing counts of each motif mutations in either loop or in stem by counts of this motif in the loop-forming or in stem-forming regions of the reference sequence. Statistical comparison between mutagenesis in stem vs loop for every base substitution was done by two-tailed Fisher’s exact test. P-values were corrected by FDR including 16 motifs containing C to U base substitution. Brackets indicate pairs passing FDR = 0.05. * <0.05, ** <0.005. Source data are in S7 Table.(PDF)Click here for additional data file.
Comparison of trinucleotide-centered motif G to U mutation densities between locations prone to loop or to stem formation in viral RNA genomes.
Bars represent densities of base substitutions in stem- or in loop-forming regions. Densities are calculated by dividing counts of each motif mutations in either loop or in stem by counts of this motif in the loop-forming or in stem-forming regions of the reference sequence. Statistical comparison between mutagenesis in stem vs loop for every base substitution was done by two-tailed Fisher’s exact test. P-values were corrected by FDR including 16 motifs containing G to U base substitution. Brackets indicate pairs passing FDR = 0.05. * <0.05. Source data are in S8 Table.(PDF)Click here for additional data file.
ADAR scores and complete ADAR analysis for SARS-CoV-2 filtered MAFs (contains source data for S1 Fig).
(ZIP)Click here for additional data file.
Mutation calls and RNA fold prediction in SARS-CoV-2 genomes.
S1A. Complete list of mutation calls in all SARS-CoV-2 genomes in TCGA compatible Mutation Annotation Format (MAF); nucleotides named as in DNA. S1B. Annotation of predicted RNA-fold in SARS-CoV-2 reference positions.(XLSB)Click here for additional data file.
S2A. NoDupsNonFunc—de-duplicated set of mutations from all samples of the dataset; non-functional. S2B. NoDupsFunc—de-duplicated set of mutations from all samples of the dataset; aminoacid changes or protein-truncating. S2C. NoDups—de-duplicated set of mutations from all samples of the dataset.(XLSX)Click here for additional data file.
Mutation calls and RNA fold prediction in rubella virus genomes.
S3A. The list of 993 mutations in six rubella isolates (from [20]). Sequences are shown in DNA format (T instead of U) to maintain compatibility with other outputs of the mutation signature R-script. S1B. Annotation of predicted RNA fold in rubella reference.(XLSX)Click here for additional data file.
Counts and densities of single base substitutions in SARS-CoV-2 and in rubella virus.
S4A. Counts of base substitutions. S4B. Densities of base substitutions (Source data for Fig 2).(XLSX)Click here for additional data file.
Comparison of base substitution densities between locations prone to loop or stem formation in viral RNA genomes.
(Source data for Fig 3). S5A. SARS-CoV-2, NoDupsNonFunc—de-duplicated set of mutations from all samples of the dataset; non-functional. S5B. SARS-CoV-2, NoDupsFunc—de-duplicated set of mutations from all samples of the dataset; aminoacid changes or protein-truncating. S5C. SARS-CoV-2, NoDups—de-duplicated set of mutations from all samples of the dataset. S5D. Rubella, all mutations.(XLSX)Click here for additional data file.
Statistical evaluation of mutagenesis in 192 trinucleotide-centered mutation motifs.
Sequences are shown in DNA format (T instead of U) to maintain compatibility with other outputs of the mutation signature R-script. S6A. SARS-CoV-2, NoDupsNonFunc—de-duplicated set of mutations from all samples of the dataset; non-functional. S6B. SARS-CoV-2, NoDupsFunc—de-duplicated set of mutations from all samples of the dataset; aminoacid changes or protein-truncating. S6C. SARS-CoV-2, NoDups—de-duplicated set of mutations from all samples of the dataset. S6D. Rubella, all mutations.(XLSX)Click here for additional data file.
Comparison of C to U trinucleotide motif substitution densities between locations prone to loop or stem formation in viral RNA genomes.
S7A. SARS-CoV-2, NoDupsNonFunc—de-duplicated set of mutations from all samples of the dataset; non-functional. S7B. SARS-CoV-2, NoDupsFunc—de-duplicated set of mutations from all samples of the dataset; aminoacid changes or protein-truncating. S7C. SARS-CoV-2, NoDups—de-duplicated set of mutations from all samples of the dataset. S7D. Rubella, all mutations.(XLSX)Click here for additional data file.
Comparison of G to U trinucleotide motif substitution densities between locations prone to loop or stem formation in viral RNA genomes.
S8A. SARS-CoV-2, NoDupsNonFunc—de-duplicated set of mutations from all samples of the dataset; non-functional. S8B. SARS-CoV-2, NoDupsFunc—de-duplicated set of mutations from all samples of the dataset; aminoacid changes or protein-truncating. S8C. SARS-CoV-2, NoDups—de-duplicated set of mutations from all samples of the dataset. S8D. Rubella, all mutations.(XLSX)Click here for additional data file.
Acknowledgments to research groups and individuals provided SARS-CoV-2 genome sequences to GISAID’s EpiCoV™ Database.
(XLSX)Click here for additional data file.9 Sep 2020PONE-D-20-23843Similarity between mutation spectra in hypermutated genomes of rubella virus and in SARS-CoV-2 genomes accumulated during the COVID-19 pandemicPLOS ONEDear Dr. Gordenin,Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.1) Make the suggested minor changes requested by the reviewers. While the additional data analysis suggested by Reviewer #1 might strengthen the manuscript, it is not essential for acceptance for publication (in case it is challenging or technically difficult).2) I want to emphasize that novelty is not a central criterion for publication in PLoS ONE - hence I would like to encourage you to not put too much focus on what is novel (compared to previously published competing studies) but also present and discuss comprehensively aspects where there is overlap (as the tools employed differ between the studies).3) Please contact me about anything that is unclear in the reviewers comments (or if there are difficulties in making the suggested changes), so that a I can help to speed up the revision process.Please submit your revised manuscript by Oct 24 2020 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.Please include the following items when submitting your revised manuscript:A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: http://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocolsWe look forward to receiving your revised manuscript.Kind regards,Sebastian D. Fugmann, Ph.D.Academic EditorPLOS ONEJournal Requirements:When submitting your revision, we need you to address these additional requirements.1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found athttps://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf andhttps://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf[Note: HTML markup is below. Please do not edit.]Reviewers' comments:Reviewer's Responses to QuestionsComments to the Author1. Is the manuscript technically sound, and do the data support the conclusions?The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.Reviewer #1: YesReviewer #2: YesReviewer #3: Yes**********2. Has the statistical analysis been performed appropriately and rigorously?Reviewer #1: N/AReviewer #2: YesReviewer #3: Yes**********3. Have the authors made all data underlying the findings in their manuscript fully available?The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.Reviewer #1: YesReviewer #2: YesReviewer #3: Yes**********4. Is the manuscript presented in an intelligible fashion and written in standard English?PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.Reviewer #1: YesReviewer #2: YesReviewer #3: Yes**********5. Review Comments to the AuthorPlease use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)Reviewer #1: In the manuscript “Similarity between mutation spectra in hypermutated genomes of rubella virus and SARS-CoV-2 genomes accumulated during the COVID-19 pandemic”, by Klimczak et al., the authors use bioinformatics and statistical tools to compare the mutational spectra of rubella and SARS-CoV-2 single-stranded RNA viruses. They extracted four main conclusions: 1) The mutational spectra are similar, “with C to U as well as A to G and U to C being the most prominent in plus-strand genomic RNA of each virus”; 2) “U to C changes universally showed a preference for loops versus stems in predicted RNA secondary structure”; 3) “C to U changes showed enrichment in the uCn motif, which suggested a subclass of APOBECcytidine deaminase being a source of these substitutions”; 4) There was an “Enrichment of several other trinucleotide- centered mutation motifs only in SARS-CoV-2 - likely indicative of a mutation process characteristic to this virus.” The analysis is competent, and the presentation suitable for publication. However, as there is an extensive overlap between this manuscript and the recently published work of the group of Dr. Silvo Conticello (Di Giorgio et al., Sci. Adv. 2020; 6: eabb5813), it would be beneficial if the authors identify the similar and distinct conclusions concerning this work. Furthermore, besides the two additional references already cited in the manuscript, a recent paper by Dr. Simmonds (https://doi.org/10.1128/mSphere.00408-20) should also be referenced and discussed, particularly given its thorough analysis of the C to U changes in the Sars-Cov-2 genomes. The authors should engage in a thorough comparison between their data and the recently published results. To make the manuscript shine, it is also perhaps a good idea to highlight what seems to be the only aspect that the four published papers on the mutational spectra of Sars-Cov-2 have failed to notice: conclusion 2 (U to C changes universally show a preference for loops). Below we comment on each of the first three conclusions.The mutational spectra are similarAlthough this reviewer suspects that the contribution of mutations introduced by NGS is negligible, the authors should address this issue in the methods or results, instead of leaving it to the discussion of the G to U changes (Dr. Simmonds deals with this problem compellingly).The rational beyond the generation of the NoDups, NoDupsFunc, and NoDupsNonFunc is clear. However, by removing the information on the number of repeated mutations the authors could be missing hotspots. It should be possible to use the phylogenetic data to determine whether repeated mutations are due to independent mutational events or a shared founder effect.“…density values for specific base substitutions cannot be compared directly between two viruses because they were obtained from vastly different genome numbers”. Given that the major focus of the work is on the comparison between the rubeola and Sars-Cov-2 viruses, it is not apparent why the authors did not subsample the Sars-Cov-2 data to equilibrate the genome numbers and perform the comparison. The comparison may not work for lack of power, but that would be another issue.Figure 2. 1) The definition of the null hypothesis (“positive correlation between rubella and SARS-CoV-2 spectra”) is misleading. Since the P values are all below 0.05, and the authors conclude that there is a correlation, the null hypothesis must be “lack of positive correlation between rubella and SARS-CoV-2 spectra”. 2) There is a slight drop in the correlation when only silent mutations are considered. Could this mean that part of the correlation is explained by selection?More information is needed on the “ADAR score tool.”U to C changes universally show a preference for loopsThe authors seem to favor the scenario in which A3A mostly drives the U to C changes. Since the group of Conticello favors the action of APOBEC1, it would be useful to read a discussion on the strengths and weaknesses of each scenario. To boost the impact of the manuscript, the authors should also consider changing the title to highlight the possible role of APOBEC3A. Regardless of their decision on the title of the manuscript, it is essential to 1) clarify whether the preference for RNA loops is an exclusive feature of A3A or common to other APOBECs editing RNA and 2) deal with the fact that “none of the trinucleotide motifs containing C to U base substitutions showed loop or stem preference in selection-free SARS-CoV-2 NoDupsNonFunc filtered dataset.” Concerning this latter point, the authors could perhaps estimate how often the significance is lost when considering random subsets of NoDups with the same number of elements of NoDupsNonFunct. In addition, it might be useful to incorporate the number of independent mutations in the same position to increase the sample size and perhaps reach significance in the NoDupsNonFunct. Finally, an analysis of the frequency of codons prone to give rise to non-synonymous changes upon C to U mutation in the stem versus loop regions could help evaluate whether selection explains the bias toward loops. For instance, if the frequency of these codons is considerably higher in the loop regions, than the lack of significance in the NoDupsNonFunct could not be as problematic as it seems.C to U changes showed enrichment in the uCn motifIt would be helpful to know what is the degree of overlap between the datasets of the studies already published and the data analyzed in this manuscript.In the introduction, this topic is presented in a very confusing manner. First, the authors focus on the motif preferred by APOBEC3G, which is only known to target (c)DNA. Then, they mention the target preference of APOBEC3A (A3A) and APOBEC3B (A3B) and another trinucleotide that is also a match. It would be better to remove APOBEC3G from the picture and refer the top three (DNA) trinucleotide preferences of A3A and A3B (as percentages). In addition, the authors could use the data available for APOBEC1and A3A to estimate if the trinucleotide motifs in RNA can be predicted from the trinucleotide motifs in DNA (https://doi.org/10.1038/s41467-020-16802-8 provides some hints).Minor issuesA figure with a model would be useful. Furthermore, the authors could consider a complete map of the loop and stem regions along the Sars-Cov-2 genome. This may require some creativity due to scale issues, but for a quick reading any other form of data presentation is better than a table.The writing could be improved. There are several passages with weird punctuation or grammar that need editing (e.g., “Similarly, to what,” “with the null hypothesis that, there is,” “genome shown below,” “generating RNA mutation load in population of SARS-CoV-2 genomes”, “in individual isolates We de-duplicated”…)Reviewer #2: In this manuscript by Dmitry A. Gordenin et al. make effective use of published data sets to address a relevant and straightforward research question: Is there a similarity between mutation spectra in hypermutated genomes of rubella virus and in SARS-CoV-2 genomes accumulated during the COVID-19 pandemic. The genomes of six independent isolates of hypermutated 60 vaccine-derived rubella viruses provide a total of 993 mutations, as previously published. This relatively small data set was compared to a large collection of 32,341 whole genome sequence of multiple SARS-CoV-2 isolates (FASTA entries) were included in the mutation calling, of which 251,273 mutations were retained.Mutational patterning and motif searches suggest that the mutation mechanisms underlying hypermutation of the rubella vaccine virus are very similar or even identical to those of the SARS-CoV-2 viruses propagating in the human population with regard to the uCnAPOBEC signature. An enrichment of several other trinucleotide-centered mutation motifs was also found in the SARS-CoV-2 . Despite the fact, that the latter have not been characterized in further detail and compared (limited rubella data set) this is a relevant and interesting study that deserves publication in its present form.Reviewer #3: Thank you for the opportunity to review the manuscript of Gordenin et al., on the comparative mutation-spectrum evaluation between two ssRNA viruses (Rubella and SARS-CoV-2), which I had previously read on bioRxiv. The strong points of this study are that the computational methods including the mathematical and statistical calculations that, as applied, make independent and diverse datasets comparable. On the other hand, this study has a few weaknesses with regards to the theoretical background and results explanations that I will summarise point-by-point below. Overall this is the first study that highlights the evolutionary impact of APOBEC and (very likely) ADAR editing of ssRNA viruses from the population-genetics perspective, and also the first one to link this to the mutability and infectivity of the pandemic-responsible SARS-CoV-2. I therefore recommend that the authors add in the discussion a paragraph that walks through a couple of such examples to connect the present study with previous ones. Other than that, this is an excellent study that should be published.Point-by-point commentsIntroduction• Lines 24-26: RdRPs may be a source of mutation, not a question, however as shown downstream in this work it seems that it’s rather of a “background-frequency” when SNVs are catalogued in aggregate. I suggest adding a “may” be a source of mutation in this sentence.• Lines 66-71: A U-to-C change on the (+) annotated strand of the ss-RNA viruses rubella and SARS-CoV-2 can only be an ADAR effect only if the complementing (-) strand is not from the same virus-genome, as the authors very correctly highlight in lines 72-74. It can perhaps be another RNA or DNA molecule (intermediate? Another viral genome copy?). When SNVs are called by alignment to single-stranded viral genomes the A-to-G on the “other side” which is formed through a loop will be still read as A-to-G. Which leaves an open question about potentially other modifications that may lead to U-to-C on RNAs. I recommend that the authors should have a look into this too (https://www.nature.com/articles/s41467-019-13525-3) and cite it as it highlights potential other sources for U-to-C changes on the mRNA, though not primarily focusing on that. In addition, the fact that A-to-G and U-to-C are probably not always ADAR-related is also supported by the data presented: in Figure 2 it shows that the A-to-G and T-to-C density values are never the close in the robust datasets (A and B panels). The same misconception is also published in reference [23]. (Lines 356 - 364) additional statistical data support this discrepancy.Results• Lines 246-251:This should move at the end of the introduction, between the two sentences ending and starting in line 90. Like this it will set the main goal of this study straight out and help the reader get on board faster.• Lines 257- 326: This part is very technical I suggest that it gets shortened in the results and be included in the Materials and Methods.• Lines 253-255: Please, provide a more self explanatory figure legend. Also in 339-346 legend Figure 2 explain the Dups, NoDups etc• Line 349: For accuracy please consider changing the term “genomic” RNA, which could be very confusing for the reader, and instead I recommend using something like “viral RNA genome”.• Lines 366-371: there is a good chance here to introduce the contribution of the RdRP errors up to the background level frequency.• Lines 402-404: This sentence is a bit self-contradicting. Since secondary structure effects can not be predicted, how can the U-to-C can be explained? Maybe these are cases where these changes are not ADAR-driven? It is confusing, please rephrase clearly.• Lines 424-426: please rephrase here since there’s a separate calculation and visualisation (Fig 4) for C-to-U, G-to-A, U-to-C, A-to-G. Another point that it’s worth keeping in mind is, that in the APOBEC motif sought by the authors it’s not impossible that different enzymes are targeting the (+) or (-) strand in the same double-stranded structure. For instance, the ds structure could be an RNA-DNA hybrid - (some APOBEC3s show preference to such hybrids).Discussion• Lines 498-500: this is a very intriguing point which can also be supported by the findings in reference [23] by the no-specificity in the ADAR motif (25% base composition per nucleotide in the flanking bases). Please, make it a separate paragraph and explain how this can work. Moreover, in Lines 500-501: please rephrase and explain in detail what you mean by “ADARs only in vivo”.Figures• All figures should be reformatted to a better quality.• Figure 1: the text should be minimized.• Figure 2: The panel titles (NoDupsNonFunc etc) should be better explained (also in Figure 3). Perhaps keep the one “dominant” panel (A,B,C?) that really summarizes the mutations in SARS-CoV-2 which can be immediately compared with the corresponding one of Rubella (D). Provide the rest both for SARS-CoV-2 and Rubella in the supplementary material (both for Figure 2 and 3).• Figure 3: in panels A,B,C there is a higher rate of background density from the least dominant mutations that seems higher than the one in Rubella (D). I presume that this is due to the different depth. However if this is an error rate by the polymerase, please calculate it and perform a statistical comparison (wilcoxon test - to consider the rank association of the background?) to Rubella.• Figure 4: please remove the 0 to achieve a less busy plot. Again, keep one of the NoDupsNonFunc, NoDups or Dups. Whichever you think is the most representative to be compared to Rubella (the rest put in the supplementary). And please then colour the bars the same way to highlight potential complementarity between C-to-U / G-to-A or A-to-G and U-to-C (which I can already see that there’s no complementarity to A-to-G and U-to-C. So maybe they aren’t really related?) Please discuss this in Lines 498 and 500.**********6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.If you choose “no”, your identity will remain anonymous but your review may still be made public.Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.Reviewer #1: NoReviewer #2: NoReviewer #3: Yes: F Nina Papavasiliou[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.18 Sep 2020The formatted version of this text is provided as a Word file in the resubmission package and is viewable at the end of combined PDF.--------------------------------------------------------------------------------------------------------------------------------Point by Point Responseto reviewer’s general evaluation and to Comments to the AuthorAuthor’s text in Response is typed in blue font under relevant sections of reviews, which are left in black font. Quotations from the revised text are in red fontAll references to text lines are as in the original version.General evaluation by reviewers.Reviewer's Responses to Questions1. Is the manuscript technically sound, and do the data support the conclusions?The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.Reviewer #1: YesReviewer #2: YesReviewer #3: Yes________________________________________2. Has the statistical analysis been performed appropriately and rigorously?Reviewer #1: N/AReviewer #2: YesReviewer #3: Yes________________________________________3. Have the authors made all data underlying the findings in their manuscript fully available?The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.Reviewer #1: YesReviewer #2: YesReviewer #3: Yes________________________________________4. Is the manuscript presented in an intelligible fashion and written in standard English?PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.Reviewer #1: YesReviewer #2: YesReviewer #3: Yes________________________________________Author’s responses to general evaluation :We thank all three reviewers for their careful analysis of our manuscript and for agreeing that the submitted work is technically sound, all data support the conclusions, and statistical analysis was performed appropriately and rigorously. We also thank for the comments aimed to improve our manuscript. These will be specifically addressed in the point-by-point response below. Preceding the specific responses, it may be helpful to reiterate the main question of this study – is there a similarity between the features of the mutation load slowly accumulating in SARS-CoV-2 genomes during the current pandemic and the burst-like mutation load in the previously discovered hypermutated genomes of attenuated rubella-vaccine virus. This question is important because, while there were already indications of similarity of the SARS-CoV-2 and hypermutated rubella mutation spectra, such similarity has not been validated by rigorous statistical analysis of a large dataset. Our study design and data analysis have provided strong statistical support to such a similarity between several features of the slowly accumulating mutation load in SARS-CoV-2 and the burst-like mutation load accumulated in attenuated rubella-vaccine virus. While this could be a coincidental similarity, it may also be a consequence of similarity between the background mutational processes operating in SARS-CoV-2 and the active mutational processes that resulted in hypermutation of rubella virus. The importance of our question and of the performed analysis is underscored by the fact that the hypermutation of the live attenuated rubella vaccine virus revealed in our previous work generated infectious virus particles. While this was documented only in rare cases, granulomas of immunocompromised children, the knowledge about potential background level presence of similar mutational mechanisms in SARS-CoV-2 should be accounted in future studies.Along with the reviewers, we state that our data support this conclusion as well as the conclusions about several incidental findings. Therefore, while the additional analyses suggested below may lead to more incidental findings, they are not required to support the conclusions of our work. Moreover, the suggested analyses appear challenging or technically difficult and will just delay publication of our conclusions that have already been validated by all three reviewers. We have addressed the points raised by the suggested analyses in the point-by-point response below.We also thank the reviewers and Editor for the specific comments aimed to improve our manuscript. In response to reviewer’s and Editor’s suggestions we extended citation of studies that came to conclusions similar to ours in the part related specifically to SARS-CoV-2. To the best of our knowledge, no detailed comparison between SARS-CoV-2 and hypermutated rubella mutation spectra and signatures have been published so far. These as well as all other reviewer’s comments will be addressed individually in the point-by-point response below.All references to text lines are as in the original version.Author’s text in the response is shown in blue. Changes in the text quoted in the response are shown in red both here and in a separately uploaded marked-up copy.Point-by-point-response to:5. Review Comments to the AuthorReviewer #1:In the manuscript “Similarity between mutation spectra in hypermutated genomes of rubella virus and SARS-CoV-2 genomes accumulated during the COVID-19 pandemic”, by Klimczak et al., the authors use bioinformatics and statistical tools to compare the mutational spectra of rubella and SARS-CoV-2 single-stranded RNA viruses. They extracted four main conclusions: 1) The mutational spectra are similar, “with C to U as well as A to G and U to C being the most prominent in plus-strand genomic RNA of each virus”; 2) “U to C changes universally showed a preference for loops versus stems in predicted RNA secondary structure”; 3) “C to U changes showed enrichment in the uCn motif, which suggested a subclass of APOBECcytidine deaminase being a source of these substitutions”; 4) There was an “Enrichment of several other trinucleotide- centered mutation motifs only in SARS-CoV-2 - likely indicative of a mutation process characteristic to this virus.” The analysis is competent, and the presentation suitable for publication.We thank the reviewer for the accurate summary and for the positive evaluation of our original submission.However, as there is an extensive overlap between this manuscript and the recently published work of the group of Dr. Silvo Conticello (Di Giorgio et al., Sci. Adv. 2020; 6: eabb5813), it would be beneficial if the authors identify the similar and distinct conclusions concerning this work.Di Giorgio et al., 2020 was cited (ref. 23 in the initial submission, ref. 26 in revision) in two places of our manuscript as well as two other publications (ref 32, 33 in the initial version, refs 36, 37 in revision) suggesting the roles for RNA editing enzymes in generating the mutation load in SARS-CoV-2.Furthermore, besides the two additional references already cited in the manuscript, a recent paper by Dr. Simmonds (https://doi.org/10.1128/mSphere.00408-20) should also be referenced and discussed, particularly given its thorough analysis of the C to U changes in the Sars-Cov-2 genomes.We have added a reference to Simmonds, 2020 (new ref 27 on lines 86 and 377). This and the other above mentioned citations are associated with the previously existing text (lines 83-86, lines 375-379) and with the modified text (lines 483-486 shown below) pointing to similarity with our analysis in the part of high abundance of C to U mutations as well as A to G mutations.Modification in lines 483-486:In agreement with the previous analyses performed separately for rubella [20] and for SARS-CoV2 [26, 27, 36, 37] comparisons between SARS-CoV-2 and hypermutated rubella strains demonstrated that the base substitution spectra correlate between the two viruses. Two types of base substitutions – C to U (and its complementary G to A) and A to G ( and its complementary U to C), expected from endogenous mutagenesis by APOBECs and ADARs, respectively, prevailed in both viruses (Fig 2). [a portion of the old text also shown in black to provide the logical context].The authors should engage in a thorough comparison between their data and the recently published results.Even if the suggested comparisons were feasible, the additional analysis would not alter any conclusions of our manuscript. Since the data and the analyses formats in those prior works are different from ours, the analytical comparisons are not likely to be possible or would be at least technically difficult and would just delay publication of our conclusions that have already been validated by all three reviewers. We also note that the comparison of the fine details of SARS-CoV-2 mutation spectra revealed by our statistical analyses with the other SARS-CoV-2 studies goes beyond the scope of our study devoted to comparison of SARS-CoV-2 and rubella mutation loads.To make the manuscript shine, it is also perhaps a good idea to highlight what seems to be the only aspect that the four published papers on the mutational spectra of Sars-Cov-2 have failed to notice: conclusion 2 (U to C [reviewer probably meant C to U] changes universally show a preference for loops).This feature was sufficiently highlighted by a separate section of Results (lines 373-408) and in Discussion (lines 488-493 and lines 503-507), Figure 3, S3 Fig., section of Table 1. Since this is an incidental finding that would benefit from validation in more data analyses and experimental studies, we do not see the need for additional highlighting of this finding.Below we comment on each of the first three conclusions.The mutational spectra are similarAlthough this reviewer suspects that the contribution of mutations introduced by NGS is negligible, the authors should address this issue in the methods or results, instead of leaving it to the discussion of the G to U changes (Dr. Simmonds deals with this problem compellingly).In line with the scope of our study to compare the mutation spectra of SARS-CoV-2 and hypermutated rubella, we concentrated on G to U changes because these changes showed the only detectable discrepancy between the two viruses (lines 366-371). In the discussion on lines 519-528 we reasoned (in agreement with this reviewer) that sequencing errors are unlikely to be the cause of such a discrepancy.We thank the reviewer for pointing to the analysis of Dr. Simmons indicating that a low chance of sequencing errors significantly shifts the SARS-CoV-2 mutation spectrum in the dataset and we have inserted the following passage on line 527:We also note that the recent study [27] indicated the overall low chance of sequencing errors reflected in SARS-CoV-2 consensus sequences in the dataset analyzed in that work.The rational beyond the generation of the NoDups, NoDupsFunc, and NoDupsNonFunc is clear. However, by removing the information on the number of repeated mutations the authors could be missing hotspots. It should be possible to use the phylogenetic data to determine whether repeated mutations are due to independent mutational events or a shared founder effect.Looking for hotspots in the SARS-CoV-2 mutation load and distinguishing between functional selection versus mutagenic mechanisms preference is beyond the scope of our study and requires extensive separate analyses exploring several methods and statistical hypotheses. Such analyses can be certainly attempted by others using the complete list of 251,273 unfiltered redundant mutation calls generated by our study and provided in S1A Table.“…density values for specific base substitutions cannot be compared directly between two viruses because they were obtained from vastly different genome numbers”. Given that the major focus of the work is on the comparison between the rubeola and Sars-Cov-2 viruses, it is not apparent why the authors did not subsample the Sars-Cov-2 data to equilibrate the genome numbers and perform the comparison. The comparison may not work for lack of power, but that would be another issue.We agree with the reviewer that such a comparison would lack power because most of the viral RNA genomes would contain the majority or even all mutations coming from the predecessors present in the same dataset. Determining the complete set of new mutations in each viral RNA genome would require distinguishing between the roles of functional selection and mutational preferences which cannot be done unambiguously. We noted that unlike the SARS-CoV-2 dataset, each mutation in each hypermutated rubella virus occurred independently from similar mutations in other rubella RNA genomes present in the same dataset (see lines 304-306 of the original submission).Figure 2.1) The definition of the null hypothesis (“positive correlation between rubella and SARS-CoV-2 spectra”) is misleading. Since the P values are all below 0.05, and the authors conclude that there is a correlation, the null hypothesis must be “lack of positive correlation between rubella and SARS-CoV-2 spectra”.Thank you for noticing this inaccurate statement. The requested change has been entered into the legend of Figure 2 (line 345).2) There is a slight drop in the correlation when only silent mutations are considered. Could this mean that part of the correlation is explained by selection?There could be several explanations for this really “slight drop”, in particular the lower statistical power provided by this smallest filtered subset containing only 4,740 mutation events as compared with the other two larger subsets (7,416 and 12,156 mutations) analyzed (see numbers in Figure 1)More information is needed on the “ADAR score tool.”We have added the following to S1 Fig legend (line 760):ADAR scores were calculated using the Web tool http://hci-bio-app.hci.utah.edu:8081/Bass/InosinePredictU to C changes universally show a preference for loopsThe authors seem to favor the scenario in which A3A mostly drives the U to C changes. Since the group of Conticello favors the action of APOBEC1, it would be useful to read a discussion on the strengths and weaknesses of each scenario.We do not suggest that APOBEC3A is the most likely candidate for making C to U mutations in SARS-CoV-2 genome. We mention only APOBEC3A, but not the other APOBECs, in discussion of loop-over-stem preference, because so far it is the only APOBEC enzyme with a documented biochemical preference for stems over loops, however this feature may be also among the characteristics of other APOBECs, including APOBEC1 acting on viral RNA. Model studies of APOBEC editing of viral RNAs, including the models with expression of individual APOBECs, is required for justified speculations on this subject, which we already stressed on lines 62-68 in the original submission. In order to eliminate the impression that APOBEC3A is the favorite hypothesis we have added the following passage on line 507:…which [APOBEC3A] so far is the only RNA-editing APOBEC explored for this feature.To boost the impact of the manuscript, the authors should also consider changing the title to highlight the possible role of APOBEC3A.Based on the above considerations, we feel it would be premature to include APOBEC3A in the title.Regardless of their decision on the title of the manuscript, it is essential to1) clarify whether the preference for RNA loops is an exclusive feature of A3A or common to other APOBECs editing RNASee our addition to line 507 described above.and 2) deal with the fact that “none of the trinucleotide motifs containing C to U base substitutions showed loop or stem preference in selection-free SARS-CoV-2 NoDupsNonFunc filtered dataset.” Concerning this latter point, the authors could perhaps estimate how often the significance is lost when considering random subsets of NoDups with the same number of elements of NoDupsNonFunct.The suggested analysis will not affect our conclusions (which this reviewer agreed to be supported by the data) regardless of the results. If “significance is lost” with some subsets it could be caused by the lower statistical power provided by the smaller sample sizes.In addition, it might be useful to incorporate the number of independent mutations in the same position to increase the sample size and perhaps reach significance in the NoDupsNonFunct.Different substitutions of the same nucleotide were already included in NoDupsNonFunct, wherever such substitutions occurred in the same nucleotide (see lines 126-128).Finally, an analysis of the frequency of codons prone to give rise to non-synonymous changes upon C to U mutation in the stem versus loop regions could help evaluate whether selection explains the bias toward loops. For instance, if the frequency of these codons is considerably higher in the loop regions, than the lack of significance in the NoDupsNonFunct could not be as problematic as it seems.Both, mutations in non-coding regions as well as mutations in coding regions resulting in synonymous codons were included in Non-functional category so the results of the suggested analysis would allow more than a single interpretation.C to U changes showed enrichment in the uCn motifIt would be helpful to know what is the degree of overlap between the datasets of the studies already published and the data analyzed in this manuscript.GSAID was the most complete dataset of SARS-CoV-2 RNA genome sequences and included most of the pandemic related SARS-CoV-2 genomes analyzed in prior publications. Specific overlaps can be determined by comparing the Excel acknowledgement tables that list the origins of SARS-CoV-2 sequences in each publication. Relative re-evaluation of the statistical power defining the specific features of SARS-CoV-2 mutation spectra in the different publications is beyond the scope of our study aimed at comparison of the mutation loads of SARS-CoV-2 and hypermutated rubella viruses. This said, we have highlighted other works that came to similar conclusions about the prevalence of C to U (G to A) and A to G (U to C) changes in SARS-CoV-2 (discussed above and referenced in the revision as 25, 26, 35, 36).In the introduction, this topic [“this” apparently refers to “C to U changes showed enrichment in the uCn motif”] is presented in a very confusing manner. First, the authors focus on the motif preferred by APOBEC3G, which is only known to target (c)DNA. Then, they mention the target preference of APOBEC3A (A3A) and APOBEC3B (A3B) and another trinucleotide that is also a match. It would be better to remove APOBEC3G from the picture and refer the top three (DNA) trinucleotide preferences of A3A and A3B (as percentages).In fact, the confusion about this section of Introduction was caused by our accidentally dropping a reference implying RNA editing capability for APOBEC3G. We have brought back the reference to A3G RNA editing [new ref 13] to line 46.In addition, the authors could use the data available for APOBEC1and A3A to estimate if the trinucleotide motifs in RNA can be predicted from the trinucleotide motifs in DNA (https://doi.org/10.1038/s41467-020-16802-8 provides some hints).The suggested paper by Jalili et al., 2020 (already referenced (ref 35 in the original submission; ref 39 in revision) does not contain any mention of APOBEC1Minor issuesA figure with a model would be useful. Furthermore, the authors could consider a complete map of the loop and stem regions along the Sars-Cov-2 genome. This may require some creativity due to scale issues, but for a quick reading any other form of data presentation is better than a table.Models of plus-strand ssRNA virus replication and potential mutation generation are provided in several reviews cited in our work as well as in Conticello group recent paper (ref 23 in the original submission, ref 26 in revision). We believe that the summary of preferences organized in Table 1 better serves the purpose of discussion targeted to the main questions addressed by our analysis.The writing could be improved. There are several passages with weird punctuation or grammar that need editing (e.g., “Similarly, to what,” “with the null hypothesis that, there is,” “genome shown below,” “generating RNA mutation load in population of SARS-CoV-2 genomes”, “in individual isolates We de-duplicated”…)We have corrected punctuation and grammar in the positions quoted above (lines 12, 152, 201, 247, 283).In summary, we reiterate our thanks to this reviewer for the careful analysis and positive evaluation of our original submission. We hope to use some of their suggestions in future analyses of RNA editing signatures that go beyond the scope of this submission.Reviewer #2:In this manuscript by Dmitry A. Gordenin et al. make effective use of published data sets to address a relevant and straightforward research question: Is there a similarity between mutation spectra in hypermutated genomes of rubella virus and in SARS-CoV-2 genomes accumulated during the COVID-19 pandemic. The genomes of six independent isolates of hypermutated 60 vaccine-derived rubella viruses provide a total of 993 mutations, as previously published. This relatively small data set was compared to a large collection of 32,341 whole genome sequence of multiple SARS-CoV-2 isolates (FASTA entries) were included in the mutation calling, of which 251,273 mutations were retained.Mutational patterning and motif searches suggest that the mutation mechanisms underlying hypermutation of the rubella vaccine virus are very similar or even identical to those of the SARS-CoV-2 viruses propagating in the human population with regard to the uCnAPOBEC signature. An enrichment of several other trinucleotide-centered mutation motifs was also found in the SARS-CoV-2 . Despite the fact, that the latter have not been characterized in further detail and compared (limited rubella data set) this is a relevant and interesting study that deserves publication in its present form.We thank this reviewer for the positive evaluation of our work. Several differences between the rubella and SARS-CoV-2 mutational spectra and signatures were highlighted in Discussion. In the last paragraph of Discussion, we speculated on the future diagnostic use of one of these differences.Reviewer #3:Thank you for the opportunity to review the manuscript of Gordenin et al., on the comparative mutation-spectrum evaluation between two ssRNA viruses (Rubella and SARS-CoV-2), which I had previously read on bioRxiv. The strong points of this study are that the computational methods including the mathematical and statistical calculations that, as applied, make independent and diverse datasets comparable.We thank the reviewer for highlighting the main reason that made us to undertake this analysis derived from the statistical methods of evaluating tri- (or tetra-)nucleotide-centered mutational signatures that we had initially developed for analyzing somatic mutagenesis in human genome (refs. 14 and 26 in the original submission, refs 15 and 30 in revision).On the other hand, this study has a few weaknesses with regards to the theoretical background and results explanations that I will summarise point-by-point below.We have addressed several apparent weakness points in the point-by-point response below.Overall this is the first study that highlights the evolutionary impact of APOBEC and (very likely) ADAR editing of ssRNA viruses from the population-genetics perspective, and also the first one to link this to the mutability and infectivity of the pandemic-responsible SARS-CoV-2. I therefore recommend that the authors add in the discussion a paragraph that walks through a couple of such examples to connect the present study with previous ones. Other than that, this is an excellent study that should be published.A note about the comparison with previous studies was inserted at the beginning of the second paragraph of Discussion (see also response to reviewer 1). Along with a general PLOS One policy we did not stressed on the novelty of conclusions following from our analyses, but just indicated related resuls of other groups.Point-by-point commentsIntroduction• Lines 24-26: RdRPs may be a source of mutation, not a question, however as shown downstream in this work it seems that it’s rather of a “background-frequency” when SNVs are catalogued in aggregate. I suggest adding a “may” be a source of mutation in this sentence.Done• Lines 66-71: A U-to-C change on the (+) annotated strand of the ss-RNA viruses rubella and SARS-CoV-2 can only be an ADAR effect only if the complementing (-) strand is not from the same virus-genome, as the authors very correctly highlight in lines 72-74. It can perhaps be another RNA or DNA molecule (intermediate? Another viral genome copy?). When SNVs are called by alignment to single-stranded viral genomes the A-to-G on the “other side” which is formed through a loop will be still read as A-to-G. Which leaves an open question about potentially other modifications that may lead to U-to-C on RNAs. I recommend that the authors should have a look into this too (https://www.nature.com/articles/s41467-019-13525-3) and cite it as it highlights potential other sources for U-to-C changes on the mRNA, though not primarily focusing on that. In addition, the fact that A-to-G and U-to-C are probably not always ADAR-related is also supported by the data presented: in Figure 2 it shows that the A-to-G and T-to-C density values are never the close in the robust datasets (A and B panels). The same misconception is also published in reference [23]. (Lines 356 - 364) additional statistical data support this discrepancy.A possibility that A to G and U to C changes in SARS-CoV-2 may not be stemming from ADAR action was already presented in the original submission (lines 499-500). We thank the reviewer for highlighting uracil modifications as a possible other source for U to C (or A to G) changes. We have added the following text:at the end of line 76:Besides ADAR editing, U to C (or complementary A to G) changes can result from uracil modifications by enzymes normally acting on specific uracils in tRNAs [23, 24].Results• Lines 246-251:This should move at the end of the introduction, between the two sentences ending and starting in line 90. Like this it will set the main goal of this study straight out and help the reader get on board faster.Thank you for this good suggestion. We have moved this passage as suggested.• Lines 257- 326: This part is very technical I suggest that it gets shortened in the results and be included in the Materials and Methods.Should these lines be moved to Material and Methods, they would have to be split between several parts of this section. We preferred to leave it in one place preceding the description of the outputs, because it defines the structure of the filtered data used in the analyses.• Lines 253-255: Please, provide a more self explanatory figure legend.A self-explanatory Figure 1 legend (lines 253-255) would require repeating a significant part of the section text. Instead, we have added the following clarifying sentence to the Figure 1 legend:Details of mutation call filtering and grouping as well as abbreviations are explained in the text of the "Design of the analysis" section.Also in 339-346 legend Figure 2 explain the Dups, NoDups etcWe have added the following clarification at the beginning of the Figure 2, Figure 3 and Figure 4 legendsCreation of filtered SARS-CoV-2MAFs and their abbreviated names NoDupsNonFunc, NoDupsFunc, NoDups, are described in Fig 1 and in associated text.• Line 349: For accuracy please consider changing the term “genomic” RNA, which could be very confusing for the reader, and instead I recommend using something like “viral RNA genome”.Replaced as suggested• Lines 366-371: there is a good chance here to introduce the contribution of the RdRP errors up to the background level frequency.Lines 366-371 of the result section introduced G to U changes which are prominent in SARS-CoV-2, but are nearly absent in rubella. In lines 521-528 of the original submission, we did discuss one of the several possible origins of these discrepancies between the SARS-CoV-2 and rubella spectra. While we cannot exclude an RdRP error as a potential source of such a discrepancy, we felt it would be premature to invoke this speculation because Coronaviridae have a mechanism of RdRP error correction, while rubella does not.• Lines 402-404: This sentence is a bit self-contradicting. Since secondary structure effects can not be predicted, how can the U-to-C can be explained? Maybe these are cases where these changes are not ADAR-driven? It is confusing, please rephrase clearly.Rephrased lines 402-404 as follows:If the U to C (A to G) changes were to come from ADARadenine deaminase acting on dsRNA, secondary structure effects of ssRNA intermediate folding would not be expected. Alternatively, these changes could be not ADAR driven.• Lines 424-426: please rephrase here since there’s a separate calculation and visualisation (Fig 4) for C-to-U, G-to-A, U-to-C, A-to-G.Rephrased as follows:Therefore, we concentrated on the motifs representing the most abundant types of base substitutions present in both viruses, i.e., on the C to U and their complement G to A, as well as U to C and their complement A to G changes.Another point that it’s worth keeping in mind is, that in the APOBEC motif sought by the authors it’s not impossible that different enzymes are targeting the (+) or (-) strand in the same double-stranded structure. For instance, the ds structure could be an RNA-DNA hybrid - (some APOBEC3s show preference to such hybrids).By our knowledge, DNA-RNA hybrids have not been reported in the replication cycles of rubella and SARS-CoV-2Discussion• Lines 498-500: this is a very intriguing point which can also be supported by the findings in reference [23] by the no-specificity in the ADAR motif (25% base composition per nucleotide in the flanking bases). Please, make it a separate paragraph and explain how this can work.The ADAR scores as well as the online tool for ADAR score calculation was cited in the original submission as ref 8 and were described on lines 34-41. We have added URL of the web tool to the legend of S1 Fig (see response to reviewer 1). We note that the result of the ADAR score analysis was negative and thus could have multiple explanations, including insufficient sample size. Thus, we did not further elaborate on details of the tool, which are well described in ref 8 and at the Web location.Moreover, in Lines 500-501: please rephrase and explain in detail what you mean by “ADARs only in vivo”.Replaced as the text in lines 500-501 to account for this, as well as for the below comment to lines 498-500 as follows:If A to G or U to C changes are really stemming from ADAR activity, it is possible that these enriched motifs are preferred by ADARs in SARS-CoV-2 but not in RNA substrates used to generate the editing consensus in [8]. Alternatively, A to G and U to C changes could be caused by one or more mechanisms unconnected to ADARs.Figures• All figures should be reformatted to a better quality.All main and supplementary figures used for the original submission were verified with PACE web tool. High resolution main figures were available via links in the combined PDF created by the PLOS One submission system• Figure 1: the text should be minimized.We respectfully disagree. We feel that Figure 1, as presented in the original submission, allowed all three reviewers to grasp the details of our analysis rationale and workflow.• Figure 2: The panel titles (NoDupsNonFunc etc) should be better explained (also in Figure 3). Perhaps keep the one “dominant” panel (A,B,C?) that really summarizes the mutations in SARS-CoV-2 which can be immediately compared with the corresponding one of Rubella (D). Provide the rest both for SARS-CoV-2 and Rubella in the supplementary material (both for Figure 2 and 3).Panel title details were added to Figures 2, 3, 4 legends. The reasons for comparing the rubella data with each of the three filtered SARS-CoV-2MAFs (NoDupsNonFunc, NoDupsFunc, NoDups) was explained on lines 301-315. In short, selection would be a confounding factor in each of these comparisons by an extent that is not possible to evaluate. Thus, in our view, presenting all three comparisons on main figures would add convenience, because it would not require a reader continuously switching between the main and supplemental display items.• Figure 3: in panels A,B,C there is a higher rate of background density from the least dominant mutations that seems higher than the one in Rubella (D). I presume that this is due to the different depth. However if this is an error rate by the polymerase, please calculate it and perform a statistical comparison (wilcoxon test - to consider the rank association of the background?) to Rubella.Lines 330-337 in the original submission explain the reason, i.e. the vastly different number of SARS-CoV-2 genomes vs the number of rubella genomes used to generate the mutation catalogs, by which we cannot directly compare densities of base substitutions between SARS-CoV2 and rubella and therefore should only compare distribution of base substitution densities obtained from each dataset.• Figure 4: please remove the 0 to achieve a less busy plot.Again, keep one of the NoDupsNonFunc, NoDups or Dups. Whichever you think is the most representative to be compared to Rubella (the rest put in the supplementary). And please then colour the bars the same way to highlight potential complementarity between C-to-U / G-to-A or A-to-G and U-to-C (which I can already see that there’s no complementarity to A-to-G and U-to-C.We understand that Figure 4 is busy and may appear counter-intuitive, because, unlike in mutagenesis of genomic DNA, viral strands can mutate independently. Therefore showing complementary nucleotides by the same color may unintentionally mislead a reader. Instead (as was already explained in Figure 4 legend) we annotated complementary motifs under each graph and highlighted in red the few cases in which statistically significant enrichment was detected in complementary motifs. These cases are summarized in Table 1 and in Discussion. The reason for not splitting the results obtained with the three filtered MAFs (NoDupsNonFunc, NoDupsFunc and NoDups) is provided in the above response to a comment about Figure 2. This said, we agree that all efforts should be made to help the reader going through this figure. Therefore we have supplemented the Figure 4 legend with the following text:Creation of filtered SARS-CoV-2MAFs and their abbreviated names NoDupsNonFunc, NoDupsFunc, NoDups are described in Fig 1 and in the associated text. The order of data in each group is the same as in the panel’s legend - Rubella, NoDupsNonFunc, NoDupsFunc, NoDups. Zero values are shown for the convenience of following the order of values within each group.So maybe they aren’t really related?) Please discuss this in Lines 498 and 500.We have added the following text at the end of this paragraph:Alternatively, A to G and U to C changes could be caused by one or more mechanisms unconnected to ADARs.In summary, we thank this reviewer for the high evaluation of our manuscript, careful analysis and useful advice that we have tried to follow as much as possible.Additional changes advised by an expert in rubella virus:A. Replaced “rubella” with “rubella virus” in several places, where the word “rubella” may be misinterpreted as referring to a disease rather that to a virus.B. Clarified the nature of rubella vaccine virus to stress biological effect of hypermutation.Lines 472-473:Previously, we demonstrated that hypermutation of the live attenuated rubella vaccine virus [42] can generate infectious virus particles in immunocompromised children [20].[Shown in context with pre-existing text to stress the rationale for the change]Submitted filename: PONE_point_by_point_Klimczak_etal.docxClick here for additional data file.22 Sep 2020Similarity between mutation spectra in hypermutated genomes of rubella virus and in SARS-CoV-2 genomes accumulated during the COVID-19 pandemicPONE-D-20-23843R1Dear Dr. Gordenin,We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements.Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication.An invoice for payment will follow shortly after the formal acceptance. To ensure an efficient process, please log into Editorial Manager at http://www.editorialmanager.com/pone/, click the 'Update My Information' link at the top of the page, and double check that your user information is up-to-date. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org.If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.Kind regards,Sebastian D. Fugmann, Ph.D.Academic EditorPLOS ONEAdditional Editor Comments (optional):Reviewers' comments:24 Sep 2020PONE-D-20-23843R1Similarity between mutation spectra in hypermutated genomes of rubella virus and in SARS-CoV-2 genomes accumulated during the COVID-19 pandemicDear Dr. Gordenin:I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department.If your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org.If we can help with anything else, please email us at plosone@plos.org.Thank you for submitting your work to PLOS ONE and supporting open access.Kind regards,PLOS ONE Editorial Office Staffon behalf ofDr. Sebastian D. FugmannAcademic EditorPLOS ONE
Authors: Natalie Saini; Joan F Sterling; Cynthia J Sakofsky; Camille K Giacobone; Leszek J Klimczak; Adam B Burkholder; Ewa P Malc; Piotr A Mieczkowski; Dmitry A Gordenin Journal: Nucleic Acids Res Date: 2020-04-17 Impact factor: 16.971
Authors: Natalya P Degtyareva; Natalie Saini; Joan F Sterling; Victoria C Placentra; Leszek J Klimczak; Dmitry A Gordenin; Paul W Doetsch Journal: PLoS Biol Date: 2019-05-08 Impact factor: 8.029
Authors: Kin Chan; Steven A Roberts; Leszek J Klimczak; Joan F Sterling; Natalie Saini; Ewa P Malc; Jaegil Kim; David J Kwiatkowski; David C Fargo; Piotr A Mieczkowski; Gad Getz; Dmitry A Gordenin Journal: Nat Genet Date: 2015-08-10 Impact factor: 38.330
Authors: Pégah Jalili; Danae Bowen; Adam Langenbucher; Shinho Park; Kevin Aguirre; Ryan B Corcoran; Angela G Fleischman; Michael S Lawrence; Lee Zou; Rémi Buisson Journal: Nat Commun Date: 2020-06-12 Impact factor: 14.919
Authors: Tobias Mourier; Mukhtar Sadykov; Michael J Carr; Gabriel Gonzalez; William W Hall; Arnab Pain Journal: Biochem Biophys Res Commun Date: 2020-11-05 Impact factor: 3.575
Authors: Anastasia Meshcheryakova; Peter Pietschmann; Philip Zimmermann; Igor B Rogozin; Diana Mechtcheriakova Journal: Front Immunol Date: 2021-07-01 Impact factor: 7.561
Authors: Fareeda M Barzak; Timothy M Ryan; Maksim V Kvach; Harikrishnan M Kurup; Hideki Aihara; Reuben S Harris; Vyacheslav V Filichev; Elena Harjes; Geoffrey B Jameson Journal: Viruses Date: 2021-02-12 Impact factor: 5.048