Literature DB >> 35845149

"Mutation blacklist" and "mutation whitelist" of SARS-CoV-2.

Yamin Sun^1,2, Min Wang^2,3, Wenchao Lin², Wei Dong², Jianguo Xu^1,4,5.

Abstract

Over the past two years, scientists throughout the world have completed more than 6 million SARS-CoV-2 genome sequences. Today, the number of SARS-CoV-2 genomes exceeds the total number of all other viral genomes. These genomes are a record of the evolution of SARS-CoV-2 in the human host, and provide information on the emergence of mutations. In this study, analysis of these sequenced genomes identified 296,728 de novo mutations (DNMs), and found that six types of base substitutions reached saturation in the sequenced genome population. Based on this analysis, a "mutation blacklist" of SARS-CoV-2 was compiled. The loci on the "mutation blacklist" are highly conserved, and these mutations likely have detrimental effects on virus survival, replication, and transmission. This information is valuable for SARS-CoV-2 research on gene function, vaccine design, and drug development. Through association analysis of DNMs and viral transmission rates, we identified 185 DNMs that positively correlated with the SARS-CoV-2 transmission rate, and these DNMs where classified as the "mutation whitelist" of SARS-CoV-2. The mutations on the "mutation whitelist" are beneficial for SARS-CoV-2 transmission and could therefore be used to evaluate the transmissibility of new variants. The occurrence of mutations and the evolution of viruses are dynamic processes. To more effectively monitor the mutations and variants of SARS-CoV-2, we built a SARS-CoV-2 mutation and variant monitoring and pre-warning system (MVMPS), which can monitor the occurrence and development of mutations and variants of SARS-CoV-2, as well as provide pre-warning for the prevention and control of SARS-CoV-2 (https://www.omicx.cn/). Additionally, this system could be used in real-time to update the "mutation whitelist" and "mutation blacklist" of SARS-CoV-2.

Entities: Chemical

Keywords: De novo mutations; Mutation saturation; SARS-CoV-2; Transmission

Year: 2022 PMID： 35845149 PMCID： PMC9273572 DOI： 10.1016/j.jobb.2022.06.006

Source DB: PubMed Journal: J Biosaf Biosecur ISSN： 2588-9338

Introduction

Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), the causative agent of the ongoing coronavirus 2019 (COVID-19) pandemic, belongs to the Sarbecovirus genus in the Coronaviridae family.1, 2 The first outbreak cases of SARS-CoV-2 were detected in December 2019 and quickly spread globally,3, 4, 5, 6 leading to the World Health Organization (WHO) assigning the virus pandemic status in March 2020. As of February 11, 2022, 404,910,528 cases of SARS-CoV-2 had been confirmed, resulting in 5,783,776 deaths reported to the WHO. As of January 18, 2022, more than 6 million genomes had been reported, which exceeded the total number of genomes for all other viruses. These genome sequences fully record the evolution of SARS-CoV-2 during the pandemic, and provide important mutation information. Since the COVID-19 pandemic first began, SARS-CoV-2 has continuously evolved with many variants emerging across the world. These variants are categorized as variants of interest (VOI), variants of concern (VOC), and variants under monitoring (VUM) based on their transmission potential. As of February 2022, there were five SARS-CoV-2 lineages designated as VOC (Alpha, Beta, Gama, Delta, and Omicron). VOC have increased transmissibility compared with the original virus and the potential for increased disease severity.10, 11 In addition, VOC exhibit decreased susceptibility to vaccine-induced or infection-induced immunity, and thus possess the ability to re-infect previously infected and recovered individuals. A mutation is defined as an alteration in the DNA or RNA sequence of a genome, which consequently confers a new genotype and sometimes a new phenotype. Mutations can be beneficial, neutral, or harmful for the virus.12, 13, 14 Beneficial mutations may help the virus spread or replicate more efficiently, providing an advantage over other strains.15, 16 Harmful mutations affect virus replication and transmission and will not be retained and recorded.17, 18 This evolutionary process guides SARS-CoV-2 adaptation to its new host. A large number of SARS-CoV-2 mutations have been recorded in more than 6 million genomes, raising the question of which mutations are beneficial or deleterious for SARS-CoV-2. Since the rapid emergence of mutations in viral RNA could potentially render vaccines ineffective, drug therapy unsuccessful, and lead to false detection data, it is critical to analyze and understand the implications of SARS-Cov-2 single-nucleotide mutations.20, 21 In this study, analysis of SARS-CoV-2 mutations identified six types of base substitutions that reached saturation in the population. Deleterious mutations with a negative impact on virus survival, replication, and transmission were identified and classified as the “mutation blacklist”. Beneficial mutations associated with an increased transmission rate were identified and classified as the “mutation whitelist”. The “mutation blacklist” and “mutation whitelist” determined for SARS-CoV-2 in this study are of great value for research on gene function, vaccine design, drug development, and the identification of new VOC.

Materials and methods

Data collection

SARS-CoV-2 sequences were retrieved from the Global Initiative on Sharing Avian Influenza Data (GISAID) initiative database (as of 18 January 2022, https://www.gisaid.org).22, 23 Complete genomes with an N-content lower than 0.01% and high coverage were selected for subsequent analysis. A Multiple Alignment using Fast Fourier Transform (MAFFT)-generated alignment of high coverage complete genome sequences was downloaded from the website.

Mutation analysis

The complete genome of the SARS-CoV-2 isolate Wuhan-Hu-1 (NC_045512.2) was used as the reference genome; mutations in all other samples were compared to this reference isolate. Detected mutations were confirmed using Integrative Genomics Viewer (IGV) and annotated with the SnpEff program.

Construction of a phylogenetic tree

Construction of the phylogenetic tree was performed as previously described. The amount of computation needed to construct an evolutionary tree for the 2.8 million genomes is substantial. Hence, to improve computational efficiency, SARS-CoV-2 genomes were classified by pangolin lineages using the pangoLEARN algorithm. The 2.8 million genomes were divided into 1,514 subsets according to their pangolin lineage. The RAxML software was used to determine the topological relationship between each subset according to their common mutations, and to construct the evolutionary tree as a “root-tree”. The maximum likelihood phylogenetic tree was constructed based on the General Time Reversible + Invariant + gamma sites (GTR + I + G) model of nucleotide substitution with 1000 bootstrap replicates. Then, 1,514 evolutionary trees were constructed as “branch-trees” for the 1,514 subset trees using the FastTree software with the Jukes–Cantor model. Finally, “root-tree” and “branch-trees” were merged to generate the “final-tree” by an in-house script. The flowchart of evolutionary tree construction is provided in the supplementary information (Fig. S1).

De novo mutation detection

Construction of the phylogenetic tree was performed as previously described. The information on the distribution of each mutation in the different clades of the “final-tree” was determined using an in-house-developed script. For each mutation, we step-by-step scanned the “final-tree” from root to tip to determine the proportion of mutations in each clade. When >50% of the genomes in a clade contained a particular mutation, we assumed the ancestor node of the clade contained the de novo mutations (DNMs). To avoid the identification of inherited mutations as DNMs by inaccurate terminal branching, we merged the DNMs that satisfied all of the following conditions: (1) share the same mutation type, such as C10029T (base position 10,029 in the genome is mutated from C to T), (2) appear in the same clade and the clade size is < 2,000 genomes, (3) isolated from the same country, and (4) had a time span of < 6 months. We used these criteria because of the very low probability of detecting multiple DNMs in the same country among 2000 genomes within 6 months. If the mutation rate for SARS-CoV-2 is calculated as 3 × 10−3 nucleotide substitutions per site per year, the probability of detecting the same DNMs in 2000 genomes within 6 months should be: p = 0.009 (3 × 10−3 * 3 × 10−3 * 2000 * 0.5). To avoid the impact of sequencing errors on DNM detection, we filtered out DNMs based on a single genome. The flowchart of DNM detection is provided in the supplementary information (Fig. S2).

Mutation saturation

To analyze whether all 12 base substitutions (A->T, A->C, A->G, C->A, C->T, C->G, T->A, T->C, T->G, G->A, G->T, and G->C) in SARS-CoV-2 are saturated, we first grouped the DNMs into 12 subsets according to the base substitution type, and then sorted them according to the time of occurrence. We counted the number of newly occurring non-redundant DNMs week by week, and used R scripts to draw saturation curves of 12 base substitution types. Mutation saturation was defined as the timepoint when the number of non-redundant DNMs of a base substitution type stagnates, indicative of a plateau in the saturation curve.

Association between DNMs and the transmission rate

Positive effects in transmission rate were estimated from the increased genomic prevalence of a specific DNM in the subsequently sequenced genomes. For each DNM, the proportion of genomes containing this mutation, out of all genomes sequenced in the 10 weeks following its emergence, was calculated. These data were then analyzed by linear least-squares regression using SciPy to derive the proportion growth slope of each DNM. The slope value of each DNM represents its influence on the transmission potential of each SARS-CoV-2 variant, with larger values reflecting a greater positive impact on the viral transmission rate.

Results and discussion

DNM is a term used in genetics to describe a type of genetic mutation that develops in a family member for the first time. Virus evolution begins with a DNM in the viral genome. In this study, we detected a total of 297,826 DNMs in SARS-CoV-2, which covered 88% of the total genome of the virus. This raises the question of why the remaining 12% of the genome could not be identified. Theoretically, it could be random if the number of sequenced genomes is too short to provide full coverage, or alternatively because those sites result in fatal mutations. To address this non-exclusive hypothesis, we analyzed the mutation saturation of 12 types of base substitution (A->T, A->C, A->G, C->A, C->T, C->G, T->A, T->C, T->G, G->A, G->T, and G->C). Previous studies showed that different base substitutions in SARS-CoV-2 had different mutation rates; e.g., G->T and C->T had a high mutation rate, while A->T, T->A, C->G and G->C had a low mutation rate. Our analysis showed that six base substitution types, C->T, G->T, G->A, T->C, A->G, and A->C, reached saturation by January 2022. Mutation saturation means that no novel mutations should appear in the future, even if more genomes are sequenced. Among them, the base substitution type G->T reached saturation as early as December 27, 2020, and the base substitution type C->T reached saturation on February 21, 2021, while G->A, C->A, T->C, and A->G reached saturation on April 4, 2021, December 19, 2021, November 14, 2021, and October 10, 2021, respectively (Fig. 1 ). The other six base substitution types (C->G, G->C, A->T, T->A, T->G, and A->C) had not reached mutation saturation as of February 2022. These observations are consistent with a previous report showing that these six base substitution types have a lower mutation frequency compared with other substitution types.8, 29 The mutation saturation of the six types of base substitution means that the subsequent variants in SARS-CoV-2 tend to be a combination of the existing high-frequency DNMs, rather than the emergence of a new mutation. To avoid the effect of sequencing errors on mutation saturation, complete genomes with an N-content lower than 0.01% and high coverage were selected for subsequent analysis. In addition, we filtered out DNMs supported by a single genome. Therefore, we believe that sequencing error had little effect on the mutation saturation.

Fig. 1

Mutation saturation curves of 12 types of base substitution. Identified DNMs in the SARS-CoV-2 sequenced genomes were grouped by base type and the cumulative saturation percentage of C-> T and G-A (A), C->A and G->T (B), T->C and A->G (C), A->T and T->A (D), C->G and G->C (E), T->G and A->C (F), and were plotted over time. The curve plateau reflects 100% saturation, which is marked in the figure by a vertical line.

“Mutation blacklist” of SARS-CoV-2

Among the six saturated base substitution types, we identified 5,945 potential nucleotide mutations located on 4,178 loci of all sequenced SARS-CoV-2 genomes that never mutated. Of these 5,945 potential nucleotide mutations, 1,039 were in protein coding regions and resulted in premature termination codons and consequent protein truncation, which explains the fatal phenotype. Excluding these, 4,906 potential nucleotide mutations at 3,308 loci never occurred in all SARS-CoV-2 sequenced genomes. These mutations may reflect significant changes in gene function with deleterious effects on viral survival, replication, and transmission. Therefore, these were designated as the “mutation blacklist” (Table S1). On the “mutation blacklist”, 19 mutations were located in non-coding regions, accounting for 0.41%, and the remaining 4,887 mutations were located in coding regions, accounting for 99.59%. Among the 4,887 mutations located in the coding region, 4,863 mutations (99.51%) were non-synonymous mutations, and a total of 3,778 mutations resulted in more hydrophilic amino acid residues, accounting for 77.31% of all non-synonymous mutations. This change in hydrophilicity may affect the structure of the corresponding protein and consequently gene function. Meanwhile, this result also implied that SARS-CoV-2 is evolving toward avoiding hydrophilic amino acids. On the “mutation blacklist”, we found that mutations corresponding to 172 amino acid residues have never been changed. These amino acids may play an important role in maintaining corresponding protein structure and gene function. For example, in the receptor-binding domain (RBD) region of the spike protein, D398 is never mutated in all sequenced genomes. Previous studies showed that D398 contributes to variations in pH-dependence between the locked, closed, and open forms of the RBD. Thus, these unchanged 172 amino acids on various loci on the “mutations blacklist” may reveal functionally and structurally important amino acid residues. Unlike mutations in coding regions, mutations in non-coding regions do not affect gene function or amino acids residues, and are generally subject to high mutation rates. In this study, we found highly conserved loci in non-coding regions of the genome. This means that these nucleotides may encode functional non-coding RNA or enzyme recognition sites. Analysis of the chromosomal distribution of the “mutation blacklist” identified a highly conserved region between 128 and 135 bp at the 5ʹUTR of the genome. Further analysis of this region identified a sequence motif (5ʹ-TATAATTA-3ʹ motif), which was similar to a TATA box and may therefore be important for transcriptional regulation (Fig. 2 A). Another example was the 5ʹ-ACGAAC-3ʹ motif in the intergenic region. Non-mutated nucleotides are also enriched in this motif and previous studies demonstrated that ACGAAC is a core transcription-regulating sequence guiding the discontinuous RNA synthesis of SARS-CoV-2 (Fig. 2B).

Fig. 2

A Sequence conservation analysis in the 5ʹ-UTR. The x-axis is the position on the genome, and the y-axis is the number of mutations that have never been detected. B is the conservative analysis of the sequence in the intergenic region of the orf1ab and spike genes. The x-axis is the position on the genome, and the y-axis is the number of mutations that have never been detected. Based on the above analysis, we believe that the “mutation blacklist” identifies important amino acid residues in the coding region and the regulatory non-coding regions of the genome. These sites on the “mutation blacklist” are potential targets for the development of SARS-CoV-2 drugs or broadly reactive antibodies.

“Mutation whitelist” of SARS-CoV-2

Some mutations may affect the virus transmission rate. For example, since the emergence of the “star mutation” D614G, the proportion of SARS-CoV-2 variants with this mutation increased rapidly in the sequenced population, reflecting its high transmission rate. Hence, we next determined the relationship between DNMs and variant transmission rates by tracking the changes in prevalence of each DNM in all genomes sequenced up to 10 weeks after its emergence. If the prevalence of the mutation increased weekly, then it is likely to be positively related to virus transmission, which can be measured by the slope of the corresponding linear regression model. Of all DNMs, a total of 185 mutations were significantly positively correlated with viral transmission (slope > 0.001). The chromosomal distribution of these mutations showed that most were concentrated in the RBD regions of SARS-CoV-2 Spike protein, M protein, and N protein, with a few located within the orf1ab gene and other genes encoding accessory proteins. Moreover, these mutation sites cover almost all of the important mutation sites of VOC and VOI (Fig. 3 , Table S2). Therefore, these mutations were designated as the “mutation whitelist” that may benefit virus transmission. It should be noted that some of these mutations are “driver” mutations with a real biological impact on the viral transmission rate, while others may be “passenger” mutations resulting from a “free-riding tendency”. The method used in this study could not distinguish “driver” and “passenger” mutations.

Fig. 3

Association between DNMs and transmission rate. The x-axis represents the SARS-CoV-2 genome position, and the y-axis represents the weekly growth slope of each DNM in the SARS-CoV-2 population. Mutations with a growth slope greater than 0.001 are marked in red and the corresponding amino acid mutations are noted.

Mutation and variant monitoring and the pre-warning system

The occurrence of mutations and the evolution of viruses are dynamic processes. Consequently, the “mutation whitelist” and “mutation blacklist” of SARS-CoV-2 will change with the evolution of the virus. To monitor the mutations and the variants of SARS-CoV-2, we built a website (https://www.omicx.cn/) that presents a dynamic curve in real-time with the proportions of each mutation and variant among all SARS-CoV-2 strains. Through this real-time dynamic curve, we will be able to monitor the epidemic trend for mutations and variants. It should be noted that it can take ∼2 weeks from sampling to submitting the sequenced genome data into public databases. Therefore, there will be a 2-week delay in the information detected for this website. This is a universal problem for all monitoring systems based on sequence data. The system consists of four functional modules with the following capabilities: (1) the system can collect SARS-CoV-2 published data resources from public databases (GISAID and NCBI) in real-time; (2) the system can calculate real-time statistics on the proportion of mutations and variants worldwide and for specific countries; (3) the system can assign each mutation and variant a growth rate according to the changing trend for the mutation and variant worldwide and for specific countries, and can estimate the possibility of each mutation and variant becoming a major epidemic strain in the world and various countries according to the growth rate; (4) the system can update the “mutation blacklist” and “mutation whitelist” of SARS-CoV-2 in real-time, and highlight the mutation sites that have appeared for the first time. Three data categories were included in this website: variant monitoring, mutation monitoring, and mutation blacklist and whitelist (Fig. 4 A).

Fig. 4

Features of the mutation and variant monitoring and pre-warning system (MVMPS). A is the website portal. B is the variant monitoring module, which shows the global distribution of each variant by time frame and geography. C is the mutation monitoring module, which shows the global distribution of each mutation by time frame and geography. D is the mutation blacklist and whitelist module.

Variant monitoring has the ability to monitor in real-time the weekly trend in transmission of each variant in the world and in specific major countries, and evaluate the epidemic trend of these variants (Fig. 4B). This module calculates and analyzes the proportion of each variant at different times and different places. Thus, through analyzing changes in variant proportions, the possibility of a variant becoming a pandemic variant could be determined. For example, we found that in April 2022, the proportion of variant BA.2.12 increased rapidly at a weekly growth rate of 4 percentage points, which implied that this variant had the potential to become a major epidemic variant. Mutation monitoring can monitor in real-time the trend in transmission of each mutation in the world and in specific major countries, evaluate the impact of each mutation on transmission, determine the mutations that have a positive impact on virus transmission, and then provide early pre-warning (Fig. 4C). Compared with variant monitoring, mutation monitoring is more sensitive because mutations occur before the virus becomes a new variant. The emergence of each variant is a result of the accumulation of many mutations. Thus, through mutation monitoring, some potentially important mutations could be found before the emergence of a new variant. Mutation monitoring can also detect some important mutations distributed in different variants, which are usually produced by different variants through convergent evolution. For example, through mutation monitoring, we found that the proportion of Spike protein L452 in SARS-CoV-2 rapidly increased in April 2022. These results suggest that a mutation at L452 may help to improve the transmissibility of the virus or enable the virus to escape host immunity. The latest research by Xie et al. showed that mutations at L452 can enable the virus to escape host immunity. Our further analysis found that L452 mutations in Spike protein mainly exist in four variants: L452M (BA. 2.13), L452R (BA.4 / BA.5), and L452Q (BA. 2.12). These results imply that these variants may obtain mutations at L452 through convergent evolution. The mutation blacklist and whitelist can be updated in real-time according to mutation monitoring and variant monitoring information (Fig. 4D). Generally, mutations affect gene function through their effect on protein structure. Therefore, it is of great value to evaluate the impact of each mutation on protein structure. In future work, we will add a structural model to the SARS-CoV-2 mutation and variant monitoring and pre-warning system (MVMPS), especially a structural model of Spike protein, and evaluate the impact of each mutation on protein structure. We believe that such a model will have important applications in the fields of gene function research, host adaptation, and drug discovery. Features of the mutation and variant monitoring and pre-warning system (MVMPS). A is the website portal. B is the variant monitoring module, which shows the global distribution of each variant by time frame and geography. C is the mutation monitoring module, which shows the global distribution of each mutation by time frame and geography. D is the mutation blacklist and whitelist module.

Conclusion

This study identified six base substitutions that reached saturation in the SARS-CoV-2 population. This means that currently undetected mutations will not be detected even if more genomes are sequenced. These undetected mutations were classified as the “mutation blacklist”, and it is fair to assume that they have potential deleterious effects on the survival, replication, and transmission of SARS-CoV-2. Therefore, this mutation blacklist is of great value for the development of novel therapeutics to control and prevent SARS-CoV-2. The loci of mutations on the “mutation blacklist” likely play key roles in gene function, and further investigation may therefore deepen our understanding of these genes. Furthermore, the loci of mutations on the “mutation blacklist” are also conserved “weak spots” of SARS-CoV-2, which can be targeted by new drugs or vaccines. Analysis of the association between DNMs and transmission rate identified 185 DNMs that were positively correlated with virus transmission. These were classified as the “mutation whitelist”. These mutations seem to benefit viral transmission and could be used to identify new variants with significant epidemic potential. To improve real-time monitoring of mutations and variants of SARS-CoV-2, and to update the mutation blacklist and whitelist, we built a SARS-CoV-2 mutation and variant monitoring website, which can evaluate the epidemic trend of mutations and variants and provide pre-warning for the prevention and control of SARS-CoV-2.

CRediT authorship contribution statement

Yamin Sun: Writing – original draft, Visualization. Min Wang: Writing – original draft, Visualization. Wenchao Lin: Writing – original draft. Wei Dong: Software, Data curation. Jianguo Xu: Supervision, Writing – review & editing.

Declaration of Competing Interests

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

1 in total

1. Factors That Affect the COVID-19 Pandemic in Summer 2022 Compared to Summer 2021.

Authors: Marharyta Sobczak; Rafał Pawliczak
Journal: Int J Environ Res Public Health Date: 2022-10-01 Impact factor: 4.614

1 in total