Literature DB >> 35760317

Ongoing positive selection drives the evolution of SARS-CoV-2 genomes.

Yali Hou¹, Shilei Zhao¹, Qi Liu¹, Xiaolong Zhang¹, Tong Sha¹, Yankai Su¹, Wenming Zhao¹, Yiming Bao¹, Yongbiao Xue², Hua Chen³.

Abstract

SARS-CoV-2 is a new RNA virus affecting humans and spreads extensively through world populations since its first outbreak in December, 2019. Whether the transmissibility and pathogenicity of SARS-CoV-2 in humans after zoonotic transfer are actively evolving, and driven by adaptation to the new host and environments is still under debate. Understanding the evolutionary mechanism underlying epidemiological and pathological characteristics of COVID-19 is essential for predicting the epidemic trend, and providing guidance for disease control and treatments. Interrogating novel strategies for identifying natural selection using within-species polymorphisms and 3,674,076 SARS-CoV-2 genome sequences of 169 countries as of December 30, 2021, we demonstrate with population genetic evidence that during the course of SARS-CoV-2 pandemic in humans, (i) SARS-CoV-2 genomes are overall conserved under purifying selection, especially for the 14 genes related to viral RNA replication, transcription, and assembly; (ii) Ongoing positive selection is actively driving the evolution of 6 genes (e.g., S, ORF3a, and N) that play critical roles in molecular processes involving pathogen-host interactions, including viral invasion into and egress from host cells, viral inhibition, or evasion of host immune response, possibly leading to high transmissibility and mild symptom in SARS-CoV-2 evolution. According to an established haplotype phylogenetic relationship of 138 viral clusters, a spatial and temporal landscape of 556 critical mutations is constructed based on their divergence among viral haplotype clusters or repeatedly increase in frequency within at least 2 clusters, of which multiple mutations potentially conferring alterations in viral transmissibility, pathogenicity, and virulence of SARS-CoV-2 are highlighted, warranting attentions.

Entities: Chemical

Keywords: COVID-19; Darwinian selection; Natural selection; SARS-CoV-2; Viral evolution

Year: 2022 PMID： 35760317 PMCID： PMC9233880 DOI： 10.1016/j.gpb.2022.05.009

Source DB: PubMed Journal: Genomics Proteomics Bioinformatics ISSN： 1672-0229 Impact factor: 6.409

Introduction

A newly emerged betacoronavirus, severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), causes a worldwide pandemic of the coronavirus disease 2019 (COVID-19), presenting a devastating threat to human public health attributed to its high infectivity and fatality [1], [2], [3], [4]. As of 25 April 2022, COVID-19 has resulted in 509,553,015 worldwide confirmed infections and 6,218,082 deaths across 200 countries and regions (https://coronavirus.jhu.edu/map.html). RNA viruses usually have high mutation rates and tend to evolve rapidly [5]. For a new RNA virus affecting humans, such as SARS-CoV-2, a recent host shift likely decreases its fitness and impels the virus to adapt to the new host environments and public health interventions [6], [7]. Natural selection may act on the transmissibility and virulence of SARS-CoV-2 through specifically adaptive mutations, which has been observed in Ebola, Zika, and other viruses [6], [7]. SARS-CoV-2 has circulated globally since its first outbreak in December, 2019, and accumulated plenty of genetic mutations within the short period. There are currently more than 5,124,203 complete SARS-CoV-2 genome sequences publicly available, of which 29,735 nucleotide substitutions have been identified (https://bigd.big.ac.cn/ncov/), compiling the largest population genomic data of non-humans so far. The SARS-CoV-2 lineages have exhibited considerable variations in transmission and clinical characters. For example, the basic reproduction numbers (R0) range from 2.2 to 5.9 [8], [9], and the mortality rates are from 0.8% to 14.5% [10]. Genetic variants with elevated contagiousness and escape of vaccine-derived immunity emerge and circulate with the outbreaks of Alpha, Delta, and Omicron strains [11], [12]. It is critical for epidemic trend prediction, disease control, and vaccine design to understand whether and how natural selection drives the evolution of transmissibility and virulence of SARS-CoV-2 during the pandemic. If under selection, further research is warranted to identify the functional mutants contributing to the evolving epidemiological and pathogenic characteristics. SARS-CoV-2 evolution is composed of two phases: evolution in animal hosts to obtain the transmission ability to affect human, and in human populations after zoonotic transfer [13]. So far, assessment of natural selection on SARS-Cov-2 is mainly focused on the host shifting phase from animals to humans by analyzing the sequence divergence between SARS-CoV-2 and some closely-related viruses such as BatCoV-RaTG13 [14], [15], [16], [17], [18]. In contrast, few studies target the latter phase due to lack of efficient methods that can handle indeterminate ancestral sequences, extensive sampling bias, and clustering infections of SARS-CoV-2 [13], [14], [19], [20], [21]. Some studies have explored specific mutations that are of potential significance in evolution or molecular function [10], [22], [23], [24]. Nevertheless, the analyses are based on allele frequency change of individual mutations that is not necessarily due to natural selection. It is only useful for screening for the candidate mutant loci instead of serving as a persuasive proof of the presence of natural selection. Despite the individual functional mutants identified in the aforementioned studies, “there is a lack of compelling evidence” of mutations that “impact the progression, severity, or transmission of COVID-19 in an adaptive manner” [6]. Indeed, the evolutionary driving forces underlying the SARS-CoV-2 epidemiological dynamics and pathogenic changes remain elusive. Furthermore, a genome-wide survey of the evolutionary landscape of the functional mutations and their implication of the epidemiological perspective were not fully accomplished either. Herein, we assess natural selection on SARS-CoV-2 evolution during its pandemic in humans by adopting a novel strategy that is relatively robust to viral clustering infections, founder effects, and sampling bias commonly existing in viral genomic data. The analysis validates the hypothesis that ongoing positive selection is indeed actively acting on the SARS-CoV-2 genomes, shaping the epidemic dynamics of COVID-19. We then partition the viral worldwide samples into clusters according to genomic similarity as a result of global transmission and clustering outbreaks. A spatial and temporal landscape of mutations is constructed on top of the clusters, and the critical mutations potentially conferring pathogenic and clinical characteristics of SARS-CoV-2 are highlighted. Our results provide a reference of viral evolution and genomic mutations for epidemic prediction, surveillance, vaccine, and clinical treatments of COVID-19.

Results

3,674,076 SARS-CoV-2 genome sequences publicly available from 169 countries, as of December 30, 2021, were interrogated. 69,359 mutations were identified. Since the virus populations are undergoing multiple clustering infections and founder effects, it is hard to use allele frequencies straightforward to draw reliable conclusions on virus evolution [25] . On the premise of neutral evolution, the ratio of nonsynonymous vs. synonymous mutations (N

Purifying selection dominates the viral genomic evolution in humans

From diffusion theory for the allele frequency in a large population, the population dynamic of mutations with very low frequency is identical to that of mutations under neutrality, and also, when sample size is very large, the population behavior of deleterious or beneficial mutations when in very low derived frequency is essentially the same to that of neutral mutations [26]. Therefore, we first calculated the N/S ratio of mutations in very low frequency (f) < 0.0001 as 3.1662, which represents the relative abundance of Nm and Sm when the genomes evolve under neutrality. We then checked the relative occurrence of Nm and Sm in higher frequency. As shown in Table 1 , the Nm/Sm ratio of mutations with f > 0.001 decreases to 1.2828 and is significantly lower than that with f < 0.0001 (P = 2.2E–16, Chi-squared test), indicating a negative selection against Nm. We further carried out the same analysis for subsets of mutations with different allele frequency ranges (e.g., (0.0001, 0.001] and (0.001, 0.01]). The pattern of reduced Nm/Sm ratios along with increased f is consistently observed: the ratio is 1.3854 for SNPs with 0.0001 < f <= 0.001, and is 1.2050 for 0.001 < f <= 0.01. The trend is consistent with the fact that mutations with higher derived allele frequency are usually older than those with lower frequency and are under purifying selection for a longer duration.

Table 1

Chi-squared tests to compare the Nm/Sm ratios between different mutation groups

Group	Grouping criterion	No. of N_m sites	No. of S_m sites	N_m/S_mratios	Chi-square test
Variations with low frequency of derived alleles	< 0.0001	45,292	14,305	3.17	Pa = 2.2E–16,Pb = 2.2E–16,Pc = 0.0128
Variations with high frequency of derived alleles (a)	> 0.001	998	778	1.28
Variations with high frequency of derived alleles (b)	0.0001 < f < 0.001	4407	3181	1.39
Variations with high frequency of derived alleles (c)	0.001 < f < 0.01	870	722	1.20
Widespread variations	> 30 countries	3606	2395	1.51	P = 2.2E–16
Non-widespread variations	< 15 countries	41,398	11,347	3.65	P = 2.2E–16
Long-time spanning mutations	> 300 days	27,855	14,482	1.92	P = 2.2E–16
Short-time spanning mutations	< 150 days	14,998	1967	7.62	P = 2.2E–16

Note: Nm, nonsynonymous mutations; Sm, synonymous mutations.

Chi-squared tests to compare the Nm/Sm ratios between different mutation groups Note: Nm, nonsynonymous mutations; Sm, synonymous mutations. We further compared numbers of Nm and Sm for widespread (defined as mutations observed in viral samples of more than 30 countries) and non-widespread (defined as mutations observed in samples of less than 15 countries) mutations. The widespread mutations tend to prevail in the populations for a longer time than non-widespread ones. The Nm/Sm ratio of widespread mutations is significantly lower (P = 2.2E–16, Chi-squared test; Table 1), suggesting that purifying selection has been acting on these mutations. We also grouped mutations according to their spanning time that was calculated as the duration between the earliest and the most recent collection time of viral genomes carrying the mutations. The Nm/Sm ratio of long-spanning-time mutations (> 300 days) is significantly lower than that of short-spanning-time ones (< 150 days) (P = 2.2E–16, Chi-squared test; Table 1), indicating more selective constraints on long-spanning-time mutations, again, confirming the effect of purifying selection. All the analyses consistently reveal overall purifying selection on SARS-CoV-2 genomes.

Positive selection drives the adaptive evolution of genes conferring pathogenicity and infectivity

Even though SARS-CoV-2 is under genome-wide negative selection, a small fraction of the viral genome may have undergone positive selection, of which the genetic polymorphism pattern may be diluent in the genome-wide Nm/Sm ratios. To detect positive or purifying selection acting on SARS-CoV-2 individual genes, we again adopt the fact that mutations with higher derived allele frequency usually undergo longer duration of natural selection and present increased/decreased Nm/Sm ratio compared to those in very low frequency, of which the population dynamics is identical to those under neutrality. Instead of comparing the Nm/Sm between two groups of mutations (the NSRF1 method), we now carried out stricter statistical tests that investigate the increasing or decreasing trend of Nm/Sm ratios with the increased allele frequencies (the NSRF2 method), as a more robust indicator of the footprint of natural selection. Three trend tests, including 2 nonparametric tests (the Mann–Kendall (M&K) and Cox–Stuart (C&S) tests) and a linear regression method (LinRegress), were applied to detect the trend of Nm/Sm ratio as a function of mutant allele frequencies (or mutant allele counts). Six genes, including the spike (S) and nucleocapsid (N) genes, open reading frame 3 (ORF3a), ORF8, non-structural protein 4 (NSP4), and NSP13, show strong signals of positive selection, i.e., significantly increased trend of Nm/Sm ratios with increased allele frequencies (P < 0.01; Figures 1 A and 2A; Table S1). Of them, S, N, ORF3a, and ORF8 have been previously studied with elevated protein evolutionary rates in SARS-CoV-2 evolution [27]. These genes play critical roles in molecular processes involving pathogen-host interactions, including viral invasion into and egress from host cells, viral inhibition, or evasion of host immune response, contributing to divergent pathogenic outcomes. Intriguingly, S protein is under positive selection, indicating that it has experienced adaptive alterations in its binding affinity to human angiotensin-converting enzyme 2 (ACE2) to gain cellular entry efficiency and viral infectivity during pandemic. N protein represents one of the most crucial structural components that facilitate viral replication, assembly, and release, and acts as an important immunodominant antigen. It has been reported to promote NLRP3 inflammasome activation and induce excessive inflammatory responses [28]. N protein used to be highly conserved [29], while it is under positive selection during the SARS-CoV-2 pandemic. ORF3a has been demonstrated to induce cellular apoptosis, lysosomal exocytosis-mediated viral egress, type I-IFN response inhibition, as well as potential cytokine storm, which belong to the key processes determining viral infectivity, pathogenicity, and virulence [30], [31]. ORF8 mediates host immune evasion through down-regulation of MHC-1 and inhibition of type I IFN response, promotes viral replication, induces apoptosis, and modulates ER stress [32]. NSP4 possibly participates in membrane rearrangement to benefit the viral replication and transcription complex formation, which may have also experienced positive selection when shifting from non-primate hosts to humans with some mutations potentially contributing to unique biological, pathological, and epidemiological features of SARS-CoV-2 [14]. NSP13 inhibits type-I IFN response by interaction with TBK1, and counteracts antiviral immunity through hijack of host deubiquitinase USP13 [33]. These findings highlight that during the COVID-19 global pandemic, positive selection is very likely an essential driving force acting on viral invasion, interplay between infection and host immune system defense, reshaping the viral features of infectivity, pathogenicity, and virulence.Fig 2.

Figure 1

Figure 2

Illustration of the trends of Nm/Sm ratio along with the increased allele frequencies for genes with strong evidence of positive or purifying selections. A. Significantly increasing trends of Nm/Sm ratio with the elevated allele frequencies for N, ORF3a, and S genes, indicative of signals of positive selection. B. The insignificant trends for ORF7a, NSP8, and NSP9 genes, demonstrating no selection. C. Significantly decreasing trend of Nm/Sm ratio with the elevated allele frequencies for NSP1, NSP7, and M genes, indicative of signals of purifying selection. Nm/Sm, nonsynonymous vs. synonymous mutations.

Evidence of natural selection acting on the SARS-CoV-2 genome. A. Genes showing significant signals of positive selection in this study are marked in red, and those showing significant signals of negative selection in this study are in blue. B. The genetic diversity of each gene, which is indicated by Theta (w), calculated as nucleotide diversity per site in the sequences. C. The mutation frequency spectrum. D. The gene structures. Illustration of the trends of Nm/Sm ratio along with the increased allele frequencies for genes with strong evidence of positive or purifying selections. A. Significantly increasing trends of Nm/Sm ratio with the elevated allele frequencies for N, ORF3a, and S genes, indicative of signals of positive selection. B. The insignificant trends for ORF7a, NSP8, and NSP9 genes, demonstrating no selection. C. Significantly decreasing trend of Nm/Sm ratio with the elevated allele frequencies for NSP1, NSP7, and M genes, indicative of signals of purifying selection. Nm/Sm, nonsynonymous vs. synonymous mutations. In contrast, fourteen viral genes demonstrate a significantly decreasing trend of Nm/Sm ratio with increased f (P < 0.01; Figures 1A and 2C; Table S1), being consistent with former section that purifying selection dominates. Ten of the negatively selected genes (NSP1, NSP2, NSP3, NSP5, NSP6, NSP7, NSP10, NSP12, NSP15, and NSP16), encode non-structural proteins of SARS-CoV-2, including components of replication and transcription complex such as RNA-dependent RNA polymerase (RdRp), papain-like protease (PLpro), main proteinase (Mpro), RNA primase, etc., which are all essential to viral RNA replication, transcription, and translation [34], [35]. NSP10 and NSP16 (2'-O-methyltransferase) form a complex during coronavirus life cycle, which can methylate 5' cap of viral RNAs, enhancing their translation and mimicking cellular mRNAs to prevent recognition by host innate immunity [36]. Other 2 accessory genes (ORF6 and ORF10) play roles in evading host immune restriction. ORF6 has been demonstrated to inhibit type-1 IFN response via blocking nuclear translocation of STAT2, STAT2, and IRF3, and prevent host immune response via nuclear imprisonment of host mRNA, serving as an antagonist of host immunity [37]. The rest negatively selected structural genes encode viral envelope (E) and membrane (M) proteins, and are involved in the assembly of progeny virions [38], [39]. These results indicate that proteins conferring coronaviral fundamental molecular functions, such as viral replication, translation, assembly, and functions in evasion of host innate and adaptive immune system like mimicking or imprisonment of host mRNA, are under significant purifying selection.

Clustering pattern of viral lineages

As we have demonstrated, although the viral genomes are overall under purifying selection, positive selection has been driving the evolution of genes related to coronavirus infection and host immune system defense, probably shaping epidemic and pathogenic diversification of viral populations. A further step is to understand the spatial-temporal dynamics of the diversification and identify the putative functional mutations subject to positive selection. Some studies have provided a list of candidate mutations, most of which were identified according to the trends of allele frequency changes in the overall global samples. As we know, the viral populations have been evolving and spreading in heterogeneous rates, demonstrating a clustering pattern. Investigating the allele frequencies in the pooled samples from multiple populations has two limitations: first, it has limited power to identify mutants which arose in a local population recently while are in a low frequency in the global population; second, it provides little information on the spatial and temporal origin (when and where) of these functional mutations. To track the evolutionary dynamics of genomic variants in a fine scale, we partitioned the sample of 3,328,405 genomes into distinct clusters according to their sequence similarity and evolutionary relationship, and identified 138 worldwide predominant clusters of SARS-CoV-2, denominated as C10, C23, …, and C299, respectively (see the Methods section for details of the partition approach). The clusters and their genealogical relationship on a haplotype network are presented in Figure 3 .

Figure 3

The genealogical relationship of worldwide haplotype clusters of SARS-CoV-2. The nodes represent different haplotype clusters, with the node sizes proportional to the counts of the belonged sequences. The number of line segments separated by dots between adjacent nodes indicates the hamming distance between clusters. Within each node, its geographical distributions are presented. The listed mutations differentiating adjacent clusters are marked in purple for those within genes under positively selection, in cyan for those within genes under purifying selection, and in red for those that repeatedly occurred at least twice in distinct phylogenetic relationships. The genealogical relationship reflects establishment and evolutionary routines of diversified viral clusters, consisting of repetitive processes of viral emergence, transmission across populations and countries, and mutation accumulation as well. As shown in Figure 3, the viruses represent extensive transmissions across continents and countries along with time. The haplotype clusters comprise the currently circulating variants of concern (VOC) and variants of interest (VOI) like the B.1.1.7/alpha (UK), B.1.351/beta (South Africa), P.1/gamma (Brazil), B.1.617/delta (India), B.1.1.529/omicron (South Africa), B.1.427/429/epsilon (USA/California), B.1.525/eta (UK, Nigeria), P.2/zeta (Brazil), and B.1.526/lota (USA/New York) lineages. Emergence of VOCs and VOIs is usually accompanied by accumulation of an excess of mutations, while VOC is characterized by signal of enhanced transmission [40]. Alpha and delta variants are the most abundant and widely distributed haplotype clusters so far, which broadly ravage USA and Europe (especially in UK). Of note, the clusters are still subject to ongoing evolution and branching, which warrant further surveillance.

Tracking the spatial and temporal occurrence of putatively selected mutations along the pandemic dynamics

Nucleotide mutations that are predominant in a cluster and absent or in low frequencies in others are potentially of functional importance for virus pathogenicity and transmissibility, of which some may be the targets of positive selection. Following this criterion, we identified 545 protein-coding variations differentiated among the 138 delineated haplotype clusters, including 361 Nm and 175 Sm sites. We mapped the occurrence of some of these mutations to branches connecting the clusters on the haplotype network (Figure 3). The numbers of inter-cluster mutations per branch vary from 1 to 47. We should emphasize that the allele frequency of a single mutant locus is not informative or robust to test the effect of natural selection; the list of mutations identified in this section serves as the mostly putative candidate loci under positive selection for further functional investigation. The evidence of natural selection was discussed using the trend of Nm/Sm ratio in former sections. The C112 is likely the earliest cluster if using sample collection dates as a criterion (also with the inferred Time to the Most Recent Common Ancestor (TMRCA), results not shown). A Nm L84S in the ORF8 protein, together with a tightly linked Sm variant (S76S) in the NSP4 protein, emerged on the branch, leading to other primary early clusters like C109 and C100 (Figure 3). The L84S replacement together with S76S were used in former studies to define two major haplotype groups in the early epidemic stage: the L and S lineages. The proportion of L lineage in samples collected before and after Wuhan lockdown showed distinct differentiation (99% vs. 70%), and it was hypothesized that the frequency change of L lineage may due to different pressures of negative selection from containment measures [41]. It is disputable for that the allele frequency change can be caused by sampling bias and clustering infection as well [25]. In the map of haplotype network, L84S is pinpointed to the branch connecting C112 and ancestral node of C109 and C100 (Figure 3), with a very low frequency of 0.2627% in C112 and increased to 100% in both C109 and C100, demonstrating an obvious cluster expanding pattern in the fine scale. According to COVID-3D database [42], the L84S amino acid alteration was predicted to eliminate 4 hydrophobic bonds and lead to destabilization of ORF8 protein. Another study with computational protein modeling proposed that L84S can mitigate the binding of ORF8 to human complement C3b, which is negatively regulated by the C-terminus serine-protease catalytic domain of the human complement factor 1, and activates the host complement system [43]. Therefore, the L84S mutation possibly impacts the normal function of ORF8, and plays an important role in the host immune responses and infection outcome. The Nm D614G in the epitope region (the receptor binding domain) of Spike glycoprotein protein occurred in the common ancestor of C100 and majority of the clusters (Figure 3), which has been demonstrated association with enhanced binding to the human ACE2, and increased viral replication, transmissibility, and loads in upper respiratory tract, indicating a competitive fitness advantage in humans [10], [23], [24], [44]. Another Nm P323L in NSP12 (a kind of RdRp) also occurred at this stage, coupling the evolution with D614G. This mutation might regulate the activity of RdRp, and is related to viral replication and fidelity, altering SARS-CoV-2 mutation rates [45]. Derived from C100, two Nm, R203K and G204R on the phosphoprotein domain of N protein arose to 100% in C178. Both clusters are widely distributed in European, Asian, and American countries. According to the structural prediction provided by the COVID-3D database [42], R203K and G204R both destabilize the N protein, and the predicted actual free energy value (ΔΔG) using the mutation Cutoff Scanning Matrix (ΔΔGstability mCSM) are −1.71 and −1.07 kcal.mol-1, respectively, resulting in the alteration of their molecular interactions with other amino acids, such as carbonyl, polar bonds, and hydrogen bonds (Figure S1A–D). Mutations on N protein may be functionally relevant to viral replication and assembly, and participate in immune-evasion and viral infections [39], [46], [47], [48]. Diverged from C100, the Q57H mutation in ORF3a, one of positively selected gene, arose and fixed in the C212, C240, and C242 clusters. This variant was predicted to exert structural destabilization (ΔΔGstability mCSM = −1.55 kcal.mol-1) (Figure S1E and F) and deleterious effects on protein function [49]. Another mutation I82T within the third membrane spanning helices in M gene diverges C102 and C293 (the early clusters of Delta variant) from C100, which is implicated in viral glucose transport. This mutation is structurally predicted to destabilize the M protein with ΔΔGstability mCSM = −2.9 kcal.mol-1, and it is observed that it may be associated with emergence of the clusters with an excess of mutations. Mutations in NSP14, an error-correcting exonuclease gene, may lead to malfunction of ExoN proofreading activity thus result in elevated mutation rates during viral replication [40], [50]. In our case, majority of mutations on NSP14 were associated with elevated mutation rates, especially A394V and I42V that were relevant to emergence of delta (C293) and omicron (C281) variants. A394V was predicted to destabilize NSP14 with estimated ΔΔGstability mCSM equal to −0.55 kcal.mol-1, and I42V with ΔΔGstability mCSM of −1.06 kcal.mol-1. We propose NSP14 mutations could be potential predictors for clusters with a high mutation rate.

Mutations with rapid increase of frequency within clusters as potential targets of selection

Other than the mutations demonstrating nearly fixed divergence among clusters, mutations presenting prominent frequency increasing trend over sampling times within a cluster are also potential sites with evolutionary or functional significance in COVID-19 epidemic. The mutations having increasing trend in frequency independently within at least 2 clusters are of significance in evolution and function with high confidence. We profiled those within-cluster mutations according to their frequency dynamics during a period of 30 sequential sampling times. 11 mutations demonstrating a significant trend of increased frequency over sampling times (simultaneously tested by M&K, C&S, and LinRegress tests, P < 0.005) independently within at least 2 clusters were identified (Figure S2). Within C162 and C39 clusters that both predominate in Denmark, a Nm L85F in the positively selected gene ORF3a independently increased frequency from 0 to 0.62 and 1 in 2 different periods of November 09, 2020–March 01, 2021 and March 29, 2021–August 18, 2021, respectively. This variation located in the second transmembrane segment of the ORF3a protein and potentially can affect the function of virus-induced cell apoptosis and viral egress of SARS-CoV-2, as well as host immune responses and clinical outcomes. A Nm V1264L of S gene independently rises its frequencies from 0 to 0.30 within C276 and from 0 to 0.20 within C129 clusters. This variation is located in the cysteine rich intravirion region at the C-terminus of coronavirus S protein, in which the cysteine residues are targets of palmitoylation, necessary for efficiently S incorporation into virions and S-mediated membrane fusions that impact the efficiency of host cellular entry thus viral infectivity [51]. As abovementioned, mutations in NSP14, the error-correcting exonuclease gene, may be strongly associated with elevated mutation rates, and be of first priority to be monitored. We observed a substitution of M72I in NSP14 having independent increase in frequency from 0 to ∼ 0.60 within C295 and from 0 to 0.10 within Delta C129 clusters. Intriguingly, we observed consistent ongoing rises in frequency across multiple European countries including UK, Norway, Belgium, Italy, Germany, and Netherlands (Figure S2). M72I is close to the sites at heterodimer interface of NSP14/NSP10 complex which stimulates the ExoN activity of NSP14, which may elevate the mutation rates of SARS-CoV-2. Moreover, it is predicted to structurally destabilize NSP14 or NSP14/NSP10 complex with significant ΔΔGstability mCSM of −1.89 kcal.mol-1. Again, the findings suggest that mutations in NSP14 are supposed to be under constant surveillance in future, and the clusters of Delta variant are still under ongoing selection, which warrants further attentions.

Discussion

Identification of the evolutionary dynamics of SARS-CoV-2 during its pandemic in worldwide human populations, is confronted with great challenges. The dynamics of viral populations demonstrates a series of founder events caused by clustering infection or bursts of epidemic in local regions. Besides, the genomic samples are usually collected from different times and locations disproportionally (sampling bias). Both significantly impact the allele frequencies and bring challenges in analyzing the virus genomic data with most population genetic methods [25]. Comparing the relative excess of nonsynonymous and synonymous substitutions is relatively robust to population size changes, representing an efficient approach for evaluating the effects of natural selection on SARS-CoV-2. The approaches we used in this study are logically similar to the known McDonald–Kreitman test [52], [53], [54] in molecular evolution, which compares the ratio of nonsynonymous to synonymous substitutions of between-species divergence to that of within-species polymorphisms, and uses the latter as an internal control under neutrality. In contrast, the method we proposed here, referred as the NSRF1 method, is for only comparing genetic polymorphisms within a species. The method is novel in using the Nm/Sm ratio of mutations with very low frequencies as the internal control under neutral evolution. The fact that the population dynamics of mutants with very low frequencies is identical to those under neutrality is valid according to diffusion theory for the allele frequency in a large population [26]. When identifying natural selection acting on individual genes, we further investigate the increasing or decreasing trend of Nm/Sm ratios along with the increased mutant allele frequencies under the assumption that the mutations with higher frequencies tend to undergo a longer duration of natural selection and present proportionally increasing or decreasing Nm/Sm ratios (the NSRF2 method), which are more efficient and robust indicators of natural selection. By adopting the strategies, we demonstrate with multiple lines of evidence that SARS-CoV-2 genomes are overall constrained under purifying selection during its pandemic. In spite of this, evidence of positive selection acting on specific genes that participate in coronavirus infection and host immune evasion are intriguingly observed. These results indicate that ongoing positive selection is actively driving tighter affinity with human and escape of host antiviral immunity, leading to high transmissibility and mild symptom in a long-run evolution of SARS-CV-2. Such trend was supported by studies analyzing the immunological and epidemiological data on endemic human coronaviruses [55]. We further partition the viral genomic samples into 138 haplotype clusters according to their sequence similarity and genealogical relationship. Superimposing on the 138 worldwide transmission clusters, we provide a list of 556 mutations as putative target sites of natural selection. Whilst there is no concrete evidence supporting their functional significance during the outbreaks, mutations showing between-cluster divergence or within-cluster frequency boost may explain distinct pathogenicity and infectivity. Thus, the list of mutations provides a basis for further functional study and clinical treatment.

Materials and methods

SARS-CoV-2 genomes downloaded from public databases

SARS-CoV-2 genomic sequences were downloaded from the 2019 Novel Coronavirus Resource (RCoV19, https://bigd.big.ac.cn/ncov/) [56] and the Global Initiative on Sharing All Influenza Data (GISAID, https://www.gisaid.org/). 3,328,405 sequences from 169 countries were included, with the sampling dates ranging from December 24, 2019 to December 30, 2021.

Identification of nucleotide mutations

All the sequences were aligned using MUSCLE [57] with default parameter settings. 265 bps of the 5'-untranslated region (UTR) and 229 bps of the 3'-UTR regions were trimmed out, with a final length of 29,409 nucleotides retained. Nucleotide mutations were called by comparing these sequences with the reference sequence (NC_045512).

Identification of selection on genomes or individual genes

Selection on viral genomes was detected by comparing the relative abundance of N and S between mutations with high and low allele frequencies (referred to as the NSRF1 method), between widespread and non-widespread mutations, and between long- and short-spanning-time mutations. Selection on individual genes was identified as following: We divided mutations into 4000 bins corresponding to >= 5, >= 10, ..., and >= 20,000 derived allele counts. Let denote the N/S ratio for mutations with the derived alleles counts . When under purifying selection, values are expected to decrease with ; while under positive selection, an increasing trend of values is expected. Thus, we apply three kinds of statistical methods to detect the increasing or decreasing trends of values as a function of j, including M&K and C&S tests, and LinRegress (referred to as the NSRF2 method). The false discovery rate correction was performed to correct for false positives.

Clustering definition of viral lineages based on the haplotype network analysis

Each viral haplotype was assigned to a cluster following the steps of the classification tree shown in Figure S3. 545 SNPs were chosen as the features for the classification. Each of the haplotypes was assigned to different clusters according to their alleles on the 545 SNP loci following the order of the features.

Prediction of the effects of mutations on protein function

An online resource COVID-3D (http://biosig.unimelb.edu.au/covid3d/) was used to predict the effects of mutations of SARS-CoV-2 on protein structure [42].

Competing interests

The authors have declared no competing interests.

CRediT authorship contribution statement

Yali Hou: Formal analysis, Methodology, Writing – original draft, Writing – review & editing. Shilei Zhao: Methodology, Formal analysis, Visualization. Qi Liu: Formal analysis, Methodology, Visualization. Xiaolong Zhang: Formal analysis, Visualization. Tong Sha: Formal analysis. Yankai Su: Formal analysis. Wenming Zhao: Resources. Yiming Bao: Resources. Yongbiao Xue: Conceptualization, Writing – original draft, Supervision. Hua Chen: Conceptualization, Methodology, Writing – original draft, Writing – review & editing, Supervision.

53 in total

Review 1. Viral quasispecies evolution.

Authors: Esteban Domingo; Julie Sheldon; Celia Perales
Journal: Microbiol Mol Biol Rev Date: 2012-06 Impact factor: 11.056

Review 2. The origins and potential future of SARS-CoV-2 variants of concern in the evolving COVID-19 pandemic.

Authors: Sarah P Otto; Troy Day; Julien Arino; Caroline Colijn; Jonathan Dushoff; Michael Li; Samir Mechai; Gary Van Domselaar; Jianhong Wu; David J D Earn; Nicholas H Ogden
Journal: Curr Biol Date: 2021-06-23 Impact factor: 10.834

3. A pneumonia outbreak associated with a new coronavirus of probable bat origin.

Authors: Peng Zhou; Xing-Lou Yang; Xian-Guang Wang; Ben Hu; Lei Zhang; Wei Zhang; Hao-Rui Si; Yan Zhu; Bei Li; Chao-Lin Huang; Hui-Dong Chen; Jing Chen; Yun Luo; Hua Guo; Ren-Di Jiang; Mei-Qin Liu; Ying Chen; Xu-Rui Shen; Xi Wang; Xiao-Shuang Zheng; Kai Zhao; Quan-Jiao Chen; Fei Deng; Lin-Lin Liu; Bing Yan; Fa-Xian Zhan; Yan-Yi Wang; Geng-Fu Xiao; Zheng-Li Shi
Journal: Nature Date: 2020-02-03 Impact factor: 69.504

4. SARS-CoV-2 non-structural protein 13 (nsp13) hijacks host deubiquitinase USP13 and counteracts host antiviral immune response.

Authors: Guijie Guo; Ming Gao; Xiaochen Gao; Bibo Zhu; Jinzhou Huang; Kuntian Luo; Yong Zhang; Jie Sun; Min Deng; Zhenkun Lou
Journal: Signal Transduct Target Ther Date: 2021-03-11